Re: [PATCH 4/5] watchdog: control hard lockup detection default
>- Original Message - >From: "Ingo Molnar" >To: "Don Zickus" >Cc: a...@linux-foundation.org, kvm@vger.kernel.org, pbonz...@redhat.com, >mi...@redhat.com, "LKML" , "Ulrich >Obergfell" >, "Andrew Jones" >Sent: Monday, August 18, 2014 11:16:44 AM >Subject: Re: [PATCH 4/5] watchdog: control hard lockup detection default > > > * Don Zickus wrote: > >> The running kernel still has the ability to enable/disable at any >> time with /proc/sys/kernel/nmi_watchdog us usual. However even >> when the default has been overridden /proc/sys/kernel/nmi_watchdog >> will initially show '1'. To truly turn it on one must disable/enable >> it, i.e. >> echo 0 > /proc/sys/kernel/nmi_watchdog >> echo 1 > /proc/sys/kernel/nmi_watchdog > > This looks like a bug, why is this so? > > Thanks, > > Ingo This is because the hard lockup detector and the soft lockup detector are enabled and disabled at the same time - there isn't a separate 'knob' for each of them. Both are controlled via the 'watchdog_user_enabled' variable which is 1 by default. lockup_detector_init if (watchdog_user_enabled) watchdog_enable_all_cpus smpboot_register_percpu_thread(&watchdog_threads) At boot time, the above code path lauches a 'watchdog/N' thread for each online CPU. The watchdog_enable() function is executed in the context of these threads, and this attempts to enable the hard lockup detector and the soft lockup detector. [Note: Soft lockup detection is implemented in watchdog_timer_fn().] watchdog_enable hrtimer_init(hrtimer, ...) hrtimer->function = watchdog_timer_fn watchdog_nmi_enable perf_event_create_kernel_counter(..., watchdog_overflow_callback) hrtimer_start(hrtimer, ...) On bare metal systems or in virtual environments where the hypervisor does not emulate a PMU, watchdog_nmi_enable() can fail to allocate and enable a PMU counter. This is reported by a console message: NMI watchdog: disabled (cpu0): hardware events not enabled Hence, we can end up with a situation where the soft lockup detector is enabled and the hard lockup detector is not enabled. However, the output of 'cat /proc/sys/kernel/nmi_watchdog' is 1 because it merely shows the state of the 'watchdog_user_enabled' variable. The above is the behaviour even without the proposed patch. The patch merely adds the following hunk in watchdog_nmi_enable() to 'fake' a -ENOENT error return from perf_event_create_kernel_counter(). +if (!watchdog_hardlockup_detector_is_enabled()) { +event = ERR_PTR(-ENOENT); +goto handle_err; +} The patch does not break the output of 'cat /proc/sys/kernel/nmi_watchdog' since the discrepancy between the output and the actual state of the hard lockup detector is nothing new. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] watchdog: control hard lockup detection default
> - Original Message - > From: "Andrew Jones" > To: linux-ker...@vger.kernel.org, kvm@vger.kernel.org > Cc: uober...@redhat.com, dzic...@redhat.com, pbonz...@redhat.com, > a...@linux-foundation.org, mi...@redhat.com > Sent: Thursday, July 24, 2014 12:13:30 PM > Subject: [PATCH 2/3] watchdog: control hard lockup detection default [...] > The running kernel still has the ability to enable/disable at any > time with /proc/sys/kernel/nmi_watchdog us usual. However even > when the default has been overridden /proc/sys/kernel/nmi_watchdog > will initially show '1'. To truly turn it on one must disable/enable > it, i.e. > echo 0 > /proc/sys/kernel/nmi_watchdog > echo 1 > /proc/sys/kernel/nmi_watchdog [...] > @@ -626,15 +665,17 @@ int proc_dowatchdog(struct ctl_table *table, int write, > * disabled. The 'watchdog_running' variable check in > * watchdog_*_all_cpus() function takes care of this. > */ > -if (watchdog_user_enabled && watchdog_thresh) > +if (watchdog_user_enabled && watchdog_thresh) { > +watchdog_enable_hardlockup_detector(true); > err = watchdog_enable_all_cpus(old_thresh != > watchdog_thresh); > -else > +} else [...] I just realized a possible issue in the above part of the patch: If we would want to give the user the option to override the effect of patch 3/3 via /proc, I think proc_dowatchdog() should enable hard lockup detection _only_ in case of a state transition from 'NOT watchdog_running' to 'watchdog_running'. | if (watchdog_user_enabled && watchdog_thresh) {| need to add this if (!watchdog_running) <---' watchdog_enable_hardlockup_detector(true); err = watchdog_enable_all_cpus(old_thresh != watchdog_thresh); } else ... The additional 'if (!watchdog_running)' would _require_ the user to perform the sequence of commands echo 0 > /proc/sys/kernel/nmi_watchdog echo 1 > /proc/sys/kernel/nmi_watchdog to enable hard lockup detection explicitly. I think changing the 'watchdog_thresh' while 'watchdog_running' is true should _not_ enable hard lockup detection as a side-effect, because a user may have a 'sysctl.conf' entry such as kernel.watchdog_thresh = ... or may only want to change the 'watchdog_thresh' on the fly. I think the following flow of execution could cause such undesired side-effect. proc_dowatchdog if (watchdog_user_enabled && watchdog_thresh) { watchdog_enable_hardlockup_detector hardlockup_detector_enabled = true watchdog_enable_all_cpus if (!watchdog_running) { ... } else if (sample_period_changed) update_timers_all_cpus for_each_online_cpu update_timers watchdog_nmi_disable ... watchdog_nmi_enable watchdog_hardlockup_detector_is_enabled return true enable perf counter for hard lockup detection Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] watchdog: control hard lockup detection default
>- Original Message - >From: "Paolo Bonzini" >To: "Ulrich Obergfell" >Cc: "Andrew Jones" , linux-ker...@vger.kernel.org, >kvm@vger.kernel.org, dzic...@redhat.com, a...@linux-foundation.org, >>mi...@redhat.com >Sent: Thursday, July 24, 2014 1:45:47 PM >Subject: Re: [PATCH 2/3] watchdog: control hard lockup detection default > >Il 24/07/2014 13:44, Ulrich Obergfell ha scritto: >> > But this means that it is not possible to re-enable softlockup detection >> > only. I think that should be the effect of echo 0 + echo 1, if >> > hardlockup detection was disabled by either the command line or patch 3. >> >> The idea was to give the user two options to override the effect of patch >> 3/3. >> Either via the kernel command line ('nmi_watchdog=') at boot time or via >> /proc >> ('echo 0' + 'echo 1') when the system is up and running. > > I think the kernel command line is enough; another alternative is to > split the nmi_watchdog /proc entry in two. > > Paolo The current behaviour (without the patch) already allows a user to disable NMI watchdog at boot time ('nmi_watchdog=0') and enable it explicitly when the system is up and running ('echo 0' + 'echo 1'). I think it would be more consistent with this behaviour and more intuitive if we would give the user the option to override the effect of patch 3/3 via /proc. By 'intuitive' I mean that the user says: 'I _want_ this to be enabled'. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] watchdog: control hard lockup detection default
>- Original Message - >From: "Paolo Bonzini" >To: "Ulrich Obergfell" >Cc: "Andrew Jones" , linux-ker...@vger.kernel.org, >kvm@vger.kernel.org, dzic...@redhat.com, a...@linux-foundation.org, >>mi...@redhat.com >Sent: Thursday, July 24, 2014 1:26:40 PM >Subject: Re: [PATCH 2/3] watchdog: control hard lockup detection default > >Il 24/07/2014 13:18, Ulrich Obergfell ha scritto: >>>> >> The running kernel still has the ability to enable/disable at any >>>> >> time with /proc/sys/kernel/nmi_watchdog us usual. However even >>>> >> when the default has been overridden /proc/sys/kernel/nmi_watchdog >>>> >> will initially show '1'. To truly turn it on one must disable/enable >>>> >> it, i.e. >>>> >> echo 0 > /proc/sys/kernel/nmi_watchdog >>>> >> echo 1 > /proc/sys/kernel/nmi_watchdog >>> > >>> > Why is it hard to make this show the right value? :) >>> > >>> > Paolo >> 'echo 1 > /proc/sys/kernel/nmi_watchdog' enables both - hard lockup and >> soft lockup detection. watchdog_enable_all_cpus() starts a 'watchdog/N' >> thread for each CPU. If the kernel runs on a bare metal system where the >> processor does not have a PMU, or when perf_event_create_kernel_counter() >> returns failure to watchdog_nmi_enable(), or when the kernel runs as a >> guest on a hypervisor that does not emulate a PMU, then the 'watchdog/N' >> threads are still active for soft lockup detection. Patch 2/3 essentially >> makes watchdog_nmi_enable() behave in the same way as if -ENOENT would >> have been returned by perf_event_create_kernel_counter(). This is then >> reported via a console message. >> >> NMI watchdog: disabled (cpu0): hardware events not enabled >> >> It's hard say what _is_ 'the right value' (because lockup detection is >> then enabled 'partially'), regardless of whether patch 2/3 is applied >> or not. > > But this means that it is not possible to re-enable softlockup detection > only. I think that should be the effect of echo 0 + echo 1, if > hardlockup detection was disabled by either the command line or patch 3. > > Paolo The idea was to give the user two options to override the effect of patch 3/3. Either via the kernel command line ('nmi_watchdog=') at boot time or via /proc ('echo 0' + 'echo 1') when the system is up and running. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] watchdog: control hard lockup detection default
> - Original Message - > From: "Paolo Bonzini" > To: "Andrew Jones" , linux-ker...@vger.kernel.org, > kvm@vger.kernel.org > Cc: uober...@redhat.com, dzic...@redhat.com, a...@linux-foundation.org, > mi...@redhat.com > Sent: Thursday, July 24, 2014 12:46:11 PM > Subject: Re: [PATCH 2/3] watchdog: control hard lockup detection default > >Il 24/07/2014 12:13, Andrew Jones ha scritto: >> >> The running kernel still has the ability to enable/disable at any >> time with /proc/sys/kernel/nmi_watchdog us usual. However even >> when the default has been overridden /proc/sys/kernel/nmi_watchdog >> will initially show '1'. To truly turn it on one must disable/enable >> it, i.e. >> echo 0 > /proc/sys/kernel/nmi_watchdog >> echo 1 > /proc/sys/kernel/nmi_watchdog > > Why is it hard to make this show the right value? :) > > Paolo 'echo 1 > /proc/sys/kernel/nmi_watchdog' enables both - hard lockup and soft lockup detection. watchdog_enable_all_cpus() starts a 'watchdog/N' thread for each CPU. If the kernel runs on a bare metal system where the processor does not have a PMU, or when perf_event_create_kernel_counter() returns failure to watchdog_nmi_enable(), or when the kernel runs as a guest on a hypervisor that does not emulate a PMU, then the 'watchdog/N' threads are still active for soft lockup detection. Patch 2/3 essentially makes watchdog_nmi_enable() behave in the same way as if -ENOENT would have been returned by perf_event_create_kernel_counter(). This is then reported via a console message. NMI watchdog: disabled (cpu0): hardware events not enabled It's hard say what _is_ 'the right value' (because lockup detection is then enabled 'partially'), regardless of whether patch 2/3 is applied or not. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/1] KVM: x86: improve the usability of the 'kvm_pio' tracepoint
> From: "Xiao Guangrong" > To: "Ulrich Obergfell" , kvm@vger.kernel.org > Cc: pbonz...@redhat.com > Sent: Monday, May 5, 2014 9:10:19 AM > Subject: Re: [PATCH 0/1] KVM: x86: improve the usability of the 'kvm_pio' > tracepoint > > On 05/02/2014 11:57 PM, Ulrich Obergfell wrote: >> The current implementation of the 'kvm_pio' tracepoint in >> emulator_pio_in_out() >> only tells us that 'something' has been read from or written to an I/O port. >> To >> improve the usability of the tracepoint, I propose to include the >> value/content >> that has been read or written in the trace output. The proposed patch aims at >> the more common case where a single 8-bit or 16-bit or 32-bit value has been >> read or written -- it does not fully cover the case where 'count' is greater >> than one. >> >> This is an example of what the patch can do (trace of PCI config space >> access). >> >> - on the host >> >># trace-cmd record -e kvm:kvm_pio -f "(port >= 0xcf8) && (port <= 0xcff)" >>/sys/kernel/debug/tracing/events/kvm/kvm_pio/filter >>Hit Ctrl^C to stop recording >> >> - in a Linux guest >> >># dd if=/sys/bus/pci/devices/:00:06.0/config bs=2 count=4 | hexdump >>4+0 records in >>4+0 records out >>8 bytes (8 B) copied, 0.000114056 s, 70.1 kB/s >>000 1af4 1001 0507 0010 >>008 >> >> - on the host >> >># trace-cmd report >>... >>qemu-kvm-23216 [001] 15211.994089: kvm_pio: pio_write >>at 0xcf8 size 4 count 1 val 0x80003000 >>qemu-kvm-23216 [001] 15211.994108: kvm_pio: pio_read >>at 0xcfc size 2 count 1 val 0x1af4 >>qemu-kvm-23216 [001] 15211.994129: kvm_pio: pio_write >>at 0xcf8 size 4 count 1 val 0x80003000 >>qemu-kvm-23216 [001] 15211.994136: kvm_pio: pio_read >>at 0xcfe size 2 count 1 val 0x1001 >>qemu-kvm-23216 [001] 15211.994143: kvm_pio: pio_write >>at 0xcf8 size 4 count 1 val 0x80003004 >>qemu-kvm-23216 [001] 15211.994150: kvm_pio: pio_read >>at 0xcfc size 2 count 1 val 0x507 >>qemu-kvm-23216 [001] 15211.994155: kvm_pio: pio_write >>at 0xcf8 size 4 count 1 val 0x80003004 >>qemu-kvm-23216 [001] 15211.994161: kvm_pio: pio_read >>at 0xcfe size 2 count 1 val 0x10 >> > > Nice. > > Could please check "perf kvm stat" to see if "--event=ioport" > can work after your patch? > > Reviewed-by: Xiao Guangrong I've run a quick test with a local kernel - built from 3.15.0-rc1 source, including the proposed patch - in combination with the 'perf' package that is installed on my test machine. I didn't build a new 'perf' binary from 3.15.0-rc1 source. The following output of the 'perf kvm stat live --event=ioport -d 10' command looks reasonable. 17:10:29.036811 Analyze events for all VMs, all VCPUs: IO Port AccessSamples Samples% Time% Min Time Max Time Avg time 0x177:PIN 3520.00%15.40%1us3us 1.68us ( +- 8.63% ) 0x376:PIN 3017.14%16.37%1us6us 2.08us ( +- 17.15% ) 0x170:POUT 15 8.57%18.99%2us9us 4.83us ( +- 14.34% ) 0xc0ea:POUT 10 5.71% 6.57%2us2us 2.51us ( +- 5.06% ) 0xc0ea:PIN 10 5.71% 6.21%1us6us 2.37us ( +- 23.18% ) 0x176:POUT 10 5.71% 6.69%1us3us 2.55us ( +- 7.59% ) 0x170:PIN 5 2.86% 3.36%2us2us 2.56us ( +- 1.17% ) 0x171:PIN 5 2.86% 1.47%1us1us 1.12us ( +- 0.37% ) 0x171:POUT 5 2.86% 3.26%2us2us 2.49us ( +- 2.25% ) 0x172:PIN 5 2.86% 1.45%1us1us 1.11us ( +- 0.24% ) 0x172:POUT 5 2.86% 2.67%1us2us 2.04us ( +- 3.00% ) 0x173:PIN 5 2.86% 1.46%1us1us 1.11us ( +- 0.29% ) 0x173:POUT 5 2.86% 2.60%1us2us 1.99us ( +- 2.96% ) 0x174:PIN 5 2.86% 1.45%1us1us 1.11us ( +- 0.16% ) 0x174:POUT 5
[PATCH 0/1] KVM: x86: improve the usability of the 'kvm_pio' tracepoint
The current implementation of the 'kvm_pio' tracepoint in emulator_pio_in_out() only tells us that 'something' has been read from or written to an I/O port. To improve the usability of the tracepoint, I propose to include the value/content that has been read or written in the trace output. The proposed patch aims at the more common case where a single 8-bit or 16-bit or 32-bit value has been read or written -- it does not fully cover the case where 'count' is greater than one. This is an example of what the patch can do (trace of PCI config space access). - on the host # trace-cmd record -e kvm:kvm_pio -f "(port >= 0xcf8) && (port <= 0xcff)" /sys/kernel/debug/tracing/events/kvm/kvm_pio/filter Hit Ctrl^C to stop recording - in a Linux guest # dd if=/sys/bus/pci/devices/:00:06.0/config bs=2 count=4 | hexdump 4+0 records in 4+0 records out 8 bytes (8 B) copied, 0.000114056 s, 70.1 kB/s 000 1af4 1001 0507 0010 008 - on the host # trace-cmd report ... qemu-kvm-23216 [001] 15211.994089: kvm_pio: pio_write at 0xcf8 size 4 count 1 val 0x80003000 qemu-kvm-23216 [001] 15211.994108: kvm_pio: pio_read at 0xcfc size 2 count 1 val 0x1af4 qemu-kvm-23216 [001] 15211.994129: kvm_pio: pio_write at 0xcf8 size 4 count 1 val 0x80003000 qemu-kvm-23216 [001] 15211.994136: kvm_pio: pio_read at 0xcfe size 2 count 1 val 0x1001 qemu-kvm-23216 [001] 15211.994143: kvm_pio: pio_write at 0xcf8 size 4 count 1 val 0x80003004 qemu-kvm-23216 [001] 15211.994150: kvm_pio: pio_read at 0xcfc size 2 count 1 val 0x507 qemu-kvm-23216 [001] 15211.994155: kvm_pio: pio_write at 0xcf8 size 4 count 1 val 0x80003004 qemu-kvm-23216 [001] 15211.994161: kvm_pio: pio_read at 0xcfe size 2 count 1 val 0x10 Ulrich Obergfell (1): improve the usability of the 'kvm_pio' tracepoint arch/x86/kvm/trace.h | 20 arch/x86/kvm/x86.c | 4 ++-- 2 files changed, 18 insertions(+), 6 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] KVM: x86: improve the usability of the 'kvm_pio' tracepoint
This patch moves the 'kvm_pio' tracepoint to emulator_pio_in_emulated() and emulator_pio_out_emulated(), and it adds an argument (a pointer to the 'pio_data'). A single 8-bit or 16-bit or 32-bit data item is fetched from 'pio_data' (depending on 'size'), and the value is included in the trace record ('val'). If 'count' is greater than one, this is indicated by the string "(...)" in the trace output. Signed-off-by: Ulrich Obergfell --- arch/x86/kvm/trace.h | 20 arch/x86/kvm/x86.c | 4 ++-- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 545245d..33574c9 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -91,16 +91,21 @@ TRACE_EVENT(kvm_hv_hypercall, /* * Tracepoint for PIO. */ + +#define KVM_PIO_IN 0 +#define KVM_PIO_OUT 1 + TRACE_EVENT(kvm_pio, TP_PROTO(unsigned int rw, unsigned int port, unsigned int size, -unsigned int count), - TP_ARGS(rw, port, size, count), +unsigned int count, void *data), + TP_ARGS(rw, port, size, count, data), TP_STRUCT__entry( __field(unsigned int, rw ) __field(unsigned int, port) __field(unsigned int, size) __field(unsigned int, count ) + __field(unsigned int, val ) ), TP_fast_assign( @@ -108,11 +113,18 @@ TRACE_EVENT(kvm_pio, __entry->port = port; __entry->size = size; __entry->count = count; + if (size == 1) + __entry->val= *(unsigned char *)data; + else if (size == 2) + __entry->val= *(unsigned short *)data; + else + __entry->val= *(unsigned int *)data; ), - TP_printk("pio_%s at 0x%x size %d count %d", + TP_printk("pio_%s at 0x%x size %d count %d val 0x%x %s", __entry->rw ? "write" : "read", - __entry->port, __entry->size, __entry->count) + __entry->port, __entry->size, __entry->count, __entry->val, + __entry->count > 1 ? "(...)" : "") ); /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8b8fc0b..cd79707 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4486,8 +4486,6 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size, unsigned short port, void *val, unsigned int count, bool in) { - trace_kvm_pio(!in, port, size, count); - vcpu->arch.pio.port = port; vcpu->arch.pio.in = in; vcpu->arch.pio.count = count; @@ -4522,6 +4520,7 @@ static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt, if (ret) { data_avail: memcpy(val, vcpu->arch.pio_data, size * count); + trace_kvm_pio(KVM_PIO_IN, port, size, count, vcpu->arch.pio_data); vcpu->arch.pio.count = 0; return 1; } @@ -4536,6 +4535,7 @@ static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt, struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); memcpy(vcpu->arch.pio_data, val, size * count); + trace_kvm_pio(KVM_PIO_OUT, port, size, count, vcpu->arch.pio_data); return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false); } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
Loss of periodic timer interrupts caused by delayed callbacks and by interrupt coalescing is compensated by gradually injecting additional interrupts during subsequent timer intervals, starting at a rate of one additional interrupt per interval. The injection of additional interrupts is based on a backlog of unaccounted HPET clock periods (new HPETTimer field 'ticks_not_accounted'). The backlog increases due to delayed callbacks and coalesced interrupts, and it decreases if an interrupt was injected successfully. If the backlog increases while compensation is still in progress, the rate at which additional interrupts are injected is increased too. A limit is imposed on the backlog and on the rate. Injecting additional timer interrupts to compensate lost interrupts can alleviate long term time drift. However, on a short time scale, this method can have the side effect of making virtual machine time intermittently pass slower and faster than real time (depending on the guest's time keeping algorithm). Compensation is disabled by default and can be enabled for guests where this behaviour may be acceptable. Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 120 +++- 1 files changed, 118 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 0428290..bc2a21a 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -31,6 +31,7 @@ #include "hpet_emul.h" #include "sysbus.h" #include "mc146818rtc.h" +#include //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -41,6 +42,9 @@ #define HPET_MSI_SUPPORT0 +#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */ +#define MAX_IRQ_RATE(uint32_t)10 + struct HPETState; typedef struct HPETTimer { /* timers */ uint8_t tn; /*timer number*/ @@ -334,13 +338,68 @@ static const VMStateDescription vmstate_hpet = { }; /* + * This function resets the driftfix state in the following situations. + * + * - When the guest o/s changes the 'CFG_ENABLE' bit (overall enable) + * in the General Configuration Register from 0 to 1. + * + * - When the guest o/s changes the 'TN_ENABLE' bit (timer N interrupt enable) + * in the Timer N Configuration and Capabilities Register from 0 to 1. + */ +static void hpet_timer_driftfix_reset(HPETTimer *t) +{ +if (t->state->driftfix && timer_is_periodic(t)) { +t->ticks_not_accounted = t->prev_period = t->period; +t->irq_rate = 1; +t->divisor = 1; +} +} + +/* + * This function determines whether there is a backlog of ticks for which + * no interrupts have been delivered to the guest o/s yet. If the backlog + * is equal to or greater than the current period length, then additional + * interrupts will be delivered to the guest o/s inside of the subsequent + * period interval to compensate missed interrupts. + * + * 'ticks_not_accounted' increases by 'N * period' when the comparator is + * being advanced, and it decreases by 'prev_period' when an interrupt is + * delivered to the guest o/s. Normally 'prev_period' is equal to 'period' + * and 'N' is 1. 'prev_period' is different from 'period' if a guest o/s + * has changed the comparator value during the previous period interval. + * 'N' is greater than 1 if the callback was delayed by 'N - 1' periods, + * and 'N' is zero while additional interrupts are delivered inside of an + * interval. + * + * This function is called after the comparator has been advanced but before + * the interrupt is delivered to the guest o/s. Hence, 'ticks_not_accounted' + * is equal to 'prev_period' plus 'period' if there is no backlog. + */ +static bool hpet_timer_has_tick_backlog(HPETTimer *t) +{ +uint64_t backlog = 0; + +if (t->ticks_not_accounted >= t->period + t->prev_period) { +backlog = t->ticks_not_accounted - (t->period + t->prev_period); +} +return (backlog >= t->period); +} + +/* * timer expiration callback */ static void hpet_timer(void *opaque) { HPETTimer *t = opaque; +HPETState *s = t->state; uint64_t diff; - +int irq_delivered = 0; +uint32_t period_count = 0; /* elapsed periods since last callback + * 1: normal case + * > 1: missed 'period_count - 1' interrupts + * due to delayed callback + * 0: callback inside of an interval + * to deliver additional interrupts */ uint64_t period = t->period; uint64_t cur_tick = hpet_get_ticks(t->state); @@ -348,13 +407,48 @@ static void hpet_timer(void *opaque) if (t->config & HPET_TN_32BIT) {
[PATCH v5 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
driftfix is a 'bit type' property. Compensation of delayed callbacks and coalesced interrupts can be enabled with the command line option -global hpet.driftfix=on driftfix is 'off' (disabled) by default. Signed-off-by: Ulrich Obergfell --- hw/hpet.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 6ce07bc..7513065 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -72,6 +72,8 @@ typedef struct HPETState { uint64_t isr; /* interrupt status reg */ uint64_t hpet_counter; /* main counter */ uint8_t hpet_id; /* instance id */ + +uint32_t driftfix; } HPETState; static uint32_t hpet_in_legacy_mode(HPETState *s) @@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = { .qdev.props = (Property[]) { DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS), DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false), +DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false), DEFINE_PROP_END_OF_LIST(), }, }; -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)
update_irq() uses a similar method as in 'rtc_td_hack' to detect coalesced interrupts. The function entry addresses are retrieved from 'target_get_irq_delivered' and 'target_reset_irq_delivered'. This change can be replaced if a generic feedback infrastructure to track coalesced IRQs for periodic, clock providing devices becomes available. Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 13 +++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index dba9370..0428290 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -184,11 +184,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, uint64_t current) } } -static void update_irq(struct HPETTimer *timer, int set) +static int update_irq(struct HPETTimer *timer, int set) { uint64_t mask; HPETState *s; int route; +int irq_delivered = 1; if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) { /* if LegacyReplacementRoute bit is set, HPET specification requires @@ -213,8 +214,16 @@ static void update_irq(struct HPETTimer *timer, int set) qemu_irq_raise(s->irqs[route]); } else { s->isr &= ~mask; -qemu_irq_pulse(s->irqs[route]); +if (s->driftfix) { +target_reset_irq_delivered(); +qemu_irq_raise(s->irqs[route]); +irq_delivered = target_get_irq_delivered(); +qemu_irq_lower(s->irqs[route]); +} else { +qemu_irq_pulse(s->irqs[route]); +} } +return irq_delivered; } static void hpet_pre_save(void *opaque) -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers
Hi, This is version 5 of a series of patches that I originally posted in: http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328 Changes since version 4: Added comments to patch part 3 and part 5. No changes in the actual code. Please review and please comment. Regards, Uli Ulrich Obergfell (5): hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add driftfix property to HPETState and DeviceInfo hpet 'driftfix': add fields to HPETTimer and VMStateDescription hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts hw/apic.c |4 ++ hw/hpet.c | 178 +++-- hw/pc.h | 13 + vl.c | 13 + 4 files changed, 204 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
The new fields in HPETTimer are covered by a separate VMStateDescription which is a subsection of 'vmstate_hpet_timer'. They are only migrated if -global hpet.driftfix=on Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 42 ++ 1 files changed, 42 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 7513065..dba9370 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -55,6 +55,19 @@ typedef struct HPETTimer { /* timers */ uint8_t wrap_flag; /* timer pop will indicate wrap for one-shot 32-bit * mode. Next pop will be actual timer expiration. */ +/* driftfix state */ +uint64_t prev_period;/* needed when the guest o/s changes the + * comparator value */ +uint64_t ticks_not_accounted;/* ticks for which no interrupts have been + * delivered to the guest o/s yet */ +uint32_t irq_rate; /* rate at which interrupts are delivered + * to the guest o/s during one period + * interval; if rate is greater than 1, + * additional interrupts are delivered + * to compensate missed interrupts */ +uint32_t divisor;/* needed to determine when the next + * timer callback should occur while + * rate is greater than 1 */ } HPETTimer; typedef struct HPETState { @@ -246,6 +259,27 @@ static int hpet_post_load(void *opaque, int version_id) return 0; } +static bool hpet_timer_driftfix_vmstate_needed(void *opaque) +{ +HPETTimer *t = opaque; + +return (t->state->driftfix != 0); +} + +static const VMStateDescription vmstate_hpet_timer_driftfix = { +.name = "hpet_timer_driftfix", +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(prev_period, HPETTimer), +VMSTATE_UINT64(ticks_not_accounted, HPETTimer), +VMSTATE_UINT32(irq_rate, HPETTimer), +VMSTATE_UINT32(divisor, HPETTimer), +VMSTATE_END_OF_LIST() +} +}; + static const VMStateDescription vmstate_hpet_timer = { .name = "hpet_timer", .version_id = 1, @@ -260,6 +294,14 @@ static const VMStateDescription vmstate_hpet_timer = { VMSTATE_UINT8(wrap_flag, HPETTimer), VMSTATE_TIMER(qemu_timer, HPETTimer), VMSTATE_END_OF_LIST() +}, +.subsections = (VMStateSubsection []) { +{ +.vmsd = &vmstate_hpet_timer_driftfix, +.needed = hpet_timer_driftfix_vmstate_needed, +}, { +/* empty */ +} } }; -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)
'target_get_irq_delivered' and 'target_reset_irq_delivered' point to functions that are called by update_irq() to detect coalesced interrupts. Initially they point to stub functions which pretend successful interrupt injection. apic code calls two registration functions to replace the stubs with apic_get_irq_delivered() and apic_reset_irq_delivered(). This change can be replaced if a generic feedback infrastructure to track coalesced IRQs for periodic, clock providing devices becomes available. Signed-off-by: Ulrich Obergfell --- hw/apic.c |4 hw/pc.h | 13 + vl.c | 13 + 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index a45b57f..94b1d15 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -17,6 +17,7 @@ * License along with this library; if not, see <http://www.gnu.org/licenses/> */ #include "hw.h" +#include "pc.h" #include "apic.h" #include "ioapic.h" #include "qemu-timer.h" @@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = { static void apic_register_devices(void) { +register_target_get_irq_delivered(apic_get_irq_delivered); +register_target_reset_irq_delivered(apic_reset_irq_delivered); + sysbus_register_withprop(&apic_info); } diff --git a/hw/pc.h b/hw/pc.h index bc8fcec..7511f28 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -7,6 +7,19 @@ #include "fdc.h" #include "net.h" +extern int (*target_get_irq_delivered)(void); +extern void (*target_reset_irq_delivered)(void); + +static inline void register_target_get_irq_delivered(int (*func)(void)) +{ +target_get_irq_delivered = func; +} + +static inline void register_target_reset_irq_delivered(void (*func)(void)) +{ +target_reset_irq_delivered = func; +} + /* PC-style peripherals (also used by other machines). */ /* serial.c */ diff --git a/vl.c b/vl.c index 73e147f..456e320 100644 --- a/vl.c +++ b/vl.c @@ -232,6 +232,19 @@ const char *prom_envs[MAX_PROM_ENVS]; const char *nvram = NULL; int boot_menu; +static int target_get_irq_delivered_stub(void) +{ +return 1; +} + +static void target_reset_irq_delivered_stub(void) +{ +return; +} + +int (*target_get_irq_delivered)(void) = target_get_irq_delivered_stub; +void (*target_reset_irq_delivered)(void) = target_reset_irq_delivered_stub; + typedef struct FWBootEntry FWBootEntry; struct FWBootEntry { -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v4 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
Hi Zachary, 1. re.: >> +static void hpet_timer_driftfix_reset(HPETTimer *t) >> +{ >> +if (t->state->driftfix&& timer_is_periodic(t)) { >> +t->ticks_not_accounted = t->prev_period = t->period; >> > > This is rather confusing. Clearly, ticks_not_accounted isn't actually > ticks not accounted, it's a different variable entirely which is based > on the current period. > > If it were actually ticks_not_accounted, it should be reset to zero. hpet_timer_driftfix_reset() is called at two sites before the periodic timer is actually started via hpet_set_timer(). An interrupt shall be delivered at the first callback of hpet_timer(), after t->period amount of time has passed. The ticks are recorded as 'not accounted' until the interrupt is delivered to the guest. The following message explains why t->prev_period is needed in addition to t->period and how it is used. http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg00275.html On entry to hpet_timer(), t->ticks_not_accounted is >= t->prev_period, and is it increased by t->period while the comparator register is being advanced. This is also the place where we can detect whether the current callback was delayed. If the period_count is > 1, the delay was so large that we missed to inject an interrupt (which needs to be compensated). while (hpet_time_after(cur_tick, t->cmp)) { t->cmp = (uint32_t)(t->cmp + t->period); +t->ticks_not_accounted += t->period; +period_count++; } Please note that period_count can be zero. This happens during callbacks that inject additional interrupts 'inside of' a period interval in order to compensate missed and coalesced interrupts. t->ticks_not_accounted is decreased only if an interrupt was delivered to the guest. If an interrupt could not be delivered, the ticks that are represented by that interrupt remain recorded as 'not accounted' (this also triggers compensation of coalesced interrupts). +if (irq_delivered) { +t->ticks_not_accounted -= t->prev_period; +t->prev_period = t->period; +} else { +if (period_count) { +t->irq_rate++; +t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE); +} +} 2. re.: >> +if (period_count) { >> +t->divisor = t->irq_rate; >> +} >> +diff /= t->divisor--; >> > > Why subtracting from the divisor? Shouldn't the divisor and irq_rate > change in lockstep? t->irq_rate is the rate at which interrupts are delivered to the guest, relative to the period length. If t->irq_rate is 1, then one interrupt shall be injected per period interval. If t->irq_rate is > 1, we are in 'compensation mode' (trying to inject additional interrupts 'inside of' an interval). - If t->irq_rate is 2, then two interrupts shall be injected during a period interval (one regular and one additional). - If t->irq_rate is 3, then three interrupts shall be injected during a period interval (one regular and two additional). - etc. A non-zero period_count marks the start of an interval, at which the divisor is set to t->irq_rate. Let's take a look at an example where t->divisor = t->irq_rate = 3. - The current period starts at t(p), the next period starts at t(p+1). We are now at t(p). The first additional interrupt shall be injected at a(1), the second at a(2). Hence, the next callback is scheduled to occur at a(1) = t(p) + diff / 3. t(p) a(1) a(2) t(p+1) ++++ time <--- diff ---> < diff / 3 > - We are now in the callback at a(1). A new diff has been calculated, which is equal to the remaining time in the interval from a(1) to t(p+1). The second additional interrupt shall be injected at a(2). Hence, the next callback is scheduled to occur at a(2) = a(1) + diff / 2. a(1) a(2) t(p+1) +++ time <-- diff ---> < diff / 2 > - We are now in the callback at a(2). A new diff has been calculated, which is equal to the remaining time in the interval from a(2) to t(p+1). The next callback marks the beginning of period t(p+1). a(2) t(p+1) ++ time <-- diff --> < diff / 1 > At t(p), the divisor is set to irq_rate (= 3). diff is divided by 3 and the divisor is decremented by one. At a(1), the divisor is 2. diff is divided by 2 and
[PATCH v4 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
driftfix is a 'bit type' property. Compensation of delayed callbacks and coalesced interrupts can be enabled with the command line option -global hpet.driftfix=on driftfix is 'off' (disabled) by default. Signed-off-by: Ulrich Obergfell --- hw/hpet.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 6ce07bc..7513065 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -72,6 +72,8 @@ typedef struct HPETState { uint64_t isr; /* interrupt status reg */ uint64_t hpet_counter; /* main counter */ uint8_t hpet_id; /* instance id */ + +uint32_t driftfix; } HPETState; static uint32_t hpet_in_legacy_mode(HPETState *s) @@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = { .qdev.props = (Property[]) { DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS), DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false), +DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false), DEFINE_PROP_END_OF_LIST(), }, }; -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
Loss of periodic timer interrupts caused by delayed callbacks and by interrupt coalescing is compensated by gradually injecting additional interrupts during subsequent timer intervals, starting at a rate of one additional interrupt per interval. The injection of additional interrupts is based on a backlog of unaccounted HPET clock periods (new HPETTimer field 'ticks_not_accounted'). The backlog increases due to delayed callbacks and coalesced interrupts, and it decreases if an interrupt was injected successfully. If the backlog increases while compensation is still in progress, the rate at which additional interrupts are injected is increased too. A limit is imposed on the backlog and on the rate. Injecting additional timer interrupts to compensate lost interrupts can alleviate long term time drift. However, on a short time scale, this method can have the side effect of making virtual machine time intermittently pass slower and faster than real time (depending on the guest's time keeping algorithm). Compensation is disabled by default and can be enabled for guests where this behaviour may be acceptable. Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 70 +++- 1 files changed, 68 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index e57c654..519fc6b 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -31,6 +31,7 @@ #include "hpet_emul.h" #include "sysbus.h" #include "mc146818rtc.h" +#include //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -41,6 +42,9 @@ #define HPET_MSI_SUPPORT0 +#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */ +#define MAX_IRQ_RATE(uint32_t)10 + struct HPETState; typedef struct HPETTimer { /* timers */ uint8_t tn; /*timer number*/ @@ -324,14 +328,35 @@ static const VMStateDescription vmstate_hpet = { } }; +static void hpet_timer_driftfix_reset(HPETTimer *t) +{ +if (t->state->driftfix && timer_is_periodic(t)) { +t->ticks_not_accounted = t->prev_period = t->period; +t->irq_rate = 1; +t->divisor = 1; +} +} + +static bool hpet_timer_has_tick_backlog(HPETTimer *t) +{ +uint64_t backlog = 0; + +if (t->ticks_not_accounted >= t->period + t->prev_period) { +backlog = t->ticks_not_accounted - (t->period + t->prev_period); +} +return (backlog >= t->period); +} + /* * timer expiration callback */ static void hpet_timer(void *opaque) { HPETTimer *t = opaque; +HPETState *s = t->state; uint64_t diff; - +int irq_delivered = 0; +uint32_t period_count = 0; uint64_t period = t->period; uint64_t cur_tick = hpet_get_ticks(t->state); @@ -339,13 +364,37 @@ static void hpet_timer(void *opaque) if (t->config & HPET_TN_32BIT) { while (hpet_time_after(cur_tick, t->cmp)) { t->cmp = (uint32_t)(t->cmp + t->period); +t->ticks_not_accounted += t->period; +period_count++; } } else { while (hpet_time_after64(cur_tick, t->cmp)) { t->cmp += period; +t->ticks_not_accounted += period; +period_count++; } } diff = hpet_calculate_diff(t, cur_tick); +if (s->driftfix) { +if (t->ticks_not_accounted > MAX_TICKS_NOT_ACCOUNTED) { +t->ticks_not_accounted = t->period + t->prev_period; +} +if (hpet_timer_has_tick_backlog(t)) { +if (t->irq_rate == 1 || period_count > 1) { +t->irq_rate++; +t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE); +} +if (t->divisor == 0) { +assert(period_count); +} +if (period_count) { +t->divisor = t->irq_rate; +} +diff /= t->divisor--; +} else { +t->irq_rate = 1; +} +} qemu_mod_timer(t->qemu_timer, qemu_get_clock_ns(vm_clock) + (int64_t)ticks_to_ns(diff)); } else if (t->config & HPET_TN_32BIT && !timer_is_periodic(t)) { @@ -356,7 +405,22 @@ static void hpet_timer(void *opaque) t->wrap_flag = 0; } } -update_irq(t, 1); +if (s->driftfix && timer_is_periodic(t) && period != 0) { +if (t->ticks_not_accounted >= t->period + t->prev_period) { +irq_delivered = update_irq(t, 1); +if (irq_delivered) { +t->ticks_not_accounted -= t->prev_period; +t->prev_period = t->period; +} else { +if (p
[PATCH v4 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)
update_irq() uses a similar method as in 'rtc_td_hack' to detect coalesced interrupts. The function entry addresses are retrieved from 'target_get_irq_delivered' and 'target_reset_irq_delivered'. This change can be replaced if a generic feedback infrastructure to track coalesced IRQs for periodic, clock providing devices becomes available. Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 13 +++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 7ab6e62..e57c654 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -175,11 +175,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, uint64_t current) } } -static void update_irq(struct HPETTimer *timer, int set) +static int update_irq(struct HPETTimer *timer, int set) { uint64_t mask; HPETState *s; int route; +int irq_delivered = 1; if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) { /* if LegacyReplacementRoute bit is set, HPET specification requires @@ -204,8 +205,16 @@ static void update_irq(struct HPETTimer *timer, int set) qemu_irq_raise(s->irqs[route]); } else { s->isr &= ~mask; -qemu_irq_pulse(s->irqs[route]); +if (s->driftfix) { +target_reset_irq_delivered(); +qemu_irq_raise(s->irqs[route]); +irq_delivered = target_get_irq_delivered(); +qemu_irq_lower(s->irqs[route]); +} else { +qemu_irq_pulse(s->irqs[route]); +} } +return irq_delivered; } static void hpet_pre_save(void *opaque) -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
The new fields in HPETTimer are covered by a separate VMStateDescription which is a subsection of 'vmstate_hpet_timer'. They are only migrated if -global hpet.driftfix=on Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 33 + 1 files changed, 33 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 7513065..7ab6e62 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -55,6 +55,10 @@ typedef struct HPETTimer { /* timers */ uint8_t wrap_flag; /* timer pop will indicate wrap for one-shot 32-bit * mode. Next pop will be actual timer expiration. */ +uint64_t prev_period; +uint64_t ticks_not_accounted; +uint32_t irq_rate; +uint32_t divisor; } HPETTimer; typedef struct HPETState { @@ -246,6 +250,27 @@ static int hpet_post_load(void *opaque, int version_id) return 0; } +static bool hpet_timer_driftfix_vmstate_needed(void *opaque) +{ +HPETTimer *t = opaque; + +return (t->state->driftfix != 0); +} + +static const VMStateDescription vmstate_hpet_timer_driftfix = { +.name = "hpet_timer_driftfix", +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(prev_period, HPETTimer), +VMSTATE_UINT64(ticks_not_accounted, HPETTimer), +VMSTATE_UINT32(irq_rate, HPETTimer), +VMSTATE_UINT32(divisor, HPETTimer), +VMSTATE_END_OF_LIST() +} +}; + static const VMStateDescription vmstate_hpet_timer = { .name = "hpet_timer", .version_id = 1, @@ -260,6 +285,14 @@ static const VMStateDescription vmstate_hpet_timer = { VMSTATE_UINT8(wrap_flag, HPETTimer), VMSTATE_TIMER(qemu_timer, HPETTimer), VMSTATE_END_OF_LIST() +}, +.subsections = (VMStateSubsection []) { +{ +.vmsd = &vmstate_hpet_timer_driftfix, +.needed = hpet_timer_driftfix_vmstate_needed, +}, { +/* empty */ +} } }; -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)
'target_get_irq_delivered' and 'target_reset_irq_delivered' point to functions that are called by update_irq() to detect coalesced interrupts. Initially they point to stub functions which pretend successful interrupt injection. apic code calls two registration functions to replace the stubs with apic_get_irq_delivered() and apic_reset_irq_delivered(). This change can be replaced if a generic feedback infrastructure to track coalesced IRQs for periodic, clock providing devices becomes available. Signed-off-by: Ulrich Obergfell --- hw/apic.c |4 hw/pc.h | 13 + vl.c | 13 + 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index a45b57f..94b1d15 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -17,6 +17,7 @@ * License along with this library; if not, see <http://www.gnu.org/licenses/> */ #include "hw.h" +#include "pc.h" #include "apic.h" #include "ioapic.h" #include "qemu-timer.h" @@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = { static void apic_register_devices(void) { +register_target_get_irq_delivered(apic_get_irq_delivered); +register_target_reset_irq_delivered(apic_reset_irq_delivered); + sysbus_register_withprop(&apic_info); } diff --git a/hw/pc.h b/hw/pc.h index 1291e2d..79885f4 100644 --- a/hw/pc.h +++ b/hw/pc.h @@ -7,6 +7,19 @@ #include "fdc.h" #include "net.h" +extern int (*target_get_irq_delivered)(void); +extern void (*target_reset_irq_delivered)(void); + +static inline void register_target_get_irq_delivered(int (*func)(void)) +{ +target_get_irq_delivered = func; +} + +static inline void register_target_reset_irq_delivered(void (*func)(void)) +{ +target_reset_irq_delivered = func; +} + /* PC-style peripherals (also used by other machines). */ /* serial.c */ diff --git a/vl.c b/vl.c index a143250..a2bbc61 100644 --- a/vl.c +++ b/vl.c @@ -233,6 +233,19 @@ const char *prom_envs[MAX_PROM_ENVS]; const char *nvram = NULL; int boot_menu; +static int target_get_irq_delivered_stub(void) +{ +return 1; +} + +static void target_reset_irq_delivered_stub(void) +{ +return; +} + +int (*target_get_irq_delivered)(void) = target_get_irq_delivered_stub; +void (*target_reset_irq_delivered)(void) = target_reset_irq_delivered_stub; + typedef struct FWBootEntry FWBootEntry; struct FWBootEntry { -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers
Hi, This is version 4 of a series of patches that I originally posted in: http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328 Changes since version 3: in patch part 1/5 and part 4/5 - Added stub functions for 'target_reset_irq_delivered' and 'target_get_irq_delivered'. Added registration functions that are used by apic code to replace the stubs. - Removed NULL pointer checks from update_irq(). in patch part 5/5 - A minor modification in hpet_timer_has_tick_backlog(). - Renamed the local variable 'irq_count' in hpet_timer() to 'period_count'. - Driftfix-related fields in struct 'HPETTimer' are no longer being initialized/reset in hpet_reset(). Added the function hpet_timer_driftfix_reset() which is called when the guest . sets the 'CFG_ENABLE' bit (overall enable) in the General Configuration Register. . sets the 'TN_ENABLE' bit (timer N interrupt enable) in the Timer N Configuration and Capabilities Register. Please review and please comment. Regards, Uli Ulrich Obergfell (5): hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add driftfix property to HPETState and DeviceInfo hpet 'driftfix': add fields to HPETTimer and VMStateDescription hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts hw/apic.c |4 ++ hw/hpet.c | 119 ++-- hw/pc.h | 13 +++ vl.c | 13 +++ 4 files changed, 145 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
Hi Marcelo, > Other than that, shouldnt reset accounting variables to init state on > write to GLOBAL_ENABLE_CFG / writes to main counter? I'd suggest to initialize/reset the driftfix-related fields in the 'HPETTimer' structure (including the backlog of unaccounted ticks) in the following situations. - When the guest o/s sets the 'CFG_ENABLE' bit (overall enable) in the General Configuration Register. This is the place in hpet_ram_writel(): case HPET_CFG: ... if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) { ... for (i = 0; i < s->num_timers; i++) { if ((&s->timer[i])->cmp != ~0ULL) { // initialize / reset fields here hpet_set_timer(&s->timer[i]); - When the guest o/s sets the 'TN_ENABLE' bit (timer N interrupt enable) in the Timer N Configuration and Capabilities Register. This is the place in hpet_ram_writel(): case HPET_TN_CFG: ... if (activating_bit(old_val, new_val, HPET_TN_ENABLE)) { // initialize / reset fields here hpet_set_timer(timer); This should cover cases such as ... - first time initialization of HPET & timers during guest o/s boot. (includes kexec and reboot) - if a guest o/s stops and restarts the timer. (for example, to change the comparator register value or to switch a timer between periodic mode and non-periodic mode) Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
Hi Marcelo, > Whats prev_period for, since in practice the period will not change > between interrupts (OS programs comparator once, or perhaps twice > during bootup) ? 'prev_period' is needed if a guest o/s changes the comparator period 'on the fly' (without stopping and restarting the timer). guest o/s changes period | ti(n-1) |ti(n) ti(n+1) | v | | +-+--+ <--- prev_period ---> <-- period --> The idea is that each timer interrupt represents a certain quantum of time (the comparator period). If a guest o/s changes the period between timer interrupt 'n-1' and timer interrupt 'n', I think the new value should not take effect before timer interrupt 'n'. Timer interrupt 'n' still represents the old/previous quantum, and timer interrupt 'n+1' represents the new quantum. Hence, the patch decrements 'ticks_not_accounted' by 'prev_period' and sets 'prev_period' to 'period' when an interrupt was delivered to the guest o/s. +irq_delivered = update_irq(t, 1); +if (irq_delivered) { +t->ticks_not_accounted -= t->prev_period; +t->prev_period = t->period; +} else { Most of the time 'prev_period' is equal to 'period'. It should only be different in the scenario shown above. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v3 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)
> On 2011-04-28 20:51, Blue Swirl wrote: >> On Thu, Apr 28, 2011 at 5:24 PM, Ulrich Obergfell wrote: >>> 'target_get_irq_delivered' and 'target_reset_irq_delivered' contain >>> entry addresses of functions that are utilized by update_irq() to >>> detect coalesced interrupts. apic code loads these pointers during >>> initialization. >> >> I'm not so happy with this approach, but probably then the i386 >> dependencies can be removed from RTC and it can be compiled only once >> for all targets. > > This whole series is really the minimalistic approach. The callbacks > defined here must remain a temporary "shortcut". Just like proper > abstraction of periodic tick compensation for reuse in other timers has > to be added later on. And the limitation to edge-triggered legacy HPET > INTs has to be removed. Since QEMU doesn't have a generic infrastructure to track interrupt delivery, I decided to reuse something that is currently available. The only mechanism that I'm aware of is the one that is utilized by RTC code ('rtc_td_hack'), i.e. apic_get_irq_delivered() etc. The changes that would be introduced by part 1/5 and part 4/5 of this patch series could be replaced if a generic infrastructure to track interrupt delivery becomes available. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
driftfix is a 'bit type' property. Compensation of delayed callbacks and coalesced interrupts can be enabled with the command line option -global hpet.driftfix=on driftfix is 'off' (disabled) by default. Signed-off-by: Ulrich Obergfell --- hw/hpet.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 6ce07bc..7513065 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -72,6 +72,8 @@ typedef struct HPETState { uint64_t isr; /* interrupt status reg */ uint64_t hpet_counter; /* main counter */ uint8_t hpet_id; /* instance id */ + +uint32_t driftfix; } HPETState; static uint32_t hpet_in_legacy_mode(HPETState *s) @@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = { .qdev.props = (Property[]) { DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS), DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false), +DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false), DEFINE_PROP_END_OF_LIST(), }, }; -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)
update_irq() uses a similar method as in 'rtc_td_hack' to detect coalesced interrupts. The function entry addresses are retrieved from 'target_get_irq_delivered' and 'target_reset_irq_delivered'. This change can be replaced if a generic feedback infrastructure to track coalesced IRQs for periodic, clock providing devices becomes available. Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 15 +-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 7ab6e62..35466ae 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -31,6 +31,7 @@ #include "hpet_emul.h" #include "sysbus.h" #include "mc146818rtc.h" +#include "sysemu.h" //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -175,11 +176,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, uint64_t current) } } -static void update_irq(struct HPETTimer *timer, int set) +static int update_irq(struct HPETTimer *timer, int set) { uint64_t mask; HPETState *s; int route; +int irq_delivered = 1; if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) { /* if LegacyReplacementRoute bit is set, HPET specification requires @@ -204,8 +206,17 @@ static void update_irq(struct HPETTimer *timer, int set) qemu_irq_raise(s->irqs[route]); } else { s->isr &= ~mask; -qemu_irq_pulse(s->irqs[route]); +if (s->driftfix && target_get_irq_delivered +&& target_reset_irq_delivered) { +target_reset_irq_delivered(); +qemu_irq_raise(s->irqs[route]); +irq_delivered = target_get_irq_delivered(); +qemu_irq_lower(s->irqs[route]); +} else { +qemu_irq_pulse(s->irqs[route]); +} } +return irq_delivered; } static void hpet_pre_save(void *opaque) -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
The new fields in HPETTimer are covered by a separate VMStateDescription which is a subsection of 'vmstate_hpet_timer'. They are only migrated if -global hpet.driftfix=on Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 33 + 1 files changed, 33 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 7513065..7ab6e62 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -55,6 +55,10 @@ typedef struct HPETTimer { /* timers */ uint8_t wrap_flag; /* timer pop will indicate wrap for one-shot 32-bit * mode. Next pop will be actual timer expiration. */ +uint64_t prev_period; +uint64_t ticks_not_accounted; +uint32_t irq_rate; +uint32_t divisor; } HPETTimer; typedef struct HPETState { @@ -246,6 +250,27 @@ static int hpet_post_load(void *opaque, int version_id) return 0; } +static bool hpet_timer_driftfix_vmstate_needed(void *opaque) +{ +HPETTimer *t = opaque; + +return (t->state->driftfix != 0); +} + +static const VMStateDescription vmstate_hpet_timer_driftfix = { +.name = "hpet_timer_driftfix", +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(prev_period, HPETTimer), +VMSTATE_UINT64(ticks_not_accounted, HPETTimer), +VMSTATE_UINT32(irq_rate, HPETTimer), +VMSTATE_UINT32(divisor, HPETTimer), +VMSTATE_END_OF_LIST() +} +}; + static const VMStateDescription vmstate_hpet_timer = { .name = "hpet_timer", .version_id = 1, @@ -260,6 +285,14 @@ static const VMStateDescription vmstate_hpet_timer = { VMSTATE_UINT8(wrap_flag, HPETTimer), VMSTATE_TIMER(qemu_timer, HPETTimer), VMSTATE_END_OF_LIST() +}, +.subsections = (VMStateSubsection []) { +{ +.vmsd = &vmstate_hpet_timer_driftfix, +.needed = hpet_timer_driftfix_vmstate_needed, +}, { +/* empty */ +} } }; -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers
Hi, This is version 3 of a series of patches that I originally posted in: http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328 Changes since version 2: - The vmstate related to 'driftfix' is now in a separate subsection that can be migrated conditionally. - The new fields of the HPETTimer structure are no longer initialized in hpet_init(). This is now done in hpet_reset() only. - Compensation of lost timer interrupts is now based on a backlog of unaccounted HPET clock periods instead of 'irqs_to_inject'. This eliminates the need to scale 'irqs_to_inject' when a guest o/s modifies the comparator register value. Please review and please comment. Regards, Uli Ulrich Obergfell (5): hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add driftfix property to HPETState and DeviceInfo hpet 'driftfix': add fields to HPETTimer and VMStateDescription hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts hw/apic.c |4 ++ hw/hpet.c | 114 ++-- sysemu.h |3 ++ vl.c |3 ++ 4 files changed, 120 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)
'target_get_irq_delivered' and 'target_reset_irq_delivered' contain entry addresses of functions that are utilized by update_irq() to detect coalesced interrupts. apic code loads these pointers during initialization. This change can be replaced if a generic feedback infrastructure to track coalesced IRQs for periodic, clock providing devices becomes available. Signed-off-by: Ulrich Obergfell --- hw/apic.c |4 sysemu.h |3 +++ vl.c |3 +++ 3 files changed, 10 insertions(+), 0 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index a45b57f..eb0f6d9 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -24,6 +24,7 @@ #include "sysbus.h" #include "trace.h" #include "kvm.h" +#include "sysemu.h" /* APIC Local Vector Table */ #define APIC_LVT_TIMER 0 @@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = { static void apic_register_devices(void) { +target_get_irq_delivered = apic_get_irq_delivered; +target_reset_irq_delivered = apic_reset_irq_delivered; + sysbus_register_withprop(&apic_info); } diff --git a/sysemu.h b/sysemu.h index 07d85cd..75b0139 100644 --- a/sysemu.h +++ b/sysemu.h @@ -98,6 +98,9 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); int qemu_loadvm_state(QEMUFile *f); +extern int (*target_get_irq_delivered)(void); +extern void (*target_reset_irq_delivered)(void); + /* SLIRP */ void do_info_slirp(Monitor *mon); diff --git a/vl.c b/vl.c index 0c24e07..7bab315 100644 --- a/vl.c +++ b/vl.c @@ -233,6 +233,9 @@ const char *prom_envs[MAX_PROM_ENVS]; const char *nvram = NULL; int boot_menu; +int (*target_get_irq_delivered)(void) = 0; +void (*target_reset_irq_delivered)(void) = 0; + typedef struct FWBootEntry FWBootEntry; struct FWBootEntry { -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
Loss of periodic timer interrupts caused by delayed callbacks and by interrupt coalescing is compensated by gradually injecting additional interrupts during subsequent timer intervals, starting at a rate of one additional interrupt per interval. The injection of additional interrupts is based on a backlog of unaccounted HPET clock periods (new HPETTimer field 'ticks_not_accounted'). The backlog increases due to delayed callbacks and coalesced interrupts, and it decreases if an interrupt was injected successfully. If the backlog increases while compensation is still in progress, the rate at which additional interrupts are injected is increased too. A limit is imposed on the backlog and on the rate. Injecting additional timer interrupts to compensate lost interrupts can alleviate long term time drift. However, on a short time scale, this method can have the side effect of making virtual machine time intermittently pass slower and faster than real time (depending on the guest's time keeping algorithm). Compensation is disabled by default and can be enabled for guests where this behaviour may be acceptable. Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 63 +++- 1 files changed, 61 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 35466ae..92d5f58 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -32,6 +32,7 @@ #include "sysbus.h" #include "mc146818rtc.h" #include "sysemu.h" +#include //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -42,6 +43,9 @@ #define HPET_MSI_SUPPORT0 +#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */ +#define MAX_IRQ_RATE(uint32_t)10 + struct HPETState; typedef struct HPETTimer { /* timers */ uint8_t tn; /*timer number*/ @@ -326,28 +330,63 @@ static const VMStateDescription vmstate_hpet = { } }; +static bool hpet_timer_has_tick_backlog(HPETTimer *t) +{ +uint64_t backlog = t->ticks_not_accounted - (t->period + t->prev_period); +return (backlog >= t->period); +} + /* * timer expiration callback */ static void hpet_timer(void *opaque) { HPETTimer *t = opaque; +HPETState *s = t->state; uint64_t diff; - +int irq_delivered = 0; +uint32_t irq_count = 0; uint64_t period = t->period; uint64_t cur_tick = hpet_get_ticks(t->state); +if (s->driftfix && !t->ticks_not_accounted) { +t->ticks_not_accounted = t->prev_period = t->period; +} if (timer_is_periodic(t) && period != 0) { if (t->config & HPET_TN_32BIT) { while (hpet_time_after(cur_tick, t->cmp)) { t->cmp = (uint32_t)(t->cmp + t->period); +t->ticks_not_accounted += t->period; +irq_count++; } } else { while (hpet_time_after64(cur_tick, t->cmp)) { t->cmp += period; +t->ticks_not_accounted += period; +irq_count++; } } diff = hpet_calculate_diff(t, cur_tick); +if (s->driftfix) { +if (t->ticks_not_accounted > MAX_TICKS_NOT_ACCOUNTED) { +t->ticks_not_accounted = t->period + t->prev_period; +} +if (hpet_timer_has_tick_backlog(t)) { +if (t->irq_rate == 1 || irq_count > 1) { +t->irq_rate++; +t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE); +} +if (t->divisor == 0) { +assert(irq_count); +} +if (irq_count) { +t->divisor = t->irq_rate; +} +diff /= t->divisor--; +} else { +t->irq_rate = 1; +} +} qemu_mod_timer(t->qemu_timer, qemu_get_clock_ns(vm_clock) + (int64_t)ticks_to_ns(diff)); } else if (t->config & HPET_TN_32BIT && !timer_is_periodic(t)) { @@ -358,7 +397,22 @@ static void hpet_timer(void *opaque) t->wrap_flag = 0; } } -update_irq(t, 1); +if (s->driftfix && timer_is_periodic(t) && period != 0) { +if (t->ticks_not_accounted >= t->period + t->prev_period) { +irq_delivered = update_irq(t, 1); +if (irq_delivered) { +t->ticks_not_accounted -= t->prev_period; +t->prev_period = t->period; +} else { +if (irq_count) { +t->irq_rate++; +t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE); +} +} +} +} else { +update_irq(t, 1); +} } static void hpet_set_timer(HPE
Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
>> vmstate_hpet_timer = { >> VMSTATE_UINT64(fsb, HPETTimer), >> VMSTATE_UINT64(period, HPETTimer), >> VMSTATE_UINT8(wrap_flag, HPETTimer), >> + VMSTATE_UINT64_V(saved_period, HPETTimer, 3), >> + VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3), >> + VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3), >> + VMSTATE_UINT32_V(irq_rate, HPETTimer, 3), >> + VMSTATE_UINT32_V(divisor, HPETTimer, 3), > > We ought to be able to use a subsection keyed off of whether any ticks > are currently accumulated, no? Anthony, I'm not sure if I understand your question correctly. Are you suggesting to migrate the driftfix-related state conditionally / only if there are any ticks accumulated in 'ticks_not_accounted' and 'irqs_to_inject' ? The size of the driftfix-related state is 28 bytes per timer and we have 32 timers per HPETState, i.e. 896 additional bytes per HPETState. With a maximum number of 8 HPET blocks (HPETState), this amounts to 7168 bytes. Hence, unconditional migration of the driftfix-related state should not cause significant additional overhead. Maybe I missed something. Could you please explain which benefit you see in using a subsection ? Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
>> typedef struct HPETState { >> @@ -248,7 +253,7 @@ static int hpet_post_load(void *opaque, int >> version_id) >> >> static const VMStateDescription vmstate_hpet_timer = { >> .name = "hpet_timer", >> - .version_id = 1, >> + .version_id = 3, > > Why jump from 1 to 3? > >> .minimum_version_id = 1, >> .minimum_version_id_old = 1, >> .fields = (VMStateField []) { >> @@ -258,6 +263,11 @@ static const VMStateDescription >> vmstate_hpet_timer = { >> VMSTATE_UINT64(fsb, HPETTimer), >> VMSTATE_UINT64(period, HPETTimer), >> VMSTATE_UINT8(wrap_flag, HPETTimer), >> + VMSTATE_UINT64_V(saved_period, HPETTimer, 3), >> + VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3), >> + VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3), >> + VMSTATE_UINT32_V(irq_rate, HPETTimer, 3), >> + VMSTATE_UINT32_V(divisor, HPETTimer, 3), Anthony, I incremented the version ID of 'vmstate_hpet' from 2 to 3 to make sure that migrations from a QEMU process that is capable of 'driftfix' to a QEMU process that is _not_ capable of 'driftfix' will fail. I assigned version ID 3 to 'vmstate_hpet_timer' and to the new fields in there too to indicate that adding those fields was the reason why the version ID of 'vmstate_hpet' was incremented to 3. As far as the flow of execution in vmstate_load_state() is concerned, I think it does not matter whether the version ID of 'vmstate_hpet_timer' and the new fields in there is 2 or 3 (as long as they are consistent). When the 'while(field->name)' loop in vmstate_load_state() gets to the following field in 'vmstate_hpet' ... VMSTATE_STRUCT_VARRAY_UINT8(timer, HPETState, num_timers, 0, vmstate_hpet_timer, HPETTimer), ... it calls itself recursively ... if (field->flags & VMS_STRUCT) { ret = vmstate_load_state(f, field->vmsd, addr, field->vmsd->version_id); 'field->vmsd->version_id' is the version ID of 'vmstate_hpet_timer' [1]. Hence 'vmstate_hpet_timer.version_id' is being checked against itself ... if (version_id > vmsd->version_id) { return -EINVAL; } ... and the version IDs of the new fields are also being checked against 'vmstate_hpet_timer.version_id' ... if ((field->field_exists && field->field_exists(opaque, version_id)) || (!field->field_exists && field->version_id <= version_id)) { If you want me to change the version ID of 'vmstate_hpet_timer' and the new fields in there from 3 to 2, I can do that. Regards, Uli [1] Ref.: commit fa3aad24d94a6cf894db52d83f72a399324a17bb -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription
Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 14 -- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index 45847ed..c150da5 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -55,6 +55,11 @@ typedef struct HPETTimer { /* timers */ uint8_t wrap_flag; /* timer pop will indicate wrap for one-shot 32-bit * mode. Next pop will be actual timer expiration. */ +uint64_t saved_period; +uint64_t ticks_not_accounted; +uint32_t irqs_to_inject; +uint32_t irq_rate; +uint32_t divisor; } HPETTimer; typedef struct HPETState { @@ -248,7 +253,7 @@ static int hpet_post_load(void *opaque, int version_id) static const VMStateDescription vmstate_hpet_timer = { .name = "hpet_timer", -.version_id = 1, +.version_id = 3, .minimum_version_id = 1, .minimum_version_id_old = 1, .fields = (VMStateField []) { @@ -258,6 +263,11 @@ static const VMStateDescription vmstate_hpet_timer = { VMSTATE_UINT64(fsb, HPETTimer), VMSTATE_UINT64(period, HPETTimer), VMSTATE_UINT8(wrap_flag, HPETTimer), +VMSTATE_UINT64_V(saved_period, HPETTimer, 3), +VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3), +VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3), +VMSTATE_UINT32_V(irq_rate, HPETTimer, 3), +VMSTATE_UINT32_V(divisor, HPETTimer, 3), VMSTATE_TIMER(qemu_timer, HPETTimer), VMSTATE_END_OF_LIST() } @@ -265,7 +275,7 @@ static const VMStateDescription vmstate_hpet_timer = { static const VMStateDescription vmstate_hpet = { .name = "hpet", -.version_id = 2, +.version_id = 3, .minimum_version_id = 1, .minimum_version_id_old = 1, .pre_save = hpet_pre_save, -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/5] hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only)
update_irq() uses a similar method as in 'rtc_td_hack' to detect coalesced interrupts. The function entry addresses are retrieved from 'target_get_irq_delivered' and 'target_reset_irq_delivered'. Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 15 +-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index c150da5..b5625fb 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -31,6 +31,7 @@ #include "hpet_emul.h" #include "sysbus.h" #include "mc146818rtc.h" +#include "sysemu.h" //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -176,11 +177,12 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, uint64_t current) } } -static void update_irq(struct HPETTimer *timer, int set) +static int update_irq(struct HPETTimer *timer, int set) { uint64_t mask; HPETState *s; int route; +int irq_delivered = 1; if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) { /* if LegacyReplacementRoute bit is set, HPET specification requires @@ -205,8 +207,17 @@ static void update_irq(struct HPETTimer *timer, int set) qemu_irq_raise(s->irqs[route]); } else { s->isr &= ~mask; -qemu_irq_pulse(s->irqs[route]); +if (s->driftfix && target_get_irq_delivered +&& target_reset_irq_delivered) { +target_reset_irq_delivered(); +qemu_irq_raise(s->irqs[route]); +irq_delivered = target_get_irq_delivered(); +qemu_irq_lower(s->irqs[route]); +} else { +qemu_irq_pulse(s->irqs[route]); +} } +return irq_delivered; } static void hpet_pre_save(void *opaque) -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/5] hpet 'driftfix': add driftfix property to HPETState and DeviceInfo
driftfix is a 'bit type' property. Compensation of delayed callbacks and coalesced interrupts can be enabled with the command line option -global hpet.driftfix=on driftfix is 'off' (disabled) by default. Signed-off-by: Ulrich Obergfell --- hw/hpet.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index e92b775..45847ed 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -72,6 +72,8 @@ typedef struct HPETState { uint64_t isr; /* interrupt status reg */ uint64_t hpet_counter; /* main counter */ uint8_t hpet_id; /* instance id */ + +uint32_t driftfix; } HPETState; static uint32_t hpet_in_legacy_mode(HPETState *s) @@ -738,6 +740,7 @@ static SysBusDeviceInfo hpet_device_info = { .qdev.props = (Property[]) { DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS), DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false), +DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false), DEFINE_PROP_END_OF_LIST(), }, }; -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts
Loss of periodic timer interrupts caused by delayed callbacks and by interrupt coalescing is compensated by gradually injecting additional interrupts during subsequent timer intervals, starting at a rate of one additional interrupt per interval. If further interrupts are lost while compensation is in progress, the rate is increased. A limit is imposed on the rate and on the 'backlog' of lost interrupts that are to be injected. If a guest o/s modifies the comparator register value while compensation is in progress, the 'backlog' of lost interrupts that are to be injected is scaled to the new value. Injecting additional timer interrupts to compensate lost interrupts can alleviate long term time drift. However, on a short time scale, this method can have the side effect of making virtual machine time intermittently pass faster than real time (depending on the guest's time keeping algorithm). Compensation is disabled by default and can be enabled for guests where this behaviour is acceptable. Signed-off-by: Ulrich Obergfell --- hw/hpet.c | 58 +- 1 files changed, 57 insertions(+), 1 deletions(-) diff --git a/hw/hpet.c b/hw/hpet.c index b5625fb..5a25a96 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -42,6 +42,9 @@ #define HPET_MSI_SUPPORT0 +#define MAX_IRQS_TO_INJECT (uint32_t)5000 +#define MAX_IRQ_RATE (uint32_t)10 + struct HPETState; typedef struct HPETTimer { /* timers */ uint8_t tn; /*timer number*/ @@ -309,8 +312,11 @@ static const VMStateDescription vmstate_hpet = { static void hpet_timer(void *opaque) { HPETTimer *t = opaque; +HPETState *s = t->state; uint64_t diff; +int irq_delivered = 0; +uint32_t irq_count = 0; uint64_t period = t->period; uint64_t cur_tick = hpet_get_ticks(t->state); @@ -318,13 +324,36 @@ static void hpet_timer(void *opaque) if (t->config & HPET_TN_32BIT) { while (hpet_time_after(cur_tick, t->cmp)) { t->cmp = (uint32_t)(t->cmp + t->period); +irq_count++; } } else { while (hpet_time_after64(cur_tick, t->cmp)) { t->cmp += period; +irq_count++; } } diff = hpet_calculate_diff(t, cur_tick); +if (s->driftfix) { +if (t->saved_period != t->period) { +uint64_t tmp = (t->irqs_to_inject * t->saved_period) + + t->ticks_not_accounted; +t->irqs_to_inject = tmp / t->period; +t->ticks_not_accounted = tmp % t->period; +t->saved_period = t->period; +} +t->irqs_to_inject += irq_count; +t->irqs_to_inject = MIN(t->irqs_to_inject, MAX_IRQS_TO_INJECT); +if (t->irqs_to_inject > 1) { +if (irq_count > 1) { +t->irq_rate++; +t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE); +} +if (irq_count || t->divisor == 0) { +t->divisor = t->irq_rate; +} +diff /= t->divisor--; +} +} qemu_mod_timer(t->qemu_timer, qemu_get_clock(vm_clock) + (int64_t)ticks_to_ns(diff)); } else if (t->config & HPET_TN_32BIT && !timer_is_periodic(t)) { @@ -335,7 +364,22 @@ static void hpet_timer(void *opaque) t->wrap_flag = 0; } } -update_irq(t, 1); +if (s->driftfix && timer_is_periodic(t) && period != 0) { +if (t->irqs_to_inject) { +irq_delivered = update_irq(t, 1); +if (irq_delivered) { +t->irq_rate = MIN(t->irq_rate, t->irqs_to_inject); +t->irqs_to_inject--; +} else { +if (irq_count) { +t->irq_rate++; +t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE); +} +} +} +} else { +update_irq(t, 1); +} } static void hpet_set_timer(HPETTimer *t) @@ -674,6 +718,12 @@ static void hpet_reset(DeviceState *d) timer->config |= 0x0004ULL << 32; timer->period = 0ULL; timer->wrap_flag = 0; + +timer->saved_period = 0; +timer->ticks_not_accounted = 0; +timer->irqs_to_inject = 0; +timer->irq_rate = 1; +timer->divisor = 1; } s->hpet_counter = 0ULL; @@ -734,6 +784,12 @@ static int hpet_init(SysBusDevice *dev) timer->qemu_timer = qemu_new_timer(vm_clock, hpet_timer, timer); timer->tn = i; timer->state = s; + +timer->saved_period = 0; +
[PATCH v2 1/5] hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only)
'target_get_irq_delivered' and 'target_reset_irq_delivered' contain entry addresses of functions that are utilized by update_irq() to detect coalesced interrupts. apic code loads these pointers during initialization. Signed-off-by: Ulrich Obergfell --- hw/apic.c |4 sysemu.h |3 +++ vl.c |3 +++ 3 files changed, 10 insertions(+), 0 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index 00907e0..44d8cb3 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -24,6 +24,7 @@ #include "sysbus.h" #include "trace.h" #include "kvm.h" +#include "sysemu.h" /* APIC Local Vector Table */ #define APIC_LVT_TIMER 0 @@ -1143,6 +1144,9 @@ static SysBusDeviceInfo apic_info = { static void apic_register_devices(void) { +target_get_irq_delivered = apic_get_irq_delivered; +target_reset_irq_delivered = apic_reset_irq_delivered; + sysbus_register_withprop(&apic_info); } diff --git a/sysemu.h b/sysemu.h index 3f7e17e..b5d4f90 100644 --- a/sysemu.h +++ b/sysemu.h @@ -96,6 +96,9 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); int qemu_loadvm_state(QEMUFile *f); +extern int (*target_get_irq_delivered)(void); +extern void (*target_reset_irq_delivered)(void); + /* SLIRP */ void do_info_slirp(Monitor *mon); diff --git a/vl.c b/vl.c index 8bcf2ae..548912a 100644 --- a/vl.c +++ b/vl.c @@ -231,6 +231,9 @@ const char *prom_envs[MAX_PROM_ENVS]; const char *nvram = NULL; int boot_menu; +int (*target_get_irq_delivered)(void) = 0; +void (*target_reset_irq_delivered)(void) = 0; + typedef struct FWBootEntry FWBootEntry; struct FWBootEntry { -- 1.6.2.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/5] hpet 'driftfix': alleviate time drift with HPET periodic timers
Hi, This is version 2 of a series of patches that I originally posted in: http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01989.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01992.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01991.html http://lists.gnu.org/archive/html/qemu-devel/2011-03/msg01990.html http://article.gmane.org/gmane.comp.emulators.kvm.devel/69325 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69326 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69327 http://article.gmane.org/gmane.comp.emulators.kvm.devel/69328 I implemented the proposed changes (remove configurability and ifdefs, define 'driftfix' as a property of the HPET device) and I've split the patch into smaller parts (originally 3, now 5). I also included a fix related to scaling 't->irqs_to_inject' when the guest o/s modifies the comparator register value. Please review and please comment. Regards, Uli Ulrich Obergfell (5): hpet 'driftfix': add hooks required to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add driftfix property to HPETState and DeviceInfo hpet 'driftfix': add fields to HPETTimer and VMStateDescription hpet 'driftfix': add code in update_irq() to detect coalesced interrupts (x86 apic only) hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts hw/apic.c |4 +++ hw/hpet.c | 90 +--- sysemu.h |3 ++ vl.c |3 ++ 4 files changed, 95 insertions(+), 5 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] alleviate time drift with HPET periodic timers
>> Part 3 of the patch implements the following options for the >> 'configure' script. >> >> --disable-hpet-driftfix >> --enable-hpet-driftfix > > I see no benefit in this configurability. Just make the driftfix > unconditionally available, runtime-disabled by default for now until it > matured and there is no downside in enabling it all the time. Many Thanks Jan, I enclosed the code in '#ifdef CONFIG_HPET_DRIFTFIX ... #endif' so that it can be easily identified (and removed if the generic API would be implemented some day). Since the ifdef's are already there I added the configuration option for convenience. As you don't see any benefit in this option, I can remove that part of the patch. However, I'd suggest to keep the ifdef's and do the following: - Rename to '#ifdef HPET_DRIFTFIX ... #endif' to make it clear that this is not controlled via a configuration option. - Add '#define HPET_DRIFTFIX' to hw/hpet_emul.h. Do you agree ? Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] alleviate time drift with HPET periodic timers
>> Part 1 of the patch implements the following QEMU command line option. >> >> -hpet [device=none|present][,driftfix=none|slew] > > Just define driftfix as property of the hpet device. That way it can be > controlled both globally (-global hpet.driftfix=...) and per hpet block > (once we support instantiating >1 of them). Many Thanks Jan, I started investigating code changes. I'm thinking of ... - adding a new field to the HPETState structure. uint32_t driftfix; - adding the property 'driftfix' to the DeviceInfo structure. DEFINE_PROP_BIT("driftfix", HPETState, driftfix, 0, false) Using a single bit so that the option syntax would be, e.g.: -global hpet.driftfix=on (Default is 'off') - Replace all 'if (hpet_driftfix ...)' by: if ((HPETState)s->driftfix ...) Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] alleviate time drift with HPET periodic timers
Part 3 of the patch implements the following options for the 'configure' script. --disable-hpet-driftfix --enable-hpet-driftfix Signed-off-by: Ulrich Obergfell diff -up ./configure.orig3 ./configure --- ./configure.orig3 2011-02-18 22:48:06.0 +0100 +++ ./configure 2011-03-13 12:43:44.870966181 +0100 @@ -140,6 +140,7 @@ linux_aio="" attr="" vhost_net="" xfs="" +hpet_driftfix="no" gprof="no" debug_tcg="no" @@ -749,6 +750,10 @@ for opt do ;; --enable-rbd) rbd="yes" ;; + --disable-hpet-driftfix) hpet_driftfix="no" + ;; + --enable-hpet-driftfix) hpet_driftfix="yes" + ;; *) echo "ERROR: unknown option $opt"; show_help="yes" ;; esac @@ -2602,6 +2607,7 @@ echo "Trace output file $trace_file- Your SDL version is too old - please upgrade to have SDL support" @@ -2893,6 +2899,10 @@ if test "$rbd" = "yes" ; then echo "CONFIG_RBD=y" >> $config_host_mak fi +if test "$hpet_driftfix" = "yes" ; then + echo "CONFIG_HPET_DRIFTFIX=y" >> $config_host_mak +fi + # USB host support case "$usb" in linux) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] alleviate time drift with HPET periodic timers
Part 2 of the patch compensates loss of timer interrupts caused by ... - delayed timer callback. - interrupt coalescing (x86 only). Lost timer interrupts are compensated by gradually injecting additional interrupts during the subsequent timer intervals, starting at a rate of one additional interrupt per interval. If further interrupts are lost while compensation is still in progress, the rate is increased. A limit is imposed on the rate and on the 'backlog' of lost interrupts that are to be injected. Signed-off-by: Ulrich Obergfell diff -up ./hw/apic.c.orig2 ./hw/apic.c --- ./hw/apic.c.orig2 2011-02-18 22:48:06.0 +0100 +++ ./hw/apic.c 2011-03-13 12:40:56.110991702 +0100 @@ -24,6 +24,9 @@ #include "sysbus.h" #include "trace.h" #include "kvm.h" +#ifdef CONFIG_HPET_DRIFTFIX +#include "sysemu.h" +#endif /* APIC Local Vector Table */ #define APIC_LVT_TIMER 0 @@ -1143,6 +1146,10 @@ static SysBusDeviceInfo apic_info = { static void apic_register_devices(void) { +#ifdef CONFIG_HPET_DRIFTFIX +qemu_get_irq_delivered = apic_get_irq_delivered; +qemu_reset_irq_delivered = apic_reset_irq_delivered; +#endif sysbus_register_withprop(&apic_info); } diff -up ./hw/hpet.c.orig2 ./hw/hpet.c --- ./hw/hpet.c.orig2 2011-02-18 22:48:06.0 +0100 +++ ./hw/hpet.c 2011-03-13 12:40:56.111991655 +0100 @@ -31,6 +31,9 @@ #include "hpet_emul.h" #include "sysbus.h" #include "mc146818rtc.h" +#ifdef CONFIG_HPET_DRIFTFIX +#include "sysemu.h" +#endif //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -41,6 +44,13 @@ #define HPET_MSI_SUPPORT0 +#ifdef CONFIG_HPET_DRIFTFIX +#define MAX_IRQS_TO_INJECT (uint32_t)5000 +#define MAX_IRQ_RATE (uint32_t)10 + +extern int hpet_driftfix; +#endif + struct HPETState; typedef struct HPETTimer { /* timers */ uint8_t tn; /*timer number*/ @@ -55,6 +65,12 @@ typedef struct HPETTimer { /* timers */ uint8_t wrap_flag; /* timer pop will indicate wrap for one-shot 32-bit * mode. Next pop will be actual timer expiration. */ +#ifdef CONFIG_HPET_DRIFTFIX +uint64_t saved_period; +uint32_t irqs_to_inject; +uint32_t irq_rate; +uint32_t divisor; +#endif } HPETTimer; typedef struct HPETState { @@ -169,11 +185,12 @@ static inline uint64_t hpet_calculate_di } } -static void update_irq(struct HPETTimer *timer, int set) +static int update_irq(struct HPETTimer *timer, int set) { uint64_t mask; HPETState *s; int route; +int irq_delivered = 1; if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) { /* if LegacyReplacementRoute bit is set, HPET specification requires @@ -198,8 +215,19 @@ static void update_irq(struct HPETTimer qemu_irq_raise(s->irqs[route]); } else { s->isr &= ~mask; +#ifdef CONFIG_HPET_DRIFTFIX +if (hpet_driftfix && qemu_get_irq_delivered + && qemu_reset_irq_delivered) { +qemu_reset_irq_delivered(); +qemu_irq_raise(s->irqs[route]); +irq_delivered = qemu_get_irq_delivered(); +qemu_irq_lower(s->irqs[route]); +} +else +#endif qemu_irq_pulse(s->irqs[route]); } +return irq_delivered; } static void hpet_pre_save(void *opaque) @@ -246,7 +274,11 @@ static int hpet_post_load(void *opaque, static const VMStateDescription vmstate_hpet_timer = { .name = "hpet_timer", +#ifdef CONFIG_HPET_DRIFTFIX +.version_id = 3, +#else .version_id = 1, +#endif .minimum_version_id = 1, .minimum_version_id_old = 1, .fields = (VMStateField []) { @@ -256,6 +288,12 @@ static const VMStateDescription vmstate_ VMSTATE_UINT64(fsb, HPETTimer), VMSTATE_UINT64(period, HPETTimer), VMSTATE_UINT8(wrap_flag, HPETTimer), +#ifdef CONFIG_HPET_DRIFTFIX +VMSTATE_UINT64_V(saved_period, HPETTimer, 3), +VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3), +VMSTATE_UINT32_V(irq_rate, HPETTimer, 3), +VMSTATE_UINT32_V(divisor, HPETTimer, 3), +#endif VMSTATE_TIMER(qemu_timer, HPETTimer), VMSTATE_END_OF_LIST() } @@ -263,7 +301,11 @@ static const VMStateDescription vmstate_ static const VMStateDescription vmstate_hpet = { .name = "hpet", +#ifdef CONFIG_HPET_DRIFTFIX +.version_id = 3, +#else .version_id = 2, +#endif .minimum_version_id = 1, .minimum_version_id_old = 1, .pre_save = hpet_pre_save, @@ -287,7 +329,10 @@ static void hpet_timer(void *opaque) { HPETTimer *t = opaque; uint64_t diff; - +#ifdef CONFIG_HPET_DRIFTFIX +uint32_t irq_count = 0; +int irq_delivered = 0; +#endif uint64_t period = t->period; uint64_t cur_tick = hpet_get_ticks(t->state); @@ -295,1
[PATCH 1/3] alleviate time drift with HPET periodic timers
Part 1 of the patch implements the following QEMU command line option. -hpet [device=none|present][,driftfix=none|slew] Signed-off-by: Ulrich Obergfell diff -up ./qemu-config.c.orig1 ./qemu-config.c --- ./qemu-config.c.orig1 2011-02-18 22:48:06.0 +0100 +++ ./qemu-config.c 2011-03-13 12:38:22.813976639 +0100 @@ -261,6 +261,23 @@ static QemuOptsList qemu_rtc_opts = { }, }; +#ifdef CONFIG_HPET_DRIFTFIX +static QemuOptsList qemu_hpet_opts = { +.name = "hpet", +.head = QTAILQ_HEAD_INITIALIZER(qemu_hpet_opts.head), +.desc = { +{ +.name = "device", +.type = QEMU_OPT_STRING, +},{ +.name = "driftfix", +.type = QEMU_OPT_STRING, +}, +{ /* end of list */ } +}, +}; +#endif + static QemuOptsList qemu_global_opts = { .name = "global", .head = QTAILQ_HEAD_INITIALIZER(qemu_global_opts.head), @@ -462,6 +479,9 @@ static QemuOptsList *vm_config_groups[32 &qemu_netdev_opts, &qemu_net_opts, &qemu_rtc_opts, +#ifdef CONFIG_HPET_DRIFTFIX +&qemu_hpet_opts, +#endif &qemu_global_opts, &qemu_mon_opts, &qemu_cpudef_opts, diff -up ./qemu-options.hx.orig1 ./qemu-options.hx --- ./qemu-options.hx.orig1 2011-02-18 22:48:06.0 +0100 +++ ./qemu-options.hx 2011-03-13 12:38:22.815977096 +0100 @@ -972,6 +972,13 @@ STEXI Disable HPET support. ETEXI +#ifdef CONFIG_HPET_DRIFTFIX +DEF("hpet", HAS_ARG, QEMU_OPTION_hpet, \ +"-hpet [device=none|present][,driftfix=none|slew]\n" \ +"disable or enable HPET, disable or enable drift fix for periodic timers\n", +QEMU_ARCH_ALL) +#endif + DEF("balloon", HAS_ARG, QEMU_OPTION_balloon, "-balloon none disable balloon device\n" "-balloon virtio[,addr=str]\n" diff -up ./vl.c.orig1 ./vl.c --- ./vl.c.orig12011-02-18 22:48:06.0 +0100 +++ ./vl.c 2011-03-13 12:38:35.167984285 +0100 @@ -203,6 +203,9 @@ CharDriverState *parallel_hds[MAX_PARALL CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES]; int win2k_install_hack = 0; int rtc_td_hack = 0; +#ifdef CONFIG_HPET_DRIFTFIX +int hpet_driftfix = 0; +#endif int usb_enabled = 0; int singlestep = 0; int smp_cpus = 1; @@ -431,6 +434,41 @@ static void configure_rtc(QemuOpts *opts } } +#ifdef CONFIG_HPET_DRIFTFIX +static void configure_hpet(QemuOpts *opts) +{ +const char *value, *opt; + +opt = "device"; +value = qemu_opt_get(opts, opt); +if (value) { +if (!strcmp(value, "present")) { +no_hpet = 0; +} else if (!strcmp(value, "none")) { +no_hpet = 1; +} else { +goto error_exit; +} +} +opt = "driftfix"; +value = qemu_opt_get(opts, opt); +if (value) { +if (!strcmp(value, "slew")) { +hpet_driftfix = 1; +} else if (!strcmp(value, "none")) { +hpet_driftfix = 0; +} else { +goto error_exit; +} +} +return; + +error_exit: +fprintf(stderr, "qemu: -hpet option '%s': value missing or invalid\n", opt); +exit(1); +} +#endif + /***/ /* Bluetooth support */ static int nb_hcis; @@ -2644,6 +2682,15 @@ int main(int argc, char **argv, char **e case QEMU_OPTION_no_hpet: no_hpet = 1; break; +#ifdef CONFIG_HPET_DRIFTFIX +case QEMU_OPTION_hpet: +opts = qemu_opts_parse(qemu_find_opts("hpet"), optarg, 0); +if (!opts) { +exit(1); +} +configure_hpet(opts); +break; +#endif case QEMU_OPTION_balloon: if (balloon_parse(optarg) < 0) { fprintf(stderr, "Unknown -balloon argument %s\n", optarg); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] alleviate time drift with HPET periodic timers
Hi, By the beginning of February I posted an RFC regarding an approach to alleviate time drift with HPET periodic timers. Ref.: http://article.gmane.org/gmane.comp.emulators.kvm.devel/67346 http://lists.gnu.org/archive/html/qemu-devel/2011-02/msg00327.html During the discussion it became apparent that a generic API that could facilitate (re)injection of lost timer interrupts for any timer device emulation would be preferred over an isolated fix for a specific device. However, at this stage there appears to be no plan to actually implement the generic API. Therefore, we may want to reconsider an improved version of the code that I originally included in the RFC. This improved version compensates the loss of timer interrupts caused by ... - delayed timer callback. - interrupt coalescing (x86 only). Lost timer interrupts are compensated by gradually injecting additional interrupts during the subsequent timer intervals, starting at a rate of one additional interrupt per interval. If further interrupts are lost while compensation is still in progress, the rate is increased. A limit is imposed on the rate and on the 'backlog' of lost interrupts that are to be injected. Compensation of lost timer interrupts can be enabled via a new QEMU command line option [1]. -hpet [device=none|present][,driftfix=none|slew] The code is enclosed in '#ifdef CONFIG_HPET_DRIFTFIX ... #endif' and it is configurable via a new option '--enable-hpet-driftfix' for the 'configure' script. This also makes it easy to identify and to remove if the generic API would be implemented some day. I've tested this code primarily by playing a list of example video files continuously in a Windows Vista(tm) guest. In some cases I observed the guest time fall 15+ seconds behind KVM host time in one minute without the patch, whereas with the patch there was no visible difference between the guest time and KVM host time after several hours of testing [2]. The 'qemu-system-x86_64' test binary image file was built from snapshot ... http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=commit;h=e700699e9f6f1e163985a8e4cd76d8568d24c152 ... using only the '--prefix' and '--enable-hpet-driftfix' options of the 'configure' script (other options default). The series of patches is based on the aforementioned snapshot and it consists of three parts. - Part 1 introduces the new QEMU command line option. - Part 2 implements the method to compensate lost timer interrupts. - Part 3 introduces the new option for the 'configure' script. Please review and please comment. Regards, Uli Obergfell [1] Injecting additional timer interrupts to compensate lost interrupts can alleviate long term time drift. However, on a short time scale, this method can have the side effect of making virtual machine time intermittently pass faster than real time (depending on the guest's time keeping algorithm). Compensation is disabled by default and can be enabled for guests where this behaviour is acceptable. [2] Testing required a patch in tight_compress_data(), ui/vnc-enc-tight.c to avoid segmentation fault / core dump of qemu. However, this needs to be discussed in a separate post. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC: 0/2] patch for QEMU HPET periodic timer emulation to alleviate time drift
On 02/03/2011 9:07 PM, Anthony Liguori wrote: > On 02/03/2011 09:28 AM, Jan Kiszka wrote: >> On 2011-02-03 14:43, Ulrich Obergfell wrote: ... >>> This is an RFC through which I would like to get feedback on how the >>> idea of a patch to compensate those lost interrupts would be received: >>> >>> The patch determines the number of lost timer interrupts based on the >>> number of elapsed comparator periods. Lost interrupts are compensated >>> >> That neglects coalescing of the HPET IRQs: If the timer is run regularly >> but the guest is not able to retrieve the injected IRQs, you should >> still see drifts with your patches. >> > > FWIW, this isn't the most common failure scenario. This is only really > prominent when you have rapid reinject like we do with the in-kernel > PIT. This generally shouldn't be an issue with gradual reinjection. > >>> by gradually injecting additional interrupts during the subsequent >>> timer intervals, starting at a rate of one additional interrupt per >>> interval. If further interrupts are lost while compensation is still >>> in progress, the rate is increased. The algorithm imposes a limit on >>> the rate and on the 'backlog' of lost interrupts to be injected. Anthony, Jan, many thanks for your feedback. I'm sorry for not responding earlier. I used the following KVM kernel module trace point in __apic_accept_irq() to trace coalesced interrupts related to HPET timer 0: result = !apic_test_and_set_irr(vector, apic); trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode, trig_mode, vector, !result); While running the test outlined in my original email, I observed between 5 and 15 of these events occurring during one minute: qemu-system-x86-7632 [000] 3755.888909: kvm_apic_accept_irq: apicid 0 vec 209 (Fixed|edge) (coalesced) kvm->arch->vioapic.redirtbl[2] indicated that IRQ 2 was routed to vector number 209. With a comparator period of 15.6 milliseconds, the coalesced interrupts apparently contribute less than 250 milliseconds of drift. However, the total drift that I observed during one minute was up to 15 seconds. I think the patch could possibly handle coalesced interrupts too: - In update_irq(), use a similar method as in RTC emulation to detect coalesced interrupts. Return a delivery status to hpet_timer(). - In hpet_timer(), decrement t->irqs_to_inject (number of interrupts remaining to inject) only if the interrupt was not coalesced. Regards, Uli Obergfell -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC: 0/2] patch for QEMU HPET periodic timer emulation to alleviate time drift
Anthony, in reply to: > My point is that there's really low hanging fruit and while for some > curious reason I don't actually see this patch, I believe that a patch > like this probably can help us quite a lot in the short term. I've sent the patch in two separate emails: - code part 1 (introduces the qemu command line option) http://article.gmane.org/gmane.comp.emulators.kvm.devel/67347 - code part 2 (implements compensation of lost interrupts) http://article.gmane.org/gmane.comp.emulators.kvm.devel/67348 Regards, Uli Obergfell -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC: 2/2] patch for QEMU HPET periodic timer emulation to alleviate time drift (code part 2)
code part 2 --- Implements compensation of lost interrupts for HPET periodic timers. diff -up ./hw/hpet.c.orig2 ./hw/hpet.c --- ./hw/hpet.c.orig2 2011-01-21 23:34:47.0 +0100 +++ ./hw/hpet.c 2011-02-01 19:20:24.619247214 +0100 @@ -41,6 +41,13 @@ #define HPET_MSI_SUPPORT0 +#ifdef TARGET_I386 +#define MAX_IRQS_TO_INJECT (uint32_t)100 +#define MAX_IRQ_RATE (uint32_t)10 + +extern int hpet_driftfix; +#endif + struct HPETState; typedef struct HPETTimer { /* timers */ uint8_t tn; /*timer number*/ @@ -55,6 +62,13 @@ typedef struct HPETTimer { /* timers */ uint8_t wrap_flag; /* timer pop will indicate wrap for one-shot 32-bit * mode. Next pop will be actual timer expiration. */ +#ifdef TARGET_I386 +/* for lost interrupt compensation */ +uint64_t saved_period; +uint32_t irqs_to_inject; +uint32_t irq_rate; +uint32_t divisor; +#endif } HPETTimer; typedef struct HPETState { @@ -248,7 +262,11 @@ static int hpet_post_load(void *opaque, static const VMStateDescription vmstate_hpet_timer = { .name = "hpet_timer", +#ifdef TARGET_I386 +.version_id = 3, +#else .version_id = 1, +#endif .minimum_version_id = 1, .minimum_version_id_old = 1, .fields = (VMStateField []) { @@ -258,6 +276,13 @@ static const VMStateDescription vmstate_ VMSTATE_UINT64(fsb, HPETTimer), VMSTATE_UINT64(period, HPETTimer), VMSTATE_UINT8(wrap_flag, HPETTimer), +#ifdef TARGET_I386 +/* for lost interrupt compensation */ +VMSTATE_UINT64_V(saved_period, HPETTimer, 3), +VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3), +VMSTATE_UINT32_V(irq_rate, HPETTimer, 3), +VMSTATE_UINT32_V(divisor, HPETTimer, 3), +#endif VMSTATE_TIMER(qemu_timer, HPETTimer), VMSTATE_END_OF_LIST() } @@ -265,7 +290,11 @@ static const VMStateDescription vmstate_ static const VMStateDescription vmstate_hpet = { .name = "hpet", +#ifdef TARGET_I386 +.version_id = 3, +#else .version_id = 2, +#endif .minimum_version_id = 1, .minimum_version_id_old = 1, .pre_save = hpet_pre_save, @@ -289,7 +318,9 @@ static void hpet_timer(void *opaque) { HPETTimer *t = opaque; uint64_t diff; - +#ifdef TARGET_I386 +uint32_t irq_count = 0; +#endif uint64_t period = t->period; uint64_t cur_tick = hpet_get_ticks(t->state); @@ -297,13 +328,88 @@ static void hpet_timer(void *opaque) if (t->config & HPET_TN_32BIT) { while (hpet_time_after(cur_tick, t->cmp)) { t->cmp = (uint32_t)(t->cmp + t->period); +#ifdef TARGET_I386 +/* count one irq per period */ +++irq_count; +#endif } } else { while (hpet_time_after64(cur_tick, t->cmp)) { t->cmp += period; +#ifdef TARGET_I386 +/* count one irq per period */ +++irq_count; +#endif } } diff = hpet_calculate_diff(t, cur_tick); +#ifdef TARGET_I386 +if (hpet_driftfix) { +/* + * If the period value changed since the previous callback, + * scale the number of irqs to inject to the new period value + * and save the new period value. + */ +if (t->saved_period != t->period) { +t->irqs_to_inject = (t->irqs_to_inject * t->saved_period) + / t->period; +t->saved_period = t->period; +} +/* + * Add the irq count of the current callback to the number + * of irqs to inject. Make sure the result does not exceed + * the limit. + */ +t->irqs_to_inject += irq_count; +if (t->irqs_to_inject > MAX_IRQS_TO_INJECT) +t->irqs_to_inject = MAX_IRQS_TO_INJECT; +/* + * One irq will be injected during the current callback. If + * the number of irqs to inject is greater than 1, there is + * is a backlog of 'irqs_to_inject - 1' interrupts that were + * lost and need to be compensated. + */ +if (t->irqs_to_inject > 1) { +/* + * If the irq count of the current callback is greater + * than 1, 'irq_count - 1' interrupts were lost since + * the previous callback. Increase the rate at which + * additional irqs will be injected to compensate the + * lost interrupts. Make sure the rate does not exceed + * the limit. + */ +if (irq_count > 1) { +t->irq_rate++; +if (t->irq_rate > MAX_IRQ_RATE) +t->irq_rate = MAX_IRQ_RATE; +} +
[RFC: 1/2] patch for QEMU HPET periodic timer emulation to alleviate time drift (code part 1)
code part 1 --- Introduces the '-hpet [device=none|present][,driftfix=none|slew]' option. diff -up ./qemu-config.c.orig1 ./qemu-config.c --- ./qemu-config.c.orig1 2011-01-21 23:34:47.0 +0100 +++ ./qemu-config.c 2011-02-01 19:38:26.665250920 +0100 @@ -255,6 +255,21 @@ static QemuOptsList qemu_rtc_opts = { }, }; +static QemuOptsList qemu_hpet_opts = { +.name = "hpet", +.head = QTAILQ_HEAD_INITIALIZER(qemu_hpet_opts.head), +.desc = { +{ +.name = "device", +.type = QEMU_OPT_STRING, +},{ +.name = "driftfix", +.type = QEMU_OPT_STRING, +}, +{ /* end of list */ } +}, +}; + static QemuOptsList qemu_global_opts = { .name = "global", .head = QTAILQ_HEAD_INITIALIZER(qemu_global_opts.head), @@ -456,6 +471,7 @@ static QemuOptsList *vm_config_groups[32 &qemu_netdev_opts, &qemu_net_opts, &qemu_rtc_opts, +&qemu_hpet_opts, &qemu_global_opts, &qemu_mon_opts, &qemu_cpudef_opts, diff -up ./qemu-options.hx.orig1 ./qemu-options.hx --- ./qemu-options.hx.orig1 2011-01-21 23:34:47.0 +0100 +++ ./qemu-options.hx 2011-02-01 16:01:26.063249119 +0100 @@ -972,6 +972,11 @@ STEXI Disable HPET support. ETEXI +DEF("hpet", HAS_ARG, QEMU_OPTION_hpet, \ +"-hpet [device=none|present][,driftfix=none|slew]\n" \ +"disable or enable HPET, disable or enable drift fix for periodic timers (x86 only)\n", +QEMU_ARCH_ALL) + DEF("balloon", HAS_ARG, QEMU_OPTION_balloon, "-balloon none disable balloon device\n" "-balloon virtio[,addr=str]\n" diff -up ./vl.c.orig1 ./vl.c --- ./vl.c.orig12011-01-21 23:34:47.0 +0100 +++ ./vl.c 2011-02-01 19:39:22.163249205 +0100 @@ -203,6 +203,7 @@ CharDriverState *parallel_hds[MAX_PARALL CharDriverState *virtcon_hds[MAX_VIRTIO_CONSOLES]; int win2k_install_hack = 0; int rtc_td_hack = 0; +int hpet_driftfix = 0; int usb_enabled = 0; int singlestep = 0; int smp_cpus = 1; @@ -431,6 +432,39 @@ static void configure_rtc(QemuOpts *opts } } +static void configure_hpet(QemuOpts *opts) +{ +const char *value, *opt; + +opt = "device"; +value = qemu_opt_get(opts, opt); +if (value) { +if (!strcmp(value, "present")) { +no_hpet = 0; +} else if (!strcmp(value, "none")) { +no_hpet = 1; +} else { +goto error_exit; +} +} +opt = "driftfix"; +value = qemu_opt_get(opts, opt); +if (value) { +if (!strcmp(value, "slew")) { +hpet_driftfix = 1; +} else if (!strcmp(value, "none")) { +hpet_driftfix = 0; +} else { +goto error_exit; +} +} +return; + +error_exit: +fprintf(stderr, "qemu: -hpet option '%s': value missing or invalid\n", opt); +exit(1); +} + /***/ /* Bluetooth support */ static int nb_hcis; @@ -2608,6 +2642,13 @@ int main(int argc, char **argv, char **e case QEMU_OPTION_no_hpet: no_hpet = 1; break; +case QEMU_OPTION_hpet: +opts = qemu_opts_parse(qemu_find_opts("hpet"), optarg, 0); +if (!opts) { +exit(1); +} +configure_hpet(opts); +break; case QEMU_OPTION_balloon: if (balloon_parse(optarg) < 0) { fprintf(stderr, "Unknown -balloon argument %s\n", optarg); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC: 0/2] patch for QEMU HPET periodic timer emulation to alleviate time drift
Hi, I am observing severe backward time drift in a MS Windows Vista(tm) guest running on a Fedora 14 KVM host. I can reproduce the problem with the following steps: 1. Use 'vncviewer' to connect to the guest's desktop. 2. Click on the menu title bar of a window on the guest's desktop. 3. Move that window around on the guest's desktop. While I keep on moving the window around for one minute, the guest time falls up to 15 seconds behind host time. The problem is caused by delayed callbacks of hpet_timer(). A timer interrupt is injected into the guest during each callback. However, interrupts are lost if delays are greater than a comparator period. This is an RFC through which I would like to get feedback on how the idea of a patch to compensate those lost interrupts would be received: The patch determines the number of lost timer interrupts based on the number of elapsed comparator periods. Lost interrupts are compensated by gradually injecting additional interrupts during the subsequent timer intervals, starting at a rate of one additional interrupt per interval. If further interrupts are lost while compensation is still in progress, the rate is increased. The algorithm imposes a limit on the rate and on the 'backlog' of lost interrupts to be injected. The patch can be enabled via a qemu command line option. -hpet [device=none|present][,driftfix=none|slew] The 'device=none' option is equivalent to the '-no-hpet' option, and the 'driftfix=slew' option enables the patch (similar to RTC). The second and third part of this series of email contain the patch: - Code part 1 introduces the qemu command line option. - Code part 2 implements compensation of lost interrupts. Please review and please comment. With the patch enabled, I'm not observing any drift after one minute of testing (as outlined above). I'm not using a fine grained method to check the drift, just looking at the hrs:min:sec of the guest's and the host's clock. If there is still some drift, it's apparently less than a second. Regards, Uli Obergfell -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
- "Anthony Liguori" wrote: > On 12/14/2010 06:09 AM, Ulrich Obergfell wrote: [...] > > Parts 1 thru 4 of this RFC contain experimental source code which > > I recently used to investigate the performance benefit. In a Linux > > guest, I was running a program that calls gettimeofday() 'n' times > > in a loop (the PM Timer register is read during each call). With > > in-kernel PM Timer, I observed a significant reduction of program > > execution time. > > > > I've played with this in the past. Can you post real numbers, > preferably, with a real work load? Anthony, I only experimented with a gettimeofday() loop. With this test scenario I observed that in-kernel PM Timer reduced the program execution time to roughly half of the execution time that it takes with userspace PM Timer. Please find some example results below (these results were obtained while the host was not busy). The relative difference of in-kernel PM Timer versus userspace PM Timer is high, whereas the absolute difference per call appears to be low. So, the benefit much depends on how frequently gettimeofday() is called in a real work load. I don't have any numbers from a real work load. When I began working on this, I was motivated by the fact that the Linux kernel itself provides an optimization for the gettimeofday() call ('vxtime'). So, from this I presumed that there would be real work loads which would benefit from the optimization of the gettimeofday() call (otherwise, why would we have 'vxtime' ?). Of course, 'vxtime' is not related to PM based time keeping. However, the experimental code shows an approach to optimize gettimeofday() in KVM virtual machines. Regards, Uli - host: # grep "model name" /proc/cpuinfo | sort | uniq -c 8 model name : Intel(R) Core(TM) i7 CPU Q 740 @ 1.73GHz # uname -r 2.6.37-rc4 - guest: # grep "model name" /proc/cpuinfo | sort | uniq -c 4 model name : QEMU Virtual CPU version 0.13.50 - test program ('gtod.c'): #include #include struct timeval tv; main(int argc, char *argv[]) { int i = atoi(argv[1]); while (i-- > 0) gettimeofday(&tv, NULL); } - example results with in-kernel PM Timer: # for i in 1 2 3 > do > time ./gtod 2500 > done real0m44.302s user0m1.090s sys 0m43.163s real0m44.509s user0m1.100s sys 0m43.393s real0m45.290s user0m1.160s sys 0m44.123s # for i in 1000 5000 1 > do > time ./gtod $i > done real0m17.981s user0m0.810s sys 0m17.157s real1m27.253s user0m1.930s sys 1m25.307s real2m51.801s user0m3.359s sys 2m48.384s - example results with userspace PM Timer: # for i in 1 2 3 > do > time ./gtod 2500 > done real1m24.185s user0m2.000s sys 1m22.168s real1m23.508s user0m1.750s sys 1m21.738s real1m24.437s user0m1.900s sys 1m22.517s # for i in 1000 5000 1 > do > time ./gtod $i > done real0m33.479s user0m0.680s sys 0m32.785s real2m50.831s user0m3.389s sys 2m47.405s real5m42.304s user0m7.319s sys 5m34.919s -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/4] KVM in-kernel PM Timer implementation
- "Avi Kivity" wrote: > On 12/14/2010 03:40 PM, Glauber Costa wrote: > > > > > > What is the motivation for this? Are there any important guests that > > > use the pmtimer? > > Avi, > > > > All older RHEL and Windows, for example, would benefit for this. > > They only benefit from it because we don't provide HPET. If we did, the > guests would use HPET in preference to pmtimer, since HPET is so much > better than pmtimer (yet still sucks in an absolute sense). > > > > If anything I'd expect hpet or the Microsoft synthetic timers to be a > > > lot more important. > > > > True. But also a lot more work. > > Implementing just the pm timer counter - not the whole of it - in > > kernel, gives us a lot of gain with not very much effort. Patch is > > pretty simple, as you can see, and most of it is even code to turn it > > on/off, etc. > > > > Partial emulation is not something I like since it causes a fuzzy > kernel/user boundary. In this case, transitioning to userspace when > interrupts are enabled doesn't look so hot. Are you sure all guests > that benefit from this don't enable the pmtimer interrupt? What about > the transition? Will we have a time discontinuity when that happens? Avi, the idea is to use the '-kvm-pmtmr' option (in code part 4) only with guests that do not enable the 'timer carry interrupt'. Guests that need to enable the 'timer carry interrupt' should rather use the PM Timer emulation in qemu userspace (i.e. they should not be started with this option). If a guest is accidentally started with this option, the in-kernel PM Timer (in code part 1) detects if the guest attempts to enable the 'timer carry interrupt' and falls back to PM Timer emulation in qemu userspace (in-kernel PM Timer disables itself automatically). So, this is not a combination of in-kernel PM Timer register emulation and qemu userspace PM Timer interrupt emulation. Regards, Uli -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 4/4] KVM in-kernel PM Timer implementation (experimental code part 4)
experimental code part 4 (qemu userspace) - This code introduces the new qemu command line option '-kvm-pmtmr'. qemu only creates and configures in-kernel PM Timer if this option is specified on the command line. diff -up ./qemu-kvm.c.orig4 ./qemu-kvm.c --- ./qemu-kvm.c.orig4 2010-12-10 10:50:42.857811776 +0100 +++ ./qemu-kvm.c2010-12-10 11:45:23.783748044 +0100 @@ -54,6 +54,9 @@ int kvm_irqchip = 1; int kvm_pit = 1; int kvm_pit_reinject = 1; int kvm_nested = 0; +#ifdef KVM_CAP_PMTMR +int kvm_pmtmr = 0; +#endif KVMState *kvm_state; @@ -186,7 +189,7 @@ int kvm_init(int smp_cpus) kvm_context->no_irqchip_creation = 0; kvm_context->no_pit_creation = 0; #ifdef KVM_CAP_PMTMR -kvm_context->no_pmtmr_creation = 0; +kvm_context->no_pmtmr_creation = 1; #endif #ifdef KVM_CAP_SET_GUEST_DEBUG @@ -241,6 +244,11 @@ void kvm_disable_pit_creation(kvm_contex } #ifdef KVM_CAP_PMTMR +void kvm_enable_pmtmr_creation(kvm_context_t kvm) +{ +kvm->no_pmtmr_creation = 0; +} + void (*kvm_arch_pmtmr_handler)(kvm_context_t kvm); /* * This handler is called by @@ -1654,6 +1662,11 @@ static int kvm_create_context(void) if (!kvm_pit) { kvm_disable_pit_creation(kvm_context); } +#ifdef KVM_CAP_PMTMR +if (kvm_pmtmr) { +kvm_enable_pmtmr_creation(kvm_context); +} +#endif if (kvm_create(kvm_context, 0, NULL) < 0) { kvm_finalize(kvm_state); return -1; diff -up ./qemu-kvm.h.orig4 ./qemu-kvm.h --- ./qemu-kvm.h.orig4 2010-12-10 11:26:43.726790319 +0100 +++ ./qemu-kvm.h2010-12-10 11:47:50.074805792 +0100 @@ -124,6 +124,18 @@ void kvm_disable_irqchip_creation(kvm_co */ void kvm_disable_pit_creation(kvm_context_t kvm); +#ifdef KVM_CAP_PMTMR +/*! + * \brief Enable the in-kernel ACPI PM Timer register creation + * + * In-kernel ACPI PM Timer register is disabled by default. + * If in-kernel is to be used, this should be called prior to kvm_create(). + * + * \param kvm Pointer to the kvm_context + */ +void kvm_enable_pmtmr_creation(kvm_context_t kvm); +#endif + /*! * \brief Create new virtual machine * @@ -706,6 +718,9 @@ extern int kvm_irqchip; extern int kvm_pit; extern int kvm_pit_reinject; extern int kvm_nested; +#ifdef KVM_CAP_PMTMR +extern int kvm_pmtmr; +#endif extern kvm_context_t kvm_context; struct ioperm_data { diff -up ./qemu-options.hx.orig4 ./qemu-options.hx --- ./qemu-options.hx.orig4 2010-12-02 15:15:20.0 +0100 +++ ./qemu-options.hx 2010-12-06 11:27:57.273648509 +0100 @@ -2330,6 +2330,9 @@ DEF("no-kvm-pit-reinjection", 0, QEMU_OP QEMU_ARCH_I386) DEF("enable-nesting", 0, QEMU_OPTION_enable_nesting, "-enable-nesting enable support for running a VM inside the VM (AMD only)\n", QEMU_ARCH_I386) +DEF("kvm-pmtmr", 0, QEMU_OPTION_kvm_pmtmr, +"-kvm-pmtmr enable KVM kernel mode ACPI PM Timer register emulation\n", +QEMU_ARCH_I386) DEF("nvram", HAS_ARG, QEMU_OPTION_nvram, "-nvram FILE provide ia64 nvram contents\n", QEMU_ARCH_ALL) DEF("tdf", 0, QEMU_OPTION_tdf, diff -up ./vl.c.orig4 ./vl.c --- ./vl.c.orig42010-12-10 10:34:55.388997058 +0100 +++ ./vl.c 2010-12-10 11:50:20.566810444 +0100 @@ -2474,6 +2474,12 @@ int main(int argc, char **argv, char **e kvm_nested = 1; break; } +#ifdef KVM_CAP_PMTMR + case QEMU_OPTION_kvm_pmtmr: { + kvm_pmtmr = 1; + break; + } +#endif #endif case QEMU_OPTION_usb: usb_enabled = 1; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 3/4] KVM in-kernel PM Timer implementation (experimental code part 3)
experimental code part 3 (qemu userspace) - This code utlizes the new ioctl commands introduced by code part 2. The KVM_CREATE_PMTMR ioctl command is simply called once when a virtual machine is being created. However, calling KVM_CONFIGURE_PMTMR is more challenging because it involves ... - passing the base address of PM I/O port range to code part 1 - passing the clock offset to code part 1 'timers_state.cpu_clock_offset' gets updated at each vm_start() call. However, the PM I/O port base address is not available at the first vm_start() call. So, configuring the in-kernel PM Timer needs to be postponed until the PIIX4 PCI configuration is initialized. This is facilitated by the new function kvm_pmtmr_handler() which is called by vm_start() and by pm_io_space_update(). kvm_pmtmr_handler() calls architecture-specific code thru a function pointer 'kvm_arch_pmtmr_handler'. kvm_pmtmr_handler() is a 'no-op' if an architecture does not provide or clears this function pointer. The architecture-specific code is responsible for configuring the in-kernel PM Timer. The experimental code provides kvm_arch_configure_pmtmr_wrapper() in qemu-kvm-x86.c. kvm_arch_create_pmtmr() sets 'kvm_arch_pmtmr_handler' to 'kvm_arch_configure_pmtmr_wrapper' after successful completion of the KVM_CREATE_PMTMR ioctl command. kvm_arch_configure_pmtmr_wrapper() requires ACPI PM code to provide a function pointer 'kvm_arch_get_pm_io_base' thru which the PM I/O port base address can be obtained. kvm_arch_configure_pmtmr_wrapper() is a 'no-op' too if ACPI PM code does not provide or clears this function pointer. The experimental code provides piix4_get_pm_io_base() in hw/acpi_piix4.c. pm_io_space_update() sets 'kvm_arch_get_pm_io_base' to 'piix4_get_pm_io_base'. Consider two scenarios ... - during virtual machine creation and startup kvm_arch_create kvm_arch_create_pmtmr ioctl(KVM_CREATE_PMTMR) kvm_arch_pmtmr_handler = kvm_arch_configure_pmtmr_wrapper : vm_start kvm_pmtmr_handler kvm_arch_configure_pmtmr_wrapper 'no-op' because kvm_arch_get_pm_io_base not set yet : pm_io_space_update kvm_arch_get_pm_io_base = piix4_get_pm_io_base kvm_pmtmr_handler kvm_arch_configure_pmtmr_wrapper obtain PM I/O port base thru kvm_arch_get_pm_io_base kvm_arch_configure_pmtmr ioctl(KVM_CONFIGURE_PMTMR) - any other vm_start() call, for example after migration vm_start kvm_pmtmr_handler kvm_arch_configure_pmtmr_wrapper obtain PM I/O port base thru kvm_arch_get_pm_io_base kvm_arch_configure_pmtmr ioctl(KVM_CONFIGURE_PMTMR) diff -up ./hw/acpi_piix4.c.orig3 ./hw/acpi_piix4.c --- ./hw/acpi_piix4.c.orig3 2010-12-02 15:15:20.0 +0100 +++ ./hw/acpi_piix4.c 2010-12-10 11:26:53.943753235 +0100 @@ -23,6 +23,7 @@ #include "acpi.h" #include "sysemu.h" #include "range.h" +#include "qemu-kvm.h" //#define DEBUG @@ -80,6 +81,9 @@ typedef struct PIIX4PMState { static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s); +/* for cpu hotadd (and in-kernel PM Timer if KVM_CAP_PMTMR is defined) */ +static PIIX4PMState *global_piix4_pm_state; + #define ACPI_ENABLE 0xf1 #define ACPI_DISABLE 0xf0 @@ -250,6 +254,19 @@ static void acpi_dbg_writel(void *opaque PIIX4_DPRINTF("ACPI: DBG: 0x%08x\n", val); } +#ifdef KVM_CAP_PMTMR +static uint64_t piix4_get_pm_io_base(void) +{ +PIIX4PMState *s = global_piix4_pm_state; +uint32_t pm_io_base; + +pm_io_base = le32_to_cpu(*(uint32_t *)(s->dev.config + 0x40)); +pm_io_base &= 0xffc0; + +return (uint64_t)pm_io_base; +} +#endif + static void pm_io_space_update(PIIX4PMState *s) { uint32_t pm_io_base; @@ -262,6 +279,16 @@ static void pm_io_space_update(PIIX4PMSt PIIX4_DPRINTF("PM: mapping to 0x%x\n", pm_io_base); iorange_init(&s->ioport, &pm_iorange_ops, pm_io_base, 64); ioport_register(&s->ioport); +#ifdef KVM_CAP_PMTMR +kvm_arch_get_pm_io_base = piix4_get_pm_io_base; +/* + * The base address of the PM I/O port address range is now known. + * The following call is needed to pass the base address to the + * in-kernel PM Timer emulation. Note that 'kvm_arch_get_pm_io_base' + * must be set _before_ this call. + */ +kvm_pmtmr_handler(); +#endif } } @@ -354,14 +381,12 @@ static void piix4_powerdown(void *opaque } } -static PIIX4PMState *global_piix4_pm_state; /* cpu hotadd */ - static int piix4_pm_initfn(PCIDevice *dev) { PIIX4PMState *s = DO_UPCAST(PIIX4PMState, dev, dev); uint8_t *pci_conf; -/* for cpu hotadd */ +/* for cpu hotadd and in-kernel PM Timer */ global_piix4_pm_state = s; pci_conf = s->dev.config; diff -up ./kvm/include/linux/kvm.h.orig3 ./kvm/include/linux/kvm.h --- ./kvm/include/
[RFC 2/4] KVM in-kernel PM Timer implementation (experimental code part 2)
experimental code part 2 (KVM kernel) - This code introduces two new ioctl commands KVM_CREATE_PMTMR and KVM_CONFIGURE_PMTMR plus the new capability KVM_CAP_PMTMR to the ioctl infrastructure of the KVM kernel. This code utilizes some helper functions introduced by code part 1. diff -up ./arch/x86/include/asm/kvm.h.orig2 ./arch/x86/include/asm/kvm.h --- ./arch/x86/include/asm/kvm.h.orig2 2010-12-05 09:35:17.0 +0100 +++ ./arch/x86/include/asm/kvm.h2010-12-10 12:32:47.067686432 +0100 @@ -24,6 +24,7 @@ #define __KVM_HAVE_DEBUGREGS #define __KVM_HAVE_XSAVE #define __KVM_HAVE_XCRS +#define __KVM_HAVE_PMTMR /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 diff -up ./arch/x86/kvm/x86.c.orig2 ./arch/x86/kvm/x86.c --- ./arch/x86/kvm/x86.c.orig2 2010-12-05 09:35:17.0 +0100 +++ ./arch/x86/kvm/x86.c2010-12-10 12:24:58.083739549 +0100 @@ -26,6 +26,9 @@ #include "tss.h" #include "kvm_cache_regs.h" #include "x86.h" +#ifdef KVM_CAP_PMTMR +#include "pmtmr.h" +#endif #include #include @@ -1965,6 +1968,9 @@ int kvm_dev_ioctl_check_extension(long e case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: +#ifdef KVM_CAP_PMTMR + case KVM_CAP_PMTMR: +#endif r = 1; break; case KVM_CAP_COALESCED_MMIO: @@ -3274,6 +3280,7 @@ long kvm_arch_vm_ioctl(struct file *filp struct kvm_pit_state ps; struct kvm_pit_state2 ps2; struct kvm_pit_config pit_config; + struct kvm_pmtmr_config pmtmr_config; } u; switch (ioctl) { @@ -3541,6 +3548,23 @@ long kvm_arch_vm_ioctl(struct file *filp r = 0; break; } +#ifdef KVM_CAP_PMTMR + case KVM_CREATE_PMTMR: { + mutex_lock(&kvm->slots_lock); + r = kvm_create_pmtmr(kvm); + mutex_unlock(&kvm->slots_lock); + break; + } + case KVM_CONFIGURE_PMTMR: { + r = -EFAULT; + if (copy_from_user(&u.pmtmr_config, argp, + sizeof(struct kvm_pmtmr_config))) + goto out; + + r = kvm_configure_pmtmr(kvm, &u.pmtmr_config); + break; + } +#endif default: ; diff -up ./include/linux/kvm.h.orig2 ./include/linux/kvm.h --- ./include/linux/kvm.h.orig2 2010-12-05 09:35:17.0 +0100 +++ ./include/linux/kvm.h 2010-12-10 12:30:13.677745093 +0100 @@ -140,6 +140,12 @@ struct kvm_pit_config { __u32 pad[15]; }; +/* for KVM_CONFIGURE_PMTMR */ +struct kvm_pmtmr_config { + __u64 pm_io_base; + __s64 clock_offset; +}; + #define KVM_PIT_SPEAKER_DUMMY 1 #define KVM_EXIT_UNKNOWN 0 @@ -541,6 +547,9 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_PPC_GET_PVINFO 57 #define KVM_CAP_PPC_IRQ_LEVEL 58 #define KVM_CAP_ASYNC_PF 59 +#ifdef __KVM_HAVE_PMTMR +#define KVM_CAP_PMTMR 60 +#endif #ifdef KVM_CAP_IRQ_ROUTING @@ -672,6 +681,8 @@ struct kvm_clock_data { #define KVM_XEN_HVM_CONFIG_IOW(KVMIO, 0x7a, struct kvm_xen_hvm_config) #define KVM_SET_CLOCK _IOW(KVMIO, 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define KVM_CREATE_PMTMR _IO(KVMIO, 0x7d) +#define KVM_CONFIGURE_PMTMR _IOW(KVMIO, 0x7e, struct kvm_pmtmr_config) /* Available with KVM_CAP_PIT_STATE2 */ #define KVM_GET_PIT2 _IOR(KVMIO, 0x9f, struct kvm_pit_state2) #define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/4] KVM in-kernel PM Timer implementation (experimental code part 1)
experimental code part 1 (KVM kernel) - This code introduces the actual emulation of the PM Timer register plus some helper functions to create and configure the in-kernel PM Timer. The emulation utilizes the 'kvm_io_bus' infrastructure. diff -up ./arch/x86/include/asm/kvm_host.h.orig1 ./arch/x86/include/asm/kvm_host.h --- ./arch/x86/include/asm/kvm_host.h.orig1 2010-12-05 09:35:17.0 +0100 +++ ./arch/x86/include/asm/kvm_host.h 2010-12-10 12:14:29.282686691 +0100 @@ -459,6 +459,10 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + +#ifdef KVM_CAP_PMTMR + struct kvm_pmtmr *vpmtmr; +#endif }; struct kvm_vm_stat { diff -up ./arch/x86/kvm/i8254.c.orig1 ./arch/x86/kvm/i8254.c --- ./arch/x86/kvm/i8254.c.orig12010-12-05 09:35:17.0 +0100 +++ ./arch/x86/kvm/i8254.c 2010-12-10 12:09:36.877729064 +0100 @@ -51,7 +51,7 @@ #define RW_STATE_WORD1 4 /* Compute with 96 bit intermediate result: (a*b)/c */ -static u64 muldiv64(u64 a, u32 b, u32 c) +u64 muldiv64(u64 a, u32 b, u32 c) { union { u64 ll; diff -up ./arch/x86/kvm/Makefile.orig1 ./arch/x86/kvm/Makefile --- ./arch/x86/kvm/Makefile.orig1 2010-12-05 09:35:17.0 +0100 +++ ./arch/x86/kvm/Makefile 2010-12-10 12:07:14.379811121 +0100 @@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API) += $(addprefix . kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o) kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \ - i8254.o timer.o + i8254.o timer.o pmtmr.o kvm-intel-y+= vmx.o kvm-amd-y += svm.o diff -up ./arch/x86/kvm/pmtmr.c.orig1 ./arch/x86/kvm/pmtmr.c --- ./arch/x86/kvm/pmtmr.c.orig12010-12-10 12:05:39.878691941 +0100 +++ ./arch/x86/kvm/pmtmr.c 2010-12-10 12:06:00.987738524 +0100 @@ -0,0 +1,151 @@ +/* + * in-kernel ACPI PM Timer emulation + * + * Note: 'timer carry interrupt' is not implemented + */ + +#include + +#ifdef KVM_CAP_PMTMR + +#include "pmtmr.h" + +static int emulate_acpi_reg_pmtmr(struct kvm_pmtmr *pmtmr, void *data, int len) +{ + s64 tmp; + u32 running_count; + + if (len != 4) + return -EOPNOTSUPP; + + tmp = ktime_to_ns(ktime_get()) + pmtmr->clock_offset; + running_count = (u32)muldiv64(tmp, KVM_ACPI_PMTMR_FREQ, NSEC_PER_SEC); + *(u32 *)data = running_count & KVM_ACPI_PMTMR_MASK; + +#ifdef KVM_ACPI_PMTMR_STATS + pmtmr->read_count++; +#endif + return 0; +} + +/* + * This function returns true for I/O ports in the range from 'PM base' + * to 'PM Timer' (this range contains the PM1 Status and the PM1 Enable + * registers). + */ +static inline int pmtmr_in_range(struct kvm_pmtmr *pmtmr, gpa_t ioport) +{ + return ((ioport >= pmtmr->pm_io_base) && + (ioport <= pmtmr->pm_io_base + KVM_ACPI_REG_PMTMR)); +} + +static inline struct kvm_pmtmr *dev_to_pmtmr(struct kvm_io_device *dev) +{ +return container_of(dev, struct kvm_pmtmr, dev); +} + +static int pmtmr_ioport_read(struct kvm_io_device *this, +gpa_t ioport, int len, void *data) +{ + struct kvm_pmtmr *pmtmr = dev_to_pmtmr(this); + + if (!pmtmr_in_range(pmtmr, ioport)) + return -EOPNOTSUPP; + + switch (ioport - pmtmr->pm_io_base) { + case KVM_ACPI_REG_PMTMR: + /* emulate PM Timer read if in-kernel emulation is enabled */ + if (pmtmr->state == KVM_PMTMR_STATE_ENABLED) + return(emulate_acpi_reg_pmtmr(pmtmr, data, len)); + + /* fall thru */ + default: + /* let qemu userspace handle everything else */ + return -EOPNOTSUPP; + } +} + +static int pmtmr_ioport_write(struct kvm_io_device *this, + gpa_t ioport, int len, const void *data) +{ + struct kvm_pmtmr *pmtmr = dev_to_pmtmr(this); + + if (!pmtmr_in_range(pmtmr, ioport)) + return -EOPNOTSUPP; + + switch (ioport - pmtmr->pm_io_base) { + case KVM_ACPI_REG_PMTMR: + /* ignore PM Timer write */ + return 0; + case KVM_ACPI_REG_PMEN: + if (len == 2) { + u16 val = *(u16 *)data; + /* +* Fall back to qemu userspace PM Timer emulation if +* the VM sets the 'timer carry interrupt enable' bit +* in the PM1 Enable register. +*/ + if (val & KVM_ACPI_PMTMR_TMR_EN) + /* disable in-kernel PM Timer emulation */ + pmtmr->state = KVM_PMTMR_STATE_DISABLED; + } + /* fall thru */ + default: + /* let qemu userspace handle everything
[RFC 0/4] KVM in-kernel PM Timer implementation
Hi, This is an RFC through which I would like to get feedback on how the idea of in-kernel PM Timer would be received. The current implementation of PM Timer emulation is 'heavy-weight' because the code resides in qemu userspace. Guest operating systems that use PM Timer as a clock source (for example, older versions of Linux that do not have paravirtualized clock) would benefit from an in-kernel PM Timer emulation. Parts 1 thru 4 of this RFC contain experimental source code which I recently used to investigate the performance benefit. In a Linux guest, I was running a program that calls gettimeofday() 'n' times in a loop (the PM Timer register is read during each call). With in-kernel PM Timer, I observed a significant reduction of program execution time. The experimental code emulates the PM Timer register in KVM kernel. All other components of ACPI PM remain in qemu userspace. Also, the 'timer carry interrupt' feature is not implemented in-kernel. If a guest operating system needs to enable the 'timer carry interrupt', the code takes care that PM Timer emulation falls back to userspace. However, I think the design of the code has sufficient flexibility, so that anyone who would want to add the 'timer carry interrupt' feature in-kernel could try to do so later on. Please review and please comment. Regards, Uli Obergfell -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html