On Tue, Nov 25, 2008 at 01:52:59PM +0100, Andi Kleen wrote:
> > But yeah - the remapping of HPET timers to virtual HPET timers sounds  
> > pretty tough. I wonder if one could overcome that with a little  
> > hardware support though ...
> 
> For gettimeofday better make TSC work. Even in the best case (no 
> virtualization) it is much faster than HPET because it sits in the CPU,
> while HPET is far away on the external south bridge.

The tsc clock on older Linux 2.6 kernels compensates for lost ticks.
The algorithm uses the PIT count (latched) to measure the delay between
interrupt generation and handling, and sums that value, on the next
interrupt, to the TSC delta.

Sheng investigated this problem in the discussions before in-kernel PIT
was merged:

http://www.mail-archive.com/kvm-de...@lists.sourceforge.net/msg13873.html

The algorithm overcompensates for lost ticks and the guest time runs
faster than the hosts.

There are two issues:

1) A bug in the in-kernel PIT which miscalculates the count value.

2) For the case where more than one interrupt is lost, and later
reinjected, the value read from PIT count is meaningless for the purpose
of the tsc algorithm. The count is interpreted as the delay until the
next interrupt, which is not the case with reinjection.

As Sheng mentioned in the thread above, Xen pulls back the TSC value
when reinjecting interrupts. VMWare ESX has a notion of "virtual TSC",
which I believe is similar in this context.

For KVM I believe the best immediate solution (for now) is to provide an
option to disable reinjection, behaving similarly to real hardware. The
advantage is simplicity compared to virtualizing the time sources.

The QEMU PIT emulation has a limit on the rate of interrupt reinjection,
perhaps something similar should be investigated in the future.

The following patch (which contains the bugfix for 1) and disabled
reinjection) fixes the severe time drift on RHEL4 with "clock=tsc".
What I'm proposing is to condition reinjection with an option
(-kvm-pit-no-reinject or something).

Comments or better ideas?


diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index e665d1c..608af7b 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
        if (!atomic_inc_and_test(&pt->pending))
                set_bit(KVM_REQ_PENDING_TIMER, &vcpu0->requests);
 
+       if (atomic_read(&pt->pending) > 1)
+               atomic_set(&pt->pending, 1);
+
        if (vcpu0 && waitqueue_active(&vcpu0->wq))
                wake_up_interruptible(&vcpu0->wq);
 
        hrtimer_add_expires_ns(&pt->timer, pt->period);
        pt->scheduled = hrtimer_get_expires_ns(&pt->timer);
        if (pt->period)
-               ps->channels[0].count_load_time = 
hrtimer_get_expires(&pt->timer);
+               ps->channels[0].count_load_time = ktime_get();
 
        return (pt->period == 0 ? 0 : 1);
 }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to