On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote:
> On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote:
> > On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> > > Sheng Yang wrote:
> > > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> > > >> While booting Linux in VMware ESX, I encountered a strange effect
> > > >> in the in-kernel lapic implementation: time went backwards!
> > > >>
> > > >> ---
> > > >
> > > > Hi Alexander
> > > >
> > > > I'm a little suspect here:
> > > >> if (unlikely(ktime_to_ns(now) <=
> > > >> ktime_to_ns(apic->timer.last_update))) {
> > > >> /* Wrap around */
> > > >> passed = ktime_add(( {
> > > >> (ktime_t) {
> > > >> .tv64 = KTIME_MAX -
> > > >> (apic->timer.last_update).tv64}; }
> > > >> ), now);
> > > >> apic_debug("time elapsed\n");
> > > >> } else
> > > >> passed = ktime_sub(now, apic->timer.last_update);
> > > >
> > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and
> > > > get_time() is really ktime_get() which is almost impossible to wrap
> > > > around. If it's overflow, at least we need a warning. I think this
> > > > piece of code due to clock source change.
> > > >
> > > > So I doubt: due to some reason, now <= apic->timer.last_update, which
> > > > cause a big wrap around operation.
> > > >
> > > > And the most suspect:
> > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> > > >> {
> > > >> struct kvm_lapic *apic = vcpu->arch.apic;
> > > >>
> > > >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> > > >> apic->timer.last_update = ktime_add_ns(
> > > >> apic->timer.last_update,
> > > >> apic->timer.period);
> > > >> }
> > > >
> > > > Not sure what's happening, have you tried this? (In fact, I am little
> > > > willing to replace all apic->timer.dev.base->get_time() with more
> > > > explicit ktime_get() as in pit.)
> > >
> > > Yes, this code is the culprit. Using that patch of yours now is always
> > > > last_update. I'd still like to see that while loop go away ;-).
>
> Sheng,
>
> Will separate the discussions on items:
>
>
> 1) On the current code, last_update is assigned by kvm_apic_timer_intr_post
> as the expiration time of the next interrupt, while apic_get_tmcct
> interprets it as "countdown start":
>
> passed = ktime_sub(now, apic->timer.last_update);
>
> This explains why your first patch against kvm_apic_timer_intr_post
> fixed the bug Alexander was experiencing.
>
> BTW, yes, the "overflow" handling is not necessary as you noted.
>
> 2) The current count calculation is not right (apart from any eventual
> bugs), or I don't understand it. It uses the interrupt injection
> time as "countdown start" (or rearm in Linux terminology), in
> kvm_apic_timer_intr_post.
>
> But the host hrtimer starts counting once rearmed (assuming periodic
> configuration). So the inaccuracy of APIC_TMCCT depends on this delay
> between hrtimer rearm and interrupt injection.
>
> What is this scheme supposed to achieve? It seems much simpler, and
> more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT
> emulation.
Marcelo,
Sorry to late reply...
After rechecked the code, got some thoughts. last_update can indicated rearm
time as following:
The original apic_timer_intr_post() got:
> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> apic->timer.last_update = ktime_add_ns(
> apic->timer.last_update,
> apic->timer.period);
I think the purpose is let guest see a reasonable TMCCT. Which means:
1. Timer start to fire at start_apic_timer(). last_update=now().
2. Every one LAPIC timer interrupt was injected into the guest, so the time at
least elapsed timer.period(the first alarm set at "period" later), then
last_update is increased by period. Notice last_update's base is set before
timer fire, so it's not indicated the next one, but *the time this interrupt
should be injected*(time can be delayed)... So last_update = n*period + base.
3. If there is pending interrupt, last_update won't be updated, so we have
this:
> while (counter_passed > tmcct)
> counter_passed -= tmcct;
> tmcct -= counter_passed;
(notice tmcct is TMICT here.)
And the returned tmcct value indicated (counter_passed % tmict), a offset
regardless of pending interrupt numbers.
I think now the overflow seems OK, but I am not sure why last_update can be
bigger than ktime_get()? Maybe due to vcpu migration? Suspect some racy or
boundary condition existed...
And base on this, I don't think my quick fix is correct...
>
> 3) And then there's interrupt reinjection. Once interrupts accumulate,
> and we reinject, current count reads become completly bogus. This is
> the reason for time drift on RHEL4-style kernels with "clock=tsc". The
> algorithm calculates the interrupt handling delay by using the PIT
> count (equivalent to APIC_TMCCT). But PIT count emulation uses the
> hrtimer expiration time, which has nothing to do with the accumulated
> interrupts.
>
> So what I propose is to first switch lapic current count emulation to a
> straightforward scaled hrtimer_get_remaining equivalent.
>
> For the reinjection case, maintain an average of the delay between host
> interrupt and interrupt injection (this can be generic, so both PIT
> and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
> emulation whenever pending > 2.
>
> What you think?
1. If we simply use ktime_get() to update last_update, the interval can't be
same between different last_update. I think we may got some problem here, but
not for sure. For example, three delayed interrupt was injected one after one,
last_update would show very little interval, and APIC_TMCCT may got trouble.
Maintain an average of the delay seems a little tricky here, and I am not sure
if it would help the problem... Average is average at most, not the real
affection at the time...
2. For the current mechanism, the interval is the same, so last_update always
equal to n*period + base. If there are more than one pending interrupts, TMCCT
should also can return the relative correct value.
So I am leaning toward to fix current problem, though I haven't find the root
cause yet...
> And as Avi noted, there should be a distinction between accumulated
> interrupts, we only want to reinject the ones caused by host load (and
> there are some interesting optimizations that can be done later, such
> as stopping hrtimer from ticking until guest acks one interrupt, and
> calculating the missed ones). But better start with the basic.
Agree...
--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html