Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

Blue Swirl Sun, 30 May 2010 04:42:17 -0700

2010/5/30 Gleb Natapov <g...@redhat.com>:
> On Sat, May 29, 2010 at 08:52:34PM +0000, Blue Swirl wrote:
>> On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov <g...@redhat.com> wrote:
>> > On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
>> >> 2010/5/29 Gleb Natapov <g...@redhat.com>:
>> >> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
>> >> >> >> There is no code, because we're still at architecture design stage.
>> >> >> >>
>> >> >> > Try to write test code to understand the problem better.
>> >> >>
>> >> >> I will.
>> >> >>
>> >> > Please do ASAP. This discussion shows that you don't understand what is 
>> >> > the
>> >> > problem that we are dialing with.
>> >>
>> >> Which part of the problem you think I don't understand?
>> >>
>> > It seams to me you don't understand how Windows uses RTC for time
>> > keeping and how the QEMU solves the problem today.
>>
>> RTC causes periodic interrupts and Windows interrupt handler
>> increments jiffies, like Linux?
>>
> Linux does much more complicated things than that to keep time, so the
> only way to fix time drift in Linux was to introduce pvclock. For Window
> it is not so accurate too, since Windows can change clock frequency any
> time it can't calculate time from jiffies, it needs to update clock at
> each time tick.
>
>> >> >> >> >> >> guests could also be assisted with special handling (like 
>> >> >> >> >> >> win2k
>> >> >> >> >> >> install hack), for example guest instructions could be counted
>> >> >> >> >> >> (approximately, for example using TB size or TSC) and only 
>> >> >> >> >> >> inject
>> >> >> >> >> >> after at least N instructions have passed.
>> >> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be 
>> >> >> >> >> > done more
>> >> >> >> >> > or less reliably using perf counters, may be).
>> >> >> >> >>
>> >> >> >> >> Aren't there any debug registers or perf counters, which can 
>> >> >> >> >> generate
>> >> >> >> >> an interrupt after some number of instructions have been 
>> >> >> >> >> executed?
>> >> >> >> > Don't think debug registers have something like that and they are
>> >> >> >> > available for guest use anyway. Perf counters differs greatly 
>> >> >> >> > from CPU
>> >> >> >> > to CPU (even between two CPUs of the same manufacturer), and we 
>> >> >> >> > want to
>> >> >> >> > keep using them for profiling guests. And I don't see what 
>> >> >> >> > problem it
>> >> >> >> > will solve anyway that can be solved by simple delay between irq
>> >> >> >> > reinjection.
>> >> >> >>
>> >> >> >> This would allow counting the executed instructions and limit it. 
>> >> >> >> Thus
>> >> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>> >> >> >>
>> >> >> > Why would you want to limit number of instruction executed by guest 
>> >> >> > if
>> >> >> > CPU has nothing else to do anyway? The problem occurs not when we 
>> >> >> > have
>> >> >> > spare cycles so give to a guest, but in opposite case.
>> >> >>
>> >> >> I think one problem is that the guest has executed too much compared
>> >> >> to what would happen with real HW with a lesser CPU. That explains the
>> >> >> RTC frequency reprogramming case.
>> >> > You think wrong. The problem is exactly opposite: the guest haven't
>> >> > had enough execution time between two time interrupts. I don't know what
>> >> > RTC frequency reprogramming case you are talking about here.
>> >>
>> >> The case you told me where N pending tick IRQs exist but the guest
>> >> wants to change the RTC frequency from 64Hz to 1024Hz.
>> >>
>> >> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
>> >> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
>> >> the frequency to 1000Hz.
>> >>
>> >> The problem for emulation is that for the same 3 ticks, there has been
>> >> so little execution power that the ticks have been coalesced. But
>> >> isn't the guest cycle count then much lower than 30Mcyc?
>> >>
>> >> Isn't it so that the guest must be above 30Mcyc to be able to want the
>> >> change? But if we reach that point,  the problem must have not been
>> >> too little execution time, but too much.
>> >>
>> > Sorry I tried hard to understand what have you said above but failed.
>> > What do you mean "to be able to want the change"? Guest sometimes wants
>> > to get 64 timer interrupts per second and sometimes it wants to get 1024
>> > timer interrupt per second. It wants it not as a result of time drift or
>> > anything. It's just how guest behaves. You seams to be to fixated on
>> > guest frequency change. It's just something you have to take into
>> > account when you reinject interrupts.
>>
>> I meant that in the scenario, the guest won't change the RTC before
>> 30Mcyc because of some built in determinism in the guest. At that
>> point, because of some reason, the change would happen.
>>
> I still don't understand what are you trying to say here. Guest changes
> frequency because of some even in the guest. It is totally independent
> of what happens in QEMUs RTC emulation.


I'm trying to understand the order of events. In the scenario, the
order of events on real HW would be:
10Mcyc: tick IRQ 1
20Mcyc: tick IRQ 2
30Mcyc: tick IRQ 3
30Mcyc: reprogram timer
31Mcyc: tick IRQ 4
32Mcyc: tick IRQ 5
33Mcyc: tick IRQ 6
34Mcyc: tick IRQ 7

With QEMU, the order could become:
30Mcyc: reprogram timer
30.5Mcyc: tick IRQ 1
31Mcyc: tick IRQ 2
31.5Mcyc: tick IRQ 3
32Mcyc: tick IRQ 4
32.5Mcyc: tick IRQ 5
33Mcyc: tick IRQ 6
34Mcyc: tick IRQ 7

Correct?

>
>> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> > And even if the rate did not matter, the APIC woult still 
>> >> >> >> >> >> > have to now
>> >> >> >> >> >> > about the fact that an IRQ is really periodic and does not 
>> >> >> >> >> >> > only appear
>> >> >> >> >> >> > as such for a certain interval. This really does not sound 
>> >> >> >> >> >> > like
>> >> >> >> >> >> > simplifying things or even make them cleaner.
>> >> >> >> >> >>
>> >> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC 
>> >> >> >> >> >> would be
>> >> >> >> >> >> just like any other device. With the bidirectional irqs, this 
>> >> >> >> >> >> voodoo
>> >> >> >> >> >> would probably eventually spread to many other devices. The 
>> >> >> >> >> >> logical
>> >> >> >> >> >> conclusion of that would be a system where all devices would 
>> >> >> >> >> >> be
>> >> >> >> >> >> careful not to disturb the guest at wrong moment because that 
>> >> >> >> >> >> would
>> >> >> >> >> >> trigger a bug.
>> >> >> >> >> >>
>> >> >> >> >> > This voodoo will be so complex and unreliable that it will 
>> >> >> >> >> > make RTC hack
>> >> >> >> >> > pale in comparison (and I still don't see how you are going to 
>> >> >> >> >> > make it
>> >> >> >> >> > actually work).
>> >> >> >> >>
>> >> >> >> >> Implement everything inside APIC: only coalescing and 
>> >> >> >> >> reinjection.
>> >> >> >> > APIC has zero info needed to implement reinjection correctly as 
>> >> >> >> > was
>> >> >> >> > shown to you several time in this thread and you simply keep 
>> >> >> >> > ignoring
>> >> >> >> > it.
>> >> >> >>
>> >> >> >> On the contrary, APIC is actually the only source of the IRQ ack
>> >> >> >> information. RTC hack would not work without APIC (or the
>> >> >> >> bidirectional IRQ) passing this info to RTC.
>> >> >> >>
>> >> >> >> What APIC doesn't have now is the timer frequency or period info. 
>> >> >> >> This
>> >> >> >> is known by RTC and also higher levels managing the clocks.
>> >> >> >>
>> >> >> > So APIC has one bit of information and RTC everything else.
>> >> >>
>> >> >> The information known by RTC (timer period) is also known by higher 
>> >> >> levels.
>> >> >>
>> >> > What do you mean by higher level here? vl.c or APIC.
>> >>
>> >> vl.c, qemu-timer.c.
>> >>
>> >> >> > The current
>> >> >> > approach (and proposed patch) brings this one bit of information to 
>> >> >> > RTC,
>> >> >> > you are arguing that RTC should be able to communicate all its info 
>> >> >> > to
>> >> >> > APIC. Sorry I don't see that your way has any advantage. Just more
>> >> >> > complex interface and it is much easier to get it wrong for other 
>> >> >> > time
>> >> >> > sources.
>> >> >>
>> >> >> I don't think anymore that APIC should be handling this but the
>> >> >> generic stuff, like vl.c or exec.c. Then there would be only
>> >> >> information passing from APIC to higher levels.
>> >> > Handling reinjection by general timer code makes infinitely more sense
>> >> > then handling it in APIC.
>> >>
>> >> I'm glad you agree, or did you mean 'less'?
>> >>
>> > Compared to APIC I would agree that even putting it in IDE is better idea 
>> > :)
>> >
>> >> > One thing (from the top of my head) that can't
>> >> > be implemented at that level is injection of interrupt back to back (i.e
>> >> > injecting next interrupt immediately after guest acknowledge previous
>> >> > one to RTC).
>> >>
>> >> But Jan told this confuses some buggy OSes.
>> >>
>> > You keep calling them buggy, but I don't agree. They are written with
>> > certain assumption that are true on real HW, but hard to achieve on
>> > virtual. Anyway we use this technique (back to back reinjection)
>> > otherwise you can't solve drift problem if guest want to receive
>> > timer interrupts with max frequency that host time source support.
>>
>> Even if this confuses some OSes?
>>
> We make it so that it will not confuse relevant OSes. It is not line you
> have better choice.
>
>> >> >
>> >> >>
>> >> >> >> I keep ignoring the idea that the current model, where both RTC and
>> >> >> >> APIC must somehow work together to make coalescing work, is the only
>> >> >> >> possible just because it is committed and it happens to work in some
>> >> >> >> cases. It would be much better to concentrate this to one place, 
>> >> >> >> APIC
>> >> >> >> or preferably higher level where it may benefit other timers too.
>> >> >> >> Provided of course that the other models can be made to work.
>> >> >> >>
>> >> >> > So write the code and show us. You haven't show any evidence that 
>> >> >> > RTC is
>> >> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> >> >> > know when clock frequency changes, it know when device reset 
>> >> >> > happened.
>> >> >> > APIC knows only that interrupt was coalesced. It doesn't even know 
>> >> >> > that
>> >> >> > it may be masked by a guest in IOAPIC (interrupts delivered while 
>> >> >> > they
>> >> >> > are masked not considered coalesced).
>> >> >>
>> >> >> Oh, I thought interrupt masking was the reason for coalescing! What
>> >> >> exactly is the reason then?
>> >> >>
>> >> > The reason is that guest has no time to process previous interrupt
>> >> > before it is time to inject next one.
>> >>
>> >> Because of other host load or other emulation done by the same QEMU
>> >> process, I suppose?
>> > Yes, both.
>> >
>> >>
>> >> >> > Time source knows only when
>> >> >> > frequency changes and may be when device reset happens if timer is
>> >> >> > stopped by device on reset. So RTC is actually a sweet spot if you 
>> >> >> > want
>> >> >> > to minimize amount of info you need to pass between various layers.
>> >> >> >
>> >> >> >> >> Maybe that version would not bend backwards as much as the 
>> >> >> >> >> current to
>> >> >> >> >> cater for buggy hosts.
>> >> >> >> >>
>> >> >> >> > You mean "buggy guests"?
>> >> >> >>
>> >> >> >> Yes, sorry.
>> >> >> >>
>> >> >> >> > What guests are not buggy in your opinion?
>> >> >> >> > Linux tries hard to be smart and as a result the only way to have 
>> >> >> >> > stable
>> >> >> >> > clock with it is to go paravirt.
>> >> >> >>
>> >> >> >> I'm not an OS designer, but I think an OS should never crash, even 
>> >> >> >> if
>> >> >> >> a burst of IRQs is received. Reprogramming the timer should consider
>> >> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> >> >> >> cause of the problem.
>> >> >> > OS should never crash in the absence of HW bugs? I doubt you can 
>> >> >> > design
>> >> >> > an OS that can run in a face of any HW failure. Anyway here we are
>> >> >> > trying to solve guests time keeping problem not crashes. Do you think
>> >> >> > you can design OS that can keep time accurately no matter how crazy 
>> >> >> > all
>> >> >> > HW clock behaves?
>> >> >>
>> >> >> I think my OS design skills are not relevant in this discussion, but
>> >> >> IIRC there are fault tolerant operating systems for extreme conditions
>> >> >> so it can be done.
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >> >> > The fact is that timer device is not "just like any
>> >> >> >> >> > other device" in virtual world. Any other device is easy: you 
>> >> >> >> >> > just
>> >> >> >> >> > implement spec as close as possible and everything works. For 
>> >> >> >> >> > time
>> >> >> >> >> > source device this is not enough. You can implement RTC+HPET 
>> >> >> >> >> > to the
>> >> >> >> >> > letter and your guest will drift like crazy.
>> >> >> >> >>
>> >> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> >> >> >> without any voodoo. The interrupts would come after executing 
>> >> >> >> >> the same
>> >> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> >> >> >> guests in any sufficiently desperate low resource conditions, 
>> >> >> >> >> this may
>> >> >> >> >> be the only option that will always work.
>> >> >> >> >>
>> >> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't 
>> >> >> >> > strive
>> >> >> >> > to be one. On the contrary KVM runs at native host CPU speed most 
>> >> >> >> > of the
>> >> >> >> > time, so any emulation done between two instruction is 
>> >> >> >> > theoretically
>> >> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, 
>> >> >> >> > so
>> >> >> >> > keeping all time source in perfect sync is also impossible.
>> >> >> >>
>> >> >> >> That is actually another cause of the problem. KVM gives the guest 
>> >> >> >> an
>> >> >> >> illusion that the VCPU speed is equal to host speed. When they don't
>> >> >> >> match, especially in critical code, there can be problems. It would 
>> >> >> >> be
>> >> >> >> better to tell the guest a lower speed, which also can be 
>> >> >> >> guaranteed.
>> >> >> >>
>> >> >> > Not possible. It's that simple. You should take it into account in 
>> >> >> > your
>> >> >> > architecture design stage. In case of KVM real physical CPU executes 
>> >> >> > guest
>> >> >> > instruction and it does this as fast as it can. The only way we can 
>> >> >> > hide
>> >> >> > that from a guest is by intercepting each access to TSC and at that
>> >> >> > point we can use bochs instead.
>> >> >>
>> >> >> Well, as Paul pointed out, there's also icount option.
>> >> >>
>> >> > icount is not an option for KVM.
>> >>
>> >> I think icount timer adjustment model might make sense for this work
>> >> too. We'd then just need some figure of executed CPU instructions, TSC
>> >> cycles or even kernel scheduler time slice information (how much time
>> >> the process got).
>> >>
>> > And then? icount makes guest time flow dependant on amount of emulated
>> > instructions. It relies on the fact that all time sources are
>> > synchronized for a guest during emulation (including TSC). This is not
>> > true for virtualization.
>>
>> So for virtualization, is it OK then to keep time sources unsynchronized?
> No, this is not. This is the reality that can't be changed for now. May
> be when HW vitalization will introduce TSC scaling, but even then we
> don't want to run 4 500Mhz guest on 2GHz host, we want to run 4 2GHz
> guest on 2GHz host, so overcommit occurs and you can't guaranty that
> no coalescing will happen. And I hope you are aware of the fact that
> using icount introduce ~10% performance penalty on QEMU.
>
> --
>                        Gleb.
>

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

Reply via email to