2010/5/30 Gleb Natapov <g...@redhat.com>: > On Sat, May 29, 2010 at 08:52:34PM +0000, Blue Swirl wrote: >> On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov <g...@redhat.com> wrote: >> > On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote: >> >> 2010/5/29 Gleb Natapov <g...@redhat.com>: >> >> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote: >> >> >> >> There is no code, because we're still at architecture design stage. >> >> >> >> >> >> >> > Try to write test code to understand the problem better. >> >> >> >> >> >> I will. >> >> >> >> >> > Please do ASAP. This discussion shows that you don't understand what is >> >> > the >> >> > problem that we are dialing with. >> >> >> >> Which part of the problem you think I don't understand? >> >> >> > It seams to me you don't understand how Windows uses RTC for time >> > keeping and how the QEMU solves the problem today. >> >> RTC causes periodic interrupts and Windows interrupt handler >> increments jiffies, like Linux? >> > Linux does much more complicated things than that to keep time, so the > only way to fix time drift in Linux was to introduce pvclock. For Window > it is not so accurate too, since Windows can change clock frequency any > time it can't calculate time from jiffies, it needs to update clock at > each time tick. > >> >> >> >> >> >> guests could also be assisted with special handling (like >> >> >> >> >> >> win2k >> >> >> >> >> >> install hack), for example guest instructions could be counted >> >> >> >> >> >> (approximately, for example using TB size or TSC) and only >> >> >> >> >> >> inject >> >> >> >> >> >> after at least N instructions have passed. >> >> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be >> >> >> >> >> > done more >> >> >> >> >> > or less reliably using perf counters, may be). >> >> >> >> >> >> >> >> >> >> Aren't there any debug registers or perf counters, which can >> >> >> >> >> generate >> >> >> >> >> an interrupt after some number of instructions have been >> >> >> >> >> executed? >> >> >> >> > Don't think debug registers have something like that and they are >> >> >> >> > available for guest use anyway. Perf counters differs greatly >> >> >> >> > from CPU >> >> >> >> > to CPU (even between two CPUs of the same manufacturer), and we >> >> >> >> > want to >> >> >> >> > keep using them for profiling guests. And I don't see what >> >> >> >> > problem it >> >> >> >> > will solve anyway that can be solved by simple delay between irq >> >> >> >> > reinjection. >> >> >> >> >> >> >> >> This would allow counting the executed instructions and limit it. >> >> >> >> Thus >> >> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately. >> >> >> >> >> >> >> > Why would you want to limit number of instruction executed by guest >> >> >> > if >> >> >> > CPU has nothing else to do anyway? The problem occurs not when we >> >> >> > have >> >> >> > spare cycles so give to a guest, but in opposite case. >> >> >> >> >> >> I think one problem is that the guest has executed too much compared >> >> >> to what would happen with real HW with a lesser CPU. That explains the >> >> >> RTC frequency reprogramming case. >> >> > You think wrong. The problem is exactly opposite: the guest haven't >> >> > had enough execution time between two time interrupts. I don't know what >> >> > RTC frequency reprogramming case you are talking about here. >> >> >> >> The case you told me where N pending tick IRQs exist but the guest >> >> wants to change the RTC frequency from 64Hz to 1024Hz. >> >> >> >> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so >> >> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change >> >> the frequency to 1000Hz. >> >> >> >> The problem for emulation is that for the same 3 ticks, there has been >> >> so little execution power that the ticks have been coalesced. But >> >> isn't the guest cycle count then much lower than 30Mcyc? >> >> >> >> Isn't it so that the guest must be above 30Mcyc to be able to want the >> >> change? But if we reach that point, the problem must have not been >> >> too little execution time, but too much. >> >> >> > Sorry I tried hard to understand what have you said above but failed. >> > What do you mean "to be able to want the change"? Guest sometimes wants >> > to get 64 timer interrupts per second and sometimes it wants to get 1024 >> > timer interrupt per second. It wants it not as a result of time drift or >> > anything. It's just how guest behaves. You seams to be to fixated on >> > guest frequency change. It's just something you have to take into >> > account when you reinject interrupts. >> >> I meant that in the scenario, the guest won't change the RTC before >> 30Mcyc because of some built in determinism in the guest. At that >> point, because of some reason, the change would happen. >> > I still don't understand what are you trying to say here. Guest changes > frequency because of some even in the guest. It is totally independent > of what happens in QEMUs RTC emulation.
I'm trying to understand the order of events. In the scenario, the order of events on real HW would be: 10Mcyc: tick IRQ 1 20Mcyc: tick IRQ 2 30Mcyc: tick IRQ 3 30Mcyc: reprogram timer 31Mcyc: tick IRQ 4 32Mcyc: tick IRQ 5 33Mcyc: tick IRQ 6 34Mcyc: tick IRQ 7 With QEMU, the order could become: 30Mcyc: reprogram timer 30.5Mcyc: tick IRQ 1 31Mcyc: tick IRQ 2 31.5Mcyc: tick IRQ 3 32Mcyc: tick IRQ 4 32.5Mcyc: tick IRQ 5 33Mcyc: tick IRQ 6 34Mcyc: tick IRQ 7 Correct? > >> > >> >> > >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > And even if the rate did not matter, the APIC woult still >> >> >> >> >> >> > have to now >> >> >> >> >> >> > about the fact that an IRQ is really periodic and does not >> >> >> >> >> >> > only appear >> >> >> >> >> >> > as such for a certain interval. This really does not sound >> >> >> >> >> >> > like >> >> >> >> >> >> > simplifying things or even make them cleaner. >> >> >> >> >> >> >> >> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC >> >> >> >> >> >> would be >> >> >> >> >> >> just like any other device. With the bidirectional irqs, this >> >> >> >> >> >> voodoo >> >> >> >> >> >> would probably eventually spread to many other devices. The >> >> >> >> >> >> logical >> >> >> >> >> >> conclusion of that would be a system where all devices would >> >> >> >> >> >> be >> >> >> >> >> >> careful not to disturb the guest at wrong moment because that >> >> >> >> >> >> would >> >> >> >> >> >> trigger a bug. >> >> >> >> >> >> >> >> >> >> >> > This voodoo will be so complex and unreliable that it will >> >> >> >> >> > make RTC hack >> >> >> >> >> > pale in comparison (and I still don't see how you are going to >> >> >> >> >> > make it >> >> >> >> >> > actually work). >> >> >> >> >> >> >> >> >> >> Implement everything inside APIC: only coalescing and >> >> >> >> >> reinjection. >> >> >> >> > APIC has zero info needed to implement reinjection correctly as >> >> >> >> > was >> >> >> >> > shown to you several time in this thread and you simply keep >> >> >> >> > ignoring >> >> >> >> > it. >> >> >> >> >> >> >> >> On the contrary, APIC is actually the only source of the IRQ ack >> >> >> >> information. RTC hack would not work without APIC (or the >> >> >> >> bidirectional IRQ) passing this info to RTC. >> >> >> >> >> >> >> >> What APIC doesn't have now is the timer frequency or period info. >> >> >> >> This >> >> >> >> is known by RTC and also higher levels managing the clocks. >> >> >> >> >> >> >> > So APIC has one bit of information and RTC everything else. >> >> >> >> >> >> The information known by RTC (timer period) is also known by higher >> >> >> levels. >> >> >> >> >> > What do you mean by higher level here? vl.c or APIC. >> >> >> >> vl.c, qemu-timer.c. >> >> >> >> >> > The current >> >> >> > approach (and proposed patch) brings this one bit of information to >> >> >> > RTC, >> >> >> > you are arguing that RTC should be able to communicate all its info >> >> >> > to >> >> >> > APIC. Sorry I don't see that your way has any advantage. Just more >> >> >> > complex interface and it is much easier to get it wrong for other >> >> >> > time >> >> >> > sources. >> >> >> >> >> >> I don't think anymore that APIC should be handling this but the >> >> >> generic stuff, like vl.c or exec.c. Then there would be only >> >> >> information passing from APIC to higher levels. >> >> > Handling reinjection by general timer code makes infinitely more sense >> >> > then handling it in APIC. >> >> >> >> I'm glad you agree, or did you mean 'less'? >> >> >> > Compared to APIC I would agree that even putting it in IDE is better idea >> > :) >> > >> >> > One thing (from the top of my head) that can't >> >> > be implemented at that level is injection of interrupt back to back (i.e >> >> > injecting next interrupt immediately after guest acknowledge previous >> >> > one to RTC). >> >> >> >> But Jan told this confuses some buggy OSes. >> >> >> > You keep calling them buggy, but I don't agree. They are written with >> > certain assumption that are true on real HW, but hard to achieve on >> > virtual. Anyway we use this technique (back to back reinjection) >> > otherwise you can't solve drift problem if guest want to receive >> > timer interrupts with max frequency that host time source support. >> >> Even if this confuses some OSes? >> > We make it so that it will not confuse relevant OSes. It is not line you > have better choice. > >> >> > >> >> >> >> >> >> >> I keep ignoring the idea that the current model, where both RTC and >> >> >> >> APIC must somehow work together to make coalescing work, is the only >> >> >> >> possible just because it is committed and it happens to work in some >> >> >> >> cases. It would be much better to concentrate this to one place, >> >> >> >> APIC >> >> >> >> or preferably higher level where it may benefit other timers too. >> >> >> >> Provided of course that the other models can be made to work. >> >> >> >> >> >> >> > So write the code and show us. You haven't show any evidence that >> >> >> > RTC is >> >> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it >> >> >> > know when clock frequency changes, it know when device reset >> >> >> > happened. >> >> >> > APIC knows only that interrupt was coalesced. It doesn't even know >> >> >> > that >> >> >> > it may be masked by a guest in IOAPIC (interrupts delivered while >> >> >> > they >> >> >> > are masked not considered coalesced). >> >> >> >> >> >> Oh, I thought interrupt masking was the reason for coalescing! What >> >> >> exactly is the reason then? >> >> >> >> >> > The reason is that guest has no time to process previous interrupt >> >> > before it is time to inject next one. >> >> >> >> Because of other host load or other emulation done by the same QEMU >> >> process, I suppose? >> > Yes, both. >> > >> >> >> >> >> > Time source knows only when >> >> >> > frequency changes and may be when device reset happens if timer is >> >> >> > stopped by device on reset. So RTC is actually a sweet spot if you >> >> >> > want >> >> >> > to minimize amount of info you need to pass between various layers. >> >> >> > >> >> >> >> >> Maybe that version would not bend backwards as much as the >> >> >> >> >> current to >> >> >> >> >> cater for buggy hosts. >> >> >> >> >> >> >> >> >> > You mean "buggy guests"? >> >> >> >> >> >> >> >> Yes, sorry. >> >> >> >> >> >> >> >> > What guests are not buggy in your opinion? >> >> >> >> > Linux tries hard to be smart and as a result the only way to have >> >> >> >> > stable >> >> >> >> > clock with it is to go paravirt. >> >> >> >> >> >> >> >> I'm not an OS designer, but I think an OS should never crash, even >> >> >> >> if >> >> >> >> a burst of IRQs is received. Reprogramming the timer should consider >> >> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one >> >> >> >> cause of the problem. >> >> >> > OS should never crash in the absence of HW bugs? I doubt you can >> >> >> > design >> >> >> > an OS that can run in a face of any HW failure. Anyway here we are >> >> >> > trying to solve guests time keeping problem not crashes. Do you think >> >> >> > you can design OS that can keep time accurately no matter how crazy >> >> >> > all >> >> >> > HW clock behaves? >> >> >> >> >> >> I think my OS design skills are not relevant in this discussion, but >> >> >> IIRC there are fault tolerant operating systems for extreme conditions >> >> >> so it can be done. >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> > The fact is that timer device is not "just like any >> >> >> >> >> > other device" in virtual world. Any other device is easy: you >> >> >> >> >> > just >> >> >> >> >> > implement spec as close as possible and everything works. For >> >> >> >> >> > time >> >> >> >> >> > source device this is not enough. You can implement RTC+HPET >> >> >> >> >> > to the >> >> >> >> >> > letter and your guest will drift like crazy. >> >> >> >> >> >> >> >> >> >> It's doable: a cycle accurate emulator will not cause any drift, >> >> >> >> >> without any voodoo. The interrupts would come after executing >> >> >> >> >> the same >> >> >> >> >> instruction as the real HW. For emulating any sufficiently buggy >> >> >> >> >> guests in any sufficiently desperate low resource conditions, >> >> >> >> >> this may >> >> >> >> >> be the only option that will always work. >> >> >> >> >> >> >> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't >> >> >> >> > strive >> >> >> >> > to be one. On the contrary KVM runs at native host CPU speed most >> >> >> >> > of the >> >> >> >> > time, so any emulation done between two instruction is >> >> >> >> > theoretically >> >> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, >> >> >> >> > so >> >> >> >> > keeping all time source in perfect sync is also impossible. >> >> >> >> >> >> >> >> That is actually another cause of the problem. KVM gives the guest >> >> >> >> an >> >> >> >> illusion that the VCPU speed is equal to host speed. When they don't >> >> >> >> match, especially in critical code, there can be problems. It would >> >> >> >> be >> >> >> >> better to tell the guest a lower speed, which also can be >> >> >> >> guaranteed. >> >> >> >> >> >> >> > Not possible. It's that simple. You should take it into account in >> >> >> > your >> >> >> > architecture design stage. In case of KVM real physical CPU executes >> >> >> > guest >> >> >> > instruction and it does this as fast as it can. The only way we can >> >> >> > hide >> >> >> > that from a guest is by intercepting each access to TSC and at that >> >> >> > point we can use bochs instead. >> >> >> >> >> >> Well, as Paul pointed out, there's also icount option. >> >> >> >> >> > icount is not an option for KVM. >> >> >> >> I think icount timer adjustment model might make sense for this work >> >> too. We'd then just need some figure of executed CPU instructions, TSC >> >> cycles or even kernel scheduler time slice information (how much time >> >> the process got). >> >> >> > And then? icount makes guest time flow dependant on amount of emulated >> > instructions. It relies on the fact that all time sources are >> > synchronized for a guest during emulation (including TSC). This is not >> > true for virtualization. >> >> So for virtualization, is it OK then to keep time sources unsynchronized? > No, this is not. This is the reality that can't be changed for now. May > be when HW vitalization will introduce TSC scaling, but even then we > don't want to run 4 500Mhz guest on 2GHz host, we want to run 4 2GHz > guest on 2GHz host, so overcommit occurs and you can't guaranty that > no coalescing will happen. And I hope you are aware of the fact that > using icount introduce ~10% performance penalty on QEMU. > > -- > Gleb. >