2010/5/29 Jan Kiszka <jan.kis...@web.de>: > Gleb Natapov wrote: >> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote: >>> 2010/5/28 Gleb Natapov <g...@redhat.com>: >>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote: >>>>> 2010/5/27 Gleb Natapov <g...@redhat.com>: >>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote: >>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kis...@web.de> wrote: >>>>>>>> Blue Swirl wrote: >>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kis...@web.de> wrote: >>>>>>>>>> Anthony Liguori wrote: >>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote: >>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kis...@web.de> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> From: Jan Kiszka<jan.kis...@siemens.com> >>>>>>>>>>>>> >>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during >>>>>>>>>>>>> delivery from >>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing >>>>>>>>>>>>> workarounds need to register handlers that return the appropriate >>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all >>>>>>>>>>>>> IRQ >>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it >>>>>>>>>>>>> can >>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only >>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report >>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED. >>>>>>>>>>>>> >>>>>>>>>>>> No real devices are interested whether any of their output lines >>>>>>>>>>>> are >>>>>>>>>>>> even connected. This would introduce a new signal type, >>>>>>>>>>>> bidirectional >>>>>>>>>>>> multi-level, which is not correct. >>>>>>>>>>>> >>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't >>>>>>>>>>> disagree >>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for >>>>>>>>>>> this >>>>>>>>>>> type of bidirectional feedback. Maybe it's qemu_coalesced_irq and >>>>>>>>>>> has a >>>>>>>>>>> similar interface as qemu_irq. >>>>>>>>>> A separate type would complicate the delivery of the feedback value >>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing). >>>>>>>>>> >>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one >>>>>>>>>>>> device, in this case APIC because it has the information about irq >>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency >>>>>>>>>>>> information and whether they get delivered or not. If not, an >>>>>>>>>>>> internal >>>>>>>>>>>> timer is installed which injects the lost irqs. >>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a >>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the >>>>>>>>>> interrupt controller about the fact that a particular IRQ should be >>>>>>>>>> delivered at a specific regular rate. For this, you also need a >>>>>>>>>> generic >>>>>>>>>> interface - nothing really "won". >>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to >>>>>>>>> monitor or tell anything. >>>>>>>> There are guests that won't like this (I know of one in-house, but >>>>>>>> others may even have more examples), specifically if you end up firing >>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the >>>>>>>> RTC >>>>>>>> spreads the reinjection according to the current rate. >>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy >>>>>> If guest's time frequency is the same as host time frequency you can't >>>>>> reinject with constant delay. That is why current code mixes two >>>>>> approaches: reinject M interrupts in a raw then delay. >>>>> This approach can be also used by APIC-only version. >>>>> >>>> I don't know what APIC-only version you are talking about. I haven't >>>> seen the code and I don't understand hand waving, sorry. >>> There is no code, because we're still at architecture design stage. >>> >> Try to write test code to understand the problem better. >> >>>>>>> guests could also be assisted with special handling (like win2k >>>>>>> install hack), for example guest instructions could be counted >>>>>>> (approximately, for example using TB size or TSC) and only inject >>>>>>> after at least N instructions have passed. >>>>>> Guest instructions cannot be easily counted in KVM (it can be done more >>>>>> or less reliably using perf counters, may be). >>>>> Aren't there any debug registers or perf counters, which can generate >>>>> an interrupt after some number of instructions have been executed? >>>> Don't think debug registers have something like that and they are >>>> available for guest use anyway. Perf counters differs greatly from CPU >>>> to CPU (even between two CPUs of the same manufacturer), and we want to >>>> keep using them for profiling guests. And I don't see what problem it >>>> will solve anyway that can be solved by simple delay between irq >>>> reinjection. >>> This would allow counting the executed instructions and limit it. Thus >>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately. >>> >> Why would you want to limit number of instruction executed by guest if >> CPU has nothing else to do anyway? The problem occurs not when we have >> spare cycles so give to a guest, but in opposite case. >> >>>>>>>> And even if the rate did not matter, the APIC woult still have to now >>>>>>>> about the fact that an IRQ is really periodic and does not only appear >>>>>>>> as such for a certain interval. This really does not sound like >>>>>>>> simplifying things or even make them cleaner. >>>>>>> It would, the voodoo would be contained only in APIC, RTC would be >>>>>>> just like any other device. With the bidirectional irqs, this voodoo >>>>>>> would probably eventually spread to many other devices. The logical >>>>>>> conclusion of that would be a system where all devices would be >>>>>>> careful not to disturb the guest at wrong moment because that would >>>>>>> trigger a bug. >>>>>>> >>>>>> This voodoo will be so complex and unreliable that it will make RTC hack >>>>>> pale in comparison (and I still don't see how you are going to make it >>>>>> actually work). >>>>> Implement everything inside APIC: only coalescing and reinjection. >>>> APIC has zero info needed to implement reinjection correctly as was >>>> shown to you several time in this thread and you simply keep ignoring >>>> it. >>> On the contrary, APIC is actually the only source of the IRQ ack >>> information. RTC hack would not work without APIC (or the >>> bidirectional IRQ) passing this info to RTC. >>> >>> What APIC doesn't have now is the timer frequency or period info. This >>> is known by RTC and also higher levels managing the clocks. >>> >> So APIC has one bit of information and RTC everything else. The current >> approach (and proposed patch) brings this one bit of information to RTC, >> you are arguing that RTC should be able to communicate all its info to >> APIC. Sorry I don't see that your way has any advantage. Just more >> complex interface and it is much easier to get it wrong for other time >> sources. >> >>> I keep ignoring the idea that the current model, where both RTC and >>> APIC must somehow work together to make coalescing work, is the only >>> possible just because it is committed and it happens to work in some >>> cases. It would be much better to concentrate this to one place, APIC >>> or preferably higher level where it may benefit other timers too. >>> Provided of course that the other models can be made to work. >>> >> So write the code and show us. You haven't show any evidence that RTC is >> the wrong place. RTC knows when interrupt was acknowledge to RTC, it >> know when clock frequency changes, it know when device reset happened. >> APIC knows only that interrupt was coalesced. It doesn't even know that >> it may be masked by a guest in IOAPIC (interrupts delivered while they >> are masked not considered coalesced). Time source knows only when >> frequency changes and may be when device reset happens if timer is >> stopped by device on reset. So RTC is actually a sweet spot if you want >> to minimize amount of info you need to pass between various layers. > > This is - according to my current understanding - the proposed > alternative architecture: > > .---------------. > | de-coalescing | > | logic | > '---------------' > ^ | > period,| |IRQ > coalesced| |(or tuned VM clock) > (yes/no)| v > .-------. .--------. .-------. > | RTC |-----IRQ----->| router |-----IRQ---->| APIC | > '-------' '--------' '-------' > | ^ | ^ > | | | | > '-------period-------' '------period-------'
The period information is already known by the higher level clock management, it should be available (or made available) to de-coalescing logic (which should probably be located close to other clock stuff) but otherwise there shouldn't be a need to pass it around. The tuned VM clock of course only affects the RTC. Otherwise correct. > And here is what this patch implements (except for the not yet factored > out de-coalescing logic): > > .---------------. > | de-coalescing | > | logic | > '---------------' > ^ | > period,| |next IRQ date > coalesced| |(or tuned VM clock) > (yes/no)| v > .-------. .--------. .-------. > | RTC |-----IRQ----->| router |-----IRQ---->| APIC | > '-------' '--------' '-------' > ^ | ^ | > | | | | > '------coalesced-----' '-----coalesced-----' Why not route the coalesced signal directly from APIC to de-coalescing logic? Then our designs would match! > > I still don't see how the alternative is supposed to simplify our life > or improve the efficiency of the de-coalescing workaround. It's rather > problematic like Gleb pointed out: The de-coalescing logic needs to be > informed about periodicity changes that can only be delivered along > IRQs. So what to do with the backlog when the timer is stopped? What happens with the current design? Gleb only mentioned the frequency change, I thought that was not so big problem. But I don't think this case should be allowed happen at all, it can't exist on real HW. > Regarding an improved de-coalescing logic: As its final design is widely > independent of how we collect the information, it could perfectly be > done after we laid the elementary foundation. True. But what is the foundation, do we need bidirectional IRQs or not? I still think current IRQs should be enough.