Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

Blue Swirl Sat, 29 May 2010 09:46:01 -0700

2010/5/29 Jan Kiszka <jan.kis...@web.de>:
> Gleb Natapov wrote:
>> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>>> 2010/5/28 Gleb Natapov <g...@redhat.com>:
>>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>>> 2010/5/27 Gleb Natapov <g...@redhat.com>:
>>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kis...@web.de> wrote:
>>>>>>>> Blue Swirl wrote:
>>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kis...@web.de> wrote:
>>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kis...@web.de>  
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> From: Jan Kiszka<jan.kis...@siemens.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during 
>>>>>>>>>>>>> delivery from
>>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all 
>>>>>>>>>>>>> IRQ
>>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it 
>>>>>>>>>>>>> can
>>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>>
>>>>>>>>>>>> No real devices are interested whether any of their output lines 
>>>>>>>>>>>> are
>>>>>>>>>>>> even connected. This would introduce a new signal type, 
>>>>>>>>>>>> bidirectional
>>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>>
>>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't 
>>>>>>>>>>> disagree
>>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for 
>>>>>>>>>>> this
>>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and 
>>>>>>>>>>> has a
>>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>>
>>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>>> information and whether they get delivered or not. If not, an 
>>>>>>>>>>>> internal
>>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>>> delivered at a specific regular rate. For this, you also need a 
>>>>>>>>>> generic
>>>>>>>>>> interface - nothing really "won".
>>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>>> monitor or tell anything.
>>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the 
>>>>>>>> RTC
>>>>>>>> spreads the reinjection according to the current rate.
>>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>>> reinject with constant delay. That is why current code mixes two
>>>>>> approaches: reinject M interrupts in a raw then delay.
>>>>> This approach can be also used by APIC-only version.
>>>>>
>>>> I don't know what APIC-only version you are talking about. I haven't
>>>> seen the code and I don't understand hand waving, sorry.
>>> There is no code, because we're still at architecture design stage.
>>>
>> Try to write test code to understand the problem better.
>>
>>>>>>> guests could also be assisted with special handling (like win2k
>>>>>>> install hack), for example guest instructions could be counted
>>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>>> after at least N instructions have passed.
>>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>>> or less reliably using perf counters, may be).
>>>>> Aren't there any debug registers or perf counters, which can generate
>>>>> an interrupt after some number of instructions have been executed?
>>>> Don't think debug registers have something like that and they are
>>>> available for guest use anyway. Perf counters differs greatly from CPU
>>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>>> keep using them for profiling guests. And I don't see what problem it
>>>> will solve anyway that can be solved by simple delay between irq
>>>> reinjection.
>>> This would allow counting the executed instructions and limit it. Thus
>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>
>> Why would you want to limit number of instruction executed by guest if
>> CPU has nothing else to do anyway? The problem occurs not when we have
>> spare cycles so give to a guest, but in opposite case.
>>
>>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>>> simplifying things or even make them cleaner.
>>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>>> would probably eventually spread to many other devices. The logical
>>>>>>> conclusion of that would be a system where all devices would be
>>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>>> trigger a bug.
>>>>>>>
>>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>>> actually work).
>>>>> Implement everything inside APIC: only coalescing and reinjection.
>>>> APIC has zero info needed to implement reinjection correctly as was
>>>> shown to you several time in this thread and you simply keep ignoring
>>>> it.
>>> On the contrary, APIC is actually the only source of the IRQ ack
>>> information. RTC hack would not work without APIC (or the
>>> bidirectional IRQ) passing this info to RTC.
>>>
>>> What APIC doesn't have now is the timer frequency or period info. This
>>> is known by RTC and also higher levels managing the clocks.
>>>
>> So APIC has one bit of information and RTC everything else. The current
>> approach (and proposed patch) brings this one bit of information to RTC,
>> you are arguing that RTC should be able to communicate all its info to
>> APIC. Sorry I don't see that your way has any advantage. Just more
>> complex interface and it is much easier to get it wrong for other time
>> sources.
>>
>>> I keep ignoring the idea that the current model, where both RTC and
>>> APIC must somehow work together to make coalescing work, is the only
>>> possible just because it is committed and it happens to work in some
>>> cases. It would be much better to concentrate this to one place, APIC
>>> or preferably higher level where it may benefit other timers too.
>>> Provided of course that the other models can be made to work.
>>>
>> So write the code and show us. You haven't show any evidence that RTC is
>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> know when clock frequency changes, it know when device reset happened.
>> APIC knows only that interrupt was coalesced. It doesn't even know that
>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>> are masked not considered coalesced). Time source knows only when
>> frequency changes and may be when device reset happens if timer is
>> stopped by device on reset. So RTC is actually a sweet spot if you want
>> to minimize amount of info you need to pass between various layers.
>
> This is - according to my current understanding - the proposed
> alternative architecture:
>
>                                          .---------------.
>                                          | de-coalescing |
>                                          |     logic     |
>                                          '---------------'
>                                                ^   |
>                                         period,|   |IRQ
>                                       coalesced|   |(or tuned VM clock)
>                                        (yes/no)|   v
> .-------.              .--------.             .-------.
> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
> '-------'              '--------'             '-------'
>    |                    ^    |                   ^
>    |                    |    |                   |
>    '-------period-------'    '------period-------'


The period information is already known by the higher level clock
management, it should be available (or made available) to
de-coalescing logic (which should probably be located close to other
clock stuff) but otherwise there shouldn't be a need to pass it
around. The tuned VM clock of course only affects the RTC. Otherwise
correct.

> And here is what this patch implements (except for the not yet factored
> out de-coalescing logic):
>
>   .---------------.
>   | de-coalescing |
>   |     logic     |
>   '---------------'
>         ^   |
>  period,|   |next IRQ date
> coalesced|   |(or tuned VM clock)
>  (yes/no)|   v
>       .-------.              .--------.             .-------.
>       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>       '-------'              '--------'             '-------'
>           ^                    |    ^                   |
>           |                    |    |                   |
>           '------coalesced-----'    '-----coalesced-----'

Why not route the coalesced signal directly from APIC to de-coalescing
logic? Then our designs would match!

>
> I still don't see how the alternative is supposed to simplify our life
> or improve the efficiency of the de-coalescing workaround. It's rather
> problematic like Gleb pointed out: The de-coalescing logic needs to be
> informed about periodicity changes that can only be delivered along
> IRQs. So what to do with the backlog when the timer is stopped?

What happens with the current design? Gleb only mentioned the
frequency change, I thought that was not so big problem. But I don't
think this case should be allowed happen at all, it can't exist on
real HW.

> Regarding an improved de-coalescing logic: As its final design is widely
> independent of how we collect the information, it could perfectly be
> done after we laid the elementary foundation.

True. But what is the foundation, do we need bidirectional IRQs or
not? I still think current IRQs should be enough.

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

Reply via email to