Avi Kivity wrote:
> Jan Kiszka wrote:
>>> The complexity, as shown by the decision tree, is exploding, however.
>>>     
>>
>> Yes, I had the same impression while gluing the fix together. And I also
>> wondered how things may look like for SVM once we support NMIs there as
>> well. However, I didn't want to add a new regression source by
>> refactoring this sensitive code at this point.
>>   
> 
> Right.  We want a short-term fix (your patch, or perhaps some simplified
> version thereof) and a long-term fix which will be simpler to understand.
> 
> 
>>> I wonder whether this patch can be simplified, and what to do in the
>>> long term.
>>>
>>> First, this clearly has to be in subarch independent code, it's too
>>> complex to live in vmx.c.  Second, how about this:
>>>
>>> - create an 'event' object, that can be an exception, interrupt, or
>>> nmi.  An event can be injected or pending; its other attributes are
>>> vector number, has_error, and exception code.
>>> - an event has 'inject' and 'check_injected' methods
>>> - events live in a queue, either a real array or simply three entries
>>> for the three types which are checked in order of precedence, like now
>>> - the pre-entry logic looks like
>>>
>>> if (!injected_event && queue_len > 0)
>>>    injected_event = queue_head()
>>> if (injected_event)
>>>    injected_event.inject()
>>>
>>> - the post-exit logic looks like
>>>
>>> if (injected_event)
>>>    if (injected_event.check_injected())
>>>        injected_event = null
>>>        queue_pop()
>>>
>>> The queue is resorted only when injected_event = null.
>>>     
>>
>> Don't understand if I got the last sentence correctly, 
> 
> So long as an event is being injected, we don't want to mess with the
> priority queue.
> 
>> but generally
>> this sounds nice. Let's try to express my understanding:
>>
>> We have an event class of which at most one object is instantiated per
>> type (which are: exception, NMI, IRQ), ie. we will never have more than
>> 3 instances at the same time. These objects are kept in queue, _always_
>> sorted according to their priority (in descending order: exception, NMI,
>> IRQ). On pre-entry, the topmost object is taken and injected to the
>> guest. The queue may only change after the post-exit check (and before
>> the pre-entry injection, of course) so that we can track if some object
>> finally has been successfully injected and can be deleted from the queue
>> (or replaced with a new one, provided by its backing source). Did I get
>> your idea correctly?
>>   
> 
> Yes.
> 
> Reflecting some more, there is some talk in SDM 3A 5.9 about handling
> simultaneous exceptions, and MAY want to use the queue to resolve this
> according to the rules provided there.  On the other hand, we MAY want
> to tear 5.9 our of the manual and live with the existing complexity.

Mmmh, not ever exception is equal - that would add complexity, indeed.

> 
> Even more, we want to be able to remove an event that is in the queue
> but has not been injected yet.  This is useful for svm, where you can
> queue an interrupt when it is masked by eflags.if or apic.tpr, and svm
> will inject it when it becomes unmasked.  In other words, svm implements
> part of our queue in hardware.

That's a smart feature: one reason for vmexits less plus reduced VMM
complexity.

> 
> So when an irq line is asserted, the ioapic locates the victim lapic,
> sets the IRR bit, and IPIs it.  This forces an exit; the uninjected
> interrupt is placed back in the queue.  The lapic code sees a higher
> priority interrupt, yanks the lower interrupt from the queue, replaces
> it with the new interrupt, and we go back to the guest.
> 
>>> Thoughts?  I think this will close the race as well as simplify
>>> things. It's similar to the current logic, except that the if ()s are
>>> taken out
>>> of the context of individual events and placed into a generic context,
>>> once.
>>>
>>> I also want to think about the short term solution.  If you can think of
>>> a simpler way to close the race, I'd like to hear of it.
>>>     
>>
>> Frankly, not at the moment. This patch tries to extend the existing
>> logic only as far as required to fix the issue, while keeping the rest
>> as is to avoid breaking some other corner case. In short: conservative
>> fixing. Unless someone (/me is busy the next weeks) pulls a patch to
>> implement your refactoring idea out of some hat, I would suggest to
>> apply my patch for now, but keep the rework on the todo list.
>>   
> 
> As mentioned above, that's my plan.  I'll see if I can simplify it a
> bit.  Is there an easy way to test it?

Jiajun kindly provided me a RHEL kernel and initrd (2.6.18-53-el5) which
I ran for a while (or booted a few times) to trigger the hang. Basically
you need high IRQ load (preferably via LAPIC, to exploit that un-acked
IRQs will block low-prio IRQs as well) + high NMI load (e.g. via NMI
watchdog).

Jan

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to