Re: [kvm-devel] irqdevice INTR example

Avi Kivity Thu, 12 Apr 2007 05:50:00 -0700

Gregory Haskins wrote:
>>>   
>>>       
>> If you want condition variables, activate your cryogenically- cooled suit 
>> and post on it on lkml.  This cannot be added to the kernel via kvm.
>>     
>
> Hehe.  I wasn't aware of the contention surrounding them but that does 
> explain why they are notably missing.  I will rip it out this code in favor 
> of a pure waitqueue as Christof suggested to avoid more controversy.
>
>


Actually, I am in favor of having well-defined synchronization 
primitives.  The only issues I see with condition variables are:

- there are many variants of mutexes in the kernel (mutex, spinlock, 
spinlock with irqs disabled, ...), so many condvar variants are needed.
- they're very close to waitqueues, so perhaps there's not much 
justification

I'm no synchronization primitive expert, though.  My comment had nothing 
to do with my opinion on condition variables or your implementation 
thereof.  kvm is simply not the place to introduce them.

>>
>>     
>>> +
>>>     /*
>>> -    * FIXME: Implement this or the CPU wont notice the interrupt until
>>> -    * the next natural VMEXIT.  Note that this is how the system
>>> -    * has always worked, so nothing is broken here.  This is a future
>>> -    * enhancement
>>> +    * HACK ALERT!
>>> +    *
>>> +    * We want to send a virtual interrupt signal to the task that owns
>>> +    * the guest.  However, the signal will only force a VMEXIT (via
>>> +    * a reschedule IPI) if the task is currently in GUEST mode.  There
>>> +    * is a race condition between the time that we mark the vcpu as
>>> +    * running and the time the system actually enter guest mode.  Since
>>> +    * there doesnt appear to be any way to help with this situation from
>>> +    * the VT hardware, we are forced to wait to make sure the guest 
>>>   
>>>       
>> We support AMD too.
>>     
>
> My patch actually supports both, and I didn't mean to infer that AMD was 
> skipped.  I was speaking generally when I said "VT".  Is there a better term 
> to refer to the concept of VMX/SVN?
>   

I referred to the comment.  Maybe just "the hardware"?

>   
>>> +    * actually gets interrupted in a reasonable amount of time.  If it
>>> +    * does not, we assume that the IPI failed because it was too early
>>> +    * and must try again until it does.
>>>   
>>>       
>> Waiting seems totally broken.  I'm not even sure what we are waiting for 
>> --  certainly you can't wait for re- entry into guest mode, since there's 
>> not guarantee that's coming.  
>>     
>
> Actually, there is ;)  (unless I missed something)   If you look at the other 
> side of this implementation in vcpu_XXX_run(), I only set the vcpu->irq.task 
> when the vcpu is just about to enter guest mode.  And I clear it when it 
> leaves.  In other words, the kvn_vcpu_intr() function is a no-op if the vcpu 
> is not in guest mode (or close to it).
>
>   

Ah, ok -- I misunderstood the whole thing.  The way to avoid the race is 
to disable interrupts before entering the guest.  This way the IPI is 
delayed until you enter guest mode:

    irq_disable();

    spin_lock();
    vcpu->guest_mode = 1;
    check_whether_an_irq_is_pending_and_if_so_inject_it();
    spin_unlock();

    asm ("vm_entry_sequence");

    vcpu->guest_mode = 0;
   
    irq_enable(); // currently in the vm entry sequence

    // recheck here?

If the interrupt happens before the spin_lock(), we'll get a non-ipi 
wakeup and then see it in check_whether().  If it happens after it we'll 
get an IPI which will be ignored until we're snugly in guest mode.

>> If it is elsewhere, we need to use one of the many (alas) standard wakeup 
>> mechanisms to 
>> wake up userspace and let it know the vcpu needs service.
>>     
>
> My code assumes we only need to kick the vcpu if its in guest mode because it 
> figures it will just pick up the interrupt on the next run cycle otherwise.  
> I haven't studied QEMU deep enough to know that it can suspend the execution 
> of the run-cycle...i just thought it looped indefinitely.  

In general I find it useful to pretend there are many userspaces being 
written for kvm, otherwise we get locked into qemu's current mode of 
operation.

> This makes sense now that I think about it because something like hlt should 
> cause a suspension of CPU activity until further notice. 

An alternative is to handle hlt in the kernel in a place where we're 
ready for the IPI wakeup.  The downside to that is that we have to be 
prepared for external wakeup sources as well (signal, poll, aio... messy).


>   
>>> +    *
>>> +    * This condvar/spinlock/timeout/retry eliminate the race in a safe
>>> +    * manner, at the expense of making the INTR delivery synchronous
>>>      */
>>> +   spin_lock(&vcpu- >irq.lock);
>>> +   
>>> +   if (vcpu- >irq.task) {
>>> +           struct timespec tmo = {
>>> +                   .tv_sec  = 0,
>>> +                   .tv_nsec = 100000 /* 100us */
>>> +           };
>>> +
>>> +           BUG_ON(vcpu- >irq.task == current);
>>> +                   
>>> +           while (vcpu- >irq.task) {
>>> +                   send_sig(SIGSTOP, vcpu- >irq.task, 0);
>>>   
>>>       
>> SIGSTOP?!  I lost the plot here.
>>     
>
> Well, I needed a signal to send that would reliably IPI the guest.  I 
> originally wrote the code to send RT signal 33 (which I defined as 
> KVM_SIGNAL_VIRTUAL_INTERRUPT).  In both cases, the design calls for sending a 
> signal (purely for the IPI function of the kernel) and then to remove the 
> signal before userspace sees it.  I had some trouble with the code that used 
> dequeue_signal() to strip out the RT signal.  So I then changed to sending 
> SIGSTOP to IPI, and SIGCONT to effectively strip out the pending SIGSTOP.
>
>   

It's best not to use signals internally.  Qemu relies on them and we 
have to support them, but in kernel we should use existing kernel 
mechanisms.

> Keep in mind that this is really just proof-of-concept code.  The general 
> idea of this POC is to demonstrate how I envisioned the irqdevice API to be 
> used to deliver interrupts even once concurrent access is possible (e.g., 
> SMP/PV).  The details of which signal are used, etc, are incidental for now 
> until we agree that the irqdevice API approach is acceptable. 
>   

I was interested in how ->pending() and ->read_vector() and the raise 
callback interact, but got distracted by the, err, uniqueness of the 
signal thing.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
kvm-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] irqdevice INTR example

Reply via email to