Re: Mask bit support's API

Yang, Sheng Tue, 30 Nov 2010 18:34:59 -0800

On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
> On 11/26/2010 04:35 AM, Yang, Sheng wrote:
> > >  >  Shouldn't kvm also service reads from the pending bitmask?
> > >  
> > >  Of course KVM should service reading from pending bitmask. For
> > >  assigned device, it's kernel who would set the pending bit; but I am
> > >  not sure for virtio. This interface is GET_ENTRY, so reading is fine
> > >  with it.
> 
> The kernel should manage it in the same way.  Virtio raises irq (via
> KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
> 
> Note we need to be able to read and write the pending bitmask for live
> migration.


Then seems we still need to an writing interface for it. And I think we can 
work 
on it later since it got no user now.
> 
> > >  >  We could have the kernel handle addr/data writes by setting up an
> > >  >  internal interrupt routing.  A disadvantage is that more work is
> > >  >  needed if we emulator interrupt remapping in qemu.
> > >  
> > >  In fact modifying irq routing in the kernel is also the thing I want
> > >  to avoid.
> > >  
> > >  So, the flow would be:
> > >  
> > >  kernel get MMIO write, record it in it's own MSI table
> > >  KVM exit to QEmu, by one specific exit reason
> > >  QEmu know it have to sync the MSI table, then reading the entries from
> > >  kernel QEmu found it's an write, so it need to reprogram irq routing
> > >  table using the entries above
> > >  done
> > >  
> > >  But wait, why should qemu read entries from kernel? By default exit we
> > >  already have the information about what's the entry to modify and what
> > >  to write, so we can use them directly. By this way, we also don't
> > >  need an specific exit reason - just exit to qemu in normal way is
> > >  fine.
> 
> Because we have an interface where you get an exit if (addr % 4) < 3 and
> don't get an exit if (addr % 4) == 3.  There is a gpa range which is
> partially maintained by the kernel and partially in userspace.  It's a
> confusing interface.  Things like 64-bit reads or writes need to be
> broken up and serviced in two different places.
> 
> We already need to support this (for unaligned writes which hit two
> regions), but let's at least make a contiguous region behave sanely.

Oh, I didn't mean to handle this kind of unaligned writing in userspace. 
They're 
illegal according to the PCI spec(otherwise the result is undefined according 
to 
the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next 
version, and discard all illegal writing. And 64-bit accessing would be broken 
up 
in qemu as you said, as they do currently. 

In fact I think we can handle all data for 64-bit to qemu, because it should 
still 
sync the mask bit with kernel, which make the maskbit writing in userspace 
useless 
and can be ignored.
>
> > >  Then it would be:
> > >  
> > >  kernel get MMIO write, record it in it's own MSI table
> > >  KVM exit to QEmu, indicate MMIO exit
> > >  QEmu found it's an write, it would update it's own MSI table(may need
> > >  to query mask bit from kernel), and reprogram irq routing table using
> > >  the entries above done
> > >  
> > >  Then why should kernel kept it's own MSI table? I think the only
> > >  reason is we can speed up reading in that way - but the reading we
> > >  want to speed up is mostly on enabled entry(the first entry), which
> > >  is already in the IRQ routing table...
> 
> The reason is to keep a sane interface.  Like we emulate instructions
> and msrs in the kernel and don't do half a job.  I don't think there's a
> real need to accelerate the first three words of an msi-x entry.

Here is the case we've observed with Xen. It can only be reproduced by large 
scale 
testing. When the interrupt intensity is very high, even new kernels would try 
to 
make it lower, then it would access mask bit very frequently. And in the 
kernel, 
msi_set_mask_bit() is like this:

static void msi_set_mask_bit(struct irq_data *data, u32 flag)               
{                                                                           
        struct msi_desc *desc = irq_data_get_msi(data);                     
                                                                            
        if (desc->msi_attrib.is_msix) {                                     
                msix_mask_irq(desc, flag);                                  
                readl(desc->mask_base);         /* Flush write to device */ 
        } else {                                                            
                unsigned offset = data->irq - desc->dev->irq;               
                msi_mask_irq(desc, 1 << offset, flag << offset);            
        }                                                                   
}                                                                           

That flush by readl() would complied with each MSI-X mask writing. That is the 
only 
place we want to accelerate, otherwise it would still fall back to qemu each 
time 
when guest want to mask the MSI-X entry.

> > >  And for enabled/disabled entry, you can see it like this: for the
> > >  entries inside routing table, we think it's enabled; otherwise it's
> > >  disabled. Then you don't need to bothered by pci_enable_msix().
> > >  
> > >  So our strategy for reading accelerating can be:
> > >  
> > >  If the entry contained in irq routing table, then use it; otherwise
> > >  let qemu deal with it. Because it's the QEmu who owned irq routing
> > >  table, the synchronization is guaranteed. We don't need the MSI table
> > >  in the kernel then.
> 
> I agree about letting qemu manage the irq routing table.  It changes
> very rarely.  I just prefer to let it know about the change via
> something other than KVM_EXIT_MMIO.

Well, we can go back to this after we've settled other things.
> 
> > >  And for writing, we just want to cover all of mask bit, but none of
> > >  others.
> > >  
> > >  I think the concept here is more acceptable?
> > >  
> > >  The issue here is MSI table and irq routing table got duplicate
> > >  information on some entries. My initial purposal is to use irq
> > >  routing table in kernel, then we don't need to duplicate information.
> > 
> > Avi?
> 
> Sorry about the late reply.
> 
> > And BTW, we can take routing table as a kind of *cache*, if the content
> > is in the cache, then we can fetch it from the cache, otherwise we need
> > to go back to fetch it from memory(userspace).
> 
> If it's guaranteed by the spec that addr/data pairs are always
> interpreted in the same way, sure.  But there no reason to do it,
> really, it isn't a fast path.

I am not quite understand the first sentence... But it's a fast path, in the 
case I 
showed above.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

Reply via email to