subject:"Mask bit support's API"

Re: Mask bit support's API

2010-12-02 Thread Yang, Sheng

On Friday 03 December 2010 00:55:03 Michael S. Tsirkin wrote:
> On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote:
> > On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin  wrote:
> > > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
> > >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
> > >> >>  Which case?  the readl() doesn't need access to the routing table,
> > >> >>  just the entry.
> > >> >
> > >> >One thing that read should do is flush in the outstanding
> > >> >interrupts and flush out the mask bit writes.
> > >> 
> > >> The mask bit writes are synchronous.
> > >> 
> > >> wrt interrupts, we can deal with assigned devices, and can poll
> > >> irqfds.  But we can't force vhost-net to issue an interrupt (and I
> > >> don't think it's required).
> > > 
> > > To clarify:
> > > 
> > >mask write
> > >read
> > > 
> > > it is safe for guest to assume no more interrupts
> > > 
> > > where as with a simple
> > >mask write
> > > 
> > > an interrupt might be in flight and get delivered shortly afterwards.
> > 
> > I think it's already contained in the current patchset.
> > 
> > >> >>  Oh, I think there is a terminology problem, I was talking about
> > >> >>  kvm's irq routing table, you were talking about the msix entries.
> > >> >> 
> > >> >>  I think treating it as a cache causes more problems, since there
> > >> >> are now two paths for reads (in cache and not in cache) and more
> > >> >> things for writes to manage.
> > >> >> 
> > >> >>  Here's my proposed API:
> > >> >> 
> > >> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
> > >> >>  pending_bitmap_base_gpa)
> > >> >> 
> > >> >>   - called when the guest enables msix
> > >> >
> > >> >I would add virtual addresses so that we can use swappable memory to
> > >> >store the state.
> > >> 
> > >> Right.
> > 
> > Do we need synchronization between kernel and userspace? Any recommended
> > method?
> > 
> > >> >If we do, maybe we can just keep the table there and then
> > >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
> > >> 
> > >> Still need to to let userspace know it needs to reprogram the irqfd
> > >> or whatever it uses to inject the interrupt.
> > > 
> > > Why do we need to reprogram irqfd?  I thought irqfd would map to an
> > > entry within the table instead of address/data as now.
> > > Could you clarify please?
> > > 
> > >> >>  KVM_REMOVE_MSIX_TABLE(table_id)
> > >> >> 
> > >> >>- called when the guest disables msix
> > >> >> 
> > >> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
> > >> >> 
> > >> >>- called when the guest enables msix (to initialize it), or
> > >> >> after live migration
> > >> >
> > >> >What is entry_id here?
> > >> 
> > >> Entry within the table.
> > > 
> > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
> > > enabled (note: it can not be called at boot anyway since pa
> > > depends on BAR assigned by BIOS).
> > 
> > Don't agree. MMIO can be write regardless of if MSIX is enabled. If
> > you handle MMIO to kernel, them handle them all.
> 
> Hmm, does this mean we would need ioctls to enable/disable MSIX?  I
> think passing control from userspace to kernel when MSIX is enabled is
> not too bad. Will think about it.

Anyway we need ioctls to do so because kernel knows nothing about PCI 
configuration 
space... And we already have that for assigned device.
> 
> > I suppose qemu still
> > got control of BAR?
> 
> Yes. So e.g. if memory is disabled in the device then we must
> disable this as well, and not claim these transactions
> as mmio.
> 
> > Then leave it in the current place should be fine.
> 
> current place?

assigned_dev_iomem_map().
> 
> At the moment qemu does not notify devices when
> memory is disabled, only when it is enabled.

You mean bit 2 of Command register? I think suppose device would only disable 
memory when shut down should be acceptable. Also we can hook at disable point 
as 
well.
 
> So handling msix enable/disable is more straight-forward.

Don't agree. Then you got duplicate between kernel and userspace. Also the 
semantic of MSI-X MMIO has no relationship with MSIX enable/disable.
> 
> > >> >>  Michael?  I think that should work for virtio and vfio assigned
> > >> >>  devices?  Not sure about pending bits.
> > >> >
> > >> >Pending bits must be tracked in kernel, but I don't see
> > >> >how we can support polling mode if we don't exit to userspace
> > >> >on pending bit reads.
> > >> >
> > >> >This does mean that some reads will be fast and some will be
> > >> >slow, and it's a bit sad that we seem to be optimizing
> > >> >for specific guests, but I just can't come up with
> > >> >anything better.
> > >> 
> > >> If the pending bits live in userspace memory, the device model can
> > >> update them directly?
> > > 
> > > Note that these are updated on an interrupt, so updating them
> > > in userspace would need get_user_page etc trickery,
> > > and add the overhead of atomics.
> > > 
> > > Further I th

Re: Mask bit support's API

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote:
> On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin  wrote:
> > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
> >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
> >> >>
> >> >>  Which case?  the readl() doesn't need access to the routing table,
> >> >>  just the entry.
> >> >
> >> >One thing that read should do is flush in the outstanding
> >> >interrupts and flush out the mask bit writes.
> >>
> >> The mask bit writes are synchronous.
> >>
> >> wrt interrupts, we can deal with assigned devices, and can poll
> >> irqfds.  But we can't force vhost-net to issue an interrupt (and I
> >> don't think it's required).
> >
> > To clarify:
> >
> >        mask write
> >        read
> >
> > it is safe for guest to assume no more interrupts
> >
> > where as with a simple
> >        mask write
> >
> > an interrupt might be in flight and get delivered shortly afterwards.
> 
> I think it's already contained in the current patchset.
> >
> >
> >> >>  Oh, I think there is a terminology problem, I was talking about
> >> >>  kvm's irq routing table, you were talking about the msix entries.
> >> >>
> >> >>  I think treating it as a cache causes more problems, since there are
> >> >>  now two paths for reads (in cache and not in cache) and more things
> >> >>  for writes to manage.
> >> >>
> >> >>  Here's my proposed API:
> >> >>
> >> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
> >> >>  pending_bitmap_base_gpa)
> >> >>
> >> >>   - called when the guest enables msix
> >> >
> >> >I would add virtual addresses so that we can use swappable memory to
> >> >store the state.
> >>
> >> Right.
> 
> Do we need synchronization between kernel and userspace? Any recommended 
> method?
> >>
> >> >If we do, maybe we can just keep the table there and then
> >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
> >>
> >> Still need to to let userspace know it needs to reprogram the irqfd
> >> or whatever it uses to inject the interrupt.
> >
> > Why do we need to reprogram irqfd?  I thought irqfd would map to an
> > entry within the table instead of address/data as now.
> > Could you clarify please?
> >
> >
> >> >>  KVM_REMOVE_MSIX_TABLE(table_id)
> >> >>
> >> >>    - called when the guest disables msix
> >> >>
> >> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
> >> >>
> >> >>    - called when the guest enables msix (to initialize it), or after
> >> >>  live migration
> >> >
> >> >What is entry_id here?
> >>
> >> Entry within the table.
> >
> > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
> > enabled (note: it can not be called at boot anyway since pa
> > depends on BAR assigned by BIOS).
> 
> Don't agree. MMIO can be write regardless of if MSIX is enabled. If
> you handle MMIO to kernel, them handle them all.

Hmm, does this mean we would need ioctls to enable/disable MSIX?  I
think passing control from userspace to kernel when MSIX is enabled is
not too bad. Will think about it.

> I suppose qemu still
> got control of BAR?

Yes. So e.g. if memory is disabled in the device then we must
disable this as well, and not claim these transactions
as mmio.

> Then leave it in the current place should be fine.

current place?

At the moment qemu does not notify devices when
memory is disabled, only when it is enabled.

So handling msix enable/disable is more straight-forward.

> >
> >> >>  Michael?  I think that should work for virtio and vfio assigned
> >> >>  devices?  Not sure about pending bits.
> >> >
> >> >Pending bits must be tracked in kernel, but I don't see
> >> >how we can support polling mode if we don't exit to userspace
> >> >on pending bit reads.
> >> >
> >> >This does mean that some reads will be fast and some will be
> >> >slow, and it's a bit sad that we seem to be optimizing
> >> >for specific guests, but I just can't come up with
> >> >anything better.
> >> >
> >>
> >> If the pending bits live in userspace memory, the device model can
> >> update them directly?
> >
> > Note that these are updated on an interrupt, so updating them
> > in userspace would need get_user_page etc trickery,
> > and add the overhead of atomics.
> >
> > Further I think it's important to avoid the overhead of updating them
> > all the time, and only do this when an interrupt is
> > masked or on pending bits read. Since userspace does not know
> > when interrupts are masked, this means do update on each read.
> 
> In fact qemu's accessing to MMIO should be quite rare after moving all
> the things to the kernel. Using IOCTL is also fine with me.
> 
> And how to "do update on each read"?


When read of pending bits is detected, we could forward it up to qemu.
Qemu could check internal device state and clear bits that are no longer
relevant.

> -- 
> regards,
> Yang, Sheng
> >
> >> >So maybe just add an ioctl to get and to clear pending bits.
> >> >Maybe set for symmetry.
> >>
> >> For live migration too.  But if they live in memory,

Re: Mask bit support's API

2010-12-02 Thread Sheng Yang

On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin  wrote:
> On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
>> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
>> >>
>> >>  Which case?  the readl() doesn't need access to the routing table,
>> >>  just the entry.
>> >
>> >One thing that read should do is flush in the outstanding
>> >interrupts and flush out the mask bit writes.
>>
>> The mask bit writes are synchronous.
>>
>> wrt interrupts, we can deal with assigned devices, and can poll
>> irqfds.  But we can't force vhost-net to issue an interrupt (and I
>> don't think it's required).
>
> To clarify:
>
>        mask write
>        read
>
> it is safe for guest to assume no more interrupts
>
> where as with a simple
>        mask write
>
> an interrupt might be in flight and get delivered shortly afterwards.

I think it's already contained in the current patchset.
>
>
>> >>  Oh, I think there is a terminology problem, I was talking about
>> >>  kvm's irq routing table, you were talking about the msix entries.
>> >>
>> >>  I think treating it as a cache causes more problems, since there are
>> >>  now two paths for reads (in cache and not in cache) and more things
>> >>  for writes to manage.
>> >>
>> >>  Here's my proposed API:
>> >>
>> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
>> >>  pending_bitmap_base_gpa)
>> >>
>> >>   - called when the guest enables msix
>> >
>> >I would add virtual addresses so that we can use swappable memory to
>> >store the state.
>>
>> Right.

Do we need synchronization between kernel and userspace? Any recommended method?
>>
>> >If we do, maybe we can just keep the table there and then
>> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
>>
>> Still need to to let userspace know it needs to reprogram the irqfd
>> or whatever it uses to inject the interrupt.
>
> Why do we need to reprogram irqfd?  I thought irqfd would map to an
> entry within the table instead of address/data as now.
> Could you clarify please?
>
>
>> >>  KVM_REMOVE_MSIX_TABLE(table_id)
>> >>
>> >>    - called when the guest disables msix
>> >>
>> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
>> >>
>> >>    - called when the guest enables msix (to initialize it), or after
>> >>  live migration
>> >
>> >What is entry_id here?
>>
>> Entry within the table.
>
> So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
> enabled (note: it can not be called at boot anyway since pa
> depends on BAR assigned by BIOS).

Don't agree. MMIO can be write regardless of if MSIX is enabled. If
you handle MMIO to kernel, them handle them all. I suppose qemu still
got control of BAR? Then leave it in the current place should be fine.
>
>> >>  Michael?  I think that should work for virtio and vfio assigned
>> >>  devices?  Not sure about pending bits.
>> >
>> >Pending bits must be tracked in kernel, but I don't see
>> >how we can support polling mode if we don't exit to userspace
>> >on pending bit reads.
>> >
>> >This does mean that some reads will be fast and some will be
>> >slow, and it's a bit sad that we seem to be optimizing
>> >for specific guests, but I just can't come up with
>> >anything better.
>> >
>>
>> If the pending bits live in userspace memory, the device model can
>> update them directly?
>
> Note that these are updated on an interrupt, so updating them
> in userspace would need get_user_page etc trickery,
> and add the overhead of atomics.
>
> Further I think it's important to avoid the overhead of updating them
> all the time, and only do this when an interrupt is
> masked or on pending bits read. Since userspace does not know
> when interrupts are masked, this means do update on each read.

In fact qemu's accessing to MMIO should be quite rare after moving all
the things to the kernel. Using IOCTL is also fine with me.

And how to "do update on each read"?

-- 
regards,
Yang, Sheng
>
>> >So maybe just add an ioctl to get and to clear pending bits.
>> >Maybe set for symmetry.
>>
>> For live migration too.  But if they live in memory, no need for
>> get/set, just specify the address.
>>
>> --
>> error compiling committee.c: too many arguments to function
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
> >>
> >>  Which case?  the readl() doesn't need access to the routing table,
> >>  just the entry.
> >
> >One thing that read should do is flush in the outstanding
> >interrupts and flush out the mask bit writes.
> 
> The mask bit writes are synchronous.
> 
> wrt interrupts, we can deal with assigned devices, and can poll
> irqfds.  But we can't force vhost-net to issue an interrupt (and I
> don't think it's required).

To clarify:

mask write
read

it is safe for guest to assume no more interrupts

where as with a simple
mask write

an interrupt might be in flight and get delivered shortly afterwards.


> >>  Oh, I think there is a terminology problem, I was talking about
> >>  kvm's irq routing table, you were talking about the msix entries.
> >>
> >>  I think treating it as a cache causes more problems, since there are
> >>  now two paths for reads (in cache and not in cache) and more things
> >>  for writes to manage.
> >>
> >>  Here's my proposed API:
> >>
> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
> >>  pending_bitmap_base_gpa)
> >>
> >>   - called when the guest enables msix
> >
> >I would add virtual addresses so that we can use swappable memory to
> >store the state.
> 
> Right.
> 
> >If we do, maybe we can just keep the table there and then
> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
> 
> Still need to to let userspace know it needs to reprogram the irqfd
> or whatever it uses to inject the interrupt.

Why do we need to reprogram irqfd?  I thought irqfd would map to an
entry within the table instead of address/data as now.
Could you clarify please?


> >>  KVM_REMOVE_MSIX_TABLE(table_id)
> >>
> >>- called when the guest disables msix
> >>
> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
> >>
> >>- called when the guest enables msix (to initialize it), or after
> >>  live migration
> >
> >What is entry_id here?
> 
> Entry within the table.

So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
enabled (note: it can not be called at boot anyway since pa
depends on BAR assigned by BIOS).

> >>  Michael?  I think that should work for virtio and vfio assigned
> >>  devices?  Not sure about pending bits.
> >
> >Pending bits must be tracked in kernel, but I don't see
> >how we can support polling mode if we don't exit to userspace
> >on pending bit reads.
> >
> >This does mean that some reads will be fast and some will be
> >slow, and it's a bit sad that we seem to be optimizing
> >for specific guests, but I just can't come up with
> >anything better.
> >
> 
> If the pending bits live in userspace memory, the device model can
> update them directly?

Note that these are updated on an interrupt, so updating them
in userspace would need get_user_page etc trickery,
and add the overhead of atomics.

Further I think it's important to avoid the overhead of updating them
all the time, and only do this when an interrupt is
masked or on pending bits read. Since userspace does not know
when interrupts are masked, this means do update on each read.

> >So maybe just add an ioctl to get and to clear pending bits.
> >Maybe set for symmetry.
> 
> For live migration too.  But if they live in memory, no need for
> get/set, just specify the address.
> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Avi Kivity

On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:

>
>  Which case?  the readl() doesn't need access to the routing table,
>  just the entry.

One thing that read should do is flush in the outstanding
interrupts and flush out the mask bit writes.

The mask bit writes are synchronous.

wrt interrupts, we can deal with assigned devices, and can poll irqfds.  
But we can't force vhost-net to issue an interrupt (and I don't think 
it's required).

>  Oh, I think there is a terminology problem, I was talking about
>  kvm's irq routing table, you were talking about the msix entries.
>
>  I think treating it as a cache causes more problems, since there are
>  now two paths for reads (in cache and not in cache) and more things
>  for writes to manage.
>
>  Here's my proposed API:
>
>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
>  pending_bitmap_base_gpa)
>
>   - called when the guest enables msix

I would add virtual addresses so that we can use swappable memory to
store the state.

Right.

If we do, maybe we can just keep the table there and then
KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?

Still need to to let userspace know it needs to reprogram the irqfd or 
whatever it uses to inject the interrupt.

>  KVM_REMOVE_MSIX_TABLE(table_id)
>
>- called when the guest disables msix
>
>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
>
>- called when the guest enables msix (to initialize it), or after
>  live migration

What is entry_id here?

Entry within the table.

>  Michael?  I think that should work for virtio and vfio assigned
>  devices?  Not sure about pending bits.

Pending bits must be tracked in kernel, but I don't see
how we can support polling mode if we don't exit to userspace
on pending bit reads.

This does mean that some reads will be fast and some will be
slow, and it's a bit sad that we seem to be optimizing
for specific guests, but I just can't come up with
anything better.

If the pending bits live in userspace memory, the device model can 
update them directly?

So maybe just add an ioctl to get and to clear pending bits.
Maybe set for symmetry.

For live migration too.  But if they live in memory, no need for 
get/set, just specify the address.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 03:09:43PM +0200, Avi Kivity wrote:
> On 12/01/2010 04:36 AM, Yang, Sheng wrote:
> >On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
> >>  On 11/26/2010 04:35 AM, Yang, Sheng wrote:
> >>  >  >   >   Shouldn't kvm also service reads from the pending bitmask?
> >>  >  >
> >>  >  >   Of course KVM should service reading from pending bitmask. For
> >>  >  >   assigned device, it's kernel who would set the pending bit; but I 
> >> am
> >>  >  >   not sure for virtio. This interface is GET_ENTRY, so reading is 
> >> fine
> >>  >  >   with it.
> >>
> >>  The kernel should manage it in the same way.  Virtio raises irq (via
> >>  KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
> >>
> >>  Note we need to be able to read and write the pending bitmask for live
> >>  migration.
> >
> >Then seems we still need to an writing interface for it. And I think we can 
> >work
> >on it later since it got no user now.
> 
> I'm not sure.  Suppose a guest starts using pending bits.  Does it
> mean it will not work with kernel msix acceleration?
> 
> If that is the case, then we need pending bit support now.  I don't
> want to knowingly merge something that doesn't conform to the spec,
> forcing users to choose whether they want to enable kernel msix
> acceleration or not.
> 
> 
> 
> >>
> >>  Because we have an interface where you get an exit if (addr % 4)<  3 and
> >>  don't get an exit if (addr % 4) == 3.  There is a gpa range which is
> >>  partially maintained by the kernel and partially in userspace.  It's a
> >>  confusing interface.  Things like 64-bit reads or writes need to be
> >>  broken up and serviced in two different places.
> >>
> >>  We already need to support this (for unaligned writes which hit two
> >>  regions), but let's at least make a contiguous region behave sanely.
> >
> >Oh, I didn't mean to handle this kind of unaligned writing in userspace. 
> >They're
> >illegal according to the PCI spec(otherwise the result is undefined 
> >according to
> >the spec). I would cover all contiguous writing(32-bit and 64-bit) in the 
> >next
> >version, and discard all illegal writing. And 64-bit accessing would be 
> >broken up
> >in qemu as you said, as they do currently.
> >
> >In fact I think we can handle all data for 64-bit to qemu, because it should 
> >still
> >sync the mask bit with kernel, which make the maskbit writing in userspace 
> >useless
> >and can be ignored.
> 
> What about reads?  Split those as well?
> 
> >>  The reason is to keep a sane interface.  Like we emulate instructions
> >>  and msrs in the kernel and don't do half a job.  I don't think there's a
> >>  real need to accelerate the first three words of an msi-x entry.
> >
> >Here is the case we've observed with Xen. It can only be reproduced by large 
> >scale
> >testing. When the interrupt intensity is very high, even new kernels would 
> >try to
> >make it lower, then it would access mask bit very frequently. And in the 
> >kernel,
> >msi_set_mask_bit() is like this:
> >
> >static void msi_set_mask_bit(struct irq_data *data, u32 flag)
> >{
> > struct msi_desc *desc = irq_data_get_msi(data);
> >
> > if (desc->msi_attrib.is_msix) {
> > msix_mask_irq(desc, flag);
> > readl(desc->mask_base); /* Flush write to device */
> > } else {
> > unsigned offset = data->irq - desc->dev->irq;
> > msi_mask_irq(desc, 1<<  offset, flag<<  offset);
> > }
> >}
> >
> >That flush by readl() would complied with each MSI-X mask writing. That is 
> >the only
> >place we want to accelerate, otherwise it would still fall back to qemu each 
> >time
> >when guest want to mask the MSI-X entry.
> 
> So it seems we do want to accelerate reads to the entire entry.
> This seems to give more weight to making the kernel own the entire
> entry.
> 
> >>
> >>  >  And BTW, we can take routing table as a kind of *cache*, if the content
> >>  >  is in the cache, then we can fetch it from the cache, otherwise we need
> >>  >  to go back to fetch it from memory(userspace).
> >>
> >>  If it's guaranteed by the spec that addr/data pairs are always
> >>  interpreted in the same way, sure.  But there no reason to do it,
> >>  really, it isn't a fast path.
> >
> >I am not quite understand the first sentence... But it's a fast path, in the 
> >case I
> >showed above.
> 
> Which case?  the readl() doesn't need access to the routing table,
> just the entry.

One thing that read should do is flush in the outstanding
interrupts and flush out the mask bit writes.

> Oh, I think there is a terminology problem, I was talking about
> kvm's irq routing table, you were talking about the msix entries.
> 
> I think treating it as a cache causes more problems, since there are
> now two paths for reads (in cache and not in cache) and more things
> for writes to manage.
> 
> Here's my proposed API:
> 
> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
>

Re: Mask bit support's API

2010-12-02 Thread Avi Kivity

On 12/01/2010 04:36 AM, Yang, Sheng wrote:

On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
>  On 11/26/2010 04:35 AM, Yang, Sheng wrote:
>  >  >   >   Shouldn't kvm also service reads from the pending bitmask?
>  >  >
>  >  >   Of course KVM should service reading from pending bitmask. For
>  >  >   assigned device, it's kernel who would set the pending bit; but I am
>  >  >   not sure for virtio. This interface is GET_ENTRY, so reading is fine
>  >  >   with it.
>
>  The kernel should manage it in the same way.  Virtio raises irq (via
>  KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
>
>  Note we need to be able to read and write the pending bitmask for live
>  migration.

Then seems we still need to an writing interface for it. And I think we can work
on it later since it got no user now.

I'm not sure.  Suppose a guest starts using pending bits.  Does it mean 
it will not work with kernel msix acceleration?

If that is the case, then we need pending bit support now.  I don't want 
to knowingly merge something that doesn't conform to the spec, forcing 
users to choose whether they want to enable kernel msix acceleration or not.

>
>  Because we have an interface where you get an exit if (addr % 4)<  3 and
>  don't get an exit if (addr % 4) == 3.  There is a gpa range which is
>  partially maintained by the kernel and partially in userspace.  It's a
>  confusing interface.  Things like 64-bit reads or writes need to be
>  broken up and serviced in two different places.
>
>  We already need to support this (for unaligned writes which hit two
>  regions), but let's at least make a contiguous region behave sanely.

Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're
illegal according to the PCI spec(otherwise the result is undefined according to
the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next
version, and discard all illegal writing. And 64-bit accessing would be broken 
up
in qemu as you said, as they do currently.

In fact I think we can handle all data for 64-bit to qemu, because it should 
still
sync the mask bit with kernel, which make the maskbit writing in userspace 
useless
and can be ignored.

What about reads?  Split those as well?

>  The reason is to keep a sane interface.  Like we emulate instructions
>  and msrs in the kernel and don't do half a job.  I don't think there's a
>  real need to accelerate the first three words of an msi-x entry.

Here is the case we've observed with Xen. It can only be reproduced by large 
scale
testing. When the interrupt intensity is very high, even new kernels would try 
to
make it lower, then it would access mask bit very frequently. And in the kernel,
msi_set_mask_bit() is like this:

static void msi_set_mask_bit(struct irq_data *data, u32 flag)
{
 struct msi_desc *desc = irq_data_get_msi(data);

 if (desc->msi_attrib.is_msix) {
 msix_mask_irq(desc, flag);
 readl(desc->mask_base); /* Flush write to device */
 } else {
 unsigned offset = data->irq - desc->dev->irq;
 msi_mask_irq(desc, 1<<  offset, flag<<  offset);
 }
}

That flush by readl() would complied with each MSI-X mask writing. That is the 
only
place we want to accelerate, otherwise it would still fall back to qemu each 
time
when guest want to mask the MSI-X entry.

So it seems we do want to accelerate reads to the entire entry.  This 
seems to give more weight to making the kernel own the entire entry.

>
>  >  And BTW, we can take routing table as a kind of *cache*, if the content
>  >  is in the cache, then we can fetch it from the cache, otherwise we need
>  >  to go back to fetch it from memory(userspace).
>
>  If it's guaranteed by the spec that addr/data pairs are always
>  interpreted in the same way, sure.  But there no reason to do it,
>  really, it isn't a fast path.

I am not quite understand the first sentence... But it's a fast path, in the 
case I
showed above.

Which case?  the readl() doesn't need access to the routing table, just 
the entry.

Oh, I think there is a terminology problem, I was talking about kvm's 
irq routing table, you were talking about the msix entries.

I think treating it as a cache causes more problems, since there are now 
two paths for reads (in cache and not in cache) and more things for 
writes to manage.

Here's my proposed API:

KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, 
pending_bitmap_base_gpa)

 - called when the guest enables msix

KVM_REMOVE_MSIX_TABLE(table_id)

  - called when the guest disables msix

KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)

  - called when the guest enables msix (to initialize it), or after 
live migration

KVM_GET_MSIX_ENTRY(table_id, entry_id, contents)

  - called when the guest updates an msix table (triggered by 
KVM_EXIT_MSIX_ENTRY_UPDATED)

KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED

  - returned from KV

Re: Mask bit support's API

2010-11-30 Thread Yang, Sheng

On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
> On 11/26/2010 04:35 AM, Yang, Sheng wrote:
> > >  >  Shouldn't kvm also service reads from the pending bitmask?
> > >  
> > >  Of course KVM should service reading from pending bitmask. For
> > >  assigned device, it's kernel who would set the pending bit; but I am
> > >  not sure for virtio. This interface is GET_ENTRY, so reading is fine
> > >  with it.
> 
> The kernel should manage it in the same way.  Virtio raises irq (via
> KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
> 
> Note we need to be able to read and write the pending bitmask for live
> migration.

Then seems we still need to an writing interface for it. And I think we can 
work 
on it later since it got no user now.
> 
> > >  >  We could have the kernel handle addr/data writes by setting up an
> > >  >  internal interrupt routing.  A disadvantage is that more work is
> > >  >  needed if we emulator interrupt remapping in qemu.
> > >  
> > >  In fact modifying irq routing in the kernel is also the thing I want
> > >  to avoid.
> > >  
> > >  So, the flow would be:
> > >  
> > >  kernel get MMIO write, record it in it's own MSI table
> > >  KVM exit to QEmu, by one specific exit reason
> > >  QEmu know it have to sync the MSI table, then reading the entries from
> > >  kernel QEmu found it's an write, so it need to reprogram irq routing
> > >  table using the entries above
> > >  done
> > >  
> > >  But wait, why should qemu read entries from kernel? By default exit we
> > >  already have the information about what's the entry to modify and what
> > >  to write, so we can use them directly. By this way, we also don't
> > >  need an specific exit reason - just exit to qemu in normal way is
> > >  fine.
> 
> Because we have an interface where you get an exit if (addr % 4) < 3 and
> don't get an exit if (addr % 4) == 3.  There is a gpa range which is
> partially maintained by the kernel and partially in userspace.  It's a
> confusing interface.  Things like 64-bit reads or writes need to be
> broken up and serviced in two different places.
> 
> We already need to support this (for unaligned writes which hit two
> regions), but let's at least make a contiguous region behave sanely.

Oh, I didn't mean to handle this kind of unaligned writing in userspace. 
They're 
illegal according to the PCI spec(otherwise the result is undefined according 
to 
the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next 
version, and discard all illegal writing. And 64-bit accessing would be broken 
up 
in qemu as you said, as they do currently. 

In fact I think we can handle all data for 64-bit to qemu, because it should 
still 
sync the mask bit with kernel, which make the maskbit writing in userspace 
useless 
and can be ignored.
>
> > >  Then it would be:
> > >  
> > >  kernel get MMIO write, record it in it's own MSI table
> > >  KVM exit to QEmu, indicate MMIO exit
> > >  QEmu found it's an write, it would update it's own MSI table(may need
> > >  to query mask bit from kernel), and reprogram irq routing table using
> > >  the entries above done
> > >  
> > >  Then why should kernel kept it's own MSI table? I think the only
> > >  reason is we can speed up reading in that way - but the reading we
> > >  want to speed up is mostly on enabled entry(the first entry), which
> > >  is already in the IRQ routing table...
> 
> The reason is to keep a sane interface.  Like we emulate instructions
> and msrs in the kernel and don't do half a job.  I don't think there's a
> real need to accelerate the first three words of an msi-x entry.

Here is the case we've observed with Xen. It can only be reproduced by large 
scale 
testing. When the interrupt intensity is very high, even new kernels would try 
to 
make it lower, then it would access mask bit very frequently. And in the 
kernel, 
msi_set_mask_bit() is like this:

static void msi_set_mask_bit(struct irq_data *data, u32 flag)   
{   
struct msi_desc *desc = irq_data_get_msi(data); 

if (desc->msi_attrib.is_msix) { 
msix_mask_irq(desc, flag);  
readl(desc->mask_base); /* Flush write to device */ 
} else {
unsigned offset = data->irq - desc->dev->irq;   
msi_mask_irq(desc, 1 << offset, flag << offset);
}   
}   

That flush by readl() would complied with each MSI-X mask writing. That is the 
only 
place we want to accelerate, otherwise it would still fall back to

Re: Mask bit support's API

2010-11-30 Thread Avi Kivity

On 11/26/2010 04:35 AM, Yang, Sheng wrote:

>  >
>  >  Shouldn't kvm also service reads from the pending bitmask?
>
>  Of course KVM should service reading from pending bitmask. For assigned
>  device, it's kernel who would set the pending bit; but I am not sure for
>  virtio. This interface is GET_ENTRY, so reading is fine with it.

The kernel should manage it in the same way.  Virtio raises irq (via 
KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.

Note we need to be able to read and write the pending bitmask for live 
migration.

>  >
>  >  We could have the kernel handle addr/data writes by setting up an
>  >  internal interrupt routing.  A disadvantage is that more work is needed
>  >  if we emulator interrupt remapping in qemu.
>
>  In fact modifying irq routing in the kernel is also the thing I want to
>  avoid.
>
>  So, the flow would be:
>
>  kernel get MMIO write, record it in it's own MSI table
>  KVM exit to QEmu, by one specific exit reason
>  QEmu know it have to sync the MSI table, then reading the entries from
>  kernel QEmu found it's an write, so it need to reprogram irq routing table
>  using the entries above
>  done
>
>  But wait, why should qemu read entries from kernel? By default exit we
>  already have the information about what's the entry to modify and what to
>  write, so we can use them directly. By this way, we also don't need an
>  specific exit reason - just exit to qemu in normal way is fine.

Because we have an interface where you get an exit if (addr % 4) < 3 and 
don't get an exit if (addr % 4) == 3.  There is a gpa range which is 
partially maintained by the kernel and partially in userspace.  It's a 
confusing interface.  Things like 64-bit reads or writes need to be 
broken up and serviced in two different places.

We already need to support this (for unaligned writes which hit two 
regions), but let's at least make a contiguous region behave sanely.

>
>  Then it would be:
>
>  kernel get MMIO write, record it in it's own MSI table
>  KVM exit to QEmu, indicate MMIO exit
>  QEmu found it's an write, it would update it's own MSI table(may need to
>  query mask bit from kernel), and reprogram irq routing table using the
>  entries above done
>
>  Then why should kernel kept it's own MSI table? I think the only reason is
>  we can speed up reading in that way - but the reading we want to speed up
>  is mostly on enabled entry(the first entry), which is already in the IRQ
>  routing table...

The reason is to keep a sane interface.  Like we emulate instructions 
and msrs in the kernel and don't do half a job.  I don't think there's a 
real need to accelerate the first three words of an msi-x entry.

>  And for enabled/disabled entry, you can see it like this: for the entries
>  inside routing table, we think it's enabled; otherwise it's disabled. Then
>  you don't need to bothered by pci_enable_msix().
>
>  So our strategy for reading accelerating can be:
>
>  If the entry contained in irq routing table, then use it; otherwise let
>  qemu deal with it. Because it's the QEmu who owned irq routing table, the
>  synchronization is guaranteed. We don't need the MSI table in the kernel
>  then.

I agree about letting qemu manage the irq routing table.  It changes 
very rarely.  I just prefer to let it know about the change via 
something other than KVM_EXIT_MMIO.

>
>  And for writing, we just want to cover all of mask bit, but none of others.
>
>  I think the concept here is more acceptable?
>
>  The issue here is MSI table and irq routing table got duplicate information
>  on some entries. My initial purposal is to use irq routing table in
>  kernel, then we don't need to duplicate information.

Avi?

Sorry about the late reply.

And BTW, we can take routing table as a kind of *cache*, if the content is in 
the
cache, then we can fetch it from the cache, otherwise we need to go back to 
fetch
it from memory(userspace).

If it's guaranteed by the spec that addr/data pairs are always 
interpreted in the same way, sure.  But there no reason to do it, 
really, it isn't a fast path.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-25 Thread Yang, Sheng

On Wednesday 24 November 2010 09:59:23 Yang, Sheng wrote:
> On Tuesday 23 November 2010 22:06:20 Avi Kivity wrote:
> > On 11/23/2010 03:57 PM, Yang, Sheng wrote:
> > > >  >  Yeah, but won't be included in this patchset.
> > > >  
> > > >  What API changes are needed?  I'd like to see the complete API.
> > > 
> > > I am not sure about it. But I suppose the structure should be the same?
> > > In fact it's pretty hard for me to image what's needed for virtio in
> > > the future, especially there is no such code now. I really prefer to
> > > deal with assigned device and virtio separately, which would make the
> > > work much easier. But seems you won't agree on that.
> > 
> > First, I don't really see why the two cases are different (but I don't
> > do a lot in this space).  Surely between you and Michael, you have all
> > the information?
> > 
> > Second, my worry is a huge number of ABI variants that come from
> > incrementally adding features.  I want to implement bigger chunks of
> > functionality.  So I'd like to see all potential users addressed, at
> > least from the ABI point of view if not the implementation.
> > 
> > > >  The API needs to be compatible with the pending bit, even if we
> > > >  don't implement it now.  I want to reduce the rate of API changes.
> > > 
> > > This can be implemented by this API, just adding a flag for it. And I
> > > would still take this into consideration in the next API purposal.
> > 
> > Shouldn't kvm also service reads from the pending bitmask?
> 
> Of course KVM should service reading from pending bitmask. For assigned
> device, it's kernel who would set the pending bit; but I am not sure for
> virtio. This interface is GET_ENTRY, so reading is fine with it.
> 
> > > >  So instead of
> > > >  
> > > >  - guest reads/writes msix
> > > >  - kvm filters mmio, implements some, passes others to userspace
> > > >  
> > > >  we have
> > > >  
> > > >  - guest reads/writes msix
> > > >  - kvm implements all
> > > >  - some writes generate an additional notification to userspace
> > > 
> > > I suppose we don't need to generate notification to userspace? Because
> > > every read/write is handled by kernel, and userspace just need
> > > interface to kernel to get/set the entry - and well, does userspace
> > > need to do it when kernel can handle all of them? Maybe not...
> > 
> > We could have the kernel handle addr/data writes by setting up an
> > internal interrupt routing.  A disadvantage is that more work is needed
> > if we emulator interrupt remapping in qemu.
> 
> In fact modifying irq routing in the kernel is also the thing I want to
> avoid.
> 
> So, the flow would be:
> 
> kernel get MMIO write, record it in it's own MSI table
> KVM exit to QEmu, by one specific exit reason
> QEmu know it have to sync the MSI table, then reading the entries from
> kernel QEmu found it's an write, so it need to reprogram irq routing table
> using the entries above
> done
> 
> But wait, why should qemu read entries from kernel? By default exit we
> already have the information about what's the entry to modify and what to
> write, so we can use them directly. By this way, we also don't need an
> specific exit reason - just exit to qemu in normal way is fine.
> 
> Then it would be:
> 
> kernel get MMIO write, record it in it's own MSI table
> KVM exit to QEmu, indicate MMIO exit
> QEmu found it's an write, it would update it's own MSI table(may need to
> query mask bit from kernel), and reprogram irq routing table using the
> entries above done
> 
> Then why should kernel kept it's own MSI table? I think the only reason is
> we can speed up reading in that way - but the reading we want to speed up
> is mostly on enabled entry(the first entry), which is already in the IRQ
> routing table...
> 
> And for enabled/disabled entry, you can see it like this: for the entries
> inside routing table, we think it's enabled; otherwise it's disabled. Then
> you don't need to bothered by pci_enable_msix().
> 
> So our strategy for reading accelerating can be:
> 
> If the entry contained in irq routing table, then use it; otherwise let
> qemu deal with it. Because it's the QEmu who owned irq routing table, the
> synchronization is guaranteed. We don't need the MSI table in the kernel
> then.
> 
> And for writing, we just want to cover all of mask bit, but none of others.
> 
> I think the concept here is more acceptable?
> 
> The issue here is MSI table and irq routing table got duplicate information
> on some entries. My initial purposal is to use irq routing table in
> kernel, then we don't need to duplicate information.

Avi?

And BTW, we can take routing table as a kind of *cache*, if the content is in 
the 
cache, then we can fetch it from the cache, otherwise we need to go back to 
fetch 
it from memory(userspace). 

--
regards
Yang, Sheng

> 
> 
> --
> regards
> Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
Mo

Re: Mask bit support's API

2010-11-23 Thread Yang, Sheng

On Tuesday 23 November 2010 22:06:20 Avi Kivity wrote:
> On 11/23/2010 03:57 PM, Yang, Sheng wrote:
> > >  >  Yeah, but won't be included in this patchset.
> > >  
> > >  What API changes are needed?  I'd like to see the complete API.
> > 
> > I am not sure about it. But I suppose the structure should be the same?
> > In fact it's pretty hard for me to image what's needed for virtio in the
> > future, especially there is no such code now. I really prefer to deal
> > with assigned device and virtio separately, which would make the work
> > much easier. But seems you won't agree on that.
> 
> First, I don't really see why the two cases are different (but I don't
> do a lot in this space).  Surely between you and Michael, you have all
> the information?
> 
> Second, my worry is a huge number of ABI variants that come from
> incrementally adding features.  I want to implement bigger chunks of
> functionality.  So I'd like to see all potential users addressed, at
> least from the ABI point of view if not the implementation.
> 
> > >  The API needs to be compatible with the pending bit, even if we don't
> > >  implement it now.  I want to reduce the rate of API changes.
> > 
> > This can be implemented by this API, just adding a flag for it. And I
> > would still take this into consideration in the next API purposal.
> 
> Shouldn't kvm also service reads from the pending bitmask?

Of course KVM should service reading from pending bitmask. For assigned device, 
it's kernel who would set the pending bit; but I am not sure for virtio. This 
interface is GET_ENTRY, so reading is fine with it.

> > >  So instead of
> > >  
> > >  - guest reads/writes msix
> > >  - kvm filters mmio, implements some, passes others to userspace
> > >  
> > >  we have
> > >  
> > >  - guest reads/writes msix
> > >  - kvm implements all
> > >  - some writes generate an additional notification to userspace
> > 
> > I suppose we don't need to generate notification to userspace? Because
> > every read/write is handled by kernel, and userspace just need interface
> > to kernel to get/set the entry - and well, does userspace need to do it
> > when kernel can handle all of them? Maybe not...
> 
> We could have the kernel handle addr/data writes by setting up an
> internal interrupt routing.  A disadvantage is that more work is needed
> if we emulator interrupt remapping in qemu.

In fact modifying irq routing in the kernel is also the thing I want to avoid.

So, the flow would be:

kernel get MMIO write, record it in it's own MSI table
KVM exit to QEmu, by one specific exit reason
QEmu know it have to sync the MSI table, then reading the entries from kernel
QEmu found it's an write, so it need to reprogram irq routing table using the 
entries above
done

But wait, why should qemu read entries from kernel? By default exit we already 
have the information about what's the entry to modify and what to write, so we 
can 
use them directly. By this way, we also don't need an specific exit reason - 
just 
exit to qemu in normal way is fine.

Then it would be:

kernel get MMIO write, record it in it's own MSI table
KVM exit to QEmu, indicate MMIO exit
QEmu found it's an write, it would update it's own MSI table(may need to query 
mask bit from kernel), and reprogram irq routing table using the entries above
done

Then why should kernel kept it's own MSI table? I think the only reason is we 
can 
speed up reading in that way - but the reading we want to speed up is mostly on 
enabled entry(the first entry), which is already in the IRQ routing table... 

And for enabled/disabled entry, you can see it like this: for the entries 
inside 
routing table, we think it's enabled; otherwise it's disabled. Then you don't 
need 
to bothered by pci_enable_msix().

So our strategy for reading accelerating can be:

If the entry contained in irq routing table, then use it; otherwise let qemu 
deal 
with it. Because it's the QEmu who owned irq routing table, the synchronization 
is 
guaranteed. We don't need the MSI table in the kernel then.

And for writing, we just want to cover all of mask bit, but none of others.

I think the concept here is more acceptable?

The issue here is MSI table and irq routing table got duplicate information on 
some entries. My initial purposal is to use irq routing table in kernel, then 
we 
don't need to duplicate information.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Michael S. Tsirkin

On Tue, Nov 23, 2010 at 05:24:44PM +0200, Gleb Natapov wrote:
> On Tue, Nov 23, 2010 at 05:11:19PM +0200, Michael S. Tsirkin wrote:
> > On Tue, Nov 23, 2010 at 04:06:20PM +0200, Avi Kivity wrote:
> > > >>
> > > >>  So instead of
> > > >>
> > > >>  - guest reads/writes msix
> > > >>  - kvm filters mmio, implements some, passes others to userspace
> > > >>
> > > >>  we have
> > > >>
> > > >>  - guest reads/writes msix
> > > >>  - kvm implements all
> > > >>  - some writes generate an additional notification to userspace
> > > >
> > > >I suppose we don't need to generate notification to userspace? Because 
> > > >every
> > > >read/write is handled by kernel, and userspace just need interface to 
> > > >kernel to
> > > >get/set the entry - and well, does userspace need to do it when kernel 
> > > >can handle
> > > >all of them? Maybe not...
> > > 
> > > We could have the kernel handle addr/data writes by setting up an
> > > internal interrupt routing.  A disadvantage is that more work is
> > > needed if we emulator interrupt remapping in qemu.
> > 
> > As an alternative, interrupt remapping will need some API rework, right?
> > Existing APIs only pass address/data for msi.
> > 
> IIRC interrupt remapping works with address/data to. It just interpret
> it differently from apic.

Yes. So since our APIs use address/data, this is an argument for doing
the remapping in kernel.

> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Gleb Natapov

On Tue, Nov 23, 2010 at 05:11:19PM +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 23, 2010 at 04:06:20PM +0200, Avi Kivity wrote:
> > >>
> > >>  So instead of
> > >>
> > >>  - guest reads/writes msix
> > >>  - kvm filters mmio, implements some, passes others to userspace
> > >>
> > >>  we have
> > >>
> > >>  - guest reads/writes msix
> > >>  - kvm implements all
> > >>  - some writes generate an additional notification to userspace
> > >
> > >I suppose we don't need to generate notification to userspace? Because 
> > >every
> > >read/write is handled by kernel, and userspace just need interface to 
> > >kernel to
> > >get/set the entry - and well, does userspace need to do it when kernel can 
> > >handle
> > >all of them? Maybe not...
> > 
> > We could have the kernel handle addr/data writes by setting up an
> > internal interrupt routing.  A disadvantage is that more work is
> > needed if we emulator interrupt remapping in qemu.
> 
> As an alternative, interrupt remapping will need some API rework, right?
> Existing APIs only pass address/data for msi.
> 
IIRC interrupt remapping works with address/data to. It just interpret
it differently from apic.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Michael S. Tsirkin

On Tue, Nov 23, 2010 at 04:06:20PM +0200, Avi Kivity wrote:
> >>
> >>  So instead of
> >>
> >>  - guest reads/writes msix
> >>  - kvm filters mmio, implements some, passes others to userspace
> >>
> >>  we have
> >>
> >>  - guest reads/writes msix
> >>  - kvm implements all
> >>  - some writes generate an additional notification to userspace
> >
> >I suppose we don't need to generate notification to userspace? Because every
> >read/write is handled by kernel, and userspace just need interface to kernel 
> >to
> >get/set the entry - and well, does userspace need to do it when kernel can 
> >handle
> >all of them? Maybe not...
> 
> We could have the kernel handle addr/data writes by setting up an
> internal interrupt routing.  A disadvantage is that more work is
> needed if we emulator interrupt remapping in qemu.

As an alternative, interrupt remapping will need some API rework, right?
Existing APIs only pass address/data for msi.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Avi Kivity

On 11/23/2010 03:57 PM, Yang, Sheng wrote:

>  >
>  >  Yeah, but won't be included in this patchset.
>
>  What API changes are needed?  I'd like to see the complete API.

I am not sure about it. But I suppose the structure should be the same? In fact
it's pretty hard for me to image what's needed for virtio in the future,
especially there is no such code now. I really prefer to deal with assigned 
device
and virtio separately, which would make the work much easier. But seems you 
won't
agree on that.

First, I don't really see why the two cases are different (but I don't 
do a lot in this space).  Surely between you and Michael, you have all 
the information?

Second, my worry is a huge number of ABI variants that come from 
incrementally adding features.  I want to implement bigger chunks of 
functionality.  So I'd like to see all potential users addressed, at 
least from the ABI point of view if not the implementation.

>  The API needs to be compatible with the pending bit, even if we don't
>  implement it now.  I want to reduce the rate of API changes.

This can be implemented by this API, just adding a flag for it. And I would 
still
take this into consideration in the next API purposal.

Shouldn't kvm also service reads from the pending bitmask?

>
>  So instead of
>
>  - guest reads/writes msix
>  - kvm filters mmio, implements some, passes others to userspace
>
>  we have
>
>  - guest reads/writes msix
>  - kvm implements all
>  - some writes generate an additional notification to userspace

I suppose we don't need to generate notification to userspace? Because every
read/write is handled by kernel, and userspace just need interface to kernel to
get/set the entry - and well, does userspace need to do it when kernel can 
handle
all of them? Maybe not...

We could have the kernel handle addr/data writes by setting up an 
internal interrupt routing.  A disadvantage is that more work is needed 
if we emulator interrupt remapping in qemu.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Yang, Sheng

On Tuesday 23 November 2010 20:04:16 Michael S. Tsirkin wrote:
> On Tue, Nov 23, 2010 at 02:09:52PM +0800, Yang, Sheng wrote:
> > Hi Avi,
> > 
> > I've purposed the following API for mask bit support.
> > 
> > The main point is, QEmu can know which entries are enabled(by
> > pci_enable_msix()).
> 
> Unfortunately, it can't I think, unless all your guests are linux.
> "enabled entries" is a linux kernel concept.
> The MSIX spec only tells you which entries are masked and which are
> unmasked.

Can't understand what you are talking about, and how it related to the guest 
OS. I 
was talking about pci_enable_msix() in the host Linux.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Yang, Sheng

On Tuesday 23 November 2010 20:47:33 Avi Kivity wrote:
> On 11/23/2010 10:30 AM, Yang, Sheng wrote:
> > On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote:
> > >  On 11/23/2010 08:35 AM, Yang, Sheng wrote:
> > >  >  On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
> > >  >  >   On 11/23/2010 08:09 AM, Yang, Sheng wrote:
> > >  >  >   >   Hi Avi,
> > >  >  >   >   
> > >  >  >   >   I've purposed the following API for mask bit support.
> > >  >  >   >   
> > >  >  >   >   The main point is, QEmu can know which entries are
> > >  >  >   >   enabled(by pci_enable_msix()). And for enabled entries,
> > >  >  >   >   kernel own it, including MSI data/address and mask
> > >  >  >   >   bit(routing table and mask bitmap). QEmu should use
> > >  >  >   >   KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with
> > >  >  >   >   them if it want to do so).
> > >  >  >   >   
> > >  >  >   >   Before entries are enabled, QEmu can still use it's own MSI
> > >  >  >   >   table(because we didn't contain these kind of information
> > >  >  >   >   in kernel, and it's unnecessary for kernel).
> > >  >  >   >   
> > >  >  >   >   The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to
> > >  >  >   >   query one entry didn't exist in kernel - or we can simply
> > >  >  >   >   return -EINVAL for it.
> > >  >  >   >   
> > >  >  >   >   I suppose it would be rare for QEmu to use this interface
> > >  >  >   >   to get the context of entry(the only case I think is when
> > >  >  >   >   MSI-X disable and QEmu need to sync the context), so
> > >  >  >   >   performance should not be an issue.
> > >  >  >   >   
> > >  >  >   >   What's your opinion?
> > >  >  >   >   
> > >  >  >   >   >#define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d,
> > >  >  >   >   >struct kvm_msix_entry)
> > >  >  >   
> > >  >  >   Need SET_MSIX_ENTRY for live migration as well.
> > >  >  
> > >  >  Current we don't support LM with VT-d...
> > >  
> > >  Isn't this work useful for virtio as well?
> > 
> > Yeah, but won't be included in this patchset.
> 
> What API changes are needed?  I'd like to see the complete API.

I am not sure about it. But I suppose the structure should be the same? In fact 
it's pretty hard for me to image what's needed for virtio in the future, 
especially there is no such code now. I really prefer to deal with assigned 
device 
and virtio separately, which would make the work much easier. But seems you 
won't 
agree on that.

> 
> > >  >  >   What about the pending bits?
> > >  >  
> > >  >  We didn't cover it here - and it's in another MMIO space(PBA). Of
> > >  >  course we can add more flags for it later.
> > >  
> > >  When an entry is masked, we need to set the pending bit for it
> > >  somewhere.  I guess this is broken in the existing code (without your
> > >  patches)?
> > 
> > Even with my patch, we didn't support the pending bit. It would always
> > return 0 now. What we supposed to do(after my patch checked in) is to
> > check IRQ_PENDING flag of irq_desc->status(if the entry is masked), and
> > return the result to userspace.
> > 
> > That would involve some core change, like to export irq_to_desc(). I
> > don't think it would be accepted soon, so would push mask bit first.
> 
> The API needs to be compatible with the pending bit, even if we don't
> implement it now.  I want to reduce the rate of API changes.

This can be implemented by this API, just adding a flag for it. And I would 
still 
take this into consideration in the next API purposal.
 
> > >  >  >   Also need a new exit reason to tell userspace that an msix
> > >  >  >   entry has changed, so userspace can update mappings.
> > >  >  
> > >  >  I think we don't need it. Whenever userspace want to get one
> > >  >  mapping which is an enabled MSI-X entry, it can check it with the
> > >  >  API above(which is quite rare, because kernel would handle all of
> > >  >  them when guest is accessing them). If it's a disabled entry, the
> > >  >  context inside userspace MMIO record is the correct one(and only
> > >  >  one). The only place I think QEmu need to sync is when MSI-X is
> > >  >  about to disabled, QEmu need to update it's own MMIO record.
> > >  
> > >  So in-kernel handling of mmio would be decided per entry?  I'm trying
> > >  to simplify this, and simplest thing is - all or nothing.
> > 
> > So you would like to handle all MSI-X MMIO in kernel?
> 
> Yes.  Writes to address or data would be handled by:
> - recording it into the shadow msix table
> - notifying userspace that msix entry x changed
> Reads would be handled in kernel from the shadow msix table.
> 
> So instead of
> 
> - guest reads/writes msix
> - kvm filters mmio, implements some, passes others to userspace
> 
> we have
> 
> - guest reads/writes msix
> - kvm implements all
> - some writes generate an additional notification to userspace

I suppose we don't need to generate notification to userspace? Because every 
read/write is handled by kernel, and user

Re: Mask bit support's API

2010-11-23 Thread Michael S. Tsirkin

On Tue, Nov 23, 2010 at 02:47:33PM +0200, Avi Kivity wrote:
> On 11/23/2010 10:30 AM, Yang, Sheng wrote:
> >On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote:
> >>  On 11/23/2010 08:35 AM, Yang, Sheng wrote:
> >>  >  On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
> >>  >  >   On 11/23/2010 08:09 AM, Yang, Sheng wrote:
> >>  >  >   >   Hi Avi,
> >>  >  >   >
> >>  >  >   >   I've purposed the following API for mask bit support.
> >>  >  >   >
> >>  >  >   >   The main point is, QEmu can know which entries are enabled(by
> >>  >  >   >   pci_enable_msix()). And for enabled entries, kernel own it,
> >>  >  >   >   including MSI data/address and mask bit(routing table and mask
> >>  >  >   >   bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get 
> >> them(and
> >>  >  >   >   it can sync with them if it want to do so).
> >>  >  >   >
> >>  >  >   >   Before entries are enabled, QEmu can still use it's own MSI
> >>  >  >   >   table(because we didn't contain these kind of information in
> >>  >  >   >   kernel, and it's unnecessary for kernel).
> >>  >  >   >
> >>  >  >   >   The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to 
> >> query
> >>  >  >   >   one entry didn't exist in kernel - or we can simply return 
> >> -EINVAL
> >>  >  >   >   for it.
> >>  >  >   >
> >>  >  >   >   I suppose it would be rare for QEmu to use this interface to 
> >> get
> >>  >  >   >   the context of entry(the only case I think is when MSI-X 
> >> disable
> >>  >  >   >   and QEmu need to sync the context), so performance should not 
> >> be
> >>  >  >   >   an issue.
> >>  >  >   >
> >>  >  >   >   What's your opinion?
> >>  >  >   >
> >>  >  >   >   >#define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, 
> >> struct
> >>  >  >   >   >kvm_msix_entry)
> >>  >  >
> >>  >  >   Need SET_MSIX_ENTRY for live migration as well.
> >>  >
> >>  >  Current we don't support LM with VT-d...
> >>
> >>  Isn't this work useful for virtio as well?
> >
> >Yeah, but won't be included in this patchset.
> 
> What API changes are needed?  I'd like to see the complete API.
> 
> >>  >  >   What about the pending bits?
> >>  >
> >>  >  We didn't cover it here - and it's in another MMIO space(PBA). Of 
> >> course
> >>  >  we can add more flags for it later.
> >>
> >>  When an entry is masked, we need to set the pending bit for it
> >>  somewhere.  I guess this is broken in the existing code (without your
> >>  patches)?
> >
> >Even with my patch, we didn't support the pending bit. It would always 
> >return 0
> >now. What we supposed to do(after my patch checked in) is to check 
> >IRQ_PENDING flag
> >of irq_desc->status(if the entry is masked), and return the result to 
> >userspace.
> >
> >That would involve some core change, like to export irq_to_desc(). I don't 
> >think
> >it would be accepted soon, so would push mask bit first.
> 
> The API needs to be compatible with the pending bit, even if we
> don't implement it now.  I want to reduce the rate of API changes.
> 
> >>
> >>  >  >   Also need a new exit reason to tell userspace that an msix entry 
> >> has
> >>  >  >   changed, so userspace can update mappings.
> >>  >
> >>  >  I think we don't need it. Whenever userspace want to get one mapping
> >>  >  which is an enabled MSI-X entry, it can check it with the API
> >>  >  above(which is quite rare, because kernel would handle all of them when
> >>  >  guest is accessing them). If it's a disabled entry, the context inside
> >>  >  userspace MMIO record is the correct one(and only one). The only place 
> >> I
> >>  >  think QEmu need to sync is when MSI-X is about to disabled, QEmu need 
> >> to
> >>  >  update it's own MMIO record.
> >>
> >>  So in-kernel handling of mmio would be decided per entry?  I'm trying to
> >>  simplify this, and simplest thing is - all or nothing.
> >
> >So you would like to handle all MSI-X MMIO in kernel?
> 
> Yes.  Writes to address or data would be handled by:
> - recording it into the shadow msix table
> - notifying userspace that msix entry x changed
> Reads would be handled in kernel from the shadow msix table.
> 
> So instead of
> 
> - guest reads/writes msix
> - kvm filters mmio, implements some, passes others to userspace
> 
> we have
> 
> - guest reads/writes msix
> - kvm implements all
> - some writes generate an additional notification to userspace

One small proposal in addition: since all accesses are done from guest
anyway, the shadow table can/should be stored using userspace memory,
reducing the kernel memory overhead of the feature from up to 4K per
MSIX table to just 8 bytes.

Active entries can be cached in kernel memory.

> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Avi Kivity


On 11/23/2010 10:30 AM, Yang, Sheng wrote:

On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote:
>  On 11/23/2010 08:35 AM, Yang, Sheng wrote:
>  >  On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
>  >  >   On 11/23/2010 08:09 AM, Yang, Sheng wrote:
>  >  >   >   Hi Avi,
>  >  >   >
>  >  >   >   I've purposed the following API for mask bit support.
>  >  >   >
>  >  >   >   The main point is, QEmu can know which entries are enabled(by
>  >  >   >   pci_enable_msix()). And for enabled entries, kernel own it,
>  >  >   >   including MSI data/address and mask bit(routing table and mask
>  >  >   >   bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get them(and
>  >  >   >   it can sync with them if it want to do so).
>  >  >   >
>  >  >   >   Before entries are enabled, QEmu can still use it's own MSI
>  >  >   >   table(because we didn't contain these kind of information in
>  >  >   >   kernel, and it's unnecessary for kernel).
>  >  >   >
>  >  >   >   The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query
>  >  >   >   one entry didn't exist in kernel - or we can simply return -EINVAL
>  >  >   >   for it.
>  >  >   >
>  >  >   >   I suppose it would be rare for QEmu to use this interface to get
>  >  >   >   the context of entry(the only case I think is when MSI-X disable
>  >  >   >   and QEmu need to sync the context), so performance should not be
>  >  >   >   an issue.
>  >  >   >
>  >  >   >   What's your opinion?
>  >  >   >
>  >  >   >   >#define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct
>  >  >   >   >kvm_msix_entry)
>  >  >
>  >  >   Need SET_MSIX_ENTRY for live migration as well.
>  >
>  >  Current we don't support LM with VT-d...
>
>  Isn't this work useful for virtio as well?

Yeah, but won't be included in this patchset.


What API changes are needed?  I'd like to see the complete API.


>  >  >   What about the pending bits?
>  >
>  >  We didn't cover it here - and it's in another MMIO space(PBA). Of course
>  >  we can add more flags for it later.
>
>  When an entry is masked, we need to set the pending bit for it
>  somewhere.  I guess this is broken in the existing code (without your
>  patches)?

Even with my patch, we didn't support the pending bit. It would always return 0
now. What we supposed to do(after my patch checked in) is to check IRQ_PENDING 
flag
of irq_desc->status(if the entry is masked), and return the result to userspace.

That would involve some core change, like to export irq_to_desc(). I don't think
it would be accepted soon, so would push mask bit first.


The API needs to be compatible with the pending bit, even if we don't 
implement it now.  I want to reduce the rate of API changes.



>
>  >  >   Also need a new exit reason to tell userspace that an msix entry has
>  >  >   changed, so userspace can update mappings.
>  >
>  >  I think we don't need it. Whenever userspace want to get one mapping
>  >  which is an enabled MSI-X entry, it can check it with the API
>  >  above(which is quite rare, because kernel would handle all of them when
>  >  guest is accessing them). If it's a disabled entry, the context inside
>  >  userspace MMIO record is the correct one(and only one). The only place I
>  >  think QEmu need to sync is when MSI-X is about to disabled, QEmu need to
>  >  update it's own MMIO record.
>
>  So in-kernel handling of mmio would be decided per entry?  I'm trying to
>  simplify this, and simplest thing is - all or nothing.

So you would like to handle all MSI-X MMIO in kernel?


Yes.  Writes to address or data would be handled by:
- recording it into the shadow msix table
- notifying userspace that msix entry x changed
Reads would be handled in kernel from the shadow msix table.

So instead of

- guest reads/writes msix
- kvm filters mmio, implements some, passes others to userspace

we have

- guest reads/writes msix
- kvm implements all
- some writes generate an additional notification to userspace


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Michael S. Tsirkin

On Tue, Nov 23, 2010 at 02:09:52PM +0800, Yang, Sheng wrote:
> Hi Avi,
> 
> I've purposed the following API for mask bit support.
> 
> The main point is, QEmu can know which entries are enabled(by 
> pci_enable_msix()). 

Unfortunately, it can't I think, unless all your guests are linux.
"enabled entries" is a linux kernel concept.
The MSIX spec only tells you which entries are masked and which are unmasked.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Yang, Sheng

On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote:
> On 11/23/2010 08:35 AM, Yang, Sheng wrote:
> > On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
> > >  On 11/23/2010 08:09 AM, Yang, Sheng wrote:
> > >  >  Hi Avi,
> > >  >  
> > >  >  I've purposed the following API for mask bit support.
> > >  >  
> > >  >  The main point is, QEmu can know which entries are enabled(by
> > >  >  pci_enable_msix()). And for enabled entries, kernel own it,
> > >  >  including MSI data/address and mask bit(routing table and mask
> > >  >  bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get them(and
> > >  >  it can sync with them if it want to do so).
> > >  >  
> > >  >  Before entries are enabled, QEmu can still use it's own MSI
> > >  >  table(because we didn't contain these kind of information in
> > >  >  kernel, and it's unnecessary for kernel).
> > >  >  
> > >  >  The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query
> > >  >  one entry didn't exist in kernel - or we can simply return -EINVAL
> > >  >  for it.
> > >  >  
> > >  >  I suppose it would be rare for QEmu to use this interface to get
> > >  >  the context of entry(the only case I think is when MSI-X disable
> > >  >  and QEmu need to sync the context), so performance should not be
> > >  >  an issue.
> > >  >  
> > >  >  What's your opinion?
> > >  >  
> > >  >  >   #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct
> > >  >  >   kvm_msix_entry)
> > >  
> > >  Need SET_MSIX_ENTRY for live migration as well.
> > 
> > Current we don't support LM with VT-d...
> 
> Isn't this work useful for virtio as well?

Yeah, but won't be included in this patchset.
> 
> > >  >  >   #define KVM_UPDATE_MSIX_MMIO  _IOW(KVMIO,  0x7e, struct
> > >  >  >   kvm_msix_mmio)
> > >  >  >   
> > >  >  >   #define KVM_MSIX_TYPE_ASSIGNED_DEV  1
> > >  >  >   
> > >  >  >   #define KVM_MSIX_FLAG_MASKBIT   (1<<   0)
> > >  >  >   #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<<   0)
> > >  >  >   #define KVM_MSIX_FLAG_ENTRY (1<<   1)
> > >  >  >   #define KVM_MSIX_FLAG_QUERY_ENTRY   (1<<   1)
> > >  
> > >  Why is there a need for the flag?  If we simply get/set entire
> > >  entries, that includes the mask bits?
> > 
> > We still want QEmu to cover a part of entries which hasn't been enabled
> > yet(which won't existed in routing table), but kernel would cover all
> > mask bit regardless of if it's enabled. So QEmu can query any entry to
> > check the maskbit, but not address/data.
> 
> Don't understand.  If we support reading/writing entire entries, that
> works for both enabled and disabled entries?
> 
> > >  What about the pending bits?
> > 
> > We didn't cover it here - and it's in another MMIO space(PBA). Of course
> > we can add more flags for it later.
> 
> When an entry is masked, we need to set the pending bit for it
> somewhere.  I guess this is broken in the existing code (without your
> patches)?

Even with my patch, we didn't support the pending bit. It would always return 0 
now. What we supposed to do(after my patch checked in) is to check IRQ_PENDING 
flag 
of irq_desc->status(if the entry is masked), and return the result to userspace.

That would involve some core change, like to export irq_to_desc(). I don't 
think 
it would be accepted soon, so would push mask bit first.

> 
> > >  Also need a new exit reason to tell userspace that an msix entry has
> > >  changed, so userspace can update mappings.
> > 
> > I think we don't need it. Whenever userspace want to get one mapping
> > which is an enabled MSI-X entry, it can check it with the API
> > above(which is quite rare, because kernel would handle all of them when
> > guest is accessing them). If it's a disabled entry, the context inside
> > userspace MMIO record is the correct one(and only one). The only place I
> > think QEmu need to sync is when MSI-X is about to disabled, QEmu need to
> > update it's own MMIO record.
> 
> So in-kernel handling of mmio would be decided per entry?  I'm trying to
> simplify this, and simplest thing is - all or nothing.

So you would like to handle all MSI-X MMIO in kernel?

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-22 Thread Avi Kivity

On 11/23/2010 08:35 AM, Yang, Sheng wrote:

On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
>  On 11/23/2010 08:09 AM, Yang, Sheng wrote:
>  >  Hi Avi,
>  >
>  >  I've purposed the following API for mask bit support.
>  >
>  >  The main point is, QEmu can know which entries are enabled(by
>  >  pci_enable_msix()). And for enabled entries, kernel own it, including
>  >  MSI data/address and mask bit(routing table and mask bitmap). QEmu
>  >  should use KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with
>  >  them if it want to do so).
>  >
>  >  Before entries are enabled, QEmu can still use it's own MSI table(because
>  >  we didn't contain these kind of information in kernel, and it's
>  >  unnecessary for kernel).
>  >
>  >  The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one
>  >  entry didn't exist in kernel - or we can simply return -EINVAL for it.
>  >
>  >  I suppose it would be rare for QEmu to use this interface to get the
>  >  context of entry(the only case I think is when MSI-X disable and QEmu
>  >  need to sync the context), so performance should not be an issue.
>  >
>  >  What's your opinion?
>  >
>  >  >   #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct
>  >  >   kvm_msix_entry)
>
>  Need SET_MSIX_ENTRY for live migration as well.

Current we don't support LM with VT-d...

Isn't this work useful for virtio as well?

>
>  >  >   #define KVM_UPDATE_MSIX_MMIO  _IOW(KVMIO,  0x7e, struct
>  >  >   kvm_msix_mmio)
>  >  >
>  >  >   #define KVM_MSIX_TYPE_ASSIGNED_DEV  1
>  >  >
>  >  >   #define KVM_MSIX_FLAG_MASKBIT   (1<<   0)
>  >  >   #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<<   0)
>  >  >   #define KVM_MSIX_FLAG_ENTRY (1<<   1)
>  >  >   #define KVM_MSIX_FLAG_QUERY_ENTRY   (1<<   1)
>
>  Why is there a need for the flag?  If we simply get/set entire entries,
>  that includes the mask bits?

We still want QEmu to cover a part of entries which hasn't been enabled 
yet(which
won't existed in routing table), but kernel would cover all mask bit regardless 
of
if it's enabled. So QEmu can query any entry to check the maskbit, but not
address/data.

Don't understand.  If we support reading/writing entire entries, that 
works for both enabled and disabled entries?

>  What about the pending bits?

We didn't cover it here - and it's in another MMIO space(PBA). Of course we can
add more flags for it later.

When an entry is masked, we need to set the pending bit for it 
somewhere.  I guess this is broken in the existing code (without your 
patches)?

>
>  Also need a new exit reason to tell userspace that an msix entry has
>  changed, so userspace can update mappings.

I think we don't need it. Whenever userspace want to get one mapping which is an
enabled MSI-X entry, it can check it with the API above(which is quite rare,
because kernel would handle all of them when guest is accessing them). If it's a
disabled entry, the context inside userspace MMIO record is the correct one(and
only one). The only place I think QEmu need to sync is when MSI-X is about to
disabled, QEmu need to update it's own MMIO record.

So in-kernel handling of mmio would be decided per entry?  I'm trying to 
simplify this, and simplest thing is - all or nothing.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-22 Thread Yang, Sheng

On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
> On 11/23/2010 08:09 AM, Yang, Sheng wrote:
> > Hi Avi,
> > 
> > I've purposed the following API for mask bit support.
> > 
> > The main point is, QEmu can know which entries are enabled(by
> > pci_enable_msix()). And for enabled entries, kernel own it, including
> > MSI data/address and mask bit(routing table and mask bitmap). QEmu
> > should use KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with
> > them if it want to do so).
> > 
> > Before entries are enabled, QEmu can still use it's own MSI table(because
> > we didn't contain these kind of information in kernel, and it's
> > unnecessary for kernel).
> > 
> > The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one
> > entry didn't exist in kernel - or we can simply return -EINVAL for it.
> > 
> > I suppose it would be rare for QEmu to use this interface to get the
> > context of entry(the only case I think is when MSI-X disable and QEmu
> > need to sync the context), so performance should not be an issue.
> > 
> > What's your opinion?
> > 
> > >  #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct
> > >  kvm_msix_entry)
> 
> Need SET_MSIX_ENTRY for live migration as well.

Current we don't support LM with VT-d...
> 
> > >  #define KVM_UPDATE_MSIX_MMIO  _IOW(KVMIO,  0x7e, struct
> > >  kvm_msix_mmio)
> > >  
> > >  #define KVM_MSIX_TYPE_ASSIGNED_DEV  1
> > >  
> > >  #define KVM_MSIX_FLAG_MASKBIT   (1<<  0)
> > >  #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<<  0)
> > >  #define KVM_MSIX_FLAG_ENTRY (1<<  1)
> > >  #define KVM_MSIX_FLAG_QUERY_ENTRY   (1<<  1)
> 
> Why is there a need for the flag?  If we simply get/set entire entries,
> that includes the mask bits?

We still want QEmu to cover a part of entries which hasn't been enabled 
yet(which 
won't existed in routing table), but kernel would cover all mask bit regardless 
of 
if it's enabled. So QEmu can query any entry to check the maskbit, but not 
address/data.
 
> What about the pending bits?

We didn't cover it here - and it's in another MMIO space(PBA). Of course we can 
add more flags for it later.
> 
> > >  struct kvm_msix_entry {
> > >  
> > >  __u32 id;
> > >  __u32 type;
> > >  __u32 entry; /* The index of entry in the MSI-X table */
> > >  __u32 flags;
> > >  __u32 query_flags;
> > >  union {
> > >  
> > >  struct {
> > >  
> > >  __u32 addr_lo;
> > >  __u32 addr_hi;
> > >  __u32 data;
> 
> Isn't the mask bit in the last word?  Or maybe I'm confused about the
> format.

I separated the entry and mask bit as I said above.
> 
> > >  } msi_entry;
> > >  __u32 reserved[12];
> > >  
> > >  };
> > >  
> > >  };
> > >  
> > >  #define KVM_MSIX_MMIO_FLAG_REGISTER (1<<  0)
> > >  #define KVM_MSIX_MMIO_FLAG_UNREGISTER   (1<<  1)
> > >  #define KVM_MSIX_MMIO_FLAG_MASK 0x3
> > >  
> > >  struct kvm_msix_mmio {
> > >  
> > >  __u32 id;
> > >  __u32 type;
> > >  __u64 base_addr;
> > >  __u32 max_entries_nr;
> > >  __u32 flags;
> > >  __u32 reserved[6];
> > >  
> > >  };
> 
> Also need a new exit reason to tell userspace that an msix entry has
> changed, so userspace can update mappings.

I think we don't need it. Whenever userspace want to get one mapping which is 
an 
enabled MSI-X entry, it can check it with the API above(which is quite rare, 
because kernel would handle all of them when guest is accessing them). If it's 
a 
disabled entry, the context inside userspace MMIO record is the correct one(and 
only one). The only place I think QEmu need to sync is when MSI-X is about to 
disabled, QEmu need to update it's own MMIO record.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-22 Thread Avi Kivity

On 11/23/2010 08:09 AM, Yang, Sheng wrote:

Hi Avi,

I've purposed the following API for mask bit support.

The main point is, QEmu can know which entries are enabled(by 
pci_enable_msix()).
And for enabled entries, kernel own it, including MSI data/address and mask
bit(routing table and mask bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to
get them(and it can sync with them if it want to do so).

Before entries are enabled, QEmu can still use it's own MSI table(because we
didn't contain these kind of information in kernel, and it's unnecessary for
kernel).

The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one entry 
didn't
exist in kernel - or we can simply return -EINVAL for it.

I suppose it would be rare for QEmu to use this interface to get the context of
entry(the only case I think is when MSI-X disable and QEmu need to sync the
context), so performance should not be an issue.

What's your opinion?

>  #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct kvm_msix_entry)

Need SET_MSIX_ENTRY for live migration as well.

>  #define KVM_UPDATE_MSIX_MMIO  _IOW(KVMIO,  0x7e, struct kvm_msix_mmio)
>
>  #define KVM_MSIX_TYPE_ASSIGNED_DEV  1
>
>  #define KVM_MSIX_FLAG_MASKBIT   (1<<  0)
>  #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<<  0)
>  #define KVM_MSIX_FLAG_ENTRY (1<<  1)
>  #define KVM_MSIX_FLAG_QUERY_ENTRY   (1<<  1)
>

Why is there a need for the flag?  If we simply get/set entire entries, 
that includes the mask bits?

What about the pending bits?

>  struct kvm_msix_entry {
>  __u32 id;
>  __u32 type;
>  __u32 entry; /* The index of entry in the MSI-X table */
>  __u32 flags;
>  __u32 query_flags;
>  union {
>  struct {
>  __u32 addr_lo;
>  __u32 addr_hi;
>  __u32 data;

Isn't the mask bit in the last word?  Or maybe I'm confused about the 
format.

>  } msi_entry;
>  __u32 reserved[12];
>  };
>  };
>
>  #define KVM_MSIX_MMIO_FLAG_REGISTER (1<<  0)
>  #define KVM_MSIX_MMIO_FLAG_UNREGISTER   (1<<  1)
>  #define KVM_MSIX_MMIO_FLAG_MASK 0x3
>
>  struct kvm_msix_mmio {
>  __u32 id;
>  __u32 type;
>  __u64 base_addr;
>  __u32 max_entries_nr;
>  __u32 flags;
>  __u32 reserved[6];
>  };

Also need a new exit reason to tell userspace that an msix entry has 
changed, so userspace can update mappings.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mask bit support's API

2010-11-22 Thread Yang, Sheng

Hi Avi,

I've purposed the following API for mask bit support.

The main point is, QEmu can know which entries are enabled(by 
pci_enable_msix()). 
And for enabled entries, kernel own it, including MSI data/address and mask 
bit(routing table and mask bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to 
get them(and it can sync with them if it want to do so).

Before entries are enabled, QEmu can still use it's own MSI table(because we 
didn't contain these kind of information in kernel, and it's unnecessary for 
kernel). 

The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one entry 
didn't 
exist in kernel - or we can simply return -EINVAL for it.

I suppose it would be rare for QEmu to use this interface to get the context of 
entry(the only case I think is when MSI-X disable and QEmu need to sync the 
context), so performance should not be an issue.

What's your opinion?

> #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct kvm_msix_entry) 
> #define KVM_UPDATE_MSIX_MMIO  _IOW(KVMIO,  0x7e, struct kvm_msix_mmio)   
>
> #define KVM_MSIX_TYPE_ASSIGNED_DEV  1   
> 
> #define KVM_MSIX_FLAG_MASKBIT   (1 << 0)
> #define KVM_MSIX_FLAG_QUERY_MASKBIT (1 << 0)
> #define KVM_MSIX_FLAG_ENTRY (1 << 1)
> #define KVM_MSIX_FLAG_QUERY_ENTRY   (1 << 1)
> 
> struct kvm_msix_entry { 
> __u32 id;   
> __u32 type; 
> __u32 entry; /* The index of entry in the MSI-X table */
> __u32 flags;
> __u32 query_flags;  
> union { 
> struct {
> __u32 addr_lo;  
> __u32 addr_hi;  
> __u32 data; 
> } msi_entry;
> __u32 reserved[12]; 
> };  
> };  
> 
> #define KVM_MSIX_MMIO_FLAG_REGISTER (1 << 0)
> #define KVM_MSIX_MMIO_FLAG_UNREGISTER   (1 << 1)
> #define KVM_MSIX_MMIO_FLAG_MASK 0x3 
> 
> struct kvm_msix_mmio {  
> __u32 id;   
> __u32 type; 
> __u64 base_addr;
> __u32 max_entries_nr;   
> __u32 flags;
> __u32 reserved[6];  
> };  

--
regards
Yang, Sheng 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Re: Mask bit support's API

Mask bit support's API

25 matches

Site Navigation

Mail list logo

Footer information