Re: Mask bit support's API
On Friday 03 December 2010 00:55:03 Michael S. Tsirkin wrote: > On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote: > > On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin wrote: > > > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote: > > >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: > > >> >> Which case? the readl() doesn't need access to the routing table, > > >> >> just the entry. > > >> > > > >> >One thing that read should do is flush in the outstanding > > >> >interrupts and flush out the mask bit writes. > > >> > > >> The mask bit writes are synchronous. > > >> > > >> wrt interrupts, we can deal with assigned devices, and can poll > > >> irqfds. But we can't force vhost-net to issue an interrupt (and I > > >> don't think it's required). > > > > > > To clarify: > > > > > >mask write > > >read > > > > > > it is safe for guest to assume no more interrupts > > > > > > where as with a simple > > >mask write > > > > > > an interrupt might be in flight and get delivered shortly afterwards. > > > > I think it's already contained in the current patchset. > > > > >> >> Oh, I think there is a terminology problem, I was talking about > > >> >> kvm's irq routing table, you were talking about the msix entries. > > >> >> > > >> >> I think treating it as a cache causes more problems, since there > > >> >> are now two paths for reads (in cache and not in cache) and more > > >> >> things for writes to manage. > > >> >> > > >> >> Here's my proposed API: > > >> >> > > >> >> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, > > >> >> pending_bitmap_base_gpa) > > >> >> > > >> >> - called when the guest enables msix > > >> > > > >> >I would add virtual addresses so that we can use swappable memory to > > >> >store the state. > > >> > > >> Right. > > > > Do we need synchronization between kernel and userspace? Any recommended > > method? > > > > >> >If we do, maybe we can just keep the table there and then > > >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? > > >> > > >> Still need to to let userspace know it needs to reprogram the irqfd > > >> or whatever it uses to inject the interrupt. > > > > > > Why do we need to reprogram irqfd? I thought irqfd would map to an > > > entry within the table instead of address/data as now. > > > Could you clarify please? > > > > > >> >> KVM_REMOVE_MSIX_TABLE(table_id) > > >> >> > > >> >>- called when the guest disables msix > > >> >> > > >> >> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) > > >> >> > > >> >>- called when the guest enables msix (to initialize it), or > > >> >> after live migration > > >> > > > >> >What is entry_id here? > > >> > > >> Entry within the table. > > > > > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is > > > enabled (note: it can not be called at boot anyway since pa > > > depends on BAR assigned by BIOS). > > > > Don't agree. MMIO can be write regardless of if MSIX is enabled. If > > you handle MMIO to kernel, them handle them all. > > Hmm, does this mean we would need ioctls to enable/disable MSIX? I > think passing control from userspace to kernel when MSIX is enabled is > not too bad. Will think about it. Anyway we need ioctls to do so because kernel knows nothing about PCI configuration space... And we already have that for assigned device. > > > I suppose qemu still > > got control of BAR? > > Yes. So e.g. if memory is disabled in the device then we must > disable this as well, and not claim these transactions > as mmio. > > > Then leave it in the current place should be fine. > > current place? assigned_dev_iomem_map(). > > At the moment qemu does not notify devices when > memory is disabled, only when it is enabled. You mean bit 2 of Command register? I think suppose device would only disable memory when shut down should be acceptable. Also we can hook at disable point as well. > So handling msix enable/disable is more straight-forward. Don't agree. Then you got duplicate between kernel and userspace. Also the semantic of MSI-X MMIO has no relationship with MSIX enable/disable. > > > >> >> Michael? I think that should work for virtio and vfio assigned > > >> >> devices? Not sure about pending bits. > > >> > > > >> >Pending bits must be tracked in kernel, but I don't see > > >> >how we can support polling mode if we don't exit to userspace > > >> >on pending bit reads. > > >> > > > >> >This does mean that some reads will be fast and some will be > > >> >slow, and it's a bit sad that we seem to be optimizing > > >> >for specific guests, but I just can't come up with > > >> >anything better. > > >> > > >> If the pending bits live in userspace memory, the device model can > > >> update them directly? > > > > > > Note that these are updated on an interrupt, so updating them > > > in userspace would need get_user_page etc trickery, > > > and add the overhead of atomics. > > > > > > Further I th
Re: Mask bit support's API
On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote: > On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin wrote: > > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote: > >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: > >> >> > >> >> Which case? the readl() doesn't need access to the routing table, > >> >> just the entry. > >> > > >> >One thing that read should do is flush in the outstanding > >> >interrupts and flush out the mask bit writes. > >> > >> The mask bit writes are synchronous. > >> > >> wrt interrupts, we can deal with assigned devices, and can poll > >> irqfds. But we can't force vhost-net to issue an interrupt (and I > >> don't think it's required). > > > > To clarify: > > > > mask write > > read > > > > it is safe for guest to assume no more interrupts > > > > where as with a simple > > mask write > > > > an interrupt might be in flight and get delivered shortly afterwards. > > I think it's already contained in the current patchset. > > > > > >> >> Oh, I think there is a terminology problem, I was talking about > >> >> kvm's irq routing table, you were talking about the msix entries. > >> >> > >> >> I think treating it as a cache causes more problems, since there are > >> >> now two paths for reads (in cache and not in cache) and more things > >> >> for writes to manage. > >> >> > >> >> Here's my proposed API: > >> >> > >> >> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, > >> >> pending_bitmap_base_gpa) > >> >> > >> >> - called when the guest enables msix > >> > > >> >I would add virtual addresses so that we can use swappable memory to > >> >store the state. > >> > >> Right. > > Do we need synchronization between kernel and userspace? Any recommended > method? > >> > >> >If we do, maybe we can just keep the table there and then > >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? > >> > >> Still need to to let userspace know it needs to reprogram the irqfd > >> or whatever it uses to inject the interrupt. > > > > Why do we need to reprogram irqfd? I thought irqfd would map to an > > entry within the table instead of address/data as now. > > Could you clarify please? > > > > > >> >> KVM_REMOVE_MSIX_TABLE(table_id) > >> >> > >> >> - called when the guest disables msix > >> >> > >> >> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) > >> >> > >> >> - called when the guest enables msix (to initialize it), or after > >> >> live migration > >> > > >> >What is entry_id here? > >> > >> Entry within the table. > > > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is > > enabled (note: it can not be called at boot anyway since pa > > depends on BAR assigned by BIOS). > > Don't agree. MMIO can be write regardless of if MSIX is enabled. If > you handle MMIO to kernel, them handle them all. Hmm, does this mean we would need ioctls to enable/disable MSIX? I think passing control from userspace to kernel when MSIX is enabled is not too bad. Will think about it. > I suppose qemu still > got control of BAR? Yes. So e.g. if memory is disabled in the device then we must disable this as well, and not claim these transactions as mmio. > Then leave it in the current place should be fine. current place? At the moment qemu does not notify devices when memory is disabled, only when it is enabled. So handling msix enable/disable is more straight-forward. > > > >> >> Michael? I think that should work for virtio and vfio assigned > >> >> devices? Not sure about pending bits. > >> > > >> >Pending bits must be tracked in kernel, but I don't see > >> >how we can support polling mode if we don't exit to userspace > >> >on pending bit reads. > >> > > >> >This does mean that some reads will be fast and some will be > >> >slow, and it's a bit sad that we seem to be optimizing > >> >for specific guests, but I just can't come up with > >> >anything better. > >> > > >> > >> If the pending bits live in userspace memory, the device model can > >> update them directly? > > > > Note that these are updated on an interrupt, so updating them > > in userspace would need get_user_page etc trickery, > > and add the overhead of atomics. > > > > Further I think it's important to avoid the overhead of updating them > > all the time, and only do this when an interrupt is > > masked or on pending bits read. Since userspace does not know > > when interrupts are masked, this means do update on each read. > > In fact qemu's accessing to MMIO should be quite rare after moving all > the things to the kernel. Using IOCTL is also fine with me. > > And how to "do update on each read"? When read of pending bits is detected, we could forward it up to qemu. Qemu could check internal device state and clear bits that are no longer relevant. > -- > regards, > Yang, Sheng > > > >> >So maybe just add an ioctl to get and to clear pending bits. > >> >Maybe set for symmetry. > >> > >> For live migration too. But if they live in memory,
Re: Mask bit support's API
On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin wrote: > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote: >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: >> >> >> >> Which case? the readl() doesn't need access to the routing table, >> >> just the entry. >> > >> >One thing that read should do is flush in the outstanding >> >interrupts and flush out the mask bit writes. >> >> The mask bit writes are synchronous. >> >> wrt interrupts, we can deal with assigned devices, and can poll >> irqfds. But we can't force vhost-net to issue an interrupt (and I >> don't think it's required). > > To clarify: > > mask write > read > > it is safe for guest to assume no more interrupts > > where as with a simple > mask write > > an interrupt might be in flight and get delivered shortly afterwards. I think it's already contained in the current patchset. > > >> >> Oh, I think there is a terminology problem, I was talking about >> >> kvm's irq routing table, you were talking about the msix entries. >> >> >> >> I think treating it as a cache causes more problems, since there are >> >> now two paths for reads (in cache and not in cache) and more things >> >> for writes to manage. >> >> >> >> Here's my proposed API: >> >> >> >> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, >> >> pending_bitmap_base_gpa) >> >> >> >> - called when the guest enables msix >> > >> >I would add virtual addresses so that we can use swappable memory to >> >store the state. >> >> Right. Do we need synchronization between kernel and userspace? Any recommended method? >> >> >If we do, maybe we can just keep the table there and then >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? >> >> Still need to to let userspace know it needs to reprogram the irqfd >> or whatever it uses to inject the interrupt. > > Why do we need to reprogram irqfd? I thought irqfd would map to an > entry within the table instead of address/data as now. > Could you clarify please? > > >> >> KVM_REMOVE_MSIX_TABLE(table_id) >> >> >> >> - called when the guest disables msix >> >> >> >> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) >> >> >> >> - called when the guest enables msix (to initialize it), or after >> >> live migration >> > >> >What is entry_id here? >> >> Entry within the table. > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is > enabled (note: it can not be called at boot anyway since pa > depends on BAR assigned by BIOS). Don't agree. MMIO can be write regardless of if MSIX is enabled. If you handle MMIO to kernel, them handle them all. I suppose qemu still got control of BAR? Then leave it in the current place should be fine. > >> >> Michael? I think that should work for virtio and vfio assigned >> >> devices? Not sure about pending bits. >> > >> >Pending bits must be tracked in kernel, but I don't see >> >how we can support polling mode if we don't exit to userspace >> >on pending bit reads. >> > >> >This does mean that some reads will be fast and some will be >> >slow, and it's a bit sad that we seem to be optimizing >> >for specific guests, but I just can't come up with >> >anything better. >> > >> >> If the pending bits live in userspace memory, the device model can >> update them directly? > > Note that these are updated on an interrupt, so updating them > in userspace would need get_user_page etc trickery, > and add the overhead of atomics. > > Further I think it's important to avoid the overhead of updating them > all the time, and only do this when an interrupt is > masked or on pending bits read. Since userspace does not know > when interrupts are masked, this means do update on each read. In fact qemu's accessing to MMIO should be quite rare after moving all the things to the kernel. Using IOCTL is also fine with me. And how to "do update on each read"? -- regards, Yang, Sheng > >> >So maybe just add an ioctl to get and to clear pending bits. >> >Maybe set for symmetry. >> >> For live migration too. But if they live in memory, no need for >> get/set, just specify the address. >> >> -- >> error compiling committee.c: too many arguments to function > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote: > On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: > >> > >> Which case? the readl() doesn't need access to the routing table, > >> just the entry. > > > >One thing that read should do is flush in the outstanding > >interrupts and flush out the mask bit writes. > > The mask bit writes are synchronous. > > wrt interrupts, we can deal with assigned devices, and can poll > irqfds. But we can't force vhost-net to issue an interrupt (and I > don't think it's required). To clarify: mask write read it is safe for guest to assume no more interrupts where as with a simple mask write an interrupt might be in flight and get delivered shortly afterwards. > >> Oh, I think there is a terminology problem, I was talking about > >> kvm's irq routing table, you were talking about the msix entries. > >> > >> I think treating it as a cache causes more problems, since there are > >> now two paths for reads (in cache and not in cache) and more things > >> for writes to manage. > >> > >> Here's my proposed API: > >> > >> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, > >> pending_bitmap_base_gpa) > >> > >> - called when the guest enables msix > > > >I would add virtual addresses so that we can use swappable memory to > >store the state. > > Right. > > >If we do, maybe we can just keep the table there and then > >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? > > Still need to to let userspace know it needs to reprogram the irqfd > or whatever it uses to inject the interrupt. Why do we need to reprogram irqfd? I thought irqfd would map to an entry within the table instead of address/data as now. Could you clarify please? > >> KVM_REMOVE_MSIX_TABLE(table_id) > >> > >>- called when the guest disables msix > >> > >> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) > >> > >>- called when the guest enables msix (to initialize it), or after > >> live migration > > > >What is entry_id here? > > Entry within the table. So I think KVM_DEFINE_MSIX_TABLE should be called when msix is enabled (note: it can not be called at boot anyway since pa depends on BAR assigned by BIOS). > >> Michael? I think that should work for virtio and vfio assigned > >> devices? Not sure about pending bits. > > > >Pending bits must be tracked in kernel, but I don't see > >how we can support polling mode if we don't exit to userspace > >on pending bit reads. > > > >This does mean that some reads will be fast and some will be > >slow, and it's a bit sad that we seem to be optimizing > >for specific guests, but I just can't come up with > >anything better. > > > > If the pending bits live in userspace memory, the device model can > update them directly? Note that these are updated on an interrupt, so updating them in userspace would need get_user_page etc trickery, and add the overhead of atomics. Further I think it's important to avoid the overhead of updating them all the time, and only do this when an interrupt is masked or on pending bits read. Since userspace does not know when interrupts are masked, this means do update on each read. > >So maybe just add an ioctl to get and to clear pending bits. > >Maybe set for symmetry. > > For live migration too. But if they live in memory, no need for > get/set, just specify the address. > > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: > > Which case? the readl() doesn't need access to the routing table, > just the entry. One thing that read should do is flush in the outstanding interrupts and flush out the mask bit writes. The mask bit writes are synchronous. wrt interrupts, we can deal with assigned devices, and can poll irqfds. But we can't force vhost-net to issue an interrupt (and I don't think it's required). > Oh, I think there is a terminology problem, I was talking about > kvm's irq routing table, you were talking about the msix entries. > > I think treating it as a cache causes more problems, since there are > now two paths for reads (in cache and not in cache) and more things > for writes to manage. > > Here's my proposed API: > > KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, > pending_bitmap_base_gpa) > > - called when the guest enables msix I would add virtual addresses so that we can use swappable memory to store the state. Right. If we do, maybe we can just keep the table there and then KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? Still need to to let userspace know it needs to reprogram the irqfd or whatever it uses to inject the interrupt. > KVM_REMOVE_MSIX_TABLE(table_id) > >- called when the guest disables msix > > KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) > >- called when the guest enables msix (to initialize it), or after > live migration What is entry_id here? Entry within the table. > Michael? I think that should work for virtio and vfio assigned > devices? Not sure about pending bits. Pending bits must be tracked in kernel, but I don't see how we can support polling mode if we don't exit to userspace on pending bit reads. This does mean that some reads will be fast and some will be slow, and it's a bit sad that we seem to be optimizing for specific guests, but I just can't come up with anything better. If the pending bits live in userspace memory, the device model can update them directly? So maybe just add an ioctl to get and to clear pending bits. Maybe set for symmetry. For live migration too. But if they live in memory, no need for get/set, just specify the address. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Thu, Dec 02, 2010 at 03:09:43PM +0200, Avi Kivity wrote: > On 12/01/2010 04:36 AM, Yang, Sheng wrote: > >On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote: > >> On 11/26/2010 04:35 AM, Yang, Sheng wrote: > >> > > > Shouldn't kvm also service reads from the pending bitmask? > >> > > > >> > > Of course KVM should service reading from pending bitmask. For > >> > > assigned device, it's kernel who would set the pending bit; but I > >> am > >> > > not sure for virtio. This interface is GET_ENTRY, so reading is > >> fine > >> > > with it. > >> > >> The kernel should manage it in the same way. Virtio raises irq (via > >> KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit. > >> > >> Note we need to be able to read and write the pending bitmask for live > >> migration. > > > >Then seems we still need to an writing interface for it. And I think we can > >work > >on it later since it got no user now. > > I'm not sure. Suppose a guest starts using pending bits. Does it > mean it will not work with kernel msix acceleration? > > If that is the case, then we need pending bit support now. I don't > want to knowingly merge something that doesn't conform to the spec, > forcing users to choose whether they want to enable kernel msix > acceleration or not. > > > > >> > >> Because we have an interface where you get an exit if (addr % 4)< 3 and > >> don't get an exit if (addr % 4) == 3. There is a gpa range which is > >> partially maintained by the kernel and partially in userspace. It's a > >> confusing interface. Things like 64-bit reads or writes need to be > >> broken up and serviced in two different places. > >> > >> We already need to support this (for unaligned writes which hit two > >> regions), but let's at least make a contiguous region behave sanely. > > > >Oh, I didn't mean to handle this kind of unaligned writing in userspace. > >They're > >illegal according to the PCI spec(otherwise the result is undefined > >according to > >the spec). I would cover all contiguous writing(32-bit and 64-bit) in the > >next > >version, and discard all illegal writing. And 64-bit accessing would be > >broken up > >in qemu as you said, as they do currently. > > > >In fact I think we can handle all data for 64-bit to qemu, because it should > >still > >sync the mask bit with kernel, which make the maskbit writing in userspace > >useless > >and can be ignored. > > What about reads? Split those as well? > > >> The reason is to keep a sane interface. Like we emulate instructions > >> and msrs in the kernel and don't do half a job. I don't think there's a > >> real need to accelerate the first three words of an msi-x entry. > > > >Here is the case we've observed with Xen. It can only be reproduced by large > >scale > >testing. When the interrupt intensity is very high, even new kernels would > >try to > >make it lower, then it would access mask bit very frequently. And in the > >kernel, > >msi_set_mask_bit() is like this: > > > >static void msi_set_mask_bit(struct irq_data *data, u32 flag) > >{ > > struct msi_desc *desc = irq_data_get_msi(data); > > > > if (desc->msi_attrib.is_msix) { > > msix_mask_irq(desc, flag); > > readl(desc->mask_base); /* Flush write to device */ > > } else { > > unsigned offset = data->irq - desc->dev->irq; > > msi_mask_irq(desc, 1<< offset, flag<< offset); > > } > >} > > > >That flush by readl() would complied with each MSI-X mask writing. That is > >the only > >place we want to accelerate, otherwise it would still fall back to qemu each > >time > >when guest want to mask the MSI-X entry. > > So it seems we do want to accelerate reads to the entire entry. > This seems to give more weight to making the kernel own the entire > entry. > > >> > >> > And BTW, we can take routing table as a kind of *cache*, if the content > >> > is in the cache, then we can fetch it from the cache, otherwise we need > >> > to go back to fetch it from memory(userspace). > >> > >> If it's guaranteed by the spec that addr/data pairs are always > >> interpreted in the same way, sure. But there no reason to do it, > >> really, it isn't a fast path. > > > >I am not quite understand the first sentence... But it's a fast path, in the > >case I > >showed above. > > Which case? the readl() doesn't need access to the routing table, > just the entry. One thing that read should do is flush in the outstanding interrupts and flush out the mask bit writes. > Oh, I think there is a terminology problem, I was talking about > kvm's irq routing table, you were talking about the msix entries. > > I think treating it as a cache causes more problems, since there are > now two paths for reads (in cache and not in cache) and more things > for writes to manage. > > Here's my proposed API: > > KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, >
Re: Mask bit support's API
On 12/01/2010 04:36 AM, Yang, Sheng wrote: On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote: > On 11/26/2010 04:35 AM, Yang, Sheng wrote: > > > > Shouldn't kvm also service reads from the pending bitmask? > > > > > > Of course KVM should service reading from pending bitmask. For > > > assigned device, it's kernel who would set the pending bit; but I am > > > not sure for virtio. This interface is GET_ENTRY, so reading is fine > > > with it. > > The kernel should manage it in the same way. Virtio raises irq (via > KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit. > > Note we need to be able to read and write the pending bitmask for live > migration. Then seems we still need to an writing interface for it. And I think we can work on it later since it got no user now. I'm not sure. Suppose a guest starts using pending bits. Does it mean it will not work with kernel msix acceleration? If that is the case, then we need pending bit support now. I don't want to knowingly merge something that doesn't conform to the spec, forcing users to choose whether they want to enable kernel msix acceleration or not. > > Because we have an interface where you get an exit if (addr % 4)< 3 and > don't get an exit if (addr % 4) == 3. There is a gpa range which is > partially maintained by the kernel and partially in userspace. It's a > confusing interface. Things like 64-bit reads or writes need to be > broken up and serviced in two different places. > > We already need to support this (for unaligned writes which hit two > regions), but let's at least make a contiguous region behave sanely. Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're illegal according to the PCI spec(otherwise the result is undefined according to the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next version, and discard all illegal writing. And 64-bit accessing would be broken up in qemu as you said, as they do currently. In fact I think we can handle all data for 64-bit to qemu, because it should still sync the mask bit with kernel, which make the maskbit writing in userspace useless and can be ignored. What about reads? Split those as well? > The reason is to keep a sane interface. Like we emulate instructions > and msrs in the kernel and don't do half a job. I don't think there's a > real need to accelerate the first three words of an msi-x entry. Here is the case we've observed with Xen. It can only be reproduced by large scale testing. When the interrupt intensity is very high, even new kernels would try to make it lower, then it would access mask bit very frequently. And in the kernel, msi_set_mask_bit() is like this: static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi(data); if (desc->msi_attrib.is_msix) { msix_mask_irq(desc, flag); readl(desc->mask_base); /* Flush write to device */ } else { unsigned offset = data->irq - desc->dev->irq; msi_mask_irq(desc, 1<< offset, flag<< offset); } } That flush by readl() would complied with each MSI-X mask writing. That is the only place we want to accelerate, otherwise it would still fall back to qemu each time when guest want to mask the MSI-X entry. So it seems we do want to accelerate reads to the entire entry. This seems to give more weight to making the kernel own the entire entry. > > > And BTW, we can take routing table as a kind of *cache*, if the content > > is in the cache, then we can fetch it from the cache, otherwise we need > > to go back to fetch it from memory(userspace). > > If it's guaranteed by the spec that addr/data pairs are always > interpreted in the same way, sure. But there no reason to do it, > really, it isn't a fast path. I am not quite understand the first sentence... But it's a fast path, in the case I showed above. Which case? the readl() doesn't need access to the routing table, just the entry. Oh, I think there is a terminology problem, I was talking about kvm's irq routing table, you were talking about the msix entries. I think treating it as a cache causes more problems, since there are now two paths for reads (in cache and not in cache) and more things for writes to manage. Here's my proposed API: KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, pending_bitmap_base_gpa) - called when the guest enables msix KVM_REMOVE_MSIX_TABLE(table_id) - called when the guest disables msix KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) - called when the guest enables msix (to initialize it), or after live migration KVM_GET_MSIX_ENTRY(table_id, entry_id, contents) - called when the guest updates an msix table (triggered by KVM_EXIT_MSIX_ENTRY_UPDATED) KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED - returned from KV
Re: Mask bit support's API
On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote: > On 11/26/2010 04:35 AM, Yang, Sheng wrote: > > > > Shouldn't kvm also service reads from the pending bitmask? > > > > > > Of course KVM should service reading from pending bitmask. For > > > assigned device, it's kernel who would set the pending bit; but I am > > > not sure for virtio. This interface is GET_ENTRY, so reading is fine > > > with it. > > The kernel should manage it in the same way. Virtio raises irq (via > KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit. > > Note we need to be able to read and write the pending bitmask for live > migration. Then seems we still need to an writing interface for it. And I think we can work on it later since it got no user now. > > > > > We could have the kernel handle addr/data writes by setting up an > > > > internal interrupt routing. A disadvantage is that more work is > > > > needed if we emulator interrupt remapping in qemu. > > > > > > In fact modifying irq routing in the kernel is also the thing I want > > > to avoid. > > > > > > So, the flow would be: > > > > > > kernel get MMIO write, record it in it's own MSI table > > > KVM exit to QEmu, by one specific exit reason > > > QEmu know it have to sync the MSI table, then reading the entries from > > > kernel QEmu found it's an write, so it need to reprogram irq routing > > > table using the entries above > > > done > > > > > > But wait, why should qemu read entries from kernel? By default exit we > > > already have the information about what's the entry to modify and what > > > to write, so we can use them directly. By this way, we also don't > > > need an specific exit reason - just exit to qemu in normal way is > > > fine. > > Because we have an interface where you get an exit if (addr % 4) < 3 and > don't get an exit if (addr % 4) == 3. There is a gpa range which is > partially maintained by the kernel and partially in userspace. It's a > confusing interface. Things like 64-bit reads or writes need to be > broken up and serviced in two different places. > > We already need to support this (for unaligned writes which hit two > regions), but let's at least make a contiguous region behave sanely. Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're illegal according to the PCI spec(otherwise the result is undefined according to the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next version, and discard all illegal writing. And 64-bit accessing would be broken up in qemu as you said, as they do currently. In fact I think we can handle all data for 64-bit to qemu, because it should still sync the mask bit with kernel, which make the maskbit writing in userspace useless and can be ignored. > > > > Then it would be: > > > > > > kernel get MMIO write, record it in it's own MSI table > > > KVM exit to QEmu, indicate MMIO exit > > > QEmu found it's an write, it would update it's own MSI table(may need > > > to query mask bit from kernel), and reprogram irq routing table using > > > the entries above done > > > > > > Then why should kernel kept it's own MSI table? I think the only > > > reason is we can speed up reading in that way - but the reading we > > > want to speed up is mostly on enabled entry(the first entry), which > > > is already in the IRQ routing table... > > The reason is to keep a sane interface. Like we emulate instructions > and msrs in the kernel and don't do half a job. I don't think there's a > real need to accelerate the first three words of an msi-x entry. Here is the case we've observed with Xen. It can only be reproduced by large scale testing. When the interrupt intensity is very high, even new kernels would try to make it lower, then it would access mask bit very frequently. And in the kernel, msi_set_mask_bit() is like this: static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi(data); if (desc->msi_attrib.is_msix) { msix_mask_irq(desc, flag); readl(desc->mask_base); /* Flush write to device */ } else { unsigned offset = data->irq - desc->dev->irq; msi_mask_irq(desc, 1 << offset, flag << offset); } } That flush by readl() would complied with each MSI-X mask writing. That is the only place we want to accelerate, otherwise it would still fall back to
Re: Mask bit support's API
On 11/26/2010 04:35 AM, Yang, Sheng wrote: > > > > Shouldn't kvm also service reads from the pending bitmask? > > Of course KVM should service reading from pending bitmask. For assigned > device, it's kernel who would set the pending bit; but I am not sure for > virtio. This interface is GET_ENTRY, so reading is fine with it. The kernel should manage it in the same way. Virtio raises irq (via KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit. Note we need to be able to read and write the pending bitmask for live migration. > > > > We could have the kernel handle addr/data writes by setting up an > > internal interrupt routing. A disadvantage is that more work is needed > > if we emulator interrupt remapping in qemu. > > In fact modifying irq routing in the kernel is also the thing I want to > avoid. > > So, the flow would be: > > kernel get MMIO write, record it in it's own MSI table > KVM exit to QEmu, by one specific exit reason > QEmu know it have to sync the MSI table, then reading the entries from > kernel QEmu found it's an write, so it need to reprogram irq routing table > using the entries above > done > > But wait, why should qemu read entries from kernel? By default exit we > already have the information about what's the entry to modify and what to > write, so we can use them directly. By this way, we also don't need an > specific exit reason - just exit to qemu in normal way is fine. Because we have an interface where you get an exit if (addr % 4) < 3 and don't get an exit if (addr % 4) == 3. There is a gpa range which is partially maintained by the kernel and partially in userspace. It's a confusing interface. Things like 64-bit reads or writes need to be broken up and serviced in two different places. We already need to support this (for unaligned writes which hit two regions), but let's at least make a contiguous region behave sanely. > > Then it would be: > > kernel get MMIO write, record it in it's own MSI table > KVM exit to QEmu, indicate MMIO exit > QEmu found it's an write, it would update it's own MSI table(may need to > query mask bit from kernel), and reprogram irq routing table using the > entries above done > > Then why should kernel kept it's own MSI table? I think the only reason is > we can speed up reading in that way - but the reading we want to speed up > is mostly on enabled entry(the first entry), which is already in the IRQ > routing table... The reason is to keep a sane interface. Like we emulate instructions and msrs in the kernel and don't do half a job. I don't think there's a real need to accelerate the first three words of an msi-x entry. > And for enabled/disabled entry, you can see it like this: for the entries > inside routing table, we think it's enabled; otherwise it's disabled. Then > you don't need to bothered by pci_enable_msix(). > > So our strategy for reading accelerating can be: > > If the entry contained in irq routing table, then use it; otherwise let > qemu deal with it. Because it's the QEmu who owned irq routing table, the > synchronization is guaranteed. We don't need the MSI table in the kernel > then. I agree about letting qemu manage the irq routing table. It changes very rarely. I just prefer to let it know about the change via something other than KVM_EXIT_MMIO. > > And for writing, we just want to cover all of mask bit, but none of others. > > I think the concept here is more acceptable? > > The issue here is MSI table and irq routing table got duplicate information > on some entries. My initial purposal is to use irq routing table in > kernel, then we don't need to duplicate information. Avi? Sorry about the late reply. And BTW, we can take routing table as a kind of *cache*, if the content is in the cache, then we can fetch it from the cache, otherwise we need to go back to fetch it from memory(userspace). If it's guaranteed by the spec that addr/data pairs are always interpreted in the same way, sure. But there no reason to do it, really, it isn't a fast path. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Wednesday 24 November 2010 09:59:23 Yang, Sheng wrote: > On Tuesday 23 November 2010 22:06:20 Avi Kivity wrote: > > On 11/23/2010 03:57 PM, Yang, Sheng wrote: > > > > > Yeah, but won't be included in this patchset. > > > > > > > > What API changes are needed? I'd like to see the complete API. > > > > > > I am not sure about it. But I suppose the structure should be the same? > > > In fact it's pretty hard for me to image what's needed for virtio in > > > the future, especially there is no such code now. I really prefer to > > > deal with assigned device and virtio separately, which would make the > > > work much easier. But seems you won't agree on that. > > > > First, I don't really see why the two cases are different (but I don't > > do a lot in this space). Surely between you and Michael, you have all > > the information? > > > > Second, my worry is a huge number of ABI variants that come from > > incrementally adding features. I want to implement bigger chunks of > > functionality. So I'd like to see all potential users addressed, at > > least from the ABI point of view if not the implementation. > > > > > > The API needs to be compatible with the pending bit, even if we > > > > don't implement it now. I want to reduce the rate of API changes. > > > > > > This can be implemented by this API, just adding a flag for it. And I > > > would still take this into consideration in the next API purposal. > > > > Shouldn't kvm also service reads from the pending bitmask? > > Of course KVM should service reading from pending bitmask. For assigned > device, it's kernel who would set the pending bit; but I am not sure for > virtio. This interface is GET_ENTRY, so reading is fine with it. > > > > > So instead of > > > > > > > > - guest reads/writes msix > > > > - kvm filters mmio, implements some, passes others to userspace > > > > > > > > we have > > > > > > > > - guest reads/writes msix > > > > - kvm implements all > > > > - some writes generate an additional notification to userspace > > > > > > I suppose we don't need to generate notification to userspace? Because > > > every read/write is handled by kernel, and userspace just need > > > interface to kernel to get/set the entry - and well, does userspace > > > need to do it when kernel can handle all of them? Maybe not... > > > > We could have the kernel handle addr/data writes by setting up an > > internal interrupt routing. A disadvantage is that more work is needed > > if we emulator interrupt remapping in qemu. > > In fact modifying irq routing in the kernel is also the thing I want to > avoid. > > So, the flow would be: > > kernel get MMIO write, record it in it's own MSI table > KVM exit to QEmu, by one specific exit reason > QEmu know it have to sync the MSI table, then reading the entries from > kernel QEmu found it's an write, so it need to reprogram irq routing table > using the entries above > done > > But wait, why should qemu read entries from kernel? By default exit we > already have the information about what's the entry to modify and what to > write, so we can use them directly. By this way, we also don't need an > specific exit reason - just exit to qemu in normal way is fine. > > Then it would be: > > kernel get MMIO write, record it in it's own MSI table > KVM exit to QEmu, indicate MMIO exit > QEmu found it's an write, it would update it's own MSI table(may need to > query mask bit from kernel), and reprogram irq routing table using the > entries above done > > Then why should kernel kept it's own MSI table? I think the only reason is > we can speed up reading in that way - but the reading we want to speed up > is mostly on enabled entry(the first entry), which is already in the IRQ > routing table... > > And for enabled/disabled entry, you can see it like this: for the entries > inside routing table, we think it's enabled; otherwise it's disabled. Then > you don't need to bothered by pci_enable_msix(). > > So our strategy for reading accelerating can be: > > If the entry contained in irq routing table, then use it; otherwise let > qemu deal with it. Because it's the QEmu who owned irq routing table, the > synchronization is guaranteed. We don't need the MSI table in the kernel > then. > > And for writing, we just want to cover all of mask bit, but none of others. > > I think the concept here is more acceptable? > > The issue here is MSI table and irq routing table got duplicate information > on some entries. My initial purposal is to use irq routing table in > kernel, then we don't need to duplicate information. Avi? And BTW, we can take routing table as a kind of *cache*, if the content is in the cache, then we can fetch it from the cache, otherwise we need to go back to fetch it from memory(userspace). -- regards Yang, Sheng > > > -- > regards > Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org Mo
Re: Mask bit support's API
On Tuesday 23 November 2010 22:06:20 Avi Kivity wrote: > On 11/23/2010 03:57 PM, Yang, Sheng wrote: > > > > Yeah, but won't be included in this patchset. > > > > > > What API changes are needed? I'd like to see the complete API. > > > > I am not sure about it. But I suppose the structure should be the same? > > In fact it's pretty hard for me to image what's needed for virtio in the > > future, especially there is no such code now. I really prefer to deal > > with assigned device and virtio separately, which would make the work > > much easier. But seems you won't agree on that. > > First, I don't really see why the two cases are different (but I don't > do a lot in this space). Surely between you and Michael, you have all > the information? > > Second, my worry is a huge number of ABI variants that come from > incrementally adding features. I want to implement bigger chunks of > functionality. So I'd like to see all potential users addressed, at > least from the ABI point of view if not the implementation. > > > > The API needs to be compatible with the pending bit, even if we don't > > > implement it now. I want to reduce the rate of API changes. > > > > This can be implemented by this API, just adding a flag for it. And I > > would still take this into consideration in the next API purposal. > > Shouldn't kvm also service reads from the pending bitmask? Of course KVM should service reading from pending bitmask. For assigned device, it's kernel who would set the pending bit; but I am not sure for virtio. This interface is GET_ENTRY, so reading is fine with it. > > > So instead of > > > > > > - guest reads/writes msix > > > - kvm filters mmio, implements some, passes others to userspace > > > > > > we have > > > > > > - guest reads/writes msix > > > - kvm implements all > > > - some writes generate an additional notification to userspace > > > > I suppose we don't need to generate notification to userspace? Because > > every read/write is handled by kernel, and userspace just need interface > > to kernel to get/set the entry - and well, does userspace need to do it > > when kernel can handle all of them? Maybe not... > > We could have the kernel handle addr/data writes by setting up an > internal interrupt routing. A disadvantage is that more work is needed > if we emulator interrupt remapping in qemu. In fact modifying irq routing in the kernel is also the thing I want to avoid. So, the flow would be: kernel get MMIO write, record it in it's own MSI table KVM exit to QEmu, by one specific exit reason QEmu know it have to sync the MSI table, then reading the entries from kernel QEmu found it's an write, so it need to reprogram irq routing table using the entries above done But wait, why should qemu read entries from kernel? By default exit we already have the information about what's the entry to modify and what to write, so we can use them directly. By this way, we also don't need an specific exit reason - just exit to qemu in normal way is fine. Then it would be: kernel get MMIO write, record it in it's own MSI table KVM exit to QEmu, indicate MMIO exit QEmu found it's an write, it would update it's own MSI table(may need to query mask bit from kernel), and reprogram irq routing table using the entries above done Then why should kernel kept it's own MSI table? I think the only reason is we can speed up reading in that way - but the reading we want to speed up is mostly on enabled entry(the first entry), which is already in the IRQ routing table... And for enabled/disabled entry, you can see it like this: for the entries inside routing table, we think it's enabled; otherwise it's disabled. Then you don't need to bothered by pci_enable_msix(). So our strategy for reading accelerating can be: If the entry contained in irq routing table, then use it; otherwise let qemu deal with it. Because it's the QEmu who owned irq routing table, the synchronization is guaranteed. We don't need the MSI table in the kernel then. And for writing, we just want to cover all of mask bit, but none of others. I think the concept here is more acceptable? The issue here is MSI table and irq routing table got duplicate information on some entries. My initial purposal is to use irq routing table in kernel, then we don't need to duplicate information. -- regards Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Tue, Nov 23, 2010 at 05:24:44PM +0200, Gleb Natapov wrote: > On Tue, Nov 23, 2010 at 05:11:19PM +0200, Michael S. Tsirkin wrote: > > On Tue, Nov 23, 2010 at 04:06:20PM +0200, Avi Kivity wrote: > > > >> > > > >> So instead of > > > >> > > > >> - guest reads/writes msix > > > >> - kvm filters mmio, implements some, passes others to userspace > > > >> > > > >> we have > > > >> > > > >> - guest reads/writes msix > > > >> - kvm implements all > > > >> - some writes generate an additional notification to userspace > > > > > > > >I suppose we don't need to generate notification to userspace? Because > > > >every > > > >read/write is handled by kernel, and userspace just need interface to > > > >kernel to > > > >get/set the entry - and well, does userspace need to do it when kernel > > > >can handle > > > >all of them? Maybe not... > > > > > > We could have the kernel handle addr/data writes by setting up an > > > internal interrupt routing. A disadvantage is that more work is > > > needed if we emulator interrupt remapping in qemu. > > > > As an alternative, interrupt remapping will need some API rework, right? > > Existing APIs only pass address/data for msi. > > > IIRC interrupt remapping works with address/data to. It just interpret > it differently from apic. Yes. So since our APIs use address/data, this is an argument for doing the remapping in kernel. > -- > Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Tue, Nov 23, 2010 at 05:11:19PM +0200, Michael S. Tsirkin wrote: > On Tue, Nov 23, 2010 at 04:06:20PM +0200, Avi Kivity wrote: > > >> > > >> So instead of > > >> > > >> - guest reads/writes msix > > >> - kvm filters mmio, implements some, passes others to userspace > > >> > > >> we have > > >> > > >> - guest reads/writes msix > > >> - kvm implements all > > >> - some writes generate an additional notification to userspace > > > > > >I suppose we don't need to generate notification to userspace? Because > > >every > > >read/write is handled by kernel, and userspace just need interface to > > >kernel to > > >get/set the entry - and well, does userspace need to do it when kernel can > > >handle > > >all of them? Maybe not... > > > > We could have the kernel handle addr/data writes by setting up an > > internal interrupt routing. A disadvantage is that more work is > > needed if we emulator interrupt remapping in qemu. > > As an alternative, interrupt remapping will need some API rework, right? > Existing APIs only pass address/data for msi. > IIRC interrupt remapping works with address/data to. It just interpret it differently from apic. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Tue, Nov 23, 2010 at 04:06:20PM +0200, Avi Kivity wrote: > >> > >> So instead of > >> > >> - guest reads/writes msix > >> - kvm filters mmio, implements some, passes others to userspace > >> > >> we have > >> > >> - guest reads/writes msix > >> - kvm implements all > >> - some writes generate an additional notification to userspace > > > >I suppose we don't need to generate notification to userspace? Because every > >read/write is handled by kernel, and userspace just need interface to kernel > >to > >get/set the entry - and well, does userspace need to do it when kernel can > >handle > >all of them? Maybe not... > > We could have the kernel handle addr/data writes by setting up an > internal interrupt routing. A disadvantage is that more work is > needed if we emulator interrupt remapping in qemu. As an alternative, interrupt remapping will need some API rework, right? Existing APIs only pass address/data for msi. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On 11/23/2010 03:57 PM, Yang, Sheng wrote: > > > > Yeah, but won't be included in this patchset. > > What API changes are needed? I'd like to see the complete API. I am not sure about it. But I suppose the structure should be the same? In fact it's pretty hard for me to image what's needed for virtio in the future, especially there is no such code now. I really prefer to deal with assigned device and virtio separately, which would make the work much easier. But seems you won't agree on that. First, I don't really see why the two cases are different (but I don't do a lot in this space). Surely between you and Michael, you have all the information? Second, my worry is a huge number of ABI variants that come from incrementally adding features. I want to implement bigger chunks of functionality. So I'd like to see all potential users addressed, at least from the ABI point of view if not the implementation. > The API needs to be compatible with the pending bit, even if we don't > implement it now. I want to reduce the rate of API changes. This can be implemented by this API, just adding a flag for it. And I would still take this into consideration in the next API purposal. Shouldn't kvm also service reads from the pending bitmask? > > So instead of > > - guest reads/writes msix > - kvm filters mmio, implements some, passes others to userspace > > we have > > - guest reads/writes msix > - kvm implements all > - some writes generate an additional notification to userspace I suppose we don't need to generate notification to userspace? Because every read/write is handled by kernel, and userspace just need interface to kernel to get/set the entry - and well, does userspace need to do it when kernel can handle all of them? Maybe not... We could have the kernel handle addr/data writes by setting up an internal interrupt routing. A disadvantage is that more work is needed if we emulator interrupt remapping in qemu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Tuesday 23 November 2010 20:04:16 Michael S. Tsirkin wrote: > On Tue, Nov 23, 2010 at 02:09:52PM +0800, Yang, Sheng wrote: > > Hi Avi, > > > > I've purposed the following API for mask bit support. > > > > The main point is, QEmu can know which entries are enabled(by > > pci_enable_msix()). > > Unfortunately, it can't I think, unless all your guests are linux. > "enabled entries" is a linux kernel concept. > The MSIX spec only tells you which entries are masked and which are > unmasked. Can't understand what you are talking about, and how it related to the guest OS. I was talking about pci_enable_msix() in the host Linux. -- regards Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Tuesday 23 November 2010 20:47:33 Avi Kivity wrote: > On 11/23/2010 10:30 AM, Yang, Sheng wrote: > > On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote: > > > On 11/23/2010 08:35 AM, Yang, Sheng wrote: > > > > On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote: > > > > > On 11/23/2010 08:09 AM, Yang, Sheng wrote: > > > > > > Hi Avi, > > > > > > > > > > > > I've purposed the following API for mask bit support. > > > > > > > > > > > > The main point is, QEmu can know which entries are > > > > > > enabled(by pci_enable_msix()). And for enabled entries, > > > > > > kernel own it, including MSI data/address and mask > > > > > > bit(routing table and mask bitmap). QEmu should use > > > > > > KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with > > > > > > them if it want to do so). > > > > > > > > > > > > Before entries are enabled, QEmu can still use it's own MSI > > > > > > table(because we didn't contain these kind of information > > > > > > in kernel, and it's unnecessary for kernel). > > > > > > > > > > > > The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to > > > > > > query one entry didn't exist in kernel - or we can simply > > > > > > return -EINVAL for it. > > > > > > > > > > > > I suppose it would be rare for QEmu to use this interface > > > > > > to get the context of entry(the only case I think is when > > > > > > MSI-X disable and QEmu need to sync the context), so > > > > > > performance should not be an issue. > > > > > > > > > > > > What's your opinion? > > > > > > > > > > > > >#define KVM_GET_MSIX_ENTRY_IOWR(KVMIO, 0x7d, > > > > > > >struct kvm_msix_entry) > > > > > > > > > > Need SET_MSIX_ENTRY for live migration as well. > > > > > > > > Current we don't support LM with VT-d... > > > > > > Isn't this work useful for virtio as well? > > > > Yeah, but won't be included in this patchset. > > What API changes are needed? I'd like to see the complete API. I am not sure about it. But I suppose the structure should be the same? In fact it's pretty hard for me to image what's needed for virtio in the future, especially there is no such code now. I really prefer to deal with assigned device and virtio separately, which would make the work much easier. But seems you won't agree on that. > > > > > > What about the pending bits? > > > > > > > > We didn't cover it here - and it's in another MMIO space(PBA). Of > > > > course we can add more flags for it later. > > > > > > When an entry is masked, we need to set the pending bit for it > > > somewhere. I guess this is broken in the existing code (without your > > > patches)? > > > > Even with my patch, we didn't support the pending bit. It would always > > return 0 now. What we supposed to do(after my patch checked in) is to > > check IRQ_PENDING flag of irq_desc->status(if the entry is masked), and > > return the result to userspace. > > > > That would involve some core change, like to export irq_to_desc(). I > > don't think it would be accepted soon, so would push mask bit first. > > The API needs to be compatible with the pending bit, even if we don't > implement it now. I want to reduce the rate of API changes. This can be implemented by this API, just adding a flag for it. And I would still take this into consideration in the next API purposal. > > > > > Also need a new exit reason to tell userspace that an msix > > > > > entry has changed, so userspace can update mappings. > > > > > > > > I think we don't need it. Whenever userspace want to get one > > > > mapping which is an enabled MSI-X entry, it can check it with the > > > > API above(which is quite rare, because kernel would handle all of > > > > them when guest is accessing them). If it's a disabled entry, the > > > > context inside userspace MMIO record is the correct one(and only > > > > one). The only place I think QEmu need to sync is when MSI-X is > > > > about to disabled, QEmu need to update it's own MMIO record. > > > > > > So in-kernel handling of mmio would be decided per entry? I'm trying > > > to simplify this, and simplest thing is - all or nothing. > > > > So you would like to handle all MSI-X MMIO in kernel? > > Yes. Writes to address or data would be handled by: > - recording it into the shadow msix table > - notifying userspace that msix entry x changed > Reads would be handled in kernel from the shadow msix table. > > So instead of > > - guest reads/writes msix > - kvm filters mmio, implements some, passes others to userspace > > we have > > - guest reads/writes msix > - kvm implements all > - some writes generate an additional notification to userspace I suppose we don't need to generate notification to userspace? Because every read/write is handled by kernel, and user
Re: Mask bit support's API
On Tue, Nov 23, 2010 at 02:47:33PM +0200, Avi Kivity wrote: > On 11/23/2010 10:30 AM, Yang, Sheng wrote: > >On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote: > >> On 11/23/2010 08:35 AM, Yang, Sheng wrote: > >> > On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote: > >> > > On 11/23/2010 08:09 AM, Yang, Sheng wrote: > >> > > > Hi Avi, > >> > > > > >> > > > I've purposed the following API for mask bit support. > >> > > > > >> > > > The main point is, QEmu can know which entries are enabled(by > >> > > > pci_enable_msix()). And for enabled entries, kernel own it, > >> > > > including MSI data/address and mask bit(routing table and mask > >> > > > bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get > >> them(and > >> > > > it can sync with them if it want to do so). > >> > > > > >> > > > Before entries are enabled, QEmu can still use it's own MSI > >> > > > table(because we didn't contain these kind of information in > >> > > > kernel, and it's unnecessary for kernel). > >> > > > > >> > > > The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to > >> query > >> > > > one entry didn't exist in kernel - or we can simply return > >> -EINVAL > >> > > > for it. > >> > > > > >> > > > I suppose it would be rare for QEmu to use this interface to > >> get > >> > > > the context of entry(the only case I think is when MSI-X > >> disable > >> > > > and QEmu need to sync the context), so performance should not > >> be > >> > > > an issue. > >> > > > > >> > > > What's your opinion? > >> > > > > >> > > > >#define KVM_GET_MSIX_ENTRY_IOWR(KVMIO, 0x7d, > >> struct > >> > > > >kvm_msix_entry) > >> > > > >> > > Need SET_MSIX_ENTRY for live migration as well. > >> > > >> > Current we don't support LM with VT-d... > >> > >> Isn't this work useful for virtio as well? > > > >Yeah, but won't be included in this patchset. > > What API changes are needed? I'd like to see the complete API. > > >> > > What about the pending bits? > >> > > >> > We didn't cover it here - and it's in another MMIO space(PBA). Of > >> course > >> > we can add more flags for it later. > >> > >> When an entry is masked, we need to set the pending bit for it > >> somewhere. I guess this is broken in the existing code (without your > >> patches)? > > > >Even with my patch, we didn't support the pending bit. It would always > >return 0 > >now. What we supposed to do(after my patch checked in) is to check > >IRQ_PENDING flag > >of irq_desc->status(if the entry is masked), and return the result to > >userspace. > > > >That would involve some core change, like to export irq_to_desc(). I don't > >think > >it would be accepted soon, so would push mask bit first. > > The API needs to be compatible with the pending bit, even if we > don't implement it now. I want to reduce the rate of API changes. > > >> > >> > > Also need a new exit reason to tell userspace that an msix entry > >> has > >> > > changed, so userspace can update mappings. > >> > > >> > I think we don't need it. Whenever userspace want to get one mapping > >> > which is an enabled MSI-X entry, it can check it with the API > >> > above(which is quite rare, because kernel would handle all of them when > >> > guest is accessing them). If it's a disabled entry, the context inside > >> > userspace MMIO record is the correct one(and only one). The only place > >> I > >> > think QEmu need to sync is when MSI-X is about to disabled, QEmu need > >> to > >> > update it's own MMIO record. > >> > >> So in-kernel handling of mmio would be decided per entry? I'm trying to > >> simplify this, and simplest thing is - all or nothing. > > > >So you would like to handle all MSI-X MMIO in kernel? > > Yes. Writes to address or data would be handled by: > - recording it into the shadow msix table > - notifying userspace that msix entry x changed > Reads would be handled in kernel from the shadow msix table. > > So instead of > > - guest reads/writes msix > - kvm filters mmio, implements some, passes others to userspace > > we have > > - guest reads/writes msix > - kvm implements all > - some writes generate an additional notification to userspace One small proposal in addition: since all accesses are done from guest anyway, the shadow table can/should be stored using userspace memory, reducing the kernel memory overhead of the feature from up to 4K per MSIX table to just 8 bytes. Active entries can be cached in kernel memory. > > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On 11/23/2010 10:30 AM, Yang, Sheng wrote: On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote: > On 11/23/2010 08:35 AM, Yang, Sheng wrote: > > On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote: > > > On 11/23/2010 08:09 AM, Yang, Sheng wrote: > > > > Hi Avi, > > > > > > > > I've purposed the following API for mask bit support. > > > > > > > > The main point is, QEmu can know which entries are enabled(by > > > > pci_enable_msix()). And for enabled entries, kernel own it, > > > > including MSI data/address and mask bit(routing table and mask > > > > bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get them(and > > > > it can sync with them if it want to do so). > > > > > > > > Before entries are enabled, QEmu can still use it's own MSI > > > > table(because we didn't contain these kind of information in > > > > kernel, and it's unnecessary for kernel). > > > > > > > > The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query > > > > one entry didn't exist in kernel - or we can simply return -EINVAL > > > > for it. > > > > > > > > I suppose it would be rare for QEmu to use this interface to get > > > > the context of entry(the only case I think is when MSI-X disable > > > > and QEmu need to sync the context), so performance should not be > > > > an issue. > > > > > > > > What's your opinion? > > > > > > > > >#define KVM_GET_MSIX_ENTRY_IOWR(KVMIO, 0x7d, struct > > > > >kvm_msix_entry) > > > > > > Need SET_MSIX_ENTRY for live migration as well. > > > > Current we don't support LM with VT-d... > > Isn't this work useful for virtio as well? Yeah, but won't be included in this patchset. What API changes are needed? I'd like to see the complete API. > > > What about the pending bits? > > > > We didn't cover it here - and it's in another MMIO space(PBA). Of course > > we can add more flags for it later. > > When an entry is masked, we need to set the pending bit for it > somewhere. I guess this is broken in the existing code (without your > patches)? Even with my patch, we didn't support the pending bit. It would always return 0 now. What we supposed to do(after my patch checked in) is to check IRQ_PENDING flag of irq_desc->status(if the entry is masked), and return the result to userspace. That would involve some core change, like to export irq_to_desc(). I don't think it would be accepted soon, so would push mask bit first. The API needs to be compatible with the pending bit, even if we don't implement it now. I want to reduce the rate of API changes. > > > > Also need a new exit reason to tell userspace that an msix entry has > > > changed, so userspace can update mappings. > > > > I think we don't need it. Whenever userspace want to get one mapping > > which is an enabled MSI-X entry, it can check it with the API > > above(which is quite rare, because kernel would handle all of them when > > guest is accessing them). If it's a disabled entry, the context inside > > userspace MMIO record is the correct one(and only one). The only place I > > think QEmu need to sync is when MSI-X is about to disabled, QEmu need to > > update it's own MMIO record. > > So in-kernel handling of mmio would be decided per entry? I'm trying to > simplify this, and simplest thing is - all or nothing. So you would like to handle all MSI-X MMIO in kernel? Yes. Writes to address or data would be handled by: - recording it into the shadow msix table - notifying userspace that msix entry x changed Reads would be handled in kernel from the shadow msix table. So instead of - guest reads/writes msix - kvm filters mmio, implements some, passes others to userspace we have - guest reads/writes msix - kvm implements all - some writes generate an additional notification to userspace -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Tue, Nov 23, 2010 at 02:09:52PM +0800, Yang, Sheng wrote: > Hi Avi, > > I've purposed the following API for mask bit support. > > The main point is, QEmu can know which entries are enabled(by > pci_enable_msix()). Unfortunately, it can't I think, unless all your guests are linux. "enabled entries" is a linux kernel concept. The MSIX spec only tells you which entries are masked and which are unmasked. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote: > On 11/23/2010 08:35 AM, Yang, Sheng wrote: > > On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote: > > > On 11/23/2010 08:09 AM, Yang, Sheng wrote: > > > > Hi Avi, > > > > > > > > I've purposed the following API for mask bit support. > > > > > > > > The main point is, QEmu can know which entries are enabled(by > > > > pci_enable_msix()). And for enabled entries, kernel own it, > > > > including MSI data/address and mask bit(routing table and mask > > > > bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get them(and > > > > it can sync with them if it want to do so). > > > > > > > > Before entries are enabled, QEmu can still use it's own MSI > > > > table(because we didn't contain these kind of information in > > > > kernel, and it's unnecessary for kernel). > > > > > > > > The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query > > > > one entry didn't exist in kernel - or we can simply return -EINVAL > > > > for it. > > > > > > > > I suppose it would be rare for QEmu to use this interface to get > > > > the context of entry(the only case I think is when MSI-X disable > > > > and QEmu need to sync the context), so performance should not be > > > > an issue. > > > > > > > > What's your opinion? > > > > > > > > > #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO, 0x7d, struct > > > > > kvm_msix_entry) > > > > > > Need SET_MSIX_ENTRY for live migration as well. > > > > Current we don't support LM with VT-d... > > Isn't this work useful for virtio as well? Yeah, but won't be included in this patchset. > > > > > > #define KVM_UPDATE_MSIX_MMIO _IOW(KVMIO, 0x7e, struct > > > > > kvm_msix_mmio) > > > > > > > > > > #define KVM_MSIX_TYPE_ASSIGNED_DEV 1 > > > > > > > > > > #define KVM_MSIX_FLAG_MASKBIT (1<< 0) > > > > > #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<< 0) > > > > > #define KVM_MSIX_FLAG_ENTRY (1<< 1) > > > > > #define KVM_MSIX_FLAG_QUERY_ENTRY (1<< 1) > > > > > > Why is there a need for the flag? If we simply get/set entire > > > entries, that includes the mask bits? > > > > We still want QEmu to cover a part of entries which hasn't been enabled > > yet(which won't existed in routing table), but kernel would cover all > > mask bit regardless of if it's enabled. So QEmu can query any entry to > > check the maskbit, but not address/data. > > Don't understand. If we support reading/writing entire entries, that > works for both enabled and disabled entries? > > > > What about the pending bits? > > > > We didn't cover it here - and it's in another MMIO space(PBA). Of course > > we can add more flags for it later. > > When an entry is masked, we need to set the pending bit for it > somewhere. I guess this is broken in the existing code (without your > patches)? Even with my patch, we didn't support the pending bit. It would always return 0 now. What we supposed to do(after my patch checked in) is to check IRQ_PENDING flag of irq_desc->status(if the entry is masked), and return the result to userspace. That would involve some core change, like to export irq_to_desc(). I don't think it would be accepted soon, so would push mask bit first. > > > > Also need a new exit reason to tell userspace that an msix entry has > > > changed, so userspace can update mappings. > > > > I think we don't need it. Whenever userspace want to get one mapping > > which is an enabled MSI-X entry, it can check it with the API > > above(which is quite rare, because kernel would handle all of them when > > guest is accessing them). If it's a disabled entry, the context inside > > userspace MMIO record is the correct one(and only one). The only place I > > think QEmu need to sync is when MSI-X is about to disabled, QEmu need to > > update it's own MMIO record. > > So in-kernel handling of mmio would be decided per entry? I'm trying to > simplify this, and simplest thing is - all or nothing. So you would like to handle all MSI-X MMIO in kernel? -- regards Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On 11/23/2010 08:35 AM, Yang, Sheng wrote: On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote: > On 11/23/2010 08:09 AM, Yang, Sheng wrote: > > Hi Avi, > > > > I've purposed the following API for mask bit support. > > > > The main point is, QEmu can know which entries are enabled(by > > pci_enable_msix()). And for enabled entries, kernel own it, including > > MSI data/address and mask bit(routing table and mask bitmap). QEmu > > should use KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with > > them if it want to do so). > > > > Before entries are enabled, QEmu can still use it's own MSI table(because > > we didn't contain these kind of information in kernel, and it's > > unnecessary for kernel). > > > > The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one > > entry didn't exist in kernel - or we can simply return -EINVAL for it. > > > > I suppose it would be rare for QEmu to use this interface to get the > > context of entry(the only case I think is when MSI-X disable and QEmu > > need to sync the context), so performance should not be an issue. > > > > What's your opinion? > > > > > #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO, 0x7d, struct > > > kvm_msix_entry) > > Need SET_MSIX_ENTRY for live migration as well. Current we don't support LM with VT-d... Isn't this work useful for virtio as well? > > > > #define KVM_UPDATE_MSIX_MMIO _IOW(KVMIO, 0x7e, struct > > > kvm_msix_mmio) > > > > > > #define KVM_MSIX_TYPE_ASSIGNED_DEV 1 > > > > > > #define KVM_MSIX_FLAG_MASKBIT (1<< 0) > > > #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<< 0) > > > #define KVM_MSIX_FLAG_ENTRY (1<< 1) > > > #define KVM_MSIX_FLAG_QUERY_ENTRY (1<< 1) > > Why is there a need for the flag? If we simply get/set entire entries, > that includes the mask bits? We still want QEmu to cover a part of entries which hasn't been enabled yet(which won't existed in routing table), but kernel would cover all mask bit regardless of if it's enabled. So QEmu can query any entry to check the maskbit, but not address/data. Don't understand. If we support reading/writing entire entries, that works for both enabled and disabled entries? > What about the pending bits? We didn't cover it here - and it's in another MMIO space(PBA). Of course we can add more flags for it later. When an entry is masked, we need to set the pending bit for it somewhere. I guess this is broken in the existing code (without your patches)? > > Also need a new exit reason to tell userspace that an msix entry has > changed, so userspace can update mappings. I think we don't need it. Whenever userspace want to get one mapping which is an enabled MSI-X entry, it can check it with the API above(which is quite rare, because kernel would handle all of them when guest is accessing them). If it's a disabled entry, the context inside userspace MMIO record is the correct one(and only one). The only place I think QEmu need to sync is when MSI-X is about to disabled, QEmu need to update it's own MMIO record. So in-kernel handling of mmio would be decided per entry? I'm trying to simplify this, and simplest thing is - all or nothing. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote: > On 11/23/2010 08:09 AM, Yang, Sheng wrote: > > Hi Avi, > > > > I've purposed the following API for mask bit support. > > > > The main point is, QEmu can know which entries are enabled(by > > pci_enable_msix()). And for enabled entries, kernel own it, including > > MSI data/address and mask bit(routing table and mask bitmap). QEmu > > should use KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with > > them if it want to do so). > > > > Before entries are enabled, QEmu can still use it's own MSI table(because > > we didn't contain these kind of information in kernel, and it's > > unnecessary for kernel). > > > > The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one > > entry didn't exist in kernel - or we can simply return -EINVAL for it. > > > > I suppose it would be rare for QEmu to use this interface to get the > > context of entry(the only case I think is when MSI-X disable and QEmu > > need to sync the context), so performance should not be an issue. > > > > What's your opinion? > > > > > #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO, 0x7d, struct > > > kvm_msix_entry) > > Need SET_MSIX_ENTRY for live migration as well. Current we don't support LM with VT-d... > > > > #define KVM_UPDATE_MSIX_MMIO _IOW(KVMIO, 0x7e, struct > > > kvm_msix_mmio) > > > > > > #define KVM_MSIX_TYPE_ASSIGNED_DEV 1 > > > > > > #define KVM_MSIX_FLAG_MASKBIT (1<< 0) > > > #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<< 0) > > > #define KVM_MSIX_FLAG_ENTRY (1<< 1) > > > #define KVM_MSIX_FLAG_QUERY_ENTRY (1<< 1) > > Why is there a need for the flag? If we simply get/set entire entries, > that includes the mask bits? We still want QEmu to cover a part of entries which hasn't been enabled yet(which won't existed in routing table), but kernel would cover all mask bit regardless of if it's enabled. So QEmu can query any entry to check the maskbit, but not address/data. > What about the pending bits? We didn't cover it here - and it's in another MMIO space(PBA). Of course we can add more flags for it later. > > > > struct kvm_msix_entry { > > > > > > __u32 id; > > > __u32 type; > > > __u32 entry; /* The index of entry in the MSI-X table */ > > > __u32 flags; > > > __u32 query_flags; > > > union { > > > > > > struct { > > > > > > __u32 addr_lo; > > > __u32 addr_hi; > > > __u32 data; > > Isn't the mask bit in the last word? Or maybe I'm confused about the > format. I separated the entry and mask bit as I said above. > > > > } msi_entry; > > > __u32 reserved[12]; > > > > > > }; > > > > > > }; > > > > > > #define KVM_MSIX_MMIO_FLAG_REGISTER (1<< 0) > > > #define KVM_MSIX_MMIO_FLAG_UNREGISTER (1<< 1) > > > #define KVM_MSIX_MMIO_FLAG_MASK 0x3 > > > > > > struct kvm_msix_mmio { > > > > > > __u32 id; > > > __u32 type; > > > __u64 base_addr; > > > __u32 max_entries_nr; > > > __u32 flags; > > > __u32 reserved[6]; > > > > > > }; > > Also need a new exit reason to tell userspace that an msix entry has > changed, so userspace can update mappings. I think we don't need it. Whenever userspace want to get one mapping which is an enabled MSI-X entry, it can check it with the API above(which is quite rare, because kernel would handle all of them when guest is accessing them). If it's a disabled entry, the context inside userspace MMIO record is the correct one(and only one). The only place I think QEmu need to sync is when MSI-X is about to disabled, QEmu need to update it's own MMIO record. -- regards Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On 11/23/2010 08:09 AM, Yang, Sheng wrote: Hi Avi, I've purposed the following API for mask bit support. The main point is, QEmu can know which entries are enabled(by pci_enable_msix()). And for enabled entries, kernel own it, including MSI data/address and mask bit(routing table and mask bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with them if it want to do so). Before entries are enabled, QEmu can still use it's own MSI table(because we didn't contain these kind of information in kernel, and it's unnecessary for kernel). The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one entry didn't exist in kernel - or we can simply return -EINVAL for it. I suppose it would be rare for QEmu to use this interface to get the context of entry(the only case I think is when MSI-X disable and QEmu need to sync the context), so performance should not be an issue. What's your opinion? > #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO, 0x7d, struct kvm_msix_entry) Need SET_MSIX_ENTRY for live migration as well. > #define KVM_UPDATE_MSIX_MMIO _IOW(KVMIO, 0x7e, struct kvm_msix_mmio) > > #define KVM_MSIX_TYPE_ASSIGNED_DEV 1 > > #define KVM_MSIX_FLAG_MASKBIT (1<< 0) > #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<< 0) > #define KVM_MSIX_FLAG_ENTRY (1<< 1) > #define KVM_MSIX_FLAG_QUERY_ENTRY (1<< 1) > Why is there a need for the flag? If we simply get/set entire entries, that includes the mask bits? What about the pending bits? > struct kvm_msix_entry { > __u32 id; > __u32 type; > __u32 entry; /* The index of entry in the MSI-X table */ > __u32 flags; > __u32 query_flags; > union { > struct { > __u32 addr_lo; > __u32 addr_hi; > __u32 data; Isn't the mask bit in the last word? Or maybe I'm confused about the format. > } msi_entry; > __u32 reserved[12]; > }; > }; > > #define KVM_MSIX_MMIO_FLAG_REGISTER (1<< 0) > #define KVM_MSIX_MMIO_FLAG_UNREGISTER (1<< 1) > #define KVM_MSIX_MMIO_FLAG_MASK 0x3 > > struct kvm_msix_mmio { > __u32 id; > __u32 type; > __u64 base_addr; > __u32 max_entries_nr; > __u32 flags; > __u32 reserved[6]; > }; Also need a new exit reason to tell userspace that an msix entry has changed, so userspace can update mappings. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mask bit support's API
Hi Avi, I've purposed the following API for mask bit support. The main point is, QEmu can know which entries are enabled(by pci_enable_msix()). And for enabled entries, kernel own it, including MSI data/address and mask bit(routing table and mask bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with them if it want to do so). Before entries are enabled, QEmu can still use it's own MSI table(because we didn't contain these kind of information in kernel, and it's unnecessary for kernel). The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one entry didn't exist in kernel - or we can simply return -EINVAL for it. I suppose it would be rare for QEmu to use this interface to get the context of entry(the only case I think is when MSI-X disable and QEmu need to sync the context), so performance should not be an issue. What's your opinion? > #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO, 0x7d, struct kvm_msix_entry) > #define KVM_UPDATE_MSIX_MMIO _IOW(KVMIO, 0x7e, struct kvm_msix_mmio) > > #define KVM_MSIX_TYPE_ASSIGNED_DEV 1 > > #define KVM_MSIX_FLAG_MASKBIT (1 << 0) > #define KVM_MSIX_FLAG_QUERY_MASKBIT (1 << 0) > #define KVM_MSIX_FLAG_ENTRY (1 << 1) > #define KVM_MSIX_FLAG_QUERY_ENTRY (1 << 1) > > struct kvm_msix_entry { > __u32 id; > __u32 type; > __u32 entry; /* The index of entry in the MSI-X table */ > __u32 flags; > __u32 query_flags; > union { > struct { > __u32 addr_lo; > __u32 addr_hi; > __u32 data; > } msi_entry; > __u32 reserved[12]; > }; > }; > > #define KVM_MSIX_MMIO_FLAG_REGISTER (1 << 0) > #define KVM_MSIX_MMIO_FLAG_UNREGISTER (1 << 1) > #define KVM_MSIX_MMIO_FLAG_MASK 0x3 > > struct kvm_msix_mmio { > __u32 id; > __u32 type; > __u64 base_addr; > __u32 max_entries_nr; > __u32 flags; > __u32 reserved[6]; > }; -- regards Yang, Sheng -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html