Re: [PATCH] deal with interrupt shadow state for emulated instruction
On Wed, May 06, 2009 at 01:51:04PM +0300, Avi Kivity wrote: > Gleb Natapov wrote: >> On Tue, May 05, 2009 at 02:40:11PM -0400, Glauber Costa wrote: >> >>> diff --git a/arch/x86/include/asm/kvm_host.h >>> b/arch/x86/include/asm/kvm_host.h >>> index 8e680c3..a49d07b 100644 >>> --- a/arch/x86/include/asm/kvm_host.h >>> +++ b/arch/x86/include/asm/kvm_host.h >>> @@ -510,6 +510,8 @@ struct kvm_x86_ops { >>> void (*run)(struct kvm_vcpu *vcpu, struct kvm_run *run); >>> int (*handle_exit)(struct kvm_run *run, struct kvm_vcpu *vcpu); >>> void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu); >>> + void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask); >>> >> There is .drop_interrupt_shadow() callback. The patch should remove it and >> replace its use by set_interrupt_shadow(). >> > > That would be [PATCH 1/2]. [PATCH 2/2]. Otherwise we will break bisectability, as the pure removal of this function would lead us to a non-functioning kernel for no reason. Avi: if this patch is okay, please apply. I'll send another one later that replaces the existing .drop_interrupt_shadow by the (then) in tree set_interrupt_shadow. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
On Thu, 7 May 2009, Avi Kivity wrote: > What's your take on adding irq context safe callbacks to irqfd? > > To give some background here, we would like to use eventfd as a generic > connector between components, so the components do not know about each other. > So far eventfd successfully abstracts among components in the same process, in > different processes, and in the kernel. > > eventfd_signal() can be safely called from irq context, and will wake up a > waiting task. But in some cases, if the consumer is in the kernel, it may be > able to consume the event from irq context, saving a context switch. > > So, will you consider patches adding this capability to eventfd? Since I received this one after your ACK on the capability of eventfd of triggering callbacks, I assume we're clear on this point, aren't we? - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote: > >> Also for PIO/MMIO you're adding this unoptimized lookup to the >> measurement: >> >> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >> if (pio_dev) { >> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >> complete_pio(vcpu); >> return 1; >> } >> >> Whereas for hypercall measurement you don't. I believe a fair comparison >> would be have a shared guest/host memory area where you store guest/host >> TSC values and then do, on guest: >> >> rdtscll(&shared_area->guest_tsc); >> pio/mmio/hypercall >> ... back to host >> rdtscll(&shared_area->host_tsc); >> >> And then calculate the difference (minus guests TSC_OFFSET of course)? >> > > Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest" > Core2 Xeon 5130 @2.00Ghz, 4GB RAM. > > Also it would be interesting to see the MMIO comparison with EPT/NPT, > it probably sucks much less than what you're seeing. > > Agreed. If you or someone on this thread has such a beast, please fire up my test and post the numbers. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote: > >> Chris Wright wrote: >> >>> * Gregory Haskins (ghask...@novell.com) wrote: >>> >>> Chris Wright wrote: > VF drivers can also have this issue (and typically use mmio). > I at least have a better idea what your proposal is, thanks for > explanation. Are you able to demonstrate concrete benefit with it yet > (improved latency numbers for example)? > > I had a test-harness/numbers for this kind of thing, but its a bit crufty since its from ~1.5 years ago. I will dig it up, update it, and generate/post new numbers. >>> That would be useful, because I keep coming back to pio and shared >>> page(s) when think of why not to do this. Seems I'm not alone in that. >>> >>> thanks, >>> -chris >>> >>> >> I completed the resurrection of the test and wrote up a little wiki on >> the subject, which you can find here: >> >> http://developer.novell.com/wiki/index.php/WhyHypercalls >> >> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why >> reinvent the wheel?" questions. >> >> I will include this information when I publish the updated v2 series >> with the s/hypercall/dynhc changes. >> >> Let me know if you have any questions. >> > > Greg, > > I think comparison is not entirely fair. You're using > KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that > (on Intel) to only one register read: > > nr = kvm_register_read(vcpu, VCPU_REGS_RAX); > > Whereas in a real hypercall for (say) PIO you would need the address, > size, direction and data. > Hi Marcelo, I'll have to respectfully disagree with you here. What you are proposing is actually a different test: a 4th type I would call "PIO over HC". It is distinctly different than the existing MMIO, PIO, and HC tests already present. I assert that the current HC test remains valid because for pure hypercalls, the "nr" *is* the address. It identifies the function to be executed (e.g. VAPIC_POLL_IRQ = null), just like the PIO address of my nullio device identifies the function to be executed (i.e. nullio_write() = null) My argument is that the HC test emulates the "dynhc()" concept I have been talking about, whereas the PIOoHC is more like the pv_io_ops->iowrite approach. That said, your 4th test type would actually be a very interesting data-point to add to the suite (especially since we are still kicking around the notion of doing something like this). I will update the patches. > Also for PIO/MMIO you're adding this unoptimized lookup to the > measurement: > > pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > if (pio_dev) { > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > complete_pio(vcpu); > return 1; > } > > Whereas for hypercall measurement you don't. In theory they should both share about the same algorithmic complexity in the decode-stage, but due to the possible optimization you mention you may have a point. I need to take some steps to ensure the HC path isn't artificially simplified by GCC (like making the execute stage do some trivial work like you mention below). > I believe a fair comparison > would be have a shared guest/host memory area where you store guest/host > TSC values and then do, on guest: > > rdtscll(&shared_area->guest_tsc); > pio/mmio/hypercall > ... back to host > rdtscll(&shared_area->host_tsc); > > And then calculate the difference (minus guests TSC_OFFSET of course)? > > I'm not sure I need that much complexity. I can probably just change the test harness to generate an ioread32(), and have the functions return the TSC value as a return parameter for all test types. The important thing is that we pick something extremely cheap (yet dynamic) to compute so the execution time doesn't invalidate the measurement granularity with a large constant. Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: SCSI block device timeouts post-migration on x86_64 linux guests
Marcelo Tosatti wrote: There's no savevm handler at all: Ahh! Thank you for the explanation. There was an incomplete patch posted to qemu-devel a while ago, which I can't find. I'm guessing that would be "[PATCH] save/restore for LSI SCSI", by Nolan : http://permalink.gmane.org/gmane.comp.emulators.qemu/36045 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SCSI block device timeouts post-migration on x86_64 linux guests
Charles, There's no savevm handler at all: $ grep register_savevm hw/lsi53c895a.c There was an incomplete patch posted to qemu-devel a while ago, which I can't find. On Tue, Apr 28, 2009 at 04:24:02PM -0500, Charles Duffy wrote: > Per subject. Observed on kvm-83 through kvm-85; haven't traced back > further. No longer occurs when switching to IDE. Manifests with messages > similar to the following on restore: > > sd 0:0:0:0: [sda] ABORT operation started > sd 0:0:0:0: ABORT operation timed-out. > sd 0:0:0:0: [sda] ABORT operation started > sd 0:0:0:0: ABORT operation timed-out. > sd 0:0:0:0: [sda] ABORT operation started > sd 0:0:0:0: ABORT operation timed-out. > sd 0:0:0:0: [sda] ABORT operation started -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Fri, 2009-05-08 at 09:25 +0930, Rusty Russell wrote: > On Thu, 7 May 2009 07:49:53 pm Sheng Yang wrote: > > On Thursday 07 May 2009 17:53:02 Matthew Wilcox wrote: > > > Here's a good example. Let's suppose you have a driver which supports > > > two different models of cards, one has 16 MSI-X interrupts, the other > > > has 10. You can call pci_enable_msix() asking for 16 vectors. If your > > > card is model A, you get 16 interrupts. If your card is model B, it says > > > "you can have 10". > > Sheng is absolutely right, that's a horrid API. It's not horrid, though it is tricky - but only for drivers that care, you can still do: if (pci_enable_msix(..)) bail; > If it actually enabled that number and returned it, it might make sense (cf. > write() returning less bytes than you give it). It could do that, but I think that would be worse. The driver, on finding out it can only get a certain number of MSIs might need to make a decision about how it allocates those - eg. in a network driver, sharing them between TX/RX/management. And in practice most of the drivers just bail if they can't get what they asked for, so enabling less than they wanted would just mean they have to go and disable them. > But overloading the return > value to save an explicit call is just ugly; it's not worth saving a few > lines > of code at cost of making all the drivers subtle and tricksy. Looking at just this patch, I would agree, but unfortunately it's not that simple. The first limitation on the number of MSIs the driver can have is the number the device supports, that's what this patch does. But there are others, and they come out of the arch code, or even the firmware. So to implement pci_how_many_msis(), we'd need a parallel API all the way down to the arch code, or a flag to all the existing routines saying "don't really allocate, just find out". That would be horrid IMHO. cheers signature.asc Description: This is a digitally signed message part
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Fri, May 08, 2009 at 09:25:00AM +0930, Rusty Russell wrote: > Sheng is absolutely right, that's a horrid API. But the API already exists, and is in use by 27 drivers. If this were a change to the API, I'd be against it, but it is the existing API. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thu, 7 May 2009 07:49:53 pm Sheng Yang wrote: > On Thursday 07 May 2009 17:53:02 Matthew Wilcox wrote: > > Here's a good example. Let's suppose you have a driver which supports > > two different models of cards, one has 16 MSI-X interrupts, the other > > has 10. You can call pci_enable_msix() asking for 16 vectors. If your > > card is model A, you get 16 interrupts. If your card is model B, it says > > "you can have 10". Sheng is absolutely right, that's a horrid API. If it actually enabled that number and returned it, it might make sense (cf. write() returning less bytes than you give it). But overloading the return value to save an explicit call is just ugly; it's not worth saving a few lines of code at cost of making all the drivers subtle and tricksy. Fail with -ENOSPC or something. Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote: > Also for PIO/MMIO you're adding this unoptimized lookup to the > measurement: > > pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > if (pio_dev) { > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > complete_pio(vcpu); > return 1; > } > > Whereas for hypercall measurement you don't. I believe a fair comparison > would be have a shared guest/host memory area where you store guest/host > TSC values and then do, on guest: > > rdtscll(&shared_area->guest_tsc); > pio/mmio/hypercall > ... back to host > rdtscll(&shared_area->host_tsc); > > And then calculate the difference (minus guests TSC_OFFSET of course)? Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest" Core2 Xeon 5130 @2.00Ghz, 4GB RAM. Also it would be interesting to see the MMIO comparison with EPT/NPT, it probably sucks much less than what you're seeing. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bridges
On Thu, May 07, 2009 at 08:57:03AM -0700, Ross Boylan wrote: > I'm trying to understand bridging with KVM, but am still puzzled. > I think that the recommended bridging with TAP means that packets from > the VM will end up going out the host card attached to the default > gateway. But it looks to me as if their IP address is unchanged, which > means replies will never reach me. Is that correct? Do I need to NAT > the packets, or is something already doing that? > > Some documents indicate that I need to bring the interfaces (e.g., eth0) > down before I bring the bridge up, and that afterwards only the bridge > will have an IP address. Is that right? Here's how I think of a Linux "soft" bridge: the bridge consists of an Ethernet switch, and a regular interface (named after the bridge) that is connected to that switch. This is why you "give an IP address to the bridge", because "the bridge" is also a NIC of it's own. If you attach any physical interfaces (eg ethN) to the bridge, they aren't NICs any more, they're just network cables you plug into the switch to pass traffic to other switches. Attaching VMs to the switch is just hooking up more cables between the switch and the VMs. If you want your host to do NAT for your VMs, then you do as you would for any other firewall -- you have one switch (the bridge, in this case) with all of your VMs and the "internal" interface of the host (in this case, the bridge as well) all plugged in, and then a second interface to the outside world (the physical NIC). > Some documents, e.g., > http://ebtables.sourceforge.net/br_fw_ia/br_fw_ia.html, indicate > iptables should "just work" with bridging. Yes, iptables *does* "just work" with bridging, in the sense that iptables can still filter IP packets passing through it's interfaces. What you *can't* do, though, is have some sort of magic iptables filter deep in the bridge that plays with all traffic as it traverses. For that, there's ebtables, which is iptables but for Ethernet (rather than IP) traffic. Personally, I've never used ebtables in my life. > However, I've seen someone > with a 2.6.15 kernel ask about firewalling and be told they needed to > patch the kernel to get it work (don't have the reference handy). > Should it just work? It should Just Work, and if you've got to patch any 2.6 (or even probably 2.4) kernel then you're doing something *very* esoteric. - Matt -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote: > Chris Wright wrote: > > * Gregory Haskins (ghask...@novell.com) wrote: > > > >> Chris Wright wrote: > >> > >>> VF drivers can also have this issue (and typically use mmio). > >>> I at least have a better idea what your proposal is, thanks for > >>> explanation. Are you able to demonstrate concrete benefit with it yet > >>> (improved latency numbers for example)? > >>> > >> I had a test-harness/numbers for this kind of thing, but its a bit > >> crufty since its from ~1.5 years ago. I will dig it up, update it, and > >> generate/post new numbers. > >> > > > > That would be useful, because I keep coming back to pio and shared > > page(s) when think of why not to do this. Seems I'm not alone in that. > > > > thanks, > > -chris > > > > I completed the resurrection of the test and wrote up a little wiki on > the subject, which you can find here: > > http://developer.novell.com/wiki/index.php/WhyHypercalls > > Hopefully this answers Chris' "show me the numbers" and Anthony's "Why > reinvent the wheel?" questions. > > I will include this information when I publish the updated v2 series > with the s/hypercall/dynhc changes. > > Let me know if you have any questions. Greg, I think comparison is not entirely fair. You're using KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that (on Intel) to only one register read: nr = kvm_register_read(vcpu, VCPU_REGS_RAX); Whereas in a real hypercall for (say) PIO you would need the address, size, direction and data. Also for PIO/MMIO you're adding this unoptimized lookup to the measurement: pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); if (pio_dev) { kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); complete_pio(vcpu); return 1; } Whereas for hypercall measurement you don't. I believe a fair comparison would be have a shared guest/host memory area where you store guest/host TSC values and then do, on guest: rdtscll(&shared_area->guest_tsc); pio/mmio/hypercall ... back to host rdtscll(&shared_area->host_tsc); And then calculate the difference (minus guests TSC_OFFSET of course)? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: Use a bitmap for tracking used GSIs
We're currently using a counter to track the most recent GSI we've handed out. This quickly hits KVM_MAX_IRQ_ROUTES when using device assignment with a driver that regularly toggles the MSI enable bit. This can mean only a few minutes of usable run time. Instead, track used GSIs in a bitmap. Signed-off-by: Alex Williamson --- Applies on top of "kvm: device-assignment: Catch GSI overflow" hw/device-assignment.c |4 ++- kvm/libkvm/kvm-common.h |3 +- kvm/libkvm/libkvm.c | 68 +-- kvm/libkvm/libkvm.h | 10 +++ 4 files changed, 74 insertions(+), 11 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index e06dd08..5bdae24 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev) { int i; -for (i = 0; i < dev->irq_entries_nr; i++) +for (i = 0; i < dev->irq_entries_nr; i++) { kvm_del_routing_entry(kvm_context, &dev->entry[i]); +kvm_free_irq_route_gsi(kvm_context, dev->entry[i].gsi); +} free(dev->entry); dev->entry = NULL; dev->irq_entries_nr = 0; diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h index 591fb53..94f86e5 100644 --- a/kvm/libkvm/kvm-common.h +++ b/kvm/libkvm/kvm-common.h @@ -66,8 +66,9 @@ struct kvm_context { #ifdef KVM_CAP_IRQ_ROUTING struct kvm_irq_routing *irq_routes; int nr_allocated_irq_routes; + void *used_gsi_bitmap; + int max_gsi; #endif - int max_used_gsi; }; int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory, diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c index 2a4165a..43abc7d 100644 --- a/kvm/libkvm/libkvm.c +++ b/kvm/libkvm/libkvm.c @@ -1298,8 +1298,6 @@ int kvm_add_routing_entry(kvm_context_t kvm, new->flags = entry->flags; new->u = entry->u; - if (entry->gsi > kvm->max_used_gsi) - kvm->max_used_gsi = entry->gsi; return 0; #else return -ENOSYS; @@ -1404,20 +1402,72 @@ int kvm_commit_irq_routes(kvm_context_t kvm) #endif } +#ifdef KVM_CAP_IRQ_ROUTING +static inline void set_bit(unsigned int *buf, int bit) +{ + buf[bit >> 5] |= (1U << (bit & 0x1f)); +} + +static inline void clear_bit(unsigned int *buf, int bit) +{ + buf[bit >> 5] &= ~(1U << (bit & 0x1f)); +} + +static int kvm_find_free_gsi(kvm_context_t kvm) +{ + int i, bit, gsi; + unsigned int *buf = kvm->used_gsi_bitmap; + + for (i = 0; i < (kvm->max_gsi >> 5); i++) { + if (buf[i] != ~0U) + break; + } + + if (i == kvm->max_gsi >> 5) + return -ENOSPC; + + bit = ffs(~buf[i]); + if (!bit) + return -EAGAIN; + + gsi = (bit - 1) | (i << 5); + set_bit(buf, gsi); + return gsi; +} +#endif + int kvm_get_irq_route_gsi(kvm_context_t kvm) { #ifdef KVM_CAP_IRQ_ROUTING - if (kvm->max_used_gsi >= KVM_IOAPIC_NUM_PINS) { - if (kvm->max_used_gsi + 1 < kvm_get_gsi_count(kvm)) -return kvm->max_used_gsi + 1; -else -return -ENOSPC; -} else -return KVM_IOAPIC_NUM_PINS; + if (!kvm->max_gsi) { + int i; + + /* Round the number of GSIs supported to a 4 byte +* value so we can search it using ints and ffs */ + i = kvm_get_gsi_count(kvm) & ~0x1f; + kvm->used_gsi_bitmap = malloc(i >> 3); + if (!kvm->used_gsi_bitmap) + return -ENOMEM; + memset(kvm->used_gsi_bitmap, 0, i >> 3); + kvm->max_gsi = i; + + /* Mark all the IOAPIC pin GSIs as already used */ + for (i = 0; i <= KVM_IOAPIC_NUM_PINS; i++) + set_bit(kvm->used_gsi_bitmap, i); + } + + return kvm_find_free_gsi(kvm); #else return -ENOSYS; #endif } + +void kvm_free_irq_route_gsi(kvm_context_t kvm, int gsi) +{ +#ifdef KVM_CAP_IRQ_ROUTING + clear_bit(kvm->used_gsi_bitmap, gsi); +#endif +} #ifdef KVM_CAP_DEVICE_MSIX int kvm_assign_set_msix_nr(kvm_context_t kvm, diff --git a/kvm/libkvm/libkvm.h b/kvm/libkvm/libkvm.h index c23d37b..4e9344c 100644 --- a/kvm/libkvm/libkvm.h +++ b/kvm/libkvm/libkvm.h @@ -856,6 +856,16 @@ int kvm_commit_irq_routes(kvm_context_t kvm); */ int kvm_get_irq_route_gsi(kvm_context_t kvm); +/*! + * \brief Free used GSI number + * + * Free used GSI number acquired from kvm_get_irq_route_gsi() + * + * \param kvm Pointer to the current kvm_context + * \param gsi GSI number to free + */ +void kvm_free_irq_route_gsi(kvm_context_t kvm, int gsi); + #ifdef KVM_CAP_DEVICE_MSIX int kvm_assign_set_msix_nr(kvm_context_t kvm, struct kvm_assigned_msix_nr *msix_nr); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Chris Wright wrote: > > > Chris, is that issue with the non ioread/iowrite access of a mangled > > pointer still an issue here? I would think so, but I am a bit fuzzy on > > whether there is still an issue of non-wrapped MMIO ever occuring. > > Arnd was saying it's a bug for other reasons, so perhaps it would work > out fine. Well, maybe. I only said that __raw_writel and pointer dereference is bad, but not writel. IIRC when we had that discussion about io-workarounds on powerpc, the outcome was that passing an IORESOURCE_MEM resource into pci_iomap must still result in something that can be passed into writel in addition to iowrite32, while an IORESOURCE_IO resource may or may not be valid for writel and/or outl. Unfortunately, this means that either readl/writel needs to be adapted in some way (e.g. the address also ioremapped to the mangled pointer) or the mechanism will be limited to I/O space accesses. Maybe BenH remembers the details better than me. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (gregory.hask...@gmail.com) wrote: > Arnd Bergmann wrote: > > pci_iomap could look at the bus device that the PCI function sits on. > > If it detects a PCI bridge that has a certain property (config space > > setting, vendor/device ID, ...), it assumes that the device itself > > will be emulated and it should set the address flag for IO_COND. > > > > This implies that all pass-through devices need to be on a different > > PCI bridge from the emulated devices, which should be fairly > > straightforward to enforce. Hmm, this gets to the grey area of the ABI. I think this would mean an upgrade of the host would suddenly break when the mgmt tool does: (qemu) pci_add pci_addr=0:6 host host=01:10.0 > Thats actually a pretty good idea. > > Chris, is that issue with the non ioread/iowrite access of a mangled > pointer still an issue here? I would think so, but I am a bit fuzzy on > whether there is still an issue of non-wrapped MMIO ever occuring. Arnd was saying it's a bug for other reasons, so perhaps it would work out fine. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Arnd Bergmann wrote: > On Thursday 07 May 2009, Gregory Haskins wrote: > >> What I am not clear on is how you would know to flag the address to >> begin with. >> > > pci_iomap could look at the bus device that the PCI function sits on. > If it detects a PCI bridge that has a certain property (config space > setting, vendor/device ID, ...), it assumes that the device itself > will be emulated and it should set the address flag for IO_COND. > > This implies that all pass-through devices need to be on a different > PCI bridge from the emulated devices, which should be fairly > straightforward to enforce. > Thats actually a pretty good idea. Chris, is that issue with the non ioread/iowrite access of a mangled pointer still an issue here? I would think so, but I am a bit fuzzy on whether there is still an issue of non-wrapped MMIO ever occuring. -Greg signature.asc Description: OpenPGP digital signature
[patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3
Disallow the deletion of memory slots (and aliases, for x86 case), if a vcpu contains a cr3 that points to such slot/alias. This complements commit 6c20e1442bb1c62914bb85b7f4a38973d2a423ba. v2: - set KVM_REQ_TRIPLE_FAULT - use __KVM_HAVE_ARCH_CAN_FREE_MEMSLOT to avoid duplication of stub Signed-off-by: Marcelo Tosatti Index: kvm-pending/arch/x86/kvm/x86.c === --- kvm-pending.orig/arch/x86/kvm/x86.c +++ kvm-pending/arch/x86/kvm/x86.c @@ -1636,6 +1636,29 @@ gfn_t unalias_gfn(struct kvm *kvm, gfn_t return gfn; } +static int kvm_root_gfn_in_range(struct kvm *kvm, gfn_t base_gfn, +gfn_t end_gfn, bool unalias) +{ + struct kvm_vcpu *vcpu; + gfn_t root_gfn; + int i; + + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + vcpu = kvm->vcpus[i]; + if (!vcpu) + continue; + root_gfn = vcpu->arch.cr3 >> PAGE_SHIFT; + if (unalias) + root_gfn = unalias_gfn(kvm, root_gfn); + if (root_gfn >= base_gfn && root_gfn <= end_gfn) { + set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests); + return 1; + } + } + + return 0; +} + /* * Set a new alias region. Aliases map a portion of physical memory into * another portion. This is useful for memory windows, for example the PC @@ -1666,6 +1689,19 @@ static int kvm_vm_ioctl_set_memory_alias spin_lock(&kvm->mmu_lock); p = &kvm->arch.aliases[alias->slot]; + + /* FIXME: either disallow shrinking alias slots or disable +* size changes as done with memslots +*/ + if (!alias->memory_size) { + r = -EBUSY; + if (kvm_root_gfn_in_range(kvm, p->base_gfn, + p->base_gfn + p->npages - 1, + false)) + goto out_unlock; + } + + p->base_gfn = alias->guest_phys_addr >> PAGE_SHIFT; p->npages = alias->memory_size >> PAGE_SHIFT; p->target_gfn = alias->target_phys_addr >> PAGE_SHIFT; @@ -1682,6 +1718,9 @@ static int kvm_vm_ioctl_set_memory_alias return 0; +out_unlock: + spin_unlock(&kvm->mmu_lock); + up_write(&kvm->slots_lock); out: return r; } @@ -4552,6 +4591,15 @@ void kvm_arch_flush_shadow(struct kvm *k kvm_mmu_zap_all(kvm); } +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + int ret; + + ret = kvm_root_gfn_in_range(kvm, slot->base_gfn, + slot->base_gfn + slot->npages - 1, true); + return !ret; +} + int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) { return vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE Index: kvm-pending/include/linux/kvm_host.h === --- kvm-pending.orig/include/linux/kvm_host.h +++ kvm-pending/include/linux/kvm_host.h @@ -200,6 +200,7 @@ int kvm_arch_set_memory_region(struct kv struct kvm_memory_slot old, int user_alloc); void kvm_arch_flush_shadow(struct kvm *kvm); +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn); struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn); Index: kvm-pending/virt/kvm/kvm_main.c === --- kvm-pending.orig/virt/kvm/kvm_main.c +++ kvm-pending/virt/kvm/kvm_main.c @@ -1061,6 +1061,13 @@ static int kvm_vm_release(struct inode * return 0; } +#ifndef __KVM_HAVE_ARCH_CAN_FREE_MEMSLOT +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + return 1; +} +#endif + /* * Allocate some memory and give it an address in the guest physical address * space. @@ -1179,8 +1186,13 @@ int __kvm_set_memory_region(struct kvm * } #endif /* not defined CONFIG_S390 */ - if (!npages) + if (!npages) { kvm_arch_flush_shadow(kvm); + if (!kvm_arch_can_free_memslot(kvm, memslot)) { + r = -EBUSY; + goto out_free; + } + } spin_lock(&kvm->mmu_lock); if (mem->slot >= kvm->nmemslots) Index: kvm-pending/arch/x86/include/asm/kvm.h === --- kvm-pending.orig/arch/x86/include/asm/kvm.h +++ kvm-pending/arch/x86/include/asm/kvm.h @@ -18,6 +18,8 @@ #define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_MSIX +#define __KVM_HAVE_ARCH_CAN_FREE_MEMSLOT + /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the
[patch 3/4] KVM: use smp_send_reschedule in kvm_vcpu_kick
KVM uses a function call IPI to cause the exit of a guest running on a physical cpu. For virtual interrupt notification there is no need to wait on IPI receival, or to execute any function. This is exactly what the reschedule IPI does, without the overhead of function IPI. So use it instead of smp_call_function_single in kvm_vcpu_kick. Also change the "guest_mode" variable to a bit in vcpu->requests, and use that to collapse multiple IPI's that would be issued between the first one and zeroing of guest mode. This allows kvm_vcpu_kick to called with interrupts disabled. Signed-off-by: Marcelo Tosatti Index: kvm-pending/arch/ia64/kernel/irq_ia64.c === --- kvm-pending.orig/arch/ia64/kernel/irq_ia64.c +++ kvm-pending/arch/ia64/kernel/irq_ia64.c @@ -610,6 +610,9 @@ static struct irqaction ipi_irqaction = .name = "IPI" }; +/* + * KVM uses this interrupt to force a cpu out of guest mode + */ static struct irqaction resched_irqaction = { .handler = dummy_handler, .flags =IRQF_DISABLED, Index: kvm-pending/arch/ia64/kvm/kvm-ia64.c === --- kvm-pending.orig/arch/ia64/kvm/kvm-ia64.c +++ kvm-pending/arch/ia64/kvm/kvm-ia64.c @@ -668,7 +668,7 @@ again: host_ctx = kvm_get_host_context(vcpu); guest_ctx = kvm_get_guest_context(vcpu); - vcpu->guest_mode = 1; + clear_bit(KVM_REQ_KICK, &vcpu->requests); r = kvm_vcpu_pre_transition(vcpu); if (r < 0) @@ -685,7 +685,7 @@ again: kvm_vcpu_post_transition(vcpu); vcpu->arch.launched = 1; - vcpu->guest_mode = 0; + set_bit(KVM_REQ_KICK, &vcpu->requests); local_irq_enable(); /* @@ -1879,24 +1879,18 @@ void kvm_arch_hardware_unsetup(void) { } -static void vcpu_kick_intr(void *info) -{ -#ifdef DEBUG - struct kvm_vcpu *vcpu = (struct kvm_vcpu *)info; - printk(KERN_DEBUG"vcpu_kick_intr %p \n", vcpu); -#endif -} - void kvm_vcpu_kick(struct kvm_vcpu *vcpu) { - int ipi_pcpu = vcpu->cpu; - int cpu = get_cpu(); + int me; + int cpu = vcpu->cpu; if (waitqueue_active(&vcpu->wq)) wake_up_interruptible(&vcpu->wq); - if (vcpu->guest_mode && cpu != ipi_pcpu) - smp_call_function_single(ipi_pcpu, vcpu_kick_intr, vcpu, 0); + me = get_cpu(); + if (cpu != me && (unsigned) cpu < nr_cpu_ids && cpu_online(cpu)) + if (!test_and_set_bit(KVM_REQ_KICK, &vcpu->requests)) + smp_send_reschedule(cpu); put_cpu(); } Index: kvm-pending/arch/x86/kernel/smp.c === --- kvm-pending.orig/arch/x86/kernel/smp.c +++ kvm-pending/arch/x86/kernel/smp.c @@ -172,6 +172,9 @@ void smp_reschedule_interrupt(struct pt_ { ack_APIC_irq(); inc_irq_stat(irq_resched_count); + /* +* KVM uses this interrupt to force a cpu out of guest mode +*/ } void smp_call_function_interrupt(struct pt_regs *regs) Index: kvm-pending/arch/x86/kvm/x86.c === --- kvm-pending.orig/arch/x86/kvm/x86.c +++ kvm-pending/arch/x86/kvm/x86.c @@ -3259,6 +3259,9 @@ static int vcpu_enter_guest(struct kvm_v local_irq_disable(); + clear_bit(KVM_REQ_KICK, &vcpu->requests); + smp_mb__after_clear_bit(); + if (vcpu->requests || need_resched() || signal_pending(current)) { local_irq_enable(); preempt_enable(); @@ -3266,13 +3269,6 @@ static int vcpu_enter_guest(struct kvm_v goto out; } - vcpu->guest_mode = 1; - /* -* Make sure that guest_mode assignment won't happen after -* testing the pending IRQ vector bitmap. -*/ - smp_wmb(); - if (vcpu->arch.exception.pending) __queue_exception(vcpu); else @@ -3317,7 +3313,7 @@ static int vcpu_enter_guest(struct kvm_v set_debugreg(vcpu->arch.host_dr6, 6); set_debugreg(vcpu->arch.host_dr7, 7); - vcpu->guest_mode = 0; + set_bit(KVM_REQ_KICK, &vcpu->requests); local_irq_enable(); ++vcpu->stat.exits; @@ -4611,30 +4607,20 @@ int kvm_arch_vcpu_runnable(struct kvm_vc || vcpu->arch.nmi_pending; } -static void vcpu_kick_intr(void *info) -{ -#ifdef DEBUG - struct kvm_vcpu *vcpu = (struct kvm_vcpu *)info; - printk(KERN_DEBUG "vcpu_kick_intr %p \n", vcpu); -#endif -} - void kvm_vcpu_kick(struct kvm_vcpu *vcpu) { - int ipi_pcpu = vcpu->cpu; - int cpu; + int me; + int cpu = vcpu->cpu; if (waitqueue_active(&vcpu->wq)) { wake_up_interruptible(&vcpu->wq); ++vcpu->stat.halt_wakeup; } - /* -* We may be called synchronously with irqs disabled in guest m
[patch 0/4] smp_send_reschedule / assigned dev host intx race v2
Addressing comments. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/4] kvm-kmod: smp_send_reschedule compat
smp_send_reschedule was exported (via smp_ops) in v2.6.24. Create a compat function which schedules the IPI to keventd context, in case interrupts are disabled, for kernels < 2.6.24. Signed-off-by: Marcelo Tosatti Index: kvm-kmod/ia64/hack-module.awk === --- kvm-kmod.orig/ia64/hack-module.awk +++ kvm-kmod/ia64/hack-module.awk @@ -1,4 +1,4 @@ -BEGIN { split("INIT_WORK on_each_cpu smp_call_function " \ +BEGIN { split("INIT_WORK on_each_cpu smp_call_function smp_send_reschedule " \ "hrtimer_add_expires_ns hrtimer_get_expires " \ "hrtimer_get_expires_ns hrtimer_start_expires " \ "hrtimer_expires_remaining " \ Index: kvm-kmod/x86/hack-module.awk === --- kvm-kmod.orig/x86/hack-module.awk +++ kvm-kmod/x86/hack-module.awk @@ -1,7 +1,7 @@ BEGIN { split("INIT_WORK desc_struct ldttss_desc64 desc_ptr " \ "hrtimer_add_expires_ns hrtimer_get_expires " \ "hrtimer_get_expires_ns hrtimer_start_expires " \ - "hrtimer_expires_remaining " \ + "hrtimer_expires_remaining smp_send_reschedule " \ "on_each_cpu relay_open request_irq free_irq" , compat_apis); } /^int kvm_init\(/ { anon_inodes = 1 } Index: kvm-kmod/external-module-compat-comm.h === --- kvm-kmod.orig/external-module-compat-comm.h +++ kvm-kmod/external-module-compat-comm.h @@ -224,6 +224,10 @@ int kvm_smp_call_function_mask(cpumask_t #define smp_call_function_mask kvm_smp_call_function_mask +void kvm_smp_send_reschedule(int cpu); + +#define smp_send_reschedule kvm_smp_send_reschedule + #endif /* empty_zero_page isn't exported in all kernels */ Index: kvm-kmod/external-module-compat.c === --- kvm-kmod.orig/external-module-compat.c +++ kvm-kmod/external-module-compat.c @@ -221,6 +221,58 @@ out: return 0; } +#include + +static void vcpu_kick_intr(void *info) +{ +} + +struct kvm_kick { + int cpu; + struct work_struct work; +}; + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,20) +static void kvm_do_smp_call_function(void *data) +{ + int me; + struct kvm_kick *kvm_kick = data; +#else +static void kvm_do_smp_call_function(struct work_struct *work) +{ + int me; + struct kvm_kick *kvm_kick = container_of(work, struct kvm_kick, work); +#endif + me = get_cpu(); + + if (kvm_kick->cpu != me) + smp_call_function_single(kvm_kick->cpu, vcpu_kick_intr, +NULL, 0); + kfree(kvm_kick); + put_cpu(); +} + +void kvm_queue_smp_call_function(int cpu) +{ + struct kvm_kick *kvm_kick = kmalloc(sizeof(struct kvm_kick), GFP_ATOMIC); + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,20) + INIT_WORK(&kvm_kick->work, kvm_do_smp_call_function, kvm_kick); +#else + INIT_WORK(&kvm_kick->work, kvm_do_smp_call_function); +#endif + + schedule_work(&kvm_kick->work); +} + +void kvm_smp_send_reschedule(int cpu) +{ + if (irqs_disabled()) { + kvm_queue_smp_call_function(cpu); + return; + } + smp_call_function_single(cpu, vcpu_kick_intr, NULL, 0); +} #endif /* manually export hrtimer_init/start/cancel */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 4/4] KVM: x86: disallow changing a slots size
Support to shrinking aliases complicates kernel code unnecessarily, while userspace can do the same with two operations, delete an alias, and create a new alias. Signed-off-by: Marcelo Tosatti Index: kvm-pending/arch/x86/kvm/x86.c === --- kvm-pending.orig/arch/x86/kvm/x86.c +++ kvm-pending/arch/x86/kvm/x86.c @@ -1669,6 +1669,7 @@ static int kvm_vm_ioctl_set_memory_alias { int r, n; struct kvm_mem_alias *p; + unsigned long npages; r = -EINVAL; /* General sanity checks */ @@ -1690,9 +1691,12 @@ static int kvm_vm_ioctl_set_memory_alias p = &kvm->arch.aliases[alias->slot]; - /* FIXME: either disallow shrinking alias slots or disable -* size changes as done with memslots -*/ + /* Disallow changing an alias slot's size. */ + npages = alias->memory_size >> PAGE_SHIFT; + r = -EINVAL; + if (npages && p->npages && npages != p->npages) + goto out_unlock; + if (!alias->memory_size) { r = -EBUSY; if (kvm_root_gfn_in_range(kvm, p->base_gfn, -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/4] set_memory_region locking fixes / cr3 vs removal of memslots v2
Addressing comments. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 4/4] KVM: protect assigned dev workqueue, int handler and irq acker
kvm_assigned_dev_ack_irq is vulnerable to a race condition with the interrupt handler function. It does: if (dev->host_irq_disabled) { enable_irq(dev->host_irq); dev->host_irq_disabled = false; } If an interrupt triggers before the host->dev_irq_disabled assignment, it will disable the interrupt and set dev->host_irq_disabled to true. On return to kvm_assigned_dev_ack_irq, dev->host_irq_disabled is set to false, and the next kvm_assigned_dev_ack_irq call will fail to reenable it. Other than that, having the interrupt handler and work handlers run in parallel sounds like asking for trouble (could not spot any obvious problem, but better not have to, its fragile). CC: sheng.y...@intel.com Signed-off-by: Marcelo Tosatti Index: kvm-pending/include/linux/kvm_host.h === --- kvm-pending.orig/include/linux/kvm_host.h +++ kvm-pending/include/linux/kvm_host.h @@ -346,6 +346,7 @@ struct kvm_assigned_dev_kernel { int flags; struct pci_dev *dev; struct kvm *kvm; + spinlock_t assigned_dev_lock; }; struct kvm_irq_mask_notifier { Index: kvm-pending/virt/kvm/kvm_main.c === --- kvm-pending.orig/virt/kvm/kvm_main.c +++ kvm-pending/virt/kvm/kvm_main.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -130,6 +131,7 @@ static void kvm_assigned_dev_interrupt_w * finer-grained lock, update this */ mutex_lock(&kvm->lock); + spin_lock_irq(&assigned_dev->assigned_dev_lock); if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) { struct kvm_guest_msix_entry *guest_entries = assigned_dev->guest_msix_entries; @@ -156,18 +158,21 @@ static void kvm_assigned_dev_interrupt_w } } + spin_unlock_irq(&assigned_dev->assigned_dev_lock); mutex_unlock(&assigned_dev->kvm->lock); } static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id) { + unsigned long flags; struct kvm_assigned_dev_kernel *assigned_dev = (struct kvm_assigned_dev_kernel *) dev_id; + spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags); if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) { int index = find_index_from_host_irq(assigned_dev, irq); if (index < 0) - return IRQ_HANDLED; + goto out; assigned_dev->guest_msix_entries[index].flags |= KVM_ASSIGNED_MSIX_PENDING; } @@ -177,6 +182,8 @@ static irqreturn_t kvm_assigned_dev_intr disable_irq_nosync(irq); assigned_dev->host_irq_disabled = true; +out: + spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags); return IRQ_HANDLED; } @@ -184,6 +191,7 @@ static irqreturn_t kvm_assigned_dev_intr static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian) { struct kvm_assigned_dev_kernel *dev; + unsigned long flags; if (kian->gsi == -1) return; @@ -196,10 +204,12 @@ static void kvm_assigned_dev_ack_irq(str /* The guest irq may be shared so this ack may be * from another device. */ + spin_lock_irqsave(&dev->assigned_dev_lock, flags); if (dev->host_irq_disabled) { enable_irq(dev->host_irq); dev->host_irq_disabled = false; } + spin_unlock_irqrestore(&dev->assigned_dev_lock, flags); } static void deassign_guest_irq(struct kvm *kvm, @@ -615,6 +625,7 @@ static int kvm_vm_ioctl_assign_device(st match->host_devfn = assigned_dev->devfn; match->flags = assigned_dev->flags; match->dev = dev; + spin_lock_init(&match->assigned_dev_lock); match->irq_source_id = -1; match->kvm = kvm; match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/4] kvm-kmod: nr_cpu_ids compat
Signed-off-by: Marcelo Tosatti Index: kvm-kmod/external-module-compat-comm.h === --- kvm-kmod.orig/external-module-compat-comm.h +++ kvm-kmod/external-module-compat-comm.h @@ -116,6 +116,10 @@ int kvm_smp_call_function_single(int cpu #endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,28) +#define nr_cpu_ids NR_CPUS +#endif + #include #ifndef KVM_MINOR #define KVM_MINOR 232 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/4] KVM: MMU: protect kvm_mmu_change_mmu_pages with mmu_lock
kvm_handle_hva, called by MMU notifiers, manipulates mmu data only with the protection of mmu_lock. Update kvm_mmu_change_mmu_pages callers to take mmu_lock, thus protecting against kvm_handle_hva. CC: Andrea Arcangeli Signed-off-by: Marcelo Tosatti Index: kvm-pending/arch/x86/kvm/mmu.c === --- kvm-pending.orig/arch/x86/kvm/mmu.c +++ kvm-pending/arch/x86/kvm/mmu.c @@ -2723,7 +2723,6 @@ void kvm_mmu_slot_remove_write_access(st { struct kvm_mmu_page *sp; - spin_lock(&kvm->mmu_lock); list_for_each_entry(sp, &kvm->arch.active_mmu_pages, link) { int i; u64 *pt; @@ -2738,7 +2737,6 @@ void kvm_mmu_slot_remove_write_access(st pt[i] &= ~PT_WRITABLE_MASK; } kvm_flush_remote_tlbs(kvm); - spin_unlock(&kvm->mmu_lock); } void kvm_mmu_zap_all(struct kvm *kvm) Index: kvm-pending/arch/x86/kvm/x86.c === --- kvm-pending.orig/arch/x86/kvm/x86.c +++ kvm-pending/arch/x86/kvm/x86.c @@ -1607,10 +1607,12 @@ static int kvm_vm_ioctl_set_nr_mmu_pages return -EINVAL; down_write(&kvm->slots_lock); + spin_lock(&kvm->mmu_lock); kvm_mmu_change_mmu_pages(kvm, kvm_nr_mmu_pages); kvm->arch.n_requested_mmu_pages = kvm_nr_mmu_pages; + spin_unlock(&kvm->mmu_lock); up_write(&kvm->slots_lock); return 0; } @@ -1786,7 +1788,9 @@ int kvm_vm_ioctl_get_dirty_log(struct kv /* If nothing is dirty, don't bother messing with page tables. */ if (is_dirty) { + spin_lock(&kvm->mmu_lock); kvm_mmu_slot_remove_write_access(kvm, log->slot); + spin_unlock(&kvm->mmu_lock); kvm_flush_remote_tlbs(kvm); memslot = &kvm->memslots[log->slot]; n = ALIGN(memslot->npages, BITS_PER_LONG) / 8; @@ -4530,12 +4534,14 @@ int kvm_arch_set_memory_region(struct kv } } + spin_lock(&kvm->mmu_lock); if (!kvm->arch.n_requested_mmu_pages) { unsigned int nr_mmu_pages = kvm_mmu_calculate_mmu_pages(kvm); kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages); } kvm_mmu_slot_remove_write_access(kvm, mem->slot); + spin_unlock(&kvm->mmu_lock); kvm_flush_remote_tlbs(kvm); return 0; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/4] KVM: take mmu_lock when updating a deleted slot
kvm_handle_hva relies on mmu_lock protection to safely access the memslot structures. Signed-off-by: Marcelo Tosatti Index: kvm-pending/virt/kvm/kvm_main.c === --- kvm-pending.orig/virt/kvm/kvm_main.c +++ kvm-pending/virt/kvm/kvm_main.c @@ -1199,8 +1199,10 @@ int __kvm_set_memory_region(struct kvm * kvm_free_physmem_slot(&old, npages ? &new : NULL); /* Slot deletion case: we have to update the current slot */ + spin_lock(&kvm->mmu_lock); if (!npages) *memslot = old; + spin_unlock(&kvm->mmu_lock); #ifdef CONFIG_DMAR /* map the pages in iommu page table */ r = kvm_iommu_map_pages(kvm, base_gfn, npages); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Gregory Haskins wrote: > What I am not clear on is how you would know to flag the address to > begin with. pci_iomap could look at the bus device that the PCI function sits on. If it detects a PCI bridge that has a certain property (config space setting, vendor/device ID, ...), it assumes that the device itself will be emulated and it should set the address flag for IO_COND. This implies that all pass-through devices need to be on a different PCI bridge from the emulated devices, which should be fairly straightforward to enforce. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (gregory.hask...@gmail.com) wrote: > After posting my numbers today, what I *can* tell you definitively that > its significantly slower to VMEXIT via MMIO. I guess I do not really > know the reason for sure. :) there's certainly more work, including insn decoding -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Arnd Bergmann wrote: > An easy way to deal with the pass-through case might be to actually use > __raw_writel there. In guest-to-guest communication, the two sides are > known to have the same endianess (I assume) and you can still add the > appropriate smp_mb() and such into the code. Ok, that was nonsense. I thought you meant pass-through to a memory range on the host that is potentially shared with other processes or guests. For pass-through to a real device, it obviously would not work. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Gregory Haskins wrote: > Arnd Bergmann wrote: > > An mmio that goes through a PF is a bug, it's certainly broken on > > a number of platforms, so performance should not be an issue there. > > > > This may be my own ignorance, but I thought a VMEXIT of type "PF" was > how MMIO worked in VT/SVM. You are right that all MMIOs (and PIO on most non-x86 architectures) are handled this way in the end. What I meant was that an MMIO that traps because of a simple pointer dereference as in __raw_writel is a bug, while any actual writel() call could be diverted to do an hcall and therefore not cause a PF once the infrastructure is there. > I guess the problem that was later pointed out is that we cannot discern > which devices might be pass-through and therefore should not be > revectored through a HC. But I am even less knowledgeable about how > pass-through works than I am about the MMIO traps, so I might be > completely off here. An easy way to deal with the pass-through case might be to actually use __raw_writel there. In guest-to-guest communication, the two sides are known to have the same endianess (I assume) and you can still add the appropriate smp_mb() and such into the code. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Gregory Haskins (gregory.hask...@gmail.com) wrote: > >> What I am not clear on is how you would know to flag the address to >> begin with. >> > > That's why I mentioned pv_io_ops->iomap() earlier. Something I'd expect > would get called on IORESOURCE_PVIO type. Yeah, this wasn't clear at the time, but I totally get what you meant now in retrospect. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Arnd Bergmann wrote: > On Thursday 07 May 2009, Gregory Haskins wrote: > >> I guess technically mmio can just be a simple access of the page which >> would be problematic to trap locally without a PF. However it seems >> that most mmio always passes through a ioread()/iowrite() call so this >> is perhaps the hook point. If we set the stake in the ground that mmios >> that go through some other mechanism like PFs can just hit the "slow >> path" are an acceptable casualty, I think we can make that work. >> >> Thoughts? >> > > An mmio that goes through a PF is a bug, it's certainly broken on > a number of platforms, so performance should not be an issue there. > This may be my own ignorance, but I thought a VMEXIT of type "PF" was how MMIO worked in VT/SVM. I didn't mean to imply that the guest nor the host took a traditional PF exception in their respective IDT, if that is what you thought I meant here. Rather, the mmio region is unmapped in the guest MMU, access causes a VMEXIT to host-side KVM of type PF, and the host side code then consults the guest page-table to see if its an MMIO or not. I could very well be mistaken as I have only a cursory understanding of what happens in KVM today with this path. After posting my numbers today, what I *can* tell you definitively that its significantly slower to VMEXIT via MMIO. I guess I do not really know the reason for sure. :) > Note that are four commonly used interface classes for PIO/MMIO: > > 1. readl/writel: little-endian MMIO > 2. inl/outl: little-endian PIO > 3. ioread32/iowrite32: converged little-endian PIO/MMIO > 4. __raw_readl/__raw_writel: native-endian MMIO without checks > > You don't need to worry about the __raw_* stuff, as this should never > be used in device drivers. > > As a simplification, you could mandate that all drivers that want to > use this get converted to the ioread/iowrite class of interfaces and > leave the others slow. > I guess the problem that was later pointed out is that we cannot discern which devices might be pass-through and therefore should not be revectored through a HC. But I am even less knowledgeable about how pass-through works than I am about the MMIO traps, so I might be completely off here. In any case, thank you kindly for the suggestions. Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >>> Oh yes. But don't call it dynhc - like Chris says it's the wrong >>> semantic. >>> >>> Since we want to connect it to an eventfd, call it HC_NOTIFY or >>> HC_EVENT or something along these lines. You won't be able to pass >>> any data, but that's fine. Registers are saved to memory anyway. >>> >> Ok, but how would you access the registers since you would presumably >> only be getting a waitq::func callback on the eventfd. Or were you >> saying that more data, if required, is saved in a side-band memory >> location? I can see the latter working. > > Yeah. You basically have that side-band in vbus shmem (or the virtio > ring). Ok, got it. > >> I can't wrap my head around >> the former. >> > > I only meant that registers aren't faster than memory, since they are > just another memory location. > > In fact registers are accessed through a function call (not that that > takes any time these days). > > >>> Just to make sure we have everything plumbed down, here's how I see >>> things working out (using qemu and virtio, use sed to taste): >>> >>> 1. qemu starts up, sets up the VM >>> 2. qemu creates virtio-net-server >>> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx >>> ring, one set for rx ring) >>> 4. qemu connects the six eventfd to the data-available, >>> data-not-available, and kick ports of virtio-net-server >>> 5. the guest starts up and configures virtio-net in pci pin mode >>> 6. qemu notices and decides it will manage interrupts in user space >>> since this is complicated (shared level triggered interrupts) >>> 7. the guest OS boots, loads device driver >>> 8. device driver switches virtio-net to msix mode >>> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the >>> notify fds as notifyfd >>> 10. look ma, no hands. >>> >>> Under the hood, the following takes place. >>> >>> kvm wires the irqfds to schedule a work item which fires the >>> interrupt. One day the kvm developers get their act together and >>> change it to inject the interrupt directly when the irqfd is signalled >>> (which could be from the net softirq or somewhere similarly nasty). >>> >>> virtio-net-server wires notifyfd according to its liking. It may >>> schedule a thread, or it may execute directly. >>> >>> And they all lived happily ever after. >>> >> >> Ack. I hope when its all said and done I can convince you that the >> framework to code up those virtio backends in the kernel is vbus ;) > > If vbus doesn't bring significant performance advantages, I'll prefer > virtio because of existing investment. Just to clarify: vbus is just the container/framework for the in-kernel models. You can implement and deploy virtio devices inside the container (tho I haven't had a chance to sit down and implement one yet). Note that I did publish a virtio transport in the last few series to demonstrate how that might work, so its just ripe for the picking if someone is so inclined. So really the question is whether you implement the in-kernel virtio backend in vbus, in some other framework, or just do it standalone. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (gregory.hask...@gmail.com) wrote: > What I am not clear on is how you would know to flag the address to > begin with. That's why I mentioned pv_io_ops->iomap() earlier. Something I'd expect would get called on IORESOURCE_PVIO type. This isn't really transparent though (only virtio devices basically), kind of like you're saying below. > Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it > would never be a real device). This means we route any io requests from > virtio-pci though pv_io_ops->mmio(), but not unflagged addresses. This > is not as slick as boosting *everyones* mmio speed as Avi's original > idea would have, but it is perhaps a good tradeoff between the entirely > new namespace created by my original dynhc() proposal and leaving them > all PF based. > > This way, its just like using my dynhc() proposal except the mmio-addr > is the substitute address-token (instead of the dynhc-vector). > Additionally, if you do not PV the kernel the IO_COND/pv_io_op is > ignored and it just slow-paths through the PF as it does today. Dynhc() > would be dependent on pv_ops. > > Thoughts? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Oh yes. But don't call it dynhc - like Chris says it's the wrong semantic. Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT or something along these lines. You won't be able to pass any data, but that's fine. Registers are saved to memory anyway. Ok, but how would you access the registers since you would presumably only be getting a waitq::func callback on the eventfd. Or were you saying that more data, if required, is saved in a side-band memory location? I can see the latter working. Yeah. You basically have that side-band in vbus shmem (or the virtio ring). I can't wrap my head around the former. I only meant that registers aren't faster than memory, since they are just another memory location. In fact registers are accessed through a function call (not that that takes any time these days). Just to make sure we have everything plumbed down, here's how I see things working out (using qemu and virtio, use sed to taste): 1. qemu starts up, sets up the VM 2. qemu creates virtio-net-server 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx ring, one set for rx ring) 4. qemu connects the six eventfd to the data-available, data-not-available, and kick ports of virtio-net-server 5. the guest starts up and configures virtio-net in pci pin mode 6. qemu notices and decides it will manage interrupts in user space since this is complicated (shared level triggered interrupts) 7. the guest OS boots, loads device driver 8. device driver switches virtio-net to msix mode 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the notify fds as notifyfd 10. look ma, no hands. Under the hood, the following takes place. kvm wires the irqfds to schedule a work item which fires the interrupt. One day the kvm developers get their act together and change it to inject the interrupt directly when the irqfd is signalled (which could be from the net softirq or somewhere similarly nasty). virtio-net-server wires notifyfd according to its liking. It may schedule a thread, or it may execute directly. And they all lived happily ever after. Ack. I hope when its all said and done I can convince you that the framework to code up those virtio backends in the kernel is vbus ;) If vbus doesn't bring significant performance advantages, I'll prefer virtio because of existing investment. But even if not, this should provide enough plumbing that we can all coexist together peacefully. Yes, vbus and virtio can compete on their merits without bias from some maintainer getting in the way. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >>> Don't - it's broken. It will also catch device assignment mmio and >>> hypercall them. >>> >>> >> Ah. Crap. >> >> Would you be conducive if I continue along with the dynhc() approach >> then? >> > > Oh yes. But don't call it dynhc - like Chris says it's the wrong > semantic. > > Since we want to connect it to an eventfd, call it HC_NOTIFY or > HC_EVENT or something along these lines. You won't be able to pass > any data, but that's fine. Registers are saved to memory anyway. Ok, but how would you access the registers since you would presumably only be getting a waitq::func callback on the eventfd. Or were you saying that more data, if required, is saved in a side-band memory location? I can see the latter working. I can't wrap my head around the former. > > And btw, given that eventfd and the underlying infrastructure are so > flexible, it's probably better to go back to your original "irqfd gets > fd from userspace" just to be consistent everywhere. > > (no, I'm not deliberately making you rewrite that patch again and > again... it's going to be a key piece of infrastructure so I want to > get it right) Ok, np. Actually now that Davide showed me the waitq::func trick, the fd technically doesn't even need to be an eventfd per se. We can just plain-old "fget()" it and attach via the f_ops->poll() as I do in v5. Ill submit this later today. > > > Just to make sure we have everything plumbed down, here's how I see > things working out (using qemu and virtio, use sed to taste): > > 1. qemu starts up, sets up the VM > 2. qemu creates virtio-net-server > 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx > ring, one set for rx ring) > 4. qemu connects the six eventfd to the data-available, > data-not-available, and kick ports of virtio-net-server > 5. the guest starts up and configures virtio-net in pci pin mode > 6. qemu notices and decides it will manage interrupts in user space > since this is complicated (shared level triggered interrupts) > 7. the guest OS boots, loads device driver > 8. device driver switches virtio-net to msix mode > 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the > notify fds as notifyfd > 10. look ma, no hands. > > Under the hood, the following takes place. > > kvm wires the irqfds to schedule a work item which fires the > interrupt. One day the kvm developers get their act together and > change it to inject the interrupt directly when the irqfd is signalled > (which could be from the net softirq or somewhere similarly nasty). > > virtio-net-server wires notifyfd according to its liking. It may > schedule a thread, or it may execute directly. > > And they all lived happily ever after. Ack. I hope when its all said and done I can convince you that the framework to code up those virtio backends in the kernel is vbus ;) But even if not, this should provide enough plumbing that we can all coexist together peacefully. Thanks, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Gregory Haskins wrote: > I guess technically mmio can just be a simple access of the page which > would be problematic to trap locally without a PF. However it seems > that most mmio always passes through a ioread()/iowrite() call so this > is perhaps the hook point. If we set the stake in the ground that mmios > that go through some other mechanism like PFs can just hit the "slow > path" are an acceptable casualty, I think we can make that work. > > Thoughts? An mmio that goes through a PF is a bug, it's certainly broken on a number of platforms, so performance should not be an issue there. Note that are four commonly used interface classes for PIO/MMIO: 1. readl/writel: little-endian MMIO 2. inl/outl: little-endian PIO 3. ioread32/iowrite32: converged little-endian PIO/MMIO 4. __raw_readl/__raw_writel: native-endian MMIO without checks You don't need to worry about the __raw_* stuff, as this should never be used in device drivers. As a simplification, you could mandate that all drivers that want to use this get converted to the ioread/iowrite class of interfaces and leave the others slow. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: I think we just past the "too complicated" threshold. And the "can't spel" threshold in the same sentence. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Don't - it's broken. It will also catch device assignment mmio and hypercall them. Ah. Crap. Would you be conducive if I continue along with the dynhc() approach then? Oh yes. But don't call it dynhc - like Chris says it's the wrong semantic. Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT or something along these lines. You won't be able to pass any data, but that's fine. Registers are saved to memory anyway. And btw, given that eventfd and the underlying infrastructure are so flexible, it's probably better to go back to your original "irqfd gets fd from userspace" just to be consistent everywhere. (no, I'm not deliberately making you rewrite that patch again and again... it's going to be a key piece of infrastructure so I want to get it right) Just to make sure we have everything plumbed down, here's how I see things working out (using qemu and virtio, use sed to taste): 1. qemu starts up, sets up the VM 2. qemu creates virtio-net-server 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx ring, one set for rx ring) 4. qemu connects the six eventfd to the data-available, data-not-available, and kick ports of virtio-net-server 5. the guest starts up and configures virtio-net in pci pin mode 6. qemu notices and decides it will manage interrupts in user space since this is complicated (shared level triggered interrupts) 7. the guest OS boots, loads device driver 8. device driver switches virtio-net to msix mode 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the notify fds as notifyfd 10. look ma, no hands. Under the hood, the following takes place. kvm wires the irqfds to schedule a work item which fires the interrupt. One day the kvm developers get their act together and change it to inject the interrupt directly when the irqfd is signalled (which could be from the net softirq or somewhere similarly nasty). virtio-net-server wires notifyfd according to its liking. It may schedule a thread, or it may execute directly. And they all lived happily ever after. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Gregory Haskins (ghask...@novell.com) wrote: > >> Chris Wright wrote: >> >>> * Avi Kivity (a...@redhat.com) wrote: >>> Gregory Haskins wrote: > Cool, I will code this up and submit it. While Im at it, Ill run it > through the "nullio" ringer, too. ;) It would be cool to see the > pv-mmio hit that 2.07us number. I can't think of any reason why this > will not be the case. > > Don't - it's broken. It will also catch device assignment mmio and hypercall them. >>> Not necessarily. It just needs to be creative w/ IO_COND >>> >> Hi Chris, >>Could you elaborate? How would you know which pages to hypercall and >> which to let PF? >> > > Was just thinking of some ugly mangling of the addr (I'm not entirely > sure what would work best). > Right, I get the part about flagging the address and then keying off that flag in IO_COND (like we do for PIO vs MMIO). What I am not clear on is how you would know to flag the address to begin with. Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it would never be a real device). This means we route any io requests from virtio-pci though pv_io_ops->mmio(), but not unflagged addresses. This is not as slick as boosting *everyones* mmio speed as Avi's original idea would have, but it is perhaps a good tradeoff between the entirely new namespace created by my original dynhc() proposal and leaving them all PF based. This way, its just like using my dynhc() proposal except the mmio-addr is the substitute address-token (instead of the dynhc-vector). Additionally, if you do not PV the kernel the IO_COND/pv_io_op is ignored and it just slow-paths through the PF as it does today. Dynhc() would be dependent on pv_ops. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: * Gregory Haskins (ghask...@novell.com) wrote: Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: Gregory Haskins wrote: Cool, I will code this up and submit it. While Im at it, Ill run it through the "nullio" ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. Not necessarily. It just needs to be creative w/ IO_COND Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? Was just thinking of some ugly mangling of the addr (I'm not entirely sure what would work best). I think we just past the "too complicated" threshold. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (ghask...@novell.com) wrote: > Chris Wright wrote: > > * Avi Kivity (a...@redhat.com) wrote: > >> Gregory Haskins wrote: > >>> Cool, I will code this up and submit it. While Im at it, Ill run it > >>> through the "nullio" ringer, too. ;) It would be cool to see the > >>> pv-mmio hit that 2.07us number. I can't think of any reason why this > >>> will not be the case. > >>> > >> Don't - it's broken. It will also catch device assignment mmio and > >> hypercall them. > > > > Not necessarily. It just needs to be creative w/ IO_COND > > Hi Chris, >Could you elaborate? How would you know which pages to hypercall and > which to let PF? Was just thinking of some ugly mangling of the addr (I'm not entirely sure what would work best). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bridges
Ross Boylan wrote: On Thu, 2009-05-07 at 11:13 -0600, Cam Macdonell wrote: Ross Boylan wrote: I'm trying to understand bridging with KVM, but am still puzzled. I think that the recommended bridging with TAP means that packets from the VM will end up going out the host card attached to the default gateway. But it looks to me as if their IP address is unchanged, which means replies will never reach me. Is that correct? Do I need to NAT the packets, or is something already doing that? Hi Ross, This is the place to start http://www.linux-kvm.org/page/Networking. I saw that; it gives some recipes but I wasn't sure what their effect was. You want a public bridge. I'm not sure what "their" and "me" mean in your email. In short, with bridging each VM has its own IP and that VM can be accessed directly from the network. "their" = the VM. "me" = my host machine. So if the VM's are running on their own subnet, VMs do not run on their own subnet with bridged networking. e.g., 10.0.2.* (I've been assuming the subnet with TAP is like the one with the User Mode Network stack in 3.7.3 of http://www.nongnu.org/qemu/qemu-doc.html) and my host machine is on another net, e.g., 10.0.8.* then I think the packet will go out with an IP of 10.0.2.2 (say). When some other machine tries to reply to 10.0.2.2, the packet gets lost because the outside network thinks 10.0.2.* is not for it. At least that's my concern. If the return IP address on the packet were 10.0.8.44 (supposing that's the IP of my host machine) then the packets could find their way back. Using bridged networking is very different from the user stack. The user stack is extremely limited and slow. My host machine is on an internal network with a 10.* IP. The example might be clearer if one supposed that the VM's were on a 192.168.* network. I am perhaps being influenced by the fact that I don't want to ask for more IP's, so I don't want to configure the VM's to use an IP on our 10.0.8 network. Then you probably want to use a NAT network. A NAT setup puts all the VMs on their own network within the host machine. iptables is necessary to forward the subnet packets out to the world and back. Here is some older documentation, but not much has changed. Look at the first entry under "Advanced Networking". https://help.ubuntu.com/community/KVMFeisty Does the TAP networking setup a whole subnet like the user mode network stack (e.g., running a DHCP server), or is the idea that I would just give the VM an IP on my subnet (10.0.8.*) in this example? No, bridge networking using taps (one tap per VM) and effectively sits all the VMs on the same network your host is on. You would need to get IPs from sysadmin for each VM. If the latter is the case (I'm now suspecting it is) then I think the solution is clear. I just stick the VM's on a private (to my machine) subnet, like 192.168.*, and I do NAT on the packets as they go out. NAT is a very common solution. Use VDE (vde.sourceforget.net) to create a virtual switch on your host for the VMs. dnsmasq can serve dynamic IPs to the VMs on their own subnet that doesn't bother your sysadmin at all. Use iptables to forward and receive packets through your host's NIC. Cam -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Avi Kivity (a...@redhat.com) wrote: > >> Gregory Haskins wrote: >> >>> Cool, I will code this up and submit it. While Im at it, Ill run it >>> through the "nullio" ringer, too. ;) It would be cool to see the >>> pv-mmio hit that 2.07us number. I can't think of any reason why this >>> will not be the case. >>> >> Don't - it's broken. It will also catch device assignment mmio and >> hypercall them. >> > > Not necessarily. It just needs to be creative w/ IO_COND > Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
* Avi Kivity (a...@redhat.com) wrote: > Gregory Haskins wrote: >> Cool, I will code this up and submit it. While Im at it, Ill run it >> through the "nullio" ringer, too. ;) It would be cool to see the >> pv-mmio hit that 2.07us number. I can't think of any reason why this >> will not be the case. > > Don't - it's broken. It will also catch device assignment mmio and > hypercall them. Not necessarily. It just needs to be creative w/ IO_COND -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >> Avi Kivity wrote: >> >>> Gregory Haskins wrote: >>> I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. >>> That's my thinking exactly. >>> >> >> Cool, I will code this up and submit it. While Im at it, Ill run it >> through the "nullio" ringer, too. ;) It would be cool to see the >> pv-mmio hit that 2.07us number. I can't think of any reason why this >> will not be the case. >> > > Don't - it's broken. It will also catch device assignment mmio and > hypercall them. > Ah. Crap. Would you be conducive if I continue along with the dynhc() approach then? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Avi Kivity wrote: Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. That's my thinking exactly. Cool, I will code this up and submit it. While Im at it, Ill run it through the "nullio" ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3
On Thu, May 07, 2009 at 05:16:35PM +0300, Avi Kivity wrote: > mtosa...@redhat.com wrote: >> Disallow the deletion of memory slots (and aliases, for x86 case), if a >> vcpu contains a cr3 that points to such slot/alias. >> > > That allows the guest to induce failures in the host. I don't understand what you mean. What is the problem with returning errors in the ioctl handlers? The guest can cause an overflow in qemu, overwrite the parameters to KVM_GET_MSR_INDEX_LIST in an attempt to read kernel data, and get -E2BIG. Or pick your combination. > Better to triple-fault the guest instead. Sure can additionally triple fault it, but the kernel might attempt to access the non-existant slot which cr3 points to before TRIPLE_FAULT is processed. So you have to avoid that possibility in the first place, thats why the patch modifies the ioctls to fail. >> +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot >> *slot) >> +{ >> +return 1; >> +} >> + >> > > In general, instead of stubs in every arch, have x86 say > KVM_HAVE_ARCH_CAN_FREE_MEMSLOT and define the stub in generic code when > that define is not present. Will fix that. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu: kvm: avoid harmless unhandled wrmsr 0xc0010117 messages
Olders kernel which don't contain kvm.git commit 61a6bd672bda3b9468bf5895c1be085c4e481138 display the following message: kvm: 32301: cpu0 unhandled wrmsr: 0xc0010117 data 0 When kvm_arch_load_regs is called. This is confusing in bug reports. Avoid it by checking whether the host advertises the MSR, similarly to how MSR_STAR is handled. Signed-off-by: Marcelo Tosatti diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 06ef775..5d19705 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -25,6 +25,7 @@ static struct kvm_msr_list *kvm_msr_list; extern unsigned int kvm_shadow_memory; static int kvm_has_msr_star; +static int kvm_has_vm_hsave_pa; static int lm_capable_kernel; @@ -54,10 +55,14 @@ int kvm_arch_qemu_create_context(void) kvm_msr_list = kvm_get_msr_list(kvm_context); if (!kvm_msr_list) return -1; -for (i = 0; i < kvm_msr_list->nmsrs; ++i) +for (i = 0; i < kvm_msr_list->nmsrs; ++i) { if (kvm_msr_list->indices[i] == MSR_STAR) kvm_has_msr_star = 1; - return 0; +if (kvm_msr_list->indices[i] == MSR_VM_HSAVE_PA) +kvm_has_vm_hsave_pa = 1; +} + +return 0; } static void set_msr_entry(struct kvm_msr_entry *entry, uint32_t index, @@ -260,7 +265,8 @@ void kvm_arch_load_regs(CPUState *env) set_msr_entry(&msrs[n++], MSR_IA32_SYSENTER_EIP, env->sysenter_eip); if (kvm_has_msr_star) set_msr_entry(&msrs[n++], MSR_STAR, env->star); -set_msr_entry(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave); +if (kvm_has_vm_hsave_pa) +set_msr_entry(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave); #ifdef TARGET_X86_64 if (lm_capable_kernel) { set_msr_entry(&msrs[n++], MSR_CSTAR, env->cstar); @@ -435,7 +441,8 @@ void kvm_arch_save_regs(CPUState *env) if (kvm_has_msr_star) msrs[n++].index = MSR_STAR; msrs[n++].index = MSR_IA32_TSC; -msrs[n++].index = MSR_VM_HSAVE_PA; +if (kvm_has_vm_hsave_pa) +msrs[n++].index = MSR_VM_HSAVE_PA; #ifdef TARGET_X86_64 if (lm_capable_kernel) { msrs[n++].index = MSR_CSTAR; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >> I guess technically mmio can just be a simple access of the page which >> would be problematic to trap locally without a PF. However it seems >> that most mmio always passes through a ioread()/iowrite() call so this >> is perhaps the hook point. If we set the stake in the ground that mmios >> that go through some other mechanism like PFs can just hit the "slow >> path" are an acceptable casualty, I think we can make that work. >> > > That's my thinking exactly. Cool, I will code this up and submit it. While Im at it, Ill run it through the "nullio" ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. > > Note we can cheat further. kvm already has a "coalesced mmio" feature > where side-effect-free mmios are collected in the kernel and passed to > userspace only when some other significant event happens. We could > pass those addresses to the guest and let it queue those writes > itself, avoiding the hypercall completely. > > Though it's probably pointless: if the guest is paravirtualized enough > to have the mmio hypercall, then it shouldn't be using e1000. Yeah...plus at least for my vbus purposes, all my my guest->host transitions are explicitly to cause side-effects, or I wouldn't be doing them in the first place ;) I suspect virtio-pci is exactly the same. I.e. the coalescing has already been done at a higher layer for platforms running "PV" code. Still a cool feature, tho. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. That's my thinking exactly. Note we can cheat further. kvm already has a "coalesced mmio" feature where side-effect-free mmios are collected in the kernel and passed to userspace only when some other significant event happens. We could pass those addresses to the guest and let it queue those writes itself, avoiding the hypercall completely. Though it's probably pointless: if the guest is paravirtualized enough to have the mmio hypercall, then it shouldn't be using e1000. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Intel-IOMMU, intr-remap: source-id checking
On Wed, 2009-05-06 at 23:16 -0700, Han, Weidong wrote: > @@ -634,6 +694,44 @@ static int ir_parse_ioapic_scope(struct acpi_dmar_header > *header, > " 0x%Lx\n", scope->enumeration_id, > drhd->address); > > + bus = pci_find_bus(drhd->segment, scope->bus); > + path = (struct acpi_dmar_pci_path *)(scope + 1); > + count = (scope->length - > + sizeof(struct acpi_dmar_device_scope)) > + / sizeof(struct acpi_dmar_pci_path); > + > + while (count) { > + if (pdev) > + pci_dev_put(pdev); > + > + if (!bus) > + break; > + > + pdev = pci_get_slot(bus, > + PCI_DEVFN(path->dev, path->fn)); > + if (!pdev) > + break; ir_parse_ioapic_scope() happens very early in the boot. So, I don't think we can do the pci related discovery here. thanks, suresh > + > + path++; > + count--; > + bus = pdev->subordinate; > + } > + > + if (pdev) { /* PCI discoverable IOAPIC*/ > + ir_ioapic[ir_ioapic_num].bus = > + pdev->bus->number; > + ir_ioapic[ir_ioapic_num].devfn = pdev->devfn; > + } else { /* Not PCI discoverable IOAPIC */ > + if (!bus) > + ir_ioapic[ir_ioapic_num].bus = > + scope->bus; > + else > + ir_ioapic[ir_ioapic_num].bus = > + bus->number; > + ir_ioapic[ir_ioapic_num].devfn = > + PCI_DEVFN(path->dev, path->fn); > + } > + > ir_ioapic[ir_ioapic_num].iommu = iommu; > ir_ioapic[ir_ioapic_num].id = scope->enumeration_id; > ir_ioapic_num++; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >>> What do you think of my mmio hypercall? That will speed up all mmio >>> to be as fast as a hypercall, and then we can use ordinary mmio/pio >>> writes to trigger things. >>> >>> >> I like it! >> >> Bigger question is what kind of work goes into making mmio a pv_op (or >> is this already done)? >> >> > > Looks like it isn't there. But it isn't any different than set_pte - > convert a write into a hypercall. > > I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Davide Libenzi wrote: On Thu, 7 May 2009, Avi Kivity wrote: Davide Libenzi wrote: What's your take on adding irq context safe callbacks to irqfd? To give some background here, we would like to use eventfd as a generic connector between components, so the components do not know about each other. So far eventfd successfully abstracts among components in the same process, in different processes, and in the kernel. eventfd_signal() can be safely called from irq context, and will wake up a waiting task. But in some cases, if the consumer is in the kernel, it may be able to consume the event from irq context, saving a context switch. So, will you consider patches adding this capability to eventfd? Maybe I got lost in the thread, but inside the kernel we have callback-based wakeup since long time. This is what epoll uses, when hooking into the file* f_op->poll() subsystem. Did you mean something else? Do you mean wait_queue_t::func? Yes, it is since long time ago that a wakeup in Linux can lead either to a real wakeup (in scheduler terms), or to a simple function call. Isn't that enough? It's perfect. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. I like it! Bigger question is what kind of work goes into making mmio a pv_op (or is this already done)? Looks like it isn't there. But it isn't any different than set_pte - convert a write into a hypercall. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >> I completed the resurrection of the test and wrote up a little wiki on >> the subject, which you can find here: >> >> http://developer.novell.com/wiki/index.php/WhyHypercalls >> >> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why >> reinvent the wheel?" questions. >> >> I will include this information when I publish the updated v2 series >> with the s/hypercall/dynhc changes. >> >> Let me know if you have any questions. >> > > Well, 420 ns is not to be sneezed at. > > What do you think of my mmio hypercall? That will speed up all mmio > to be as fast as a hypercall, and then we can use ordinary mmio/pio > writes to trigger things. > I like it! Bigger question is what kind of work goes into making mmio a pv_op (or is this already done)? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: I completed the resurrection of the test and wrote up a little wiki on the subject, which you can find here: http://developer.novell.com/wiki/index.php/WhyHypercalls Hopefully this answers Chris' "show me the numbers" and Anthony's "Why reinvent the wheel?" questions. I will include this information when I publish the updated v2 series with the s/hypercall/dynhc changes. Let me know if you have any questions. Well, 420 ns is not to be sneezed at. What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bridges
On Thu, 2009-05-07 at 11:13 -0600, Cam Macdonell wrote: > > Ross Boylan wrote: > > I'm trying to understand bridging with KVM, but am still puzzled. > > I think that the recommended bridging with TAP means that packets > from > > the VM will end up going out the host card attached to the default > > gateway. But it looks to me as if their IP address is unchanged, > which > > means replies will never reach me. Is that correct? Do I need to > NAT > > the packets, or is something already doing that? > > Hi Ross, > > This is the place to start > > http://www.linux-kvm.org/page/Networking. I saw that; it gives some recipes but I wasn't sure what their effect was. > You want a public bridge. > > I'm not sure what "their" and "me" mean in your email. In short, > with > bridging each VM has its own IP and that VM can be accessed directly > from the network. "their" = the VM. "me" = my host machine. So if the VM's are running on their own subnet, e.g., 10.0.2.* (I've been assuming the subnet with TAP is like the one with the User Mode Network stack in 3.7.3 of http://www.nongnu.org/qemu/qemu-doc.html) and my host machine is on another net, e.g., 10.0.8.* then I think the packet will go out with an IP of 10.0.2.2 (say). When some other machine tries to reply to 10.0.2.2, the packet gets lost because the outside network thinks 10.0.2.* is not for it. At least that's my concern. If the return IP address on the packet were 10.0.8.44 (supposing that's the IP of my host machine) then the packets could find their way back. My host machine is on an internal network with a 10.* IP. The example might be clearer if one supposed that the VM's were on a 192.168.* network. I am perhaps being influenced by the fact that I don't want to ask for more IP's, so I don't want to configure the VM's to use an IP on our 10.0.8 network. Does the TAP networking setup a whole subnet like the user mode network stack (e.g., running a DHCP server), or is the idea that I would just give the VM an IP on my subnet (10.0.8.*) in this example? If the latter is the case (I'm now suspecting it is) then I think the solution is clear. I just stick the VM's on a private (to my machine) subnet, like 192.168.*, and I do NAT on the packets as they go out. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: device-assignment: Catch GSI overflow
On Thu, 2009-05-07 at 11:09 -0600, Alex Williamson wrote: > Fix the index at which we return -ENOSPC since the kernel side will > reject a GSI >= KVM_MAX_IRQ_ROUTES. Also, mask as a signed int before > testing for error. Even with this, there still seems to be a fundamental problem with our consumption of GSIs in kvm. For example, every time a guest writes to the MSI capabilities area and enables MSI support, we call kvm_get_irq_route_gsi() to get a new max_used_gsi + 1 value, then call kvm_add_routing_entry(), which updates max_used_gsi. It doesn't take too long before we exhaust the GSI space and the device no longer works. This seems to happen within a minute or two of booting a guest with an e1000e device sitting idle on a busy network. Do we need to keep a bitmap of used GSIs or maybe just attempt to reuse the GSI we've gotten previously for the device? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: device-assignment: Catch GSI overflow
Fix the index at which we return -ENOSPC since the kernel side will reject a GSI >= KVM_MAX_IRQ_ROUTES. Also, mask as a signed int before testing for error. Signed-off-by: Alex Williamson --- hw/device-assignment.c |2 +- kvm/libkvm/libkvm.c|2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index a7365c8..e06dd08 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -796,7 +796,7 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, unsigned int ctrl_pos) pci_dev->cap.start + PCI_MSI_DATA_32); assigned_dev->entry->type = KVM_IRQ_ROUTING_MSI; assigned_dev->entry->gsi = kvm_get_irq_route_gsi(kvm_context); -if (assigned_dev->entry->gsi < 0) { +if ((int)(assigned_dev->entry->gsi) < 0) { perror("assigned_dev_update_msi: kvm_get_irq_route_gsi"); return; } diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c index ba0a5d1..2a4165a 100644 --- a/kvm/libkvm/libkvm.c +++ b/kvm/libkvm/libkvm.c @@ -1408,7 +1408,7 @@ int kvm_get_irq_route_gsi(kvm_context_t kvm) { #ifdef KVM_CAP_IRQ_ROUTING if (kvm->max_used_gsi >= KVM_IOAPIC_NUM_PINS) { - if (kvm->max_used_gsi <= kvm_get_gsi_count(kvm)) + if (kvm->max_used_gsi + 1 < kvm_get_gsi_count(kvm)) return kvm->max_used_gsi + 1; else return -ENOSPC; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Gregory Haskins (ghask...@novell.com) wrote: > >> Chris Wright wrote: >> >>> VF drivers can also have this issue (and typically use mmio). >>> I at least have a better idea what your proposal is, thanks for >>> explanation. Are you able to demonstrate concrete benefit with it yet >>> (improved latency numbers for example)? >>> >> I had a test-harness/numbers for this kind of thing, but its a bit >> crufty since its from ~1.5 years ago. I will dig it up, update it, and >> generate/post new numbers. >> > > That would be useful, because I keep coming back to pio and shared > page(s) when think of why not to do this. Seems I'm not alone in that. > > thanks, > -chris > I completed the resurrection of the test and wrote up a little wiki on the subject, which you can find here: http://developer.novell.com/wiki/index.php/WhyHypercalls Hopefully this answers Chris' "show me the numbers" and Anthony's "Why reinvent the wheel?" questions. I will include this information when I publish the updated v2 series with the s/hypercall/dynhc changes. Let me know if you have any questions. -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
On Thu, 7 May 2009, Avi Kivity wrote: > Davide Libenzi wrote: > > > > > What's your take on adding irq context safe callbacks to irqfd? > > > > > > To give some background here, we would like to use eventfd as a generic > > > connector between components, so the components do not know about each > > > other. > > > So far eventfd successfully abstracts among components in the same > > > process, in > > > different processes, and in the kernel. > > > > > > eventfd_signal() can be safely called from irq context, and will wake up a > > > waiting task. But in some cases, if the consumer is in the kernel, it may > > > be > > > able to consume the event from irq context, saving a context switch. > > > > > > So, will you consider patches adding this capability to eventfd? > > > > > > > Maybe I got lost in the thread, but inside the kernel we have callback-based > > wakeup since long time. This is what epoll uses, when hooking into the file* > > f_op->poll() subsystem. > > Did you mean something else? > > > > Do you mean wait_queue_t::func? Yes, it is since long time ago that a wakeup in Linux can lead either to a real wakeup (in scheduler terms), or to a simple function call. Isn't that enough? - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] BIOS changes for configuring irq0->inti2 override(v2)
Avi Kivity wrote: Beth Kon wrote: These patches resolve the irq0->inti2 override issue, and get the hpet working on kvm. They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch. Override and HPET changes are sent as a series because HPET depends on the override. Win2k8 expects the HPET interrupt on inti2, regardless of whether an override exists in the BIOS. And the HPET spec states that in legacy mode, timer interrupt is on inti2. The irq0->inti2 override will always be used unless the kernel cannot do irq routing (i.e., compatibility with old kernels). So if the kernel is capable, userspace sets up irq0->inti2 via the irq routing interface, and adds the irq0->inti2 override to the MADT interrupt source override table, and the mp table (for the no-acpi case). A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid checksum with these patches, but later he couldn't reproduce it, and I'm not seeing it now. While all guests still need to be fully tested, everything appears to be in order. I've tested on win2k864, win2k832, RHEL5.3 32 bit, and ubuntu 8.10 64 bit. What are the changes relative to v1? Just merge issues with the changes you put in when moving to the newer bios. I submitted prematurely, incorrectly thinking I was done testing. When I finished, some problems surfaced. @@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val) #define QEMU_CFG_SIGNATURE 0x00 #define QEMU_CFG_ID 0x01 #define QEMU_CFG_UUID 0x02 +#define QEMU_CFG_IRQ0_OVERRIDE 0x0e As noted, this should be in the arch local space. The base changes were not in the code yet. As we discussed on IRC, I'll resubmit once they're there. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Driver for Inter-VM shared memory device for KVM supporting interrupts.
Driver for inter-VM shared memory device that now supports interrupts between two guests. The driver defines a counting semaphore and wait_event queue for different synchronization needs of users. Initializing the semaphore count, sending interrupts and waiting are implemented via ioctl calls. The synchronization mechanisms are simple and rely on existing kernel primitives, but I think they're flexible for synchronization between guests. I'm contemplating more complicated designs that would use the shared memory to store synchronization variables, but thought I would get this initial patch out to get some feedback. Cheers, Cam --- drivers/char/Kconfig |8 + drivers/char/Makefile |2 + drivers/char/kvm_ivshmem.c | 430 3 files changed, 440 insertions(+), 0 deletions(-) create mode 100644 drivers/char/kvm_ivshmem.c diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 735bbe2..afa7cb8 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -1099,6 +1099,14 @@ config DEVPORT depends on ISA || PCI default y +config KVM_IVSHMEM +tristate "Inter-VM Shared Memory Device" +depends on PCI +default m +help + This device maps a region of shared memory between the host OS and any + number of virtual machines. + source "drivers/s390/char/Kconfig" endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index 9caf5b5..021f06b 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -111,6 +111,8 @@ obj-$(CONFIG_PS3_FLASH) += ps3flash.o obj-$(CONFIG_JS_RTC) += js-rtc.o js-rtc-y = rtc.o +obj-$(CONFIG_KVM_IVSHMEM) += kvm_ivshmem.o + # Files generated that shall be removed upon make clean clean-files := consolemap_deftbl.c defkeymap.c diff --git a/drivers/char/kvm_ivshmem.c b/drivers/char/kvm_ivshmem.c new file mode 100644 index 000..a20a224 --- /dev/null +++ b/drivers/char/kvm_ivshmem.c @@ -0,0 +1,430 @@ +/* + * drivers/char/kvm_ivshmem.c - driver for KVM Inter-VM shared memory PCI device + * + * Copyright 2009 Cam Macdonell + * + * Based on cirrusfb.c and 8139cp.c: + * Copyright 1999-2001 Jeff Garzik + * Copyright 2001-2004 Jeff Garzik + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define TRUE 1 +#define FALSE 0 +#define KVM_IVSHMEM_DEVICE_MINOR_NUM 0 + +enum { +/* KVM Inter-VM shared memory device register offsets */ +IntrMask= 0x00,/* Interrupt Mask */ +IntrStatus = 0x10,/* Interrupt Status */ +Doorbell= 0x20,/* Doorbell */ +ShmOK = 1/* Everything is OK */ +}; + +typedef struct kvm_ivshmem_device { +void __iomem * regs; + +void * base_addr; + +unsigned int regaddr; +unsigned int reg_size; + +unsigned int ioaddr; +unsigned int ioaddr_size; +unsigned int irq; + +bool enabled; + +} kvm_ivshmem_device; + +static int event_num; +static struct semaphore sema; +static wait_queue_head_t wait_queue; + +static kvm_ivshmem_device kvm_ivshmem_dev; + +static int device_major_nr; + +static int kvm_ivshmem_ioctl(struct inode *, struct file *, unsigned int, unsigned long); +static int kvm_ivshmem_mmap(struct file *, struct vm_area_struct *); +static int kvm_ivshmem_open(struct inode *, struct file *); +static int kvm_ivshmem_release(struct inode *, struct file *); +static ssize_t kvm_ivshmem_read(struct file *, char *, size_t, loff_t *); +static ssize_t kvm_ivshmem_write(struct file *, const char *, size_t, loff_t *); +static loff_t kvm_ivshmem_lseek(struct file * filp, loff_t offset, int origin); + +enum ivshmem_ioctl { set_sema, down_sema, sema_irq, wait_event, wait_event_irq }; + +static const struct file_operations kvm_ivshmem_ops = { +.owner = THIS_MODULE, +.open= kvm_ivshmem_open, +.mmap= kvm_ivshmem_mmap, +.read= kvm_ivshmem_read, +.ioctl = kvm_ivshmem_ioctl, +.write = kvm_ivshmem_write, +.llseek = kvm_ivshmem_lseek, +.release = kvm_ivshmem_release, +}; + +static struct pci_device_id kvm_ivshmem_id_table[] = { +{ 0x1af4, 0x1110, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 }, +{ 0 }, +}; +MODULE_DEVICE_TABLE (pci, kvm_ivshmem_id_table); + +static void kvm_ivshmem_remove_device(struct pci_dev* pdev); +static int kvm_ivshmem_probe_device (struct pci_dev *pdev, +const struct pci_device_id * ent); + +static struct pci_driver kvm_ivshmem_pci_driver = { +.name= "kvm-shmem", +.id_table= kvm_ivshmem_id_table, +.probe = kvm_ivshmem_probe_device, +.remove = kvm_ivshmem_remove_device, +}; + +static int kvm_ivshmem_ioctl(struct inode * ino, struct file * filp, +unsigned int cmd, unsigned long arg) +{ + +int rv; + +switch (cmd) { +case set_sema: +printk("KVM_IVSHMEM: initialize semaphore\n")
bridges
I'm trying to understand bridging with KVM, but am still puzzled. I think that the recommended bridging with TAP means that packets from the VM will end up going out the host card attached to the default gateway. But it looks to me as if their IP address is unchanged, which means replies will never reach me. Is that correct? Do I need to NAT the packets, or is something already doing that? Some documents indicate that I need to bring the interfaces (e.g., eth0) down before I bring the bridge up, and that afterwards only the bridge will have an IP address. Is that right? Some documents, e.g., http://ebtables.sourceforge.net/br_fw_ia/br_fw_ia.html, indicate iptables should "just work" with bridging. However, I've seen someone with a 2.6.15 kernel ask about firewalling and be told they needed to patch the kernel to get it work (don't have the reference handy). Should it just work? I'm running a 2.6.29 kernel on Debian Lenny with kvm 72+dfsg-5~lenny1. Version 84+dfsg-2 is available in experimental. Is there much to be gained by going with the more recent version? Please cc me; I'm not on the list. Thanks. Ross Boylan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bios: Use a different mask to size the option ROM BAR
Bit 0 is the enable bit, which we not only don't want to set, but it will stick and make us think it's an I/O port resource. Signed-off-by: Alex Williamson --- bios/rombios32.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/bios/rombios32.c b/bios/rombios32.c index 269f175..616e70a 100644 --- a/bios/rombios32.c +++ b/bios/rombios32.c @@ -948,11 +948,13 @@ static void pci_bios_init_device(PCIDevice *d) int ofs; uint32_t val, size ; -if (i == PCI_ROM_SLOT) +if (i == PCI_ROM_SLOT) { ofs = 0x30; -else +pci_config_writel(d, ofs, 0xfffe); +} else { ofs = 0x10 + i * 4; -pci_config_writel(d, ofs, 0x); +pci_config_writel(d, ofs, 0x); +} val = pci_config_readl(d, ofs); if (val != 0) { size = (~(val & ~0xf)) + 1; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Shared memory device with interrupt support
Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. This device now creates a qemu character device and sends 1-bytes messages to trigger interrupts. Writes are trigger by writing to the "Doorbell" register on the shared memory PCI device. The lower 8-bits of the value written to this register are sent as the 1-byte message so different meanings of interrupts can be supported. Interrupts are only supported between 2 VMs currently. One VM must act as the server by adding "server" to the command-line argument. Shared memory devices are created with the following command-line: -ivhshmem ,,[unix:][,server] Interrupts can also be used between host and guest as well by implementing a listener on the host. Cam --- Makefile.target |3 + hw/ivshmem.c| 421 +++ hw/pc.c |6 + hw/pc.h |3 + qemu-options.hx | 14 ++ sysemu.h|8 + vl.c| 14 ++ 7 files changed, 469 insertions(+), 0 deletions(-) create mode 100644 hw/ivshmem.c diff --git a/Makefile.target b/Makefile.target index b68a689..3190bba 100644 --- a/Makefile.target +++ b/Makefile.target @@ -643,6 +643,9 @@ OBJS += pcnet.o OBJS += rtl8139.o OBJS += e1000.o +# Inter-VM PCI shared memory +OBJS += ivshmem.o + # Generic watchdog support and some watchdog devices OBJS += watchdog.o OBJS += wdt_ib700.o wdt_i6300esb.o diff --git a/hw/ivshmem.c b/hw/ivshmem.c new file mode 100644 index 000..95e2268 --- /dev/null +++ b/hw/ivshmem.c @@ -0,0 +1,421 @@ +/* + * Inter-VM Shared Memory PCI device. + * + * Author: + * Cam Macdonell + * + * Based On: cirrus_vga.c and rtl8139.c + * + * This code is licensed under the GNU GPL v2. + */ + +#include "hw.h" +#include "console.h" +#include "pc.h" +#include "pci.h" +#include "sysemu.h" + +#include "qemu-common.h" +#include + +#define PCI_COMMAND_IOACCESS0x0001 +#define PCI_COMMAND_MEMACCESS 0x0002 +#define PCI_COMMAND_BUSMASTER 0x0004 + +//#define DEBUG_IVSHMEM + +#ifdef DEBUG_IVSHMEM +#define IVSHMEM_DPRINTF(fmt, args...)\ +do {printf("IVSHMEM: " fmt, ##args); } while (0) +#else +#define IVSHMEM_DPRINTF(fmt, args...) +#endif + +typedef struct IVShmemState { +uint16_t intrmask; +uint16_t intrstatus; +uint16_t doorbell; +uint8_t *ivshmem_ptr; +unsigned long ivshmem_offset; +unsigned int ivshmem_size; +unsigned long bios_offset; +unsigned int bios_size; +target_phys_addr_t base_ctrl; +int it_shift; +PCIDevice *pci_dev; +CharDriverState * chr; +unsigned long map_addr; +unsigned long map_end; +int ivshmem_mmio_io_addr; +} IVShmemState; + +typedef struct PCI_IVShmemState { +PCIDevice dev; +IVShmemState ivshmem_state; +} PCI_IVShmemState; + +typedef struct IVShmemDesc { +char name[1024]; +char * chrdev; +int size; +} IVShmemDesc; + + +/* registers for the Inter-VM shared memory device */ +enum ivshmem_registers { +IntrMask = 0, +IntrStatus = 16, +Doorbell = 32 +}; + +static int num_ivshmem_devices = 0; +static IVShmemDesc ivshmem_desc; + +static void ivshmem_map(PCIDevice *pci_dev, int region_num, +uint32_t addr, uint32_t size, int type) +{ +PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev; +IVShmemState *s = &d->ivshmem_state; + +IVSHMEM_DPRINTF("addr = %u size = %u\n", addr, size); +cpu_register_physical_memory(addr, s->ivshmem_size, s->ivshmem_offset); + +} + +void ivshmem_init(const char * optarg) { + +char * temp; +char * ivshmem_sz; +int size; + +num_ivshmem_devices++; + +/* currently we only support 1 device */ +if (num_ivshmem_devices > MAX_IVSHMEM_DEVICES) { +return; +} + +temp = strdup(optarg); +snprintf(ivshmem_desc.name, 1024, "/%s", strsep(&temp,",")); +ivshmem_sz=strsep(&temp,","); +if (ivshmem_sz != NULL){ +size = atol(ivshmem_sz); +} else { +size = -1; +} + +ivshmem_desc.chrdev = strsep(&temp,"\0"); + +if ( size == -1) { +ivshmem_desc.size = TARGET_PAGE_SIZE; +} else { +ivshmem_desc.size = size*1024*1024; +} +IVSHMEM_DPRINTF("optarg is %s, name is %s, size is %d, chrdev is %s\n", +optarg, ivshmem_desc.name, +ivshmem_desc.size, ivshmem_desc.chrdev); +} + +int ivshmem_get_size(void) { +return ivshmem_desc.size; +} + +/* accessing registers - based on rtl8139 */ +static void ivshmem_update_irq(IVShmemState *s) +{ +int isr; +isr = (s->intrstatus & s->intrmask) & 0x; + +/* don't print ISR resets */ +if (isr) { +IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n", + isr ? 1 : 0, s
Re: [PATCH 1/4] BIOS changes for configuring irq0->inti2 override (v2)
Beth Kon wrote: These patches resolve the irq0->inti2 override issue, and get the hpet working on kvm. They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch. Override and HPET changes are sent as a series because HPET depends on the override. Win2k8 expects the HPET interrupt on inti2, regardless of whether an override exists in the BIOS. And the HPET spec states that in legacy mode, timer interrupt is on inti2. The irq0->inti2 override will always be used unless the kernel cannot do irq routing (i.e., compatibility with old kernels). So if the kernel is capable, userspace sets up irq0->inti2 via the irq routing interface, and adds the irq0->inti2 override to the MADT interrupt source override table, and the mp table (for the no-acpi case). A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid checksum with these patches, but later he couldn't reproduce it, and I'm not seeing it now. While all guests still need to be fully tested, everything appears to be in order. I've tested on win2k864, win2k832, RHEL5.3 32 bit, and ubuntu 8.10 64 bit. What are the changes relative to v1? @@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val) #define QEMU_CFG_SIGNATURE 0x00 #define QEMU_CFG_ID 0x01 #define QEMU_CFG_UUID 0x02 +#define QEMU_CFG_IRQ0_OVERRIDE 0x0e As noted, this should be in the arch local space. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Davide Libenzi wrote: What's your take on adding irq context safe callbacks to irqfd? To give some background here, we would like to use eventfd as a generic connector between components, so the components do not know about each other. So far eventfd successfully abstracts among components in the same process, in different processes, and in the kernel. eventfd_signal() can be safely called from irq context, and will wake up a waiting task. But in some cases, if the consumer is in the kernel, it may be able to consume the event from irq context, saving a context switch. So, will you consider patches adding this capability to eventfd? Maybe I got lost in the thread, but inside the kernel we have callback-based wakeup since long time. This is what epoll uses, when hooking into the file* f_op->poll() subsystem. Did you mean something else? Do you mean wait_queue_t::func? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][KVM][retry 1] Add support for Pause Filtering to AMD SVM
(copying Ingo) Mark Langsdorf wrote: commit 01813db8627e74018c8cec90df7e345839351f23 Author: root Date: Thu May 7 09:44:10 2009 -0500 New AMD processors will support the Pause Filter Feature. This feature creates a new field in the VMCB called Pause Filter Count. If Pause Filter Count is greater than 0 and intercepting PAUSEs is enabled, the processor will increment an internal counter when a PAUSE instruction occurs instead of intercepting. When the internal counter reaches the Pause Filter Count value, a PAUSE intercept will occur. This feature can be used to detect contended spinlocks, especially when the lock holding VCPU is not scheduled. Rescheduling another VCPU prevents the VCPU seeking the lock from wasting its quantum by spinning idly. Experimental results show that most spinlocks are held for less than 1000 PAUSE cycles or more than a few thousand. Default the Pause Filter Counter to 3000 to detect the contended spinlocks. Processor support for this feature is indicated by a CPUID bit. On a 24 core system running 4 guests each with 16 VCPUs, this patch improved overall performance of each guest's 32 job kernbench by approximately 1%. Further performance improvement may be possible with a more sophisticated yield algorithm. -Mark Langsdorf Operating System Research Center AMD Signed-off-by: Mark Langsdorf (please use git format-patch rather than git show, and set up user.name and user.email properly) svm->nested_vmcb = 0; svm->vcpu.arch.hflags = HF_GIF_MASK; + + if (svm_has(SVM_FEATURE_PAUSE_FILTER)) { + control->pause_filter_count = 5000; + control->intercept |= (1ULL << INTERCEPT_PAUSE); + } + } 3000 or 5000? +static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + /* Simple yield */ + vcpu_put(&svm->vcpu); + schedule(); + vcpu_load(&svm->vcpu); + return 1; + Ingo, will this do anything under CFS, or will CFS note that nothing has changed in the accounting and reschedule us immediately? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Gregory Haskins wrote: This is my preferred option. For a virtio-net-server in the kernel, we'd service its eventfd in qemu, raising and lowering the pci interrupt in the traditional way. But we'd still need to know when to lower the interrupt. How? IIUC, isn't that usually device/subsystem specific, and out of scope of the GSI delivery vehicle? For instance, most devices I have seen with level ints have a register in their device register namespace for acking the int. Yes it is. As an aside, this is what causes some of the grief in dealing with shared interrupts like KVM pass-through and/or threaded-isrs: There isn't a standardized way to ACK them. So we'd need a side channel to tell userspace to lower the irq. Another eventfd likely. Note we don't support device assignment for devices with shared interrupts. You may also see some generalization of masking/acking in things like the MSI-X table. But again, this would be out of scope of the general GSI delivery path IIUC. I understand that there is a feedback mechanism in the ioapic model for calling back on acknowledgment of the interrupt. But I am not sure what is how the real hardware works normally, and therefore I am not convinced that is something we need to feed all the way back (i.e. via irqfd or whatever). In the interest of full disclosure, its been a few years since I studied the xAPIC docs, so I might be out to lunch on that assertion. ;) Right, that ack thing is completely internal, used for catching up on time drift, and for shutting down level triggered assigned interrupts. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][KVM][retry 1] Add support for Pause Filtering to AMD SVM
commit 01813db8627e74018c8cec90df7e345839351f23 Author: root Date: Thu May 7 09:44:10 2009 -0500 New AMD processors will support the Pause Filter Feature. This feature creates a new field in the VMCB called Pause Filter Count. If Pause Filter Count is greater than 0 and intercepting PAUSEs is enabled, the processor will increment an internal counter when a PAUSE instruction occurs instead of intercepting. When the internal counter reaches the Pause Filter Count value, a PAUSE intercept will occur. This feature can be used to detect contended spinlocks, especially when the lock holding VCPU is not scheduled. Rescheduling another VCPU prevents the VCPU seeking the lock from wasting its quantum by spinning idly. Experimental results show that most spinlocks are held for less than 1000 PAUSE cycles or more than a few thousand. Default the Pause Filter Counter to 3000 to detect the contended spinlocks. Processor support for this feature is indicated by a CPUID bit. On a 24 core system running 4 guests each with 16 VCPUs, this patch improved overall performance of each guest's 32 job kernbench by approximately 1%. Further performance improvement may be possible with a more sophisticated yield algorithm. -Mark Langsdorf Operating System Research Center AMD Signed-off-by: Mark Langsdorf diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 85574b7..1fecb7e 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -57,7 +57,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u16 intercept_dr_write; u32 intercept_exceptions; u64 intercept; - u8 reserved_1[44]; + u8 reserved_1[42]; + u16 pause_filter_count; u64 iopm_base_pa; u64 msrpm_base_pa; u64 tsc_offset; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ef43a18..4279141 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -45,6 +45,7 @@ MODULE_LICENSE("GPL"); #define SVM_FEATURE_NPT (1 << 0) #define SVM_FEATURE_LBRV (1 << 1) #define SVM_FEATURE_SVML (1 << 2) +#define SVM_FEATURE_PAUSE_FILTER (1 << 10) #define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) @@ -575,6 +576,12 @@ static void init_vmcb(struct vcpu_svm *svm) svm->nested_vmcb = 0; svm->vcpu.arch.hflags = HF_GIF_MASK; + + if (svm_has(SVM_FEATURE_PAUSE_FILTER)) { + control->pause_filter_count = 5000; + control->intercept |= (1ULL << INTERCEPT_PAUSE); + } + } static int svm_vcpu_reset(struct kvm_vcpu *vcpu) @@ -2087,6 +2094,15 @@ static int interrupt_window_interception(struct vcpu_svm *svm, return 1; } +static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + /* Simple yield */ + vcpu_put(&svm->vcpu); + schedule(); + vcpu_load(&svm->vcpu); + return 1; +} + static int (*svm_exit_handlers[])(struct vcpu_svm *svm, struct kvm_run *kvm_run) = { [SVM_EXIT_READ_CR0] = emulate_on_interception, @@ -2123,6 +2139,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm, [SVM_EXIT_CPUID]= cpuid_interception, [SVM_EXIT_IRET] = iret_interception, [SVM_EXIT_INVD] = emulate_on_interception, + [SVM_EXIT_PAUSE]= pause_interception, [SVM_EXIT_HLT] = halt_interception, [SVM_EXIT_INVLPG] = invlpg_interception, [SVM_EXIT_INVLPGA] = invalid_op_interception, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2b73e19..e2b730d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -710,6 +710,7 @@ void vcpu_load(struct kvm_vcpu *vcpu) kvm_arch_vcpu_load(vcpu, cpu); put_cpu(); } +EXPORT_SYMBOL_GPL(vcpu_load); void vcpu_put(struct kvm_vcpu *vcpu) { @@ -719,6 +720,7 @@ void vcpu_put(struct kvm_vcpu *vcpu) preempt_enable(); mutex_unlock(&vcpu->mutex); } +EXPORT_SYMBOL_GPL(vcpu_put); static void ack_flush(void *_completed) { -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] BIOS changes for KVM HPET (v2)
Just a note here... The number of table_offset_entry entries for the non BX_QEMU case doesn't make sense here. There are only 2 entries. I left it as is, since it does not impact HPET's interraction with it. Actually it seems like dead code since this is in kvm code but with BX_QEMU undefined. It doesn't seem to be a problem. Signed-off-by: Beth Kon diff --git a/kvm/bios/acpi-dsdt.dsl b/kvm/bios/acpi-dsdt.dsl index c756fed..0e142be 100755 --- a/kvm/bios/acpi-dsdt.dsl +++ b/kvm/bios/acpi-dsdt.dsl @@ -308,7 +308,6 @@ DefinitionBlock ( }) } #ifdef BX_QEMU -#ifdef HPET_WORKS_IN_KVM Device(HPET) { Name(_HID, EISAID("PNP0103")) Name(_UID, 0) @@ -328,7 +327,6 @@ DefinitionBlock ( }) } #endif -#endif } Scope(\_SB.PCI0) { diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c index ddfa828..7441cd7 100755 --- a/kvm/bios/rombios32.c +++ b/kvm/bios/rombios32.c @@ -1293,7 +1293,7 @@ struct rsdt_descriptor_rev1 { ACPI_TABLE_HEADER_DEF /* ACPI common table header */ #ifdef BX_QEMU - uint32_t table_offset_entry [2]; /* Array of pointers to other */ + uint32_t table_offset_entry [3]; /* Array of pointers to other */ // uint32_t table_offset_entry [4]; /* Array of pointers to other */ #else uint32_t table_offset_entry [3]; /* Array of pointers to other */ @@ -1450,8 +1450,8 @@ struct acpi_20_generic_address { } __attribute__((__packed__)); /* - * * HPET Description Table - * */ + * HPET Description Table + */ struct acpi_20_hpet { ACPI_TABLE_HEADER_DEF /* ACPI common table header */ uint32_t timer_block_id; @@ -1591,13 +1591,11 @@ void acpi_bios_init(void) #endif addr += madt_size; #ifdef BX_QEMU -#ifdef HPET_WORKS_IN_KVM addr = (addr + 7) & ~7; hpet_addr = addr; hpet = (void *)(addr); addr += sizeof(*hpet); #endif -#endif acpi_tables_size = addr - base_addr; @@ -1620,10 +1618,10 @@ void acpi_bios_init(void) memset(rsdt, 0, sizeof(*rsdt)); rsdt->table_offset_entry[0] = cpu_to_le32(fadt_addr); rsdt->table_offset_entry[1] = cpu_to_le32(madt_addr); -//rsdt->table_offset_entry[2] = cpu_to_le32(ssdt_addr); #ifdef BX_QEMU -//rsdt->table_offset_entry[3] = cpu_to_le32(hpet_addr); +rsdt->table_offset_entry[2] = cpu_to_le32(hpet_addr); #endif +//rsdt->table_offset_entry[3] = cpu_to_le32(ssdt_addr); acpi_build_table_header((struct acpi_table_header *)rsdt, "RSDT", sizeof(*rsdt), 1); @@ -1723,7 +1721,6 @@ void acpi_bios_init(void) #ifdef BX_QEMU /* HPET */ -#ifdef HPET_WORKS_IN_KVM memset(hpet, 0, sizeof(*hpet)); /* Note timer_block_id value must be kept in sync with value advertised by * emulated hpet @@ -1733,7 +1730,6 @@ void acpi_bios_init(void) acpi_build_table_header((struct acpi_table_header *)hpet, "HPET", sizeof(*hpet), 1); #endif -#endif } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Userspace changes for KVM HPET (v2)
Signed-off-by: Beth Kon diff --git a/hw/hpet.c b/hw/hpet.c index c7945ec..47c9f89 100644 --- a/hw/hpet.c +++ b/hw/hpet.c @@ -30,6 +30,7 @@ #include "console.h" #include "qemu-timer.h" #include "hpet_emul.h" +#include "qemu-kvm.h" //#define HPET_DEBUG #ifdef HPET_DEBUG @@ -48,6 +49,43 @@ uint32_t hpet_in_legacy_mode(void) return 0; } +static void hpet_kpit_enable(void) +{ +struct kvm_pit_state ps; +kvm_get_pit(kvm_context, &ps); +kvm_set_pit(kvm_context, &ps); +} + +static void hpet_kpit_disable(void) +{ +struct kvm_pit_state ps; +kvm_get_pit(kvm_context, &ps); +ps.channels[0].mode = 0xff; +kvm_set_pit(kvm_context, &ps); +} + +static void hpet_legacy_enable(void) +{ +if (qemu_kvm_pit_in_kernel()) { + hpet_kpit_disable(); + dprintf("qemu: hpet disabled kernel pit\n"); +} else { + hpet_pit_disable(); + dprintf("qemu: hpet disabled userspace pit\n"); +} +} + +static void hpet_legacy_disable(void) +{ +if (qemu_kvm_pit_in_kernel()) { + hpet_kpit_enable(); + dprintf("qemu: hpet enabled kernel pit\n"); +} else { + hpet_pit_enable(); + dprintf("qemu: hpet enabled userspace pit\n"); +} +} + static uint32_t timer_int_route(struct HPETTimer *timer) { uint32_t route; @@ -475,9 +513,9 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr, } /* i8254 and RTC are disabled when HPET is in legacy mode */ if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) { -hpet_pit_disable(); +hpet_legacy_enable(); } else if (deactivating_bit(old_val, new_val, HPET_CFG_LEGACY)) { -hpet_pit_enable(); +hpet_legacy_disable(); } break; case HPET_CFG + 4: @@ -560,7 +598,7 @@ static void hpet_reset(void *opaque) { * hpet_reset is called due to system reset. At this point control must * be returned to pit until SW reenables hpet. */ -hpet_pit_enable(); +hpet_legacy_disable(); count = 1; } diff --git a/vl.c b/vl.c index f9a72b3..b860b82 100644 --- a/vl.c +++ b/vl.c @@ -6138,10 +6138,15 @@ int main(int argc, char **argv, char **envp) } if (kvm_enabled()) { - kvm_init_ap(); +kvm_init_ap(); #ifdef USE_KVM if (kvm_irqchip && !qemu_kvm_has_gsi_routing()) { irq0override = 0; +/* if kernel can't do irq routing, interrupt source + * override 0->2 can not be set up as required by hpet, + * so disable hpet. + */ +no_hpet=1; } #endif } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Userspace changes for configuring irq0->inti2 override (v2)
Signed-off-by: Beth Kon diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c index e1b19d7..bb74f38 100644 --- a/hw/fw_cfg.c +++ b/hw/fw_cfg.c @@ -279,6 +279,7 @@ void *fw_cfg_init(uint32_t ctl_port, uint32_t data_port, fw_cfg_add_bytes(s, FW_CFG_UUID, qemu_uuid, 16); fw_cfg_add_i16(s, FW_CFG_NOGRAPHIC, (uint16_t)nographic); fw_cfg_add_i16(s, FW_CFG_NB_CPUS, (uint16_t)smp_cpus); +fw_cfg_add_i16(s, FW_CFG_IRQ0_OVERRIDE, (uint16_t)irq0override); register_savevm("fw_cfg", -1, 1, fw_cfg_save, fw_cfg_load, s); qemu_register_reset(fw_cfg_reset, s); diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h index f616ed2..498c1e3 100644 --- a/hw/fw_cfg.h +++ b/hw/fw_cfg.h @@ -15,6 +15,7 @@ #define FW_CFG_INITRD_SIZE 0x0b #define FW_CFG_BOOT_DEVICE 0x0c #define FW_CFG_NUMA 0x0d +#define FW_CFG_IRQ0_OVERRIDE0x0e #define FW_CFG_MAX_ENTRY0x10 #define FW_CFG_WRITE_CHANNEL0x4000 diff --git a/hw/ioapic.c b/hw/ioapic.c index 0b70cf6..2d77a2c 100644 --- a/hw/ioapic.c +++ b/hw/ioapic.c @@ -23,6 +23,7 @@ #include "hw.h" #include "pc.h" +#include "sysemu.h" #include "qemu-timer.h" #include "host-utils.h" @@ -95,14 +96,13 @@ void ioapic_set_irq(void *opaque, int vector, int level) { IOAPICState *s = opaque; -#if 0 /* ISA IRQs map to GSI 1-1 except for IRQ0 which maps * to GSI 2. GSI maps to ioapic 1-1. This is not * the cleanest way of doing it but it should work. */ -if (vector == 0) +if (vector == 0 && irq0override) { vector = 2; -#endif +} if (vector >= 0 && vector < IOAPIC_NUM_PINS) { uint32_t mask = 1 << vector; diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 8cb6faa..2e52c87 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -879,7 +879,11 @@ int kvm_arch_init_irq_routing(void) return r; } for (i = 0; i < 24; ++i) { -r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i); +if (i == 0) { +r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, 2); +} else if (i != 2) { +r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i); +} if (r < 0) return r; } diff --git a/qemu-kvm.h b/qemu-kvm.h index 8226001..c64718c 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -165,6 +165,7 @@ void qemu_kvm_cpu_stop(CPUState *env); #define kvm_enabled() (kvm_allowed) #define qemu_kvm_irqchip_in_kernel() kvm_irqchip_in_kernel(kvm_context) #define qemu_kvm_pit_in_kernel() kvm_pit_in_kernel(kvm_context) +#define qemu_kvm_has_gsi_routing() kvm_has_gsi_routing(kvm_context) #define kvm_has_sync_mmu() qemu_kvm_has_sync_mmu() void kvm_init_vcpu(CPUState *env); void kvm_load_tsc(CPUState *env); @@ -173,6 +174,7 @@ void kvm_load_tsc(CPUState *env); #define kvm_nested 0 #define qemu_kvm_irqchip_in_kernel() (0) #define qemu_kvm_pit_in_kernel() (0) +#define qemu_kvm_has_gsi_routing() (0) #define kvm_has_sync_mmu() (0) #define kvm_load_registers(env) do {} while(0) #define kvm_save_registers(env) do {} while(0) diff --git a/sysemu.h b/sysemu.h index 1f45fd6..292bbc3 100644 --- a/sysemu.h +++ b/sysemu.h @@ -93,6 +93,7 @@ extern int graphic_width; extern int graphic_height; extern int graphic_depth; extern int nographic; +extern int irq0override; extern const char *keyboard_layout; extern int win2k_install_hack; extern int rtc_td_hack; diff --git a/vl.c b/vl.c index 6b4b7d2..f9a72b3 100644 --- a/vl.c +++ b/vl.c @@ -207,6 +207,7 @@ static int vga_ram_size; enum vga_retrace_method vga_retrace_method = VGA_RETRACE_DUMB; static DisplayState *display_state; int nographic; +int irq0override; static int curses; static int sdl; const char* keyboard_layout = NULL; @@ -5039,6 +5040,7 @@ int main(int argc, char **argv, char **envp) vga_ram_size = VGA_RAM_SIZE; snapshot = 0; nographic = 0; +irq0override = 1; curses = 0; kernel_filename = NULL; kernel_cmdline = ""; @@ -6135,8 +6137,14 @@ int main(int argc, char **argv, char **envp) } } -if (kvm_enabled()) - kvm_init_ap(); +if (kvm_enabled()) { + kvm_init_ap(); +#ifdef USE_KVM +if (kvm_irqchip && !qemu_kvm_has_gsi_routing()) { +irq0override = 0; +} +#endif +} machine->init(ram_size, vga_ram_size, boot_devices, kernel_filename, kernel_cmdline, initrd_filename, cpu_model); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] BIOS changes for configuring irq0->inti2 override (v2)
These patches resolve the irq0->inti2 override issue, and get the hpet working on kvm. They are dependent on Jes Sorensen's recent 0006-qemu-kvm-irq-routing.patch. Override and HPET changes are sent as a series because HPET depends on the override. Win2k8 expects the HPET interrupt on inti2, regardless of whether an override exists in the BIOS. And the HPET spec states that in legacy mode, timer interrupt is on inti2. The irq0->inti2 override will always be used unless the kernel cannot do irq routing (i.e., compatibility with old kernels). So if the kernel is capable, userspace sets up irq0->inti2 via the irq routing interface, and adds the irq0->inti2 override to the MADT interrupt source override table, and the mp table (for the no-acpi case). A couple of months ago, Marcelo was seeing RHEL5 guests complain of invalid checksum with these patches, but later he couldn't reproduce it, and I'm not seeing it now. While all guests still need to be fully tested, everything appears to be in order. I've tested on win2k864, win2k832, RHEL5.3 32 bit, and ubuntu 8.10 64 bit. Signed-off-by: Beth Kon diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c index 8684987..07dda73 100755 --- a/kvm/bios/rombios32.c +++ b/kvm/bios/rombios32.c @@ -445,6 +445,9 @@ uint32_t cpuid_ext_features; unsigned long ram_size; uint64_t ram_end; uint8_t bios_uuid[16]; +#ifdef BX_QEMU +uint8_t irq0_override; +#endif #ifdef BX_USE_EBDA_TABLES unsigned long ebda_cur_addr; #endif @@ -477,6 +480,7 @@ void wrmsr_smp(uint32_t index, uint64_t val) #define QEMU_CFG_SIGNATURE 0x00 #define QEMU_CFG_ID 0x01 #define QEMU_CFG_UUID 0x02 +#define QEMU_CFG_IRQ0_OVERRIDE 0x0e int qemu_cfg_port; @@ -518,6 +522,18 @@ void uuid_probe(void) memset(bios_uuid, 0, 16); } +#ifdef BX_QEMU +void irq0_override_probe(void) +{ +if(qemu_cfg_port) { +qemu_cfg_select(QEMU_CFG_IRQ0_OVERRIDE); +qemu_cfg_read(&irq0_override, 1); +return; +} +memset(&irq0_override, 0, 1); +} +#endif + void cpu_probe(void) { uint32_t eax, ebx, ecx, edx; @@ -1160,6 +1176,13 @@ static void mptable_init(void) /* irqs */ for(i = 0; i < 16; i++) { +#ifdef BX_QEMU +/* One entry per ioapic interrupt destination. Destination 2 is covered + * by irq0->inti2 override (i == 0). Source IRQ 2 is unused + */ +if (irq0_override && i == 2) +continue; +#endif putb(&q, 3); /* entry type = I/O interrupt */ putb(&q, 0); /* interrupt type = vectored interrupt */ putb(&q, 0); /* flags: po=0, el=0 */ @@ -1167,7 +1190,12 @@ static void mptable_init(void) putb(&q, 0); /* source bus ID = ISA */ putb(&q, i); /* source bus IRQ */ putb(&q, ioapic_id); /* dest I/O APIC ID */ -putb(&q, i); /* dest I/O APIC interrupt in */ +#ifdef BX_QEMU +if (irq0_override && i == 0) +putb(&q, 2); /* dest I/O APIC interrupt in */ +else +#endif +putb(&q, i); /* dest I/O APIC interrupt in */ } /* patch length */ len = q - mp_config_table; @@ -1550,16 +1578,18 @@ void acpi_bios_init(void) addr = (addr + 7) & ~7; madt_addr = addr; +madt = (void *)(addr); madt_size = sizeof(*madt) + sizeof(struct madt_processor_apic) * MAX_CPUS + -#ifdef BX_QEMU -sizeof(struct madt_io_apic) /* + sizeof(struct madt_int_override) */; -#else sizeof(struct madt_io_apic); +#ifdef BX_QEMU +for (i = 0; i < 16; i++) +if (PCI_ISA_IRQ_MASK & (1U << i)) +madt_size += sizeof(struct madt_int_override); +if (irq0_override) +madt_size += sizeof(struct madt_int_override); #endif -madt = (void *)(addr); addr += madt_size; - #ifdef BX_QEMU #ifdef HPET_WORKS_IN_KVM addr = (addr + 7) & ~7; @@ -1660,23 +1690,20 @@ void acpi_bios_init(void) io_apic->io_apic_id = smp_cpus; io_apic->address = cpu_to_le32(0xfec0); io_apic->interrupt = cpu_to_le32(0); +int_override = (struct madt_int_override*)(io_apic + 1); #ifdef BX_QEMU -#ifdef HPET_WORKS_IN_KVM -io_apic++; - -int_override = (void *)io_apic; -int_override->type = APIC_XRUPT_OVERRIDE; -int_override->length = sizeof(*int_override); -int_override->bus = cpu_to_le32(0); -int_override->source = cpu_to_le32(0); -int_override->gsi = cpu_to_le32(2); -int_override->flags = cpu_to_le32(0); -#endif +if (irq0_override) { +memset(int_override, 0, sizeof(*int_override)); +int_override->type = APIC_XRUPT_OVERRIDE; +int_override->length = sizeof(*int_override); +int_override->source = 0; +int_override->gsi = 2; +int_override->flags = 0; /* conforms to bus specifications */ +int_override++; +} #endif - -int_override = (struct madt_int_override*)(io_apic + 1); for
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Avi Kivity wrote: > Gregory Haskins wrote: >> One thing I was thinking here was that I could create a flag for the >> kvm_irqfd() function for something like "KVM_IRQFD_MODE_CLEAR". This >> flag when specified at creation time will cause the event to execute a >> clear operation instead of a set when triggered. That way, the default >> mode is an edge-triggered set. The non-default mode is to trigger a >> clear. Level-triggered ints could therefore create two irqfds, one for >> raising, the other for clearing. >> > > That's my second choice option. > >> An alternative is to abandon the use of eventfd, and allow the irqfd to >> be a first-class anon-fd. The parameters passed to the write/signal() >> function could then indicate the desired level. The disadvantage would >> be that it would not be compatible with eventfd, so we would need to >> decide if the tradeoff is worth it. >> > > I would really like to keep using eventfd. Which is why I asked > Davide about the prospects of direct callbacks (vs wakeups). I saw that request. That would be ideal. > >> OTOH, I suspect level triggered interrupts will be primarily in the >> legacy domain, so perhaps we do not need to worry about it too much. >> Therefore, another option is that we *could* simply set the stake in the >> ground that legacy/level cannot use irqfd. >> > > This is my preferred option. For a virtio-net-server in the kernel, > we'd service its eventfd in qemu, raising and lowering the pci > interrupt in the traditional way. > > But we'd still need to know when to lower the interrupt. How? IIUC, isn't that usually device/subsystem specific, and out of scope of the GSI delivery vehicle? For instance, most devices I have seen with level ints have a register in their device register namespace for acking the int. As an aside, this is what causes some of the grief in dealing with shared interrupts like KVM pass-through and/or threaded-isrs: There isn't a standardized way to ACK them. You may also see some generalization of masking/acking in things like the MSI-X table. But again, this would be out of scope of the general GSI delivery path IIUC. I understand that there is a feedback mechanism in the ioapic model for calling back on acknowledgment of the interrupt. But I am not sure what is how the real hardware works normally, and therefore I am not convinced that is something we need to feed all the way back (i.e. via irqfd or whatever). In the interest of full disclosure, its been a few years since I studied the xAPIC docs, so I might be out to lunch on that assertion. ;) -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
On Thu, 7 May 2009, Avi Kivity wrote: > What's your take on adding irq context safe callbacks to irqfd? > > To give some background here, we would like to use eventfd as a generic > connector between components, so the components do not know about each other. > So far eventfd successfully abstracts among components in the same process, in > different processes, and in the kernel. > > eventfd_signal() can be safely called from irq context, and will wake up a > waiting task. But in some cases, if the consumer is in the kernel, it may be > able to consume the event from irq context, saving a context switch. > > So, will you consider patches adding this capability to eventfd? Maybe I got lost in the thread, but inside the kernel we have callback-based wakeup since long time. This is what epoll uses, when hooking into the file* f_op->poll() subsystem. Did you mean something else? - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Gregory Haskins wrote: One thing I was thinking here was that I could create a flag for the kvm_irqfd() function for something like "KVM_IRQFD_MODE_CLEAR". This flag when specified at creation time will cause the event to execute a clear operation instead of a set when triggered. That way, the default mode is an edge-triggered set. The non-default mode is to trigger a clear. Level-triggered ints could therefore create two irqfds, one for raising, the other for clearing. That's my second choice option. An alternative is to abandon the use of eventfd, and allow the irqfd to be a first-class anon-fd. The parameters passed to the write/signal() function could then indicate the desired level. The disadvantage would be that it would not be compatible with eventfd, so we would need to decide if the tradeoff is worth it. I would really like to keep using eventfd. Which is why I asked Davide about the prospects of direct callbacks (vs wakeups). OTOH, I suspect level triggered interrupts will be primarily in the legacy domain, so perhaps we do not need to worry about it too much. Therefore, another option is that we *could* simply set the stake in the ground that legacy/level cannot use irqfd. This is my preferred option. For a virtio-net-server in the kernel, we'd service its eventfd in qemu, raising and lowering the pci interrupt in the traditional way. But we'd still need to know when to lower the interrupt. How? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] KVM: introduce kvm_arch_can_free_memslot, disallow slot deletion if cached cr3
mtosa...@redhat.com wrote: Disallow the deletion of memory slots (and aliases, for x86 case), if a vcpu contains a cr3 that points to such slot/alias. That allows the guest to induce failures in the host. Better to triple-fault the guest instead. +int kvm_arch_can_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + return 1; +} + In general, instead of stubs in every arch, have x86 say KVM_HAVE_ARCH_CAN_FREE_MEMSLOT and define the stub in generic code when that define is not present. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] virtio_pci: optional MSI-X support
This implements optional MSI-X support in virtio_pci. MSI-X is used whenever the host supports at least 2 MSI-X vectors: 1 for configuration changes and 1 for virtqueues. Per-virtqueue vectors are allocated if enough vectors available. Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_pci.c | 209 +-- include/linux/virtio_pci.h |8 ++- 2 files changed, 190 insertions(+), 27 deletions(-) diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h index cd0fd5d..4a0275b 100644 --- a/include/linux/virtio_pci.h +++ b/include/linux/virtio_pci.h @@ -47,9 +47,15 @@ /* The bit of the ISR which indicates a device configuration change. */ #define VIRTIO_PCI_ISR_CONFIG 0x2 +/* MSI-X registers: only enabled if MSI-X is enabled. */ +/* A 16-bit vector for configuration changes. */ +#define VIRTIO_MSI_CONFIG_VECTOR20 +/* A 16-bit vector for selected queue notifications. */ +#define VIRTIO_MSI_QUEUE_VECTOR 22 + /* The remaining space is defined by each driver as the per-driver * configuration space */ -#define VIRTIO_PCI_CONFIG 20 +#define VIRTIO_PCI_CONFIG(dev) ((dev)->msix_enabled ? 24 : 20) /* Virtio ABI version, this must match exactly */ #define VIRTIO_PCI_ABI_VERSION 0 diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index f7b79a2..2b6333c 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -42,6 +42,28 @@ struct virtio_pci_device /* a list of queues so we can dispatch IRQs */ spinlock_t lock; struct list_head virtqueues; + + /* MSI-X support */ + int msix_enabled; + struct msix_entry *msix_entries; + /* Name strings for interrupts. This size should be enough, +* and I'm too lazy to allocate each name separately. */ + char (*msix_names)[256]; + /* Number of vectors configured at startup (excludes per-virtqueue +* vectors if any) */ + unsigned msix_preset_vectors; + /* Number of per-virtqueue vectors if any. */ + unsigned msix_per_vq_vectors; +}; + +/* Constants for MSI-X */ +/* Use first vector for configuration changes, second and the rest for + * virtqueues Thus, we need at least 2 vectors for MSI. */ +enum { + VP_MSIX_CONFIG_VECTOR = 0, + VP_MSIX_VQ_VECTOR = 1, + VP_MSIX_MIN_VECTORS = 2, + VP_MSIX_NO_VECTOR = 0x, }; struct virtio_pci_vq_info @@ -60,6 +82,9 @@ struct virtio_pci_vq_info /* the list node for the virtqueues list */ struct list_head node; + + /* MSI-X vector (or none) */ + unsigned vector; }; /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */ @@ -109,7 +134,8 @@ static void vp_get(struct virtio_device *vdev, unsigned offset, void *buf, unsigned len) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); - void __iomem *ioaddr = vp_dev->ioaddr + VIRTIO_PCI_CONFIG + offset; + void __iomem *ioaddr = vp_dev->ioaddr + + VIRTIO_PCI_CONFIG(vp_dev) + offset; u8 *ptr = buf; int i; @@ -123,7 +149,8 @@ static void vp_set(struct virtio_device *vdev, unsigned offset, const void *buf, unsigned len) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); - void __iomem *ioaddr = vp_dev->ioaddr + VIRTIO_PCI_CONFIG + offset; + void __iomem *ioaddr = vp_dev->ioaddr + + VIRTIO_PCI_CONFIG(vp_dev) + offset; const u8 *ptr = buf; int i; @@ -221,17 +248,110 @@ static irqreturn_t vp_interrupt(int irq, void *opaque) return vp_vring_interrupt(irq, opaque); } - spin_lock_irqsave(&vp_dev->lock, flags); - list_for_each_entry(info, &vp_dev->virtqueues, node) { - if (vring_interrupt(irq, info->vq) == IRQ_HANDLED) - ret = IRQ_HANDLED; +/* the config->free_vqs() implementation */ +static void vp_free_vectors(struct virtio_device *vdev) { + struct virtio_pci_device *vp_dev = to_vp_device(vdev); + int i; + + /* Disable the vector used for configuration */ + iowrite16(VP_MSIX_NO_VECTOR, + vp_dev->ioaddr + VIRTIO_MSI_CONFIG_VECTOR); + + for (i = 0; i < vp_dev->msix_preset_vectors; ++i) + free_irq(vp_dev->msix_entries[i].vector, vp_dev); + + if (!vp_dev->msix_preset_vectors) + free_irq(vp_dev->pci_dev->irq, vp_dev); + + if (vp_dev->msix_enabled) { + vp_dev->msix_enabled = 0; + pci_disable_msix(vp_dev->pci_dev); } - spin_unlock_irqrestore(&vp_dev->lock, flags); +} - return ret; +static int vp_request_vectors(struct virtio_device *vdev, unsigned max_vqs) +{ + struct virtio_pci_device *vp_dev = to_vp_device(vdev); + const char *name = dev_name(&vp_dev->vdev.dev); + unsigned i, vectors; + int err
[PATCH 2/3] virtio_pci: split up vp_interrupt
This reorganizes virtio-pci code in vp_interrupt slightly, so that it's easier to add per-vq MSI support on top. Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_pci.c | 45 +- 1 files changed, 35 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 3671c42..f7b79a2 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -164,6 +164,37 @@ static void vp_notify(struct virtqueue *vq) iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY); } +/* Handle a configuration change: Tell driver if it wants to know. */ +static irqreturn_t vp_config_changed(int irq, void *opaque) +{ + struct virtio_pci_device *vp_dev = opaque; + struct virtio_driver *drv; + drv = container_of(vp_dev->vdev.dev.driver, + struct virtio_driver, driver); + + if (drv && drv->config_changed) + drv->config_changed(&vp_dev->vdev); + return IRQ_HANDLED; +} + +/* Notify all virtqueues on an interrupt. */ +static irqreturn_t vp_vring_interrupt(int irq, void *opaque) +{ + struct virtio_pci_device *vp_dev = opaque; + struct virtio_pci_vq_info *info; + irqreturn_t ret = IRQ_NONE; + unsigned long flags; + + spin_lock_irqsave(&vp_dev->lock, flags); + list_for_each_entry(info, &vp_dev->virtqueues, node) { + if (vring_interrupt(irq, info->vq) == IRQ_HANDLED) + ret = IRQ_HANDLED; + } + spin_unlock_irqrestore(&vp_dev->lock, flags); + + return ret; +} + /* A small wrapper to also acknowledge the interrupt when it's handled. * I really need an EIO hook for the vring so I can ack the interrupt once we * know that we'll be handling the IRQ but before we invoke the callback since @@ -173,9 +204,6 @@ static void vp_notify(struct virtqueue *vq) static irqreturn_t vp_interrupt(int irq, void *opaque) { struct virtio_pci_device *vp_dev = opaque; - struct virtio_pci_vq_info *info; - irqreturn_t ret = IRQ_NONE; - unsigned long flags; u8 isr; /* reading the ISR has the effect of also clearing it so it's very @@ -187,14 +215,11 @@ static irqreturn_t vp_interrupt(int irq, void *opaque) return IRQ_NONE; /* Configuration change? Tell driver if it wants to know. */ - if (isr & VIRTIO_PCI_ISR_CONFIG) { - struct virtio_driver *drv; - drv = container_of(vp_dev->vdev.dev.driver, - struct virtio_driver, driver); + if (isr & VIRTIO_PCI_ISR_CONFIG) + vp_config_changed(irq, opaque); - if (drv && drv->config_changed) - drv->config_changed(&vp_dev->vdev); - } + return vp_vring_interrupt(irq, opaque); +} spin_lock_irqsave(&vp_dev->lock, flags); list_for_each_entry(info, &vp_dev->virtqueues, node) { -- 1.6.3.rc3.1.g830204 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] virtio: find_vqs/del_vqs virtio operations
This replaces find_vq/del_vq with find_vqs/del_vqs virtio operations, and updates all drivers. This is needed for MSI support, because MSI needs to know the total number of vectors upfront. Signed-off-by: Michael S. Tsirkin --- drivers/block/virtio_blk.c | 11 +++--- drivers/char/hw_random/virtio-rng.c | 11 +++--- drivers/char/virtio_console.c | 27 ++ drivers/lguest/lguest_device.c | 49 +- drivers/net/virtio_net.c| 47 +++--- drivers/s390/kvm/kvm_virtio.c | 64 +- drivers/virtio/virtio_balloon.c | 31 - drivers/virtio/virtio_pci.c | 43 +++ include/linux/virtio_config.h | 29 +++- 9 files changed, 222 insertions(+), 90 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 5d34764..7b7435d 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -197,6 +197,7 @@ static int virtblk_probe(struct virtio_device *vdev) u64 cap; u32 v; u32 blk_size, sg_elems; + virtqueue_callback *callback[] = { blk_done }; if (index_to_minor(index) >= 1 << MINORBITS) return -ENOSPC; @@ -224,11 +225,9 @@ static int virtblk_probe(struct virtio_device *vdev) sg_init_table(vblk->sg, vblk->sg_elems); /* We expect one virtqueue, for output. */ - vblk->vq = vdev->config->find_vq(vdev, 0, blk_done); - if (IS_ERR(vblk->vq)) { - err = PTR_ERR(vblk->vq); + err = vdev->config->find_vqs(vdev, 1, &vblk->vq, callback); + if (err) goto out_free_vblk; - } vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); if (!vblk->pool) { @@ -323,7 +322,7 @@ out_put_disk: out_mempool: mempool_destroy(vblk->pool); out_free_vq: - vdev->config->del_vq(vblk->vq); + vdev->config->del_vqs(vdev); out_free_vblk: kfree(vblk); out: @@ -344,7 +343,7 @@ static void virtblk_remove(struct virtio_device *vdev) blk_cleanup_queue(vblk->disk->queue); put_disk(vblk->disk); mempool_destroy(vblk->pool); - vdev->config->del_vq(vblk->vq); + vdev->config->del_vqs(vdev); kfree(vblk); } diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c index 86e83f8..18eabe4 100644 --- a/drivers/char/hw_random/virtio-rng.c +++ b/drivers/char/hw_random/virtio-rng.c @@ -91,16 +91,17 @@ static struct hwrng virtio_hwrng = { static int virtrng_probe(struct virtio_device *vdev) { + virtqueue_callback *callback[] = { random_recv_done }; int err; /* We expect a single virtqueue. */ - vq = vdev->config->find_vq(vdev, 0, random_recv_done); - if (IS_ERR(vq)) - return PTR_ERR(vq); + err = vdev->config->find_vqs(vdev, 1, &vq, callback); + if (err) + return err; err = hwrng_register(&virtio_hwrng); if (err) { - vdev->config->del_vq(vq); + vdev->config->del_vqs(vdev); return err; } @@ -112,7 +113,7 @@ static void virtrng_remove(struct virtio_device *vdev) { vdev->config->reset(vdev); hwrng_unregister(&virtio_hwrng); - vdev->config->del_vq(vq); + vdev->config->del_vqs(vdev); } static struct virtio_device_id id_table[] = { diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index ff6f5a4..1fd5376 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -188,6 +188,8 @@ static void hvc_handle_input(struct virtqueue *vq) * Finally we put our input buffer in the input queue, ready to receive. */ static int __devinit virtcons_probe(struct virtio_device *dev) { + struct virtqueue *vqs[2]; + virtqueue_callback *callbacks[2]; int err; vdev = dev; @@ -199,20 +201,17 @@ static int __devinit virtcons_probe(struct virtio_device *dev) goto fail; } - /* Find the input queue. */ + /* Find the queues. */ /* FIXME: This is why we want to wean off hvc: we do nothing * when input comes in. */ - in_vq = vdev->config->find_vq(vdev, 0, hvc_handle_input); - if (IS_ERR(in_vq)) { - err = PTR_ERR(in_vq); + callbacks[0] = hvc_handle_input; + callbacks[1] = NULL; + err = vdev->config->find_vqs(vdev, 2, vqs, callbacks); + if (err) goto free; - } - out_vq = vdev->config->find_vq(vdev, 1, NULL); - if (IS_ERR(out_vq)) { - err = PTR_ERR(out_vq); - goto free_in_vq; - } + in_vq = vqs[0]; + out_vq = vqs[1]; /* Start using the new console output. */ virtio_cons.get_chars = get_chars; @@ -233,17 +232,15 @@ static int __devinit virtcons_probe(
[PATCHv2 0/3] virtio: add guest MSI-X support
Add optional MSI-X support: use a vector per virtqueue with fallback to a common vector and finally to regular interrupt. Teach all drivers to use it. Signed-off-by: Michael S. Tsirkin --- Here's a draft set of patches for MSI-X support in the guest. It still needs to be tested properly, and performance impact measured, but I thought I'd share it here in the hope of getting some very early feedback/flames. Changelog since v1: - Per Avi's suggestion, let guest configure virtqueue to vector mapping - Per Rusty's suggestion, replace API with find_vqs/del_vqs. Michael S. Tsirkin (3): virtio: find_vqs/del_vqs virtio operations virtio_pci: split up vp_interrupt virtio_pci: optional MSI-X support drivers/block/virtio_blk.c | 11 +- drivers/char/hw_random/virtio-rng.c | 11 +- drivers/char/virtio_console.c | 27 ++-- drivers/lguest/lguest_device.c | 49 ++- drivers/net/virtio_net.c| 47 +++ drivers/s390/kvm/kvm_virtio.c | 64 - drivers/virtio/virtio_balloon.c | 31 ++-- drivers/virtio/virtio_pci.c | 283 ++- include/linux/virtio_config.h | 29 +++- include/linux/virtio_pci.h |8 +- 10 files changed, 440 insertions(+), 120 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 12:48:21PM +0300, Avi Kivity wrote: > >> Davide Libenzi wrote: >> >>> On Wed, 6 May 2009, Gregory Haskins wrote: >>> >>> >>> I think we are ok in this regard (at least in v5) without the callback. kvm holds irqfd, which holds eventfd. In a normal situation, we will have eventfd with 2 references. If userspace closes the eventfd, it will drop 1 of the 2 eventfd file references, but the object should remain intact as long as kvm still holds it as well. When the kvm-fd is released, we will then decouple from the eventfd->wqh and drop the last fput(), officially freeing it. Likewise, if kvm is closed before the eventfd, we will simply decouple from the wqh and fput(eventfd), leaving the last reference held by userspace until it closes as well. Let me know if you see any holes in that. >>> Looks OK, modulo my knowledge of KVM internals. >>> >>> >> What's your take on adding irq context safe callbacks to irqfd? >> >> To give some background here, we would like to use eventfd as a generic >> connector between components, so the components do not know about each >> other. So far eventfd successfully abstracts among components in the >> same process, in different processes, and in the kernel. >> >> eventfd_signal() can be safely called from irq context, and will wake up >> a waiting task. But in some cases, if the consumer is in the kernel, it >> may be able to consume the event from irq context, saving a context >> switch. >> >> So, will you consider patches adding this capability to eventfd? >> > > (pasting from a separate thread) > > >> That's my thinking. PCI interrupts don't work because we need to do >> some hacky stuff in there, but MSI should. Oh, and we could improve >> UIO >> support for interrupts when using MSI, since there's no need to >> acknowledge the interrupt. >> > > Ok, so for INTx assigned devices all you need to do on the ACK handler > is to re-enable the host interrupt (and set the guest interrupt line to > low). > > Right now the ack comes through a kvm internal irq ack callback. > > AFAICS there is no mechanism in irqfd for ACK notification, and > interrupt injection is edge triggered. > > So for PCI INTx assigned devices (or any INTx level), you'd want to keep > the guest interrupt high, with some way to notify the ACK. > > Avi mentioned a separate irqfd to notify the ACK. For assigned devices, > you could register a fd wakeup function in that fd, which replaces the > current irq ACK callback? > One thing I was thinking here was that I could create a flag for the kvm_irqfd() function for something like "KVM_IRQFD_MODE_CLEAR". This flag when specified at creation time will cause the event to execute a clear operation instead of a set when triggered. That way, the default mode is an edge-triggered set. The non-default mode is to trigger a clear. Level-triggered ints could therefore create two irqfds, one for raising, the other for clearing. An alternative is to abandon the use of eventfd, and allow the irqfd to be a first-class anon-fd. The parameters passed to the write/signal() function could then indicate the desired level. The disadvantage would be that it would not be compatible with eventfd, so we would need to decide if the tradeoff is worth it. OTOH, I suspect level triggered interrupts will be primarily in the legacy domain, so perhaps we do not need to worry about it too much. Therefore, another option is that we *could* simply set the stake in the ground that legacy/level cannot use irqfd. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v4 2/2] kvm: add support for irqfd via eventfd-notification interface
On Thu, May 07, 2009 at 12:48:21PM +0300, Avi Kivity wrote: > Davide Libenzi wrote: >> On Wed, 6 May 2009, Gregory Haskins wrote: >> >> >>> I think we are ok in this regard (at least in v5) without the >>> callback. kvm holds irqfd, which holds eventfd. In a normal >>> situation, we will >>> have eventfd with 2 references. If userspace closes the eventfd, it >>> will drop 1 of the 2 eventfd file references, but the object should >>> remain intact as long as kvm still holds it as well. When the kvm-fd is >>> released, we will then decouple from the eventfd->wqh and drop the last >>> fput(), officially freeing it. >>> >>> Likewise, if kvm is closed before the eventfd, we will simply decouple >>> from the wqh and fput(eventfd), leaving the last reference held by >>> userspace until it closes as well. >>> >>> Let me know if you see any holes in that. >>> >> >> Looks OK, modulo my knowledge of KVM internals. >> > > What's your take on adding irq context safe callbacks to irqfd? > > To give some background here, we would like to use eventfd as a generic > connector between components, so the components do not know about each > other. So far eventfd successfully abstracts among components in the > same process, in different processes, and in the kernel. > > eventfd_signal() can be safely called from irq context, and will wake up > a waiting task. But in some cases, if the consumer is in the kernel, it > may be able to consume the event from irq context, saving a context > switch. > > So, will you consider patches adding this capability to eventfd? (pasting from a separate thread) > That's my thinking. PCI interrupts don't work because we need to do > some hacky stuff in there, but MSI should. Oh, and we could improve > UIO > support for interrupts when using MSI, since there's no need to > acknowledge the interrupt. Ok, so for INTx assigned devices all you need to do on the ACK handler is to re-enable the host interrupt (and set the guest interrupt line to low). Right now the ack comes through a kvm internal irq ack callback. AFAICS there is no mechanism in irqfd for ACK notification, and interrupt injection is edge triggered. So for PCI INTx assigned devices (or any INTx level), you'd want to keep the guest interrupt high, with some way to notify the ACK. Avi mentioned a separate irqfd to notify the ACK. For assigned devices, you could register a fd wakeup function in that fd, which replaces the current irq ACK callback? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jan Kiszka wrote: I'm not asking for full gdbstub support at this point, just for a build workaround at arch-level (or in arch-specific blocks, in gdbstub.c e.g.) instead of patching generic code with global risk of regressions. Oh, I see. I already submitted that to Avi, making gdb_set_stop_cpu() a noop on ia64. Since there's no target-ia64/ directory in mainline qemu.git yet, that fix will have to live in qemu-kvm for the time being. I am fine if you wish to remove all the #ifdef CONFIG_GDBSTUB bits. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] qemu: external module: smp_send_reschedule compat
mtosa...@redhat.com wrote: smp_send_reschedule was exported (via smp_ops) in v2.6.24. Create a compat function which schedules the IPI to keventd context, in case interrupts are disabled, for kernels < 2.6.24. #endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,28) +#define nr_cpu_ids NR_CPUS +#endif + Doesn't seem related? Please send as a separate patch. diff --git a/kvm/kernel/x86/hack-module.awk b/kvm/kernel/x86/hack-module.awk index 260eeef..f4d14da 100644 --- a/kvm/kernel/x86/hack-module.awk +++ b/kvm/kernel/x86/hack-module.awk @@ -35,6 +35,8 @@ BEGIN { split("INIT_WORK desc_struct ldttss_desc64 desc_ptr " \ { sub(/match->dev->msi_enabled/, "kvm_pcidev_msi_enabled(match->dev)") } +{ sub(/smp_send_reschedule/, "kvm_smp_send_reschedule") } + /^static void __vmx_load_host_state/ { vmx_load_host_state = 1 } There's a bit on the top of hack-module.awk that does this slightly simpler. Please also adjust ia64. This now resides in kvm-kmod.git, please patch against that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/4] use smp_send_reschedule in vcpu_kick / assigned dev host intx race fix
Sheng Yang wrote: Is there any issue blocked this patchset? Yes, a slow maintainer. I'll go review it now. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jes Sorensen wrote: > Jan Kiszka wrote: >> As the original change this patch (correctly) fixes broke gdbstub >> support in qemu-kvm for all archs + it is still a deviation from >> upstream, please take the chance and go for ia64 stubs. > > Jan, > > There's far more to do in upstream than just gdbstubs for ia64 to be > usable :-( > > We'll get to gdbstubs at some point, but there's too many other higher > priority items to deal with first. I'm not asking for full gdbstub support at this point, just for a build workaround at arch-level (or in arch-specific blocks, in gdbstub.c e.g.) instead of patching generic code with global risk of regressions. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jan Kiszka wrote: As the original change this patch (correctly) fixes broke gdbstub support in qemu-kvm for all archs + it is still a deviation from upstream, please take the chance and go for ia64 stubs. Jan, There's far more to do in upstream than just gdbstubs for ia64 to be usable :-( We'll get to gdbstubs at some point, but there's too many other higher priority items to deal with first. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Jes Sorensen wrote: > Avi Kivity wrote: >> I agree, unless >> >> - we want to make gdbstub support configurable (don't see any >> overwhelming reason for this, but maybe others do) >> - we want to merge ia64 kvm support upstream, and don't want to impose >> gdbstub support (though I'd recommend properly implementing gdbstub) >> >> In any case, I'm okay with dropping the check upstream and applying >> the local fixup. > > Hi, > > Here's a patch that fixes the #ifndef to make it the #ifdef as it was > intended. > > I am quite fine with us trying to drop all the #ifdefs and introduce > noop wrappers for archs that do not provide the gdbstubs (ie. ia64), but > to start with we better just fix the #ifndef to make it behave like it > was originally intended. As the original change this patch (correctly) fixes broke gdbstub support in qemu-kvm for all archs + it is still a deviation from upstream, please take the chance and go for ia64 stubs. BTW, I'm planning to submit a cleanup patch for CONFIG_GDBSTUB to upstream soon. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix qemu-kvm to build when gdbstub is disabled
Avi Kivity wrote: I agree, unless - we want to make gdbstub support configurable (don't see any overwhelming reason for this, but maybe others do) - we want to merge ia64 kvm support upstream, and don't want to impose gdbstub support (though I'd recommend properly implementing gdbstub) In any case, I'm okay with dropping the check upstream and applying the local fixup. Hi, Here's a patch that fixes the #ifndef to make it the #ifdef as it was intended. I am quite fine with us trying to drop all the #ifdefs and introduce noop wrappers for archs that do not provide the gdbstubs (ie. ia64), but to start with we better just fix the #ifndef to make it behave like it was originally intended. Cheers, Jes Fix incorrect #ifndef for CONFIG_GETSTUB, which should have been an #ifdef. Signed-off-by: Jes Sorensen --- vl.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: qemu/vl.c === --- qemu.orig/vl.c +++ qemu/vl.c @@ -4350,7 +4350,7 @@ } if (cpu_can_run(env)) ret = qemu_cpu_exec(env); -#ifndef CONFIG_GDBSTUB +#ifdef CONFIG_GDBSTUB if (ret == EXCP_DEBUG) { gdb_set_stop_cpu(env); debug_requested = 1;
Re: [PATCH] KVM: Fix assigned device with no irq
On Thursday 07 May 2009 20:28:12 Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 03:24:15PM +0300, Avi Kivity wrote: > > Sheng Yang wrote: > >> Some device like VF of SRIOV only support MSI-X. > >> > >> With this patch, SRIOV can be enabled with KVM assigned device(all > >> kernel space patches are ready). > > > > Just noticed I missed this patch. Is it still needed or is it obsolete? > > I think its obsolete. Yeah, it have been fixed from another direct in userspace: Interrupt pin =0 means it haven't been initialized. -- regards Yang, Sheng > > >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > >> index 4d2be16..ce80f3a 100644 > >> --- a/virt/kvm/kvm_main.c > >> +++ b/virt/kvm/kvm_main.c > >> @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm, > >>kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id); > >>assigned_dev->irq_source_id = -1; > >> - if (!assigned_dev->irq_requested_type) > >> + if (!assigned_dev->irq_requested_type) { > >> + if (assigned_dev->dev->msi_enabled) > >> + pci_disable_msi(assigned_dev->dev); > >>return; > >> + } > >>/* > >> * In kvm_free_device_irq, cancel_work_sync return true if: > >> @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm > >> *kvm, if (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_INTX) > >>return 0; > >> + /* IRQ 0 means uninitialized here */ > >> + if (airq->host_irq == 0) > >> + return 0; > >> + > >>if (irqchip_in_kernel(kvm)) { > >>if (!msi2intx && > >>(adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_MSI)) { > >> @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm, > >>r = 0; > >>} else { > >>printk(KERN_WARNING > >> - "kvm: failed to enable MSI device!\n"); > >> - r = -ENOTTY; > >> - goto out_release; > >> + "kvm: device didn't support INTx or MSI!\n"); > >> + r = 0; > >>} > >>} else { > >>/* Non-sharing INTx mode */ > > > > -- > > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Chris Wright wrote: * Gregory Haskins (ghask...@novell.com) wrote: Chris Wright wrote: But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) means hypercall number and arg list must be the same in order for code to call hypercall() in a hypervisor agnostic way. Yes, and that is exactly the intention. I think its perhaps the point you are missing. Yes, I was reading this as purely any hypercall, but it seems a bit more like: pv_io_ops->iomap() pv_io_ops->ioread() pv_io_ops->iowrite() Right. Hmm, reminds me of something I thought of a while back. We could implement an 'mmio hypercall' that does mmio reads/writes via a hypercall instead of an mmio operation. That will speed up mmio for emulated devices (say, e1000). It's easy to hook into Linux (readl/writel), is pci-friendly, non-x86 friendly, etc. It also makes the device work when hypercall support is not available (qemu/tcg); you simply fall back on mmio. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Fix assigned device with no irq
On Thu, May 07, 2009 at 03:24:15PM +0300, Avi Kivity wrote: > Sheng Yang wrote: >> Some device like VF of SRIOV only support MSI-X. >> >> With this patch, SRIOV can be enabled with KVM assigned device(all kernel >> space >> patches are ready). >> >> > > Just noticed I missed this patch. Is it still needed or is it obsolete? I think its obsolete. >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 4d2be16..ce80f3a 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm, >> kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id); >> assigned_dev->irq_source_id = -1; >> - if (!assigned_dev->irq_requested_type) >> +if (!assigned_dev->irq_requested_type) { >> +if (assigned_dev->dev->msi_enabled) >> +pci_disable_msi(assigned_dev->dev); >> return; >> +} >> /* >> * In kvm_free_device_irq, cancel_work_sync return true if: >> @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm, >> if (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_INTX) >> return 0; >> + /* IRQ 0 means uninitialized here */ >> +if (airq->host_irq == 0) >> +return 0; >> + >> if (irqchip_in_kernel(kvm)) { >> if (!msi2intx && >> (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_MSI)) { >> @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm, >> r = 0; >> } else { >> printk(KERN_WARNING >> - "kvm: failed to enable MSI device!\n"); >> -r = -ENOTTY; >> -goto out_release; >> + "kvm: device didn't support INTx or MSI!\n"); >> +r = 0; >> } >> } else { >> /* Non-sharing INTx mode */ >> > > > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Fix assigned device with no irq
Sheng Yang wrote: Some device like VF of SRIOV only support MSI-X. With this patch, SRIOV can be enabled with KVM assigned device(all kernel space patches are ready). Just noticed I missed this patch. Is it still needed or is it obsolete? diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4d2be16..ce80f3a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -217,8 +217,11 @@ static void kvm_free_assigned_irq(struct kvm *kvm, kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id); assigned_dev->irq_source_id = -1; - if (!assigned_dev->irq_requested_type) + if (!assigned_dev->irq_requested_type) { + if (assigned_dev->dev->msi_enabled) + pci_disable_msi(assigned_dev->dev); return; + } /* * In kvm_free_device_irq, cancel_work_sync return true if: @@ -308,6 +311,10 @@ static int assigned_device_update_intx(struct kvm *kvm, if (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_INTX) return 0; + /* IRQ 0 means uninitialized here */ + if (airq->host_irq == 0) + return 0; + if (irqchip_in_kernel(kvm)) { if (!msi2intx && (adev->irq_requested_type & KVM_ASSIGNED_DEV_HOST_MSI)) { @@ -529,9 +536,8 @@ static int kvm_vm_ioctl_assign_irq(struct kvm *kvm, r = 0; } else { printk(KERN_WARNING - "kvm: failed to enable MSI device!\n"); - r = -ENOTTY; - goto out_release; + "kvm: device didn't support INTx or MSI!\n"); + r = 0; } } else { /* Non-sharing INTx mode */ -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm: bios: Use a different mask to size the option ROM BAR
Alex Williamson wrote: Bit 0 is the enable bit, which we not only don't want to set, but it will stick and make us think it's an I/O port resource. Signed-off-by: Alex Williamson --- kvm/bios/rombios32.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/kvm/bios/rombios32.c b/kvm/bios/rombios32.c index 8684987..6502e63 100755 --- a/kvm/bios/rombios32.c +++ b/kvm/bios/rombios32.c @@ -958,11 +958,13 @@ static void pci_bios_init_device(PCIDevice *d) int ofs; uint32_t val, size ; -if (i == PCI_ROM_SLOT) +if (i == PCI_ROM_SLOT) { ofs = 0x30; -else +pci_config_writel(d, ofs, 0xfffe); +} else { ofs = 0x10 + i * 4; -pci_config_writel(d, ofs, 0x); +pci_config_writel(d, ofs, 0x); +} val = pci_config_readl(d, ofs); if (val != 0) { size = (~(val & ~0xf)) + 1; Looks good. I think it applies upstream (or even upstream's upstream) though, in which case please send it qemu-devel. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] msi-x: let drivers retry when not enough vectors
On Thu, May 07, 2009 at 01:44:38PM +0300, Avi Kivity wrote: > I imagine this loop is present in many drivers. So why not add a helper Let's look! arch/x86/kernel/amd_iommu_init.c : Needs an exact number of vectors. drivers/block/cciss.c : If it doesn't get all 4 vectors, falls back to pin mode (instead of MSI mode -- bug!) drivers/dma/ioat_dma.c : Falls back down to 1 vector if it can't get nvec drivers/infiniband/hw/mthca/mthca_main.c : Reverts to MSI if it can't get 3. drivers/net/benet/be_main.c : Falls back to MSI if it can't get 2. drivers/net/bnx2.c : Falls back to MSI if it can't get 9. drivers/net/bnx2x_main.c : Falls back to MSI if it can't get N. drivers/net/e1000e/netdev.c: Falls back to MSI if it can't get N. drivers/net/enic/enic_main.c: Falls back to MSI if it can't get N. drivers/net/forcedeth.c: Falls back to MSI if it can't get N. drivers/net/igb/igb_main.c: Falls back to MSI if it can't get N. drivers/net/igbvf/netdev.c: Falls back to MSI if it can't get 3. drivers/net/myri10ge/myri10ge.c: Falls back to Pin if it can't get N. drivers/net/netxen/netxen_nic_main.c: Falls back to MSI if it can't get N. drivers/net/qlge/qlge_main.c: Falls back to MSI if it can't get N. drivers/net/s2io.c: Falls back to MSI if it can't get N. drivers/net/vxge/vxge-main.c: Falls back once to a second number. drivers/pci/pcie/portdrv_core.c: Falls back to MSI if it can't get all of them. drivers/scsi/lpfc/lpfc_init.c: Falls back to MSI if it can't get N. drivers/scsi/mpt2sas/mpt2sas_base.c: Only allocates 1. drivers/net/cxgb3/cxgb3_main.c: Actually falls back ... in a bit of a weird way. This one could definitely do with the proposed loop. drivers/net/ixgbe/ixgbe_main.c: Could also be improved with this loop. drivers/net/mlx4/main.c: Nasty goto-based loop. drivers/net/niu.c: Ditto drivers/net/sfc/efx.c: Only falls back once. Would benefit from loop. drivers/scsi/qla2xxx/qla_isr.c: Only falls back once. Would benefit from loop. drivers/staging/sxg/sxg.c: /*Should try with less vector returned.*/ so ... 7 drivers would benefit from this loop. 20 seem like they wouldn't. What a lot of drivers seem to do is fall back either to a single MSI or just pin-based interrupts when they can't get as many interrupts as they would like. They don't try to get a single MSI-X interrupt. I feel an update to the MSI-HOWTO coming on. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] fix irq routing by moving it to the correct place
Avi Kivity wrote: Jes Sorensen wrote: Hi, Some freak :-) put x86 specific irq routing into the generic code path, which obviously won't work on non-x86 systems. This patch creates kvm_arch_init_irq_routing() and moves the x86 gibberish to it's correct location :-) Applied, thanks. Argh, that didn't build on x86. I fixed it up. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html