Re: [PATCH] virtio_pci: properly clean up MSI-X state when initialization fails
Hi Michael, On Mon, Sep 15, 2014 at 1:32 AM, Michael S. Tsirkin wrote: > On Sun, Sep 14, 2014 at 08:23:26PM -0700, Anthony Liguori wrote: >> From: Anthony Liguori >> >> If MSI-X initialization fails after setting msix_enabled = 1, then >> the device is left in an inconsistent state. This would normally >> only happen if there was a bug in the device emulation but it still >> should be handled correctly. > > This might happen if host runs out of resources when trying > to map VQs to vectors, so doesn't have to be a bug. > > But I don't see what the problem is: > msix_used_vectors reflects the number of used vectors > and msix_enabled is set, thus vp_free_vectors > will free all IRQs and then disable MSIX. > > Where is the inconsistency you speak about? I missed the fact that vp_free_vectors() conditionally sets msix_enabled=0. It seems a bit cludgy especially since it is called both before and after setting msix_enabled=1. I ran into a number of weird problems due to read/write reordering that was causing this code path to fail. The impact was non-deterministic. I'll go back and try to better understand what code path is causing it. >> Cc: Matt Wilson >> Cc: Rusty Russell >> Cc: Michael Tsirkin >> Signed-off-by: Anthony Liguori >> --- >> drivers/virtio/virtio_pci.c |8 ++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c >> index 9cbac33..3d2c2a5 100644 >> --- a/drivers/virtio/virtio_pci.c >> +++ b/drivers/virtio/virtio_pci.c >> @@ -357,7 +357,7 @@ static int vp_request_msix_vectors(struct virtio_device >> *vdev, int nvectors, >> v = ioread16(vp_dev->ioaddr + VIRTIO_MSI_CONFIG_VECTOR); >> if (v == VIRTIO_MSI_NO_VECTOR) { >> err = -EBUSY; >> - goto error; >> + goto error_msix_used; >> } >> >> if (!per_vq_vectors) { >> @@ -369,11 +369,15 @@ static int vp_request_msix_vectors(struct >> virtio_device *vdev, int nvectors, >> vp_vring_interrupt, 0, vp_dev->msix_names[v], >> vp_dev); >> if (err) >> - goto error; >> + goto error_msix_used; >> ++vp_dev->msix_used_vectors; >> } >> return 0; >> +error_msix_used: >> + v = --vp_dev->msix_used_vectors; >> + free_irq(vp_dev->msix_entries[v].vector, vp_dev); >> error: >> + vp_dev->msix_enabled = 0; > > As far as I can see, if you do this, guest will not call > pci_disable_msix thus leaving the device with MSIX enabled. I don't understand this comment. How is the work done in this path any different from what's done in vp_free_vectors()? Regards, Anthony Liguori > I'm not sure this won't break drivers if they then > try to use the device without MSIX, and it > definitely seems less elegant than reverting the > device to the original state. > > >> vp_free_vectors(vdev); >> return err; >> } >> -- >> 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] xen-gnttab: do not add m2p override entries for blkback mappings
From: Anthony Liguori Commit 5dc03639 switched blkback to also add m2p override entries when mapping grant pages but history seems to have forgotten why this is useful if it ever was. The blkback driver does not need m2p override entries to exist and there is significant overhead due to the locking in the m2p override table. We see about a 10% improvement in IOP rate with this patch applied running FIO in the guest. See http://article.gmane.org/gmane.linux.kernel/1590932 for a full analysis of current users. Signed-off-by: Anthony Liguori --- A backported version of this has been heavily tested but the testing against the latest Linux tree is light so far. --- drivers/block/xen-blkback/blkback.c | 12 - drivers/xen/gntdev.c|4 +-- drivers/xen/grant-table.c | 49 --- include/xen/grant_table.h | 14 -- 4 files changed, 59 insertions(+), 20 deletions(-) diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c index bf4b9d2..53ea90e 100644 --- a/drivers/block/xen-blkback/blkback.c +++ b/drivers/block/xen-blkback/blkback.c @@ -286,7 +286,7 @@ static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root, if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST || !rb_next(&persistent_gnt->node)) { ret = gnttab_unmap_refs(unmap, NULL, pages, - segs_to_unmap); + segs_to_unmap, 0); BUG_ON(ret); put_free_pages(blkif, pages, segs_to_unmap); segs_to_unmap = 0; @@ -322,7 +322,7 @@ static void unmap_purged_grants(struct work_struct *work) if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST) { ret = gnttab_unmap_refs(unmap, NULL, pages, - segs_to_unmap); + segs_to_unmap, 0); BUG_ON(ret); put_free_pages(blkif, pages, segs_to_unmap); segs_to_unmap = 0; @@ -330,7 +330,7 @@ static void unmap_purged_grants(struct work_struct *work) kfree(persistent_gnt); } if (segs_to_unmap > 0) { - ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap); + ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap, 0); BUG_ON(ret); put_free_pages(blkif, pages, segs_to_unmap); } @@ -671,14 +671,14 @@ static void xen_blkbk_unmap(struct xen_blkif *blkif, pages[i]->handle = BLKBACK_INVALID_HANDLE; if (++invcount == BLKIF_MAX_SEGMENTS_PER_REQUEST) { ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, - invcount); + invcount, 0); BUG_ON(ret); put_free_pages(blkif, unmap_pages, invcount); invcount = 0; } } if (invcount) { - ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount); + ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount, 0); BUG_ON(ret); put_free_pages(blkif, unmap_pages, invcount); } @@ -740,7 +740,7 @@ again: } if (segs_to_map) { - ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map); + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map, 0); BUG_ON(ret); } diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c index e41c79c..9ab1792 100644 --- a/drivers/xen/gntdev.c +++ b/drivers/xen/gntdev.c @@ -285,7 +285,7 @@ static int map_grant_pages(struct grant_map *map) pr_debug("map %d+%d\n", map->index, map->count); err = gnttab_map_refs(map->map_ops, use_ptemod ? map->kmap_ops : NULL, - map->pages, map->count); + map->pages, map->count, 1); if (err) return err; @@ -317,7 +317,7 @@ static int __unmap_grant_pages(struct grant_map *map, int offset, int pages) err = gnttab_unmap_refs(map->unmap_ops + offset, use_ptemod ? map->kmap_ops + offset : NULL, map->pages + offset, - pages); + pages, 1); if (err) return err; diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index c4d2298..081be8d 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -881,7 +881,8 @@ EXPORT_SYMBOL_GPL(gnttab_batch_copy); int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops, struct gnttab_map_gran
Re: [Xen-devel] [PATCH] grant-table: don't set m2p override if kmap_ops is not set
Ian Campbell writes: > On Tue, 2013-11-05 at 12:53 -0800, Anthony Liguori wrote: >> >> Yes, you are completely right, then I have to figure out why blkback >> >> works fine with this patch applied (or at least it seems to work fine). >> > >> > blkback also works for me when testing a similar patch. I'm still >> > confused. One thing with your proposed patch: I'm not sure that you're >> > putting back the correct mfn. > > As long as it is unique and owned by the local domain I guess it doesn't > matter which mfn goes back? It's going to get given back to the generic > allocator so the content is irrelevant? (all speculation without looking > at blkback) Hrm, so m2p_add_override already does set_phys_to_machine(pfn, FOREIGN_FRAME(mfn)) so this patch doesn't add anything that isn't already there. It's just removing the additional to the m2p override table. So what happens when the MFN isn't in the m2p override table? It's treated as a foreign PFN and m2p lookups fail. But why is this a problem for blkback? >> It's perfectly fine to store a foreign pfn in the m2p table. > > No, it's not. The m2p is host global and maps mfns to the pfns in the > owning domain. A domain which grant maps a foreign mfn into its address > space doesn't get to update the m2p for that mfn. Perhaps you were > thinking of the p2m? Yes, I typoed that bit. > Depending on how an OS mapping the foreign MFN does things it may well > be accurate to say that having a foreign pfn in the m2p table is > harmless, in as much as it will never try and use the m2p for foreign > pages for anything real, but it depends on the implementation of the > particular OS. Note that this patch isn't doing anything to the m2p table. >> The m2p >> override table is used by the grant device to allow a reverse lookup of >> the real mfn to a pfn even if it's foreign. > > Classic Xen used an array of struct page * overrides in the struct vma > for this sort of use case (e.g. in blktap). That obviously cannot fly > for pvops... I don't think the usage from blkback precludes using the gntdev either. It just so happens that the m2p_add_override does extra work but I don't think it actually benefits blkback in any way. Am I missing something? Regards, Anthony Liguori > > Ian. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH] grant-table: don't set m2p override if kmap_ops is not set
On Tue, Nov 5, 2013 at 1:16 PM, Konrad Rzeszutek Wilk wrote: > On Tue, Nov 05, 2013 at 12:53:17PM -0800, Anthony Liguori wrote: >> Matt Wilson writes: >> >> > On Tue, Nov 05, 2013 at 05:03:58PM +0100, Roger Pau Monné wrote: >> >> On 05/11/13 16:08, Ian Campbell wrote: >> >> > On Tue, 2013-11-05 at 16:01 +0100, Roger Pau Monné wrote: >> >> >> On 05/11/13 15:56, Konrad Rzeszutek Wilk wrote: >> >> >>> On Tue, Nov 05, 2013 at 03:47:08PM +0100, Roger Pau Monné wrote: >> >> >>>> On 05/11/13 13:36, David Vrabel wrote: >> >> >>>>> On 05/11/13 11:24, Roger Pau Monne wrote: >> >> >>>>>> IMHO there's no reason to set a m2p override if the mapping is >> >> >>>>>> done in >> >> >>>>>> kernel space, so only set the m2p override when kmap_ops is set. >> >> >>>>> >> >> >>>>> Can you provide a more detailed reasoning about why this is safe? >> >> >>>> >> >> >>>> To tell the truth, I don't understand why we need to use the m2p >> >> >>>> override for kernel space only mappings, my understanding is that >> >> >>>> this >> >> >>>> m2p override is needed for user space mappings only (where we >> >> >>>> actually >> >> >>>> end up doing two mappings, one in kernel space and one in user >> >> >>>> space). >> >> >>>> For kernel space I don't see why we need to do anything else than >> >> >>>> setting the right p2m translation. >> >> >>> >> >> >>> We needed the m2p when doing DMA operations. As the driver would >> >> >>> want the bus address (so p2m) and then when unmapping the DMA we >> >> >>> only get the bus address - so we needed to do a m2p lookup. >> >> >> >> >> >> OK, we need a m2p (that we already have in machine_to_phys_mapping), >> >> >> what I don't understand is why we need the m2p override. >> >> > >> >> > The m2p is a host global table. >> >> > >> >> > For a foreign page grant mapped into the current domain the m2p will >> >> > give you the foreign (owner) domain's p from the m, not the local one. >> >> >> >> Yes, you are completely right, then I have to figure out why blkback >> >> works fine with this patch applied (or at least it seems to work fine). >> > >> > blkback also works for me when testing a similar patch. I'm still >> > confused. One thing with your proposed patch: I'm not sure that you're >> > putting back the correct mfn. >> >> It's perfectly fine to store a foreign pfn in the m2p table. The m2p >> override table is used by the grant device to allow a reverse lookup of >> the real mfn to a pfn even if it's foreign. >> >> blkback doesn't actually need this though. This was introduced in: >> >> commit 5dc03639cc903f887931831d69895facb5260f4b >> Author: Konrad Rzeszutek Wilk >> Date: Tue Mar 1 16:46:45 2011 -0500 >> >> xen/blkback: Utilize the M2P override mechanism for GNTMAP_host_map >> >> Purely as an optimization. In practice though due to lock contention it >> slows things down. >> >> I think an alternative would be to use a read/write lock instead of just >> a spinlock since it's the read path that is the most hot. > > The m2p hash table can also be expanded to lower the contention. That was the first thing I tried :-) The issue isn't lookup cost as much as the number of acquisitions and the cost of adding/removing entries. >> I haven't tested that yet though. > > Looking forward to your patches :-) I'm really just being thorough in suggesting the rwlock. I actually think that not using the m2p override table is the right long term fix. Regards, Anthony Liguori >> >> Regards, >> >> Anthony Liguori >> >> > >> > Adding Anthony to the thread. >> > >> > --msw -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH] grant-table: don't set m2p override if kmap_ops is not set
Matt Wilson writes: > On Tue, Nov 05, 2013 at 05:03:58PM +0100, Roger Pau Monné wrote: >> On 05/11/13 16:08, Ian Campbell wrote: >> > On Tue, 2013-11-05 at 16:01 +0100, Roger Pau Monné wrote: >> >> On 05/11/13 15:56, Konrad Rzeszutek Wilk wrote: >> >>> On Tue, Nov 05, 2013 at 03:47:08PM +0100, Roger Pau Monné wrote: >> >>>> On 05/11/13 13:36, David Vrabel wrote: >> >>>>> On 05/11/13 11:24, Roger Pau Monne wrote: >> >>>>>> IMHO there's no reason to set a m2p override if the mapping is done in >> >>>>>> kernel space, so only set the m2p override when kmap_ops is set. >> >>>>> >> >>>>> Can you provide a more detailed reasoning about why this is safe? >> >>>> >> >>>> To tell the truth, I don't understand why we need to use the m2p >> >>>> override for kernel space only mappings, my understanding is that this >> >>>> m2p override is needed for user space mappings only (where we actually >> >>>> end up doing two mappings, one in kernel space and one in user space). >> >>>> For kernel space I don't see why we need to do anything else than >> >>>> setting the right p2m translation. >> >>> >> >>> We needed the m2p when doing DMA operations. As the driver would >> >>> want the bus address (so p2m) and then when unmapping the DMA we >> >>> only get the bus address - so we needed to do a m2p lookup. >> >> >> >> OK, we need a m2p (that we already have in machine_to_phys_mapping), >> >> what I don't understand is why we need the m2p override. >> > >> > The m2p is a host global table. >> > >> > For a foreign page grant mapped into the current domain the m2p will >> > give you the foreign (owner) domain's p from the m, not the local one. >> >> Yes, you are completely right, then I have to figure out why blkback >> works fine with this patch applied (or at least it seems to work fine). > > blkback also works for me when testing a similar patch. I'm still > confused. One thing with your proposed patch: I'm not sure that you're > putting back the correct mfn. It's perfectly fine to store a foreign pfn in the m2p table. The m2p override table is used by the grant device to allow a reverse lookup of the real mfn to a pfn even if it's foreign. blkback doesn't actually need this though. This was introduced in: commit 5dc03639cc903f887931831d69895facb5260f4b Author: Konrad Rzeszutek Wilk Date: Tue Mar 1 16:46:45 2011 -0500 xen/blkback: Utilize the M2P override mechanism for GNTMAP_host_map Purely as an optimization. In practice though due to lock contention it slows things down. I think an alternative would be to use a read/write lock instead of just a spinlock since it's the read path that is the most hot. I haven't tested that yet though. Regards, Anthony Liguori > > Adding Anthony to the thread. > > --msw -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)
On 02/13/2013 02:56 AM, Pekka Enberg wrote: On Wed, Feb 13, 2013 at 10:23 AM, Paolo Bonzini wrote: Il 12/02/2013 10:52, Ingo Molnar ha scritto: Check the list I gave (unmodified): "- Pekka listed new virtio drivers that were done via tools/kvm. vhost-scsi got in first in tools/kvm, but out-of-tree patches had existed for QEMU for more than a year. It was developed with QEMU. I think Ingo confused virtio and vhost. IIRC, Asias developed vhost-blk using tools/kvm. We've done extensive performance analysis of vhost-blk and it's not any faster than a userspace solution. This is why it's still not in mainline. This wasn't noticed with tools/kvm because it's block layer is too simplistic. In order to get to the point where we're able to do this well in userspace in QEMU took tons of bug fixes to the kernel and added features (like pread64/pwrite64). That all happened without QEMU being in the kernel. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Alter steal time reporting in KVM
Glauber Costa writes: > Hi, > > On 11/27/2012 12:36 AM, Michael Wolf wrote: >> In the case of where you have a system that is running in a >> capped or overcommitted environment the user may see steal time >> being reported in accounting tools such as top or vmstat. This can >> cause confusion for the end user. To ease the confusion this patch set >> adds the idea of consigned (expected steal) time. The host will separate >> the consigned time from the steal time. The consignment limit passed to the >> host will be the amount of steal time expected within a fixed period of >> time. Any other steal time accruing during that period will show as the >> traditional steal time. > > If you submit this again, please include a version number in your series. > > It would also be helpful to include a small changelog about what changed > between last version and this version, so we could focus on that. > > As for the rest, I answered your previous two submissions saying I don't > agree with the concept. If you hadn't changed anything, resending it > won't change my mind. > > I could of course, be mistaken or misguided. But I had also not seen any > wave of support in favor of this previously, so basically I have no new > data to make me believe I should see it any differently. > > Let's try this again: > > * Rik asked you in your last submission how does ppc handle this. You > said, and I quote: "In the case of lpar on POWER systems they simply > report steal time and do not alter it in any way. > They do however report how much processor is assigned to the partition > and that information is in /proc/ppc64/lparcfg." This only is helpful for static entitlements. But if we allow dynamic entitlements--which is a very useful feature, think buying an online "upgrade" in a cloud environment--then you need to account for entitlement loss at the same place where you do the rest of the accounting: in /proc/stat. > Now, that is a *way* more sensible thing to do. Much more. "Confusing > users" is something extremely subjective. This is specially true about > concepts that are know for quite some time, like steal time. If you out > of a sudden change the meaning of this, it is sure to confuse a lot more > users than it would clarify. I'll bring you a nice bottle of scotch at the next KVM Forum if you can find me one user that can accurately describe what steal time is. The semantics are so incredibly subtle that I have a hard time believing anyone actually understands what it means today. Regards, Anthony Liguori > > > > > >> >> --- >> >> Michael Wolf (5): >> Alter the amount of steal time reported by the guest. >> Expand the steal time msr to also contain the consigned time. >> Add the code to send the consigned time from the host to the guest >> Add a timer to allow the separation of consigned from steal time. >> Add an ioctl to communicate the consign limit to the host. >> >> >> arch/x86/include/asm/kvm_host.h | 11 +++ >> arch/x86/include/asm/kvm_para.h |3 +- >> arch/x86/include/asm/paravirt.h |4 +-- >> arch/x86/include/asm/paravirt_types.h |2 + >> arch/x86/kernel/kvm.c |8 ++--- >> arch/x86/kernel/paravirt.c|4 +-- >> arch/x86/kvm/x86.c| 50 >> - >> fs/proc/stat.c|9 +- >> include/linux/kernel_stat.h |2 + >> include/linux/kvm_host.h |2 + >> include/uapi/linux/kvm.h |2 + >> kernel/sched/core.c | 10 ++- >> kernel/sched/cputime.c| 21 +- >> kernel/sched/sched.h |2 + >> virt/kvm/kvm_main.c |7 + >> 15 files changed, 120 insertions(+), 17 deletions(-) >> > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Pv-drivers] [PATCH 0/6] VSOCK for Linux upstreaming
On 11/07/2012 12:58 AM, Gerd Hoffmann wrote: On 11/05/12 19:19, Andy King wrote: Hi David, The big and only question is whether anyone can actually use any of this stuff without your proprietary bits? Do you mean the VMCI calls? The VMCI driver is in the process of being upstreamed into the drivers/misc tree. Greg (cc'd on these patches) is actively reviewing that code and we are addressing feedback. Also, there was some interest from RedHat into using vSockets as a unified interface, routed over a hypervisor-specific transport (virtio or otherwise, although for now VMCI is the only one implemented). Can you outline how this can be done? From a quick look over the code it seems like vsock has a hard dependency on vmci, is that correct? When making vsock a generic, reusable kernel service it should be the other way around: vsock should provide the core implementation and an interface where hypervisor-specific transports (vmci, virtio, xenbus, ...) can register themself. This was already done in a hypervisor neutral way using virtio: http://lists.openwall.net/netdev/2008/12/14/8 The concept was Nacked and that led to the abomination of virtio-serial. If an address family for virtualization is on the table, we should reconsider AF_VMCHANNEL. I'd be thrilled to get rid of virtio-serial... Regards, Anthony Liguori cheers, Gerd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] Proposal for virtio standardization.
Rusty Russell writes: > Hi all, > > I've had several requests for a more formal approach to the > virtio draft spec, and (after some soul-searching) I'd like to try that. > > The proposal is to use OASIS as the standards body, as it's > fairly light-weight as these things go. For me this means paperwork and > setting up a Working Group and getting the right people involved as > Voting members starting with the current contributors; for most of you > it just means a new mailing list, though I'll be cross-posting any > drafts and major changes here anyway. > > I believe that a documented standard (aka virtio 1.0) will > increase visibility and adoption in areas outside our normal linux/kvm > universe. There's been some of that already, but this is the clearest > path to accelerate it. Not the easiest path, but I believe that a solid > I/O standard is a Good Thing for everyone. > > Yet I also want to decouple new and experimental development > from the standards effort; running code comes first. New feature bits > and new device numbers should be reservable without requiring a full > spec change. > > So the essence of my proposal is: > 1) I start a Working Group within OASIS where we can aim for virtio spec >1.0. > > 2) The current spec is textually reordered so the core is clearly >bus-independent, with PCI, mmio, etc appendices. > > 3) Various clarifications, formalizations and cleanups to the spec text, >and possibly elimination of old deprecated features. > > 4) The only significant change to the spec is that we use PCI >capabilities, so we can have infinite feature bits. >(see > http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html) We discussed this on IRC last night. I don't think PCI capabilites are a good mechanism to use... PCI capabilities are there to organize how the PCI config space is allocated to allow vendor extensions to co-exist with future PCI extensions. But we've never used the PCI config space within virtio-pci. We do everything in BAR0. I don't think there's any real advantage of using the config space vs. a BAR for virtio-pci. The nice thing about using a BAR is that the region is MMIO or PIO. This maps really nicely to non-PCI transports too. But extending the PCI config space (especially dealing with capability allocation) is pretty gnarly and there isn't an obvious equivalent outside of PCI. There are very devices that we emulate today that make use of extended PCI device registers outside the platform devices (that have no BARs). Regards, Anthony Liguori > > 5) Changes to the ring layout and other such things are deferred to a >future virtio version; whether this is done within OASIS or >externally depends on how well this works for the 1.0 release. > > Thoughts? > Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] virtio-net: inline header support
Rusty Russell writes: > Anthony Liguori writes: >> Rusty Russell writes: >> >>> "Michael S. Tsirkin" writes: >>> >>>> Thinking about Sasha's patches, we can reduce ring usage >>>> for virtio net small packets dramatically if we put >>>> virtio net header inline with the data. >>>> This can be done for free in case guest net stack allocated >>>> extra head room for the packet, and I don't see >>>> why would this have any downsides. >>> >>> I've been wanting to do this for the longest time... but... >>> >>>> Even though with my recent patches qemu >>>> no longer requires header to be the first s/g element, >>>> we need a new feature bit to detect this. >>>> A trivial qemu patch will be sent separately. >>> >>> There's a reason I haven't done this. I really, really dislike "my >>> implemention isn't broken" feature bits. We could have an infinite >>> number of them, for each bug in each device. >> >> This is a bug in the specification. >> >> The QEMU implementation pre-dates the specification. All of the actual >> implementations of virtio relied on the semantics of s/g elements and >> still do. > > lguest fix is pending in my queue. lkvm and qemu are broken; lkvm isn't > ever going to be merged, so I'm not sure what its status is? But I'm > determined to fix qemu, and hence my torture patch to make sure this > doesn't creep in again. There are even more implementations out there and I'd wager they all rely on framing. >> What's in the specification really doesn't matter when it doesn't agree >> with all of the existing implementations. >> >> Users use implementations, not specifications. The specification really >> ought to be changed here. > > I'm sorely tempted, except that we're losing a real optimization because > of this :( What optimizations? What Michael is proposing is still achievable with a device feature. Are there other optimizations that can be achieved by changing framing that we can't achieve with feature bits? As I mentioned in another note, bad framing decisions can cause performance issues too... > The specification has long contained the footnote: > > The current qemu device implementations mistakenly insist that > the first descriptor cover the header in these cases exactly, so > a cautious driver should arrange it so. I seem to recall this being a compromise between you and I.. I think I objected strongly to this back when you first wrote the spec and you added this to appease me ;-) Regards, Anthony Liguori > > I'd like to tie this caveat to the PCI capability change, so this note > will move to the appendix with the old PCI layout. > > Cheers, > Rusty. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] virtio-net: inline header support
Rusty Russell writes: > "Michael S. Tsirkin" writes: > > There's a reason I haven't done this. I really, really dislike "my > implemention isn't broken" feature bits. We could have an infinite > number of them, for each bug in each device. > > So my plan was to tie this assumption to the new PCI layout. And have a > stress-testing patch like the one below in the kernel (see my virtio-wip > branch for stuff like this). Turn it on at boot with > "virtio_ring.torture" on the kernel commandline. > > BTW, I've fixed lguest, but my kvm here (Ubuntu precise, kvm-qemu 1.0) > is too old. Building the latest git now... > > Cheers, > Rusty. > > Subject: virtio: CONFIG_VIRTIO_DEVICE_TORTURE > > Virtio devices are not supposed to depend on the framing of the scatter-gather > lists, but various implementations did. Safeguard this in future by adding > an option to deliberately create perverse descriptors. > > Signed-off-by: Rusty Russell Ignore framing is really a bad idea. You want backends to enforce reasonable framing because guest's shouldn't do silly things with framing. For instance, with virtio-blk, if you want decent performance, you absolutely want to avoid bouncing the data. If you're using O_DIRECT in the host to submit I/O requests, then it's critical that all of the s/g elements are aligned to a sector boundary and sized to a sector boundary. Yes, QEMU can handle if that's not the case, but it would be insanely stupid for a guest not to do this. This is the sort of thing that ought to be enforced in the specification because a guest cannot perform well if it doesn't follow these rules. A spec isn't terribly useful if the result is guest drivers that are slow. There's very little to gain by not enforcing rules around framing and there's a lot to lose if a guest frames incorrectly. In the rare case where we want to make a framing change, we should use feature bits like Michael is proposing. In this case, we should simply say that with the feature bit, the vnet header can be in the same element as the data but not allow the header to be spread across multiple elements. Regards, Anthony Liguori > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig > index 8d5bddb..930a4ea 100644 > --- a/drivers/virtio/Kconfig > +++ b/drivers/virtio/Kconfig > @@ -5,6 +5,15 @@ config VIRTIO > bus, such as CONFIG_VIRTIO_PCI, CONFIG_VIRTIO_MMIO, CONFIG_LGUEST, > CONFIG_RPMSG or CONFIG_S390_GUEST. > > +config VIRTIO_DEVICE_TORTURE > + bool "Virtio device torture tests" > + depends on VIRTIO && DEBUG_KERNEL > + help > + This makes the virtio_ring implementation creatively change > + the format of requests to make sure that devices are > + properly implemented. This will make your virtual machine > + slow *and* unreliable! Say N. > + > menu "Virtio drivers" > > config VIRTIO_PCI > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index e639584..8893753 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -124,6 +124,149 @@ struct vring_virtqueue > > #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) > > +#ifdef CONFIG_VIRTIO_DEVICE_TORTURE > +static bool torture; > +module_param(torture, bool, 0644); > + > +struct torture { > + unsigned int orig_out, orig_in; > + void *orig_data; > + struct scatterlist sg[4]; > + struct scatterlist orig_sg[]; > +}; > + > +static size_t tot_len(struct scatterlist sg[], unsigned num) > +{ > + size_t len, i; > + > + for (len = 0, i = 0; i < num; i++) > + len += sg[i].length; > + > + return len; > +} > + > +static void copy_sg_data(const struct scatterlist *dst, unsigned dnum, > + const struct scatterlist *src, unsigned snum) > +{ > + unsigned len; > + struct scatterlist s, d; > + > + s = *src; > + d = *dst; > + > + while (snum && dnum) { > + len = min(s.length, d.length); > + memcpy(sg_virt(&d), sg_virt(&s), len); > + d.offset += len; > + d.length -= len; > + s.offset += len; > + s.length -= len; > + if (!s.length) { > + BUG_ON(snum == 0); > + src++; > + snum--; > + s = *src; > + } > + if (!d.length) { > + BUG_ON(dnum == 0); > + dst++; > + dnum--; > +
Re: [PATCH 0/3] virtio-net: inline header support
Rusty Russell writes: > "Michael S. Tsirkin" writes: > >> Thinking about Sasha's patches, we can reduce ring usage >> for virtio net small packets dramatically if we put >> virtio net header inline with the data. >> This can be done for free in case guest net stack allocated >> extra head room for the packet, and I don't see >> why would this have any downsides. > > I've been wanting to do this for the longest time... but... > >> Even though with my recent patches qemu >> no longer requires header to be the first s/g element, >> we need a new feature bit to detect this. >> A trivial qemu patch will be sent separately. > > There's a reason I haven't done this. I really, really dislike "my > implemention isn't broken" feature bits. We could have an infinite > number of them, for each bug in each device. This is a bug in the specification. The QEMU implementation pre-dates the specification. All of the actual implementations of virtio relied on the semantics of s/g elements and still do. What's in the specification really doesn't matter when it doesn't agree with all of the existing implementations. Users use implementations, not specifications. The specification really ought to be changed here. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
Avi Kivity writes: > On 08/27/2012 02:27 PM, Michael Wolf wrote: >> On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote: >> > On 08/27/2012 01:23 PM, Michael Wolf wrote: >> > > > >> > > > How would a guest know what its entitlement is? >> > > > >> > > > >> > > >> > > Currently the Admin/management tool setting up the guests will put it on >> > > the qemu commandline. From this it is passed via an ioctl to the host. >> > > The guest will get the value from the host via a hypercall. >> > > >> > > In the future the host could try and do some of it automatically in some >> > > cases. >> > >> > Seems to me it's a meaningless value for the guest. Suppose it is >> > migrated to a host that is more powerful, and as a result its relative >> > entitlement is reduced. The value needs to be adjusted. >> >> This is why I chose to manage the value from the sysctl interface rather >> than just have it stored as a value in /proc. Whatever tool was used to >> migrate the vm could hopefully adjust the sysctl value on the guest. > > We usually try to avoid this type of coupling. What if the guest is > rebooting while this is happening? What if it's not running Linux at > all? The guest shouldn't need to know it's entitlement. Or at least, it's up to a management tool to report that in a way that's meaningful for the guest. For instance, with a hosting provider, you may have 3 service levels (small, medium, large). How you present whether the guest is small, medium, or large to the guest is up to the hosting provider. > >> > >> > This is best taken care of from the host side. >> >> Not sure what you are getting at here. If you are running in a cloud >> environment, you purchase a VM with the understanding that you are >> getting certain resources. As this type of user I don't believe you >> have any access to the host to see this type of information. So the >> user still wouldnt have a way to confirm that they are receiving what >> they should be in the way of processor resources. >> >> Would you please elaborate a little more on this? > > I meant not reporting this time as steal time. But that cripples steal > time reporting. > > Looks like for each quanta we need to report how much real time has > passed, how much the guest was actually using, and how much the guest > was not using due to overcommit (with the reminder being unallocated > time). The guest could then present it any way it wanted to. What I had previously suggested what splitting entitlement loss out of steal time and reporting it as a separate metric (but not reporting a fixed notion of entitlement). You're missing the entitlement loss bit above. But you need to call out entitlement loss in order to report idle time correctly. I think changing steal time (as this patch does) is wrong. Regards, Anthony Liguori > > -- > I have a truly marvellous patch that fixes the bug which this > signature is too narrow to contain. > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] x86, nops settings result in kernel crash
Alan Cox writes: > On Thu, 16 Aug 2012 14:45:15 -0400 (EDT) > Tomas Racek wrote: > >> - Original Message - >> > On Thu, Aug 16, 2012 at 09:35:12AM -0400, Tomas Racek wrote: >> > > Hi, >> > > >> > > I am writing a file system test which I execute in qemu with kernel >> > > compiled from latest git sources and running it causes this error: >> > > >> > > https://bugzilla.kernel.org/show_bug.cgi?id=45971 >> > > >> > > It works with v3.5, so I ran git bisect which pointed me to: >> > > >> > > d6250a3f12edb3a86db9598ffeca3de8b4a219e9 x86, nops: Missing break >> > > resulting in incorrect selection on Intel >> > > >> > > To be quite honest, I don't understand this stuff much but I tried >> > > to do some debugging and I figured out (I hope) that the crash is >> > > caused by setting ideal_nops to p6_nops (k8_nops was used before >> > > the break statement was added). >> > >> > Maybe I overlooked it or maybe it was implied but did you try >> > reverting >> > the patch and rerunning your test? Does it work ok then? >> > >> >> Yes, if I remove the break statement (introduced by this commit), it works >> fine. > > What version of qemu is this - do we have qemu bug here I wonder. >From the cpuinfo, it's 0.15.1. That's old but not ancient. I took a brief look at the kernel code here. The default invocation of qemu presents an idealistic CPU with a very minimum feature bit set exposed. No processor has ever existed with this feature set. We do this in order to maintain compatibility when migration from Intel to AMD but also for legacy reasons. >From the report, using '-cpu host' solves the problem. '-cpu host' exposes most of the host CPUID to the guest. That said, QEMU really doesn't do anything differently depending on what feature bits are exposed to the guest. So my guess is that the odd combination of CPUID bits that are exposed to the guest is confusing the kernel. Can you post dmesg from the host kernel? Perhaps there's instruction emulation failing in the host KVM? That would manifest in strange behavior in the guest. Regards, Anthony Liguori > > Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [PATCH v8] kvm: notify host when the guest is panicked
Marcelo Tosatti writes: > On Tue, Aug 14, 2012 at 02:35:34PM -0500, Anthony Liguori wrote: >> Marcelo Tosatti writes: >> >> > On Tue, Aug 14, 2012 at 01:53:01PM -0500, Anthony Liguori wrote: >> >> Marcelo Tosatti writes: >> >> >> >> > On Tue, Aug 14, 2012 at 05:55:54PM +0300, Yan Vugenfirer wrote: >> >> >> >> >> >> On Aug 14, 2012, at 1:42 PM, Jan Kiszka wrote: >> >> >> >> >> >> > On 2012-08-14 10:56, Daniel P. Berrange wrote: >> >> >> >> On Mon, Aug 13, 2012 at 03:21:32PM -0300, Marcelo Tosatti wrote: >> >> >> >>> On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote: >> >> >> >>>> We can know the guest is panicked when the guest runs on xen. >> >> >> >>>> But we do not have such feature on kvm. >> >> >> >>>> >> >> >> >>>> Another purpose of this feature is: management app(for example: >> >> >> >>>> libvirt) can do auto dump when the guest is panicked. If >> >> >> >>>> management >> >> >> >>>> app does not do auto dump, the guest's user can do dump by hand if >> >> >> >>>> he sees the guest is panicked. >> >> >> >>>> >> >> >> >>>> We have three solutions to implement this feature: >> >> >> >>>> 1. use vmcall >> >> >> >>>> 2. use I/O port >> >> >> >>>> 3. use virtio-serial. >> >> >> >>>> >> >> >> >>>> We have decided to avoid touching hypervisor. The reason why I >> >> >> >>>> choose >> >> >> >>>> choose the I/O port is: >> >> >> >>>> 1. it is easier to implememt >> >> >> >>>> 2. it does not depend any virtual device >> >> >> >>>> 3. it can work when starting the kernel >> >> >> >>> >> >> >> >>> How about searching for the "Kernel panic - not syncing" string >> >> >> >>> in the guests serial output? Say libvirtd could take an action upon >> >> >> >>> that? >> >> >> >> >> >> >> >> No, this is not satisfactory. It depends on the guest OS being >> >> >> >> configured to use the serial port for console output which we >> >> >> >> cannot mandate, since it may well be required for other purposes. >> >> >> > >> >> >> Please don't forget Windows guests, there is no console and no "Kernel >> >> >> Panic" string ;) >> >> >> >> >> >> What I used for debugging purposes on Windows guest is to register a >> >> >> bugcheck callback in virtio-net driver and write 1 to VIRTIO_PCI_ISR >> >> >> register. >> >> >> >> >> >> Yan. >> >> > >> >> > Considering whether a "panic-device" should cover other OSes is also \ >> > >> >> > something to consider. Even for Linux, is "panic" the only case which >> >> > should be reported via the mechanism? What about oopses without panic? >> >> > >> >> > Is the mechanism general enough for supporting new events, etc. >> >> >> >> Hi, >> >> >> >> I think this discussion is gone of the deep end. >> >> >> >> Forget about !x86 platforms. They have their own way to do this sort of >> >> thing. >> > >> > The panic function in kernel/panic.c has the following options, which >> > appear to be arch independent, on panic: >> > >> > - reboot >> > - blink >> >> Not sure the semantics of blink but that might be a good place for a >> pvops hook. >> >> > >> > None are paravirtual interfaces however. >> > >> >> Think of this feature like a status LED on a motherboard. These >> >> are very common and usually controlled by IO ports. >> >> >> >> We're simply reserving a "status LED" for the guest to indicate that it >> >> has paniced. Let's not over engineer this. >> > >> > My concern is that you end up wi
Re: [Qemu-devel] [PATCH v8] kvm: notify host when the guest is panicked
Marcelo Tosatti writes: > On Tue, Aug 14, 2012 at 01:53:01PM -0500, Anthony Liguori wrote: >> Marcelo Tosatti writes: >> >> > On Tue, Aug 14, 2012 at 05:55:54PM +0300, Yan Vugenfirer wrote: >> >> >> >> On Aug 14, 2012, at 1:42 PM, Jan Kiszka wrote: >> >> >> >> > On 2012-08-14 10:56, Daniel P. Berrange wrote: >> >> >> On Mon, Aug 13, 2012 at 03:21:32PM -0300, Marcelo Tosatti wrote: >> >> >>> On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote: >> >> >>>> We can know the guest is panicked when the guest runs on xen. >> >> >>>> But we do not have such feature on kvm. >> >> >>>> >> >> >>>> Another purpose of this feature is: management app(for example: >> >> >>>> libvirt) can do auto dump when the guest is panicked. If management >> >> >>>> app does not do auto dump, the guest's user can do dump by hand if >> >> >>>> he sees the guest is panicked. >> >> >>>> >> >> >>>> We have three solutions to implement this feature: >> >> >>>> 1. use vmcall >> >> >>>> 2. use I/O port >> >> >>>> 3. use virtio-serial. >> >> >>>> >> >> >>>> We have decided to avoid touching hypervisor. The reason why I choose >> >> >>>> choose the I/O port is: >> >> >>>> 1. it is easier to implememt >> >> >>>> 2. it does not depend any virtual device >> >> >>>> 3. it can work when starting the kernel >> >> >>> >> >> >>> How about searching for the "Kernel panic - not syncing" string >> >> >>> in the guests serial output? Say libvirtd could take an action upon >> >> >>> that? >> >> >> >> >> >> No, this is not satisfactory. It depends on the guest OS being >> >> >> configured to use the serial port for console output which we >> >> >> cannot mandate, since it may well be required for other purposes. >> >> > >> >> Please don't forget Windows guests, there is no console and no "Kernel >> >> Panic" string ;) >> >> >> >> What I used for debugging purposes on Windows guest is to register a >> >> bugcheck callback in virtio-net driver and write 1 to VIRTIO_PCI_ISR >> >> register. >> >> >> >> Yan. >> > >> > Considering whether a "panic-device" should cover other OSes is also \ > >> > something to consider. Even for Linux, is "panic" the only case which >> > should be reported via the mechanism? What about oopses without panic? >> > >> > Is the mechanism general enough for supporting new events, etc. >> >> Hi, >> >> I think this discussion is gone of the deep end. >> >> Forget about !x86 platforms. They have their own way to do this sort of >> thing. > > The panic function in kernel/panic.c has the following options, which > appear to be arch independent, on panic: > > - reboot > - blink Not sure the semantics of blink but that might be a good place for a pvops hook. > > None are paravirtual interfaces however. > >> Think of this feature like a status LED on a motherboard. These >> are very common and usually controlled by IO ports. >> >> We're simply reserving a "status LED" for the guest to indicate that it >> has paniced. Let's not over engineer this. > > My concern is that you end up with state that is dependant on x86. > > Subject: [PATCH v8 3/6] add a new runstate: RUN_STATE_GUEST_PANICKED > > Having the ability to stop/restart the guest (and even introducing a > new VM runstate) is more than a status LED analogy. I must admit, I don't know why a new runstate is necessary/useful. The kernel shouldn't have to care about the difference between a halted guest and a panicked guest. That level of information belongs in userspace IMHO. > Can this new infrastructure be used by other architectures? I guess I don't understand why the kernel side of this isn't anything more than a paravirt op hook that does a single outb() with the remaining logic handled 100% in QEMU. > Do you consider allowing support for Windows as overengineering? I don't think there is a way to hook BSOD on Windows so attempting to engineer something that works with Windows seems odd, no? Regards, Anthony Liguori > >> Regards, >> >> Anthony Liguori >> >> > >> >> >> >> > Well, we have more than a single serial port, even when leaving >> >> > virtio-serial aside... >> >> > >> >> > Jan >> >> > >> >> > -- >> >> > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE >> >> > Corporate Competence Center Embedded Linux >> >> > -- >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in >> >> > the body of a message to majord...@vger.kernel.org >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [PATCH v8] kvm: notify host when the guest is panicked
Marcelo Tosatti writes: > On Tue, Aug 14, 2012 at 05:55:54PM +0300, Yan Vugenfirer wrote: >> >> On Aug 14, 2012, at 1:42 PM, Jan Kiszka wrote: >> >> > On 2012-08-14 10:56, Daniel P. Berrange wrote: >> >> On Mon, Aug 13, 2012 at 03:21:32PM -0300, Marcelo Tosatti wrote: >> >>> On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote: >> >>>> We can know the guest is panicked when the guest runs on xen. >> >>>> But we do not have such feature on kvm. >> >>>> >> >>>> Another purpose of this feature is: management app(for example: >> >>>> libvirt) can do auto dump when the guest is panicked. If management >> >>>> app does not do auto dump, the guest's user can do dump by hand if >> >>>> he sees the guest is panicked. >> >>>> >> >>>> We have three solutions to implement this feature: >> >>>> 1. use vmcall >> >>>> 2. use I/O port >> >>>> 3. use virtio-serial. >> >>>> >> >>>> We have decided to avoid touching hypervisor. The reason why I choose >> >>>> choose the I/O port is: >> >>>> 1. it is easier to implememt >> >>>> 2. it does not depend any virtual device >> >>>> 3. it can work when starting the kernel >> >>> >> >>> How about searching for the "Kernel panic - not syncing" string >> >>> in the guests serial output? Say libvirtd could take an action upon >> >>> that? >> >> >> >> No, this is not satisfactory. It depends on the guest OS being >> >> configured to use the serial port for console output which we >> >> cannot mandate, since it may well be required for other purposes. >> > >> Please don't forget Windows guests, there is no console and no "Kernel >> Panic" string ;) >> >> What I used for debugging purposes on Windows guest is to register a >> bugcheck callback in virtio-net driver and write 1 to VIRTIO_PCI_ISR >> register. >> >> Yan. > > Considering whether a "panic-device" should cover other OSes is also > something to consider. Even for Linux, is "panic" the only case which > should be reported via the mechanism? What about oopses without panic? > > Is the mechanism general enough for supporting new events, etc. Hi, I think this discussion is gone of the deep end. Forget about !x86 platforms. They have their own way to do this sort of thing. Think of this feature like a status LED on a motherboard. These are very common and usually controlled by IO ports. We're simply reserving a "status LED" for the guest to indicate that it has paniced. Let's not over engineer this. Regards, Anthony Liguori > >> >> > Well, we have more than a single serial port, even when leaving >> > virtio-serial aside... >> > >> > Jan >> > >> > -- >> > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE >> > Corporate Competence Center Embedded Linux >> > -- >> > To unsubscribe from this list: send the line "unsubscribe kvm" in >> > the body of a message to majord...@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [PATCH v7.5] kvm: notify host when the guest is panicked
Sasha Levin writes: > On 07/22/2012 09:14 PM, Anthony Liguori wrote: >> Sasha Levin writes: >> >>> On 07/21/2012 10:44 AM, Wen Congyang wrote: >>>> We can know the guest is panicked when the guest runs on xen. >>>> But we do not have such feature on kvm. >>>> >>>> Another purpose of this feature is: management app(for example: >>>> libvirt) can do auto dump when the guest is panicked. If management >>>> app does not do auto dump, the guest's user can do dump by hand if >>>> he sees the guest is panicked. >>>> >>>> We have three solutions to implement this feature: >>>> 1. use vmcall >>>> 2. use I/O port >>>> 3. use virtio-serial. >>>> >>>> We have decided to avoid touching hypervisor. The reason why I choose >>>> choose the I/O port is: >>>> 1. it is easier to implememt >>>> 2. it does not depend any virtual device >>>> 3. it can work when starting the kernel >>> >>> Was the option of implementing a virtio-watchdog driver considered? >>> >>> You're basically re-implementing a watchdog, a guest-host interface and a >>> set of protocols for guest-host communications. >>> >>> Why can't we re-use everything we have now, push a virtio watchdog >>> driver into drivers/watchdog/, and gain a more complete solution to >>> detecting hangs inside the guest. >> >> The purpose of virtio is not to reinvent every possible type of device. >> There are plenty of hardware watchdogs that are very suitable to be used >> for this purpose. QEMU implements quite a few already. >> >> Watchdogs are not performance sensitive so there's no point in using >> virtio. > > The issue here is not performance, but the adding of a brand new > guest-host interface. We have: 1) Virtio--this is our preferred PV interface. It needs PCI to be fully initialized and probably will live as a module. 2) Hypercalls--this a secondary PV interface but is available very early. It's terminated in kvm.ko which means it can only operate on things that are logically part of the CPU and/or APIC complex. This patch introduces a third interface which is available early like hypercalls but not necessarily terminated in kvm.ko. That means it can have a broader scope in functionality than (2). We could just as well use a hypercall and have multiple commands issued to that hypercall as a convention and add a new exit type to KVM that sent that specific hypercall to userspace for processing. But a PIO operation already has this behavior and requires no changes to kvm.ko. > virtio-rng isn't performance sensitive either, yet it was implemented > using virtio so there wouldn't be yet another interface to communicate > between guest and host. There isn't really an obvious discrete RNG that is widely supported. > This patch goes ahead to add a "arch pv features" interface using > ioports, without any idea what it might be used for beyond this > watchdog. It's not a watchdog--it's the opposite of a watchdog. You know such a thing already exists in the kernel, right? S390 has had a hypercall like this for years. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked
Sasha Levin writes: > On 07/22/2012 10:19 PM, Sasha Levin wrote: >> On 07/22/2012 09:22 PM, Anthony Liguori wrote: >>> Sasha Levin writes: >>> >>>> On 07/21/2012 09:12 AM, Wen Congyang wrote: >>>>> +#define KVM_PV_PORT (0x505UL) >>>>> + >>>>> #ifdef __KERNEL__ >>>>> #include >>>>> >>>>> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void) >>>>> } >>>>> #endif >>>>> >>>>> +static inline unsigned int kvm_arch_pv_features(void) >>>>> +{ >>>>> + return inl(KVM_PV_PORT); >>>>> +} >>>>> + >>>> >>>> Why is this safe? >>>> >>>> I'm not sure you can just pick any ioport you'd like and use it. >>> >>> There are three ways I/O ports get used on a PC: >>> >>> 1) Platform devices >>> - This is well defined since the vast majority of platform devices are >>>implemented within a single chip. If you're emulating an i440fx >>>chipset, the PIIX4 spec has an exhaustive list. >>> >>> 2) PCI devices >>> - Typically, PCI only allocates ports starting at 0x0d00 to avoid >>>conflicts with ISA devices. >>> >>> 3) ISA devices >>> - ISA uses subtractive decoding so any ISA device can access. In >>>theory, an ISA device could attempt to use port 0x0505 but it's >>>unlikely. In a modern guest, there aren't really any ISA devices being >>>added either. >>> >>> So yes, picking port 0x0505 is safe for something like this (as long as >>> you check to make sure that you really are under KVM). >> >> Is there anything that actually prevents me from using PCI ports lower than >> 0x0d00? As you said in (3), ISA isn't really used anymore (nor is >> implemented by lkvm for example), so placing PCI below 0x0d00 might even >> make sense in that case. >> >> Furthermore, I can place one of these brand new virtio-mmio devices which >> got introduced recently wherever I want right now - Having a device that >> uses 0x505 would cause a pretty non-obvious failure mode. >> >> Either way, If we are going to grab an ioport, then: >> >> - It should be documented well somewhere in Documentation/virt/kvm >> - It should go through request_region() to actually claim those ioports. >> - It should fail gracefully if that port is taken for some reason, instead >> of not even checking it. >> > > Out of curiosity I tested that, and apparently lkvm has no problem allocating > virtio-pci devices in that range: > > sh-4.2# pwd > /sys/devices/pci:00/:00:01.0 > sh-4.2# cat resource | head -n1 > 0x0500 0x05ff 0x00040101 > > This was with the commit in question applied. With all due respect, lkvm has a half-baked implementation of PCI. This is why you have to pass kernel parameters to disable ACPI and disable PCI BIOS probing. So yeah, you can do funky things in lkvm but that doesn't mean a system that emulated actual hardware would ever do that. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked
Sasha Levin writes: > On 07/22/2012 09:22 PM, Anthony Liguori wrote: >> Sasha Levin writes: >> >>> On 07/21/2012 09:12 AM, Wen Congyang wrote: >>>> +#define KVM_PV_PORT (0x505UL) >>>> + >>>> #ifdef __KERNEL__ >>>> #include >>>> >>>> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void) >>>> } >>>> #endif >>>> >>>> +static inline unsigned int kvm_arch_pv_features(void) >>>> +{ >>>> + return inl(KVM_PV_PORT); >>>> +} >>>> + >>> >>> Why is this safe? >>> >>> I'm not sure you can just pick any ioport you'd like and use it. >> >> There are three ways I/O ports get used on a PC: >> >> 1) Platform devices >> - This is well defined since the vast majority of platform devices are >>implemented within a single chip. If you're emulating an i440fx >>chipset, the PIIX4 spec has an exhaustive list. >> >> 2) PCI devices >> - Typically, PCI only allocates ports starting at 0x0d00 to avoid >>conflicts with ISA devices. >> >> 3) ISA devices >> - ISA uses subtractive decoding so any ISA device can access. In >>theory, an ISA device could attempt to use port 0x0505 but it's >>unlikely. In a modern guest, there aren't really any ISA devices being >>added either. >> >> So yes, picking port 0x0505 is safe for something like this (as long as >> you check to make sure that you really are under KVM). > > Is there anything that actually prevents me from using PCI ports lower > than 0x0d00? As you said in (3), ISA isn't really used anymore (nor is > implemented by lkvm for example), so placing PCI below 0x0d00 might > even make sense in that case. On modern systems, the OS goes by whatever is in the ACPI table describing the PCI bus. In QEMU, we have: WordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange, 0x, // Address Space Granularity 0x0D00, // Address Range Minimum 0x, // Address Range Maximum 0x, // Address Translation Offset 0xF300, // Address Length ,, , TypeStatic) So Linux will always use 0x0D00 -> 0x for the valid range. Practically speaking, you can't use anything below 0x0D00 because the PCI bus configuration registers live at 0xCF8-0xCFF. If you tried to create the region starting at 0x0500 you'd have to limit it to 0xCF8 to avoid conflicting with the PCI host controller. That's not a useful amount of space for I/O ports so that would be a pretty dumb thing to do. > Furthermore, I can place one of these brand new virtio-mmio devices > which got introduced recently wherever I want right now - Having a > device that uses 0x505 would cause a pretty non-obvious failure mode. I think you're confusing PIO with MMIO. They are separate address spaces. You could certainly argue that relying on PIO is way too architecture specific since that's only available on x86. That's a good argument but the counter is that other architectures have their own interfaces for this sort of thing. > Either way, If we are going to grab an ioport, then: > > - It should be documented well somewhere in Documentation/virt/kvm > - It should go through request_region() to actually claim those ioports. > - It should fail gracefully if that port is taken for some reason, > instead of not even checking it. I agree with the above. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked
Sasha Levin writes: > On 07/21/2012 09:12 AM, Wen Congyang wrote: >> +#define KVM_PV_PORT (0x505UL) >> + >> #ifdef __KERNEL__ >> #include >> >> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void) >> } >> #endif >> >> +static inline unsigned int kvm_arch_pv_features(void) >> +{ >> +return inl(KVM_PV_PORT); >> +} >> + > > Why is this safe? > > I'm not sure you can just pick any ioport you'd like and use it. There are three ways I/O ports get used on a PC: 1) Platform devices - This is well defined since the vast majority of platform devices are implemented within a single chip. If you're emulating an i440fx chipset, the PIIX4 spec has an exhaustive list. 2) PCI devices - Typically, PCI only allocates ports starting at 0x0d00 to avoid conflicts with ISA devices. 3) ISA devices - ISA uses subtractive decoding so any ISA device can access. In theory, an ISA device could attempt to use port 0x0505 but it's unlikely. In a modern guest, there aren't really any ISA devices being added either. So yes, picking port 0x0505 is safe for something like this (as long as you check to make sure that you really are under KVM). Regards, Anthony Liguori > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Qemu-devel] [PATCH v7.5] kvm: notify host when the guest is panicked
Sasha Levin writes: > On 07/21/2012 10:44 AM, Wen Congyang wrote: >> We can know the guest is panicked when the guest runs on xen. >> But we do not have such feature on kvm. >> >> Another purpose of this feature is: management app(for example: >> libvirt) can do auto dump when the guest is panicked. If management >> app does not do auto dump, the guest's user can do dump by hand if >> he sees the guest is panicked. >> >> We have three solutions to implement this feature: >> 1. use vmcall >> 2. use I/O port >> 3. use virtio-serial. >> >> We have decided to avoid touching hypervisor. The reason why I choose >> choose the I/O port is: >> 1. it is easier to implememt >> 2. it does not depend any virtual device >> 3. it can work when starting the kernel > > Was the option of implementing a virtio-watchdog driver considered? > > You're basically re-implementing a watchdog, a guest-host interface and a set > of protocols for guest-host communications. > > Why can't we re-use everything we have now, push a virtio watchdog > driver into drivers/watchdog/, and gain a more complete solution to > detecting hangs inside the guest. The purpose of virtio is not to reinvent every possible type of device. There are plenty of hardware watchdogs that are very suitable to be used for this purpose. QEMU implements quite a few already. Watchdogs are not performance sensitive so there's no point in using virtio. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support
"Michael S. Tsirkin" writes: > On Thu, Jul 19, 2012 at 08:05:42AM -0500, Anthony Liguori wrote: >> Of course, the million dollar question is why would using AIO in the >> kernel be faster than using AIO in userspace? > > Actually for me a more important question is how does it compare > with virtio-blk dataplane? I'm not even asking for a benchmark comparision. It's the same API being called from a kernel thread vs. a userspace thread. Why would there be a 60% performance difference between the two? That doesn't make any sense. There's got to be a better justification for putting this in the kernel than just that we can. I completely understand why Christoph's suggestion of submitting BIOs directly would be faster. There's no way to do that in userspace. Regards, Anthony Liguori > > -- > MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/ directory)
On 07/19/2012 05:30 PM, KY Srinivasan wrote: -Original Message- From: Greg KH (gre...@linuxfoundation.org) [mailto:gre...@linuxfoundation.org] Sent: Thursday, July 19, 2012 6:02 PM To: KY Srinivasan Cc: Paolo Bonzini; de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; virtualizat...@lists.osdl.org Subject: Re: 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/ directory) On Thu, Jul 19, 2012 at 09:22:53PM +, KY Srinivasan wrote: -Original Message- From: Greg KH (gre...@linuxfoundation.org) [mailto:gre...@linuxfoundation.org] Sent: Thursday, July 19, 2012 5:07 PM To: KY Srinivasan Cc: Paolo Bonzini; de...@linuxdriverproject.org; linux- ker...@vger.kernel.org; virtualizat...@lists.osdl.org Subject: Re: 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/ directory) On Thu, Jul 19, 2012 at 02:11:47AM +, KY Srinivasan wrote: -Original Message- From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo Bonzini Sent: Friday, July 13, 2012 6:23 AM To: KY Srinivasan Cc: Greg KH; de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; virtualizat...@lists.osdl.org Subject: 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/ directory) Il 04/10/2011 21:34, Greg KH ha scritto: diff --git a/drivers/staging/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h similarity index 99% rename from drivers/staging/hv/hyperv_vmbus.h rename to drivers/hv/hyperv_vmbus.h index 3d2d836..8261cb6 100644 --- a/drivers/staging/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -28,8 +28,7 @@ #include #include #include - -#include "hyperv.h" +#include /* * The below CPUID leaves are present if VersionAndFeatures.HypervisorPresent git's rename detection snips away this gem: +#define HV_LINUX_GUEST_ID_LO 0x +#define HV_LINUX_GUEST_ID_HI 0xB16B00B5 +#define HV_LINUX_GUEST_ID (((u64)HV_LINUX_GUEST_ID_HI << 32) | \ + HV_LINUX_GUEST_ID_LO) Somone was trying to be funny, I guess. KY, I suppose you have access to Hyper-V code or can ask someone who does. Is this signature actually used in the Hyper-V host code? Paolo, As I noted earlier, this is just a guest ID that needs to be registered with the hypervisor. Thanks for reporting this issue and on behalf of Microsoft, I would like to apologize for this offensive string. I have submitted a patch to fix this issue. You only changed it to be in decimal, you did not change the id at all. Is there some reason why you can not change it? You said there was a reserved range of ids that could be used, perhaps just pick another one? What is the valid range that can be used here? Greg, As you know, this ID has been in use for a long time now. While the hypervisor does not interpret the guest ID that is registered, I am not sure what dependencies there might be on this value. Could you please go find out the answer to this? That is easier said than done. I have sent emails out asking this very question and I have not received a definitive answer yet. Not knowing if and when I can get a definitive answer here, I chose the least risky approach in my patch. If, as you originally stated, there is a range of values we can use, then we should probably use another one, right? On the Windows side this ID namespace is managed well. However on the Linux side, we have really had this current ID in use for almost five years now. I am not aware of any pool of IDs available for Linux usage except that Linux IDs be distinct from the guest IDs in use by MSFT operating systems. If I were to change the guest ID, I would probably want to comply with the MSFT guidance on constructing these IDs (although not all fields may be relevant for Linux). Presumably, Hyper-V can deal with unexpected values here, no? Otherwise, it wouldn't be future proof against new types of guests. So worst case scenario, Hyper-V disables optimizations on Linux guests that report then new ID until they patch Hyper-V to know about the new ID. That seems like a reasonable trade off to me. I'm sure there's sufficient incentive to patch Hyper-V for this at Microsoft... Regards, Anthony Liguori Regards, K. Y -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support
*vq, > + struct virtio_blk_outhdr *hdr, > + u16 head, u16 out, u16 in, > + struct file *file) > +{ > + struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev); > + struct iovec *iov = &vq->iov[BLK_HDR + 1]; > + loff_t offset = hdr->sector << 9; > + struct vhost_blk_req *req; > + u64 nr_vecs; > + int ret = 0; > + u8 status; > + > + if (hdr->type == VIRTIO_BLK_T_IN || hdr->type == VIRTIO_BLK_T_GET_ID) > + nr_vecs = in - 1; > + else > + nr_vecs = out - 1; > + > + req = &blk->reqs[head]; > + req->head = head; > + req->status = blk->vq.iov[nr_vecs + 1].iov_base; > + > + switch (hdr->type) { > + case VIRTIO_BLK_T_OUT: > + ret = vhost_blk_io_submit(blk, file, req, iov, nr_vecs, offset, > + IOCB_CMD_PWRITEV); > + break; > + case VIRTIO_BLK_T_IN: > + ret = vhost_blk_io_submit(blk, file, req, iov, nr_vecs, offset, > + IOCB_CMD_PREADV); > + break; > + case VIRTIO_BLK_T_FLUSH: > + ret = vfs_fsync(file, 1); > + status = ret < 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK; > + ret = vhost_blk_set_status(blk, req->status, status); > + if (!ret) > + vhost_add_used_and_signal(&blk->dev, vq, head, ret); > + break; > + case VIRTIO_BLK_T_GET_ID: > + /* TODO: need a real ID string */ > + ret = snprintf(vq->iov[BLK_HDR + 1].iov_base, > +VIRTIO_BLK_ID_BYTES, "VHOST-BLK-DISK"); > + status = ret < 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK; > + ret = vhost_blk_set_status(blk, req->status, status); > + if (!ret) > + vhost_add_used_and_signal(&blk->dev, vq, head, > + VIRTIO_BLK_ID_BYTES); > + break; > + default: > + pr_warn("Unsupported request type %d\n", hdr->type); > + vhost_discard_vq_desc(vq, 1); > + ret = -EFAULT; > + break; > + } There doesn't appear to be any error handling in the event that vhost_blk_io_submit fails. It would appear that you leak the ring queue entry since you never push anything onto the used queue. I think you need to handle io_submit() failing too with EAGAIN. Relying on min nr_events == queue_size seems like a dangerous assumption to me particularly since queue_size tends to be very large and max_nr_events is a fixed size global pool. To properly handle EAGAIN, you effectively need to implement flow control and back off reading the virtqueue until you can submit requests again. Of course, the million dollar question is why would using AIO in the kernel be faster than using AIO in userspace? When using eventfd, there is no heavy weight exit on the notify path. It should just be the difference between scheduling a kernel thread vs scheduling a userspace thread. There's simply no way that that's a 60% difference in performance. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: current mainline ide doesn't like qemu/kvm (or vice versa)
Alan Cox wrote: On Fri, 8 Feb 2008 16:25:08 +0100 Christoph Hellwig <[EMAIL PROTECTED]> wrote: When trying to put some stress on qemu by running the xfs testsuite I get the following: debian:~/xfs-cmds/xfstests# sh check [ 438.166822] SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug enabled [ 438.185557] SGI XFS Quota Management subsystem [ 438.193150] hdb: task_no_data_intr: status=0x41 { DriveReady Error } [ 438.194018] hdb: task_no_data_intr: error=0x04 { DriveStatusError } [ 438.194195] ide: failed opcode was: 0x9d Drive aborted the command. Thats all correct. and after that the kernel seems to hang. Qemu is emulating a piix3 device, and using the piix driver. This is on a pretty old kvm (version 28) because newer ones don't even compile. kvm-28 is ancient and there have been IDE emulation fixes since. I am not aware of issues building KVM either. The problems you reported a month ago on kvm-devel have been fixed. Old Qemu is not really a credible IDE emulation. The chances are that as with all the libata bugs I close for qemu the problem is qemu. Please file a bug with the qemu people. Agreed, this is very likely to be a QEMU issue (that may have already been fixed). Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] KVM binary incompatiablity
Stephen Hemminger wrote: I notice that recent KVM is incompatiable with older versions. Using a KVM image created on 2.6.24 will crash on 2.6.25 (or vice versa). It appears that Ubuntu Hardy has incorporated the 2.6.25 update even though it claims to be 2.6.24. This isn't intentional. What is the guest and how does it crash? I've been using the same image for most of KVM's development life cycle without having issues. Regards, Anthony Liguori This is reproducible on Intel (64bit) kernel. Was this intentional? is it documented? - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/8] SVM: add module parameter to disable NestedPaging
Alexey Eremenko wrote: Generally I see no problem with it. But at least for NPT I don't see a reason why someone should want to disable it on a VM basis (as far as it works stable). Avi, what do you think? 1. This is more user-friendly and easier-to-find I'm inclined to disagree. Why add an additional thing for people to tune if they really shouldn't need to. Once NPT/EPT support are stable, is there any reason to want to disable it? Regards, Anthony Liguori 2. A lot of people would like to view (and Demo) it side-by-side. 3. Useful for BETA-testing of KVM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 4/8] X86: export information about NPT to generic x86 code
Anthony Liguori wrote: Joerg Roedel wrote: The generic x86 code has to know if the specific implementation uses Nested Paging. In the generic code Nested Paging is called Hardware Assisted Paging (HAP) to avoid confusion with (future) HAP implementations of other vendors. This patch exports the availability of HAP to the generic x86 code. Signed-off-by: Joerg Roedel <[EMAIL PROTECTED]> --- arch/x86/kvm/svm.c |7 +++ arch/x86/kvm/vmx.c |7 +++ include/asm-x86/kvm_host.h |2 ++ 3 files changed, 16 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 2e718ff..d0bfdd8 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1678,6 +1678,11 @@ static bool svm_cpu_has_accelerated_tpr(void) return false; } +static bool svm_hap_enabled(void) +{ +return npt_enabled; +} + To help with bisecting, you should probably return false here until the patch that actually implements NPT support. Otherwise, the 7th patch in this series breaks KVM for SVM. Ignore this, you're already doing the right thing :-) Regards, Anthony Liguori Regards, Anthony Liguori static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, @@ -1734,6 +1739,8 @@ static struct kvm_x86_ops svm_x86_ops = { .inject_pending_vectors = do_interrupt_requests, .set_tss_addr = svm_set_tss_addr, + +.hap_enabled = svm_hap_enabled, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 00a00e4..8feb775 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2631,6 +2631,11 @@ static void __init vmx_check_processor_compat(void *rtn) } } +static bool vmx_hap_enabled(void) +{ +return false; +} + static struct kvm_x86_ops vmx_x86_ops = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, @@ -2688,6 +2693,8 @@ static struct kvm_x86_ops vmx_x86_ops = { .inject_pending_vectors = do_interrupt_requests, .set_tss_addr = vmx_set_tss_addr, + +.hap_enabled = vmx_hap_enabled, }; static int __init vmx_init(void) diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 67ae307..45a9d05 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -392,6 +392,8 @@ struct kvm_x86_ops { struct kvm_run *run); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); + +bool (*hap_enabled)(void); }; extern struct kvm_x86_ops *kvm_x86_ops; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 4/8] X86: export information about NPT to generic x86 code
Joerg Roedel wrote: The generic x86 code has to know if the specific implementation uses Nested Paging. In the generic code Nested Paging is called Hardware Assisted Paging (HAP) to avoid confusion with (future) HAP implementations of other vendors. This patch exports the availability of HAP to the generic x86 code. Signed-off-by: Joerg Roedel <[EMAIL PROTECTED]> --- arch/x86/kvm/svm.c |7 +++ arch/x86/kvm/vmx.c |7 +++ include/asm-x86/kvm_host.h |2 ++ 3 files changed, 16 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 2e718ff..d0bfdd8 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1678,6 +1678,11 @@ static bool svm_cpu_has_accelerated_tpr(void) return false; } +static bool svm_hap_enabled(void) +{ + return npt_enabled; +} + To help with bisecting, you should probably return false here until the patch that actually implements NPT support. Otherwise, the 7th patch in this series breaks KVM for SVM. Regards, Anthony Liguori static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, @@ -1734,6 +1739,8 @@ static struct kvm_x86_ops svm_x86_ops = { .inject_pending_vectors = do_interrupt_requests, .set_tss_addr = svm_set_tss_addr, + + .hap_enabled = svm_hap_enabled, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 00a00e4..8feb775 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2631,6 +2631,11 @@ static void __init vmx_check_processor_compat(void *rtn) } } +static bool vmx_hap_enabled(void) +{ + return false; +} + static struct kvm_x86_ops vmx_x86_ops = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, @@ -2688,6 +2693,8 @@ static struct kvm_x86_ops vmx_x86_ops = { .inject_pending_vectors = do_interrupt_requests, .set_tss_addr = vmx_set_tss_addr, + + .hap_enabled = vmx_hap_enabled, }; static int __init vmx_init(void) diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 67ae307..45a9d05 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -392,6 +392,8 @@ struct kvm_x86_ops { struct kvm_run *run); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); + + bool (*hap_enabled)(void); }; extern struct kvm_x86_ops *kvm_x86_ops; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/8] SVM: add module parameter to disable Nested Paging
Joerg Roedel wrote: To disable the use of the Nested Paging feature even if it is available in hardware this patch adds a module parameter. Nested Paging can be disabled by passing npt=off to the kvm_amd module. Signed-off-by: Joerg Roedel <[EMAIL PROTECTED]> --- arch/x86/kvm/svm.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 49bb57a..2e718ff 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -48,6 +48,9 @@ MODULE_LICENSE("GPL"); #define SVM_DEATURE_SVML (1 << 2) static bool npt_enabled = false; +static char *npt = "on"; + +module_param(npt, charp, S_IRUGO); This would probably be better as an integer. Then we don't have to do nasty things like implicitly cast a literal to a char *. Regards, Anthony Liguori static void kvm_reput_irq(struct vcpu_svm *svm); @@ -415,6 +418,11 @@ static __init int svm_hardware_setup(void) if (!svm_has(SVM_FEATURE_NPT)) npt_enabled = false; + if (npt_enabled && strncmp(npt, "off", 3) == 0) { + printk(KERN_INFO "kvm: Nested Paging disabled\n"); + npt_enabled = false; + } + if (npt_enabled) printk(KERN_INFO "kvm: Nested Paging enabled\n"); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH][RFC] SVM: Add Support for Nested Paging in AMD Fam16 CPUs
Joerg Roedel wrote: Hi, here is the first release of patches for KVM to support the Nested Paging (NPT) feature of AMD QuadCore CPUs for comments and public testing. This feature improves the guest performance significantly. I measured an improvement of around 17% using kernbench in my first tests. This patch series is basically tested with Linux guests (32 bit legacy paging, 32 bit PAE paging and 64 bit Long Mode). Also tested with Windows Vista 32 bit and 64 bit. All these guests ran successfully with these patches. The patch series only enables NPT for 64 bit Linux hosts at the moment. Please give these patches a good and deep testing. I hope we have this patchset ready for merging soon. A quick sniff test and things look pretty good. I was able to start running the install CDs for 32-bit and 64-bit Ubuntu, 32-bit OpenSuSE, 64-bit Fedora, and 32-bit Win2k8. I'll do a more thorough run of kvm-test on Monday when I have a better connection to my machine. Nice work! Regards, Anthony Liguori Joerg Here is the diffstat: arch/x86/kvm/mmu.c | 81 +++--- arch/x86/kvm/mmu.h |6 +++ arch/x86/kvm/svm.c | 94 +-- arch/x86/kvm/vmx.c |7 +++ arch/x86/kvm/x86.c |1 + include/asm-x86/kvm_host.h |4 ++ 6 files changed, 182 insertions(+), 11 deletions(-) - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview
Hi, I believe this work is very important especially in the context of virtual machines. I think it would be more useful though implemented in the context of the IO scheduler. Since we already support a notion of IO priority, it seems reasonable to add a notion of an IO cap. Regards, Anthony Liguori Ryo Tsuruta wrote: Hi everyone, I'm happy to announce that I've implemented a Block I/O bandwidth controller. The controller is designed to be of use in a cgroup or virtual machine environment. The current approach is that the controller is implemented as a device-mapper driver. What's dm-band all about? Dm-band is an I/O bandwidth controller implemented as a device-mapper driver. Several jobs using the same physical device have to share the bandwidth of the device. Dm-band gives bandwidth to each job according to its weight, which each job can set its own value to. At this time, a job is a group of processes with the same pid or pgrp or uid. There is also a plan to make it support cgroup. A job can also be a virtual machine such as KVM or Xen. +--+ +--+ +--+ +--+ +--+ +--+ |cgroup| |cgroup| | the | | pid | | pid | | the | jobs | A | | B | |others| | X | | Y | |others| +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--V+---V---+V---+ +--V+---V---+V---+ | group | group | default| | group | group | default| band groups | | | group | | | | group | +---+---++ +---+---++ | band1 | | band2 | band devices +---|+ +---|+ +---V--+-V+ | | | | sdb1| sdb2 | physical devices +--+--+ How dm-band works. Every band device has one band group, which by default is called the default group. Band devices can also have extra band groups in them. Each band group has a job to support and a weight. Proportional to the weight, dm-band gives tokens to the group. A group passes on I/O requests that its job issues to the underlying layer so long as it has tokens left, while requests are blocked if there aren't any tokens left in the group. One token is consumed each time the group passes on a request. Dm-band will refill groups with tokens once all of groups that have requests on a given physical device use up their tokens. With this approach, a job running on a band group with large weight is guaranteed to be able to issue a large number of I/O requests. Getting started = The following is a brief description how to control the I/O bandwidth of disks. In this description, we'll take one disk with two partitions as an example target. You can also check the manual at Document/device-mapper/band.txt of the linux kernel source tree for more information. Create and map band devices --- Create two band devices "band1" and "band2" and map them to "/dev/sda1" and "/dev/sda2" respectively. # echo "0 `blockdev --getsize /dev/sda1` band /dev/sda1 1" | dmsetup create band1 # echo "0 `blockdev --getsize /dev/sda2` band /dev/sda2 1" | dmsetup create band2 If the commands are successful then the device files "/dev/mapper/band1" and "/dev/mapper/band2" will have been created. Bandwidth control In this example weights of 40 and 10 will be assigned to "band1" and "band2" respectively. This is done using the following commands: # dmsetup message band1 0 weight 40 # dmsetup message band2 0 weight 10 After these commands, "band1" can use 80% --- 40/(40+10)*100 --- of the bandwidth of the physical disk "/dev/sda" while "band2" can use 20%. Additional bandwidth control --- In this example two extra band groups are created on "band1". The first group consists of all the processes with user-id 1000 and the second group consists of all the processes with user-id 2000. Their weights are 30 and 20 respectively. Firstly the band group type of "band1" is set to "user". Then, the user-id 1000 and 2000 groups are attached to "band1". Finally, weights are assigned to the user-id 1000 and 2000 groups. # dmsetup message band1 0 type user # dmsetup message band1 0 attach 1000 # dmsetup message band1 0 attach 2000 # dmsetup message band1 0 weight 1000:30 # dmsetup message band1 0 weight 2000:20 Now the processes in the user-id 1000 group can use 30% --- 30/(30+20+40+10)*100 --- of the bandwidth of the physical disk. Band DeviceBand Group
Re: [PATCH] xen: relax signature check
Jeremy Fitzhardinge wrote: Bodo Eggert wrote: Not BUG_ON(memcmp(xen_start_info->magic, "xen-3.", 6) != 0); ? I don't thin Xen version 32 will be compatible ... It had better be; if it loads the kernel, it should present a xen-3 compatible ABI. If xen-32.0 should be compatible than wouldn't xen-24.0 be compatible too? I think the point was that you should either be checking for 'xen-3.x' or something more general that would accept anything >= xen-3.0. Regards, Anthony Liguori But this is just a sanity check to make sure things are basically OK; BUG_ON is hardly nice error reporting (not that there's much else we can do at that point). J -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure (v2)
Amit Shah wrote: * Anthony Liguori wrote: This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. This doesn't work right for SVM. It keeps looping indefinitely; on a kvm_stat run, I get about 230,000 light vm exits per second, with the hypercall never returning to the guest. ... What are you using to issue the hypercall? Regards, Anthony Liguori diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index 729f1cd..d09a9f5 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -476,7 +476,8 @@ static void init_vmcb(struct vmcb *vmcb) INTERCEPT_DR5_MASK | INTERCEPT_DR7_MASK; - control->intercept_exceptions = 1 << PF_VECTOR; + control->intercept_exceptions = (1 << PF_VECTOR) | + (1 << UD_VECTOR); control->intercept = (1ULL << INTERCEPT_INTR) | @@ -970,6 +971,17 @@ static int pf_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 0; } +static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + int er; + + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0); + if (er != EMULATE_DONE) + inject_ud(&svm->vcpu); + + return 1; +} + static int nm_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { svm->vmcb->control.intercept_exceptions &= ~(1 << NM_VECTOR); @@ -1036,7 +1048,8 @@ static int vmmcall_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { svm->next_rip = svm->vmcb->save.rip + 3; skip_emulated_instruction(&svm->vcpu); - return kvm_hypercall(&svm->vcpu, kvm_run); + kvm_emulate_hypercall(&svm->vcpu); + return 1; } static int invalid_op_interception(struct vcpu_svm *svm, @@ -1232,6 +1245,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm, [SVM_EXIT_WRITE_DR3] = emulate_on_interception, [SVM_EXIT_WRITE_DR5]= emulate_on_interception, [SVM_EXIT_WRITE_DR7]= emulate_on_interception, + [SVM_EXIT_EXCP_BASE + UD_VECTOR]= ud_interception, [SVM_EXIT_EXCP_BASE + PF_VECTOR]= pf_interception, [SVM_EXIT_EXCP_BASE + NM_VECTOR]= nm_interception, [SVM_EXIT_INTR] = nop_on_interception, @@ -1664,7 +1678,6 @@ svm_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) hypercall[0] = 0x0f; hypercall[1] = 0x01; hypercall[2] = 0xd9; - hypercall[3] = 0xc3; } ... +/* This instruction is vmcall. On non-VT architectures, it will generate a + * trap that we will then rewrite to the appropriate instruction. */ -#define __NR_hypercalls0 +#define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" .. which never happens. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Avi Kivity wrote: rx and tx are closely related. You rarely have one without the other. In fact, a turned implementation should have zero kicks or interrupts for bulk transfers. The rx interrupt on the host will process new tx descriptors and fill the guest's rx queue; the guest's transmit function can also check the receive queue. I don't know if that's achievable for Linuz guests currently, but we should aim to make it possible. ATM, the net driver does a pretty good job of disabling kicks/interrupts unless they are needed. Checking for rx on tx and vice versa is a good idea and could further help there. I'll give it a try this week. Another point is that virtio still has a lot of leading zeros in its mileage counter. We need to keep things flexible and learn from others as much as possible, especially when talking about the ABI. Yes, after thinking about it over holiday, I agree that we should at least introduce a virtio-pci feature bitmask. I'm not inclined to attempt to define a hypercall ABI or anything like that right now but having the feature bitmask will at least make it possible to do such a thing in the future. I'm wary of introducing the notion of hypercalls to this device because it makes the device VMM specific. Maybe we could have the device provide an option ROM that was treated as the device "BIOS" that we could use for kicking and interrupt acking? Any idea of how that would map to Windows? Are there real PCI devices that use the option ROM space to provide what's essentially firmware? Unfortunately, I don't think an option ROM BIOS would map well to other architectures. The BIOS wouldn't work even on x86 because it isn't mapped to the guest address space (at least not consistently), and doesn't know the guest's programming model (16, 32, or 64-bits? segmented or flat?) Xen uses a hypercall page to abstract these details out. However, I'm not proposing that. Simply indicate that we support hypercalls, and use some layer below to actually send them. It is the responsibility of this layer to detect if hypercalls are present and how to call them. Hey, I think the best place for it is in paravirt_ops. We can even patch the hypercall instruction inline, and the driver doesn't need to know about it. Yes, paravirt_ops is attractive for abstracting the hypercall calling mechanism but it's still necessary to figure out how hypercalls would be identified. I think it would be necessary to define a virtio specific hypercall space and use the virtio device ID to claim subspaces. For instance, the hypercall number could be (virtio_devid << 16) | (call number). How that translates into a hypercall would then be part of the paravirt_ops abstraction. In KVM, we may have a single virtio hypercall where we pass the virtio hypercall number as one of the arguments or something like that. Not much of an argument, I know. wrt. number of queues, 8 queues will consume 32 bytes of pci space if all you store is the ring pfn. You also at least need a num argument which takes you to 48 or 64 depending on whether you care about strange formatting. 8 queues may not be enough either. Eric and I have discussed whether the 9p virtio device should support multiple mounts per-virtio device and if so, whether each one should have it's own queue. Any devices that supports this sort of multiplexing will very quickly start using a lot of queues. Make it appear as a pci function? (though my feeling is that multiple mounts should be different devices; we can then hotplug mountpoints). We may run out of PCI slots though :-/ Then we can start selling virtio extension chassis. :-) Do you know if there is a hard limit on the number of devices on a PCI bus? My concern was that it was limited by something stupid like an 8-bit identifier. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Avi Kivity wrote: Anthony Liguori wrote: Well please propose the virtio API first and then I'll adjust the PCI ABI. I don't want to build things into the ABI that we never actually end up using in virtio :-) Move ->kick() to virtio_driver. Then on each kick, all queues have to be checked for processing? What devices do you expect this would help? I believe Xen networking uses the same event channel for both rx and tx, so in effect they're using this model. Long time since I looked though, I would have to look, but since rx/tx are rather independent actions, I'm not sure that you would really save that much. You still end up doing the same number of kicks unless I'm missing something. I was thinking more along the lines that a hypercall-based device would certainly be implemented in-kernel whereas the current device is naturally implemented in userspace. We can simply use a different device for in-kernel drivers than for userspace drivers. Where the device is implemented is an implementation detail that should be hidden from the guest, isn't that one of the strengths of virtualization? Two examples: a file-based block device implemented in qemu gives you fancy file formats with encryption and compression, while the same device implemented in the kernel gives you a low-overhead path directly to a zillion-disk SAN volume. Or a user-level network device capable of running with the slirp stack and no permissions vs. the kernel device running copyless most of the time and using a dma engine for the rest but requiring you to be good friends with the admin. The user should expect zero reconfigurations moving a VM from one model to the other. I'm wary of introducing the notion of hypercalls to this device because it makes the device VMM specific. Maybe we could have the device provide an option ROM that was treated as the device "BIOS" that we could use for kicking and interrupt acking? Any idea of how that would map to Windows? Are there real PCI devices that use the option ROM space to provide what's essentially firmware? Unfortunately, I don't think an option ROM BIOS would map well to other architectures. None of the PCI devices currently work like that in QEMU. It would be very hard to make a device that worked this way because since the order in which values are written matter a whole lot. For instance, if you wrote the status register before the queue information, the driver could get into a funky state. I assume you're talking about restore? Isn't that atomic? If you're doing restore by passing the PCI config blob to a registered routine, then sure, but that doesn't seem much better to me than just having the device generate that blob in the first place (which is what we have today). I was assuming that you would want to use the existing PIO/MMIO handlers to do restore by rewriting the config as if the guest was. Not much of an argument, I know. wrt. number of queues, 8 queues will consume 32 bytes of pci space if all you store is the ring pfn. You also at least need a num argument which takes you to 48 or 64 depending on whether you care about strange formatting. 8 queues may not be enough either. Eric and I have discussed whether the 9p virtio device should support multiple mounts per-virtio device and if so, whether each one should have it's own queue. Any devices that supports this sort of multiplexing will very quickly start using a lot of queues. Make it appear as a pci function? (though my feeling is that multiple mounts should be different devices; we can then hotplug mountpoints). We may run out of PCI slots though :-/ I think most types of hardware have some notion of a selector or mode. Take a look at the LSI adapter or even VGA. True. They aren't fun to use, though. I don't think they're really any worse :-) Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Avi Kivity wrote: Anthony Liguori wrote: Avi Kivity wrote: Anthony Liguori wrote: This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. + +/* the notify function used when creating a virt queue */ +static void vp_notify(struct virtqueue *vq) +{ +struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev); +struct virtio_pci_vq_info *info = vq->priv; + +/* we write the queue's selector into the notification register to + * signal the other end */ +iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY); +} This means we can't kick multiple queues with one exit. There is no interface in virtio currently to batch multiple queue notifications so the only way one could do this AFAICT is to use a timer to delay the notifications. Were you thinking of something else? No. We can change virtio though, so let's have a flexible ABI. Well please propose the virtio API first and then I'll adjust the PCI ABI. I don't want to build things into the ABI that we never actually end up using in virtio :-) I'd also like to see a hypercall-capable version of this (but that can wait). That can be a different device. That means the user has to select which device to expose. With feature bits, the hypervisor advertises both pio and hypercalls, the guest picks whatever it wants. I was thinking more along the lines that a hypercall-based device would certainly be implemented in-kernel whereas the current device is naturally implemented in userspace. We can simply use a different device for in-kernel drivers than for userspace drivers. There's no point at all in doing a hypercall based userspace device IMHO. I don't think so. A vmexit is required to lower the IRQ line. It may be possible to do something clever like set a shared memory value that's checked on every vmexit. I think it's very unlikely that it's worth it though. Why so unlikely? Not all workloads will have good batching. It's pretty invasive. I think a more paravirt device that expected an edge triggered interrupt would be a better solution for those types of devices. +return ret; +} + +/* the config->find_vq() implementation */ +static struct virtqueue *vp_find_vq(struct virtio_device *vdev, unsigned index, +bool (*callback)(struct virtqueue *vq)) +{ +struct virtio_pci_device *vp_dev = to_vp_device(vdev); +struct virtio_pci_vq_info *info; +struct virtqueue *vq; +int err; +u16 num; + +/* Select the queue we're interested in */ +iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL); I would really like to see this implemented as pci config space, with no tricks like multiplexing several virtqueues on one register. Something like the PCI BARs where you have all the register numbers allocated statically to queues. My first implementation did that. I switched to using a selector because it reduces the amount of PCI config space used and does not limit the number of queues defined by the ABI as much. But... it's tricky, and it's nonstandard. With pci config, you can do live migration by shipping the pci config space to the other side. With the special iospace, you need to encode/decode it. None of the PCI devices currently work like that in QEMU. It would be very hard to make a device that worked this way because since the order in which values are written matter a whole lot. For instance, if you wrote the status register before the queue information, the driver could get into a funky state. We'll still need save/restore routines for virtio devices. I don't really see this as a problem since we do this for every other device. Not much of an argument, I know. wrt. number of queues, 8 queues will consume 32 bytes of pci space if all you store is the ring pfn. You also at least need a num argument which takes you to 48 or 64 depending on whether you care about strange formatting. 8 queues may not be enough either. Eric and I have discussed whether the 9p virtio device should support multiple mounts per-virtio device and if so, whether each one should have it's own queue. Any devices that supports this sort of multiplexing will very quickly start using a lot of queues. I think most types of hardware have some notion of a selector or mode. Take a look at the LSI adapter or even VGA. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Avi Kivity wrote: Anthony Liguori wrote: This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. + +/* the notify function used when creating a virt queue */ +static void vp_notify(struct virtqueue *vq) +{ +struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev); +struct virtio_pci_vq_info *info = vq->priv; + +/* we write the queue's selector into the notification register to + * signal the other end */ +iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY); +} This means we can't kick multiple queues with one exit. There is no interface in virtio currently to batch multiple queue notifications so the only way one could do this AFAICT is to use a timer to delay the notifications. Were you thinking of something else? I'd also like to see a hypercall-capable version of this (but that can wait). That can be a different device. + +/* A small wrapper to also acknowledge the interrupt when it's handled. + * I really need an EIO hook for the vring so I can ack the interrupt once we + * know that we'll be handling the IRQ but before we invoke the callback since + * the callback may notify the host which results in the host attempting to + * raise an interrupt that we would then mask once we acknowledged the + * interrupt. */ +static irqreturn_t vp_interrupt(int irq, void *opaque) +{ +struct virtio_pci_device *vp_dev = opaque; +struct virtio_pci_vq_info *info; +irqreturn_t ret = IRQ_NONE; +u8 isr; + +/* reading the ISR has the effect of also clearing it so it's very + * important to save off the value. */ +isr = ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR); Can this be implemented via shared memory? We're exiting now on every interrupt. I don't think so. A vmexit is required to lower the IRQ line. It may be possible to do something clever like set a shared memory value that's checked on every vmexit. I think it's very unlikely that it's worth it though. +return ret; +} + +/* the config->find_vq() implementation */ +static struct virtqueue *vp_find_vq(struct virtio_device *vdev, unsigned index, +bool (*callback)(struct virtqueue *vq)) +{ +struct virtio_pci_device *vp_dev = to_vp_device(vdev); +struct virtio_pci_vq_info *info; +struct virtqueue *vq; +int err; +u16 num; + +/* Select the queue we're interested in */ +iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL); I would really like to see this implemented as pci config space, with no tricks like multiplexing several virtqueues on one register. Something like the PCI BARs where you have all the register numbers allocated statically to queues. My first implementation did that. I switched to using a selector because it reduces the amount of PCI config space used and does not limit the number of queues defined by the ABI as much. + +/* Check if queue is either not available or already active. */ +num = ioread16(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NUM); +if (!num || ioread32(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN)) +return ERR_PTR(-ENOENT); + +/* allocate and fill out our structure the represents an active + * queue */ +info = kmalloc(sizeof(struct virtio_pci_vq_info), GFP_KERNEL); +if (!info) +return ERR_PTR(-ENOMEM); + +info->queue_index = index; +info->num = num; + +/* determine the memory needed for the queue and provide the memory + * location to the host */ +info->n_pages = DIV_ROUND_UP(vring_size(num), PAGE_SIZE); +info->pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, + get_order(info->n_pages)); +if (info->pages == NULL) { +err = -ENOMEM; +goto out_info; +} + +/* FIXME: is this sufficient for info->n_pages > 1? */ +info->queue = kmap(info->pages); +if (info->queue == NULL) { +err = -ENOMEM; +goto out_alloc_pages; +} + +/* activate the queue */ +iowrite32(page_to_pfn(info->pages), + vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); + +/* create the vring */ +vq = vring_new_virtqueue(info->num, vdev, info->queue, + vp_notify, callback); +if (!vq) { +err = -ENOMEM; +goto out_activate_queue; +} + +vq->priv = info; +info->vq = vq; + +spin_lock(&vp_dev->lock); +list_add(&info->node, &vp_dev->virtqueues); +spin_unlock(&vp_dev->lock); + Is this run only on init? If so the lock isn't needed. Yes, it's also not stricly needed on cleanup I think. I left it there though for clarity. I can remove. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio config_ops refactoring
Rusty Russell wrote: On Saturday 10 November 2007 10:45:38 Anthony Liguori wrote: The problem is the ABI. We can either require that PCI configuration values are accessed with natural instructions, or it makes very little sense to use the PCI configuration space for virtio configuration information. To me it seems logical and simplest to allow any accesses, lay out your structure and emulate it in the obvious way. Okay, I've got some updates that I'm going to send out now to the PCI virtio driver but I'll also change it to switch over to a memory layout. It's not the best PCI ABI but I can certainly live with it. You can put the configuration information somewhere else, but the PCI config space seems the logical place. If we're treating the PCI config space as an opaque memory blob, instead of as distinct registers, I think it makes more sense to just put it in memory. In the backend, I have to treat it as a memory blob anyway and using the PCI config space just means that I have to write all the various PIO handlers for the different access sizes. It's much more elegant in my mind just to have the driver provide some memory that the host fills out. Thanks for all the review, Anthony Liguori Either virtio config looks like a shared memory area (as lguest currently implements it), or it looks like hardware registers (like virtio-pci implements it). After thinking about it for a while, I don't think the two can be reconciled. There are subtle differences between the two that can't be hidden in the virtio interface. For instance, in the PCI model, you get notified when values are read/written whereas in the lguest model, you don't and need explicit status bits. No. You need those status bits even if you have your register model, otherwise you can't tell when the configuration as a whole is stable. Consider the feature bits. Worse, consider extending the feature bits beyond 32 bits. (We will have to deal with dynamic configuration changes in future; I was planning on using some status bits. But it's pretty clear that it's going to require an explicit ack of some form.) Hope that clarifies, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio config_ops refactoring
Rusty Russell wrote: On Friday 09 November 2007 09:33:04 Anthony Liguori wrote: switch (addr) { case VIRTIO_BLK_CONFIG_MAX_SEG: return vdev->max_seg & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SEG + 1: return (vdev->max_seg >> 8) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SEG + 2: return (vdev->max_seg >> 16) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SEG + 3: return (vdev->max_seg >> 24) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SIZE: return vdev->max_size & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SIZE + 1: return (vdev->max_size >> 8) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SIZE + 2: return (vdev->max_size >> 16) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SIZE + 3: return (vdev->max_size >> 24) & 0xFF; ... struct virtio_blk_config { uint32_t max_seg, max_size; }; ... struct virtio_blk_config conf = { vdev->max_seg, vdev->max_size }; return ((unsigned char *)&conf)[addr]; (Which strongly implies our headers should expose that nominal struct, rather than numerical constants). The problem is the ABI. We can either require that PCI configuration values are accessed with natural instructions, or it makes very little sense to use the PCI configuration space for virtio configuration information. If we really can't find a way to do this (and I think my current implementation is the best compromise since it hides this from everything else), then I think I'll switch over to just writing a PFN into a PCI configuration slot and then have that page store the virtio configuration information (much like is done with lguest). Either virtio config looks like a shared memory area (as lguest currently implements it), or it looks like hardware registers (like virtio-pci implements it). After thinking about it for a while, I don't think the two can be reconciled. There are subtle differences between the two that can't be hidden in the virtio interface. For instance, in the PCI model, you get notified when values are read/written whereas in the lguest model, you don't and need explicit status bits. If you're very against the switch() magic, then I'll switch over to just using a shared memory area. Regards, Anthony Liguori Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lguest] [PATCH] virtio config_ops refactoring
Dor Laor wrote: ron minnich wrote: Hi, I'm sorry, I've been stuck on other things (NFS RDMA anyone?) and missed part of this discussion. Is it really the case that operations on virtio devices will involve outl/inl etc.? What's the problem with them? Except for the kick event it's not performance critical and the difference between pio vmexit and hypercall exit is very small. If you're a nutty guy who's interesting in creating the most absolute minimal VMM to run exotic paravirtual guests on massive clusters, then requiring PIO implies that you have to have an instruction decoder which is goes against the earlier goal ;-) Regards, Anthony Liguori I don't know about problems in other architectures, maybe mmio is better? Dor. Apologies in advance for my failure to pay attention. thanks ron ___ Lguest mailing list [EMAIL PROTECTED] https://ozlabs.org/mailman/listinfo/lguest - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Dor Laor wrote: Anthony Liguori wrote: This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. While it's a little premature, we can start thinking of irq path improvements. The current patch acks a private isr and afterwards apic eoi will also be hit since its a level trig irq. This means 2 vmexits per irq. We can start with regular pci irqs and move afterwards to msi. Some other ugly hack options [we're better use msi]: - Read the eoi directly from apic and save the first private isr ack I must admit, that I don't know a whole lot about interrupt delivery. If we can avoid the private ISR ack then that would certainly be a good thing to do! I think that would involve adding another bit to the virtqueues to indicate whether or not there is work to be handled. It's really just moving the ISR to shared memory so that there's no plenty for accessing it. Regards, Anthony Liguori - Convert the specific irq line to edge triggered and dont share it What do you guys think? +/* A small wrapper to also acknowledge the interrupt when it's handled. + * I really need an EIO hook for the vring so I can ack the interrupt once we + * know that we'll be handling the IRQ but before we invoke the callback since + * the callback may notify the host which results in the host attempting to + * raise an interrupt that we would then mask once we acknowledged the + * interrupt. */ +static irqreturn_t vp_interrupt(int irq, void *opaque) +{ +struct virtio_pci_device *vp_dev = opaque; +struct virtio_pci_vq_info *info; +irqreturn_t ret = IRQ_NONE; +u8 isr; + +/* reading the ISR has the effect of also clearing it so it's very + * important to save off the value. */ +isr = ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR); + +/* It's definitely not us if the ISR was not high */ +if (!isr) +return IRQ_NONE; + +spin_lock(&vp_dev->lock); +list_for_each_entry(info, &vp_dev->virtqueues, node) { +if (vring_interrupt(irq, info->vq) == IRQ_HANDLED) +ret = IRQ_HANDLED; +} +spin_unlock(&vp_dev->lock); + +return ret; +} - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lguest] [PATCH] virtio config_ops refactoring
ron minnich wrote: Hi, I'm sorry, I've been stuck on other things (NFS RDMA anyone?) and missed part of this discussion. Is it really the case that operations on virtio devices will involve outl/inl etc.? No, this is just for the PCI virtio transport. lguest's virtio transport uses hypercalls and shared memory. Regards, Anthony Liguori Apologies in advance for my failure to pay attention. thanks ron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio config_ops refactoring
Rusty Russell wrote: On Thursday 08 November 2007 13:41:16 Anthony Liguori wrote: Rusty Russell wrote: On Thursday 08 November 2007 04:30:50 Anthony Liguori wrote: I would prefer that the virtio API not expose a little endian standard. I'm currently converting config->get() ops to ioreadXX depending on the size which already does the endianness conversion for me so this just messes things up. I think it's better to let the backend deal with endianness since it's trivial to handle for both the PCI backend and the lguest backend (lguest doesn't need to do any endianness conversion). -ETOOMUCHMAGIC. We should either expose all the XX interfaces (but this isn't a high-speed interface, so let's not) or not "sometimes" convert endianness. Getting surprises because a field happens to be packed into 4 bytes is counter-intuitive. Then I think it's necessary to expose the XX interfaces. Otherwise, the backend has to deal with doing all register operations at a per-byte granularity which adds a whole lot of complexity on a per-device basis (as opposed to a little complexity once in the transport layer). Huh? Take a look at the drivers, this simply isn't true. Do you have evidence that it will be true later? I'm a bit confused. So right now, the higher level virtio functions do endianness conversion. I really want to make sure that if a guest tries to read a 4-byte PCI config field, that it does so using an "outl" instruction so that in my QEMU backend, I don't have to deal with a guest reading/writing a single byte within a 4-byte configuration field. It's the difference between having in the PIO handler: switch (addr) { case VIRTIO_BLK_CONFIG_MAX_SEG: return vdev->max_seg; case VIRTIO_BLK_CONFIG_MAX_SIZE: return vdev->max_size; } and: switch (addr) { case VIRTIO_BLK_CONFIG_MAX_SEG: return vdev->max_seg & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SEG + 1: return (vdev->max_seg >> 8) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SEG + 2: return (vdev->max_seg >> 16) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SEG + 3: return (vdev->max_seg >> 24) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SIZE: return vdev->max_size & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SIZE + 1: return (vdev->max_size >> 8) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SIZE + 2: return (vdev->max_size >> 16) & 0xFF; case VIRTIO_BLK_CONFIG_MAX_SIZE + 3: return (vdev->max_size >> 24) & 0xFF; ... } It's the host-side code I'm concerned about, not the guest-side code. I'm happy to just ignore the whole endianness conversion thing and always pass values through in the CPU bitness but it's very important to me that the PCI config registers are accessed with their natural sized instructions (just as they would with a real PCI device). Regards, Anthony Liguori Plus your code will be smaller doing a single writeb/readb loop than trying to do a switch statement. You really want to be able to rely on multi-byte atomic operations too when setting values. Otherwise, you need another register to just to signal when it's okay for the device to examine any given register. You already do; the status register fills this role. For example, you can't tell what features a guest understands until it updates the status register. Hope that clarifies, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Arnd Bergmann wrote: On Thursday 08 November 2007, Anthony Liguori wrote: +/* A PCI device has it's own struct device and so does a virtio device so + * we create a place for the virtio devices to show up in sysfs. I think it + * would make more sense for virtio to not insist on having it's own device. */ +static struct device virtio_pci_root = { + .parent = NULL, + .bus_id = "virtio-pci", +}; + +/* Unique numbering for devices under the kvm root */ +static unsigned int dev_index; + ... +/* the PCI probing function */ +static int __devinit virtio_pci_probe(struct pci_dev *pci_dev, + const struct pci_device_id *id) +{ + struct virtio_pci_device *vp_dev; + int err; + + /* allocate our structure and fill it out */ + vp_dev = kzalloc(sizeof(struct virtio_pci_device), GFP_KERNEL); + if (vp_dev == NULL) + return -ENOMEM; + + vp_dev->pci_dev = pci_dev; + vp_dev->vdev.dev.parent = &virtio_pci_root; If you use vp_dev->vdev.dev.parent = &pci_dev->dev; Then there is no need for the special kvm root device, and the actual virtio device shows up in a more logical place, under where it is really (virtually) attached. They already show up underneath of the PCI bus. The issue is that there are two separate 'struct device's for each virtio device. There's the PCI device (that's part of the pci_dev structure) and then there's the virtio_device one. I thought that setting the dev.parent of the virtio_device struct device would result in having two separate entries under the PCI bus directory which would be pretty confusing :-) Regards, Anthony Liguori Arnd <>< - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use of virtio device IDs
Gerd Hoffmann wrote: Avi Kivity wrote: You are probably better off designing something that is PV specific instead of shoehorning it in to fit a different model (at least for the things I have in mind). Well, if we design our pv devices to look like hardware, they will fit quite well. Both to the guest OS and to user's expectations. Disclaimer: Havn't looked at the virtio code much. I think we should keep the door open for both models and don't nail the virtio infrastructure to one of them. For pure pv devices I don't see the point in trying to squeeze it into the PCI model. Also s390 has no PCI, so there effecticely is no way around that, we must be able have some pure virtual bus like xenbus. I don't really agree with this assessment. There is no performance advantage to using a pure virtual bus. If you have a pure pv device that looks and act like a PCI device, besides the obvious advantage of easy portability to other guest OSes (since everything support PCI, but porting XenBus--event to Linux 2.4.x was a royal pain), it is also very easy to support the device on other VMMs. For instance, the PCI device that I just posted would allow virtio devices to be used trivially with HVM on Xen. In fact, once the backends are complete and merged into QEMU, the next time Xen rebases QEMU they'll get the virtio PV-on-HVM drivers for free. To me, that's a pretty significant advantage. Uhm, well, yea. Guess you are refering to the pv-on-hvm drivers. Been there, dealt with it. What exactly do you think is messy there? IMHO the most messy thing is the boot problem. hvm bios can't deal with pv disks, so you can't boot with pv disks only. "fixed" by having the (boot) disk twice in the system, once via emulated ide, once as pv disk. Ouch. I have actually addressed this problem with a PV option rom for QEMU. I expect to get time to submit the QEMU patches by the end of the year. See http://hg.codemonkey.ws/extboot Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Avi Kivity wrote: If a pci device is capable of dma (or issuing interrupts), it will be useless with pv pci. Hrm, I think we may be talking about different things. Are you thinking that the driver I posted allows you to do PCI pass-through over virtio? That's not what it is. The driver I posted is a virtio implementation that uses a PCI device. This lets you use virtio-blk and virtio-net under KVM. The alternative to this virtio PCI device would be a virtio transport built with hypercalls like lguest has. I choose a PCI device because it ensured that each virtio device showed up like a normal PCI device. Am I misunderstanding what you're asking about? Regards, Anthony Liguori I think that with Amit's pvdma patches you can support dma-capable devices as well without too much fuss. What is the use case you're thinking of? A semi-paravirt driver that does dma directly to a device? No, an unmodified driver that, by using clever tricks with dma_ops, can do dma directly to guest memory. See Amit's patches. In fact, why do a virtio transport at all? It can be done either with trap'n'emulate, or by directly mapping the device mmio space into the guest. (what use case are you considering? devices without interrupts and dma? pci door stoppers?) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
Avi Kivity wrote: Anthony Liguori wrote: This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. Didn't see support for dma. Not sure what you're expecting there. Using dma_ops in virtio_ring? I think that with Amit's pvdma patches you can support dma-capable devices as well without too much fuss. What is the use case you're thinking of? A semi-paravirt driver that does dma directly to a device? Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] virtio PCI device
This is a PCI device that implements a transport for virtio. It allows virtio devices to be used by QEMU based VMMs like KVM or Xen. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index 9e33fc4..c81e0f3 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -6,3 +6,20 @@ config VIRTIO config VIRTIO_RING bool depends on VIRTIO + +config VIRTIO_PCI + tristate "PCI driver for virtio devices (EXPERIMENTAL)" + depends on PCI && EXPERIMENTAL + select VIRTIO + select VIRTIO_RING + ---help--- + This drivers provides support for virtio based paravirtual device + drivers over PCI. This requires that your VMM has appropriate PCI + virtio backends. Most QEMU based VMMs should support these devices + (like KVM or Xen). + + Currently, the ABI is not considered stable so there is no guarantee + that this version of the driver will work with your VMM. + + If unsure, say M. + diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile index f70e409..cc84999 100644 --- a/drivers/virtio/Makefile +++ b/drivers/virtio/Makefile @@ -1,2 +1,3 @@ obj-$(CONFIG_VIRTIO) += virtio.o obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o +obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c new file mode 100644 index 000..85ae096 --- /dev/null +++ b/drivers/virtio/virtio_pci.c @@ -0,0 +1,469 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Anthony Liguori <[EMAIL PROTECTED]>"); +MODULE_DESCRIPTION("virtio-pci"); +MODULE_LICENSE("GPL"); +MODULE_VERSION("1"); + +/* Our device structure */ +struct virtio_pci_device +{ + /* the virtio device */ + struct virtio_device vdev; + /* the PCI device */ + struct pci_dev *pci_dev; + /* the IO mapping for the PCI config space */ + void *ioaddr; + + spinlock_t lock; + struct list_head virtqueues; +}; + +struct virtio_pci_vq_info +{ + /* the number of entries in the queue */ + int num; + /* the number of pages the device needs for the ring queue */ + int n_pages; + /* the index of the queue */ + int queue_index; + /* the struct page of the ring queue */ + struct page *pages; + /* the virtual address of the ring queue */ + void *queue; + /* a pointer to the virtqueue */ + struct virtqueue *vq; + /* the node pointer */ + struct list_head node; +}; + +/* We have to enumerate here all virtio PCI devices. */ +static struct pci_device_id virtio_pci_id_table[] = { + { 0x5002, 0x2258, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 }, /* Dummy entry */ + { 0 }, +}; + +MODULE_DEVICE_TABLE(pci, virtio_pci_id_table); + +/* A PCI device has it's own struct device and so does a virtio device so + * we create a place for the virtio devices to show up in sysfs. I think it + * would make more sense for virtio to not insist on having it's own device. */ +static struct device virtio_pci_root = { + .parent = NULL, + .bus_id = "virtio-pci", +}; + +/* Unique numbering for devices under the kvm root */ +static unsigned int dev_index; + +/* Convert a generic virtio device to our structure */ +static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev) +{ + return container_of(vdev, struct virtio_pci_device, vdev); +} + +/* virtio config->feature() implementation */ +static bool vp_feature(struct virtio_device *vdev, unsigned bit) +{ + struct virtio_pci_device *vp_dev = to_vp_device(vdev); + u32 mask; + + /* Since this function is supposed to have the side effect of +* enabling a queried feature, we simulate that by doing a read +* from the host feature bitmask and then writing to the guest +* feature bitmask */ + mask = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES); + if (mask & (1 << bit)) { + mask |= (1 << bit); + iowrite32(mask, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES); + } + + return !!(mask & (1 << bit)); +} + +/* virtio config->get() implementation */ +static void vp_get(struct virtio_device *vdev, unsigned offset, + void *buf, unsigned len) +{ + struct virtio_pci_device *vp_dev = to_vp_device(vdev); + void *ioaddr = vp_dev->ioaddr + VIRTIO_PCI_CONFIG + offset; + + /* We translate appropriately sized get requests into more natural +* IO operations. These functions also take care of endianness +* conversion. */ + switch (len) { + case 1: { + u8 val; + val = ioread8(ioaddr); + memcpy(
[PATCH 2/3] Put the virtio under the virtualization menu
This patch moves virtio under the virtualization menu and changes virtio devices to not claim to only be for lguest. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/Kconfig b/drivers/Kconfig index f4076d9..d945ffc 100644 --- a/drivers/Kconfig +++ b/drivers/Kconfig @@ -93,6 +93,4 @@ source "drivers/auxdisplay/Kconfig" source "drivers/kvm/Kconfig" source "drivers/uio/Kconfig" - -source "drivers/virtio/Kconfig" endmenu diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 4d0119e..be4b224 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -429,6 +429,7 @@ config VIRTIO_BLK tristate "Virtio block driver (EXPERIMENTAL)" depends on EXPERIMENTAL && VIRTIO ---help--- - This is the virtual block driver for lguest. Say Y or M. + This is the virtual block driver for virtio. It can be used with + lguest or QEMU based VMMs (like KVM or Xen). Say Y or M. endif # BLK_DEV diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig index 6569206..ac4bcdf 100644 --- a/drivers/kvm/Kconfig +++ b/drivers/kvm/Kconfig @@ -50,5 +50,6 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. source drivers/lguest/Kconfig +source drivers/virtio/Kconfig endif # VIRTUALIZATION diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 86b8641..e66aec4 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -3107,6 +3107,7 @@ config VIRTIO_NET tristate "Virtio network driver (EXPERIMENTAL)" depends on EXPERIMENTAL && VIRTIO ---help--- - This is the virtual network driver for lguest. Say Y or M. + This is the virtual network driver for virtio. It can be used with + lguest or QEMU based VMMs (like KVM or Xen). Say Y or M. endif # NETDEVICES - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] Export vring functions for modules to use
This is needed for the virtio PCI device to be compiled as a module. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 0e1bf05..3f28b47 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -260,6 +260,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq) return IRQ_HANDLED; } +EXPORT_SYMBOL_GPL(vring_interrupt); + static struct virtqueue_ops vring_vq_ops = { .add_buf = vring_add_buf, .get_buf = vring_get_buf, @@ -306,8 +308,12 @@ struct virtqueue *vring_new_virtqueue(unsigned int num, return &vq->vq; } +EXPORT_SYMBOL_GPL(vring_new_virtqueue); + void vring_del_virtqueue(struct virtqueue *vq) { kfree(to_vvq(vq)); } +EXPORT_SYMBOL_GPL(vring_del_virtqueue); + - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] virtio PCI driver
This patch series implements a PCI driver for virtio. This allows virtio devices (like block and network) to be used in QEMU/KVM. I'll post a very early KVM userspace backend in kvm-devel for those that are interested. This series depends on the two virtio fixes I've posted and Rusty's config_ops refactoring. I've tested with these patches on Rusty's experimental virtio tree. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio config_ops refactoring
Rusty Russell wrote: On Thursday 08 November 2007 04:30:50 Anthony Liguori wrote: I would prefer that the virtio API not expose a little endian standard. I'm currently converting config->get() ops to ioreadXX depending on the size which already does the endianness conversion for me so this just messes things up. I think it's better to let the backend deal with endianness since it's trivial to handle for both the PCI backend and the lguest backend (lguest doesn't need to do any endianness conversion). -ETOOMUCHMAGIC. We should either expose all the XX interfaces (but this isn't a high-speed interface, so let's not) or not "sometimes" convert endianness. Getting surprises because a field happens to be packed into 4 bytes is counter-intuitive. Then I think it's necessary to expose the XX interfaces. Otherwise, the backend has to deal with doing all register operations at a per-byte granularity which adds a whole lot of complexity on a per-device basis (as opposed to a little complexity once in the transport layer). You really want to be able to rely on multi-byte atomic operations too when setting values. Otherwise, you need another register to just to signal when it's okay for the device to examine any given register. Regards, Anthony Liguori Since your most trivial implementation is to do a byte at a time, I don't think you have a good argument on that basis either. Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][VIRTIO] Fix vring_init() ring computations
Rusty Russell wrote: On Wednesday 07 November 2007 13:52:29 Anthony Liguori wrote: This patch fixes a typo in vring_init(). Thanks, applied. I've put it in the new, experimental virtio git tree on git.kernel.org. Hrm, perhaps you forgot to push? I don't see it in the tree although I see the config ops refactoring. Regards, Anthony Liguori Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix used_idx wrap-around in virtio
The more_used() function compares the vq->vring.used->idx with last_used_idx. Since vq->vring.used->idx is a 16-bit integer, and last_used_idx is an unsigned int, this results in unpredictable behavior when vq->vring.used->idx wraps around. This patch corrects this by changing last_used_idx to the correct type. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 0e4baca..0e1bf05 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -53,7 +53,7 @@ struct vring_virtqueue unsigned int num_added; /* Last used index we've seen. */ - unsigned int last_used_idx; + u16 last_used_idx; /* How to notify other side. FIXME: commonalize hcalls! */ void (*notify)(struct virtqueue *vq); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use of virtio device IDs
Rusty Russell wrote: On Wednesday 07 November 2007 16:40:13 Avi Kivity wrote: Gregory Haskins wrote: but FWIW: This is a major motivation for the reason that the IOQ stuff I posted a while back used strings for device identification instead of a fixed length, centrally managed namespace like PCI vendor/dev-id. Then you can just name your device something reasonably unique (e.g. "qumranet::veth", or "ibm-pvirt-clock"). I dislike strings. They make it look as if you have a nice extensible interface, where in reality you have a poorly documented interface which leads to poor interoperability. Yes, you end up with exactly names like "qumranet::veth" and "ibm-pvirt-clock". I would recommend looking very hard at /proc, Open Firmware on a modern system, or the Xen store, to see what a lack of limitation can do to you :) FWIW, I've switched to using the PCI subsystem vendor/device IDs for virtio which Rusty suggested. I think this makes even more sense than using the main vendor/device ID since I do think that we only should use a single vendor/device ID for all virtio PCI devices and then differentiate based on the subsystem IDs. Regards, Anthony Liguori We will support non-pci for s390, but in order to support Windows and older Linux PCI is necessary. The aim is that PCI support is clean, but that we're not really tied to PCI. I think we're getting closer with the recent config changes. Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio config_ops refactoring
Rusty Russell wrote: After discussion with Anthony, the virtio config has been simplified. We lose some minor features (the virtio_net address must now be 6 bytes) but it turns out to be a wash in terms of complexity, while simplifying PCI. Hi Rusty, Thanks for posting this! It's really simplified things for me. - /** - * virtio_use_bit - helper to use a feature bit in a bitfield value. - * @dev: the virtio device - * @token: the token as returned from vdev->config->find(). - * @len: the length of the field. - * @bitnum: the bit to test. + * __virtio_config_val - get a single virtio config without feature check. + * @vdev: the virtio device + * @offset: the type to search for. + * @val: a pointer to the value to fill in. * - * If handed a NULL token, it returns false, otherwise returns bit status. - * If it's one, it sets the mirroring acknowledgement bit. */ -int virtio_use_bit(struct virtio_device *vdev, - void *token, unsigned int len, unsigned int bitnum); + * The value is endian-corrected and returned in v. */ +#define __virtio_config_val(vdev, offset, v) do { \ + BUILD_BUG_ON(sizeof(*(v)) != 1 && sizeof(*(v)) != 2 \ +&& sizeof(*(v)) != 4 && sizeof(*(v)) != 8);\ + (vdev)->config->get((vdev), (offset), (v), sizeof(*(v))); \ + switch (sizeof(*(v))) { \ + case 2: le16_to_cpus((__u16 *) v); break; \ + case 4: le32_to_cpus((__u32 *) v); break; \ + case 8: le64_to_cpus((__u64 *) v); break; \ + } \ +} while(0) I would prefer that the virtio API not expose a little endian standard. I'm currently converting config->get() ops to ioreadXX depending on the size which already does the endianness conversion for me so this just messes things up. I think it's better to let the backend deal with endianness since it's trivial to handle for both the PCI backend and the lguest backend (lguest doesn't need to do any endianness conversion). Regards, Anthony Liguori #endif /* __KERNEL__ */ #endif /* _LINUX_VIRTIO_CONFIG_H */ diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index ae469ae..8bf1602 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -5,16 +5,16 @@ /* The ID for virtio_net */ #define VIRTIO_ID_NET 1 -/* The bitmap of config for virtio net */ -#define VIRTIO_CONFIG_NET_F0x40 +/* The feature bitmap for virtio net */ #define VIRTIO_NET_F_NO_CSUM 0 #define VIRTIO_NET_F_TSO4 1 #define VIRTIO_NET_F_UFO 2 #define VIRTIO_NET_F_TSO4_ECN 3 #define VIRTIO_NET_F_TSO6 4 +#define VIRTIO_NET_F_MAC 5 -/* The config defining mac address. */ -#define VIRTIO_CONFIG_NET_MAC_F0x41 +/* The config defining mac address (6 bytes) */ +#define VIRTIO_CONFIG_NET_MAC_F0 /* This is the first element of the scatter-gather list. If you don't * specify GSO or CSUM features, you can simply ignore the header. */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use of virtio device IDs
Rusty Russell wrote: On Wednesday 07 November 2007 16:40:13 Avi Kivity wrote: Gregory Haskins wrote: but FWIW: This is a major motivation for the reason that the IOQ stuff I posted a while back used strings for device identification instead of a fixed length, centrally managed namespace like PCI vendor/dev-id. Then you can just name your device something reasonably unique (e.g. "qumranet::veth", or "ibm-pvirt-clock"). I dislike strings. They make it look as if you have a nice extensible interface, where in reality you have a poorly documented interface which leads to poor interoperability. Yes, you end up with exactly names like "qumranet::veth" and "ibm-pvirt-clock". I would recommend looking very hard at /proc, Open Firmware on a modern system, or the Xen store, to see what a lack of limitation can do to you :) We will support non-pci for s390, but in order to support Windows and older Linux PCI is necessary. The aim is that PCI support is clean, but that we're not really tied to PCI. I think we're getting closer with the recent config changes. Yes, my main desire was to ensure that we had a clean PCI ABI that would be natural to implement on a platform like Windows. I think with the recent config_ops refactoring, we can now do that. Regards, Anthony Liguori Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][VIRTIO] Fix vring_init() ring computations
This patch fixes a typo in vring_init(). This happens to work today in lguest because the sizeof(struct vring_desc) is 16 and struct vring contains 3 pointers and an unsigned int so on 32-bit sizeof(struct vring_desc) == sizeof(struct vring). However, this is no longer true on 64-bit where the bug is exposed. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h index ac69e7b..5b88d21 100644 --- a/include/linux/virtio_ring.h +++ b/include/linux/virtio_ring.h @@ -92,8 +92,8 @@ static inline void vring_init(struct vring *vr, unsigned int num, void *p) { vr->num = num; vr->desc = p; - vr->avail = p + num*sizeof(struct vring); - vr->used = p + (num+1)*(sizeof(struct vring) + sizeof(__u16)); + vr->avail = p + num*sizeof(struct vring_desc); + vr->used = p + (num+1)*(sizeof(struct vring_desc) + sizeof(__u16)); } static inline unsigned vring_size(unsigned int num) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: virtio config_ops refactoring
Rusty Russell wrote: On Wednesday 07 November 2007 04:48:35 Anthony Liguori wrote: Semantically, find requires that a field have both a type and a length. With the exception of the VIRTQUEUE field used internally by lguest, type is always a unique identifier. Since virtqueue information is not a required part of the config space, it seems to me that type really should be treated as a unique identifier. Hi Anthony, Not sure I get this. It is a unique identifier. You need the length to handle unknown fields. It's not a unique identifier since it can be used for multiple items (like it is for virtqueues configs). find_vq also is curious in that it is stateful in it's enumeration. Well, they're *all* stateful. This gives a simple method of knowing what fields the guest understands: it marks the fields as it finds them. Then it sets the status, which allows the host to know when it's completed configuration reads. But PCI device configuration is not stateful. If you care about letting the host know what features a guest understands, I think something more explicit and stateful should be used. For instance, a feature register that stores a bitmap. Otherwise, the host has to infer based on what fields that guest has read what features the guest actually supports. That seems error prone to me. I like enumerating the virtqueues: it's not necessary but it's clearer. This adds seemingly unnecessary complexity. I'd be happy for a simpler mechanism... What do you think of what I proposed? It seems simpler to me. Regards, Anthony Liguori Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use of virtio device IDs
Anthony Liguori wrote: Hi Rusty, I've written a PCI virtio transport and noticed something strange. All current in-tree virtio devices register ID tables that match a specific device ID, but any vendor ID. This is incompatible with using PCI vendor/device IDs for virtio vendor/device IDs since vendors control what device IDs mean. A simple solution would be to assign a fixed vendor ID to all current virtio devices. This doesn't solve the problem completely though since you would create a conflict between the PCI vendor ID space and the virtio vendor ID space. The only solutions seem to be virtualizing the virtio vendor/device IDs (which is what I'm currently doing) or to mandate that the virtio vendor ID be within the PCI vendor ID space. It's probably not necessary to make the same requirement for device IDs though. There's another ugly bit in the current implementation. Right now, we would have to have every PCI vendor/device ID pair in the virtio PCI driver ID table for every virtio device. This means every time a virtio device is added to Linux, the virtio PCI driver has to be modified (assuming that each virtio device uses a unique PCI vendor/device ID) :-/ Regards, Anthony Liguori What are your thoughts? Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
virtio config_ops refactoring
Hi, The current virtio config_ops do not map very well to the PCI config space. Here they are: struct virtio_config_ops { void *(*find)(struct virtio_device *vdev, u8 type, unsigned *len); void (*get)(struct virtio_device *vdev, void *token, void *buf, unsigned len); void (*set)(struct virtio_device *vdev, void *token, const void *buf, unsigned len); u8 (*get_status)(struct virtio_device *vdev); void (*set_status)(struct virtio_device *vdev, u8 status); struct virtqueue *(*find_vq)(struct virtio_device *vdev, bool (*callback)(struct virtqueue *)); void (*del_vq)(struct virtqueue *vq); }; Semantically, find requires that a field have both a type and a length. With the exception of the VIRTQUEUE field used internally by lguest, type is always a unique identifier. Since virtqueue information is not a required part of the config space, it seems to me that type really should be treated as a unique identifier. find_vq also is curious in that it is stateful in it's enumeration. This adds seemingly unnecessary complexity. The result is that in my PCI virtio implementation, I have to use an opaque blob that's self describing and variable length in the PCI config space. This is not very natural at all for a PCI device! Here is how I propose changing the config_ops: struct virtio_config_ops { bool (*get)(struct virtio_device *vdev, unsigned id, void *buf, unsigned len); bool (*set)(struct virtio_device *vdev, unsigned id, const void *buf, unsigned len); u8 (*get_status)(struct virtio_device *vdev); void (*set_status)(struct virtio_device *vdev, u8 status); struct virtqueue *(*get_vq)(struct virtio_device *vdev, unsigned index, bool (*callback)(struct virtqueue *)); void (*del_vq)(struct virtqueue *vq); }; config_ops->get() returns false if the id is invalid as does ->set(). get_vq() returns NULL if the index is not a valid virtqueue (and we'll mandate that all virtqueues are in order starting at 0). The id space is defined by the driver itself but is guaranteed to be non-overlapping and starting from 0. For instance, the block device would have: /* The capacity (in 512-byte sectors). */ #define VIRTIO_CONFIG_BLK_F_CAPACITY0x00 /* The maximum segment size. */ #define VIRTIO_CONFIG_BLK_F_SIZE_MAX0x08 /* The maximum number of segments. */ #define VIRTIO_CONFIG_BLK_F_SEG_MAX0x12 This maps very well to the PCI config space. For lguest, the actual config items would simply be packed in a single page. Since lguest uses the config space for virtqueues, lguest_device_desc will have to be changed to also contain an array of virtqueue infos. It doesn't prevent an implementation from using the id's as opaque keys though (as one might do for Xen since presumably the configuration data would be represented within xenbus). If there's agreement that this approach is better, I'll start submitting patches. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Use of virtio device IDs
Hi Rusty, I've written a PCI virtio transport and noticed something strange. All current in-tree virtio devices register ID tables that match a specific device ID, but any vendor ID. This is incompatible with using PCI vendor/device IDs for virtio vendor/device IDs since vendors control what device IDs mean. A simple solution would be to assign a fixed vendor ID to all current virtio devices. This doesn't solve the problem completely though since you would create a conflict between the PCI vendor ID space and the virtio vendor ID space. The only solutions seem to be virtualizing the virtio vendor/device IDs (which is what I'm currently doing) or to mandate that the virtio vendor ID be within the PCI vendor ID space. It's probably not necessary to make the same requirement for device IDs though. What are your thoughts? Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 2/3] Refactor hypercall infrastructure (v3)
Avi Kivity wrote: Avi Kivity wrote: Avi Kivity wrote: Anthony Liguori wrote: This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. It also introduces the infrastructure to probe for hypercall available via CPUID leaves 0x4000. CPUID leaf 0x4001 should be filled out by userspace. A fall-out of this patch is that the unhandled hypercalls no longer trap to userspace. There is very little reason though to use a hypercall to communicate with userspace as PIO or MMIO can be used. There is no code in tree that uses userspace hypercalls. Surprisingly, this patch kills Windows XP (ACPI HAL). I'll try to find out why. Not trapping #UD brings things back to normal. So, Windows likes to execute undefined instructions, and we don'd handle these well. Okay, vmx_inject_ud() was broken. Fixed now. Yeah, I was just about to send the patch for that. Sorry about that, I didn't even think to test Windows... Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] Refactor hypercall infrastructure (v3)
This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. It also introduces the infrastructure to probe for hypercall available via CPUID leaves 0x4000. CPUID leaf 0x4001 should be filled out by userspace. A fall-out of this patch is that the unhandled hypercalls no longer trap to userspace. There is very little reason though to use a hypercall to communicate with userspace as PIO or MMIO can be used. There is no code in tree that uses userspace hypercalls. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index ad08138..1cde572 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -46,6 +46,7 @@ #define KVM_MAX_CPUID_ENTRIES 40 #define DE_VECTOR 0 +#define UD_VECTOR 6 #define NM_VECTOR 7 #define DF_VECTOR 8 #define TS_VECTOR 10 @@ -317,9 +318,6 @@ struct kvm_vcpu { unsigned long cr0; unsigned long cr2; unsigned long cr3; - gpa_t para_state_gpa; - struct page *para_state_page; - gpa_t hypercall_gpa; unsigned long cr4; unsigned long cr8; u64 pdptrs[4]; /* pae */ @@ -622,7 +620,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu); static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 99e4917..0629dd7 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include @@ -1383,51 +1384,61 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_emulate_halt); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run) +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { - unsigned long nr, a0, a1, a2, a3, a4, a5, ret; + unsigned long nr, a0, a1, a2, a3, ret; kvm_x86_ops->cache_regs(vcpu); - ret = -KVM_EINVAL; -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - nr = vcpu->regs[VCPU_REGS_RAX]; - a0 = vcpu->regs[VCPU_REGS_RDI]; - a1 = vcpu->regs[VCPU_REGS_RSI]; - a2 = vcpu->regs[VCPU_REGS_RDX]; - a3 = vcpu->regs[VCPU_REGS_RCX]; - a4 = vcpu->regs[VCPU_REGS_R8]; - a5 = vcpu->regs[VCPU_REGS_R9]; - } else -#endif - { - nr = vcpu->regs[VCPU_REGS_RBX] & -1u; - a0 = vcpu->regs[VCPU_REGS_RAX] & -1u; - a1 = vcpu->regs[VCPU_REGS_RCX] & -1u; - a2 = vcpu->regs[VCPU_REGS_RDX] & -1u; - a3 = vcpu->regs[VCPU_REGS_RSI] & -1u; - a4 = vcpu->regs[VCPU_REGS_RDI] & -1u; - a5 = vcpu->regs[VCPU_REGS_RBP] & -1u; + + nr = vcpu->regs[VCPU_REGS_RAX]; + a0 = vcpu->regs[VCPU_REGS_RBX]; + a1 = vcpu->regs[VCPU_REGS_RCX]; + a2 = vcpu->regs[VCPU_REGS_RDX]; + a3 = vcpu->regs[VCPU_REGS_RSI]; + + if (!is_long_mode(vcpu)) { + nr &= 0x; + a0 &= 0x; + a1 &= 0x; + a2 &= 0x; + a3 &= 0x; } + switch (nr) { default: - run->hypercall.nr = nr; - run->hypercall.args[0] = a0; - run->hypercall.args[1] = a1; - run->hypercall.args[2] = a2; - run->hypercall.args[3] = a3; - run->hypercall.args[4] = a4; - run->hypercall.args[5] = a5; - run->hypercall.ret = ret; - run->hypercall.longmode = is_long_mode(vcpu); - kvm_x86_ops->decache_regs(vcpu); - return 0; + ret = -KVM_ENOSYS; + break; } vcpu->regs[VCPU_REGS_RAX] = ret; kvm_x86_ops->decache_regs(vcpu); - return 1; + return 0; +} +EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu) +{ + char instruction[3]; + int ret = 0; + + mutex_lock(&vcpu->kvm->lock); + + /* +* Blow out the MMU to ensure that no other VCPU has an active mapping +* to ensure that the updated hypercall appears atomically across all +* VCPUs. +*/ + kvm_mmu_zap_all(vcpu->kvm); +
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
Jeremy Fitzhardinge wrote: I'm starting to lean toward just using . If for no other reason than the hypercall space is unsharable. Well, it could be, but it would take affirmative action on the guest's part. If there's feature bits for each supported hypercall interface, then you could have a magic MSR to select which interface you want to use now. That would allow a generic-interface-using guest to probe for the generic interface at cpuid leaf 0x40001000, use 40001001 to determine whether the hypercall interface is available, 4000100x to find the base of the magic msrs, and write appropriate msr to set the desired hypercall style (and all this can be done without using vmcall, so it doesn't matter that hypercall interface is initially established). The main thing keeping me from doing this ATM is what I perceive as lack of interest in a generic interface. I think it's also a little premature given that we don't have any features on the plate yet. However, I don't think that means that we cannot turn KVM's PV into a generic one. So here's what I propose. Let's start building the KVM PV interface on 4000 . That means that Xen cannot initially use it but that's okay. Once KVM-lite is merged and we have some solid features (and other guests start implementing them), we can also advertise this interface as a "generic interface" by also supporting the signature on leave 4000 1000 and using the MSR trickery that you propose. As long as we all agree not to use 4000 1000 for now, it leaves open the possibility of having a generic interface in the future. Regards, Anthony Liguori J - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
Nakajima, Jun wrote: I don't understand the purpose of returning the max leaf. Who is that information useful for? Well, this is the key info to the user of CPUID. It tells which leaves are valid to use. Otherwise, the user cannot tell whether the results of CPUID.0x400N are valid or not (i.e. junk). BTW, this is what we are doing on the native (for the leaf 0, 0x8000, for example). The fact that Xen returns 0x4002 means it only uses 3 leaves today. Then it's just a version ID. You pretty much have to treat it as a version id because if it returns 0x4000 0003 and you only know what 0002 is, then you can't actually use it. I much prefer the current use of CPUID in KVM. If 1000 returns the KVM signature, then 1001 *must* be valid and contain a set of feature bits. If we wish to use additional CPUID leaves in the future, then we can just use a feature bit. The real benefit to us is that we can use a discontiguous set of leaves whereas the Xen approach is forced to use a linear set (at least for the result to be meaningful). I like Jeremy's suggesting of starting with 0x40001000 for KVM. Xen has an established hypercall interface and that isn't going to change. However, in the future, if other Operating Systems (like the BSDs) choose to implement the KVM paravirtualization interface, then that leaves open the possibility for Xen to also support this interface to get good performance for those OSes. It's necessary to be able to support both at once if you wish to support these interfaces without user interaction. Using CPUID.0x400N (N > 2) does not prevent Xen from doing that, either. If you use 0x40001000, 1) you need to say the leaves from 0x4000 through 0x40001000 are all valid, OR 2) you create/fork a new/odd leaf (with 0x1000 offset) repeating the detection redundantly. Why do -1000 have to be valid? Xen is not going to change what they have today--they can't. However, if down the road, they decided that since so many guests use KVM's paravirtualization interface other than Linux, there's value in supporting it, by using 1000, they can. There's no tangible benefit to us to use 0x4000. Therefore I'm inclined to lean toward making things easier for others. Again, 0x4000 is not Xen specific. If the leaf 0x4000 is used for any guest to detect any hypervisor, that would be compelling benefit. For future Xen-specific features, it's safe for Xen to use other bigger leaves (like 0x40001000) because the guest starts looking at them after detection of Xen. I'm starting to lean toward just using . If for no other reason than the hypercall space is unsharable. Regards, Anthony Liguori Likewise if KVM paravirtualization interface (as kind of "open source paravirtualization interface") is detected in the generic areas (not in vender-specific), any guest can check the features available without knowing which hypervisor uses which CPUID for that. Regards, Anthony Liguori And like CPUID.1, CPUID.0x4001 returns the version number in eax, and each VMM should be able to define a number of VMM-specific features available in ebx, ecx, and edx returned (which are reserved, i.e. not used in Xen today). Suppose we knew (i.e. tested) Xen and KVM supported Linux paravirtualization, the Linux code does: 1. detect Xen or KVM using CPUID.0x4000 2. Check the version if necessary using CPUID.0x4001 3. Check the Linux paravirtualization features available using CPUID.0x400Y. Jun --- Intel Open Source Technology Center Jun --- Intel Open Source Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
Nakajima, Jun wrote: Jeremy Fitzhardinge wrote: Nakajima, Jun wrote: The hypervisor detection machanism is generic, and the signature returned is implentation specific. Having a list of all hypervisor signatures sounds fine to me as we are detecting vendor-specific processor(s) in the native. And I don't expect the list is large. I'm confused about what you're proposing. I was thinking that a kernel looking for the generic hypervisor interface would check for a specific signature at some cpuid leaf, and then go about using it from there. If not, how does is it supposed to detect the generic hypervisor interface? J I'm suggesting that we use CPUID.0x400Y (Y: TBD, e.g. 6) for Linux paravirtualization. The ebx, ecx and edx return the Linux paravirtualization features available on that hypervisor. Those features are defined architecturally (not VMM specific). Like CPUID.0, CPUID.0x4000 is used to detect the hypervisor with the vendor identification string returned in ebx, edx, and ecx (as we are doing in Xen). The eax returns the max leaf (which is 0x4002 on Xen today). I don't understand the purpose of returning the max leaf. Who is that information useful for? I like Jeremy's suggesting of starting with 0x40001000 for KVM. Xen has an established hypercall interface and that isn't going to change. However, in the future, if other Operating Systems (like the BSDs) choose to implement the KVM paravirtualization interface, then that leaves open the possibility for Xen to also support this interface to get good performance for those OSes. It's necessary to be able to support both at once if you wish to support these interfaces without user interaction. There's no tangible benefit to us to use 0x4000. Therefore I'm inclined to lean toward making things easier for others. Regards, Anthony Liguori And like CPUID.1, CPUID.0x4001 returns the version number in eax, and each VMM should be able to define a number of VMM-specific features available in ebx, ecx, and edx returned (which are reserved, i.e. not used in Xen today). Suppose we knew (i.e. tested) Xen and KVM supported Linux paravirtualization, the Linux code does: 1. detect Xen or KVM using CPUID.0x4000 2. Check the version if necessary using CPUID.0x4001 3. Check the Linux paravirtualization features available using CPUID.0x400Y. Jun --- Intel Open Source Technology Center - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Refactor hypercall infrastructure (v2)
This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. It also introduces the infrastructure to probe for hypercall available via CPUID leaves 0x40001000. CPUID leaf 0x40001001 should be filled out by userspace. A fall-out of this patch is that the unhandled hypercalls no longer trap to userspace. There is very little reason though to use a hypercall to communicate with userspace as PIO or MMIO can be used. There is no code in tree that uses userspace hypercalls. Since the last patchset, I've changed the CPUID leaves to better avoid Xen's CPUID range and fixed a bug spotted by Muli in masking off hypercall arguments. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index ad08138..1cde572 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -46,6 +46,7 @@ #define KVM_MAX_CPUID_ENTRIES 40 #define DE_VECTOR 0 +#define UD_VECTOR 6 #define NM_VECTOR 7 #define DF_VECTOR 8 #define TS_VECTOR 10 @@ -317,9 +318,6 @@ struct kvm_vcpu { unsigned long cr0; unsigned long cr2; unsigned long cr3; - gpa_t para_state_gpa; - struct page *para_state_page; - gpa_t hypercall_gpa; unsigned long cr4; unsigned long cr8; u64 pdptrs[4]; /* pae */ @@ -622,7 +620,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu); static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 99e4917..d720290 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include @@ -1383,51 +1384,61 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_emulate_halt); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run) +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { - unsigned long nr, a0, a1, a2, a3, a4, a5, ret; + unsigned long nr, a0, a1, a2, a3, ret; kvm_x86_ops->cache_regs(vcpu); - ret = -KVM_EINVAL; -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - nr = vcpu->regs[VCPU_REGS_RAX]; - a0 = vcpu->regs[VCPU_REGS_RDI]; - a1 = vcpu->regs[VCPU_REGS_RSI]; - a2 = vcpu->regs[VCPU_REGS_RDX]; - a3 = vcpu->regs[VCPU_REGS_RCX]; - a4 = vcpu->regs[VCPU_REGS_R8]; - a5 = vcpu->regs[VCPU_REGS_R9]; - } else -#endif - { - nr = vcpu->regs[VCPU_REGS_RBX] & -1u; - a0 = vcpu->regs[VCPU_REGS_RAX] & -1u; - a1 = vcpu->regs[VCPU_REGS_RCX] & -1u; - a2 = vcpu->regs[VCPU_REGS_RDX] & -1u; - a3 = vcpu->regs[VCPU_REGS_RSI] & -1u; - a4 = vcpu->regs[VCPU_REGS_RDI] & -1u; - a5 = vcpu->regs[VCPU_REGS_RBP] & -1u; - } + + nr = vcpu->regs[VCPU_REGS_RAX]; + a0 = vcpu->regs[VCPU_REGS_RBX]; + a1 = vcpu->regs[VCPU_REGS_RCX]; + a2 = vcpu->regs[VCPU_REGS_RDX]; + a3 = vcpu->regs[VCPU_REGS_RSI]; + + if (!is_long_mode(vcpu)) { + nr &= 0x; + a0 &= 0x; + a1 &= 0x; + a2 &= 0x; + a3 &= 0x; + } + switch (nr) { default: - run->hypercall.nr = nr; - run->hypercall.args[0] = a0; - run->hypercall.args[1] = a1; - run->hypercall.args[2] = a2; - run->hypercall.args[3] = a3; - run->hypercall.args[4] = a4; - run->hypercall.args[5] = a5; - run->hypercall.ret = ret; - run->hypercall.longmode = is_long_mode(vcpu); - kvm_x86_ops->decache_regs(vcpu); - return 0; + ret = -KVM_ENOSYS; + break; } vcpu->regs[VCPU_REGS_RAX] = ret; kvm_x86_ops->decache_regs(vcpu); - return 1; + return 0; +} +EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu) +{ + char instruction[3]; + int ret = 0; + + mutex_lock(&vcpu->kvm->lock); + + /* +* Blow out the MMU to ensure that no other VCPU has an active map
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
Zachary Amsden wrote: On Fri, 2007-09-14 at 16:44 -0500, Anthony Liguori wrote: So then each module creates a hypercall page using this magic MSR and the hypervisor has to keep track of it so that it can appropriately change the page on migration. The page can only contain a single instruction or else it cannot be easily changed (or you have to be able to prevent the guest from being migrated while in the hypercall page). We're really talking about identical models. Instead of an MSR, the #GP is what tells the hypervisor to update the instruction. The nice thing about this is that you don't have to keep track of all the current hypercall page locations in the hypervisor. I agree, multiple hypercall pages is insane. I was thinking more of a single hypercall page, fixed in place by the hypervisor, not the kernel. Then each module can read an MSR saying what VA the hypercall page is at, and the hypervisor can simply flip one page to switch architectures. That requires a memory hole though. In KVM, we don't have a memory hole. Regards, Anthony Liguori Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
Jeremy Fitzhardinge wrote: Anthony Liguori wrote: Yeah, see, the initial goal was to make it possible to use the KVM paravirtualizations on other hypervisors. However, I don't think this is really going to be possible in general so maybe it's better to just use leaf 0. I'll let others chime in before sending a new patch. Hm. Obviously you can just define a signature for "kvm-compatible hypercall interface" and make it common that way, but it gets tricky if the hypervisor supports multiple hypercall interfaces, including the kvm one. Start the kvm leaves at 0x40001000 or something? Yeah, that works with me. Regards, Anthony Liguori J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
Jeremy Fitzhardinge wrote: Anthony Liguori wrote: The whole point of using the instruction is to allow hypercalls to be used in many locations. This has the nice side effect of not requiring a central hypercall initialization routine in the guest to fetch the hypercall page. A PV driver can be completely independent of any other code provided that it restricts itself to it's hypercall namespace. I see. So you take the fault, disassemble the instruction, see that its another CPU's vmcall instruction, and then replace it with the current CPU's vmcall? Yup. Xen is currently using 0/1/2. I had thought it was only using 0/1. The intention was not to squash Xen's current CPUID usage so that it would still be possible for Xen to make use of the guest code. Can we agree that Xen won't squash leaves 3/4 or is it not worth trying to be compatible at this point? No, the point is that you're supposed to work out which hypervisor it is from the signature in leaf 0, and then the hypervisor can put anything it wants in the other leaves. Yeah, see, the initial goal was to make it possible to use the KVM paravirtualizations on other hypervisors. However, I don't think this is really going to be possible in general so maybe it's better to just use leaf 0. I'll let others chime in before sending a new patch. Regards, Anthony Liguori J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
Zachary Amsden wrote: On Fri, 2007-09-14 at 16:02 -0500, Anthony Liguori wrote: Jeremy Fitzhardinge wrote: Anthony Liguori wrote: This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. I guess it would be pretty rude/unlikely for these opcodes to get reused in other implementations... But couldn't you make the page trap instead, rather than relying on an instruction fault? The whole point of using the instruction is to allow hypercalls to be used in many locations. This has the nice side effect of not requiring a central hypercall initialization routine in the guest to fetch the hypercall page. A PV driver can be completely independent of any other code provided that it restricts itself to it's hypercall namespace. But if the instruction is architecture dependent, and you run on the wrong architecture, now you have to patch many locations at fault time, introducing some nasty runtime code / data cache overlap performance problems. Granted, they go away eventually. We're addressing that by blowing away the shadow cache and holding the big kvm lock to ensure SMP safety. Not a great thing to do from a performance perspective but the whole point of patching is that the cost is amortized. I prefer the idea of a hypercall page, but not a central initialization. Rather, a decentralized approach where PV drivers can detect using CPUID which hypervisor is present, and a common MSR shared by all hypervisors that provides the location of the hypercall page. So then each module creates a hypercall page using this magic MSR and the hypervisor has to keep track of it so that it can appropriately change the page on migration. The page can only contain a single instruction or else it cannot be easily changed (or you have to be able to prevent the guest from being migrated while in the hypercall page). We're really talking about identical models. Instead of an MSR, the #GP is what tells the hypervisor to update the instruction. The nice thing about this is that you don't have to keep track of all the current hypercall page locations in the hypervisor. Regards, Anthony Liguori Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
Jeremy Fitzhardinge wrote: Anthony Liguori wrote: This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. I guess it would be pretty rude/unlikely for these opcodes to get reused in other implementations... But couldn't you make the page trap instead, rather than relying on an instruction fault? The whole point of using the instruction is to allow hypercalls to be used in many locations. This has the nice side effect of not requiring a central hypercall initialization routine in the guest to fetch the hypercall page. A PV driver can be completely independent of any other code provided that it restricts itself to it's hypercall namespace. It also introduces the infrastructure to probe for hypercall available via CPUID leaves 0x4002. CPUID leaf 0x4003 should be filled out by userspace. Is this compatible with Xen's (and other's) use of cpuid? That is, 0x4000 returns a hypervisor-specific signature in e[bcd]x, and eax has the max hypervisor leaf. Xen is currently using 0/1/2. I had thought it was only using 0/1. The intention was not to squash Xen's current CPUID usage so that it would still be possible for Xen to make use of the guest code. Can we agree that Xen won't squash leaves 3/4 or is it not worth trying to be compatible at this point? Regards, Anthony Liguori J - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Refactor hypercall infrastructure
This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. It also introduces the infrastructure to probe for hypercall available via CPUID leaves 0x4002. CPUID leaf 0x4003 should be filled out by userspace. A fall-out of this patch is that the unhandled hypercalls no longer trap to userspace. There is very little reason though to use a hypercall to communicate with userspace as PIO or MMIO can be used. There is no code in tree that uses userspace hypercalls. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index ad08138..1cde572 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -46,6 +46,7 @@ #define KVM_MAX_CPUID_ENTRIES 40 #define DE_VECTOR 0 +#define UD_VECTOR 6 #define NM_VECTOR 7 #define DF_VECTOR 8 #define TS_VECTOR 10 @@ -317,9 +318,6 @@ struct kvm_vcpu { unsigned long cr0; unsigned long cr2; unsigned long cr3; - gpa_t para_state_gpa; - struct page *para_state_page; - gpa_t hypercall_gpa; unsigned long cr4; unsigned long cr8; u64 pdptrs[4]; /* pae */ @@ -622,7 +620,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu); static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 99e4917..5211d19 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include @@ -1383,51 +1384,61 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_emulate_halt); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run) +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { - unsigned long nr, a0, a1, a2, a3, a4, a5, ret; + unsigned long nr, a0, a1, a2, a3, ret; kvm_x86_ops->cache_regs(vcpu); - ret = -KVM_EINVAL; -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - nr = vcpu->regs[VCPU_REGS_RAX]; - a0 = vcpu->regs[VCPU_REGS_RDI]; - a1 = vcpu->regs[VCPU_REGS_RSI]; - a2 = vcpu->regs[VCPU_REGS_RDX]; - a3 = vcpu->regs[VCPU_REGS_RCX]; - a4 = vcpu->regs[VCPU_REGS_R8]; - a5 = vcpu->regs[VCPU_REGS_R9]; - } else -#endif - { - nr = vcpu->regs[VCPU_REGS_RBX] & -1u; - a0 = vcpu->regs[VCPU_REGS_RAX] & -1u; - a1 = vcpu->regs[VCPU_REGS_RCX] & -1u; - a2 = vcpu->regs[VCPU_REGS_RDX] & -1u; - a3 = vcpu->regs[VCPU_REGS_RSI] & -1u; - a4 = vcpu->regs[VCPU_REGS_RDI] & -1u; - a5 = vcpu->regs[VCPU_REGS_RBP] & -1u; - } + + nr = vcpu->regs[VCPU_REGS_RAX]; + a0 = vcpu->regs[VCPU_REGS_RBX]; + a1 = vcpu->regs[VCPU_REGS_RCX]; + a2 = vcpu->regs[VCPU_REGS_RDX]; + a3 = vcpu->regs[VCPU_REGS_RSI]; + + if (!is_long_mode(vcpu)) { + nr &= ~1u; + a0 &= ~1u; + a1 &= ~1u; + a2 &= ~1u; + a3 &= ~1u; + } + switch (nr) { default: - run->hypercall.nr = nr; - run->hypercall.args[0] = a0; - run->hypercall.args[1] = a1; - run->hypercall.args[2] = a2; - run->hypercall.args[3] = a3; - run->hypercall.args[4] = a4; - run->hypercall.args[5] = a5; - run->hypercall.ret = ret; - run->hypercall.longmode = is_long_mode(vcpu); - kvm_x86_ops->decache_regs(vcpu); - return 0; + ret = -KVM_ENOSYS; + break; } vcpu->regs[VCPU_REGS_RAX] = ret; kvm_x86_ops->decache_regs(vcpu); - return 1; + return 0; +} +EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu) +{ + char instruction[3]; + int ret = 0; + + mutex_lock(&vcpu->kvm->lock); + + /* +* Blow out the MMU to ensure that no other VCPU has an active mapping +* to ensure that the updated hypercall appears atomically across all +* VCPUs. +*/ + kvm_mmu_zap_all(vcpu->kvm); + + kvm_x86_ops->cache_reg
Re: [kvm-devel] [RFC] 9p: add KVM/QEMU pci transport
I think that it would be nicer to implement the p9 transport on top of virtio instead of directly on top of PCI. I think your PCI transport would make a pretty nice start of a PCI virtio transport though. Regards, Anthony Liguori On Tue, 2007-08-28 at 13:52 -0500, Eric Van Hensbergen wrote: > From: Latchesar Ionkov <[EMAIL PROTECTED]> > > This adds a shared memory transport for a synthetic 9p device for > paravirtualized file system support under KVM/QEMU. > > Signed-off-by: Latchesar Ionkov <[EMAIL PROTECTED]> > Signed-off-by: Eric Van Hensbergen <[EMAIL PROTECTED]> > --- > Documentation/filesystems/9p.txt |2 + > net/9p/Kconfig | 10 ++- > net/9p/Makefile |4 + > net/9p/trans_pci.c | 295 > ++ > 4 files changed, 310 insertions(+), 1 deletions(-) > create mode 100644 net/9p/trans_pci.c > > diff --git a/Documentation/filesystems/9p.txt > b/Documentation/filesystems/9p.txt > index 1a5f50d..e1879bd 100644 > --- a/Documentation/filesystems/9p.txt > +++ b/Documentation/filesystems/9p.txt > @@ -46,6 +46,8 @@ OPTIONS > tcp - specifying a normal TCP/IP connection > fd - used passed file descriptors for connection > (see rfdno and wfdno) > + pci - use a PCI pseudo device for 9p communication > + over shared memory between a guest and host > >uname=name user name to attempt mount as on the remote server. The > server may override or ignore this value. Certain user > diff --git a/net/9p/Kconfig b/net/9p/Kconfig > index 09566ae..8517560 100644 > --- a/net/9p/Kconfig > +++ b/net/9p/Kconfig > @@ -16,13 +16,21 @@ menuconfig NET_9P > config NET_9P_FD > depends on NET_9P > default y if NET_9P > - tristate "9P File Descriptor Transports (Experimental)" > + tristate "9p File Descriptor Transports (Experimental)" > help > This builds support for file descriptor transports for 9p > which includes support for TCP/IP, named pipes, or passed > file descriptors. TCP/IP is the default transport for 9p, > so if you are going to use 9p, you'll likely want this. > > +config NET_9P_PCI > + depends on NET_9P > + tristate "9p PCI Shared Memory Transport (Experimental)" > + help > + This builds support for a PCI psuedo-device currently available > + under KVM/QEMU which allows for 9p transactions over shared > + memory between the guest and the host. > + > config NET_9P_DEBUG > bool "Debug information" > depends on NET_9P > diff --git a/net/9p/Makefile b/net/9p/Makefile > index 7b2a67a..26ce89d 100644 > --- a/net/9p/Makefile > +++ b/net/9p/Makefile > @@ -1,5 +1,6 @@ > obj-$(CONFIG_NET_9P) := 9pnet.o > obj-$(CONFIG_NET_9P_FD) += 9pnet_fd.o > +obj-$(CONFIG_NET_9P_PCI) += 9pnet_pci.o > > 9pnet-objs := \ > mod.o \ > @@ -14,3 +15,6 @@ obj-$(CONFIG_NET_9P_FD) += 9pnet_fd.o > > 9pnet_fd-objs := \ > trans_fd.o \ > + > +9pnet_pci-objs := \ > + trans_pci.o \ > diff --git a/net/9p/trans_pci.c b/net/9p/trans_pci.c > new file mode 100644 > index 000..36ddc5f > --- /dev/null > +++ b/net/9p/trans_pci.c > @@ -0,0 +1,295 @@ > +/* > + * net/9p/trans_pci.c > + * > + * 9P over PCI transport layer. For use with KVM/QEMU. > + * > + * Copyright (C) 2007 by Latchesar Ionkov <[EMAIL PROTECTED]> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 > + * as published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write to: > + * Free Software Foundation > + * 51 Franklin Street, Fifth Floor > + * Boston, MA 02111-1301 USA > + * > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define P9PCI_DRIVER_NAME "9P PCI Device" > +#define P9PCI_DRIVER_VERSION "1" > + > +#define PCI_VENDOR_ID_9P 0x5002 > +#defin
Re: [kvm-devel] [PATCH 3/3] KVM paravirt-ops implementation
On Wed, 2007-08-29 at 04:31 +1000, Rusty Russell wrote: > On Mon, 2007-08-27 at 10:16 -0500, Anthony Liguori wrote: > > @@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void) > > } > > sort_main_extable(); > > trap_init(); > > + kvm_guest_init(); > > rcu_init(); > > init_IRQ(); > > pidhash_init(); > > Hi Anthony, > > This placement seems arbitrary. Why not earlier from setup_arch, or as > a normal initcall? The placement is important if we wish to have a paravirt_ops hook for the interrupt controller. This is the latest possible spot we can do it. A comment is probably appropriate here. Regards, Anthony Liguori > Rusty. > > > > - > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > ___ > kvm-devel mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 2/3] Refactor hypercall infrastructure
On Wed, 2007-08-29 at 04:12 +1000, Rusty Russell wrote: > On Mon, 2007-08-27 at 10:16 -0500, Anthony Liguori wrote: > > This patch refactors the current hypercall infrastructure to better support > > live > > migration and SMP. It eliminates the hypercall page by trapping the UD > > exception that would occur if you used the wrong hypercall instruction for > > the > > underlying architecture and replacing it with the right one lazily. > > It also reduces the number of hypercall args, which you don't mention > here. Oh yes, sorry. > > + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0); > > + > > + /* we should only succeed here in the case of hypercalls which > > + cannot generate an MMIO event. MMIO means that the emulator > > + is mistakenly allowing an instruction that should generate > > + a UD fault so it's a bug. */ > > + BUG_ON(er == EMULATE_DO_MMIO); > > This seems... unwise. Firstly we know our emulator is incomplete. > Secondly an SMP guest can exploit this to crash the host. This code is gone in v2. > (Code is in two places). > > > +#define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" Good point. > A nice big comment would be nice here, I think. Note that this is big > enough for both "int $0x1f" and "sysenter", so I'm happy. I need to add a comment somewhere mentioning that if you patch with something less than 3 bytes, then you should pad with nop but the hypervisor must treat the whole instruction (including the padding) as atomic (that is, regardless of hypercall size, eip += 3) or you run the risk of breakage during migration. Regards, Anthony Liguori > Cheers, > Rusty. > > > - > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > ___ > kvm-devel mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] Refactor hypercall infrastructure (v2)
This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. It also introduces the infrastructure to probe for hypercall available via CPUID leaves 0x4002. CPUID leaf 0x4003 should be filled out by userspace. A fall-out of this patch is that the unhandled hypercalls no longer trap to userspace. There is very little reason though to use a hypercall to communicate with userspace as PIO or MMIO can be used. There is no code in tree that uses userspace hypercalls. Since v1, we now use emulator_write_emulated() and zap the MMU while holding kvm->lock to ensure atomicity on SMP. We also switch the KVM errnos to be integer based and make kvm_para.h #include-able from userspace. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index a42a6f3..38c0eaf 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -46,6 +46,7 @@ #define KVM_MAX_CPUID_ENTRIES 40 #define DE_VECTOR 0 +#define UD_VECTOR 6 #define NM_VECTOR 7 #define DF_VECTOR 8 #define TS_VECTOR 10 @@ -316,9 +317,6 @@ struct kvm_vcpu { unsigned long cr0; unsigned long cr2; unsigned long cr3; - gpa_t para_state_gpa; - struct page *para_state_page; - gpa_t hypercall_gpa; unsigned long cr4; unsigned long cr8; u64 pdptrs[4]; /* pae */ @@ -587,7 +585,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu); static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index d154487..177cba1 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1267,51 +1267,61 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_emulate_halt); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run) +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { - unsigned long nr, a0, a1, a2, a3, a4, a5, ret; + unsigned long nr, a0, a1, a2, a3, ret; kvm_arch_ops->cache_regs(vcpu); - ret = -KVM_EINVAL; -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - nr = vcpu->regs[VCPU_REGS_RAX]; - a0 = vcpu->regs[VCPU_REGS_RDI]; - a1 = vcpu->regs[VCPU_REGS_RSI]; - a2 = vcpu->regs[VCPU_REGS_RDX]; - a3 = vcpu->regs[VCPU_REGS_RCX]; - a4 = vcpu->regs[VCPU_REGS_R8]; - a5 = vcpu->regs[VCPU_REGS_R9]; - } else -#endif - { - nr = vcpu->regs[VCPU_REGS_RBX] & -1u; - a0 = vcpu->regs[VCPU_REGS_RAX] & -1u; - a1 = vcpu->regs[VCPU_REGS_RCX] & -1u; - a2 = vcpu->regs[VCPU_REGS_RDX] & -1u; - a3 = vcpu->regs[VCPU_REGS_RSI] & -1u; - a4 = vcpu->regs[VCPU_REGS_RDI] & -1u; - a5 = vcpu->regs[VCPU_REGS_RBP] & -1u; + + nr = vcpu->regs[VCPU_REGS_RAX]; + a0 = vcpu->regs[VCPU_REGS_RBX]; + a1 = vcpu->regs[VCPU_REGS_RCX]; + a2 = vcpu->regs[VCPU_REGS_RDX]; + a3 = vcpu->regs[VCPU_REGS_RSI]; + + if (!is_long_mode(vcpu)) { + nr &= ~1u; + a0 &= ~1u; + a1 &= ~1u; + a2 &= ~1u; + a3 &= ~1u; } + switch (nr) { default: - run->hypercall.nr = nr; - run->hypercall.args[0] = a0; - run->hypercall.args[1] = a1; - run->hypercall.args[2] = a2; - run->hypercall.args[3] = a3; - run->hypercall.args[4] = a4; - run->hypercall.args[5] = a5; - run->hypercall.ret = ret; - run->hypercall.longmode = is_long_mode(vcpu); - kvm_arch_ops->decache_regs(vcpu); - return 0; + ret = -KVM_ENOSYS; + break; } vcpu->regs[VCPU_REGS_RAX] = ret; kvm_arch_ops->decache_regs(vcpu); - return 1; + return 0; +} +EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu) +{ + char instruction[3]; + int ret = 0; + + mutex_lock(&vcpu->kvm->lock); + + /* +* Blow out the MMU to ensure that no other VCPU has an active mapping +* to ensure that the updated hypercall appears atomicall
[PATCH 2/3] KVM paravirt-ops implementation (v2)
A very simple paravirt_ops implementation for KVM. Future patches will add more sophisticated optimizations. The goal of having this presenting this now is to validate the new hypercall infrastructure and to make my patch queue smaller. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index f952493..ceacc66 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -237,6 +237,13 @@ config VMI at the moment), by linking the kernel to a GPL-ed ROM module provided by the hypervisor. +config KVM_GUEST + bool "KVM paravirt-ops support" + depends on PARAVIRT + help + This option enables various optimizations for running under the KVM + hypervisor. + config ACPI_SRAT bool default y diff --git a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile index 9d33b00..6308fcf 100644 --- a/arch/i386/kernel/Makefile +++ b/arch/i386/kernel/Makefile @@ -42,6 +42,7 @@ obj-$(CONFIG_K8_NB) += k8.o obj-$(CONFIG_MGEODE_LX)+= geode.o obj-$(CONFIG_VMI) += vmi.o vmiclock.o +obj-$(CONFIG_KVM_GUEST)+= kvm.o obj-$(CONFIG_PARAVIRT) += paravirt.o obj-y += pcspeaker.o diff --git a/arch/i386/kernel/kvm.c b/arch/i386/kernel/kvm.c new file mode 100644 index 000..26585a4 --- /dev/null +++ b/arch/i386/kernel/kvm.c @@ -0,0 +1,46 @@ +/* + * KVM paravirt_ops implementation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + * + * Copyright IBM Corporation, 2007 + * Authors: Anthony Liguori <[EMAIL PROTECTED]> + */ + +#include +#include + +/* + * No need for any "IO delay" on KVM + */ +static void kvm_io_delay(void) +{ +} + +static __init void paravirt_setup(void) +{ + paravirt_ops.name = "KVM"; + + if (kvm_para_has_feature(KVM_PARA_FEATURE_NOP_IO_DELAY)) + paravirt_ops.io_delay = kvm_io_delay; + + paravirt_ops.paravirt_enabled = 1; +} + +void __init kvm_guest_init(void) +{ + if (kvm_para_available()) + paravirt_setup(); +} diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h index 0f94138..10b2e7a 100644 --- a/include/linux/kvm_para.h +++ b/include/linux/kvm_para.h @@ -4,6 +4,14 @@ #ifdef __KERNEL__ #include +#ifdef CONFIG_KVM_GUEST +void __init kvm_guest_init(void); +#else +static inline void kvm_guest_init(void) +{ +} +#endif + #define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" static inline long kvm_hypercall0(unsigned int nr) diff --git a/init/main.c b/init/main.c index d3bcb3b..b224905 100644 --- a/init/main.c +++ b/init/main.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void) } sort_main_extable(); trap_init(); + kvm_guest_init(); rcu_init(); init_IRQ(); pidhash_init(); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] KVM paravirtualization framework (v2)
This patchset refactors KVM's paravirtualization support to better support guest SMP and cross platform migration. It also introduces a bare-bones KVM paravirt_ops implementation. I've tested this on VT and it works nicely. I'm having trouble getting a bzImage to boot on SVM so I haven't been able to test on SVM yet. I've incorporated all of the feedback into this series from posting the last series. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Implement emulator_write_phys()
On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > Since a hypercall may span two pages and is a gva, we need a function to > > write > > to a gva that may span multiple pages. emulator_write_phys() seems like the > > logical choice for this. > > > > @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, > > unsigned int bytes, > > struct kvm_vcpu *vcpu > > I think that emulator_write_emulated(), except for being awkwardly > named, should do the job. We have enough APIs. I'll drop this patch. My mistakenly thought emulator_write_emulated() took a GPA. Thanks! Regards, Anthony Liguori > But! We may not overwrite the hypercall instruction while a vcpu may be > executing, since there's no atomicity guarantee for code fetch. We have > to to be out of guest mode while writing that insn. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Implement emulator_write_phys()
On Mon, 2007-08-27 at 20:26 +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote: > > > >> Anthony Liguori wrote: > >> > >>> Since a hypercall may span two pages and is a gva, we need a function to > >>> write > >>> to a gva that may span multiple pages. emulator_write_phys() seems like > >>> the > >>> logical choice for this. > >>> > >>> @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, > >>> unsigned int bytes, > >>> struct kvm_vcpu *vcpu > >>> > >> I think that emulator_write_emulated(), except for being awkwardly > >> named, should do the job. We have enough APIs. > >> > >> But! We may not overwrite the hypercall instruction while a vcpu may be > >> executing, since there's no atomicity guarantee for code fetch. We have > >> to to be out of guest mode while writing that insn. > >> > > > > > > Hrm, good catch. > > > > How can we get out of guest mode given SMP guest support? > > > > > > kvm_flush_remote_tlbs() is something that can be generalized. > Basically, you set a bit in each vcpu and send an IPI to take them out. > > But that's deadlock prone and complex. Maybe you can just take > kvm->lock, zap the mmu and the flush tlbs, and patch the instruction at > your leisure, as no vcpu will be able to map memory until the lock is > released. This works for shadow paging but not necessarily with NPT. Do code fetches really not respect atomic writes? We could switch to a 32-bit atomic operation and that should result in no worse than the code being patched twice. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] Refactor hypercall infrastructure
On Mon, 2007-08-27 at 19:06 +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) > > { > > int i; > > @@ -1632,6 +1575,12 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) > > vcpu->regs[VCPU_REGS_RBX] = 0; > > vcpu->regs[VCPU_REGS_RCX] = 0; > > vcpu->regs[VCPU_REGS_RDX] = 0; > > + > > + if ((function & 0x) == 0x4000) { > > + emulate_paravirt_cpuid(vcpu, function); > > + goto out; > > + } > > + > > > > Hmm. Suppose we expose kvm capabilities to host userspace instead, and > let the host userspace decide which features to expose to the guest via > the regular cpuid emulation? That allows the qemu command line to > control the feature set. That's a little awkward with something like NPT. Some CPUID features should be masked by the availability of NPT support in hardware and whether kernelspace supports NPT. To set these feature bits in userspace, you would have to query kernelspace anyway. It would be nice to be consistent with the other cpuid flags though. There is somewhat of a similar issue with migration here. If we have an initial set of PV features and down the road, add more, it may be desirable to enable those features depending on what your network looks like. Yeah, I think I agree that userspace is the right place to set these bits. > > > > +static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) > > +{ > > + int er; > > + > > + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0); > > + > > + /* we should only succeed here in the case of hypercalls which > > + cannot generate an MMIO event. MMIO means that the emulator > > + is mistakenly allowing an instruction that should generate > > + a UD fault so it's a bug. */ > > + BUG_ON(er == EMULATE_DO_MMIO); > > > > It's a guest-triggerable bug; one vcpu can be issuing ud2-in-a-loop > while the other replaces the instruction with something that does mmio. Good catch! I'll update. > > + > > +#define KVM_ENOSYS ENOSYS > > > > A real number (well, an integer, not a real) here please. I know that > ENOSYS isn't going to change soon, but this file defines the kvm abi and > I'd like it to be as independent as possible. > > Let's start it at 1000 so that spurious "return 1"s or "return -1"s > don't get translated into valid error numbers. Okay. > I really like the simplification to the guest/host interface that this > patch brings. Thanks. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Implement emulator_write_phys()
On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > Since a hypercall may span two pages and is a gva, we need a function to > > write > > to a gva that may span multiple pages. emulator_write_phys() seems like the > > logical choice for this. > > > > @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, > > unsigned int bytes, > > struct kvm_vcpu *vcpu > > I think that emulator_write_emulated(), except for being awkwardly > named, should do the job. We have enough APIs. > > But! We may not overwrite the hypercall instruction while a vcpu may be > executing, since there's no atomicity guarantee for code fetch. We have > to to be out of guest mode while writing that insn. Hrm, good catch. How can we get out of guest mode given SMP guest support? Regards, Anthony Liguori > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] KVM paravirtualization framework
This patchset refactors KVM's paravirtualization support to better support guest SMP and cross platform migration. It also introduces a bare-bones KVM paravirt_ops implementation. I've tested this on VT and it works nicely. I'm having trouble getting a bzImage to boot on SVM so I haven't been able to test on SVM yet. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] Implement emulator_write_phys()
Since a hypercall may span two pages and is a gva, we need a function to write to a gva that may span multiple pages. emulator_write_phys() seems like the logical choice for this. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index d154487..ebcc5c9 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, unsigned int bytes, struct kvm_vcpu *vcpu) { - pr_unimpl(vcpu, "emulator_write_std: addr %lx n %d\n", addr, bytes); - return X86EMUL_UNHANDLEABLE; + const void *data = val; + + while (bytes) { + gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); + unsigned offset = addr & (PAGE_SIZE-1); + unsigned tocopy = min(bytes, (unsigned)PAGE_SIZE - offset); + unsigned long pfn; + struct page *page; + void *page_virt; + + if (gpa == UNMAPPED_GVA) + return X86EMUL_PROPAGATE_FAULT; + pfn = gpa >> PAGE_SHIFT; + page = gfn_to_page(vcpu->kvm, pfn); + if (!page) + return X86EMUL_UNHANDLEABLE; + page_virt = kmap_atomic(page, KM_USER0); + + memcpy(page_virt + offset, data, tocopy); + + kunmap_atomic(page_virt, KM_USER0); + mark_page_dirty(vcpu->kvm, addr >> PAGE_SHIFT); + + bytes -= tocopy; + data += tocopy; + addr += tocopy; + } + + return X86EMUL_CONTINUE; } static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] Refactor hypercall infrastructure
This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. It also introduces the infrastructure to probe for hypercall available via CPUID leaves 0x4002 and 0x4003. A fall-out of this patch is that the unhandled hypercalls no longer trap to userspace. There is very little reason though to use a hypercall to communicate with userspace as PIO or MMIO can be used. There is no code in tree that uses userspace hypercalls. Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index a42a6f3..38c0eaf 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -46,6 +46,7 @@ #define KVM_MAX_CPUID_ENTRIES 40 #define DE_VECTOR 0 +#define UD_VECTOR 6 #define NM_VECTOR 7 #define DF_VECTOR 8 #define TS_VECTOR 10 @@ -316,9 +317,6 @@ struct kvm_vcpu { unsigned long cr0; unsigned long cr2; unsigned long cr3; - gpa_t para_state_gpa; - struct page *para_state_page; - gpa_t hypercall_gpa; unsigned long cr4; unsigned long cr8; u64 pdptrs[4]; /* pae */ @@ -587,7 +585,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu); static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index ebcc5c9..996d0ee 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1294,51 +1294,49 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_emulate_halt); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run) +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { - unsigned long nr, a0, a1, a2, a3, a4, a5, ret; + unsigned long nr, a0, a1, a2, a3, ret; kvm_arch_ops->cache_regs(vcpu); - ret = -KVM_EINVAL; -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - nr = vcpu->regs[VCPU_REGS_RAX]; - a0 = vcpu->regs[VCPU_REGS_RDI]; - a1 = vcpu->regs[VCPU_REGS_RSI]; - a2 = vcpu->regs[VCPU_REGS_RDX]; - a3 = vcpu->regs[VCPU_REGS_RCX]; - a4 = vcpu->regs[VCPU_REGS_R8]; - a5 = vcpu->regs[VCPU_REGS_R9]; - } else -#endif - { - nr = vcpu->regs[VCPU_REGS_RBX] & -1u; - a0 = vcpu->regs[VCPU_REGS_RAX] & -1u; - a1 = vcpu->regs[VCPU_REGS_RCX] & -1u; - a2 = vcpu->regs[VCPU_REGS_RDX] & -1u; - a3 = vcpu->regs[VCPU_REGS_RSI] & -1u; - a4 = vcpu->regs[VCPU_REGS_RDI] & -1u; - a5 = vcpu->regs[VCPU_REGS_RBP] & -1u; + + nr = vcpu->regs[VCPU_REGS_RAX]; + a0 = vcpu->regs[VCPU_REGS_RBX]; + a1 = vcpu->regs[VCPU_REGS_RCX]; + a2 = vcpu->regs[VCPU_REGS_RDX]; + a3 = vcpu->regs[VCPU_REGS_RSI]; + + if (!is_long_mode(vcpu)) { + nr &= ~1u; + a0 &= ~1u; + a1 &= ~1u; + a2 &= ~1u; + a3 &= ~1u; } + switch (nr) { default: - run->hypercall.nr = nr; - run->hypercall.args[0] = a0; - run->hypercall.args[1] = a1; - run->hypercall.args[2] = a2; - run->hypercall.args[3] = a3; - run->hypercall.args[4] = a4; - run->hypercall.args[5] = a5; - run->hypercall.ret = ret; - run->hypercall.longmode = is_long_mode(vcpu); - kvm_arch_ops->decache_regs(vcpu); - return 0; + ret = -KVM_ENOSYS; + break; } vcpu->regs[VCPU_REGS_RAX] = ret; kvm_arch_ops->decache_regs(vcpu); - return 1; + return 0; +} +EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu) +{ + char instruction[3]; + + kvm_arch_ops->cache_regs(vcpu); + kvm_arch_ops->patch_hypercall(vcpu, instruction); + if (emulator_write_std(vcpu->rip, instruction, 3, vcpu) + != X86EMUL_CONTINUE) + return -EFAULT; + + return 0; } -EXPORT_SYMBOL_GPL(kvm_hypercall); static u64 mk_cr_64(u64 curr_cr, u32 new_val) { @@ -1406,75 +1404,6 @@ void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val, } } -/* - * Register the para guest w
[PATCH 3/3] KVM paravirt-ops implementation
diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index f952493..ceacc66 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -237,6 +237,13 @@ config VMI at the moment), by linking the kernel to a GPL-ed ROM module provided by the hypervisor. +config KVM_GUEST + bool "KVM paravirt-ops support" + depends on PARAVIRT + help + This option enables various optimizations for running under the KVM + hypervisor. + config ACPI_SRAT bool default y diff --git a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile index 9d33b00..6308fcf 100644 --- a/arch/i386/kernel/Makefile +++ b/arch/i386/kernel/Makefile @@ -42,6 +42,7 @@ obj-$(CONFIG_K8_NB) += k8.o obj-$(CONFIG_MGEODE_LX)+= geode.o obj-$(CONFIG_VMI) += vmi.o vmiclock.o +obj-$(CONFIG_KVM_GUEST)+= kvm.o obj-$(CONFIG_PARAVIRT) += paravirt.o obj-y += pcspeaker.o diff --git a/arch/i386/kernel/kvm.c b/arch/i386/kernel/kvm.c new file mode 100644 index 000..26585a4 --- /dev/null +++ b/arch/i386/kernel/kvm.c @@ -0,0 +1,46 @@ +/* + * KVM paravirt_ops implementation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + * + * Copyright IBM Corporation, 2007 + * Authors: Anthony Liguori <[EMAIL PROTECTED]> + */ + +#include +#include + +/* + * No need for any "IO delay" on KVM + */ +static void kvm_io_delay(void) +{ +} + +static __init void paravirt_setup(void) +{ + paravirt_ops.name = "KVM"; + + if (kvm_para_has_feature(KVM_PARA_FEATURE_NOP_IO_DELAY)) + paravirt_ops.io_delay = kvm_io_delay; + + paravirt_ops.paravirt_enabled = 1; +} + +void __init kvm_guest_init(void) +{ + if (kvm_para_available()) + paravirt_setup(); +} diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h index e92ea3e..16e51a1 100644 --- a/include/linux/kvm_para.h +++ b/include/linux/kvm_para.h @@ -3,6 +3,14 @@ #include +#ifdef CONFIG_KVM_GUEST +void __init kvm_guest_init(void); +#else +static inline void kvm_guest_init(void) +{ +} +#endif + #define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" static inline long kvm_hypercall0(unsigned int nr) diff --git a/init/main.c b/init/main.c index d3bcb3b..b224905 100644 --- a/init/main.c +++ b/init/main.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void) } sort_main_extable(); trap_init(); + kvm_guest_init(); rcu_init(); init_IRQ(); pidhash_init(); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] KVM - add hypercall nr to kvm_run
Jeff Dike wrote: On Wed, Jul 18, 2007 at 11:28:18AM -0500, Anthony Liguori wrote: Within the kernel right (VMCALL is only usable in ring 1). Yup. Is it terribly important to be able to pass through the syscall arguments in registers verses packing them in a data structure and passing a pointer to that structure? An arg block would be fine, too. Okay, then you should be pretty well insulated against the changing hypercall stuff since you only need one hypercall NR and can use a one argument hypercall. Regards, Anthony Liguori Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] KVM - add hypercall nr to kvm_run
Jeff Dike wrote: On Tue, Jul 17, 2007 at 09:15:51AM -0500, Anthony Liguori wrote: I'm planning on breaking this interface again since the new hypercall API only takes 4 arguments instead of 6. Is anything written anywhere about this hypercall interface? I've posted patches. I'll post more in the near future since we decided to make some rather drastic changes. With the new patchset, guest's will use vmcall directly with a couple bytes of padding. The hypervisor may or may not patch the calling site to use something different (like vmmcall). The thing which would make me happy (which the current code comes close to) is for the guest to be able to pass out Linux system calls with VMCALL. Within the kernel right (VMCALL is only usable in ring 1). Is it terribly important to be able to pass through the syscall arguments in registers verses packing them in a data structure and passing a pointer to that structure? Regards, Anthony Liguori Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [GIT PULL][RESEND #2] KVM Updates for 2.6.23-rc1
S.Çağlar Onur wrote: Hi; 17 Tem 2007 Sal tarihinde, Avi Kivity şunları yazmıştı: I fixed the issue with the previous patchset. Please provide further feedback, or pull from: git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git for-linus This contains kvm updates for the 2.6.23 merge window, including After that commit (+appArmor patchset) any guest image hangs the system completely as soon as started with kvm-29 userspace. Guest images works with "qemu-kvm -hda X.img -m 512 -no-kvm" works as expected Can you reproduce without the appArmor patchset? Regards, Anthony Liguori I cannot provide any more information for now causes system freezes at all and netconsole not works with ipw3945 (netconsole: eth1 doesn't support polling, aborting) but tomorrow i'll try with cable. If needed you can find .config and dmesg from [1], if anything else needed please just say it... [1] http://cekirdek.pardus.org.tr/~caglar/kvm/ Cheers - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] KVM - add hypercall nr to kvm_run
Avi Kivity wrote: Jeff Dike wrote: Add the hypercall number to kvm_run and initialize it. This might be considered API-changing, so I kept it separate. Signed-off-by: Jeff Dike <[EMAIL PROTECTED]> -- drivers/kvm/kvm_main.c |1 + include/linux/kvm.h|1 + 2 files changed, 2 insertions(+) Index: kvm/drivers/kvm/kvm_main.c === --- kvm.orig/drivers/kvm/kvm_main.c +++ kvm/drivers/kvm/kvm_main.c @@ -1361,6 +1361,7 @@ int kvm_hypercall(struct kvm_vcpu *vcpu, } switch (nr) { default: + run->hypercall.nr = nr; run->hypercall.args[0] = a0; run->hypercall.args[1] = a1; run->hypercall.args[2] = a2; Index: kvm/include/linux/kvm.h === --- kvm.orig/include/linux/kvm.h +++ kvm/include/linux/kvm.h @@ -106,6 +106,7 @@ struct kvm_run { } mmio; /* KVM_EXIT_HYPERCALL */ struct { + __u64 nr; __u64 args[6]; __u64 ret; __u32 longmode; It isn't API changing as the API could not be (and has not been) used. We do need to add a padding element to the union to make sure additional union members don't cause its size to change. Quite sad that we got this far with such brokenness. I'm planning on breaking this interface again since the new hypercall API only takes 4 arguments instead of 6. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 2/3] KVM - Fix hypercall arguments
Avi Kivity wrote: Jeff Dike wrote: It looks like kvm_hypercall is trying to match the system call convention and mixed up the call number and first argument in the 32-bit case. Signed-off-by: Jeff Dike <[EMAIL PROTECTED]> -- drivers/kvm/kvm_main.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: kvm/drivers/kvm/kvm_main.c === --- kvm.orig/drivers/kvm/kvm_main.c +++ kvm/drivers/kvm/kvm_main.c @@ -1351,8 +1351,8 @@ int kvm_hypercall(struct kvm_vcpu *vcpu, } else #endif { - nr = vcpu->regs[VCPU_REGS_RBX] & -1u; - a0 = vcpu->regs[VCPU_REGS_RAX] & -1u; + nr = vcpu->regs[VCPU_REGS_RAX] & -1u; + a0 = vcpu->regs[VCPU_REGS_RBX] & -1u; a1 = vcpu->regs[VCPU_REGS_RCX] & -1u; a2 = vcpu->regs[VCPU_REGS_RDX] & -1u; a3 = vcpu->regs[VCPU_REGS_RSI] & -1u; Anthony? I think you were hacking this area? I make a similar change in my series. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem with global_flush_tlb() on i386 in 2.6.22-rc4-mm2
show_trace+0x12/0x14 [ 1114.688308] [] dump_stack+0x15/0x17 [ 1114.701557] [] __might_sleep+0xcf/0xe1 [ 1114.715584] [] down_read+0x18/0x4b [ 1114.728574] [] exit_mm+0x27/0xd1 [ 1114.741045] [] do_exit+0x10f/0x88f [ 1114.754035] [] do_trap+0x0/0x152 [ 1114.766503] [] do_page_fault+0x310/0x7ed [ 1114.781050] [] error_code+0x6a/0x70 [ 1114.794303] [] my_open+0xa6/0x124 [test_rodata] [ 1114.810665] [] proc_reg_open+0x42/0x68 [ 1114.824692] [] __dentry_open+0xe6/0x1e2 [ 1114.838979] [] nameidata_to_filp+0x35/0x3f [ 1114.854044] [] do_filp_open+0x3b/0x43 [ 1114.867811] [] do_sys_open+0x43/0x116 [ 1114.881579] [] sys_open+0x1c/0x1e [ 1114.894308] [] syscall_call+0x7/0xb [ 1114.907557] [] 0xe410 [ 1114.918212] === This is clearly the memory write I am trying to do in the page of which I just changed the attributes to RWX. If I remove the variable read before I change the flags, it starts working correctly (as far as I have tested...). If I use my own my_local_tlb_flush() function (not SMP aware) instead of global_flush_tlb(), it works correctly. What is your my_local_tlb_flush() and are you calling with preemption disabled? I also tried just calling clflush on the modified page just after the global_flush_tlb(), and the problem was still there. I therefore suspect that include/asm-i386/tlbflush.h: #define __native_flush_tlb_global() \ do {\ unsigned int tmpreg, cr4, cr4_orig; \ \ __asm__ __volatile__( \ "movl %%cr4, %2; # turn off PGE \n"\ "movl %2, %1;\n"\ "andl %3, %1;\n"\ "movl %1, %%cr4; \n"\ "movl %%cr3, %0; \n"\ "movl %0, %%cr3; # flush TLB\n"\ "movl %2, %%cr4; # turn PGE back on \n"\ : "=&r" (tmpreg), "=&r" (cr4), "=&r" (cr4_orig) \ : "i" (~X86_CR4_PGE)\ : "memory");\ } while (0) is not doing its job correctly. The problem does not seem to be caused by PARAVIRT, because it is still buggy even if I disable the PARAVIRT config option. This is actually very conservative seeing as how disabling CR4.PGE should be sufficient to flush global pages on modern processors. I suspect you're getting preempted while it's running. Regards, Anthony Liguori - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: QEMU's scsi controller no longer works on arm.
Rob Landley wrote: In 2.6.20, I can boot an arm kernel up to a shell prompt including a virtual scsi hard drive. In 2.6.21-rc1, this stopped working. It's broken with x86 too. I looked into it for a bit and it appears that the Linux driver is using more LSI commands than it was before. The odd thing is that it's probing some space that is marked reserved in the LSI manual. Unfortunately, I couldn't understand the SCSI system well enough to understand what changed in Linux. Regards, Anthony Liguori I tried "git bisect" and found out there's a range of about 5000 commits between the two where arm doesn't compile. At the start of this range, the controller worked. At the end, it didn't anymore. How can YOU reproduce this problem? I'm glad you asked: The miniconfig I'm using is attached. You'll need an arm compiler to build it, of course. (If you haven't got one, the cross compiler I made is at "http://landley.net/code/firmware/downloads/cross-compiler/";. Download the armv4l version for the appropriate host, extract the tarball and add the "bin" directory under that to your path. The x86 version should work on Ubuntu 6.06 or newer, the x86-64 version was built on 7.04.) Configure with: make ARCH=arm allnoconfig KCONFIG_ALLCONFIG=miniconfig-linux make ARCH=arm CROSS_COMPILE=armv4l- And then run qemu on it: qemu-system-arm -M versatilepb -nographic -no-reboot \ -hda /dev/null -kernel zImage-armv4l -append \ 'rw panic=1 root=/dev/sdaconsole=ttyAMA0' The failing system will loop resetting the scsi controller with lots of timeouts, and takes several minutes to get the the panic where it can't mount root. The working system will panic due to root being /dev/null fairly quickly, without pages of error messages and a very long wait first. Rob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] Re: More virtio users
Gerd Hoffmann wrote: Hi, Framebuffer is an interesting one. Virtio doesn't assume shared memory, so naively the fb you would just send outbufs describing changed memory. This would work, but describing rectangles is better. A helper might be the right approach here Rectangles work just fine for a framebuffer console. They stop working once you plan to run any graphical stuff such as an X-Server on top of the framebuffer. Only way to get notified about changes is page faults, i.e. 4k granularity on the linear framebuffer memory. Related to Framebuffer is virtual keyboard and virtual mouse (or better touchscreen), which probably works perfectly fine with virtio. I'd guess you can even reuse the input layer event struct for the virtio events. Virtio seems like overkill for either of those things. It's necessary for pure PV but not at all necessary for something like KVM. Xen has virtual framebuffer, kbd & mouse, although not (yet?) in the paravirt_ops patch queue, so there is something you can look at ;) In retrospect, IMHO, a shared framebuffer was a bad idea for Xen. It's easy enough for dealing with an unaccelerated display, but once you start getting into more advanced features like blitting (which is really important actually for decent VNC performance), the synchronization becomes a big problem. If we were to do a PV display driver again, I really think something that is closer to a VNC protocol is the right way to go. There simply isn't a significant performance overhead to copying the relatively small amount of memory. So virtio in it's current form would actually be pretty useful for that. Regards, Anthony Liguori cheers, Gerd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A set of "standard" virtual devices?
Jeremy Fitzhardinge wrote: Andi Kleen wrote: The implementation wouldn't need to use PCI at all. There wouldn't even need to be PCI like registers internally. Just a pci device with an ID somewhere in sysfs. PCI with unique IDs is just a convenient and well established key into the driver module collection. Once you have the right driver it can do what it wants. But I understood hpa's suggestion to mean that there would be a standard PCI interface for a hardware RNG, and a single linux driver for that device, which all hypervisors would be expected to implement. But that's only reasonable if the virtualization environment has some notion of PCI to expose to the Linux guest. The actual PCI bus could paravirtualized. It's just a question of whether one reinvents a device discovery mechanism (like XenBus) or whether one piggy backs on existing mechanisms. Furthermore, in the future, I strongly suspect that HVM will become much more important for Xen than PV and since that already has a PCI bus it's not really that big of a deal. Regards, Anthony Liguori J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 0/15] KVM userspace interface updates
Heiko Carstens wrote: On Sun, Mar 11, 2007 at 03:53:12PM +0200, Avi Kivity wrote: This patchset updates the kvm userspace interface to what I hope will be the long-term stable interface. Provisions are included for extending the interface later. The patches address performance and cleanliness concerns. Searching the mailing list I figured that as soons as the interface seems to be stable, kvm should/would switch to a system call based interface. I assume the userspace interface might still change a lot, especially if kvm is ported to new architectures. But the general question is: do you still plan to switch to a syscall interface? What benefit would a syscall interface have? Regards, Anthony Liguori - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/