from:"Anthony Liguori"

Re: [PATCH] virtio_pci: properly clean up MSI-X state when initialization fails

2014-09-15 Thread Anthony Liguori

Hi Michael,

On Mon, Sep 15, 2014 at 1:32 AM, Michael S. Tsirkin  wrote:
> On Sun, Sep 14, 2014 at 08:23:26PM -0700, Anthony Liguori wrote:
>> From: Anthony Liguori 
>>
>> If MSI-X initialization fails after setting msix_enabled = 1, then
>> the device is left in an inconsistent state.  This would normally
>> only happen if there was a bug in the device emulation but it still
>> should be handled correctly.
>
> This might happen if host runs out of resources when trying
> to map VQs to vectors, so doesn't have to be a bug.
>
> But I don't see what the problem is:
> msix_used_vectors reflects the number of used vectors
> and msix_enabled is set, thus vp_free_vectors
> will free all IRQs and then disable MSIX.
>
> Where is the inconsistency you speak about?

I missed the fact that vp_free_vectors() conditionally sets
msix_enabled=0.  It seems a bit cludgy especially since it is called
both before and after setting msix_enabled=1.

I ran into a number of weird problems due to read/write reordering
that was causing this code path to fail.  The impact was
non-deterministic.  I'll go back and try to better understand what
code path is causing it.

>> Cc: Matt Wilson 
>> Cc: Rusty Russell 
>> Cc: Michael Tsirkin 
>> Signed-off-by: Anthony Liguori 
>> ---
>>  drivers/virtio/virtio_pci.c |8 ++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
>> index 9cbac33..3d2c2a5 100644
>> --- a/drivers/virtio/virtio_pci.c
>> +++ b/drivers/virtio/virtio_pci.c
>> @@ -357,7 +357,7 @@ static int vp_request_msix_vectors(struct virtio_device 
>> *vdev, int nvectors,
>>   v = ioread16(vp_dev->ioaddr + VIRTIO_MSI_CONFIG_VECTOR);
>>   if (v == VIRTIO_MSI_NO_VECTOR) {
>>   err = -EBUSY;
>> - goto error;
>> + goto error_msix_used;
>>   }
>>
>>   if (!per_vq_vectors) {
>> @@ -369,11 +369,15 @@ static int vp_request_msix_vectors(struct 
>> virtio_device *vdev, int nvectors,
>> vp_vring_interrupt, 0, vp_dev->msix_names[v],
>> vp_dev);
>>   if (err)
>> - goto error;
>> + goto error_msix_used;
>>   ++vp_dev->msix_used_vectors;
>>   }
>>   return 0;
>> +error_msix_used:
>> + v = --vp_dev->msix_used_vectors;
>> + free_irq(vp_dev->msix_entries[v].vector, vp_dev);
>>  error:
>> + vp_dev->msix_enabled = 0;
>
> As far as I can see, if you do this, guest will not call
> pci_disable_msix thus leaving the device with MSIX enabled.

I don't understand this comment.  How is the work done in this path
any different from what's done in vp_free_vectors()?

Regards,

Anthony Liguori

> I'm not sure this won't break drivers if they then
> try to use the device without MSIX, and it
> definitely seems less elegant than reverting the
> device to the original state.
>
>
>>   vp_free_vectors(vdev);
>>   return err;
>>  }
>> --
>> 1.7.9.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] xen-gnttab: do not add m2p override entries for blkback mappings

2013-11-12 Thread Anthony Liguori

From: Anthony Liguori 

Commit 5dc03639 switched blkback to also add m2p override entries
when mapping grant pages but history seems to have forgotten why
this is useful if it ever was.

The blkback driver does not need m2p override entries to exist
and there is significant overhead due to the locking in the m2p
override table.  We see about a 10% improvement in IOP rate with
this patch applied running FIO in the guest.

See http://article.gmane.org/gmane.linux.kernel/1590932 for a full
analysis of current users.

Signed-off-by: Anthony Liguori 
---
A backported version of this has been heavily tested but the testing
against the latest Linux tree is light so far.
---
 drivers/block/xen-blkback/blkback.c |   12 -
 drivers/xen/gntdev.c|4 +--
 drivers/xen/grant-table.c   |   49 ---
 include/xen/grant_table.h   |   14 --
 4 files changed, 59 insertions(+), 20 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index bf4b9d2..53ea90e 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -286,7 +286,7 @@ static void free_persistent_gnts(struct xen_blkif *blkif, 
struct rb_root *root,
if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
!rb_next(&persistent_gnt->node)) {
ret = gnttab_unmap_refs(unmap, NULL, pages,
-   segs_to_unmap);
+   segs_to_unmap, 0);
BUG_ON(ret);
put_free_pages(blkif, pages, segs_to_unmap);
segs_to_unmap = 0;
@@ -322,7 +322,7 @@ static void unmap_purged_grants(struct work_struct *work)
 
if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
ret = gnttab_unmap_refs(unmap, NULL, pages,
-   segs_to_unmap);
+   segs_to_unmap, 0);
BUG_ON(ret);
put_free_pages(blkif, pages, segs_to_unmap);
segs_to_unmap = 0;
@@ -330,7 +330,7 @@ static void unmap_purged_grants(struct work_struct *work)
kfree(persistent_gnt);
}
if (segs_to_unmap > 0) {
-   ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
+   ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap, 0);
BUG_ON(ret);
put_free_pages(blkif, pages, segs_to_unmap);
}
@@ -671,14 +671,14 @@ static void xen_blkbk_unmap(struct xen_blkif *blkif,
pages[i]->handle = BLKBACK_INVALID_HANDLE;
if (++invcount == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
ret = gnttab_unmap_refs(unmap, NULL, unmap_pages,
-   invcount);
+   invcount, 0);
BUG_ON(ret);
put_free_pages(blkif, unmap_pages, invcount);
invcount = 0;
}
}
if (invcount) {
-   ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount);
+   ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount, 0);
BUG_ON(ret);
put_free_pages(blkif, unmap_pages, invcount);
}
@@ -740,7 +740,7 @@ again:
}
 
if (segs_to_map) {
-   ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
+   ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map, 0);
BUG_ON(ret);
}
 
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index e41c79c..9ab1792 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -285,7 +285,7 @@ static int map_grant_pages(struct grant_map *map)
 
pr_debug("map %d+%d\n", map->index, map->count);
err = gnttab_map_refs(map->map_ops, use_ptemod ? map->kmap_ops : NULL,
-   map->pages, map->count);
+ map->pages, map->count, 1);
if (err)
return err;
 
@@ -317,7 +317,7 @@ static int __unmap_grant_pages(struct grant_map *map, int 
offset, int pages)
 
err = gnttab_unmap_refs(map->unmap_ops + offset,
use_ptemod ? map->kmap_ops + offset : NULL, map->pages 
+ offset,
-   pages);
+   pages, 1);
if (err)
return err;
 
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index c4d2298..081be8d 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -881,7 +881,8 @@ EXPORT_SYMBOL_GPL(gnttab_batch_copy);
 
 int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
struct gnttab_map_gran

Re: [Xen-devel] [PATCH] grant-table: don't set m2p override if kmap_ops is not set

2013-11-06 Thread Anthony Liguori

Ian Campbell  writes:

> On Tue, 2013-11-05 at 12:53 -0800, Anthony Liguori wrote:
>> >> Yes, you are completely right, then I have to figure out why blkback
>> >> works fine with this patch applied (or at least it seems to work fine).
>> >
>> > blkback also works for me when testing a similar patch. I'm still
>> > confused. One thing with your proposed patch: I'm not sure that you're
>> > putting back the correct mfn.
>
> As long as it is unique and owned by the local domain I guess it doesn't
> matter which mfn goes back? It's going to get given back to the generic
> allocator so the content is irrelevant? (all speculation without looking
> at blkback)


Hrm, so m2p_add_override already does set_phys_to_machine(pfn,
FOREIGN_FRAME(mfn)) so this patch doesn't add anything that isn't
already there.  It's just removing the additional to the m2p override
table.

So what happens when the MFN isn't in the m2p override table?  It's
treated as a foreign PFN and m2p lookups fail.  But why is this a
problem for blkback?

>> It's perfectly fine to store a foreign pfn in the m2p table.
>
> No, it's not. The m2p is host global and maps mfns to the pfns in the
> owning domain. A domain which grant maps a foreign mfn into its address
> space doesn't get to update the m2p for that mfn. Perhaps you were
> thinking of the p2m?

Yes, I typoed that bit.

> Depending on how an OS mapping the foreign MFN does things it may well
> be accurate to say that having a foreign pfn in the m2p table is
> harmless, in as much as it will never try and use the m2p for foreign
> pages for anything real, but it depends on the implementation of the
> particular OS.

Note that this patch isn't doing anything to the m2p table.

>>   The m2p
>> override table is used by the grant device to allow a reverse lookup of
>> the real mfn to a pfn even if it's foreign.
>
> Classic Xen used an array of struct page * overrides in the struct vma
> for this sort of use case (e.g. in blktap). That obviously cannot fly
> for pvops...

I don't think the usage from blkback precludes using the gntdev either.
It just so happens that the m2p_add_override does extra work but I don't
think it actually benefits blkback in any way.

Am I missing something?

Regards,

Anthony Liguori

>
> Ian.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH] grant-table: don't set m2p override if kmap_ops is not set

2013-11-05 Thread Anthony Liguori

On Tue, Nov 5, 2013 at 1:16 PM, Konrad Rzeszutek Wilk
 wrote:
> On Tue, Nov 05, 2013 at 12:53:17PM -0800, Anthony Liguori wrote:
>> Matt Wilson  writes:
>>
>> > On Tue, Nov 05, 2013 at 05:03:58PM +0100, Roger Pau Monné wrote:
>> >> On 05/11/13 16:08, Ian Campbell wrote:
>> >> > On Tue, 2013-11-05 at 16:01 +0100, Roger Pau Monné wrote:
>> >> >> On 05/11/13 15:56, Konrad Rzeszutek Wilk wrote:
>> >> >>> On Tue, Nov 05, 2013 at 03:47:08PM +0100, Roger Pau Monné wrote:
>> >> >>>> On 05/11/13 13:36, David Vrabel wrote:
>> >> >>>>> On 05/11/13 11:24, Roger Pau Monne wrote:
>> >> >>>>>> IMHO there's no reason to set a m2p override if the mapping is 
>> >> >>>>>> done in
>> >> >>>>>> kernel space, so only set the m2p override when kmap_ops is set.
>> >> >>>>>
>> >> >>>>> Can you provide a more detailed reasoning about why this is safe?
>> >> >>>>
>> >> >>>> To tell the truth, I don't understand why we need to use the m2p
>> >> >>>> override for kernel space only mappings, my understanding is that 
>> >> >>>> this
>> >> >>>> m2p override is needed for user space mappings only (where we 
>> >> >>>> actually
>> >> >>>> end up doing two mappings, one in kernel space and one in user 
>> >> >>>> space).
>> >> >>>> For kernel space I don't see why we need to do anything else than
>> >> >>>> setting the right p2m translation.
>> >> >>>
>> >> >>> We needed the m2p when doing DMA operations. As the driver would
>> >> >>> want the bus address (so p2m) and then when unmapping the DMA we
>> >> >>> only get the bus address  - so we needed to do a m2p lookup.
>> >> >>
>> >> >> OK, we need a m2p (that we already have in machine_to_phys_mapping),
>> >> >> what I don't understand is why we need the m2p override.
>> >> >
>> >> > The m2p is a host global table.
>> >> >
>> >> > For a foreign page grant mapped into the current domain the m2p will
>> >> > give you the foreign (owner) domain's p from the m, not the local one.
>> >>
>> >> Yes, you are completely right, then I have to figure out why blkback
>> >> works fine with this patch applied (or at least it seems to work fine).
>> >
>> > blkback also works for me when testing a similar patch. I'm still
>> > confused. One thing with your proposed patch: I'm not sure that you're
>> > putting back the correct mfn.
>>
>> It's perfectly fine to store a foreign pfn in the m2p table.  The m2p
>> override table is used by the grant device to allow a reverse lookup of
>> the real mfn to a pfn even if it's foreign.
>>
>> blkback doesn't actually need this though.  This was introduced in:
>>
>> commit 5dc03639cc903f887931831d69895facb5260f4b
>> Author: Konrad Rzeszutek Wilk 
>> Date:   Tue Mar 1 16:46:45 2011 -0500
>>
>> xen/blkback: Utilize the M2P override mechanism for GNTMAP_host_map
>>
>> Purely as an optimization.  In practice though due to lock contention it
>> slows things down.
>>
>> I think an alternative would be to use a read/write lock instead of just
>> a spinlock since it's the read path that is the most hot.
>
> The m2p hash table can also be expanded to lower the contention.

That was the first thing I tried :-)

The issue isn't lookup cost as much as the number of acquisitions and
the cost of adding/removing entries.

>> I haven't tested that yet though.
>
> Looking forward to your patches :-)

I'm really just being thorough in suggesting the rwlock.  I actually
think that not using the m2p override table is the right long term
fix.

Regards,

Anthony Liguori

>>
>> Regards,
>>
>> Anthony Liguori
>>
>> >
>> > Adding Anthony to the thread.
>> >
>> > --msw
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH] grant-table: don't set m2p override if kmap_ops is not set

2013-11-05 Thread Anthony Liguori

Matt Wilson  writes:

> On Tue, Nov 05, 2013 at 05:03:58PM +0100, Roger Pau Monné wrote:
>> On 05/11/13 16:08, Ian Campbell wrote:
>> > On Tue, 2013-11-05 at 16:01 +0100, Roger Pau Monné wrote:
>> >> On 05/11/13 15:56, Konrad Rzeszutek Wilk wrote:
>> >>> On Tue, Nov 05, 2013 at 03:47:08PM +0100, Roger Pau Monné wrote:
>> >>>> On 05/11/13 13:36, David Vrabel wrote:
>> >>>>> On 05/11/13 11:24, Roger Pau Monne wrote:
>> >>>>>> IMHO there's no reason to set a m2p override if the mapping is done in
>> >>>>>> kernel space, so only set the m2p override when kmap_ops is set.
>> >>>>>
>> >>>>> Can you provide a more detailed reasoning about why this is safe?
>> >>>>
>> >>>> To tell the truth, I don't understand why we need to use the m2p
>> >>>> override for kernel space only mappings, my understanding is that this
>> >>>> m2p override is needed for user space mappings only (where we actually
>> >>>> end up doing two mappings, one in kernel space and one in user space).
>> >>>> For kernel space I don't see why we need to do anything else than
>> >>>> setting the right p2m translation.
>> >>>
>> >>> We needed the m2p when doing DMA operations. As the driver would
>> >>> want the bus address (so p2m) and then when unmapping the DMA we
>> >>> only get the bus address  - so we needed to do a m2p lookup.
>> >>
>> >> OK, we need a m2p (that we already have in machine_to_phys_mapping),
>> >> what I don't understand is why we need the m2p override.
>> > 
>> > The m2p is a host global table.
>> > 
>> > For a foreign page grant mapped into the current domain the m2p will
>> > give you the foreign (owner) domain's p from the m, not the local one.
>> 
>> Yes, you are completely right, then I have to figure out why blkback
>> works fine with this patch applied (or at least it seems to work fine).
>
> blkback also works for me when testing a similar patch. I'm still
> confused. One thing with your proposed patch: I'm not sure that you're
> putting back the correct mfn.

It's perfectly fine to store a foreign pfn in the m2p table.  The m2p
override table is used by the grant device to allow a reverse lookup of
the real mfn to a pfn even if it's foreign.

blkback doesn't actually need this though.  This was introduced in:

commit 5dc03639cc903f887931831d69895facb5260f4b
Author: Konrad Rzeszutek Wilk 
Date:   Tue Mar 1 16:46:45 2011 -0500

xen/blkback: Utilize the M2P override mechanism for GNTMAP_host_map

Purely as an optimization.  In practice though due to lock contention it
slows things down.

I think an alternative would be to use a read/write lock instead of just
a spinlock since it's the read path that is the most hot.

I haven't tested that yet though.

Regards,

Anthony Liguori

>
> Adding Anthony to the thread.
>
> --msw
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kvmtool tree (Was: Re: [patch] config: fix make kvmconfig)

2013-02-14 Thread Anthony Liguori


On 02/13/2013 02:56 AM, Pekka Enberg wrote:

On Wed, Feb 13, 2013 at 10:23 AM, Paolo Bonzini  wrote:

Il 12/02/2013 10:52, Ingo Molnar ha scritto:

Check the list I gave (unmodified):

"- Pekka listed new virtio drivers that were done via tools/kvm.


vhost-scsi got in first in tools/kvm, but out-of-tree patches had
existed for QEMU for more than a year.  It was developed with QEMU.


I think Ingo confused virtio and vhost. IIRC, Asias developed
vhost-blk using tools/kvm.


We've done extensive performance analysis of vhost-blk and it's not any faster 
than a userspace solution.  This is why it's still not in mainline.


This wasn't noticed with tools/kvm because it's block layer is too simplistic.

In order to get to the point where we're able to do this well in userspace in 
QEMU took tons of bug fixes to the kernel and added features (like 
pread64/pwrite64).


That all happened without QEMU being in the kernel.

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Anthony Liguori

Glauber Costa  writes:

> Hi,
>
> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>> In the case of where you have a system that is running in a
>> capped or overcommitted environment the user may see steal time
>> being reported in accounting tools such as top or vmstat.  This can
>> cause confusion for the end user.  To ease the confusion this patch set
>> adds the idea of consigned (expected steal) time.  The host will separate
>> the consigned time from the steal time.  The consignment limit passed to the
>> host will be the amount of steal time expected within a fixed period of
>> time.  Any other steal time accruing during that period will show as the
>> traditional steal time.
>
> If you submit this again, please include a version number in your series.
>
> It would also be helpful to include a small changelog about what changed
> between last version and this version, so we could focus on that.
>
> As for the rest, I answered your previous two submissions saying I don't
> agree with the concept. If you hadn't changed anything, resending it
> won't change my mind.
>
> I could of course, be mistaken or misguided. But I had also not seen any
> wave of support in favor of this previously, so basically I have no new
> data to make me believe I should see it any differently.
>
> Let's try this again:
>
> * Rik asked you in your last submission how does ppc handle this. You
> said, and I quote: "In the case of lpar on POWER systems they simply
> report steal time and do not alter it in any way.
> They do however report how much processor is assigned to the partition
> and that information is in /proc/ppc64/lparcfg."

This only is helpful for static entitlements.

But if we allow dynamic entitlements--which is a very useful feature,
think buying an online "upgrade" in a cloud environment--then you need
to account for entitlement loss at the same place where you do the rest
of the accounting: in /proc/stat.

> Now, that is a *way* more sensible thing to do. Much more. "Confusing
> users" is something extremely subjective. This is specially true about
> concepts that are know for quite some time, like steal time. If you out
> of a sudden change the meaning of this, it is sure to confuse a lot more
> users than it would clarify.

I'll bring you a nice bottle of scotch at the next KVM Forum if you can
find me one user that can accurately describe what steal time is.

The semantics are so incredibly subtle that I have a hard time believing
anyone actually understands what it means today.

Regards,

Anthony Liguori
>
>
>
>
>
>> 
>> ---
>> 
>> Michael Wolf (5):
>>   Alter the amount of steal time reported by the guest.
>>   Expand the steal time msr to also contain the consigned time.
>>   Add the code to send the consigned time from the host to the guest
>>   Add a timer to allow the separation of consigned from steal time.
>>   Add an ioctl to communicate the consign limit to the host.
>> 
>> 
>>  arch/x86/include/asm/kvm_host.h   |   11 +++
>>  arch/x86/include/asm/kvm_para.h   |3 +-
>>  arch/x86/include/asm/paravirt.h   |4 +--
>>  arch/x86/include/asm/paravirt_types.h |2 +
>>  arch/x86/kernel/kvm.c |8 ++---
>>  arch/x86/kernel/paravirt.c|4 +--
>>  arch/x86/kvm/x86.c|   50 
>> -
>>  fs/proc/stat.c|9 +-
>>  include/linux/kernel_stat.h   |2 +
>>  include/linux/kvm_host.h  |2 +
>>  include/uapi/linux/kvm.h  |2 +
>>  kernel/sched/core.c   |   10 ++-
>>  kernel/sched/cputime.c|   21 +-
>>  kernel/sched/sched.h  |2 +
>>  virt/kvm/kvm_main.c   |7 +
>>  15 files changed, 120 insertions(+), 17 deletions(-)
>> 
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Pv-drivers] [PATCH 0/6] VSOCK for Linux upstreaming

2012-11-15 Thread Anthony Liguori


On 11/07/2012 12:58 AM, Gerd Hoffmann wrote:

On 11/05/12 19:19, Andy King wrote:

Hi David,


The big and only question is whether anyone can actually use any of
this stuff without your proprietary bits?


Do you mean the VMCI calls?  The VMCI driver is in the process of being
upstreamed into the drivers/misc tree.  Greg (cc'd on these patches) is
actively reviewing that code and we are addressing feedback.

Also, there was some interest from RedHat into using vSockets as a unified
interface, routed over a hypervisor-specific transport (virtio or
otherwise, although for now VMCI is the only one implemented).


Can you outline how this can be done?  From a quick look over the code
it seems like vsock has a hard dependency on vmci, is that correct?

When making vsock a generic, reusable kernel service it should be the
other way around:  vsock should provide the core implementation and an
interface where hypervisor-specific transports (vmci, virtio, xenbus,
...) can register themself.


This was already done in a hypervisor neutral way using virtio:

http://lists.openwall.net/netdev/2008/12/14/8

The concept was Nacked and that led to the abomination of virtio-serial.  If an 
address family for virtualization is on the table, we should reconsider 
AF_VMCHANNEL.


I'd be thrilled to get rid of virtio-serial...

Regards,

Anthony Liguori



cheers,
   Gerd


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] Proposal for virtio standardization.

2012-10-04 Thread Anthony Liguori

Rusty Russell  writes:

> Hi all,
>
>   I've had several requests for a more formal approach to the
> virtio draft spec, and (after some soul-searching) I'd like to try that.
>
>   The proposal is to use OASIS as the standards body, as it's
> fairly light-weight as these things go.  For me this means paperwork and
> setting up a Working Group and getting the right people involved as
> Voting members starting with the current contributors; for most of you
> it just means a new mailing list, though I'll be cross-posting any
> drafts and major changes here anyway.
>
>   I believe that a documented standard (aka virtio 1.0) will
> increase visibility and adoption in areas outside our normal linux/kvm
> universe.  There's been some of that already, but this is the clearest
> path to accelerate it.  Not the easiest path, but I believe that a solid
> I/O standard is a Good Thing for everyone.
>
>   Yet I also want to decouple new and experimental development
> from the standards effort; running code comes first.  New feature bits
> and new device numbers should be reservable without requiring a full
> spec change.
>
> So the essence of my proposal is:
> 1) I start a Working Group within OASIS where we can aim for virtio spec
>1.0.
>
> 2) The current spec is textually reordered so the core is clearly
>bus-independent, with PCI, mmio, etc appendices.
>
> 3) Various clarifications, formalizations and cleanups to the spec text,
>and possibly elimination of old deprecated features.
>
> 4) The only significant change to the spec is that we use PCI
>capabilities, so we can have infinite feature bits.
>(see 
> http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html)

We discussed this on IRC last night.  I don't think PCI capabilites are
a good mechanism to use...

PCI capabilities are there to organize how the PCI config space is
allocated to allow vendor extensions to co-exist with future PCI
extensions.

But we've never used the PCI config space within virtio-pci.  We do
everything in BAR0.  I don't think there's any real advantage of using
the config space vs. a BAR for virtio-pci.

The nice thing about using a BAR is that the region is MMIO or PIO.
This maps really nicely to non-PCI transports too.  But extending the
PCI config space (especially dealing with capability allocation) is
pretty gnarly and there isn't an obvious equivalent outside of PCI.

There are very devices that we emulate today that make use of extended
PCI device registers outside the platform devices (that have no BARs).

Regards,

Anthony Liguori

>
> 5) Changes to the ring layout and other such things are deferred to a
>future virtio version; whether this is done within OASIS or
>externally depends on how well this works for the 1.0 release.
>
> Thoughts?
> Rusty.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] virtio-net: inline header support

2012-10-03 Thread Anthony Liguori

Rusty Russell  writes:

> Anthony Liguori  writes:
>> Rusty Russell  writes:
>>
>>> "Michael S. Tsirkin"  writes:
>>>
>>>> Thinking about Sasha's patches, we can reduce ring usage
>>>> for virtio net small packets dramatically if we put
>>>> virtio net header inline with the data.
>>>> This can be done for free in case guest net stack allocated
>>>> extra head room for the packet, and I don't see
>>>> why would this have any downsides.
>>>
>>> I've been wanting to do this for the longest time... but...
>>>
>>>> Even though with my recent patches qemu
>>>> no longer requires header to be the first s/g element,
>>>> we need a new feature bit to detect this.
>>>> A trivial qemu patch will be sent separately.
>>>
>>> There's a reason I haven't done this.  I really, really dislike "my
>>> implemention isn't broken" feature bits.  We could have an infinite
>>> number of them, for each bug in each device.
>>
>> This is a bug in the specification.
>>
>> The QEMU implementation pre-dates the specification.  All of the actual
>> implementations of virtio relied on the semantics of s/g elements and
>> still do.
>
> lguest fix is pending in my queue.  lkvm and qemu are broken; lkvm isn't
> ever going to be merged, so I'm not sure what its status is?  But I'm
> determined to fix qemu, and hence my torture patch to make sure this
> doesn't creep in again.

There are even more implementations out there and I'd wager they all
rely on framing.

>> What's in the specification really doesn't matter when it doesn't agree
>> with all of the existing implementations.
>>
>> Users use implementations, not specifications.  The specification really
>> ought to be changed here.
>
> I'm sorely tempted, except that we're losing a real optimization because
> of this :(

What optimizations?  What Michael is proposing is still achievable with
a device feature.  Are there other optimizations that can be achieved by
changing framing that we can't achieve with feature bits?

As I mentioned in another note, bad framing decisions can cause
performance issues too...

> The specification has long contained the footnote:
>
> The current qemu device implementations mistakenly insist that
> the first descriptor cover the header in these cases exactly, so
> a cautious driver should arrange it so.

I seem to recall this being a compromise between you and I..  I think
I objected strongly to this back when you first wrote the spec and you
added this to appease me ;-)

Regards,

Anthony Liguori

>
> I'd like to tie this caveat to the PCI capability change, so this note
> will move to the appendix with the old PCI layout.
>
> Cheers,
> Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] virtio-net: inline header support

2012-10-03 Thread Anthony Liguori

Rusty Russell  writes:

> "Michael S. Tsirkin"  writes:
>
> There's a reason I haven't done this.  I really, really dislike "my
> implemention isn't broken" feature bits.  We could have an infinite
> number of them, for each bug in each device.
>
> So my plan was to tie this assumption to the new PCI layout.  And have a
> stress-testing patch like the one below in the kernel (see my virtio-wip
> branch for stuff like this).  Turn it on at boot with
> "virtio_ring.torture" on the kernel commandline.
>
> BTW, I've fixed lguest, but my kvm here (Ubuntu precise, kvm-qemu 1.0)
> is too old.  Building the latest git now...
>
> Cheers,
> Rusty.
>
> Subject: virtio: CONFIG_VIRTIO_DEVICE_TORTURE
>
> Virtio devices are not supposed to depend on the framing of the scatter-gather
> lists, but various implementations did.  Safeguard this in future by adding
> an option to deliberately create perverse descriptors.
>
> Signed-off-by: Rusty Russell 

Ignore framing is really a bad idea.  You want backends to enforce
reasonable framing because guest's shouldn't do silly things with framing.

For instance, with virtio-blk, if you want decent performance, you
absolutely want to avoid bouncing the data.  If you're using O_DIRECT in
the host to submit I/O requests, then it's critical that all of the s/g
elements are aligned to a sector boundary and sized to a sector
boundary.

Yes, QEMU can handle if that's not the case, but it would be insanely
stupid for a guest not to do this.  This is the sort of thing that ought
to be enforced in the specification because a guest cannot perform well
if it doesn't follow these rules.

A spec isn't terribly useful if the result is guest drivers that are
slow.  There's very little to gain by not enforcing rules around framing
and there's a lot to lose if a guest frames incorrectly.

In the rare case where we want to make a framing change, we should use
feature bits like Michael is proposing.

In this case, we should simply say that with the feature bit, the vnet
header can be in the same element as the data but not allow the header
to be spread across multiple elements.

Regards,

Anthony Liguori

>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 8d5bddb..930a4ea 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -5,6 +5,15 @@ config VIRTIO
> bus, such as CONFIG_VIRTIO_PCI, CONFIG_VIRTIO_MMIO, CONFIG_LGUEST,
> CONFIG_RPMSG or CONFIG_S390_GUEST.
>  
> +config VIRTIO_DEVICE_TORTURE
> + bool "Virtio device torture tests"
> + depends on VIRTIO && DEBUG_KERNEL
> + help
> +   This makes the virtio_ring implementation creatively change
> +   the format of requests to make sure that devices are
> +   properly implemented.  This will make your virtual machine
> +   slow *and* unreliable!  Say N.
> +
>  menu "Virtio drivers"
>  
>  config VIRTIO_PCI
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index e639584..8893753 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -124,6 +124,149 @@ struct vring_virtqueue
>  
>  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>  
> +#ifdef CONFIG_VIRTIO_DEVICE_TORTURE
> +static bool torture;
> +module_param(torture, bool, 0644);
> +
> +struct torture {
> + unsigned int orig_out, orig_in;
> + void *orig_data;
> + struct scatterlist sg[4];
> + struct scatterlist orig_sg[];
> +};
> +
> +static size_t tot_len(struct scatterlist sg[], unsigned num)
> +{
> + size_t len, i;
> +
> + for (len = 0, i = 0; i < num; i++)
> + len += sg[i].length;
> +
> + return len;
> +}
> +
> +static void copy_sg_data(const struct scatterlist *dst, unsigned dnum,
> +  const struct scatterlist *src, unsigned snum)
> +{
> + unsigned len;
> + struct scatterlist s, d;
> +
> + s = *src;
> + d = *dst;
> +
> + while (snum && dnum) {
> + len = min(s.length, d.length);
> + memcpy(sg_virt(&d), sg_virt(&s), len);
> + d.offset += len;
> + d.length -= len;
> + s.offset += len;
> + s.length -= len;
> + if (!s.length) {
> + BUG_ON(snum == 0);
> + src++;
> + snum--;
> + s = *src;
> + }
> + if (!d.length) {
> + BUG_ON(dnum == 0);
> + dst++;
> + dnum--;
> +

Re: [PATCH 0/3] virtio-net: inline header support

2012-10-03 Thread Anthony Liguori

Rusty Russell  writes:

> "Michael S. Tsirkin"  writes:
>
>> Thinking about Sasha's patches, we can reduce ring usage
>> for virtio net small packets dramatically if we put
>> virtio net header inline with the data.
>> This can be done for free in case guest net stack allocated
>> extra head room for the packet, and I don't see
>> why would this have any downsides.
>
> I've been wanting to do this for the longest time... but...
>
>> Even though with my recent patches qemu
>> no longer requires header to be the first s/g element,
>> we need a new feature bit to detect this.
>> A trivial qemu patch will be sent separately.
>
> There's a reason I haven't done this.  I really, really dislike "my
> implemention isn't broken" feature bits.  We could have an infinite
> number of them, for each bug in each device.

This is a bug in the specification.

The QEMU implementation pre-dates the specification.  All of the actual
implementations of virtio relied on the semantics of s/g elements and
still do.

What's in the specification really doesn't matter when it doesn't agree
with all of the existing implementations.

Users use implementations, not specifications.  The specification really
ought to be changed here.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-28 Thread Anthony Liguori

Avi Kivity  writes:

> On 08/27/2012 02:27 PM, Michael Wolf wrote:
>> On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
>> > On 08/27/2012 01:23 PM, Michael Wolf wrote:
>> > > > 
>> > > > How would a guest know what its entitlement is?
>> > > > 
>> > > > 
>> > >
>> > > Currently the Admin/management tool setting up the guests will put it on
>> > > the qemu commandline.  From this it is passed via an ioctl to the host.
>> > > The guest will get the value from the host via a hypercall.
>> > >
>> > > In the future the host could try and do some of it automatically in some
>> > > cases. 
>> > 
>> > Seems to me it's a meaningless value for the guest.  Suppose it is
>> > migrated to a host that is more powerful, and as a result its relative
>> > entitlement is reduced.  The value needs to be adjusted.
>>
>> This is why I chose to manage the value from the sysctl interface rather
>> than just have it stored as a value in /proc.  Whatever tool was used to
>> migrate the vm could hopefully adjust the sysctl value on the guest.
>
> We usually try to avoid this type of coupling.  What if the guest is
> rebooting while this is happening?  What if it's not running Linux at
> all?

The guest shouldn't need to know it's entitlement.  Or at least, it's up
to a management tool to report that in a way that's meaningful for the
guest.

For instance, with a hosting provider, you may have 3 service levels
(small, medium, large).  How you present whether the guest is small,
medium, or large to the guest is up to the hosting provider.

>
>> > 
>> > This is best taken care of from the host side.
>>
>> Not sure what you are getting at here.  If you are running in a cloud
>> environment, you purchase a VM with the understanding that you are
>> getting certain resources.  As this type of user I don't believe you
>> have any access to the host to see this type of information.  So the
>> user still wouldnt have a way to confirm that they are receiving what
>> they should be in the way of processor resources.
>>
>> Would you please elaborate a little more on this?
>
> I meant not reporting this time as steal time.  But that cripples steal
> time reporting.
>
> Looks like for each quanta we need to report how much real time has
> passed, how much the guest was actually using, and how much the guest
> was not using due to overcommit (with the reminder being unallocated
> time).  The guest could then present it any way it wanted to.

What I had previously suggested what splitting entitlement loss out of
steal time and reporting it as a separate metric (but not reporting a
fixed notion of entitlement).

You're missing the entitlement loss bit above.  But you need to call
out entitlement loss in order to report idle time correctly.

I think changing steal time (as this patch does) is wrong.

Regards,

Anthony Liguori

>
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] x86, nops settings result in kernel crash

2012-08-16 Thread Anthony Liguori

Alan Cox  writes:

> On Thu, 16 Aug 2012 14:45:15 -0400 (EDT)
> Tomas Racek  wrote:
>
>> - Original Message -
>> > On Thu, Aug 16, 2012 at 09:35:12AM -0400, Tomas Racek wrote:
>> > > Hi,
>> > > 
>> > > I am writing a file system test which I execute in qemu with kernel
>> > > compiled from latest git sources and running it causes this error:
>> > > 
>> > > https://bugzilla.kernel.org/show_bug.cgi?id=45971
>> > > 
>> > > It works with v3.5, so I ran git bisect which pointed me to:
>> > > 
>> > > d6250a3f12edb3a86db9598ffeca3de8b4a219e9 x86, nops: Missing break
>> > > resulting in incorrect selection on Intel
>> > > 
>> > > To be quite honest, I don't understand this stuff much but I tried
>> > > to do some debugging and I figured out (I hope) that the crash is
>> > > caused by setting ideal_nops to p6_nops (k8_nops was used before
>> > > the break statement was added).
>> > 
>> > Maybe I overlooked it or maybe it was implied but did you try
>> > reverting
>> > the patch and rerunning your test? Does it work ok then?
>> > 
>> 
>> Yes, if I remove the break statement (introduced by this commit), it works 
>> fine.
>
> What version of qemu is this - do we have qemu bug here I wonder.

>From the cpuinfo, it's 0.15.1.  That's old but not ancient.

I took a brief look at the kernel code here.  The default invocation of
qemu presents an idealistic CPU with a very minimum feature bit set
exposed.  No processor has ever existed with this feature set.

We do this in order to maintain compatibility when migration from Intel
to AMD but also for legacy reasons.

>From the report, using '-cpu host' solves the problem.  '-cpu host'
exposes most of the host CPUID to the guest.

That said, QEMU really doesn't do anything differently depending on what
feature bits are exposed to the guest.  So my guess is that the odd
combination of CPUID bits that are exposed to the guest is confusing the
kernel.

Can you post dmesg from the host kernel?  Perhaps there's instruction
emulation failing in the host KVM?  That would manifest in strange
behavior in the guest.

Regards,

Anthony Liguori

>
> Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH v8] kvm: notify host when the guest is panicked

2012-08-14 Thread Anthony Liguori

Marcelo Tosatti  writes:

> On Tue, Aug 14, 2012 at 02:35:34PM -0500, Anthony Liguori wrote:
>> Marcelo Tosatti  writes:
>> 
>> > On Tue, Aug 14, 2012 at 01:53:01PM -0500, Anthony Liguori wrote:
>> >> Marcelo Tosatti  writes:
>> >> 
>> >> > On Tue, Aug 14, 2012 at 05:55:54PM +0300, Yan Vugenfirer wrote:
>> >> >> 
>> >> >> On Aug 14, 2012, at 1:42 PM, Jan Kiszka wrote:
>> >> >> 
>> >> >> > On 2012-08-14 10:56, Daniel P. Berrange wrote:
>> >> >> >> On Mon, Aug 13, 2012 at 03:21:32PM -0300, Marcelo Tosatti wrote:
>> >> >> >>> On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote:
>> >> >> >>>> We can know the guest is panicked when the guest runs on xen.
>> >> >> >>>> But we do not have such feature on kvm.
>> >> >> >>>> 
>> >> >> >>>> Another purpose of this feature is: management app(for example:
>> >> >> >>>> libvirt) can do auto dump when the guest is panicked. If 
>> >> >> >>>> management
>> >> >> >>>> app does not do auto dump, the guest's user can do dump by hand if
>> >> >> >>>> he sees the guest is panicked.
>> >> >> >>>> 
>> >> >> >>>> We have three solutions to implement this feature:
>> >> >> >>>> 1. use vmcall
>> >> >> >>>> 2. use I/O port
>> >> >> >>>> 3. use virtio-serial.
>> >> >> >>>> 
>> >> >> >>>> We have decided to avoid touching hypervisor. The reason why I 
>> >> >> >>>> choose
>> >> >> >>>> choose the I/O port is:
>> >> >> >>>> 1. it is easier to implememt
>> >> >> >>>> 2. it does not depend any virtual device
>> >> >> >>>> 3. it can work when starting the kernel
>> >> >> >>> 
>> >> >> >>> How about searching for the "Kernel panic - not syncing" string 
>> >> >> >>> in the guests serial output? Say libvirtd could take an action upon
>> >> >> >>> that?
>> >> >> >> 
>> >> >> >> No, this is not satisfactory. It depends on the guest OS being
>> >> >> >> configured to use the serial port for console output which we
>> >> >> >> cannot mandate, since it may well be required for other purposes.
>> >> >> > 
>> >> >> Please don't forget Windows guests, there is no console and no "Kernel 
>> >> >> Panic" string ;)
>> >> >> 
>> >> >> What I used for debugging purposes on Windows guest is to register a 
>> >> >> bugcheck callback in virtio-net driver and write 1 to VIRTIO_PCI_ISR 
>> >> >> register.
>> >> >> 
>> >> >> Yan. 
>> >> >
>> >> > Considering whether a "panic-device" should cover other OSes is also \
>> >
>> >> > something to consider. Even for Linux, is "panic" the only case which
>> >> > should be reported via the mechanism? What about oopses without panic? 
>> >> >
>> >> > Is the mechanism general enough for supporting new events, etc.
>> >> 
>> >> Hi,
>> >> 
>> >> I think this discussion is gone of the deep end.
>> >> 
>> >> Forget about !x86 platforms.  They have their own way to do this sort of
>> >> thing.  
>> >
>> > The panic function in kernel/panic.c has the following options, which
>> > appear to be arch independent, on panic:
>> >
>> > - reboot 
>> > - blink
>> 
>> Not sure the semantics of blink but that might be a good place for a
>> pvops hook.
>> 
>> >
>> > None are paravirtual interfaces however.
>> >
>> >> Think of this feature like a status LED on a motherboard.  These
>> >> are very common and usually controlled by IO ports.
>> >> 
>> >> We're simply reserving a "status LED" for the guest to indicate that it
>> >> has paniced.  Let's not over engineer this.
>> >
>> > My concern is that you end up wi

Re: [Qemu-devel] [PATCH v8] kvm: notify host when the guest is panicked

2012-08-14 Thread Anthony Liguori

Marcelo Tosatti  writes:

> On Tue, Aug 14, 2012 at 01:53:01PM -0500, Anthony Liguori wrote:
>> Marcelo Tosatti  writes:
>> 
>> > On Tue, Aug 14, 2012 at 05:55:54PM +0300, Yan Vugenfirer wrote:
>> >> 
>> >> On Aug 14, 2012, at 1:42 PM, Jan Kiszka wrote:
>> >> 
>> >> > On 2012-08-14 10:56, Daniel P. Berrange wrote:
>> >> >> On Mon, Aug 13, 2012 at 03:21:32PM -0300, Marcelo Tosatti wrote:
>> >> >>> On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote:
>> >> >>>> We can know the guest is panicked when the guest runs on xen.
>> >> >>>> But we do not have such feature on kvm.
>> >> >>>> 
>> >> >>>> Another purpose of this feature is: management app(for example:
>> >> >>>> libvirt) can do auto dump when the guest is panicked. If management
>> >> >>>> app does not do auto dump, the guest's user can do dump by hand if
>> >> >>>> he sees the guest is panicked.
>> >> >>>> 
>> >> >>>> We have three solutions to implement this feature:
>> >> >>>> 1. use vmcall
>> >> >>>> 2. use I/O port
>> >> >>>> 3. use virtio-serial.
>> >> >>>> 
>> >> >>>> We have decided to avoid touching hypervisor. The reason why I choose
>> >> >>>> choose the I/O port is:
>> >> >>>> 1. it is easier to implememt
>> >> >>>> 2. it does not depend any virtual device
>> >> >>>> 3. it can work when starting the kernel
>> >> >>> 
>> >> >>> How about searching for the "Kernel panic - not syncing" string 
>> >> >>> in the guests serial output? Say libvirtd could take an action upon
>> >> >>> that?
>> >> >> 
>> >> >> No, this is not satisfactory. It depends on the guest OS being
>> >> >> configured to use the serial port for console output which we
>> >> >> cannot mandate, since it may well be required for other purposes.
>> >> > 
>> >> Please don't forget Windows guests, there is no console and no "Kernel 
>> >> Panic" string ;)
>> >> 
>> >> What I used for debugging purposes on Windows guest is to register a 
>> >> bugcheck callback in virtio-net driver and write 1 to VIRTIO_PCI_ISR 
>> >> register.
>> >> 
>> >> Yan. 
>> >
>> > Considering whether a "panic-device" should cover other OSes is also \
>
>> > something to consider. Even for Linux, is "panic" the only case which
>> > should be reported via the mechanism? What about oopses without panic? 
>> >
>> > Is the mechanism general enough for supporting new events, etc.
>> 
>> Hi,
>> 
>> I think this discussion is gone of the deep end.
>> 
>> Forget about !x86 platforms.  They have their own way to do this sort of
>> thing.  
>
> The panic function in kernel/panic.c has the following options, which
> appear to be arch independent, on panic:
>
> - reboot 
> - blink

Not sure the semantics of blink but that might be a good place for a
pvops hook.

>
> None are paravirtual interfaces however.
>
>> Think of this feature like a status LED on a motherboard.  These
>> are very common and usually controlled by IO ports.
>> 
>> We're simply reserving a "status LED" for the guest to indicate that it
>> has paniced.  Let's not over engineer this.
>
> My concern is that you end up with state that is dependant on x86.
>
> Subject: [PATCH v8 3/6] add a new runstate: RUN_STATE_GUEST_PANICKED
>
> Having the ability to stop/restart the guest (and even introducing a 
> new VM runstate) is more than a status LED analogy.

I must admit, I don't know why a new runstate is necessary/useful.  The
kernel shouldn't have to care about the difference between a halted guest
and a panicked guest.  That level of information belongs in userspace IMHO.

> Can this new infrastructure be used by other architectures?

I guess I don't understand why the kernel side of this isn't anything
more than a paravirt op hook that does a single outb() with the
remaining logic handled 100% in QEMU.

> Do you consider allowing support for Windows as overengineering?

I don't think there is a way to hook BSOD on Windows so attempting to
engineer something that works with Windows seems odd, no?

Regards,

Anthony Liguori

>
>> Regards,
>> 
>> Anthony Liguori
>> 
>> >
>> >> 
>> >> > Well, we have more than a single serial port, even when leaving
>> >> > virtio-serial aside...
>> >> > 
>> >> > Jan
>> >> > 
>> >> > -- 
>> >> > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
>> >> > Corporate Competence Center Embedded Linux
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> > the body of a message to majord...@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH v8] kvm: notify host when the guest is panicked

2012-08-14 Thread Anthony Liguori

Marcelo Tosatti  writes:

> On Tue, Aug 14, 2012 at 05:55:54PM +0300, Yan Vugenfirer wrote:
>> 
>> On Aug 14, 2012, at 1:42 PM, Jan Kiszka wrote:
>> 
>> > On 2012-08-14 10:56, Daniel P. Berrange wrote:
>> >> On Mon, Aug 13, 2012 at 03:21:32PM -0300, Marcelo Tosatti wrote:
>> >>> On Wed, Aug 08, 2012 at 10:43:01AM +0800, Wen Congyang wrote:
>> >>>> We can know the guest is panicked when the guest runs on xen.
>> >>>> But we do not have such feature on kvm.
>> >>>> 
>> >>>> Another purpose of this feature is: management app(for example:
>> >>>> libvirt) can do auto dump when the guest is panicked. If management
>> >>>> app does not do auto dump, the guest's user can do dump by hand if
>> >>>> he sees the guest is panicked.
>> >>>> 
>> >>>> We have three solutions to implement this feature:
>> >>>> 1. use vmcall
>> >>>> 2. use I/O port
>> >>>> 3. use virtio-serial.
>> >>>> 
>> >>>> We have decided to avoid touching hypervisor. The reason why I choose
>> >>>> choose the I/O port is:
>> >>>> 1. it is easier to implememt
>> >>>> 2. it does not depend any virtual device
>> >>>> 3. it can work when starting the kernel
>> >>> 
>> >>> How about searching for the "Kernel panic - not syncing" string 
>> >>> in the guests serial output? Say libvirtd could take an action upon
>> >>> that?
>> >> 
>> >> No, this is not satisfactory. It depends on the guest OS being
>> >> configured to use the serial port for console output which we
>> >> cannot mandate, since it may well be required for other purposes.
>> > 
>> Please don't forget Windows guests, there is no console and no "Kernel 
>> Panic" string ;)
>> 
>> What I used for debugging purposes on Windows guest is to register a 
>> bugcheck callback in virtio-net driver and write 1 to VIRTIO_PCI_ISR 
>> register.
>> 
>> Yan. 
>
> Considering whether a "panic-device" should cover other OSes is also 
> something to consider. Even for Linux, is "panic" the only case which
> should be reported via the mechanism? What about oopses without panic? 
>
> Is the mechanism general enough for supporting new events, etc.

Hi,

I think this discussion is gone of the deep end.

Forget about !x86 platforms.  They have their own way to do this sort of
thing.  Think of this feature like a status LED on a motherboard.  These
are very common and usually controlled by IO ports.

We're simply reserving a "status LED" for the guest to indicate that it
has paniced.  Let's not over engineer this.

Regards,

Anthony Liguori

>
>> 
>> > Well, we have more than a single serial port, even when leaving
>> > virtio-serial aside...
>> > 
>> > Jan
>> > 
>> > -- 
>> > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
>> > Corporate Competence Center Embedded Linux
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH v7.5] kvm: notify host when the guest is panicked

2012-07-22 Thread Anthony Liguori

Sasha Levin  writes:

> On 07/22/2012 09:14 PM, Anthony Liguori wrote:
>> Sasha Levin  writes:
>> 
>>> On 07/21/2012 10:44 AM, Wen Congyang wrote:
>>>> We can know the guest is panicked when the guest runs on xen.
>>>> But we do not have such feature on kvm.
>>>>
>>>> Another purpose of this feature is: management app(for example:
>>>> libvirt) can do auto dump when the guest is panicked. If management
>>>> app does not do auto dump, the guest's user can do dump by hand if
>>>> he sees the guest is panicked.
>>>>
>>>> We have three solutions to implement this feature:
>>>> 1. use vmcall
>>>> 2. use I/O port
>>>> 3. use virtio-serial.
>>>>
>>>> We have decided to avoid touching hypervisor. The reason why I choose
>>>> choose the I/O port is:
>>>> 1. it is easier to implememt
>>>> 2. it does not depend any virtual device
>>>> 3. it can work when starting the kernel
>>>
>>> Was the option of implementing a virtio-watchdog driver considered?
>>>
>>> You're basically re-implementing a watchdog, a guest-host interface and a 
>>> set of protocols for guest-host communications.
>>>
>>> Why can't we re-use everything we have now, push a virtio watchdog
>>> driver into drivers/watchdog/, and gain a more complete solution to
>>> detecting hangs inside the guest.
>> 
>> The purpose of virtio is not to reinvent every possible type of device.
>> There are plenty of hardware watchdogs that are very suitable to be used
>> for this purpose.  QEMU implements quite a few already.
>> 
>> Watchdogs are not performance sensitive so there's no point in using
>> virtio.
>
> The issue here is not performance, but the adding of a brand new
> guest-host interface.

We have:

1) Virtio--this is our preferred PV interface.  It needs PCI to be fully
initialized and probably will live as a module.

2) Hypercalls--this a secondary PV interface but is available very
early.  It's terminated in kvm.ko which means it can only operate on
things that are logically part of the CPU and/or APIC complex.

This patch introduces a third interface which is available early like
hypercalls but not necessarily terminated in kvm.ko.  That means it can
have a broader scope in functionality than (2).

We could just as well use a hypercall and have multiple commands issued
to that hypercall as a convention and add a new exit type to KVM that
sent that specific hypercall to userspace for processing.

But a PIO operation already has this behavior and requires no changes to kvm.ko.

> virtio-rng isn't performance sensitive either, yet it was implemented
> using virtio so there wouldn't be yet another interface to communicate
> between guest and host.

There isn't really an obvious discrete RNG that is widely supported.

> This patch goes ahead to add a "arch pv features" interface using
> ioports, without any idea what it might be used for beyond this
> watchdog.

It's not a watchdog--it's the opposite of a watchdog.

You know such a thing already exists in the kernel, right?  S390 has had
a hypercall like this for years.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked

2012-07-22 Thread Anthony Liguori

Sasha Levin  writes:

> On 07/22/2012 10:19 PM, Sasha Levin wrote:
>> On 07/22/2012 09:22 PM, Anthony Liguori wrote:
>>> Sasha Levin  writes:
>>>
>>>> On 07/21/2012 09:12 AM, Wen Congyang wrote:
>>>>> +#define KVM_PV_PORT  (0x505UL)
>>>>> +
>>>>>  #ifdef __KERNEL__
>>>>>  #include 
>>>>>  
>>>>> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void)
>>>>>  }
>>>>>  #endif
>>>>>  
>>>>> +static inline unsigned int kvm_arch_pv_features(void)
>>>>> +{
>>>>> + return inl(KVM_PV_PORT);
>>>>> +}
>>>>> +
>>>>
>>>> Why is this safe?
>>>>
>>>> I'm not sure you can just pick any ioport you'd like and use it.
>>>
>>> There are three ways I/O ports get used on a PC:
>>>
>>> 1) Platform devices
>>>  - This is well defined since the vast majority of platform devices are
>>>implemented within a single chip.  If you're emulating an i440fx
>>>chipset, the PIIX4 spec has an exhaustive list.
>>>
>>> 2) PCI devices
>>>  - Typically, PCI only allocates ports starting at 0x0d00 to avoid
>>>conflicts with ISA devices.
>>>
>>> 3) ISA devices
>>>  - ISA uses subtractive decoding so any ISA device can access.  In
>>>theory, an ISA device could attempt to use port 0x0505 but it's
>>>unlikely.  In a modern guest, there aren't really any ISA devices being
>>>added either.
>>>
>>> So yes, picking port 0x0505 is safe for something like this (as long as
>>> you check to make sure that you really are under KVM).
>> 
>> Is there anything that actually prevents me from using PCI ports lower than 
>> 0x0d00? As you said in (3), ISA isn't really used anymore (nor is 
>> implemented by lkvm for example), so placing PCI below 0x0d00 might even 
>> make sense in that case.
>> 
>> Furthermore, I can place one of these brand new virtio-mmio devices which 
>> got introduced recently wherever I want right now - Having a device that 
>> uses 0x505 would cause a pretty non-obvious failure mode.
>> 
>> Either way, If we are going to grab an ioport, then:
>> 
>>  - It should be documented well somewhere in Documentation/virt/kvm
>>  - It should go through request_region() to actually claim those ioports.
>>  - It should fail gracefully if that port is taken for some reason, instead 
>> of not even checking it.
>> 
>
> Out of curiosity I tested that, and apparently lkvm has no problem allocating 
> virtio-pci devices in that range:
>
> sh-4.2# pwd
> /sys/devices/pci:00/:00:01.0
> sh-4.2# cat resource | head -n1
> 0x0500 0x05ff 0x00040101
>
> This was with the commit in question applied.

With all due respect, lkvm has a half-baked implementation of PCI.  This
is why you have to pass kernel parameters to disable ACPI and disable
PCI BIOS probing.

So yeah, you can do funky things in lkvm but that doesn't mean a system
that emulated actual hardware would ever do that.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked

2012-07-22 Thread Anthony Liguori

Sasha Levin  writes:

> On 07/22/2012 09:22 PM, Anthony Liguori wrote:
>> Sasha Levin  writes:
>> 
>>> On 07/21/2012 09:12 AM, Wen Congyang wrote:
>>>> +#define KVM_PV_PORT   (0x505UL)
>>>> +
>>>>  #ifdef __KERNEL__
>>>>  #include 
>>>>  
>>>> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void)
>>>>  }
>>>>  #endif
>>>>  
>>>> +static inline unsigned int kvm_arch_pv_features(void)
>>>> +{
>>>> +  return inl(KVM_PV_PORT);
>>>> +}
>>>> +
>>>
>>> Why is this safe?
>>>
>>> I'm not sure you can just pick any ioport you'd like and use it.
>> 
>> There are three ways I/O ports get used on a PC:
>> 
>> 1) Platform devices
>>  - This is well defined since the vast majority of platform devices are
>>implemented within a single chip.  If you're emulating an i440fx
>>chipset, the PIIX4 spec has an exhaustive list.
>> 
>> 2) PCI devices
>>  - Typically, PCI only allocates ports starting at 0x0d00 to avoid
>>conflicts with ISA devices.
>> 
>> 3) ISA devices
>>  - ISA uses subtractive decoding so any ISA device can access.  In
>>theory, an ISA device could attempt to use port 0x0505 but it's
>>unlikely.  In a modern guest, there aren't really any ISA devices being
>>added either.
>> 
>> So yes, picking port 0x0505 is safe for something like this (as long as
>> you check to make sure that you really are under KVM).
>
> Is there anything that actually prevents me from using PCI ports lower
> than 0x0d00? As you said in (3), ISA isn't really used anymore (nor is
> implemented by lkvm for example), so placing PCI below 0x0d00 might
> even make sense in that case.

On modern systems, the OS goes by whatever is in the ACPI table
describing the PCI bus.  In QEMU, we have:

WordIO (ResourceProducer, MinFixed, MaxFixed, PosDecode, EntireRange,
  0x, // Address Space Granularity
  0x0D00, // Address Range Minimum
  0x, // Address Range Maximum
  0x, // Address Translation Offset
  0xF300, // Address Length
  ,, , TypeStatic)

So Linux will always use 0x0D00 -> 0x for the valid
range. Practically speaking, you can't use anything below 0x0D00 because
the PCI bus configuration registers live at 0xCF8-0xCFF.  If you tried
to create the region starting at 0x0500 you'd have to limit it to 0xCF8
to avoid conflicting with the PCI host controller.

That's not a useful amount of space for I/O ports so that would be a
pretty dumb thing to do.

> Furthermore, I can place one of these brand new virtio-mmio devices
> which got introduced recently wherever I want right now - Having a
> device that uses 0x505 would cause a pretty non-obvious failure mode.

I think you're confusing PIO with MMIO.  They are separate address
spaces.

You could certainly argue that relying on PIO is way too architecture
specific since that's only available on x86.  That's a good argument but
the counter is that other architectures have their own interfaces for
this sort of thing.

> Either way, If we are going to grab an ioport, then:
>
>  - It should be documented well somewhere in Documentation/virt/kvm
>  - It should go through request_region() to actually claim those ioports.
>  - It should fail gracefully if that port is taken for some reason,
>  instead of not even checking it.

I agree with the above.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked

2012-07-22 Thread Anthony Liguori

Sasha Levin  writes:

> On 07/21/2012 09:12 AM, Wen Congyang wrote:
>> +#define KVM_PV_PORT (0x505UL)
>> +
>>  #ifdef __KERNEL__
>>  #include 
>>  
>> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void)
>>  }
>>  #endif
>>  
>> +static inline unsigned int kvm_arch_pv_features(void)
>> +{
>> +return inl(KVM_PV_PORT);
>> +}
>> +
>
> Why is this safe?
>
> I'm not sure you can just pick any ioport you'd like and use it.

There are three ways I/O ports get used on a PC:

1) Platform devices
 - This is well defined since the vast majority of platform devices are
   implemented within a single chip.  If you're emulating an i440fx
   chipset, the PIIX4 spec has an exhaustive list.

2) PCI devices
 - Typically, PCI only allocates ports starting at 0x0d00 to avoid
   conflicts with ISA devices.

3) ISA devices
 - ISA uses subtractive decoding so any ISA device can access.  In
   theory, an ISA device could attempt to use port 0x0505 but it's
   unlikely.  In a modern guest, there aren't really any ISA devices being
   added either.

So yes, picking port 0x0505 is safe for something like this (as long as
you check to make sure that you really are under KVM).

Regards,

Anthony Liguori

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH v7.5] kvm: notify host when the guest is panicked

2012-07-22 Thread Anthony Liguori

Sasha Levin  writes:

> On 07/21/2012 10:44 AM, Wen Congyang wrote:
>> We can know the guest is panicked when the guest runs on xen.
>> But we do not have such feature on kvm.
>> 
>> Another purpose of this feature is: management app(for example:
>> libvirt) can do auto dump when the guest is panicked. If management
>> app does not do auto dump, the guest's user can do dump by hand if
>> he sees the guest is panicked.
>> 
>> We have three solutions to implement this feature:
>> 1. use vmcall
>> 2. use I/O port
>> 3. use virtio-serial.
>> 
>> We have decided to avoid touching hypervisor. The reason why I choose
>> choose the I/O port is:
>> 1. it is easier to implememt
>> 2. it does not depend any virtual device
>> 3. it can work when starting the kernel
>
> Was the option of implementing a virtio-watchdog driver considered?
>
> You're basically re-implementing a watchdog, a guest-host interface and a set 
> of protocols for guest-host communications.
>
> Why can't we re-use everything we have now, push a virtio watchdog
> driver into drivers/watchdog/, and gain a more complete solution to
> detecting hangs inside the guest.

The purpose of virtio is not to reinvent every possible type of device.
There are plenty of hardware watchdogs that are very suitable to be used
for this purpose.  QEMU implements quite a few already.

Watchdogs are not performance sensitive so there's no point in using
virtio.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support

2012-07-20 Thread Anthony Liguori

"Michael S. Tsirkin"  writes:

> On Thu, Jul 19, 2012 at 08:05:42AM -0500, Anthony Liguori wrote:
>> Of course, the million dollar question is why would using AIO in the
>> kernel be faster than using AIO in userspace?
>
> Actually for me a more important question is how does it compare
> with virtio-blk dataplane?

I'm not even asking for a benchmark comparision.  It's the same API
being called from a kernel thread vs. a userspace thread.  Why would
there be a 60% performance difference between the two?  That doesn't
make any sense.

There's got to be a better justification for putting this in the kernel
than just that we can.

I completely understand why Christoph's suggestion of submitting BIOs
directly would be faster.  There's no way to do that in userspace.

Regards,

Anthony Liguori

>
> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 0xB16B00B5? Really? (was Re: Move hyperv out of the drivers/staging/ directory)

2012-07-19 Thread Anthony Liguori


On 07/19/2012 05:30 PM, KY Srinivasan wrote:




-Original Message-
From: Greg KH (gre...@linuxfoundation.org)
[mailto:gre...@linuxfoundation.org]
Sent: Thursday, July 19, 2012 6:02 PM
To: KY Srinivasan
Cc: Paolo Bonzini; de...@linuxdriverproject.org; linux-kernel@vger.kernel.org;
virtualizat...@lists.osdl.org
Subject: Re: 0xB16B00B5? Really? (was Re: Move hyperv out of the
drivers/staging/ directory)

On Thu, Jul 19, 2012 at 09:22:53PM +, KY Srinivasan wrote:




-Original Message-
From: Greg KH (gre...@linuxfoundation.org)
[mailto:gre...@linuxfoundation.org]
Sent: Thursday, July 19, 2012 5:07 PM
To: KY Srinivasan
Cc: Paolo Bonzini; de...@linuxdriverproject.org; linux-

ker...@vger.kernel.org;

virtualizat...@lists.osdl.org
Subject: Re: 0xB16B00B5? Really? (was Re: Move hyperv out of the
drivers/staging/ directory)

On Thu, Jul 19, 2012 at 02:11:47AM +, KY Srinivasan wrote:




-Original Message-
From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo
Bonzini
Sent: Friday, July 13, 2012 6:23 AM
To: KY Srinivasan
Cc: Greg KH; de...@linuxdriverproject.org; linux-kernel@vger.kernel.org;
virtualizat...@lists.osdl.org
Subject: 0xB16B00B5? Really? (was Re: Move hyperv out of the

drivers/staging/

directory)

Il 04/10/2011 21:34, Greg KH ha scritto:

diff --git a/drivers/staging/hv/hyperv_vmbus.h

b/drivers/hv/hyperv_vmbus.h

similarity index 99%
rename from drivers/staging/hv/hyperv_vmbus.h
rename to drivers/hv/hyperv_vmbus.h
index 3d2d836..8261cb6 100644
--- a/drivers/staging/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -28,8 +28,7 @@
  #include
  #include
  #include
-
-#include "hyperv.h"
+#include

  /*
   * The below CPUID leaves are present if

VersionAndFeatures.HypervisorPresent

git's rename detection snips away this gem:

+#define HV_LINUX_GUEST_ID_LO   0x
+#define HV_LINUX_GUEST_ID_HI   0xB16B00B5
+#define HV_LINUX_GUEST_ID  (((u64)HV_LINUX_GUEST_ID_HI
<<  32) | \
+  HV_LINUX_GUEST_ID_LO)

Somone was trying to be funny, I guess.

KY, I suppose you have access to Hyper-V code or can ask someone who

does.

Is this signature actually used in the Hyper-V host code?


Paolo,

As I noted earlier, this is just a guest ID that needs to be registered with the
hypervisor.  Thanks  for reporting this issue and on behalf of Microsoft, I

would

like to  apologize for this offensive string. I have submitted a patch to fix 
this

issue.

You only changed it to be in decimal, you did not change the id at all.
Is there some reason why you can not change it?  You said there was a
reserved range of ids that could be used, perhaps just pick another one?
What is the valid range that can be used here?


Greg,

As you know, this ID has been in use for a long time now. While the hypervisor
does not interpret the guest ID that is registered, I am not sure what

dependencies

there might be on this value.


Could you please go find out the answer to this?


That is easier said than done. I have sent emails out asking this very question 
and I have
not received a definitive answer yet. Not knowing if and when I can get a 
definitive
answer here, I chose the least risky approach in my patch.


If, as you originally stated, there is a range of values we can use,
then we should probably use another one, right?


On the Windows side this ID namespace is managed well. However on the Linux
side, we have really had this current ID in use for almost five years now. I am 
not
aware of any pool of IDs available for Linux usage except that Linux IDs be 
distinct from
the guest IDs in use by MSFT operating systems. If I were to change the guest 
ID, I would
probably want to comply with the MSFT guidance on constructing these IDs 
(although not
all fields may be relevant for Linux).


Presumably, Hyper-V can deal with unexpected values here, no?  Otherwise, it 
wouldn't be future proof against new types of guests.


So worst case scenario, Hyper-V disables optimizations on Linux guests that 
report then new ID until they patch Hyper-V to know about the new ID.


That seems like a reasonable trade off to me.  I'm sure there's sufficient 
incentive to patch Hyper-V for this at Microsoft...


Regards,

Anthony Liguori



Regards,

K. Y





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support

2012-07-19 Thread Anthony Liguori

 *vq,
> + struct virtio_blk_outhdr *hdr,
> + u16 head, u16 out, u16 in,
> + struct file *file)
> +{
> + struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
> + struct iovec *iov = &vq->iov[BLK_HDR + 1];
> + loff_t offset = hdr->sector << 9;
> + struct vhost_blk_req *req;
> + u64 nr_vecs;
> + int ret = 0;
> + u8 status;
> +
> + if (hdr->type == VIRTIO_BLK_T_IN || hdr->type == VIRTIO_BLK_T_GET_ID)
> + nr_vecs = in - 1;
> + else
> + nr_vecs = out - 1;
> +
> + req = &blk->reqs[head];
> + req->head   = head;
> + req->status = blk->vq.iov[nr_vecs + 1].iov_base;
> +
> + switch (hdr->type) {
> + case VIRTIO_BLK_T_OUT:
> + ret = vhost_blk_io_submit(blk, file, req, iov, nr_vecs, offset,
> +   IOCB_CMD_PWRITEV);
> + break;
> + case VIRTIO_BLK_T_IN:
> + ret = vhost_blk_io_submit(blk, file, req, iov, nr_vecs, offset,
> +   IOCB_CMD_PREADV);
> + break;
> + case VIRTIO_BLK_T_FLUSH:
> + ret = vfs_fsync(file, 1);
> + status = ret < 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
> + ret = vhost_blk_set_status(blk, req->status, status);
> + if (!ret)
> + vhost_add_used_and_signal(&blk->dev, vq, head, ret);
> + break;
> + case VIRTIO_BLK_T_GET_ID:
> + /* TODO: need a real ID string */
> + ret = snprintf(vq->iov[BLK_HDR + 1].iov_base,
> +VIRTIO_BLK_ID_BYTES, "VHOST-BLK-DISK");
> + status = ret < 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
> + ret = vhost_blk_set_status(blk, req->status, status);
> + if (!ret)
> + vhost_add_used_and_signal(&blk->dev, vq, head,
> +   VIRTIO_BLK_ID_BYTES);
> + break;
> + default:
> + pr_warn("Unsupported request type %d\n", hdr->type);
> + vhost_discard_vq_desc(vq, 1);
> + ret = -EFAULT;
> + break;
> + }

There doesn't appear to be any error handling in the event that
vhost_blk_io_submit fails.  It would appear that you leak the ring queue
entry since you never push anything onto the used queue.

I think you need to handle io_submit() failing too with EAGAIN.  Relying
on min nr_events == queue_size seems like a dangerous assumption to me
particularly since queue_size tends to be very large and max_nr_events
is a fixed size global pool.

To properly handle EAGAIN, you effectively need to implement flow
control and back off reading the virtqueue until you can submit requests again.

Of course, the million dollar question is why would using AIO in the
kernel be faster than using AIO in userspace?

When using eventfd, there is no heavy weight exit on the notify path.
It should just be the difference between scheduling a kernel thread vs
scheduling a userspace thread.  There's simply no way that that's a 60%
difference in performance.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: current mainline ide doesn't like qemu/kvm (or vice versa)

2008-02-08 Thread Anthony Liguori


Alan Cox wrote:

On Fri, 8 Feb 2008 16:25:08 +0100
Christoph Hellwig <[EMAIL PROTECTED]> wrote:


When trying to put some stress on qemu by running the xfs testsuite
I get the following:

debian:~/xfs-cmds/xfstests# sh check 
[  438.166822] SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug enabled

[  438.185557] SGI XFS Quota Management subsystem
[  438.193150] hdb: task_no_data_intr: status=0x41 { DriveReady Error }
[  438.194018] hdb: task_no_data_intr: error=0x04 { DriveStatusError }
[  438.194195] ide: failed opcode was: 0x9d


Drive aborted the command. Thats all correct.

and after that the kernel seems to hang.  Qemu is emulating a piix3
device, and using the piix driver.  This is on a pretty old kvm (version
28) because newer ones don't even compile.


kvm-28 is ancient and there have been IDE emulation fixes since.  I am 
not aware of issues building KVM either.  The problems you reported a 
month ago on kvm-devel have been fixed.



Old Qemu is not really a credible IDE emulation. The chances are that as
with all the libata bugs I close for qemu the problem is qemu. Please
file a bug with the qemu people.


Agreed, this is very likely to be a QEMU issue (that may have already 
been fixed).


Regards,

Anthony Liguori


-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] KVM binary incompatiablity

2008-02-08 Thread Anthony Liguori


Stephen Hemminger wrote:

I notice that recent KVM is incompatiable with older versions.

Using a KVM image created on 2.6.24 will crash on 2.6.25 (or
vice versa). It appears that Ubuntu Hardy has incorporated the 2.6.25
update even though it claims to be 2.6.24.
  


This isn't intentional.  What is the guest and how does it crash?

I've been using the same image for most of KVM's development life cycle 
without having issues.


Regards,

Anthony Liguori


This is reproducible on Intel (64bit) kernel.  Was this intentional?
is it documented? 


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/kvm-devel
  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/8] SVM: add module parameter to disable NestedPaging

2008-01-26 Thread Anthony Liguori


Alexey Eremenko wrote:

Generally I see no problem with it. But at least for NPT I don't see a
reason why someone should want to disable it on a VM basis (as far as it
works stable). Avi, what do you think?




1. This is more user-friendly and easier-to-find
  


I'm inclined to disagree.  Why add an additional thing for people to 
tune if they really shouldn't need to.  Once NPT/EPT support are stable, 
is there any reason to want to disable it?


Regards,

Anthony Liguori


2. A lot of people would like to view (and Demo) it side-by-side.

3. Useful for BETA-testing of KVM.

  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 4/8] X86: export information about NPT to generic x86 code

2008-01-25 Thread Anthony Liguori


Anthony Liguori wrote:

Joerg Roedel wrote:
The generic x86 code has to know if the specific implementation uses 
Nested
Paging. In the generic code Nested Paging is called Hardware Assisted 
Paging
(HAP) to avoid confusion with (future) HAP implementations of other 
vendors.

This patch exports the availability of HAP to the generic x86 code.

Signed-off-by: Joerg Roedel <[EMAIL PROTECTED]>
---
 arch/x86/kvm/svm.c |7 +++
 arch/x86/kvm/vmx.c |7 +++
 include/asm-x86/kvm_host.h |2 ++
 3 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2e718ff..d0bfdd8 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1678,6 +1678,11 @@ static bool svm_cpu_has_accelerated_tpr(void)
 return false;
 }
 
+static bool svm_hap_enabled(void)

+{
+return npt_enabled;
+}
+
  


To help with bisecting, you should probably return false here until 
the patch that actually implements NPT support.  Otherwise, the 7th 
patch in this series breaks KVM for SVM.


Ignore this, you're already doing the right thing :-)

Regards,

Anthony Liguori


Regards,

Anthony Liguori


 static struct kvm_x86_ops svm_x86_ops = {
 .cpu_has_kvm_support = has_svm,
 .disabled_by_bios = is_disabled,
@@ -1734,6 +1739,8 @@ static struct kvm_x86_ops svm_x86_ops = {
 .inject_pending_vectors = do_interrupt_requests,
 
 .set_tss_addr = svm_set_tss_addr,

+
+.hap_enabled = svm_hap_enabled,
 };
 
 static int __init svm_init(void)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 00a00e4..8feb775 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2631,6 +2631,11 @@ static void __init 
vmx_check_processor_compat(void *rtn)

 }
 }
 
+static bool vmx_hap_enabled(void)

+{
+return false;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 .cpu_has_kvm_support = cpu_has_kvm_support,
 .disabled_by_bios = vmx_disabled_by_bios,
@@ -2688,6 +2693,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
 .inject_pending_vectors = do_interrupt_requests,
 
 .set_tss_addr = vmx_set_tss_addr,

+
+.hap_enabled = vmx_hap_enabled,
 };
 
 static int __init vmx_init(void)

diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 67ae307..45a9d05 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -392,6 +392,8 @@ struct kvm_x86_ops {
struct kvm_run *run);
 
 int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);

+
+bool (*hap_enabled)(void);
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;
  




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 4/8] X86: export information about NPT to generic x86 code

2008-01-25 Thread Anthony Liguori


Joerg Roedel wrote:

The generic x86 code has to know if the specific implementation uses Nested
Paging. In the generic code Nested Paging is called Hardware Assisted Paging
(HAP) to avoid confusion with (future) HAP implementations of other vendors.
This patch exports the availability of HAP to the generic x86 code.

Signed-off-by: Joerg Roedel <[EMAIL PROTECTED]>
---
 arch/x86/kvm/svm.c |7 +++
 arch/x86/kvm/vmx.c |7 +++
 include/asm-x86/kvm_host.h |2 ++
 3 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2e718ff..d0bfdd8 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1678,6 +1678,11 @@ static bool svm_cpu_has_accelerated_tpr(void)
return false;
 }
 
+static bool svm_hap_enabled(void)

+{
+   return npt_enabled;
+}
+
  


To help with bisecting, you should probably return false here until the 
patch that actually implements NPT support.  Otherwise, the 7th patch in 
this series breaks KVM for SVM.


Regards,

Anthony Liguori


 static struct kvm_x86_ops svm_x86_ops = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -1734,6 +1739,8 @@ static struct kvm_x86_ops svm_x86_ops = {
.inject_pending_vectors = do_interrupt_requests,
 
 	.set_tss_addr = svm_set_tss_addr,

+
+   .hap_enabled = svm_hap_enabled,
 };
 
 static int __init svm_init(void)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 00a00e4..8feb775 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2631,6 +2631,11 @@ static void __init vmx_check_processor_compat(void *rtn)
}
 }
 
+static bool vmx_hap_enabled(void)

+{
+   return false;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -2688,6 +2693,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.inject_pending_vectors = do_interrupt_requests,
 
 	.set_tss_addr = vmx_set_tss_addr,

+
+   .hap_enabled = vmx_hap_enabled,
 };
 
 static int __init vmx_init(void)

diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 67ae307..45a9d05 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -392,6 +392,8 @@ struct kvm_x86_ops {
   struct kvm_run *run);
 
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);

+
+   bool (*hap_enabled)(void);
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;
  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/8] SVM: add module parameter to disable Nested Paging

2008-01-25 Thread Anthony Liguori


Joerg Roedel wrote:

To disable the use of the Nested Paging feature even if it is available in
hardware this patch adds a module parameter. Nested Paging can be disabled by
passing npt=off to the kvm_amd module.

Signed-off-by: Joerg Roedel <[EMAIL PROTECTED]>
---
 arch/x86/kvm/svm.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 49bb57a..2e718ff 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -48,6 +48,9 @@ MODULE_LICENSE("GPL");
 #define SVM_DEATURE_SVML (1 << 2)
 
 static bool npt_enabled = false;

+static char *npt = "on";
+
+module_param(npt, charp, S_IRUGO);
  


This would probably be better as an integer.  Then we don't have to do 
nasty things like implicitly cast a literal to a char *.


Regards,

Anthony Liguori


 static void kvm_reput_irq(struct vcpu_svm *svm);
 
@@ -415,6 +418,11 @@ static __init int svm_hardware_setup(void)

if (!svm_has(SVM_FEATURE_NPT))
npt_enabled = false;
 
+	if (npt_enabled && strncmp(npt, "off", 3) == 0) {

+   printk(KERN_INFO "kvm: Nested Paging disabled\n");
+   npt_enabled = false;
+   }
+
if (npt_enabled)
printk(KERN_INFO "kvm: Nested Paging enabled\n");
 
  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH][RFC] SVM: Add Support for Nested Paging in AMD Fam16 CPUs

2008-01-25 Thread Anthony Liguori


Joerg Roedel wrote:
Hi,  


here is the first release of patches for KVM to support the Nested Paging (NPT)
feature of AMD QuadCore CPUs for comments and public testing. This feature
improves the guest performance significantly. I measured an improvement of
around 17% using kernbench in my first tests.

This patch series is basically tested with Linux guests (32 bit legacy
paging, 32 bit PAE paging and 64 bit Long Mode). Also tested with Windows Vista
32 bit and 64 bit. All these guests ran successfully with these patches. The
patch series only enables NPT for 64 bit Linux hosts at the moment.

Please give these patches a good and deep testing. I hope we have this patchset
ready for merging soon.
  


A quick sniff test and things look pretty good.  I was able to start 
running the install CDs for 32-bit and 64-bit Ubuntu, 32-bit OpenSuSE, 
64-bit Fedora, and 32-bit Win2k8.  I'll do a more thorough run of 
kvm-test on Monday when I have a better connection to my machine.


Nice work!

Regards,

Anthony Liguori


Joerg

Here is the diffstat:

 arch/x86/kvm/mmu.c |   81 +++---
 arch/x86/kvm/mmu.h |6 +++
 arch/x86/kvm/svm.c |   94 +--
 arch/x86/kvm/vmx.c |7 +++
 arch/x86/kvm/x86.c |1 +
 include/asm-x86/kvm_host.h |4 ++
 6 files changed, 182 insertions(+), 11 deletions(-)





-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/kvm-devel
  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview

2008-01-23 Thread Anthony Liguori


Hi,

I believe this work is very important especially in the context of 
virtual machines.  I think it would be more useful though implemented in 
the context of the IO scheduler.  Since we already support a notion of 
IO priority, it seems reasonable to add a notion of an IO cap.


Regards,

Anthony Liguori

Ryo Tsuruta wrote:

Hi everyone,

I'm happy to announce that I've implemented a Block I/O bandwidth controller.
The controller is designed to be of use in a cgroup or virtual machine
environment. The current approach is that the controller is implemented as
a device-mapper driver.

What's dm-band all about?

Dm-band is an I/O bandwidth controller implemented as a device-mapper driver.
Several jobs using the same physical device have to share the bandwidth of
the device. Dm-band gives bandwidth to each job according to its weight, 
which each job can set its own value to.


At this time, a job is a group of processes with the same pid or pgrp or uid.
There is also a plan to make it support cgroup. A job can also be a virtual
machine such as KVM or Xen.

  +--+ +--+ +--+   +--+ +--+ +--+ 
  |cgroup| |cgroup| | the  |   | pid  | | pid  | | the  |  jobs
  |  A   | |  B   | |others|   |  X   | |  Y   | |others| 
  +--|---+ +--|---+ +--|---+   +--|---+ +--|---+ +--|---+   
  +--V+---V---+V---+   +--V+---V---+V---+   
  | group | group | default|   | group | group | default|  band groups
  |   |   |  group |   |   |   |  group | 
  +---+---++   +---+---++

  | band1  |   | band2  |  band devices
  +---|+   +---|+
  +---V--+-V+
  |  |  |
  |  sdb1|   sdb2   |  physical devices
  +--+--+


How dm-band works.

Every band device has one band group, which by default is called the default
group.

Band devices can also have extra band groups in them. Each band group
has a job to support and a weight. Proportional to the weight, dm-band gives
tokens to the group.

A group passes on I/O requests that its job issues to the underlying
layer so long as it has tokens left, while requests are blocked
if there aren't any tokens left in the group. One token is consumed each
time the group passes on a request. Dm-band will refill groups with tokens
once all of groups that have requests on a given physical device use up their
tokens.

With this approach, a job running on a band group with large weight is
guaranteed to be able to issue a large number of I/O requests.


Getting started
=
The following is a brief description how to control the I/O bandwidth of
disks. In this description, we'll take one disk with two partitions as an
example target.

You can also check the manual at Document/device-mapper/band.txt of the
linux kernel source tree for more information.


Create and map band devices
---
Create two band devices "band1" and "band2" and map them to "/dev/sda1"
and "/dev/sda2" respectively.

 # echo "0 `blockdev --getsize /dev/sda1` band /dev/sda1 1" | dmsetup create 
band1
 # echo "0 `blockdev --getsize /dev/sda2` band /dev/sda2 1" | dmsetup create 
band2

If the commands are successful then the device files "/dev/mapper/band1"
and "/dev/mapper/band2" will have been created.


Bandwidth control

In this example weights of 40 and 10 will be assigned to "band1" and
"band2" respectively. This is done using the following commands:

 # dmsetup message band1 0 weight 40
 # dmsetup message band2 0 weight 10

After these commands, "band1" can use 80% --- 40/(40+10)*100 --- of the
bandwidth of the physical disk "/dev/sda" while "band2" can use 20%.


Additional bandwidth control
---
In this example two extra band groups are created on "band1".
The first group consists of all the processes with user-id 1000 and the
second group consists of all the processes with user-id 2000. Their
weights are 30 and 20 respectively.

Firstly the band group type of "band1" is set to "user".
Then, the user-id 1000 and 2000 groups are attached to "band1".
Finally, weights are assigned to the user-id 1000 and 2000 groups.

 # dmsetup message band1 0 type user
 # dmsetup message band1 0 attach 1000
 # dmsetup message band1 0 attach 2000
 # dmsetup message band1 0 weight 1000:30
 # dmsetup message band1 0 weight 2000:20

Now the processes in the user-id 1000 group can use 30% ---
30/(30+20+40+10)*100 --- of the bandwidth of the physical disk.

 Band DeviceBand Group

Re: [PATCH] xen: relax signature check

2007-12-12 Thread Anthony Liguori


Jeremy Fitzhardinge wrote:

Bodo Eggert wrote:

Not BUG_ON(memcmp(xen_start_info->magic, "xen-3.", 6) != 0); ?
I don't thin Xen version 32 will be compatible ...
  


It had better be; if it loads the kernel, it should present a xen-3
compatible ABI.


If xen-32.0 should be compatible than wouldn't xen-24.0 be compatible 
too?  I think the point was that you should either be checking for 
'xen-3.x' or something more general that would accept anything >= xen-3.0.


Regards,

Anthony Liguori


But this is just a sanity check to make sure things are basically OK;
BUG_ON is hardly nice error reporting (not that there's much else we can
do at that point).

J


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure (v2)

2007-12-02 Thread Anthony Liguori


Amit Shah wrote:

* Anthony Liguori wrote:
  

This patch refactors the current hypercall infrastructure to better support
live migration and SMP.  It eliminates the hypercall page by trapping the
UD exception that would occur if you used the wrong hypercall instruction
for the underlying architecture and replacing it with the right one lazily.



This doesn't work right for SVM. It keeps looping indefinitely; on a kvm_stat 
run, I get about 230,000 light vm exits per second, with the hypercall never 
returning to the guest.


...
  


What are you using to issue the hypercall?

Regards,

Anthony Liguori


diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index 729f1cd..d09a9f5 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -476,7 +476,8 @@ static void init_vmcb(struct vmcb *vmcb)
INTERCEPT_DR5_MASK |
INTERCEPT_DR7_MASK;

-   control->intercept_exceptions = 1 << PF_VECTOR;
+   control->intercept_exceptions = (1 << PF_VECTOR) |
+   (1 << UD_VECTOR);


control->intercept = (1ULL << INTERCEPT_INTR) |
@@ -970,6 +971,17 @@ static int pf_interception(struct vcpu_svm *svm,
struct kvm_run *kvm_run) return 0;
 }

+static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+{
+   int er;
+
+   er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0);
+   if (er != EMULATE_DONE)
+   inject_ud(&svm->vcpu);
+
+   return 1;
+}
+
 static int nm_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
 {
svm->vmcb->control.intercept_exceptions &= ~(1 << NM_VECTOR);
@@ -1036,7 +1048,8 @@ static int vmmcall_interception(struct vcpu_svm *svm,
struct kvm_run *kvm_run) {
svm->next_rip = svm->vmcb->save.rip + 3;
skip_emulated_instruction(&svm->vcpu);
-   return kvm_hypercall(&svm->vcpu, kvm_run);
+   kvm_emulate_hypercall(&svm->vcpu);
+   return 1;
 }

 static int invalid_op_interception(struct vcpu_svm *svm,
@@ -1232,6 +1245,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm
*svm, [SVM_EXIT_WRITE_DR3]  = emulate_on_interception,
[SVM_EXIT_WRITE_DR5]= emulate_on_interception,
[SVM_EXIT_WRITE_DR7]= emulate_on_interception,
+   [SVM_EXIT_EXCP_BASE + UD_VECTOR]= ud_interception,
[SVM_EXIT_EXCP_BASE + PF_VECTOR]= pf_interception,
[SVM_EXIT_EXCP_BASE + NM_VECTOR]= nm_interception,
[SVM_EXIT_INTR] = nop_on_interception,
@@ -1664,7 +1678,6 @@ svm_patch_hypercall(struct kvm_vcpu *vcpu, unsigned
char *hypercall) hypercall[0] = 0x0f;
hypercall[1] = 0x01;
hypercall[2] = 0xd9;
-   hypercall[3] = 0xc3;
 }



...

  

+/* This instruction is vmcall.  On non-VT architectures, it will generate
a + * trap that we will then rewrite to the appropriate instruction. */
-#define __NR_hypercalls0
+#define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1"



.. which never happens.
  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-26 Thread Anthony Liguori


Avi Kivity wrote:

rx and tx are closely related. You rarely have one without the other.

In fact, a turned implementation should have zero kicks or interrupts 
for bulk transfers. The rx interrupt on the host will process new tx 
descriptors and fill the guest's rx queue; the guest's transmit 
function can also check the receive queue. I don't know if that's 
achievable for Linuz guests currently, but we should aim to make it 
possible.


ATM, the net driver does a pretty good job of disabling kicks/interrupts 
unless they are needed.  Checking for rx on tx and vice versa is a good 
idea and could further help there.  I'll give it a try this week.


Another point is that virtio still has a lot of leading zeros in its 
mileage counter. We need to keep things flexible and learn from others 
as much as possible, especially when talking about the ABI.


Yes, after thinking about it over holiday, I agree that we should at 
least introduce a virtio-pci feature bitmask.  I'm not inclined to 
attempt to define a hypercall ABI or anything like that right now but 
having the feature bitmask will at least make it possible to do such a 
thing in the future.


I'm wary of introducing the notion of hypercalls to this device 
because it makes the device VMM specific.  Maybe we could have the 
device provide an option ROM that was treated as the device "BIOS" 
that we could use for kicking and interrupt acking?  Any idea of how 
that would map to Windows?  Are there real PCI devices that use the 
option ROM space to provide what's essentially firmware?  
Unfortunately, I don't think an option ROM BIOS would map well to 
other architectures.


  


The BIOS wouldn't work even on x86 because it isn't mapped to the 
guest address space (at least not consistently), and doesn't know the 
guest's programming model (16, 32, or 64-bits? segmented or flat?)


Xen uses a hypercall page to abstract these details out. However, I'm 
not proposing that. Simply indicate that we support hypercalls, and 
use some layer below to actually send them. It is the responsibility 
of this layer to detect if hypercalls are present and how to call them.


Hey, I think the best place for it is in paravirt_ops. We can even 
patch the hypercall instruction inline, and the driver doesn't need to 
know about it.


Yes, paravirt_ops is attractive for abstracting the hypercall calling 
mechanism but it's still necessary to figure out how hypercalls would be 
identified.  I think it would be necessary to define a virtio specific 
hypercall space and use the virtio device ID to claim subspaces.


For instance, the hypercall number could be (virtio_devid << 16) | (call 
number).  How that translates into a hypercall would then be part of the 
paravirt_ops abstraction.  In KVM, we may have a single virtio hypercall 
where we pass the virtio hypercall number as one of the arguments or 
something like that.



Not much of an argument, I know.


wrt. number of queues, 8 queues will consume 32 bytes of pci space 
if all you store is the ring pfn.

You also at least need a num argument which takes you to 48 or 64 
depending on whether you care about strange formatting.  8 queues 
may not be enough either.  Eric and I have discussed whether the 9p 
virtio device should support multiple mounts per-virtio device and 
if so, whether each one should have it's own queue.  Any devices 
that supports this sort of multiplexing will very quickly start 
using a lot of queues.

Make it appear as a pci function?  (though my feeling is that 
multiple mounts should be different devices; we can then hotplug 
mountpoints).



We may run out of PCI slots though :-/
  


Then we can start selling virtio extension chassis.


:-)  Do you know if there is a hard limit on the number of devices on a 
PCI bus?  My concern was that it was limited by something stupid like an 
8-bit identifier.


Regards,

Anthony Liguori

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-23 Thread Anthony Liguori


Avi Kivity wrote:

Anthony Liguori wrote:
Well please propose the virtio API first and then I'll adjust the PCI 
ABI.  I don't want to build things into the ABI that we never 
actually end up using in virtio :-)


  


Move ->kick() to virtio_driver.


Then on each kick, all queues have to be checked for processing?  What 
devices do you expect this would help?


I believe Xen networking uses the same event channel for both rx and 
tx, so in effect they're using this model.  Long time since I looked 
though,


I would have to look, but since rx/tx are rather independent actions, 
I'm not sure that you would really save that much.  You still end up 
doing the same number of kicks unless I'm missing something.


I was thinking more along the lines that a hypercall-based device 
would certainly be implemented in-kernel whereas the current device 
is naturally implemented in userspace.  We can simply use a different 
device for in-kernel drivers than for userspace drivers.  


Where the device is implemented is an implementation detail that 
should be hidden from the guest, isn't that one of the strengths of 
virtualization?  Two examples: a file-based block device implemented 
in qemu gives you fancy file formats with encryption and compression, 
while the same device implemented in the kernel gives you a 
low-overhead path directly to a zillion-disk SAN volume.  Or a 
user-level network device capable of running with the slirp stack and 
no permissions vs. the kernel device running copyless most of the time 
and using a dma engine for the rest but requiring you to be good 
friends with the admin.


The user should expect zero reconfigurations moving a VM from one 
model to the other.


I'm wary of introducing the notion of hypercalls to this device because 
it makes the device VMM specific.  Maybe we could have the device 
provide an option ROM that was treated as the device "BIOS" that we 
could use for kicking and interrupt acking?  Any idea of how that would 
map to Windows?  Are there real PCI devices that use the option ROM 
space to provide what's essentially firmware?  Unfortunately, I don't 
think an option ROM BIOS would map well to other architectures.


None of the PCI devices currently work like that in QEMU.  It would 
be very hard to make a device that worked this way because since the 
order in which values are written matter a whole lot.  For instance, 
if you wrote the status register before the queue information, the 
driver could get into a funky state.
  


I assume you're talking about restore?  Isn't that atomic?


If you're doing restore by passing the PCI config blob to a registered 
routine, then sure, but that doesn't seem much better to me than just 
having the device generate that blob in the first place (which is what 
we have today).  I was assuming that you would want to use the existing 
PIO/MMIO handlers to do restore by rewriting the config as if the guest was.



Not much of an argument, I know.


wrt. number of queues, 8 queues will consume 32 bytes of pci space 
if all you store is the ring pfn.



You also at least need a num argument which takes you to 48 or 64 
depending on whether you care about strange formatting.  8 queues may 
not be enough either.  Eric and I have discussed whether the 9p 
virtio device should support multiple mounts per-virtio device and if 
so, whether each one should have it's own queue.  Any devices that 
supports this sort of multiplexing will very quickly start using a 
lot of queues.
  


Make it appear as a pci function?  (though my feeling is that multiple 
mounts should be different devices; we can then hotplug mountpoints).


We may run out of PCI slots though :-/

I think most types of hardware have some notion of a selector or 
mode.  Take a look at the LSI adapter or even VGA.


  


True.  They aren't fun to use, though.


I don't think they're really any worse :-)

Regards,

Anthony Liguori

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-20 Thread Anthony Liguori


Avi Kivity wrote:

Anthony Liguori wrote:

Avi Kivity wrote:
 

Anthony Liguori wrote:
   
This is a PCI device that implements a transport for virtio.  It 
allows virtio

devices to be used by QEMU based VMMs like KVM or Xen.

+
+/* the notify function used when creating a virt queue */
+static void vp_notify(struct virtqueue *vq)
+{
+struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
+struct virtio_pci_vq_info *info = vq->priv;
+
+/* we write the queue's selector into the notification 
register to

+ * signal the other end */
+iowrite16(info->queue_index, vp_dev->ioaddr + 
VIRTIO_PCI_QUEUE_NOTIFY);

+}


This means we can't kick multiple queues with one exit.



There is no interface in virtio currently to batch multiple queue 
notifications so the only way one could do this AFAICT is to use a 
timer to delay the notifications.  Were you thinking of something else?


  


No.  We can change virtio though, so let's have a flexible ABI.


Well please propose the virtio API first and then I'll adjust the PCI 
ABI.  I don't want to build things into the ABI that we never actually 
end up using in virtio :-)


I'd also like to see a hypercall-capable version of this (but that 
can wait).



That can be a different device.
  


That means the user has to select which device to expose.  With 
feature bits, the hypervisor advertises both pio and hypercalls, the 
guest picks whatever it wants.


I was thinking more along the lines that a hypercall-based device would 
certainly be implemented in-kernel whereas the current device is 
naturally implemented in userspace.  We can simply use a different 
device for in-kernel drivers than for userspace drivers.  There's no 
point at all in doing a hypercall based userspace device IMHO.


I don't think so.  A vmexit is required to lower the IRQ line.  It 
may be possible to do something clever like set a shared memory value 
that's checked on every vmexit.  I think it's very unlikely that it's 
worth it though.
  


Why so unlikely?  Not all workloads will have good batching.


It's pretty invasive.  I think a more paravirt device that expected an 
edge triggered interrupt would be a better solution for those types of 
devices.



+return ret;
+}
+
+/* the config->find_vq() implementation */
+static struct virtqueue *vp_find_vq(struct virtio_device *vdev, 
unsigned index,

+bool (*callback)(struct virtqueue *vq))
+{
+struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+struct virtio_pci_vq_info *info;
+struct virtqueue *vq;
+int err;
+u16 num;
+
+/* Select the queue we're interested in */
+iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);

I would really like to see this implemented as pci config space, 
with no tricks like multiplexing several virtqueues on one register. 
Something like the PCI BARs where you have all the register numbers 
allocated statically to queues.



My first implementation did that.  I switched to using a selector 
because it reduces the amount of PCI config space used and does not 
limit the number of queues defined by the ABI as much.
  


But... it's tricky, and it's nonstandard.  With pci config, you can do 
live migration by shipping the pci config space to the other side.  
With the special iospace, you need to encode/decode it.


None of the PCI devices currently work like that in QEMU.  It would be 
very hard to make a device that worked this way because since the order 
in which values are written matter a whole lot.  For instance, if you 
wrote the status register before the queue information, the driver could 
get into a funky state.


We'll still need save/restore routines for virtio devices.  I don't 
really see this as a problem since we do this for every other device.



Not much of an argument, I know.


wrt. number of queues, 8 queues will consume 32 bytes of pci space if 
all you store is the ring pfn.


You also at least need a num argument which takes you to 48 or 64 
depending on whether you care about strange formatting.  8 queues may 
not be enough either.  Eric and I have discussed whether the 9p virtio 
device should support multiple mounts per-virtio device and if so, 
whether each one should have it's own queue.  Any devices that supports 
this sort of multiplexing will very quickly start using a lot of queues.


I think most types of hardware have some notion of a selector or mode.  
Take a look at the LSI adapter or even VGA.


Regards,

Anthony Liguori

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-20 Thread Anthony Liguori


Avi Kivity wrote:

Anthony Liguori wrote:
This is a PCI device that implements a transport for virtio.  It 
allows virtio

devices to be used by QEMU based VMMs like KVM or Xen.

+
+/* the notify function used when creating a virt queue */
+static void vp_notify(struct virtqueue *vq)
+{
+struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
+struct virtio_pci_vq_info *info = vq->priv;
+
+/* we write the queue's selector into the notification register to
+ * signal the other end */
+iowrite16(info->queue_index, vp_dev->ioaddr + 
VIRTIO_PCI_QUEUE_NOTIFY);

+}
  


This means we can't kick multiple queues with one exit.


There is no interface in virtio currently to batch multiple queue 
notifications so the only way one could do this AFAICT is to use a timer 
to delay the notifications.  Were you thinking of something else?


I'd also like to see a hypercall-capable version of this (but that can 
wait).


That can be a different device.


+
+/* A small wrapper to also acknowledge the interrupt when it's handled.
+ * I really need an EIO hook for the vring so I can ack the 
interrupt once we
+ * know that we'll be handling the IRQ but before we invoke the 
callback since
+ * the callback may notify the host which results in the host 
attempting to

+ * raise an interrupt that we would then mask once we acknowledged the
+ * interrupt. */
+static irqreturn_t vp_interrupt(int irq, void *opaque)
+{
+struct virtio_pci_device *vp_dev = opaque;
+struct virtio_pci_vq_info *info;
+irqreturn_t ret = IRQ_NONE;
+u8 isr;
+
+/* reading the ISR has the effect of also clearing it so it's very
+ * important to save off the value. */
+isr = ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR);
  


Can this be implemented via shared memory? We're exiting now on every 
interrupt.


I don't think so.  A vmexit is required to lower the IRQ line.  It may 
be possible to do something clever like set a shared memory value that's 
checked on every vmexit.  I think it's very unlikely that it's worth it 
though.





+return ret;
+}
+
+/* the config->find_vq() implementation */
+static struct virtqueue *vp_find_vq(struct virtio_device *vdev, 
unsigned index,

+bool (*callback)(struct virtqueue *vq))
+{
+struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+struct virtio_pci_vq_info *info;
+struct virtqueue *vq;
+int err;
+u16 num;
+
+/* Select the queue we're interested in */
+iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
  


I would really like to see this implemented as pci config space, with 
no tricks like multiplexing several virtqueues on one register. 
Something like the PCI BARs where you have all the register numbers 
allocated statically to queues.


My first implementation did that.  I switched to using a selector 
because it reduces the amount of PCI config space used and does not 
limit the number of queues defined by the ABI as much.



+
+/* Check if queue is either not available or already active. */
+num = ioread16(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NUM);
+if (!num || ioread32(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN))
+return ERR_PTR(-ENOENT);
+
+/* allocate and fill out our structure the represents an active
+ * queue */
+info = kmalloc(sizeof(struct virtio_pci_vq_info), GFP_KERNEL);
+if (!info)
+return ERR_PTR(-ENOMEM);
+
+info->queue_index = index;
+info->num = num;
+
+/* determine the memory needed for the queue and provide the memory
+ * location to the host */
+info->n_pages = DIV_ROUND_UP(vring_size(num), PAGE_SIZE);
+info->pages = alloc_pages(GFP_KERNEL | __GFP_ZERO,
+  get_order(info->n_pages));
+if (info->pages == NULL) {
+err = -ENOMEM;
+goto out_info;
+}
+
+/* FIXME: is this sufficient for info->n_pages > 1? */
+info->queue = kmap(info->pages);
+if (info->queue == NULL) {
+err = -ENOMEM;
+goto out_alloc_pages;
+}
+
+/* activate the queue */
+iowrite32(page_to_pfn(info->pages),
+  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
+  +/* create the vring */
+vq = vring_new_virtqueue(info->num, vdev, info->queue,
+ vp_notify, callback);
+if (!vq) {
+err = -ENOMEM;
+goto out_activate_queue;
+}
+
+vq->priv = info;
+info->vq = vq;
+
+spin_lock(&vp_dev->lock);
+list_add(&info->node, &vp_dev->virtqueues);
+spin_unlock(&vp_dev->lock);
+
  


Is this run only on init? If so the lock isn't needed.


Yes, it's also not stricly needed on cleanup I think.  I left it there 
though for clarity.  I can remove.


Regards,

Anthony Liguori

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio config_ops refactoring

2007-11-10 Thread Anthony Liguori


Rusty Russell wrote:

On Saturday 10 November 2007 10:45:38 Anthony Liguori wrote:
  

The problem is the ABI.  We can either require that PCI configuration
values are accessed with natural instructions, or it makes very little
sense to use the PCI configuration space for virtio configuration
information.



To me it seems logical and simplest to allow any accesses, lay out your 
structure and emulate it in the obvious way.
  


Okay, I've got some updates that I'm going to send out now to the PCI 
virtio driver but I'll also change it to switch over to a memory 
layout.  It's not the best PCI ABI but I can certainly live with it.


You can put the configuration information somewhere else, but the PCI config 
space seems the logical place.
  


If we're treating the PCI config space as an opaque memory blob, instead 
of as distinct registers, I think it makes more sense to just put it in 
memory.  In the backend, I have to treat it as a memory blob anyway and 
using the PCI config space just means that I have to write all the 
various PIO handlers for the different access sizes.  It's much more 
elegant in my mind just to have the driver provide some memory that the 
host fills out.


Thanks for all the review,

Anthony Liguori


Either virtio config looks like a shared memory area (as lguest
currently implements it), or it looks like hardware registers (like
virtio-pci implements it).  After thinking about it for a while, I don't
think the two can be reconciled.  There are subtle differences between
the two that can't be hidden in the virtio interface.  For instance, in
the PCI model, you get notified when values are read/written whereas in
the lguest model, you don't and need explicit status bits.



No.  You need those status bits even if you have your register model, 
otherwise you can't tell when the configuration as a whole is stable.  
Consider the feature bits.  Worse, consider extending the feature bits beyond 
32 bits.


(We will have to deal with dynamic configuration changes in future; I was 
planning on using some status bits.  But it's pretty clear that it's going to 
require an explicit ack of some form.)


Hope that clarifies,
Rusty.

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio config_ops refactoring

2007-11-09 Thread Anthony Liguori


Rusty Russell wrote:

On Friday 09 November 2007 09:33:04 Anthony Liguori wrote:
  

switch (addr) {
case VIRTIO_BLK_CONFIG_MAX_SEG:
   return vdev->max_seg & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SEG + 1:
   return (vdev->max_seg >> 8) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SEG + 2:
   return (vdev->max_seg >> 16) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SEG + 3:
   return (vdev->max_seg >> 24) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SIZE:
   return vdev->max_size & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SIZE + 1:
   return (vdev->max_size >> 8) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SIZE + 2:
   return (vdev->max_size >> 16) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SIZE + 3:
   return (vdev->max_size >> 24) & 0xFF;
...



struct virtio_blk_config
{
uint32_t max_seg, max_size;
};

...
struct virtio_blk_config conf = { vdev->max_seg, vdev->max_size };

return ((unsigned char *)&conf)[addr];

(Which strongly implies our headers should expose that nominal struct, rather 
than numerical constants).
  


The problem is the ABI.  We can either require that PCI configuration 
values are accessed with natural instructions, or it makes very little 
sense to use the PCI configuration space for virtio configuration 
information.  If we really can't find a way to do this (and I think my 
current implementation is the best compromise since it hides this from 
everything else), then I think I'll switch over to just writing a PFN 
into a PCI configuration slot and then have that page store the virtio 
configuration information (much like is done with lguest).


Either virtio config looks like a shared memory area (as lguest 
currently implements it), or it looks like hardware registers (like 
virtio-pci implements it).  After thinking about it for a while, I don't 
think the two can be reconciled.  There are subtle differences between 
the two that can't be hidden in the virtio interface.  For instance, in 
the PCI model, you get notified when values are read/written whereas in 
the lguest model, you don't and need explicit status bits.


If you're very against the switch() magic, then I'll switch over to just 
using a shared memory area.


Regards,

Anthony Liguori


Rusty.

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lguest] [PATCH] virtio config_ops refactoring

2007-11-08 Thread Anthony Liguori


Dor Laor wrote:

ron minnich wrote:


Hi, I'm sorry, I've been stuck on other things (NFS RDMA anyone?) and
missed part of this discussion.

Is it really the case that operations on virtio devices will involve
outl/inl etc.?


What's the problem with them?
Except for the kick event it's not performance critical and the 
difference between pio vmexit

and hypercall exit is very small.


If you're a nutty guy who's interesting in creating the most absolute 
minimal VMM to run exotic paravirtual guests on massive clusters, then 
requiring PIO implies that you have to have an instruction decoder which 
is goes against the earlier goal ;-)


Regards,

Anthony Liguori


I don't know about problems in other architectures, maybe mmio is better?
Dor.



Apologies in advance for my failure to pay attention.

thanks

ron
___
Lguest mailing list
[EMAIL PROTECTED]
https://ozlabs.org/mailman/listinfo/lguest





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-08 Thread Anthony Liguori


Dor Laor wrote:

Anthony Liguori wrote:
This is a PCI device that implements a transport for virtio.  It 
allows virtio

devices to be used by QEMU based VMMs like KVM or Xen.


  
While it's a little premature, we can start thinking of irq path 
improvements.
The current patch acks a private isr and afterwards apic eoi will also 
be hit since its

a level trig irq. This means 2 vmexits per irq.
We can start with regular pci irqs and move afterwards to msi.
Some other ugly hack options [we're better use msi]:
   - Read the eoi directly from apic and save the first private isr ack


I must admit, that I don't know a whole lot about interrupt delivery.  
If we can avoid the private ISR ack then that would certainly be a good 
thing to do!  I think that would involve adding another bit to the 
virtqueues to indicate whether or not there is work to be handled.  It's 
really just moving the ISR to shared memory so that there's no plenty 
for accessing it.


Regards,

Anthony Liguori


   - Convert the specific irq line to edge triggered and dont share it
What do you guys think?

+/* A small wrapper to also acknowledge the interrupt when it's handled.
+ * I really need an EIO hook for the vring so I can ack the 
interrupt once we
+ * know that we'll be handling the IRQ but before we invoke the 
callback since
+ * the callback may notify the host which results in the host 
attempting to

+ * raise an interrupt that we would then mask once we acknowledged the
+ * interrupt. */
+static irqreturn_t vp_interrupt(int irq, void *opaque)
+{
+struct virtio_pci_device *vp_dev = opaque;
+struct virtio_pci_vq_info *info;
+irqreturn_t ret = IRQ_NONE;
+u8 isr;
+
+/* reading the ISR has the effect of also clearing it so it's very
+ * important to save off the value. */
+isr = ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR);
+
+/* It's definitely not us if the ISR was not high */
+if (!isr)
+return IRQ_NONE;
+
+spin_lock(&vp_dev->lock);
+list_for_each_entry(info, &vp_dev->virtqueues, node) {
+if (vring_interrupt(irq, info->vq) == IRQ_HANDLED)
+ret = IRQ_HANDLED;
+}
+spin_unlock(&vp_dev->lock);
+
+return ret;
+}
  




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lguest] [PATCH] virtio config_ops refactoring

2007-11-08 Thread Anthony Liguori


ron minnich wrote:

Hi, I'm sorry, I've been stuck on other things (NFS RDMA anyone?) and
missed part of this discussion.

Is it really the case that operations on virtio devices will involve
outl/inl etc.?
  


No, this is just for the PCI virtio transport.  lguest's virtio 
transport uses hypercalls and shared memory.


Regards,

Anthony Liguori


Apologies in advance for my failure to pay attention.

thanks

ron
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio config_ops refactoring

2007-11-08 Thread Anthony Liguori


Rusty Russell wrote:

On Thursday 08 November 2007 13:41:16 Anthony Liguori wrote:
  

Rusty Russell wrote:


On Thursday 08 November 2007 04:30:50 Anthony Liguori wrote:
  

I would prefer that the virtio API not expose a little endian standard.
I'm currently converting config->get() ops to ioreadXX depending on the
size which already does the endianness conversion for me so this just
messes things up.  I think it's better to let the backend deal with
endianness since it's trivial to handle for both the PCI backend and the
lguest backend (lguest doesn't need to do any endianness conversion).


-ETOOMUCHMAGIC.  We should either expose all the XX interfaces (but this
isn't a high-speed interface, so let's not) or not "sometimes" convert
endianness. Getting surprises because a field happens to be packed into 4
bytes is counter-intuitive.
  

Then I think it's necessary to expose the XX interfaces.  Otherwise, the
backend has to deal with doing all register operations at a per-byte
granularity which adds a whole lot of complexity on a per-device basis
(as opposed to a little complexity once in the transport layer).



Huh?  Take a look at the drivers, this simply isn't true.  Do you have 
evidence that it will be true later?
  


I'm a bit confused.  So right now, the higher level virtio functions do 
endianness conversion.  I really want to make sure that if a guest tries 
to read a 4-byte PCI config field, that it does so using an "outl" 
instruction so that in my QEMU backend, I don't have to deal with a 
guest reading/writing a single byte within a 4-byte configuration 
field.  It's the difference between having in the PIO handler:


switch (addr) {
case VIRTIO_BLK_CONFIG_MAX_SEG:
   return vdev->max_seg;
case VIRTIO_BLK_CONFIG_MAX_SIZE:
   return vdev->max_size;

}

and:

switch (addr) {
case VIRTIO_BLK_CONFIG_MAX_SEG:
  return vdev->max_seg & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SEG + 1:
  return (vdev->max_seg >> 8) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SEG + 2:
  return (vdev->max_seg >> 16) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SEG + 3:
  return (vdev->max_seg >> 24) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SIZE:
  return vdev->max_size & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SIZE + 1:
  return (vdev->max_size >> 8) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SIZE + 2:
  return (vdev->max_size >> 16) & 0xFF;
case VIRTIO_BLK_CONFIG_MAX_SIZE + 3:
  return (vdev->max_size >> 24) & 0xFF;
...
}


It's the host-side code I'm concerned about, not the guest-side code.  
I'm happy to just ignore the whole endianness conversion thing and 
always pass values through in the CPU bitness but it's very important to 
me that the PCI config registers are accessed with their natural sized 
instructions (just as they would with a real PCI device).


Regards,

Anthony Liguori

Plus your code will be smaller doing a single writeb/readb loop than trying to 
do a switch statement.


  

You really want to be able to rely on multi-byte atomic operations too
when setting values.  Otherwise, you need another register to just to
signal when it's okay for the device to examine any given register.



You already do; the status register fills this role.  For example, you can't 
tell what features a guest understands until it updates the status register.


Hope that clarifies,
Rusty.

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-08 Thread Anthony Liguori


Arnd Bergmann wrote:

On Thursday 08 November 2007, Anthony Liguori wrote:
  

+/* A PCI device has it's own struct device and so does a virtio device so
+ * we create a place for the virtio devices to show up in sysfs.  I think it
+ * would make more sense for virtio to not insist on having it's own device. */
+static struct device virtio_pci_root = {
+   .parent = NULL,
+   .bus_id = "virtio-pci",
+};
+
+/* Unique numbering for devices under the kvm root */
+static unsigned int dev_index;
+



...

  

+/* the PCI probing function */
+static int __devinit virtio_pci_probe(struct pci_dev *pci_dev,
+ const struct pci_device_id *id)
+{
+   struct virtio_pci_device *vp_dev;
+   int err;
+
+   /* allocate our structure and fill it out */
+   vp_dev = kzalloc(sizeof(struct virtio_pci_device), GFP_KERNEL);
+   if (vp_dev == NULL)
+   return -ENOMEM;
+
+   vp_dev->pci_dev = pci_dev;
+   vp_dev->vdev.dev.parent = &virtio_pci_root;



If you use 


vp_dev->vdev.dev.parent = &pci_dev->dev;

Then there is no need for the special kvm root device, and the actual
virtio device shows up in a more logical place, under where it is
really (virtually) attached.
  


They already show up underneath of the PCI bus.  The issue is that there 
are two separate 'struct device's for each virtio device.  There's the 
PCI device (that's part of the pci_dev structure) and then there's the 
virtio_device one.  I thought that setting the dev.parent of the 
virtio_device struct device would result in having two separate entries 
under the PCI bus directory which would be pretty confusing :-)


Regards,

Anthony Liguori


Arnd <><
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Use of virtio device IDs

2007-11-08 Thread Anthony Liguori


Gerd Hoffmann wrote:

Avi Kivity wrote:
  

You are
probably better off designing something that is PV specific instead of
shoehorning it in to fit a different model (at least for the things I
have in mind).  
  

Well, if we design our pv devices to look like hardware, they will fit
quite well.  Both to the guest OS and to user's expectations.



Disclaimer: Havn't looked at the virtio code much.

I think we should keep the door open for both models and don't nail the
 virtio infrastructure to one of them.

For pure pv devices I don't see the point in trying to squeeze it into
the PCI model.  Also s390 has no PCI, so there effecticely is no way
around that, we must be able have some pure virtual bus like xenbus.
  


I don't really agree with this assessment.  There is no performance 
advantage to using a pure virtual bus.  If you have a pure pv device 
that looks and act like a PCI device, besides the obvious advantage of 
easy portability to other guest OSes (since everything support PCI, but 
porting XenBus--event to Linux 2.4.x was a royal pain), it is also very 
easy to support the device on other VMMs.


For instance, the PCI device that I just posted would allow virtio 
devices to be used trivially with HVM on Xen.  In fact, once the 
backends are complete and merged into QEMU, the next time Xen rebases 
QEMU they'll get the virtio PV-on-HVM drivers for free.  To me, that's a 
pretty significant advantage.



Uhm, well, yea.  Guess you are refering to the pv-on-hvm drivers.  Been
there, dealt with it.  What exactly do you think is messy there?

IMHO the most messy thing is the boot problem.  hvm bios can't deal with
pv disks, so you can't boot with pv disks only.  "fixed" by having the
(boot) disk twice in the system, once via emulated ide, once as pv disk.
 Ouch.
  


I have actually addressed this problem with a PV option rom for QEMU.  I 
expect to get time to submit the QEMU patches by the end of the year.  
See http://hg.codemonkey.ws/extboot


Regards,

Anthony Liguori
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-08 Thread Anthony Liguori


Avi Kivity wrote:
If a pci device is capable of dma (or issuing interrupts), it will be 
useless with pv pci.


Hrm, I think we may be talking about different things.  Are you thinking 
that the driver I posted allows you to do PCI pass-through over virtio?  
That's not what it is.


The driver I posted is a virtio implementation that uses a PCI device.  
This lets you use virtio-blk and virtio-net under KVM.  The alternative 
to this virtio PCI device would be a virtio transport built with 
hypercalls like lguest has.  I choose a PCI device because it ensured 
that each virtio device showed up like a normal PCI device.


Am I misunderstanding what you're asking about?

Regards,

Anthony Liguori




 I think that with Amit's pvdma patches you
can support dma-capable devices as well without too much fuss.
  


What is the use case you're thinking of?  A semi-paravirt driver that 
does dma directly to a device?


No, an unmodified driver that, by using clever tricks with dma_ops, 
can do dma directly to guest memory.  See Amit's patches.


In fact, why do a virtio transport at all?  It can be done either with 
trap'n'emulate, or by directly mapping the device mmio space into the 
guest.



(what use case are you considering? devices without interrupts and 
dma? pci door stoppers?)




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-08 Thread Anthony Liguori


Avi Kivity wrote:

Anthony Liguori wrote:
  

This is a PCI device that implements a transport for virtio.  It allows virtio
devices to be used by QEMU based VMMs like KVM or Xen.

  



Didn't see support for dma.


Not sure what you're expecting there.  Using dma_ops in virtio_ring?


 I think that with Amit's pvdma patches you
can support dma-capable devices as well without too much fuss.
  


What is the use case you're thinking of?  A semi-paravirt driver that 
does dma directly to a device?


Regards,

Anthony Liguori


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] virtio PCI device

2007-11-07 Thread Anthony Liguori

This is a PCI device that implements a transport for virtio.  It allows virtio
devices to be used by QEMU based VMMs like KVM or Xen.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 9e33fc4..c81e0f3 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -6,3 +6,20 @@ config VIRTIO
 config VIRTIO_RING
bool
depends on VIRTIO
+
+config VIRTIO_PCI
+   tristate "PCI driver for virtio devices (EXPERIMENTAL)"
+   depends on PCI && EXPERIMENTAL
+   select VIRTIO
+   select VIRTIO_RING
+   ---help---
+ This drivers provides support for virtio based paravirtual device
+ drivers over PCI.  This requires that your VMM has appropriate PCI
+ virtio backends.  Most QEMU based VMMs should support these devices
+ (like KVM or Xen).
+
+ Currently, the ABI is not considered stable so there is no guarantee
+ that this version of the driver will work with your VMM.
+
+ If unsure, say M.
+  
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index f70e409..cc84999 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -1,2 +1,3 @@
 obj-$(CONFIG_VIRTIO) += virtio.o
 obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
+obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
new file mode 100644
index 000..85ae096
--- /dev/null
+++ b/drivers/virtio/virtio_pci.c
@@ -0,0 +1,469 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Anthony Liguori <[EMAIL PROTECTED]>");
+MODULE_DESCRIPTION("virtio-pci");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
+
+/* Our device structure */
+struct virtio_pci_device
+{
+   /* the virtio device */
+   struct virtio_device vdev;
+   /* the PCI device */
+   struct pci_dev *pci_dev;
+   /* the IO mapping for the PCI config space */
+   void *ioaddr;
+
+   spinlock_t lock;
+   struct list_head virtqueues;
+};
+
+struct virtio_pci_vq_info
+{
+   /* the number of entries in the queue */
+   int num;
+   /* the number of pages the device needs for the ring queue */
+   int n_pages;
+   /* the index of the queue */
+   int queue_index;
+   /* the struct page of the ring queue */
+   struct page *pages;
+   /* the virtual address of the ring queue */
+   void *queue;
+   /* a pointer to the virtqueue */
+   struct virtqueue *vq;
+   /* the node pointer */
+   struct list_head node;
+};
+
+/* We have to enumerate here all virtio PCI devices. */
+static struct pci_device_id virtio_pci_id_table[] = {
+   { 0x5002, 0x2258, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 }, /* Dummy entry */
+   { 0 },
+};
+
+MODULE_DEVICE_TABLE(pci, virtio_pci_id_table);
+
+/* A PCI device has it's own struct device and so does a virtio device so
+ * we create a place for the virtio devices to show up in sysfs.  I think it
+ * would make more sense for virtio to not insist on having it's own device. */
+static struct device virtio_pci_root = {
+   .parent = NULL,
+   .bus_id = "virtio-pci",
+};
+
+/* Unique numbering for devices under the kvm root */
+static unsigned int dev_index;
+
+/* Convert a generic virtio device to our structure */
+static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
+{
+   return container_of(vdev, struct virtio_pci_device, vdev);
+}
+
+/* virtio config->feature() implementation */
+static bool vp_feature(struct virtio_device *vdev, unsigned bit)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   u32 mask;
+
+   /* Since this function is supposed to have the side effect of
+* enabling a queried feature, we simulate that by doing a read
+* from the host feature bitmask and then writing to the guest
+* feature bitmask */
+   mask = ioread32(vp_dev->ioaddr + VIRTIO_PCI_HOST_FEATURES);
+   if (mask & (1 << bit)) {
+   mask |= (1 << bit);
+   iowrite32(mask, vp_dev->ioaddr + VIRTIO_PCI_GUEST_FEATURES);
+   }
+
+   return !!(mask & (1 << bit));
+}
+
+/* virtio config->get() implementation */
+static void vp_get(struct virtio_device *vdev, unsigned offset,
+  void *buf, unsigned len)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   void *ioaddr = vp_dev->ioaddr + VIRTIO_PCI_CONFIG + offset;
+
+   /* We translate appropriately sized get requests into more natural
+* IO operations.  These functions also take care of endianness
+* conversion. */
+   switch (len) {
+   case 1: {
+   u8 val;
+   val = ioread8(ioaddr);
+   memcpy(

[PATCH 2/3] Put the virtio under the virtualization menu

2007-11-07 Thread Anthony Liguori

This patch moves virtio under the virtualization menu and changes virtio
devices to not claim to only be for lguest.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/Kconfig b/drivers/Kconfig
index f4076d9..d945ffc 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -93,6 +93,4 @@ source "drivers/auxdisplay/Kconfig"
 source "drivers/kvm/Kconfig"
 
 source "drivers/uio/Kconfig"
-
-source "drivers/virtio/Kconfig"
 endmenu
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 4d0119e..be4b224 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -429,6 +429,7 @@ config VIRTIO_BLK
tristate "Virtio block driver (EXPERIMENTAL)"
depends on EXPERIMENTAL && VIRTIO
---help---
- This is the virtual block driver for lguest.  Say Y or M.
+ This is the virtual block driver for virtio.  It can be used with
+  lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
 endif # BLK_DEV
diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
index 6569206..ac4bcdf 100644
--- a/drivers/kvm/Kconfig
+++ b/drivers/kvm/Kconfig
@@ -50,5 +50,6 @@ config KVM_AMD
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/lguest/Kconfig
+source drivers/virtio/Kconfig
 
 endif # VIRTUALIZATION
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 86b8641..e66aec4 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3107,6 +3107,7 @@ config VIRTIO_NET
tristate "Virtio network driver (EXPERIMENTAL)"
depends on EXPERIMENTAL && VIRTIO
---help---
- This is the virtual network driver for lguest.  Say Y or M.
+ This is the virtual network driver for virtio.  It can be used with
+  lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
 endif # NETDEVICES
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] Export vring functions for modules to use

2007-11-07 Thread Anthony Liguori

This is needed for the virtio PCI device to be compiled as a module.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 0e1bf05..3f28b47 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -260,6 +260,8 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
return IRQ_HANDLED;
 }
 
+EXPORT_SYMBOL_GPL(vring_interrupt);
+
 static struct virtqueue_ops vring_vq_ops = {
.add_buf = vring_add_buf,
.get_buf = vring_get_buf,
@@ -306,8 +308,12 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
return &vq->vq;
 }
 
+EXPORT_SYMBOL_GPL(vring_new_virtqueue);
+
 void vring_del_virtqueue(struct virtqueue *vq)
 {
kfree(to_vvq(vq));
 }
 
+EXPORT_SYMBOL_GPL(vring_del_virtqueue);
+
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] virtio PCI driver

2007-11-07 Thread Anthony Liguori

This patch series implements a PCI driver for virtio.  This allows virtio
devices (like block and network) to be used in QEMU/KVM.  I'll post a very
early KVM userspace backend in kvm-devel for those that are interested.

This series depends on the two virtio fixes I've posted and Rusty's config_ops
refactoring.  I've tested with these patches on Rusty's experimental virtio
tree.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio config_ops refactoring

2007-11-07 Thread Anthony Liguori


Rusty Russell wrote:

On Thursday 08 November 2007 04:30:50 Anthony Liguori wrote:

I would prefer that the virtio API not expose a little endian standard.
I'm currently converting config->get() ops to ioreadXX depending on the
size which already does the endianness conversion for me so this just
messes things up.  I think it's better to let the backend deal with
endianness since it's trivial to handle for both the PCI backend and the
lguest backend (lguest doesn't need to do any endianness conversion).


-ETOOMUCHMAGIC.  We should either expose all the XX interfaces (but this isn't 
a high-speed interface, so let's not) or not "sometimes" convert endianness.  
Getting surprises because a field happens to be packed into 4 bytes is 
counter-intuitive.


Then I think it's necessary to expose the XX interfaces.  Otherwise, the 
backend has to deal with doing all register operations at a per-byte 
granularity which adds a whole lot of complexity on a per-device basis 
(as opposed to a little complexity once in the transport layer).


You really want to be able to rely on multi-byte atomic operations too 
when setting values.  Otherwise, you need another register to just to 
signal when it's okay for the device to examine any given register.


Regards,

Anthony Liguori

Since your most trivial implementation is to do a byte at a time, I don't 
think you have a good argument on that basis either.


Cheers,
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][VIRTIO] Fix vring_init() ring computations

2007-11-07 Thread Anthony Liguori


Rusty Russell wrote:

On Wednesday 07 November 2007 13:52:29 Anthony Liguori wrote:
  
This patch fixes a typo in vring_init(). 



Thanks, applied.

I've put it in the new, experimental virtio git tree on git.kernel.org.
  


Hrm, perhaps you forgot to push?  I don't see it in the tree although I 
see the config ops refactoring.


Regards,

Anthony Liguori


Cheers,
Rusty.
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix used_idx wrap-around in virtio

2007-11-07 Thread Anthony Liguori

The more_used() function compares the vq->vring.used->idx with last_used_idx.
Since vq->vring.used->idx is a 16-bit integer, and last_used_idx is an
unsigned int, this results in unpredictable behavior when vq->vring.used->idx
wraps around.

This patch corrects this by changing last_used_idx to the correct type.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 0e4baca..0e1bf05 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -53,7 +53,7 @@ struct vring_virtqueue
unsigned int num_added;
 
/* Last used index we've seen. */
-   unsigned int last_used_idx;
+   u16 last_used_idx;
 
/* How to notify other side. FIXME: commonalize hcalls! */
void (*notify)(struct virtqueue *vq);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Use of virtio device IDs

2007-11-07 Thread Anthony Liguori


Rusty Russell wrote:

On Wednesday 07 November 2007 16:40:13 Avi Kivity wrote:
  

Gregory Haskins wrote:


 but FWIW: This is a major motivation for the reason that the
IOQ stuff I posted a while back used strings for device identification
instead of a fixed length, centrally managed namespace like PCI
vendor/dev-id.  Then you can just name your device something reasonably
unique (e.g. "qumranet::veth", or "ibm-pvirt-clock").
  

I dislike strings.  They make it look as if you have a nice extensible
interface, where in reality you have a poorly documented interface which
leads to poor interoperability.



Yes, you end up with exactly names like "qumranet::veth" 
and "ibm-pvirt-clock".  I would recommend looking very hard at /proc, Open 
Firmware on a modern system, or the Xen store, to see what a lack of 
limitation can do to you :)
  


FWIW, I've switched to using the PCI subsystem vendor/device IDs for 
virtio which Rusty suggested.  I think this makes even more sense than 
using the main vendor/device ID since I do think that we only should use 
a single vendor/device ID for all virtio PCI devices and then 
differentiate based on the subsystem IDs.


Regards,

Anthony Liguori


We will support non-pci for s390, but in order to support Windows and
older Linux PCI is necessary.



The aim is that PCI support is clean, but that we're not really tied to PCI.  
I think we're getting closer with the recent config changes.


Cheers,
Rusty.
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] virtio config_ops refactoring

2007-11-07 Thread Anthony Liguori


Rusty Russell wrote:

After discussion with Anthony, the virtio config has been simplified.  We
lose some minor features (the virtio_net address must now be 6 bytes) but
it turns out to be a wash in terms of complexity, while simplifying PCI.
  


Hi Rusty,

Thanks for posting this!  It's really simplified things for me.


-
 /**
- * virtio_use_bit - helper to use a feature bit in a bitfield value.
- * @dev: the virtio device
- * @token: the token as returned from vdev->config->find().
- * @len: the length of the field.
- * @bitnum: the bit to test.
+ * __virtio_config_val - get a single virtio config without feature check.
+ * @vdev: the virtio device
+ * @offset: the type to search for.
+ * @val: a pointer to the value to fill in.
  *
- * If handed a NULL token, it returns false, otherwise returns bit status.
- * If it's one, it sets the mirroring acknowledgement bit. */
-int virtio_use_bit(struct virtio_device *vdev,
-  void *token, unsigned int len, unsigned int bitnum);
+ * The value is endian-corrected and returned in v. */
+#define __virtio_config_val(vdev, offset, v) do {  \
+   BUILD_BUG_ON(sizeof(*(v)) != 1 && sizeof(*(v)) != 2 \
+&& sizeof(*(v)) != 4 && sizeof(*(v)) != 8);\
+   (vdev)->config->get((vdev), (offset), (v), sizeof(*(v))); \
+   switch (sizeof(*(v))) { \
+   case 2: le16_to_cpus((__u16 *) v); break;   \
+   case 4: le32_to_cpus((__u32 *) v); break;   \
+   case 8: le64_to_cpus((__u64 *) v); break;   \
+   }   \
+} while(0)
  


I would prefer that the virtio API not expose a little endian standard.  
I'm currently converting config->get() ops to ioreadXX depending on the 
size which already does the endianness conversion for me so this just 
messes things up.  I think it's better to let the backend deal with 
endianness since it's trivial to handle for both the PCI backend and the 
lguest backend (lguest doesn't need to do any endianness conversion).


Regards,

Anthony Liguori


 #endif /* __KERNEL__ */
 #endif /* _LINUX_VIRTIO_CONFIG_H */
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index ae469ae..8bf1602 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -5,16 +5,16 @@
 /* The ID for virtio_net */
 #define VIRTIO_ID_NET  1

-/* The bitmap of config for virtio net */
-#define VIRTIO_CONFIG_NET_F0x40
+/* The feature bitmap for virtio net */
 #define VIRTIO_NET_F_NO_CSUM   0
 #define VIRTIO_NET_F_TSO4  1
 #define VIRTIO_NET_F_UFO   2
 #define VIRTIO_NET_F_TSO4_ECN  3
 #define VIRTIO_NET_F_TSO6  4
+#define VIRTIO_NET_F_MAC   5

-/* The config defining mac address. */
-#define VIRTIO_CONFIG_NET_MAC_F0x41
+/* The config defining mac address (6 bytes) */
+#define VIRTIO_CONFIG_NET_MAC_F0

 /* This is the first element of the scatter-gather list.  If you don't
  * specify GSO or CSUM features, you can simply ignore the header. */
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Use of virtio device IDs

2007-11-06 Thread Anthony Liguori


Rusty Russell wrote:

On Wednesday 07 November 2007 16:40:13 Avi Kivity wrote:
  

Gregory Haskins wrote:


 but FWIW: This is a major motivation for the reason that the
IOQ stuff I posted a while back used strings for device identification
instead of a fixed length, centrally managed namespace like PCI
vendor/dev-id.  Then you can just name your device something reasonably
unique (e.g. "qumranet::veth", or "ibm-pvirt-clock").
  

I dislike strings.  They make it look as if you have a nice extensible
interface, where in reality you have a poorly documented interface which
leads to poor interoperability.



Yes, you end up with exactly names like "qumranet::veth" 
and "ibm-pvirt-clock".  I would recommend looking very hard at /proc, Open 
Firmware on a modern system, or the Xen store, to see what a lack of 
limitation can do to you :)


  

We will support non-pci for s390, but in order to support Windows and
older Linux PCI is necessary.



The aim is that PCI support is clean, but that we're not really tied to PCI.  
I think we're getting closer with the recent config changes.
  


Yes, my main desire was to ensure that we had a clean PCI ABI that would 
be natural to implement on a platform like Windows.  I think with the 
recent config_ops refactoring, we can now do that.


Regards,

Anthony Liguori


Cheers,
Rusty.
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][VIRTIO] Fix vring_init() ring computations

2007-11-06 Thread Anthony Liguori

This patch fixes a typo in vring_init().  This happens to work today in lguest
because the sizeof(struct vring_desc) is 16 and struct vring contains 3
pointers and an unsigned int so on 32-bit
sizeof(struct vring_desc) == sizeof(struct vring).  However, this is no longer
true on 64-bit where the bug is exposed.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index ac69e7b..5b88d21 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -92,8 +92,8 @@ static inline void vring_init(struct vring *vr, unsigned int 
num, void *p)
 {
vr->num = num;
vr->desc = p;
-   vr->avail = p + num*sizeof(struct vring);
-   vr->used = p + (num+1)*(sizeof(struct vring) + sizeof(__u16));
+   vr->avail = p + num*sizeof(struct vring_desc);
+   vr->used = p + (num+1)*(sizeof(struct vring_desc) + sizeof(__u16));
 }
 
 static inline unsigned vring_size(unsigned int num)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: virtio config_ops refactoring

2007-11-06 Thread Anthony Liguori


Rusty Russell wrote:

On Wednesday 07 November 2007 04:48:35 Anthony Liguori wrote:

Semantically, find requires that a field have both a type and a length.
With the exception of the VIRTQUEUE field used internally by lguest,
type is always a unique identifier.  Since virtqueue information is not
a required part of the config space, it seems to me that type really
should be treated as a unique identifier.


Hi Anthony,

Not sure I get this.  It is a unique identifier.  You need the length
to handle unknown fields.


It's not a unique identifier since it can be used for multiple items 
(like it is for virtqueues configs).



find_vq also is curious in that it is stateful in it's enumeration.


Well, they're *all* stateful.  This gives a simple method of knowing what 
fields the guest understands: it marks the fields as it finds them.  Then it 
sets the status, which allows the host to know when it's completed 
configuration reads.


But PCI device configuration is not stateful.  If you care about letting 
the host know what features a guest understands, I think something more 
explicit and stateful should be used.  For instance, a feature register 
that stores a bitmap.


Otherwise, the host has to infer based on what fields that guest has 
read what features the guest actually supports.  That seems error prone 
to me.



I like enumerating the virtqueues: it's not necessary but it's clearer.


This adds seemingly unnecessary complexity.


I'd be happy for a simpler mechanism...


What do you think of what I proposed?  It seems simpler to me.

Regards,

Anthony Liguori


Cheers,
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Use of virtio device IDs

2007-11-06 Thread Anthony Liguori


Anthony Liguori wrote:

Hi Rusty,

I've written a PCI virtio transport and noticed something strange.  
All current in-tree virtio devices register ID tables that match a 
specific device ID, but any vendor ID.


This is incompatible with using PCI vendor/device IDs for virtio 
vendor/device IDs since vendors control what device IDs mean.  A 
simple solution would be to assign a fixed vendor ID to all current 
virtio devices.  This doesn't solve the problem completely though 
since you would create a conflict between the PCI vendor ID space and 
the virtio vendor ID space.


The only solutions seem to be virtualizing the virtio vendor/device 
IDs (which is what I'm currently doing) or to mandate that the virtio 
vendor ID be within the PCI vendor ID space.  It's probably not 
necessary to make the same requirement for device IDs though.


There's another ugly bit in the current implementation.

Right now, we would have to have every PCI vendor/device ID pair in the 
virtio PCI driver ID table for every virtio device.


This means every time a virtio device is added to Linux, the virtio PCI 
driver has to be modified (assuming that each virtio device uses a 
unique PCI vendor/device ID) :-/


Regards,

Anthony Liguori


What are your thoughts?

Regards,

Anthony Liguori



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

virtio config_ops refactoring

2007-11-06 Thread Anthony Liguori


Hi,

The current virtio config_ops do not map very well to the PCI config 
space.  Here they are:



struct virtio_config_ops
{
void *(*find)(struct virtio_device *vdev, u8 type, unsigned *len);
void (*get)(struct virtio_device *vdev, void *token,
void *buf, unsigned len);
void (*set)(struct virtio_device *vdev, void *token,
const void *buf, unsigned len);
u8 (*get_status)(struct virtio_device *vdev);
void (*set_status)(struct virtio_device *vdev, u8 status);
struct virtqueue *(*find_vq)(struct virtio_device *vdev,
 bool (*callback)(struct virtqueue *));
void (*del_vq)(struct virtqueue *vq);
};


Semantically, find requires that a field have both a type and a length.  
With the exception of the VIRTQUEUE field used internally by lguest, 
type is always a unique identifier.  Since virtqueue information is not 
a required part of the config space, it seems to me that type really 
should be treated as a unique identifier.


find_vq also is curious in that it is stateful in it's enumeration.  
This adds seemingly unnecessary complexity.  The result is that in my 
PCI virtio implementation, I have to use an opaque blob that's self 
describing and variable length in the PCI config space.  This is not 
very natural at all for a PCI device!


Here is how I propose changing the config_ops:

struct virtio_config_ops
{
   bool (*get)(struct virtio_device *vdev, unsigned id, void *buf, 
unsigned len);
   bool (*set)(struct virtio_device *vdev, unsigned id, const void 
*buf, unsigned len);

   u8 (*get_status)(struct virtio_device *vdev);
   void (*set_status)(struct virtio_device *vdev, u8 status);
   struct virtqueue *(*get_vq)(struct virtio_device *vdev, unsigned 
index, bool (*callback)(struct virtqueue *));

   void (*del_vq)(struct virtqueue *vq);
};

config_ops->get() returns false if the id is invalid as does ->set().  
get_vq() returns NULL if the index is not a valid virtqueue (and we'll 
mandate that all virtqueues are in order starting at 0).


The id space is defined by the driver itself but is guaranteed to be 
non-overlapping and starting from 0.  For instance, the block device 
would have:


/* The capacity (in 512-byte sectors). */
#define VIRTIO_CONFIG_BLK_F_CAPACITY0x00
/* The maximum segment size. */
#define VIRTIO_CONFIG_BLK_F_SIZE_MAX0x08
/* The maximum number of segments. */
#define VIRTIO_CONFIG_BLK_F_SEG_MAX0x12

This maps very well to the PCI config space.  For lguest, the actual 
config items would simply be packed in a single page.  Since lguest uses 
the config space for virtqueues, lguest_device_desc will have to be 
changed to also contain an array of virtqueue infos.  It doesn't prevent 
an implementation from using the id's as opaque keys though (as one 
might do for Xen since presumably the configuration data would be 
represented within xenbus).


If there's agreement that this approach is better, I'll start submitting 
patches.


Regards,

Anthony Liguori
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Use of virtio device IDs

2007-11-06 Thread Anthony Liguori


Hi Rusty,

I've written a PCI virtio transport and noticed something strange.  All 
current in-tree virtio devices register ID tables that match a specific 
device ID, but any vendor ID.


This is incompatible with using PCI vendor/device IDs for virtio 
vendor/device IDs since vendors control what device IDs mean.  A simple 
solution would be to assign a fixed vendor ID to all current virtio 
devices.  This doesn't solve the problem completely though since you 
would create a conflict between the PCI vendor ID space and the virtio 
vendor ID space.


The only solutions seem to be virtualizing the virtio vendor/device IDs 
(which is what I'm currently doing) or to mandate that the virtio vendor 
ID be within the PCI vendor ID space.  It's probably not necessary to 
make the same requirement for device IDs though.


What are your thoughts?

Regards,

Anthony Liguori
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 2/3] Refactor hypercall infrastructure (v3)

2007-09-18 Thread Anthony Liguori


Avi Kivity wrote:

Avi Kivity wrote:
  

Avi Kivity wrote:


Anthony Liguori wrote:
 
  
This patch refactors the current hypercall infrastructure to better 
support live

migration and SMP.  It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall 
instruction for the

underlying architecture and replacing it with the right one lazily.

It also introduces the infrastructure to probe for hypercall 
available via

CPUID leaves 0x4000.  CPUID leaf 0x4001 should be filled out by
userspace.

A fall-out of this patch is that the unhandled hypercalls no longer 
trap to
userspace.  There is very little reason though to use a hypercall to 
communicate
with userspace as PIO or MMIO can be used.  There is no code in tree 
that uses

userspace hypercalls.

  

Surprisingly, this patch kills Windows XP (ACPI HAL).  I'll try to 
find out why.


  
  
Not trapping #UD brings things back to normal.  So, Windows likes to 
execute undefined instructions, and we don'd handle these well.




Okay, vmx_inject_ud() was broken.  Fixed now.
  


Yeah, I was just about to send the patch for that.  Sorry about that, I 
didn't even think to test Windows...


Regards,

Anthony Liguori

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] Refactor hypercall infrastructure (v3)

2007-09-17 Thread Anthony Liguori

This patch refactors the current hypercall infrastructure to better support live
migration and SMP.  It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall instruction for the
underlying architecture and replacing it with the right one lazily.

It also introduces the infrastructure to probe for hypercall available via
CPUID leaves 0x4000.  CPUID leaf 0x4001 should be filled out by
userspace.

A fall-out of this patch is that the unhandled hypercalls no longer trap to
userspace.  There is very little reason though to use a hypercall to communicate
with userspace as PIO or MMIO can be used.  There is no code in tree that uses
userspace hypercalls.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index ad08138..1cde572 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -46,6 +46,7 @@
 #define KVM_MAX_CPUID_ENTRIES 40
 
 #define DE_VECTOR 0
+#define UD_VECTOR 6
 #define NM_VECTOR 7
 #define DF_VECTOR 8
 #define TS_VECTOR 10
@@ -317,9 +318,6 @@ struct kvm_vcpu {
unsigned long cr0;
unsigned long cr2;
unsigned long cr3;
-   gpa_t para_state_gpa;
-   struct page *para_state_page;
-   gpa_t hypercall_gpa;
unsigned long cr4;
unsigned long cr8;
u64 pdptrs[4]; /* pae */
@@ -622,7 +620,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu);
 
 static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 u32 error_code)
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 99e4917..0629dd7 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1383,51 +1384,61 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_halt);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run)
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
-   unsigned long nr, a0, a1, a2, a3, a4, a5, ret;
+   unsigned long nr, a0, a1, a2, a3, ret;
 
kvm_x86_ops->cache_regs(vcpu);
-   ret = -KVM_EINVAL;
-#ifdef CONFIG_X86_64
-   if (is_long_mode(vcpu)) {
-   nr = vcpu->regs[VCPU_REGS_RAX];
-   a0 = vcpu->regs[VCPU_REGS_RDI];
-   a1 = vcpu->regs[VCPU_REGS_RSI];
-   a2 = vcpu->regs[VCPU_REGS_RDX];
-   a3 = vcpu->regs[VCPU_REGS_RCX];
-   a4 = vcpu->regs[VCPU_REGS_R8];
-   a5 = vcpu->regs[VCPU_REGS_R9];
-   } else
-#endif
-   {
-   nr = vcpu->regs[VCPU_REGS_RBX] & -1u;
-   a0 = vcpu->regs[VCPU_REGS_RAX] & -1u;
-   a1 = vcpu->regs[VCPU_REGS_RCX] & -1u;
-   a2 = vcpu->regs[VCPU_REGS_RDX] & -1u;
-   a3 = vcpu->regs[VCPU_REGS_RSI] & -1u;
-   a4 = vcpu->regs[VCPU_REGS_RDI] & -1u;
-   a5 = vcpu->regs[VCPU_REGS_RBP] & -1u;
+
+   nr = vcpu->regs[VCPU_REGS_RAX];
+   a0 = vcpu->regs[VCPU_REGS_RBX];
+   a1 = vcpu->regs[VCPU_REGS_RCX];
+   a2 = vcpu->regs[VCPU_REGS_RDX];
+   a3 = vcpu->regs[VCPU_REGS_RSI];
+
+   if (!is_long_mode(vcpu)) {
+   nr &= 0x;
+   a0 &= 0x;
+   a1 &= 0x;
+   a2 &= 0x;
+   a3 &= 0x;
}
+
switch (nr) {
default:
-   run->hypercall.nr = nr;
-   run->hypercall.args[0] = a0;
-   run->hypercall.args[1] = a1;
-   run->hypercall.args[2] = a2;
-   run->hypercall.args[3] = a3;
-   run->hypercall.args[4] = a4;
-   run->hypercall.args[5] = a5;
-   run->hypercall.ret = ret;
-   run->hypercall.longmode = is_long_mode(vcpu);
-   kvm_x86_ops->decache_regs(vcpu);
-   return 0;
+   ret = -KVM_ENOSYS;
+   break;
}
vcpu->regs[VCPU_REGS_RAX] = ret;
kvm_x86_ops->decache_regs(vcpu);
-   return 1;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu)
+{
+   char instruction[3];
+   int ret = 0;
+
+   mutex_lock(&vcpu->kvm->lock);
+
+   /*
+* Blow out the MMU to ensure that no other VCPU has an active mapping
+* to ensure that the updated hypercall appears atomically across all
+* VCPUs.
+*/
+   kvm_mmu_zap_all(vcpu->kvm);
+

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-17 Thread Anthony Liguori


Jeremy Fitzhardinge wrote:

I'm starting to lean toward just using .  If for no other reason
than the hypercall space is unsharable.



Well, it could be, but it would take affirmative action on the guest's
part.  If there's feature bits for each supported hypercall interface,
then you could have a magic MSR to select which interface you want to
use now.  That would allow a generic-interface-using guest to probe for
the generic interface at cpuid leaf 0x40001000, use 40001001 to
determine whether the hypercall interface is available, 4000100x to find
the base of the magic msrs, and write appropriate msr to set the desired
hypercall style (and all this can be done without using vmcall, so it
doesn't matter that hypercall interface is initially established).
  


The main thing keeping me from doing this ATM is what I perceive as lack 
of interest in a generic interface.  I think it's also a little 
premature given that we don't have any features on the plate yet.  
However, I don't think that means that we cannot turn KVM's PV into a 
generic one.  So here's what I propose.


Let's start building the KVM PV interface on 4000 .  That means that 
Xen cannot initially use it but that's okay.  Once KVM-lite is merged 
and we have some solid features (and other guests start implementing 
them), we can also advertise this interface as a "generic interface" by 
also supporting the signature on leave 4000 1000 and using the MSR 
trickery that you propose.


As long as we all agree not to use 4000 1000 for now, it leaves open the 
possibility of having a generic interface in the future.


Regards,

Anthony Liguori


J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/kvm-devel

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-17 Thread Anthony Liguori


Nakajima, Jun wrote:

I don't understand the purpose of returning the max leaf.  Who is that
information useful for?



Well, this is the key info to the user of CPUID. It tells which leaves
are valid to use. Otherwise, the user cannot tell whether the results of
CPUID.0x400N are valid or not (i.e. junk). BTW, this is what we are
doing on the native (for the leaf 0, 0x8000, for example). The fact
that Xen returns 0x4002 means it only uses 3 leaves today. 
  


Then it's just a version ID.  You pretty much have to treat it as a 
version id because if it returns 0x4000 0003 and you only know what 0002 
is, then you can't actually use it.


I much prefer the current use of CPUID in KVM.  If 1000 returns the KVM 
signature, then 1001 *must* be valid and contain a set of feature bits.  
If we wish to use additional CPUID leaves in the future, then we can 
just use a feature bit.  The real benefit to us is that we can use a 
discontiguous set of leaves whereas the Xen approach is forced to use a 
linear set (at least for the result to be meaningful).



I like Jeremy's suggesting of starting with 0x40001000 for KVM.  Xen


has
  

an established hypercall interface and that isn't going to change.
However, in the future, if other Operating Systems (like the BSDs)
choose to implement the KVM paravirtualization interface, then that
leaves open the possibility for Xen to also support this interface to
get good performance for those OSes.  It's necessary to be able to
support both at once if you wish to support these interfaces without
user interaction.



Using CPUID.0x400N (N > 2) does not prevent Xen from doing that,
either. If you use 0x40001000, 1) you need to say the leaves from
0x4000 through 0x40001000 are all valid, OR 2) you create/fork a
new/odd leaf (with 0x1000 offset) repeating the detection redundantly. 
  


Why do -1000 have to be valid?  Xen is not going to change what they 
have today--they can't.  However, if down the road, they decided that 
since so many guests use KVM's paravirtualization interface other than 
Linux, there's value in supporting it, by using 1000, they can.



There's no tangible benefit to us to use 0x4000.  Therefore I'm
inclined to lean toward making things easier for others.



Again, 0x4000 is not Xen specific. If the leaf 0x4000 is used
for any guest to detect any hypervisor, that would be compelling
benefit. For future Xen-specific features, it's safe for Xen to use
other bigger leaves (like 0x40001000) because the guest starts looking
at them after detection of Xen. 
  


I'm starting to lean toward just using .  If for no other reason 
than the hypercall space is unsharable.


Regards,

Anthony Liguori


Likewise if KVM paravirtualization interface (as kind of "open source
paravirtualization interface") is detected in the generic areas (not in
vender-specific), any guest can check the features available without
knowing which hypervisor uses which CPUID for that. 

  

Regards,

Anthony Liguori



 And like CPUID.1, CPUID.0x4001 returns the version number in
eax, and each VMM should be able to define a number of VMM-specific
features available in ebx, ecx, and edx returned (which are
  

reserved, i.e.
  
not used in Xen today). 


Suppose we knew (i.e. tested) Xen and KVM supported Linux
paravirtualization, the Linux code does:
1. detect Xen or KVM  using CPUID.0x4000
2. Check the version if necessary using CPUID.0x4001
3. Check the Linux paravirtualization features available using
CPUID.0x400Y. 


Jun
---
Intel Open Source Technology Center
  


Jun
---
Intel Open Source Technology Center
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-15 Thread Anthony Liguori


Nakajima, Jun wrote:

Jeremy Fitzhardinge wrote:
  

Nakajima, Jun wrote:


The hypervisor detection machanism is generic, and the signature
returned is implentation specific. Having a list of all hypervisor
signatures sounds fine to me as we are detecting vendor-specific
processor(s) in the native. And I don't expect the list is large.


  

I'm confused about what you're proposing.  I was thinking that a


kernel
  

looking for the generic hypervisor interface would check for a


specific
  

signature at some cpuid leaf, and then go about using it from there.


If
  

not, how does is it supposed to detect the generic hypervisor


interface?
  

J



I'm suggesting that we use CPUID.0x400Y (Y: TBD, e.g. 6) for Linux
paravirtualization.  The ebx, ecx and edx return the Linux
paravirtualization features available on that hypervisor. Those features
are defined architecturally (not VMM specific).

Like CPUID.0, CPUID.0x4000 is used to detect the hypervisor with the
vendor identification string returned in ebx, edx, and ecx (as we are
doing in Xen). The eax returns the max leaf (which is 0x4002 on Xen
today).


I don't understand the purpose of returning the max leaf.  Who is that 
information useful for?


I like Jeremy's suggesting of starting with 0x40001000 for KVM.  Xen has 
an established hypercall interface and that isn't going to change.  
However, in the future, if other Operating Systems (like the BSDs) 
choose to implement the KVM paravirtualization interface, then that 
leaves open the possibility for Xen to also support this interface to 
get good performance for those OSes.  It's necessary to be able to 
support both at once if you wish to support these interfaces without 
user interaction.


There's no tangible benefit to us to use 0x4000.  Therefore I'm 
inclined to lean toward making things easier for others.


Regards,

Anthony Liguori


 And like CPUID.1, CPUID.0x4001 returns the version number in
eax, and each VMM should be able to define a number of VMM-specific
features available in ebx, ecx, and edx returned (which are reserved,
i.e. not used in Xen today). 


Suppose we knew (i.e. tested) Xen and KVM supported Linux
paravirtualization, the Linux code does:
1. detect Xen or KVM  using CPUID.0x4000 
2. Check the version if necessary using CPUID.0x4001

3. Check the Linux paravirtualization features available using
CPUID.0x400Y.

Jun
---
Intel Open Source Technology Center

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/kvm-devel
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Refactor hypercall infrastructure (v2)

2007-09-15 Thread Anthony Liguori

This patch refactors the current hypercall infrastructure to better support live
migration and SMP.  It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall instruction for the
underlying architecture and replacing it with the right one lazily.

It also introduces the infrastructure to probe for hypercall available via
CPUID leaves 0x40001000.  CPUID leaf 0x40001001 should be filled out by
userspace.

A fall-out of this patch is that the unhandled hypercalls no longer trap to
userspace.  There is very little reason though to use a hypercall to communicate
with userspace as PIO or MMIO can be used.  There is no code in tree that uses
userspace hypercalls.

Since the last patchset, I've changed the CPUID leaves to better avoid Xen's
CPUID range and fixed a bug spotted by Muli in masking off hypercall arguments.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index ad08138..1cde572 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -46,6 +46,7 @@
 #define KVM_MAX_CPUID_ENTRIES 40
 
 #define DE_VECTOR 0
+#define UD_VECTOR 6
 #define NM_VECTOR 7
 #define DF_VECTOR 8
 #define TS_VECTOR 10
@@ -317,9 +318,6 @@ struct kvm_vcpu {
unsigned long cr0;
unsigned long cr2;
unsigned long cr3;
-   gpa_t para_state_gpa;
-   struct page *para_state_page;
-   gpa_t hypercall_gpa;
unsigned long cr4;
unsigned long cr8;
u64 pdptrs[4]; /* pae */
@@ -622,7 +620,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu);
 
 static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 u32 error_code)
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 99e4917..d720290 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1383,51 +1384,61 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_halt);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run)
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
-   unsigned long nr, a0, a1, a2, a3, a4, a5, ret;
+   unsigned long nr, a0, a1, a2, a3, ret;
 
kvm_x86_ops->cache_regs(vcpu);
-   ret = -KVM_EINVAL;
-#ifdef CONFIG_X86_64
-   if (is_long_mode(vcpu)) {
-   nr = vcpu->regs[VCPU_REGS_RAX];
-   a0 = vcpu->regs[VCPU_REGS_RDI];
-   a1 = vcpu->regs[VCPU_REGS_RSI];
-   a2 = vcpu->regs[VCPU_REGS_RDX];
-   a3 = vcpu->regs[VCPU_REGS_RCX];
-   a4 = vcpu->regs[VCPU_REGS_R8];
-   a5 = vcpu->regs[VCPU_REGS_R9];
-   } else
-#endif
-   {
-   nr = vcpu->regs[VCPU_REGS_RBX] & -1u;
-   a0 = vcpu->regs[VCPU_REGS_RAX] & -1u;
-   a1 = vcpu->regs[VCPU_REGS_RCX] & -1u;
-   a2 = vcpu->regs[VCPU_REGS_RDX] & -1u;
-   a3 = vcpu->regs[VCPU_REGS_RSI] & -1u;
-   a4 = vcpu->regs[VCPU_REGS_RDI] & -1u;
-   a5 = vcpu->regs[VCPU_REGS_RBP] & -1u;
-   }
+
+   nr = vcpu->regs[VCPU_REGS_RAX];
+   a0 = vcpu->regs[VCPU_REGS_RBX];
+   a1 = vcpu->regs[VCPU_REGS_RCX];
+   a2 = vcpu->regs[VCPU_REGS_RDX];
+   a3 = vcpu->regs[VCPU_REGS_RSI];
+
+   if (!is_long_mode(vcpu)) {
+   nr &= 0x;
+   a0 &= 0x;
+   a1 &= 0x;
+   a2 &= 0x;
+   a3 &= 0x;
+   }
+
switch (nr) {
default:
-   run->hypercall.nr = nr;
-   run->hypercall.args[0] = a0;
-   run->hypercall.args[1] = a1;
-   run->hypercall.args[2] = a2;
-   run->hypercall.args[3] = a3;
-   run->hypercall.args[4] = a4;
-   run->hypercall.args[5] = a5;
-   run->hypercall.ret = ret;
-   run->hypercall.longmode = is_long_mode(vcpu);
-   kvm_x86_ops->decache_regs(vcpu);
-   return 0;
+   ret = -KVM_ENOSYS;
+   break;
}
vcpu->regs[VCPU_REGS_RAX] = ret;
kvm_x86_ops->decache_regs(vcpu);
-   return 1;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu)
+{
+   char instruction[3];
+   int ret = 0;
+
+   mutex_lock(&vcpu->kvm->lock);
+
+   /*
+* Blow out the MMU to ensure that no other VCPU has an active map

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-15 Thread Anthony Liguori


Zachary Amsden wrote:

On Fri, 2007-09-14 at 16:44 -0500, Anthony Liguori wrote:

  
So then each module creates a hypercall page using this magic MSR and 
the hypervisor has to keep track of it so that it can appropriately 
change the page on migration.  The page can only contain a single 
instruction or else it cannot be easily changed (or you have to be able 
to prevent the guest from being migrated while in the hypercall page).


We're really talking about identical models.  Instead of an MSR, the #GP 
is what tells the hypervisor to update the instruction.  The nice thing 
about this is that you don't have to keep track of all the current 
hypercall page locations in the hypervisor.



I agree, multiple hypercall pages is insane.  I was thinking more of a
single hypercall page, fixed in place by the hypervisor, not the kernel.

Then each module can read an MSR saying what VA the hypercall page is
at, and the hypervisor can simply flip one page to switch architectures.
  


That requires a memory hole though.  In KVM, we don't have a memory hole.

Regards,

Anthony Liguori


Zach

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Anthony Liguori


Jeremy Fitzhardinge wrote:

Anthony Liguori wrote:
  

Yeah, see, the initial goal was to make it possible to use the KVM
paravirtualizations on other hypervisors.  However, I don't think this
is really going to be possible in general so maybe it's better to just
use leaf 0.  I'll let others chime in before sending a new patch.



Hm.  Obviously you can just define a signature for "kvm-compatible
hypercall interface" and make it common that way, but it gets tricky if
the hypervisor supports multiple hypercall interfaces, including the kvm
one.  Start the kvm leaves at 0x40001000 or something?
  


Yeah, that works with me.

Regards,

Anthony Liguori


J

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Anthony Liguori


Jeremy Fitzhardinge wrote:

Anthony Liguori wrote:
  

The whole point of using the instruction is to allow hypercalls to be
used in many locations.  This has the nice side effect of not
requiring a central hypercall initialization routine in the guest to
fetch the hypercall page.  A PV driver can be completely independent
of any other code provided that it restricts itself to it's hypercall
namespace.



I see.  So you take the fault, disassemble the instruction, see that its
another CPU's vmcall instruction, and then replace it with the current
CPU's vmcall?
  


Yup.

Xen is currently using 0/1/2.  I had thought it was only using 0/1. 
The intention was not to squash Xen's current CPUID usage so that it

would still be possible for Xen to make use of the guest code.  Can we
agree that Xen won't squash leaves 3/4 or is it not worth trying to be
compatible at this point?



No, the point is that you're supposed to work out which hypervisor it is
from the signature in leaf 0, and then the hypervisor can put anything
it wants in the other leaves.
  


Yeah, see, the initial goal was to make it possible to use the KVM 
paravirtualizations on other hypervisors.  However, I don't think this 
is really going to be possible in general so maybe it's better to just 
use leaf 0.  I'll let others chime in before sending a new patch.


Regards,

Anthony Liguori


J

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Anthony Liguori


Zachary Amsden wrote:

On Fri, 2007-09-14 at 16:02 -0500, Anthony Liguori wrote:
  

Jeremy Fitzhardinge wrote:


Anthony Liguori wrote:
  
  

This patch refactors the current hypercall infrastructure to better support live
migration and SMP.  It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall instruction for the
underlying architecture and replacing it with the right one lazily.
  



I guess it would be pretty rude/unlikely for these opcodes to get reused
in other implementations...  But couldn't you make the page trap
instead, rather than relying on an instruction fault?
  
  
The whole point of using the instruction is to allow hypercalls to be 
used in many locations.  This has the nice side effect of not requiring 
a central hypercall initialization routine in the guest to fetch the 
hypercall page.  A PV driver can be completely independent of any other 
code provided that it restricts itself to it's hypercall namespace.



But if the instruction is architecture dependent, and you run on the
wrong architecture, now you have to patch many locations at fault time,
introducing some nasty runtime code / data cache overlap performance
problems.  Granted, they go away eventually.
  


We're addressing that by blowing away the shadow cache and holding the 
big kvm lock to ensure SMP safety.  Not a great thing to do from a 
performance perspective but the whole point of patching is that the cost 
is amortized.



I prefer the idea of a hypercall page, but not a central initialization.
Rather, a decentralized approach where PV drivers can detect using CPUID
which hypervisor is present, and a common MSR shared by all hypervisors
that provides the location of the hypercall page.
  


So then each module creates a hypercall page using this magic MSR and 
the hypervisor has to keep track of it so that it can appropriately 
change the page on migration.  The page can only contain a single 
instruction or else it cannot be easily changed (or you have to be able 
to prevent the guest from being migrated while in the hypercall page).


We're really talking about identical models.  Instead of an MSR, the #GP 
is what tells the hypervisor to update the instruction.  The nice thing 
about this is that you don't have to keep track of all the current 
hypercall page locations in the hypervisor.


Regards,

Anthony Liguori


Zach


  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Anthony Liguori


Jeremy Fitzhardinge wrote:

Anthony Liguori wrote:
  

This patch refactors the current hypercall infrastructure to better support live
migration and SMP.  It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall instruction for the
underlying architecture and replacing it with the right one lazily.
  



I guess it would be pretty rude/unlikely for these opcodes to get reused
in other implementations...  But couldn't you make the page trap
instead, rather than relying on an instruction fault?
  


The whole point of using the instruction is to allow hypercalls to be 
used in many locations.  This has the nice side effect of not requiring 
a central hypercall initialization routine in the guest to fetch the 
hypercall page.  A PV driver can be completely independent of any other 
code provided that it restricts itself to it's hypercall namespace.



It also introduces the infrastructure to probe for hypercall available via
CPUID leaves 0x4002.  CPUID leaf 0x4003 should be filled out by
userspace.
  



Is this compatible with Xen's (and other's) use of cpuid?  That is,
0x4000 returns a hypervisor-specific signature in e[bcd]x, and eax
has the max hypervisor leaf.
  


Xen is currently using 0/1/2.  I had thought it was only using 0/1.  The 
intention was not to squash Xen's current CPUID usage so that it would 
still be possible for Xen to make use of the guest code.  Can we agree 
that Xen won't squash leaves 3/4 or is it not worth trying to be 
compatible at this point?


Regards,

Anthony Liguori


J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/kvm-devel

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Anthony Liguori

This patch refactors the current hypercall infrastructure to better support live
migration and SMP.  It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall instruction for the
underlying architecture and replacing it with the right one lazily.

It also introduces the infrastructure to probe for hypercall available via
CPUID leaves 0x4002.  CPUID leaf 0x4003 should be filled out by
userspace.

A fall-out of this patch is that the unhandled hypercalls no longer trap to
userspace.  There is very little reason though to use a hypercall to communicate
with userspace as PIO or MMIO can be used.  There is no code in tree that uses
userspace hypercalls.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index ad08138..1cde572 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -46,6 +46,7 @@
 #define KVM_MAX_CPUID_ENTRIES 40
 
 #define DE_VECTOR 0
+#define UD_VECTOR 6
 #define NM_VECTOR 7
 #define DF_VECTOR 8
 #define TS_VECTOR 10
@@ -317,9 +318,6 @@ struct kvm_vcpu {
unsigned long cr0;
unsigned long cr2;
unsigned long cr3;
-   gpa_t para_state_gpa;
-   struct page *para_state_page;
-   gpa_t hypercall_gpa;
unsigned long cr4;
unsigned long cr8;
u64 pdptrs[4]; /* pae */
@@ -622,7 +620,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu);
 
 static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 u32 error_code)
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 99e4917..5211d19 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1383,51 +1384,61 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_halt);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run)
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
-   unsigned long nr, a0, a1, a2, a3, a4, a5, ret;
+   unsigned long nr, a0, a1, a2, a3, ret;
 
kvm_x86_ops->cache_regs(vcpu);
-   ret = -KVM_EINVAL;
-#ifdef CONFIG_X86_64
-   if (is_long_mode(vcpu)) {
-   nr = vcpu->regs[VCPU_REGS_RAX];
-   a0 = vcpu->regs[VCPU_REGS_RDI];
-   a1 = vcpu->regs[VCPU_REGS_RSI];
-   a2 = vcpu->regs[VCPU_REGS_RDX];
-   a3 = vcpu->regs[VCPU_REGS_RCX];
-   a4 = vcpu->regs[VCPU_REGS_R8];
-   a5 = vcpu->regs[VCPU_REGS_R9];
-   } else
-#endif
-   {
-   nr = vcpu->regs[VCPU_REGS_RBX] & -1u;
-   a0 = vcpu->regs[VCPU_REGS_RAX] & -1u;
-   a1 = vcpu->regs[VCPU_REGS_RCX] & -1u;
-   a2 = vcpu->regs[VCPU_REGS_RDX] & -1u;
-   a3 = vcpu->regs[VCPU_REGS_RSI] & -1u;
-   a4 = vcpu->regs[VCPU_REGS_RDI] & -1u;
-   a5 = vcpu->regs[VCPU_REGS_RBP] & -1u;
-   }
+
+   nr = vcpu->regs[VCPU_REGS_RAX];
+   a0 = vcpu->regs[VCPU_REGS_RBX];
+   a1 = vcpu->regs[VCPU_REGS_RCX];
+   a2 = vcpu->regs[VCPU_REGS_RDX];
+   a3 = vcpu->regs[VCPU_REGS_RSI];
+
+   if (!is_long_mode(vcpu)) {
+   nr &= ~1u;
+   a0 &= ~1u;
+   a1 &= ~1u;
+   a2 &= ~1u;
+   a3 &= ~1u;
+   }
+
switch (nr) {
default:
-   run->hypercall.nr = nr;
-   run->hypercall.args[0] = a0;
-   run->hypercall.args[1] = a1;
-   run->hypercall.args[2] = a2;
-   run->hypercall.args[3] = a3;
-   run->hypercall.args[4] = a4;
-   run->hypercall.args[5] = a5;
-   run->hypercall.ret = ret;
-   run->hypercall.longmode = is_long_mode(vcpu);
-   kvm_x86_ops->decache_regs(vcpu);
-   return 0;
+   ret = -KVM_ENOSYS;
+   break;
}
vcpu->regs[VCPU_REGS_RAX] = ret;
kvm_x86_ops->decache_regs(vcpu);
-   return 1;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu)
+{
+   char instruction[3];
+   int ret = 0;
+
+   mutex_lock(&vcpu->kvm->lock);
+
+   /*
+* Blow out the MMU to ensure that no other VCPU has an active mapping
+* to ensure that the updated hypercall appears atomically across all
+* VCPUs.
+*/
+   kvm_mmu_zap_all(vcpu->kvm);
+
+   kvm_x86_ops->cache_reg

Re: [kvm-devel] [RFC] 9p: add KVM/QEMU pci transport

2007-08-29 Thread Anthony Liguori

I think that it would be nicer to implement the p9 transport on top of
virtio instead of directly on top of PCI.  I think your PCI transport
would make a pretty nice start of a PCI virtio transport though.

Regards,

Anthony Liguori

On Tue, 2007-08-28 at 13:52 -0500, Eric Van Hensbergen wrote:
> From: Latchesar Ionkov <[EMAIL PROTECTED]>
> 
> This adds a shared memory transport for a synthetic 9p device for
> paravirtualized file system support under KVM/QEMU.
> 
> Signed-off-by: Latchesar Ionkov <[EMAIL PROTECTED]>
> Signed-off-by: Eric Van Hensbergen <[EMAIL PROTECTED]>
> ---
>  Documentation/filesystems/9p.txt |2 +
>  net/9p/Kconfig   |   10 ++-
>  net/9p/Makefile  |4 +
>  net/9p/trans_pci.c   |  295 
> ++
>  4 files changed, 310 insertions(+), 1 deletions(-)
>  create mode 100644 net/9p/trans_pci.c
> 
> diff --git a/Documentation/filesystems/9p.txt 
> b/Documentation/filesystems/9p.txt
> index 1a5f50d..e1879bd 100644
> --- a/Documentation/filesystems/9p.txt
> +++ b/Documentation/filesystems/9p.txt
> @@ -46,6 +46,8 @@ OPTIONS
>   tcp  - specifying a normal TCP/IP connection
>   fd   - used passed file descriptors for connection
>  (see rfdno and wfdno)
> + pci  - use a PCI pseudo device for 9p communication
> + over shared memory between a guest and host
>  
>uname=name user name to attempt mount as on the remote server.  The
>   server may override or ignore this value.  Certain user
> diff --git a/net/9p/Kconfig b/net/9p/Kconfig
> index 09566ae..8517560 100644
> --- a/net/9p/Kconfig
> +++ b/net/9p/Kconfig
> @@ -16,13 +16,21 @@ menuconfig NET_9P
>  config NET_9P_FD
>   depends on NET_9P
>   default y if NET_9P
> - tristate "9P File Descriptor Transports (Experimental)"
> + tristate "9p File Descriptor Transports (Experimental)"
>   help
> This builds support for file descriptor transports for 9p
> which includes support for TCP/IP, named pipes, or passed
> file descriptors.  TCP/IP is the default transport for 9p,
> so if you are going to use 9p, you'll likely want this.  
>  
> +config NET_9P_PCI
> + depends on NET_9P
> + tristate "9p PCI Shared Memory Transport (Experimental)"
> + help
> +   This builds support for a PCI psuedo-device currently available
> +   under KVM/QEMU which allows for 9p transactions over shared
> +   memory between the guest and the host.
> +
>  config NET_9P_DEBUG
>   bool "Debug information"
>   depends on NET_9P
> diff --git a/net/9p/Makefile b/net/9p/Makefile
> index 7b2a67a..26ce89d 100644
> --- a/net/9p/Makefile
> +++ b/net/9p/Makefile
> @@ -1,5 +1,6 @@
>  obj-$(CONFIG_NET_9P) := 9pnet.o
>  obj-$(CONFIG_NET_9P_FD) += 9pnet_fd.o
> +obj-$(CONFIG_NET_9P_PCI) += 9pnet_pci.o
>  
>  9pnet-objs := \
>   mod.o \
> @@ -14,3 +15,6 @@ obj-$(CONFIG_NET_9P_FD) += 9pnet_fd.o
>  
>  9pnet_fd-objs := \
>   trans_fd.o \
> +
> +9pnet_pci-objs := \
> + trans_pci.o \
> diff --git a/net/9p/trans_pci.c b/net/9p/trans_pci.c
> new file mode 100644
> index 000..36ddc5f
> --- /dev/null
> +++ b/net/9p/trans_pci.c
> @@ -0,0 +1,295 @@
> +/*
> + * net/9p/trans_pci.c
> + *
> + * 9P over PCI transport layer. For use with KVM/QEMU.
> + *
> + *  Copyright (C) 2007 by Latchesar Ionkov <[EMAIL PROTECTED]>
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License version 2
> + *  as published by the Free Software Foundation.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to:
> + *  Free Software Foundation
> + *  51 Franklin Street, Fifth Floor
> + *  Boston, MA  02111-1301  USA
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define P9PCI_DRIVER_NAME "9P PCI Device"
> +#define P9PCI_DRIVER_VERSION "1"
> +
> +#define PCI_VENDOR_ID_9P 0x5002
> +#defin

Re: [kvm-devel] [PATCH 3/3] KVM paravirt-ops implementation

2007-08-28 Thread Anthony Liguori

On Wed, 2007-08-29 at 04:31 +1000, Rusty Russell wrote:
> On Mon, 2007-08-27 at 10:16 -0500, Anthony Liguori wrote:
> > @@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void)
> > }
> > sort_main_extable();
> > trap_init();
> > +   kvm_guest_init();
> > rcu_init();
> > init_IRQ();
> > pidhash_init();
> 
> Hi Anthony,
> 
>   This placement seems arbitrary.  Why not earlier from setup_arch, or as
> a normal initcall?

The placement is important if we wish to have a paravirt_ops hook for
the interrupt controller.  This is the latest possible spot we can do
it.  A comment is probably appropriate here.

Regards,

Anthony Liguori

> Rusty.
> 
> 
> 
> -
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> ___
> kvm-devel mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/kvm-devel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 2/3] Refactor hypercall infrastructure

2007-08-28 Thread Anthony Liguori

On Wed, 2007-08-29 at 04:12 +1000, Rusty Russell wrote:
> On Mon, 2007-08-27 at 10:16 -0500, Anthony Liguori wrote:
> > This patch refactors the current hypercall infrastructure to better support 
> > live
> > migration and SMP.  It eliminates the hypercall page by trapping the UD
> > exception that would occur if you used the wrong hypercall instruction for 
> > the
> > underlying architecture and replacing it with the right one lazily.
> 
> It also reduces the number of hypercall args, which you don't mention
> here.

Oh yes, sorry.

> > +   er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0);
> > +
> > +   /* we should only succeed here in the case of hypercalls which
> > +  cannot generate an MMIO event.  MMIO means that the emulator
> > +  is mistakenly allowing an instruction that should generate
> > +  a UD fault so it's a bug. */
> > +   BUG_ON(er == EMULATE_DO_MMIO);
> 
> This seems... unwise.  Firstly we know our emulator is incomplete.
> Secondly an SMP guest can exploit this to crash the host.

This code is gone in v2.

> (Code is in two places).
> 
> > +#define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1"

Good point.

> A nice big comment would be nice here, I think.  Note that this is big
> enough for both "int $0x1f" and "sysenter", so I'm happy.

I need to add a comment somewhere mentioning that if you patch with
something less than 3 bytes, then you should pad with nop but the
hypervisor must treat the whole instruction (including the padding) as
atomic (that is, regardless of hypercall size, eip += 3) or you run the
risk of breakage during migration.

Regards,

Anthony Liguori

> Cheers,
> Rusty.
> 
> 
> -
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> ___
> kvm-devel mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/kvm-devel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] Refactor hypercall infrastructure (v2)

2007-08-27 Thread Anthony Liguori

This patch refactors the current hypercall infrastructure to better support live
migration and SMP.  It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall instruction for the
underlying architecture and replacing it with the right one lazily.

It also introduces the infrastructure to probe for hypercall available via
CPUID leaves 0x4002.  CPUID leaf 0x4003 should be filled out by
userspace.

A fall-out of this patch is that the unhandled hypercalls no longer trap to
userspace.  There is very little reason though to use a hypercall to communicate
with userspace as PIO or MMIO can be used.  There is no code in tree that uses
userspace hypercalls.

Since v1, we now use emulator_write_emulated() and zap the MMU while holding
kvm->lock to ensure atomicity on SMP.  We also switch the KVM errnos to be
integer based and make kvm_para.h #include-able from userspace.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a42a6f3..38c0eaf 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -46,6 +46,7 @@
 #define KVM_MAX_CPUID_ENTRIES 40
 
 #define DE_VECTOR 0
+#define UD_VECTOR 6
 #define NM_VECTOR 7
 #define DF_VECTOR 8
 #define TS_VECTOR 10
@@ -316,9 +317,6 @@ struct kvm_vcpu {
unsigned long cr0;
unsigned long cr2;
unsigned long cr3;
-   gpa_t para_state_gpa;
-   struct page *para_state_page;
-   gpa_t hypercall_gpa;
unsigned long cr4;
unsigned long cr8;
u64 pdptrs[4]; /* pae */
@@ -587,7 +585,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu);
 
 static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 u32 error_code)
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index d154487..177cba1 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1267,51 +1267,61 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_halt);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run)
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
-   unsigned long nr, a0, a1, a2, a3, a4, a5, ret;
+   unsigned long nr, a0, a1, a2, a3, ret;
 
kvm_arch_ops->cache_regs(vcpu);
-   ret = -KVM_EINVAL;
-#ifdef CONFIG_X86_64
-   if (is_long_mode(vcpu)) {
-   nr = vcpu->regs[VCPU_REGS_RAX];
-   a0 = vcpu->regs[VCPU_REGS_RDI];
-   a1 = vcpu->regs[VCPU_REGS_RSI];
-   a2 = vcpu->regs[VCPU_REGS_RDX];
-   a3 = vcpu->regs[VCPU_REGS_RCX];
-   a4 = vcpu->regs[VCPU_REGS_R8];
-   a5 = vcpu->regs[VCPU_REGS_R9];
-   } else
-#endif
-   {
-   nr = vcpu->regs[VCPU_REGS_RBX] & -1u;
-   a0 = vcpu->regs[VCPU_REGS_RAX] & -1u;
-   a1 = vcpu->regs[VCPU_REGS_RCX] & -1u;
-   a2 = vcpu->regs[VCPU_REGS_RDX] & -1u;
-   a3 = vcpu->regs[VCPU_REGS_RSI] & -1u;
-   a4 = vcpu->regs[VCPU_REGS_RDI] & -1u;
-   a5 = vcpu->regs[VCPU_REGS_RBP] & -1u;
+
+   nr = vcpu->regs[VCPU_REGS_RAX];
+   a0 = vcpu->regs[VCPU_REGS_RBX];
+   a1 = vcpu->regs[VCPU_REGS_RCX];
+   a2 = vcpu->regs[VCPU_REGS_RDX];
+   a3 = vcpu->regs[VCPU_REGS_RSI];
+
+   if (!is_long_mode(vcpu)) {
+   nr &= ~1u;
+   a0 &= ~1u;
+   a1 &= ~1u;
+   a2 &= ~1u;
+   a3 &= ~1u;
}
+
switch (nr) {
default:
-   run->hypercall.nr = nr;
-   run->hypercall.args[0] = a0;
-   run->hypercall.args[1] = a1;
-   run->hypercall.args[2] = a2;
-   run->hypercall.args[3] = a3;
-   run->hypercall.args[4] = a4;
-   run->hypercall.args[5] = a5;
-   run->hypercall.ret = ret;
-   run->hypercall.longmode = is_long_mode(vcpu);
-   kvm_arch_ops->decache_regs(vcpu);
-   return 0;
+   ret = -KVM_ENOSYS;
+   break;
}
vcpu->regs[VCPU_REGS_RAX] = ret;
kvm_arch_ops->decache_regs(vcpu);
-   return 1;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu)
+{
+   char instruction[3];
+   int ret = 0;
+
+   mutex_lock(&vcpu->kvm->lock);
+
+   /*
+* Blow out the MMU to ensure that no other VCPU has an active mapping
+* to ensure that the updated hypercall appears atomicall

[PATCH 2/3] KVM paravirt-ops implementation (v2)

2007-08-27 Thread Anthony Liguori

A very simple paravirt_ops implementation for KVM.  Future patches will add
more sophisticated optimizations.  The goal of having this presenting this now
is to validate the new hypercall infrastructure and to make my patch queue
smaller.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig
index f952493..ceacc66 100644
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -237,6 +237,13 @@ config VMI
  at the moment), by linking the kernel to a GPL-ed ROM module
  provided by the hypervisor.
 
+config KVM_GUEST
+   bool "KVM paravirt-ops support"
+   depends on PARAVIRT
+   help
+ This option enables various optimizations for running under the KVM
+  hypervisor.
+
 config ACPI_SRAT
bool
default y
diff --git a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile
index 9d33b00..6308fcf 100644
--- a/arch/i386/kernel/Makefile
+++ b/arch/i386/kernel/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_K8_NB)   += k8.o
 obj-$(CONFIG_MGEODE_LX)+= geode.o
 
 obj-$(CONFIG_VMI)  += vmi.o vmiclock.o
+obj-$(CONFIG_KVM_GUEST)+= kvm.o
 obj-$(CONFIG_PARAVIRT) += paravirt.o
 obj-y  += pcspeaker.o
 
diff --git a/arch/i386/kernel/kvm.c b/arch/i386/kernel/kvm.c
new file mode 100644
index 000..26585a4
--- /dev/null
+++ b/arch/i386/kernel/kvm.c
@@ -0,0 +1,46 @@
+/*
+ * KVM paravirt_ops implementation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright IBM Corporation, 2007
+ *   Authors: Anthony Liguori <[EMAIL PROTECTED]>
+ */
+
+#include 
+#include 
+
+/*
+ * No need for any "IO delay" on KVM
+ */
+static void kvm_io_delay(void)
+{
+}
+
+static __init void paravirt_setup(void)
+{
+   paravirt_ops.name = "KVM";
+
+   if (kvm_para_has_feature(KVM_PARA_FEATURE_NOP_IO_DELAY))
+   paravirt_ops.io_delay = kvm_io_delay;
+
+   paravirt_ops.paravirt_enabled = 1;
+}
+
+void __init kvm_guest_init(void)
+{
+   if (kvm_para_available())
+   paravirt_setup();
+}
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 0f94138..10b2e7a 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -4,6 +4,14 @@
 #ifdef __KERNEL__
 #include 
 
+#ifdef CONFIG_KVM_GUEST
+void __init kvm_guest_init(void);
+#else
+static inline void kvm_guest_init(void)
+{
+}
+#endif
+
 #define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1"
 
 static inline long kvm_hypercall0(unsigned int nr)
diff --git a/init/main.c b/init/main.c
index d3bcb3b..b224905 100644
--- a/init/main.c
+++ b/init/main.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void)
}
sort_main_extable();
trap_init();
+   kvm_guest_init();
rcu_init();
init_IRQ();
pidhash_init();
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] KVM paravirtualization framework (v2)

2007-08-27 Thread Anthony Liguori

This patchset refactors KVM's paravirtualization support to better support
guest SMP and cross platform migration.  It also introduces a bare-bones KVM
paravirt_ops implementation.

I've tested this on VT and it works nicely.  I'm having trouble getting a
bzImage to boot on SVM so I haven't been able to test on SVM yet.

I've incorporated all of the feedback into this series from posting the
last series.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Implement emulator_write_phys()

2007-08-27 Thread Anthony Liguori


On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote:
> Anthony Liguori wrote:
> > Since a hypercall may span two pages and is a gva, we need a function to 
> > write
> > to a gva that may span multiple pages.  emulator_write_phys() seems like the
> > logical choice for this.
> >
> > @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr,
> >   unsigned int bytes,
> >   struct kvm_vcpu *vcpu
> 
> I think that emulator_write_emulated(), except for being awkwardly 
> named, should do the job.  We have enough APIs.

I'll drop this patch.  My mistakenly thought emulator_write_emulated()
took a GPA.  Thanks!

Regards,

Anthony Liguori

> But!  We may not overwrite the hypercall instruction while a vcpu may be 
> executing, since there's no atomicity guarantee for code fetch.  We have 
> to to be out of guest mode while writing that insn.
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Implement emulator_write_phys()

2007-08-27 Thread Anthony Liguori


On Mon, 2007-08-27 at 20:26 +0300, Avi Kivity wrote:
> Anthony Liguori wrote:
> > On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote:
> >   
> >> Anthony Liguori wrote:
> >> 
> >>> Since a hypercall may span two pages and is a gva, we need a function to 
> >>> write
> >>> to a gva that may span multiple pages.  emulator_write_phys() seems like 
> >>> the
> >>> logical choice for this.
> >>>
> >>> @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr,
> >>> unsigned int bytes,
> >>> struct kvm_vcpu *vcpu
> >>>   
> >> I think that emulator_write_emulated(), except for being awkwardly 
> >> named, should do the job.  We have enough APIs.
> >>
> >> But!  We may not overwrite the hypercall instruction while a vcpu may be 
> >> executing, since there's no atomicity guarantee for code fetch.  We have 
> >> to to be out of guest mode while writing that insn.
> >> 
> >
> >
> > Hrm, good catch.
> >
> > How can we get out of guest mode given SMP guest support?
> >
> >   
> 
> kvm_flush_remote_tlbs() is something that can be generalized.  
> Basically, you set a bit in each vcpu and send an IPI to take them out.
> 
> But that's deadlock prone and complex.  Maybe you can just take 
> kvm->lock, zap the mmu and the flush tlbs, and patch the instruction at 
> your leisure, as no vcpu will be able to map memory until the lock is 
> released.

This works for shadow paging but not necessarily with NPT.  Do code
fetches really not respect atomic writes?  We could switch to a 32-bit
atomic operation and that should result in no worse than the code being
patched twice.

Regards,

Anthony Liguori


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] Refactor hypercall infrastructure

2007-08-27 Thread Anthony Liguori


On Mon, 2007-08-27 at 19:06 +0300, Avi Kivity wrote:
> Anthony Liguori wrote:
> >  void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
> >  {
> > int i;
> > @@ -1632,6 +1575,12 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
> > vcpu->regs[VCPU_REGS_RBX] = 0;
> > vcpu->regs[VCPU_REGS_RCX] = 0;
> > vcpu->regs[VCPU_REGS_RDX] = 0;
> > +
> > +   if ((function & 0x) == 0x4000) {
> > +   emulate_paravirt_cpuid(vcpu, function);
> > +   goto out;
> > +   }
> > +
> >   
> 
> Hmm.  Suppose we expose kvm capabilities to host userspace instead, and 
> let the host userspace decide which features to expose to the guest via 
> the regular cpuid emulation?  That allows the qemu command line to 
> control the feature set.

That's a little awkward with something like NPT.  Some CPUID features
should be masked by the availability of NPT support in hardware and
whether kernelspace supports NPT.  To set these feature bits in
userspace, you would have to query kernelspace anyway.

It would be nice to be consistent with the other cpuid flags though.
There is somewhat of a similar issue with migration here.  If we have an
initial set of PV features and down the road, add more, it may be
desirable to enable those features depending on what your network looks
like.  Yeah, I think I agree that userspace is the right place to set
these bits.

> >  
> > +static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
> > +{
> > +   int er;
> > +   
> > +   er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0);
> > +
> > +   /* we should only succeed here in the case of hypercalls which
> > +  cannot generate an MMIO event.  MMIO means that the emulator
> > +  is mistakenly allowing an instruction that should generate
> > +  a UD fault so it's a bug. */
> > +   BUG_ON(er == EMULATE_DO_MMIO);
> >   
> 
> It's a guest-triggerable bug; one vcpu can be issuing ud2-in-a-loop 
> while the other replaces the instruction with something that does mmio.

Good catch!  I'll update.

> > +
> > +#define KVM_ENOSYS ENOSYS
> >   
> 
> A real number (well, an integer, not a real) here please.  I know that 
> ENOSYS isn't going to change soon, but this file defines the kvm abi and 
> I'd like it to be as independent as possible.
> 
> Let's start it at 1000 so that spurious "return 1"s or "return -1"s 
> don't get translated into valid error numbers.

Okay.

> I really like the simplification to the guest/host interface that this 
> patch brings.

Thanks.

Regards,

Anthony Liguori

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Implement emulator_write_phys()

2007-08-27 Thread Anthony Liguori


On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote:
> Anthony Liguori wrote:
> > Since a hypercall may span two pages and is a gva, we need a function to 
> > write
> > to a gva that may span multiple pages.  emulator_write_phys() seems like the
> > logical choice for this.
> >
> > @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr,
> >   unsigned int bytes,
> >   struct kvm_vcpu *vcpu
> 
> I think that emulator_write_emulated(), except for being awkwardly 
> named, should do the job.  We have enough APIs.
> 
> But!  We may not overwrite the hypercall instruction while a vcpu may be 
> executing, since there's no atomicity guarantee for code fetch.  We have 
> to to be out of guest mode while writing that insn.


Hrm, good catch.

How can we get out of guest mode given SMP guest support?

Regards,

Anthony Liguori

> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] KVM paravirtualization framework

2007-08-27 Thread Anthony Liguori

This patchset refactors KVM's paravirtualization support to better support
guest SMP and cross platform migration.  It also introduces a bare-bones KVM
paravirt_ops implementation.

I've tested this on VT and it works nicely.  I'm having trouble getting a
bzImage to boot on SVM so I haven't been able to test on SVM yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] Implement emulator_write_phys()

2007-08-27 Thread Anthony Liguori

Since a hypercall may span two pages and is a gva, we need a function to write
to a gva that may span multiple pages.  emulator_write_phys() seems like the
logical choice for this.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index d154487..ebcc5c9 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr,
  unsigned int bytes,
  struct kvm_vcpu *vcpu)
 {
-   pr_unimpl(vcpu, "emulator_write_std: addr %lx n %d\n", addr, bytes);
-   return X86EMUL_UNHANDLEABLE;
+   const void *data = val;
+
+   while (bytes) {
+   gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr);
+   unsigned offset = addr & (PAGE_SIZE-1);
+   unsigned tocopy = min(bytes, (unsigned)PAGE_SIZE - offset);
+   unsigned long pfn;
+   struct page *page;
+   void *page_virt;
+
+   if (gpa == UNMAPPED_GVA)
+   return X86EMUL_PROPAGATE_FAULT;
+   pfn = gpa >> PAGE_SHIFT;
+   page = gfn_to_page(vcpu->kvm, pfn);
+   if (!page)
+   return X86EMUL_UNHANDLEABLE;
+   page_virt = kmap_atomic(page, KM_USER0);
+
+   memcpy(page_virt + offset, data, tocopy);
+
+   kunmap_atomic(page_virt, KM_USER0);
+   mark_page_dirty(vcpu->kvm, addr >> PAGE_SHIFT);
+
+   bytes -= tocopy;
+   data += tocopy;
+   addr += tocopy;
+   }
+
+   return X86EMUL_CONTINUE;
 }
 
 static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] Refactor hypercall infrastructure

2007-08-27 Thread Anthony Liguori

This patch refactors the current hypercall infrastructure to better support live
migration and SMP.  It eliminates the hypercall page by trapping the UD
exception that would occur if you used the wrong hypercall instruction for the
underlying architecture and replacing it with the right one lazily.

It also introduces the infrastructure to probe for hypercall available via
CPUID leaves 0x4002 and 0x4003.

A fall-out of this patch is that the unhandled hypercalls no longer trap to
userspace.  There is very little reason though to use a hypercall to communicate
with userspace as PIO or MMIO can be used.  There is no code in tree that uses
userspace hypercalls.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index a42a6f3..38c0eaf 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -46,6 +46,7 @@
 #define KVM_MAX_CPUID_ENTRIES 40
 
 #define DE_VECTOR 0
+#define UD_VECTOR 6
 #define NM_VECTOR 7
 #define DF_VECTOR 8
 #define TS_VECTOR 10
@@ -316,9 +317,6 @@ struct kvm_vcpu {
unsigned long cr0;
unsigned long cr2;
unsigned long cr3;
-   gpa_t para_state_gpa;
-   struct page *para_state_page;
-   gpa_t hypercall_gpa;
unsigned long cr4;
unsigned long cr8;
u64 pdptrs[4]; /* pae */
@@ -587,7 +585,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu);
 
 static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 u32 error_code)
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index ebcc5c9..996d0ee 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1294,51 +1294,49 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_halt);
 
-int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run)
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
-   unsigned long nr, a0, a1, a2, a3, a4, a5, ret;
+   unsigned long nr, a0, a1, a2, a3, ret;
 
kvm_arch_ops->cache_regs(vcpu);
-   ret = -KVM_EINVAL;
-#ifdef CONFIG_X86_64
-   if (is_long_mode(vcpu)) {
-   nr = vcpu->regs[VCPU_REGS_RAX];
-   a0 = vcpu->regs[VCPU_REGS_RDI];
-   a1 = vcpu->regs[VCPU_REGS_RSI];
-   a2 = vcpu->regs[VCPU_REGS_RDX];
-   a3 = vcpu->regs[VCPU_REGS_RCX];
-   a4 = vcpu->regs[VCPU_REGS_R8];
-   a5 = vcpu->regs[VCPU_REGS_R9];
-   } else
-#endif
-   {
-   nr = vcpu->regs[VCPU_REGS_RBX] & -1u;
-   a0 = vcpu->regs[VCPU_REGS_RAX] & -1u;
-   a1 = vcpu->regs[VCPU_REGS_RCX] & -1u;
-   a2 = vcpu->regs[VCPU_REGS_RDX] & -1u;
-   a3 = vcpu->regs[VCPU_REGS_RSI] & -1u;
-   a4 = vcpu->regs[VCPU_REGS_RDI] & -1u;
-   a5 = vcpu->regs[VCPU_REGS_RBP] & -1u;
+
+   nr = vcpu->regs[VCPU_REGS_RAX];
+   a0 = vcpu->regs[VCPU_REGS_RBX];
+   a1 = vcpu->regs[VCPU_REGS_RCX];
+   a2 = vcpu->regs[VCPU_REGS_RDX];
+   a3 = vcpu->regs[VCPU_REGS_RSI];
+
+   if (!is_long_mode(vcpu)) {
+   nr &= ~1u;
+   a0 &= ~1u;
+   a1 &= ~1u;
+   a2 &= ~1u;
+   a3 &= ~1u;
}
+
switch (nr) {
default:
-   run->hypercall.nr = nr;
-   run->hypercall.args[0] = a0;
-   run->hypercall.args[1] = a1;
-   run->hypercall.args[2] = a2;
-   run->hypercall.args[3] = a3;
-   run->hypercall.args[4] = a4;
-   run->hypercall.args[5] = a5;
-   run->hypercall.ret = ret;
-   run->hypercall.longmode = is_long_mode(vcpu);
-   kvm_arch_ops->decache_regs(vcpu);
-   return 0;
+   ret = -KVM_ENOSYS;
+   break;
}
vcpu->regs[VCPU_REGS_RAX] = ret;
kvm_arch_ops->decache_regs(vcpu);
-   return 1;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
+
+int kvm_fix_hypercall(struct kvm_vcpu *vcpu)
+{
+   char instruction[3];
+
+   kvm_arch_ops->cache_regs(vcpu);
+   kvm_arch_ops->patch_hypercall(vcpu, instruction);
+   if (emulator_write_std(vcpu->rip, instruction, 3, vcpu)
+   != X86EMUL_CONTINUE)
+   return -EFAULT;
+
+   return 0;
 }
-EXPORT_SYMBOL_GPL(kvm_hypercall);
 
 static u64 mk_cr_64(u64 curr_cr, u32 new_val)
 {
@@ -1406,75 +1404,6 @@ void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, 
unsigned long val,
}
 }
 
-/*
- * Register the para guest w

[PATCH 3/3] KVM paravirt-ops implementation

2007-08-27 Thread Anthony Liguori

diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig
index f952493..ceacc66 100644
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -237,6 +237,13 @@ config VMI
  at the moment), by linking the kernel to a GPL-ed ROM module
  provided by the hypervisor.
 
+config KVM_GUEST
+   bool "KVM paravirt-ops support"
+   depends on PARAVIRT
+   help
+ This option enables various optimizations for running under the KVM
+  hypervisor.
+
 config ACPI_SRAT
bool
default y
diff --git a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile
index 9d33b00..6308fcf 100644
--- a/arch/i386/kernel/Makefile
+++ b/arch/i386/kernel/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_K8_NB)   += k8.o
 obj-$(CONFIG_MGEODE_LX)+= geode.o
 
 obj-$(CONFIG_VMI)  += vmi.o vmiclock.o
+obj-$(CONFIG_KVM_GUEST)+= kvm.o
 obj-$(CONFIG_PARAVIRT) += paravirt.o
 obj-y  += pcspeaker.o
 
diff --git a/arch/i386/kernel/kvm.c b/arch/i386/kernel/kvm.c
new file mode 100644
index 000..26585a4
--- /dev/null
+++ b/arch/i386/kernel/kvm.c
@@ -0,0 +1,46 @@
+/*
+ * KVM paravirt_ops implementation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright IBM Corporation, 2007
+ *   Authors: Anthony Liguori <[EMAIL PROTECTED]>
+ */
+
+#include 
+#include 
+
+/*
+ * No need for any "IO delay" on KVM
+ */
+static void kvm_io_delay(void)
+{
+}
+
+static __init void paravirt_setup(void)
+{
+   paravirt_ops.name = "KVM";
+
+   if (kvm_para_has_feature(KVM_PARA_FEATURE_NOP_IO_DELAY))
+   paravirt_ops.io_delay = kvm_io_delay;
+
+   paravirt_ops.paravirt_enabled = 1;
+}
+
+void __init kvm_guest_init(void)
+{
+   if (kvm_para_available())
+   paravirt_setup();
+}
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index e92ea3e..16e51a1 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -3,6 +3,14 @@
 
 #include 
 
+#ifdef CONFIG_KVM_GUEST
+void __init kvm_guest_init(void);
+#else
+static inline void kvm_guest_init(void)
+{
+}
+#endif
+
 #define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1"
 
 static inline long kvm_hypercall0(unsigned int nr)
diff --git a/init/main.c b/init/main.c
index d3bcb3b..b224905 100644
--- a/init/main.c
+++ b/init/main.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void)
}
sort_main_extable();
trap_init();
+   kvm_guest_init();
rcu_init();
init_IRQ();
pidhash_init();
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] KVM - add hypercall nr to kvm_run

2007-07-18 Thread Anthony Liguori


Jeff Dike wrote:

On Wed, Jul 18, 2007 at 11:28:18AM -0500, Anthony Liguori wrote:
  
Within the kernel right (VMCALL is only usable in ring 1).  



Yup.

  
Is it 
terribly important to be able to pass through the syscall arguments in 
registers verses packing them in a data structure and passing a pointer 
to that structure?



An arg block would be fine, too.
  


Okay, then you should be pretty well insulated against the changing 
hypercall stuff since you only need one hypercall NR and can use a one 
argument hypercall.


Regards,

Anthony Liguori


Jeff

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] KVM - add hypercall nr to kvm_run

2007-07-18 Thread Anthony Liguori


Jeff Dike wrote:

On Tue, Jul 17, 2007 at 09:15:51AM -0500, Anthony Liguori wrote:
  
I'm planning on breaking this interface again since the new hypercall 
API only takes 4 arguments instead of 6.



Is anything written anywhere about this hypercall interface?
  


I've posted patches.  I'll post more in the near future since we decided 
to make some rather drastic changes.  With the new patchset, guest's 
will use vmcall directly with a couple bytes of padding.  The hypervisor 
may or may not patch the calling site to use something different (like 
vmmcall).



The thing which would make me happy (which the current code comes
close to) is for the guest to be able to pass out Linux system calls
with VMCALL.
  


Within the kernel right (VMCALL is only usable in ring 1).  Is it 
terribly important to be able to pass through the syscall arguments in 
registers verses packing them in a data structure and passing a pointer 
to that structure?


Regards,

Anthony Liguori


Jeff

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [GIT PULL][RESEND #2] KVM Updates for 2.6.23-rc1

2007-07-17 Thread Anthony Liguori


S.Çağlar Onur wrote:

Hi;

17 Tem 2007 Sal tarihinde, Avi Kivity şunları yazmıştı: 
  

I fixed the issue with the previous patchset.  Please provide further
feedback, or pull from:

 git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git for-linus

This contains kvm updates for the 2.6.23 merge window, including



After that commit (+appArmor patchset) any guest image hangs the system 
completely as soon as started with kvm-29 userspace. Guest images works 
with "qemu-kvm -hda X.img -m 512 -no-kvm" works as expected
  


Can you reproduce without the appArmor patchset?

Regards,

Anthony Liguori

I cannot provide any more information for now causes system freezes at all and 
netconsole not works with ipw3945 (netconsole: eth1 doesn't support polling, 
aborting) but tomorrow i'll try with cable.


If needed you can find .config and dmesg from [1], if anything else needed 
please just say it...


[1] http://cekirdek.pardus.org.tr/~caglar/kvm/

Cheers
  



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/


___
kvm-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/kvm-devel
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 3/3] KVM - add hypercall nr to kvm_run

2007-07-17 Thread Anthony Liguori


Avi Kivity wrote:

Jeff Dike wrote:
  

Add the hypercall number to kvm_run and initialize it.  This might be
considered API-changing, so I kept it separate.

Signed-off-by: Jeff Dike <[EMAIL PROTECTED]>
--
 drivers/kvm/kvm_main.c |1 +
 include/linux/kvm.h|1 +
 2 files changed, 2 insertions(+)

Index: kvm/drivers/kvm/kvm_main.c
===
--- kvm.orig/drivers/kvm/kvm_main.c
+++ kvm/drivers/kvm/kvm_main.c
@@ -1361,6 +1361,7 @@ int kvm_hypercall(struct kvm_vcpu *vcpu,
}
switch (nr) {
default:
+   run->hypercall.nr = nr;
run->hypercall.args[0] = a0;
run->hypercall.args[1] = a1;
run->hypercall.args[2] = a2;
Index: kvm/include/linux/kvm.h
===
--- kvm.orig/include/linux/kvm.h
+++ kvm/include/linux/kvm.h
@@ -106,6 +106,7 @@ struct kvm_run {
} mmio;
/* KVM_EXIT_HYPERCALL */
struct {
+   __u64 nr;
__u64 args[6];
__u64 ret;
__u32 longmode;
  



It isn't API changing as the API could not be (and has not been) used.  
We do need to add a padding element to the union to make sure additional 
union members don't cause its size to change.


Quite sad that we got this far with such brokenness.
  


I'm planning on breaking this interface again since the new hypercall 
API only takes 4 arguments instead of 6.


Regards,

Anthony Liguori

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 2/3] KVM - Fix hypercall arguments

2007-07-17 Thread Anthony Liguori


Avi Kivity wrote:

Jeff Dike wrote:
  

It looks like kvm_hypercall is trying to match the system call
convention and mixed up the call number and first argument in the
32-bit case.

Signed-off-by: Jeff Dike <[EMAIL PROTECTED]>
--
 drivers/kvm/kvm_main.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: kvm/drivers/kvm/kvm_main.c
===
--- kvm.orig/drivers/kvm/kvm_main.c
+++ kvm/drivers/kvm/kvm_main.c
@@ -1351,8 +1351,8 @@ int kvm_hypercall(struct kvm_vcpu *vcpu,
} else
 #endif
{
-   nr = vcpu->regs[VCPU_REGS_RBX] & -1u;
-   a0 = vcpu->regs[VCPU_REGS_RAX] & -1u;
+   nr = vcpu->regs[VCPU_REGS_RAX] & -1u;
+   a0 = vcpu->regs[VCPU_REGS_RBX] & -1u;
a1 = vcpu->regs[VCPU_REGS_RCX] & -1u;
a2 = vcpu->regs[VCPU_REGS_RDX] & -1u;
a3 = vcpu->regs[VCPU_REGS_RSI] & -1u;
  



Anthony?  I think you were hacking this area?
  


I make a similar change in my series.

Regards,

Anthony Liguori


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem with global_flush_tlb() on i386 in 2.6.22-rc4-mm2

2007-06-19 Thread Anthony Liguori

 show_trace+0x12/0x14
[ 1114.688308]  [] dump_stack+0x15/0x17
[ 1114.701557]  [] __might_sleep+0xcf/0xe1
[ 1114.715584]  [] down_read+0x18/0x4b
[ 1114.728574]  [] exit_mm+0x27/0xd1
[ 1114.741045]  [] do_exit+0x10f/0x88f
[ 1114.754035]  [] do_trap+0x0/0x152
[ 1114.766503]  [] do_page_fault+0x310/0x7ed
[ 1114.781050]  [] error_code+0x6a/0x70
[ 1114.794303]  [] my_open+0xa6/0x124 [test_rodata]
[ 1114.810665]  [] proc_reg_open+0x42/0x68
[ 1114.824692]  [] __dentry_open+0xe6/0x1e2
[ 1114.838979]  [] nameidata_to_filp+0x35/0x3f
[ 1114.854044]  [] do_filp_open+0x3b/0x43
[ 1114.867811]  [] do_sys_open+0x43/0x116
[ 1114.881579]  [] sys_open+0x1c/0x1e
[ 1114.894308]  [] syscall_call+0x7/0xb
[ 1114.907557]  [] 0xe410
[ 1114.918212]  ===

This is clearly the memory write I am trying to do in the page of
which I just changed the attributes to RWX.

If I remove the variable read before I change the flags, it starts
working correctly (as far as I have tested...).

If I use my own my_local_tlb_flush() function (not SMP aware) instead of
global_flush_tlb(), it works correctly.



What is your my_local_tlb_flush() and are you calling with preemption 
disabled?



I also tried just calling clflush on the modified page just after the
global_flush_tlb(), and the problem was still there.

I therefore suspect that

include/asm-i386/tlbflush.h:
#define __native_flush_tlb_global() \
do {\
unsigned int tmpreg, cr4, cr4_orig; \
\
__asm__ __volatile__(   \
"movl %%cr4, %2;  # turn off PGE \n"\
"movl %2, %1;\n"\
"andl %3, %1;\n"\
"movl %1, %%cr4; \n"\
"movl %%cr3, %0; \n"\
"movl %0, %%cr3;  # flush TLB\n"\
"movl %2, %%cr4;  # turn PGE back on \n"\
: "=&r" (tmpreg), "=&r" (cr4), "=&r" (cr4_orig) \
: "i" (~X86_CR4_PGE)\
: "memory");\
} while (0)

is not doing its job correctly. The problem does not seem to be caused
by PARAVIRT, because it is still buggy even if I disable the PARAVIRT
config option.


This is actually very conservative seeing as how disabling CR4.PGE 
should be sufficient to flush global pages on modern processors.  I 
suspect you're getting preempted while it's running.


Regards,

Anthony Liguori
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: QEMU's scsi controller no longer works on arm.

2007-06-18 Thread Anthony Liguori


Rob Landley wrote:
In 2.6.20, I can boot an arm kernel up to a shell prompt including a virtual 
scsi hard drive.  In 2.6.21-rc1, this stopped working.




It's broken with x86 too.  I looked into it for a bit and it appears 
that the Linux driver is using more LSI commands than it was before. 
The odd thing is that it's probing some space that is marked reserved in 
the LSI manual.  Unfortunately, I couldn't understand the SCSI system 
well enough to understand what changed in Linux.


Regards,

Anthony Liguori

I tried "git bisect" and found out there's a range of about 5000 commits 
between the two where arm doesn't compile.  At the start of this range, the 
controller worked.  At the end, it didn't anymore.


How can YOU reproduce this problem?  I'm glad you asked:

The miniconfig I'm using is attached.  You'll need an arm compiler to build 
it, of course.  (If you haven't got one, the cross compiler I made is 
at "http://landley.net/code/firmware/downloads/cross-compiler/";.  Download 
the armv4l version for the appropriate host, extract the tarball and add 
the "bin" directory under that to your path.  The x86 version should work on 
Ubuntu 6.06 or newer, the x86-64 version was built on 7.04.)


Configure with:

make ARCH=arm allnoconfig KCONFIG_ALLCONFIG=miniconfig-linux
make ARCH=arm CROSS_COMPILE=armv4l-

And then run qemu on it:

qemu-system-arm -M versatilepb -nographic -no-reboot \
  -hda /dev/null -kernel zImage-armv4l -append \
  'rw panic=1 root=/dev/sdaconsole=ttyAMA0'

The failing system will loop resetting the scsi controller with lots of 
timeouts, and takes several minutes to get the the panic where it can't mount 
root.  The working system will panic due to root being /dev/null fairly 
quickly, without pages of error messages and a very long wait first.


Rob


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] Re: More virtio users

2007-06-11 Thread Anthony Liguori


Gerd Hoffmann wrote:

  Hi,


Framebuffer is an interesting one.  Virtio doesn't assume shared memory,
so naively the fb you would just send outbufs describing changed memory.
This would work, but describing rectangles is better.  A helper might be
the right approach here


Rectangles work just fine for a framebuffer console.  They stop working 
once you plan to run any graphical stuff such as an X-Server on top of 
the framebuffer.  Only way to get notified about changes is page faults, 
i.e. 4k granularity on the linear framebuffer memory.


Related to Framebuffer is virtual keyboard and virtual mouse (or better 
touchscreen), which probably works perfectly fine with virtio.  I'd 
guess you can even reuse the input layer event struct for the virtio 
events.


Virtio seems like overkill for either of those things.  It's necessary 
for pure PV but not at all necessary for something like KVM.


Xen has virtual framebuffer, kbd & mouse, although not (yet?) in the 
paravirt_ops patch queue, so there is something you can look at ;)


In retrospect, IMHO, a shared framebuffer was a bad idea for Xen.  It's 
easy enough for dealing with an unaccelerated display, but once you 
start getting into more advanced features like blitting (which is really 
important actually for decent VNC performance), the synchronization 
becomes a big problem.


If we were to do a PV display driver again, I really think something 
that is closer to a VNC protocol is the right way to go.  There simply 
isn't a significant performance overhead to copying the relatively small 
amount of memory.


So virtio in it's current form would actually be pretty useful for that.

Regards,

Anthony Liguori



cheers,
  Gerd


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A set of "standard" virtual devices?

2007-04-02 Thread Anthony Liguori


Jeremy Fitzhardinge wrote:

Andi Kleen wrote:
The implementation wouldn't need to use PCI at all. There wouldn't 
even need to be PCI like registers internally. Just a pci device

with an ID somewhere in sysfs. PCI with unique IDs
is just a convenient and well established key into the driver module
collection. Once you have the right driver it can do what it wants.


But I understood hpa's suggestion to mean that there would be a standard
PCI interface for a hardware RNG, and a single linux driver for that
device, which all hypervisors would be expected to implement.  But
that's only reasonable if the virtualization environment has some notion
of PCI to expose to the Linux guest.


The actual PCI bus could paravirtualized.  It's just a question of 
whether one reinvents a device discovery mechanism (like XenBus) or 
whether one piggy backs on existing mechanisms.


Furthermore, in the future, I strongly suspect that HVM will become much 
more important for Xen than PV and since that already has a PCI bus it's 
not really that big of a deal.


Regards,

Anthony Liguori


J


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 0/15] KVM userspace interface updates

2007-03-16 Thread Anthony Liguori


Heiko Carstens wrote:

On Sun, Mar 11, 2007 at 03:53:12PM +0200, Avi Kivity wrote:
  

This patchset updates the kvm userspace interface to what I hope will
be the long-term stable interface.  Provisions are included for extending
the interface later.  The patches address performance and cleanliness
concerns.



Searching the mailing list I figured that as soons as the interface seems
to be stable, kvm should/would switch to a system call based interface.
I assume the userspace interface might still change a lot, especially if
kvm is ported to new architectures.
But the general question is: do you still plan to switch to a syscall
interface?
  


What benefit would a syscall interface have?

Regards,

Anthony Liguori


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 109 matches

Mail list logo