[PATCH] KVM: VMX: Enforce EPT pagetable level checking

2010-06-02 Thread Sheng Yang
We only support 4 levels EPT pagetable now.

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 arch/x86/kvm/vmx.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 99ae513..d400fbb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -339,6 +339,11 @@ static inline bool cpu_has_vmx_ept_1g_page(void)
return vmx_capability.ept  VMX_EPT_1GB_PAGE_BIT;
 }
 
+static inline bool cpu_has_vmx_ept_4levels(void)
+{
+   return vmx_capability.ept  VMX_EPT_PAGE_WALK_4_BIT;
+}
+
 static inline bool cpu_has_vmx_invept_individual_addr(void)
 {
return vmx_capability.ept  VMX_EPT_EXTENT_INDIVIDUAL_BIT;
@@ -1567,7 +1572,8 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_vpid())
enable_vpid = 0;
 
-   if (!cpu_has_vmx_ept()) {
+   if (!cpu_has_vmx_ept() ||
+   !cpu_has_vmx_ept_4levels()) {
enable_ept = 0;
enable_unrestricted_guest = 0;
}
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 32-bit qemu + 64-bit kvm be a problem?

2010-06-02 Thread Michael Tokarev

02.06.2010 09:44, Neo Jia wrote:

On Wed, Mar 10, 2010 at 2:12 PM, Michael Tokarevm...@tls.msk.ru  wrote:

[]

I use 32bit kvm on 64bit kernel since the day one.  Nothing of interest
since that, everything just works.



I just came back to this thread because I am seeing that I can't run
VISTA 64-bit inside 64/32 mode, which will crash with bugcheck 0x5D.
Is this a known issue?


I haven't seen it myself, but I never tried vista.  Win7 64bit installs
and works just fine.  STOP 0x005D, according to M$, means there is
a processor present that is not supported.  I'd say try to experiment
with different -cpu options.  Also see if it works in native 64bit mode.


Recently (this week) I come across a situation when something does not
work in 64/32 mode.  Namely it is linux aio (see the other thread in
kvm@ a few days back) - but this is not due to kvm but due to other
kernel subsystem (in this case aio) which lacks proper compat handlers
in place.


Could you tell me more about the AIO issue? Will this will slow down
the guest if it does a lot I/O? Will setting up coalescing help?


You can search it yourself.  In a nutshell it's an optimisation in kvm
since version 0.12  to let raw devices be used a bit more efficient.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-02 Thread H. Peter Anvin

On 06/01/2010 10:39 AM, valdis.kletni...@vt.edu wrote:

On Tue, 01 Jun 2010 19:52:28 +0300, Avi Kivity said:

On 06/01/2010 07:38 PM, Andi Kleen wrote:

Your new code would starve again, right?

Try it on a NUMA system with unfair memory.



We are running everything on NUMA (since all modern machines are now
NUMA).  At what scale do the issues become observable?


My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the
perfectly-running NUMA=n kernel I'm running.

Or did you mean a less broad phrase than all modern machines?



All modern multisocket machines, unless configured in interleaved memory 
mode.


-hpa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v3)

2010-06-02 Thread Christian Brunner
Hi Kevin,

2010/6/1 Kevin Wolf kw...@redhat.com:
 Do you have some specific tests for the driver or should we extend
 qemu-iotests to work with protocols and use only that?

Right now I don't have any specific tests, but I'll take a look at
qemu-iotests soon.

Christian
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


unhandled wrmsr

2010-06-02 Thread Dave Young
Hi,

With today's git version (qemu-kvm), I got following message in kernel dmesg

[168344.215605] kvm: 27289: cpu0 unhandled wrmsr: 0x198 data 0

bash-3.1$ /home/dave/tmp/qemu-system-x86_64 --version
QEMU emulator version 0.12.50 (qemu-kvm-devel), Copyright (c)
2003-2008 Fabrice Bellard


bash-3.1$ uname -a
Linux darkstar 2.6.35-rc1 #35 SMP Mon May 31 16:50:15 CST 2010 x86_64
Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz GenuineIntel GNU/Linux

host distribution is slackware64-13.0

0.12.3 works well

BTW the guest is a tiny core linux image

-- 
Regards
dave
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-02 Thread Andi Kleen
On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote:
 On 06/01/2010 08:27 PM, Andi Kleen wrote:
 On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:

 We are running everything on NUMA (since all modern machines are now NUMA).
   At what scale do the issues become observable?
  
 On Intel platforms it's visible starting with 4 sockets.


 Can you recommend a benchmark that shows bad behaviour?  I'll run it with 

Pretty much anything with high lock contention.

 ticket spinlocks and Gleb's patch.  I have a 4-way Nehalem-EX, presumably 
 the huge number of threads will magnify the problem even more there.

Yes more threads cause more lock contention too.

 Do  you have any idea how we can tackle both problems?

Apparently Xen has something, perhaps that can be leveraged
(but I haven't looked at their solution in detail)

Otherwise I would probably try to start with a adaptive
spinlock that at some point calls into the HV (or updates
shared memory?), like john cooper suggested. The tricky part here would
be to find the thresholds and fit that state into
paravirt ops and the standard spinlock_t.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: rework remove-write-access for a slot

2010-06-02 Thread Lai Jiangshan
Current code uses slot_bitmap to find ptes who map a page
from the memory slot, it is not precise: some ptes in the shadow page
are not map any page from the memory slot.

This patch uses rmap to find the ptes precisely, and remove
the unused slot_bitmap.

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt
index 1e7ecdd..d749399 100644
--- a/Documentation/kvm/mmu.txt
+++ b/Documentation/kvm/mmu.txt
@@ -183,13 +183,6 @@ Shadow pages contain the following information:
 perform a reverse map from a pte to a gfn. When role.direct is set, any
 element of this array can be calculated from the gfn field when used, in
 this case, the array of gfns is not allocated. See role.direct and gfn.
-  slot_bitmap:
-A bitmap containing one bit per memory slot.  If the page contains a pte
-mapping a page from memory slot n, then bit n of slot_bitmap will be set
-(if a page is aliased among several slots, then it is not guaranteed that
-all slots will be marked).
-Used during dirty logging to avoid scanning a shadow page if none if its
-pages need tracking.
   root_count:
 A counter keeping track of how many hardware registers (guest cr3 or
 pdptrs) are now pointing at the page.  While this counter is nonzero, the
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0cd0f29..bf4f198 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -197,11 +197,6 @@ struct kvm_mmu_page {
u64 *spt;
/* hold the gfn of each spte inside spt */
gfn_t *gfns;
-   /*
-* One bit set per slot which has memory
-* in this shadow page.
-*/
-   DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
bool multimapped; /* More than one parent_pte? */
bool unsync;
int root_count;  /* Currently serving as active root */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c16c4ca..e097e81 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -941,7 +941,6 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
  PAGE_SIZE);
set_page_private(virt_to_page(sp-spt), (unsigned long)sp);
list_add(sp-link, vcpu-kvm-arch.active_mmu_pages);
-   bitmap_zero(sp-slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
sp-multimapped = 0;
sp-parent_pte = parent_pte;
--vcpu-kvm-arch.n_free_mmu_pages;
@@ -1660,14 +1659,6 @@ restart:
}
 }
 
-static void page_header_update_slot(struct kvm *kvm, void *pte, gfn_t gfn)
-{
-   int slot = memslot_id(kvm, gfn);
-   struct kvm_mmu_page *sp = page_header(__pa(pte));
-
-   __set_bit(slot, sp-slot_bitmap);
-}
-
 static void mmu_convert_notrap(struct kvm_mmu_page *sp)
 {
int i;
@@ -1979,7 +1970,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
if (!was_rmapped  is_large_pte(*sptep))
++vcpu-kvm-stat.lpages;
 
-   page_header_update_slot(vcpu-kvm, sptep, gfn);
if (!was_rmapped) {
rmap_count = rmap_add(vcpu, sptep, gfn);
kvm_release_pfn_clean(pfn);
@@ -2975,22 +2965,38 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
mmu_free_memory_caches(vcpu);
 }
 
+static void rmapp_remove_write_access(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte = rmap_next(kvm, rmapp, NULL);
+
+   while (spte) {
+   /* avoid RMW */
+   if (is_writable_pte(*spte))
+   *spte = ~PT_WRITABLE_MASK;
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+}
+
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 {
-   struct kvm_mmu_page *sp;
+   int i;
+   unsigned long gfn_offset;
+   struct kvm_memslots *slots = kvm_memslots(kvm);
+   struct kvm_memory_slot *memslot = slots-memslots[slot];
 
-   list_for_each_entry(sp, kvm-arch.active_mmu_pages, link) {
-   int i;
-   u64 *pt;
+   for (gfn_offset = 0; gfn_offset  memslot-npages; gfn_offset++) {
+   rmapp_remove_write_access(kvm, memslot-rmap[gfn_offset]);
 
-   if (!test_bit(slot, sp-slot_bitmap))
-   continue;
+   for (i = 0; i  KVM_NR_PAGE_SIZES - 1; i++) {
+   unsigned long gfn = memslot-base_gfn + gfn_offset;
+   unsigned long huge = KVM_PAGES_PER_HPAGE(i + 2);
+   int idx = gfn / huge - memslot-base_gfn / huge;
 
-   pt = sp-spt;
-   for (i = 0; i  PT64_ENT_PER_PAGE; ++i)
-   /* avoid RMW */
-   if (is_writable_pte(pt[i]))
-   pt[i] = ~PT_WRITABLE_MASK;
+   if (!(gfn_offset || (gfn % huge)))
+   break;
+

Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-02 Thread Avi Kivity

On 06/02/2010 11:50 AM, Andi Kleen wrote:

On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote:
   

On 06/01/2010 08:27 PM, Andi Kleen wrote:
 

On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:

   

We are running everything on NUMA (since all modern machines are now NUMA).
   At what scale do the issues become observable?

 

On Intel platforms it's visible starting with 4 sockets.

   

Can you recommend a benchmark that shows bad behaviour?  I'll run it with
 

Pretty much anything with high lock contention.
   


Okay, we'll try to measure it here as soon as we can switch it into numa 
mode.



Do  you have any idea how we can tackle both problems?
 

Apparently Xen has something, perhaps that can be leveraged
(but I haven't looked at their solution in detail)

Otherwise I would probably try to start with a adaptive
spinlock that at some point calls into the HV (or updates
shared memory?), like john cooper suggested. The tricky part here would
be to find the thresholds and fit that state into
paravirt ops and the standard spinlock_t.

   


There are two separate problems: the more general problem is that the 
hypervisor can put a vcpu to sleep while holding a lock, causing other 
vcpus to spin until the end of their time slice.  This can only be 
addressed with hypervisor help.  The second problem is that the extreme 
fairness of ticket locks causes lots of context switches if the 
hypervisor helps, and aggravates the first problem horribly if it 
doesn't (since now a vcpu will spin waiting for its ticket even if the 
lock is unlocked).


So yes, we'll need hypervisor assistance, but even with that we'll need 
to reduce ticket lock fairness (retaining global fairness but 
sacrificing some local fairness).  I imagine that will be helpful for 
non-virt as well as local unfairness reduces bounciness.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: cleanup, fix typing

2010-06-02 Thread Lai Jiangshan
The type of '*new.rmap' is not 'struct page *', fix it

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9ab1a77..cab7839 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -596,7 +596,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
/* Allocate if a slot is being created */
 #ifndef CONFIG_S390
if (npages  !new.rmap) {
-   new.rmap = vmalloc(npages * sizeof(struct page *));
+   new.rmap = vmalloc(npages * sizeof(*new.rmap));
 
if (!new.rmap)
goto out_free;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: cleanup, use linux/uaccess.h instead of asm/uaccess.h

2010-06-02 Thread Lai Jiangshan
Should use linux/uaccess.h instead of asm/uaccess.h

Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6b2ce1d..ea03def 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -41,13 +41,13 @@
 #include linux/srcu.h
 #include linux/slab.h
 #include linux/perf_event.h
+#include linux/uaccess.h
 #include trace/events/kvm.h
 
 #define CREATE_TRACE_POINTS
 #include trace.h
 
 #include asm/debugreg.h
-#include asm/uaccess.h
 #include asm/msr.h
 #include asm/desc.h
 #include asm/mtrr.h

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:

 There seems to be some misunderstanding.  The userspace interface
 proposed forces a separate domain per device and forces userspace to
 repeat iommu programming for each device.  We are better off sharing a
 domain between devices and programming the iommu once.
 
 The natural way to do this is to have an iommu driver for programming
 iommu.

IMO a seperate iommu-userspace driver is a nightmare for a userspace
interface. It is just too complicated to use. We can solve the problem
of multiple devices-per-domain with an ioctl which allows binding one
uio-device to the address-space on another. Thats much simpler.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
 On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:

 Main difference is that vhost works fine with unlocked
 memory, paging it in on demand. iommu needs to unmap
 memory when it is swapped out or relocated.

 So you'd just take the memory map and not pin anything.  This way you  
 can reuse the memory map.

 But no, it doesn't handle the dirty bitmap, so no go.

IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 12:45 PM, Joerg Roedel wrote:

On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
   

On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
 
   

Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.

   

So you'd just take the memory map and not pin anything.  This way you
can reuse the memory map.

But no, it doesn't handle the dirty bitmap, so no go.
 

IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.
   


vhost doesn't pin memory.

What I proposed is to describe the memory map using an object (fd), and 
pass it around to clients that use it: kvm, vhost, vfio.  That way you 
maintain the memory map in a central location and broadcast changes to 
clients.  Only a vfio client would result in memory being pinned.


It can still work, but the interface needs to be extended to include 
dirty bitmap logging.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 12:42 PM, Joerg Roedel wrote:

On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:

   

There seems to be some misunderstanding.  The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device.  We are better off sharing a
domain between devices and programming the iommu once.

The natural way to do this is to have an iommu driver for programming
iommu.
 

IMO a seperate iommu-userspace driver is a nightmare for a userspace
interface. It is just too complicated to use. We can solve the problem
of multiple devices-per-domain with an ioctl which allows binding one
uio-device to the address-space on another. Thats much simpler.
   


This is non trivial with hotplug.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Tue, Jun 01, 2010 at 09:59:40PM -0700, Tom Lyon wrote:
 This is just what I was thinking.  But rather than a get/set, just use two 
 fds.
 
   ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);
 
 This may fail if there are really 2 different IOMMUs, so user code must be
 prepared for failure,  In addition, this is strictlyupwards compatible with 
 what is there now, so maybe we can add it later.

How can this fail with multiple IOMMUs? This should be handled
transparently by the IOMMU driver.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
 On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:
 
  There seems to be some misunderstanding.  The userspace interface
  proposed forces a separate domain per device and forces userspace to
  repeat iommu programming for each device.  We are better off sharing a
  domain between devices and programming the iommu once.
  
  The natural way to do this is to have an iommu driver for programming
  iommu.
 
 IMO a seperate iommu-userspace driver is a nightmare for a userspace
 interface. It is just too complicated to use.

One advantage would be that we can reuse the uio framework
for the devices themselves. So an existing app can just program
an iommu for DMA and keep using uio for interrupts and access.

 We can solve the problem
 of multiple devices-per-domain with an ioctl which allows binding one
 uio-device to the address-space on another.

This would imply switching an iommu domain for a device while
it could potentially be doing DMA. No idea whether this can be done
in a safe manner.
Forcing iommu assignment to be done as a first step seems much saner.


 Thats much simpler.
 
   Joerg


So instead of
dev = open();
ioctl(dev, ASSIGN, iommu)
mmap

and if we for ioctl mmap will fail
we have

dev = open();
if (ndevices  0)
ioctl(devices[0], ASSIGN, dev)
mmap

And if we forget ioctl we get errors from device.
Seems more complicated to me.


There will also always exist the confusion: address space for
which device are we modifying? With a separate driver for iommu,
we can safely check that binding is done correctly.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: disable clocksource=kvm-clock

2010-06-02 Thread Peter Lieven

Avi Kivity wrote:

On 06/01/2010 04:57 PM, Peter Lieven wrote:

Avi Kivity wrote:

On 06/01/2010 04:12 PM, Peter Lieven wrote:

hi,

is it possible to avoid detection of clocksource=kvm_clock in a 
linux guest by

patching the qemu-kvm binary?
i would like to be able to avoid a guest detecting kvm-clock until 
bug #584516

is fixed without modifying all guest systems and reverting that later.



Newer qemu-kvm's will have this feature (using -cpu ...,-kvmclock).


with latest git, i can supply this switch, but

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

still shows kvm_clock in an opensuse 11.2 64-bit or an ubuntu 10.04 
64-bit guest.


Try again with tomorrow's git, I sent out a patchset earlier today 
that fixes this.



i don't see it in the git, do you mean:
[PATCH 0/5] Use more cpuid bits from upstream

thanks,
peter

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
 On 06/02/2010 12:45 PM, Joerg Roedel wrote:
 IOMMU mapped memory can not be swapped out because we can't do demand
 paging on io-page-faults with current devices. We have to pin _all_
 userspace memory that is mapped into an IOMMU domain.

 vhost doesn't pin memory.

 What I proposed is to describe the memory map using an object (fd), and  
 pass it around to clients that use it: kvm, vhost, vfio.  That way you  
 maintain the memory map in a central location and broadcast changes to  
 clients.  Only a vfio client would result in memory being pinned.

Ah ok, so its only about the database which keeps the mapping
information.

 It can still work, but the interface needs to be extended to include  
 dirty bitmap logging.

Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
dirty-bits in the page-table. And without demand-paging we can't really
tell what pages a device has written to. The only choice is to mark all
IOMMU-mapped pages dirty as long as they are mapped.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 12:04:04PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
  On 06/02/2010 12:45 PM, Joerg Roedel wrote:
  IOMMU mapped memory can not be swapped out because we can't do demand
  paging on io-page-faults with current devices. We have to pin _all_
  userspace memory that is mapped into an IOMMU domain.
 
  vhost doesn't pin memory.
 
  What I proposed is to describe the memory map using an object (fd), and  
  pass it around to clients that use it: kvm, vhost, vfio.  That way you  
  maintain the memory map in a central location and broadcast changes to  
  clients.  Only a vfio client would result in memory being pinned.
 
 Ah ok, so its only about the database which keeps the mapping
 information.
 
  It can still work, but the interface needs to be extended to include  
  dirty bitmap logging.
 
 Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
 dirty-bits in the page-table. And without demand-paging we can't really
 tell what pages a device has written to. The only choice is to mark all
 IOMMU-mapped pages dirty as long as they are mapped.
 
   Joerg

Or mark them dirty when they are unmapped.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:

  IMO a seperate iommu-userspace driver is a nightmare for a userspace
  interface. It is just too complicated to use.
 
 One advantage would be that we can reuse the uio framework
 for the devices themselves. So an existing app can just program
 an iommu for DMA and keep using uio for interrupts and access.

The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
IOMMU mappings belongs there.

  We can solve the problem
  of multiple devices-per-domain with an ioctl which allows binding one
  uio-device to the address-space on another.
 
 This would imply switching an iommu domain for a device while
 it could potentially be doing DMA. No idea whether this can be done
 in a safe manner.

It can. The worst thing that can happen is an io-page-fault.

 Forcing iommu assignment to be done as a first step seems much saner.

If we force it, there is no reason why not doing it implicitly.

We can do something like this then:

dev1 = open();
ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
it*/

dev2 = open();
ioctl(dev2, IOMMU_MAP, ...);

/* Now dev1 and dev2 are in seperate domains */

ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
   assigns it to the same domain as
   dev1. Domain has a refcount of two
   now */

close(dev1); /* domain refcount goes down to one */
close(dev2); /* domain refcount is zero and domain gets destroyed */


Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 11:45:27AM +0200, Joerg Roedel wrote:
 On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
  On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
 
  Main difference is that vhost works fine with unlocked
  memory, paging it in on demand. iommu needs to unmap
  memory when it is swapped out or relocated.
 
  So you'd just take the memory map and not pin anything.  This way you  
  can reuse the memory map.
 
  But no, it doesn't handle the dirty bitmap, so no go.
 
 IOMMU mapped memory can not be swapped out because we can't do demand
 paging on io-page-faults with current devices. We have to pin _all_
 userspace memory that is mapped into an IOMMU domain.
 
   Joerg


One of the issues I see with the current patch is that
it uses the mlock rlimit to do this pinning. So this wastes the rlimit
for an app that did mlockall already, and also consumes
this resource transparently, so an app might call mlock
on a small buffer and be surprised that it fails.

Using mmu notifiers might help?


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
 
   IMO a seperate iommu-userspace driver is a nightmare for a userspace
   interface. It is just too complicated to use.
  
  One advantage would be that we can reuse the uio framework
  for the devices themselves. So an existing app can just program
  an iommu for DMA and keep using uio for interrupts and access.
 
 The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
 IOMMU mappings belongs there.
 
   We can solve the problem
   of multiple devices-per-domain with an ioctl which allows binding one
   uio-device to the address-space on another.
  
  This would imply switching an iommu domain for a device while
  it could potentially be doing DMA. No idea whether this can be done
  in a safe manner.
 
 It can. The worst thing that can happen is an io-page-fault.

devices might not be able to recover from this.

  Forcing iommu assignment to be done as a first step seems much saner.
 
 If we force it, there is no reason why not doing it implicitly.

What you describe below does 3 ioctls for what can be done with 1.

 We can do something like this then:
 
 dev1 = open();
 ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
   it*/
 
 dev2 = open();
 ioctl(dev2, IOMMU_MAP, ...);
 
 /* Now dev1 and dev2 are in seperate domains */
 
 ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
  assigns it to the same domain as
  dev1. Domain has a refcount of two
  now */

Or maybe it destroys mapping for dev1?
How do you remember?

 close(dev1); /* domain refcount goes down to one */
 close(dev2); /* domain refcount is zero and domain gets destroyed */
 
 
   Joerg

Also, no way to unshare? That seems limiting.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 01:15:34PM +0300, Michael S. Tsirkin wrote:
 One of the issues I see with the current patch is that
 it uses the mlock rlimit to do this pinning. So this wastes the rlimit
 for an app that did mlockall already, and also consumes
 this resource transparently, so an app might call mlock
 on a small buffer and be surprised that it fails.
 
 Using mmu notifiers might help?

MMU notifiers are problematic because they are designed for situations
where we can do demand paging. The invalidate_range_start and
invalidate_range_end functions are not only called on munmap, they also
run when mprotect is called (in which case we don't want to tear down
iommu mappings). So what may happen with mmu notifiers is that we
accidentially tear down iommu mappings. With demand-paging this is no
problem because the io-ptes could be re-faulted. But that does not work
here.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:

  It can. The worst thing that can happen is an io-page-fault.
 
 devices might not be able to recover from this.

With the userspace interface a process can create io-page-faults
anyway if it wants. We can't protect us from this. And the process is
also responsible to not tear down iommu-mappings that are currently in
use.

 What you describe below does 3 ioctls for what can be done with 1.

The second IOMMU_MAP ioctl is just to show that existing mappings would
be destroyed if the device is assigned to another address space. Not
strictly necessary. So we have two ioctls but save one call to create
the iommu-domain.

  ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
 assigns it to the same domain as
 dev1. Domain has a refcount of two
 now */
 
 Or maybe it destroys mapping for dev1?
 How do you remember?

Because we express here that dev2 shares the iommu mappings of dev1.
Thats easy to remember.

 Also, no way to unshare? That seems limiting.

Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
because that would require a second parameter) ioctl is certainly also
required.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
 
   It can. The worst thing that can happen is an io-page-fault.
  
  devices might not be able to recover from this.
 
 With the userspace interface a process can create io-page-faults
 anyway if it wants. We can't protect us from this.

We could fail all operations until an iommu is bound.
This will help catch bugs with access before setup. We can not do this
if a domain is bound by default.

 And the process is
 also responsible to not tear down iommu-mappings that are currently in
 use.
 
  What you describe below does 3 ioctls for what can be done with 1.
 
 The second IOMMU_MAP ioctl is just to show that existing mappings would
 be destroyed if the device is assigned to another address space. Not
 strictly necessary. So we have two ioctls but save one call to create
 the iommu-domain.

With 10 devices you have 10 extra ioctls.

   ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
assigns it to the same domain as
dev1. Domain has a refcount of two
now */
  
  Or maybe it destroys mapping for dev1?
  How do you remember?
 
 Because we express here that dev2 shares the iommu mappings of dev1.
 Thats easy to remember.

they both share the mappings. which one gets the iommu
destroyed (breaking the device if it is now doing DMA)?

  Also, no way to unshare? That seems limiting.
 
 Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
 because that would require a second parameter) ioctl is certainly also
 required.
 
   Joerg



-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
 
   IMO a seperate iommu-userspace driver is a nightmare for a userspace
   interface. It is just too complicated to use.
  
  One advantage would be that we can reuse the uio framework
  for the devices themselves. So an existing app can just program
  an iommu for DMA and keep using uio for interrupts and access.
 
 The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
 IOMMU mappings belongs there.

Maybe it could be put there but the patch posted did not use uio.
And one of the reasons is that uio framework provides for
device access and interrupts but not for programming memory mappings.

Solutions (besides giving up on uio completely)
could include extending the framework in some way
(which was tried, but the result was not pretty) or adding
a separate driver for iommu and binding to that.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 0/3] Sheepdog: distributed storage system for QEMU

2010-06-02 Thread Kevin Wolf
Am 28.05.2010 04:44, schrieb MORITA Kazutaka:
 Hi all,
 
 This patch adds a block driver for Sheepdog distributed storage
 system.  Please consider for inclusion.

Hint for next time: You should remove the RFC from the subject line if
you think the patch is ready for inclusion. Otherwise I might miss this
and think you only want comments on it.

 MORITA Kazutaka (3):
   close all the block drivers before the qemu process exits
   block: call the snapshot handlers of the protocol drivers
   block: add sheepdog driver for distributed storage support

Thanks, I have applied the first two patches to the block branch, they
look good to me. I'll send some comments for the third one (though it's
only coding style until now).

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:

  With the userspace interface a process can create io-page-faults
  anyway if it wants. We can't protect us from this.
 
 We could fail all operations until an iommu is bound.  This will help
 catch bugs with access before setup. We can not do this if a domain is
 bound by default.

Even if it is bound to a domain the userspace driver could program the
device to do dma to unmapped regions causing io-page-faults. The kernel
can't do anything about it.

  The second IOMMU_MAP ioctl is just to show that existing mappings would
  be destroyed if the device is assigned to another address space. Not
  strictly necessary. So we have two ioctls but save one call to create
  the iommu-domain.
 
 With 10 devices you have 10 extra ioctls.

And this works implicitly with your proposal? Remember that we still
need to be able to provide seperate mappings for each device to support
IOMMU emulation for the guest. I think my proposal does not have any
extra costs.

  Because we express here that dev2 shares the iommu mappings of dev1.
  Thats easy to remember.
 
 they both share the mappings. which one gets the iommu
 destroyed (breaking the device if it is now doing DMA)?

As I wrote the domain has a reference count and is destroyed only when
it goes down to zero. This does not happen as long as a device is bound
to it.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: rework remove-write-access for a slot

2010-06-02 Thread Avi Kivity

On 06/02/2010 11:53 AM, Lai Jiangshan wrote:

Current code uses slot_bitmap to find ptes who map a page
from the memory slot, it is not precise: some ptes in the shadow page
are not map any page from the memory slot.

This patch uses rmap to find the ptes precisely, and remove
the unused slot_bitmap.

   


Patch looks good; a couple of comments:

- We might see a slowdown with !tdp, since we no longer have locality.  
Each page will map to an spte in a different page.  However, it's still 
worth it in my opinion.
- I thought of a different approach to write protection: write protect 
the L4 sptes, on write fault add write permission to the L4 spte and 
write protect the L3 sptes that it points to, etc.  This method can use 
the slot bitmap to reduce the number of write faults.  However we can 
reintroduce the slot bitmap if/when we use the method, this shouldn't 
block the patch.




+static void rmapp_remove_write_access(struct kvm *kvm, unsigned long 
*rmapp)

+{
+u64 *spte = rmap_next(kvm, rmapp, NULL);
+
+while (spte) {
+/* avoid RMW */
+if (is_writable_pte(*spte))
+*spte = ~PT_WRITABLE_MASK;


Must use an atomic operation here to avoid losing dirty or accessed bit.


+spte = rmap_next(kvm, rmapp, spte);
+}
+}



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 01:04 PM, Joerg Roedel wrote:

On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
   

On 06/02/2010 12:45 PM, Joerg Roedel wrote:
 

IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.
   

vhost doesn't pin memory.

What I proposed is to describe the memory map using an object (fd), and
pass it around to clients that use it: kvm, vhost, vfio.  That way you
maintain the memory map in a central location and broadcast changes to
clients.  Only a vfio client would result in memory being pinned.
 

Ah ok, so its only about the database which keeps the mapping
information.
   


Yes.

   

It can still work, but the interface needs to be extended to include
dirty bitmap logging.
 

Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
dirty-bits in the page-table. And without demand-paging we can't really
tell what pages a device has written to. The only choice is to mark all
IOMMU-mapped pages dirty as long as they are mapped.

   


The interface would only work for clients which support it: kvm, vhost, 
and iommu/devices with restartable dma.


Note dirty logging is not very interesting for vfio anyway, since you 
can't live migrate with assigned devices.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:
 
   With the userspace interface a process can create io-page-faults
   anyway if it wants. We can't protect us from this.
  
  We could fail all operations until an iommu is bound.  This will help
  catch bugs with access before setup. We can not do this if a domain is
  bound by default.
 
 Even if it is bound to a domain the userspace driver could program the
 device to do dma to unmapped regions causing io-page-faults. The kernel
 can't do anything about it.

It can always corrupt its own memory directly as well :)
But that is not a reason not to detect errors if we can,
and not to make APIs hard to misuse.

   The second IOMMU_MAP ioctl is just to show that existing mappings would
   be destroyed if the device is assigned to another address space. Not
   strictly necessary. So we have two ioctls but save one call to create
   the iommu-domain.
  
  With 10 devices you have 10 extra ioctls.
 
 And this works implicitly with your proposal?

Yes.  so you do:
iommu = open
ioctl(dev1, BIND, iommu)
ioctl(dev2, BIND, iommu)
ioctl(dev3, BIND, iommu)
ioctl(dev4, BIND, iommu)

No need to add a SHARE ioctl.


 Remember that we still
 need to be able to provide seperate mappings for each device to support
 IOMMU emulation for the guest.

Generally not true. E.g. guest can enable iommu passthrough
or have domain per a group of devices.

 I think my proposal does not have any
 extra costs.

with my proposal we have 1 ioctl per device + 1 per domain.
with yours we have 2 ioctls per device is iommu is shared
and 1 if it is not shared.

as current apps share iommu it seems to make sense
to optimize for that.

   Because we express here that dev2 shares the iommu mappings of dev1.
   Thats easy to remember.
  
  they both share the mappings. which one gets the iommu
  destroyed (breaking the device if it is now doing DMA)?
 
 As I wrote the domain has a reference count and is destroyed only when
 it goes down to zero. This does not happen as long as a device is bound
 to it.
 
   Joerg

We were talking about UNSHARE ioctl:
ioctl(dev1, UNSHARE, dev2)
Does it change the domain for dev1 or dev2?
If you make a mistake you get a hard to debug bug.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:

  Even if it is bound to a domain the userspace driver could program the
  device to do dma to unmapped regions causing io-page-faults. The kernel
  can't do anything about it.
 
 It can always corrupt its own memory directly as well :)
 But that is not a reason not to detect errors if we can,
 and not to make APIs hard to misuse.

Changing the domain of a device while dma can happen is the same type of
bug as unmapping potential dma target addresses. We can't catch this
kind of misuse.

   With 10 devices you have 10 extra ioctls.
  
  And this works implicitly with your proposal?
 
 Yes.  so you do:
 iommu = open
 ioctl(dev1, BIND, iommu)
 ioctl(dev2, BIND, iommu)
 ioctl(dev3, BIND, iommu)
 ioctl(dev4, BIND, iommu)
 
 No need to add a SHARE ioctl.

In my proposal this looks like:


dev1 = open();
ioctl(dev2, SHARE, dev1);
ioctl(dev3, SHARE, dev1);
ioctl(dev4, SHARE, dev1);

So we actually save an ioctl.

  Remember that we still need to be able to provide seperate mappings
  for each device to support IOMMU emulation for the guest.
 
 Generally not true. E.g. guest can enable iommu passthrough
 or have domain per a group of devices.

What I meant was that there may me multiple io-addresses spaces
necessary for one process. I didn't want to say that every device
_needs_ to have its own address space.

  As I wrote the domain has a reference count and is destroyed only when
  it goes down to zero. This does not happen as long as a device is bound
  to it.
  
  Joerg
 
 We were talking about UNSHARE ioctl:
 ioctl(dev1, UNSHARE, dev2)
 Does it change the domain for dev1 or dev2?
 If you make a mistake you get a hard to debug bug.

As I already wrote we would have an UNBIND ioctl which just removes a
device from its current domain. UNBIND is better than UNSHARE for
exactly the reason you pointed out above. I thought I stated that
already.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Virtual disks with SCSI or virtio hang after some time

2010-06-02 Thread Guido Winkelmann
Am Dienstag, 1. Juni 2010 schrieben Sie:
 On 06/01/2010 06:59 PM, Guido Winkelmann wrote:
  The host OS is Fedora Core 12, with qemu-kvm 0.11.0
 
 Please try with at least qemu-kvm-0.11.1, preferably qemu-kvm-0.12.4.
 Also use Linux 2.6.32.latest in the guest.

Okay, I've upgraded the userspace tools to qemu-kvm-0.12.4, and so far, the 
problem is not resurfacing. Before, it could usually be triggered by doing 
hdparm -t /dev/vda (or sda, respectively) once or twice.

Unfortunately, now I cannot start new VMs with libvirt anymore, and I have no 
idea why... :(

Guido
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 03:19 PM, Joerg Roedel wrote:



Yes.  so you do:
iommu = open
ioctl(dev1, BIND, iommu)
ioctl(dev2, BIND, iommu)
ioctl(dev3, BIND, iommu)
ioctl(dev4, BIND, iommu)

No need to add a SHARE ioctl.
 

In my proposal this looks like:


dev1 = open();
ioctl(dev2, SHARE, dev1);
ioctl(dev3, SHARE, dev1);
ioctl(dev4, SHARE, dev1);

So we actually save an ioctl.
   


The problem with this is that it is assymetric, dev1 is treated 
differently from dev[234].  It's an unintuitive API.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
 
   Even if it is bound to a domain the userspace driver could program the
   device to do dma to unmapped regions causing io-page-faults. The kernel
   can't do anything about it.
  
  It can always corrupt its own memory directly as well :)
  But that is not a reason not to detect errors if we can,
  and not to make APIs hard to misuse.
 
 Changing the domain of a device while dma can happen is the same type of
 bug as unmapping potential dma target addresses. We can't catch this
 kind of misuse.

you normally need device mapped to start DMA.
SHARE makes this bug more likely as you allow
switching domains: mmap could be done before switching.

With 10 devices you have 10 extra ioctls.
   
   And this works implicitly with your proposal?
  
  Yes.  so you do:
  iommu = open
  ioctl(dev1, BIND, iommu)
  ioctl(dev2, BIND, iommu)
  ioctl(dev3, BIND, iommu)
  ioctl(dev4, BIND, iommu)
  
  No need to add a SHARE ioctl.
 
 In my proposal this looks like:
 
 
 dev1 = open();
 ioctl(dev2, SHARE, dev1);
 ioctl(dev3, SHARE, dev1);
 ioctl(dev4, SHARE, dev1);
 
 So we actually save an ioctl.

I thought we had a BIND ioctl?

   Remember that we still need to be able to provide seperate mappings
   for each device to support IOMMU emulation for the guest.
  
  Generally not true. E.g. guest can enable iommu passthrough
  or have domain per a group of devices.
 
 What I meant was that there may me multiple io-addresses spaces
 necessary for one process. I didn't want to say that every device
 _needs_ to have its own address space.
 
   As I wrote the domain has a reference count and is destroyed only when
   it goes down to zero. This does not happen as long as a device is bound
   to it.
   
 Joerg
  
  We were talking about UNSHARE ioctl:
  ioctl(dev1, UNSHARE, dev2)
  Does it change the domain for dev1 or dev2?
  If you make a mistake you get a hard to debug bug.
 
 As I already wrote we would have an UNBIND ioctl which just removes a
 device from its current domain. UNBIND is better than UNSHARE for
 exactly the reason you pointed out above. I thought I stated that
 already.
 
   Joerg

You undo SHARE with UNBIND?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unhandled wrmsr

2010-06-02 Thread Andre Przywara

Dave Young wrote:

Hi,

With today's git version (qemu-kvm), I got following message in kernel dmesg

[168344.215605] kvm: 27289: cpu0 unhandled wrmsr: 0x198 data 0
Are you sure about that? 0x198 is an Intel architectural perfmon MSR and 
it is read-only. The Linux kernel source I grep'ed obeys this and does 
only rdmsr.

You can work around this by changing the error to a warning with:
# modprobe kvm ignore_msrs=1

I'd like to see more details about the guest Linux kernel, at least the 
version you used to see why there is a wrmsr on this address. Best is 
you provide the kernel (just the vmlinuz file) somewhere so that we can 
reproduce this. Also the qemu-kvm command line would be interesting.


Regards,
Andre.



bash-3.1$ /home/dave/tmp/qemu-system-x86_64 --version
QEMU emulator version 0.12.50 (qemu-kvm-devel), Copyright (c)
2003-2008 Fabrice Bellard


bash-3.1$ uname -a
Linux darkstar 2.6.35-rc1 #35 SMP Mon May 31 16:50:15 CST 2010 x86_64
Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz GenuineIntel GNU/Linux

host distribution is slackware64-13.0

0.12.3 works well

BTW the guest is a tiny core linux image



--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
 On 06/02/2010 03:19 PM, Joerg Roedel wrote:

 Yes.  so you do:
 iommu = open
 ioctl(dev1, BIND, iommu)
 ioctl(dev2, BIND, iommu)
 ioctl(dev3, BIND, iommu)
 ioctl(dev4, BIND, iommu)

 No need to add a SHARE ioctl.
  
 In my proposal this looks like:


 dev1 = open();
 ioctl(dev2, SHARE, dev1);
 ioctl(dev3, SHARE, dev1);
 ioctl(dev4, SHARE, dev1);

 So we actually save an ioctl.


 The problem with this is that it is assymetric, dev1 is treated  
 differently from dev[234].  It's an unintuitive API.

Its by far more unintuitive that a process needs to explicitly bind a
device to an iommu domain before it can do anything with it. If its
required anyway the binding can happen implicitly. We could allow to do
a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

Note that this way of handling userspace iommu mappings is also a lot
simpler for most use-cases outside of KVM. If a developer wants to write
a userspace driver all it needs to do is:

dev = open();
ioctl(dev, MAP, ...);
/* use device with mappings */
close(dev);

Which is much easier than the need to create a domain explicitly.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 03:34:17PM +0300, Michael S. Tsirkin wrote:
 On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote:

 you normally need device mapped to start DMA.
 SHARE makes this bug more likely as you allow
 switching domains: mmap could be done before switching.

We need to support domain switching anyway for iommu emulation in a
guest. So if you consider this to be a problem (I don't) it will not go
away with your proposal.

  dev1 = open();
  ioctl(dev2, SHARE, dev1);
  ioctl(dev3, SHARE, dev1);
  ioctl(dev4, SHARE, dev1);
  
  So we actually save an ioctl.
 
 I thought we had a BIND ioctl?

I can't remember a BIND ioctl in my proposal. I remember an UNBIND, but
thats bad naming as you pointed out below. See my statement on this
below too.

 You undo SHARE with UNBIND?

Thats bad naming, agreed. Lets keep UNSHARE. Point is, we only need one
parameter to do this which removes any ambiguity:

ioctl(dev1, UNSHARE);

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Avi Kivity

On 06/02/2010 03:50 PM, Joerg Roedel wrote:



The problem with this is that it is assymetric, dev1 is treated
differently from dev[234].  It's an unintuitive API.
 

Its by far more unintuitive that a process needs to explicitly bind a
device to an iommu domain before it can do anything with it.


I don't really care about the iommu domain.  It's a side effect.  The 
kernel takes care of it.  I'm only worried about the API.


We have a memory map that is (often) the same for a set of devices.  If 
you were coding a non-kernel interface, how would you code it?


  struct memory_map;
  void memory_map_init(struct memory_map *mm, ...);
  struct device;
  void device_set_memory_map(struct device *device, struct memory_map *mm);

or

  struct device;
  void device_init_memory_map(struct device *dev, ...);
  void device_clone_memory_map(struct device *dev, struct device *other);

I wouldn't even think of the second one personally.


If its
required anyway the binding can happen implicitly. We could allow to do
a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
   


It's still special.  You define the memory map only for the first 
device.  You have to make sure dev1 doesn't go away while sharing it.



Note that this way of handling userspace iommu mappings is also a lot
simpler for most use-cases outside of KVM. If a developer wants to write
a userspace driver all it needs to do is:

dev = open();
ioctl(dev, MAP, ...);
/* use device with mappings */
close(dev);

Which is much easier than the need to create a domain explicitly.
   


mm = open()
ioctl(mm, MAP, ...)
dev = open();
ioctl(dev, BIND, mm);
...
close(mm);
close(dev);

so yes, more work, but once you have multiple devices which come and go 
dynamically things become simpler.  The map object has global lifetime 
(you can even construct it if you don't assign any devices), the devices 
attach to it, memory hotplug updates the memory map but doesn't touch 
devices.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2963581 ] qemu-kvm upstream crashes when using -smp 1

2010-06-02 Thread SourceForge.net
Bugs item #2963581, was opened at 2010-03-04 18:10
Message generated for change (Comment added) made by 
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2963581group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: qemu
Group: None
Status: Closed
Resolution: None
Priority: 5
Private: No
Submitted By: Lucas Meneghel Rodrigues ()
Assigned to: Nobody/Anonymous (nobody)
Summary: qemu-kvm upstream crashes when using -smp 1

Initial Comment:
qemu-kvm.git master is crashing when using -smp 1

Relevant versions:
Commit hash for git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git is 
7811d4e8ec057d25db68f900be1f09a142faca49 (tag kvm-88-3686-g7811d4e)
Kernel: 2.6.31.12-174.2.22.fc12.x86_64

Steps to reproduce
1 - Clone git repo git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
2 - Build qemu-kvm from this repo
3 - Try to start it with -smp 1, reference command line:
03/04 12:56:12 DEBUG|kvm_vm:0461| Running qemu command:
/usr/local/autotest/tests/kvm/qemu -name 'vm1' -monitor 
unix:/tmp/monitor-20100304-125508-G6lf,server,nowait -drive 
file=/tmp/kvm_autotest_root/images/rhel5-64.qcow2,if=ide -net 
nic,vlan=0,model=rtl8139,macaddr=52:54:00:12:36:60 -net user,vlan=0 -m 1024 
-smp 1 -drive 
file=/tmp/kvm_autotest_root/isos/linux/RHEL-5.4-x86_64-DVD.iso,index=2,media=cdrom
 -fda /usr/local/autotest/tests/kvm/images/floppy.img -tftp 
/usr/local/autotest/tests/kvm/images/tftpboot  -boot d -bootp /pxelinux.0 -boot 
n -mem-path /mnt/kvm_hugepage -redir tcp:5000::22 -vnc :0
03/04 12:56:13 DEBUG|kvm_subpro:0686| (qemu) kvm_create_vcpu: Bad file 
descriptor
03/04 12:56:13 DEBUG|kvm_subpro:0686| (qemu) /bin/sh: line 1: 17273 
Segmentation fault  (core dumped) /usr/local/autotest/tests/kvm/qemu -name 
'vm1' -monitor unix:/tmp/monitor-20100304-125508-G6lf,server,nowait -drive 
file=/tmp/kvm_autotest_root/images/rhel5-64.qcow2,if=ide -net 
nic,vlan=0,model=rtl8139,macaddr=52:54:00:12:36:60 -net user,vlan=0 -m 1024 
-smp 1 -drive 
file=/tmp/kvm_autotest_root/isos/linux/RHEL-5.4-x86_64-DVD.iso,index=2,media=cdrom
 -fda /usr/local/autotest/tests/kvm/images/floppy.img -tftp 
/usr/local/autotest/tests/kvm/images/tftpboot -boot d -bootp /pxelinux.0 -boot 
n -mem-path /mnt/kvm_hugepage -redir tcp:5000::22 -vnc :0
03/04 12:56:13 DEBUG|kvm_subpro:0686| (qemu) (Process terminated with status 
139)

So we have a segmentation fault.

--

Comment By: Lucas Meneghel Rodrigues ()
Date: 2010-06-02 13:08

Message:
Bug resolved a few days after it was reported. Closing it...

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2963581group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[CFP] Bug Day 2010

2010-06-02 Thread Anthony Liguori

Hi everyone,

Today is our first QEMU Bug Day.  The goal of Bug Day is to spread some 
bug love and make a dent in the various bug trackers.  We've setup a 
special channel, #qemu-bugday on OFTC that folks can use to coordinate 
bug hunting effort.


Here's some ways you can participate in the Bug Day:

- Fixing bugs
- Marking bugs that require more information as Incomplete
- Trying to reproduce existing bugs and marking them as Confirmed
- Reporting bugs that haven't yet been reported
- Migrating bugs from the SourceForge tracker to Launchpad

More information can be found at http://wiki.qemu.org/BugDay/June2010

Happy bug hunting!

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 02:50:50PM +0200, Joerg Roedel wrote:
 On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
  On 06/02/2010 03:19 PM, Joerg Roedel wrote:
 
  Yes.  so you do:
  iommu = open
  ioctl(dev1, BIND, iommu)
  ioctl(dev2, BIND, iommu)
  ioctl(dev3, BIND, iommu)
  ioctl(dev4, BIND, iommu)
 
  No need to add a SHARE ioctl.
   
  In my proposal this looks like:
 
 
  dev1 = open();
  ioctl(dev2, SHARE, dev1);
  ioctl(dev3, SHARE, dev1);
  ioctl(dev4, SHARE, dev1);
 
  So we actually save an ioctl.
 
 
  The problem with this is that it is assymetric, dev1 is treated  
  differently from dev[234].  It's an unintuitive API.
 
 Its by far more unintuitive that a process needs to explicitly bind a
 device to an iommu domain before it can do anything with it.

The reason it is more intuitive is because it is harder to get it
wrong. If you swap iommu and device in the call, you get BADF
so you know you made a mistake. We can even make it work
both ways if we wanted to. With ioctl(dev1, BIND, dev2)
it breaks silently.


 If its
 required anyway the binding can happen implicitly. We could allow to do
 a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

And then when we assign meaning to it we find that half the apps
are broken because they did not call this ioctl.

 Note that this way of handling userspace iommu mappings is also a lot
 simpler for most use-cases outside of KVM. If a developer wants to write
 a userspace driver all it needs to do is:
 
 dev = open();
 ioctl(dev, MAP, ...);
 /* use device with mappings */
 close(dev);
 
 Which is much easier than the need to create a domain explicitly.
 
   Joerg

This simple scenario ignores all the real-life corner cases.
For example, with an explicit iommu open and bind application
can naturally detect that:
- we have run out of iommu domains
- iommu is unsupported
- iommu is in use by another, incompatible device
- device is in bad state
because each is a separate operation, so it is easy to produce meaningful
errors.

Another interesting thing that a separate iommu device supports is when
application A controls the iommu and application B
controls the device. This might be good to e.g. improve security
(B is run by root, A is unpriveledged and passes commands to/from B
 over a pipe).

This is not possible when same fd is used for iommu and device.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unhandled wrmsr

2010-06-02 Thread Peter Lieven

Andre Przywara wrote:

Dave Young wrote:

Hi,

With today's git version (qemu-kvm), I got following message in 
kernel dmesg


[168344.215605] kvm: 27289: cpu0 unhandled wrmsr: 0x198 data 0
Are you sure about that? 0x198 is an Intel architectural perfmon MSR 
and it is read-only. The Linux kernel source I grep'ed obeys this and 
does only rdmsr.

You can work around this by changing the error to a warning with:
# modprobe kvm ignore_msrs=1

I'd like to see more details about the guest Linux kernel, at least 
the version you used to see why there is a wrmsr on this address. Best 
is you provide the kernel (just the vmlinuz file) somewhere so that we 
can reproduce this. Also the qemu-kvm command line would be interesting.


I see the same with Ubuntu 10.04 64-bit (2.6.32-21-server)

cmdline:
/usr/bin/qemu-kvm-devel -net 
tap,vlan=141,script=no,downscript=no,ifname=tap0 -net 
nic,vlan=141,model=e1000,macaddr=52:54:00:ff:00:73   -drive 
format=host_device,file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-c0c61b105-db8000e7e954bf65-migration-test-2,if=ide,boot=on,cache=none,aio=native  
-m 1024 -cpu qemu64,model_id='Intel(R) Xeon(R) CPU   E5520  @ 
2.27GHz',-kvmclock  -monitor tcp:0:4001,server,nowait -vnc :1 -name 
'migration-test-10-04'  -boot order=dc,menu=on  -k de  -pidfile 
/var/run/qemu/vm-156.pid  -mem-path /hugepages -mem-prealloc  -rtc 
base=utc,clock=host -usb -usbdevice tablet




Regards,
Andre.



bash-3.1$ /home/dave/tmp/qemu-system-x86_64 --version
QEMU emulator version 0.12.50 (qemu-kvm-devel), Copyright (c)
2003-2008 Fabrice Bellard


bash-3.1$ uname -a
Linux darkstar 2.6.35-rc1 #35 SMP Mon May 31 16:50:15 CST 2010 x86_64
Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz GenuineIntel GNU/Linux

host distribution is slackware64-13.0

0.12.3 works well

BTW the guest is a tiny core linux image






--
Mit freundlichen Grüßen/Kind Regards

Peter Lieven

..

  KAMP Netzwerkdienste GmbH
  Vestische Str. 89-91 | 46117 Oberhausen
  Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
  mailto:p...@kamp.de | http://www.kamp.de

  Geschäftsführer: Heiner Lante | Michael Lante
  Amtsgericht Duisburg | HRB Nr. 12154
  USt-Id-Nr.: DE 120607556

. 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 04:06:21PM +0300, Avi Kivity wrote:
 On 06/02/2010 03:50 PM, Joerg Roedel wrote:

 Its by far more unintuitive that a process needs to explicitly bind a
 device to an iommu domain before it can do anything with it.

 I don't really care about the iommu domain.  It's a side effect.  The  
 kernel takes care of it.  I'm only worried about the API.

The proposed memory-map object is nothing else than a userspace
abstraction of an iommu-domain.

 We have a memory map that is (often) the same for a set of devices.  If  
 you were coding a non-kernel interface, how would you code it?

   struct memory_map;
   void memory_map_init(struct memory_map *mm, ...);
   struct device;
   void device_set_memory_map(struct device *device, struct memory_map *mm);

 or

   struct device;
   void device_init_memory_map(struct device *dev, ...);
   void device_clone_memory_map(struct device *dev, struct device *other);

 I wouldn't even think of the second one personally.

Right, a kernel-interface would be designed the first way. The IOMMU-API
is actually designed in this manner. But I still think we should keep it
simpler for userspace.

 If its required anyway the binding can happen implicitly. We could
 allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

 It's still special.  You define the memory map only for the first  
 device.  You have to make sure dev1 doesn't go away while sharing it.

Must be a misunderstanding. In my proposal the domain is not owned by
one device. It is owned by all devices that share it and will only
vanish if all devices that use it are unbound (which happens when the file
descriptor is closed, for example).

 so yes, more work, but once you have multiple devices which come and go  
 dynamically things become simpler.  The map object has global lifetime  
 (you can even construct it if you don't assign any devices), the devices  
 attach to it, memory hotplug updates the memory map but doesn't touch  
 devices.

I still think a userspace interface should be as simple as possible. But
since both ways will work I am not really opposed to Michael's proposal.
I just think its overkill for the common non-kvm usecase (a userspace
device driver).

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-02 Thread Kevin Wolf
Am 28.05.2010 04:44, schrieb MORITA Kazutaka:
 Sheepdog is a distributed storage system for QEMU. It provides highly
 available block level storage volumes to VMs like Amazon EBS.  This
 patch adds a qemu block driver for Sheepdog.
 
 Sheepdog features are:
 - No node in the cluster is special (no metadata node, no control
   node, etc)
 - Linear scalability in performance and capacity
 - No single point of failure
 - Autonomous management (zero configuration)
 - Useful volume management support such as snapshot and cloning
 - Thin provisioning
 - Autonomous load balancing
 
 The more details are available at the project site:
 http://www.osrg.net/sheepdog/
 
 Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
 ---
  Makefile.objs|2 +-
  block/sheepdog.c | 1835 
 ++
  2 files changed, 1836 insertions(+), 1 deletions(-)
  create mode 100644 block/sheepdog.c

One general thing: The code uses some mix of spaces and tabs for
indentation, with the greatest part using tabs. According to
CODING_STYLE it should consistently use four spaces instead.

 diff --git a/Makefile.objs b/Makefile.objs
 index 1a942e5..527a754 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
  
  block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o 
 vpc.o vvfat.o
  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
 -block-nested-y += parallels.o nbd.o blkdebug.o
 +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o
  block-nested-$(CONFIG_WIN32) += raw-win32.o
  block-nested-$(CONFIG_POSIX) += raw-posix.o
  block-nested-$(CONFIG_CURL) += curl.o
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 new file mode 100644
 index 000..68545e8
 --- /dev/null
 +++ b/block/sheepdog.c
 @@ -0,0 +1,1835 @@
 +/*
 + * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation.
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public License version
 + * 2 as published by the Free Software Foundation.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program. If not, see http://www.gnu.org/licenses/.
 + */
 +#include netdb.h
 +#include netinet/tcp.h
 +
 +#include qemu-common.h
 +#include qemu-error.h
 +#include block_int.h
 +
 +#define SD_PROTO_VER 0x01
 +
 +#define SD_DEFAULT_ADDR localhost:7000
 +
 +#define SD_OP_CREATE_AND_WRITE_OBJ  0x01
 +#define SD_OP_READ_OBJ   0x02
 +#define SD_OP_WRITE_OBJ  0x03
 +
 +#define SD_OP_NEW_VDI0x11
 +#define SD_OP_LOCK_VDI   0x12
 +#define SD_OP_RELEASE_VDI0x13
 +#define SD_OP_GET_VDI_INFO   0x14
 +#define SD_OP_READ_VDIS  0x15
 +
 +#define SD_FLAG_CMD_WRITE0x01
 +#define SD_FLAG_CMD_COW  0x02
 +
 +#define SD_RES_SUCCESS   0x00 /* Success */
 +#define SD_RES_UNKNOWN   0x01 /* Unknown error */
 +#define SD_RES_NO_OBJ0x02 /* No object found */
 +#define SD_RES_EIO   0x03 /* I/O error */
 +#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */
 +#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */
 +#define SD_RES_SYSTEM_ERROR  0x06 /* System error */
 +#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */
 +#define SD_RES_NO_VDI0x08 /* No vdi found */
 +#define SD_RES_NO_BASE_VDI   0x09 /* No base vdi found */
 +#define SD_RES_VDI_READ  0x0A /* Cannot read requested vdi */
 +#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */
 +#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */
 +#define SD_RES_BASE_VDI_WRITE   0x0D /* Cannot write base vdi */
 +#define SD_RES_NO_TAG0x0E /* Requested tag is not found */
 +#define SD_RES_STARTUP   0x0F /* Sheepdog is on starting up */
 +#define SD_RES_VDI_NOT_LOCKED   0x10 /* Vdi is not locked */
 +#define SD_RES_SHUTDOWN  0x11 /* Sheepdog is shutting down */
 +#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */
 +#define SD_RES_FULL_VDI  0x13 /* we already have the maximum vdis */
 +#define SD_RES_VER_MISMATCH  0x14 /* Protocol version mismatch */
 +#define SD_RES_NO_SPACE  0x15 /* Server has no room for new objects */
 +#define SD_RES_WAIT_FOR_FORMAT  0x16 /* Sheepdog is waiting for a format 
 operation */
 +#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes 
 joining */
 +#define SD_RES_JOIN_FAILED   0x18 /* Target node had failed to join sheepdog 
 */
 +
 +/*
 + * Object ID rules
 + *
 + *  0 - 19 (20 bits): data object space
 + * 20 - 31 (12 bits): reserved data object space
 + * 32 - 55 (24 bits): vdi object space
 + * 56 - 59 ( 4 bits): reserved vdi object space
 + * 60 - 63 ( 4 bits): object type indentifier space
 + */
 +
 +#define VDI_SPACE_SHIFT   32
 +#define VDI_BIT (UINT64_C(1)  63)
 +#define VMSTATE_BIT (UINT64_C(1)  62)
 +#define MAX_DATA_OBJS (1ULL  20)
 +#define MAX_CHILDREN 1024
 +#define 

KVM: read apic-irr with ioapic lock held

2010-06-02 Thread Marcelo Tosatti

Read ioapic-irr inside ioapic-lock protected section.

KVM-Stable-Tag
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 3bc4fdb..1149c60 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -193,12 +193,13 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int 
irq)
 
 int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
 {
-   u32 old_irr = ioapic-irr;
+   u32 old_irr;
u32 mask = 1  irq;
union kvm_ioapic_redirect_entry entry;
int ret = 1;
 
spin_lock(ioapic-lock);
+   old_irr = ioapic-irr;
if (irq = 0  irq  IOAPIC_NUM_PINS) {
entry = ioapic-redirtbl[irq];
level ^= entry.fields.polarity;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


porting fixes regarding kvm-clock and lost irqs to stable qemu-kvm 0.12.4

2010-06-02 Thread Peter Lieven

Hi,

I would like to get latest stable qemu-kvm (0.12.4) to a usable state
regarding live-migration.

Problems are fixed in git, but there is so much new stuff that has not
extensively tested and therefore I would like to stay at 0.12.4 at the 
moment.


Therefore I would appreciate your help regarding 2 bugs:

a) -cpu xxx,-kvmclock doesn't work in 0.12.4. It started working first
after applying a patchset from Avi that is not even in git yet. What
needs to be done to get it working in 0.12.4 ?

b) There is a bug in 0.12.4 which leads to lost irqs. It was reported
appearing in virtio_blk (#584131) and e1000 (#585113). It is finally
fixed in GIT. Can someone give me a hint with commit fixed this?

Thanks,
Peter

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] make-release: make mtime, owner, group consistent

2010-06-02 Thread Michael S. Tsirkin
Files from git have modification time set to one
of commit, and owner/group to root.
Making it so for generated files as well makes
it easier to generate an identical tarball from git.

Setting owner/group to root is especially important because
otherwise you must have a user/group with same name
to generate an identical tarball.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 kvm/scripts/make-release |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kvm/scripts/make-release b/kvm/scripts/make-release
index fdc402b..e8051f4 100755
--- a/kvm/scripts/make-release
+++ b/kvm/scripts/make-release
@@ -1,7 +1,7 @@
 #!/bin/bash -e
 
 usage() {
-echo usage: $0 [--upload] [--formal] commit [name] [tarball]
+echo usage: $0 [--upload] [--formal] commit [name] [tarball] [user]
 exit 1
 }
 
@@ -51,17 +51,22 @@ cd $(dirname $0)/../..
 mkdir -p $(dirname $tarball)
 git archive --prefix=$name/ --format=tar $commit  $tarball
 
+mtime=`git show --format=%ct $commit --`
+tarargs=--owner=root --group=root --mti...@$mtime
+
 mkdir -p $tmpdir
 git cat-file -p ${commit}:roms | awk ' { print $4, $3 } ' \
  $tmpdir/EXTERNAL_DEPENDENCIES
 tar -rf $tarball --transform s,^,$name/, -C $tmpdir \
+$tarargs \
 EXTERNAL_DEPENDENCIES
 rm -rf $tmpdir
 
 if [[ -n $formal ]]; then
 mkdir -p $tmpdir
 echo $name  $tmpdir/KVM_VERSION
-tar -rf $tarball --transform s,^,$name/, -C $tmpdir KVM_VERSION
+tar -rf $tarball --transform s,^,$name/, -C $tmpdir KVM_VERSION \
+$tarargs
 rm -rf $tmpdir
 fi
 
-- 
1.7.1.12.g42b7f
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix bugs in msix_set/unset_mask_notifier() routines

2010-06-02 Thread Sridhar Samudrala

I am hitting the following assertions in msix.c when doing a guest reboot or 
live migration
using vhost.
qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed.
qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier: Assertion 
`dev-msix_mask_notifier_opaque[vector]' failed.

The following patch fixes the bugs in handling msix_is_masked() condition
in msix_set/unset_mask_notifier() routines.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com


diff --git a/hw/msix.c b/hw/msix.c
index 1398680..a191df1 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -609,7 +609,7 @@ void msix_unuse_all_vectors(PCIDevice *dev)
 
 int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque)
 {
-int r;
+int r = 0;
 if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
 return 0;
 
@@ -619,13 +619,15 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
vector, void *opaque)
 
 /* Unmask the new notifier unless vector is masked. */
 if (msix_is_masked(dev, vector)) {
-return 0;
+goto out;
 }
 r = dev-msix_mask_notifier(dev, vector, opaque,
 msix_is_masked(dev, vector));
 if (r  0) {
 return r;
 }
+
+out:
 dev-msix_mask_notifier_opaque[vector] = opaque;
 return r;
 }
@@ -640,8 +642,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned 
vector)
 assert(dev-msix_mask_notifier_opaque[vector]);
 
 /* Mask the old notifier unless it is already masked. */
-if (msix_is_masked(dev, vector)) {
-return 0;
+if (!msix_is_masked(dev, vector)) {
+goto out;
 }
 r = dev-msix_mask_notifier(dev, vector,
 dev-msix_mask_notifier_opaque[vector],
@@ -649,6 +651,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned 
vector)
 if (r  0) {
 return r;
 }
+
+out:
 dev-msix_mask_notifier_opaque[vector] = NULL;
 return r;
 }



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Use more cpuid bits from upstream

2010-06-02 Thread Marcelo Tosatti
On Tue, Jun 01, 2010 at 03:33:42PM +0300, Avi Kivity wrote:
 This patchset converts kvm_arch_vcpu_init()'s cpuid handling bits to use
 upstream code.
 
 Avi Kivity (5):
   Make get_para_features() similar to upstream
   Use get_para_features() from upstream
   Rename kvm_arch_vcpu_init()s cenv argument to env
   Use skeleton of upstream's kvm_arch_init_vcpu()
   Use upstream's kvm_arch_init_vcpu()'s cpuid bits
 
  qemu-kvm-x86.c|  148 
 +++--
  target-i386/kvm.c |   15 -
  2 files changed, 20 insertions(+), 143 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Add Documentation/kvm/msr.txt

2010-06-02 Thread Marcelo Tosatti
On Tue, Jun 01, 2010 at 08:22:48AM -0400, Glauber Costa wrote:
 This patch adds a file that documents the usage of KVM-specific
 MSRs.
 
 [ v2: added comments from Randy ]
 [ v3: added comments from Avi ]
 [ v4: added information about wallclock alignment ]
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 Reviewed-by: Randy Dunlap randy.dun...@oracle.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] VMX: Properly return error to userspace on vmentry failure

2010-06-02 Thread Marcelo Tosatti
On Mon, May 31, 2010 at 10:40:54PM +0300, Mohammed Gamal wrote:
 The vmexit handler returns KVM_EXIT_UNKNOWN since there is no handler
 for vmentry failures. This intercepts vmentry failures and returns
 KVM_FAIL_ENTRY to userspace instead.
 
 Signed-off-by: Mohammed Gamal m.gamal...@gmail.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: elide struct thread_struct instances from stack

2010-06-02 Thread Marcelo Tosatti
On Mon, May 31, 2010 at 09:59:13PM +0200, Andreas Schwab wrote:
 Instead of instantiating a whole thread_struct on the stack use only the
 required parts of it.
 
 Signed-off-by: Andreas Schwab sch...@linux-m68k.org
 Tested-by: Alexander Graf ag...@suse.de
 ---
  arch/powerpc/include/asm/kvm_fpu.h   |   27 +
  arch/powerpc/kernel/ppc_ksyms.c  |4 -
  arch/powerpc/kvm/book3s.c|   49 +---
  arch/powerpc/kvm/book3s_paired_singles.c |   94 
 --
  arch/powerpc/kvm/fpu.S   |   18 ++
  5 files changed, 97 insertions(+), 95 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Chris Wright
* Avi Kivity (a...@redhat.com) wrote:
 The interface would only work for clients which support it: kvm,
 vhost, and iommu/devices with restartable dma.

BTW, there is no such thing as restartable dma.  There is a provision in
new specs (read: no real hardware) that allows a device to request pages
before using them.  So it's akin to demand paging, but the demand is an
explicit request rather than a page fault.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] acpi_piix4: save gpe and pci hotplug slot status

2010-06-02 Thread Alex Williamson
PCI hotplug currently doesn't work after a migration because
we don't migrate the enable bits of the GPE state.  Pull hotplug
structs into vmstate.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/acpi_piix4.c |   29 -
 1 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index 0fce958..091cdcd 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -282,9 +282,33 @@ static int vmstate_acpi_post_load(void *opaque, int 
version_id)
 return 0;
 }
 
+static const VMStateDescription vmstate_gpe = {
+.name = gpe,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT16(sts, struct gpe_regs),
+VMSTATE_UINT16(en, struct gpe_regs),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static const VMStateDescription vmstate_pci_status = {
+.name = pci_status,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT32(up, struct pci_status),
+VMSTATE_UINT32(down, struct pci_status),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_acpi = {
 .name = piix4_pm,
-.version_id = 1,
+.version_id = 2,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .post_load = vmstate_acpi_post_load,
@@ -296,6 +320,9 @@ static const VMStateDescription vmstate_acpi = {
 VMSTATE_STRUCT(apm, PIIX4PMState, 0, vmstate_apm, APMState),
 VMSTATE_TIMER(tmr_timer, PIIX4PMState),
 VMSTATE_INT64(tmr_overflow_time, PIIX4PMState),
+VMSTATE_STRUCT(gpe, PIIX4PMState, 2, vmstate_gpe, struct gpe_regs),
+VMSTATE_STRUCT(pci0_status, PIIX4PMState, 2, vmstate_pci_status,
+   struct pci_status),
 VMSTATE_END_OF_LIST()
 }
 };

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix bugs in msix_set/unset_mask_notifier() routines

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 09:09:37AM -0700, Sridhar Samudrala wrote:
 
 I am hitting the following assertions in msix.c when doing a guest reboot or 
 live migration
 using vhost.
 qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed.
 qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier: Assertion 
 `dev-msix_mask_notifier_opaque[vector]' failed.
 
 The following patch fixes the bugs in handling msix_is_masked() condition
 in msix_set/unset_mask_notifier() routines.
 
 Signed-off-by: Sridhar Samudrala s...@us.ibm.com

Acked-by: Michael S. Tsirkin m...@redhat.com

Note: this is qemu-kvm only patch.

 
 diff --git a/hw/msix.c b/hw/msix.c
 index 1398680..a191df1 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -609,7 +609,7 @@ void msix_unuse_all_vectors(PCIDevice *dev)
  
  int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque)
  {
 -int r;
 +int r = 0;
  if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
  return 0;
  
 @@ -619,13 +619,15 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
 vector, void *opaque)
  
  /* Unmask the new notifier unless vector is masked. */
  if (msix_is_masked(dev, vector)) {
 -return 0;
 +goto out;
  }
  r = dev-msix_mask_notifier(dev, vector, opaque,
  msix_is_masked(dev, vector));
  if (r  0) {
  return r;
  }
 +
 +out:
  dev-msix_mask_notifier_opaque[vector] = opaque;
  return r;
  }
 @@ -640,8 +642,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned 
 vector)
  assert(dev-msix_mask_notifier_opaque[vector]);
  
  /* Mask the old notifier unless it is already masked. */
 -if (msix_is_masked(dev, vector)) {
 -return 0;
 +if (!msix_is_masked(dev, vector)) {
 +goto out;
  }
  r = dev-msix_mask_notifier(dev, vector,
  dev-msix_mask_notifier_opaque[vector],
 @@ -649,6 +651,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned 
 vector)
  if (r  0) {
  return r;
  }
 +
 +out:
  dev-msix_mask_notifier_opaque[vector] = NULL;
  return r;
  }
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm tests: make hypercall test runable on 32-bit host

2010-06-02 Thread Marcelo Tosatti
On Tue, Jun 01, 2010 at 10:02:34PM +0800, Asias He wrote:
 Signed-off-by: Asias He asias.he...@gmail.com
 ---
  kvm/test/config-x86-common.mak |5 +++--
  kvm/test/config-x86_64.mak |3 +--
  2 files changed, 4 insertions(+), 4 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pc: push setting default cpu_model down a level

2010-06-02 Thread Marcelo Tosatti
On Tue, Jun 01, 2010 at 05:03:00PM -0600, Alex Williamson wrote:
 Not that CPU hotplug currently works, but if you make the mistake of
 trying it on a VM started without specifying a -cpu value, you hit
 a segfault from trying to strdup(NULL) in cpu_x86_find_by_name().
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: cleanup, fix typing

2010-06-02 Thread Marcelo Tosatti
On Wed, Jun 02, 2010 at 05:01:23PM +0800, Lai Jiangshan wrote:
 The type of '*new.rmap' is not 'struct page *', fix it
 
 Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com

Applied this and uaccess cleanup, thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: Enforce EPT pagetable level checking

2010-06-02 Thread Marcelo Tosatti
On Wed, Jun 02, 2010 at 02:05:24PM +0800, Sheng Yang wrote:
 We only support 4 levels EPT pagetable now.
 
 Signed-off-by: Sheng Yang sh...@linux.intel.com
 ---
  arch/x86/kvm/vmx.c |8 +++-
  1 files changed, 7 insertions(+), 1 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Chris Wright
* Joerg Roedel (j...@8bytes.org) wrote:
 On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
  On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
 
   Even if it is bound to a domain the userspace driver could program the
   device to do dma to unmapped regions causing io-page-faults. The kernel
   can't do anything about it.
  
  It can always corrupt its own memory directly as well :)
  But that is not a reason not to detect errors if we can,
  and not to make APIs hard to misuse.
 
 Changing the domain of a device while dma can happen is the same type of
 bug as unmapping potential dma target addresses. We can't catch this
 kind of misuse.
 
With 10 devices you have 10 extra ioctls.
   
   And this works implicitly with your proposal?
  
  Yes.  so you do:
  iommu = open
  ioctl(dev1, BIND, iommu)
  ioctl(dev2, BIND, iommu)
  ioctl(dev3, BIND, iommu)
  ioctl(dev4, BIND, iommu)
  
  No need to add a SHARE ioctl.
 
 In my proposal this looks like:
 
 
 dev1 = open();
 ioctl(dev2, SHARE, dev1);
 ioctl(dev3, SHARE, dev1);
 ioctl(dev4, SHARE, dev1);
 
 So we actually save an ioctl.

This is not any hot path, so saving an ioctl shouldn't be a consideration.
Only important consideration is a good API.  I may have lost context here,
but the SHARE API is limited to the vfio fd.  The BIND API expects a new
iommu object.  Are there other uses for this object?  Tom's current vfio
driver exposes a dma mapping interface, would the iommu object expose
one as well?  Current interface is device specific DMA interface for
host device drivers typically mapping in-flight dma buffers, and IOMMU
specific interface for assigned devices typically mapping entire virtual
address space.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix bugs in msix_set/unset_mask_notifier() routines

2010-06-02 Thread Michael S. Tsirkin
On Wed, Jun 02, 2010 at 09:09:37AM -0700, Sridhar Samudrala wrote:
 
 I am hitting the following assertions in msix.c when doing a guest reboot or 
 live migration
 using vhost.
 qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed.
 qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier: Assertion 
 `dev-msix_mask_notifier_opaque[vector]' failed.
 
 The following patch fixes the bugs in handling msix_is_masked() condition
 in msix_set/unset_mask_notifier() routines.
 
 Signed-off-by: Sridhar Samudrala s...@us.ibm.com
 

I take back the ACK.

 diff --git a/hw/msix.c b/hw/msix.c
 index 1398680..a191df1 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -609,7 +609,7 @@ void msix_unuse_all_vectors(PCIDevice *dev)
  
  int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque)
  {
 -int r;
 +int r = 0;
  if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
  return 0;
  
 @@ -619,13 +619,15 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
 vector, void *opaque)
  
  /* Unmask the new notifier unless vector is masked. */
  if (msix_is_masked(dev, vector)) {
 -return 0;
 +goto out;
  }
  r = dev-msix_mask_notifier(dev, vector, opaque,
  msix_is_masked(dev, vector));
  if (r  0) {
  return r;
  }
 +
 +out:
  dev-msix_mask_notifier_opaque[vector] = opaque;
  return r;
  }
 @@ -640,8 +642,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned 
 vector)
  assert(dev-msix_mask_notifier_opaque[vector]);
  
  /* Mask the old notifier unless it is already masked. */
 -if (msix_is_masked(dev, vector)) {
 -return 0;
 +if (!msix_is_masked(dev, vector)) {
 +goto out;
  }

Wait a second, the logic is reverted here.

  r = dev-msix_mask_notifier(dev, vector,
  dev-msix_mask_notifier_opaque[vector],
 @@ -649,6 +651,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned 
 vector)
  if (r  0) {
  return r;
  }
 +
 +out:
  dev-msix_mask_notifier_opaque[vector] = NULL;
  return r;
  }
 
 


I find the goto's confusing. I'll post a patch without.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] msix: fix msix_set/unset_mask_notifier

2010-06-02 Thread Michael S. Tsirkin
Sridhar Samudrala reported hitting the following assertions
in msix.c when doing a guest reboot or live migration using vhost.
qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed.
qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier:
Assertion `dev-msix_mask_notifier_opaque[vector]' failed.

The issue is that we didn't clear/set the opaque pointer
when vector is masked. The following patch fixes this.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Sridhar, could you test the following please?

 hw/msix.c |   33 -
 1 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/hw/msix.c b/hw/msix.c
index b400769..24ff6ae 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -610,7 +610,7 @@ void msix_unuse_all_vectors(PCIDevice *dev)
 
 int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque)
 {
-int r;
+int r = 0;
 if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
 return 0;
 
@@ -619,13 +619,11 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
vector, void *opaque)
 assert(!dev-msix_mask_notifier_opaque[vector]);
 
 /* Unmask the new notifier unless vector is masked. */
-if (msix_is_masked(dev, vector)) {
-return 0;
-}
-r = dev-msix_mask_notifier(dev, vector, opaque,
-msix_is_masked(dev, vector));
-if (r  0) {
-return r;
+if (!msix_is_masked(dev, vector)) {
+r = dev-msix_mask_notifier(dev, vector, opaque, false);
+if (r  0) {
+return r;
+}
 }
 dev-msix_mask_notifier_opaque[vector] = opaque;
 return r;
@@ -634,21 +632,21 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
vector, void *opaque)
 int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector)
 {
 int r = 0;
+void *opaque;
 if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
 return 0;
 
+opaque = dev-msix_mask_notifier_opaque[vector];
+
 assert(dev-msix_mask_notifier);
-assert(dev-msix_mask_notifier_opaque[vector]);
+assert(opaque);
 
 /* Mask the old notifier unless it is already masked. */
-if (msix_is_masked(dev, vector)) {
-return 0;
-}
-r = dev-msix_mask_notifier(dev, vector,
-dev-msix_mask_notifier_opaque[vector],
-!msix_is_masked(dev, vector));
-if (r  0) {
-return r;
+if (!msix_is_masked(dev, vector)) {
+r = dev-msix_mask_notifier(dev, vector, opaque, true);
+if (r  0) {
+return r;
+}
 }
 dev-msix_mask_notifier_opaque[vector] = NULL;
 return r;
-- 
1.7.1.12.g42b7f
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] virtio-net: stop vhost backend on vmstop

2010-06-02 Thread Michael S. Tsirkin
vhost net currently keeps running after vmstop,
which causes trouble as qemy does not check
for dirty pages anymore.
The fix is to simply keep vm and vhost running/stopped
status in sync.

Tested-by: David L Stevens dlstev...@us.ibm.com
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Changes from v1:
  Simplify code as suggested by Amit.

 hw/virtio-net.c |   11 +--
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 28f26e1..67eebcf 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -877,12 +877,11 @@ static void virtio_net_set_status(struct VirtIODevice 
*vdev, uint8_t status)
 static void virtio_net_vmstate_change(void *opaque, int running, int reason)
 {
 VirtIONet *n = opaque;
-if (!running) {
-return;
-}
-/* This is called when vm is started, it will start vhost backend if
- * appropriate e.g. after migration. */
-virtio_net_set_status(n-vdev, n-vdev.status);
+uint8_t status = running ? n-vdev.status : 0;
+/* This is called when vm is started/stopped,
+ * it will start/stop vhost backend if appropriate
+ * e.g. after migration. */
+virtio_net_set_status(n-vdev, status);
 }
 
 VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf)
-- 
1.7.1.12.g42b7f
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Tom Lyon
On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:
 * Joerg Roedel (j...@8bytes.org) wrote:
  On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
   On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
  
Even if it is bound to a domain the userspace driver could program the
device to do dma to unmapped regions causing io-page-faults. The kernel
can't do anything about it.
   
   It can always corrupt its own memory directly as well :)
   But that is not a reason not to detect errors if we can,
   and not to make APIs hard to misuse.
  
  Changing the domain of a device while dma can happen is the same type of
  bug as unmapping potential dma target addresses. We can't catch this
  kind of misuse.
  
 With 10 devices you have 10 extra ioctls.

And this works implicitly with your proposal?
   
   Yes.  so you do:
   iommu = open
   ioctl(dev1, BIND, iommu)
   ioctl(dev2, BIND, iommu)
   ioctl(dev3, BIND, iommu)
   ioctl(dev4, BIND, iommu)
   
   No need to add a SHARE ioctl.
  
  In my proposal this looks like:
  
  
  dev1 = open();
  ioctl(dev2, SHARE, dev1);
  ioctl(dev3, SHARE, dev1);
  ioctl(dev4, SHARE, dev1);
  
  So we actually save an ioctl.
 
 This is not any hot path, so saving an ioctl shouldn't be a consideration.
 Only important consideration is a good API.  I may have lost context here,
 but the SHARE API is limited to the vfio fd.  The BIND API expects a new
 iommu object.  Are there other uses for this object?  Tom's current vfio
 driver exposes a dma mapping interface, would the iommu object expose
 one as well?  Current interface is device specific DMA interface for
 host device drivers typically mapping in-flight dma buffers, and IOMMU
 specific interface for assigned devices typically mapping entire virtual
 address space.

Actually, it a domain object - which may be usable among iommus (Joerg?).
However, you can't really do the dma mapping with just the domain because
every device supports a different size address space as a master, i.e.,
the dma_mask.

And I don't know how kvm would deal with devices with varying dma mask support,
or why they'd be in the same domain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM test: Fixing size of NTFS partitions on windows unattended installs

2010-06-02 Thread Lucas Meneghel Rodrigues
While running tests on WinVista SP2 64 bits, I noticed that, even though
we reserved 20GB of disk space on images, the unattended files are
allocating 15GB of the effective space, causing SP2 64 bit install to
fail. Fix that, make sure all primary disks do use 20GB.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 .../kvm/unattended/win2008-32-autounattend.xml |2 +-
 .../kvm/unattended/win2008-64-autounattend.xml |2 +-
 .../kvm/unattended/win2008-r2-autounattend.xml |2 +-
 .../tests/kvm/unattended/win7-32-autounattend.xml  |2 +-
 .../tests/kvm/unattended/win7-64-autounattend.xml  |2 +-
 .../kvm/unattended/winvista-32-autounattend.xml|2 +-
 .../kvm/unattended/winvista-64-autounattend.xml|2 +-
 7 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/client/tests/kvm/unattended/win2008-32-autounattend.xml 
b/client/tests/kvm/unattended/win2008-32-autounattend.xml
index c6fafac..44a9fc4 100644
--- a/client/tests/kvm/unattended/win2008-32-autounattend.xml
+++ b/client/tests/kvm/unattended/win2008-32-autounattend.xml
@@ -25,7 +25,7 @@
CreatePartitions
CreatePartition 
wcm:action=add
Order1/Order
-   Size15000/Size
+   Size2/Size
TypePrimary/Type
/CreatePartition
/CreatePartitions
diff --git a/client/tests/kvm/unattended/win2008-64-autounattend.xml 
b/client/tests/kvm/unattended/win2008-64-autounattend.xml
index 2520a7a..ea0a524 100644
--- a/client/tests/kvm/unattended/win2008-64-autounattend.xml
+++ b/client/tests/kvm/unattended/win2008-64-autounattend.xml
@@ -12,7 +12,7 @@
CreatePartitions
CreatePartition 
wcm:action=add
Order1/Order
-   Size15000/Size
+   Size2/Size
TypePrimary/Type
/CreatePartition
/CreatePartitions
diff --git a/client/tests/kvm/unattended/win2008-r2-autounattend.xml 
b/client/tests/kvm/unattended/win2008-r2-autounattend.xml
index 2520a7a..ea0a524 100644
--- a/client/tests/kvm/unattended/win2008-r2-autounattend.xml
+++ b/client/tests/kvm/unattended/win2008-r2-autounattend.xml
@@ -12,7 +12,7 @@
CreatePartitions
CreatePartition 
wcm:action=add
Order1/Order
-   Size15000/Size
+   Size2/Size
TypePrimary/Type
/CreatePartition
/CreatePartitions
diff --git a/client/tests/kvm/unattended/win7-32-autounattend.xml 
b/client/tests/kvm/unattended/win7-32-autounattend.xml
index c37afb7..a577a91 100644
--- a/client/tests/kvm/unattended/win7-32-autounattend.xml
+++ b/client/tests/kvm/unattended/win7-32-autounattend.xml
@@ -26,7 +26,7 @@
CreatePartitions
CreatePartition 
wcm:action=add
Order1/Order
-   Size15000/Size
+   Size2/Size
TypePrimary/Type
/CreatePartition
/CreatePartitions
diff --git a/client/tests/kvm/unattended/win7-64-autounattend.xml 
b/client/tests/kvm/unattended/win7-64-autounattend.xml
index ad047d0..fec8017 100644
--- a/client/tests/kvm/unattended/win7-64-autounattend.xml
+++ b/client/tests/kvm/unattended/win7-64-autounattend.xml
@@ -26,7 +26,7 @@
CreatePartitions
CreatePartition 
wcm:action=add
Order1/Order
-   Size15000/Size
+   Size2/Size
TypePrimary/Type
/CreatePartition

Re: [PATCH] msix: fix msix_set/unset_mask_notifier

2010-06-02 Thread Sridhar Samudrala
On Wed, 2010-06-02 at 20:49 +0300, Michael S. Tsirkin wrote:
 Sridhar Samudrala reported hitting the following assertions
 in msix.c when doing a guest reboot or live migration using vhost.
 qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed.
 qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier:
 Assertion `dev-msix_mask_notifier_opaque[vector]' failed.
 
 The issue is that we didn't clear/set the opaque pointer
 when vector is masked. The following patch fixes this.
 
 Signed-off-by: Sridhar Samudrala s...@us.ibm.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 Sridhar, could you test the following please?

Yes. looks good and works fine for me.
Tested-by: Sridhar Samudrala s...@us.ibm.com

 
  hw/msix.c |   33 -
  1 files changed, 16 insertions(+), 17 deletions(-)
 
 diff --git a/hw/msix.c b/hw/msix.c
 index b400769..24ff6ae 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -610,7 +610,7 @@ void msix_unuse_all_vectors(PCIDevice *dev)
 
  int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque)
  {
 -int r;
 +int r = 0;
  if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
  return 0;
 
 @@ -619,13 +619,11 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
 vector, void *opaque)
  assert(!dev-msix_mask_notifier_opaque[vector]);
 
  /* Unmask the new notifier unless vector is masked. */
 -if (msix_is_masked(dev, vector)) {
 -return 0;
 -}
 -r = dev-msix_mask_notifier(dev, vector, opaque,
 -msix_is_masked(dev, vector));
 -if (r  0) {
 -return r;
 +if (!msix_is_masked(dev, vector)) {
 +r = dev-msix_mask_notifier(dev, vector, opaque, false);
 +if (r  0) {
 +return r;
 +}
  }
  dev-msix_mask_notifier_opaque[vector] = opaque;
  return r;
 @@ -634,21 +632,21 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
 vector, void *opaque)
  int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector)
  {
  int r = 0;
 +void *opaque;
  if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
  return 0;
 
 +opaque = dev-msix_mask_notifier_opaque[vector];
 +
  assert(dev-msix_mask_notifier);
 -assert(dev-msix_mask_notifier_opaque[vector]);
 +assert(opaque);
 
  /* Mask the old notifier unless it is already masked. */
 -if (msix_is_masked(dev, vector)) {
 -return 0;
 -}
 -r = dev-msix_mask_notifier(dev, vector,
 -dev-msix_mask_notifier_opaque[vector],
 -!msix_is_masked(dev, vector));
 -if (r  0) {
 -return r;
 +if (!msix_is_masked(dev, vector)) {
 +r = dev-msix_mask_notifier(dev, vector, opaque, true);
 +if (r  0) {
 +return r;
 +}
  }
  dev-msix_mask_notifier_opaque[vector] = NULL;
  return r;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH UPDATED 1/3] vhost: replace vhost_workqueue with per-vhost kthread

2010-06-02 Thread Tejun Heo
Replace vhost_workqueue with per-vhost kthread.  Other than callback
argument change from struct work_struct * to struct vhost_work *,
there's no visible change to vhost_poll_*() interface.

This conversion is to make each vhost use a dedicated kthread so that
resource control via cgroup can be applied.

Partially based on Sridhar Samudrala's patch.

* Updated to use sub structure vhost_work instead of directly using
  vhost_poll at Michael's suggestion.

* Added flusher wake_up() optimization at Michael's suggestion.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Michael S. Tsirkin m...@redhat.com
Cc: Sridhar Samudrala samudrala.srid...@gmail.com
---
Okay, just tested it.  dev-work_lock had to be updated to use irq
operations but other than that it worked just fine.  Copied a large
file using scp and it seems to perform pretty well although I don't
have any reference of comparison.  So, here's the updated version with
the sign off.

Thanks.

 drivers/vhost/net.c   |   56 ++---
 drivers/vhost/vhost.c |  111 ++
 drivers/vhost/vhost.h |   38 +++--
 3 files changed, 134 insertions(+), 71 deletions(-)

Index: work/drivers/vhost/net.c
===
--- work.orig/drivers/vhost/net.c
+++ work/drivers/vhost/net.c
@@ -294,54 +294,58 @@ static void handle_rx(struct vhost_net *
unuse_mm(net-dev.mm);
 }

-static void handle_tx_kick(struct work_struct *work)
+static void handle_tx_kick(struct vhost_work *work)
 {
-   struct vhost_virtqueue *vq;
-   struct vhost_net *net;
-   vq = container_of(work, struct vhost_virtqueue, poll.work);
-   net = container_of(vq-dev, struct vhost_net, dev);
+   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+ poll.work);
+   struct vhost_net *net = container_of(vq-dev, struct vhost_net, dev);
+
handle_tx(net);
 }

-static void handle_rx_kick(struct work_struct *work)
+static void handle_rx_kick(struct vhost_work *work)
 {
-   struct vhost_virtqueue *vq;
-   struct vhost_net *net;
-   vq = container_of(work, struct vhost_virtqueue, poll.work);
-   net = container_of(vq-dev, struct vhost_net, dev);
+   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+ poll.work);
+   struct vhost_net *net = container_of(vq-dev, struct vhost_net, dev);
+
handle_rx(net);
 }

-static void handle_tx_net(struct work_struct *work)
+static void handle_tx_net(struct vhost_work *work)
 {
-   struct vhost_net *net;
-   net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
+   struct vhost_net *net = container_of(work, struct vhost_net,
+poll[VHOST_NET_VQ_TX].work);
handle_tx(net);
 }

-static void handle_rx_net(struct work_struct *work)
+static void handle_rx_net(struct vhost_work *work)
 {
-   struct vhost_net *net;
-   net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
+   struct vhost_net *net = container_of(work, struct vhost_net,
+poll[VHOST_NET_VQ_RX].work);
handle_rx(net);
 }

 static int vhost_net_open(struct inode *inode, struct file *f)
 {
struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
+   struct vhost_dev *dev;
int r;
+
if (!n)
return -ENOMEM;
+
+   dev = n-dev;
n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-   r = vhost_dev_init(n-dev, n-vqs, VHOST_NET_VQ_MAX);
+   r = vhost_dev_init(dev, n-vqs, VHOST_NET_VQ_MAX);
if (r  0) {
kfree(n);
return r;
}

-   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
n-tx_poll_state = VHOST_NET_POLL_DISABLED;

f-private_data = n;
@@ -644,25 +648,13 @@ static struct miscdevice vhost_net_misc

 static int vhost_net_init(void)
 {
-   int r = vhost_init();
-   if (r)
-   goto err_init;
-   r = misc_register(vhost_net_misc);
-   if (r)
-   goto err_reg;
-   return 0;
-err_reg:
-   vhost_cleanup();
-err_init:
-   return r;
-
+   return misc_register(vhost_net_misc);
 }
 module_init(vhost_net_init);

 static void vhost_net_exit(void)
 {
misc_deregister(vhost_net_misc);
-   vhost_cleanup();
 }
 module_exit(vhost_net_exit);

Index: work/drivers/vhost/vhost.c
===
--- 

Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Joerg Roedel
On Wed, Jun 02, 2010 at 11:09:17AM -0700, Tom Lyon wrote:
 On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:

  This is not any hot path, so saving an ioctl shouldn't be a consideration.
  Only important consideration is a good API.  I may have lost context here,
  but the SHARE API is limited to the vfio fd.  The BIND API expects a new
  iommu object.  Are there other uses for this object?  Tom's current vfio
  driver exposes a dma mapping interface, would the iommu object expose
  one as well?  Current interface is device specific DMA interface for
  host device drivers typically mapping in-flight dma buffers, and IOMMU
  specific interface for assigned devices typically mapping entire virtual
  address space.
 
 Actually, it a domain object - which may be usable among iommus (Joerg?).

Yes, this 'iommu' thing is would be a domain object. But sharing among
iommus is not necessary because the fact that there are multiple iommus
in the system is hidden by the iommu drivers. This fact is not even
exposed by the iommu-api. This makes protection domains system global.

 However, you can't really do the dma mapping with just the domain because
 every device supports a different size address space as a master, i.e.,
 the dma_mask.

The dma_mask has to be handled by the device driver. With the
iommu-mapping interface the driver can specify the target io-address and
has to consider the dma_mask for that too.

 And I don't know how kvm would deal with devices with varying dma mask 
 support,
 or why they'd be in the same domain.

KVM does not care about these masks. This is the business of the guest
device drivers.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCHv2-RFC 0/2] virtio: put last seen used index into ring itself

2010-06-02 Thread Jes Sorensen
On 05/31/10 09:46, Rusty Russell wrote:
 On Thu, 27 May 2010 05:20:35 am Michael S. Tsirkin wrote:
 Here's a rewrite of the original patch with a new layout.
 I haven't tested it yet so no idea how this performs, but
 I think this addresses the cache bounce issue raised by Avi.
 Posting for early flames/comments.
 
 Sorry, not without some evidence that it'll actually reduce cacheline
 bouncing.  I *think* it will, but it's not obvious: the host may keep
 looking at avail_idx as we're updating last_seen.  Or does qemu always
 look at both together anyway?
 
 Can someone convince me this is a win?
 Rusty.

Hi Rusty,

I ran some tests using the vring index publish patch with virtio_blk.
The numbers are based on running IOZone on a ramdisk passed to the guest
via virtio. While I didn't see any throughput improvements, I saw a
20-30% reduction in the VMExit count for the full run. This was measured
grabbing the VMExit count prior and after the run and calculating the
difference.

I have the numbers in a PDF, so I will email it to you in private as I
don't like sending PDFs to the mailing list. However if anybody else
wants the numbers feel free to ping me off list and I'll forward them.

Cheers,
Jes

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Clock jumps

2010-06-02 Thread Orion Poplawski

On 05/26/2010 11:31 AM, Alexander Graf wrote:


On 26.05.2010, at 19:10, Orion Poplawski wrote:


On 05/25/2010 12:21 AM, Gleb Natapov wrote:

Adding kvm to CC.

On Mon, May 24, 2010 at 04:06:32PM +, Orion Poplawski wrote:

I have a KVM virtual machine running 2.6.33.4-95.fc13.x86_64 on a CentOS 5.5
host whose clock jumps about 8-12 hours a couple times a day.  I have no idea
what is causing it.  Fedora 12 and Centos 5.5 KVM machines run fine on the same
host.  Is there any debugging I can enable to see what is jumping the clock?

kvm-clock: cpu 0, msr 0:1ba4741, boot clock
kvm-clock: cpu 0, msr 0:1e15741, primary cpu clock
Switching to clocksource kvm-clock
rtc_cmos 00:01: setting system clock to 2010-05-20 16:59:48 UTC (1274374788)



Thanks, though I don't think it made it there.  I'm also not sure it's 
completely limited to KVM, though that is the only running system I am 
currently seeing the problem on.  I also see clock jumps during anaconda 
installs on physical hardware and apparently they have been present since at 
least F11.  Might be unrelated though.

I'm really at a loss of how to debug this though.


Do you have ntpd running inside the guest? I have a bug report lying around 
about 2.6.33 with kvm-clock jumping in time when ntpd is used: 
https://bugzilla.novell.com/show_bug.cgi?id=582260

Alex



Turning off ntpd and chronyd did not help for me.


--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA/CoRA DivisionFAX: 303-415-9702
3380 Mitchell Lane  or...@cora.nwra.com
Boulder, CO 80301  http://www.cora.nwra.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unhandled wrmsr

2010-06-02 Thread Dave Young
On Wed, Jun 2, 2010 at 8:45 PM, Andre Przywara andre.przyw...@amd.com wrote:
 Dave Young wrote:

 Hi,

 With today's git version (qemu-kvm), I got following message in kernel
 dmesg

 [168344.215605] kvm: 27289: cpu0 unhandled wrmsr: 0x198 data 0

 Are you sure about that?

Sure

0x198 is an Intel architectural perfmon MSR and it
 is read-only. The Linux kernel source I grep'ed obeys this and does only
 rdmsr.
 You can work around this by changing the error to a warning with:
 # modprobe kvm ignore_msrs=1

with this param, appear following warning:
kvm: 28520: cpu0 ignored wrmsr: 0x198 data 0


 I'd like to see more details about the guest Linux kernel, at least the
 version you used to see why there is a wrmsr on this address. Best is you
 provide the kernel (just the vmlinuz file) somewhere so that we can
 reproduce this. Also the qemu-kvm command line would be interesting.

Actually I tried different guest image, recreate this problem is easy.

one of them is slackware 13.0 kernel version 2.6.29.6, you can download from:
http://www.slackware.at/data/slackware64-13.0/kernels/huge.s/

kvm cmdline:
qemu-system-x86_64 -net nic -net tap,ifname=tap0,script=no slack64.img


 Regards,
 Andre.


 bash-3.1$ /home/dave/tmp/qemu-system-x86_64 --version
 QEMU emulator version 0.12.50 (qemu-kvm-devel), Copyright (c)
 2003-2008 Fabrice Bellard


 bash-3.1$ uname -a
 Linux darkstar 2.6.35-rc1 #35 SMP Mon May 31 16:50:15 CST 2010 x86_64
 Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz GenuineIntel GNU/Linux

 host distribution is slackware64-13.0

 0.12.3 works well

 BTW the guest is a tiny core linux image


 --
 Andre Przywara
 AMD-Operating System Research Center (OSRC), Dresden, Germany
 Tel: +49 351 448-3567-12

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Regards
dave
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unhandled wrmsr

2010-06-02 Thread Dave Young
On Thu, Jun 3, 2010 at 9:34 AM, Dave Young hidave.darks...@gmail.com wrote:
 On Wed, Jun 2, 2010 at 8:45 PM, Andre Przywara andre.przyw...@amd.com wrote:
 Dave Young wrote:

 Hi,

 With today's git version (qemu-kvm), I got following message in kernel
 dmesg

 [168344.215605] kvm: 27289: cpu0 unhandled wrmsr: 0x198 data 0

 Are you sure about that?

 Sure

 0x198 is an Intel architectural perfmon MSR and it
 is read-only. The Linux kernel source I grep'ed obeys this and does only
 rdmsr.
 You can work around this by changing the error to a warning with:
 # modprobe kvm ignore_msrs=1

 with this param, appear following warning:
 kvm: 28520: cpu0 ignored wrmsr: 0x198 data 0


 I'd like to see more details about the guest Linux kernel, at least the
 version you used to see why there is a wrmsr on this address. Best is you
 provide the kernel (just the vmlinuz file) somewhere so that we can
 reproduce this. Also the qemu-kvm command line would be interesting.

 Actually I tried different guest image, recreate this problem is easy.

 one of them is slackware 13.0 kernel version 2.6.29.6, you can download from:
 http://www.slackware.at/data/slackware64-13.0/kernels/huge.s/

 kvm cmdline:
 qemu-system-x86_64 -net nic -net tap,ifname=tap0,script=no slack64.img


 Regards,
 Andre.


 bash-3.1$ /home/dave/tmp/qemu-system-x86_64 --version
 QEMU emulator version 0.12.50 (qemu-kvm-devel), Copyright (c)
 2003-2008 Fabrice Bellard

Another problem with this build is I can not boot from virtio-blk, cmdline:
qemu-system-x86_64 -boot c -drive file=slack64.img,if-virtio,boot=on

qemu just complain about:
Could not read the boot disk

Anything missing by me?



 bash-3.1$ uname -a
 Linux darkstar 2.6.35-rc1 #35 SMP Mon May 31 16:50:15 CST 2010 x86_64
 Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz GenuineIntel GNU/Linux

 host distribution is slackware64-13.0

 0.12.3 works well

 BTW the guest is a tiny core linux image


 --
 Andre Przywara
 AMD-Operating System Research Center (OSRC), Dresden, Germany
 Tel: +49 351 448-3567-12

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




 --
 Regards
 dave




-- 
Regards
dave
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unhandled wrmsr

2010-06-02 Thread Dave Young
On Thu, Jun 3, 2010 at 10:23 AM, Dave Young hidave.darks...@gmail.com wrote:
 On Thu, Jun 3, 2010 at 9:34 AM, Dave Young hidave.darks...@gmail.com wrote:
 On Wed, Jun 2, 2010 at 8:45 PM, Andre Przywara andre.przyw...@amd.com 
 wrote:
 Dave Young wrote:

 Hi,

 With today's git version (qemu-kvm), I got following message in kernel
 dmesg

 [168344.215605] kvm: 27289: cpu0 unhandled wrmsr: 0x198 data 0

 Are you sure about that?

 Sure

 0x198 is an Intel architectural perfmon MSR and it
 is read-only. The Linux kernel source I grep'ed obeys this and does only
 rdmsr.
 You can work around this by changing the error to a warning with:
 # modprobe kvm ignore_msrs=1

 with this param, appear following warning:
 kvm: 28520: cpu0 ignored wrmsr: 0x198 data 0


 I'd like to see more details about the guest Linux kernel, at least the
 version you used to see why there is a wrmsr on this address. Best is you
 provide the kernel (just the vmlinuz file) somewhere so that we can
 reproduce this. Also the qemu-kvm command line would be interesting.

 Actually I tried different guest image, recreate this problem is easy.

 one of them is slackware 13.0 kernel version 2.6.29.6, you can download from:
 http://www.slackware.at/data/slackware64-13.0/kernels/huge.s/

 kvm cmdline:
 qemu-system-x86_64 -net nic -net tap,ifname=tap0,script=no slack64.img


 Regards,
 Andre.


 bash-3.1$ /home/dave/tmp/qemu-system-x86_64 --version
 QEMU emulator version 0.12.50 (qemu-kvm-devel), Copyright (c)
 2003-2008 Fabrice Bellard

 Another problem with this build is I can not boot from virtio-blk, cmdline:
 qemu-system-x86_64 -boot c -drive file=slack64.img,if-virtio,boot=on
typo, if=virtio

 qemu just complain about:
 Could not read the boot disk

 Anything missing by me?



 bash-3.1$ uname -a
 Linux darkstar 2.6.35-rc1 #35 SMP Mon May 31 16:50:15 CST 2010 x86_64
 Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz GenuineIntel GNU/Linux

 host distribution is slackware64-13.0

 0.12.3 works well

 BTW the guest is a tiny core linux image


 --
 Andre Przywara
 AMD-Operating System Research Center (OSRC), Dresden, Germany
 Tel: +49 351 448-3567-12

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




 --
 Regards
 dave




 --
 Regards
 dave




-- 
Regards
dave
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-02 Thread Srivatsa Vaddagiri
On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote:
 
 There are two separate problems: the more general problem is that
 the hypervisor can put a vcpu to sleep while holding a lock, causing
 other vcpus to spin until the end of their time slice.  This can
 only be addressed with hypervisor help.

Fyi - I have a early patch ready to address this issue. Basically I am using
host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint 
host whenever guest is in spin-lock'ed section, which is read by host scheduler 
to defer preemption.

Guest side:

static inline void spin_lock(spinlock_t *lock)
{
raw_spin_lock(lock-rlock);
+   __get_cpu_var(gh_vcpu_ptr)-defer_preempt++;
}

static inline void spin_unlock(spinlock_t *lock)
{
+   __get_cpu_var(gh_vcpu_ptr)-defer_preempt--;
raw_spin_unlock(lock-rlock);
}

[similar changes to other spinlock variants]

Host side:


@@ -860,6 +866,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr-sum_exec_runtime - curr-prev_sum_exec_runtime;
if (delta_exec  ideal_runtime) {
+   if ((sched_feat(DEFER_PREEMPT))  
(rq_of(cfs_rq)-curr-ghptr)) {
+   int defer_preempt =  
rq_of(cfs_rq)-curr-ghptr-defer_preempt;
+   if (((defer_preempt  0x) == 0xfeed)  
((defer_preempt  0x) != 0)) {
+   if  ((rq_of(cfs_rq)-curr-grace_defer++  
sysctl_sched_preempt_defer_count)) {
+   rq_of(cfs_rq)-defer_preempt++;
+   return;
+   } else
+   rq_of(cfs_rq)-force_preempt++;
+   }
+   }
resched_task(rq_of(cfs_rq)-curr);
/*
 * The current task ran long enough, ensure it doesn't get

[similar changes introduced at other preemption points in sched_fair.c]


Note that guest can only request preemption to be deferred (and not disabled via
this mechanism). I have seen good improvement (~15%) in kern compile benchmark 
with sysctl_sched_preempt_defer_count set to a low value of just 2 (i.e we can 
defer preemption by maximum two ticks). I intend to cleanup and post the patches
pretty soon for comments.

One pathological case where this may actually hurt is routines in guest like 
flush_tlb_others_ipi() which take a spinlock and then enter a while() loop
waiting for other cpus to ack something. In this case, deferring preemption just
because guest is in critical section actually hurts! Hopefully the upper bound
for deferring preemtion and the fact that such routines may not be frequently
hit should help alleviate such situations.

- vatsa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-02 Thread Srivatsa Vaddagiri
On Thu, Jun 03, 2010 at 06:51:51AM +0200, Eric Dumazet wrote:
  Guest side:
  
  static inline void spin_lock(spinlock_t *lock)
  {
  raw_spin_lock(lock-rlock);
  +   __get_cpu_var(gh_vcpu_ptr)-defer_preempt++;
 
 1) __this_cpu_inc() should be faster

Ok ..thx for that tip.

 2) Isnt a bit late to do this increment _after_
 raw_spin_lock(lock-rlock);

I think so, my worry about doing it earlier is we may set the defer_preempt hint
for the wrong vcpu (if lets say the guest application thread is preempted by
guest kernel and later migrated to another vcpu after it sets the hint and 
before it acquires the lock). 

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm/powerpc: fix a build error in e500_tlb.c

2010-06-02 Thread Kevin Hao
We use the wrong number arguments when invoking trace_kvm_stlb_inval,
and cause the following build error.
arch/powerpc/kvm/e500_tlb.c: In function 'kvmppc_e500_stlbe_invalidate':
arch/powerpc/kvm/e500_tlb.c:230: error: too many arguments to function 
'trace_kvm_stlb_inval'

Signed-off-by: Kevin Hao haoke...@gmail.com
---
 arch/powerpc/kvm/e500_tlb.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 21011e1..1261a21 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -226,8 +226,7 @@ static void kvmppc_e500_stlbe_invalidate(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 
kvmppc_e500_shadow_release(vcpu_e500, tlbsel, esel);
stlbe-mas1 = 0;
-   trace_kvm_stlb_inval(index_of(tlbsel, esel), stlbe-mas1, stlbe-mas2,
-stlbe-mas3, stlbe-mas7);
+   trace_kvm_stlb_inval(index_of(tlbsel, esel));
 }
 
 static void kvmppc_e500_tlb1_invalidate(struct kvmppc_vcpu_e500 *vcpu_e500,
-- 
1.6.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: elide struct thread_struct instances from stack

2010-06-02 Thread Marcelo Tosatti
On Mon, May 31, 2010 at 09:59:13PM +0200, Andreas Schwab wrote:
 Instead of instantiating a whole thread_struct on the stack use only the
 required parts of it.
 
 Signed-off-by: Andreas Schwab sch...@linux-m68k.org
 Tested-by: Alexander Graf ag...@suse.de
 ---
  arch/powerpc/include/asm/kvm_fpu.h   |   27 +
  arch/powerpc/kernel/ppc_ksyms.c  |4 -
  arch/powerpc/kvm/book3s.c|   49 +---
  arch/powerpc/kvm/book3s_paired_singles.c |   94 
 --
  arch/powerpc/kvm/fpu.S   |   18 ++
  5 files changed, 97 insertions(+), 95 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html