Re: [PATCH 2/5] KVM: MMU: split the operations of kvm_mmu_zap_page()

2010-06-01 Thread Avi Kivity

On 06/01/2010 05:29 AM, Xiao Guangrong wrote:



How about passing the list as a parameter to prepare() and commit()?  If
the lifetime of the list is just prepare/commit, it shouldn't be a global.

 

Does below example code show your meaning correctly?

+   struct list_head free_list = LIST_HEAD_INIT(free_list);

hlist_for_each_entry_safe(sp, node, nn, bucket, hash_link) {
if (sp-gfn == gfn  !sp-role.direct
  !sp-role.invalid) {
pgprintk(%s: zap %lx %x\n,
 __func__, gfn, sp-role.word);
+   kvm_mmu_prepare_zap_page(kvm, sp,free_list);
}
}
+   kvm_mmu_commit_zap_page(kvm,free_list);

   


Yes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: MMU: introduce some macros to cleanup hlist traverseing

2010-06-01 Thread Avi Kivity

On 06/01/2010 05:38 AM, Xiao Guangrong wrote:


Avi Kivity wrote:
   

On 05/31/2010 05:00 AM, Xiao Guangrong wrote:
 
   


 

+
+#define for_each_gfn_indirect_sp(kvm, sp, gfn, pos, n)\
+  hlist_for_each_entry_safe(sp, pos, n,\
+kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+if (sp-gfn == gfn!sp-role.direct)
+
+#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos, n)\
+  hlist_for_each_entry_safe(sp, pos, n,\
+kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+if (sp-gfn == gfn!sp-role.direct  \
+!sp-role.invalid)


   

Shouldn't we always skip invalid gfns?

 

Actually, in kvm_mmu_unprotect_page() function, it need find out
invalid shadow pages:

|hlist_for_each_entry_safe(sp, node, n, bucket, hash_link)
|if (sp-gfn == gfn   !sp-role.direct) {
|pgprintk(%s: gfn %lx role %x\n, __func__, gfn,
| sp-role.word);
|r = 1;
|if (kvm_mmu_zap_page(kvm, sp))
|goto restart;
|}

I'm not sure whether we can skip invalid sp here, since it can change
this
function's return value. :-(

   

Hm.  Invalid pages don't need to be write protected.  So I think you can
patch unprotect_page() to ignore invalid pages, and then you can convert
it to the new macros which ignore invalid pages as well.

The invariant is: if an sp exists with !role.invalid and !unsync, then
the page must be write protected.
 

OK, will fix it in the next version.

   


 

What about providing both gfn and role to the macro?


 

In current code, no code simply use role and gfn to find sp,
in kvm_mmu_get_page(), we need do other work for
'sp-gfn == gfn   sp-role != role' sp, and other functions only need
compare
some members in role, but not all members.

   

How about just gfn?  I think everything compares against that!

 

In this patch, it already introduced a macro to only compares 'gfn', that is:

+#define for_each_gfn_sp(kvm, sp, gfn, pos, n)  \
+  hlist_for_each_entry_safe(sp, pos, n,\
+   kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+   if (sp-gfn == gfn)

Sorry if i misunderstand your meaning.
   


No, I got confused.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote:

On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:
   

On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
 

So what I suggested is failing any kind of access until iommu
is assigned.

   

So, the kernel driver must be aware of the iommu.  In which case it may
as well program it.
 

It's a kernel driver anyway. Point is that
the *device* driver is better off not programming iommu,
this way we do not need to reprogram it for each device.
   


The device driver is in userspace.  It can't program the iommu.  What 
the patch proposes is that userspace tells vfio about the needed 
mappings, and vfio programs the iommu.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7] KVM: VMX: Enable XSAVE/XRSTOR for guest

2010-06-01 Thread Avi Kivity

On 05/31/2010 02:54 PM, Sheng Yang wrote:

From: Dexuan Cuidexuan@intel.com

This patch enable guest to use XSAVE/XRSTOR instructions.

We assume that host_xcr0 would use all possible bits that OS supported.

And we loaded xcr0 in the same way we handled fpu - do it as late as we can.

   


Reviewed-by: Avi Kivity a...@redhat.com

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: elide struct thread_struct instances from stack

2010-06-01 Thread Andreas Schwab
Paul Mackerras pau...@samba.org writes:

 I re-read the relevant part of the PowerPC architecture spec
 yesterday, and it seems pretty clear that the FPSCR doesn't affect the
 behaviour of lfs and stfs, and is not affected by them.  So in fact 4
 out of the 7 instructions in each of those procedures are unnecessary
 (and similarly for the cvt_fd/df used in the alignment fixup code).

I'd prefer to have this deferred to a separate patch.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf trace event parse errors for KVM events

2010-06-01 Thread Avi Kivity

On 05/30/2010 06:34 PM, Steven Rostedt wrote:



Cool.  May make sense to use simpler formatting in the kernel, and use
trace-cmd plugins for the complicated cases.

It does raise issues with ABIs.  Can trace-cmd read plugins from
/lib/modules/*?  We can then distribute the plugins with the kernel.

 

We probably could do that. Perhaps if we can port the python code to
perf, then it would work for both. Then the plugins could be just python
scripts, (or binary .so modules) and have a tools/plugins dir?

The python part probably would be easier to port, since the .so modules
are a bit more specific to trace-cmd.
   


One concern is performance.  Traces tend to be long, and running python 
code on each line will be slow.


If trace-cmd integrates a pager and a search mechanism that looks at the 
binary data instead of the text, we could format only the lines that are 
displayed.  But that is going to be a lot of work and I don't think it's 
worth the effort.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] ceph/rbd block driver for qemu-kvm (v3)

2010-06-01 Thread Kevin Wolf
Hi Christian,

Am 31.05.2010 21:31, schrieb Christian Brunner:
 Hi Kevin,
 
 here is an updated patch for the ceph/rbd driver. I hope that everything 
 is fine now.

I'll try to get to give it a final review later this week. In the
meantime, I would be happy to see another review by someone else.

Do you have some specific tests for the driver or should we extend
qemu-iotests to work with protocols and use only that?

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] Print a user-friendly message on failed vmentry

2010-06-01 Thread Avi Kivity

On 05/31/2010 10:40 PM, Mohammed Gamal wrote:

This patch address bug report in https://bugs.launchpad.net/qemu/+bug/530077.

Failed vmentries were handled with handle_unhandled() which prints a rather
unfriendly message to the user. This patch separates handling vmentry failures
from unknown exit reasons and prints a friendly message to the user.



+#define VMX_INVALID_GUEST_STATE 0x8021
+
+static int handle_failed_vmentry(uint64_t reason)
+{
+fprintf(stderr, kvm: vm entry failed with error 0x% PRIx64 \n\n, 
reason);
+
+/* Perhaps we will need to check if this machine is intel since exit 
reason 0x21
+   has a different interpretation on SVM */
+if (reason == VMX_INVALID_GUEST_STATE) {
+fprintf(stderr, If you're runnning a guest on an Intel machine 
without\n);
+fprintf(stderr, unrestricted mode support, the failure can be most 
likely\n);
+fprintf(stderr, due to the guest entering an invalid state for Intel 
VT.\n);
+fprintf(stderr, For example, the guest maybe running in big real 
mode\n);
+fprintf(stderr, which is not supported on less recent Intel 
processors.\n\n);
+fprintf(stderr, You may want to try enabling KVM real mode emulation. 
To\n);
+fprintf(stderr, enable it, you can run the following commands as 
root:\n);
+fprintf(stderr, # rmmod kvm_intel\n);
+fprintf(stderr, # rmmod kvm\n);
+fprintf(stderr, # modprobe kvm_intel 
emulate_invalid_guest_state=1\n\n);
+fprintf(stderr, WARNING: Real mode emulation is still 
work-in-progress\n);
+fprintf(stderr, and thus it is not always guaranteed to work.\n\n);
+}
+
+return -EINVAL;
+}

   


It's almost guaranteed to fail, isn't it?  Is there any guest which 
fails with emulated_invalid_guest_state=0 but works with e_i_g_s=1?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] test: Add XSAVE unit test

2010-06-01 Thread Sheng Yang
On Tuesday 01 June 2010 16:51:05 Avi Kivity wrote:
 On 05/31/2010 02:17 PM, Sheng Yang wrote:
  Only test legal action so far, we can extend it later.
 
 The legal actions are tested by guests, so it's more important for unit
 tests to check illegal (and potentially subversive) actions.

Yes. This is just the first step. I think we need construct IDT and use 
something 
similar to fixup exception table. That's not a quick work, and we need some 
base 
for it.
 
  +
  +void test_xsave()
  +{
  +   unsigned int cr4;
  +   uint64_t supported_xcr0;
  +   uint64_t test_bits;
  +
  +   supported_xcr0 = get_supported_xcr0();
  +   printf(Supported XCR0 bits: 0x%x\n, supported_xcr0);
  +
  +   printf(Check minimal XSAVE required bits: );
  +   test_bits = XSTATE_FP | XSTATE_SSE | XSTATE_YMM;
  +   pass_if((supported_xcr0  test_bits) == test_bits);
 
 This will fail on a cpu without YMM but with xsave.

Yes, would discard this...
 
  +
  +int main()
  +{
  +   int cpuid_has_xsave;
  +
  +   cpuid_has_xsave = check_xsave();
  +   if (cpuid_has_xsave) {
  +   printf(CPU has XSAVE feature\n);
  +   test_xsave();
  +   } else
  +   printf(CPU don't has XSAVE feature\n);
  +   return 0;
 
 Should return 1 if any failure for autotest integration.

OK
 
  +}
  +

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] test: Add XSAVE unit test

2010-06-01 Thread Avi Kivity

On 06/01/2010 12:00 PM, Sheng Yang wrote:

On Tuesday 01 June 2010 16:51:05 Avi Kivity wrote:
   

On 05/31/2010 02:17 PM, Sheng Yang wrote:
 

Only test legal action so far, we can extend it later.
   

The legal actions are tested by guests, so it's more important for unit
tests to check illegal (and potentially subversive) actions.
 

Yes. This is just the first step. I think we need construct IDT and use 
something
similar to fixup exception table. That's not a quick work, and we need some base
for it.
   


Sure, it's fine to proceed in steps.  We do need a framework for 
trapping exceptions.  There's a bit in access.flat (also for running 
code from cpl 3), we could generalize it.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: elide struct thread_struct instances from stack

2010-06-01 Thread Alexander Graf

On 01.06.2010, at 10:36, Andreas Schwab wrote:

 Paul Mackerras pau...@samba.org writes:
 
 I re-read the relevant part of the PowerPC architecture spec
 yesterday, and it seems pretty clear that the FPSCR doesn't affect the
 behaviour of lfs and stfs, and is not affected by them.  So in fact 4
 out of the 7 instructions in each of those procedures are unnecessary
 (and similarly for the cvt_fd/df used in the alignment fixup code).
 
 I'd prefer to have this deferred to a separate patch.

I agree. Andreas' patch takes the current logic and moves it to be KVM 
contained, so we don't clutter the stack. The fact that the old code was 
inefficient is a separate story.

Avi / Marcelo, please apply the patch nevertheless.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] vhost: replace vhost_workqueue with per-vhost kthread

2010-06-01 Thread Tejun Heo
Replace vhost_workqueue with per-vhost kthread.  Other than callback
argument change from struct work_struct * to struct vhost_work *,
there's no visible change to vhost_poll_*() interface.

This conversion is to make each vhost use a dedicated kthread so that
resource control via cgroup can be applied.

Partially based on Sridhar Samudrala's patch.

* Updated to use sub structure vhost_work instead of directly using
  vhost_poll at Michael's suggestion.

* Added flusher wake_up() optimization at Michael's suggestion.

NOT_SIGNED_OFF_YET
Cc: Michael S. Tsirkin m...@redhat.com
Cc: Sridhar Samudrala samudrala.srid...@gmail.com
---
Okay, here's the updated version.  I'm still in the process of setting
up the testing environment.  I'll be traveling from this afternoon
till tomorrow so I don't think I'll be able to test it before that.
If you can give it a shot, it would be great.

Thanks.

 drivers/vhost/net.c   |   56 ++---
 drivers/vhost/vhost.c |  110 ++
 drivers/vhost/vhost.h |   38 +++--
 3 files changed, 133 insertions(+), 71 deletions(-)

Index: work/drivers/vhost/net.c
===
--- work.orig/drivers/vhost/net.c
+++ work/drivers/vhost/net.c
@@ -294,54 +294,58 @@ static void handle_rx(struct vhost_net *
unuse_mm(net-dev.mm);
 }

-static void handle_tx_kick(struct work_struct *work)
+static void handle_tx_kick(struct vhost_work *work)
 {
-   struct vhost_virtqueue *vq;
-   struct vhost_net *net;
-   vq = container_of(work, struct vhost_virtqueue, poll.work);
-   net = container_of(vq-dev, struct vhost_net, dev);
+   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+ poll.work);
+   struct vhost_net *net = container_of(vq-dev, struct vhost_net, dev);
+
handle_tx(net);
 }

-static void handle_rx_kick(struct work_struct *work)
+static void handle_rx_kick(struct vhost_work *work)
 {
-   struct vhost_virtqueue *vq;
-   struct vhost_net *net;
-   vq = container_of(work, struct vhost_virtqueue, poll.work);
-   net = container_of(vq-dev, struct vhost_net, dev);
+   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+ poll.work);
+   struct vhost_net *net = container_of(vq-dev, struct vhost_net, dev);
+
handle_rx(net);
 }

-static void handle_tx_net(struct work_struct *work)
+static void handle_tx_net(struct vhost_work *work)
 {
-   struct vhost_net *net;
-   net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
+   struct vhost_net *net = container_of(work, struct vhost_net,
+poll[VHOST_NET_VQ_TX].work);
handle_tx(net);
 }

-static void handle_rx_net(struct work_struct *work)
+static void handle_rx_net(struct vhost_work *work)
 {
-   struct vhost_net *net;
-   net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
+   struct vhost_net *net = container_of(work, struct vhost_net,
+poll[VHOST_NET_VQ_RX].work);
handle_rx(net);
 }

 static int vhost_net_open(struct inode *inode, struct file *f)
 {
struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
+   struct vhost_dev *dev;
int r;
+
if (!n)
return -ENOMEM;
+
+   dev = n-dev;
n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-   r = vhost_dev_init(n-dev, n-vqs, VHOST_NET_VQ_MAX);
+   r = vhost_dev_init(dev, n-vqs, VHOST_NET_VQ_MAX);
if (r  0) {
kfree(n);
return r;
}

-   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
n-tx_poll_state = VHOST_NET_POLL_DISABLED;

f-private_data = n;
@@ -644,25 +648,13 @@ static struct miscdevice vhost_net_misc

 static int vhost_net_init(void)
 {
-   int r = vhost_init();
-   if (r)
-   goto err_init;
-   r = misc_register(vhost_net_misc);
-   if (r)
-   goto err_reg;
-   return 0;
-err_reg:
-   vhost_cleanup();
-err_init:
-   return r;
-
+   return misc_register(vhost_net_misc);
 }
 module_init(vhost_net_init);

 static void vhost_net_exit(void)
 {
misc_deregister(vhost_net_misc);
-   vhost_cleanup();
 }
 module_exit(vhost_net_exit);

Index: work/drivers/vhost/vhost.c
===
--- work.orig/drivers/vhost/vhost.c
+++ work/drivers/vhost/vhost.c
@@ -17,12 +17,12 @@
 

[PATCH 2/3] cgroups: Add an API to attach a task to current task's cgroup

2010-06-01 Thread Tejun Heo
From: Sridhar Samudrala samudrala.srid...@gmail.com

Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com
Reviewed-by: Paul Menage men...@google.com
Acked-by: Li Zefan l...@cn.fujitsu.com
---
 include/linux/cgroup.h |1 +
 kernel/cgroup.c|   23 +++
 2 files changed, 24 insertions(+)

Index: work/include/linux/cgroup.h
===
--- work.orig/include/linux/cgroup.h
+++ work/include/linux/cgroup.h
@@ -570,6 +570,7 @@ struct task_struct *cgroup_iter_next(str
 void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
+int cgroup_attach_task_current_cg(struct task_struct *);

 /*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
Index: work/kernel/cgroup.c
===
--- work.orig/kernel/cgroup.c
+++ work/kernel/cgroup.c
@@ -1788,6 +1788,29 @@ out:
return retval;
 }

+/**
+ * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup
+ * @tsk: the task to be attached
+ */
+int cgroup_attach_task_current_cg(struct task_struct *tsk)
+{
+   struct cgroupfs_root *root;
+   struct cgroup *cur_cg;
+   int retval = 0;
+
+   cgroup_lock();
+   for_each_active_root(root) {
+   cur_cg = task_cgroup_from_root(current, root);
+   retval = cgroup_attach_task(cur_cg, tsk);
+   if (retval)
+   break;
+   }
+   cgroup_unlock();
+
+   return retval;
+}
+EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg);
+
 /*
  * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
  * held. May take task_lock of task
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Gleb Natapov
Ticket lock spinlock ensures fairness by introducing FIFO of cpus waiting
for spinlock to be released. This works great on real HW, but when running
on hypervisor it introduce very heavy performance hit if physical cpus
are overcommitted (up to 35% in my test). The reason for performance
hit is that vcpu that at the head of the FIFO can be unscheduled and no
other vcpu waiting on the lock can proceed.  This result in a big time
gaps where no vcpu is in critical section. Even worse they are constantly
spinning and not allowing vcpu at the head of the FIFO to be scheduled
to run.

The patch below allows to patch ticket spinlock code to behave similar to
old unfair spinlock when hypervisor is detected. After patching unlocked
condition remains the same (next == owner), but if vcpu detects that lock
is already taken it spins till (next == owner == 0) and when the condition
is met it tries to reacquire the lock. Unlock simply zeroes head and tail.
Trylock state the same since unlocked condition stays the same.

I ran two guest with 16 vcpus each one. One guest with this patch another
without. Inside those guests I ran kernel compilation time make -j 16 all.
Here are results of two runs:

patched:unpatched:
real 13m34.674s real 19m35.827s
user 96m2.638s  user 102m38.665s
sys 56m14.991s  sys 158m22.470s

real 13m3.768s  real 19m4.375s
user 95m34.509s user 111m9.903s
sys 53m40.550s  sys 141m59.370s

Note that for 6 minutes unpatched guest ran without cpu oversubscription
since patched guest was already idle.

Running kernbench on the host with and without the patch:

Unpatched:Patched:
Average Half load -j 8 Run (std deviation):
Elapsed Time 312.538 (0.779404)   Elapsed Time 311.974 
(0.258031)
User Time 2154.89 (6.62261)   User Time 2151.94 
(1.96702)
System Time 233.78 (1.96642)  System Time 236.472 
(0.695572)
Percent CPU 763.8 (1.64317)   Percent CPU 765 (0.707107)
Context Switches 39348.2 (1542.87)Context Switches 41522.4 
(1193.13)
Sleeps 871688 (5596.52)   Sleeps 877317 (1135.83)

Average Optimal load -j 16 Run (std deviation):
Elapsed Time 259.878 (0.444826)   Elapsed Time 258.842 
(0.413122)
User Time 2549.22 (415.685)   User Time 2538.42 
(407.383)
System Time 261.084 (28.8145) System Time 263.847 
(28.8638)
Percent CPU 1003.4 (252.566)  Percent CPU 1003.5 
(251.407)
Context Switches 228418 (199308)  Context Switches 228894 
(197533)
Sleeps 898721 (28757.2)   Sleeps 902020 (26074.5)

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 3089f70..b919b54 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -60,19 +60,27 @@
 
 static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 {
-   short inc = 0x0100;
+   short inc;
 
asm volatile (
+   1:\t\n
+   mov $0x100, %0\n\t
LOCK_PREFIX xaddw %w0, %1\n
-   1:\t
+   2:\t
cmpb %h0, %b0\n\t
-   je 2f\n\t
+   je 4f\n\t
+   3:\t\n
rep ; nop\n\t
-   movb %1, %b0\n\t
/* don't need lfence here, because loads are in-order */
-   jmp 1b\n
-   2:
-   : +Q (inc), +m (lock-slock)
+   ALTERNATIVE(
+   movb %1, %b0\n\t
+   jmp 2b\n,
+   nop, X86_FEATURE_HYPERVISOR)\n\t
+   cmpw $0, %1\n\t
+   jne 3b\n\t
+   jmp 1b\n\t
+   4:
+   : =Q (inc), +m (lock-slock)
:
: memory, cc);
 }
@@ -98,10 +106,13 @@ static __always_inline int 
__ticket_spin_trylock(arch_spinlock_t *lock)
 
 static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
 {
-   asm volatile(UNLOCK_LOCK_PREFIX incb %0
-: +m (lock-slock)
-:
-: memory, cc);
+   asm volatile(
+   ALTERNATIVE(UNLOCK_LOCK_PREFIX incb (%0);ASM_NOP3,
+   UNLOCK_LOCK_PREFIX movw $0, (%0),
+   X86_FEATURE_HYPERVISOR)
+   :
+   : Q (lock-slock)
+   : memory, cc);
 }
 #else
 #define TICKET_SHIFT 16
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers

2010-06-01 Thread Tejun Heo
Apply the cpumask and cgroup of the initializing task to the created
vhost worker.

Based on Sridhar Samudrala's patch.  Li Zefan spotted a bug in error
path (twice), fixed (twice).

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Michael S. Tsirkin m...@redhat.com
Cc: Sridhar Samudrala samudrala.srid...@gmail.com
Cc: Li Zefan l...@cn.fujitsu.com
---
 drivers/vhost/vhost.c |   34 ++
 1 file changed, 30 insertions(+), 4 deletions(-)

Index: work/drivers/vhost/vhost.c
===
--- work.orig/drivers/vhost/vhost.c
+++ work/drivers/vhost/vhost.c
@@ -23,6 +23,7 @@
 #include linux/highmem.h
 #include linux/slab.h
 #include linux/kthread.h
+#include linux/cgroup.h

 #include linux/net.h
 #include linux/if_packet.h
@@ -187,11 +188,29 @@ long vhost_dev_init(struct vhost_dev *de
struct vhost_virtqueue *vqs, int nvqs)
 {
struct task_struct *worker;
-   int i;
+   cpumask_var_t mask;
+   int i, ret = -ENOMEM;
+
+   if (!alloc_cpumask_var(mask, GFP_KERNEL))
+   goto out_free_mask;

worker = kthread_create(vhost_worker, dev, vhost-%d, current-pid);
-   if (IS_ERR(worker))
-   return PTR_ERR(worker);
+   if (IS_ERR(worker)) {
+   ret = PTR_ERR(worker);
+   goto out_free_mask;
+   }
+
+   ret = sched_getaffinity(current-pid, mask);
+   if (ret)
+   goto out_stop_worker;
+
+   ret = sched_setaffinity(worker-pid, mask);
+   if (ret)
+   goto out_stop_worker;
+
+   ret = cgroup_attach_task_current_cg(worker);
+   if (ret)
+   goto out_stop_worker;

dev-vqs = vqs;
dev-nvqs = nvqs;
@@ -214,7 +233,14 @@ long vhost_dev_init(struct vhost_dev *de
}

wake_up_process(worker);/* avoid contributing to loadavg */
-   return 0;
+   ret = 0;
+   goto out_free_mask;
+
+out_stop_worker:
+   kthread_stop(worker);
+out_free_mask:
+   free_cpumask_var(mask);
+   return ret;
 }

 /* Caller should have device mutex */
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Michael S. Tsirkin
On Tue, Jun 01, 2010 at 11:10:45AM +0300, Avi Kivity wrote:
 On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote:
 On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:

 On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
  
 So what I suggested is failing any kind of access until iommu
 is assigned.


 So, the kernel driver must be aware of the iommu.  In which case it may
 as well program it.
  
 It's a kernel driver anyway. Point is that
 the *device* driver is better off not programming iommu,
 this way we do not need to reprogram it for each device.


 The device driver is in userspace.

I mean the kernel driver that grants userspace the access.

  It can't program the iommu.
 What  
 the patch proposes is that userspace tells vfio about the needed  
 mappings, and vfio programs the iommu.

There seems to be some misunderstanding.  The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device.  We are better off sharing a
domain between devices and programming the iommu once.

The natural way to do this is to have an iommu driver for programming
iommu.

This likely means we will have to pass the domain to 'vfio' or uio or
whatever the driver that gives userspace the access to device is called,
but this is only for security, there's no need to support programming
iommu there.

And using this design means the uio framework changes
required would be minor, so we won't have to duplicate code.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers

2010-06-01 Thread Michael S. Tsirkin
On Tue, Jun 01, 2010 at 11:35:15AM +0200, Tejun Heo wrote:
 Apply the cpumask and cgroup of the initializing task to the created
 vhost worker.
 
 Based on Sridhar Samudrala's patch.  Li Zefan spotted a bug in error
 path (twice), fixed (twice).
 
 Signed-off-by: Tejun Heo t...@kernel.org
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Sridhar Samudrala samudrala.srid...@gmail.com
 Cc: Li Zefan l...@cn.fujitsu.com

Something that I wanted to figure out - what happens if the
CPU mask limits us to a certain CPU that subsequently goes offline?
Will e.g. flush block forever or until that CPU comes back?
Also, does singlethreaded workqueue behave in the same way?

 ---
  drivers/vhost/vhost.c |   34 ++
  1 file changed, 30 insertions(+), 4 deletions(-)
 
 Index: work/drivers/vhost/vhost.c
 ===
 --- work.orig/drivers/vhost/vhost.c
 +++ work/drivers/vhost/vhost.c
 @@ -23,6 +23,7 @@
  #include linux/highmem.h
  #include linux/slab.h
  #include linux/kthread.h
 +#include linux/cgroup.h
 
  #include linux/net.h
  #include linux/if_packet.h
 @@ -187,11 +188,29 @@ long vhost_dev_init(struct vhost_dev *de
   struct vhost_virtqueue *vqs, int nvqs)
  {
   struct task_struct *worker;
 - int i;
 + cpumask_var_t mask;
 + int i, ret = -ENOMEM;
 +
 + if (!alloc_cpumask_var(mask, GFP_KERNEL))
 + goto out_free_mask;
 
   worker = kthread_create(vhost_worker, dev, vhost-%d, current-pid);
 - if (IS_ERR(worker))
 - return PTR_ERR(worker);
 + if (IS_ERR(worker)) {
 + ret = PTR_ERR(worker);
 + goto out_free_mask;
 + }
 +
 + ret = sched_getaffinity(current-pid, mask);
 + if (ret)
 + goto out_stop_worker;
 +
 + ret = sched_setaffinity(worker-pid, mask);
 + if (ret)
 + goto out_stop_worker;
 +
 + ret = cgroup_attach_task_current_cg(worker);
 + if (ret)
 + goto out_stop_worker;
 
   dev-vqs = vqs;
   dev-nvqs = nvqs;
 @@ -214,7 +233,14 @@ long vhost_dev_init(struct vhost_dev *de
   }
 
   wake_up_process(worker);/* avoid contributing to loadavg */
 - return 0;
 + ret = 0;
 + goto out_free_mask;
 +
 +out_stop_worker:
 + kthread_stop(worker);
 +out_free_mask:
 + free_cpumask_var(mask);
 + return ret;
  }
 
  /* Caller should have device mutex */
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:



  It can't program the iommu.
What
the patch proposes is that userspace tells vfio about the needed
mappings, and vfio programs the iommu.
 

There seems to be some misunderstanding.  The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device.  We are better off sharing a
domain between devices and programming the iommu once.
   


  iommufd = open(/dev/iommu);
  ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
  ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

?

If so, I agree.


The natural way to do this is to have an iommu driver for programming
iommu.

This likely means we will have to pass the domain to 'vfio' or uio or
whatever the driver that gives userspace the access to device is called,
but this is only for security, there's no need to support programming
iommu there.

And using this design means the uio framework changes
required would be minor, so we won't have to duplicate code.
   


Since vfio would be the only driver, there would be no duplication.  But 
a separate object for the iommu mapping is a good thing.  Perhaps we can 
even share it with vhost (without actually using the mmu, since vhost is 
software only).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Michael S. Tsirkin
On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
 On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:

   It can't program the iommu.
 What
 the patch proposes is that userspace tells vfio about the needed
 mappings, and vfio programs the iommu.
  
 There seems to be some misunderstanding.  The userspace interface
 proposed forces a separate domain per device and forces userspace to
 repeat iommu programming for each device.  We are better off sharing a
 domain between devices and programming the iommu once.


   iommufd = open(/dev/iommu);
   ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
   ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

 ?

Yes.

 If so, I agree.

Good.

 The natural way to do this is to have an iommu driver for programming
 iommu.

 This likely means we will have to pass the domain to 'vfio' or uio or
 whatever the driver that gives userspace the access to device is called,
 but this is only for security, there's no need to support programming
 iommu there.

 And using this design means the uio framework changes
 required would be minor, so we won't have to duplicate code.


 Since vfio would be the only driver, there would be no duplication.  But  
 a separate object for the iommu mapping is a good thing.  Perhaps we can  
 even share it with vhost (without actually using the mmu, since vhost is  
 software only).

Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] Print a user-friendly message on failed vmentry

2010-06-01 Thread Mohammed Gamal
On Tue, Jun 1, 2010 at 11:59 AM, Avi Kivity a...@redhat.com wrote:
 On 05/31/2010 10:40 PM, Mohammed Gamal wrote:

 This patch address bug report in
 https://bugs.launchpad.net/qemu/+bug/530077.

 Failed vmentries were handled with handle_unhandled() which prints a
 rather
 unfriendly message to the user. This patch separates handling vmentry
 failures
 from unknown exit reasons and prints a friendly message to the user.



 +#define VMX_INVALID_GUEST_STATE 0x8021
 +
 +static int handle_failed_vmentry(uint64_t reason)
 +{
 +    fprintf(stderr, kvm: vm entry failed with error 0x% PRIx64 \n\n,
 reason);
 +
 +    /* Perhaps we will need to check if this machine is intel since exit
 reason 0x21
 +       has a different interpretation on SVM */
 +    if (reason == VMX_INVALID_GUEST_STATE) {
 +        fprintf(stderr, If you're runnning a guest on an Intel machine
 without\n);
 +        fprintf(stderr, unrestricted mode support, the failure can be
 most likely\n);
 +        fprintf(stderr, due to the guest entering an invalid state for
 Intel VT.\n);
 +        fprintf(stderr, For example, the guest maybe running in big real
 mode\n);
 +        fprintf(stderr, which is not supported on less recent Intel
 processors.\n\n);
 +        fprintf(stderr, You may want to try enabling KVM real mode
 emulation. To\n);
 +        fprintf(stderr, enable it, you can run the following commands as
 root:\n);
 +        fprintf(stderr, # rmmod kvm_intel\n);
 +        fprintf(stderr, # rmmod kvm\n);
 +        fprintf(stderr, # modprobe kvm_intel
 emulate_invalid_guest_state=1\n\n);
 +        fprintf(stderr, WARNING: Real mode emulation is still
 work-in-progress\n);
 +        fprintf(stderr, and thus it is not always guaranteed to
 work.\n\n);
 +    }
 +
 +    return -EINVAL;
 +}



 It's almost guaranteed to fail, isn't it?  Is there any guest which fails
 with emulated_invalid_guest_state=0 but works with e_i_g_s=1?

You're right! Perhaps I should remove the e_i_g_s bit from the
message. What do you think?

 --
 error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Marcelo Tosatti
On Mon, May 24, 2010 at 04:05:29PM +0900, Takuya Yoshikawa wrote:
 (2010/05/17 18:06), Takuya Yoshikawa wrote:
 
 User allocated bitmaps have the advantage of reducing pinned memory.
 However we have plenty more pinned memory allocated in memory slots, so
 by itself, user allocated bitmaps don't justify this change.
 
 Sorry for pinging several times.
 
 
 In that sense, what do you think about the question I sent last week?
 
 === REPOST 1 ===
  
   mark_page_dirty is called with the mmu_lock spinlock held in set_spte.
   Must find a way to move it outside of the spinlock section.
 
 I am now trying to do something to solve this spinlock problem. But the
 spinlock section looks too wide to solve with simple workaround.
 
 Sorry but I have to say that mmu_lock spin_lock problem was completely
 out of
 my mind. Although I looked through the code, it seems not easy to move the
 set_bit_user to outside of spinlock section without breaking the
 semantics of
 its protection.
 
 So this may take some time to solve.
 
 But personally, I want to do something for x86's vmallc() every time
 problem
 even though moving dirty bitmaps to user space cannot be achieved soon.
 
 In that sense, do you mind if we do double buffering without moving
 dirty bitmaps to
 user space?
 
 So I would be happy if you give me any comments about this kind of other
 options.

What if you pin the bitmaps? 

The alternative to that is to move mark_page_dirty(gfn) before acquision 
of mmu_lock, in the page fault paths. The downside of that is a
potentially (large?) number of false positives in the dirty bitmap.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers

2010-06-01 Thread Tejun Heo
Hello,

On 06/01/2010 12:17 PM, Michael S. Tsirkin wrote:
 Something that I wanted to figure out - what happens if the
 CPU mask limits us to a certain CPU that subsequently goes offline?

The thread gets unbound during the last steps of cpu offlining.

 Will e.g. flush block forever or until that CPU comes back?
 Also, does singlethreaded workqueue behave in the same way?

So, things will proceed as usual although the thread will lose its
affinity.  Singlethread wqs don't bind their workers (and they
shouldn't! :-).  MT ones explicitly manage workers according to cpu
up/down events.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Latest GIT doesn't work with long device names

2010-06-01 Thread Peter Lieven

Hi,

I just compiled latest git to work on Bug #585113 .

Unfortunately, I can't start the the VMs with the device mappings 
generated by our multipath

setup.

cmdline:
/usr/bin/qemu-kvm-devel  -net none  -drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1,if=scsi,boot=on,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-89661b105-458000e7e804beaa-test2,if=scsi,boot=off,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8a361b105-928000e7e834beaa-test3,if=scsi,boot=off,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8af61b105-d1e7e864beaa-test4,if=scsi,boot=off,cache=none,aio=native  
-m 1024 -cpu qemu64,model_id='Intel(R) Xeon(R) CPU   E5520  @ 
2.27GHz'  -monitor tcp:0:4005,server,nowait -vnc :5 -name 'ide-test'  
-boot order=dc,menu=on  -cdrom 
/home/kvm/cdrom//root/KNOPPIX_V6.2.1CD-2010-01-31-DE.iso -k de  -pidfile 
/var/run/qemu/vm-153.pid  -mem-path /hugepages -mem-prealloc  -rtc 
base=utc,clock=host -usb -usbdevice tablet


error:
qemu: could not open disk image 
/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1: 
Argument list too long


Can somebody help here please. I justed started digging into the code 
and found that filename is a character array of 1024 Byte. So the 
problem must be somewhere

else...

BR,
Peter

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Latest GIT doesn't work with long device names

2010-06-01 Thread Kevin Wolf
Am 01.06.2010 12:59, schrieb Peter Lieven:
 Hi,
 
 I just compiled latest git to work on Bug #585113 .
 
 Unfortunately, I can't start the the VMs with the device mappings 
 generated by our multipath
 setup.
 
 cmdline:
 /usr/bin/qemu-kvm-devel  -net none  -drive 
 file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1,if=scsi,boot=on,cache=none,aio=native
   
 -drive 
 file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-89661b105-458000e7e804beaa-test2,if=scsi,boot=off,cache=none,aio=native
   
 -drive 
 file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8a361b105-928000e7e834beaa-test3,if=scsi,boot=off,cache=none,aio=native
   
 -drive 
 file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8af61b105-d1e7e864beaa-test4,if=scsi,boot=off,cache=none,aio=native
   
 -m 1024 -cpu qemu64,model_id='Intel(R) Xeon(R) CPU   E5520  @ 
 2.27GHz'  -monitor tcp:0:4005,server,nowait -vnc :5 -name 'ide-test'  
 -boot order=dc,menu=on  -cdrom 
 /home/kvm/cdrom//root/KNOPPIX_V6.2.1CD-2010-01-31-DE.iso -k de  -pidfile 
 /var/run/qemu/vm-153.pid  -mem-path /hugepages -mem-prealloc  -rtc 
 base=utc,clock=host -usb -usbdevice tablet
 
 error:
 qemu: could not open disk image 
 /dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1:
  
 Argument list too long

Might be a problem with the colon in your file name (the error message
is nonsense - I'll send a patch). It takes everything before the first
colon as a protocol name. To work around this you can specify the
protocol explicitly, as in -drive file=host_device:/dev/mapper/...

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Latest GIT doesn't work with long device names

2010-06-01 Thread Peter Lieven

Kevin Wolf wrote:

Am 01.06.2010 12:59, schrieb Peter Lieven:
  

Hi,

I just compiled latest git to work on Bug #585113 .

Unfortunately, I can't start the the VMs with the device mappings 
generated by our multipath

setup.

cmdline:
/usr/bin/qemu-kvm-devel  -net none  -drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1,if=scsi,boot=on,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-89661b105-458000e7e804beaa-test2,if=scsi,boot=off,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8a361b105-928000e7e834beaa-test3,if=scsi,boot=off,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8af61b105-d1e7e864beaa-test4,if=scsi,boot=off,cache=none,aio=native  
-m 1024 -cpu qemu64,model_id='Intel(R) Xeon(R) CPU   E5520  @ 
2.27GHz'  -monitor tcp:0:4005,server,nowait -vnc :5 -name 'ide-test'  
-boot order=dc,menu=on  -cdrom 
/home/kvm/cdrom//root/KNOPPIX_V6.2.1CD-2010-01-31-DE.iso -k de  -pidfile 
/var/run/qemu/vm-153.pid  -mem-path /hugepages -mem-prealloc  -rtc 
base=utc,clock=host -usb -usbdevice tablet


error:
qemu: could not open disk image 
/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1: 
Argument list too long



Might be a problem with the colon in your file name (the error message
is nonsense - I'll send a patch). It takes everything before the first
colon as a protocol name. To work around this you can specify the
protocol explicitly, as in -drive file=host_device:/dev/mapper/...
  


mmh, doesn't work.

qemu: could not open disk image 
host_device:/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1: 
No such file or directory




Kevin


  



--
Mit freundlichen Grüßen/Kind Regards

Peter Lieven

..

  KAMP Netzwerkdienste GmbH
  Vestische Str. 89-91 | 46117 Oberhausen
  Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
  mailto:p...@kamp.de | http://www.kamp.de

  Geschäftsführer: Heiner Lante | Michael Lante
  Amtsgericht Duisburg | HRB Nr. 12154
  USt-Id-Nr.: DE 120607556

.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Latest GIT doesn't work with long device names

2010-06-01 Thread Peter Lieven

Kevin Wolf wrote:

Am 01.06.2010 12:59, schrieb Peter Lieven:
  

Hi,

I just compiled latest git to work on Bug #585113 .

Unfortunately, I can't start the the VMs with the device mappings 
generated by our multipath

setup.

cmdline:
/usr/bin/qemu-kvm-devel  -net none  -drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1,if=scsi,boot=on,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-89661b105-458000e7e804beaa-test2,if=scsi,boot=off,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8a361b105-928000e7e834beaa-test3,if=scsi,boot=off,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8af61b105-d1e7e864beaa-test4,if=scsi,boot=off,cache=none,aio=native  
-m 1024 -cpu qemu64,model_id='Intel(R) Xeon(R) CPU   E5520  @ 
2.27GHz'  -monitor tcp:0:4005,server,nowait -vnc :5 -name 'ide-test'  
-boot order=dc,menu=on  -cdrom 
/home/kvm/cdrom//root/KNOPPIX_V6.2.1CD-2010-01-31-DE.iso -k de  -pidfile 
/var/run/qemu/vm-153.pid  -mem-path /hugepages -mem-prealloc  -rtc 
base=utc,clock=host -usb -usbdevice tablet


error:
qemu: could not open disk image 
/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1: 
Argument list too long



Might be a problem with the colon in your file name (the error message
is nonsense - I'll send a patch). It takes everything before the first
colon as a protocol name. To work around this you can specify the
protocol explicitly, as in -drive file=host_device:/dev/mapper/...
  

applied your patch, the actual error is (e.g.):
qemu: could not open disk image 
/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-e1861b105-4cb000e7e8c4bf3b-opensuse-11-2-i586: 
No such file or directory


while the image is there:

r...@test:/home/kvm/src# ls -la 
/dev/mapper/iqn.2001-05.com.equallogic\:0-8a0906-e1861b105-4cb000e7e8c4bf3b-opensuse-11-2-i586 

brw-rw 1 root disk 254, 0 2010-06-01 13:48 
/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-e1861b105-4cb000e7e8c4bf3b-opensuse-11-2-i586


peter

Kevin


  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf trace event parse errors for KVM events

2010-06-01 Thread Steven Rostedt
On Tue, 2010-06-01 at 11:39 +0300, Avi Kivity wrote:
 On 05/30/2010 06:34 PM, Steven Rostedt wrote:
 
  Cool.  May make sense to use simpler formatting in the kernel, and use
  trace-cmd plugins for the complicated cases.
 
  It does raise issues with ABIs.  Can trace-cmd read plugins from
  /lib/modules/*?  We can then distribute the plugins with the kernel.
 
   
  We probably could do that. Perhaps if we can port the python code to
  perf, then it would work for both. Then the plugins could be just python
  scripts, (or binary .so modules) and have a tools/plugins dir?
 
  The python part probably would be easier to port, since the .so modules
  are a bit more specific to trace-cmd.
 
 
 One concern is performance.  Traces tend to be long, and running python 
 code on each line will be slow.
 
 If trace-cmd integrates a pager and a search mechanism that looks at the 
 binary data instead of the text, we could format only the lines that are 
 displayed.  But that is going to be a lot of work and I don't think it's 
 worth the effort.
 

Every event gets its own ID. The plugin registers a callback to that ID.
When the ID is hit, the plugin is executed on that event to display its
binary format.

This is done after the data has been saved in binary format to a file.
It may slow down the executing of reading a data file, but it does not
affect the running of the trace one bit.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Takuya Yoshikawa

(2010/06/01 19:55), Marcelo Tosatti wrote:


Sorry but I have to say that mmu_lock spin_lock problem was completely
out of
my mind. Although I looked through the code, it seems not easy to move the
set_bit_user to outside of spinlock section without breaking the
semantics of
its protection.

So this may take some time to solve.

But personally, I want to do something for x86's vmallc() every time
problem
even though moving dirty bitmaps to user space cannot be achieved soon.

In that sense, do you mind if we do double buffering without moving
dirty bitmaps to
user space?


So I would be happy if you give me any comments about this kind of other
options.


What if you pin the bitmaps?


Yes, pinning bitmaps works. The small problem is that we need to hold
the dirty_bitmap_pages[] array for every slot, the size of this array
depends on the slot length, and of course pinning itself.

In the performance point of view, having double sized vmalloc'ed
area may be better.



The alternative to that is to move mark_page_dirty(gfn) before acquision
of mmu_lock, in the page fault paths. The downside of that is a
potentially (large?) number of false positives in the dirty bitmap.



Interesting, but probably dangerous.


From my experience, though this includes my personal view, removing vmalloc
currently used by x86 is the most simple and effective change.

So if you don't mind, I want to double the size of vmalloc'ed area for x86
without changing other parts.

 == if this one more bitmap is problematic, dirty logging itself would be
 in danger of failure: we need to have the same size in the timing of
 switch.

Make sense?


We can consider moving dirty bitmaps to user space later.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Latest GIT doesn't work with long device names

2010-06-01 Thread Peter Lieven

Kevin Wolf wrote:

Am 01.06.2010 12:59, schrieb Peter Lieven:
  

Hi,

I just compiled latest git to work on Bug #585113 .

Unfortunately, I can't start the the VMs with the device mappings 
generated by our multipath

setup.

cmdline:
/usr/bin/qemu-kvm-devel  -net none  -drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1,if=scsi,boot=on,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-89661b105-458000e7e804beaa-test2,if=scsi,boot=off,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8a361b105-928000e7e834beaa-test3,if=scsi,boot=off,cache=none,aio=native  
-drive 
file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-8af61b105-d1e7e864beaa-test4,if=scsi,boot=off,cache=none,aio=native  
-m 1024 -cpu qemu64,model_id='Intel(R) Xeon(R) CPU   E5520  @ 
2.27GHz'  -monitor tcp:0:4005,server,nowait -vnc :5 -name 'ide-test'  
-boot order=dc,menu=on  -cdrom 
/home/kvm/cdrom//root/KNOPPIX_V6.2.1CD-2010-01-31-DE.iso -k de  -pidfile 
/var/run/qemu/vm-153.pid  -mem-path /hugepages -mem-prealloc  -rtc 
base=utc,clock=host -usb -usbdevice tablet


error:
qemu: could not open disk image 
/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-88961b105-19f000e7e7d4beaa-test1: 
Argument list too long



Might be a problem with the colon in your file name (the error message
is nonsense - I'll send a patch). It takes everything before the first
colon as a protocol name. To work around this you can specify the
protocol explicitly, as in -drive file=host_device:/dev/mapper/...
  

-drive format=host_device,file=/dev/mapper/...

solves the problem

Kevin


  



--
Mit freundlichen Grüßen/Kind Regards

Peter Lieven

..

  KAMP Netzwerkdienste GmbH
  Vestische Str. 89-91 | 46117 Oberhausen
  Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
  mailto:p...@kamp.de | http://www.kamp.de

  Geschäftsführer: Heiner Lante | Michael Lante
  Amtsgericht Duisburg | HRB Nr. 12154
  USt-Id-Nr.: DE 120607556

. 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] Add Documentation/kvm/msr.txt

2010-06-01 Thread Glauber Costa
This patch adds a file that documents the usage of KVM-specific
MSRs.

[ v2: added comments from Randy ]
[ v3: added comments from Avi ]
[ v4: added information about wallclock alignment ]

Signed-off-by: Glauber Costa glom...@redhat.com
Reviewed-by: Randy Dunlap randy.dun...@oracle.com
---
 Documentation/kvm/msr.txt |  153 +
 1 files changed, 153 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/kvm/msr.txt

diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt
new file mode 100644
index 000..a8502e3
--- /dev/null
+++ b/Documentation/kvm/msr.txt
@@ -0,0 +1,153 @@
+KVM-specific MSRs.
+Glauber Costa glom...@redhat.com, Red Hat Inc, 2010
+=
+
+KVM makes use of some custom MSRs to service some requests.
+At present, this facility is only used by kvmclock.
+
+Custom MSRs have a range reserved for them, that goes from
+0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
+but they are deprecated and their use is discouraged.
+
+Custom MSR list
+
+
+The current supported Custom MSR list is:
+
+MSR_KVM_WALL_CLOCK_NEW:   0x4b564d00
+
+   data: 4-byte alignement physical address of a memory area which must be
+   in guest RAM. This memory is expected to hold a copy of the following
+   structure:
+
+   struct pvclock_wall_clock {
+   u32   version;
+   u32   sec;
+   u32   nsec;
+   } __attribute__((__packed__));
+
+   whose data will be filled in by the hypervisor. The hypervisor is only 
+   guaranteed to update this data at the moment of MSR write.
+   Users that want to reliably query this information more than once have
+   to write more than once to this MSR. Fields have the following meanings:
+
+   version: guest has to check version before and after grabbing
+   time information and check that they are both equal and even.
+   An odd version indicates an in-progress update.
+
+   sec: number of seconds for wallclock.
+
+   nsec: number of nanoseconds for wallclock.
+
+   Note that although MSRs are per-CPU entities, the effect of this
+   particular MSR is global.
+
+   Availability of this MSR must be checked via bit 3 in 0x401 cpuid
+   leaf prior to usage. 
+
+MSR_KVM_SYSTEM_TIME_NEW:  0x4b564d01
+
+   data: 4-byte aligned physical address of a memory area which must be in
+   guest RAM, plus an enable bit in bit 0. This memory is expected to hold
+   a copy of the following structure:
+
+   struct pvclock_vcpu_time_info {
+   u32   version;
+   u32   pad0;
+   u64   tsc_timestamp;
+   u64   system_time;
+   u32   tsc_to_system_mul;
+   s8tsc_shift;
+   u8flags;
+   u8pad[2];
+   } __attribute__((__packed__)); /* 32 bytes */
+
+   whose data will be filled in by the hypervisor periodically. Only one
+   write, or registration, is needed for each VCPU. The interval between
+   updates of this structure is arbitrary and implementation-dependent.
+   The hypervisor may update this structure at any time it sees fit until
+   anything with bit0 == 0 is written to it.
+
+   Fields have the following meanings:
+
+   version: guest has to check version before and after grabbing
+   time information and check that they are both equal and even.
+   An odd version indicates an in-progress update.
+
+   tsc_timestamp: the tsc value at the current VCPU at the time
+   of the update of this structure. Guests can subtract this value
+   from current tsc to derive a notion of elapsed time since the
+   structure update.
+
+   system_time: a host notion of monotonic time, including sleep
+   time at the time this structure was last updated. Unit is
+   nanoseconds.
+
+   tsc_to_system_mul: a function of the tsc frequency. One has
+   to multiply any tsc-related quantity by this value to get
+   a value in nanoseconds, besides dividing by 2^tsc_shift
+
+   tsc_shift: cycle to nanosecond divider, as a power of two, to
+   allow for shift rights. One has to shift right any tsc-related
+   quantity by this value to get a value in nanoseconds, besides
+   multiplying by tsc_to_system_mul.
+
+   With this information, guests can derive per-CPU time by
+   doing:
+
+   time = (current_tsc - tsc_timestamp)
+   time = (time * tsc_to_system_mul)  tsc_shift 
+   time = time + system_time
+
+   flags: bits in this field indicate extended capabilities
+   coordinated between the guest and 

[PATCH 4/5] Use skeleton of upstream's kvm_arch_init_vcpu()

2010-06-01 Thread Avi Kivity
Currently all content from qemu-kvm's kvm_arch_init_vcpu().

Signed-off-by: Avi Kivity a...@redhat.com
---
 qemu-kvm-x86.c|2 +-
 target-i386/kvm.c |   17 +++--
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index f5c76bc..853d50e 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1129,7 +1129,7 @@ static void kvm_trim_features(uint32_t *features, 
uint32_t supported)
 }
 }
 
-int kvm_arch_init_vcpu(CPUState *env)
+static int _kvm_arch_init_vcpu(CPUState *env)
 {
 struct kvm_cpuid_entry2 cpuid_ent[100];
 #ifdef KVM_CPUID_SIGNATURE
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index ac9e17b..8db6d54 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -163,10 +163,13 @@ static int get_para_features(CPUState *env)
 }
 #endif
 
-#ifdef KVM_UPSTREAM
+static int _kvm_arch_init_vcpu(CPUState *env);
 
 int kvm_arch_init_vcpu(CPUState *env)
 {
+int r;
+#ifdef KVM_UPSTREAM
+
 struct {
 struct kvm_cpuid2 cpuid;
 struct kvm_cpuid_entry2 entries[100];
@@ -178,6 +181,15 @@ int kvm_arch_init_vcpu(CPUState *env)
 uint32_t signature[3];
 #endif
 
+#endif
+
+r = _kvm_arch_init_vcpu(env);
+if (r  0) {
+return r;
+}
+
+#ifdef KVM_UPSTREAM
+
 env-mp_state = KVM_MP_STATE_RUNNABLE;
 
 env-cpuid_features = kvm_arch_get_supported_cpuid(env, 1, R_EDX);
@@ -273,9 +285,10 @@ int kvm_arch_init_vcpu(CPUState *env)
 cpuid_data.cpuid.nent = cpuid_i;
 
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data);
+#endif
+return 0;
 }
 
-#endif
 void kvm_arch_reset_vcpu(CPUState *env)
 {
 env-exception_injected = -1;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] Make get_para_features() similar to upstream

2010-06-01 Thread Avi Kivity
Accept a CPUState parameter instead of a kvm_context_t.

Signed-off-by: Avi Kivity a...@redhat.com
---
 qemu-kvm-x86.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 95b7aa5..0eb4060 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1132,7 +1132,7 @@ struct kvm_para_features {
{ -1, -1 }
 };
 
-static int get_para_features(kvm_context_t kvm_context)
+static int get_para_features(CPUState *env)
 {
int i, features = 0;
 
@@ -1184,7 +1184,7 @@ int kvm_arch_init_vcpu(CPUState *cenv)
 pv_ent = cpuid_ent[cpuid_nent++];
 memset(pv_ent, 0, sizeof(*pv_ent));
 pv_ent-function = KVM_CPUID_FEATURES;
-pv_ent-eax = get_para_features(kvm_context);
+pv_ent-eax = get_para_features(cenv);
 #endif
 
 kvm_trim_features(cenv-cpuid_features,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] Use get_para_features() from upstream

2010-06-01 Thread Avi Kivity
Signed-off-by: Avi Kivity a...@redhat.com
---
 qemu-kvm-x86.c|   28 
 target-i386/kvm.c |4 ++--
 2 files changed, 2 insertions(+), 30 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 0eb4060..3b9be6d 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1116,34 +1116,6 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *e, 
uint32_t function,
 e-edx = env-regs[R_EDX];
 }
 
-struct kvm_para_features {
-   int cap;
-   int feature;
-} para_features[] = {
-#ifdef KVM_CAP_CLOCKSOURCE
-   { KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE },
-#endif
-#ifdef KVM_CAP_NOP_IO_DELAY
-   { KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
-#endif
-#ifdef KVM_CAP_PV_MMU
-   { KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
-#endif
-   { -1, -1 }
-};
-
-static int get_para_features(CPUState *env)
-{
-   int i, features = 0;
-
-   for (i = 0; i  ARRAY_SIZE(para_features)-1; i++) {
-   if (kvm_check_extension(kvm_state, para_features[i].cap))
-   features |= (1  para_features[i].feature);
-   }
-
-   return features;
-}
-
 static void kvm_trim_features(uint32_t *features, uint32_t supported)
 {
 int i;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 87c1133..ac9e17b 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -133,8 +133,6 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, 
uint32_t function, int reg)
 
 #endif
 
-#ifdef KVM_UPSTREAM
-
 #ifdef CONFIG_KVM_PARA
 struct kvm_para_features {
 int cap;
@@ -165,6 +163,8 @@ static int get_para_features(CPUState *env)
 }
 #endif
 
+#ifdef KVM_UPSTREAM
+
 int kvm_arch_init_vcpu(CPUState *env)
 {
 struct {
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] Rename kvm_arch_vcpu_init()s cenv argument to env

2010-06-01 Thread Avi Kivity
Be more similar to upstream.

Signed-off-by: Avi Kivity a...@redhat.com
---
 qemu-kvm-x86.c |   42 +-
 1 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 3b9be6d..f5c76bc 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1129,7 +1129,7 @@ static void kvm_trim_features(uint32_t *features, 
uint32_t supported)
 }
 }
 
-int kvm_arch_init_vcpu(CPUState *cenv)
+int kvm_arch_init_vcpu(CPUState *env)
 {
 struct kvm_cpuid_entry2 cpuid_ent[100];
 #ifdef KVM_CPUID_SIGNATURE
@@ -1140,7 +1140,7 @@ int kvm_arch_init_vcpu(CPUState *cenv)
 CPUState copy;
 uint32_t i, j, limit;
 
-kvm_arch_reset_vcpu(cenv);
+kvm_arch_reset_vcpu(env);
 
 #ifdef KVM_CPUID_SIGNATURE
 /* Paravirtualization CPUIDs */
@@ -1156,24 +1156,24 @@ int kvm_arch_init_vcpu(CPUState *cenv)
 pv_ent = cpuid_ent[cpuid_nent++];
 memset(pv_ent, 0, sizeof(*pv_ent));
 pv_ent-function = KVM_CPUID_FEATURES;
-pv_ent-eax = get_para_features(cenv);
+pv_ent-eax = get_para_features(env);
 #endif
 
-kvm_trim_features(cenv-cpuid_features,
-  kvm_arch_get_supported_cpuid(cenv, 1, R_EDX));
+kvm_trim_features(env-cpuid_features,
+  kvm_arch_get_supported_cpuid(env, 1, R_EDX));
 
 /* prevent the hypervisor bit from being cleared by the kernel */
-i = cenv-cpuid_ext_features  CPUID_EXT_HYPERVISOR;
-kvm_trim_features(cenv-cpuid_ext_features,
-  kvm_arch_get_supported_cpuid(cenv, 1, R_ECX));
-cenv-cpuid_ext_features |= i;
+i = env-cpuid_ext_features  CPUID_EXT_HYPERVISOR;
+kvm_trim_features(env-cpuid_ext_features,
+  kvm_arch_get_supported_cpuid(env, 1, R_ECX));
+env-cpuid_ext_features |= i;
 
-kvm_trim_features(cenv-cpuid_ext2_features,
-  kvm_arch_get_supported_cpuid(cenv, 0x8001, R_EDX));
-kvm_trim_features(cenv-cpuid_ext3_features,
-  kvm_arch_get_supported_cpuid(cenv, 0x8001, R_ECX));
+kvm_trim_features(env-cpuid_ext2_features,
+  kvm_arch_get_supported_cpuid(env, 0x8001, R_EDX));
+kvm_trim_features(env-cpuid_ext3_features,
+  kvm_arch_get_supported_cpuid(env, 0x8001, R_ECX));
 
-copy = *cenv;
+copy = *env;
 
 copy.regs[R_EAX] = 0;
 qemu_kvm_cpuid_on_env(copy);
@@ -1207,11 +1207,11 @@ int kvm_arch_init_vcpu(CPUState *cenv)
 for (i = 0x8000; i = limit; ++i)
do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0, copy);
 
-kvm_setup_cpuid2(cenv, cpuid_nent, cpuid_ent);
+kvm_setup_cpuid2(env, cpuid_nent, cpuid_ent);
 
 #ifdef KVM_CAP_MCE
-if (((cenv-cpuid_version  8)0xF) = 6
- (cenv-cpuid_features(CPUID_MCE|CPUID_MCA)) == 
(CPUID_MCE|CPUID_MCA)
+if (((env-cpuid_version  8)0xF) = 6
+ (env-cpuid_features(CPUID_MCE|CPUID_MCA)) == (CPUID_MCE|CPUID_MCA)
  kvm_check_extension(kvm_state, KVM_CAP_MCE)  0) {
 uint64_t mcg_cap;
 int banks;
@@ -1223,18 +1223,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
 banks = MCE_BANKS_DEF;
 mcg_cap = MCE_CAP_DEF;
 mcg_cap |= banks;
-if (kvm_setup_mce(cenv, mcg_cap))
+if (kvm_setup_mce(env, mcg_cap))
 perror(kvm_setup_mce FAILED);
 else
-cenv-mcg_cap = mcg_cap;
+env-mcg_cap = mcg_cap;
 }
 }
 #endif
 
 #ifdef KVM_EXIT_TPR_ACCESS
-kvm_enable_tpr_access_reporting(cenv);
+kvm_enable_tpr_access_reporting(env);
 #endif
-kvm_reset_mpstate(cenv);
+kvm_reset_mpstate(env);
 return 0;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] Use more cpuid bits from upstream

2010-06-01 Thread Avi Kivity
This patchset converts kvm_arch_vcpu_init()'s cpuid handling bits to use
upstream code.

Avi Kivity (5):
  Make get_para_features() similar to upstream
  Use get_para_features() from upstream
  Rename kvm_arch_vcpu_init()s cenv argument to env
  Use skeleton of upstream's kvm_arch_init_vcpu()
  Use upstream's kvm_arch_init_vcpu()'s cpuid bits

 qemu-kvm-x86.c|  148 +++--
 target-i386/kvm.c |   15 -
 2 files changed, 20 insertions(+), 143 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] Use upstream's kvm_arch_init_vcpu()'s cpuid bits

2010-06-01 Thread Avi Kivity
Signed-off-by: Avi Kivity a...@redhat.com
---
 qemu-kvm-x86.c|  104 -
 target-i386/kvm.c |8 +---
 2 files changed, 2 insertions(+), 110 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 853d50e..3c33e64 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1101,114 +1101,10 @@ void kvm_arch_save_regs(CPUState *env)
 kvm_get_debugregs(env);
 }
 
-static void do_cpuid_ent(struct kvm_cpuid_entry2 *e, uint32_t function,
- uint32_t count, CPUState *env)
-{
-env-regs[R_EAX] = function;
-env-regs[R_ECX] = count;
-qemu_kvm_cpuid_on_env(env);
-e-function = function;
-e-flags = 0;
-e-index = 0;
-e-eax = env-regs[R_EAX];
-e-ebx = env-regs[R_EBX];
-e-ecx = env-regs[R_ECX];
-e-edx = env-regs[R_EDX];
-}
-
-static void kvm_trim_features(uint32_t *features, uint32_t supported)
-{
-int i;
-uint32_t mask;
-
-for (i = 0; i  32; ++i) {
-mask = 1U  i;
-if ((*features  mask)  !(supported  mask)) {
-*features = ~mask;
-}
-}
-}
-
 static int _kvm_arch_init_vcpu(CPUState *env)
 {
-struct kvm_cpuid_entry2 cpuid_ent[100];
-#ifdef KVM_CPUID_SIGNATURE
-struct kvm_cpuid_entry2 *pv_ent;
-uint32_t signature[3];
-#endif
-int cpuid_nent = 0;
-CPUState copy;
-uint32_t i, j, limit;
-
 kvm_arch_reset_vcpu(env);
 
-#ifdef KVM_CPUID_SIGNATURE
-/* Paravirtualization CPUIDs */
-memcpy(signature, KVMKVMKVM\0\0\0, 12);
-pv_ent = cpuid_ent[cpuid_nent++];
-memset(pv_ent, 0, sizeof(*pv_ent));
-pv_ent-function = KVM_CPUID_SIGNATURE;
-pv_ent-eax = 0;
-pv_ent-ebx = signature[0];
-pv_ent-ecx = signature[1];
-pv_ent-edx = signature[2];
-
-pv_ent = cpuid_ent[cpuid_nent++];
-memset(pv_ent, 0, sizeof(*pv_ent));
-pv_ent-function = KVM_CPUID_FEATURES;
-pv_ent-eax = get_para_features(env);
-#endif
-
-kvm_trim_features(env-cpuid_features,
-  kvm_arch_get_supported_cpuid(env, 1, R_EDX));
-
-/* prevent the hypervisor bit from being cleared by the kernel */
-i = env-cpuid_ext_features  CPUID_EXT_HYPERVISOR;
-kvm_trim_features(env-cpuid_ext_features,
-  kvm_arch_get_supported_cpuid(env, 1, R_ECX));
-env-cpuid_ext_features |= i;
-
-kvm_trim_features(env-cpuid_ext2_features,
-  kvm_arch_get_supported_cpuid(env, 0x8001, R_EDX));
-kvm_trim_features(env-cpuid_ext3_features,
-  kvm_arch_get_supported_cpuid(env, 0x8001, R_ECX));
-
-copy = *env;
-
-copy.regs[R_EAX] = 0;
-qemu_kvm_cpuid_on_env(copy);
-limit = copy.regs[R_EAX];
-
-for (i = 0; i = limit; ++i) {
-if (i == 4 || i == 0xb || i == 0xd) {
-for (j = 0; ; ++j) {
-do_cpuid_ent(cpuid_ent[cpuid_nent], i, j, copy);
-
-cpuid_ent[cpuid_nent].flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-cpuid_ent[cpuid_nent].index = j;
-
-cpuid_nent++;
-
-if (i == 4  copy.regs[R_EAX] == 0)
-break;
-if (i == 0xb  !(copy.regs[R_ECX]  0xff00))
-break;
-if (i == 0xd  copy.regs[R_EAX] == 0)
-break;
-}
-} else
-do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0, copy);
-}
-
-copy.regs[R_EAX] = 0x8000;
-qemu_kvm_cpuid_on_env(copy);
-limit = copy.regs[R_EAX];
-
-for (i = 0x8000; i = limit; ++i)
-   do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0, copy);
-
-kvm_setup_cpuid2(env, cpuid_nent, cpuid_ent);
-
 #ifdef KVM_CAP_MCE
 if (((env-cpuid_version  8)0xF) = 6
  (env-cpuid_features(CPUID_MCE|CPUID_MCA)) == (CPUID_MCE|CPUID_MCA)
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 8db6d54..9cb9cf4 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -168,8 +168,6 @@ static int _kvm_arch_init_vcpu(CPUState *env);
 int kvm_arch_init_vcpu(CPUState *env)
 {
 int r;
-#ifdef KVM_UPSTREAM
-
 struct {
 struct kvm_cpuid2 cpuid;
 struct kvm_cpuid_entry2 entries[100];
@@ -181,8 +179,6 @@ int kvm_arch_init_vcpu(CPUState *env)
 uint32_t signature[3];
 #endif
 
-#endif
-
 r = _kvm_arch_init_vcpu(env);
 if (r  0) {
 return r;
@@ -192,6 +188,8 @@ int kvm_arch_init_vcpu(CPUState *env)
 
 env-mp_state = KVM_MP_STATE_RUNNABLE;
 
+#endif
+
 env-cpuid_features = kvm_arch_get_supported_cpuid(env, 1, R_EDX);
 
 i = env-cpuid_ext_features  CPUID_EXT_HYPERVISOR;
@@ -285,8 +283,6 @@ int kvm_arch_init_vcpu(CPUState *env)
 cpuid_data.cpuid.nent = cpuid_i;
 
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data);
-#endif
-return 0;
 }
 
 void kvm_arch_reset_vcpu(CPUState *env)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Perf trace event parse errors for KVM events

2010-06-01 Thread Avi Kivity

On 06/01/2010 02:59 PM, Steven Rostedt wrote:



One concern is performance.  Traces tend to be long, and running python
code on each line will be slow.

If trace-cmd integrates a pager and a search mechanism that looks at the
binary data instead of the text, we could format only the lines that are
displayed.  But that is going to be a lot of work and I don't think it's
worth the effort.

 

Every event gets its own ID. The plugin registers a callback to that ID.
When the ID is hit, the plugin is executed on that event to display its
binary format.

This is done after the data has been saved in binary format to a file.
It may slow down the executing of reading a data file, but it does not
affect the running of the trace one bit.
   


I meant that viewing would be slowed down.  It's an important part of 
using ftrace!


How long does the Python formatter take to process 100k or 1M events?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RESEND] Print a user-friendly message on failed vmentry

2010-06-01 Thread Avi Kivity

On 06/01/2010 01:55 PM, Mohammed Gamal wrote:

On Tue, Jun 1, 2010 at 11:59 AM, Avi Kivitya...@redhat.com  wrote:
   

On 05/31/2010 10:40 PM, Mohammed Gamal wrote:
 

This patch address bug report in
https://bugs.launchpad.net/qemu/+bug/530077.

Failed vmentries were handled with handle_unhandled() which prints a
rather
unfriendly message to the user. This patch separates handling vmentry
failures
from unknown exit reasons and prints a friendly message to the user.



+#define VMX_INVALID_GUEST_STATE 0x8021
+
+static int handle_failed_vmentry(uint64_t reason)
+{
+fprintf(stderr, kvm: vm entry failed with error 0x% PRIx64 \n\n,
reason);
+
+/* Perhaps we will need to check if this machine is intel since exit
reason 0x21
+   has a different interpretation on SVM */
+if (reason == VMX_INVALID_GUEST_STATE) {
+fprintf(stderr, If you're runnning a guest on an Intel machine
without\n);
+fprintf(stderr, unrestricted mode support, the failure can be
most likely\n);
+fprintf(stderr, due to the guest entering an invalid state for
Intel VT.\n);
+fprintf(stderr, For example, the guest maybe running in big real
mode\n);
+fprintf(stderr, which is not supported on less recent Intel
processors.\n\n);
+fprintf(stderr, You may want to try enabling KVM real mode
emulation. To\n);
+fprintf(stderr, enable it, you can run the following commands as
root:\n);
+fprintf(stderr, # rmmod kvm_intel\n);
+fprintf(stderr, # rmmod kvm\n);
+fprintf(stderr, # modprobe kvm_intel
emulate_invalid_guest_state=1\n\n);
+fprintf(stderr, WARNING: Real mode emulation is still
work-in-progress\n);
+fprintf(stderr, and thus it is not always guaranteed to
work.\n\n);
+}
+
+return -EINVAL;
+}


   

It's almost guaranteed to fail, isn't it?  Is there any guest which fails
with emulated_invalid_guest_state=0 but works with e_i_g_s=1?

 

You're right! Perhaps I should remove the e_i_g_s bit from the
message. What do you think?
   



Sure.  Better than the current message.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:



Since vfio would be the only driver, there would be no duplication.  But
a separate object for the iommu mapping is a good thing.  Perhaps we can
even share it with vhost (without actually using the mmu, since vhost is
software only).
 

Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.

   


So you'd just take the memory map and not pin anything.  This way you 
can reuse the memory map.


But no, it doesn't handle the dirty bitmap, so no go.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disable clocksource=kvm-clock

2010-06-01 Thread Avi Kivity

On 06/01/2010 04:12 PM, Peter Lieven wrote:

hi,

is it possible to avoid detection of clocksource=kvm_clock in a linux 
guest by

patching the qemu-kvm binary?
i would like to be able to avoid a guest detecting kvm-clock until bug 
#584516

is fixed without modifying all guest systems and reverting that later.



Newer qemu-kvm's will have this feature (using -cpu ...,-kvmclock).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Guest booting after kernel update (2.6.32.12 -- 2.6.34)

2010-06-01 Thread Kangkook Jee
Hi, all 

After updating kernel to the latest (2.6.34), my guests would boot up. 
When I connect it with VNC, it shows the following message and wouldn't 
precede. 

Boot from (hd0,0) ext3 f98 (uuid)
starting up ...

these guest is installed with 'vmbuilder' script and used to boot up without 
any problems. 

can anyone help me with this problem?

Thanks, Kangkook

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: disable clocksource=kvm-clock

2010-06-01 Thread Peter Lieven

Avi Kivity wrote:

On 06/01/2010 04:12 PM, Peter Lieven wrote:

hi,

is it possible to avoid detection of clocksource=kvm_clock in a linux 
guest by

patching the qemu-kvm binary?
i would like to be able to avoid a guest detecting kvm-clock until 
bug #584516

is fixed without modifying all guest systems and reverting that later.



Newer qemu-kvm's will have this feature (using -cpu ...,-kvmclock).


with latest git, i can supply this switch, but

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

still shows kvm_clock in an opensuse 11.2 64-bit or an ubuntu 10.04 
64-bit guest.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: disable clocksource=kvm-clock

2010-06-01 Thread Avi Kivity

On 06/01/2010 04:57 PM, Peter Lieven wrote:

Avi Kivity wrote:

On 06/01/2010 04:12 PM, Peter Lieven wrote:

hi,

is it possible to avoid detection of clocksource=kvm_clock in a 
linux guest by

patching the qemu-kvm binary?
i would like to be able to avoid a guest detecting kvm-clock until 
bug #584516

is fixed without modifying all guest systems and reverting that later.



Newer qemu-kvm's will have this feature (using -cpu ...,-kvmclock).


with latest git, i can supply this switch, but

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

still shows kvm_clock in an opensuse 11.2 64-bit or an ubuntu 10.04 
64-bit guest.


Try again with tomorrow's git, I sent out a patchset earlier today that 
fixes this.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm tests: make hypercall test runable on 32-bit host

2010-06-01 Thread Asias He
Signed-off-by: Asias He asias.he...@gmail.com
---
 kvm/test/config-x86-common.mak |5 +++--
 kvm/test/config-x86_64.mak |3 +--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kvm/test/config-x86-common.mak b/kvm/test/config-x86-common.mak
index 9084e2d..38dbf5a 100644
--- a/kvm/test/config-x86-common.mak
+++ b/kvm/test/config-x86-common.mak
@@ -24,7 +24,8 @@ FLATLIBS = lib/libcflat.a $(libgcc)
 
 tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
$(TEST_DIR)/smptest.flat  $(TEST_DIR)/port80.flat \
-   $(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat
+   $(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \
+   $(TEST_DIR)/hypercall.flat
 
 test_cases: $(tests-common) $(tests)
 
@@ -32,7 +33,7 @@ $(TEST_DIR)/%.o: CFLAGS += -std=gnu99 -ffreestanding -I lib 
-I lib/x86
  
 $(TEST_DIR)/access.flat: $(cstart.o) $(TEST_DIR)/access.o $(TEST_DIR)/print.o
  
-$(TEST_DIR)/hypercall.flat: $(cstart.o) $(TEST_DIR)/hypercall.o 
$(TEST_DIR)/print.o
+$(TEST_DIR)/hypercall.flat: $(cstart.o) $(TEST_DIR)/hypercall.o
  
 $(TEST_DIR)/sieve.flat: $(cstart.o) $(TEST_DIR)/sieve.o \
$(TEST_DIR)/print.o $(TEST_DIR)/vm.o
diff --git a/kvm/test/config-x86_64.mak b/kvm/test/config-x86_64.mak
index b898ece..cc7d7d7 100644
--- a/kvm/test/config-x86_64.mak
+++ b/kvm/test/config-x86_64.mak
@@ -5,7 +5,6 @@ ldarch = elf64-x86-64
 CFLAGS += -D__x86_64__
 
 tests = $(TEST_DIR)/access.flat $(TEST_DIR)/sieve.flat \
- $(TEST_DIR)/emulator.flat $(TEST_DIR)/hypercall.flat \
- $(TEST_DIR)/apic.flat
+ $(TEST_DIR)/emulator.flat $(TEST_DIR)/apic.flat
 
 include config-x86-common.mak
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: disable clocksource=kvm-clock

2010-06-01 Thread Peter Lieven

Avi Kivity wrote:

On 06/01/2010 04:57 PM, Peter Lieven wrote:

Avi Kivity wrote:

On 06/01/2010 04:12 PM, Peter Lieven wrote:

hi,

is it possible to avoid detection of clocksource=kvm_clock in a 
linux guest by

patching the qemu-kvm binary?
i would like to be able to avoid a guest detecting kvm-clock until 
bug #584516

is fixed without modifying all guest systems and reverting that later.



Newer qemu-kvm's will have this feature (using -cpu ...,-kvmclock).


with latest git, i can supply this switch, but

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

still shows kvm_clock in an opensuse 11.2 64-bit or an ubuntu 10.04 
64-bit guest.


Try again with tomorrow's git, I sent out a patchset earlier today 
that fixes this.


avi, i do not know whats going on. but if i supply -cpu xxx,-kvmclock 
the guest

still uses kvm-clock, but it seems bug #584516 is not triggered...
thats weird...

br,
peter


--
Mit freundlichen Grüßen/Kind Regards

Peter Lieven

..

  KAMP Netzwerkdienste GmbH
  Vestische Str. 89-91 | 46117 Oberhausen
  Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
  mailto:p...@kamp.de | http://www.kamp.de

  Geschäftsführer: Heiner Lante | Michael Lante
  Amtsgericht Duisburg | HRB Nr. 12154
  USt-Id-Nr.: DE 120607556

. 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: disable clocksource=kvm-clock

2010-06-01 Thread Avi Kivity

On 06/01/2010 05:06 PM, Peter Lieven wrote:


avi, i do not know whats going on. but if i supply -cpu xxx,-kvmclock 
the guest

still uses kvm-clock, but it seems bug #584516 is not triggered...
thats weird...


I guess that bug was resolved in qemu-kvm.git.  Likely 1a03675db1, but 
it appears to be part of 0.12 too, so not sure.


What qemu-kvm were you using previously?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: disable clocksource=kvm-clock

2010-06-01 Thread Peter Lieven

Avi Kivity wrote:

On 06/01/2010 05:06 PM, Peter Lieven wrote:


avi, i do not know whats going on. but if i supply -cpu xxx,-kvmclock 
the guest

still uses kvm-clock, but it seems bug #584516 is not triggered...
thats weird...


I guess that bug was resolved in qemu-kvm.git.  Likely 1a03675db1, but 
it appears to be part of 0.12 too, so not sure.


What qemu-kvm were you using previously?


i'm using qemu-kvm git from today (aa22e82).

bug #585113 seems to be fixed (this was the irq nobody cared error with 
virtio_blk or e1000)
bug #584516 is still there, but if I set - cpu  xxx,-kvmclock it is not 
triggered. the guest
shows current_clocksource = kvm_clock, but migration seems to succeed 
(at least the ~20

times i tested it since you told me about the flag)

do you have a clue what fixed bug #585113. i would like to backport it 
to 0.12.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Make get_para_features() similar to upstream

2010-06-01 Thread Glauber Costa
On Tue, Jun 01, 2010 at 03:33:43PM +0300, Avi Kivity wrote:
 Accept a CPUState parameter instead of a kvm_context_t.
 
 Signed-off-by: Avi Kivity a...@redhat.com
Acked-by: Glauber Costa glom...@redhat.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] Use get_para_features() from upstream

2010-06-01 Thread Glauber Costa
On Tue, Jun 01, 2010 at 03:33:44PM +0300, Avi Kivity wrote:
 Signed-off-by: Avi Kivity a...@redhat.com
Acked-by: Glauber Costa glom...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] Rename kvm_arch_vcpu_init()s cenv argument to env

2010-06-01 Thread Glauber Costa
On Tue, Jun 01, 2010 at 03:33:45PM +0300, Avi Kivity wrote:
 Be more similar to upstream.
 
 Signed-off-by: Avi Kivity a...@redhat.com
Acked-by: Glauber Costa glom...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] Use skeleton of upstream's kvm_arch_init_vcpu()

2010-06-01 Thread Glauber Costa
On Tue, Jun 01, 2010 at 03:33:46PM +0300, Avi Kivity wrote:
 Currently all content from qemu-kvm's kvm_arch_init_vcpu().
 
 Signed-off-by: Avi Kivity a...@redhat.com
Acked-by: Glauber Costa glom...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 0/2] virtio: put last seen used index into ring itself

2010-06-01 Thread Michael S. Tsirkin
Changes from v2: added padding between avail idx and flags,
and changed virtio to only publish used index when callbacks
are enabled.

Here's a rewrite of the original patch with a new layout.
I haven't tested it yet so no idea how this performs, but
I think this addresses the cache bounce issue raised by Avi.
Posting for early flames/comments.

Generally, the Host end of the virtio ring doesn't need to see where
Guest is up to in consuming the ring.  However, to completely understand
what's going on from the outside, this information must be exposed.
For example, host can reduce the number of interrupts by detecting
that the guest is currently handling previous buffers.

We add a feature bit so the guest can tell the host that it's writing
out the current value there, if it wants to use that.

This differs from original approach in that the used index
is put after avail index (they are typically written out together).
To avoid cache bounces on descriptor access,
and make future extensions easier, we put the ring itself at start of
page, and move the control after it.


Signed-off-by: Michael S. Tsirkin m...@redhat.com


Michael S. Tsirkin (2):
  virtio: support layout with avail ring before idx
  virtio: publish used idx

 drivers/net/virtio_net.c |2 +
 drivers/virtio/virtio_ring.c |   25 +++--
 include/linux/virtio_ring.h  |   50 -
 3 files changed, 64 insertions(+), 13 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 1/2] virtio: support layout with avail ring before idx

2010-06-01 Thread Michael S. Tsirkin
This adds an (unused) option to put available ring before control (avail
index, flags), and adds padding between index and flags. This avoids
cache line sharing between control and ring, and also makes it possible
to extend avail control without incurring extra cache misses.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/virtio/virtio_ring.c |   12 ++--
 include/linux/virtio_ring.h  |   40 ++--
 2 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1ca8890..0f684b4 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -223,8 +223,8 @@ add_head:
 
/* Put entry in available array (but don't update avail-idx until they
 * do sync).  FIXME: avoid modulus here? */
-   avail = (vq-vring.avail-idx + vq-num_added++) % vq-vring.num;
-   vq-vring.avail-ring[avail] = head;
+   avail = (*vq-vring.avail_idx + vq-num_added++) % vq-vring.num;
+   vq-vring.avail_ring[avail] = head;
 
pr_debug(Added buffer head %i to %p\n, head, vq);
END_USE(vq);
@@ -244,7 +244,7 @@ void virtqueue_kick(struct virtqueue *_vq)
 * new available array entries. */
virtio_wmb();
 
-   vq-vring.avail-idx += vq-num_added;
+   *vq-vring.avail_idx += vq-num_added;
vq-num_added = 0;
 
/* Need to update avail index before checking if we should notify */
@@ -335,7 +335,7 @@ void virtqueue_disable_cb(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
 
-   vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
+   *vq-vring.avail_flags |= VRING_AVAIL_F_NO_INTERRUPT;
 }
 EXPORT_SYMBOL_GPL(virtqueue_disable_cb);
 
@@ -347,7 +347,7 @@ bool virtqueue_enable_cb(struct virtqueue *_vq)
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
-   vq-vring.avail-flags = ~VRING_AVAIL_F_NO_INTERRUPT;
+   vq-vring.avail_flags = ~VRING_AVAIL_F_NO_INTERRUPT;
virtio_mb();
if (unlikely(more_used(vq))) {
END_USE(vq);
@@ -425,7 +425,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
if (!vq)
return NULL;
 
-   vring_init(vq-vring, num, pages, vring_align);
+   vring_init(vq-vring, num, pages, vring_align, false);
vq-vq.callback = callback;
vq-vq.vdev = vdev;
vq-vq.name = name;
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index e4d144b..458ce41 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -47,6 +47,12 @@ struct vring_avail {
__u16 ring[];
 };
 
+struct vring_avail_ctrl {
+   __u16 idx;
+   __u8 pad[254];
+   __u16 flags;
+};
+
 /* u32 is used here for ids for padding reasons. */
 struct vring_used_elem {
/* Index of start of used descriptor chain. */
@@ -66,7 +72,9 @@ struct vring {
 
struct vring_desc *desc;
 
-   struct vring_avail *avail;
+   __u16 *avail_idx;
+   __u16 *avail_flags;
+   __u16 *avail_ring;
 
struct vring_used *used;
 };
@@ -79,11 +87,19 @@ struct vring {
  * // The actual descriptors (16 bytes each)
  * struct vring_desc desc[num];
  *
- * // A ring of available descriptor heads with free-running index.
+ * // A ring of available descriptor heads with a control structure
+ *  // including a free-running index.
+ *  // The ring can come either after  (legacy) or before the control.
  * __u16 avail_flags;
  * __u16 avail_idx;
  * __u16 available[num];
  *
+ * or
+ *
+ * __u16 available[num];
+ * __u16 avail_idx;
+ * __u8 pad[254]; // Padding to align flags at cache line boundary.
+ * __u16 avail_flags;
  * // Padding to the next align boundary.
  * char pad[];
  *
@@ -94,13 +110,25 @@ struct vring {
  * };
  */
 static inline void vring_init(struct vring *vr, unsigned int num, void *p,
- unsigned long align)
+ unsigned long align, bool avail_ring_first)
 {
+   struct vring_avail *avail = p + num * sizeof(struct vring_desc);
vr-num = num;
vr-desc = p;
-   vr-avail = p + num*sizeof(struct vring_desc);
-   vr-used = (void *)(((unsigned long)vr-avail-ring[num] + align-1)
-~(align - 1));
+   if (avail_ring_first) {
+   struct vring_avail_ctrl *ctrl = (void *)vr-avail_ring[num];
+   vr-avail_ring = (void*)avail;
+   vr-avail_idx = ctrl-idx;
+   vr-avail_flags = ctrl-flags;
+   } else {
+   vr-avail_idx = avail-idx;
+   vr-avail_flags = avail-flags;
+   }
+   vr-used = (void *)ALIGN((unsigned long)avail-ring[num], align);
+   /* Verify that avail fits before used. */
+   BUG_ON((unsigned long)(vr-avail_flags + 1)  (unsigned long)vr-used);
+   

Re: [PATCH 5/5] Use upstream's kvm_arch_init_vcpu()'s cpuid bits

2010-06-01 Thread Glauber Costa
On Tue, Jun 01, 2010 at 03:33:47PM +0300, Avi Kivity wrote:
 Signed-off-by: Avi Kivity a...@redhat.com
Acked-by: Glauber Costa glom...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 2/2] virtio: publish used idx

2010-06-01 Thread Michael S. Tsirkin
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/net/virtio_net.c |2 ++
 drivers/virtio/virtio_ring.c |   15 +--
 include/linux/virtio_ring.h  |   10 ++
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 78eb319..30e7483 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -23,6 +23,7 @@
 #include linux/module.h
 #include linux/virtio.h
 #include linux/virtio_net.h
+#include linux/virtio_ring.h
 #include linux/scatterlist.h
 #include linux/if_vlan.h
 #include linux/slab.h
@@ -1056,6 +1057,7 @@ static unsigned int features[] = {
VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
+   VIRTIO_RING_F_PUBLISH_USED,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 0f684b4..5a8c711 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -92,6 +92,9 @@ struct vring_virtqueue
/* Last used index we've seen. */
u16 last_used_idx;
 
+   /* Publish last used index we've seen at this location. */
+   u16 *publish_last_used_idx;
+
/* How to notify other side. FIXME: commonalize hcalls! */
void (*notify)(struct virtqueue *vq);
 
@@ -325,7 +328,7 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int 
*len)
/* detach_buf clears data, so grab it now. */
ret = vq-data[i];
detach_buf(vq, i);
-   vq-last_used_idx++;
+   *vq-publish_last_used_idx = ++vq-last_used_idx;
END_USE(vq);
return ret;
 }
@@ -425,13 +428,19 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
if (!vq)
return NULL;
 
-   vring_init(vq-vring, num, pages, vring_align, false);
+   vring_init(vq-vring, num, pages, vring_align,
+  virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_USED));
+   if (virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_USED))
+   vq-publish_last_used_idx = vq-vring.avail-last_used_idx;
+   else
+   vq-publish_last_used_idx = vq-last_used_idx;
vq-vq.callback = callback;
vq-vq.vdev = vdev;
vq-vq.name = name;
vq-notify = notify;
vq-broken = false;
vq-last_used_idx = 0;
+   *vq-publish_last_used_idx = 0;
vq-num_added = 0;
list_add_tail(vq-vq.list, vdev-vqs);
 #ifdef DEBUG
@@ -473,6 +482,8 @@ void vring_transport_features(struct virtio_device *vdev)
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
break;
+   case VIRTIO_RING_F_PUBLISH_USED:
+   break;
default:
/* We don't understand this bit. */
clear_bit(i, vdev-features);
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 458ce41..87f8fd3 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -29,6 +29,9 @@
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
+/* The Guest publishes last-seen used index at the end of the avail ring. */
+#define VIRTIO_RING_F_PUBLISH_USED 29
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via next. */
 struct vring_desc {
/* Address (guest-physical). */
@@ -51,6 +54,7 @@ struct vring_avail_ctrl {
__u16 idx;
__u8 pad[254];
__u16 flags;
+   __u16 last_used_idx;
 };
 
 /* u32 is used here for ids for padding reasons. */
@@ -75,6 +79,7 @@ struct vring {
__u16 *avail_idx;
__u16 *avail_flags;
__u16 *avail_ring;
+   __u16 *last_used_idx;
 
struct vring_used *used;
 };
@@ -100,6 +105,7 @@ struct vring {
  * __u16 avail_idx;
  * __u8 pad[254]; // Padding to align flags at cache line boundary.
  * __u16 avail_flags;
+ * __u16 last_used_idx;
  * // Padding to the next align boundary.
  * char pad[];
  *
@@ -120,14 +126,18 @@ static inline void vring_init(struct vring *vr, unsigned 
int num, void *p,
vr-avail_ring = (void*)avail;
vr-avail_idx = ctrl-idx;
vr-avail_flags = ctrl-flags;
+   vr-last_used_idx = ctrl-last_used_idx;
} else {
vr-avail_idx = avail-idx;
vr-avail_flags = avail-flags;
+   vr-last_used_idx = NULL;
}
vr-used = (void *)ALIGN((unsigned long)avail-ring[num], align);
/* Verify that avail fits before used. */
BUG_ON((unsigned long)(vr-avail_flags + 1)  (unsigned long)vr-used);
BUG_ON((unsigned long)(vr-avail_idx + 1)  (unsigned long)vr-used);
+   BUG_ON(vr-last_used_idx  (unsigned long)(vr-last_used_idx + 1) 
+  (unsigned 

Re: [PATCHv3 0/2] virtio: put last seen used index into ring itself

2010-06-01 Thread Michael S. Tsirkin
On Tue, Jun 01, 2010 at 05:47:08PM +0300, Michael S. Tsirkin wrote:
 Changes from v2: added padding between avail idx and flags,
 and changed virtio to only publish used index when callbacks
 are enabled.

Here's the updated spec patch.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

--

diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index ed35893..a2dcd76 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -1803,6 +1803,36 @@ next
 \emph default
  descriptor entry (modulo the ring size).
  This starts at 0, and increases.
+\change_inserted 0 1274966643
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1274968378
+When PUBLISH_USED feature flag has 
+\emph on
+not
+\emph default
+ been negotiated, the ring follows the 
+\begin_inset Quotes eld
+\end_inset
+
+flags
+\begin_inset Quotes erd
+\end_inset
+
+ and the 
+\begin_inset Quotes eld
+\end_inset
+
+idx
+\begin_inset Quotes erd
+\end_inset
+
+ fields:
+\change_unchanged
+
 \end_layout
 
 \begin_layout Standard
@@ -1845,7 +1875,143 @@ struct vring_avail {
 
 \end_layout
 
+\begin_layout Standard
+
+\change_inserted 0 1274968432
+\begin_inset CommandInset label
+LatexCommand label
+name PUBLISH_USED-feature
+
+\end_inset
+
+When PUBLISH_USED feature flag has been negotiated, the control structure
+ including the 
+\begin_inset Quotes eld
+\end_inset
+
+flags and the 
+\begin_inset Quotes eld
+\end_inset
+
+idx
+\begin_inset Quotes erd
+\end_inset
+
+ fields follows the ring.
+ This leaves the room for the 
+\begin_inset Quotes eld
+\end_inset
+
+last_seen_used_idx
+\begin_inset Quotes erd
+\end_inset
+
+ field, which indicates the most recent 
+\begin_inset Quotes eld
+\end_inset
+
+idx
+\begin_inset Quotes erd
+\end_inset
+
+ value observed by guest in the used ring (see 
+\begin_inset CommandInset ref
+LatexCommand ref
+reference sub:Used-Ring
+
+\end_inset
+
+ below):
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1274967396
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1274967404
+
+struct vring_avail {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1275404889
+
+   u16 ring[qsz]; /* qsz is the Queue Size field read from device */
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1275404891
+
+   u16 idx;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1275404921
+
+   u8 pad[254] /* Padding to avoid sharing cache line between flags and
+ idx fields.
+ */
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1274967533
+
+#define VRING_AVAIL_F_NO_INTERRUPT  1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1274967533
+
+   u16 flags;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1274968345
+
+   u16 last_seen_used_idx;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1274967396
+
+}; 
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 0 1274967715
+If the ring is large enough, the second layout maintains the control and
+ ring structures on separate cache lines.
+\end_layout
+
 \begin_layout Subsection
+
+\change_inserted 0 1274968415
+\begin_inset CommandInset label
+LatexCommand label
+name sub:Used-Ring
+
+\end_inset
+
+
+\change_unchanged
 Used Ring
 \end_layout
 
@@ -2391,12 +2557,20 @@ status open
 
 \begin_layout Plain Layout
 
-while (vq-last_seen_used != vring-used.idx) {
+while (vq-last_seen_used
+\change_inserted 0 1274968316
+_idx
+\change_unchanged
+ != vring-used.idx) {
 \end_layout
 
 \begin_layout Plain Layout
 
-struct vring_used_elem *e = vring.used-ring[vq-last_seen_used%vsz];
+struct vring_used_elem *e = vring.used-ring[vq-last_seen_used
+\change_inserted 0 1274968326
+_idx
+\change_unchanged
+%vsz];
 \end_layout
 
 \begin_layout Plain Layout
@@ -2406,7 +2580,11 @@ while (vq-last_seen_used != vring-used.idx) {
 
 \begin_layout Plain Layout
 
-vq-last_seen_used++;
+vq-last_seen_used
+\change_inserted 0 1274968321
+_idx
+\change_unchanged
+++;
 \end_layout
 
 \begin_layout Plain Layout
@@ -2419,6 +2597,15 @@ while (vq-last_seen_used != vring-used.idx) {
 
 \end_layout
 
+\begin_layout Standard
+
+\change_inserted 0 1275405042
+If PUBLISH_USED feature is negotiated, last_seen_used value should be published
+ to the device in the avail ring.
+ This value is used by the host for interrupt mitigation, so it only need
+ to be updated when interrupts are enabled.
+\end_layout
+
 \begin_layout Subsection
 Dealing With Configuration Changes
 \end_layout
@@ -2986,6 +3173,47 @@ struct vring_avail {
 \begin_layout Plain Layout
 
 };
+\change_inserted 0 1274966477
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1274966484
+
+struct vring_avail_ctrl {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1274966489
+
+__u16 flags;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 0 1274966494
+

Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Marcelo Tosatti
On Tue, Jun 01, 2010 at 09:05:38PM +0900, Takuya Yoshikawa wrote:
 (2010/06/01 19:55), Marcelo Tosatti wrote:
 
 Sorry but I have to say that mmu_lock spin_lock problem was completely
 out of
 my mind. Although I looked through the code, it seems not easy to move the
 set_bit_user to outside of spinlock section without breaking the
 semantics of
 its protection.
 
 So this may take some time to solve.
 
 But personally, I want to do something for x86's vmallc() every time
 problem
 even though moving dirty bitmaps to user space cannot be achieved soon.
 
 In that sense, do you mind if we do double buffering without moving
 dirty bitmaps to
 user space?
 
 So I would be happy if you give me any comments about this kind of other
 options.
 
 What if you pin the bitmaps?
 
 Yes, pinning bitmaps works. The small problem is that we need to hold
 the dirty_bitmap_pages[] array for every slot, the size of this array
 depends on the slot length, and of course pinning itself.
 
 In the performance point of view, having double sized vmalloc'ed
 area may be better.
 
 
 The alternative to that is to move mark_page_dirty(gfn) before acquision
 of mmu_lock, in the page fault paths. The downside of that is a
 potentially (large?) number of false positives in the dirty bitmap.
 
 
 Interesting, but probably dangerous.
 
 
 From my experience, though this includes my personal view, removing vmalloc
 currently used by x86 is the most simple and effective change.
 
 So if you don't mind, I want to double the size of vmalloc'ed area for x86
 without changing other parts.
 
  == if this one more bitmap is problematic, dirty logging itself would be
  in danger of failure: we need to have the same size in the timing of
  switch.
 
 Make sense?

That seems the most sensible approach.

 
 We can consider moving dirty bitmaps to user space later.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Andi Kleen
Gleb Natapov g...@redhat.com writes:

 The patch below allows to patch ticket spinlock code to behave similar to
 old unfair spinlock when hypervisor is detected. After patching unlocked

The question is what happens when you have a system with unfair
memory and you run the hypervisor on that. There it could be much worse.

Your new code would starve again, right?

There's a reason the ticket spinlocks were added in the first place.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Virtual disks with SCSI or virtio hang after some time

2010-06-01 Thread Guido Winkelmann
Hi,

When using KVM machines with virtual disks that are hooked up to the guest via 
either virtio or SCSI, the virtual disk will often hang completely after a 
short time of operation. When this happens, the network connectivity of the 
machine will usually go down, too. (I.e. it stops responding to pings.)

When this happens when using SCSI, the following messages will appear in 
syslog (in the guest):

sd 0:0:0:0: [sda] ABORT operation started
sd 0:0:0:0: ABORT operation timed out
sd 0:0:0:0: [sda] DEVICE RESET operation started

When using virtio, no log messages appear at the time the hang occurs, but 
some time before, these messages appear:

JBD: barrier-based sync failed on vda1-8
JBD: barrier-based sync failed on vda3-8
JBD: barrier-based sync failed on vda5-8

The hang seems to happen faster with virtio, usually within one minute after 
bootup, while with SCSI, it can take a few more minutes

In the host, no interesting messages appear in syslog.

The host OS is Fedora Core 12, with qemu-kvm 0.11.0, Kernel 
2.6.32.11-99.fc12.x86_64 and libvirt 0.8.1.

The guest kernel is 2.6.32-gentoo-r7.

Other, non-linux, guest operating systems seem to have problems with the 
virtualised scsi disks, too. (I've tried OpenBSD.)

Does anyone have an idea what is happening here and what can be done about it?

Regards,

Guido

-- 
Too much multitasking isn't good for you.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Gleb Natapov
On Tue, Jun 01, 2010 at 05:53:09PM +0200, Andi Kleen wrote:
 Gleb Natapov g...@redhat.com writes:
 
  The patch below allows to patch ticket spinlock code to behave similar to
  old unfair spinlock when hypervisor is detected. After patching unlocked
 
 The question is what happens when you have a system with unfair
 memory and you run the hypervisor on that. There it could be much worse.
 
How much worse performance hit could be?

 Your new code would starve again, right?
 
Yes, of course it may starve with unfair spinlock. Since vcpus are not
always running there is much smaller chance then vcpu on remote memory
node will starve forever. Old kernels with unfair spinlocks are running
fine in VMs on NUMA machines with various loads.

 There's a reason the ticket spinlocks were added in the first place.
 
I understand that reason and do not propose to get back to old spinlock
on physical HW! But with virtualization performance hit is unbearable.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Andi Kleen
On Tue, Jun 01, 2010 at 07:24:14PM +0300, Gleb Natapov wrote:
 On Tue, Jun 01, 2010 at 05:53:09PM +0200, Andi Kleen wrote:
  Gleb Natapov g...@redhat.com writes:
  
   The patch below allows to patch ticket spinlock code to behave similar to
   old unfair spinlock when hypervisor is detected. After patching unlocked
  
  The question is what happens when you have a system with unfair
  memory and you run the hypervisor on that. There it could be much worse.
  
 How much worse performance hit could be?

It depends on the workload. Overall it means that a contended
lock can have much higher latencies.

If you want to study some examples see the locking problems the
RT people have with their heavy weight mutex-spinlocks.

But the main problem is that in the worst case you 
can see extremly long stalls (upto a second has been observed),
which then turns in a correctness issue.
 
  Your new code would starve again, right?
  
 Yes, of course it may starve with unfair spinlock. Since vcpus are not
 always running there is much smaller chance then vcpu on remote memory
 node will starve forever. Old kernels with unfair spinlocks are running
 fine in VMs on NUMA machines with various loads.

Try it on a NUMA system with unfair memory.

  There's a reason the ticket spinlocks were added in the first place.
  
 I understand that reason and do not propose to get back to old spinlock
 on physical HW! But with virtualization performance hit is unbearable.

Extreme unfairness can be unbearable too.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Virtual disks with SCSI or virtio hang after some time

2010-06-01 Thread Avi Kivity

On 06/01/2010 06:59 PM, Guido Winkelmann wrote:

The host OS is Fedora Core 12, with qemu-kvm 0.11.0
   


Please try with at least qemu-kvm-0.11.1, preferably qemu-kvm-0.12.4.  
Also use Linux 2.6.32.latest in the guest.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Avi Kivity

On 06/01/2010 07:38 PM, Andi Kleen wrote:

Your new code would starve again, right?

   

Yes, of course it may starve with unfair spinlock. Since vcpus are not
always running there is much smaller chance then vcpu on remote memory
node will starve forever. Old kernels with unfair spinlocks are running
fine in VMs on NUMA machines with various loads.
 

Try it on a NUMA system with unfair memory.
   


We are running everything on NUMA (since all modern machines are now 
NUMA).  At what scale do the issues become observable?



I understand that reason and do not propose to get back to old spinlock
on physical HW! But with virtualization performance hit is unbearable.
 

Extreme unfairness can be unbearable too.
   


Well, the question is what happens first.  In our experience, vcpu 
overcommit is a lot more painful.  People will never see the NUMA 
unfairness issue if they can't use kvm due to the vcpu overcommit problem.


What I'd like to see eventually is a short-term-unfair, long-term-fair 
spinlock.  Might make sense for bare metal as well.  But it won't be 
easy to write.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7] KVM: VMX: Enable XSAVE/XRSTOR for guest

2010-06-01 Thread Marcelo Tosatti
On Mon, May 31, 2010 at 07:54:11PM +0800, Sheng Yang wrote:
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 99ae513..8649627 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -36,6 +36,8 @@
  #include asm/vmx.h
  #include asm/virtext.h
  #include asm/mce.h
 +#include asm/i387.h
 +#include asm/xcr.h
  
  #include trace.h
  
 @@ -3354,6 +3356,16 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu)
   return 1;
  }
  
 +static int handle_xsetbv(struct kvm_vcpu *vcpu)
 +{
 + u64 new_bv = kvm_read_edx_eax(vcpu);
 + u32 index = kvm_register_read(vcpu, VCPU_REGS_RCX);
 +
 + kvm_set_xcr(vcpu, index, new_bv);
 + skip_emulated_instruction(vcpu);

Should only skip_emulated_instruction if no exception was injected.

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 7be1d36..0fcd8de 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c

 +int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 +{
 + u64 xcr0;
 +
 + /* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
 + if (index != XCR_XFEATURE_ENABLED_MASK)
 + return 1;
 + xcr0 = xcr;
 + if (kvm_x86_ops-get_cpl(vcpu) != 0)
 + return 1;
 + if (!(xcr0  XSTATE_FP))
 + return 1;
 + if ((xcr0  XSTATE_YMM)  !(xcr0  XSTATE_SSE))
 + return 1;
 + if (xcr0  ~host_xcr0)
 + return 1;
 + vcpu-arch.xcr0 = xcr0;
 + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0);

Won't this happen on guest entry, since vcpu-guest_xcr0_loaded == 0?

 @@ -4522,6 +4602,24 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
   }
  }
  
 +static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
 +{
 + if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) 
 + !vcpu-guest_xcr0_loaded) {
 + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0);

Only necessary if guest and hosts XCR0 differ?

 + vcpu-guest_xcr0_loaded = 1;
 + }
 +}
 +
 +static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
 +{
 + if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) 
 + vcpu-guest_xcr0_loaded) {
 + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
 + vcpu-guest_xcr0_loaded = 0;
 + }
 +}

What if you load guest's XCR0, then guest clears CR4.OSXSAVE? (restore 
of host_xcr0 should be conditional on guest_xcr0_loaded only).

 +
  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
  {
   int r;
 @@ -4567,6 +4665,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
   kvm_x86_ops-prepare_guest_switch(vcpu);
   if (vcpu-fpu_active)
   kvm_load_guest_fpu(vcpu);
 + kvm_load_guest_xcr0(vcpu);
  
   atomic_set(vcpu-guest_mode, 1);
   smp_wmb();
 @@ -5118,6 +5217,11 @@ void fx_init(struct kvm_vcpu *vcpu)
   fpu_alloc(vcpu-arch.guest_fpu);
   fpu_finit(vcpu-arch.guest_fpu);
  
 + /*
 +  * Ensure guest xcr0 is valid for loading
 +  */
 + vcpu-arch.xcr0 = XSTATE_FP;
 +
   vcpu-arch.cr0 |= X86_CR0_ET;
  }
  EXPORT_SYMBOL_GPL(fx_init);
 @@ -5132,6 +5236,12 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
   if (vcpu-guest_fpu_loaded)
   return;
  
 + /*
 +  * Restore all possible states in the guest,
 +  * and assume host would use all available bits.
 +  * Guest xcr0 would be loaded later.
 +  */
 + kvm_put_guest_xcr0(vcpu);
   vcpu-guest_fpu_loaded = 1;
   unlazy_fpu(current);
   fpu_restore_checking(vcpu-arch.guest_fpu);
 @@ -5140,6 +5250,8 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
  
  void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
 + kvm_put_guest_xcr0(vcpu);
 +

Should be in kvm_arch_vcpu_put?

   if (!vcpu-guest_fpu_loaded)
   return;
  
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 4e8fdbf..3784d58 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -88,7 +88,7 @@ struct kvm_vcpu {
   int srcu_idx;
  
   int fpu_active;
 - int guest_fpu_loaded;
 + int guest_fpu_loaded, guest_xcr0_loaded;
   wait_queue_head_t wq;
   int sigset_active;
   sigset_t sigset;
 -- 
 1.7.0.1
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers

2010-06-01 Thread Sridhar Samudrala
On Tue, 2010-06-01 at 11:35 +0200, Tejun Heo wrote:
 Apply the cpumask and cgroup of the initializing task to the created
 vhost worker.
 
 Based on Sridhar Samudrala's patch.  Li Zefan spotted a bug in error
 path (twice), fixed (twice).
 
 Signed-off-by: Tejun Heo t...@kernel.org
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Sridhar Samudrala samudrala.srid...@gmail.com
 Cc: Li Zefan l...@cn.fujitsu.com
 ---
  drivers/vhost/vhost.c |   34 ++
  1 file changed, 30 insertions(+), 4 deletions(-)
 
 Index: work/drivers/vhost/vhost.c
 ===
 --- work.orig/drivers/vhost/vhost.c
 +++ work/drivers/vhost/vhost.c
 @@ -23,6 +23,7 @@
  #include linux/highmem.h
  #include linux/slab.h
  #include linux/kthread.h
 +#include linux/cgroup.h
 
  #include linux/net.h
  #include linux/if_packet.h
 @@ -187,11 +188,29 @@ long vhost_dev_init(struct vhost_dev *de
   struct vhost_virtqueue *vqs, int nvqs)
  {
   struct task_struct *worker;
 - int i;
 + cpumask_var_t mask;
 + int i, ret = -ENOMEM;
 +
 + if (!alloc_cpumask_var(mask, GFP_KERNEL))
 + goto out_free_mask;

I think this is another bug in the error path. You should simply
do a return instead of a goto here when aloc_cpu_mask fails.

Thanks
Sridhar
 
   worker = kthread_create(vhost_worker, dev, vhost-%d, current-pid);
 - if (IS_ERR(worker))
 - return PTR_ERR(worker);
 + if (IS_ERR(worker)) {
 + ret = PTR_ERR(worker);
 + goto out_free_mask;
 + }
 +
 + ret = sched_getaffinity(current-pid, mask);
 + if (ret)
 + goto out_stop_worker;
 +
 + ret = sched_setaffinity(worker-pid, mask);
 + if (ret)
 + goto out_stop_worker;
 +
 + ret = cgroup_attach_task_current_cg(worker);
 + if (ret)
 + goto out_stop_worker;
 
   dev-vqs = vqs;
   dev-nvqs = nvqs;
 @@ -214,7 +233,14 @@ long vhost_dev_init(struct vhost_dev *de
   }
 
   wake_up_process(worker);/* avoid contributing to loadavg */
 - return 0;
 + ret = 0;
 + goto out_free_mask;
 +
 +out_stop_worker:
 + kthread_stop(worker);
 +out_free_mask:
 + free_cpumask_var(mask);
 + return ret;
  }
 
  /* Caller should have device mutex */
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Andi Kleen
On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:
 We are running everything on NUMA (since all modern machines are now NUMA). 
  At what scale do the issues become observable?

On Intel platforms it's visible starting with 4 sockets.


 I understand that reason and do not propose to get back to old spinlock
 on physical HW! But with virtualization performance hit is unbearable.
  
 Extreme unfairness can be unbearable too.


 Well, the question is what happens first.  In our experience, vcpu 
 overcommit is a lot more painful.  People will never see the NUMA 
 unfairness issue if they can't use kvm due to the vcpu overcommit problem.

You really have to address both, if you don't fix them both
users will eventually into one of them and be unhappy.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Valdis . Kletnieks
On Tue, 01 Jun 2010 19:52:28 +0300, Avi Kivity said:
 On 06/01/2010 07:38 PM, Andi Kleen wrote:
  Your new code would starve again, right?
  Try it on a NUMA system with unfair memory.

 We are running everything on NUMA (since all modern machines are now 
 NUMA).  At what scale do the issues become observable?

My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the
perfectly-running NUMA=n kernel I'm running.

Or did you mean a less broad phrase than all modern machines?



pgpThbBemb8KF.pgp
Description: PGP signature


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread john cooper
Avi Kivity wrote:
 On 06/01/2010 07:38 PM, Andi Kleen wrote:
 Your new code would starve again, right?


 Yes, of course it may starve with unfair spinlock. Since vcpus are not
 always running there is much smaller chance then vcpu on remote memory
 node will starve forever. Old kernels with unfair spinlocks are running
 fine in VMs on NUMA machines with various loads.
  
 Try it on a NUMA system with unfair memory.

 
 We are running everything on NUMA (since all modern machines are now
 NUMA).  At what scale do the issues become observable?
 
 I understand that reason and do not propose to get back to old spinlock
 on physical HW! But with virtualization performance hit is unbearable.
  
 Extreme unfairness can be unbearable too.

 
 Well, the question is what happens first.  In our experience, vcpu
 overcommit is a lot more painful.  People will never see the NUMA
 unfairness issue if they can't use kvm due to the vcpu overcommit problem.

Gleb's observed performance hit seems to be a rather mild
throughput depression compared with creating a worst case by
enforcing vcpu overcommit.  Running a single guest with 2:1
overcommit on a 4 core machine I saw over an order of magnitude
slowdown vs. 1:1 commit with the same kernel build test.
Others have reported similar results.

How close you'll get to that scenario depends on host
scheduling dynamics, and statistically the number of opened
and stalled lock held paths waiting to be contended.  So
I'd expect to see quite variable numbers for guest-guest
aggravation of this problem.

 What I'd like to see eventually is a short-term-unfair, long-term-fair
 spinlock.  Might make sense for bare metal as well.  But it won't be
 easy to write.

Collecting the contention/usage statistics on a per spinlock
basis seems complex.  I believe a practical approximation
to this are adaptive mutexes where upon hitting a spin
time threshold, punt and let the scheduler reconcile fairness.

-john

-- 
john.coo...@third-harmonic.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7] KVM: VMX: Enable XSAVE/XRSTOR for guest

2010-06-01 Thread Avi Kivity

On 06/01/2010 08:12 PM, Marcelo Tosatti wrote:

On Mon, May 31, 2010 at 07:54:11PM +0800, Sheng Yang wrote:
   

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 99ae513..8649627 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -36,6 +36,8 @@
  #includeasm/vmx.h
  #includeasm/virtext.h
  #includeasm/mce.h
+#includeasm/i387.h
+#includeasm/xcr.h

  #include trace.h

@@ -3354,6 +3356,16 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu)
return 1;
  }

+static int handle_xsetbv(struct kvm_vcpu *vcpu)
+{
+   u64 new_bv = kvm_read_edx_eax(vcpu);
+   u32 index = kvm_register_read(vcpu, VCPU_REGS_RCX);
+
+   kvm_set_xcr(vcpu, index, new_bv);
+   skip_emulated_instruction(vcpu);
 

Should only skip_emulated_instruction if no exception was injected.
   


Good catch.  So, unit testing failure cases _is_ important.

   

+int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
+{
+   u64 xcr0;
+
+   /* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now  */
+   if (index != XCR_XFEATURE_ENABLED_MASK)
+   return 1;
+   xcr0 = xcr;
+   if (kvm_x86_ops-get_cpl(vcpu) != 0)
+   return 1;
+   if (!(xcr0  XSTATE_FP))
+   return 1;
+   if ((xcr0  XSTATE_YMM)  !(xcr0  XSTATE_SSE))
+   return 1;
+   if (xcr0  ~host_xcr0)
+   return 1;
+   vcpu-arch.xcr0 = xcr0;
+   xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0);
 

Won't this happen on guest entry, since vcpu-guest_xcr0_loaded == 0?
   


In fact we don't know what vcpu-guest_xcr0_loaded is.  Best to unload 
it explicitly (and drop the xsetbv() call).



+   vcpu-guest_xcr0_loaded = 1;
+   }
+}
+
+static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
+{
+   if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)
+   vcpu-guest_xcr0_loaded) {
+   xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
+   vcpu-guest_xcr0_loaded = 0;
+   }
+}
 

What if you load guest's XCR0, then guest clears CR4.OSXSAVE? (restore
of host_xcr0 should be conditional on guest_xcr0_loaded only).
   


Good catch again.


@@ -5140,6 +5250,8 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)

  void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
{
+   kvm_put_guest_xcr0(vcpu);
+
 

Should be in kvm_arch_vcpu_put?
   


That's called unconditionally from kvm_arch_vcpu_put(), so equivalent.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Andi Kleen
 Collecting the contention/usage statistics on a per spinlock
 basis seems complex.  I believe a practical approximation
 to this are adaptive mutexes where upon hitting a spin
 time threshold, punt and let the scheduler reconcile fairness.

That would probably work, except: how do you get the
adaptive spinlock into a paravirt op without slowing
down a standard kernel?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM Test: Make login timeout configurable.

2010-06-01 Thread Lucas Meneghel Rodrigues
On Fri, 2010-05-28 at 13:36 +0800, Feng Yang wrote:
 Add login_timeout parameter to make login timeout configurable.
 Currently default timeout value is 240s. It is always not enough,
 many case fail foer could not boot up in 240s in our testing.

I like the idea of unifying all login timeouts. I remember to have
implemented this partially for some tests, all with different parameter,
which ended up being a mess. Let me search for all the occurrences, fix
them and merge with your patch. I'll return with an updated patch.

 Have update following script:
 client/tests/kvm/tests/autoit.py
 client/tests/kvm/tests/autotest.py
 client/tests/kvm/tests/balloon_check.py
 client/tests/kvm/tests/guest_s4.py
 client/tests/kvm/tests/iozone_windows.py
 client/tests/kvm/tests/linux_s3.py
 client/tests/kvm/tests/migration.py
 client/tests/kvm/tests/pci_hotplug.py
 client/tests/kvm/tests/physical_resources_check.py
 client/tests/kvm/tests/shutdown.py
 client/tests/kvm/tests/timedrift.py
 client/tests/kvm/tests/timedrift_with_migration.py
 client/tests/kvm/tests/timedrift_with_reboot.py
 client/tests/kvm/tests/vlan_tag.py
 client/tests/kvm/tests/yum_update.py
 
 Signed-off-by: Feng Yang fy...@redhat.com
 ---
  client/tests/kvm/tests/autoit.py   |3 ++-
  client/tests/kvm/tests/autotest.py |3 ++-
  client/tests/kvm/tests/balloon_check.py|3 ++-
  client/tests/kvm/tests/guest_s4.py |5 +++--
  client/tests/kvm/tests/iozone_windows.py   |3 ++-
  client/tests/kvm/tests/linux_s3.py |3 ++-
  client/tests/kvm/tests/migration.py|5 +++--
  client/tests/kvm/tests/pci_hotplug.py  |3 ++-
  client/tests/kvm/tests/physical_resources_check.py |3 ++-
  client/tests/kvm/tests/shutdown.py |3 ++-
  client/tests/kvm/tests/timedrift.py|3 ++-
  client/tests/kvm/tests/timedrift_with_migration.py |3 ++-
  client/tests/kvm/tests/timedrift_with_reboot.py|3 ++-
  client/tests/kvm/tests/vlan_tag.py |5 +++--
  client/tests/kvm/tests/yum_update.py   |3 ++-
  client/tests/kvm/tests_base.cfg.sample |2 +-
  16 files changed, 34 insertions(+), 19 deletions(-)
 
 diff --git a/client/tests/kvm/tests/autoit.py 
 b/client/tests/kvm/tests/autoit.py
 index ed1d491..ba3cdf3 100644
 --- a/client/tests/kvm/tests/autoit.py
 +++ b/client/tests/kvm/tests/autoit.py
 @@ -17,7 +17,8 @@ def run_autoit(test, params, env):
  @param env: Dictionary with the test environment.
  
  vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 -session = kvm_test_utils.wait_for_login(vm)
 +timeout = int(params.get(login_timeout, 360))
 +session = kvm_test_utils.wait_for_login(vm, timeout=timeout)
  
  try:
  logging.info(Starting script...)
 diff --git a/client/tests/kvm/tests/autotest.py 
 b/client/tests/kvm/tests/autotest.py
 index 31e36cf..2916ebd 100644
 --- a/client/tests/kvm/tests/autotest.py
 +++ b/client/tests/kvm/tests/autotest.py
 @@ -13,7 +13,8 @@ def run_autotest(test, params, env):
  @param env: Dictionary with the test environment.
  
  vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 -session = kvm_test_utils.wait_for_login(vm)
 +timeout = int(params.get(login_timeout, 360))
 +session = kvm_test_utils.wait_for_login(vm, timeout=timeout)
  
  # Collect test parameters
  timeout = int(params.get(test_timeout, 300))
 diff --git a/client/tests/kvm/tests/balloon_check.py 
 b/client/tests/kvm/tests/balloon_check.py
 index 2d483c6..f33bda8 100644
 --- a/client/tests/kvm/tests/balloon_check.py
 +++ b/client/tests/kvm/tests/balloon_check.py
 @@ -64,7 +64,8 @@ def run_balloon_check(test, params, env):
  
  fail = 0
  vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 -session = kvm_test_utils.wait_for_login(vm)
 +timeout = int(params.get(login_timeout, 360))
 +session = kvm_test_utils.wait_for_login(vm, timeout=timeout)
  
  # Upper limit that we can raise the memory
  vm_assigned_mem = int(params.get(mem))
 diff --git a/client/tests/kvm/tests/guest_s4.py 
 b/client/tests/kvm/tests/guest_s4.py
 index 88b2b7c..608d542 100644
 --- a/client/tests/kvm/tests/guest_s4.py
 +++ b/client/tests/kvm/tests/guest_s4.py
 @@ -12,7 +12,8 @@ def run_guest_s4(test, params, env):
  @param env: Dictionary with the test environment.
  
  vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 -session = kvm_test_utils.wait_for_login(vm)
 +timeout = int(params.get(login_timeout, 360))
 +session = kvm_test_utils.wait_for_login(vm, timeout=timeout)
  
  logging.info(Checking whether guest OS supports suspend to disk 
 (S4)...)
  s, o = 
 session.get_command_status_output(params.get(check_s4_support_cmd))
 @@ -31,7 +32,7 @@ def run_guest_s4(test, params, env):
  time.sleep(5)
  
  # Get the 

Re: [PATCH] kvm mmu: Don't calculate quadrant if tdp_enabled.

2010-06-01 Thread Marcelo Tosatti
On Mon, May 31, 2010 at 05:11:39PM +0800, Gui Jianfeng wrote:
 There's no need to calculate quadrant if tdp is enabled.
 
 Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Document large pages

2010-06-01 Thread Marcelo Tosatti
On Thu, May 27, 2010 at 04:44:12PM +0300, Avi Kivity wrote:
 Signed-off-by: Avi Kivity a...@redhat.com
 ---
  Documentation/kvm/mmu.txt |   23 +++
  1 files changed, 23 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt
 index 1e7ecdd..6a0cab2 100644
 --- a/Documentation/kvm/mmu.txt
 +++ b/Documentation/kvm/mmu.txt
 @@ -317,6 +317,29 @@ on fault type:
  
  (user write faults generate a #PF)
  
 +Large pages
 +===
 +
 +The mmu supports all combinations of large and small guest and host pages.
 +Supported page sizes include 4k, 2M, 4M, and 1G.  4M pages are treated as
 +two separate 2M pages, on both guest and host, since the mmu always uses PAE
 +paging.
 +
 +To instantiate a large spte, four constraints must be satisfied:
 +
 +- the spte must point to a large host page
 +- the guest pte must be a large pte of at least equivalent size (if tdp is
 +  enabled, there is no guest pte and this condition is satisified)
 +- if the spte will be writeable, the large page frame may not overlap any
 +  write-protected pages
 +- the guest guest pte must be wholly contained by a single memory slot

Fixed this guest guest pte typo and applied.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Tuesday 01 June 2010 03:46:51 am Michael S. Tsirkin wrote:
 On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
  On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
 
It can't program the iommu.
  What
  the patch proposes is that userspace tells vfio about the needed
  mappings, and vfio programs the iommu.
   
  There seems to be some misunderstanding.  The userspace interface
  proposed forces a separate domain per device and forces userspace to
  repeat iommu programming for each device.  We are better off sharing a
  domain between devices and programming the iommu once.
 
 
iommufd = open(/dev/iommu);
ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
 
  ?
 
 Yes.
 
  If so, I agree.
 
 Good.

I'm not really opposed to multiple devices per domain, but let me point out how 
I
ended up here.  First, the driver has two ways of mapping pages, one based on 
the
iommu api and one based on the dma_map_sg api.  With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers.  If there are multiple 
drivers
in the user level, do we not want the same isoation to apply to them?  
Also, domains are not a very scarce resource - my little core i5 has 256, 
and the intel architecture goes to 64K.
And then there's the fact that it is possible to have multiple disjoint iommus 
on a system,
so it may not even be possible to bring 2 devices under one domain. 

Given all that, I am inclined to leave it alone until someone has a real 
problem.
Note that not sharing iommu domains doesn't mean you can't share device memory,
just that you have to do multiple mappings
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Monday 31 May 2010 10:17:35 am Alan Cox wrote:
 
 Does look like it needs a locking audit, some memory and error checks
 reviewing and some further review of the ioctl security and
 overflows/trusted values.
Yes. Thanks for the detailed look.
 
 Rather a nice way of attacking the user space PCI problem.
And thanks for that!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Eric Dumazet
Le mardi 01 juin 2010 à 19:52 +0300, Avi Kivity a écrit :

 What I'd like to see eventually is a short-term-unfair, long-term-fair 
 spinlock.  Might make sense for bare metal as well.  But it won't be 
 easy to write.
 

This thread rings a bell here :)

Yes, ticket spinlocks are sometime slower, especially in workloads where
a spinlock needs to be taken several times to handle one unit of work,
and many cpus competing.

We currently have kind of a similar problem in network stack, and we
have a patch to speedup xmit path by an order of magnitude, letting one
cpu (the consumer cpu) to get unfair access to the (ticket) spinlock.
(It can compete with no more than one other cpu)

Boost from ~50.000 to ~600.000 pps on a dual quad core machine (E5450
@3.00GHz) on a particular workload (many cpus want to xmit their
packets)

( patch : http://patchwork.ozlabs.org/patch/53163/ )


It could be possible to write such a generic beast, with a cascade or
regular ticket spinlocks ?

One ticket spinlock at first stage (only if some conditions are met, aka
slow path), then an 'primary' spinlock at second stage.


// generic implementation
// (x86 could use 16bit fields for users_in  user_out)
struct cascade_lock {
atomic_tusers_in;
int users_out;
spinlock_t  primlock;
spinlock_t  slowpathlock; // could be outside of this structure, 
shared by many 'cascade_locks'
};

/*
 * In kvm case, you might call hypervisor when slowpathlock is about to be 
taken ?
 * When a cascade lock is unlocked, and relocked right after, this cpu has 
unfair
 * priority and could get the lock before cpus blocked in slowpathlock 
(especially if
 * an hypervisor call was done)
 *
 * In network xmit path, the dequeue thread would use highprio_user=true mode
 * In network xmit path, the 'contended' enqueueing thread would set a negative 
threshold,
 *  to force a 'lowprio_user' mode.
 */
void cascade_lock(struct cascade_lock *l, bool highprio_user, int threshold)
{
bool slowpath = false;

atomic_inc(l-users_in); // no real need for atomic_inc_return()
if (atomic_read(l-users_in) - l-users_out  threshold  
!highprio_user)) {
spin_lock(l-slowpathlock);
slowpath = true;
}
spin_lock(l-primlock);
if (slowpath)
spin_unlock(l-slowpathlock);
}

void cascade_unlock(struct cascade_lock *l)
{
l-users_out++;
spin_unlock(l-primlock);
}



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pc: push setting default cpu_model down a level

2010-06-01 Thread Alex Williamson
Not that CPU hotplug currently works, but if you make the mistake of
trying it on a VM started without specifying a -cpu value, you hit
a segfault from trying to strdup(NULL) in cpu_x86_find_by_name().

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/pc.c |   16 
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 9b85c42..a79586c 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -771,6 +771,14 @@ static CPUState *pc_new_cpu(const char *cpu_model)
 {
 CPUState *env;
 
+if (cpu_model == NULL) {
+#ifdef TARGET_X86_64
+cpu_model = qemu64;
+#else
+cpu_model = qemu32;
+#endif
+}
+
 env = cpu_init(cpu_model);
 if (!env) {
 fprintf(stderr, Unable to find x86 CPU definition\n);
@@ -791,14 +799,6 @@ void pc_cpus_init(const char *cpu_model)
 int i;
 
 /* init CPUs */
-if (cpu_model == NULL) {
-#ifdef TARGET_X86_64
-cpu_model = qemu64;
-#else
-cpu_model = qemu32;
-#endif
-}
-
 for(i = 0; i  smp_cpus; i++) {
 pc_new_cpu(cpu_model);
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers

2010-06-01 Thread Tejun Heo
On 06/01/2010 07:19 PM, Sridhar Samudrala wrote:
 -int i;
 +cpumask_var_t mask;
 +int i, ret = -ENOMEM;
 +
 +if (!alloc_cpumask_var(mask, GFP_KERNEL))
 +goto out_free_mask;
 
 I think this is another bug in the error path. You should simply
 do a return instead of a goto here when aloc_cpu_mask fails.

Oh... it's always safe to call free_cpumask_var() after failed
alloc_cpumask_var(), so that part isn't broken.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2989366 ] huge memory leak (qemu-kvm 0.12.3)

2010-06-01 Thread SourceForge.net
Bugs item #2989366, was opened at 2010-04-19 13:47
Message generated for change (Comment added) made by sf-robot
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2989366group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Closed
Resolution: Fixed
Priority: 5
Private: No
Submitted By: tgr (tgr1)
Assigned to: Nobody/Anonymous (nobody)
Summary: huge memory leak (qemu-kvm 0.12.3)

Initial Comment:
CPU: Xeon E5450
KVM: qemu-kvm 0.12.3
Host: Debian Squeeze amd64, kernel 2.6.32
Guest: Debian Lenny amd64, kernel 2.6.26
qemu command line below (no helpers like libvirt used)
whether the problem goes away if using the -no-kvm-irqchip or -no-kvm-pit 
switch: seems unrelated
whether the problem also appears with the -no-kvm switch: load profile of the 
guest makes running without KVM infeasible

The problem: after 4 days, the kvm process of a guest run with -mem 3072 takes 
12 GB VSZ / 9 GB RSS - grows until OOM (kvm invoked oom-killer in dmesg).
At that point the guest shows more than 2 GB of free memory.

qemu command line:

kvm -smp 4 -m 3072 -net nic,vlan=0,model=virtio,name=eth0,macaddr=x:x:x:x:x:x 
-net tap,vlan=0,ifname=tap6 -net 
nic,vlan=1,model=virtio,name=eth1,macaddr=x:x:x:x:x:x -net 
tap,vlan=1,ifname=tap7 -drive if=virtio,file=/dev/x/x-rootfs,index=0 -drive 
if=virtio,file=/dev/x/x-swap,index=1 -name x -nographic -kernel 
/boot/vmlinuz-2.6.26-2-amd64 -initrd /boot/initrd.img-2.6.26-2-amd64 -append 
root=/dev/vda console=ttyS0 -echr 2 -serial 
mon:telnet:localhost:12344,server,nowait -monitor 
unix:/var/run/kvm-x,server,nowait -daemonize -watchdog i6300esb 
-watchdog-action reset

(the monitor on both telnet and unix socket is intentional, the one on the 
socket is only used for sending system_reset and system_powerdown qemu 
commands).


--

Comment By: SourceForge Robot (sf-robot)
Date: 2010-06-02 02:20

Message:
This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 14 days (the time period specified by
the administrator of this Tracker).

--

Comment By: Avi Kivity (avik)
Date: 2010-05-18 17:44

Message:
Should be fixed in qemu-kvm-0.12.4.

--

Comment By: trevoro (trevoro)
Date: 2010-05-18 17:25

Message:
I am experiencing an identical issue on an AMD Sempron with qemu-kvm
0.12.3. 

We have a few virtual machines running on this host. After a few days this
is what the memory utilization looks like for the KVM processes.


be95ae3e-5f79-11df-b0c0-540001bf0689  1024M: vsz=1241M  rss=706M
3bd6f690-5c65-11df-9fb1-5400013f4a54 512M: vsz=675M   rss=466M
c103e4e4-5e00-11df-94fe-540001fccc62256M: vsz=1470M  rss=301M
0c068dba-5df8-11df-938c-5400015669e3   256M: vsz=7646M  rss=2560M




--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2989366group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Avi Kivity

On 06/01/2010 08:39 PM, valdis.kletni...@vt.edu wrote:

We are running everything on NUMA (since all modern machines are now
NUMA).  At what scale do the issues become observable?
 

My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the
perfectly-running NUMA=n kernel I'm running.

Or did you mean a less broad phrase than all modern machines?

   


All modern two socket and above boards, sorry.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Avi Kivity

On 06/01/2010 08:27 PM, Andi Kleen wrote:

On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote:
   

We are running everything on NUMA (since all modern machines are now NUMA).
  At what scale do the issues become observable?
 

On Intel platforms it's visible starting with 4 sockets.
   


Can you recommend a benchmark that shows bad behaviour?  I'll run it 
with ticket spinlocks and Gleb's patch.  I have a 4-way Nehalem-EX, 
presumably the huge number of threads will magnify the problem even more 
there.



I understand that reason and do not propose to get back to old spinlock
on physical HW! But with virtualization performance hit is unbearable.

 

Extreme unfairness can be unbearable too.

   

Well, the question is what happens first.  In our experience, vcpu
overcommit is a lot more painful.  People will never see the NUMA
unfairness issue if they can't use kvm due to the vcpu overcommit problem.
 

You really have to address both, if you don't fix them both
users will eventually into one of them and be unhappy.
   


That's definitely the long term plan.  I consider Gleb's patch the first 
step.


Do  you have any idea how we can tackle both problems?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/02/2010 12:26 AM, Tom Lyon wrote:


I'm not really opposed to multiple devices per domain, but let me point out how 
I
ended up here.  First, the driver has two ways of mapping pages, one based on 
the
iommu api and one based on the dma_map_sg api.  With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers.  If there are multiple 
drivers
in the user level, do we not want the same isoation to apply to them?
   


In the case of kvm, we don't want isolation between devices, because 
that doesn't happen on real hardware.  So if the guest programs devices 
to dma to each other, we want that to succeed.



Also, domains are not a very scarce resource - my little core i5 has 256,
and the intel architecture goes to 64K.
   


But there is a 0.2% of mapped memory per domain cost for the page 
tables.  For the kvm use case, that could be significant since a guest 
may have large amounts of memory and large numbers of assigned devices.



And then there's the fact that it is possible to have multiple disjoint iommus 
on a system,
so it may not even be possible to bring 2 devices under one domain.
   


That's indeed a deficiency.


Given all that, I am inclined to leave it alone until someone has a real 
problem.
Note that not sharing iommu domains doesn't mean you can't share device memory,
just that you have to do multiple mappings
   


I think we do have a real problem (though a mild one).

The only issue I see with deferring the solution is that the API becomes 
gnarly; both the kernel and userspace will have to support both APIs 
forever.  Perhaps we can implement the new API but defer the actual 
sharing until later, don't know how much work this saves.  Or Alex/Chris 
can pitch in and help.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Alex Williamson
On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
 On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
 
It can't program the iommu.
  What
  the patch proposes is that userspace tells vfio about the needed
  mappings, and vfio programs the iommu.
   
  There seems to be some misunderstanding.  The userspace interface
  proposed forces a separate domain per device and forces userspace to
  repeat iommu programming for each device.  We are better off sharing a
  domain between devices and programming the iommu once.
 
 
iommufd = open(/dev/iommu);
ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

It seems part of the annoyance of the current KVM device assignment is
that we have multiple files open, we mmap here, read there, write over
there, maybe, if it's not emulated.  I quite like Tom's approach that we
have one stop shopping with /dev/vfion, including config space
emulation so each driver doesn't have to try to write their own.  So
continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
ioctl to vfio so that after we setup one device we can bind the next to
the same domain?

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Tuesday 01 June 2010 09:29:47 pm Alex Williamson wrote:
 On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
  On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
  
 It can't program the iommu.
   What
   the patch proposes is that userspace tells vfio about the needed
   mappings, and vfio programs the iommu.

   There seems to be some misunderstanding.  The userspace interface
   proposed forces a separate domain per device and forces userspace to
   repeat iommu programming for each device.  We are better off sharing a
   domain between devices and programming the iommu once.
  
  
 iommufd = open(/dev/iommu);
 ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
 ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
 
 It seems part of the annoyance of the current KVM device assignment is
 that we have multiple files open, we mmap here, read there, write over
 there, maybe, if it's not emulated.  I quite like Tom's approach that we
 have one stop shopping with /dev/vfion, including config space
 emulation so each driver doesn't have to try to write their own.  So
 continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
 ioctl to vfio so that after we setup one device we can bind the next to
 the same domain?

This is just what I was thinking.  But rather than a get/set, just use two fds.

ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);

This may fail if there are really 2 different IOMMUs, so user code must be
prepared for failure,  In addition, this is strictlyupwards compatible with 
what is there now, so maybe we can add it later.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/02/2010 07:59 AM, Tom Lyon wrote:


This is just what I was thinking.  But rather than a get/set, just use two fds.

ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);

This may fail if there are really 2 different IOMMUs, so user code must be
prepared for failure,  In addition, this is strictlyupwards compatible with
what is there now, so maybe we can add it later.

   


What happens if one of the fds is later closed?

I don't like this conceptually.  There is a 1:n relationship between the 
memory map and the devices.  Ignoring it will cause the API to have 
warts.  It's more straightforward to have an object to represent the 
memory mapping (and talk to the iommus), and have devices bind to this 
object.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Srivatsa Vaddagiri
On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote:
 That's definitely the long term plan.  I consider Gleb's patch the
 first step.
 
 Do  you have any idea how we can tackle both problems?

I recall Xen posting some solution for a similar problem:

http://lkml.org/lkml/2010/1/29/45

Wouldn't a similar approach help KVM as well?

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Chris Wright
* Avi Kivity (a...@redhat.com) wrote:
 On 06/02/2010 12:26 AM, Tom Lyon wrote:
 
 I'm not really opposed to multiple devices per domain, but let me point out 
 how I
 ended up here.  First, the driver has two ways of mapping pages, one based 
 on the
 iommu api and one based on the dma_map_sg api.  With the latter, the system
 already allocates a domain per device and there's no way to control it. This 
 was
 presumably done to help isolation between drivers.  If there are multiple 
 drivers
 in the user level, do we not want the same isoation to apply to them?
 
 In the case of kvm, we don't want isolation between devices, because
 that doesn't happen on real hardware.

Sure it does.  That's exactly what happens when there's an iommu
involved with bare metal.

 So if the guest programs
 devices to dma to each other, we want that to succeed.

And it will as long as ATS is enabled (this is a basic requirement
for PCIe peer-to-peer traffic to succeed with an iommu involved on
bare metal).

That's how things currently are, i.e. we put all devices belonging to a
single guest in the same domain.  However, it can be useful to put each
device belonging to a guest in a unique domain.  Especially as qemu
grows support for iommu emulation, and guest OSes begin to understand
how to use a hw iommu.

 Also, domains are not a very scarce resource - my little core i5 has 256,
 and the intel architecture goes to 64K.
 
 But there is a 0.2% of mapped memory per domain cost for the page
 tables.  For the kvm use case, that could be significant since a
 guest may have large amounts of memory and large numbers of assigned
 devices.
 
 And then there's the fact that it is possible to have multiple disjoint 
 iommus on a system,
 so it may not even be possible to bring 2 devices under one domain.
 
 That's indeed a deficiency.

Not sure it's a deficiency.  Typically to share page table mappings
across multiple iommu's you just have to do update/invalidate to each
hw iommu that is sharing the mapping.  Alternatively, you can use more
memory and build/maintain identical mappings (as Tom alludes to below).

 Given all that, I am inclined to leave it alone until someone has a real 
 problem.
 Note that not sharing iommu domains doesn't mean you can't share device 
 memory,
 just that you have to do multiple mappings
 
 I think we do have a real problem (though a mild one).
 
 The only issue I see with deferring the solution is that the API
 becomes gnarly; both the kernel and userspace will have to support
 both APIs forever.  Perhaps we can implement the new API but defer
 the actual sharing until later, don't know how much work this saves.
 Or Alex/Chris can pitch in and help.

It really shouldn't be that complicated to create the API to allow for
flexible device - domain mappings, so I agree, makes sense to do it
right up front.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Avi Kivity

On 06/02/2010 08:29 AM, Chris Wright wrote:

* Avi Kivity (a...@redhat.com) wrote:
   

On 06/02/2010 12:26 AM, Tom Lyon wrote:
 

I'm not really opposed to multiple devices per domain, but let me point out how 
I
ended up here.  First, the driver has two ways of mapping pages, one based on 
the
iommu api and one based on the dma_map_sg api.  With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers.  If there are multiple 
drivers
in the user level, do we not want the same isoation to apply to them?
   

In the case of kvm, we don't want isolation between devices, because
that doesn't happen on real hardware.
 

Sure it does.  That's exactly what happens when there's an iommu
involved with bare metal.
   


But we are emulating a machine without an iommu.

When we emulate a machine with an iommu, then yes, we'll want to use as 
many domains as the guest does.



So if the guest programs
devices to dma to each other, we want that to succeed.
 

And it will as long as ATS is enabled (this is a basic requirement
for PCIe peer-to-peer traffic to succeed with an iommu involved on
bare metal).

That's how things currently are, i.e. we put all devices belonging to a
single guest in the same domain.  However, it can be useful to put each
device belonging to a guest in a unique domain.  Especially as qemu
grows support for iommu emulation, and guest OSes begin to understand
how to use a hw iommu.
   


Right, we need to keep flexibility.


And then there's the fact that it is possible to have multiple disjoint iommus 
on a system,
so it may not even be possible to bring 2 devices under one domain.
   

That's indeed a deficiency.
 

Not sure it's a deficiency.  Typically to share page table mappings
across multiple iommu's you just have to do update/invalidate to each
hw iommu that is sharing the mapping.  Alternatively, you can use more
memory and build/maintain identical mappings (as Tom alludes to below).
   


Sharing the page tables is just an optimization, I was worried about 
devices in separate domains not talking to each other.  if ATS fixes 
that, great.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 32-bit qemu + 64-bit kvm be a problem?

2010-06-01 Thread Neo Jia
On Wed, Mar 10, 2010 at 2:12 PM, Michael Tokarev m...@tls.msk.ru wrote:
 Neo Jia wrote:
 hi,

 I have to keep a 32-bit qmeu user space to work with some legacy
 library I have but still want to use 64-bit host Linux to explore
 64-bit advantage.

 So I am wondering if I can use a 32-bit qemu + 64-bit kvm-kmod
 configuration. Will there be any limitation or drawback for this
 configuration? I already get one that we can't assign guest physical
 memory more than 2047 MB.

 I use 32bit kvm on 64bit kernel since the day one.  Nothing of interest
 since that, everything just works.

Michael,

I just came back to this thread because I am seeing that I can't run
VISTA 64-bit inside 64/32 mode, which will crash with bugcheck 0x5D.
Is this a known issue?


 Recently (this week) I come across a situation when something does not
 work in 64/32 mode.  Namely it is linux aio (see the other thread in
 kvm@ a few days back) - but this is not due to kvm but due to other
 kernel subsystem (in this case aio) which lacks proper compat handlers
 in place.

Could you tell me more about the AIO issue? Will this will slow down
the guest if it does a lot I/O? Will setting up coalescing help?

Thanks,
Neo



 Generally I reported quite several issues in this config - here or there
 there were issues, something did not work.  Now the places where we've
 issues are decreasing (hopefully anyway), at least I haven't seen issues
 recently, except of this aio stuff.

 But strictly speaking, I don't see any good reason to run 32bit kvm on
 64 bit kernel either.  Most distributions nowadays provide a set of
 64bit libraries for their 32bit versions so that limited support for
 64bit binaries are available.  This is mostly enough for kvm - without
 X and SDL support it works just fine (using vnc display).  Historically
 I've 32bit userspace, but most guests now are running with 64bit kvm -
 either because the guests switched to 64bit kernel or because aio thing
 or just because I looks like it is more efficient (less syscall/ioctl
 32=64 translation and the like).  kvm itself uses only very few memory
 so here it almost makes no difference between 32 and 64 bits (in 64bit
 pointers are larger and hence usually more memory is used).  Yes, it is
 difficult to provide everything needed for sdl, but for our tasks SDL
 windows aren't really necessary, and for testing 32bit mode works just
 fine too...

 /mjt




-- 
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: elide struct thread_struct instances from stack

2010-06-01 Thread Alexander Graf

On 01.06.2010, at 10:36, Andreas Schwab wrote:

 Paul Mackerras pau...@samba.org writes:
 
 I re-read the relevant part of the PowerPC architecture spec
 yesterday, and it seems pretty clear that the FPSCR doesn't affect the
 behaviour of lfs and stfs, and is not affected by them.  So in fact 4
 out of the 7 instructions in each of those procedures are unnecessary
 (and similarly for the cvt_fd/df used in the alignment fixup code).
 
 I'd prefer to have this deferred to a separate patch.

I agree. Andreas' patch takes the current logic and moves it to be KVM 
contained, so we don't clutter the stack. The fact that the old code was 
inefficient is a separate story.

Avi / Marcelo, please apply the patch nevertheless.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Marcelo Tosatti
On Mon, May 24, 2010 at 04:05:29PM +0900, Takuya Yoshikawa wrote:
 (2010/05/17 18:06), Takuya Yoshikawa wrote:
 
 User allocated bitmaps have the advantage of reducing pinned memory.
 However we have plenty more pinned memory allocated in memory slots, so
 by itself, user allocated bitmaps don't justify this change.
 
 Sorry for pinging several times.
 
 
 In that sense, what do you think about the question I sent last week?
 
 === REPOST 1 ===
  
   mark_page_dirty is called with the mmu_lock spinlock held in set_spte.
   Must find a way to move it outside of the spinlock section.
 
 I am now trying to do something to solve this spinlock problem. But the
 spinlock section looks too wide to solve with simple workaround.
 
 Sorry but I have to say that mmu_lock spin_lock problem was completely
 out of
 my mind. Although I looked through the code, it seems not easy to move the
 set_bit_user to outside of spinlock section without breaking the
 semantics of
 its protection.
 
 So this may take some time to solve.
 
 But personally, I want to do something for x86's vmallc() every time
 problem
 even though moving dirty bitmaps to user space cannot be achieved soon.
 
 In that sense, do you mind if we do double buffering without moving
 dirty bitmaps to
 user space?
 
 So I would be happy if you give me any comments about this kind of other
 options.

What if you pin the bitmaps? 

The alternative to that is to move mark_page_dirty(gfn) before acquision 
of mmu_lock, in the page fault paths. The downside of that is a
potentially (large?) number of false positives in the dirty bitmap.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Takuya Yoshikawa

(2010/06/01 19:55), Marcelo Tosatti wrote:


Sorry but I have to say that mmu_lock spin_lock problem was completely
out of
my mind. Although I looked through the code, it seems not easy to move the
set_bit_user to outside of spinlock section without breaking the
semantics of
its protection.

So this may take some time to solve.

But personally, I want to do something for x86's vmallc() every time
problem
even though moving dirty bitmaps to user space cannot be achieved soon.

In that sense, do you mind if we do double buffering without moving
dirty bitmaps to
user space?


So I would be happy if you give me any comments about this kind of other
options.


What if you pin the bitmaps?


Yes, pinning bitmaps works. The small problem is that we need to hold
the dirty_bitmap_pages[] array for every slot, the size of this array
depends on the slot length, and of course pinning itself.

In the performance point of view, having double sized vmalloc'ed
area may be better.



The alternative to that is to move mark_page_dirty(gfn) before acquision
of mmu_lock, in the page fault paths. The downside of that is a
potentially (large?) number of false positives in the dirty bitmap.



Interesting, but probably dangerous.


From my experience, though this includes my personal view, removing vmalloc
currently used by x86 is the most simple and effective change.

So if you don't mind, I want to double the size of vmalloc'ed area for x86
without changing other parts.

 == if this one more bitmap is problematic, dirty logging itself would be
 in danger of failure: we need to have the same size in the timing of
 switch.

Make sense?


We can consider moving dirty bitmaps to user space later.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any comments? Re: [RFC][PATCH 0/12] KVM, x86, ppc, asm-generic: moving dirty bitmaps to user space

2010-06-01 Thread Marcelo Tosatti
On Tue, Jun 01, 2010 at 09:05:38PM +0900, Takuya Yoshikawa wrote:
 (2010/06/01 19:55), Marcelo Tosatti wrote:
 
 Sorry but I have to say that mmu_lock spin_lock problem was completely
 out of
 my mind. Although I looked through the code, it seems not easy to move the
 set_bit_user to outside of spinlock section without breaking the
 semantics of
 its protection.
 
 So this may take some time to solve.
 
 But personally, I want to do something for x86's vmallc() every time
 problem
 even though moving dirty bitmaps to user space cannot be achieved soon.
 
 In that sense, do you mind if we do double buffering without moving
 dirty bitmaps to
 user space?
 
 So I would be happy if you give me any comments about this kind of other
 options.
 
 What if you pin the bitmaps?
 
 Yes, pinning bitmaps works. The small problem is that we need to hold
 the dirty_bitmap_pages[] array for every slot, the size of this array
 depends on the slot length, and of course pinning itself.
 
 In the performance point of view, having double sized vmalloc'ed
 area may be better.
 
 
 The alternative to that is to move mark_page_dirty(gfn) before acquision
 of mmu_lock, in the page fault paths. The downside of that is a
 potentially (large?) number of false positives in the dirty bitmap.
 
 
 Interesting, but probably dangerous.
 
 
 From my experience, though this includes my personal view, removing vmalloc
 currently used by x86 is the most simple and effective change.
 
 So if you don't mind, I want to double the size of vmalloc'ed area for x86
 without changing other parts.
 
  == if this one more bitmap is problematic, dirty logging itself would be
  in danger of failure: we need to have the same size in the timing of
  switch.
 
 Make sense?

That seems the most sensible approach.

 
 We can consider moving dirty bitmaps to user space later.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html