Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-22 Thread Avi Kivity

On 04/22/2010 08:37 AM, Gerhard Wiesinger wrote:

On Wed, 21 Apr 2010, Avi Kivity wrote:


On 04/21/2010 09:14 PM, Gerhard Wiesinger wrote:


Can you explain which code files/functions of KVM is involved in 
handling VGA memory window and page switching through the port write 
to the VGA window register (or is that part handled through QEMU), 
so a little bit architecture explaination would be nice?


qemu hw/vga.c and hw/cirrus_vga.c.  Boring functions like 
vbe_ioport_write_data() and vga_ioport_write().




Yes, I was already in that code part and that are very simple 
functions as already explained and are therefore in QEMU only very 
fast. But I ment: How is the calling path from KVM guest OS to 
hw/vga.c for memory and I/O accesses, and which parts are done in 
hardware directly (to understand the speed gap and maybe to find a 
solution)?


The speed gap is mostly due to hardware constraints (it takes ~2000 
cycles for an exit from guest mode, plus we need to switch a few msrs to 
get to userspace).


See vmx_vcpu_run(), the vmresume instruction is where an exit starts.





BTW: In which KVM code parts is decided where "direct code" or an 
"emulated device code" is used?




Same place.  Look for calls to cpu_register_physical_memory().  If 
the last argument was obtained by a call to cpu_register_io_memory(), 
then all writes trap.  Otherwise, it was obtained by qemu_ram_alloc() 
and writes will not trap (except the first write to a page in a 30ms 
window, used to note that the page is dirty and needs redrawing).


Ok, that finally ends in:
cpu_register_physical_memory_offset()
...
// 0.12.3
if (kvm_enabled())
kvm_set_phys_mem(start_addr, size, phys_offset);
// KVM
cpu_notify_set_memory(start_addr, size, phys_offset);
...

I/O is always done through:
cpu_register_io_memory => cpu_register_io_memory_fixed
cpu_register_io_memory_fixed()
...
No call to KVM?


kvm_set_phys_mem() is a call to kvm.


...

Where is the trap from KVM to QEMU?


See kvm_cpu_exec().

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: QEMU-KVM and video performance

2010-04-22 Thread Avi Kivity

On 04/22/2010 09:04 AM, Gerhard Wiesinger wrote:

On Wed, 21 Apr 2010, Avi Kivity wrote:


On 04/21/2010 09:50 PM, Gerhard Wiesinger wrote:

I don't think changing VGA window is a problem because there are
500.000-1Mio changes/s possible.


1MB/s, 500k-1M changes/s Coincidence?  Is it taking a page fault
or trap on every write?



To clarify:
Memory Performance writing to segmen A000 is about 1MB/st.


That indicates a fault every write (assuming 8-16 bit writes).  If 
you're using 256 color vga and not switching banks, this indicates a 
bug.




Yes, 256 color VGA and no bank switches involved.

Calling INT 10 set/get window function with different windows (e.g. 
toggling between window page 0 and 1) is about 500.000 to 1Mio 
function calls per second.


That's suprisingly fast. I'd expect 100-200k/sec.



Sorry, I mixed up the numbers:
1.) QEMU-KVM: ~111k
2.) QEMU only: 500k-1Mio


Please run kvm_stat and report output for both tests to confirm.



See below. 2nd column is per second statistic when running the test.


 efer_reload  0   0
 exits 18470836  554582
 fpu_reload 21478333469
 halt_exits2083   0
 halt_wakeup   2047   0
 host_state_reload  21481863470
 hypercalls   0   0
 insn_emulation 7688203  554244

This indicates that kvm is emulating instead of direct mapping.  That's 
probably a bug.  If you fix it, performance will increase dramatically.






To get real good VGA performance both parameters should be:
About >50MB/s for writes to segment A000
~500.000 bank switches per second.


First should be doable easily, second is borderline.


I think this is very easy to distingish:
1.) VGA Segment A000 is legacy and should be handled through QEMU 
and not through KVM (because it is much more faster). Also 16 color 
modes should be fast enough there.
2.) All other flat PCI memory accesses should be handled through KVM 
(there is a specialized driver loaded for that PCI device in the non 
legacy OS).


Is that easily possible?


No.  Code can run in either qemu or kvm, not both.  You can switch 
between them based on access statistics (early versions of qemu-kvm 
did that, without the statistics part), but this isn't trivial.


Hmmm. Ok, 2 different opinions about the memory write performance:
Easily or not possible?


Switching between tcg and kvm is hard, but not needed.  For 256 color 
modes, direct map is possible and should yield good performance.  Bank 
switching can be improved perhaps 3x, but will never be fast.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-22 Thread Karl Vogel

Avi Kivity wrote:

+Translation
+===
+
+The primary job of the mmu is to program the processor's mmu to translate
+addresses for the guest.  Different translations are required at different
+times:
+
+- when guest paging is disabled, we translate guest physical addresses to
+  host physical addresses (gpa->hpa)
+- when guest paging is enabled, we translate guest virtual addresses, to
+  guest virtual addresses, to host physical addresses (gva->gpa->hpa)


I think you meant 'to guest physical addresses' here.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-22 Thread Avi Kivity

On 04/22/2010 10:13 AM, Karl Vogel wrote:

Avi Kivity wrote:

+Translation
+===
+
+The primary job of the mmu is to program the processor's mmu to 
translate
+addresses for the guest.  Different translations are required at 
different

+times:
+
+- when guest paging is disabled, we translate guest physical 
addresses to

+  host physical addresses (gpa->hpa)
+- when guest paging is enabled, we translate guest virtual 
addresses, to

+  guest virtual addresses, to host physical addresses (gva->gpa->hpa)


I think you meant 'to guest physical addresses' here.



Fixed.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Timedrift in KVM guests after livemigration.

2010-04-22 Thread Thomas Treutner
On Sunday 18 April 2010 11:33:44 Espen Berg wrote:
> All guest are Debian lenny with latest upstream kernel, hvm/kvm.
>
> We are using kvm-clock as guest source clock.
>
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> kvm-clock

I had to deactivate C1E (AMD CPUs) and use acpi clocksource (for both servers 
and VMs, IIRC). If you can, you should give it a try. After that, live 
migration worked somewhat stable.


regards, 
thomas
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-22 Thread xiaohui . xin
From: Xin Xiaohui 

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---

Michael,
Thanks. I have updated the patch with your suggestion.
It looks much clean now. Please have a review.

Thanks
Xiaohui

 drivers/vhost/Kconfig |   10 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/mpassthru.c | 1239 +
 include/linux/mpassthru.h |   29 +
 4 files changed, 1280 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..91806b1 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+   tristate "mediate passthru network driver (EXPERIMENTAL)"
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..cc99b14
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1239 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME"mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct page_info {
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the user space buffers
+*/
+   int total;
+   int offset;
+   struct page *pages[MAX_SKB_FRAGS+1];
+   struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a user space allocated skb or kernel
+*/
+   struct skb_user_pageuser;
+   struct skb_shared_info  ushinfo;
+
+#define INFO_READ  0
+#define INFO_WRITE 1
+   unsignedflags;
+   unsignedpnum;
+
+   /* It's meaningful for receive, means
+* the max length allowed
+*/
+   size_t  len;
+
+   /* The fields after that is for backend
+* driver, now for vhost-net.
+*/
+
+   struct kiocb*iocb;
+  

RE: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-22 Thread Xin, Xiaohui
Michael,
Sorry, it's based on the suggestion to hook an iocb completion callback
to handle the iocb list in vhost-net.

Thanks
Xiaohui

-Original Message-
From: Xin, Xiaohui
Sent: Thursday, April 22, 2010 4:24 PM
To: m...@redhat.com
Cc: a...@arndb.de; net...@vger.kernel.org; kvm@vger.kernel.org; 
linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; 
jd...@linux.intel.com; Xin, Xiaohui
Subject: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.

From: Xin Xiaohui 

Add a device to utilize the vhost-net backend driver for
copy-less data transfer between guest FE and host NIC.
It pins the guest user space to the host memory and
provides proto_ops as sendmsg/recvmsg to vhost-net.

Signed-off-by: Xin Xiaohui 
Signed-off-by: Zhao Yu 
Reviewed-by: Jeff Dike 
---

Michael,
Thanks. I have updated the patch with your suggestion.
It looks much clean now. Please have a review.

Thanks
Xiaohui

 drivers/vhost/Kconfig |   10 +
 drivers/vhost/Makefile|2 +
 drivers/vhost/mpassthru.c | 1239 +
 include/linux/mpassthru.h |   29 +
 4 files changed, 1280 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c
 create mode 100644 include/linux/mpassthru.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9f409f4..91806b1 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.

+config MEDIATE_PASSTHRU
+   tristate "mediate passthru network driver (EXPERIMENTAL)"
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..cc99b14
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1239 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME"mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+struct page_ctor {
+   struct list_headreadq;
+   int w_len;
+   int r_len;
+   spinlock_t  read_lock;
+   struct kmem_cache   *cache;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+};
+
+struct page_info {
+   struct list_headlist;
+   int header;
+   /* indicate the actual length of bytes
+* send/recv in the user space buffers
+*/
+   int total;
+   int offset;
+   struct page *pages[MAX_SKB_FRAGS+1];
+   struct skb_frag_struct  frag[MAX_SKB_FRAGS+1];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a user space allocated skb or kernel
+*/
+   struct skb_user_page

Re:[RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.

2010-04-22 Thread xiaohui . xin
From: Xin Xiaohui 

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui 
---

Michael,

>Can't vhost supply a kiocb completion callback that will handle the list?

Yes, thanks. And with it I also remove the vq->receiver finally.

Thanks
Xiaohui

 drivers/vhost/net.c   |  227 +++--
 drivers/vhost/vhost.c |  115 ++---
 drivers/vhost/vhost.h |   14 +++
 3 files changed, 301 insertions(+), 55 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..4a70f66 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -47,6 +49,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -91,11 +94,132 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(&vq->notify_lock, flags);
+   if (!list_empty(&vq->notifier)) {
+   iocb = list_first_entry(&vq->notifier,
+   struct kiocb, ki_list);
+   list_del(&iocb->ki_list);
+   }
+   spin_unlock_irqrestore(&vq->notify_lock, flags);
+   return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+   struct vhost_virtqueue *vq = iocb->private;
+   unsigned long flags;
+
+spin_lock_irqsave(&vq->notify_lock, flags);
+list_add_tail(&iocb->ki_list, &vq->notifier);
+spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+struct vhost_virtqueue *vq,
+struct socket *sock)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   unsigned int head, log, in, out;
+   int size;
+
+   if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   if (sock->sk->sk_data_ready)
+   sock->sk->sk_data_ready(sock->sk, 0);
+
+   vq_log = unlikely(vhost_has_feature(
+   &net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(&net->dev, vq,
+   iocb->ki_pos, iocb->ki_nbytes);
+   log = (int)(iocb->ki_user_data >> 32);
+   size = iocb->ki_nbytes;
+   head = iocb->ki_pos;
+   rx_total_len += iocb->ki_nbytes;
+
+   if (iocb->ki_dtor)
+   iocb->ki_dtor(iocb);
+   kmem_cache_free(net->cache, iocb);
+
+   /* when log is enabled, recomputing the log info is needed,
+* since these buffers are in async queue, and may not get
+* the log info before.
+*/
+   if (unlikely(vq_log)) {
+   if (!log)
+   __vhost_get_vq_desc(&net->dev, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   &out, &in, vq_log,
+   &log, head);
+   vhost_log_write(vq, vq_log, log, size);
+   }
+   if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(&vq->poll);
+   break;
+   }
+   }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+   struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   int tx_total_len = 0;
+
+   if (vq->link_state != VHOST_VQ_LINK_ASYNC)
+   return;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   vhost_add_used_and_signal(&net->dev, vq,
+   iocb->ki_pos, 0);
+   tx_total_len += iocb->ki_nbytes;
+
+   if (iocb->ki_dtor)
+   iocb->ki_dtor(iocb);
+
+   kmem_cache_free(net->cache, iocb);
+   if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(&vq->poll);
+   break;
+   }
+   }
+}
+
+static struct kiocb *

Re: [Qemu-devel] [PATCH][STABLE] block: Free iovec arrays allocated by multiwrite_merge()

2010-04-22 Thread Kevin Wolf
Am 21.04.2010 21:35, schrieb Stefan Hajnoczi:
> A new iovec array is allocated when creating a merged write request.
> This patch ensures that the iovec array is deleted in addition to its
> qiov owner.
> 
> Reported-by: Leszek Urbanski 
> Signed-off-by: Stefan Hajnoczi 

Acked-by: Kevin Wolf 

Picked it up for my block branch, too, but I think this should be
committed immediately by Aurelien.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 3/6] KVM: introduce a wrapper function to copy dirty bitmaps to user space

2010-04-22 Thread Takuya Yoshikawa

(2010/04/21 20:12), Avi Kivity wrote:

On 04/20/2010 01:59 PM, Takuya Yoshikawa wrote:

We will replace copy_to_user() to copy_in_user() when we move
the dirty bitmaps to user space.

But sadly, we have copy_in_user() only for 64 bits architectures.
So this function should work as a wrapper to hide ifdefs from outside.
Once we get copy_in_user() for 32 bits architectures, we can remove
this wrapper and use copy_in_user() directly.


I prefer a generic copy_in_user() instead of having multiple paths in kvm.



I do too :).

I checked ppc64's copy_in_user().
It looks like just using __copy_tofrom_user() and it is also implemented
for ppc32, IIUC.

So yes, we may not need to implement it by ourselves!

Just ask them to make it useful for 32bit too.



And about x86_32 copy_in_user().

They are using copy_user_generic() which is implemented only for 64bit.

So I'll show them current copy_from_user() and copy_to_user() version and
see their response.

If rejected, though I hope not, please give me another option, OK?

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Dor Laor

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're sending
this message to share what we have now and TODO lists.  Hopefully, we would like
to get early feedback to keep us in the right direction.  Although advanced
approaches in the TODO lists are fascinating, we would like to run this project
step by step while absorbing comments from the community.  The current code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs.  It may also need some plumbing in the
kernel side to guarantee replayability of certain events and instructions,
integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which event the
primary should synchronize with the secondary.  The basic assumption here is
that outgoing I/O operations are idempotent, which is usually true for disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs 
non tcp protocol and the packet that the master node sent reached some 
remote client and before the sync to the slave the master failed?


[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is accessed would be
prohibitive, the transmission of the TSC will be done lazily, which means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the 
IO model on snapshot it also synchronizes the tsc.


In general, can you please explain the 'algorithm' for continuous 
snapshots (is that what you like to do?):

A trivial one would we to :
 - do X online snapshots/sec
 - Stall all IO (disk/block) from the guest to the outside world
   until the previous snapshot reaches the slave.
 - Snapshots are made of
   - diff of dirty pages from last snapshot
   - Qemu device model (+kvm's) diff from last.
You can do 'light' snapshots in between to send dirty pages to reduce 
snapshot time.


I wrote the above to serve a reference for your comments so it will map 
into my mind. Thanks, dor




TODO:
  - Synchronization of clock sources (need to intercept TSC reads, etc).

=== usability ===

These are items that defines how users interact with Kemari.

TODO:
  - Kemarid daemon that takes care of the cluster management/monitoring
side of things.
  - Some device emulators might need minor modifications to work well
with Kemari.  Use white(black)-listing to take the burden of
choosing the right device model off the users.

=== optimizations ===

Although the big picture can be realized by completing the TODO list above, we
need some optimizations/enhancements to make Kemari useful in real world, and
these are items what needs to be done for that.

TODO:
  - SMP (for the sake of performance might need to implement a
synchronization protocol that can maintain two or more
synchronization points active at any given moment)
  - VGA (leverage VNC's subtilting mechanism to identify fb pages that
are really dirty).


Any comments/suggestions would be greatly appreciated.

Thanks,

Yoshi

--

Kemari starts synchronizing VMs when QEMU handles I/O requests.
Without this patch VCPU state is already proceeded before
synchronization, and after failover to the VM on the receiver, it
hangs because of this.

Signed-off-by: Yoshiaki Tamura
---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/svm.c  |   11 ---
  arch/x86/kvm/vmx.c  |   11 ---
  arch/x86/kvm/x86.c  |4 
  4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26c629a..7b8f514 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -227,6 +227,7 @@ struct kvm_pio_request {
int in;
int port;
int size;
+   bool lazy_skip;
  };

  /*
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d04c7ad..e373245 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
  {
struct kvm_vcpu *vcpu =&svm->vcpu;
u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
-   int size, in, string;
+   int size, in, string, ret;
unsigned port;

++svm->vcpu.stat.io_exits;
@@ -1507,9 +1507,14 @@ stat

RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-22 Thread Xin, Xiaohui
Michael,

>Yes, I think this packet split mode probably maps well to mergeable buffer
>support. Note that
>1. Not all devices support large packets in this way, others might map
>   to indirect buffers better

Do the indirect buffers accord to deal with the skb->frag_list?

>   So we have to figure out how migration is going to work
Yes, different guest virtio-net driver may contain different features.
Does the qemu migration work with different features supported by virtio-net
driver now?

>2. It's up to guest driver whether to enable features such as
>   mergeable buffers and indirect buffers
>   So we have to figure out how to notify guest which mode
>   is optimal for a given device
Yes. When a device is binded, the mp device may query the capabilities from 
driver.
Actually, there is a structure now in mp device can do this, we can add some 
field
to support more.

>3. We don't want to depend on jumbo frames for decent performance
>   So we probably should support GSO/GRO
GSO is for the tx side, right? I think driver can handle it itself.
For GRO, I'm not sure it's easy or not. Basically, the mp device now
we have support is doing what raw socket is doing. The packets are not going to 
host stack.
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-22 Thread Takuya Yoshikawa

(2010/04/21 20:26), Avi Kivity wrote:


r = 0;
@@ -1858,7 +1866,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
if (memslot->is_dirty) {
kvm_flush_remote_tlbs(kvm);
n = kvm_dirty_bitmap_bytes(memslot);
- memset(memslot->dirty_bitmap, 0, n);
+ clear_user(memslot->dirty_bitmap, n);
memslot->is_dirty = false;


Does this need an error check?


Yes, sorry I forgot.





@@ -468,8 +480,12 @@ void kvm_free_physmem(struct kvm *kvm)
int i;
struct kvm_memslots *slots = kvm->memslots;

- for (i = 0; i< slots->nmemslots; ++i)
+ for (i = 0; i< slots->nmemslots; ++i) {
+ /* We don't munmap dirty bitmaps by ourselves. */


Why not? If we allocated them, we have to free them.


In the case of VM destruction, when qemu process exits, it conflicts
with the work of the usual (process) destructor's job.

I struggled with this problem before sending the first version.

So I have to differentiate the slot release and qemu process exit.






+ slots->memslots[i].dirty_bitmap = NULL;
+ slots->memslots[i].dirty_bitmap_old = NULL;
kvm_free_physmem_slot(&slots->memslots[i], NULL);
+ }


+/*
+ * Please use generic *_user bitops once they become available.
+ * Be careful setting the bit won't be done atomically.
+ */


Please introduce the user bitops as part of this patchset.



OK, I will do in the next version. In this RFC, I would be happy if I can
know the overall design is right or not.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-22 Thread Michael S. Tsirkin
On Thu, Apr 22, 2010 at 04:57:56PM +0800, Xin, Xiaohui wrote:
> Michael,
> 
> >Yes, I think this packet split mode probably maps well to mergeable buffer
> >support. Note that
> >1. Not all devices support large packets in this way, others might map
> >   to indirect buffers better
> 
> Do the indirect buffers accord to deal with the skb->frag_list?

We currently use skb->frags.

> >   So we have to figure out how migration is going to work
> Yes, different guest virtio-net driver may contain different features.
> Does the qemu migration work with different features supported by virtio-net
> driver now?

For now, you must have identical feature-sets for migration to work.
And long as we manage the buffers in software, we can always make
features match.

> >2. It's up to guest driver whether to enable features such as
> >   mergeable buffers and indirect buffers
> >   So we have to figure out how to notify guest which mode
> >   is optimal for a given device
> Yes. When a device is binded, the mp device may query the capabilities from 
> driver.
> Actually, there is a structure now in mp device can do this, we can add some 
> field
> to support more.
> 
> >3. We don't want to depend on jumbo frames for decent performance
> >   So we probably should support GSO/GRO
> GSO is for the tx side, right? I think driver can handle it itself.
> For GRO, I'm not sure it's easy or not. Basically, the mp device now
> we have support is doing what raw socket is doing. The packets are not going 
> to host stack.

See commit bfd5f4a3d605e0f6054df0b59fe0907ff7e696d3
(it doesn't currently work with vhost net, but that's
 a separate story).

> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-22 Thread Takuya Yoshikawa

Thanks, I can know basic rules about kvm/api .

(2010/04/21 20:46), Avi Kivity wrote:




+Type: vm ioctl
+Parameters: struct kvm_dirty_log (in/out)
+Returns: 0 on success, -1 on error
+
+/* for KVM_SWITCH_DIRTY_LOG */
+struct kvm_dirty_log {
+ __u32 slot;
+ __u32 padding;


Please put a flags field here (and verify it is zero in the ioctl
implementation). This allows us to extend it later.


+ union {
+ void __user *dirty_bitmap; /* one bit per page */
+ __u64 addr;
+ };


Just make dirty_bitmap a __u64. With the union there is the risk that
someone forgets to zero the structure so we get some random bits in the
pointer, and also issues with big endian and s390 compat.

Reserve some extra space here for future expansion.

Hm. I see that this is the existing kvm_dirty_log structure. So we can't
play with it, ignore my comments about it.


So, introducing a new structure to export(and get) two bitmap addresses
with a flag is the thing?

1. Qemu calls ioctl to get the two addresses.
2. Qemu calls ioctl to make KVM switch the dirty bitmaps.
   --> in this case, qemu have to change the address got in step 1.
   ...
3. Qemu calls ioctl(we can reuse 1. with different command flag) to free the 
bitmaps.


In my plan, we don't let qemu malloc() dirty bitmaps: at least, I want to make 
that
another patch set because this patch set already has some dependencies to other 
issues.

But, yes, I can make the struct considering the future expansion if it needed.




- r = kvm_copy_dirty_bitmap(log->dirty_bitmap, dirty_bitmap_old, n);
+ if (need_copy) {
+ r = kvm_copy_dirty_bitmap(log->dirty_bitmap,
+ dirty_bitmap_old, n);
+ } else {
+ log->addr = (unsigned long)dirty_bitmap_old;
+ r = 0;
+ }


Instead of passing need_copy everywhere, you might check a flag in
log->. You'll need a flag to know whether to munmap() or not, no?


To judge munmap() or not, putting in the memory slot's flag may be
more useful.

But as you suggest, we won't use kvm_dirty_log so parameters will change.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] KVM MMU: make kvm_mmu_zap_page() return the number of zapped sp in total.

2010-04-22 Thread Gui Jianfeng
Currently, in kvm_mmu_change_mmu_pages(kvm, page), "used_pages--" is  performed 
after calling
kvm_mmu_zap_page() in spite of that whether "page" is actually reclaimed. 
Because root sp won't be 
reclaimed by kvm_mmu_zap_page(). So making kvm_mmu_zap_page() return total 
number of reclaimed sp
makes more sense. A new flag is put into kvm_mmu_zap_page() to indicate whether 
the top page is reclaimed.

signed-off-by: Gui Jianfeng 
---
 arch/x86/kvm/mmu.c |   62 +--
 1 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7a17db1..3fb18dd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1195,12 +1195,15 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
--kvm->stat.mmu_unsync;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   int *self_deleted);
 
 static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
+   int self_deleted;
+
if (sp->role.cr4_pae != !!is_pae(vcpu)) {
-   kvm_mmu_zap_page(vcpu->kvm, sp);
+   kvm_mmu_zap_page(vcpu->kvm, sp, &self_deleted);
return 1;
}
 
@@ -1209,7 +1212,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
kvm_flush_remote_tlbs(vcpu->kvm);
kvm_unlink_unsync_page(vcpu->kvm, sp);
if (vcpu->arch.mmu.sync_page(vcpu, sp)) {
-   kvm_mmu_zap_page(vcpu->kvm, sp);
+   kvm_mmu_zap_page(vcpu->kvm, sp, &self_deleted);
return 1;
}
 
@@ -1471,6 +1474,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
int i, zapped = 0;
struct mmu_page_path parents;
struct kvm_mmu_pages pages;
+   int self_deleted;
 
if (parent->role.level == PT_PAGE_TABLE_LEVEL)
return 0;
@@ -1480,7 +1484,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
struct kvm_mmu_page *sp;
 
for_each_sp(pages, sp, parents, i) {
-   kvm_mmu_zap_page(kvm, sp);
+   kvm_mmu_zap_page(kvm, sp, &self_deleted);
mmu_pages_clear_parents(&parents);
zapped++;
}
@@ -1490,10 +1494,12 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   int *self_deleted)
 {
int ret;
 
+   *self_deleted = 0;
trace_kvm_mmu_zap_page(sp);
++kvm->stat.mmu_shadow_zapped;
ret = mmu_zap_unsync_children(kvm, sp);
@@ -1507,11 +1513,15 @@ static int kvm_mmu_zap_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
if (!sp->root_count) {
hlist_del(&sp->hash_link);
kvm_mmu_free_page(kvm, sp);
+   /* Count self */
+   ret++;
+   *self_deleted = 1;
} else {
sp->role.invalid = 1;
list_move(&sp->link, &kvm->arch.active_mmu_pages);
kvm_reload_remote_mmus(kvm);
}
+
kvm_mmu_reset_last_pte_updated(kvm);
return ret;
 }
@@ -1523,6 +1533,7 @@ static int kvm_mmu_zap_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages)
 {
int used_pages;
+   int self_deleted;
 
used_pages = kvm->arch.n_alloc_mmu_pages - kvm->arch.n_free_mmu_pages;
used_pages = max(0, used_pages);
@@ -1540,8 +1551,8 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
int kvm_nr_mmu_pages)
 
page = container_of(kvm->arch.active_mmu_pages.prev,
struct kvm_mmu_page, link);
-   used_pages -= kvm_mmu_zap_page(kvm, page);
-   used_pages--;
+   used_pages -= kvm_mmu_zap_page(kvm, page,
+  &self_deleted);
}
kvm_nr_mmu_pages = used_pages;
kvm->arch.n_free_mmu_pages = 0;
@@ -1560,6 +1571,8 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t 
gfn)
struct kvm_mmu_page *sp;
struct hlist_node *node, *n;
int r;
+   int self_deleted;
+   int ret;
 
pgprintk("%s: looking for gfn %lx\n", __func__, gfn);
r = 0;
@@ -1571,7 +1584,8 @@ restart:
pgprintk("%s: gfn %lx role %x\n", __func__, gfn,
 sp->role.word);
r = 1;
-   if (kvm_mmu_zap_page(kvm, sp))
+   ret = kvm_mmu_zap_page(kvm, sp, &self_deleted);
+ 

[PATCH 2/3] KVM MMU: fix sp->unsync type error in trace event definition.

2010-04-22 Thread Gui Jianfeng
sp->unsync is bool now, so update trace event declaration.

Signed-off-by: Gui Jianfeng 
---
 arch/x86/kvm/mmutrace.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 3851f1f..9966e80 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -11,7 +11,7 @@
__field(__u64, gfn) \
__field(__u32, role) \
__field(__u32, root_count) \
-   __field(__u32, unsync)
+   __field(bool, unsync)
 
 #define KVM_MMU_PAGE_ASSIGN(sp) \
__entry->gfn = sp->gfn;  \
-- 
1.6.5.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] KVM MMU: Take sp level into account when calculating quadran

2010-04-22 Thread Gui Jianfeng
Take sp level into account when calculating quadrant, because only when
level == PT_PAGE_TABLE_LEVEL, quadrant is needed.

Signed-off-by: Gui Jianfeng 
---
 arch/x86/kvm/mmu.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 640b82d..2a35a65 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1324,7 +1324,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
if (role.direct)
role.cr4_pae = 0;
role.access = access;
-   if (vcpu->arch.mmu.root_level <= PT32_ROOT_LEVEL) {
+   if (vcpu->arch.mmu.root_level <= PT32_ROOT_LEVEL &&
+   level == PT_PAGE_TABLE_LEVEL) {
quadrant = gaddr >> (PAGE_SHIFT + (PT64_PT_BITS * level));
quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
role.quadrant = quadrant;
-- 
1.6.5.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.

2010-04-22 Thread Michael S. Tsirkin
On Thu, Apr 22, 2010 at 04:37:16PM +0800, xiaohui@intel.com wrote:
> From: Xin Xiaohui 
> 
> The vhost-net backend now only supports synchronous send/recv
> operations. The patch provides multiple submits and asynchronous
> notifications. This is needed for zero-copy case.
> 
> Signed-off-by: Xin Xiaohui 
> ---
> 
> Michael,
> 
> >Can't vhost supply a kiocb completion callback that will handle the list?
> 
> Yes, thanks. And with it I also remove the vq->receiver finally.
> 
> Thanks
> Xiaohui

Nice progress. I commented on some minor issues below.
Thanks!

>  drivers/vhost/net.c   |  227 
> +++--
>  drivers/vhost/vhost.c |  115 ++---
>  drivers/vhost/vhost.h |   14 +++
>  3 files changed, 301 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 22d5fef..4a70f66 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -17,11 +17,13 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
> @@ -47,6 +49,7 @@ struct vhost_net {
>   struct vhost_dev dev;
>   struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
>   struct vhost_poll poll[VHOST_NET_VQ_MAX];
> + struct kmem_cache   *cache;
>   /* Tells us whether we are polling a socket for TX.
>* We only do this when socket buffer fills up.
>* Protected by tx vq lock. */
> @@ -91,11 +94,132 @@ static void tx_poll_start(struct vhost_net *net, struct 
> socket *sock)
>   net->tx_poll_state = VHOST_NET_POLL_STARTED;
>  }
>  
> +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> +{
> + struct kiocb *iocb = NULL;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&vq->notify_lock, flags);
> + if (!list_empty(&vq->notifier)) {
> + iocb = list_first_entry(&vq->notifier,
> + struct kiocb, ki_list);
> + list_del(&iocb->ki_list);
> + }
> + spin_unlock_irqrestore(&vq->notify_lock, flags);
> + return iocb;
> +}
> +
> +static void handle_iocb(struct kiocb *iocb)
> +{
> + struct vhost_virtqueue *vq = iocb->private;
> + unsigned long flags;
> +
> +spin_lock_irqsave(&vq->notify_lock, flags);
> +list_add_tail(&iocb->ki_list, &vq->notifier);
> +spin_unlock_irqrestore(&vq->notify_lock, flags);
> +}
> +

checkpatch.pl does not complain about the above?

> +static void handle_async_rx_events_notify(struct vhost_net *net,
> +  struct vhost_virtqueue *vq,
> +  struct socket *sock)

continuation lines should start to the right of (.

> +{
> + struct kiocb *iocb = NULL;
> + struct vhost_log *vq_log = NULL;
> + int rx_total_len = 0;
> + unsigned int head, log, in, out;
> + int size;
> +
> + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> + return;
> +
> + if (sock->sk->sk_data_ready)
> + sock->sk->sk_data_ready(sock->sk, 0);
> +
> + vq_log = unlikely(vhost_has_feature(
> + &net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;

split the above line at ?, continuation being to the left of ( looks
ugly.

> + while ((iocb = notify_dequeue(vq)) != NULL) {
> + vhost_add_used_and_signal(&net->dev, vq,
> + iocb->ki_pos, iocb->ki_nbytes);
> + log = (int)(iocb->ki_user_data >> 32);

how about we always do the recompute step, and not encode
the log bit in ki_user_data?

> + size = iocb->ki_nbytes;
> + head = iocb->ki_pos;
> + rx_total_len += iocb->ki_nbytes;
> +
> + if (iocb->ki_dtor)
> + iocb->ki_dtor(iocb);
> + kmem_cache_free(net->cache, iocb);
> +
> + /* when log is enabled, recomputing the log info is needed,
> +  * since these buffers are in async queue, and may not get
> +  * the log info before.
> +  */
> + if (unlikely(vq_log)) {
> + if (!log)
> + __vhost_get_vq_desc(&net->dev, vq, vq->iov,
> + ARRAY_SIZE(vq->iov),
> + &out, &in, vq_log,
> + &log, head);
> + vhost_log_write(vq, vq_log, log, size);
> + }
> + if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> + vhost_poll_queue(&vq->poll);
> + break;
> + }
> + }
> +}
> +
> +static void handle_async_tx_events_notify(struct vhost_net *net,
> + struct vhost_virtqueue *vq)
> +{
> + struct kiocb *iocb = NULL;
> + int tx_total_len = 0;
> +
> + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> + return;
> +
> + 

32-bit color graphic on KVM virtual machines

2010-04-22 Thread shacky
Hi.
Is it possible to have 32-bit color graphic on KVM virtual machines?
I installed a Windows virtual machine, but it allows me to configure
only 24-bit color display and it does not have any display driver
installed.

Is there a way to solve this problem?

Thank youv very much!
Bye.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V5 1/3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-22 Thread Liu Yu-B13201
 

> -Original Message-
> From: kvm-ow...@vger.kernel.org 
> [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Zhang, Yanmin
> Sent: Monday, April 19, 2010 1:33 PM
> To: Avi Kivity
> Cc: Ingo Molnar; Peter Zijlstra; Avi Kivity; Sheng Yang; 
> linux-ker...@vger.kernel.org; kvm@vger.kernel.org; Marcelo 
> Tosatti; oerg Roedel; Jes Sorensen; Gleb Natapov; Zachary 
> Amsden; zhiteng.hu...@intel.com; tim.c.c...@intel.com; 
> Arnaldo Carvalho de Melo
> Subject: [PATCH V5 1/3] perf & kvm: Enhance perf to collect 
> KVM guest os statistics from host side
> 
> Below patch introduces perf_guest_info_callbacks and related 
> register/unregister
> functions. Add more PERF_RECORD_MISC_XXX bits meaning guest 
> kernel and guest user
> space.
> 
> Signed-off-by: Zhang Yanmin 
> 
> ---

> diff -Nraup --exclude-from=exclude.diff 
> linux-2.6_tip0417/include/linux/perf_event.h 
> linux-2.6_tip0417_perfkvm/include/linux/perf_event.h
> --- linux-2.6_tip0417/include/linux/perf_event.h  
> 2010-04-19 09:51:59.544791000 +0800
> +++ linux-2.6_tip0417_perfkvm/include/linux/perf_event.h  
> 2010-04-19 09:53:59.691378953 +0800

> @@ -932,6 +940,12 @@ static inline void perf_event_mmap(struc
>   __perf_event_mmap(vma);
>  }
>  
> +extern struct perf_guest_info_callbacks *perf_guest_cbs;
> +extern int perf_register_guest_info_callbacks(
> + struct perf_guest_info_callbacks *);
> +extern int perf_unregister_guest_info_callbacks(
> + struct perf_guest_info_callbacks *);
> +
>  extern void perf_event_comm(struct task_struct *tsk);
>  extern void perf_event_fork(struct task_struct *tsk);
>  
> @@ -1001,6 +1015,11 @@ perf_sw_event(u32 event_id, u64 nr, int 
>  static inline void
>  perf_bp_event(struct perf_event *event, void *data)  
>   { }
>  
> +static inline int perf_register_guest_info_callbacks
> +(struct perf_guest_info_callbacks *) {return 0; }
> +static inline int perf_unregister_guest_info_callbacks
> +(struct perf_guest_info_callbacks *) {return 0; }
> +
>  static inline void perf_event_mmap(struct vm_area_struct 
> *vma) { }
>  static inline void perf_event_comm(struct task_struct *tsk)  
>   { }
>  static inline void perf_event_fork(struct task_struct *tsk)  
>   { }


Hi,

I met this error when built kernel. Anything wrong?

  CC  init/main.o
In file included from include/linux/ftrace_event.h:8,
 from include/trace/syscall.h:6,
 from include/linux/syscalls.h:75,
 from init/main.c:16:
include/linux/perf_event.h: In function 'perf_register_guest_info_callbacks':
include/linux/perf_event.h:1019: error: parameter name omitted
include/linux/perf_event.h: In function 'perf_unregister_guest_info_callbacks':
include/linux/perf_event.h:1021: error: parameter name omitted
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-22 Thread Joerg Roedel
The patch introducing nested nmi handling had a bug. The
check does not belong to enable_nmi_window but must be in
nmi_allowed. This patch fixes this.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/svm.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ab78eb8..ec20584 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb *vmcb = svm->vmcb;
-   return !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
-   !(svm->vcpu.arch.hflags & HF_NMI_MASK);
+   int ret;
+   ret = !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
+ !(svm->vcpu.arch.hflags & HF_NMI_MASK);
+   ret = ret && gif_set(svm) && nested_svm_nmi(svm);
+
+   return ret;
 }
 
 static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
@@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
 * Something prevents NMI from been injected. Single step over possible
 * problem (IRET or exception injection or interrupt shadow)
 */
-   if (gif_set(svm) && nested_svm_nmi(svm)) {
-   svm->nmi_singlestep = true;
-   svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
-   update_db_intercept(vcpu);
-   }
+   svm->nmi_singlestep = true;
+   svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
+   update_db_intercept(vcpu);
 }
 
 static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] KVM: SVM: Propagate nested entry failure into guest hypervisor

2010-04-22 Thread Joerg Roedel
This patch implements propagation of a failes guest vmrun
back into the guest instead of killing the whole guest.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/svm.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 5ad9d80..b10d163 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1715,6 +1715,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
vmexit = NESTED_EXIT_DONE;
break;
}
+   case SVM_EXIT_ERR: {
+   vmexit = NESTED_EXIT_DONE;
+   break;
+   }
default: {
u64 exit_bits = 1ULL << (exit_code - SVM_EXIT_INTR);
if (svm->nested.intercept & exit_bits)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling

2010-04-22 Thread Joerg Roedel
This patch syncs cr0 and cr3 from the vmcb to the kvm state
before nested intercept handling is done. This allows to
simplify the vmexit path.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/svm.c |   15 ++-
 1 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c480d7f..5ad9d80 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
nested_vmcb->save.gdtr   = vmcb->save.gdtr;
nested_vmcb->save.idtr   = vmcb->save.idtr;
nested_vmcb->save.cr0= kvm_read_cr0(&svm->vcpu);
-   if (npt_enabled)
-   nested_vmcb->save.cr3= vmcb->save.cr3;
-   else
-   nested_vmcb->save.cr3= svm->vcpu.arch.cr3;
+   nested_vmcb->save.cr3= svm->vcpu.arch.cr3;
nested_vmcb->save.cr2= vmcb->save.cr2;
nested_vmcb->save.cr4= svm->vcpu.arch.cr4;
nested_vmcb->save.rflags = vmcb->save.rflags;
@@ -2641,6 +2638,11 @@ static int handle_exit(struct kvm_vcpu *vcpu)
 
trace_kvm_exit(exit_code, vcpu);
 
+   if (!(svm->vmcb->control.intercept_cr_write & INTERCEPT_CR0_MASK))
+   vcpu->arch.cr0 = svm->vmcb->save.cr0;
+   if (npt_enabled)
+   vcpu->arch.cr3 = svm->vmcb->save.cr3;
+
if (unlikely(svm->nested.exit_required)) {
nested_svm_vmexit(svm);
svm->nested.exit_required = false;
@@ -2668,11 +2670,6 @@ static int handle_exit(struct kvm_vcpu *vcpu)
 
svm_complete_interrupts(svm);
 
-   if (!(svm->vmcb->control.intercept_cr_write & INTERCEPT_CR0_MASK))
-   vcpu->arch.cr0 = svm->vmcb->save.cr0;
-   if (npt_enabled)
-   vcpu->arch.cr3 = svm->vmcb->save.cr3;
-
if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
kvm_run->fail_entry.hardware_entry_failure_reason
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] KVM: x86: Allow marking an exception as reinjected

2010-04-22 Thread Joerg Roedel
This patch adds logic to kvm/x86 which allows to mark an
injected exception as reinjected. This allows to remove an
ugly hack from svm_complete_interrupts that prevented
exceptions from being reinjected at all in the nested case.
The hack was necessary because an reinjected exception into
the nested guest could cause a nested vmexit emulation. But
reinjected exceptions must not intercept. The downside of
the hack is that a exception that in injected could get
lost.
This patch fixes the problem and puts the code for it into
generic x86 files because. Nested-VMX will likely have the
same problem and could reuse the code.

Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/kvm_host.h |6 +-
 arch/x86/kvm/svm.c  |   12 ++--
 arch/x86/kvm/vmx.c  |3 ++-
 arch/x86/kvm/x86.c  |   23 +++
 4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 357573a..3f0007b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -312,6 +312,7 @@ struct kvm_vcpu_arch {
struct kvm_queued_exception {
bool pending;
bool has_error_code;
+   bool reinject;
u8 nr;
u32 error_code;
} exception;
@@ -514,7 +515,8 @@ struct kvm_x86_ops {
void (*set_irq)(struct kvm_vcpu *vcpu);
void (*set_nmi)(struct kvm_vcpu *vcpu);
void (*queue_exception)(struct kvm_vcpu *vcpu, unsigned nr,
-   bool has_error_code, u32 error_code);
+   bool has_error_code, u32 error_code,
+   bool reinject);
int (*interrupt_allowed)(struct kvm_vcpu *vcpu);
int (*nmi_allowed)(struct kvm_vcpu *vcpu);
bool (*get_nmi_mask)(struct kvm_vcpu *vcpu);
@@ -617,6 +619,8 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long 
rflags);
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
+void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
+void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 
error_code);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long cr2,
   u32 error_code);
 bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 65fc114..30e49fe 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -338,7 +338,8 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
 }
 
 static void svm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
-   bool has_error_code, u32 error_code)
+   bool has_error_code, u32 error_code,
+   bool reinject)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -346,7 +347,8 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu, 
unsigned nr,
 * If we are within a nested VM we'd better #VMEXIT and let the guest
 * handle the exception
 */
-   if (nested_svm_check_exception(svm, nr, has_error_code, error_code))
+   if (!reinject &&
+   nested_svm_check_exception(svm, nr, has_error_code, error_code))
return;
 
if (nr == BP_VECTOR && !svm_has(SVM_FEATURE_NRIP)) {
@@ -2918,8 +2920,6 @@ static void svm_complete_interrupts(struct vcpu_svm *svm)
svm->vcpu.arch.nmi_injected = true;
break;
case SVM_EXITINTINFO_TYPE_EXEPT:
-   if (is_nested(svm))
-   break;
/*
 * In case of software exceptions, do not reinject the vector,
 * but re-execute the instruction instead. Rewind RIP first
@@ -2935,10 +2935,10 @@ static void svm_complete_interrupts(struct vcpu_svm 
*svm)
}
if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
u32 err = svm->vmcb->control.exit_int_info_err;
-   kvm_queue_exception_e(&svm->vcpu, vector, err);
+   kvm_requeue_exception_e(&svm->vcpu, vector, err);
 
} else
-   kvm_queue_exception(&svm->vcpu, vector);
+   kvm_requeue_exception(&svm->vcpu, vector);
break;
case SVM_EXITINTINFO_TYPE_INTR:
kvm_queue_interrupt(&svm->vcpu, vector, false);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index dc023bc..285fe1a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -919,7 +919,8 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
 }
 
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
-   bool has_error_code, u32 error_code)
+   bool has_error_code, u32 error_c

[PATCH 8/8] KVM: SVM: Handle MCE intercepts always on host level

2010-04-22 Thread Joerg Roedel
This patch prevents MCE intercepts from being propagated
into the L1 guest if they happened in an L2 guest.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/svm.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 30e49fe..889f660 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1651,6 +1651,7 @@ static int nested_svm_exit_special(struct vcpu_svm *svm)
switch (exit_code) {
case SVM_EXIT_INTR:
case SVM_EXIT_NMI:
+   case SVM_EXIT_EXCP_BASE + MC_VECTOR:
return NESTED_EXIT_HOST;
case SVM_EXIT_NPF:
/* For now we are always handling NPFs when using them */
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] KVM: SVM: Report emulated SVM features to userspace

2010-04-22 Thread Joerg Roedel
This patch implements the reporting of the emulated SVM
features to userspace instead of the real hardware
capabilities. Every real hardware capability needs emulation
in nested svm so the old behavior was broken.

Cc: sta...@kernel.org
Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/svm.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0fa2035..65fc114 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3154,6 +3154,16 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
 
 static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+   switch (func) {
+   case 0x800A:
+   entry->eax = 1; /* SVM revision 1 */
+   entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper
+  ASID emulation to nested SVM */
+   entry->ecx = 0; /* Reserved */
+   entry->edx = 0; /* Do not support any additional features */
+
+   break;
+   }
 }
 
 static const struct trace_print_flags svm_exit_reasons_str[] = {
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] KVM: X86: Add callback to let modules decide over some supported cpuid bits

2010-04-22 Thread Joerg Roedel
This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Cc: sta...@kernel.org
Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/kvm_host.h |2 ++
 arch/x86/kvm/svm.c  |6 ++
 arch/x86/kvm/vmx.c  |6 ++
 arch/x86/kvm/x86.c  |3 +++
 4 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d47d087..357573a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -528,6 +528,8 @@ struct kvm_x86_ops {
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
 
+   void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
+
const struct trace_print_flags *exit_reasons_str;
 };
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b10d163..0fa2035 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3152,6 +3152,10 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
 {
 }
 
+static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
+{
+}
+
 static const struct trace_print_flags svm_exit_reasons_str[] = {
{ SVM_EXIT_READ_CR0,"read_cr0" },
{ SVM_EXIT_READ_CR3,"read_cr3" },
@@ -3297,6 +3301,8 @@ static struct kvm_x86_ops svm_x86_ops = {
.cpuid_update = svm_cpuid_update,
 
.rdtscp_supported = svm_rdtscp_supported,
+
+   .set_supported_cpuid = svm_set_supported_cpuid,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d0a10b5..dc023bc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4111,6 +4111,10 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
}
 }
 
+static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
+{
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -4183,6 +4187,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.cpuid_update = vmx_cpuid_update,
 
.rdtscp_supported = vmx_rdtscp_supported,
+
+   .set_supported_cpuid = vmx_set_supported_cpuid,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf37ac6..c2c3c29 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1960,6 +1960,9 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
entry->ecx &= kvm_supported_word6_x86_features;
break;
}
+
+   kvm_x86_ops->set_supported_cpuid(function, entry);
+
put_cpu();
 }
 
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] More fixes for nested svm

2010-04-22 Thread Joerg Roedel
Hi Avi, Marcelo,

here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
root domain booting. The patches also add better handling for nested entry
failures and mce intercepts.
Also in this patchset are the fixes for the supported cpuid reporting for svm
features. These patches were taken from the nested-npt patchset and slightly
modified. These patches are also marked for -stable backporting.
The probably most important fix is about exception reinjection. This didn't
work reliably before and is fixed with the patch in this series now. This fix
also touches common x86 code but that should be ok because it could be reused
by nested-vmx later.
Please review and give comments (or apply ;-).

Thanks,

Joerg

Diffstat:

 arch/x86/include/asm/kvm_host.h |8 -
 arch/x86/kvm/svm.c  |   72 +--
 arch/x86/kvm/vmx.c  |9 -
 arch/x86/kvm/x86.c  |   26 --
 4 files changed, 83 insertions(+), 32 deletions(-)

Shortlog:

Joerg Roedel (8):
  KVM: SVM: Fix nested nmi handling
  KVM: SVM: Make sure rip is synced to vmcb before nested vmexit
  KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling
  KVM: SVM: Propagate nested entry failure into guest hypervisor
  KVM: X86: Add callback to let modules decide over some supported cpuid 
bits
  KVM: SVM: Report emulated SVM features to userspace
  KVM: x86: Allow marking an exception as reinjected
  KVM: SVM: Handle MCE intercepts always on host level


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] KVM: SVM: Make sure rip is synced to vmcb before nested vmexit

2010-04-22 Thread Joerg Roedel
This patch fixes a bug where a nested guest always went over
the same instruction because the rip was not advanced on a
nested vmexit.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/svm.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ec20584..c480d7f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2960,6 +2960,10 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
u16 gs_selector;
u16 ldt_selector;
 
+   svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX];
+   svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
+   svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
+
/*
 * A vmexit emulation is required before the vcpu can be executed
 * again.
@@ -2967,10 +2971,6 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(svm->nested.exit_required))
return;
 
-   svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX];
-   svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
-   svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
-
pre_svm_run(svm);
 
sync_lapic_to_cr8(vcpu);
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network that goes 
through virtio-net.


However, if the application was using unreliable protocols, it should have its 
own recovering mechanism, or it should be completely stateless.



[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is
accessed would be
prohibitive, the transmission of the TSC will be done lazily, which means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.

In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):


Yes, of course.
Sorry for being less informative.


A trivial one would we to :
- do X online snapshots/sec


I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the guest was 
almost idle, there will be no snapshots in 5sec.  On the other hand, if the 
guest was running I/O intensive workloads (netperf, iozone for example), there 
will be about 50 snapshots/sec.



- Stall all IO (disk/block) from the guest to the outside world
until the previous snapshot reaches the slave.


Yes, it does.


- Snapshots are made of


Full device model + diff of dirty pages from the last snapshot.


- diff of dirty pages from last snapshot


This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less than 100.


- Qemu device model (+kvm's) diff from last.


We're currently sending full copy because we're completely reusing this part of 
existing live migration framework.


Last time we measured, it was about 13KB.
But it varies by which QEMU version is used.


You can do 'light' snapshots in between to send dirty pages to reduce
snapshot time.


I agree.  That's one of the advanced topic we would like to try too.


I wrote the above to serve a reference for your comments so it will map
into my mind. Thanks, dor


Thank your for the guidance.
I hope this answers to your question.

At the same time, I would also be happy it we could discuss how to implement 
too.  In fact, we needed a hack to prevent rip from proceeding in KVM, which 
turned out that it was not the best workaround.


Thanks,

Yoshi





TODO:
- Synchronization of clock sources (need to intercept TSC reads, etc).

=== usability ===

These are items that defines how users interact with Kemari.

TODO:
- Kemarid daemon that takes care of the cluster management/monitoring
side of things.
- Some device emulators might need minor modifications to work well
with Kemari. Use white(black)-listing to take the burden of
choosing the right device model off the users.

=== optimizations ===

Although the big picture can be realized by completing the TODO list
above, we
need some optimizations/enhancements to make Kemari useful in real
world, and
these are items what needs to be done for that.

TODO:
- SMP (for the sake of performance might need to implement a
synchronization protocol that can maintain two or more
synchronization points active at any given moment)
- VGA (leverage VNC's subtilting mechanism to identify fb pages that
are really dirty)

[PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Joerg Roedel
This patch adds code to ask the kernel about the svm
features it supports for its guests and propagates them to
the guest. The new capability is necessary because the old
behavior of the kernel was to just return the host svm
features but every svm-feature needs emulation in the nested
svm kernel code. The new capability indicates that the
kernel is aware of that when returning svm cpuid
information.

Signed-off-by: Joerg Roedel 
---
 qemu-kvm-x86.c |   14 --
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 748ff69..6eccd69 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
 qemu_kvm_cpuid_on_env(©);
 limit = copy.regs[R_EAX];
 
-for (i = 0x8000; i <= limit; ++i)
-   do_cpuid_ent(&cpuid_ent[cpuid_nent++], i, 0, ©);
+for (i = 0x8000; i <= limit; ++i) {
+do_cpuid_ent(&cpuid_ent[cpuid_nent], i, 0, ©);
+switch (i) {
+case 0x800a:
+cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EAX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EDX);
+break;
+}
+cpuid_nent += 1;
+}
 
 kvm_setup_cpuid2(cenv, cpuid_nent, cpuid_ent);
 
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] More fixes for nested svm

2010-04-22 Thread Avi Kivity

On 04/22/2010 01:33 PM, Joerg Roedel wrote:

Hi Avi, Marcelo,

here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
root domain booting. The patches also add better handling for nested entry
failures and mce intercepts.
Also in this patchset are the fixes for the supported cpuid reporting for svm
features. These patches were taken from the nested-npt patchset and slightly
modified. These patches are also marked for -stable backporting.
The probably most important fix is about exception reinjection. This didn't
work reliably before and is fixed with the patch in this series now. This fix
also touches common x86 code but that should be ok because it could be reused
by nested-vmx later.
Please review and give comments (or apply ;-).
   


Looks good.  I'll wait a day or two for more reviews (esp. Alex).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Avi Kivity

On 04/22/2010 01:57 PM, Joerg Roedel wrote:

This patch adds code to ask the kernel about the svm
features it supports for its guests and propagates them to
the guest. The new capability is necessary because the old
behavior of the kernel was to just return the host svm
features but every svm-feature needs emulation in the nested
svm kernel code. The new capability indicates that the
kernel is aware of that when returning svm cpuid
information.
   


No new capability here.


Signed-off-by: Joerg Roedel
---
  qemu-kvm-x86.c |   14 --
  1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 748ff69..6eccd69 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
  qemu_kvm_cpuid_on_env(©);
  limit = copy.regs[R_EAX];

-for (i = 0x8000; i<= limit; ++i)
-   do_cpuid_ent(&cpuid_ent[cpuid_nent++], i, 0,©);
+for (i = 0x8000; i<= limit; ++i) {
+do_cpuid_ent(&cpuid_ent[cpuid_nent], i, 0,©);
+switch (i) {
+case 0x800a:
+cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EAX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EDX);
+break;
+}
+cpuid_nent += 1;
+}
   


I don't understand why this is different compared to all other cpuid bits.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm vmx: free vpid when fail to create vcpu

2010-04-22 Thread Avi Kivity

On 04/17/2010 11:41 AM, Lai Jiangshan wrote:

Fix bug of the exception path, free allocated vpid when fail
to create vcpu.
   


Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Takuya Yoshikawa

(2010/04/22 19:35), Yoshiaki Tamura wrote:




A trivial one would we to :
- do X online snapshots/sec


I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec. On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), there will be about 50 snapshots/sec.



50 is too small: this depends on the synchronization speed and does not
show how many snapshots we need, right?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Joerg Roedel
On Thu, Apr 22, 2010 at 02:07:14PM +0300, Avi Kivity wrote:
> On 04/22/2010 01:57 PM, Joerg Roedel wrote:
> >This patch adds code to ask the kernel about the svm
> >features it supports for its guests and propagates them to
> >the guest. The new capability is necessary because the old
> >behavior of the kernel was to just return the host svm
> >features but every svm-feature needs emulation in the nested
> >svm kernel code. The new capability indicates that the
> >kernel is aware of that when returning svm cpuid
> >information.
> 
> No new capability here.

copy&paste error, sorry.

> 
> >Signed-off-by: Joerg Roedel
> >---
> >  qemu-kvm-x86.c |   14 --
> >  1 files changed, 12 insertions(+), 2 deletions(-)
> >
> >diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
> >index 748ff69..6eccd69 100644
> >--- a/qemu-kvm-x86.c
> >+++ b/qemu-kvm-x86.c
> >@@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
> >  qemu_kvm_cpuid_on_env(©);
> >  limit = copy.regs[R_EAX];
> >
> >-for (i = 0x8000; i<= limit; ++i)
> >-do_cpuid_ent(&cpuid_ent[cpuid_nent++], i, 0,©);
> >+for (i = 0x8000; i<= limit; ++i) {
> >+do_cpuid_ent(&cpuid_ent[cpuid_nent], i, 0,©);
> >+switch (i) {
> >+case 0x800a:
> >+cpuid_ent[cpuid_nent].eax = 
> >kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EAX);
> >+cpuid_ent[cpuid_nent].ebx = 
> >kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX);
> >+cpuid_ent[cpuid_nent].ebx = 
> >kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX);
> >+cpuid_ent[cpuid_nent].edx = 
> >kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EDX);
> >+break;
> >+}
> >+cpuid_nent += 1;
> >+}
> 
> I don't understand why this is different compared to all other cpuid bits.

Because for the SVM features we report to the guest we need to ask the
kernel which of them are supported. We can't just take the host-cpuid
because most of the additional svm features need special emulation in
the kernel. Or do you think this should better be handled in
target-i386/cpuid.c?

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread Michael S. Tsirkin
On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
> This patch adds the mergeable RX buffers feature to vhost.
> 
> Signed-off-by: David L Stevens 

Looks pretty clean to me.
Could you send a checkpatch-clean version please?
We should also check performance implications.
Do you have any data?

Thanks!

> diff -ruNp net-next-p0/drivers/vhost/net.c net-next-v4/drivers/vhost/net.c
> --- net-next-p0/drivers/vhost/net.c   2010-03-22 12:04:38.0 -0700
> +++ net-next-v4/drivers/vhost/net.c   2010-04-19 14:23:38.0 -0700
> @@ -108,7 +108,7 @@ static void handle_tx(struct vhost_net *
>   };
>   size_t len, total_len = 0;
>   int err, wmem;
> - size_t hdr_size;
> + size_t vhost_hlen;
>   struct socket *sock = rcu_dereference(vq->private_data);
>   if (!sock)
>   return;
> @@ -127,13 +127,13 @@ static void handle_tx(struct vhost_net *
>  
>   if (wmem < sock->sk->sk_sndbuf / 2)
>   tx_poll_stop(net);
> - hdr_size = vq->hdr_size;
> + vhost_hlen = vq->vhost_hlen;
>  
>   for (;;) {
> - head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> -  ARRAY_SIZE(vq->iov),
> -  &out, &in,
> -  NULL, NULL);
> + head = vhost_get_desc(&net->dev, vq, vq->iov,
> +   ARRAY_SIZE(vq->iov),
> +   &out, &in,
> +   NULL, NULL);
>   /* Nothing new?  Wait for eventfd to tell us they refilled. */
>   if (head == vq->num) {
>   wmem = atomic_read(&sock->sk->sk_wmem_alloc);
> @@ -154,20 +154,20 @@ static void handle_tx(struct vhost_net *
>   break;
>   }
>   /* Skip header. TODO: support TSO. */
> - s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> + s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, out);
>   msg.msg_iovlen = out;
>   len = iov_length(vq->iov, out);
>   /* Sanity check */
>   if (!len) {
>   vq_err(vq, "Unexpected header len for TX: "
>  "%zd expected %zd\n",
> -iov_length(vq->hdr, s), hdr_size);
> +iov_length(vq->hdr, s), vhost_hlen);
>   break;
>   }
>   /* TODO: Check specific error and bomb out unless ENOBUFS? */
>   err = sock->ops->sendmsg(NULL, sock, &msg, len);
>   if (unlikely(err < 0)) {
> - vhost_discard_vq_desc(vq);
> + vhost_discard_desc(vq, 1);
>   tx_poll_start(net, sock);
>   break;
>   }
> @@ -186,12 +186,25 @@ static void handle_tx(struct vhost_net *
>   unuse_mm(net->dev.mm);
>  }
>  
> +static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
> +{
> + struct sk_buff *head;
> + int len = 0;
> +
> + lock_sock(sk);
> + head = skb_peek(&sk->sk_receive_queue);
> + if (head)
> + len = head->len + vq->sock_hlen;
> + release_sock(sk);
> + return len;
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_rx(struct vhost_net *net)
>  {
>   struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> - unsigned head, out, in, log, s;
> + unsigned in, log, s;
>   struct vhost_log *vq_log;
>   struct msghdr msg = {
>   .msg_name = NULL,
> @@ -202,14 +215,14 @@ static void handle_rx(struct vhost_net *
>   .msg_flags = MSG_DONTWAIT,
>   };
>  
> - struct virtio_net_hdr hdr = {
> - .flags = 0,
> - .gso_type = VIRTIO_NET_HDR_GSO_NONE
> + struct virtio_net_hdr_mrg_rxbuf hdr = {
> + .hdr.flags = 0,
> + .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
>   };
>  
>   size_t len, total_len = 0;
> - int err;
> - size_t hdr_size;
> + int err, headcount, datalen;
> + size_t vhost_hlen;
>   struct socket *sock = rcu_dereference(vq->private_data);
>   if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
>   return;
> @@ -217,18 +230,18 @@ static void handle_rx(struct vhost_net *
>   use_mm(net->dev.mm);
>   mutex_lock(&vq->mutex);
>   vhost_disable_notify(vq);
> - hdr_size = vq->hdr_size;
> + vhost_hlen = vq->vhost_hlen;
>  
>   vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
>   vq->log : NULL;
>  
> - for (;;) {
> - head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> -  ARRAY_SIZE(vq->iov),
> -  &out, &in,
> -  vq_log, &log);
> 

Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Avi Kivity

On 04/22/2010 03:02 PM, Joerg Roedel wrote:



Signed-off-by: Joerg Roedel
---
  qemu-kvm-x86.c |   14 --
  1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 748ff69..6eccd69 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv)
  qemu_kvm_cpuid_on_env(©);
  limit = copy.regs[R_EAX];

-for (i = 0x8000; i<= limit; ++i)
-   do_cpuid_ent(&cpuid_ent[cpuid_nent++], i, 0,©);
+for (i = 0x8000; i<= limit; ++i) {
+do_cpuid_ent(&cpuid_ent[cpuid_nent], i, 0,©);
+switch (i) {
+case 0x800a:
+cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EAX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EBX);
+cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 
0x800a, R_EDX);
+break;
+}
+cpuid_nent += 1;
+}
   

I don't understand why this is different compared to all other cpuid bits.
 

Because for the SVM features we report to the guest we need to ask the
kernel which of them are supported.


That's true for all cpuid features.


We can't just take the host-cpuid
because most of the additional svm features need special emulation in
the kernel. Or do you think this should better be handled in
target-i386/cpuid.c?
   


Yes.  -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it 
back to the vcpu configuration, others just take the qemu configuration, 
mask it with supported bits, and pass it back (see 
check_features_against_host()).


(need feature names for the bits, too, so you can enable or disable them 
from the command line)


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Dor Laor

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely stateless.


Why do you treat tcp differently? You can damage the entire VM this way 
- think of dhcp request that was dropped on the moment you switched 
between the master and the slave?






[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is
accessed would be
prohibitive, the transmission of the TSC will be done lazily, which
means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.


So, do you agree that an extra clock synchronization is not needed since 
it is done anyway as part of the live migration state sync?




In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):


Yes, of course.
Sorry for being less informative.


A trivial one would we to :
- do X online snapshots/sec


I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec. On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), there will be about 50 snapshots/sec.


- Stall all IO (disk/block) from the guest to the outside world
until the previous snapshot reaches the slave.


Yes, it does.


- Snapshots are made of


Full device model + diff of dirty pages from the last snapshot.


- diff of dirty pages from last snapshot


This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less than 100.


The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much 
more)





- Qemu device model (+kvm's) diff from last.


We're currently sending full copy because we're completely reusing this
part of existing live migration framework.

Last time we measured, it was about 13KB.
But it varies by which QEMU version is used.


You can do 'light' snapshots in between to send dirty pages to reduce
snapshot time.


I agree. That's one of the advanced topic we would like to try too.


I wrote the above to serve a reference for your comments so it will map
into my mind. Thanks, dor


Thank your for the guidance.
I hope this answers to your question.

At the same time, I would also be happy it we could discuss how to
implement too. In fact, we needed a hack to prevent rip from proceeding
in KVM, which turned out that it was not the best workaround.


There are brute force solutions like
- stop the guest until you send all of the snapshot to the remote (like
  standard live migration)
- Stop + fork + cont the father

Or mark the recent dirty pages that were not sent to the remote as write 
protected and copy them if touched.





Thanks,

Yoshi





TODO:
- Synchronization of clock sources (need to intercept TSC reads, etc).

=== usability ===

Thes

Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Joerg Roedel
On Thu, Apr 22, 2010 at 03:13:14PM +0300, Avi Kivity wrote:
> On 04/22/2010 03:02 PM, Joerg Roedel wrote:

>> We can't just take the host-cpuid
>> because most of the additional svm features need special emulation in
>> the kernel. Or do you think this should better be handled in
>> target-i386/cpuid.c?
>>
>
> Yes.  -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it  
> back to the vcpu configuration, others just take the qemu configuration,  
> mask it with supported bits, and pass it back (see  
> check_features_against_host()).

Hmm, the plan was to enable with -enable-nesting all kernel supported
svm features for the guest (and add switches later to remove them
individually)
If we activate nested svm with -cpu host in the future thats fine too
(closed-source hypervisors need that anyway). But we should also define
a cpu model in which we can migrate nested hypervisors between machines
were the cpu is not completly indentical.

> (need feature names for the bits, too, so you can enable or disable them  
> from the command line)

Yeah, I know. I omitted that for the first bring-up. It was planned for
a later patch.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[qemu-kvm tests 0/3] qemu-kvm tests cleanup

2010-04-22 Thread Naphtali Sprei
Cleanup, mostly x86 oriented.
Patches against 'next' branch.

Naphtali Sprei (3):
  qemu-kvm tests cleanup
  qemu-kvm tests cleanup: adapt stringio test to kernel-mode run
  qemu-kvm tests cleanup: Added printing for passing tests Also typo
fix

 kvm/user/README|   23 ++
 kvm/user/balloon_ctl.c |   92 --
 kvm/user/bootstrap.lds |   15 -
 kvm/user/config-x86-common.mak |   26 +--
 kvm/user/config-x86_64.mak |5 +-
 kvm/user/main.c|  611 
 kvm/user/test/x86/README   |   15 +
 kvm/user/test/x86/bootstrap.S  |  137 -
 kvm/user/test/x86/exit.c   |7 -
 kvm/user/test/x86/memtest1.S   |   44 ---
 kvm/user/test/x86/realmode.c   |  106 +++-
 kvm/user/test/x86/runtime.h|6 -
 kvm/user/test/x86/simple.S |   13 -
 kvm/user/test/x86/stringio.S   |   13 +-
 kvm/user/test/x86/test32.S |8 -
 kvm/user/testdev.txt   |   14 +
 16 files changed, 170 insertions(+), 965 deletions(-)
 create mode 100644 kvm/user/README
 delete mode 100755 kvm/user/balloon_ctl.c
 delete mode 100644 kvm/user/bootstrap.lds
 delete mode 100644 kvm/user/main.c
 create mode 100644 kvm/user/test/x86/README
 delete mode 100644 kvm/user/test/x86/bootstrap.S
 delete mode 100644 kvm/user/test/x86/exit.c
 delete mode 100644 kvm/user/test/x86/memtest1.S
 delete mode 100644 kvm/user/test/x86/runtime.h
 delete mode 100644 kvm/user/test/x86/simple.S
 delete mode 100644 kvm/user/test/x86/test32.S
 create mode 100644 kvm/user/testdev.txt

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[qemu-kvm tests 1/3] qemu-kvm tests cleanup

2010-04-22 Thread Naphtali Sprei
Mainly removed unused/unnecessary files and references to them

Signed-off-by: Naphtali Sprei 
---
 kvm/user/README|   23 ++
 kvm/user/balloon_ctl.c |   92 --
 kvm/user/bootstrap.lds |   15 -
 kvm/user/config-x86-common.mak |   24 +--
 kvm/user/config-x86_64.mak |5 +-
 kvm/user/main.c|  611 
 kvm/user/test/x86/README   |   15 +
 kvm/user/test/x86/bootstrap.S  |  137 -
 kvm/user/test/x86/exit.c   |7 -
 kvm/user/test/x86/memtest1.S   |   44 ---
 kvm/user/test/x86/runtime.h|6 -
 kvm/user/test/x86/simple.S |   13 -
 kvm/user/test/x86/test32.S |8 -
 kvm/user/testdev.txt   |   14 +
 14 files changed, 60 insertions(+), 954 deletions(-)
 create mode 100644 kvm/user/README
 delete mode 100755 kvm/user/balloon_ctl.c
 delete mode 100644 kvm/user/bootstrap.lds
 delete mode 100644 kvm/user/main.c
 create mode 100644 kvm/user/test/x86/README
 delete mode 100644 kvm/user/test/x86/bootstrap.S
 delete mode 100644 kvm/user/test/x86/exit.c
 delete mode 100644 kvm/user/test/x86/memtest1.S
 delete mode 100644 kvm/user/test/x86/runtime.h
 delete mode 100644 kvm/user/test/x86/simple.S
 delete mode 100644 kvm/user/test/x86/test32.S
 create mode 100644 kvm/user/testdev.txt

diff --git a/kvm/user/README b/kvm/user/README
new file mode 100644
index 000..6a83831
--- /dev/null
+++ b/kvm/user/README
@@ -0,0 +1,23 @@
+This directory contains sources for a kvm test suite.
+
+Tests for x86 architecture are run as kernel images for qemu that supports 
multiboot format.
+Tests uses an infrastructure called from the bios code. The infrastructure 
initialize the system/cpu's,
+switch to long-mode and calls the 'main' function of the individual test.
+Tests uses a qemu's virtual test device, named testdev, for services like 
printing, exiting, query memory size etc.
+See file testdev.txt for more details.
+
+To create the tests' images just type 'make' in this directory.
+Tests' images created in ./test//*.flat
+
+An example of a test invocation:
+qemu-system-x86_64 -device testdev,chardev=testlog -chardev 
file,id=testlog,path=msr.out -kernel ./test/x86/msr.flat
+This invocation runs the msr test case. The test output is in file msr.out.
+
+
+
+Directory structure:
+.:  Makefile and config files for the tests
+./test/lib: general services for the tests
+./test/lib/: architecture dependent services for the tests
+./test/: the sources of the tests and the created objects/images
+
diff --git a/kvm/user/balloon_ctl.c b/kvm/user/balloon_ctl.c
deleted file mode 100755
index e65b08d..000
--- a/kvm/user/balloon_ctl.c
+++ /dev/null
@@ -1,92 +0,0 @@
-/*
- * This binary provides access to the guest's balloon driver
- * module.
- *
- * Copyright (C) 2007 Qumranet
- *
- * Author:
- *
- *  Dor Laor 
- *
- * This work is licensed under the GNU LGPL license, version 2.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#define __user
-#include 
-
-#define PAGE_SIZE 4096ul
-
-
-static int balloon_op(int *fd, int bytes)
-{
-   struct kvm_balloon_op bop;
-int r;
-
-   bop.npages = bytes/PAGE_SIZE;
-   r = ioctl(*fd, KVM_BALLOON_OP, &bop);
-   if (r == -1)
-   return -errno;
-   printf("Ballon handled %d pages successfully\n", bop.npages);
-
-   return 0;
-}
-
-static int balloon_init(int *fd)
-{
-   *fd = open("/dev/kvm_balloon", O_RDWR);
-   if (*fd == -1) {
-   perror("open /dev/kvm_balloon");
-   return -1;
-   }
-
-   return 0;
-}
-
-int main(int argc, char *argv[])
-{
-   int fd;
-   int r;
-   int bytes;
-
-   if (argc != 3) {
-   perror("Please provide op=[i|d], bytes\n");
-   return 1;
-   }
-   bytes = atoi(argv[2]);
-
-   switch (*argv[1]) {
-   case 'i':
-   break;
-   case 'd':
-   bytes = -bytes;
-   break;
-   default:
-   perror("Wrong op param\n");
-   return 1;
-   }
-
-   if (balloon_init(&fd)) {
-   perror("balloon_init failed\n");
-   return 1;
-   }
-
-   if ((r = balloon_op(&fd, bytes))) {
-   perror("balloon_op failed\n");
-   goto out;
-   }
-
-out:
-   close(fd);
-
-   return r;
-}
-
diff --git a/kvm/user/bootstrap.lds b/kvm/user/bootstrap.lds
deleted file mode 100644
index fd0a4f8..000
--- a/kvm/user/bootstrap.lds
+++ /dev/null
@@ -1,15 +0,0 @@
-OUTPUT_FORMAT(binary)
-
-SECTIONS
-{
-. = 0;
-stext = .;
-.text : { *(.init) *(.text) }
-. = ALIGN(4K);
-.data : { *(.data) }
-. = ALIGN(16);
-.bss : { *(.bss) }
-. = ALIGN(4K);
-edata = .;
-}
-
diff --git a/kvm/user/config-x86-common.mak b/kvm/user/config-x86-common.mak
index 63cca42..960741e 100644
--- a/kvm/user/config-x86-common.mak
+++ b/kvm/user/config-x86-common.

[qemu-kvm tests 2/3] qemu-kvm tests cleanup: adapt stringio test to kernel-mode run

2010-04-22 Thread Naphtali Sprei
Also use testdev for output, call exit to quit.
Currently, test reboots endlessly because of a triple-fault.
Need to run test with -no-reboot till issue fixed (in kvm ??)

Signed-off-by: Naphtali Sprei 
---
 kvm/user/config-x86-common.mak |2 +-
 kvm/user/test/x86/stringio.S   |   13 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kvm/user/config-x86-common.mak b/kvm/user/config-x86-common.mak
index 960741e..d312681 100644
--- a/kvm/user/config-x86-common.mak
+++ b/kvm/user/config-x86-common.mak
@@ -54,7 +54,7 @@ $(TEST_DIR)/realmode.flat: $(TEST_DIR)/realmode.o
 
 $(TEST_DIR)/realmode.o: bits = 32
 
-$(TEST_DIR)/stringio.flat: $(TEST_DIR)/stringio.o
+$(TEST_DIR)/stringio.flat: $(cstart.o) $(TEST_DIR)/stringio.o
 
 $(TEST_DIR)/msr.flat: $(cstart.o) $(TEST_DIR)/msr.o
 
diff --git a/kvm/user/test/x86/stringio.S b/kvm/user/test/x86/stringio.S
index 31ddc47..461621c 100644
--- a/kvm/user/test/x86/stringio.S
+++ b/kvm/user/test/x86/stringio.S
@@ -8,24 +8,29 @@
 1: 
 .endm  
 
+TESTDEV_PORT = 0xf1
+
str "forward", "forward"
str "backward", "backward"

 .text
 
-
+.global main
+main:
cld
movl forward, %ecx
lea  4+forward, %rsi
-   movw $1, %dx
+   movw $TESTDEV_PORT, %dx
rep outsb
 
std
movl backward, %ecx
lea 4+backward-1(%rcx), %rsi
-   movw $2, %dx
+   movw $TESTDEV_PORT, %dx
rep outsb

-   hlt
+   mov $0, %rsi
+   call exit
+
 
 
-- 
1.6.3.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[qemu-kvm tests 3/3] qemu-kvm tests cleanup: Added printing for passing tests Also typo fix

2010-04-22 Thread Naphtali Sprei

Signed-off-by: Naphtali Sprei 
---
 kvm/user/test/x86/realmode.c |  106 +++--
 1 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/kvm/user/test/x86/realmode.c b/kvm/user/test/x86/realmode.c
index bfc2942..bc4ed97 100644
--- a/kvm/user/test/x86/realmode.c
+++ b/kvm/user/test/x86/realmode.c
@@ -160,6 +160,8 @@ void test_xchg(void)
 
if (!regs_equal(&inregs, &outregs, 0))
print_serial("xchg test 1: FAIL\n");
+   else
+   print_serial("xchg test 1: PASS\n");
 
exec_in_big_real_mode(&inregs, &outregs,
  insn_xchg_test2,
@@ -169,6 +171,8 @@ void test_xchg(void)
 outregs.eax != inregs.ebx ||
 outregs.ebx != inregs.eax)
print_serial("xchg test 2: FAIL\n");
+   else
+   print_serial("xchg test 2: PASS\n");
 
exec_in_big_real_mode(&inregs, &outregs,
  insn_xchg_test3,
@@ -178,6 +182,8 @@ void test_xchg(void)
 outregs.eax != inregs.ecx ||
 outregs.ecx != inregs.eax)
print_serial("xchg test 3: FAIL\n");
+   else
+   print_serial("xchg test 3: PASS\n");
 
exec_in_big_real_mode(&inregs, &outregs,
  insn_xchg_test4,
@@ -187,6 +193,8 @@ void test_xchg(void)
 outregs.eax != inregs.edx ||
 outregs.edx != inregs.eax)
print_serial("xchg test 4: FAIL\n");
+   else
+   print_serial("xchg test 4: PASS\n");
 
exec_in_big_real_mode(&inregs, &outregs,
  insn_xchg_test5,
@@ -196,6 +204,8 @@ void test_xchg(void)
 outregs.eax != inregs.esi ||
 outregs.esi != inregs.eax)
print_serial("xchg test 5: FAIL\n");
+   else
+   print_serial("xchg test 5: PASS\n");
 
exec_in_big_real_mode(&inregs, &outregs,
  insn_xchg_test6,
@@ -205,6 +215,8 @@ void test_xchg(void)
 outregs.eax != inregs.edi ||
 outregs.edi != inregs.eax)
print_serial("xchg test 6: FAIL\n");
+   else
+   print_serial("xchg test 6: PASS\n");
 
exec_in_big_real_mode(&inregs, &outregs,
  insn_xchg_test7,
@@ -214,6 +226,8 @@ void test_xchg(void)
 outregs.eax != inregs.ebp ||
 outregs.ebp != inregs.eax)
print_serial("xchg test 7: FAIL\n");
+   else
+   print_serial("xchg test 7: PASS\n");
 
exec_in_big_real_mode(&inregs, &outregs,
  insn_xchg_test8,
@@ -223,6 +237,8 @@ void test_xchg(void)
 outregs.eax != inregs.esp ||
 outregs.esp != inregs.eax)
print_serial("xchg test 8: FAIL\n");
+   else
+   print_serial("xchg test 8: PASS\n");
 }
 
 void test_shld(void)
@@ -234,9 +250,9 @@ void test_shld(void)
  insn_shld_test,
  insn_shld_test_end - insn_shld_test);
if (outregs.eax != 0xbeef)
-   print_serial("shld: failure\n");
+   print_serial("shld: FAIL\n");
else
-   print_serial("shld: success\n");
+   print_serial("shld: PASS\n");
 }
 
 void test_mov_imm(void)
@@ -253,6 +269,8 @@ void test_mov_imm(void)
  insn_mov_r16_imm_1_end - insn_mov_r16_imm_1);
if (!regs_equal(&inregs, &outregs, R_AX) || outregs.eax != 1234)
print_serial("mov test 1: FAIL\n");
+   else
+   print_serial("mov test 1: PASS\n");
 
/* test mov $imm, %eax */
exec_in_big_real_mode(&inregs, &outregs,
@@ -260,6 +278,8 @@ void test_mov_imm(void)
  insn_mov_r32_imm_1_end - insn_mov_r32_imm_1);
if (!regs_equal(&inregs, &outregs, R_AX) || outregs.eax != 1234567890)
print_serial("mov test 2: FAIL\n");
+   else
+   print_serial("mov test 2: PASS\n");
 
/* test mov $imm, %al/%ah */
exec_in_big_real_mode(&inregs, &outregs,
@@ -267,16 +287,24 @@ void test_mov_imm(void)
  insn_mov_r8_imm_1_end - insn_mov_r8_imm_1);
if (!regs_equal(&inregs, &outregs, R_AX) || outregs.eax != 0x1200)
print_serial("mov test 3: FAIL\n");
+   else
+   print_serial("mov test 3: PASS\n");
+
exec_in_big_real_mode(&inregs, &outregs,
  insn_mov_r8_imm_2,
  insn_mov_r8_imm_2_end - insn_mov_r8_imm_2);
if (!regs_equal(&inregs, &outregs, R_AX) || outregs.eax != 0x34)
print_serial("mov test 4: FAIL\n");
+   else
+   print_serial("mov test 4: PASS\n");
+
exec_in_big_real_mode(&inregs, &outregs,
  insn_mov_r8_imm_3,
  insn_mov_r8_im

Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura
2010/4/22 Takuya Yoshikawa :
> (2010/04/22 19:35), Yoshiaki Tamura wrote:
>
>>
>>> A trivial one would we to :
>>> - do X online snapshots/sec
>>
>> I currently don't have good numbers that I can share right now.
>> Snapshots/sec depends on what kind of workload is running, and if the
>> guest was almost idle, there will be no snapshots in 5sec. On the other
>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>> for example), there will be about 50 snapshots/sec.
>>
>
> 50 is too small: this depends on the synchronization speed and does not
> show how many snapshots we need, right?

No it doesn't.
It's an example data which I measured before.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-22 Thread Glauber Costa
On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote:
> On 04/20/2010 11:54 AM, Avi Kivity wrote:
> > On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
> >> On 04/20/2010 02:31 AM, Avi Kivity wrote:
> >>   
> >>> btw, do you want this code in pvclock.c, or shall we keep it kvmclock
> >>> specific?
> >>>  
> >> I think its a pvclock-level fix.  I'd been hoping to avoid having
> >> something like this, but I think its ultimately necessary.
> >>
> >
> > Did you observe drift on Xen, or is this "ultimately" pointing at the
> > future?
> 
> People are reporting weirdnesses that "clocksource=jiffies" apparently
> resolves.  Xen and KVM are faced with the same hardware constraints, and
> it wouldn't surprise me if there were small measurable
> non-monotonicities in the PV clock under Xen.  May as well be safe.
> 
> Of course, it kills any possibility of being able to usefully export
> this interface down to usermode.
> 
> My main concern about this kind of simple fix is that if there's a long
> term systematic drift between different CPU's tscs, then this will
> somewhat mask the problem while giving really awful time measurement on
> the "slow" CPU(s).  In that case it really needs to adjust the scaling
> factor to correct for the drift (*not* update the offset).  But if we're
> definitely only talking about fixed, relatively small time offsets then
> it is fine.
Can you by any chance run ingo's time warp test on those machines?

You need to define TOD to 1, and leave out the TSC test.

For me, warps exists on every machine out there, but the nehalems, so far
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura
2010/4/22 Dor Laor :
> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>
>> Dor Laor wrote:
>>>
>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

 Hi all,

 We have been implementing the prototype of Kemari for KVM, and we're
 sending
 this message to share what we have now and TODO lists. Hopefully, we
 would like
 to get early feedback to keep us in the right direction. Although
 advanced
 approaches in the TODO lists are fascinating, we would like to run
 this project
 step by step while absorbing comments from the community. The current
 code is
 based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

 For those who are new to Kemari for KVM, please take a look at the
 following RFC which we posted last year.

 http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

 The transmission/transaction protocol, and most of the control logic is
 implemented in QEMU. However, we needed a hack in KVM to prevent rip
 from
 proceeding before synchronizing VMs. It may also need some plumbing in
 the
 kernel side to guarantee replayability of certain events and
 instructions,
 integrate the RAS capabilities of newer x86 hardware with the HA
 stack, as well
 as for optimization purposes, for example.
>>>
>>> [ snap]
>>>

 The rest of this message describes TODO lists grouped by each topic.

 === event tapping ===

 Event tapping is the core component of Kemari, and it decides on which
 event the
 primary should synchronize with the secondary. The basic assumption
 here is
 that outgoing I/O operations are idempotent, which is usually true for
 disk I/O
 and reliable network protocols such as TCP.
>>>
>>> IMO any type of network even should be stalled too. What if the VM runs
>>> non tcp protocol and the packet that the master node sent reached some
>>> remote client and before the sync to the slave the master failed?
>>
>> In current implementation, it is actually stalling any type of network
>> that goes through virtio-net.
>>
>> However, if the application was using unreliable protocols, it should
>> have its own recovering mechanism, or it should be completely stateless.
>
> Why do you treat tcp differently? You can damage the entire VM this way -
> think of dhcp request that was dropped on the moment you switched between
> the master and the slave?

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation, it's synchronizing before dhcp ack is sent.
But in case of tcp, once you send ack to the client before sync, there
is no way to recover.

>>> [snap]
>>>
>>>
 === clock ===

 Since synchronizing the virtual machines every time the TSC is
 accessed would be
 prohibitive, the transmission of the TSC will be done lazily, which
 means
 delaying it until there is a non-TSC synchronization point arrives.
>>>
>>> Why do you specifically care about the tsc sync? When you sync all the
>>> IO model on snapshot it also synchronizes the tsc.
>
> So, do you agree that an extra clock synchronization is not needed since it
> is done anyway as part of the live migration state sync?

I agree that its sent as part of the live migration.
What I wanted to say here is that this is not something for real time
applications.
I usually get questions like can this guarantee fault tolerance for
real time applications.

>>> In general, can you please explain the 'algorithm' for continuous
>>> snapshots (is that what you like to do?):
>>
>> Yes, of course.
>> Sorry for being less informative.
>>
>>> A trivial one would we to :
>>> - do X online snapshots/sec
>>
>> I currently don't have good numbers that I can share right now.
>> Snapshots/sec depends on what kind of workload is running, and if the
>> guest was almost idle, there will be no snapshots in 5sec. On the other
>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>> for example), there will be about 50 snapshots/sec.
>>
>>> - Stall all IO (disk/block) from the guest to the outside world
>>> until the previous snapshot reaches the slave.
>>
>> Yes, it does.
>>
>>> - Snapshots are made of
>>
>> Full device model + diff of dirty pages from the last snapshot.
>>
>>> - diff of dirty pages from last snapshot
>>
>> This also depends on the workload.
>> In case of I/O intensive workloads, dirty pages are usually less than 100.
>
> The hardest would be memory intensive loads.
> So 100 snap/sec means latency of 10msec right?
> (not that it's not ok, with faster hw and IB you'll be able to get much
> more)

Doesn't 100 snap/sec mean the interval of snap is 10msec?
IIUC, to get the latency, you need to get, Time to transfer VM + Time
to get response from the receiver.

It's hard to say which load is the hardest.
Mem

Re: [PATCH] qemu-kvm: Ask kernel about supported svm features

2010-04-22 Thread Avi Kivity

On 04/22/2010 03:28 PM, Joerg Roedel wrote:



Yes.  -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it
back to the vcpu configuration, others just take the qemu configuration,
mask it with supported bits, and pass it back (see
check_features_against_host()).
 

Hmm, the plan was to enable with -enable-nesting all kernel supported
svm features for the guest (and add switches later to remove them
individually)
If we activate nested svm with -cpu host in the future thats fine too
(closed-source hypervisors need that anyway). But we should also define
a cpu model in which we can migrate nested hypervisors between machines
were the cpu is not completly indentical.
   


You can use -cpu host, or -cpu kvm64,+svm,+npt,+other_feature... just as 
with all other features.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Jamie Lokier
Yoshiaki Tamura wrote:
> Dor Laor wrote:
> >On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> >>Event tapping is the core component of Kemari, and it decides on which
> >>event the
> >>primary should synchronize with the secondary. The basic assumption
> >>here is
> >>that outgoing I/O operations are idempotent, which is usually true for
> >>disk I/O
> >>and reliable network protocols such as TCP.
> >
> >IMO any type of network even should be stalled too. What if the VM runs
> >non tcp protocol and the packet that the master node sent reached some
> >remote client and before the sync to the slave the master failed?
> 
> In current implementation, it is actually stalling any type of network 
> that goes through virtio-net.
> 
> However, if the application was using unreliable protocols, it should have 
> its own recovering mechanism, or it should be completely stateless.

Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.

If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.

But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.

About IP idempotency:

In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.

However there is one respect in which they're not idempotent:

The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.

It is acceptable to reduce TTL faster than the minimum.  After all, it
is reduced by 1 on every forwarding hop, in addition to time delays.

> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the 
> guest was almost idle, there will be no snapshots in 5sec.  On the other 
> hand, if the guest was running I/O intensive workloads (netperf, iozone 
> for example), there will be about 50 snapshots/sec.

That is a really satisfying number, thank you :-)

Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-22 Thread Marcelo Tosatti
On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote:
> On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote:
> > On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote:
> > > On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote:
> > > > On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote:
> > > > > On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote:
> > > > > > > Or could we make kvm_set_irq() atomic? Though the code path is a 
> > > > > > > little long 
> > > > > > > for spinlock.
> > > > > > 
> > > > > > Yes, given the sleep-inside-RCU-protected section bug from
> > > > > > kvm_notify_acked_irq, either that or convert IRQ locking to SRCU.
> > > > > > 
> > > > > > But as you said, the code paths are long and potentially slow, so
> > > > > > probably SRCU is a better alternative.
> > > > > > 
> > > > > > Gleb?
> > > > > kvm_set_irq() was converted to rcu from mutex to make msix interrupt
> > > > > injection scalable.
> > > > 
> > > > We meant ioapic lock. See the last report from Ralf on this thread. 
> > > Can we solve the problem by calling ack notifier outside rcu read
> > > section in kvm_notify_acked_irq()?
> > 
> > The unregister path does
> > 
> > - remove_from_list(entry)
> > - synchronize_rcu
> > - kfree(entry)
> > 
> > So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and the
> > notifier entry can be freed.
> What I mean is kvm_notify_acked_irq() will iterate over all ack entries
> in rcu read protected section, but instead of calling callback it will
> collect them into the array and call them outside rcu read section. At
> this point it doesn't matter if entry is unregistered since the copy is
> used to actually call the notifier.

Here it is, but no, this trick can't be done safely for ack notifiers
because they are unregistered at runtime by device assignment.

How does the freeing path knows it can proceed to free its entry if
there is no reliable way to know if there is a reference to itself?
(think "priv" gets freed between rcu_read_unlock and ->irq_acked with
the patch below).

I can convert to SRCU if you have no objections.

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 0150aff..900ac05 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -241,8 +241,8 @@ int pit_has_pending_timer(struct kvm_vcpu *vcpu)
 
 static void kvm_pit_ack_irq(struct kvm_irq_ack_notifier *kian)
 {
-   struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state,
-irq_ack_notifier);
+   struct kvm_kpit_state *ps = kian->priv;
+
raw_spin_lock(&ps->inject_lock);
if (atomic_dec_return(&ps->pit_timer.pending) < 0)
atomic_inc(&ps->pit_timer.pending);
@@ -636,6 +636,7 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
 CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
pit_state->irq_ack_notifier.gsi = 0;
pit_state->irq_ack_notifier.irq_acked = kvm_pit_ack_irq;
+   pit_state->irq_ack_notifier.priv = &pit->pit_state;
kvm_register_irq_ack_notifier(kvm, &pit_state->irq_ack_notifier);
pit_state->pit_timer.reinject = true;
mutex_unlock(&pit->pit_state.lock);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ce027d5..6c3fb06 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -404,6 +404,7 @@ struct kvm_irq_ack_notifier {
struct hlist_node link;
unsigned gsi;
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
+   void *priv;
 };
 
 #define KVM_ASSIGNED_MSIX_PENDING  0x1
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 4d10b1e..779b749 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -121,8 +121,7 @@ static void kvm_assigned_dev_ack_irq(struct 
kvm_irq_ack_notifier *kian)
if (kian->gsi == -1)
return;
 
-   dev = container_of(kian, struct kvm_assigned_dev_kernel,
-  ack_notifier);
+   dev = kian->priv;
 
kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);
 
@@ -563,6 +562,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
match->irq_source_id = -1;
match->kvm = kvm;
match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
+   match->ack_notifier.priv = match;
INIT_WORK(&match->interrupt_work,
  kvm_assigned_dev_interrupt_work_handler);
 
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index a0e8880..ebccea8 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -175,11 +175,14 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq, int level)
return ret;
 }
 
+#define MAX_ACK_NOTIFIER_PER_GSI 4
+
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
struct kvm_irq_ack_notifier *kian;
+   struct kvm_irq_ack_notifier acks[MAX_ACK_NOTIFIER_PER_GSI];
 

KVM_SET_MP_STATE is undocumented

2010-04-22 Thread Pekka Enberg

Hi!

I noticed that QEMU uses KVM_SET_MP_STATE but the ioctl() is completely 
undocumented.  I assume it has something to do with multiprocessor but I 
am unable to work out the details unless I take a peek at arch/x86/kvm.


Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Enable pvclock flags in vcpu_time_info structure

2010-04-22 Thread Glauber Costa
This patch removes one padding byte and transform it into a flags
field. New versions of guests using pvclock will query these flags
upon each read.

Flags, however, will only be interpreted when the guest decides to.
It uses the pvclock_valid_flags function to signal that a specific
set of flags should be taken into consideration. Which flags are valid
are usually devised via HV negotiation.

Signed-off-by: Glauber Costa 
---
 arch/x86/include/asm/pvclock-abi.h |3 ++-
 arch/x86/include/asm/pvclock.h |1 +
 arch/x86/kernel/pvclock.c  |9 +
 3 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/pvclock-abi.h 
b/arch/x86/include/asm/pvclock-abi.h
index 6d93508..ec5c41a 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -29,7 +29,8 @@ struct pvclock_vcpu_time_info {
u64   system_time;
u32   tsc_to_system_mul;
s8tsc_shift;
-   u8pad[3];
+   u8flags;
+   u8pad[2];
 } __attribute__((__packed__)); /* 32 bytes */
 
 struct pvclock_wall_clock {
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 53235fd..c50823f 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+void pvclock_valid_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct pvclock_vcpu_time_info *vcpu,
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..8f4af7b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -31,8 +31,16 @@ struct pvclock_shadow_time {
u32 tsc_to_nsec_mul;
int tsc_shift;
u32 version;
+   u8  flags;
 };
 
+static u8 valid_flags = 0;
+
+void pvclock_valid_flags(u8 flags)
+{
+   valid_flags = flags;
+}
+
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
@@ -91,6 +99,7 @@ static unsigned pvclock_get_time_values(struct 
pvclock_shadow_time *dst,
dst->system_timestamp  = src->system_time;
dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
dst->tsc_shift = src->tsc_shift;
+   dst->flags = src->flags;
rmb();  /* test version after fetching data */
} while ((src->version & 1) || (dst->version != src->version));
 
-- 
1.6.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread David Stevens
"Michael S. Tsirkin"  wrote on 04/22/2010 05:02:25 AM:

> On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
> > This patch adds the mergeable RX buffers feature to vhost.
> > 
> > Signed-off-by: David L Stevens 
> 
> Looks pretty clean to me.
> Could you send a checkpatch-clean version please?

The original passes checkpatch already, but I guess I must
still be getting whitespace mangling if it didn't for you. (sigh)
Here it is as an attachment:



> We should also check performance implications.
> Do you have any data?

I'm getting on the order of 10-20% improvement in
throughput over stock vhost guest to host, but I do see
a lot of variability in the results, even with no KVM
and just over loopback. I don't know where that's coming
from, but I'll do some runs and post.

Thanks for all the reviews!

+-DLS



MRXBv4.patch
Description: Binary data


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread Michael S. Tsirkin
On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
> This patch adds the mergeable RX buffers feature to vhost.
> 
> Signed-off-by: David L Stevens 

BTW, which userspace should I use for testing this?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread Michael S. Tsirkin
On Thu, Apr 22, 2010 at 11:47:15AM -0600, David Stevens wrote:
> "Michael S. Tsirkin"  wrote on 04/22/2010 05:02:25 AM:
> 
> > On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
> > > This patch adds the mergeable RX buffers feature to vhost.
> > > 
> > > Signed-off-by: David L Stevens 
> > 
> > Looks pretty clean to me.
> > Could you send a checkpatch-clean version please?
> 
> The original passes checkpatch already, but I guess I must
> still be getting whitespace mangling if it didn't for you. (sigh)
> Here it is as an attachment:

No good either. I thought you managed to run sendmail somewhere?
Failing that, put it on dropbox and let me know.
See http://kbase.redhat.com/faq/docs/DOC-2113

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread David Stevens
"Michael S. Tsirkin"  wrote on 04/22/2010 10:43:49 AM:

> On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
> > This patch adds the mergeable RX buffers feature to vhost.
> > 
> > Signed-off-by: David L Stevens 
> 
> BTW, which userspace should I use for testing this?

I use the qemu-kvm patch to your git tree that I posted
a while back (also attached). And it needs your TUNSETVNETHDRSIZE
ioctl in the kernel as well. Also, I added the TUN ioctl
defines to /usr/include for this to pick up (it compiles, but
won't do the ioctl's without that); will get back
to that patch per your comments next.

[also correction: I said 10-20% guest to host, I meant host-to-guest
(ie, the receiver side). h2g appears improved too, but not as
much)]

+-DLS


qemu-MRXB-2.patch
Description: Binary data


Re: [PATCH v4] Add mergeable RX bufs support to vhost

2010-04-22 Thread Michael S. Tsirkin
On Thu, Apr 22, 2010 at 11:59:55AM -0600, David Stevens wrote:
> "Michael S. Tsirkin"  wrote on 04/22/2010 10:43:49 AM:
> 
> > On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote:
> > > This patch adds the mergeable RX buffers feature to vhost.
> > > 
> > > Signed-off-by: David L Stevens 
> > 
> > BTW, which userspace should I use for testing this?
> 
> I use the qemu-kvm patch to your git tree that I posted
> a while back (also attached).

Doesn't apply, probably also corrupted.
Could you put it up on dropbox please?

> And it needs your TUNSETVNETHDRSIZE
> ioctl in the kernel as well. Also, I added the TUN ioctl
> defines to /usr/include for this to pick up (it compiles, but
> won't do the ioctl's without that); will get back
> to that patch per your comments next.
> 
> [also correction: I said 10-20% guest to host, I meant host-to-guest
> (ie, the receiver side). h2g appears improved too, but not as
> much)]
> 
> +-DLS


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-22 Thread Gleb Natapov
On Thu, Apr 22, 2010 at 01:40:38PM -0300, Marcelo Tosatti wrote:
> On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote:
> > On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote:
> > > > On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote:
> > > > > On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote:
> > > > > > On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote:
> > > > > > > > Or could we make kvm_set_irq() atomic? Though the code path is 
> > > > > > > > a little long 
> > > > > > > > for spinlock.
> > > > > > > 
> > > > > > > Yes, given the sleep-inside-RCU-protected section bug from
> > > > > > > kvm_notify_acked_irq, either that or convert IRQ locking to SRCU.
> > > > > > > 
> > > > > > > But as you said, the code paths are long and potentially slow, so
> > > > > > > probably SRCU is a better alternative.
> > > > > > > 
> > > > > > > Gleb?
> > > > > > kvm_set_irq() was converted to rcu from mutex to make msix interrupt
> > > > > > injection scalable.
> > > > > 
> > > > > We meant ioapic lock. See the last report from Ralf on this thread. 
> > > > Can we solve the problem by calling ack notifier outside rcu read
> > > > section in kvm_notify_acked_irq()?
> > > 
> > > The unregister path does
> > > 
> > > - remove_from_list(entry)
> > > - synchronize_rcu
> > > - kfree(entry)
> > > 
> > > So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and the
> > > notifier entry can be freed.
> > What I mean is kvm_notify_acked_irq() will iterate over all ack entries
> > in rcu read protected section, but instead of calling callback it will
> > collect them into the array and call them outside rcu read section. At
> > this point it doesn't matter if entry is unregistered since the copy is
> > used to actually call the notifier.
> 
> Here it is, but no, this trick can't be done safely for ack notifiers
> because they are unregistered at runtime by device assignment.
> 
> How does the freeing path knows it can proceed to free its entry if
> there is no reliable way to know if there is a reference to itself?
> (think "priv" gets freed between rcu_read_unlock and ->irq_acked with
> the patch below).
> 
Yeah, I see :(

> I can convert to SRCU if you have no objections.
> 
AFAIR there was a disadvantage comparing to RCU, but I don't remember what
it was exactly. What about converting PIC/IOAPIC mutexes into spinlocks?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.

2010-04-22 Thread Anthony Liguori

Hi,

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA,
CODE, MIGRATION) bit-based phys_ram_dirty bitmap.  On allocation, it
sets all bits in the bitmap.  It uses ffs() to convert DIRTY_FLAG to
DIRTY_IDX.

Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
bit-based phys_ram_dirty bitmap.  MASTER works as a buffer, and upon
get_diry() or get_dirty_flags(), it calls
cpu_physical_memory_sync_master() to update VGA and MIGRATION.
   


Why use an additional bitmap for MASTER instead of just updating the 
VGA, CODE, and MIGRATION bitmaps together?


Regards,

Anthony Liguori


Replaces direct phys_ram_dirty access with wrapper functions to
prevent direct access to the phys_ram_dirty bitmap.

Signed-off-by: Yoshiaki Tamura
Signed-off-by: OHMURA Kei
---
  cpu-all.h |  130 +
  exec.c|   60 ++--
  2 files changed, 152 insertions(+), 38 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 51effc0..3f8762d 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -37,6 +37,9 @@

  #include "softfloat.h"

+/* to use ffs in flag_to_idx() */
+#include
+
  #if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN)
  #define BSWAP_NEEDED
  #endif
@@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str);
  /* memory API */

  extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
  extern ram_addr_t ram_size;
  extern ram_addr_t last_ram_offset;
  extern uint8_t *bios_mem;
@@ -869,28 +871,140 @@ extern uint8_t *bios_mem;
  /* Set if TLB entry is an IO callback.  */
  #define TLB_MMIO(1<<  5)

+/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */
+#define MASTER_DIRTY_IDX0
+#define VGA_DIRTY_IDX   1
+#define CODE_DIRTY_IDX  2
+#define MIGRATION_DIRTY_IDX 3
+#define NUM_DIRTY_IDX   4
+
+#define MASTER_DIRTY_FLAG(1<<  MASTER_DIRTY_IDX)
+#define VGA_DIRTY_FLAG   (1<<  VGA_DIRTY_IDX)
+#define CODE_DIRTY_FLAG  (1<<  CODE_DIRTY_IDX)
+#define MIGRATION_DIRTY_FLAG (1<<  MIGRATION_DIRTY_IDX)
+
+extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
+
+static inline int dirty_flag_to_idx(int flag)
+{
+return ffs(flag) - 1;
+}
+
+static inline int dirty_idx_to_flag(int idx)
+{
+return 1<<  idx;
+}
+
  int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
  uint8_t *buf, int len, int is_write);

-#define VGA_DIRTY_FLAG   0x01
-#define CODE_DIRTY_FLAG  0x02
-#define MIGRATION_DIRTY_FLAG 0x08
-
  /* read dirty bit (return 0 or 1) */
  static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
  {
-return phys_ram_dirty[addr>>  TARGET_PAGE_BITS] == 0xff;
+unsigned long mask;
+ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
+int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
+
+mask = 1UL<<  offset;
+return (phys_ram_dirty[MASTER_DIRTY_IDX][index]&  mask) == mask;
+}
+
+static inline void cpu_physical_memory_sync_master(ram_addr_t index)
+{
+if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) {
+phys_ram_dirty[VGA_DIRTY_IDX][index]
+|=  phys_ram_dirty[MASTER_DIRTY_IDX][index];
+phys_ram_dirty[MIGRATION_DIRTY_IDX][index]
+|=  phys_ram_dirty[MASTER_DIRTY_IDX][index];
+phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL;
+}
+}
+
+static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
+{
+unsigned long mask;
+ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
+int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
+int ret = 0, i;
+
+mask = 1UL<<  offset;
+cpu_physical_memory_sync_master(index);
+
+for (i = VGA_DIRTY_IDX; i<= MIGRATION_DIRTY_IDX; i++) {
+if (phys_ram_dirty[i][index]&  mask) {
+ret |= dirty_idx_to_flag(i);
+}
+}
+
+return ret;
+}
+
+static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr,
+int dirty_idx)
+{
+unsigned long mask;
+ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
+int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_BITS - 1);
+
+mask = 1UL<<  offset;
+cpu_physical_memory_sync_master(index);
+return (phys_ram_dirty[dirty_idx][index]&  mask) == mask;
  }

  static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
  int dirty_flags)
  {
-return phys_ram_dirty[addr>>  TARGET_PAGE_BITS]&  dirty_flags;
+return cpu_physical_memory_get_dirty_idx(addr,
+ dirty_flag_to_idx(dirty_flags));
  }

  static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
  {
-phys_ram_dirty[addr>>  TARGET_PAGE_BITS] = 0xff;
+unsigned long mask;
+ram_addr_t index = (addr>>  TARGET_PAGE_BITS) / HOST_LONG_BITS;
+int offset = (addr>>  TARGET_PAGE_BITS)&  (HOST_LONG_

Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

For fool proof purpose, qemu_put_vector_parepare should be called
before qemu_put_vector.  Then, if qemu_put_* functions except this is
called after qemu_put_vector_prepare, program will abort().

Signed-off-by: Yoshiaki Tamura
   


I don't get it.  What's this protecting against?

Regards,

Anthony Liguori


---
  hw/hw.h  |2 ++
  savevm.c |   53 +
  2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 921cf90..10e6dda 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f);
  int qemu_fclose(QEMUFile *f);
  void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
  void qemu_put_byte(QEMUFile *f, int v);
+void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
+void qemu_put_vector_prepare(QEMUFile *f);
  void *qemu_realloc_buffer(QEMUFile *f, int size);
  void qemu_clear_buffer(QEMUFile *f);

diff --git a/savevm.c b/savevm.c
index 944e788..22d928c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -180,6 +180,7 @@ struct QEMUFile {
  uint8_t *buf;

  int has_error;
+int prepares_vector;
  };

  typedef struct QEMUFileStdio
@@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v)
  qemu_fflush(f);
  }

+void qemu_put_vector(QEMUFile *f, QEMUIOVector *v)
+{
+struct iovec *iov;
+int cnt;
+size_t bufsize;
+uint8_t *buf;
+
+if (qemu_file_get_rate_limit(f) != 0) {
+fprintf(stderr,
+"Attempted to write vector while bandwidth limit is not 
zero.\n");
+abort();
+}
+
+/* checks prepares vector.
+ * For fool proof purpose, qemu_put_vector_parepare should be called
+ * before qemu_put_vector.  Then, if qemu_put_* functions except this
+ * is called after qemu_put_vector_prepare, program will abort().
+ */
+if (!f->prepares_vector) {
+fprintf(stderr,
+"You should prepare with qemu_put_vector_prepare.\n");
+abort();
+} else if (f->prepares_vector&&  f->buf_index != 0) {
+fprintf(stderr, "Wrote data after qemu_put_vector_prepare.\n");
+abort();
+}
+f->prepares_vector = 0;
+
+if (f->put_vector) {
+qemu_iovec_to_vector(v,&iov,&cnt);
+f->put_vector(f->opaque, iov, 0, cnt);
+} else {
+qemu_iovec_to_size(v,&bufsize);
+buf = qemu_malloc(bufsize + 1 /* for '\0' */);
+qemu_iovec_to_buffer(v, buf);
+qemu_put_buffer(f, buf, bufsize);
+qemu_free(buf);
+}
+
+}
+
+void qemu_put_vector_prepare(QEMUFile *f)
+{
+if (f->prepares_vector) {
+/* prepare vector */
+fprintf(stderr, "Attempted to prepare vector twice\n");
+abort();
+}
+f->prepares_vector = 1;
+qemu_fflush(f);
+}
+
  int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1)
  {
  int size, l;
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Introduce RAMSaveIO to use writev for saving ram blocks, and modifies
ram_save_block() and ram_save_remaining() to use
cpu_physical_memory_get_dirty_range() to check multiple dirty and
non-dirty pages at once.

Signed-off-by: Yoshiaki Tamura
Signed-off-by: OHMURA Kei
   


Perf data?

Regards,

Anthony Liguori


---
  vl.c |  221 ++---
  1 files changed, 197 insertions(+), 24 deletions(-)

diff --git a/vl.c b/vl.c
index 729c955..9c3dc4c 100644
--- a/vl.c
+++ b/vl.c
@@ -2774,12 +2774,167 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
  return 1;
  }

-static int ram_save_block(QEMUFile *f)
+typedef struct RAMSaveIO RAMSaveIO;
+
+struct RAMSaveIO {
+QEMUFile *f;
+QEMUIOVector *qiov;
+
+uint8_t *ram_store;
+size_t nalloc, nused;
+uint8_t io_mode;
+
+void (*put_buffer)(RAMSaveIO *s, uint8_t *buf, size_t len);
+void (*put_byte)(RAMSaveIO *s, int v);
+void (*put_be64)(RAMSaveIO *s, uint64_t v);
+
+};
+
+static inline void ram_saveio_flush(RAMSaveIO *s, int prepare)
+{
+qemu_put_vector(s->f, s->qiov);
+if (prepare)
+qemu_put_vector_prepare(s->f);
+
+/* reset stored data */
+qemu_iovec_reset(s->qiov);
+s->nused = 0;
+}
+
+static inline void ram_saveio_put_buffer(RAMSaveIO *s, uint8_t *buf, size_t 
len)
+{
+s->put_buffer(s, buf, len);
+}
+
+static inline void ram_saveio_put_byte(RAMSaveIO *s, int v)
+{
+s->put_byte(s, v);
+}
+
+static inline void ram_saveio_put_be64(RAMSaveIO *s, uint64_t v)
+{
+s->put_be64(s, v);
+}
+
+static inline void ram_saveio_set_error(RAMSaveIO *s)
+{
+qemu_file_set_error(s->f);
+}
+
+static void ram_saveio_put_buffer_vector(RAMSaveIO *s, uint8_t *buf, size_t 
len)
+{
+qemu_iovec_add(s->qiov, buf, len);
+}
+
+static void ram_saveio_put_buffer_direct(RAMSaveIO *s, uint8_t *buf, size_t 
len)
+{
+qemu_put_buffer(s->f, buf, len);
+}
+
+static void ram_saveio_put_byte_vector(RAMSaveIO *s, int v)
+{
+uint8_t *to_save;
+
+if (s->nalloc - s->nused<  sizeof(int))
+ram_saveio_flush(s, 1);
+
+to_save =&s->ram_store[s->nused];
+to_save[0] = v&  0xff;
+s->nused++;
+
+qemu_iovec_add(s->qiov, to_save, 1);
+}
+
+static void ram_saveio_put_byte_direct(RAMSaveIO *s, int v)
+{
+qemu_put_byte(s->f, v);
+}
+
+static void ram_saveio_put_be64_vector(RAMSaveIO *s, uint64_t v)
+{
+uint8_t *to_save;
+
+if (s->nalloc - s->nused<  sizeof(uint64_t))
+ram_saveio_flush(s, 1);
+
+to_save =&s->ram_store[s->nused];
+to_save[0] = (v>>  56)&  0xff;
+to_save[1] = (v>>  48)&  0xff;
+to_save[2] = (v>>  40)&  0xff;
+to_save[3] = (v>>  32)&  0xff;
+to_save[4] = (v>>  24)&  0xff;
+to_save[5] = (v>>  16)&  0xff;
+to_save[6] = (v>>   8)&  0xff;
+to_save[7] = (v>>   0)&  0xff;
+s->nused += sizeof(uint64_t);
+
+qemu_iovec_add(s->qiov, to_save, sizeof(uint64_t));
+}
+
+static void ram_saveio_put_be64_direct(RAMSaveIO *s, uint64_t v)
+{
+
+qemu_put_be64(s->f, v);
+}
+
+static RAMSaveIO *ram_saveio_new(QEMUFile *f, size_t max_store)
+{
+RAMSaveIO *s;
+
+s = qemu_mallocz(sizeof(*s));
+
+if (qemu_file_get_rate_limit(f) == 0) {/* non buffer mode */
+/* When QEMUFile don't have get_rate limit,
+ * qemu_file_get_rate_limit will return 0.
+ * However, we believe that all kinds of QEMUFile
+ * except non-block mode has rate limit function.
+ */
+s->io_mode = 1;
+s->ram_store = qemu_mallocz(max_store);
+s->nalloc = max_store;
+s->nused = 0;
+
+s->qiov = qemu_mallocz(sizeof(*s->qiov));
+qemu_iovec_init(s->qiov, max_store);
+
+s->put_buffer = ram_saveio_put_buffer_vector;
+s->put_byte = ram_saveio_put_byte_vector;
+s->put_be64 = ram_saveio_put_be64_vector;
+
+qemu_put_vector_prepare(f);
+} else {
+s->io_mode = 0;
+s->put_buffer = ram_saveio_put_buffer_direct;
+s->put_byte = ram_saveio_put_byte_direct;
+s->put_be64 = ram_saveio_put_be64_direct;
+}
+
+s->f = f;
+
+return s;
+}
+
+static void ram_saveio_destroy(RAMSaveIO *s)
+{
+if (s->qiov != NULL) { /* means using put_vector */
+ram_saveio_flush(s, 0);
+qemu_iovec_destroy(s->qiov);
+qemu_free(s->qiov);
+qemu_free(s->ram_store);
+}
+qemu_free(s);
+}
+
+/*
+ * RAMSaveIO will manage I/O.
+ */
+static int ram_save_block(RAMSaveIO *s)
  {
  static ram_addr_t current_addr = 0;
  ram_addr_t saved_addr = current_addr;
  ram_addr_t addr = 0;
-int found = 0;
+ram_addr_t dirty_rams[HOST_LONG_BITS];
+int i, found = 0;

  while (addr<  last_ram_offset) {
  if (kvm_enabled()&&  current_addr == 0) {
@@ -2787,32 +2942,38 @@ static int ram_save_block(QEMUFile *f)
  r = kvm_update_dirty_pages_log();
  if (r) {
  fprintf(stder

Re: [PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page

2010-04-22 Thread Marcelo Tosatti
On Thu, Apr 22, 2010 at 02:13:04PM +0800, Xiao Guangrong wrote:
> If have new mapping to the unsync page(i.e, add a new parent), just
> update the page from sp->gfn but not write-protect gfn, and if need
> create new shadow page form sp->gfn, we should sync it
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c |   27 +++
>  1 files changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index fd027a6..8607a64 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1196,16 +1196,20 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
> struct kvm_mmu_page *sp)
>  
>  static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
> -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> +  bool clear_unsync)
>  {
>   if (sp->role.cr4_pae != !!is_pae(vcpu)) {
>   kvm_mmu_zap_page(vcpu->kvm, sp);
>   return 1;
>   }
>  
> - if (rmap_write_protect(vcpu->kvm, sp->gfn))
> - kvm_flush_remote_tlbs(vcpu->kvm);
> - kvm_unlink_unsync_page(vcpu->kvm, sp);
> + if (clear_unsync) {
> + if (rmap_write_protect(vcpu->kvm, sp->gfn))
> + kvm_flush_remote_tlbs(vcpu->kvm);
> + kvm_unlink_unsync_page(vcpu->kvm, sp);
> + }
> +
>   if (vcpu->arch.mmu.sync_page(vcpu, sp)) {
>   kvm_mmu_zap_page(vcpu->kvm, sp);
>   return 1;
> @@ -1293,7 +1297,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
>   kvm_flush_remote_tlbs(vcpu->kvm);
>  
>   for_each_sp(pages, sp, parents, i) {
> - kvm_sync_page(vcpu, sp);
> + kvm_sync_page(vcpu, sp, true);
>   mmu_pages_clear_parents(&parents);
>   }
>   cond_resched_lock(&vcpu->kvm->mmu_lock);
> @@ -1313,7 +1317,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,
>   unsigned index;
>   unsigned quadrant;
>   struct hlist_head *bucket;
> - struct kvm_mmu_page *sp;
> + struct kvm_mmu_page *sp, *unsync_sp = NULL;
>   struct hlist_node *node, *tmp;
>  
>   role = vcpu->arch.mmu.base_role;
> @@ -1332,12 +1336,16 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,
>   hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
>   if (sp->gfn == gfn) {
>   if (sp->unsync)
> - if (kvm_sync_page(vcpu, sp))
> - continue;
> + unsync_sp = sp;

Xiao,

I don't see a reason why you can't create a new mapping to an unsync
page. The code already creates shadow pte entries using unsync
pagetables.

So all you need would be to kvm_sync_pages before write protecting.

Also make sure kvm_sync_pages is in place here before enabling multiple
unsync shadows, in the patch series.

>  
>   if (sp->role.word != role.word)
>   continue;
>  
> + if (unsync_sp && kvm_sync_page(vcpu, unsync_sp, false)) 
> {
> + unsync_sp = NULL;
> + continue;
> + }
> +
>   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
>   if (sp->unsync_children) {
>   set_bit(KVM_REQ_MMU_SYNC, &vcpu->requests);
> @@ -1346,6 +1354,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
> kvm_vcpu *vcpu,
>   trace_kvm_mmu_get_page(sp, false);
>   return sp;
>   }
> + if (unsync_sp)
> + kvm_sync_page(vcpu, unsync_sp, true);
> +
>   ++vcpu->kvm->stat.mmu_cache_miss;
>   sp = kvm_mmu_alloc_page(vcpu, parent_pte);
>   if (!sp)
> -- 
> 1.6.1.2
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Signed-off-by: Yoshiaki Tamura
   


I think the more appropriate thing to do is have 
qemu_savevm_state_complete() not write QEMU_VM_EOF when doing a 
continuous live migration.


You would then want qemu_loadvm_state() to detect real EOF and treat 
that the same as QEMU_VM_EOF (provided it occurred at a section boundary).


Of course, this should be a flag to qemu_loadvm_state() as it's not 
always the right behavior.


Regards,

Anthony Liguori


---
  migration-exec.c |2 +-
  migration-fd.c   |2 +-
  migration-tcp.c  |2 +-
  migration-unix.c |2 +-
  savevm.c |   25 ++---
  sysemu.h |2 +-
  6 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/migration-exec.c b/migration-exec.c
index 3edc026..5839a6d 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void *opaque)
  QEMUFile *f = opaque;
  int ret;

-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  if (ret<  0) {
  fprintf(stderr, "load of migration failed\n");
  goto err;
diff --git a/migration-fd.c b/migration-fd.c
index 0cc74ad..0e97ed0 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void *opaque)
  QEMUFile *f = opaque;
  int ret;

-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  if (ret<  0) {
  fprintf(stderr, "load of migration failed\n");
  goto err;
diff --git a/migration-tcp.c b/migration-tcp.c
index 56e1a3b..94a1a03 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void *opaque)
  goto out;
  }

-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  if (ret<  0) {
  fprintf(stderr, "load of migration failed\n");
  goto out_fopen;
diff --git a/migration-unix.c b/migration-unix.c
index b7aab38..dd99a73 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void *opaque)
  goto out;
  }

-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  if (ret<  0) {
  fprintf(stderr, "load of migration failed\n");
  goto out_fopen;
diff --git a/savevm.c b/savevm.c
index 22d928c..a401b27 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry {
  int version_id;
  } LoadStateEntry;

-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
  {
  QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
  QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f)
  unsigned int v;
  int ret;

-v = qemu_get_be32(f);
-if (v != QEMU_VM_FILE_MAGIC)
-return -EINVAL;
+if (!skip_header) {
+v = qemu_get_be32(f);
+if (v != QEMU_VM_FILE_MAGIC)
+return -EINVAL;
+
+v = qemu_get_be32(f);
+if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+fprintf(stderr, "SaveVM v3 format is obsolete and don't work 
anymore\n");
+return -ENOTSUP;
+}
+if (v != QEMU_VM_FILE_VERSION)
+return -ENOTSUP;

-v = qemu_get_be32(f);
-if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-fprintf(stderr, "SaveVM v2 format is obsolete and don't work 
anymore\n");
-return -ENOTSUP;
  }
-if (v != QEMU_VM_FILE_VERSION)
-return -ENOTSUP;

  while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
  uint32_t instance_id, version_id, section_id;
@@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name)
  monitor_printf(mon, "Could not open VM state file\n");
  return -EINVAL;
  }
-ret = qemu_loadvm_state(f);
+ret = qemu_loadvm_state(f, 0);
  qemu_fclose(f);
  if (ret<  0) {
  monitor_printf(mon, "Error %d while loading VM state\n", ret);
diff --git a/sysemu.h b/sysemu.h
index 647a468..6c1441f 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
  int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
  int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
  void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);

  void qemu_errors_to_file(FILE *fp);
  void qemu_errors_to_mon(Monitor *mon);
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the
transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE.
   


I'd rather you encapsulate the current protocol as opposed to extending 
it with a new version.


You could also do this by just having a new -incoming option, right?  
All you really need is to indicate that this is for high frequency 
checkpointing, right?


Regards,

Anthony Liguori


Signed-off-by: Yoshiaki Tamura
---
  savevm.c |   76 -
  sysemu.h |1 +
  2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/savevm.c b/savevm.c
index 292ae32..19b3efb 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
  }

  #define QEMU_VM_FILE_MAGIC   0x5145564d
-#define QEMU_VM_FILE_VERSION_COMPAT  0x0002
-#define QEMU_VM_FILE_VERSION 0x0003
+#define QEMU_VM_FILE_VERSION_COMPAT  0x0003
+#define QEMU_VM_FILE_VERSION 0x0004
+#define QEMU_VM_LIVE_MIGRATION_MODE  0x0005
+#define QEMU_VM_FT_MODE  0x0006

  #define QEMU_VM_EOF  0x00
  #define QEMU_VM_SECTION_START0x01
@@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, 
int blk_enable,

  qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
  qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+if (ft_mode) {
+qemu_put_be32(f, QEMU_VM_FT_MODE);
+} else {
+qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE);
+}

  QTAILQ_FOREACH(se,&savevm_handlers, entry) {
  int len;
@@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
  return 0;
  }

+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se,&savevm_handlers, entry) {
+int len;
+
+if (se->save_live_state == NULL)
+continue;
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_START);
+qemu_put_be32(f, se->section_id);
+
+/* ID string */
+len = strlen(se->idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+qemu_put_be32(f, se->instance_id);
+qemu_put_be32(f, se->version_id);
+if (ft_mode == FT_INIT) {
+/* This is workaround. */
+se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque);
+} else {
+se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque);
+}
+}
+
+ft_mode = FT_TRANSACTION;
+QTAILQ_FOREACH(se,&savevm_handlers, entry) {
+int len;
+
+   if (se->save_state == NULL&&  se->vmsd == NULL)
+   continue;
+
+/* Section type */
+qemu_put_byte(f, QEMU_VM_SECTION_FULL);
+qemu_put_be32(f, se->section_id);
+
+/* ID string */
+len = strlen(se->idstr);
+qemu_put_byte(f, len);
+qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+qemu_put_be32(f, se->instance_id);
+qemu_put_be32(f, se->version_id);
+
+vmstate_save(f, se);
+}
+
+qemu_put_byte(f, QEMU_VM_EOF);
+
+if (qemu_file_has_error(f))
+return -EIO;
+
+return 0;
+}
+
+
  void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
  {
  SaveStateEntry *se;
@@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int skip_header)
  if (v != QEMU_VM_FILE_VERSION)
  return -ENOTSUP;

+v = qemu_get_be32(f);
+if (v == QEMU_VM_FT_MODE) {
+ft_mode = FT_INIT;
+}
  }

  while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
diff --git a/sysemu.h b/sysemu.h
index 6c1441f..df314bb 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int 
blk_enable,
  int shared);
  int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
  int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f);
  void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
  int qemu_loadvm_state(QEMUFile *f, int skip_header);

   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 15/20] Introduce FT mode support to configure.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Signed-off-by: Yoshiaki Tamura
   


No need for this.

Regards,

Anthony Liguori


---
  configure |8 
  1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index 046c591..f0682d4 100755
--- a/configure
+++ b/configure
@@ -298,6 +298,7 @@ bsd_user="no"
  guest_base=""
  uname_release=""
  io_thread="no"
+ft_mode="no"
  mixemu="no"
  kvm_trace="no"
  kvm_cap_pit=""
@@ -671,6 +672,8 @@ for opt do
;;
--enable-io-thread) io_thread="yes"
;;
+  --enable-ft-mode) ft_mode="yes"
+  ;;
--disable-blobs) blobs="no"
;;
--kerneldir=*) kerneldir="$optarg"
@@ -840,6 +843,7 @@ echo "  --enable-vde enable support for vde 
network"
  echo "  --disable-linux-aio  disable Linux AIO support"
  echo "  --enable-linux-aio   enable Linux AIO support"
  echo "  --enable-io-thread   enable IO thread"
+echo "  --enable-ft-mode enable FT mode support"
  echo "  --disable-blobs  disable installing provided firmware blobs"
  echo "  --kerneldir=PATH look for kernel includes in PATH"
  echo "  --with-kvm-trace enable building the KVM module with the kvm trace 
option"
@@ -2117,6 +2121,7 @@ echo "GUEST_BASE$guest_base"
  echo "PIE user targets  $user_pie"
  echo "vde support   $vde"
  echo "IO thread $io_thread"
+echo "FT mode support   $ft_mode"
  echo "Linux AIO support $linux_aio"
  echo "Install blobs $blobs"
  echo "KVM support   $kvm"
@@ -2318,6 +2323,9 @@ fi
  if test "$io_thread" = "yes" ; then
echo "CONFIG_IOTHREAD=y">>  $config_host_mak
  fi
+if test "$ft_mode" = "yes" ; then
+  echo "CONFIG_FT_MODE=y">>  $config_host_mak
+fi
  if test "$linux_aio" = "yes" ; then
echo "CONFIG_LINUX_AIO=y">>  $config_host_mak
  fi
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

do_event_tap() is inserted to functions which actually fire outputs.
By synchronizing VMs before outputs are fired, we can failover to the
receiver upon failure.  To save VM continuously, comment out assert()
on cpu_single_env temporally.

Signed-off-by: Yoshiaki Tamura
---
  hw/virtio-blk.c |2 ++
  hw/virtio-net.c |2 ++
  qemu-kvm.c  |7 ++-
  3 files changed, 10 insertions(+), 1 deletions(-)
   


This would be better done in the generic layers (the block and net 
layers respectively).  Then it would work with virtio and emulated devices.


Regards,

Anthony Liguori


diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index b80402d..1dd1c31 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
  .old_bs = NULL,
  };

+do_event_tap();
+
  while ((req = virtio_blk_get_request(s))) {
  virtio_blk_handle_request(req,&mrb);
  }
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 5c0093e..1a32bf3 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, 
VirtQueue *vq)
  {
  VirtIONet *n = to_virtio_net(vdev);

+do_event_tap();
+
  if (n->tx_timer_active) {
  virtio_queue_set_notification(vq, 1);
  qemu_del_timer(n->tx_timer);
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 1414f49..769bc95 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -935,8 +935,12 @@ int kvm_run(CPUState *env)

  post_kvm_run(kvm, env);

+/* TODO: we need to prevent tapping events that derived from the
+ * same VMEXIT. This needs more info from the kernel. */
  #if defined(KVM_CAP_COALESCED_MMIO)
  if (kvm_state->coalesced_mmio) {
+/* prevent from tapping events while handling coalesced_mmio */
+event_tap_suspend();
  struct kvm_coalesced_mmio_ring *ring =
  (void *) run + kvm_state->coalesced_mmio * PAGE_SIZE;
  while (ring->first != ring->last) {
@@ -946,6 +950,7 @@ int kvm_run(CPUState *env)
  smp_wmb();
  ring->first = (ring->first + 1) % KVM_COALESCED_MMIO_MAX;
  }
+event_tap_resume();
  }
  #endif

@@ -1770,7 +1775,7 @@ static void resume_all_threads(void)
  {
  CPUState *penv = first_cpu;

-assert(!cpu_single_env);
+/* assert(!cpu_single_env); */

  while (penv) {
  penv->stop = 0;
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-22 Thread Marcelo Tosatti
On Thu, Apr 22, 2010 at 09:11:30PM +0300, Gleb Natapov wrote:
> On Thu, Apr 22, 2010 at 01:40:38PM -0300, Marcelo Tosatti wrote:
> > On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote:
> > > On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote:
> > > > On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote:
> > > > > On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote:
> > > > > > On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote:
> > > > > > > On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote:
> > > > > > > > > Or could we make kvm_set_irq() atomic? Though the code path 
> > > > > > > > > is a little long 
> > > > > > > > > for spinlock.
> > > > > > > > 
> > > > > > > > Yes, given the sleep-inside-RCU-protected section bug from
> > > > > > > > kvm_notify_acked_irq, either that or convert IRQ locking to 
> > > > > > > > SRCU.
> > > > > > > > 
> > > > > > > > But as you said, the code paths are long and potentially slow, 
> > > > > > > > so
> > > > > > > > probably SRCU is a better alternative.
> > > > > > > > 
> > > > > > > > Gleb?
> > > > > > > kvm_set_irq() was converted to rcu from mutex to make msix 
> > > > > > > interrupt
> > > > > > > injection scalable.
> > > > > > 
> > > > > > We meant ioapic lock. See the last report from Ralf on this thread. 
> > > > > Can we solve the problem by calling ack notifier outside rcu read
> > > > > section in kvm_notify_acked_irq()?
> > > > 
> > > > The unregister path does
> > > > 
> > > > - remove_from_list(entry)
> > > > - synchronize_rcu
> > > > - kfree(entry)
> > > > 
> > > > So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and the
> > > > notifier entry can be freed.
> > > What I mean is kvm_notify_acked_irq() will iterate over all ack entries
> > > in rcu read protected section, but instead of calling callback it will
> > > collect them into the array and call them outside rcu read section. At
> > > this point it doesn't matter if entry is unregistered since the copy is
> > > used to actually call the notifier.
> > 
> > Here it is, but no, this trick can't be done safely for ack notifiers
> > because they are unregistered at runtime by device assignment.
> > 
> > How does the freeing path knows it can proceed to free its entry if
> > there is no reliable way to know if there is a reference to itself?
> > (think "priv" gets freed between rcu_read_unlock and ->irq_acked with
> > the patch below).
> > 
> Yeah, I see :(
> 
> > I can convert to SRCU if you have no objections.
> > 
> AFAIR there was a disadvantage comparing to RCU, but I don't remember what
> it was exactly. 

http://www.mentby.com/paul-e-mckenney/kvm-patch-v3-13-kvm-fix-race-in-irqrouting-logic.html

But for KVM IRQ path's it should not matter much since usage of grace
periods is rare (registration/unregistration is very rare compared to 
read side), and the IRQ path's are already large and slow, so the added
overhead should not be noticeable.

> What about converting PIC/IOAPIC mutexes into spinlocks?

Works for me, but on large guests the spinning will be noticeable.
I believe.

Avi?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're sending
this message to share what we have now and TODO lists.  Hopefully, we would like
to get early feedback to keep us in the right direction.  Although advanced
approaches in the TODO lists are fascinating, we would like to run this project
step by step while absorbing comments from the community.  The current code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs.  It may also need some plumbing in the
kernel side to guarantee replayability of certain events and instructions,
integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
as for optimization purposes, for example.

Before going into details, we would like to show how Kemari looks.  We prepared
a demonstration video at the following location.  For those who are not
interested in the code, please take a look.
The demonstration scenario is,

1. Play with a guest VM that has virtio-blk and virtio-net.
# The guest image should be a NFS/SAN.
2. Start Kemari to synchronize the VM by running the following command in QEMU.
Just add "-k" option to usual migrate command.
migrate -d -k tcp:192.168.0.20:
3. Check the status by calling info migrate.
4. Go back to the VM to play chess animation.
5. Kill the the VM. (VNC client also disappears)
6. Press "c" to continue the VM on the other host.
7. Bring up the VNC client (Sorry, it pops outside of video capture.)
8. Confirm that the chess animation ends, browser works fine, then shutdown.

http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov

The repository contains all patches we're sending with this message.  For those
who want to try, pull the following repository.  At running configure, please
put --enable-ft-mode.  Also you need to apply a patch attached at the end of
this message to your KVM.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari

In addition to usual migrate environment and command, add "-k" to run.

The patch set consists of following components.

- bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o)
- writev() support to QEMUFile and FdMigrationState.
- FT transaction sender/receiver
- event tap that triggers FT transaction.
- virtio-blk, virtio-net support for event tap.
   


This series looks quite nice!

I think it would make sense to separate out the things that are actually 
optimizations (like the dirty bitmap changes and the writev/readv 
changes) and to attempt to justify them with actual performance data.


I'd prefer not to modify the live migration protocol ABI and it doesn't 
seem to be necessary if we're willing to add options to the -incoming 
flag.  We also want to be a bit more generic with respect to IO.  
Otherwise, the series looks very close to being mergable.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-04-22 Thread Anthony Liguori

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

QEMUFile currently doesn't support writev().  For sending multiple
data, such as pages, using writev() should be more efficient.

Signed-off-by: Yoshiaki Tamura
   


Is there performance data that backs this up?  Since QEMUFile uses a 
linear buffer for most operations that's limited to 16k, I suspect you 
wouldn't be able to observe a difference in practice.


Regards,

Anthony Liguori


---
  buffered_file.c |2 +-
  hw/hw.h |   16 
  savevm.c|   43 +--
  3 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 54dc6c2..187d1d4 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
  s->wait_for_unfreeze = wait_for_unfreeze;
  s->close = close;

-s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
+s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
   buffered_close, buffered_rate_limit,
   buffered_set_rate_limit,
 buffered_get_rate_limit);
diff --git a/hw/hw.h b/hw/hw.h
index fc9ed29..921cf90 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -23,6 +23,13 @@
  typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
  int64_t pos, int size);

+/* This function writes a chunk of vector to a file at the given position.
+ * The pos argument can be ignored if the file is only being used for
+ * streaming.
+ */
+typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
+int64_t pos, int iovcnt);
+
  /* Read a chunk of data from a file at the given position.  The pos argument
   * can be ignored if the file is only be used for streaming.  The number of
   * bytes actually read should be returned.
@@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const 
uint8_t *buf,
  typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
  int64_t pos, int size);

+/* Read a chunk of vector from a file at the given position.  The pos argument
+ * can be ignored if the file is only be used for streaming.  The number of
+ * bytes actually read should be returned.
+ */
+typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov,
+int64_t pos, int iovcnt);
+
  /* Close a file and return an error code */
  typedef int (QEMUFileCloseFunc)(void *opaque);

@@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque, size_t 
new_rate);
  typedef size_t (QEMUFileGetRateLimit)(void *opaque);

  QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
+ QEMUFilePutVectorFunc *put_vector,
   QEMUFileGetBufferFunc *get_buffer,
+ QEMUFileGetVectorFunc *get_vector,
   QEMUFileCloseFunc *close,
   QEMUFileRateLimit *rate_limit,
   QEMUFileSetRateLimit *set_rate_limit,
diff --git a/savevm.c b/savevm.c
index 490ab70..944e788 100644
--- a/savevm.c
+++ b/savevm.c
@@ -162,7 +162,9 @@ void qemu_announce_self(void)

  struct QEMUFile {
  QEMUFilePutBufferFunc *put_buffer;
+QEMUFilePutVectorFunc *put_vector;
  QEMUFileGetBufferFunc *get_buffer;
+QEMUFileGetVectorFunc *get_vector;
  QEMUFileCloseFunc *close;
  QEMUFileRateLimit *rate_limit;
  QEMUFileSetRateLimit *set_rate_limit;
@@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const char *mode)
  s->stdio_file = stdio_file;

  if(mode[0] == 'r') {
-s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose,
-NULL, NULL, NULL);
+s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer,
+ NULL, stdio_pclose, NULL, NULL, NULL);
  } else {
-s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose,
-NULL, NULL, NULL);
+s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+ stdio_pclose, NULL, NULL, NULL);
  }
  return s->file;
  }
@@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
  goto fail;

  if(mode[0] == 'r') {
-s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose,
-NULL, NULL, NULL);
+s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL,
+ stdio_fclose, NULL, NULL, NULL);
  } else {
-s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose,
-NULL, NULL, NULL);
+s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+ stdio_fclose, NULL, NULL, NULL);
  }
  return s->file;

@@ -3

Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-22 Thread Gleb Natapov
On Thu, Apr 22, 2010 at 04:40:30PM -0300, Marcelo Tosatti wrote:
> On Thu, Apr 22, 2010 at 09:11:30PM +0300, Gleb Natapov wrote:
> > On Thu, Apr 22, 2010 at 01:40:38PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote:
> > > > On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote:
> > > > > On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote:
> > > > > > On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote:
> > > > > > > On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote:
> > > > > > > > On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote:
> > > > > > > > > > Or could we make kvm_set_irq() atomic? Though the code path 
> > > > > > > > > > is a little long 
> > > > > > > > > > for spinlock.
> > > > > > > > > 
> > > > > > > > > Yes, given the sleep-inside-RCU-protected section bug from
> > > > > > > > > kvm_notify_acked_irq, either that or convert IRQ locking to 
> > > > > > > > > SRCU.
> > > > > > > > > 
> > > > > > > > > But as you said, the code paths are long and potentially 
> > > > > > > > > slow, so
> > > > > > > > > probably SRCU is a better alternative.
> > > > > > > > > 
> > > > > > > > > Gleb?
> > > > > > > > kvm_set_irq() was converted to rcu from mutex to make msix 
> > > > > > > > interrupt
> > > > > > > > injection scalable.
> > > > > > > 
> > > > > > > We meant ioapic lock. See the last report from Ralf on this 
> > > > > > > thread. 
> > > > > > Can we solve the problem by calling ack notifier outside rcu read
> > > > > > section in kvm_notify_acked_irq()?
> > > > > 
> > > > > The unregister path does
> > > > > 
> > > > > - remove_from_list(entry)
> > > > > - synchronize_rcu
> > > > > - kfree(entry)
> > > > > 
> > > > > So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and 
> > > > > the
> > > > > notifier entry can be freed.
> > > > What I mean is kvm_notify_acked_irq() will iterate over all ack entries
> > > > in rcu read protected section, but instead of calling callback it will
> > > > collect them into the array and call them outside rcu read section. At
> > > > this point it doesn't matter if entry is unregistered since the copy is
> > > > used to actually call the notifier.
> > > 
> > > Here it is, but no, this trick can't be done safely for ack notifiers
> > > because they are unregistered at runtime by device assignment.
> > > 
> > > How does the freeing path knows it can proceed to free its entry if
> > > there is no reliable way to know if there is a reference to itself?
> > > (think "priv" gets freed between rcu_read_unlock and ->irq_acked with
> > > the patch below).
> > > 
> > Yeah, I see :(
> > 
> > > I can convert to SRCU if you have no objections.
> > > 
> > AFAIR there was a disadvantage comparing to RCU, but I don't remember what
> > it was exactly. 
> 
> http://www.mentby.com/paul-e-mckenney/kvm-patch-v3-13-kvm-fix-race-in-irqrouting-logic.html
> 
> But for KVM IRQ path's it should not matter much since usage of grace
> periods is rare (registration/unregistration is very rare compared to 
> read side), and the IRQ path's are already large and slow, so the added
> overhead should not be noticeable.
> 
msix path shouldn't be slow and this is the case we are trying to
optimize.

 SRCU's read-side primitives are also significantly slower than
 those of RCU.

Sounds not too optimistic.

> > What about converting PIC/IOAPIC mutexes into spinlocks?
> 
> Works for me, but on large guests the spinning will be noticeable.
> I believe.
For interrupts going through IOPIC, but we know this is not scalable
anyway.

> 
> Avi?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Anthony Liguori

On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:

2010/4/22 Dor Laor:
   

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
 

Dor Laor wrote:
   

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
 

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.
   

[ snap]

 

The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.
   

IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?
 

In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely stateless.
   

Why do you treat tcp differently? You can damage the entire VM this way -
think of dhcp request that was dropped on the moment you switched between
the master and the slave?
 

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation,
   


I'm slightly confused about the current implementation vs. my 
recollection of the original paper with Xen.  I had thought that all 
disk and network I/O was buffered in such a way that at each checkpoint, 
the I/O operations would be released in a burst.  Otherwise, you would 
have to synchronize after every I/O operation which is what it seems the 
current implementation does.  I'm not sure how that is accomplished 
atomically though since you could have a completed I/O operation 
duplicated on the slave node provided it didn't notify completion prior 
to failure.


Is there another kemari component that somehow handles buffering I/O 
that is not obvious from these patches?


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Dor Laor

On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote:

2010/4/22 Dor Laor:

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:


Dor Laor wrote:


On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:


Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely stateless.


Why do you treat tcp differently? You can damage the entire VM this way -
think of dhcp request that was dropped on the moment you switched between
the master and the slave?


I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?


But until it timeouts it won't have networking.


BTW, in current implementation, it's synchronizing before dhcp ack is sent.
But in case of tcp, once you send ack to the client before sync, there
is no way to recover.


What if the guest is running dhcp server? It we provide an IP to a 
client and then fail to the secondary that will run without knowing the 
master allocated this IP





[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is
accessed would be
prohibitive, the transmission of the TSC will be done lazily, which
means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.


So, do you agree that an extra clock synchronization is not needed since it
is done anyway as part of the live migration state sync?


I agree that its sent as part of the live migration.
What I wanted to say here is that this is not something for real time
applications.
I usually get questions like can this guarantee fault tolerance for
real time applications.


First the huge cost of snapshots won't match to any real time app.
Second, even if it wasn't the case, the tsc delta and kvmclock are 
synchronized as part of the VM state so there is no use of trapping it 
in the middle.





In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):


Yes, of course.
Sorry for being less informative.


A trivial one would we to :
- do X online snapshots/sec


I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec. On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), there will be about 50 snapshots/sec.


- Stall all IO (disk/block) from the guest to the outside world
until the previous snapshot reaches the slave.


Yes, it does.


- Snapshots are made of


Full device model + diff of dirty pages from the last snapshot.


- diff of dirty pages from last snapshot


This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less than 100.


The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much
more)


Doesn't 100 snap/sec mean the interval of snap is 10msec?
IIUC, to get the latency, you need to get, T

using ftrace with kvm

2010-04-22 Thread David S. Ahern
I have a VM that is spinning (both vcpus at 100%). As I recall kvm_stat
has been deprecated in favor or ftrace. Is there a wiki page or document
that gives suggestions on this?

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes

2010-04-22 Thread Eric Northup
I've been reading the x86's mmu.c recently and had been wondering
about something.  Avi's recent mmu documentation (thanks!) seems to
have confirmed my understanding of how the shadow paging is supposed
to be working.  In TDP mode, when mmu_alloc_roots() calls
kvm_mmu_get_page(), why does it pass (vcpu->arch.cr3 >> PAGE_SHIFT) or
(vcpu->arch.mmu.pae_root[i]) as gfn?

It seems to me that in TDP mode, gfn should be either zero for the
root page table, or 0/1GB/2GB/3GB (for PAE page tables).

The existing behavior can lead to multiple, semantically-identical TDP
roots being created by mmu_alloc_roots, depending on the VCPU's CR3 at
the time that mmu_alloc_roots was called.  But the nested page tables
should be* independent of the VCPU state. That wastes some memory and
causes extra page faults while populating the extra copies of the page
tables.

*assuming that we aren't modeling per-VCPU state that might change the
physical address map as seen by that VCPU, such as setting the APIC
base to an address overlapping RAM.

All feedback would be welcome, since I'm new to this system!  A
strawman patch follows.

thanks,
-Eric

--

For TDP mode, avoid creating multiple page table roots for the single
guest-to-host physical address map by fixing the inputs used for the
shadow page table hash in mmu_alloc_roots().

Signed-off-by: Eric Northup 
---
 arch/x86/kvm/mmu.c |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ddfa865..9696d65 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2059,10 +2059,12 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
hpa_t root = vcpu->arch.mmu.root_hpa;

ASSERT(!VALID_PAGE(root));
-   if (tdp_enabled)
-   direct = 1;
if (mmu_check_root(vcpu, root_gfn))
return 1;
+   if (tdp_enabled) {
+   direct = 1;
+   root_gfn = 0;
+   }
sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
  PT64_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
@@ -2072,8 +2074,6 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
return 0;
}
direct = !is_paging(vcpu);
-   if (tdp_enabled)
-   direct = 1;
for (i = 0; i < 4; ++i) {
hpa_t root = vcpu->arch.mmu.pae_root[i];

@@ -2089,6 +2089,10 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
root_gfn = 0;
if (mmu_check_root(vcpu, root_gfn))
return 1;
+   if (tdp_enabled) {
+   direct = 1;
+   root_gfn = i << 30;
+   }
sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
  PT32_ROOT_LEVEL, direct,
  ACC_ALL, NULL);
--
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: using ftrace with kvm

2010-04-22 Thread Manish Regmi
On Thu, Apr 22, 2010 at 3:53 PM, David S. Ahern  wrote:
I have a VM that is spinning (both vcpus at 100%). As I recall kvm_stat
has been deprecated in favor or ftrace. Is there a wiki page or document
that gives suggestions on this?

David

Documentation/trace/* is the place to see.

but for me function and function_graph give too much data even if i
limit it to 1 function. so i use trace points. i simply enable the
tracepoints i am interested in and read trace_pipe.

cat available_events
echo kvm:* > set_event
cat trace_pipe > ~/mydump.txt (or whatever you want to do)

---
regards
Manish Regmi

http://manish-cs.blogspot.com
http://ext2read.sf.net
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-22 Thread Alexander Graf

On 22.04.2010, at 08:09, Fernando Luis Vázquez Cao wrote:

> On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote:
>> On 04/21/2010 06:41 PM, Alexander Graf wrote:
>>> On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote:
>>> 
 On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote:
> @@ -318,7 +318,7 @@ struct kvm_dirty_log {
>   __u32 padding1;
>   union {
>   void __user *dirty_bitmap; /* one bit per page */
> - __u64 padding2;
> + __u64 addr;
 
 This can break on x86_32 and x86_64-compat. addr is a long not a __u64.
>>> 
>>> So the high 32 bits are zero. Where's the problem?
>> 
>> If we are careful enough to cast the addr appropriately we should be fine,
>> even if we keep the padding field in the union. I am not saying that it
>> breaks 32 architectures but that it can potentially be problematic.
>> 
> + case KVM_SWITCH_DIRTY_LOG: {
> + struct kvm_dirty_log log;
> +
> + r = -EFAULT;
> + if (copy_from_user(&log, argp, sizeof log))
> + goto out;
> + r = kvm_vm_ioctl_switch_dirty_log(kvm, &log);
> + if (r)
> + goto out;
> + r = -EFAULT;
> + if (copy_to_user(argp, &log, sizeof log))
> + goto out;
> + r = 0;
> + break;
> + }
 
 In x86_64-compat mode we are handling 32bit user-space addresses
 so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too.
>>> 
>>> The compat code just forwards everything to the generic ioctls.
>> 
>> The compat code uses struct compat_kvm_dirty_log instead of
>> struct kvm_dirty_log to communicate with user space so
>> the necessary conversions needs to be done before invoking
>> the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl).
>> 
>> By the way we probable should move the definition of struct
>> compat_kvm_dirty_log to a header file.
> 
> It seems that it was you and Arnd who added the kvm_vm compat ioctl :-).
> Are you considering a different approach to tackle the issues that we
> have with a big-endian userspace?

IIRC the issue was a pointer inside of a nested structure, no?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Jamie Lokier wrote:

Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Event tapping is the core component of Kemari, and it decides on which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should have
its own recovering mechanism, or it should be completely stateless.


Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.

If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.


That's something we're expecting now.


But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.


Even current implementation syncs just before network output, what you pointed 
out could happen.  In this case, would the connection going to be lost, or would 
client/server recover from it?  If latter, it would be fine, otherwise I wonder 
how people doing similar things are handling this situation.



About IP idempotency:

In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.

However there is one respect in which they're not idempotent:

The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.

It is acceptable to reduce TTL faster than the minimum.  After all, it
is reduced by 1 on every forwarding hop, in addition to time delays.


So the problem is, when the slave takes over, it sends a packet with same TTL 
which client may have received.



I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec.  On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), there will be about 50 snapshots/sec.


That is a really satisfying number, thank you :-)

Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.


Thank you for your comments.

Although I haven't prepared good data yet, I personally prefer to have 
discussion with actual implementation and experimental data.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs. It may also need some plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.

Before going into details, we would like to show how Kemari looks. We
prepared
a demonstration video at the following location. For those who are not
interested in the code, please take a look.
The demonstration scenario is,

1. Play with a guest VM that has virtio-blk and virtio-net.
# The guest image should be a NFS/SAN.
2. Start Kemari to synchronize the VM by running the following command
in QEMU.
Just add "-k" option to usual migrate command.
migrate -d -k tcp:192.168.0.20:
3. Check the status by calling info migrate.
4. Go back to the VM to play chess animation.
5. Kill the the VM. (VNC client also disappears)
6. Press "c" to continue the VM on the other host.
7. Bring up the VNC client (Sorry, it pops outside of video capture.)
8. Confirm that the chess animation ends, browser works fine, then
shutdown.

http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov

The repository contains all patches we're sending with this message.
For those
who want to try, pull the following repository. At running configure,
please
put --enable-ft-mode. Also you need to apply a patch attached at the
end of
this message to your KVM.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari

In addition to usual migrate environment and command, add "-k" to run.

The patch set consists of following components.

- bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April
2o)
- writev() support to QEMUFile and FdMigrationState.
- FT transaction sender/receiver
- event tap that triggers FT transaction.
- virtio-blk, virtio-net support for event tap.


This series looks quite nice!


Thanks for your kind words!


I think it would make sense to separate out the things that are actually
optimizations (like the dirty bitmap changes and the writev/readv
changes) and to attempt to justify them with actual performance data.


I agree with the separation plan.

For dirty bitmap change, Avi and I discussed on patchset for upsream QEMU while 
you were offline (Sorry, if I was wrong).  Could you also take a look?


http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html

Regarding writev, I agree that it should be backed with actual data, otherwise 
it should be removed.  We attemped to do everything that may reduce the overhead 
of the transaction.



I'd prefer not to modify the live migration protocol ABI and it doesn't
seem to be necessary if we're willing to add options to the -incoming
flag. We also want to be a bit more generic with respect to IO.


I totally agree with your approach not to change the protocol ABI.  Can we add 
an option to -incoming?  Like, -incoming ft_mode, for example

Regarding the IO, let me reply to the next message.


Otherwise, the series looks very close to being mergable.


Thank you for your comment on each patch.

To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU and 
struggled for how to implement in an acceptable way.


Thanks,

Yoshi



Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PXE Boot Timeout Issue...

2010-04-22 Thread Stuart Sheldon
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi all,

Just upgraded to 12.3 user space tools from 11.0, and now when I attempt
to netboot a guest, it appears that the pxe rom is timing out on dhcp
before the bridge has enough time to come up.

Is there a command line switch to set the dhcp timeout, or a build
option that can be changed to set the timeout to a longer value, or
disable it entirely?

Thanks in advance...

Stu

- --
Spider Pig, Spider Pig, Does whatever a Spider Pig does.
Can he swing, from a web, no he can't, he's a pig.
Look Oot! He is a Spider Pig
-- Matt Groening - "Spider Pig Lyrics"
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iQIcBAEBCAAGBQJL0O2PAAoJEFKVLITDJSGSHKUP/RGNTgmaQmqQ4L5wdkRpWVY4
0BY0wsfajLCQ8W87Zr5EIoFDw3QfeKM9345bvstwSvzvX5WB19U/Q463nxzUfV3v
YjBk6oCQVI7nnbSZIuz0rqxeo8QPcfuDStHQAkYZD6TlfgNntbzFRnCoMRAMuK46
EdcjmJMCcZ5bF2OezzljhpvJavqwBhOsuOwXHprlIxkhnlOQB5sBY4RSS/f886xw
34Bf1ER7J3zoyq96ElTOCNpFU5BeoaUGw0n/fDPaxQ5QmGoHOB/OpSoN8fae+h1J
OxgPWWUhgcb1d0rCsU9R4frbzJGwQFy8X4tBw4gqhjPX76D3T4zyf532pbTjrnri
GASOLFuSGxowPH1LmPag36Yq8IdOejVrEzK8jTyrpbmWtVBkyS9NiqQEJmqdbGpt
bu4Kt+Ab4oJRI+DGf8kfnvBvfZ5wIPmfsWzV3YrV4AHn+9X6FG/1OMxg9EV8cyC7
hhy/BTGX19qwzuxxUq8s1aH0BSARip++lZoKFuYojF5AoP+SBUSoOjpdhY+NNMTh
3cldyrN3Zd9Y5/ejHIBta0QDUg6YdbXvi6Ai2B+6aKtq63B4+uT8XBfh+duKNBfF
Y25gERY2aXoed4KRukSfr/6qh2y9MSQb7UbZneAy9bFMKhK21/ZK9FRM/uIUOs+g
lE6iqHCL4f4esyNFIwAr
=OAyU
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-22 Thread Zachary Amsden

On 04/22/2010 03:11 AM, Glauber Costa wrote:

On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote:
   

On 04/20/2010 11:54 AM, Avi Kivity wrote:
 

On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
   

On 04/20/2010 02:31 AM, Avi Kivity wrote:

 

btw, do you want this code in pvclock.c, or shall we keep it kvmclock
specific?

   

I think its a pvclock-level fix.  I'd been hoping to avoid having
something like this, but I think its ultimately necessary.

 

Did you observe drift on Xen, or is this "ultimately" pointing at the
future?
   

People are reporting weirdnesses that "clocksource=jiffies" apparently
resolves.  Xen and KVM are faced with the same hardware constraints, and
it wouldn't surprise me if there were small measurable
non-monotonicities in the PV clock under Xen.  May as well be safe.

Of course, it kills any possibility of being able to usefully export
this interface down to usermode.

My main concern about this kind of simple fix is that if there's a long
term systematic drift between different CPU's tscs, then this will
somewhat mask the problem while giving really awful time measurement on
the "slow" CPU(s).  In that case it really needs to adjust the scaling
factor to correct for the drift (*not* update the offset).  But if we're
definitely only talking about fixed, relatively small time offsets then
it is fine.
 

Can you by any chance run ingo's time warp test on those machines?

You need to define TOD to 1, and leave out the TSC test.

For me, warps exists on every machine out there, but the nehalems, so far
   

Or apply this patch.
diff -rup a/time-warp-test.c b/time-warp-test.c
--- a/time-warp-test.c  2010-04-15 16:30:13.955981607 -1000
+++ b/time-warp-test.c  2010-04-15 16:35:37.777982377 -1000
@@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
 {
DECLARE_ARGS(val, low, high);
 
-   asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
+   asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) :: "ebx", 
"ecx");
 
return EAX_EDX_VAL(val, low, high);
 }


[RFC PATCH] asm-generic: bitops: introduce le bit offset macro

2010-04-22 Thread Takuya Yoshikawa
Although we can use *_le_bit() helpers to treat bitmaps le arranged,
having le bit offset calculation as a seperate macro gives us more freedom.

For example, KVM has le arranged dirty bitmaps for VGA, live-migration,
etc. and we use them in user space too. To avoid bitmap copies between
kernel and user space, we want to update the bitmaps in user space directly.

To achive this, le bit offset with *_user() functions help us a lot.

So let us use the le bit offset calculation part by defining it as a new
macro: generic_le_bit_offset() .

Signed-off-by: Takuya Yoshikawa 
---
 include/asm-generic/bitops/le.h |   18 +++---
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h
index 80e3bf1..6af2e54 100644
--- a/include/asm-generic/bitops/le.h
+++ b/include/asm-generic/bitops/le.h
@@ -9,6 +9,8 @@
 
 #if defined(__LITTLE_ENDIAN)
 
+#define generic_le_bit_offset(nr)  (nr)
+
 #define generic_test_le_bit(nr, addr) test_bit(nr, addr)
 #define generic___set_le_bit(nr, addr) __set_bit(nr, addr)
 #define generic___clear_le_bit(nr, addr) __clear_bit(nr, addr)
@@ -25,22 +27,24 @@
 
 #elif defined(__BIG_ENDIAN)
 
+#define generic_le_bit_offset(nr)  ((nr) ^ BITOP_LE_SWIZZLE)
+
 #define generic_test_le_bit(nr, addr) \
-   test_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   test_bit(generic_le_bit_offset(nr), (addr))
 #define generic___set_le_bit(nr, addr) \
-   __set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   __set_bit(generic_le_bit_offset(nr), (addr))
 #define generic___clear_le_bit(nr, addr) \
-   __clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   __clear_bit(generic_le_bit_offset(nr), (addr))
 
 #define generic_test_and_set_le_bit(nr, addr) \
-   test_and_set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   test_and_set_bit(generic_le_bit_offset(nr), (addr))
 #define generic_test_and_clear_le_bit(nr, addr) \
-   test_and_clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   test_and_clear_bit(generic_le_bit_offset(nr), (addr))
 
 #define generic___test_and_set_le_bit(nr, addr) \
-   __test_and_set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   __test_and_set_bit(generic_le_bit_offset(nr), (addr))
 #define generic___test_and_clear_le_bit(nr, addr) \
-   __test_and_clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr))
+   __test_and_clear_bit(generic_le_bit_offset(nr), (addr))
 
 extern unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);
-- 
1.6.3.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:

2010/4/22 Dor Laor:

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control
logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some
plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.

[ snap]


The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on
which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true
for
disk I/O
and reliable network protocols such as TCP.

IMO any type of network even should be stalled too. What if the VM
runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?

In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely
stateless.

Why do you treat tcp differently? You can damage the entire VM this
way -
think of dhcp request that was dropped on the moment you switched
between
the master and the slave?

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation,


I'm slightly confused about the current implementation vs. my
recollection of the original paper with Xen. I had thought that all disk
and network I/O was buffered in such a way that at each checkpoint, the
I/O operations would be released in a burst. Otherwise, you would have
to synchronize after every I/O operation which is what it seems the
current implementation does.


Yes, you're almost right.
It's synchronizing before QEMU starts emulating I/O at each device model.
It was originally designed that way to avoid complexity of introducing buffering 
mechanism and additional I/O latency by buffering.



I'm not sure how that is accomplished
atomically though since you could have a completed I/O operation
duplicated on the slave node provided it didn't notify completion prior
to failure.


That's exactly the point I wanted to discuss.
Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() 
before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is 
complete.  I mimicked what existing live migration is doing.

It's not enough?


Is there another kemari component that somehow handles buffering I/O
that is not obvious from these patches?


No, I'm not hiding anything, and I would share any information regarding Kemari 
to develop it in this community :-)


Thanks,

Yoshi



Regards,

Anthony Liguori




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

Hi,

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA,
CODE, MIGRATION) bit-based phys_ram_dirty bitmap. On allocation, it
sets all bits in the bitmap. It uses ffs() to convert DIRTY_FLAG to
DIRTY_IDX.

Modifies wrapper functions for byte-based phys_ram_dirty bitmap to
bit-based phys_ram_dirty bitmap. MASTER works as a buffer, and upon
get_diry() or get_dirty_flags(), it calls
cpu_physical_memory_sync_master() to update VGA and MIGRATION.


Why use an additional bitmap for MASTER instead of just updating the
VGA, CODE, and MIGRATION bitmaps together?


This way we don't have to think whether we should update VGA or MIGRATION. 
IIRC, we had this discussion on upstream before with Avi?


http://www.mail-archive.com/kvm@vger.kernel.org/msg30728.html

BTW, I also have the following TODO list regarding dirty bitmap.

1. Allocate vga and migration dirty bitmap dynamically.
2. Separate CODE and the other dirty bitmaps functions.



Regards,

Anthony Liguori


Replaces direct phys_ram_dirty access with wrapper functions to
prevent direct access to the phys_ram_dirty bitmap.

Signed-off-by: Yoshiaki Tamura
Signed-off-by: OHMURA Kei
---
cpu-all.h | 130
+
exec.c | 60 ++--
2 files changed, 152 insertions(+), 38 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 51effc0..3f8762d 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -37,6 +37,9 @@

#include "softfloat.h"

+/* to use ffs in flag_to_idx() */
+#include
+
#if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN)
#define BSWAP_NEEDED
#endif
@@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str);
/* memory API */

extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
extern ram_addr_t ram_size;
extern ram_addr_t last_ram_offset;
extern uint8_t *bios_mem;
@@ -869,28 +871,140 @@ extern uint8_t *bios_mem;
/* Set if TLB entry is an IO callback. */
#define TLB_MMIO (1<< 5)

+/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */
+#define MASTER_DIRTY_IDX 0
+#define VGA_DIRTY_IDX 1
+#define CODE_DIRTY_IDX 2
+#define MIGRATION_DIRTY_IDX 3
+#define NUM_DIRTY_IDX 4
+
+#define MASTER_DIRTY_FLAG (1<< MASTER_DIRTY_IDX)
+#define VGA_DIRTY_FLAG (1<< VGA_DIRTY_IDX)
+#define CODE_DIRTY_FLAG (1<< CODE_DIRTY_IDX)
+#define MIGRATION_DIRTY_FLAG (1<< MIGRATION_DIRTY_IDX)
+
+extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX];
+
+static inline int dirty_flag_to_idx(int flag)
+{
+ return ffs(flag) - 1;
+}
+
+static inline int dirty_idx_to_flag(int idx)
+{
+ return 1<< idx;
+}
+
int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
uint8_t *buf, int len, int is_write);

-#define VGA_DIRTY_FLAG 0x01
-#define CODE_DIRTY_FLAG 0x02
-#define MIGRATION_DIRTY_FLAG 0x08
-
/* read dirty bit (return 0 or 1) */
static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
{
- return phys_ram_dirty[addr>> TARGET_PAGE_BITS] == 0xff;
+ unsigned long mask;
+ ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+ int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
+
+ mask = 1UL<< offset;
+ return (phys_ram_dirty[MASTER_DIRTY_IDX][index]& mask) == mask;
+}
+
+static inline void cpu_physical_memory_sync_master(ram_addr_t index)
+{
+ if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) {
+ phys_ram_dirty[VGA_DIRTY_IDX][index]
+ |= phys_ram_dirty[MASTER_DIRTY_IDX][index];
+ phys_ram_dirty[MIGRATION_DIRTY_IDX][index]
+ |= phys_ram_dirty[MASTER_DIRTY_IDX][index];
+ phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL;
+ }
+}
+
+static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
+{
+ unsigned long mask;
+ ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+ int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
+ int ret = 0, i;
+
+ mask = 1UL<< offset;
+ cpu_physical_memory_sync_master(index);
+
+ for (i = VGA_DIRTY_IDX; i<= MIGRATION_DIRTY_IDX; i++) {
+ if (phys_ram_dirty[i][index]& mask) {
+ ret |= dirty_idx_to_flag(i);
+ }
+ }
+
+ return ret;
+}
+
+static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr,
+ int dirty_idx)
+{
+ unsigned long mask;
+ ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+ int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
+
+ mask = 1UL<< offset;
+ cpu_physical_memory_sync_master(index);
+ return (phys_ram_dirty[dirty_idx][index]& mask) == mask;
}

static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
int dirty_flags)
{
- return phys_ram_dirty[addr>> TARGET_PAGE_BITS]& dirty_flags;
+ return cpu_physical_memory_get_dirty_idx(addr,
+ dirty_flag_to_idx(dirty_flags));
}

static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
{
- phys_ram_dirty[addr>> TARGET_PAGE_BITS] = 0xff;
+ unsigned long mask;
+ ram_addr_t index = (addr>> TARGET_PAGE_BITS) / HOST_LONG_BITS;
+ int offset = (addr>> TARGET_PAGE_BITS)& (HOST_LONG_BITS - 1);
+
+ mask = 1UL<< offset;
+ phys_ram_dirty[MASTER_DIRTY_IDX][in

Re: [RFC PATCH 15/20] Introduce FT mode support to configure.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Signed-off-by: Yoshiaki Tamura


No need for this.


OK.



Regards,

Anthony Liguori


---
configure | 8 
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index 046c591..f0682d4 100755
--- a/configure
+++ b/configure
@@ -298,6 +298,7 @@ bsd_user="no"
guest_base=""
uname_release=""
io_thread="no"
+ft_mode="no"
mixemu="no"
kvm_trace="no"
kvm_cap_pit=""
@@ -671,6 +672,8 @@ for opt do
;;
--enable-io-thread) io_thread="yes"
;;
+ --enable-ft-mode) ft_mode="yes"
+ ;;
--disable-blobs) blobs="no"
;;
--kerneldir=*) kerneldir="$optarg"
@@ -840,6 +843,7 @@ echo " --enable-vde enable support for vde network"
echo " --disable-linux-aio disable Linux AIO support"
echo " --enable-linux-aio enable Linux AIO support"
echo " --enable-io-thread enable IO thread"
+echo " --enable-ft-mode enable FT mode support"
echo " --disable-blobs disable installing provided firmware blobs"
echo " --kerneldir=PATH look for kernel includes in PATH"
echo " --with-kvm-trace enable building the KVM module with the kvm
trace option"
@@ -2117,6 +2121,7 @@ echo "GUEST_BASE $guest_base"
echo "PIE user targets $user_pie"
echo "vde support $vde"
echo "IO thread $io_thread"
+echo "FT mode support $ft_mode"
echo "Linux AIO support $linux_aio"
echo "Install blobs $blobs"
echo "KVM support $kvm"
@@ -2318,6 +2323,9 @@ fi
if test "$io_thread" = "yes" ; then
echo "CONFIG_IOTHREAD=y">> $config_host_mak
fi
+if test "$ft_mode" = "yes" ; then
+ echo "CONFIG_FT_MODE=y">> $config_host_mak
+fi
if test "$linux_aio" = "yes" ; then
echo "CONFIG_LINUX_AIO=y">> $config_host_mak
fi







--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the
transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE.


I'd rather you encapsulate the current protocol as opposed to extending
it with a new version.

You could also do this by just having a new -incoming option, right? All
you really need is to indicate that this is for high frequency
checkpointing, right?


Exactly.
I would use -incoming option.



Regards,

Anthony Liguori


Signed-off-by: Yoshiaki Tamura
---
savevm.c | 76
-
sysemu.h | 1 +
2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/savevm.c b/savevm.c
index 292ae32..19b3efb 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f,
SaveStateEntry *se)
}

#define QEMU_VM_FILE_MAGIC 0x5145564d
-#define QEMU_VM_FILE_VERSION_COMPAT 0x0002
-#define QEMU_VM_FILE_VERSION 0x0003
+#define QEMU_VM_FILE_VERSION_COMPAT 0x0003
+#define QEMU_VM_FILE_VERSION 0x0004
+#define QEMU_VM_LIVE_MIGRATION_MODE 0x0005
+#define QEMU_VM_FT_MODE 0x0006

#define QEMU_VM_EOF 0x00
#define QEMU_VM_SECTION_START 0x01
@@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon,
QEMUFile *f, int blk_enable,

qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+ if (ft_mode) {
+ qemu_put_be32(f, QEMU_VM_FT_MODE);
+ } else {
+ qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE);
+ }

QTAILQ_FOREACH(se,&savevm_handlers, entry) {
int len;
@@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon,
QEMUFile *f)
return 0;
}

+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f)
+{
+ SaveStateEntry *se;
+
+ QTAILQ_FOREACH(se,&savevm_handlers, entry) {
+ int len;
+
+ if (se->save_live_state == NULL)
+ continue;
+
+ /* Section type */
+ qemu_put_byte(f, QEMU_VM_SECTION_START);
+ qemu_put_be32(f, se->section_id);
+
+ /* ID string */
+ len = strlen(se->idstr);
+ qemu_put_byte(f, len);
+ qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+ qemu_put_be32(f, se->instance_id);
+ qemu_put_be32(f, se->version_id);
+ if (ft_mode == FT_INIT) {
+ /* This is workaround. */
+ se->save_live_state(mon, f, QEMU_VM_SECTION_START, se->opaque);
+ } else {
+ se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque);
+ }
+ }
+
+ ft_mode = FT_TRANSACTION;
+ QTAILQ_FOREACH(se,&savevm_handlers, entry) {
+ int len;
+
+ if (se->save_state == NULL&& se->vmsd == NULL)
+ continue;
+
+ /* Section type */
+ qemu_put_byte(f, QEMU_VM_SECTION_FULL);
+ qemu_put_be32(f, se->section_id);
+
+ /* ID string */
+ len = strlen(se->idstr);
+ qemu_put_byte(f, len);
+ qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+ qemu_put_be32(f, se->instance_id);
+ qemu_put_be32(f, se->version_id);
+
+ vmstate_save(f, se);
+ }
+
+ qemu_put_byte(f, QEMU_VM_EOF);
+
+ if (qemu_file_has_error(f))
+ return -EIO;
+
+ return 0;
+}
+
+
void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
{
SaveStateEntry *se;
@@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int
skip_header)
if (v != QEMU_VM_FILE_VERSION)
return -ENOTSUP;

+ v = qemu_get_be32(f);
+ if (v == QEMU_VM_FT_MODE) {
+ ft_mode = FT_INIT;
+ }
}

while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
diff --git a/sysemu.h b/sysemu.h
index 6c1441f..df314bb 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile
*f, int blk_enable,
int shared);
int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
+int qemu_savevm_state_all(Monitor *mon, QEMUFile *f);
void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
int qemu_loadvm_state(QEMUFile *f, int skip_header);







--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

QEMUFile currently doesn't support writev(). For sending multiple
data, such as pages, using writev() should be more efficient.

Signed-off-by: Yoshiaki Tamura


Is there performance data that backs this up? Since QEMUFile uses a
linear buffer for most operations that's limited to 16k, I suspect you
wouldn't be able to observe a difference in practice.


I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.

1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages at once.

I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small as 
possible, no?




Regards,

Anthony Liguori


---
buffered_file.c | 2 +-
hw/hw.h | 16 
savevm.c | 43 +--
3 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 54dc6c2..187d1d4 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
s->wait_for_unfreeze = wait_for_unfreeze;
s->close = close;

- s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
+ s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
buffered_close, buffered_rate_limit,
buffered_set_rate_limit,
buffered_get_rate_limit);
diff --git a/hw/hw.h b/hw/hw.h
index fc9ed29..921cf90 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -23,6 +23,13 @@
typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
int64_t pos, int size);

+/* This function writes a chunk of vector to a file at the given
position.
+ * The pos argument can be ignored if the file is only being used for
+ * streaming.
+ */
+typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
+ int64_t pos, int iovcnt);
+
/* Read a chunk of data from a file at the given position. The pos
argument
* can be ignored if the file is only be used for streaming. The number of
* bytes actually read should be returned.
@@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque,
const uint8_t *buf,
typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
int64_t pos, int size);

+/* Read a chunk of vector from a file at the given position. The pos
argument
+ * can be ignored if the file is only be used for streaming. The
number of
+ * bytes actually read should be returned.
+ */
+typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov,
+ int64_t pos, int iovcnt);
+
/* Close a file and return an error code */
typedef int (QEMUFileCloseFunc)(void *opaque);

@@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque,
size_t new_rate);
typedef size_t (QEMUFileGetRateLimit)(void *opaque);

QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
+ QEMUFilePutVectorFunc *put_vector,
QEMUFileGetBufferFunc *get_buffer,
+ QEMUFileGetVectorFunc *get_vector,
QEMUFileCloseFunc *close,
QEMUFileRateLimit *rate_limit,
QEMUFileSetRateLimit *set_rate_limit,
diff --git a/savevm.c b/savevm.c
index 490ab70..944e788 100644
--- a/savevm.c
+++ b/savevm.c
@@ -162,7 +162,9 @@ void qemu_announce_self(void)

struct QEMUFile {
QEMUFilePutBufferFunc *put_buffer;
+ QEMUFilePutVectorFunc *put_vector;
QEMUFileGetBufferFunc *get_buffer;
+ QEMUFileGetVectorFunc *get_vector;
QEMUFileCloseFunc *close;
QEMUFileRateLimit *rate_limit;
QEMUFileSetRateLimit *set_rate_limit;
@@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const
char *mode)
s->stdio_file = stdio_file;

if(mode[0] == 'r') {
- s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose,
- NULL, NULL, NULL);
+ s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer,
+ NULL, stdio_pclose, NULL, NULL, NULL);
} else {
- s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose,
- NULL, NULL, NULL);
+ s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+ stdio_pclose, NULL, NULL, NULL);
}
return s->file;
}
@@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
goto fail;

if(mode[0] == 'r') {
- s->file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose,
- NULL, NULL, NULL);
+ s->file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL,
+ stdio_fclose, NULL, NULL, NULL);
} else {
- s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose,
- NULL, NULL, NULL);
+ s->file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL,
+ stdio_fclose, NULL, NULL, NULL);
}
return s->file;

@@ -330,8 +332,8 @@ QEMUFile *qemu_fopen_socket(int fd)
QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));

s->fd = fd;
- s->file = qemu_fopen_ops(s, NULL, socket_get_buffer, socket_close,
- NULL, NULL, NULL);
+ s->file = qemu_fopen_ops(s, NULL, NULL, socket_get_buffer, NULL,
+ socket_close, NULL, NULL, NULL);
return s->file;
}

@@ -368,11 +370,11 @@ QEMUFile *qemu_fopen(const char *filename, const
char *mode)
goto fail;

if(mode[0] == 

Re: [PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page

2010-04-22 Thread Xiao Guangrong


Marcelo Tosatti wrote:

>>  role = vcpu->arch.mmu.base_role;
>> @@ -1332,12 +1336,16 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
>> kvm_vcpu *vcpu,
>>  hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
>>  if (sp->gfn == gfn) {
>>  if (sp->unsync)
>> -if (kvm_sync_page(vcpu, sp))
>> -continue;
>> +unsync_sp = sp;
> 

Hi Marcelo,

Thanks for your comments, maybe the changlog is not clear, please allow
me explain here.

Two cases maybe happen in kvm_mmu_get_page() function:

- one case is, the goal sp is already in cache, if the sp is unsync,
  we only need update it to assure this mapping is valid, but not
  mark it sync and not write-protect sp->gfn since it not broke unsync
  rule(one shadow page for a gfn)

- another case is, the goal sp not existed, we need create a new sp
  for gfn, i.e, gfn (may)has another shadow page, to keep unsync rule,
  we should sync(mark sync and write-protect) gfn's unsync shadow page.
  After enabling multiple unsync shadows, we sync those shadow pages
  only when the new sp not allow to become unsync(also for the unsyc
  rule, the new rule is: allow all pte page become unsync)

> 
> I don't see a reason why you can't create a new mapping to an unsync
> page. The code already creates shadow pte entries using unsync
> pagetables.

Do you means the case 2? In the original code, it unsync-ed gfn's unsync
page first regardless it's whether broke unsync rule:

|   hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
|   if (sp->gfn == gfn) {
|   if (sp->unsync)
|   if (kvm_sync_page(vcpu, sp))

And, my English is poor, sorry if i misunderstand your comment :-(

Xiao
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM MMU: make kvm_mmu_zap_page() return the number of zapped sp in total.

2010-04-22 Thread Xiao Guangrong


Gui Jianfeng wrote:
> Currently, in kvm_mmu_change_mmu_pages(kvm, page), "used_pages--" is  
> performed after calling
> kvm_mmu_zap_page() in spite of that whether "page" is actually reclaimed. 
> Because root sp won't be 
> reclaimed by kvm_mmu_zap_page(). So making kvm_mmu_zap_page() return total 
> number of reclaimed sp
> makes more sense. A new flag is put into kvm_mmu_zap_page() to indicate 
> whether the top page is reclaimed.
> 

This bug only hurts kvm_mmu_change_mmu_pages() function, we'd better allow 
'self_deleted' is
NULL, then we can pass NULL at other place.

> @@ -1571,7 +1584,8 @@ restart:
>   pgprintk("%s: gfn %lx role %x\n", __func__, gfn,
>sp->role.word);
>   r = 1;
> - if (kvm_mmu_zap_page(kvm, sp))
> + ret = kvm_mmu_zap_page(kvm, sp, &self_deleted);
> + if (ret > 1 || (ret == 1 && self_deleted == 0))
>   goto restart;

Maybe we can keep kvm_mmu_zap_page() returns the number of zapped children,
and 'self_deleted' indicates whether self is zapped, then we no need modify
those function, just fix kvm_mmu_change_mmu_pages() that is if 'self_deleted == 
1',
inc 'used_pages'

Xiao

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

For fool proof purpose, qemu_put_vector_parepare should be called
before qemu_put_vector. Then, if qemu_put_* functions except this is
called after qemu_put_vector_prepare, program will abort().

Signed-off-by: Yoshiaki Tamura


I don't get it. What's this protecting against?


This was introduced to prevent mixing the order of normal write and vector 
write, and flush QEMUFile buffer before handling vectors.
While qemu_put_buffer copies data to QEMUFile buffer, qemu_put_vector() will 
bypass that buffer.


It's just fool proof purpose for what we encountered at beginning, and if the 
user of qemu_put_vector() is careful enough, we can remove 
qemu_put_vectore_prepare().  While writing this message, I started to think that 
just calling qemu_fflush() in qemu_put_vector() would be enough...




Regards,

Anthony Liguori


---
hw/hw.h | 2 ++
savevm.c | 53 +
2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 921cf90..10e6dda 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f);
int qemu_fclose(QEMUFile *f);
void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
void qemu_put_byte(QEMUFile *f, int v);
+void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
+void qemu_put_vector_prepare(QEMUFile *f);
void *qemu_realloc_buffer(QEMUFile *f, int size);
void qemu_clear_buffer(QEMUFile *f);

diff --git a/savevm.c b/savevm.c
index 944e788..22d928c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -180,6 +180,7 @@ struct QEMUFile {
uint8_t *buf;

int has_error;
+ int prepares_vector;
};

typedef struct QEMUFileStdio
@@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v)
qemu_fflush(f);
}

+void qemu_put_vector(QEMUFile *f, QEMUIOVector *v)
+{
+ struct iovec *iov;
+ int cnt;
+ size_t bufsize;
+ uint8_t *buf;
+
+ if (qemu_file_get_rate_limit(f) != 0) {
+ fprintf(stderr,
+ "Attempted to write vector while bandwidth limit is not zero.\n");
+ abort();
+ }
+
+ /* checks prepares vector.
+ * For fool proof purpose, qemu_put_vector_parepare should be called
+ * before qemu_put_vector. Then, if qemu_put_* functions except this
+ * is called after qemu_put_vector_prepare, program will abort().
+ */
+ if (!f->prepares_vector) {
+ fprintf(stderr,
+ "You should prepare with qemu_put_vector_prepare.\n");
+ abort();
+ } else if (f->prepares_vector&& f->buf_index != 0) {
+ fprintf(stderr, "Wrote data after qemu_put_vector_prepare.\n");
+ abort();
+ }
+ f->prepares_vector = 0;
+
+ if (f->put_vector) {
+ qemu_iovec_to_vector(v,&iov,&cnt);
+ f->put_vector(f->opaque, iov, 0, cnt);
+ } else {
+ qemu_iovec_to_size(v,&bufsize);
+ buf = qemu_malloc(bufsize + 1 /* for '\0' */);
+ qemu_iovec_to_buffer(v, buf);
+ qemu_put_buffer(f, buf, bufsize);
+ qemu_free(buf);
+ }
+
+}
+
+void qemu_put_vector_prepare(QEMUFile *f)
+{
+ if (f->prepares_vector) {
+ /* prepare vector */
+ fprintf(stderr, "Attempted to prepare vector twice\n");
+ abort();
+ }
+ f->prepares_vector = 1;
+ qemu_fflush(f);
+}
+
int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1)
{
int size, l;









--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

Signed-off-by: Yoshiaki Tamura


I think the more appropriate thing to do is have
qemu_savevm_state_complete() not write QEMU_VM_EOF when doing a
continuous live migration.

You would then want qemu_loadvm_state() to detect real EOF and treat
that the same as QEMU_VM_EOF (provided it occurred at a section boundary).

Of course, this should be a flag to qemu_loadvm_state() as it's not
always the right behavior.


Sorry.  I couldn't get your intention.
I would appreciate if you could explain the good points of it.

On the receiver side, we need to switch to ft_transaction mode. If the 
qemu_savevm_state_complete() didn't send QEMU_VM_EOF, qemu_loadvm_state() won't 
get out of the loop, and therefore it can switch to ft_transaction mode.


Please let me know if I were misunderstanding.



Regards,

Anthony Liguori


---
migration-exec.c | 2 +-
migration-fd.c | 2 +-
migration-tcp.c | 2 +-
migration-unix.c | 2 +-
savevm.c | 25 ++---
sysemu.h | 2 +-
6 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/migration-exec.c b/migration-exec.c
index 3edc026..5839a6d 100644
--- a/migration-exec.c
+++ b/migration-exec.c
@@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void
*opaque)
QEMUFile *f = opaque;
int ret;

- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
if (ret< 0) {
fprintf(stderr, "load of migration failed\n");
goto err;
diff --git a/migration-fd.c b/migration-fd.c
index 0cc74ad..0e97ed0 100644
--- a/migration-fd.c
+++ b/migration-fd.c
@@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void
*opaque)
QEMUFile *f = opaque;
int ret;

- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
if (ret< 0) {
fprintf(stderr, "load of migration failed\n");
goto err;
diff --git a/migration-tcp.c b/migration-tcp.c
index 56e1a3b..94a1a03 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void
*opaque)
goto out;
}

- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
if (ret< 0) {
fprintf(stderr, "load of migration failed\n");
goto out_fopen;
diff --git a/migration-unix.c b/migration-unix.c
index b7aab38..dd99a73 100644
--- a/migration-unix.c
+++ b/migration-unix.c
@@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void
*opaque)
goto out;
}

- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
if (ret< 0) {
fprintf(stderr, "load of migration failed\n");
goto out_fopen;
diff --git a/savevm.c b/savevm.c
index 22d928c..a401b27 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry {
int version_id;
} LoadStateEntry;

-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
{
QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f)
unsigned int v;
int ret;

- v = qemu_get_be32(f);
- if (v != QEMU_VM_FILE_MAGIC)
- return -EINVAL;
+ if (!skip_header) {
+ v = qemu_get_be32(f);
+ if (v != QEMU_VM_FILE_MAGIC)
+ return -EINVAL;
+
+ v = qemu_get_be32(f);
+ if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+ fprintf(stderr, "SaveVM v3 format is obsolete and don't work
anymore\n");
+ return -ENOTSUP;
+ }
+ if (v != QEMU_VM_FILE_VERSION)
+ return -ENOTSUP;

- v = qemu_get_be32(f);
- if (v == QEMU_VM_FILE_VERSION_COMPAT) {
- fprintf(stderr, "SaveVM v2 format is obsolete and don't work
anymore\n");
- return -ENOTSUP;
}
- if (v != QEMU_VM_FILE_VERSION)
- return -ENOTSUP;

while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
uint32_t instance_id, version_id, section_id;
@@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name)
monitor_printf(mon, "Could not open VM state file\n");
return -EINVAL;
}
- ret = qemu_loadvm_state(f);
+ ret = qemu_loadvm_state(f, 0);
qemu_fclose(f);
if (ret< 0) {
monitor_printf(mon, "Error %d while loading VM state\n", ret);
diff --git a/sysemu.h b/sysemu.h
index 647a468..6c1441f 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile
*f, int blk_enable,
int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);

void qemu_errors_to_file(FILE *fp);
void qemu_errors_to_mon(Monitor *mon);








--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.

2010-04-22 Thread Yoshiaki Tamura

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

do_event_tap() is inserted to functions which actually fire outputs.
By synchronizing VMs before outputs are fired, we can failover to the
receiver upon failure. To save VM continuously, comment out assert()
on cpu_single_env temporally.

Signed-off-by: Yoshiaki Tamura
---
hw/virtio-blk.c | 2 ++
hw/virtio-net.c | 2 ++
qemu-kvm.c | 7 ++-
3 files changed, 10 insertions(+), 1 deletions(-)


This would be better done in the generic layers (the block and net
layers respectively). Then it would work with virtio and emulated devices.


I agree with your opinion that it's better if we can handle any emulated devices 
at once.  However, I have a question here that, if we put do_event_tap() to the 
generic layers, emulated devices state would have already been proceeded, and it 
won't be possible to reproduce those I/O after failover?  If I were wrong, I 
would be happy to move it, but if I were right, there are would be two 
approaches to overcome this:


1. Sync I/O requests to the receiver, and upon failover, release those requests 
before running the guest VM.
2. Copy the emulated devices state before starting emulating, and once it comes 
to the generic layer, start the synchronizing using the copied state.


We should also consider the guest's VCPU state.  I previously had similar 
discussion with Avi.  I would like to reconfirm his idea too.




Regards,

Anthony Liguori


diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index b80402d..1dd1c31 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice
*vdev, VirtQueue *vq)
.old_bs = NULL,
};

+ do_event_tap();
+
while ((req = virtio_blk_get_request(s))) {
virtio_blk_handle_request(req,&mrb);
}
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 5c0093e..1a32bf3 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice
*vdev, VirtQueue *vq)
{
VirtIONet *n = to_virtio_net(vdev);

+ do_event_tap();
+
if (n->tx_timer_active) {
virtio_queue_set_notification(vq, 1);
qemu_del_timer(n->tx_timer);
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 1414f49..769bc95 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -935,8 +935,12 @@ int kvm_run(CPUState *env)

post_kvm_run(kvm, env);

+ /* TODO: we need to prevent tapping events that derived from the
+ * same VMEXIT. This needs more info from the kernel. */
#if defined(KVM_CAP_COALESCED_MMIO)
if (kvm_state->coalesced_mmio) {
+ /* prevent from tapping events while handling coalesced_mmio */
+ event_tap_suspend();
struct kvm_coalesced_mmio_ring *ring =
(void *) run + kvm_state->coalesced_mmio * PAGE_SIZE;
while (ring->first != ring->last) {
@@ -946,6 +950,7 @@ int kvm_run(CPUState *env)
smp_wmb();
ring->first = (ring->first + 1) % KVM_COALESCED_MMIO_MAX;
}
+ event_tap_resume();
}
#endif

@@ -1770,7 +1775,7 @@ static void resume_all_threads(void)
{
CPUState *penv = first_cpu;

- assert(!cpu_single_env);
+ /* assert(!cpu_single_env); */

while (penv) {
penv->stop = 0;






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-22 Thread Marcelo Tosatti
On Wed, Apr 21, 2010 at 04:09:21PM +0300, Avi Kivity wrote:
> Signed-off-by: Avi Kivity 
> ---
>  Documentation/kvm/mmu.txt |  296 
> +
>  1 files changed, 296 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/kvm/mmu.txt
> 
> diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt
> new file mode 100644
> index 000..176f834
> --- /dev/null
> +++ b/Documentation/kvm/mmu.txt
> @@ -0,0 +1,296 @@
> +The x86 kvm shadow mmu
> +==
> +
> +The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible
> +for presenting a standard x86 mmu to the guest, while translating guest
> +physical addresses to host physical addresses.
> +
> +The mmu code attempts to satisfy the following requirements:
> +
> +- correctness: the guest should not be able to determine that it is running
> +   on an emulated mmu except for timing (we attempt to comply
> +   with the specification, not emulate the characteristics of
> +   a particular implementation such as tlb size)
> +- security:the guest must not be able to touch host memory not assigned
> +   to it
> +- performance: minimize the performance penalty imposed by the mmu
> +- scaling: need to scale to large memory and large vcpu guests
> +- hardware:support the full range of x86 virtualization hardware
> +- integration: Linux memory management code must be in control of guest 
> memory
> +   so that swapping, page migration, page merging, transparent
> +   hugepages, and similar features work without change
> +- dirty tracking: report writes to guest memory to enable live migration
> +   and framebuffer-based displays
> +- footprint:   keep the amount of pinned kernel memory low (most memory
> +   should be shrinkable)
> +- reliablity:  avoid multipage or GFP_ATOMIC allocations
> +
> +Acronyms
> +
> +
> +pfn   host page frame number
> +hpa   host physical address
> +hva   host virtual address
> +gfn   guest frame number
> +gpa   guest page address

guest physical address

> +gva   guest virtual address
> +ngpa  nested guest physical address
> +ngva  nested guest virtual address
> +pte   page table entry (used also to refer generically to paging structure
> +  entries)
> +gpte  guest pte (referring to gfns)
> +spte  shadow pte (referring to pfns)
> +tdp   two dimensional paging (vendor neutral term for NPT and EPT)
> +
> +Virtual and real hardware supported
> +===
> +
> +The mmu supports first-generation mmu hardware, which allows an atomic switch
> +of the current paging mode and cr3 during guest entry, as well as
> +two-dimensional paging (AMD's NPT and Intel's EPT).  The emulated hardware
> +it exposes is the traditional 2/3/4 level x86 mmu, with support for global
> +pages, pae, pse, pse36, cr0.wp, and 1GB pages.  Work is in progress to 
> support
> +exposing NPT capable hardware on NPT capable hosts.
> +
> +Translation
> +===
> +
> +The primary job of the mmu is to program the processor's mmu to translate
> +addresses for the guest.  Different translations are required at different
> +times:
> +
> +- when guest paging is disabled, we translate guest physical addresses to
> +  host physical addresses (gpa->hpa)
> +- when guest paging is enabled, we translate guest virtual addresses, to
> +  guest virtual addresses, to host physical addresses (gva->gpa->hpa)
> +- when the guest launches a guest of its own, we translate nested guest
> +  virtual addresses, to nested guest physical addresses, to guest physical
> +  addresses, to host physical addresses (ngva->ngpa->gpa->hpa)
> +
> +The primary challenge is to encode between 1 and 3 translations into hardware
> +that support only 1 (traditional) and 2 (tdp) translations.  When the
> +number of required translations matches the hardware, the mmu operates in
> +direct mode; otherwise it operates in shadow mode (see below).
> +
> +Memory
> +==
> +
> +Guest memory (gpa) is part of user address space of the process that is using
> +kvm.  Userspace defines the translation between guest addresses and user
> +addresses (gpa->gva); 

gpa->hva

> note that two gpas may alias to the same gva, but not
> +vice versa.
> +
> +These gvas may be backed using any method available to the host: anonymous
> +memory, file backed memory, and device memory.  Memory might be paged by the
> +host at any time.
> +
> +Events
> +==
> +
> +The mmu is driven by events, some from the guest, some from the host.
> +
> +Guest generated events:
> +- writes to control registers (especially cr3)
> +- invlpg/invlpga instruction execution
> +- access to missing or protected translations
> +
> +Host generated events:
> +- changes in the gpa->hpa translation (either through gpa->hva changes or
> +  through hva->hpa changes)
> +- memory pressure (the shrinker)
> +
> +Shadow pages
> +
> +
> +T

Re: [PATCH 1/3] KVM MMU: make kvm_mmu_zap_page() return the number of zapped sp in total.

2010-04-22 Thread Gui Jianfeng
Xiao Guangrong wrote:
> 
> Gui Jianfeng wrote:
>> Currently, in kvm_mmu_change_mmu_pages(kvm, page), "used_pages--" is  
>> performed after calling
>> kvm_mmu_zap_page() in spite of that whether "page" is actually reclaimed. 
>> Because root sp won't be 
>> reclaimed by kvm_mmu_zap_page(). So making kvm_mmu_zap_page() return total 
>> number of reclaimed sp
>> makes more sense. A new flag is put into kvm_mmu_zap_page() to indicate 
>> whether the top page is reclaimed.
>>
> 
> This bug only hurts kvm_mmu_change_mmu_pages() function, we'd better allow 
> 'self_deleted' is
> NULL, then we can pass NULL at other place.

Ok, will change. Will send a updated version.

> 
>> @@ -1571,7 +1584,8 @@ restart:
>>  pgprintk("%s: gfn %lx role %x\n", __func__, gfn,
>>   sp->role.word);
>>  r = 1;
>> -if (kvm_mmu_zap_page(kvm, sp))
>> +ret = kvm_mmu_zap_page(kvm, sp, &self_deleted);
>> +if (ret > 1 || (ret == 1 && self_deleted == 0))
>>  goto restart;
> 
> Maybe we can keep kvm_mmu_zap_page() returns the number of zapped children,
> and 'self_deleted' indicates whether self is zapped, then we no need modify
> those function, just fix kvm_mmu_change_mmu_pages() that is if 'self_deleted 
> == 1',
> inc 'used_pages'

I think kvm_mmu_zap_page() returning the total zapped number is more intuitive, 
so i'd
prefer to retain the original code. Thanks.

Gui,
Thanks

> 
> Xiao
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-22 Thread Yoshiaki Tamura

Dor Laor wrote:

On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote:

2010/4/22 Dor Laor:

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:


Dor Laor wrote:


On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:


Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control
logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some
plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.


[ snap]



The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on
which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true
for
disk I/O
and reliable network protocols such as TCP.


IMO any type of network even should be stalled too. What if the VM
runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?


In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely
stateless.


Why do you treat tcp differently? You can damage the entire VM this
way -
think of dhcp request that was dropped on the moment you switched
between
the master and the slave?


I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?


But until it timeouts it won't have networking.


BTW, in current implementation, it's synchronizing before dhcp ack is
sent.
But in case of tcp, once you send ack to the client before sync, there
is no way to recover.


What if the guest is running dhcp server? It we provide an IP to a
client and then fail to the secondary that will run without knowing the
master allocated this IP


That's problematic.  So it needs to sync when dhcp ack is sent.

I should apologize for my misunderstanding and explanation.  I agree that we 
should stall every type of network output.







[snap]



=== clock ===

Since synchronizing the virtual machines every time the TSC is
accessed would be
prohibitive, the transmission of the TSC will be done lazily, which
means
delaying it until there is a non-TSC synchronization point arrives.


Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.


So, do you agree that an extra clock synchronization is not needed
since it
is done anyway as part of the live migration state sync?


I agree that its sent as part of the live migration.
What I wanted to say here is that this is not something for real time
applications.
I usually get questions like can this guarantee fault tolerance for
real time applications.


First the huge cost of snapshots won't match to any real time app.


I see.


Second, even if it wasn't the case, the tsc delta and kvmclock are
synchronized as part of the VM state so there is no use of trapping it
in the middle.


I should study the clock in KVM, but won't tsc get updated by the HW after 
migration?

I was wondering the following case for example:

1. The application on the guest calls rdtsc on host A.
2. The application uses rdtsc value for something.
3. Failover to host B.
4. The application on the guest replays the rdtsc call on host B.
5. If the rdtsc value is different between A and B, the application may get into 
trouble because of it.


If I were wrong, my apologies.






In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):


Yes, of course.
Sorry for being less informative.


A trivial one would we to :
- do X online snapshots/sec


I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec. On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), the

Re: using ftrace with kvm

2010-04-22 Thread Gleb Natapov
On Thu, Apr 22, 2010 at 02:53:45PM -0600, David S. Ahern wrote:
> I have a VM that is spinning (both vcpus at 100%). As I recall kvm_stat
> has been deprecated in favor or ftrace. Is there a wiki page or document
> that gives suggestions on this?
> 
kvmtrace was depricated in favor of ftrace. kvm_stat still works.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3 v2] KVM MMU: make kvm_mmu_zap_page() return the number of zapped sp in total.

2010-04-22 Thread Gui Jianfeng
Currently, in kvm_mmu_change_mmu_pages(kvm, page), "used_pages--" is  performed 
after calling
kvm_mmu_zap_page() in spite of that whether "page" is actually reclaimed. 
Because root sp won't 
be reclaimed by kvm_mmu_zap_page(). So making kvm_mmu_zap_page() return total 
number of reclaimed 
sp makes more sense. A new flag is put into kvm_mmu_zap_page() to indicate 
whether the top page is
reclaimed.

Signed-off-by: Gui Jianfeng 
---
 arch/x86/kvm/mmu.c |   53 +++
 1 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7a17db1..d0960f1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1195,12 +1195,13 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
--kvm->stat.mmu_unsync;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   int *self_deleted);
 
 static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
if (sp->role.cr4_pae != !!is_pae(vcpu)) {
-   kvm_mmu_zap_page(vcpu->kvm, sp);
+   kvm_mmu_zap_page(vcpu->kvm, sp, NULL);
return 1;
}
 
@@ -1209,7 +1210,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
kvm_flush_remote_tlbs(vcpu->kvm);
kvm_unlink_unsync_page(vcpu->kvm, sp);
if (vcpu->arch.mmu.sync_page(vcpu, sp)) {
-   kvm_mmu_zap_page(vcpu->kvm, sp);
+   kvm_mmu_zap_page(vcpu->kvm, sp, NULL);
return 1;
}
 
@@ -1480,7 +1481,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
struct kvm_mmu_page *sp;
 
for_each_sp(pages, sp, parents, i) {
-   kvm_mmu_zap_page(kvm, sp);
+   kvm_mmu_zap_page(kvm, sp, NULL);
mmu_pages_clear_parents(&parents);
zapped++;
}
@@ -1490,7 +1491,8 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   int *self_deleted)
 {
int ret;
 
@@ -1507,11 +1509,16 @@ static int kvm_mmu_zap_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
if (!sp->root_count) {
hlist_del(&sp->hash_link);
kvm_mmu_free_page(kvm, sp);
+   /* Count self */
+   ret++;
+   if (self_deleted)
+   *self_deleted = 1;
} else {
sp->role.invalid = 1;
list_move(&sp->link, &kvm->arch.active_mmu_pages);
kvm_reload_remote_mmus(kvm);
}
+
kvm_mmu_reset_last_pte_updated(kvm);
return ret;
 }
@@ -1540,8 +1547,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
int kvm_nr_mmu_pages)
 
page = container_of(kvm->arch.active_mmu_pages.prev,
struct kvm_mmu_page, link);
-   used_pages -= kvm_mmu_zap_page(kvm, page);
-   used_pages--;
+   used_pages -= kvm_mmu_zap_page(kvm, page, NULL);
}
kvm_nr_mmu_pages = used_pages;
kvm->arch.n_free_mmu_pages = 0;
@@ -1560,6 +1566,8 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t 
gfn)
struct kvm_mmu_page *sp;
struct hlist_node *node, *n;
int r;
+   int self_deleted = 0;
+   int ret;
 
pgprintk("%s: looking for gfn %lx\n", __func__, gfn);
r = 0;
@@ -1571,7 +1579,8 @@ restart:
pgprintk("%s: gfn %lx role %x\n", __func__, gfn,
 sp->role.word);
r = 1;
-   if (kvm_mmu_zap_page(kvm, sp))
+   ret = kvm_mmu_zap_page(kvm, sp, &self_deleted);
+   if (ret > 1 || (ret == 1 && self_deleted == 0))
goto restart;
}
return r;
@@ -1583,6 +1592,8 @@ static void mmu_unshadow(struct kvm *kvm, gfn_t gfn)
struct hlist_head *bucket;
struct kvm_mmu_page *sp;
struct hlist_node *node, *nn;
+   int ret;
+   int self_deleted = 0;
 
index = kvm_page_table_hashfn(gfn);
bucket = &kvm->arch.mmu_page_hash[index];
@@ -1592,7 +1603,8 @@ restart:
&& !sp->role.invalid) {
pgprintk("%s: zap %lx %x\n",
 __func__, gfn, sp->role.word);
-   if (kvm_mmu_zap_page(kvm, sp))
+   ret = kvm_mmu_zap_page(kvm, sp, &self_deleted);
+   if (ret > 1 || (ret == 1 && self_deleted ==