Re: [PATCH 0/2] vhot-net: Use kvm_memslots instead of vhost_memory to translate GPA to HVA

2011-12-16 Thread Sasha Levin
On Fri, 2011-12-16 at 15:40 +0800, Zang Hongyong wrote:
 于 2011/12/16,星期五 15:05, Sasha Levin 写道:
  On Fri, 2011-12-16 at 13:32 +0800, zanghongy...@huawei.com wrote:
  From: Hongyong Zangzanghongy...@huawei.com
 
  Vhost-net uses its own vhost_memory, which results from user space (qemu) 
  info,
  to translate GPA to HVA. Since kernel's kvm structure already maintains the
  address relationship in its member *kvm_memslots*, these patches use 
  kernel's
  kvm_memslots directly without the need of initialization and maintenance of
  vhost_memory.
  Conceptually, vhost isn't aware of KVM - it's just a driver which moves
  data from vq to a tap device and back. You can't simply add KVM specific
  code into vhost.
 
  Whats the performance benefit?
 
 But vhost-net is only used in virtualization situation. vhost_memory is 
 maintained
 by user space qemu.
 In this way, the memory relationship can be accquired from kernel 
 without the
 need of maintainence of vhost_memory from qemu.

You can't assume that vhost-* is used only along with qemu/kvm. Just as
virtio has more uses than just virtualization (heres one:
https://lkml.org/lkml/2011/10/25/139 ), there are more uses for vhost as
well.

There has been a great deal of effort to keep vhost and kvm untangled.
One example is the memory translation it has to do, another one is the
eventfd/irqfd thing it does just so it could signal an IRQ in the guest
instead of accessing the guest directly.

If you do see a great performance increase when tying vhost and KVM
together, it may be worth it to create some sort of an in-kernel
vhost-kvm bridging thing, but if the performance isn't noticeable we're
better off just leaving it as is and keeping the vhost code general.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] vhot-net: Use kvm_memslots instead of vhost_memory to translate GPA to HVA

2011-12-16 Thread Sasha Levin
On Fri, 2011-12-16 at 09:59 +0200, Sasha Levin wrote:
 There has been a great deal of effort to keep vhost and kvm untangled.
 One example is the memory translation it has to do, another one is the
 eventfd/irqfd thing it does just so it could signal an IRQ in the guest
 instead of accessing the guest directly.

Actually, CONFIG_VHOST_NET doesn't even depend on CONFIG_KVM, so your
patch will break build when (CONFIG_VHOST_NET=y  CONFIG_KVM=n).

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] vhot-net: Use kvm_memslots instead of vhost_memory to translate GPA to HVA

2011-12-16 Thread Zang Hongyong
于 2011/12/16,星期五 15:59, Sasha Levin 写道:
 On Fri, 2011-12-16 at 15:40 +0800, Zang Hongyong wrote:
 于 2011/12/16,星期五 15:05, Sasha Levin 写道:
 On Fri, 2011-12-16 at 13:32 +0800, zanghongy...@huawei.com wrote:
 From: Hongyong Zangzanghongy...@huawei.com

 Vhost-net uses its own vhost_memory, which results from user space (qemu) 
 info,
 to translate GPA to HVA. Since kernel's kvm structure already maintains the
 address relationship in its member *kvm_memslots*, these patches use 
 kernel's
 kvm_memslots directly without the need of initialization and maintenance of
 vhost_memory.
 Conceptually, vhost isn't aware of KVM - it's just a driver which moves
 data from vq to a tap device and back. You can't simply add KVM specific
 code into vhost.

 Whats the performance benefit?

 But vhost-net is only used in virtualization situation. vhost_memory is 
 maintained
 by user space qemu.
 In this way, the memory relationship can be accquired from kernel 
 without the
 need of maintainence of vhost_memory from qemu.
 You can't assume that vhost-* is used only along with qemu/kvm. Just as
 virtio has more uses than just virtualization (heres one:
 https://lkml.org/lkml/2011/10/25/139 ), there are more uses for vhost as
 well.

 There has been a great deal of effort to keep vhost and kvm untangled.
 One example is the memory translation it has to do, another one is the
 eventfd/irqfd thing it does just so it could signal an IRQ in the guest
 instead of accessing the guest directly.

 If you do see a great performance increase when tying vhost and KVM
 together, it may be worth it to create some sort of an in-kernel
 vhost-kvm bridging thing, but if the performance isn't noticeable we're
 better off just leaving it as is and keeping the vhost code general.

Thanks for your explanation.
Since memory layout is seldom changed after guest boots, the situation
manily occurs during initialization. There's no need for vhost-kvm
bridge now.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Make the whole guest memory mergeable

2011-12-16 Thread Zang Hongyong
于 2011/12/16,星期五 15:23, Sasha Levin 写道:
 On Fri, 2011-12-16 at 15:02 +0800, Zang Hongyong wrote:
 于 2011/12/16,星期五 13:50, Sasha Levin 写道:
 On Fri, 2011-12-16 at 09:01 +0800, zanghongy...@huawei.com wrote:
 If a guest's ram_size exceeds KVM_32BIT_GAP_START, the corresponding kvm 
 tool's
 virtual address size should be (ram_size + KVM_32BIT_GAP_SIZE), rather 
 than ram_size.
 You're right.

 There are more places than just the madvise() code which make the same
 error you've spotted (for example, the memslot allocation code), so
 instead of trying to fix all of them I'd suggest to just update ram_size
 in kvm__arch_init() before allocating everything - that should fix all
 of them at once.

 Yes. There are other scenarios with the same error.
 However ram_size sometimes means real guest ram size, and sometimes 
 means virtual address
 size of kvm tool's user space. Shall we define a new variable?
 Let's keep it simple. If the user requests more than RAM than
 KVM_32BIT_GAP_START just increase it by KVM_32BIT_GAP_SIZE, this way
 mapped size == guest size always (we can madvise(MADV_DONTNEED) the gap
 in the mmapped ram).

 Since a user which requests more than KVM_32BIT_GAP_START will have to
 be on 64bit host anyway, there shouldn't be any issue with that.

Do you mean increase *kvm-ram_size* by KVM_32BIT_GAP_SIZE?
but sometimes kvm-ram_size stands for guest physical ram size (for
example in kvm__init_ram() code).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Trivial cleanup

2011-12-16 Thread Sasha Levin
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/disk/qcow.c  |   26 --
 tools/kvm/ioport.c |7 ++-
 tools/kvm/pci.c|6 --
 tools/kvm/symbol.c |   18 +-
 tools/kvm/term.c   |1 -
 tools/kvm/util/util.c  |3 +--
 tools/kvm/virtio/balloon.c |6 +++---
 tools/kvm/virtio/core.c|1 -
 tools/kvm/virtio/net.c |   14 +++---
 tools/kvm/virtio/pci.c |9 -
 tools/kvm/virtio/rng.c |   27 +--
 tools/kvm/x86/bios.c   |6 +++---
 tools/kvm/x86/cpuid.c  |7 ---
 tools/kvm/x86/interrupt.c  |3 ++-
 tools/kvm/x86/kvm-cpu.c|   39 +++
 tools/kvm/x86/kvm.c|   23 ++-
 16 files changed, 89 insertions(+), 107 deletions(-)

diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c
index e139fa5..b12fe53 100644
--- a/tools/kvm/disk/qcow.c
+++ b/tools/kvm/disk/qcow.c
@@ -329,18 +329,16 @@ static ssize_t qcow1_read_cluster(struct qcow *q, u64 
offset,
csize = (q-cluster_size - 1);
 
if (pread_in_full(q-fd, q-cluster_data, csize,
- coffset)  0) {
+ coffset)  0)
goto out_error;
-   }
 
if (qcow_decompress_buffer(q-cluster_cache, q-cluster_size,
-   q-cluster_data, csize)  0) {
+   q-cluster_data, csize)  0)
goto out_error;
-   }
 
memcpy(dst, q-cluster_cache + clust_offset, length);
mutex_unlock(q-mutex);
-   } else{
+   } else {
if (!clust_start)
goto zero_cluster;
 
@@ -435,7 +433,7 @@ static ssize_t qcow2_read_cluster(struct qcow *q, u64 
offset,
 
memcpy(dst, q-cluster_cache + clust_offset, length);
mutex_unlock(q-mutex);
-   } else{
+   } else {
clust_start = QCOW2_OFFSET_MASK;
if (!clust_start)
goto zero_cluster;
@@ -470,11 +468,11 @@ static ssize_t qcow_read_sector_single(struct disk_image 
*disk, u64 sector,
char *buf;
u32 nr;
 
-   buf = dst;
-   nr_read = 0;
+   buf = dst;
+   nr_read = 0;
 
while (nr_read  dst_len) {
-   offset  = sector  SECTOR_SHIFT;
+   offset = sector  SECTOR_SHIFT;
if (offset = header-size)
return -1;
 
@@ -488,9 +486,9 @@ static ssize_t qcow_read_sector_single(struct disk_image 
*disk, u64 sector,
if (nr = 0)
return -1;
 
-   nr_read += nr;
-   buf += nr;
-   sector  += (nr  SECTOR_SHIFT);
+   nr_read += nr;
+   buf += nr;
+   sector  += (nr  SECTOR_SHIFT);
}
 
return dst_len;
@@ -508,9 +506,9 @@ static ssize_t qcow_read_sector(struct disk_image *disk, 
u64 sector,
return -1;
}
 
-   sector  += iov-iov_len  SECTOR_SHIFT;
+   sector += iov-iov_len  SECTOR_SHIFT;
+   total += nr;
iov++;
-   total   += nr;
}
 
return total;
diff --git a/tools/kvm/ioport.c b/tools/kvm/ioport.c
index 965cfc2..b417942 100644
--- a/tools/kvm/ioport.c
+++ b/tools/kvm/ioport.c
@@ -111,13 +111,10 @@ bool kvm__emulate_io(struct kvm *kvm, u16 port, void 
*data, int direction, int s
ops = entry-ops;
 
while (count--) {
-   if (direction == KVM_EXIT_IO_IN) {
-   if (ops-io_in)
+   if (direction == KVM_EXIT_IO_IN  ops-io_in)
ret = ops-io_in(entry, kvm, port, ptr, size);
-   } else {
-   if (ops-io_out)
+   else if (ops-io_out)
ret = ops-io_out(entry, kvm, port, ptr, size);
-   }
 
ptr += size;
}
diff --git a/tools/kvm/pci.c b/tools/kvm/pci.c
index 59b2618..06eea0f 100644
--- a/tools/kvm/pci.c
+++ b/tools/kvm/pci.c
@@ -160,10 +160,12 @@ void pci__config_rd(struct kvm *kvm, union 
pci_config_address addr, void *data,
void *p = pci_devices[dev_num];
 
memcpy(data, p + offset, size);
-   } else
+   } else {
memset(data, 0x00, size);
-   } else
+   }
+   } else {
memset(data, 0xff, size);
+   }
 }
 
 void pci__register(struct pci_device_header *dev, u8 dev_num)
diff --git a/tools/kvm/symbol.c b/tools/kvm/symbol.c
index 56dd346..a2b1e67 100644
--- a/tools/kvm/symbol.c
+++ b/tools/kvm/symbol.c
@@ -7,7 

Re: [PATCH] kvm tools: Make the whole guest memory mergeable

2011-12-16 Thread Sasha Levin
On Fri, 2011-12-16 at 16:36 +0800, Zang Hongyong wrote:
 于 2011/12/16,星期五 15:23, Sasha Levin 写道:
  On Fri, 2011-12-16 at 15:02 +0800, Zang Hongyong wrote:
  于 2011/12/16,星期五 13:50, Sasha Levin 写道:
  On Fri, 2011-12-16 at 09:01 +0800, zanghongy...@huawei.com wrote:
  If a guest's ram_size exceeds KVM_32BIT_GAP_START, the corresponding kvm 
  tool's
  virtual address size should be (ram_size + KVM_32BIT_GAP_SIZE), rather 
  than ram_size.
  You're right.
 
  There are more places than just the madvise() code which make the same
  error you've spotted (for example, the memslot allocation code), so
  instead of trying to fix all of them I'd suggest to just update ram_size
  in kvm__arch_init() before allocating everything - that should fix all
  of them at once.
 
  Yes. There are other scenarios with the same error.
  However ram_size sometimes means real guest ram size, and sometimes 
  means virtual address
  size of kvm tool's user space. Shall we define a new variable?
  Let's keep it simple. If the user requests more than RAM than
  KVM_32BIT_GAP_START just increase it by KVM_32BIT_GAP_SIZE, this way
  mapped size == guest size always (we can madvise(MADV_DONTNEED) the gap
  in the mmapped ram).
 
  Since a user which requests more than KVM_32BIT_GAP_START will have to
  be on 64bit host anyway, there shouldn't be any issue with that.
 
 Do you mean increase *kvm-ram_size* by KVM_32BIT_GAP_SIZE?
 but sometimes kvm-ram_size stands for guest physical ram size (for
 example in kvm__init_ram() code).

Yup, kvm-ram_size.

If the user requested more than KVM_32BIT_GAP_START, we pretty much have
to create the gap, so instead of playing around with different
interpretations of ram_size, lets add the gap size - this will let us
have just one ram_size.

mmap()ing extra space for the gap is free, and that was the plan in the
first place (we just got the math wrong :) ).

Do you see an issue with increasing kvm-ram_size?

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Trivial cleanup

2011-12-16 Thread Cyrill Gorcunov
On Fri, Dec 16, 2011 at 10:40:06AM +0200, Sasha Levin wrote:
 Signed-off-by: Sasha Levin levinsasha...@gmail.com
 ---

Thanks, Sasha!

Cyrill
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Fix 'make kvmconfig'

2011-12-16 Thread Sasha Levin
Set features which depend on the config we select to make 'make kvmconfig'
non-interactive.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 arch/x86/Kconfig |1 +
 scripts/kconfig/Makefile |1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8f74dc0..d553840 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -522,6 +522,7 @@ config KVMTOOL_TEST_ENABLE
select DEBUG_KERNEL
select KGDB
select KGDB_SERIAL_CONSOLE
+   select VIRTUALIZATION
select VIRTIO
select VIRTIO_RING
select VIRTIO_PCI
diff --git a/scripts/kconfig/Makefile b/scripts/kconfig/Makefile
index 65580c2..58a4259 100644
--- a/scripts/kconfig/Makefile
+++ b/scripts/kconfig/Makefile
@@ -35,6 +35,7 @@ silentoldconfig: $(obj)/conf
 
 kvmconfig:
$(Q)$(CONFIG_SHELL) $(srctree)/scripts/config -e KVMTOOL_TEST_ENABLE
+   $(Q)yes  | make oldconfig  /dev/null
@echo 'Kernel configuration modified to run as KVM guest.'
 
 # if no path is given, then use src directory to find file
-- 
1.7.8

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Fix 'make kvmconfig'

2011-12-16 Thread Ingo Molnar

* Sasha Levin levinsasha...@gmail.com wrote:

 Set features which depend on the config we select to make 'make kvmconfig'
 non-interactive.
 
 Signed-off-by: Sasha Levin levinsasha...@gmail.com

Acked-by: Ingo Molnar mi...@elte.hu

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Make the whole guest memory mergeable

2011-12-16 Thread Zang Hongyong
于 2011/12/16,星期五 16:45, Sasha Levin 写道:
 On Fri, 2011-12-16 at 16:36 +0800, Zang Hongyong wrote:
 于 2011/12/16,星期五 15:23, Sasha Levin 写道:
 On Fri, 2011-12-16 at 15:02 +0800, Zang Hongyong wrote:
 于 2011/12/16,星期五 13:50, Sasha Levin 写道:
 On Fri, 2011-12-16 at 09:01 +0800, zanghongy...@huawei.com wrote:
 If a guest's ram_size exceeds KVM_32BIT_GAP_START, the corresponding kvm 
 tool's
 virtual address size should be (ram_size + KVM_32BIT_GAP_SIZE), rather 
 than ram_size.
 You're right.

 There are more places than just the madvise() code which make the same
 error you've spotted (for example, the memslot allocation code), so
 instead of trying to fix all of them I'd suggest to just update ram_size
 in kvm__arch_init() before allocating everything - that should fix all
 of them at once.

 Yes. There are other scenarios with the same error.
 However ram_size sometimes means real guest ram size, and sometimes 
 means virtual address
 size of kvm tool's user space. Shall we define a new variable?
 Let's keep it simple. If the user requests more than RAM than
 KVM_32BIT_GAP_START just increase it by KVM_32BIT_GAP_SIZE, this way
 mapped size == guest size always (we can madvise(MADV_DONTNEED) the gap
 in the mmapped ram).

 Since a user which requests more than KVM_32BIT_GAP_START will have to
 be on 64bit host anyway, there shouldn't be any issue with that.

 Do you mean increase *kvm-ram_size* by KVM_32BIT_GAP_SIZE?
 but sometimes kvm-ram_size stands for guest physical ram size (for
 example in kvm__init_ram() code).
 Yup, kvm-ram_size.

 If the user requested more than KVM_32BIT_GAP_START, we pretty much have
 to create the gap, so instead of playing around with different
 interpretations of ram_size, lets add the gap size - this will let us
 have just one ram_size.

 mmap()ing extra space for the gap is free, and that was the plan in the
 first place (we just got the math wrong :) ).

 Do you see an issue with increasing kvm-ram_size?

Yes, it will cause some problems after simply increase the kvm-ram_size.
For examples:
In kvm__init_ram() code we use kvm-ram_size to calculate the size of
the second
RAM range from 4GB to the end of RAM (phys_size = kvm-ram_size -
phys_size;),
so after increase the kvm-ram_size, it will goes wrong.
This problem also happens in e820_setup() code and load_bzimage() code.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] virtio-serial: Allow one MSI-X vector per virtqueue

2011-12-16 Thread Amit Shah
On (Fri) 16 Dec 2011 [09:14:26], zanghongy...@huawei.com wrote:
 From: Hongyong Zang zanghongy...@huawei.com
 
 In pci_enable_msix(), the guest's virtio-serial driver tries to set msi-x
 with one vector per queue. But it fails and eventually all virtio-serial
 ports share one MSI-X vector. Because every virtio-serial port has *two*
 virtqueues, virtio-serial needs (port+1)*2 vectors other than (port+1).

Ouch, good catch.

One comment below:

 This patch allows every virtqueue to have its own MSI-X vector.
 (When the MSI-X vectors needed are more than MSIX_MAX_ENTRIES defined in
 qemu: msix.c, all the queues still share one MSI-X vector as before.)
 
 Signed-off-by: Hongyong Zang zanghongy...@huawei.com
 ---
  hw/virtio-pci.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)
 
 diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
 index 77b75bc..2c9c6fb 100644
 --- a/hw/virtio-pci.c
 +++ b/hw/virtio-pci.c
 @@ -718,8 +718,11 @@ static int virtio_serial_init_pci(PCIDevice *pci_dev)
  return -1;
  }
  vdev-nvectors = proxy-nvectors == DEV_NVECTORS_UNSPECIFIED
 -? proxy-serial.max_virtserial_ports 
 + 1
 +? 
 (proxy-serial.max_virtserial_ports + 1) * 2
  : proxy-nvectors;
 +/*msix.c: #define MSIX_MAX_ENTRIES 32*/
 +if (vdev-nvectors  32)
 +vdev-nvectors = 32;

This change isn't needed: if the proxy-nvectors value exceeds the max
allowed, virtio_init_pci() will end up using a shared vector instead
of separate ones.

Thanks,

Amit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Make the whole guest memory mergeable

2011-12-16 Thread Sasha Levin
On Fri, 2011-12-16 at 17:33 +0800, Zang Hongyong wrote:
  Do you see an issue with increasing kvm-ram_size?
 
 Yes, it will cause some problems after simply increase the kvm-ram_size.
 For examples:
 In kvm__init_ram() code we use kvm-ram_size to calculate the size of
 the second
 RAM range from 4GB to the end of RAM (phys_size = kvm-ram_size -
 phys_size;),
 so after increase the kvm-ram_size, it will goes wrong.
 This problem also happens in e820_setup() code and load_bzimage() code. 

Yup, but fixing it is much easier than having two different sizes of the same 
thing.

For example, the fix for the problem in kvm__init_ram() (and e820_setup()) 
would be:

@@ -112,7 +112,7 @@ void kvm__init_ram(struct kvm *kvm)
/* Second RAM range from 4GB to the end of RAM: */
 
phys_start = 0x1ULL;
-   phys_size  = kvm-ram_size - phys_size;
+   phys_size  = kvm-ram_size - phys_start;
host_mem   = kvm-ram_start + phys_start;
 
kvm__register_mem(kvm, phys_start, phys_size, host_mem);

I basically want one memory map with one size which includes *everything*, even 
if that memory map includes a gap in the middle I still want the total size to 
include that gap.

btw, what problem do you see in load_bzimage()?

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] KVM: MMU: reduce the size of shadow page structure and some cleanup

2011-12-16 Thread Xiao Guangrong
In this patchset, we observably reduce the size of struct kvm_mmu_page and let
the code more simple.

And this patchset have already tested by unittest and
RHEL.6.1 setup/boot/reboot/shutdown.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] KVM: MMU: combine unsync and unsync_children

2011-12-16 Thread Xiao Guangrong
unsync is only used for the page of level == 1 and unsync_children is only
used in the upper page, so combine them to reduce the size of shadow page
structure

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 Documentation/virtual/kvm/mmu.txt |5 ++-
 arch/x86/include/asm/kvm_host.h   |   14 +++-
 arch/x86/kvm/mmu.c|   39 +---
 arch/x86/kvm/mmu_audit.c  |6 ++--
 arch/x86/kvm/mmutrace.h   |3 +-
 arch/x86/kvm/paging_tmpl.h|2 +-
 6 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/Documentation/virtual/kvm/mmu.txt 
b/Documentation/virtual/kvm/mmu.txt
index 5dc972c..4a5fedd 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -205,14 +205,15 @@ Shadow pages contain the following information:
 this page's spt.  Otherwise, parent_ptes points at a data structure
 with a list of parent_ptes.
   unsync:
+It is used when role.level == 1, only level 1 shadow pages can be unsync.
 If true, then the translations in this page may not match the guest's
 translation.  This is equivalent to the state of the tlb when a pte is
 changed but before the tlb entry is flushed.  Accordingly, unsync ptes
 are synchronized when the guest executes invlpg or flushes its tlb by
 other means.  Valid for leaf pages.
   unsync_children:
-How many sptes in the page point at pages that are unsync (or have
-unsynchronized children).
+It is used when role.level  1 and indicates how many sptes in the page
+point at unsync pages or unsynchronized children.
   unsync_child_bitmap:
 A bitmap indicating which sptes in spt point (directly or indirectly) at
 pages that may be unsynchronized.  Used to quickly locate all unsychronized
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 52d6640..c0a89cd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -233,9 +233,19 @@ struct kvm_mmu_page {
 * in this shadow page.
 */
DECLARE_BITMAP(slot_bitmap, KVM_MEM_SLOTS_NUM);
-   bool unsync;
int root_count;  /* Currently serving as active root */
-   unsigned int unsync_children;
+
+   /*
+* If role.level == 1, unsync indicates whether the sp is
+* unsync, otherwise unsync_children indicates the number
+* of sptes which point at unsync sp or unsychronized children.
+* See sp_is_unsync() / sp_unsync_children_num.
+*/
+   union {
+   bool unsync;
+   unsigned int unsync_children;
+   };
+
unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
DECLARE_BITMAP(unsync_child_bitmap, 512);

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2a2a9b4..6913a16 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -151,6 +151,21 @@ module_param(dbg, bool, 0644);

 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)

+static bool sp_is_unsync(struct kvm_mmu_page *sp)
+{
+   return sp-role.level == PT_PAGE_TABLE_LEVEL  sp-unsync;
+}
+
+static unsigned int sp_unsync_children_num(struct kvm_mmu_page *sp)
+{
+   unsigned int num = 0;
+
+   if (sp-role.level != PT_PAGE_TABLE_LEVEL)
+   num = sp-unsync_children;
+
+   return num;
+}
+
 struct pte_list_desc {
u64 *sptes[PTE_LIST_EXT];
struct pte_list_desc *more;
@@ -1401,7 +1416,7 @@ static int mmu_pages_add(struct kvm_mmu_pages *pvec, 
struct kvm_mmu_page *sp,
 {
int i;

-   if (sp-unsync)
+   if (sp_is_unsync(sp))
for (i=0; i  pvec-nr; i++)
if (pvec-page[i].sp == sp)
return 0;
@@ -1426,7 +1441,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,

child = page_header(ent  PT64_BASE_ADDR_MASK);

-   if (child-unsync_children) {
+   if (sp_unsync_children_num(child)) {
if (mmu_pages_add(pvec, child, i))
return -ENOSPC;

@@ -1437,7 +1452,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
nr_unsync_leaf += ret;
else
return ret;
-   } else if (child-unsync) {
+   } else if (sp_is_unsync(child)) {
nr_unsync_leaf++;
if (mmu_pages_add(pvec, child, i))
return -ENOSPC;
@@ -1468,7 +1483,7 @@ static int mmu_unsync_walk(struct kvm_mmu_page *sp,

 static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-   WARN_ON(!sp-unsync);
+   WARN_ON(!sp_is_unsync(sp));
trace_kvm_mmu_sync_page(sp);
sp-unsync = 0;
--kvm-stat.mmu_unsync;
@@ -1546,7 +1561,7 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu,  gfn_t 

[PATCH 2/8] KVM: MMU: set the dirty bit for the upper shadow page

2011-12-16 Thread Xiao Guangrong
Upper page do not need to be tracked the status bit, it is safe to set the
dirty and let cpu to happily prefetch it

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   13 +
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6913a16..a2d28aa 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -150,6 +150,9 @@ module_param(dbg, bool, 0644);
 #define SPTE_HOST_WRITEABLE (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)

 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
+#define SHADOW_PAGE_TABLE  \
+   (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask |\
+shadow_x_mask | shadow_accessed_mask | shadow_dirty_mask)

 static bool sp_is_unsync(struct kvm_mmu_page *sp)
 {
@@ -1808,9 +1811,7 @@ static void link_shadow_page(u64 *sptep, struct 
kvm_mmu_page *sp)
 {
u64 spte;

-   spte = __pa(sp-spt)
-   | PT_PRESENT_MASK | PT_ACCESSED_MASK
-   | PT_WRITABLE_MASK | PT_USER_MASK;
+   spte = __pa(sp-spt) | SHADOW_PAGE_TABLE;
mmu_spte_set(sptep, spte);
 }

@@ -2497,11 +2498,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
return -ENOMEM;
}

-   mmu_spte_set(iterator.sptep,
-__pa(sp-spt)
-| PT_PRESENT_MASK | PT_WRITABLE_MASK
-| shadow_user_mask | shadow_x_mask
-| shadow_accessed_mask);
+   link_shadow_page(iterator.sptep, sp);
}
}
return emulate;
-- 
1.7.7.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] KVM: MMU: do not add a nonpresent spte to rmaps of its child

2011-12-16 Thread Xiao Guangrong
Set the spte before adding it to the rmap of its child so that all parent
spte are valid when propagate unsync bit from a usnync page / children page

And this feature is needed by the later patch

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |   74 +++
 arch/x86/kvm/mmutrace.h|2 +-
 arch/x86/kvm/paging_tmpl.h |   14 +++-
 3 files changed, 32 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a2d28aa..89202f4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1321,12 +1321,14 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
return gfn  ((1  KVM_MMU_HASH_SHIFT) - 1);
 }

-static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu,
-   struct kvm_mmu_page *sp, u64 *parent_pte)
+static void mmu_page_add_set_parent_pte(struct kvm_vcpu *vcpu,
+   struct kvm_mmu_page *sp,
+   u64 *parent_pte)
 {
if (!parent_pte)
return;

+   mmu_spte_set(parent_pte, __pa(sp-spt) | SHADOW_PAGE_TABLE);
pte_list_add(vcpu, parent_pte, sp-parent_ptes);
 }

@@ -1357,7 +1359,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
list_add(sp-link, vcpu-kvm-arch.active_mmu_pages);
bitmap_zero(sp-slot_bitmap, KVM_MEM_SLOTS_NUM);
sp-parent_ptes = 0;
-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
+   mmu_page_add_set_parent_pte(vcpu, sp, parent_pte);
kvm_mod_used_mmu_pages(vcpu-kvm, +1);
return sp;
 }
@@ -1690,13 +1692,10 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
 }

-static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
-gfn_t gfn,
-gva_t gaddr,
-unsigned level,
-int direct,
-unsigned access,
-u64 *parent_pte)
+static struct kvm_mmu_page *
+kvm_mmu_get_set_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr,
+unsigned level, int direct, unsigned access,
+u64 *parent_pte)
 {
union kvm_mmu_page_role role;
unsigned quadrant;
@@ -1726,7 +1725,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
if (sp_is_unsync(sp)  kvm_sync_page_transient(vcpu, sp))
break;

-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
+   mmu_page_add_set_parent_pte(vcpu, sp, parent_pte);
if (sp_unsync_children_num(sp)) {
kvm_make_request(KVM_REQ_MMU_SYNC, vcpu);
kvm_mmu_mark_parents_unsync(sp);
@@ -1734,7 +1733,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
kvm_mmu_mark_parents_unsync(sp);

__clear_sp_write_flooding_count(sp);
-   trace_kvm_mmu_get_page(sp, false);
+   trace_kvm_mmu_get_set_page(sp, false);
return sp;
}
++vcpu-kvm-stat.mmu_cache_miss;
@@ -1754,7 +1753,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
account_shadowed(vcpu-kvm, gfn);
}
init_shadow_page_table(sp);
-   trace_kvm_mmu_get_page(sp, true);
+   trace_kvm_mmu_get_set_page(sp, true);
return sp;
 }

@@ -1807,14 +1806,6 @@ static void shadow_walk_next(struct 
kvm_shadow_walk_iterator *iterator)
return __shadow_walk_next(iterator, *iterator-sptep);
 }

-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
-{
-   u64 spte;
-
-   spte = __pa(sp-spt) | SHADOW_PAGE_TABLE;
-   mmu_spte_set(sptep, spte);
-}
-
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
 {
if (is_large_pte(*sptep)) {
@@ -1879,11 +1870,6 @@ static void kvm_mmu_page_unlink_children(struct kvm *kvm,
mmu_page_zap_pte(kvm, sp, sp-spt + i);
 }

-static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
-{
-   mmu_page_remove_parent_pte(sp, parent_pte);
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
u64 *parent_pte;
@@ -2468,7 +2454,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
bool prefault)
 {
struct kvm_shadow_walk_iterator iterator;
-   struct kvm_mmu_page *sp;
int emulate = 0;
gfn_t pseudo_gfn;

@@ -2489,16 +2474,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,

base_addr = PT64_LVL_ADDR_MASK(iterator.level);
pseudo_gfn = base_addr  PAGE_SHIFT;
-   

[PATCH 4/8] KVM: MMU: drop unsync_child_bitmap

2011-12-16 Thread Xiao Guangrong
unsync_child_bitmap is used to record which spte has unsync page or unsync
children, we can set a free bit in the spte instead of it

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 Documentation/virtual/kvm/mmu.txt |4 --
 arch/x86/include/asm/kvm_host.h   |1 -
 arch/x86/kvm/mmu.c|   77 +
 3 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/Documentation/virtual/kvm/mmu.txt 
b/Documentation/virtual/kvm/mmu.txt
index 4a5fedd..6d70a6e 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -214,10 +214,6 @@ Shadow pages contain the following information:
   unsync_children:
 It is used when role.level  1 and indicates how many sptes in the page
 point at unsync pages or unsynchronized children.
-  unsync_child_bitmap:
-A bitmap indicating which sptes in spt point (directly or indirectly) at
-pages that may be unsynchronized.  Used to quickly locate all unsychronized
-pages reachable from a given page.

 Reverse map
 ===
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c0a89cd..601c7f6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -247,7 +247,6 @@ struct kvm_mmu_page {
};

unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
-   DECLARE_BITMAP(unsync_child_bitmap, 512);

 #ifdef CONFIG_X86_32
int clear_spte_count;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 89202f4..9bd2084 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -147,7 +147,8 @@ module_param(dbg, bool, 0644);
 #define CREATE_TRACE_POINTS
 #include mmutrace.h

-#define SPTE_HOST_WRITEABLE (1ULL  PT_FIRST_AVAIL_BITS_SHIFT)
+#define SPTE_HOST_WRITEABLE(1ULL  PT_FIRST_AVAIL_BITS_SHIFT)
+#define SPTE_UNSYNC_CHILD  (1ULL  (PT_FIRST_AVAIL_BITS_SHIFT + 1))

 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 #define SHADOW_PAGE_TABLE  \
@@ -561,6 +562,40 @@ static void mmu_spte_clear_no_track(u64 *sptep)
__update_clear_spte_fast(sptep, 0ull);
 }

+static bool mmu_spte_mark_unsync_child(struct kvm_mmu_page *sp, u64 *sptep)
+{
+   u64 new_spte = *sptep;
+   bool unsync_child = new_spte  SPTE_UNSYNC_CHILD;
+
+   if (unsync_child)
+   return true;
+
+   new_spte |= SPTE_UNSYNC_CHILD;
+   __set_spte(sptep, new_spte);
+   return sp-unsync_children++;
+}
+
+static bool mmu_spte_is_unsync_child(u64 *sptep)
+{
+   return *sptep  SPTE_UNSYNC_CHILD;
+}
+
+static void __mmu_spte_clear_unsync_child(u64 *sptep)
+{
+   page_header(__pa(sptep))-unsync_children--;
+   WARN_ON((int)page_header(__pa(sptep))-unsync_children  0);
+}
+
+static void mmu_spte_clear_unsync_child(u64 *sptep)
+{
+   u64 new_spte = *sptep;
+
+   if (new_spte  SPTE_UNSYNC_CHILD) {
+   __set_spte(sptep, new_spte  ~SPTE_UNSYNC_CHILD);
+   __mmu_spte_clear_unsync_child(sptep);
+   }
+}
+
 static u64 mmu_spte_get_lockless(u64 *sptep)
 {
return __get_spte_lockless(sptep);
@@ -1342,6 +1377,10 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
u64 *parent_pte)
 {
mmu_page_remove_parent_pte(sp, parent_pte);
+
+   if (*parent_pte  SPTE_UNSYNC_CHILD)
+   __mmu_spte_clear_unsync_child(parent_pte);
+
mmu_spte_clear_no_track(parent_pte);
 }

@@ -1372,16 +1411,10 @@ static void kvm_mmu_mark_parents_unsync(struct 
kvm_mmu_page *sp)

 static void mark_unsync(u64 *spte)
 {
-   struct kvm_mmu_page *sp;
-   unsigned int index;
+   struct kvm_mmu_page *sp = page_header(__pa(spte));

-   sp = page_header(__pa(spte));
-   index = spte - sp-spt;
-   if (__test_and_set_bit(index, sp-unsync_child_bitmap))
-   return;
-   if (sp-unsync_children++)
-   return;
-   kvm_mmu_mark_parents_unsync(sp);
+   if (!mmu_spte_mark_unsync_child(sp, spte))
+   kvm_mmu_mark_parents_unsync(sp);
 }

 static int nonpaging_sync_page(struct kvm_vcpu *vcpu,
@@ -1411,10 +1444,9 @@ struct kvm_mmu_pages {
unsigned int nr;
 };

-#define for_each_unsync_children(bitmap, idx)  \
-   for (idx = find_first_bit(bitmap, 512); \
-idx  512; \
-idx = find_next_bit(bitmap, 512, idx+1))
+#define for_each_unsync_children(sp, sptep, idx)   \
+   for (idx = 0; idx  512  ((sptep) = (sp)-spt + idx); idx++)  \
+   if (!mmu_spte_is_unsync_child(sptep)) {} else

 static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp,
 int idx)
@@ -1435,14 +1467,14 @@ static int mmu_pages_add(struct kvm_mmu_pages *pvec, 
struct kvm_mmu_page *sp,
 static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
   struct 

[PATCH 5/8] KVM: MMU: optimize walking unsync shadow page

2011-12-16 Thread Xiao Guangrong
Exactly, unysnc_children is the number of unsync sptes, we can use to
avoid unnecessary spte fetching

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9bd2084..16e0642 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1468,12 +1468,15 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
   struct kvm_mmu_pages *pvec)
 {
u64 *spte;
-   int i, ret, nr_unsync_leaf = 0;
+   int i, ret, nr_unsync_leaf = 0, unysnc_children = sp-unsync_children;

for_each_unsync_children(sp, spte, i) {
struct kvm_mmu_page *child;
u64 ent = *spte;

+   if (!unysnc_children--)
+   break;
+
WARN_ON(!is_shadow_present_pte(ent) || is_large_pte(ent));

child = page_header(ent  PT64_BASE_ADDR_MASK);
-- 
1.7.7.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] KVM: MMU: optimize handing invlpg

2011-12-16 Thread Xiao Guangrong
Use unsync bit to see if the spte is point to unsync child

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/paging_tmpl.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index c79c503..5333256 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -706,7 +706,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
FNAME(update_pte)(vcpu, sp, sptep, gpte);
}

-   if (!is_shadow_present_pte(*sptep) || !sp-unsync_children)
+   if (!mmu_spte_is_unsync_child(sptep))
break;
}
spin_unlock(vcpu-kvm-mmu_lock);
-- 
1.7.7.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] KVM: MMU: remove the redundant get_written_sptes

2011-12-16 Thread Xiao Guangrong
get_written_sptes is called twice in kvm_mmu_pte_write, one of them can be
removed

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 16e0642..5d0f0e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3575,7 +3575,7 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu 
*vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
+static bool detect_write_flooding(struct kvm_mmu_page *sp)
 {
/*
 * Skip write-flooding detected for the sp whose level is 1, because
@@ -3684,10 +3684,8 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,

mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
for_each_gfn_indirect_valid_sp(vcpu-kvm, sp, gfn, node) {
-   spte = get_written_sptes(sp, gpa, npte);
-
if (detect_write_misaligned(sp, gpa, bytes) ||
- detect_write_flooding(sp, spte)) {
+ detect_write_flooding(sp)) {
zap_page |= !!kvm_mmu_prepare_zap_page(vcpu-kvm, sp,
 invalid_list);
++vcpu-kvm-stat.mmu_flooded;
-- 
1.7.7.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] KVM: MMU: remove PT64_SECOND_AVAIL_BITS_SHIFT

2011-12-16 Thread Xiao Guangrong
It is not used, remove it

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/mmu.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5d0f0e3..234a32e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -91,7 +91,6 @@ module_param(dbg, bool, 0644);
 #define PTE_PREFETCH_NUM   8

 #define PT_FIRST_AVAIL_BITS_SHIFT 9
-#define PT64_SECOND_AVAIL_BITS_SHIFT 52

 #define PT64_LEVEL_BITS 9

-- 
1.7.7.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance

2011-12-16 Thread Marcelo Tosatti
On Thu, Dec 15, 2011 at 03:48:38PM +0900, Takuya Yoshikawa wrote:
 (2011/12/15 13:28), Liu Ping Fan wrote:
  From: Liu Ping Fanpingf...@linux.vnet.ibm.com
  
  Currently, vcpu can be destructed only when kvm instance destroyed.
  Change this to vcpu's destruction before kvm instance, so vcpu MUST
  and CAN be destroyed before kvm's destroy.
 
 Could you explain why this change is needed here?
 Would be helpful for those, including me, who will read the commit later.

I fail to see the motivation for this change also.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5 V5] Avoid soft lockup message when KVM is stopped by host

2011-12-16 Thread Marcelo Tosatti
On Thu, Dec 15, 2011 at 12:21:16PM +0200, Avi Kivity wrote:
 On 12/14/2011 08:21 PM, Marcelo Tosatti wrote:
  On Wed, Dec 14, 2011 at 04:39:56PM +0200, Avi Kivity wrote:
   On 12/14/2011 02:16 PM, Marcelo Tosatti wrote:
 Having this controlled from userspace means it doesn't work for 
 SIGSTOP
 or for long scheduling delays.  What about doing this automatically
 based on preempt notifiers?
   
Long scheduling delays should be considered hangups from the guest
perspective.
   
   Why?  To the guest it looks like slow hardware, but it will interpret it
   as a softlockup.
 
  Slow enough that progress of the watchdog thread is unable to keep up
  with timer interrupt processing. This is considered a hang and
  should be reported.
 
 It's not a guest hang though!

No, but your host system is in such a load state that for the sake of
system usability you better print out a warning message.

I don't see the advantage of preempt notifiers over the simple, paravirt
solution proposed? Note kvmclock is already paravirt.

What do you want to be done in preempt notifiers? Measure what to
consider setting this flag?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-16 Thread Marcelo Tosatti
On Mon, Dec 12, 2011 at 07:26:47AM +0900, Takuya Yoshikawa wrote:
 From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 
 Currently, mmu_shrink() tries to free a shadow page from one kvm and
 does not use nr_to_scan correctly.
 
 This patch fixes this by making it try to free some shadow pages from
 each kvm.  The number of shadow pages each kvm frees becomes
 proportional to the number of shadow pages it is using.
 
 Note: an easy way to see how this code works is to do
   echo 3  /proc/sys/vm/drop_caches
 during some virtual machines are running.  Shadow pages will be zapped
 as expected by this.

I'm not sure this is a meaningful test to verify this change is
worthwhile, because while the shrinker tries to free a shadow page from
one vm, the vm's position in the kvm list is changed, so the over time
the shrinker will cycle over all VMs.

Can you measure whether there is a significant difference in a synthetic
workload, and what that change is? Perhaps apply {moderate, high} memory
pressure load with {2, 4, 8, 16} VMs or something like that.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-16 Thread Takuya Yoshikawa
On Fri, 16 Dec 2011 09:06:11 -0200
Marcelo Tosatti mtosa...@redhat.com wrote:

 On Mon, Dec 12, 2011 at 07:26:47AM +0900, Takuya Yoshikawa wrote:
  From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
  
  Currently, mmu_shrink() tries to free a shadow page from one kvm and
  does not use nr_to_scan correctly.
  
  This patch fixes this by making it try to free some shadow pages from
  each kvm.  The number of shadow pages each kvm frees becomes
  proportional to the number of shadow pages it is using.
  
  Note: an easy way to see how this code works is to do
echo 3  /proc/sys/vm/drop_caches
  during some virtual machines are running.  Shadow pages will be zapped
  as expected by this.
 
 I'm not sure this is a meaningful test to verify this change is
 worthwhile, because while the shrinker tries to free a shadow page from
 one vm, the vm's position in the kvm list is changed, so the over time
 the shrinker will cycle over all VMs.

The test was for checking if mmu_shrink() would work as intended.  Maybe
not good as a changelog, sorry.


I admit that I could not find any strong reason except for protecting the
host from intentionally induced shadowing.

But for that, don't you think that freeing the same amount of shadow pages
from good and bad VMs eaqually is bad thing?

My method will try to free many shadow pages from VMs with many shadow
pages;  e.g. if there is a pathological increase in shadow pages for one
VM, that one will be intensively treated.

If you can agree on this reasoning, I will update the description and resend.

 
 Can you measure whether there is a significant difference in a synthetic
 workload, and what that change is? Perhaps apply {moderate, high} memory
 pressure load with {2, 4, 8, 16} VMs or something like that.
 

I was running 4 VMs on my machine with enough high memory pressure.  The problem
was that mmu_shrink() was not tuned to be called in usual memory pressures:  
what
I did was changing the seeks and batch parameters and making ept=0.

At least, I have checked that if I make one VM do meaningless many copies, 
letting
others keep silent, the shrinker frees shadow pages intensively from that one.


Anyway, I don't think making the shrinker call mmu_shrink() with the default 
batch
size, nr_to_scan=128, and just trying to free one shadow page is a good 
behaviour.


Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm tools: Submit multiple virtio-blk requests in parallel

2011-12-16 Thread Asias He
On 12/15/2011 08:15 PM, Sasha Levin wrote:
 When using AIO, submit all requests which exists in the vring in a single
 io_submit instead of one io_submit for each descriptor.
 
 Benchmarks:
 
 Short version: 15%+ increase in IOPS, small increase in BW.
 
 Read IOPS:
 Before:
   vda: ios=291792/0, merge=0/0, ticks=35229/0, in_queue=31025, util=61.30%

I guess you are reading the wrong IOPS number, the 'ios' is the number
of ios performed by all groups, not the IOPS result. Find the 'iops' ;-)

So, Here is the number without/with this patch.

(seq-read, seq-write, rand-read, rand-write)

Before:
  read : io=98304KB, bw=63015KB/s, iops=15753, runt=  1560msec
  write: io=98304KB, bw=56823KB/s, iops=14205, runt=  1730msec
  read : io=98304KB, bw=62139KB/s, iops=15534, runt=  1582msec
  write: io=98304KB, bw=53836KB/s, iops=13458, runt=  1826msec

After:
  read : io=98304KB, bw=63096KB/s, iops=15774, runt=  1558msec
  write: io=98304KB, bw=55823KB/s, iops=13955, runt=  1761msec
  read : io=98304KB, bw=59148KB/s, iops=14787, runt=  1662msec
  write: io=98304KB, bw=55072KB/s, iops=13768, runt=  1785msec

Submit more io requests in one time is not supposed to increase the iops
or bw so dramatically.

I even tried to submit all read/write ops in one io_submit which still
ends up with very little iops or bw improvement.

 
 After:
   vda: ios=333114/0, merge=0/0, ticks=47983/0, in_queue=46630, util=62.22%
 
 Write IOPS:
 Before:
   vda: ios=0/271716, merge=0/0, ticks=0/33367, in_queue=26531, util=59.96%
 
 After:
   vda: ios=0/327485, merge=0/0, ticks=0/23789, in_queue=22475, util=55.74%
 
 Read BW:
 Before:
READ: io=7526.0MB, aggrb=1246.3MB/s, minb=1275.1MB/s, maxb=1275.1MB/s, 
 mint=6040msec, maxt=6040msec
 After:
READ: io=7526.0MB, aggrb=1315.5MB/s, minb=1346.7MB/s, maxb=1346.7MB/s, 
 mint=5723msec, maxt=5723msec
 
 Write BW:
 Before:
   WRITE: io=7526.0MB, aggrb=1110.2MB/s, minb=1136.9MB/s, maxb=1136.9MB/s, 
 mint=6779msec, maxt=6779msec
 
 After:
   WRITE: io=7526.0MB, aggrb=1113.5MB/s, minb=1140.3MB/s, maxb=1140.3MB/s, 
 mint=6759msec, maxt=6759msec
 
 Signed-off-by: Sasha Levin levinsasha...@gmail.com
 ---
  tools/kvm/disk/core.c  |2 -
  tools/kvm/disk/raw.c   |   38 +--
  tools/kvm/include/kvm/disk-image.h |8 +-
  tools/kvm/util/read-write.c|8 +-
  tools/kvm/virtio/blk.c |3 ++
  5 files changed, 42 insertions(+), 17 deletions(-)
 
 diff --git a/tools/kvm/disk/core.c b/tools/kvm/disk/core.c
 index 4915efd..b18014e 100644
 --- a/tools/kvm/disk/core.c
 +++ b/tools/kvm/disk/core.c
 @@ -5,8 +5,6 @@
  #include sys/eventfd.h
  #include sys/poll.h
  
 -#define AIO_MAX 32
 -
  int debug_iodelay;
  
  #ifdef CONFIG_HAS_AIO
 diff --git a/tools/kvm/disk/raw.c b/tools/kvm/disk/raw.c
 index caa023c..1162fb7 100644
 --- a/tools/kvm/disk/raw.c
 +++ b/tools/kvm/disk/raw.c
 @@ -10,9 +10,9 @@ ssize_t raw_image__read_sector(struct disk_image *disk, u64 
 sector, const struct
   u64 offset = sector  SECTOR_SHIFT;
  
  #ifdef CONFIG_HAS_AIO
 - struct iocb iocb;
 + struct iocb *iocb = disk-iocb[disk-count++];
  
 - return aio_preadv(disk-ctx, iocb, disk-fd, iov, iovcount, offset,
 + return aio_preadv(disk-ctx, iocb, disk-fd, iov, iovcount, offset,
   disk-evt, param);
  #else
   return preadv_in_full(disk-fd, iov, iovcount, offset);
 @@ -25,9 +25,9 @@ ssize_t raw_image__write_sector(struct disk_image *disk, 
 u64 sector, const struc
   u64 offset = sector  SECTOR_SHIFT;
  
  #ifdef CONFIG_HAS_AIO
 - struct iocb iocb;
 + struct iocb *iocb = disk-iocb[disk-count++];
  
 - return aio_pwritev(disk-ctx, iocb, disk-fd, iov, iovcount, offset,
 + return aio_pwritev(disk-ctx, iocb, disk-fd, iov, iovcount, offset,
   disk-evt, param);
  #else
   return pwritev_in_full(disk-fd, iov, iovcount, offset);
 @@ -79,19 +79,33 @@ int raw_image__close(struct disk_image *disk)
  
   close(disk-evt);
  
 -#ifdef CONFIG_HAS_VIRTIO
 +#ifdef CONFIG_HAS_AIO
   io_destroy(disk-ctx);
  #endif
  
   return ret;
  }
  
 +static int raw_image__aio_submit(struct disk_image *disk)
 +{
 + int ret;
 +
 + ret = io_submit(disk-ctx, disk-count, disk-iocb_ptrs);
 + if (ret  0)
 + disk-count = 0;
 +
 + return ret;
 +}
 +
  /*
   * multiple buffer based disk image operations
   */
  static struct disk_image_operations raw_image_regular_ops = {
   .read_sector= raw_image__read_sector,
   .write_sector   = raw_image__write_sector,
 +#ifdef CONFIG_HAS_AIO
 + .submit = raw_image__aio_submit,
 +#endif
  };
  
  struct disk_image_operations ro_ops = {
 @@ -120,8 +134,13 @@ struct disk_image *raw_image__probe(int fd, struct stat 
 *st, bool readonly)
  
   disk = disk_image__new(fd, st-st_size, 
 ro_ops_nowrite, DISK_IMAGE_REGULAR);
  #ifdef CONFIG_HAS_AIO
 - if 

Re: [PATCH 0/4 V7] Avoid soft lockup message when KVM is stopped by host

2011-12-16 Thread Arnd Bergmann
On Thursday 15 December 2011, Eric B Munson wrote:
  Documentation/virtual/kvm/api.txt   |   12 
  arch/ia64/include/asm/kvm_para.h|5 +
  arch/powerpc/include/asm/kvm_para.h |5 +
  arch/s390/include/asm/kvm_para.h|5 +
  arch/x86/include/asm/kvm_host.h |2 ++
  arch/x86/include/asm/kvm_para.h |8 
  arch/x86/include/asm/pvclock-abi.h  |1 +
  arch/x86/kernel/kvmclock.c  |   21 +
  arch/x86/kvm/x86.c  |   20 
  include/asm-generic/kvm_para.h  |   14 ++
  include/linux/kvm.h |2 ++
  kernel/watchdog.c   |   12 
  12 files changed, 107 insertions(+), 0 deletions(-)
  create mode 100644 include/asm-generic/kvm_para.h

asm-generic changes Acked-by: Arnd Bergmann a...@arndb.de

but please remove me from the cc-list for the other patches.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3.2] KVM: Make KVM_INTEL depend on CPU_SUP_INTEL

2011-12-16 Thread Randy Dunlap
On 12/14/2011 02:27 AM, Avi Kivity wrote:
 PMU virtualization needs to talk to Intel-specific bits of perf; these are
 only available when CPU_SUP_INTEL=y.
 
 Fixes
 
   arch/x86/built-in.o: In function `atomic_switch_perf_msrs':
   vmx.c:(.text+0x6b1d4): undefined reference to `perf_guest_get_msrs'
 
 Reported-by: Ingo Molnar mi...@elte.hu
 Reported-by: Randy Dunlap rdun...@xenotime.net
 Signed-off-by: Avi Kivity a...@redhat.com

Acked-by: Randy Dunlap rdun...@xenotime.net

Thanks.

 ---
  arch/x86/kvm/Kconfig |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
 index ff5790d..ca4d49e 100644
 --- a/arch/x86/kvm/Kconfig
 +++ b/arch/x86/kvm/Kconfig
 @@ -52,6 +52,8 @@ config KVM
  config KVM_INTEL
   tristate KVM for Intel processors support
   depends on KVM
 + # for perf_guest_get_msrs():
 + depends on CPU_SUP_INTEL
   ---help---
 Provides support for KVM on Intel processors equipped with the VT
 extensions.


-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-16 Thread Nate Custer

On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote:
 Ok, I continued to develop on the patch which tries to allocate per cpu
 stats from worker thread context. Here is the patch.
 
 Can the reporter please try out the patch and see if it helps. I am not
 sure if deadlock was because of mutex issue or not, but it should help
 get rid of lockdep warning.
 
 This patch is on top of for-3.3/core branch of jens's linux-block tree.
 If it does not apply on your kernel version, do let me know the version 
 you are testing with and I will generate another version of patch.
 
 If testing results are good, I will break down the patch in small pieces
 and post as a series separately.
 
 Thanks
 Vivek

Running on a fedora-16 box with the patch applied to the linux-block tree I 
still had deadlocks. In fact it seemed to happen much faster and with ligher 
workloads.

I was able to get netconsole setup and a full stacktrace is posted here:

http://pastebin.com/9Rq68exU

Nate

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)

2011-12-16 Thread Vivek Goyal
On Fri, Dec 16, 2011 at 12:43:52PM -0600, Nate Custer wrote:
 
 On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote:
  Ok, I continued to develop on the patch which tries to allocate per cpu
  stats from worker thread context. Here is the patch.
  
  Can the reporter please try out the patch and see if it helps. I am not
  sure if deadlock was because of mutex issue or not, but it should help
  get rid of lockdep warning.
  
  This patch is on top of for-3.3/core branch of jens's linux-block tree.
  If it does not apply on your kernel version, do let me know the version 
  you are testing with and I will generate another version of patch.
  
  If testing results are good, I will break down the patch in small pieces
  and post as a series separately.
  
  Thanks
  Vivek
 
 Running on a fedora-16 box with the patch applied to the linux-block tree I 
 still had deadlocks. In fact it seemed to happen much faster and with ligher 
 workloads.
 
 I was able to get netconsole setup and a full stacktrace is posted here:
 
 http://pastebin.com/9Rq68exU

Thanks for testing it Nate. I did some debugging and found out that patch
is doing double free on per cpu pointer hence the crash you are running
into. I could reproduce this problem on my box. It is just a matter of
doing rmdir on the blkio cgroup.

I understood the cmpxchg() semantics wrong. I have fixed it now and
no crashes on directory removal. Can you please give this version a
try.

Thanks
Vivek

Alloc per cpu data from a worker thread context to avoid possibility of a
deadlock under memory pressure.

Only side affect of delayed allocation is that we will lose the blkio cgroup
stats for that group a small duration.

This patch is generated on top of for-3.3/core branch of linux-block tree.

-v2:
 Fixed the issue of double free on percpu pointer.
---
 block/blk-cgroup.c   |   31 -
 block/blk-cgroup.h   |9 +++
 block/blk-throttle.c |  116 -
 block/cfq-iosched.c  |  119 ++-
 4 files changed, 180 insertions(+), 95 deletions(-)

Index: block/blk-throttle.c
===
--- block/blk-throttle.c.orig   2011-12-17 01:49:34.0 -0500
+++ block/blk-throttle.c2011-12-17 02:21:50.0 -0500
@@ -81,6 +81,7 @@ struct throtl_grp {
int limits_changed;
 
struct rcu_head rcu_head;
+   struct work_struct stat_alloc_work;
 };
 
 struct throtl_data
@@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he
struct throtl_grp *tg;
 
tg = container_of(head, struct throtl_grp, rcu_head);
+
+   /*
+* Will delayed allocation be visible here for sure? I am relying
+* on the fact that after blkg.stats_cpu assignment, we drop
+* reference to group using atomic_dec() which should imply
+* barrier
+*/
free_percpu(tg-blkg.stats_cpu);
kfree(tg);
 }
@@ -181,6 +189,48 @@ static void throtl_put_tg(struct throtl_
call_rcu(tg-rcu_head, throtl_free_tg);
 }
 
+static inline void throtl_check_schedule_pcpu_stat_alloc(struct throtl_grp 
*tg) {
+   if (tg-blkg.stats_cpu != NULL)
+   return;
+   /*
+* Schedule the group per cpu stat allocation through worker
+* thread
+*/
+   throtl_ref_get_tg(tg);
+   if (!schedule_work(tg-stat_alloc_work)) {
+   /* Work is already scheduled by somebody */
+   throtl_put_tg(tg);
+   }
+}
+
+/* No locks taken. A reference to throtl_grp taken before invocation */
+static void tg_stat_alloc_fn(struct work_struct *work)
+{
+   struct throtl_grp *tg = container_of(work, struct throtl_grp,
+   stat_alloc_work);
+   void *stat_ptr = NULL;
+   unsigned long ret;
+
+   stat_ptr = blkio_alloc_blkg_stats_pcpu();
+
+   if (stat_ptr == NULL) {
+   /* If memory allocation failed, try again */
+   throtl_check_schedule_pcpu_stat_alloc(tg);
+   throtl_put_tg(tg);
+   return;
+   }
+
+   ret = blkio_cmpxchg_blkg_stats(tg-blkg, 0,
+   (unsigned long)stat_ptr);
+
+   if (ret != 0) {
+   /* Somebody else got to it first */
+   free_percpu(stat_ptr);
+   }
+
+   throtl_put_tg(tg);
+}
+
 static void throtl_init_group(struct throtl_grp *tg)
 {
INIT_HLIST_NODE(tg-tg_node);
@@ -188,6 +238,7 @@ static void throtl_init_group(struct thr
bio_list_init(tg-bio_lists[0]);
bio_list_init(tg-bio_lists[1]);
tg-limits_changed = false;
+   INIT_WORK(tg-stat_alloc_work, tg_stat_alloc_fn);
 
/* Practically unlimited BW */
tg-bps[0] = tg-bps[1] = -1;
@@ -263,8 +314,25 @@ static void throtl_init_add_tg_lists(str
throtl_add_group_to_td_list(td, tg);
 }
 
-/* Should be called without queue lock and 

[PATCH v5] kvm: make vcpu life cycle separated from kvm instance

2011-12-16 Thread Liu Ping Fan
From: Liu Ping Fan pingf...@linux.vnet.ibm.com

Currently, vcpu can be destructed only when kvm instance destroyed.
Change this to vcpu's destruction before kvm instance, so vcpu MUST
and CAN be destroyed before kvm's destroy.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/x86/kvm/i8254.c |   10 +++--
 arch/x86/kvm/i8259.c |   12 --
 arch/x86/kvm/x86.c   |   53 +++
 include/linux/kvm_host.h |   20 -
 virt/kvm/irq_comm.c  |6 ++-
 virt/kvm/kvm_main.c  |  106 ++---
 6 files changed, 132 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 76e3f1c..a3a5506 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -289,9 +289,8 @@ static void pit_do_work(struct work_struct *work)
struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
struct kvm *kvm = pit-kvm;
struct kvm_vcpu *vcpu;
-   int i;
struct kvm_kpit_state *ps = pit-pit_state;
-   int inject = 0;
+   int idx, inject = 0;
 
/* Try to inject pending interrupts when
 * last one has been acked.
@@ -315,9 +314,12 @@ static void pit_do_work(struct work_struct *work)
 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 * VCPU0, and only if its LVT0 is in EXTINT mode.
 */
-   if (kvm-arch.vapics_in_nmi_mode  0)
-   kvm_for_each_vcpu(i, vcpu, kvm)
+   if (kvm-arch.vapics_in_nmi_mode  0) {
+   idx = srcu_read_lock(kvm-srcu_vcpus);
+   kvm_for_each_vcpu(vcpu, kvm)
kvm_apic_nmi_wd_deliver(vcpu);
+   srcu_read_unlock(kvm-srcu_vcpus, idx);
+   }
}
 }
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index cac4746..5ef5c05 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
 {
bool wakeup = s-wakeup_needed;
struct kvm_vcpu *vcpu, *found = NULL;
-   int i;
+   struct kvm *kvm = s-kvm;
+   int idx;
 
s-wakeup_needed = false;
 
spin_unlock(s-lock);
 
if (wakeup) {
-   kvm_for_each_vcpu(i, vcpu, s-kvm) {
+   idx = srcu_read_lock(kvm-srcu_vcpus);
+   kvm_for_each_vcpu(vcpu, kvm)
if (kvm_apic_accept_pic_intr(vcpu)) {
found = vcpu;
break;
}
-   }
 
-   if (!found)
+   if (!found) {
+   srcu_read_unlock(kvm-srcu_vcpus, idx);
return;
+   }
 
kvm_make_request(KVM_REQ_EVENT, found);
kvm_vcpu_kick(found);
+   srcu_read_unlock(kvm-srcu_vcpus, idx);
}
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23c93fe..b79739d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1774,14 +1774,20 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 *pdata)
 static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 {
u64 data = 0;
+   int idx;
 
switch (msr) {
case HV_X64_MSR_VP_INDEX: {
-   int r;
+   int r = 0;
struct kvm_vcpu *v;
-   kvm_for_each_vcpu(r, v, vcpu-kvm)
+   struct kvm *kvm = vcpu-kvm;
+   idx = srcu_read_lock(kvm-srcu_vcpus);
+   kvm_for_each_vcpu(v, vcpu-kvm) {
if (v == vcpu)
data = r;
+   r++;
+   }
+   srcu_read_unlock(kvm-srcu_vcpus, idx);
break;
}
case HV_X64_MSR_EOI:
@@ -4529,7 +4535,7 @@ static int kvmclock_cpufreq_notifier(struct 
notifier_block *nb, unsigned long va
struct cpufreq_freqs *freq = data;
struct kvm *kvm;
struct kvm_vcpu *vcpu;
-   int i, send_ipi = 0;
+   int idx, send_ipi = 0;
 
/*
 * We allow guests to temporarily run on slowing clocks,
@@ -4579,13 +4585,16 @@ static int kvmclock_cpufreq_notifier(struct 
notifier_block *nb, unsigned long va
 
raw_spin_lock(kvm_lock);
list_for_each_entry(kvm, vm_list, vm_list) {
-   kvm_for_each_vcpu(i, vcpu, kvm) {
+   idx = srcu_read_lock(kvm-srcu_vcpus);
+   kvm_for_each_vcpu(vcpu, kvm) {
if (vcpu-cpu != freq-cpu)
continue;
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if (vcpu-cpu != smp_processor_id())
send_ipi = 1;
}
+   srcu_read_unlock(kvm-srcu_vcpus, idx);
+
}

Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance

2011-12-16 Thread Liu ping fan
2011/12/15 Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp:
 (2011/12/15 13:28), Liu Ping Fan wrote:
 From: Liu Ping Fanpingf...@linux.vnet.ibm.com

 Currently, vcpu can be destructed only when kvm instance destroyed.
 Change this to vcpu's destruction before kvm instance, so vcpu MUST
 and CAN be destroyed before kvm's destroy.

 Could you explain why this change is needed here?
 Would be helpful for those, including me, who will read the commit later.

Suppose the following scene,
Firstly, creating 10 kvm_vcpu for guest to take the advantage of
multi-core. Now, reclaiming some of the kvm_vcpu, so we can limit the
guest's usage of cpu. Then what about the kvm_vcpu unused? Currently
they are just idle in kernel, but with this patch, we can remove them.


 Signed-off-by: Liu Ping Fanpingf...@linux.vnet.ibm.com
 ---

 ...

 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index cac4746..f275b8c 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -50,25 +50,28 @@ static void pic_unlock(struct kvm_pic *s)
   {
       bool wakeup = s-wakeup_needed;
       struct kvm_vcpu *vcpu, *found = NULL;
 -     int i;
 +     struct kvm *kvm = s-kvm;

       s-wakeup_needed = false;

       spin_unlock(s-lock);

       if (wakeup) {
 -             kvm_for_each_vcpu(i, vcpu, s-kvm) {
 +             rcu_read_lock();
 +             kvm_for_each_vcpu(vcpu, kvm)
                       if (kvm_apic_accept_pic_intr(vcpu)) {
                               found = vcpu;
                               break;
                       }
 -             }

 -             if (!found)
 +             if (!found) {
 +                     rcu_read_unlock();
                       return;
 +             }

               kvm_make_request(KVM_REQ_EVENT, found);
               kvm_vcpu_kick(found);
 +             rcu_read_unlock();
       }
   }

 How about this? (just about stylistic issues)

:-), I just want to change based on old code. But your style is OK too.

        if (!wakeup)
                return;

        rcu_read_lock();
        kvm_for_each_vcpu(vcpu, kvm)
                if (kvm_apic_accept_pic_intr(vcpu)) {
                        found = vcpu;
                        break;
                }

        if (!found)
                goto out;

        kvm_make_request(KVM_REQ_EVENT, found);
        kvm_vcpu_kick(found);
 out:
        rcu_read_unlock();

 ...

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

 ...

 +void kvm_arch_vcpu_zap(struct work_struct *work)
 +{
 +     struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
 +                     zap_work);
 +     struct kvm *kvm = vcpu-kvm;

 -     atomic_set(kvm-online_vcpus, 0);
 -     mutex_unlock(kvm-lock);
 +     kvm_clear_async_pf_completion_queue(vcpu);
 +     kvm_unload_vcpu_mmu(vcpu);
 +     kvm_arch_vcpu_free(vcpu);
 +     kvm_put_kvm(kvm);
   }

 zap is really a good name for this?

zap = destroy, so I think it is OK.
 ...

 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index d526231..733de1c 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -19,6 +19,7 @@
   #includelinux/slab.h
   #includelinux/rcupdate.h
   #includelinux/ratelimit.h
 +#includelinux/atomic.h
   #includeasm/signal.h

   #includelinux/kvm.h
 @@ -113,6 +114,10 @@ enum {

   struct kvm_vcpu {
       struct kvm *kvm;
 +     atomic_t refcount;
 +     struct list_head list;
 +     struct rcu_head head;
 +     struct work_struct zap_work;

 How about adding some comments?
 zap_work is not at all self explanatory, IMO.


   #ifdef CONFIG_PREEMPT_NOTIFIERS
       struct preempt_notifier preempt_notifier;
   #endif
 @@ -241,9 +246,9 @@ struct kvm {
       u32 bsp_vcpu_id;
       struct kvm_vcpu *bsp_vcpu;
   #endif
 -     struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 +     struct list_head vcpus;
       atomic_t online_vcpus;
 -     int last_boosted_vcpu;
 +     struct kvm_vcpu *last_boosted_vcpu;
       struct list_head vm_list;
       struct mutex lock;
       struct kvm_io_bus *buses[KVM_NR_BUSES];
 @@ -290,17 +295,15 @@ struct kvm {
   #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
   #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu-kvm, fmt)

 -static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
 -{
 -     smp_rmb();
 -     return kvm-vcpus[i];
 -}
 +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
 +void kvm_vcpu_put(struct kvm_vcpu *vcpu);
 +void kvm_arch_vcpu_zap(struct work_struct *work);
 +
 +#define kvm_for_each_vcpu(vcpu, kvm) \
 +     list_for_each_entry_rcu(vcpu,kvm-vcpus, list)

 Is this macro really worth it?
 _rcu shows readers important information, I think.

I guest kvm_for_each_vcpu is designed for hiding the details of
internal implement, and currently it is implemented by array, and my
patch will change it to linked-list,
so IMO, we can still hide the details.

Regards,
ping fan


 -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
 -     for (idx = 0; \
 -          idx