Re: [PATCH 0/2] vhot-net: Use kvm_memslots instead of vhost_memory to translate GPA to HVA
On Fri, 2011-12-16 at 15:40 +0800, Zang Hongyong wrote: 于 2011/12/16,星期五 15:05, Sasha Levin 写道: On Fri, 2011-12-16 at 13:32 +0800, zanghongy...@huawei.com wrote: From: Hongyong Zangzanghongy...@huawei.com Vhost-net uses its own vhost_memory, which results from user space (qemu) info, to translate GPA to HVA. Since kernel's kvm structure already maintains the address relationship in its member *kvm_memslots*, these patches use kernel's kvm_memslots directly without the need of initialization and maintenance of vhost_memory. Conceptually, vhost isn't aware of KVM - it's just a driver which moves data from vq to a tap device and back. You can't simply add KVM specific code into vhost. Whats the performance benefit? But vhost-net is only used in virtualization situation. vhost_memory is maintained by user space qemu. In this way, the memory relationship can be accquired from kernel without the need of maintainence of vhost_memory from qemu. You can't assume that vhost-* is used only along with qemu/kvm. Just as virtio has more uses than just virtualization (heres one: https://lkml.org/lkml/2011/10/25/139 ), there are more uses for vhost as well. There has been a great deal of effort to keep vhost and kvm untangled. One example is the memory translation it has to do, another one is the eventfd/irqfd thing it does just so it could signal an IRQ in the guest instead of accessing the guest directly. If you do see a great performance increase when tying vhost and KVM together, it may be worth it to create some sort of an in-kernel vhost-kvm bridging thing, but if the performance isn't noticeable we're better off just leaving it as is and keeping the vhost code general. -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] vhot-net: Use kvm_memslots instead of vhost_memory to translate GPA to HVA
On Fri, 2011-12-16 at 09:59 +0200, Sasha Levin wrote: There has been a great deal of effort to keep vhost and kvm untangled. One example is the memory translation it has to do, another one is the eventfd/irqfd thing it does just so it could signal an IRQ in the guest instead of accessing the guest directly. Actually, CONFIG_VHOST_NET doesn't even depend on CONFIG_KVM, so your patch will break build when (CONFIG_VHOST_NET=y CONFIG_KVM=n). -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] vhot-net: Use kvm_memslots instead of vhost_memory to translate GPA to HVA
于 2011/12/16,星期五 15:59, Sasha Levin 写道: On Fri, 2011-12-16 at 15:40 +0800, Zang Hongyong wrote: 于 2011/12/16,星期五 15:05, Sasha Levin 写道: On Fri, 2011-12-16 at 13:32 +0800, zanghongy...@huawei.com wrote: From: Hongyong Zangzanghongy...@huawei.com Vhost-net uses its own vhost_memory, which results from user space (qemu) info, to translate GPA to HVA. Since kernel's kvm structure already maintains the address relationship in its member *kvm_memslots*, these patches use kernel's kvm_memslots directly without the need of initialization and maintenance of vhost_memory. Conceptually, vhost isn't aware of KVM - it's just a driver which moves data from vq to a tap device and back. You can't simply add KVM specific code into vhost. Whats the performance benefit? But vhost-net is only used in virtualization situation. vhost_memory is maintained by user space qemu. In this way, the memory relationship can be accquired from kernel without the need of maintainence of vhost_memory from qemu. You can't assume that vhost-* is used only along with qemu/kvm. Just as virtio has more uses than just virtualization (heres one: https://lkml.org/lkml/2011/10/25/139 ), there are more uses for vhost as well. There has been a great deal of effort to keep vhost and kvm untangled. One example is the memory translation it has to do, another one is the eventfd/irqfd thing it does just so it could signal an IRQ in the guest instead of accessing the guest directly. If you do see a great performance increase when tying vhost and KVM together, it may be worth it to create some sort of an in-kernel vhost-kvm bridging thing, but if the performance isn't noticeable we're better off just leaving it as is and keeping the vhost code general. Thanks for your explanation. Since memory layout is seldom changed after guest boots, the situation manily occurs during initialization. There's no need for vhost-kvm bridge now. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Make the whole guest memory mergeable
于 2011/12/16,星期五 15:23, Sasha Levin 写道: On Fri, 2011-12-16 at 15:02 +0800, Zang Hongyong wrote: 于 2011/12/16,星期五 13:50, Sasha Levin 写道: On Fri, 2011-12-16 at 09:01 +0800, zanghongy...@huawei.com wrote: If a guest's ram_size exceeds KVM_32BIT_GAP_START, the corresponding kvm tool's virtual address size should be (ram_size + KVM_32BIT_GAP_SIZE), rather than ram_size. You're right. There are more places than just the madvise() code which make the same error you've spotted (for example, the memslot allocation code), so instead of trying to fix all of them I'd suggest to just update ram_size in kvm__arch_init() before allocating everything - that should fix all of them at once. Yes. There are other scenarios with the same error. However ram_size sometimes means real guest ram size, and sometimes means virtual address size of kvm tool's user space. Shall we define a new variable? Let's keep it simple. If the user requests more than RAM than KVM_32BIT_GAP_START just increase it by KVM_32BIT_GAP_SIZE, this way mapped size == guest size always (we can madvise(MADV_DONTNEED) the gap in the mmapped ram). Since a user which requests more than KVM_32BIT_GAP_START will have to be on 64bit host anyway, there shouldn't be any issue with that. Do you mean increase *kvm-ram_size* by KVM_32BIT_GAP_SIZE? but sometimes kvm-ram_size stands for guest physical ram size (for example in kvm__init_ram() code). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm tools: Trivial cleanup
Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/disk/qcow.c | 26 -- tools/kvm/ioport.c |7 ++- tools/kvm/pci.c|6 -- tools/kvm/symbol.c | 18 +- tools/kvm/term.c |1 - tools/kvm/util/util.c |3 +-- tools/kvm/virtio/balloon.c |6 +++--- tools/kvm/virtio/core.c|1 - tools/kvm/virtio/net.c | 14 +++--- tools/kvm/virtio/pci.c |9 - tools/kvm/virtio/rng.c | 27 +-- tools/kvm/x86/bios.c |6 +++--- tools/kvm/x86/cpuid.c |7 --- tools/kvm/x86/interrupt.c |3 ++- tools/kvm/x86/kvm-cpu.c| 39 +++ tools/kvm/x86/kvm.c| 23 ++- 16 files changed, 89 insertions(+), 107 deletions(-) diff --git a/tools/kvm/disk/qcow.c b/tools/kvm/disk/qcow.c index e139fa5..b12fe53 100644 --- a/tools/kvm/disk/qcow.c +++ b/tools/kvm/disk/qcow.c @@ -329,18 +329,16 @@ static ssize_t qcow1_read_cluster(struct qcow *q, u64 offset, csize = (q-cluster_size - 1); if (pread_in_full(q-fd, q-cluster_data, csize, - coffset) 0) { + coffset) 0) goto out_error; - } if (qcow_decompress_buffer(q-cluster_cache, q-cluster_size, - q-cluster_data, csize) 0) { + q-cluster_data, csize) 0) goto out_error; - } memcpy(dst, q-cluster_cache + clust_offset, length); mutex_unlock(q-mutex); - } else{ + } else { if (!clust_start) goto zero_cluster; @@ -435,7 +433,7 @@ static ssize_t qcow2_read_cluster(struct qcow *q, u64 offset, memcpy(dst, q-cluster_cache + clust_offset, length); mutex_unlock(q-mutex); - } else{ + } else { clust_start = QCOW2_OFFSET_MASK; if (!clust_start) goto zero_cluster; @@ -470,11 +468,11 @@ static ssize_t qcow_read_sector_single(struct disk_image *disk, u64 sector, char *buf; u32 nr; - buf = dst; - nr_read = 0; + buf = dst; + nr_read = 0; while (nr_read dst_len) { - offset = sector SECTOR_SHIFT; + offset = sector SECTOR_SHIFT; if (offset = header-size) return -1; @@ -488,9 +486,9 @@ static ssize_t qcow_read_sector_single(struct disk_image *disk, u64 sector, if (nr = 0) return -1; - nr_read += nr; - buf += nr; - sector += (nr SECTOR_SHIFT); + nr_read += nr; + buf += nr; + sector += (nr SECTOR_SHIFT); } return dst_len; @@ -508,9 +506,9 @@ static ssize_t qcow_read_sector(struct disk_image *disk, u64 sector, return -1; } - sector += iov-iov_len SECTOR_SHIFT; + sector += iov-iov_len SECTOR_SHIFT; + total += nr; iov++; - total += nr; } return total; diff --git a/tools/kvm/ioport.c b/tools/kvm/ioport.c index 965cfc2..b417942 100644 --- a/tools/kvm/ioport.c +++ b/tools/kvm/ioport.c @@ -111,13 +111,10 @@ bool kvm__emulate_io(struct kvm *kvm, u16 port, void *data, int direction, int s ops = entry-ops; while (count--) { - if (direction == KVM_EXIT_IO_IN) { - if (ops-io_in) + if (direction == KVM_EXIT_IO_IN ops-io_in) ret = ops-io_in(entry, kvm, port, ptr, size); - } else { - if (ops-io_out) + else if (ops-io_out) ret = ops-io_out(entry, kvm, port, ptr, size); - } ptr += size; } diff --git a/tools/kvm/pci.c b/tools/kvm/pci.c index 59b2618..06eea0f 100644 --- a/tools/kvm/pci.c +++ b/tools/kvm/pci.c @@ -160,10 +160,12 @@ void pci__config_rd(struct kvm *kvm, union pci_config_address addr, void *data, void *p = pci_devices[dev_num]; memcpy(data, p + offset, size); - } else + } else { memset(data, 0x00, size); - } else + } + } else { memset(data, 0xff, size); + } } void pci__register(struct pci_device_header *dev, u8 dev_num) diff --git a/tools/kvm/symbol.c b/tools/kvm/symbol.c index 56dd346..a2b1e67 100644 --- a/tools/kvm/symbol.c +++ b/tools/kvm/symbol.c @@ -7,7
Re: [PATCH] kvm tools: Make the whole guest memory mergeable
On Fri, 2011-12-16 at 16:36 +0800, Zang Hongyong wrote: 于 2011/12/16,星期五 15:23, Sasha Levin 写道: On Fri, 2011-12-16 at 15:02 +0800, Zang Hongyong wrote: 于 2011/12/16,星期五 13:50, Sasha Levin 写道: On Fri, 2011-12-16 at 09:01 +0800, zanghongy...@huawei.com wrote: If a guest's ram_size exceeds KVM_32BIT_GAP_START, the corresponding kvm tool's virtual address size should be (ram_size + KVM_32BIT_GAP_SIZE), rather than ram_size. You're right. There are more places than just the madvise() code which make the same error you've spotted (for example, the memslot allocation code), so instead of trying to fix all of them I'd suggest to just update ram_size in kvm__arch_init() before allocating everything - that should fix all of them at once. Yes. There are other scenarios with the same error. However ram_size sometimes means real guest ram size, and sometimes means virtual address size of kvm tool's user space. Shall we define a new variable? Let's keep it simple. If the user requests more than RAM than KVM_32BIT_GAP_START just increase it by KVM_32BIT_GAP_SIZE, this way mapped size == guest size always (we can madvise(MADV_DONTNEED) the gap in the mmapped ram). Since a user which requests more than KVM_32BIT_GAP_START will have to be on 64bit host anyway, there shouldn't be any issue with that. Do you mean increase *kvm-ram_size* by KVM_32BIT_GAP_SIZE? but sometimes kvm-ram_size stands for guest physical ram size (for example in kvm__init_ram() code). Yup, kvm-ram_size. If the user requested more than KVM_32BIT_GAP_START, we pretty much have to create the gap, so instead of playing around with different interpretations of ram_size, lets add the gap size - this will let us have just one ram_size. mmap()ing extra space for the gap is free, and that was the plan in the first place (we just got the math wrong :) ). Do you see an issue with increasing kvm-ram_size? -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Trivial cleanup
On Fri, Dec 16, 2011 at 10:40:06AM +0200, Sasha Levin wrote: Signed-off-by: Sasha Levin levinsasha...@gmail.com --- Thanks, Sasha! Cyrill -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm tools: Fix 'make kvmconfig'
Set features which depend on the config we select to make 'make kvmconfig' non-interactive. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- arch/x86/Kconfig |1 + scripts/kconfig/Makefile |1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8f74dc0..d553840 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -522,6 +522,7 @@ config KVMTOOL_TEST_ENABLE select DEBUG_KERNEL select KGDB select KGDB_SERIAL_CONSOLE + select VIRTUALIZATION select VIRTIO select VIRTIO_RING select VIRTIO_PCI diff --git a/scripts/kconfig/Makefile b/scripts/kconfig/Makefile index 65580c2..58a4259 100644 --- a/scripts/kconfig/Makefile +++ b/scripts/kconfig/Makefile @@ -35,6 +35,7 @@ silentoldconfig: $(obj)/conf kvmconfig: $(Q)$(CONFIG_SHELL) $(srctree)/scripts/config -e KVMTOOL_TEST_ENABLE + $(Q)yes | make oldconfig /dev/null @echo 'Kernel configuration modified to run as KVM guest.' # if no path is given, then use src directory to find file -- 1.7.8 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Fix 'make kvmconfig'
* Sasha Levin levinsasha...@gmail.com wrote: Set features which depend on the config we select to make 'make kvmconfig' non-interactive. Signed-off-by: Sasha Levin levinsasha...@gmail.com Acked-by: Ingo Molnar mi...@elte.hu Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Make the whole guest memory mergeable
于 2011/12/16,星期五 16:45, Sasha Levin 写道: On Fri, 2011-12-16 at 16:36 +0800, Zang Hongyong wrote: 于 2011/12/16,星期五 15:23, Sasha Levin 写道: On Fri, 2011-12-16 at 15:02 +0800, Zang Hongyong wrote: 于 2011/12/16,星期五 13:50, Sasha Levin 写道: On Fri, 2011-12-16 at 09:01 +0800, zanghongy...@huawei.com wrote: If a guest's ram_size exceeds KVM_32BIT_GAP_START, the corresponding kvm tool's virtual address size should be (ram_size + KVM_32BIT_GAP_SIZE), rather than ram_size. You're right. There are more places than just the madvise() code which make the same error you've spotted (for example, the memslot allocation code), so instead of trying to fix all of them I'd suggest to just update ram_size in kvm__arch_init() before allocating everything - that should fix all of them at once. Yes. There are other scenarios with the same error. However ram_size sometimes means real guest ram size, and sometimes means virtual address size of kvm tool's user space. Shall we define a new variable? Let's keep it simple. If the user requests more than RAM than KVM_32BIT_GAP_START just increase it by KVM_32BIT_GAP_SIZE, this way mapped size == guest size always (we can madvise(MADV_DONTNEED) the gap in the mmapped ram). Since a user which requests more than KVM_32BIT_GAP_START will have to be on 64bit host anyway, there shouldn't be any issue with that. Do you mean increase *kvm-ram_size* by KVM_32BIT_GAP_SIZE? but sometimes kvm-ram_size stands for guest physical ram size (for example in kvm__init_ram() code). Yup, kvm-ram_size. If the user requested more than KVM_32BIT_GAP_START, we pretty much have to create the gap, so instead of playing around with different interpretations of ram_size, lets add the gap size - this will let us have just one ram_size. mmap()ing extra space for the gap is free, and that was the plan in the first place (we just got the math wrong :) ). Do you see an issue with increasing kvm-ram_size? Yes, it will cause some problems after simply increase the kvm-ram_size. For examples: In kvm__init_ram() code we use kvm-ram_size to calculate the size of the second RAM range from 4GB to the end of RAM (phys_size = kvm-ram_size - phys_size;), so after increase the kvm-ram_size, it will goes wrong. This problem also happens in e820_setup() code and load_bzimage() code. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] virtio-serial: Allow one MSI-X vector per virtqueue
On (Fri) 16 Dec 2011 [09:14:26], zanghongy...@huawei.com wrote: From: Hongyong Zang zanghongy...@huawei.com In pci_enable_msix(), the guest's virtio-serial driver tries to set msi-x with one vector per queue. But it fails and eventually all virtio-serial ports share one MSI-X vector. Because every virtio-serial port has *two* virtqueues, virtio-serial needs (port+1)*2 vectors other than (port+1). Ouch, good catch. One comment below: This patch allows every virtqueue to have its own MSI-X vector. (When the MSI-X vectors needed are more than MSIX_MAX_ENTRIES defined in qemu: msix.c, all the queues still share one MSI-X vector as before.) Signed-off-by: Hongyong Zang zanghongy...@huawei.com --- hw/virtio-pci.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 77b75bc..2c9c6fb 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -718,8 +718,11 @@ static int virtio_serial_init_pci(PCIDevice *pci_dev) return -1; } vdev-nvectors = proxy-nvectors == DEV_NVECTORS_UNSPECIFIED -? proxy-serial.max_virtserial_ports + 1 +? (proxy-serial.max_virtserial_ports + 1) * 2 : proxy-nvectors; +/*msix.c: #define MSIX_MAX_ENTRIES 32*/ +if (vdev-nvectors 32) +vdev-nvectors = 32; This change isn't needed: if the proxy-nvectors value exceeds the max allowed, virtio_init_pci() will end up using a shared vector instead of separate ones. Thanks, Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm tools: Make the whole guest memory mergeable
On Fri, 2011-12-16 at 17:33 +0800, Zang Hongyong wrote: Do you see an issue with increasing kvm-ram_size? Yes, it will cause some problems after simply increase the kvm-ram_size. For examples: In kvm__init_ram() code we use kvm-ram_size to calculate the size of the second RAM range from 4GB to the end of RAM (phys_size = kvm-ram_size - phys_size;), so after increase the kvm-ram_size, it will goes wrong. This problem also happens in e820_setup() code and load_bzimage() code. Yup, but fixing it is much easier than having two different sizes of the same thing. For example, the fix for the problem in kvm__init_ram() (and e820_setup()) would be: @@ -112,7 +112,7 @@ void kvm__init_ram(struct kvm *kvm) /* Second RAM range from 4GB to the end of RAM: */ phys_start = 0x1ULL; - phys_size = kvm-ram_size - phys_size; + phys_size = kvm-ram_size - phys_start; host_mem = kvm-ram_start + phys_start; kvm__register_mem(kvm, phys_start, phys_size, host_mem); I basically want one memory map with one size which includes *everything*, even if that memory map includes a gap in the middle I still want the total size to include that gap. btw, what problem do you see in load_bzimage()? -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] KVM: MMU: reduce the size of shadow page structure and some cleanup
In this patchset, we observably reduce the size of struct kvm_mmu_page and let the code more simple. And this patchset have already tested by unittest and RHEL.6.1 setup/boot/reboot/shutdown. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] KVM: MMU: combine unsync and unsync_children
unsync is only used for the page of level == 1 and unsync_children is only used in the upper page, so combine them to reduce the size of shadow page structure Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- Documentation/virtual/kvm/mmu.txt |5 ++- arch/x86/include/asm/kvm_host.h | 14 +++- arch/x86/kvm/mmu.c| 39 +--- arch/x86/kvm/mmu_audit.c |6 ++-- arch/x86/kvm/mmutrace.h |3 +- arch/x86/kvm/paging_tmpl.h|2 +- 6 files changed, 48 insertions(+), 21 deletions(-) diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 5dc972c..4a5fedd 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -205,14 +205,15 @@ Shadow pages contain the following information: this page's spt. Otherwise, parent_ptes points at a data structure with a list of parent_ptes. unsync: +It is used when role.level == 1, only level 1 shadow pages can be unsync. If true, then the translations in this page may not match the guest's translation. This is equivalent to the state of the tlb when a pte is changed but before the tlb entry is flushed. Accordingly, unsync ptes are synchronized when the guest executes invlpg or flushes its tlb by other means. Valid for leaf pages. unsync_children: -How many sptes in the page point at pages that are unsync (or have -unsynchronized children). +It is used when role.level 1 and indicates how many sptes in the page +point at unsync pages or unsynchronized children. unsync_child_bitmap: A bitmap indicating which sptes in spt point (directly or indirectly) at pages that may be unsynchronized. Used to quickly locate all unsychronized diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 52d6640..c0a89cd 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -233,9 +233,19 @@ struct kvm_mmu_page { * in this shadow page. */ DECLARE_BITMAP(slot_bitmap, KVM_MEM_SLOTS_NUM); - bool unsync; int root_count; /* Currently serving as active root */ - unsigned int unsync_children; + + /* +* If role.level == 1, unsync indicates whether the sp is +* unsync, otherwise unsync_children indicates the number +* of sptes which point at unsync sp or unsychronized children. +* See sp_is_unsync() / sp_unsync_children_num. +*/ + union { + bool unsync; + unsigned int unsync_children; + }; + unsigned long parent_ptes; /* Reverse mapping for parent_pte */ DECLARE_BITMAP(unsync_child_bitmap, 512); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2a2a9b4..6913a16 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -151,6 +151,21 @@ module_param(dbg, bool, 0644); #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) +static bool sp_is_unsync(struct kvm_mmu_page *sp) +{ + return sp-role.level == PT_PAGE_TABLE_LEVEL sp-unsync; +} + +static unsigned int sp_unsync_children_num(struct kvm_mmu_page *sp) +{ + unsigned int num = 0; + + if (sp-role.level != PT_PAGE_TABLE_LEVEL) + num = sp-unsync_children; + + return num; +} + struct pte_list_desc { u64 *sptes[PTE_LIST_EXT]; struct pte_list_desc *more; @@ -1401,7 +1416,7 @@ static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp, { int i; - if (sp-unsync) + if (sp_is_unsync(sp)) for (i=0; i pvec-nr; i++) if (pvec-page[i].sp == sp) return 0; @@ -1426,7 +1441,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, child = page_header(ent PT64_BASE_ADDR_MASK); - if (child-unsync_children) { + if (sp_unsync_children_num(child)) { if (mmu_pages_add(pvec, child, i)) return -ENOSPC; @@ -1437,7 +1452,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, nr_unsync_leaf += ret; else return ret; - } else if (child-unsync) { + } else if (sp_is_unsync(child)) { nr_unsync_leaf++; if (mmu_pages_add(pvec, child, i)) return -ENOSPC; @@ -1468,7 +1483,7 @@ static int mmu_unsync_walk(struct kvm_mmu_page *sp, static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) { - WARN_ON(!sp-unsync); + WARN_ON(!sp_is_unsync(sp)); trace_kvm_mmu_sync_page(sp); sp-unsync = 0; --kvm-stat.mmu_unsync; @@ -1546,7 +1561,7 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t
[PATCH 2/8] KVM: MMU: set the dirty bit for the upper shadow page
Upper page do not need to be tracked the status bit, it is safe to set the dirty and let cpu to happily prefetch it Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 13 + 1 files changed, 5 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 6913a16..a2d28aa 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -150,6 +150,9 @@ module_param(dbg, bool, 0644); #define SPTE_HOST_WRITEABLE (1ULL PT_FIRST_AVAIL_BITS_SHIFT) #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) +#define SHADOW_PAGE_TABLE \ + (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask |\ +shadow_x_mask | shadow_accessed_mask | shadow_dirty_mask) static bool sp_is_unsync(struct kvm_mmu_page *sp) { @@ -1808,9 +1811,7 @@ static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) { u64 spte; - spte = __pa(sp-spt) - | PT_PRESENT_MASK | PT_ACCESSED_MASK - | PT_WRITABLE_MASK | PT_USER_MASK; + spte = __pa(sp-spt) | SHADOW_PAGE_TABLE; mmu_spte_set(sptep, spte); } @@ -2497,11 +2498,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, return -ENOMEM; } - mmu_spte_set(iterator.sptep, -__pa(sp-spt) -| PT_PRESENT_MASK | PT_WRITABLE_MASK -| shadow_user_mask | shadow_x_mask -| shadow_accessed_mask); + link_shadow_page(iterator.sptep, sp); } } return emulate; -- 1.7.7.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] KVM: MMU: do not add a nonpresent spte to rmaps of its child
Set the spte before adding it to the rmap of its child so that all parent spte are valid when propagate unsync bit from a usnync page / children page And this feature is needed by the later patch Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 74 +++ arch/x86/kvm/mmutrace.h|2 +- arch/x86/kvm/paging_tmpl.h | 14 +++- 3 files changed, 32 insertions(+), 58 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a2d28aa..89202f4 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1321,12 +1321,14 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn) return gfn ((1 KVM_MMU_HASH_SHIFT) - 1); } -static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp, u64 *parent_pte) +static void mmu_page_add_set_parent_pte(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp, + u64 *parent_pte) { if (!parent_pte) return; + mmu_spte_set(parent_pte, __pa(sp-spt) | SHADOW_PAGE_TABLE); pte_list_add(vcpu, parent_pte, sp-parent_ptes); } @@ -1357,7 +1359,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, list_add(sp-link, vcpu-kvm-arch.active_mmu_pages); bitmap_zero(sp-slot_bitmap, KVM_MEM_SLOTS_NUM); sp-parent_ptes = 0; - mmu_page_add_parent_pte(vcpu, sp, parent_pte); + mmu_page_add_set_parent_pte(vcpu, sp, parent_pte); kvm_mod_used_mmu_pages(vcpu-kvm, +1); return sp; } @@ -1690,13 +1692,10 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sp); } -static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, -gfn_t gfn, -gva_t gaddr, -unsigned level, -int direct, -unsigned access, -u64 *parent_pte) +static struct kvm_mmu_page * +kvm_mmu_get_set_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr, +unsigned level, int direct, unsigned access, +u64 *parent_pte) { union kvm_mmu_page_role role; unsigned quadrant; @@ -1726,7 +1725,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, if (sp_is_unsync(sp) kvm_sync_page_transient(vcpu, sp)) break; - mmu_page_add_parent_pte(vcpu, sp, parent_pte); + mmu_page_add_set_parent_pte(vcpu, sp, parent_pte); if (sp_unsync_children_num(sp)) { kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); kvm_mmu_mark_parents_unsync(sp); @@ -1734,7 +1733,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, kvm_mmu_mark_parents_unsync(sp); __clear_sp_write_flooding_count(sp); - trace_kvm_mmu_get_page(sp, false); + trace_kvm_mmu_get_set_page(sp, false); return sp; } ++vcpu-kvm-stat.mmu_cache_miss; @@ -1754,7 +1753,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, account_shadowed(vcpu-kvm, gfn); } init_shadow_page_table(sp); - trace_kvm_mmu_get_page(sp, true); + trace_kvm_mmu_get_set_page(sp, true); return sp; } @@ -1807,14 +1806,6 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) return __shadow_walk_next(iterator, *iterator-sptep); } -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) -{ - u64 spte; - - spte = __pa(sp-spt) | SHADOW_PAGE_TABLE; - mmu_spte_set(sptep, spte); -} - static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) { if (is_large_pte(*sptep)) { @@ -1879,11 +1870,6 @@ static void kvm_mmu_page_unlink_children(struct kvm *kvm, mmu_page_zap_pte(kvm, sp, sp-spt + i); } -static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte) -{ - mmu_page_remove_parent_pte(sp, parent_pte); -} - static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp) { u64 *parent_pte; @@ -2468,7 +2454,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, bool prefault) { struct kvm_shadow_walk_iterator iterator; - struct kvm_mmu_page *sp; int emulate = 0; gfn_t pseudo_gfn; @@ -2489,16 +2474,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, base_addr = PT64_LVL_ADDR_MASK(iterator.level); pseudo_gfn = base_addr PAGE_SHIFT; -
[PATCH 4/8] KVM: MMU: drop unsync_child_bitmap
unsync_child_bitmap is used to record which spte has unsync page or unsync children, we can set a free bit in the spte instead of it Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- Documentation/virtual/kvm/mmu.txt |4 -- arch/x86/include/asm/kvm_host.h |1 - arch/x86/kvm/mmu.c| 77 + 3 files changed, 52 insertions(+), 30 deletions(-) diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 4a5fedd..6d70a6e 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -214,10 +214,6 @@ Shadow pages contain the following information: unsync_children: It is used when role.level 1 and indicates how many sptes in the page point at unsync pages or unsynchronized children. - unsync_child_bitmap: -A bitmap indicating which sptes in spt point (directly or indirectly) at -pages that may be unsynchronized. Used to quickly locate all unsychronized -pages reachable from a given page. Reverse map === diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c0a89cd..601c7f6 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -247,7 +247,6 @@ struct kvm_mmu_page { }; unsigned long parent_ptes; /* Reverse mapping for parent_pte */ - DECLARE_BITMAP(unsync_child_bitmap, 512); #ifdef CONFIG_X86_32 int clear_spte_count; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 89202f4..9bd2084 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -147,7 +147,8 @@ module_param(dbg, bool, 0644); #define CREATE_TRACE_POINTS #include mmutrace.h -#define SPTE_HOST_WRITEABLE (1ULL PT_FIRST_AVAIL_BITS_SHIFT) +#define SPTE_HOST_WRITEABLE(1ULL PT_FIRST_AVAIL_BITS_SHIFT) +#define SPTE_UNSYNC_CHILD (1ULL (PT_FIRST_AVAIL_BITS_SHIFT + 1)) #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) #define SHADOW_PAGE_TABLE \ @@ -561,6 +562,40 @@ static void mmu_spte_clear_no_track(u64 *sptep) __update_clear_spte_fast(sptep, 0ull); } +static bool mmu_spte_mark_unsync_child(struct kvm_mmu_page *sp, u64 *sptep) +{ + u64 new_spte = *sptep; + bool unsync_child = new_spte SPTE_UNSYNC_CHILD; + + if (unsync_child) + return true; + + new_spte |= SPTE_UNSYNC_CHILD; + __set_spte(sptep, new_spte); + return sp-unsync_children++; +} + +static bool mmu_spte_is_unsync_child(u64 *sptep) +{ + return *sptep SPTE_UNSYNC_CHILD; +} + +static void __mmu_spte_clear_unsync_child(u64 *sptep) +{ + page_header(__pa(sptep))-unsync_children--; + WARN_ON((int)page_header(__pa(sptep))-unsync_children 0); +} + +static void mmu_spte_clear_unsync_child(u64 *sptep) +{ + u64 new_spte = *sptep; + + if (new_spte SPTE_UNSYNC_CHILD) { + __set_spte(sptep, new_spte ~SPTE_UNSYNC_CHILD); + __mmu_spte_clear_unsync_child(sptep); + } +} + static u64 mmu_spte_get_lockless(u64 *sptep) { return __get_spte_lockless(sptep); @@ -1342,6 +1377,10 @@ static void drop_parent_pte(struct kvm_mmu_page *sp, u64 *parent_pte) { mmu_page_remove_parent_pte(sp, parent_pte); + + if (*parent_pte SPTE_UNSYNC_CHILD) + __mmu_spte_clear_unsync_child(parent_pte); + mmu_spte_clear_no_track(parent_pte); } @@ -1372,16 +1411,10 @@ static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp) static void mark_unsync(u64 *spte) { - struct kvm_mmu_page *sp; - unsigned int index; + struct kvm_mmu_page *sp = page_header(__pa(spte)); - sp = page_header(__pa(spte)); - index = spte - sp-spt; - if (__test_and_set_bit(index, sp-unsync_child_bitmap)) - return; - if (sp-unsync_children++) - return; - kvm_mmu_mark_parents_unsync(sp); + if (!mmu_spte_mark_unsync_child(sp, spte)) + kvm_mmu_mark_parents_unsync(sp); } static int nonpaging_sync_page(struct kvm_vcpu *vcpu, @@ -1411,10 +1444,9 @@ struct kvm_mmu_pages { unsigned int nr; }; -#define for_each_unsync_children(bitmap, idx) \ - for (idx = find_first_bit(bitmap, 512); \ -idx 512; \ -idx = find_next_bit(bitmap, 512, idx+1)) +#define for_each_unsync_children(sp, sptep, idx) \ + for (idx = 0; idx 512 ((sptep) = (sp)-spt + idx); idx++) \ + if (!mmu_spte_is_unsync_child(sptep)) {} else static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp, int idx) @@ -1435,14 +1467,14 @@ static int mmu_pages_add(struct kvm_mmu_pages *pvec, struct kvm_mmu_page *sp, static int __mmu_unsync_walk(struct kvm_mmu_page *sp, struct
[PATCH 5/8] KVM: MMU: optimize walking unsync shadow page
Exactly, unysnc_children is the number of unsync sptes, we can use to avoid unnecessary spte fetching Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9bd2084..16e0642 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1468,12 +1468,15 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, struct kvm_mmu_pages *pvec) { u64 *spte; - int i, ret, nr_unsync_leaf = 0; + int i, ret, nr_unsync_leaf = 0, unysnc_children = sp-unsync_children; for_each_unsync_children(sp, spte, i) { struct kvm_mmu_page *child; u64 ent = *spte; + if (!unysnc_children--) + break; + WARN_ON(!is_shadow_present_pte(ent) || is_large_pte(ent)); child = page_header(ent PT64_BASE_ADDR_MASK); -- 1.7.7.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] KVM: MMU: optimize handing invlpg
Use unsync bit to see if the spte is point to unsync child Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/paging_tmpl.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index c79c503..5333256 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -706,7 +706,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) FNAME(update_pte)(vcpu, sp, sptep, gpte); } - if (!is_shadow_present_pte(*sptep) || !sp-unsync_children) + if (!mmu_spte_is_unsync_child(sptep)) break; } spin_unlock(vcpu-kvm-mmu_lock); -- 1.7.7.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] KVM: MMU: remove the redundant get_written_sptes
get_written_sptes is called twice in kvm_mmu_pte_write, one of them can be removed Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c |6 ++ 1 files changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 16e0642..5d0f0e3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3575,7 +3575,7 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa, * If we're seeing too many writes to a page, it may no longer be a page table, * or we may be forking, in which case it is better to unmap the page. */ -static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte) +static bool detect_write_flooding(struct kvm_mmu_page *sp) { /* * Skip write-flooding detected for the sp whose level is 1, because @@ -3684,10 +3684,8 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, mask.cr0_wp = mask.cr4_pae = mask.nxe = 1; for_each_gfn_indirect_valid_sp(vcpu-kvm, sp, gfn, node) { - spte = get_written_sptes(sp, gpa, npte); - if (detect_write_misaligned(sp, gpa, bytes) || - detect_write_flooding(sp, spte)) { + detect_write_flooding(sp)) { zap_page |= !!kvm_mmu_prepare_zap_page(vcpu-kvm, sp, invalid_list); ++vcpu-kvm-stat.mmu_flooded; -- 1.7.7.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/8] KVM: MMU: remove PT64_SECOND_AVAIL_BITS_SHIFT
It is not used, remove it Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5d0f0e3..234a32e 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -91,7 +91,6 @@ module_param(dbg, bool, 0644); #define PTE_PREFETCH_NUM 8 #define PT_FIRST_AVAIL_BITS_SHIFT 9 -#define PT64_SECOND_AVAIL_BITS_SHIFT 52 #define PT64_LEVEL_BITS 9 -- 1.7.7.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
On Thu, Dec 15, 2011 at 03:48:38PM +0900, Takuya Yoshikawa wrote: (2011/12/15 13:28), Liu Ping Fan wrote: From: Liu Ping Fanpingf...@linux.vnet.ibm.com Currently, vcpu can be destructed only when kvm instance destroyed. Change this to vcpu's destruction before kvm instance, so vcpu MUST and CAN be destroyed before kvm's destroy. Could you explain why this change is needed here? Would be helpful for those, including me, who will read the commit later. I fail to see the motivation for this change also. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5 V5] Avoid soft lockup message when KVM is stopped by host
On Thu, Dec 15, 2011 at 12:21:16PM +0200, Avi Kivity wrote: On 12/14/2011 08:21 PM, Marcelo Tosatti wrote: On Wed, Dec 14, 2011 at 04:39:56PM +0200, Avi Kivity wrote: On 12/14/2011 02:16 PM, Marcelo Tosatti wrote: Having this controlled from userspace means it doesn't work for SIGSTOP or for long scheduling delays. What about doing this automatically based on preempt notifiers? Long scheduling delays should be considered hangups from the guest perspective. Why? To the guest it looks like slow hardware, but it will interpret it as a softlockup. Slow enough that progress of the watchdog thread is unable to keep up with timer interrupt processing. This is considered a hang and should be reported. It's not a guest hang though! No, but your host system is in such a load state that for the sake of system usability you better print out a warning message. I don't see the advantage of preempt notifiers over the simple, paravirt solution proposed? Note kvmclock is already paravirt. What do you want to be done in preempt notifiers? Measure what to consider setting this flag? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
On Mon, Dec 12, 2011 at 07:26:47AM +0900, Takuya Yoshikawa wrote: From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Currently, mmu_shrink() tries to free a shadow page from one kvm and does not use nr_to_scan correctly. This patch fixes this by making it try to free some shadow pages from each kvm. The number of shadow pages each kvm frees becomes proportional to the number of shadow pages it is using. Note: an easy way to see how this code works is to do echo 3 /proc/sys/vm/drop_caches during some virtual machines are running. Shadow pages will be zapped as expected by this. I'm not sure this is a meaningful test to verify this change is worthwhile, because while the shrinker tries to free a shadow page from one vm, the vm's position in the kvm list is changed, so the over time the shrinker will cycle over all VMs. Can you measure whether there is a significant difference in a synthetic workload, and what that change is? Perhaps apply {moderate, high} memory pressure load with {2, 4, 8, 16} VMs or something like that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages
On Fri, 16 Dec 2011 09:06:11 -0200 Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Dec 12, 2011 at 07:26:47AM +0900, Takuya Yoshikawa wrote: From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Currently, mmu_shrink() tries to free a shadow page from one kvm and does not use nr_to_scan correctly. This patch fixes this by making it try to free some shadow pages from each kvm. The number of shadow pages each kvm frees becomes proportional to the number of shadow pages it is using. Note: an easy way to see how this code works is to do echo 3 /proc/sys/vm/drop_caches during some virtual machines are running. Shadow pages will be zapped as expected by this. I'm not sure this is a meaningful test to verify this change is worthwhile, because while the shrinker tries to free a shadow page from one vm, the vm's position in the kvm list is changed, so the over time the shrinker will cycle over all VMs. The test was for checking if mmu_shrink() would work as intended. Maybe not good as a changelog, sorry. I admit that I could not find any strong reason except for protecting the host from intentionally induced shadowing. But for that, don't you think that freeing the same amount of shadow pages from good and bad VMs eaqually is bad thing? My method will try to free many shadow pages from VMs with many shadow pages; e.g. if there is a pathological increase in shadow pages for one VM, that one will be intensively treated. If you can agree on this reasoning, I will update the description and resend. Can you measure whether there is a significant difference in a synthetic workload, and what that change is? Perhaps apply {moderate, high} memory pressure load with {2, 4, 8, 16} VMs or something like that. I was running 4 VMs on my machine with enough high memory pressure. The problem was that mmu_shrink() was not tuned to be called in usual memory pressures: what I did was changing the seeks and batch parameters and making ept=0. At least, I have checked that if I make one VM do meaningless many copies, letting others keep silent, the shrinker frees shadow pages intensively from that one. Anyway, I don't think making the shrinker call mmu_shrink() with the default batch size, nr_to_scan=128, and just trying to free one shadow page is a good behaviour. Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] kvm tools: Submit multiple virtio-blk requests in parallel
On 12/15/2011 08:15 PM, Sasha Levin wrote: When using AIO, submit all requests which exists in the vring in a single io_submit instead of one io_submit for each descriptor. Benchmarks: Short version: 15%+ increase in IOPS, small increase in BW. Read IOPS: Before: vda: ios=291792/0, merge=0/0, ticks=35229/0, in_queue=31025, util=61.30% I guess you are reading the wrong IOPS number, the 'ios' is the number of ios performed by all groups, not the IOPS result. Find the 'iops' ;-) So, Here is the number without/with this patch. (seq-read, seq-write, rand-read, rand-write) Before: read : io=98304KB, bw=63015KB/s, iops=15753, runt= 1560msec write: io=98304KB, bw=56823KB/s, iops=14205, runt= 1730msec read : io=98304KB, bw=62139KB/s, iops=15534, runt= 1582msec write: io=98304KB, bw=53836KB/s, iops=13458, runt= 1826msec After: read : io=98304KB, bw=63096KB/s, iops=15774, runt= 1558msec write: io=98304KB, bw=55823KB/s, iops=13955, runt= 1761msec read : io=98304KB, bw=59148KB/s, iops=14787, runt= 1662msec write: io=98304KB, bw=55072KB/s, iops=13768, runt= 1785msec Submit more io requests in one time is not supposed to increase the iops or bw so dramatically. I even tried to submit all read/write ops in one io_submit which still ends up with very little iops or bw improvement. After: vda: ios=333114/0, merge=0/0, ticks=47983/0, in_queue=46630, util=62.22% Write IOPS: Before: vda: ios=0/271716, merge=0/0, ticks=0/33367, in_queue=26531, util=59.96% After: vda: ios=0/327485, merge=0/0, ticks=0/23789, in_queue=22475, util=55.74% Read BW: Before: READ: io=7526.0MB, aggrb=1246.3MB/s, minb=1275.1MB/s, maxb=1275.1MB/s, mint=6040msec, maxt=6040msec After: READ: io=7526.0MB, aggrb=1315.5MB/s, minb=1346.7MB/s, maxb=1346.7MB/s, mint=5723msec, maxt=5723msec Write BW: Before: WRITE: io=7526.0MB, aggrb=1110.2MB/s, minb=1136.9MB/s, maxb=1136.9MB/s, mint=6779msec, maxt=6779msec After: WRITE: io=7526.0MB, aggrb=1113.5MB/s, minb=1140.3MB/s, maxb=1140.3MB/s, mint=6759msec, maxt=6759msec Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/disk/core.c |2 - tools/kvm/disk/raw.c | 38 +-- tools/kvm/include/kvm/disk-image.h |8 +- tools/kvm/util/read-write.c|8 +- tools/kvm/virtio/blk.c |3 ++ 5 files changed, 42 insertions(+), 17 deletions(-) diff --git a/tools/kvm/disk/core.c b/tools/kvm/disk/core.c index 4915efd..b18014e 100644 --- a/tools/kvm/disk/core.c +++ b/tools/kvm/disk/core.c @@ -5,8 +5,6 @@ #include sys/eventfd.h #include sys/poll.h -#define AIO_MAX 32 - int debug_iodelay; #ifdef CONFIG_HAS_AIO diff --git a/tools/kvm/disk/raw.c b/tools/kvm/disk/raw.c index caa023c..1162fb7 100644 --- a/tools/kvm/disk/raw.c +++ b/tools/kvm/disk/raw.c @@ -10,9 +10,9 @@ ssize_t raw_image__read_sector(struct disk_image *disk, u64 sector, const struct u64 offset = sector SECTOR_SHIFT; #ifdef CONFIG_HAS_AIO - struct iocb iocb; + struct iocb *iocb = disk-iocb[disk-count++]; - return aio_preadv(disk-ctx, iocb, disk-fd, iov, iovcount, offset, + return aio_preadv(disk-ctx, iocb, disk-fd, iov, iovcount, offset, disk-evt, param); #else return preadv_in_full(disk-fd, iov, iovcount, offset); @@ -25,9 +25,9 @@ ssize_t raw_image__write_sector(struct disk_image *disk, u64 sector, const struc u64 offset = sector SECTOR_SHIFT; #ifdef CONFIG_HAS_AIO - struct iocb iocb; + struct iocb *iocb = disk-iocb[disk-count++]; - return aio_pwritev(disk-ctx, iocb, disk-fd, iov, iovcount, offset, + return aio_pwritev(disk-ctx, iocb, disk-fd, iov, iovcount, offset, disk-evt, param); #else return pwritev_in_full(disk-fd, iov, iovcount, offset); @@ -79,19 +79,33 @@ int raw_image__close(struct disk_image *disk) close(disk-evt); -#ifdef CONFIG_HAS_VIRTIO +#ifdef CONFIG_HAS_AIO io_destroy(disk-ctx); #endif return ret; } +static int raw_image__aio_submit(struct disk_image *disk) +{ + int ret; + + ret = io_submit(disk-ctx, disk-count, disk-iocb_ptrs); + if (ret 0) + disk-count = 0; + + return ret; +} + /* * multiple buffer based disk image operations */ static struct disk_image_operations raw_image_regular_ops = { .read_sector= raw_image__read_sector, .write_sector = raw_image__write_sector, +#ifdef CONFIG_HAS_AIO + .submit = raw_image__aio_submit, +#endif }; struct disk_image_operations ro_ops = { @@ -120,8 +134,13 @@ struct disk_image *raw_image__probe(int fd, struct stat *st, bool readonly) disk = disk_image__new(fd, st-st_size, ro_ops_nowrite, DISK_IMAGE_REGULAR); #ifdef CONFIG_HAS_AIO - if
Re: [PATCH 0/4 V7] Avoid soft lockup message when KVM is stopped by host
On Thursday 15 December 2011, Eric B Munson wrote: Documentation/virtual/kvm/api.txt | 12 arch/ia64/include/asm/kvm_para.h|5 + arch/powerpc/include/asm/kvm_para.h |5 + arch/s390/include/asm/kvm_para.h|5 + arch/x86/include/asm/kvm_host.h |2 ++ arch/x86/include/asm/kvm_para.h |8 arch/x86/include/asm/pvclock-abi.h |1 + arch/x86/kernel/kvmclock.c | 21 + arch/x86/kvm/x86.c | 20 include/asm-generic/kvm_para.h | 14 ++ include/linux/kvm.h |2 ++ kernel/watchdog.c | 12 12 files changed, 107 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/kvm_para.h asm-generic changes Acked-by: Arnd Bergmann a...@arndb.de but please remove me from the cc-list for the other patches. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3.2] KVM: Make KVM_INTEL depend on CPU_SUP_INTEL
On 12/14/2011 02:27 AM, Avi Kivity wrote: PMU virtualization needs to talk to Intel-specific bits of perf; these are only available when CPU_SUP_INTEL=y. Fixes arch/x86/built-in.o: In function `atomic_switch_perf_msrs': vmx.c:(.text+0x6b1d4): undefined reference to `perf_guest_get_msrs' Reported-by: Ingo Molnar mi...@elte.hu Reported-by: Randy Dunlap rdun...@xenotime.net Signed-off-by: Avi Kivity a...@redhat.com Acked-by: Randy Dunlap rdun...@xenotime.net Thanks. --- arch/x86/kvm/Kconfig |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index ff5790d..ca4d49e 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -52,6 +52,8 @@ config KVM config KVM_INTEL tristate KVM for Intel processors support depends on KVM + # for perf_guest_get_msrs(): + depends on CPU_SUP_INTEL ---help--- Provides support for KVM on Intel processors equipped with the VT extensions. -- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote: Ok, I continued to develop on the patch which tries to allocate per cpu stats from worker thread context. Here is the patch. Can the reporter please try out the patch and see if it helps. I am not sure if deadlock was because of mutex issue or not, but it should help get rid of lockdep warning. This patch is on top of for-3.3/core branch of jens's linux-block tree. If it does not apply on your kernel version, do let me know the version you are testing with and I will generate another version of patch. If testing results are good, I will break down the patch in small pieces and post as a series separately. Thanks Vivek Running on a fedora-16 box with the patch applied to the linux-block tree I still had deadlocks. In fact it seemed to happen much faster and with ligher workloads. I was able to get netconsole setup and a full stacktrace is posted here: http://pastebin.com/9Rq68exU Nate -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFT PATCH] blkio: alloc per cpu data from worker thread context( Re: kvm deadlock)
On Fri, Dec 16, 2011 at 12:43:52PM -0600, Nate Custer wrote: On Dec 15, 2011, at 1:47 PM, Vivek Goyal wrote: Ok, I continued to develop on the patch which tries to allocate per cpu stats from worker thread context. Here is the patch. Can the reporter please try out the patch and see if it helps. I am not sure if deadlock was because of mutex issue or not, but it should help get rid of lockdep warning. This patch is on top of for-3.3/core branch of jens's linux-block tree. If it does not apply on your kernel version, do let me know the version you are testing with and I will generate another version of patch. If testing results are good, I will break down the patch in small pieces and post as a series separately. Thanks Vivek Running on a fedora-16 box with the patch applied to the linux-block tree I still had deadlocks. In fact it seemed to happen much faster and with ligher workloads. I was able to get netconsole setup and a full stacktrace is posted here: http://pastebin.com/9Rq68exU Thanks for testing it Nate. I did some debugging and found out that patch is doing double free on per cpu pointer hence the crash you are running into. I could reproduce this problem on my box. It is just a matter of doing rmdir on the blkio cgroup. I understood the cmpxchg() semantics wrong. I have fixed it now and no crashes on directory removal. Can you please give this version a try. Thanks Vivek Alloc per cpu data from a worker thread context to avoid possibility of a deadlock under memory pressure. Only side affect of delayed allocation is that we will lose the blkio cgroup stats for that group a small duration. This patch is generated on top of for-3.3/core branch of linux-block tree. -v2: Fixed the issue of double free on percpu pointer. --- block/blk-cgroup.c | 31 - block/blk-cgroup.h |9 +++ block/blk-throttle.c | 116 - block/cfq-iosched.c | 119 ++- 4 files changed, 180 insertions(+), 95 deletions(-) Index: block/blk-throttle.c === --- block/blk-throttle.c.orig 2011-12-17 01:49:34.0 -0500 +++ block/blk-throttle.c2011-12-17 02:21:50.0 -0500 @@ -81,6 +81,7 @@ struct throtl_grp { int limits_changed; struct rcu_head rcu_head; + struct work_struct stat_alloc_work; }; struct throtl_data @@ -159,6 +160,13 @@ static void throtl_free_tg(struct rcu_he struct throtl_grp *tg; tg = container_of(head, struct throtl_grp, rcu_head); + + /* +* Will delayed allocation be visible here for sure? I am relying +* on the fact that after blkg.stats_cpu assignment, we drop +* reference to group using atomic_dec() which should imply +* barrier +*/ free_percpu(tg-blkg.stats_cpu); kfree(tg); } @@ -181,6 +189,48 @@ static void throtl_put_tg(struct throtl_ call_rcu(tg-rcu_head, throtl_free_tg); } +static inline void throtl_check_schedule_pcpu_stat_alloc(struct throtl_grp *tg) { + if (tg-blkg.stats_cpu != NULL) + return; + /* +* Schedule the group per cpu stat allocation through worker +* thread +*/ + throtl_ref_get_tg(tg); + if (!schedule_work(tg-stat_alloc_work)) { + /* Work is already scheduled by somebody */ + throtl_put_tg(tg); + } +} + +/* No locks taken. A reference to throtl_grp taken before invocation */ +static void tg_stat_alloc_fn(struct work_struct *work) +{ + struct throtl_grp *tg = container_of(work, struct throtl_grp, + stat_alloc_work); + void *stat_ptr = NULL; + unsigned long ret; + + stat_ptr = blkio_alloc_blkg_stats_pcpu(); + + if (stat_ptr == NULL) { + /* If memory allocation failed, try again */ + throtl_check_schedule_pcpu_stat_alloc(tg); + throtl_put_tg(tg); + return; + } + + ret = blkio_cmpxchg_blkg_stats(tg-blkg, 0, + (unsigned long)stat_ptr); + + if (ret != 0) { + /* Somebody else got to it first */ + free_percpu(stat_ptr); + } + + throtl_put_tg(tg); +} + static void throtl_init_group(struct throtl_grp *tg) { INIT_HLIST_NODE(tg-tg_node); @@ -188,6 +238,7 @@ static void throtl_init_group(struct thr bio_list_init(tg-bio_lists[0]); bio_list_init(tg-bio_lists[1]); tg-limits_changed = false; + INIT_WORK(tg-stat_alloc_work, tg_stat_alloc_fn); /* Practically unlimited BW */ tg-bps[0] = tg-bps[1] = -1; @@ -263,8 +314,25 @@ static void throtl_init_add_tg_lists(str throtl_add_group_to_td_list(td, tg); } -/* Should be called without queue lock and
[PATCH v5] kvm: make vcpu life cycle separated from kvm instance
From: Liu Ping Fan pingf...@linux.vnet.ibm.com Currently, vcpu can be destructed only when kvm instance destroyed. Change this to vcpu's destruction before kvm instance, so vcpu MUST and CAN be destroyed before kvm's destroy. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/x86/kvm/i8254.c | 10 +++-- arch/x86/kvm/i8259.c | 12 -- arch/x86/kvm/x86.c | 53 +++ include/linux/kvm_host.h | 20 - virt/kvm/irq_comm.c |6 ++- virt/kvm/kvm_main.c | 106 ++--- 6 files changed, 132 insertions(+), 75 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 76e3f1c..a3a5506 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -289,9 +289,8 @@ static void pit_do_work(struct work_struct *work) struct kvm_pit *pit = container_of(work, struct kvm_pit, expired); struct kvm *kvm = pit-kvm; struct kvm_vcpu *vcpu; - int i; struct kvm_kpit_state *ps = pit-pit_state; - int inject = 0; + int idx, inject = 0; /* Try to inject pending interrupts when * last one has been acked. @@ -315,9 +314,12 @@ static void pit_do_work(struct work_struct *work) * LVT0 to NMI delivery. Other PIC interrupts are just sent to * VCPU0, and only if its LVT0 is in EXTINT mode. */ - if (kvm-arch.vapics_in_nmi_mode 0) - kvm_for_each_vcpu(i, vcpu, kvm) + if (kvm-arch.vapics_in_nmi_mode 0) { + idx = srcu_read_lock(kvm-srcu_vcpus); + kvm_for_each_vcpu(vcpu, kvm) kvm_apic_nmi_wd_deliver(vcpu); + srcu_read_unlock(kvm-srcu_vcpus, idx); + } } } diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index cac4746..5ef5c05 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s) { bool wakeup = s-wakeup_needed; struct kvm_vcpu *vcpu, *found = NULL; - int i; + struct kvm *kvm = s-kvm; + int idx; s-wakeup_needed = false; spin_unlock(s-lock); if (wakeup) { - kvm_for_each_vcpu(i, vcpu, s-kvm) { + idx = srcu_read_lock(kvm-srcu_vcpus); + kvm_for_each_vcpu(vcpu, kvm) if (kvm_apic_accept_pic_intr(vcpu)) { found = vcpu; break; } - } - if (!found) + if (!found) { + srcu_read_unlock(kvm-srcu_vcpus, idx); return; + } kvm_make_request(KVM_REQ_EVENT, found); kvm_vcpu_kick(found); + srcu_read_unlock(kvm-srcu_vcpus, idx); } } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 23c93fe..b79739d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1774,14 +1774,20 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) { u64 data = 0; + int idx; switch (msr) { case HV_X64_MSR_VP_INDEX: { - int r; + int r = 0; struct kvm_vcpu *v; - kvm_for_each_vcpu(r, v, vcpu-kvm) + struct kvm *kvm = vcpu-kvm; + idx = srcu_read_lock(kvm-srcu_vcpus); + kvm_for_each_vcpu(v, vcpu-kvm) { if (v == vcpu) data = r; + r++; + } + srcu_read_unlock(kvm-srcu_vcpus, idx); break; } case HV_X64_MSR_EOI: @@ -4529,7 +4535,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va struct cpufreq_freqs *freq = data; struct kvm *kvm; struct kvm_vcpu *vcpu; - int i, send_ipi = 0; + int idx, send_ipi = 0; /* * We allow guests to temporarily run on slowing clocks, @@ -4579,13 +4585,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va raw_spin_lock(kvm_lock); list_for_each_entry(kvm, vm_list, vm_list) { - kvm_for_each_vcpu(i, vcpu, kvm) { + idx = srcu_read_lock(kvm-srcu_vcpus); + kvm_for_each_vcpu(vcpu, kvm) { if (vcpu-cpu != freq-cpu) continue; kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); if (vcpu-cpu != smp_processor_id()) send_ipi = 1; } + srcu_read_unlock(kvm-srcu_vcpus, idx); + }
Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
2011/12/15 Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp: (2011/12/15 13:28), Liu Ping Fan wrote: From: Liu Ping Fanpingf...@linux.vnet.ibm.com Currently, vcpu can be destructed only when kvm instance destroyed. Change this to vcpu's destruction before kvm instance, so vcpu MUST and CAN be destroyed before kvm's destroy. Could you explain why this change is needed here? Would be helpful for those, including me, who will read the commit later. Suppose the following scene, Firstly, creating 10 kvm_vcpu for guest to take the advantage of multi-core. Now, reclaiming some of the kvm_vcpu, so we can limit the guest's usage of cpu. Then what about the kvm_vcpu unused? Currently they are just idle in kernel, but with this patch, we can remove them. Signed-off-by: Liu Ping Fanpingf...@linux.vnet.ibm.com --- ... diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index cac4746..f275b8c 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -50,25 +50,28 @@ static void pic_unlock(struct kvm_pic *s) { bool wakeup = s-wakeup_needed; struct kvm_vcpu *vcpu, *found = NULL; - int i; + struct kvm *kvm = s-kvm; s-wakeup_needed = false; spin_unlock(s-lock); if (wakeup) { - kvm_for_each_vcpu(i, vcpu, s-kvm) { + rcu_read_lock(); + kvm_for_each_vcpu(vcpu, kvm) if (kvm_apic_accept_pic_intr(vcpu)) { found = vcpu; break; } - } - if (!found) + if (!found) { + rcu_read_unlock(); return; + } kvm_make_request(KVM_REQ_EVENT, found); kvm_vcpu_kick(found); + rcu_read_unlock(); } } How about this? (just about stylistic issues) :-), I just want to change based on old code. But your style is OK too. if (!wakeup) return; rcu_read_lock(); kvm_for_each_vcpu(vcpu, kvm) if (kvm_apic_accept_pic_intr(vcpu)) { found = vcpu; break; } if (!found) goto out; kvm_make_request(KVM_REQ_EVENT, found); kvm_vcpu_kick(found); out: rcu_read_unlock(); ... diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c ... +void kvm_arch_vcpu_zap(struct work_struct *work) +{ + struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu, + zap_work); + struct kvm *kvm = vcpu-kvm; - atomic_set(kvm-online_vcpus, 0); - mutex_unlock(kvm-lock); + kvm_clear_async_pf_completion_queue(vcpu); + kvm_unload_vcpu_mmu(vcpu); + kvm_arch_vcpu_free(vcpu); + kvm_put_kvm(kvm); } zap is really a good name for this? zap = destroy, so I think it is OK. ... diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d526231..733de1c 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -19,6 +19,7 @@ #includelinux/slab.h #includelinux/rcupdate.h #includelinux/ratelimit.h +#includelinux/atomic.h #includeasm/signal.h #includelinux/kvm.h @@ -113,6 +114,10 @@ enum { struct kvm_vcpu { struct kvm *kvm; + atomic_t refcount; + struct list_head list; + struct rcu_head head; + struct work_struct zap_work; How about adding some comments? zap_work is not at all self explanatory, IMO. #ifdef CONFIG_PREEMPT_NOTIFIERS struct preempt_notifier preempt_notifier; #endif @@ -241,9 +246,9 @@ struct kvm { u32 bsp_vcpu_id; struct kvm_vcpu *bsp_vcpu; #endif - struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; + struct list_head vcpus; atomic_t online_vcpus; - int last_boosted_vcpu; + struct kvm_vcpu *last_boosted_vcpu; struct list_head vm_list; struct mutex lock; struct kvm_io_bus *buses[KVM_NR_BUSES]; @@ -290,17 +295,15 @@ struct kvm { #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt) #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu-kvm, fmt) -static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) -{ - smp_rmb(); - return kvm-vcpus[i]; -} +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu); +void kvm_vcpu_put(struct kvm_vcpu *vcpu); +void kvm_arch_vcpu_zap(struct work_struct *work); + +#define kvm_for_each_vcpu(vcpu, kvm) \ + list_for_each_entry_rcu(vcpu,kvm-vcpus, list) Is this macro really worth it? _rcu shows readers important information, I think. I guest kvm_for_each_vcpu is designed for hiding the details of internal implement, and currently it is implemented by array, and my patch will change it to linked-list, so IMO, we can still hide the details. Regards, ping fan -#define kvm_for_each_vcpu(idx, vcpup, kvm) \ - for (idx = 0; \ - idx