Re: [PATCH] vhost-blk: Add vhost-blk support v5
On Tue, Nov 20, 2012 at 02:39:40PM +0800, Asias He wrote: On 11/20/2012 04:26 AM, Michael S. Tsirkin wrote: On Mon, Nov 19, 2012 at 04:53:42PM +0800, Asias He wrote: vhost-blk is an in-kernel virito-blk device accelerator. Due to lack of proper in-kernel AIO interface, this version converts guest's I/O request to bio and use submit_bio() to submit I/O directly. So this version any supports raw block device as guest's disk image, e.g. /dev/sda, /dev/ram0. We can add file based image support to vhost-blk once we have in-kernel AIO interface. There are some work in progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown: http://marc.info/?l=linux-fsdevelm=133312234313122 Performance evaluation: - 1) LKVM Fio with libaio ioengine on Fusion IO device using kvm tool IOPS(k)Before After Improvement seq-read 107 121 +13.0% seq-write 130 179 +37.6% rnd-read 102 122 +19.6% rnd-write 125 159 +27.0% 2) QEMU Fio with libaio ioengine on Fusion IO device using QEMU IOPS(k)Before After Improvement seq-read 76 123 +61.8% seq-write 139 173 +24.4% rnd-read 73 120 +64.3% rnd-write 75 156 +108.0% Could you compare with dataplane qemu as well please? Well, I will try to collect it. Userspace bits: - 1) LKVM The latest vhost-blk userspace bits for kvm tool can be found here: g...@github.com:asias/linux-kvm.git blk.vhost-blk 2) QEMU The latest vhost-blk userspace prototype for QEMU can be found here: g...@github.com:asias/qemu.git blk.vhost-blk Changes in v5: - Do not assume the buffer layout - Fix wakeup race Changes in v4: - Mark req-status as userspace pointer - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status() - Add if (need_resched()) schedule() in blk thread - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop() - Use vq_err() instead of pr_warn() - Fail un Unsupported request - Add flush in vhost_blk_set_features() Changes in v3: - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph! - Check file passed by user is a raw block device file Signed-off-by: Asias He as...@redhat.com Since there are files shared by this and vhost net it's easiest for me to merge this all through the vhost tree. Jens, could you ack this and the bio usage in this driver please? --- drivers/vhost/Kconfig | 1 + drivers/vhost/Kconfig.blk | 10 + drivers/vhost/Makefile| 2 + drivers/vhost/blk.c | 697 ++ drivers/vhost/blk.h | 8 + 5 files changed, 718 insertions(+) create mode 100644 drivers/vhost/Kconfig.blk create mode 100644 drivers/vhost/blk.c create mode 100644 drivers/vhost/blk.h diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 202bba6..acd8038 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -11,4 +11,5 @@ config VHOST_NET if STAGING source drivers/vhost/Kconfig.tcm +source drivers/vhost/Kconfig.blk endif diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk new file mode 100644 index 000..ff8ab76 --- /dev/null +++ b/drivers/vhost/Kconfig.blk @@ -0,0 +1,10 @@ +config VHOST_BLK + tristate Host kernel accelerator for virtio blk (EXPERIMENTAL) + depends on BLOCK EXPERIMENTAL m + ---help--- +This kernel module can be loaded in host kernel to accelerate +guest block with virtio_blk. Not to be confused with virtio_blk +module itself which needs to be loaded in guest kernel. + +To compile this driver as a module, choose M here: the module will +be called vhost_blk. diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index a27b053..1a8a4a5 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o +obj-$(CONFIG_VHOST_BLK) += vhost_blk.o +vhost_blk-y := blk.o diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c new file mode 100644 index 000..f0f118a --- /dev/null +++ b/drivers/vhost/blk.c @@ -0,0 +1,697 @@ +/* + * Copyright (C) 2011 Taobao, Inc. + * Author: Liu Yuan tailai...@taobao.com + * + * Copyright (C) 2012 Red Hat, Inc. + * Author: Asias He as...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. + * + * virtio-blk server in host kernel. + */ + +#include linux/miscdevice.h +#include linux/module.h +#include linux/vhost.h +#include linux/virtio_blk.h +#include linux/mutex.h +#include linux/file.h +#include linux/kthread.h +#include linux/blkdev.h +#include
Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages
On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote: On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini aqu...@redhat.com wrote: On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote: I'm getting the following while fuzzing using trinity inside a KVM tools guest, on latest -next: [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0194 [ 1642.785083] IP: [8122b354] isolate_migratepages_range+0x344/0x7b0 My guess is that we see those because of a race during the check in isolate_migratepages_range(). Thanks, Sasha Sasha, could you share your .config and steps you did used with trinity? So I can attempt to reproduce this issue you reported. Basically try running trinity (with ./trinity -m --quiet --dangerous -l off) inside a disposable guest as root. I manage to hit that every couple of hours. Config attached. Howdy Sasha, After several hours since last Sunday running trinity tests on a traditional KVM-QEMU guest as well as running it on a lkvm guest (both running next-20121115) I couldn't hit a single time the crash you've reported, (un)fortunately. Also, the .config you gave me, applied on top of next-20121115, haven't produced the same bin you've running and hitting the mentioned bug, apparently. Here's the RIP for your crash: [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0194 [ 1642.785083] IP: [8122b354] isolate_migratepages_range+0x344/0x7b0 And here's the symbol address for the next-20121115 with your .config I've been running tests on: [raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range 8122d890 T isolate_migratepages_range Also, it seems quite clear I'm missing something from your tree, as applying the RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads me to the _middle_ of a instruction opcode that does not dereference any pointers at all. So, if you're consistently reproducing the same crash, consider to share with us a disassembled dump from the isolate_migratepages_range() you're running along with the crash stack-dump, please. Cheers! -- Rafael ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE
On 20.11.12 at 16:04, Daniel Kiper daniel.ki...@oracle.com wrote: Some implementations (e.g. Xen PVOPS) could not use part of identity page table to construct transition page table. It means that they require separate PUDs, PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that requirement add extra pointer to PGD, PUD, PMD and PTE and align existing code. As said for v1 already - this is not really a requirement of the interface, or else none of our Xen kernels since 2.6.30 would have worked. I don't think it is desirable to introduce overhead for everyone if it's not even needed for Xen. Jan Signed-off-by: Daniel Kiper daniel.ki...@oracle.com --- arch/x86/include/asm/kexec.h | 10 +++--- arch/x86/kernel/machine_kexec_64.c | 12 ++-- 2 files changed, 13 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 317ff17..3cf5600 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -157,9 +157,13 @@ struct kimage_arch { }; #else struct kimage_arch { - pud_t *pud; - pmd_t *pmd; - pte_t *pte; + pgd_t *pgd; + pud_t *pud0; + pud_t *pud1; + pmd_t *pmd0; + pmd_t *pmd1; + pte_t *pte0; + pte_t *pte1; }; #endif diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index b3ea9db..976e54b 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -137,9 +137,9 @@ out: static void free_transition_pgtable(struct kimage *image) { - free_page((unsigned long)image-arch.pud); - free_page((unsigned long)image-arch.pmd); - free_page((unsigned long)image-arch.pte); + free_page((unsigned long)image-arch.pud0); + free_page((unsigned long)image-arch.pmd0); + free_page((unsigned long)image-arch.pte0); } static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) pud = (pud_t *)get_zeroed_page(GFP_KERNEL); if (!pud) goto err; - image-arch.pud = pud; + image-arch.pud0 = pud; set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE)); } pud = pud_offset(pgd, vaddr); @@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL); if (!pmd) goto err; - image-arch.pmd = pmd; + image-arch.pmd0 = pmd; set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE)); } pmd = pmd_offset(pud, vaddr); @@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) pte = (pte_t *)get_zeroed_page(GFP_KERNEL); if (!pte) goto err; - image-arch.pte = pte; + image-arch.pte0 = pte; set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE)); } pte = pte_offset_kernel(pmd, vaddr); -- 1.5.6.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
Daniel Kiper daniel.ki...@oracle.com writes: Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default functions or require some changes in behavior of kexec/kdump generic code. To cope with that problem kexec_ops struct was introduced. It allows a developer to replace all or some functions and control some functionality of kexec/kdump generic code. Default behavior of kexec/kdump generic code is not changed. Ick. v2 - suggestions/fixes: - add comment for kexec_ops.crash_alloc_temp_store member (suggested by Konrad Rzeszutek Wilk), - simplify kexec_ops usage (suggested by Konrad Rzeszutek Wilk). Signed-off-by: Daniel Kiper daniel.ki...@oracle.com --- include/linux/kexec.h | 26 ++ kernel/kexec.c| 131 + 2 files changed, 125 insertions(+), 32 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index d0b8458..c8d0b35 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -116,7 +116,33 @@ struct kimage { #endif }; +struct kexec_ops { + /* + * Some kdump implementations (e.g. Xen PVOPS dom0) could not access + * directly crash kernel memory area. In this situation they must + * allocate memory outside of it and later move contents from temporary + * storage to final resting places (usualy done by relocate_kernel()). + * Such behavior could be enforced by setting + * crash_alloc_temp_store member to true. + */ Why in the world would Xen not be able to access crash kernel memory? As currently defined it is normal memory that the kernel chooses not to use. If relocate kernel can access that memory you definitely can access the memory so the comment does not make any sense. + bool crash_alloc_temp_store; + struct page *(*kimage_alloc_pages)(gfp_t gfp_mask, + unsigned int order, + unsigned long limit); + void (*kimage_free_pages)(struct page *page); + unsigned long (*page_to_pfn)(struct page *page); + struct page *(*pfn_to_page)(unsigned long pfn); + unsigned long (*virt_to_phys)(volatile void *address); + void *(*phys_to_virt)(unsigned long address); + int (*machine_kexec_prepare)(struct kimage *image); + int (*machine_kexec_load)(struct kimage *image); + void (*machine_kexec_cleanup)(struct kimage *image); + void (*machine_kexec_unload)(struct kimage *image); + void (*machine_kexec_shutdown)(void); + void (*machine_kexec)(struct kimage *image); +}; Ugh. This is a nasty abstraction. You are mixing and matching a bunch of things together here. If you need to override machine_kexec_xxx please do that on a per architecture basis. Special case overrides of page_to_pfn, pfn_to_page, virt_to_phys, phys_to_virt, and friends seem completely inappropriate. There may be a point to all of these but you are mixing and matching things badly. Eric ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 7/7] mm: add vm event counters for balloon pages compaction
On Fri, 9 Nov 2012 12:58:29 -0200 Rafael Aquini aqu...@redhat.com wrote: On Fri, Nov 09, 2012 at 12:20:33PM +, Mel Gorman wrote: On Wed, Nov 07, 2012 at 01:05:54AM -0200, Rafael Aquini wrote: This patch introduces a new set of vm event counters to keep track of ballooned pages compaction activity. Signed-off-by: Rafael Aquini aqu...@redhat.com Other than confirming the thing actually works can any meaningful conclusions be drawn from this counters? I know I have been inconsistent on this myself in the past but recently I've been taking the attitude that the counters can be used to fit into some other metric. I'm looking to change the compaction counters to be able to build a basic cost model for example. The same idea could be used for balloons of course but it's a less critical path than compaction for THP for example. Assuming it builds and all the defines are correct when the feature is not configured (I didn't check) then there is nothing wrong with the patch. However, if it was dropped would it make life very hard or would you notice? Originally, I proposed this patch as droppable (and it's still droppable) because its major purpose was solely to show the thing working consistently OTOH, it might make the life easier to spot breakages if it remains with the merged bits, and per a reviewer request I removed its 'DROP BEFORE MERGE' disclaimer. https://lkml.org/lkml/2012/8/8/616 There's a lot to be said for not merging things. I think I'll maintain this as a mm-only patch. That way it's available in linux-next and we can merge it later if a need arises. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 4/7] mm: introduce compaction and migration for ballooned pages
On Fri, 9 Nov 2012 12:16:02 + Mel Gorman m...@csn.ul.ie wrote: On Wed, Nov 07, 2012 at 01:05:51AM -0200, Rafael Aquini wrote: Memory fragmentation introduced by ballooning might reduce significantly the number of 2MB contiguous memory blocks that can be used within a guest, thus imposing performance penalties associated with the reduced number of transparent huge pages that could be used by the guest workload. This patch introduces the helper functions as well as the necessary changes to teach compaction and migration bits how to cope with pages which are part of a guest memory balloon, in order to make them movable by memory compaction procedures. ... --- a/mm/compaction.c +++ b/mm/compaction.c @@ -14,6 +14,7 @@ #include linux/backing-dev.h #include linux/sysctl.h #include linux/sysfs.h +#include linux/balloon_compaction.h #include internal.h #if defined CONFIG_COMPACTION || defined CONFIG_CMA @@ -565,9 +566,24 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, goto next_pageblock; } - /* Check may be lockless but that's ok as we recheck later */ - if (!PageLRU(page)) + /* +* Check may be lockless but that's ok as we recheck later. +* It's possible to migrate LRU pages and balloon pages +* Skip any other type of page +*/ + if (!PageLRU(page)) { + if (unlikely(balloon_page_movable(page))) { Because it's lockless, it really seems that the barrier stuck down there is unnecessary. At worst you get a temporarily incorrect answer that you recheck later under page lock in balloon_page_isolate. What happened with this? Also: what barrier? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages
On 11/20/2012 09:14 AM, Rafael Aquini wrote: On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote: On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini aqu...@redhat.com wrote: On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote: I'm getting the following while fuzzing using trinity inside a KVM tools guest, on latest -next: [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0194 [ 1642.785083] IP: [8122b354] isolate_migratepages_range+0x344/0x7b0 My guess is that we see those because of a race during the check in isolate_migratepages_range(). Thanks, Sasha Sasha, could you share your .config and steps you did used with trinity? So I can attempt to reproduce this issue you reported. Basically try running trinity (with ./trinity -m --quiet --dangerous -l off) inside a disposable guest as root. I manage to hit that every couple of hours. Config attached. Howdy Sasha, After several hours since last Sunday running trinity tests on a traditional KVM-QEMU guest as well as running it on a lkvm guest (both running next-20121115) I couldn't hit a single time the crash you've reported, (un)fortunately. Odd... I can see it happening here every couple of hours. Also, the .config you gave me, applied on top of next-20121115, haven't produced the same bin you've running and hitting the mentioned bug, apparently. Here's the RIP for your crash: [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 0194 [ 1642.785083] IP: [8122b354] isolate_migratepages_range+0x344/0x7b0 And here's the symbol address for the next-20121115 with your .config I've been running tests on: [raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range 8122d890 T isolate_migratepages_range Also, it seems quite clear I'm missing something from your tree, as applying the RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads me to the _middle_ of a instruction opcode that does not dereference any pointers at all. Yup, I carry another small fix to mpol (which is unrelated to this one). So, if you're consistently reproducing the same crash, consider to share with us a disassembled dump from the isolate_migratepages_range() you're running along with the crash stack-dump, please. Sure! The call chain is: isolate_migratepages_range balloon_page_movable __is_movable_balloon_page mapping_balloon mapping_balloon() fails because it checks for mapping to be non-null (and it is - it's usually a small value like 0x50), and then it dereferences that. The relevant assembly is: static inline int mapping_balloon(struct address_space *mapping) { return mapping test_bit(AS_BALLOON_MAP, mapping-flags); 17ab: 48 85 c0test %rax,%rax 17ae: 0f 84 4c 02 00 00 je 1a00 isolate_migratepages_range+0x590 17b4: 48 8b 80 40 01 00 00mov0x140(%rax),%rax 17bb: a9 00 00 00 20 test $0x2000,%eax 17c0: 0f 84 3a 02 00 00 je 1a00 isolate_migratepages_range+0x590 It dies on 17b4. Let me know if you need anything else from me, I can also add debug code into the kernel if it would help you... Thanks, Sasha ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization