Re: [PATCH] vhost-blk: Add vhost-blk support v5

2012-11-20 Thread Michael S. Tsirkin
On Tue, Nov 20, 2012 at 02:39:40PM +0800, Asias He wrote:
 On 11/20/2012 04:26 AM, Michael S. Tsirkin wrote:
  On Mon, Nov 19, 2012 at 04:53:42PM +0800, Asias He wrote:
  vhost-blk is an in-kernel virito-blk device accelerator.
 
  Due to lack of proper in-kernel AIO interface, this version converts
  guest's I/O request to bio and use submit_bio() to submit I/O directly.
  So this version any supports raw block device as guest's disk image,
  e.g. /dev/sda, /dev/ram0. We can add file based image support to
  vhost-blk once we have in-kernel AIO interface. There are some work in
  progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown:
 
 http://marc.info/?l=linux-fsdevelm=133312234313122
 
  Performance evaluation:
  -
  1) LKVM
  Fio with libaio ioengine on Fusion IO device using kvm tool
  IOPS(k)Before   After   Improvement
  seq-read   107  121 +13.0%
  seq-write  130  179 +37.6%
  rnd-read   102  122 +19.6%
  rnd-write  125  159 +27.0%
 
  2) QEMU
  Fio with libaio ioengine on Fusion IO device using QEMU
  IOPS(k)Before   After   Improvement
  seq-read   76   123 +61.8%
  seq-write  139  173 +24.4%
  rnd-read   73   120 +64.3%
  rnd-write  75   156 +108.0%
  
  Could you compare with dataplane qemu as well please?
 
 
 Well, I will try to collect it.
 
  
 
  Userspace bits:
  -
  1) LKVM
  The latest vhost-blk userspace bits for kvm tool can be found here:
  g...@github.com:asias/linux-kvm.git blk.vhost-blk
 
  2) QEMU
  The latest vhost-blk userspace prototype for QEMU can be found here:
  g...@github.com:asias/qemu.git blk.vhost-blk
 
  Changes in v5:
  - Do not assume the buffer layout
  - Fix wakeup race
 
  Changes in v4:
  - Mark req-status as userspace pointer
  - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
  - Add if (need_resched()) schedule() in blk thread
  - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
  - Use vq_err() instead of pr_warn()
  - Fail un Unsupported request
  - Add flush in vhost_blk_set_features()
 
  Changes in v3:
  - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
  - Check file passed by user is a raw block device file
 
  Signed-off-by: Asias He as...@redhat.com
  
  Since there are files shared by this and vhost net
  it's easiest for me to merge this all through the
  vhost tree.
  
  Jens, could you ack this and the bio usage in this driver
  please?
  
  ---
   drivers/vhost/Kconfig |   1 +
   drivers/vhost/Kconfig.blk |  10 +
   drivers/vhost/Makefile|   2 +
   drivers/vhost/blk.c   | 697 
  ++
   drivers/vhost/blk.h   |   8 +
   5 files changed, 718 insertions(+)
   create mode 100644 drivers/vhost/Kconfig.blk
   create mode 100644 drivers/vhost/blk.c
   create mode 100644 drivers/vhost/blk.h
 
  diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
  index 202bba6..acd8038 100644
  --- a/drivers/vhost/Kconfig
  +++ b/drivers/vhost/Kconfig
  @@ -11,4 +11,5 @@ config VHOST_NET
   
   if STAGING
   source drivers/vhost/Kconfig.tcm
  +source drivers/vhost/Kconfig.blk
   endif
  diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk
  new file mode 100644
  index 000..ff8ab76
  --- /dev/null
  +++ b/drivers/vhost/Kconfig.blk
  @@ -0,0 +1,10 @@
  +config VHOST_BLK
  +  tristate Host kernel accelerator for virtio blk (EXPERIMENTAL)
  +  depends on BLOCK   EXPERIMENTAL  m
  +  ---help---
  +This kernel module can be loaded in host kernel to accelerate
  +guest block with virtio_blk. Not to be confused with virtio_blk
  +module itself which needs to be loaded in guest kernel.
  +
  +To compile this driver as a module, choose M here: the module will
  +be called vhost_blk.
  diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
  index a27b053..1a8a4a5 100644
  --- a/drivers/vhost/Makefile
  +++ b/drivers/vhost/Makefile
  @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o
   vhost_net-y := vhost.o net.o
   
   obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o
  +obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
  +vhost_blk-y := blk.o
  diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
  new file mode 100644
  index 000..f0f118a
  --- /dev/null
  +++ b/drivers/vhost/blk.c
  @@ -0,0 +1,697 @@
  +/*
  + * Copyright (C) 2011 Taobao, Inc.
  + * Author: Liu Yuan tailai...@taobao.com
  + *
  + * Copyright (C) 2012 Red Hat, Inc.
  + * Author: Asias He as...@redhat.com
  + *
  + * This work is licensed under the terms of the GNU GPL, version 2.
  + *
  + * virtio-blk server in host kernel.
  + */
  +
  +#include linux/miscdevice.h
  +#include linux/module.h
  +#include linux/vhost.h
  +#include linux/virtio_blk.h
  +#include linux/mutex.h
  +#include linux/file.h
  +#include linux/kthread.h
  +#include linux/blkdev.h
  +#include 

Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-20 Thread Rafael Aquini
On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote:
 On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini aqu...@redhat.com wrote:
  On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:
 
  I'm getting the following while fuzzing using trinity inside a KVM tools 
  guest,
  on latest -next:
 
  [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 
  0194
  [ 1642.785083] IP: [8122b354] 
  isolate_migratepages_range+0x344/0x7b0
 
  My guess is that we see those because of a race during the check in
  isolate_migratepages_range().
 
 
  Thanks,
  Sasha
 
  Sasha, could you share your .config and steps you did used with trinity? So 
  I
  can attempt to reproduce this issue you reported.
 
 Basically try running trinity (with ./trinity -m --quiet --dangerous
 -l off) inside a disposable guest as root.
 
 I manage to hit that every couple of hours.
 
 Config attached.
 

Howdy Sasha,

After several hours since last Sunday running trinity tests on a traditional
KVM-QEMU guest as well as running it on a lkvm guest (both running
next-20121115) I couldn't hit a single time the crash you've reported,
(un)fortunately.

Also, the .config you gave me, applied on top of next-20121115, haven't produced
the same bin you've running and hitting the mentioned bug, apparently.

Here's the RIP for your crash:
[ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at
0194
[ 1642.785083] IP: [8122b354] isolate_migratepages_range+0x344/0x7b0


And here's the symbol address for the next-20121115 with your .config I've been
running tests on:
[raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range 
8122d890 T isolate_migratepages_range

Also, it seems quite clear I'm missing something from your tree, as applying the
RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads
me to the _middle_ of a instruction opcode that does not dereference any
pointers at all.

So, if you're consistently reproducing the same crash, consider to share with us
a disassembled dump from the isolate_migratepages_range() you're running along
with the crash stack-dump, please.

Cheers!
-- Rafael

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 02/11] x86/kexec: Add extra pointers to transition page table PGD, PUD, PMD and PTE

2012-11-20 Thread Jan Beulich
 On 20.11.12 at 16:04, Daniel Kiper daniel.ki...@oracle.com wrote:
 Some implementations (e.g. Xen PVOPS) could not use part of identity page 
 table
 to construct transition page table. It means that they require separate 
 PUDs,
 PMDs and PTEs for virtual and physical (identity) mapping. To satisfy that
 requirement add extra pointer to PGD, PUD, PMD and PTE and align existing 
 code.

As said for v1 already - this is not really a requirement of the
interface, or else none of our Xen kernels since 2.6.30 would
have worked. I don't think it is desirable to introduce overhead
for everyone if it's not even needed for Xen.

Jan

 Signed-off-by: Daniel Kiper daniel.ki...@oracle.com
 ---
  arch/x86/include/asm/kexec.h   |   10 +++---
  arch/x86/kernel/machine_kexec_64.c |   12 ++--
  2 files changed, 13 insertions(+), 9 deletions(-)
 
 diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
 index 317ff17..3cf5600 100644
 --- a/arch/x86/include/asm/kexec.h
 +++ b/arch/x86/include/asm/kexec.h
 @@ -157,9 +157,13 @@ struct kimage_arch {
  };
  #else
  struct kimage_arch {
 - pud_t *pud;
 - pmd_t *pmd;
 - pte_t *pte;
 + pgd_t *pgd;
 + pud_t *pud0;
 + pud_t *pud1;
 + pmd_t *pmd0;
 + pmd_t *pmd1;
 + pte_t *pte0;
 + pte_t *pte1;
  };
  #endif
  
 diff --git a/arch/x86/kernel/machine_kexec_64.c 
 b/arch/x86/kernel/machine_kexec_64.c
 index b3ea9db..976e54b 100644
 --- a/arch/x86/kernel/machine_kexec_64.c
 +++ b/arch/x86/kernel/machine_kexec_64.c
 @@ -137,9 +137,9 @@ out:
  
  static void free_transition_pgtable(struct kimage *image)
  {
 - free_page((unsigned long)image-arch.pud);
 - free_page((unsigned long)image-arch.pmd);
 - free_page((unsigned long)image-arch.pte);
 + free_page((unsigned long)image-arch.pud0);
 + free_page((unsigned long)image-arch.pmd0);
 + free_page((unsigned long)image-arch.pte0);
  }
  
  static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, 
 pgd_t *pgd)
   pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
   if (!pud)
   goto err;
 - image-arch.pud = pud;
 + image-arch.pud0 = pud;
   set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
   }
   pud = pud_offset(pgd, vaddr);
 @@ -165,7 +165,7 @@ static int init_transition_pgtable(struct kimage *image, 
 pgd_t *pgd)
   pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
   if (!pmd)
   goto err;
 - image-arch.pmd = pmd;
 + image-arch.pmd0 = pmd;
   set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
   }
   pmd = pmd_offset(pud, vaddr);
 @@ -173,7 +173,7 @@ static int init_transition_pgtable(struct kimage *image, 
 pgd_t *pgd)
   pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
   if (!pte)
   goto err;
 - image-arch.pte = pte;
 + image-arch.pte0 = pte;
   set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
   }
   pte = pte_offset_kernel(pmd, vaddr);
 -- 
 1.5.6.5



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct

2012-11-20 Thread Eric W. Biederman
Daniel Kiper daniel.ki...@oracle.com writes:

 Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default
 functions or require some changes in behavior of kexec/kdump generic code.
 To cope with that problem kexec_ops struct was introduced. It allows
 a developer to replace all or some functions and control some
 functionality of kexec/kdump generic code.

 Default behavior of kexec/kdump generic code is not changed.

Ick.

 v2 - suggestions/fixes:
- add comment for kexec_ops.crash_alloc_temp_store member
  (suggested by Konrad Rzeszutek Wilk),
- simplify kexec_ops usage
  (suggested by Konrad Rzeszutek Wilk).

 Signed-off-by: Daniel Kiper daniel.ki...@oracle.com
 ---
  include/linux/kexec.h |   26 ++
  kernel/kexec.c|  131 
 +
  2 files changed, 125 insertions(+), 32 deletions(-)

 diff --git a/include/linux/kexec.h b/include/linux/kexec.h
 index d0b8458..c8d0b35 100644
 --- a/include/linux/kexec.h
 +++ b/include/linux/kexec.h
 @@ -116,7 +116,33 @@ struct kimage {
  #endif
  };
  
 +struct kexec_ops {
 + /*
 +  * Some kdump implementations (e.g. Xen PVOPS dom0) could not access
 +  * directly crash kernel memory area. In this situation they must
 +  * allocate memory outside of it and later move contents from temporary
 +  * storage to final resting places (usualy done by relocate_kernel()).
 +  * Such behavior could be enforced by setting
 +  * crash_alloc_temp_store member to true.
 +  */

Why in the world would Xen not be able to access crash kernel memory?
As currently defined it is normal memory that the kernel chooses not to
use.

If relocate kernel can access that memory you definitely can access the
memory so the comment does not make any sense.

 + bool crash_alloc_temp_store;
 + struct page *(*kimage_alloc_pages)(gfp_t gfp_mask,
 + unsigned int order,
 + unsigned long limit);
 + void (*kimage_free_pages)(struct page *page);
 + unsigned long (*page_to_pfn)(struct page *page);
 + struct page *(*pfn_to_page)(unsigned long pfn);
 + unsigned long (*virt_to_phys)(volatile void *address);
 + void *(*phys_to_virt)(unsigned long address);
 + int (*machine_kexec_prepare)(struct kimage *image);
 + int (*machine_kexec_load)(struct kimage *image);
 + void (*machine_kexec_cleanup)(struct kimage *image);
 + void (*machine_kexec_unload)(struct kimage *image);
 + void (*machine_kexec_shutdown)(void);
 + void (*machine_kexec)(struct kimage *image);
 +};

Ugh.  This is a nasty abstraction.

You are mixing and matching a bunch of things together here.

If you need to override machine_kexec_xxx please do that on a per
architecture basis.

Special case overrides of page_to_pfn, pfn_to_page, virt_to_phys,
phys_to_virt, and friends seem completely inappropriate.

There may be a point to all of these but you are mixing and matching
things badly.


Eric
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 7/7] mm: add vm event counters for balloon pages compaction

2012-11-20 Thread Andrew Morton
On Fri, 9 Nov 2012 12:58:29 -0200
Rafael Aquini aqu...@redhat.com wrote:

 On Fri, Nov 09, 2012 at 12:20:33PM +, Mel Gorman wrote:
  On Wed, Nov 07, 2012 at 01:05:54AM -0200, Rafael Aquini wrote:
   This patch introduces a new set of vm event counters to keep track of
   ballooned pages compaction activity.
   
   Signed-off-by: Rafael Aquini aqu...@redhat.com
  
  Other than confirming the thing actually works can any meaningful
  conclusions be drawn from this counters?
  
  I know I have been inconsistent on this myself in the past but recently
  I've been taking the attitude that the counters can be used to fit into
  some other metric. I'm looking to change the compaction counters to be
  able to build a basic cost model for example. The same idea could be
  used for balloons of course but it's a less critical path than
  compaction for THP for example.
  
  Assuming it builds and all the defines are correct when the feature is
  not configured (I didn't check) then there is nothing wrong with the
  patch. However, if it was dropped would it make life very hard or would
  you notice?
  
 
 Originally, I proposed this patch as droppable (and it's still droppable)
 because its major purpose was solely to show the thing working consistently
 
 OTOH, it might make the life easier to spot breakages if it remains with the
 merged bits, and per a reviewer request I removed its 'DROP BEFORE MERGE'
 disclaimer.
 
https://lkml.org/lkml/2012/8/8/616

There's a lot to be said for not merging things.

I think I'll maintain this as a mm-only patch.  That way it's
available in linux-next and we can merge it later if a need arises.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-20 Thread Andrew Morton
On Fri, 9 Nov 2012 12:16:02 +
Mel Gorman m...@csn.ul.ie wrote:

 On Wed, Nov 07, 2012 at 01:05:51AM -0200, Rafael Aquini wrote:
  Memory fragmentation introduced by ballooning might reduce significantly
  the number of 2MB contiguous memory blocks that can be used within a guest,
  thus imposing performance penalties associated with the reduced number of
  transparent huge pages that could be used by the guest workload.
  
  This patch introduces the helper functions as well as the necessary changes
  to teach compaction and migration bits how to cope with pages which are
  part of a guest memory balloon, in order to make them movable by memory
  compaction procedures.
  

 ...

  --- a/mm/compaction.c
  +++ b/mm/compaction.c
  @@ -14,6 +14,7 @@
   #include linux/backing-dev.h
   #include linux/sysctl.h
   #include linux/sysfs.h
  +#include linux/balloon_compaction.h
   #include internal.h
   
   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
  @@ -565,9 +566,24 @@ isolate_migratepages_range(struct zone *zone, struct 
  compact_control *cc,
  goto next_pageblock;
  }
   
  -   /* Check may be lockless but that's ok as we recheck later */
  -   if (!PageLRU(page))
  +   /*
  +* Check may be lockless but that's ok as we recheck later.
  +* It's possible to migrate LRU pages and balloon pages
  +* Skip any other type of page
  +*/
  +   if (!PageLRU(page)) {
  +   if (unlikely(balloon_page_movable(page))) {
 
 Because it's lockless, it really seems that the barrier stuck down there
 is unnecessary. At worst you get a temporarily incorrect answer that you
 recheck later under page lock in balloon_page_isolate.

What happened with this?

Also: what barrier?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-20 Thread Sasha Levin
On 11/20/2012 09:14 AM, Rafael Aquini wrote:
 On Sun, Nov 18, 2012 at 09:59:47AM -0500, Sasha Levin wrote:
 On Sat, Nov 17, 2012 at 4:54 PM, Rafael Aquini aqu...@redhat.com wrote:
 On Sat, Nov 17, 2012 at 01:01:30PM -0500, Sasha Levin wrote:

 I'm getting the following while fuzzing using trinity inside a KVM tools 
 guest,
 on latest -next:

 [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at 
 0194
 [ 1642.785083] IP: [8122b354] 
 isolate_migratepages_range+0x344/0x7b0

 My guess is that we see those because of a race during the check in
 isolate_migratepages_range().


 Thanks,
 Sasha

 Sasha, could you share your .config and steps you did used with trinity? So 
 I
 can attempt to reproduce this issue you reported.

 Basically try running trinity (with ./trinity -m --quiet --dangerous
 -l off) inside a disposable guest as root.

 I manage to hit that every couple of hours.

 Config attached.

 
 Howdy Sasha,
 
 After several hours since last Sunday running trinity tests on a traditional
 KVM-QEMU guest as well as running it on a lkvm guest (both running
 next-20121115) I couldn't hit a single time the crash you've reported,
 (un)fortunately.

Odd... I can see it happening here every couple of hours.

 Also, the .config you gave me, applied on top of next-20121115, haven't 
 produced
 the same bin you've running and hitting the mentioned bug, apparently.
 
 Here's the RIP for your crash:
 [ 1642.783728] BUG: unable to handle kernel NULL pointer dereference at
 0194
 [ 1642.785083] IP: [8122b354] isolate_migratepages_range+0x344/0x7b0
 
 
 And here's the symbol address for the next-20121115 with your .config I've 
 been
 running tests on:
 [raquini@x61 linux]$ nm -n vmlinux | grep isolate_migratepages_range 
 8122d890 T isolate_migratepages_range
 
 Also, it seems quite clear I'm missing something from your tree, as applying 
 the
 RIP displacement (0x344) to my local isolate_migratepages_range sym addr leads
 me to the _middle_ of a instruction opcode that does not dereference any
 pointers at all.

Yup, I carry another small fix to mpol (which is unrelated to this one).

 So, if you're consistently reproducing the same crash, consider to share with 
 us
 a disassembled dump from the isolate_migratepages_range() you're running along
 with the crash stack-dump, please.

Sure!

The call chain is:

isolate_migratepages_range
balloon_page_movable
__is_movable_balloon_page
mapping_balloon

mapping_balloon() fails because it checks for mapping to be non-null (and it is 
-
it's usually a small value like 0x50), and then it dereferences that.

The relevant assembly is:

static inline int mapping_balloon(struct address_space *mapping)
{
return mapping  test_bit(AS_BALLOON_MAP, mapping-flags);
17ab:   48 85 c0test   %rax,%rax
17ae:   0f 84 4c 02 00 00   je 1a00 
isolate_migratepages_range+0x590
17b4:   48 8b 80 40 01 00 00mov0x140(%rax),%rax
17bb:   a9 00 00 00 20  test   $0x2000,%eax
17c0:   0f 84 3a 02 00 00   je 1a00 
isolate_migratepages_range+0x590

It dies on 17b4.

Let me know if you need anything else from me, I can also add debug code into 
the
kernel if it would help you...


Thanks,
Sasha


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization