Re: [PATCH] virtio-blk: allow toggling host cache between writeback and writethrough

2012-07-05 Thread Badari Pulavarty
On Tue, 2012-07-03 at 15:19 +0200, Paolo Bonzini wrote:
 This patch adds support for the new VIRTIO_BLK_F_CONFIG_WCE feature,
 which exposes the cache mode in the configuration space and lets the
 driver modify it.  The cache mode is exposed via sysfs.
 
 Even if the host does not support the new feature, the cache mode is
 visible (thanks to the existing VIRTIO_BLK_F_WCE), but not modifiable.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---

Hi Paolo,

Curious - What is the host side change to support this ? QEMU would
close and re-open the device/file with the corresponding flags
(O_SYNC) ?

And also, is there a way to expose cache=none (O_DIRECT) to the guest ?
Our cluster filesystem folks need a way to verify/guarantee that
virtio-blk device has cache=none selected at host. Otherwise, they
can not support a cluster filesystem running inside a VM (on
virtio-blk).

Thoughts ?

Thanks,
Badari

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-14 Thread Badari Pulavarty

On 8/14/2011 8:20 PM, Liu Yuan wrote:

On 08/13/2011 12:12 AM, Badari Pulavarty wrote:

On 8/12/2011 4:40 AM, Liu Yuan wrote:

On 08/12/2011 04:27 PM, Liu Yuan wrote:

On 08/12/2011 12:50 PM, Badari Pulavarty wrote:

On 8/10/2011 8:19 PM, Liu Yuan wrote:

On 08/11/2011 11:01 AM, Liu Yuan wrote:


It looks like the patch wouldn't work for testing multiple 
devices.


vhost_blk_open() does
+   used_info_cachep = KMEM_CACHE(used_info, 
SLAB_HWCACHE_ALIGN |

SLAB_PANIC);



This is weird. how do you open multiple device?I just opened the 
device with following command:


-drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
file=~/data0.img,if=virtio,cache=none,aio=native -drive 
file=~/data1.img,if=virtio,cache=none,aio=native


And I didn't meet any problem.

this would tell qemu to open three devices, and pass three FDs 
to three instances of vhost_blk module.

So KMEM_CACHE() is okay in vhost_blk_open().



Oh, you are right. KMEM_CACHE() is in the wrong place. it is 
three instances vhost worker threads created. Hmmm, but I didn't 
meet any problem when opening it and running it. So strange. I'll 
go to figure it out.



When opening second device, we get panic since used_info_cachep is
already created. Just to make progress I moved this call to
vhost_blk_init().

I don't see any host panics now. With single block device (dd),
it seems to work fine. But when I start testing multiple block
devices I quickly run into hangs in the guest. I see following
messages in the guest from virtio_ring.c:

virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio1: requests: id 0 is not a head !
virtio_blk virtio4: requests: id 1 is not a head !
virtio_blk virtio3: requests: id 39 is not a head !

Thanks,
Badari




vq-data[] is initialized by guest virtio-blk driver and 
vhost_blk is unware of it. it looks like used ID passed
over by vhost_blk to guest virtio_blk is wrong, but, it should 
not happen. :|


And I can't reproduce this on my laptop. :(


Finally, found the issue  :)

Culprit is:

+static struct io_event events[MAX_EVENTS];

With multiple devices, multiple threads could be executing 
handle_completion() (one for
each fd) at the same time. events array is global :( Need to 
make it one per device/fd.


For test, I changed MAX_EVENTS to 32 and moved events array to 
be local (stack)

to handle_completion(). Tests are running fine.

Your laptop must have single processor, hence you have only one 
thread executing handle_completion()

at any time..

Thanks,
Badari


Good catch, this is rather cool!Yup, I develop it mostly in a 
nested KVM environment. and the L2 host  only runs single processor :(


Thanks,
Yuan
By the way, MAX_EVENTS should be 128, as much as guest virtio_blk 
driver can batch-submit,

causing array overflow.
I have had turned on the debug, and had seen as much as over 100 
requests batched from guest OS.




Hmm.. I am not sure why you see over 100 outstanding events per fd.  
Max events could be as high as

number of number of outstanding IOs.

Anyway, instead of putting it on stack, I kmalloced it now.

Dongsu Park, Here is the complete patch.

Thanks
Badari


In the physical machine, there is a queue depth posted by block device 
driver to limit the
pending requests number, normally it is 31. But virtio driver doesn't 
post it in the guest OS.

So nothing prvents OS batch-submitting requests more than 31.

I have noticed over 100 pending requests during guest OS initilization 
and it is reproducible.


BTW, how is perf number for vhost-blk in your environment?


Right now I am doing dd tests to test out the functionality and stability.

I plan to collect FFSB benchmark results across 6-virtio-blk/vhost-blk 
disks with
all profiles - seq read, seq write, random read, random write with 
blocksizes varying

from 4k to 1MB.

I will start the test tomorrow. It will take few days to run thru all 
the scenarios.

I don't have an easy way to collect host CPU consumption - but for now lets
focus on throughput and latency. I will share the results in few days.

Thanks
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-12 Thread Badari Pulavarty

On 8/12/2011 4:40 AM, Liu Yuan wrote:

On 08/12/2011 04:27 PM, Liu Yuan wrote:

On 08/12/2011 12:50 PM, Badari Pulavarty wrote:

On 8/10/2011 8:19 PM, Liu Yuan wrote:

On 08/11/2011 11:01 AM, Liu Yuan wrote:



It looks like the patch wouldn't work for testing multiple devices.

vhost_blk_open() does
+   used_info_cachep = KMEM_CACHE(used_info, 
SLAB_HWCACHE_ALIGN |

SLAB_PANIC);



This is weird. how do you open multiple device?I just opened the 
device with following command:


-drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
file=~/data0.img,if=virtio,cache=none,aio=native -drive 
file=~/data1.img,if=virtio,cache=none,aio=native


And I didn't meet any problem.

this would tell qemu to open three devices, and pass three FDs to 
three instances of vhost_blk module.

So KMEM_CACHE() is okay in vhost_blk_open().



Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
instances vhost worker threads created. Hmmm, but I didn't meet any 
problem when opening it and running it. So strange. I'll go to 
figure it out.



When opening second device, we get panic since used_info_cachep is
already created. Just to make progress I moved this call to
vhost_blk_init().

I don't see any host panics now. With single block device (dd),
it seems to work fine. But when I start testing multiple block
devices I quickly run into hangs in the guest. I see following
messages in the guest from virtio_ring.c:

virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio1: requests: id 0 is not a head !
virtio_blk virtio4: requests: id 1 is not a head !
virtio_blk virtio3: requests: id 39 is not a head !

Thanks,
Badari




vq-data[] is initialized by guest virtio-blk driver and vhost_blk 
is unware of it. it looks like used ID passed
over by vhost_blk to guest virtio_blk is wrong, but, it should not 
happen. :|


And I can't reproduce this on my laptop. :(


Finally, found the issue  :)

Culprit is:

+static struct io_event events[MAX_EVENTS];

With multiple devices, multiple threads could be executing 
handle_completion() (one for
each fd) at the same time. events array is global :( Need to make 
it one per device/fd.


For test, I changed MAX_EVENTS to 32 and moved events array to be 
local (stack)

to handle_completion(). Tests are running fine.

Your laptop must have single processor, hence you have only one 
thread executing handle_completion()

at any time..

Thanks,
Badari


Good catch, this is rather cool!Yup, I develop it mostly in a 
nested KVM environment. and the L2 host  only runs single processor :(


Thanks,
Yuan
By the way, MAX_EVENTS should be 128, as much as guest virtio_blk 
driver can batch-submit,

causing array overflow.
I have had turned on the debug, and had seen as much as over 100 
requests batched from guest OS.




Hmm.. I am not sure why you see over 100 outstanding events per fd.  Max 
events could be as high as

number of number of outstanding IOs.

Anyway, instead of putting it on stack, I kmalloced it now.

Dongsu Park, Here is the complete patch.

Thanks
Badari


---
 drivers/vhost/Makefile |3 
 drivers/vhost/blk.c|  536 +
 drivers/vhost/vhost.h  |   11 +
 fs/aio.c   |   44 +---
 fs/eventfd.c   |1 
 include/linux/aio.h|   31 ++
 6 files changed, 599 insertions(+), 27 deletions(-)

Index: linux-3.0/drivers/vhost/Makefile
===
--- linux-3.0.orig/drivers/vhost/Makefile   2011-08-10 12:54:26.833639379 
-0400
+++ linux-3.0/drivers/vhost/Makefile2011-08-10 12:56:29.686641851 -0400
@@ -1,2 +1,5 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
+obj-m += vhost_blk.o
+
 vhost_net-y := vhost.o net.o
+vhost_blk-y := vhost.o blk.o
Index: linux-3.0/drivers/vhost/blk.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-3.0/drivers/vhost/blk.c   2011-08-12 11:56:41.041471082 -0400
@@ -0,0 +1,536 @@
+/* Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan tailai...@taobao.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * Vhost-blk driver is an in-kernel accelerator, intercepting the
+ * IO requests from KVM virtio-capable guests. It is based on the
+ * vhost infrastructure.
+ */
+
+#include linux/miscdevice.h
+#include linux/module.h
+#include linux/virtio_net.h
+#include linux/vhost.h
+#include linux/eventfd.h
+#include linux/mutex.h
+#include linux/workqueue.h
+#include linux/virtio_blk.h
+#include linux/file.h
+#include linux/mmu_context.h
+#include linux/kthread.h
+#include linux/anon_inodes.h
+#include linux/syscalls.h
+#include linux/blkdev.h
+
+#include vhost.h
+
+#define DEBUG 0
+
+#if DEBUG  0
+#define dprintk printk
+#else
+#define dprintk(x...)   do { ; } while (0)
+#endif
+
+enum {
+   virtqueue_max = 1,
+};
+
+#define MAX_EVENTS 128
+
+struct vhost_blk {
+   struct

Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-11 Thread Badari Pulavarty

On 8/10/2011 8:19 PM, Liu Yuan wrote:

On 08/11/2011 11:01 AM, Liu Yuan wrote:



It looks like the patch wouldn't work for testing multiple devices.

vhost_blk_open() does
+   used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
SLAB_PANIC);



This is weird. how do you open multiple device?I just opened the 
device with following command:


-drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
file=~/data0.img,if=virtio,cache=none,aio=native -drive 
file=~/data1.img,if=virtio,cache=none,aio=native


And I didn't meet any problem.

this would tell qemu to open three devices, and pass three FDs to 
three instances of vhost_blk module.

So KMEM_CACHE() is okay in vhost_blk_open().



Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
instances vhost worker threads created. Hmmm, but I didn't meet any 
problem when opening it and running it. So strange. I'll go to figure 
it out.



When opening second device, we get panic since used_info_cachep is
already created. Just to make progress I moved this call to
vhost_blk_init().

I don't see any host panics now. With single block device (dd),
it seems to work fine. But when I start testing multiple block
devices I quickly run into hangs in the guest. I see following
messages in the guest from virtio_ring.c:

virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio1: requests: id 0 is not a head !
virtio_blk virtio4: requests: id 1 is not a head !
virtio_blk virtio3: requests: id 39 is not a head !

Thanks,
Badari




vq-data[] is initialized by guest virtio-blk driver and vhost_blk is 
unware of it. it looks like used ID passed
over by vhost_blk to guest virtio_blk is wrong, but, it should not 
happen. :|


And I can't reproduce this on my laptop. :(


I spent lot time looking at the code on how we can pass the wrong ID and 
corrupt vq-data[]. I can't seem

to spot the bug :(

I hacked vhost_blk to return success immediately, without doing any IO - 
to see if its a generic problem.
With the hack (of not doing any IO), I can't reproduce the problem. So, 
its some thing in the IO completion

handling code causing this. I will keep looking..

Thanks,
Badari


Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-11 Thread Badari Pulavarty

On 8/10/2011 8:19 PM, Liu Yuan wrote:

On 08/11/2011 11:01 AM, Liu Yuan wrote:



It looks like the patch wouldn't work for testing multiple devices.

vhost_blk_open() does
+   used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
SLAB_PANIC);



This is weird. how do you open multiple device?I just opened the 
device with following command:


-drive file=/dev/sda6,if=virtio,cache=none,aio=native -drive 
file=~/data0.img,if=virtio,cache=none,aio=native -drive 
file=~/data1.img,if=virtio,cache=none,aio=native


And I didn't meet any problem.

this would tell qemu to open three devices, and pass three FDs to 
three instances of vhost_blk module.

So KMEM_CACHE() is okay in vhost_blk_open().



Oh, you are right. KMEM_CACHE() is in the wrong place. it is three 
instances vhost worker threads created. Hmmm, but I didn't meet any 
problem when opening it and running it. So strange. I'll go to figure 
it out.



When opening second device, we get panic since used_info_cachep is
already created. Just to make progress I moved this call to
vhost_blk_init().

I don't see any host panics now. With single block device (dd),
it seems to work fine. But when I start testing multiple block
devices I quickly run into hangs in the guest. I see following
messages in the guest from virtio_ring.c:

virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio1: requests: id 0 is not a head !
virtio_blk virtio4: requests: id 1 is not a head !
virtio_blk virtio3: requests: id 39 is not a head !

Thanks,
Badari




vq-data[] is initialized by guest virtio-blk driver and vhost_blk is 
unware of it. it looks like used ID passed
over by vhost_blk to guest virtio_blk is wrong, but, it should not 
happen. :|


And I can't reproduce this on my laptop. :(


Finally, found the issue  :)

Culprit is:

+static struct io_event events[MAX_EVENTS];

With multiple devices, multiple threads could be executing handle_completion() 
(one for
each fd) at the same time. events array is global :( Need to make it one per 
device/fd.

For test, I changed MAX_EVENTS to 32 and moved events array to be local 
(stack)
to handle_completion(). Tests are running fine.

Your laptop must have single processor, hence you have only one thread 
executing handle_completion()
at any time..

Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-10 Thread Badari Pulavarty
On Wed, 2011-08-10 at 10:19 +0800, Liu Yuan wrote:
 On 08/09/2011 01:16 AM, Badari Pulavarty wrote:
  On 8/8/2011 12:31 AM, Liu Yuan wrote:
  On 08/08/2011 01:04 PM, Badari Pulavarty wrote:
  On 8/7/2011 6:35 PM, Liu Yuan wrote:
  On 08/06/2011 02:02 AM, Badari Pulavarty wrote:
  On 8/5/2011 4:04 AM, Liu Yuan wrote:
  On 08/05/2011 05:58 AM, Badari Pulavarty wrote:
  Hi Liu Yuan,
 
  I started testing your patches. I applied your kernel patch to 3.0
  and applied QEMU to latest git.
 
  I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
  I ran simple dd read tests from the guest on all block devices
  (with various blocksizes, iflag=direct).
 
  Unfortunately, system doesn't stay up. I immediately get into
  panic on the host. I didn't get time to debug the problem. 
  Wondering
  if you have seen this issue before and/or you have new patchset
  to try ?
 
  Let me know.
 
  Thanks,
  Badari
 
 
  Okay, it is actually a bug pointed out by MST on the other 
  thread, that it needs a mutex for completion thread.
 
  Now would you please this attachment?This patch only applies to 
  kernel part, on top of v1 kernel patch.
 
  This patch mainly moves completion thread into vhost thread as a 
  function. As a result, both requests submitting and completion 
  signalling is in the same thread.
 
  Yuan
 
  Unfortunately, dd tests (4 out of 6) in the guest hung. I see 
  following messages
 
  virtio_blk virtio2: requests: id 0 is not a head !
  virtio_blk virtio3: requests: id 1 is not a head !
  virtio_blk virtio5: requests: id 1 is not a head !
  virtio_blk virtio1: requests: id 1 is not a head !
 
  I still see host panics. I will collect the host panic and see if 
  its still same or not.
 
  Thanks,
  Badari
 
 
  Would you please show me how to reproduce it step by step? I tried 
  dd with two block device attached, but didn't get hung nor panic.
 
  Yuan
 
  I did 6 dds on 6 block devices..
 
  dd if=/dev/vdb of=/dev/null bs=1M iflag=direct 
  dd if=/dev/vdc of=/dev/null bs=1M iflag=direct 
  dd if=/dev/vdd of=/dev/null bs=1M iflag=direct 
  dd if=/dev/vde of=/dev/null bs=1M iflag=direct 
  dd if=/dev/vdf of=/dev/null bs=1M iflag=direct 
  dd if=/dev/vdg of=/dev/null bs=1M iflag=direct 
 
  I can reproduce the problem with in 3 minutes :(
 
  Thanks,
  Badari
 
 
  Ah...I made an embarrassing mistake that I tried to 'free()' an 
  kmem_cache object.
 
  Would you please revert the vblk-for-kernel-2 patch and apply the new 
  one attached in this letter?
 
  Hmm.. My version of the code seems to have kzalloc() for used_info. I 
  don't have a version
  that is using kmem_cache_alloc(). Would it be possible for you to send 
  out complete patch
  (with all the fixes applied) for me to try ? This will avoid all the 
  confusion ..
 
  Thanks,
  Badari
 

 Okay, please apply the attached patch to the vanilla kernel. :)


It looks like the patch wouldn't work for testing multiple devices.

vhost_blk_open() does
+   used_info_cachep = KMEM_CACHE(used_info, SLAB_HWCACHE_ALIGN |
SLAB_PANIC);

When opening second device, we get panic since used_info_cachep is
already created. Just to make progress I moved this call to
vhost_blk_init().

I don't see any host panics now. With single block device (dd),
it seems to work fine. But when I start testing multiple block
devices I quickly run into hangs in the guest. I see following
messages in the guest from virtio_ring.c:

virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio1: requests: id 0 is not a head !
virtio_blk virtio4: requests: id 1 is not a head !
virtio_blk virtio3: requests: id 39 is not a head !

Thanks,
Badari



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-08 Thread Badari Pulavarty

On 8/8/2011 12:31 AM, Liu Yuan wrote:

On 08/08/2011 01:04 PM, Badari Pulavarty wrote:

On 8/7/2011 6:35 PM, Liu Yuan wrote:

On 08/06/2011 02:02 AM, Badari Pulavarty wrote:

On 8/5/2011 4:04 AM, Liu Yuan wrote:

On 08/05/2011 05:58 AM, Badari Pulavarty wrote:

Hi Liu Yuan,

I started testing your patches. I applied your kernel patch to 3.0
and applied QEMU to latest git.

I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
I ran simple dd read tests from the guest on all block devices
(with various blocksizes, iflag=direct).

Unfortunately, system doesn't stay up. I immediately get into
panic on the host. I didn't get time to debug the problem. Wondering
if you have seen this issue before and/or you have new patchset
to try ?

Let me know.

Thanks,
Badari



Okay, it is actually a bug pointed out by MST on the other thread, 
that it needs a mutex for completion thread.


Now would you please this attachment?This patch only applies to 
kernel part, on top of v1 kernel patch.


This patch mainly moves completion thread into vhost thread as a 
function. As a result, both requests submitting and completion 
signalling is in the same thread.


Yuan


Unfortunately, dd tests (4 out of 6) in the guest hung. I see 
following messages


virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio3: requests: id 1 is not a head !
virtio_blk virtio5: requests: id 1 is not a head !
virtio_blk virtio1: requests: id 1 is not a head !

I still see host panics. I will collect the host panic and see if 
its still same or not.


Thanks,
Badari


Would you please show me how to reproduce it step by step? I tried 
dd with two block device attached, but didn't get hung nor panic.


Yuan


I did 6 dds on 6 block devices..

dd if=/dev/vdb of=/dev/null bs=1M iflag=direct 
dd if=/dev/vdc of=/dev/null bs=1M iflag=direct 
dd if=/dev/vdd of=/dev/null bs=1M iflag=direct 
dd if=/dev/vde of=/dev/null bs=1M iflag=direct 
dd if=/dev/vdf of=/dev/null bs=1M iflag=direct 
dd if=/dev/vdg of=/dev/null bs=1M iflag=direct 

I can reproduce the problem with in 3 minutes :(

Thanks,
Badari


Ah...I made an embarrassing mistake that I tried to 'free()' an 
kmem_cache object.


Would you please revert the vblk-for-kernel-2 patch and apply the new 
one attached in this letter?


Hmm.. My version of the code seems to have kzalloc() for used_info. I 
don't have a version
that is using kmem_cache_alloc(). Would it be possible for you to send 
out complete patch
(with all the fixes applied) for me to try ? This will avoid all the 
confusion ..


Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-07 Thread Badari Pulavarty

On 8/7/2011 6:35 PM, Liu Yuan wrote:

On 08/06/2011 02:02 AM, Badari Pulavarty wrote:

On 8/5/2011 4:04 AM, Liu Yuan wrote:

On 08/05/2011 05:58 AM, Badari Pulavarty wrote:

Hi Liu Yuan,

I started testing your patches. I applied your kernel patch to 3.0
and applied QEMU to latest git.

I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
I ran simple dd read tests from the guest on all block devices
(with various blocksizes, iflag=direct).

Unfortunately, system doesn't stay up. I immediately get into
panic on the host. I didn't get time to debug the problem. Wondering
if you have seen this issue before and/or you have new patchset
to try ?

Let me know.

Thanks,
Badari



Okay, it is actually a bug pointed out by MST on the other thread, 
that it needs a mutex for completion thread.


Now would you please this attachment?This patch only applies to 
kernel part, on top of v1 kernel patch.


This patch mainly moves completion thread into vhost thread as a 
function. As a result, both requests submitting and completion 
signalling is in the same thread.


Yuan


Unfortunately, dd tests (4 out of 6) in the guest hung. I see 
following messages


virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio3: requests: id 1 is not a head !
virtio_blk virtio5: requests: id 1 is not a head !
virtio_blk virtio1: requests: id 1 is not a head !

I still see host panics. I will collect the host panic and see if its 
still same or not.


Thanks,
Badari


Would you please show me how to reproduce it step by step? I tried dd 
with two block device attached, but didn't get hung nor panic.


Yuan


I did 6 dds on 6 block devices..

dd if=/dev/vdb of=/dev/null bs=1M iflag=direct 
dd if=/dev/vdc of=/dev/null bs=1M iflag=direct 
dd if=/dev/vdd of=/dev/null bs=1M iflag=direct 
dd if=/dev/vde of=/dev/null bs=1M iflag=direct 
dd if=/dev/vdf of=/dev/null bs=1M iflag=direct 
dd if=/dev/vdg of=/dev/null bs=1M iflag=direct 

I can reproduce the problem with in 3 minutes :(

Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-05 Thread Badari Pulavarty

On 8/5/2011 4:04 AM, Liu Yuan wrote:

On 08/05/2011 05:58 AM, Badari Pulavarty wrote:

Hi Liu Yuan,

I started testing your patches. I applied your kernel patch to 3.0
and applied QEMU to latest git.

I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM).
I ran simple dd read tests from the guest on all block devices
(with various blocksizes, iflag=direct).

Unfortunately, system doesn't stay up. I immediately get into
panic on the host. I didn't get time to debug the problem. Wondering
if you have seen this issue before and/or you have new patchset
to try ?

Let me know.

Thanks,
Badari



Okay, it is actually a bug pointed out by MST on the other thread, 
that it needs a mutex for completion thread.


Now would you please this attachment?This patch only applies to kernel 
part, on top of v1 kernel patch.


This patch mainly moves completion thread into vhost thread as a 
function. As a result, both requests submitting and completion 
signalling is in the same thread.


Yuan


Unfortunately, dd tests (4 out of 6) in the guest hung. I see 
following messages


virtio_blk virtio2: requests: id 0 is not a head !
virtio_blk virtio3: requests: id 1 is not a head !
virtio_blk virtio5: requests: id 1 is not a head !
virtio_blk virtio1: requests: id 1 is not a head !

I still see host panics. I will collect the host panic and see if its 
still same or not.


Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-08-04 Thread Badari Pulavarty
Hi Liu Yuan,

I started testing your patches. I applied your kernel patch to 3.0
and applied QEMU to latest git.

I passed 6 blockdevices from the host to guest (4 vcpu, 4GB RAM). 
I ran simple dd read tests from the guest on all block devices 
(with various blocksizes, iflag=direct).

Unfortunately, system doesn't stay up. I immediately get into
panic on the host. I didn't get time to debug the problem. Wondering
if you have seen this issue before and/or you have new patchset
to try ?

Let me know.

Thanks,
Badari

[ cut here ]
kernel BUG at mm/slab.c:3059!
invalid opcode:  [#1] SMP 
CPU 7 
Modules linked in: vhost_blk ebtable_nat ebtables xt_CHECKSUM bridge stp llc 
autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf cachefiles 
fscache ipt_REJECT ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state 
nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log 
dm_round_robin scsi_dh_rdac dm_multipath vhost_net macvtap macvlan tun 
kvm_intel kvm cdc_ether usbnet mii microcode serio_raw pcspkr i2c_i801 i2c_core 
iTCO_wdt iTCO_vendor_support shpchp ioatdma dca i7core_edac edac_core bnx2 sg 
ext4 mbcache jbd2 sd_mod crc_t10dif qla2xxx scsi_transport_fc scsi_tgt mptsas 
mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: nf_defrag_ipv4]

Pid: 2744, comm: vhost-2698 Not tainted 3.0.0 #2 IBM  -[7870AC1]-/46M0761 
RIP: 0010:[8114932c]  [8114932c] 
cache_alloc_refill+0x22c/0x250
RSP: 0018:880258c87d00  EFLAGS: 00010046
RAX: 0002 RBX: 88027f800040 RCX: dead00200200
RDX: 880271128000 RSI: 0070 RDI: 88026eb6c000
RBP: 880258c87d50 R08: 880271128000 R09: 0003
R10: 00021fff R11: 88026b5790c0 R12: 880272cd8c00
R13: 88027f822440 R14: 0002 R15: 88026eb6c000
FS:  () GS:88027fce() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 00ecb100 CR3: 000270bfe000 CR4: 26e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process vhost-2698 (pid: 2744, threadinfo 880258c86000, task 
8802703154c0)
Stack:
 88020002 000492d0 88027f822460 88027f822450
 880258c87d60 88027f800040  80d0
 80d0 0246 880258c87da0 81149c82
Call Trace:
 [81149c82] kmem_cache_alloc_trace+0x182/0x190
 [a0252f52] handle_guest_kick+0x162/0x799 [vhost_blk]
 [a02514ab] vhost_worker+0xcb/0x150 [vhost_blk]
 [a02513e0] ? vhost_dev_set_owner+0x190/0x190 [vhost_blk]
 [a02513e0] ? vhost_dev_set_owner+0x190/0x190 [vhost_blk]
 [81084c66] kthread+0x96/0xa0
 [814d2f84] kernel_thread_helper+0x4/0x10
 [81084bd0] ? kthread_worker_fn+0x1a0/0x1a0
 [814d2f80] ? gs_change+0x13/0x13
Code: 48 89 df e8 07 fb ff ff 65 8b 14 25 58 dc 00 00 85 c0 48 63 d2 4c 8b 24 
d3 74 16 41 83 3c 24 00 0f 84 fc fd ff ff e9 75 ff ff ff 0f 0b 66 90 eb fc 31 
c0 41 83 3c 24 00 0f 85 62 ff ff ff 90 e9 
RIP  [8114932c] cache_alloc_refill+0x22c/0x250
 RSP 880258c87d00
---[ end trace e286566e512cba7b ]---







--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device

2011-07-29 Thread Badari Pulavarty

Hi Liu Yuan,

I am glad to see that you started looking at vhost-blk.   I did an 
attempt year ago to improve block

performance using vhost-blk approach.

http://lwn.net/Articles/379864/
http://lwn.net/Articles/382543/

I will take a closer look at your patchset to find differences and 
similarities.


- I focused on using vfs interfaces in the kernel, so that I can use it 
for file-backed devices.

Our use-case scenario is mostly file-backed images.

- In few cases, virtio-blk did outperform vhost-blk -- which was counter 
intuitive - but

couldn't exactly nail down. why ?

- I had to implement my own threads for parellism. I see that you are 
using aio infrastructure

to get around it.

- In our high scale performance testing, what we found is block-backed 
device performance is
pretty close to bare-metal (91% of bare-metal). vhost-blk didn't add any 
major benefits to it.
I am curious on your performance analysis  data on where you see the 
gains and why ?


Hence I prioritized my work low :(

Now that you are interested in driving this, I am happy to work with you 
and see what

vhost-blk brings to the tables. (even if helps us improve virtio-blk).

Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] vhost-blk implementation (v2)

2010-04-06 Thread Badari Pulavarty
Hi All,

Here is the latest version of vhost-blk implementation.
Major difference from my previous implementation is that, I
now merge all contiguous requests (both read and write), before
submitting them. This significantly improved IO performance.
I am still collecting performance numbers, I will be posting
in next few days.

Comments ?

Todo:
- Address hch's comments on annontations
- Implement per device read/write queues
- Finish up error handling

Thanks,
Badari

---
 drivers/vhost/blk.c |  445 
 1 file changed, 445 insertions(+)

Index: net-next/drivers/vhost/blk.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ net-next/drivers/vhost/blk.c2010-04-06 16:38:03.563847905 -0400
@@ -0,0 +1,445 @@
+ /*
+  * virtio-block server in host kernel.
+  * Inspired by vhost-net and shamlessly ripped code from it :)
+  */
+
+#include linux/compat.h
+#include linux/eventfd.h
+#include linux/vhost.h
+#include linux/virtio_net.h
+#include linux/virtio_blk.h
+#include linux/mmu_context.h
+#include linux/miscdevice.h
+#include linux/module.h
+#include linux/mutex.h
+#include linux/workqueue.h
+#include linux/rcupdate.h
+#include linux/file.h
+
+#include vhost.h
+
+#define VHOST_BLK_VQ_MAX 1
+#define SECTOR_SHIFT 9
+
+struct vhost_blk {
+   struct vhost_dev dev;
+   struct vhost_virtqueue vqs[VHOST_BLK_VQ_MAX];
+   struct vhost_poll poll[VHOST_BLK_VQ_MAX];
+};
+
+struct vhost_blk_io {
+   struct list_head list;
+   struct work_struct work;
+   struct vhost_blk *blk;
+   struct file *file;
+   int head;
+   uint32_t type;
+   uint32_t nvecs;
+   uint64_t sector;
+   uint64_t len;
+   struct iovec iov[0];
+};
+
+static struct workqueue_struct *vblk_workqueue;
+static LIST_HEAD(write_queue);
+static LIST_HEAD(read_queue);
+
+static void handle_io_work(struct work_struct *work)
+{
+   struct vhost_blk_io *vbio, *entry;
+   struct vhost_virtqueue *vq;
+   struct vhost_blk *blk;
+   struct list_head single, *head, *node, *tmp;
+
+   int i, need_free, ret = 0;
+   loff_t pos;
+   uint8_t status = 0;
+
+   vbio = container_of(work, struct vhost_blk_io, work);
+   blk = vbio-blk;
+   vq = blk-dev.vqs[0];
+   pos = vbio-sector  8;
+
+   use_mm(blk-dev.mm);
+   if (vbio-type  VIRTIO_BLK_T_FLUSH)  {
+   ret = vfs_fsync(vbio-file, vbio-file-f_path.dentry, 1);
+   } else if (vbio-type  VIRTIO_BLK_T_OUT) {
+   ret = vfs_writev(vbio-file, vbio-iov, vbio-nvecs, pos);
+   } else {
+   ret = vfs_readv(vbio-file, vbio-iov, vbio-nvecs, pos);
+   }
+   status = (ret  0) ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+   if (vbio-head != -1) {
+   INIT_LIST_HEAD(single);
+   list_add(vbio-list, single);
+   head = single;
+   need_free = 0;
+   } else {
+   head = vbio-list;
+   need_free = 1;
+   }
+   list_for_each_entry(entry, head, list) {
+   copy_to_user(entry-iov[entry-nvecs].iov_base, status, sizeof 
status);
+   }
+   mutex_lock(vq-mutex);
+   list_for_each_safe(node, tmp, head) {
+   entry = list_entry(node, struct vhost_blk_io, list);
+   vhost_add_used_and_signal(blk-dev, vq, entry-head, ret);
+   list_del(node);
+   kfree(entry);
+   }
+   mutex_unlock(vq-mutex);
+   unuse_mm(blk-dev.mm);
+   if (need_free)
+   kfree(vbio);
+}
+
+static struct vhost_blk_io *allocate_vbio(int nvecs)
+{
+   struct vhost_blk_io *vbio;
+   int size = sizeof(struct vhost_blk_io) + nvecs * sizeof(struct iovec);
+   vbio = kmalloc(size, GFP_KERNEL);
+   if (vbio) {
+   INIT_WORK(vbio-work, handle_io_work);
+   INIT_LIST_HEAD(vbio-list);
+   }
+   return vbio;
+}
+
+static void merge_and_handoff_work(struct list_head *queue)
+{
+   struct vhost_blk_io *vbio, *entry;
+   int nvecs = 0;
+   int entries = 0;
+
+   list_for_each_entry(entry, queue, list) {
+   nvecs += entry-nvecs;
+   entries++;
+   }
+
+   if (entries == 1) {
+   vbio = list_first_entry(queue, struct vhost_blk_io, list);
+   list_del(vbio-list);
+   queue_work(vblk_workqueue, vbio-work);
+   return;
+   }
+
+   vbio = allocate_vbio(nvecs);
+   if (!vbio) {
+   /* Unable to allocate memory - submit IOs individually */
+   list_for_each_entry(vbio, queue, list) {
+   queue_work(vblk_workqueue, vbio-work);
+   }
+   INIT_LIST_HEAD(queue);
+   return;
+   }
+
+   entry = list_first_entry(queue, struct vhost_blk_io, list);
+   vbio-nvecs = nvecs;
+   vbio-blk = entry-blk;
+   

Re: [RFC] vhost-blk implementation

2010-04-05 Thread Badari Pulavarty
On Mon, 2010-04-05 at 15:23 -0400, Christoph Hellwig wrote:
 On Wed, Mar 24, 2010 at 01:22:37PM -0700, Badari Pulavarty wrote:
  iovecs and buffers are user-space pointers (from the host kernel point  
  of view). They are
  guest address. So, I don't need to do any set_fs tricks.
 
 From verifying the code and using the sparse annotations it appears
 that the actual buffers are userspace pointers, but the iovecs in
 the virtqueue are kernel level pointers, so you would need some
 annotations.

Yes. Thats correct. I will add appropriate annotations.

 
 While we're at it here is a patch fixing the remaining sparse 
 warnings in vhost-blk:

Applied.

Thanks,
Badari

 
 Signed-off-by: Christoph Hellwig h...@lst.de
 
 Index: linux-2.6/drivers/vhost/blk.c
 ===
 --- linux-2.6.orig/drivers/vhost/blk.c2010-04-05 21:15:11.638004250 
 +0200
 +++ linux-2.6/drivers/vhost/blk.c 2010-04-05 21:16:13.238004599 +0200
 @@ -86,7 +86,7 @@ static void handle_blk(struct vhost_blk
   nvecs++;
   BUG_ON(vq-iov[nvecs].iov_len != 1);
 
 - if (copy_to_user(vq-iov[nvecs].iov_base, status, sizeof 
 status)  0) {
 + if (copy_to_user(vq-iov[nvecs].iov_base, status, sizeof 
 status)) {
   printk(copy to user failed\n);
   vhost_discard_vq_desc(vq);
   break;
 @@ -199,7 +199,7 @@ static struct miscdevice vhost_blk_misc
   vhost_blk_fops,
  };
 
 -int vhost_blk_init(void)
 +static int vhost_blk_init(void)
  {
   int r = vhost_init();
   if (r)
 @@ -216,7 +216,7 @@ err_init:
  }
  module_init(vhost_blk_init);
 
 -void vhost_blk_exit(void)
 +static void vhost_blk_exit(void)
  {
   misc_deregister(vhost_blk_misc);
   vhost_cleanup();

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-04-05 Thread Badari Pulavarty
On Mon, 2010-04-05 at 15:22 +0100, Stefan Hajnoczi wrote:
 On Mon, Mar 29, 2010 at 4:41 PM, Badari Pulavarty pbad...@us.ibm.com wrote:
  +static void handle_io_work(struct work_struct *work)
  +{
  +   struct vhost_blk_io *vbio;
  +   struct vhost_virtqueue *vq;
  +   struct vhost_blk *blk;
  +   int i, ret = 0;
  +   loff_t pos;
  +   uint8_t status = 0;
  +
  +   vbio = container_of(work, struct vhost_blk_io, work);
  +   blk = vbio-blk;
  +   vq = blk-dev.vqs[0];
  +   pos = vbio-sector  8;
  +
  +   use_mm(blk-dev.mm);
  +
  +   if (vbio-type  VIRTIO_BLK_T_FLUSH)  {
  +   ret = vfs_fsync(vbio-file, vbio-file-f_path.dentry, 1);
  +   } else if (vbio-type  VIRTIO_BLK_T_OUT) {
  +   ret = vfs_writev(vbio-file, vbio-iov, vbio-nvecs, pos);
  +   } else {
  +   ret = vfs_readv(vbio-file, vbio-iov, vbio-nvecs, pos);
  +   }
  +
  +   status = (ret  0) ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
  +   if (copy_to_user(vbio-iov[vbio-nvecs].iov_base, status, sizeof 
  status)  0) {
  +   printk(copy to user failed\n);
  +   vhost_discard_vq_desc(vq);
  +   unuse_mm(blk-dev.mm);
  +   return;
 
 Do you need to kfree(vbio) here?

Yes. I do. As mentioned earlier, I haven't fixed error handling yet :(
 
  +static long vhost_blk_set_backend(struct vhost_blk *n, unsigned index, int 
  fd)
  +{
  +   struct file *file;
  +   struct vhost_virtqueue *vq;
  +
  +   file = fget(fd);
  +   if (!file)
  +   return -EBADF;
  +
  +   vq = n-vqs + index;
  +   mutex_lock(vq-mutex);
  +   rcu_assign_pointer(vq-private_data, file);
  +   mutex_unlock(vq-mutex);
  +   return 0;
  +}
  +
  +
  +static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
  +unsigned long arg)
  +{
  +   struct vhost_blk *n = f-private_data;
  +   void __user *argp = (void __user *)arg;
  +   struct vhost_vring_file backend;
  +   int r;
  +
  +   switch (ioctl) {
  +case VHOST_NET_SET_BACKEND:
  +   r = copy_from_user(backend, argp, sizeof backend);
  +   if (r  0)
  +   return r;
  +   return vhost_blk_set_backend(n, backend.index, backend.fd);
 
 I don't see backend.index being checked against VHOST_BLK_VQ_MAX.

Yep. You are right. I will add these checks for my next revision.

Thanks,
Badari

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-29 Thread Badari Pulavarty
Hi Christoph,

I am wondering if you can provide your thoughts here..

I modified my vhost-blk implementation to offload work to
work_queues instead of doing synchronously. Infact, I tried
to spread the work across all the CPUs. But to my surprise,
this did not improve the performance compared to virtio-blk.

I see vhost-blk taking more interrupts and context switches
compared to virtio-blk. What is virtio-blk doing which I
am not able to from vhost-blk ???

Thanks,
Badari


vhost-blk

procs ---memory-- ---swap-- -io --system-- -cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
 3  1   8920  56076  20760 56035560  104   196 79826 17164 13912  0  5 65 
30  0
 2  4   9488  57216  20744 56056160  114   195 81120 17397 13824  0  5 65 
30  0
 2  2  10028  68476  20728 55947640  108   206 80318 17162 13845  0  5 65 
30  0
 0  4  10560  70856  20708 55930880  106   205 82363 17402 13904  0  5 65 
30  0
 1  3  10948  80380  20672 55844520   78   178 79714 17113 13875  0  5 66 
29  0

qemu virtio-blk:

procs ---memory-- ---swap-- -io --system-- -cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
 0  1  14124  57456   5144 492406000   139 142546 11287 9312  1  4 80 
15  0
 0  2  14124  56736   5148 492739600   146 142968 11283 9248  1  4 80 
15  0
 0  1  14124  56712   5384 49270200074 150738 11182 9327  1  4 80 
16  0
 1  1  14124  55496   5392 492790400 2 159902 11172 9401  1  3 79 
17  0
 0  1  14124  55968   5408 492723200 0 159202 11212 9325  1  3 80 
16  0

---
 drivers/vhost/blk.c |  310 
 1 file changed, 310 insertions(+)

Index: net-next/drivers/vhost/blk.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ net-next/drivers/vhost/blk.c2010-03-25 20:06:57.484054770 -0400
@@ -0,0 +1,310 @@
+ /*
+  * virtio-block server in host kernel.
+  * Inspired by vhost-net and shamlessly ripped code from it :)
+  */
+
+#include linux/compat.h
+#include linux/eventfd.h
+#include linux/vhost.h
+#include linux/virtio_net.h
+#include linux/virtio_blk.h
+#include linux/mmu_context.h
+#include linux/miscdevice.h
+#include linux/module.h
+#include linux/mutex.h
+#include linux/workqueue.h
+#include linux/rcupdate.h
+#include linux/file.h
+
+#include vhost.h
+
+#define VHOST_BLK_VQ_MAX 1
+
+#if 0
+#define myprintk(fmt, ...) printk(pr_fmt(fmt), ##__VA_ARGS__)
+#else
+#define myprintk(fmt, ...)
+#endif
+
+struct vhost_blk {
+   struct vhost_dev dev;
+   struct vhost_virtqueue vqs[VHOST_BLK_VQ_MAX];
+   struct vhost_poll poll[VHOST_BLK_VQ_MAX];
+};
+
+struct vhost_blk_io {
+   struct work_struct work;
+   struct vhost_blk *blk;
+   struct file *file;
+   int head;
+   uint32_t type;
+   uint64_t sector;
+   struct iovec *iov;
+   int nvecs;
+};
+
+static struct workqueue_struct *vblk_workqueue;
+
+static void handle_io_work(struct work_struct *work)
+{
+   struct vhost_blk_io *vbio;
+   struct vhost_virtqueue *vq;
+   struct vhost_blk *blk;
+   int i, ret = 0;
+   loff_t pos;
+   uint8_t status = 0;
+
+   vbio = container_of(work, struct vhost_blk_io, work);
+   blk = vbio-blk;
+   vq = blk-dev.vqs[0];
+   pos = vbio-sector  8;
+
+   use_mm(blk-dev.mm);
+
+   if (vbio-type  VIRTIO_BLK_T_FLUSH)  {
+   ret = vfs_fsync(vbio-file, vbio-file-f_path.dentry, 1);
+   } else if (vbio-type  VIRTIO_BLK_T_OUT) {
+   ret = vfs_writev(vbio-file, vbio-iov, vbio-nvecs, pos);
+   } else {
+   ret = vfs_readv(vbio-file, vbio-iov, vbio-nvecs, pos);
+   }
+
+   status = (ret  0) ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+   if (copy_to_user(vbio-iov[vbio-nvecs].iov_base, status, sizeof 
status)  0) {
+   printk(copy to user failed\n);
+   vhost_discard_vq_desc(vq);
+   unuse_mm(blk-dev.mm);
+   return;
+   }
+   mutex_lock(vq-mutex);
+   vhost_add_used_and_signal(blk-dev, vq, vbio-head, ret);
+   mutex_unlock(vq-mutex);
+   unuse_mm(blk-dev.mm);
+   kfree(vbio);
+}
+
+static int cpu = 0;
+static int handoff_io(struct vhost_blk *blk, int head,
+   uint32_t type, uint64_t sector,
+   struct iovec *iov, int nvecs)
+{
+   struct vhost_virtqueue *vq = blk-dev.vqs[0];
+   struct vhost_blk_io *vbio;
+
+   vbio = kmalloc(sizeof(struct vhost_blk_io), GFP_KERNEL);
+   if (!vbio)
+   return -ENOMEM;
+
+   INIT_WORK(vbio-work, handle_io_work);
+   vbio-blk = blk;
+   vbio-file = vq-private_data;
+   vbio-head = head;
+   vbio-type = type;
+   vbio-sector = sector;
+   vbio-iov = iov;
+   vbio-nvecs = nvecs;
+
+   

Re: [RFC] vhost-blk implementation

2010-03-29 Thread Badari Pulavarty
On Mon, 2010-03-29 at 23:37 +0300, Avi Kivity wrote:
 On 03/29/2010 09:20 PM, Chris Wright wrote:
  * Badari Pulavarty (pbad...@us.ibm.com) wrote:
 
  I modified my vhost-blk implementation to offload work to
  work_queues instead of doing synchronously. Infact, I tried
  to spread the work across all the CPUs. But to my surprise,
  this did not improve the performance compared to virtio-blk.
 
  I see vhost-blk taking more interrupts and context switches
  compared to virtio-blk. What is virtio-blk doing which I
  am not able to from vhost-blk ???
   
  Your io wait time is twice as long and your throughput is about half.
  I think the qmeu block submission does an extra attempt at merging
  requests.  Does blktrace tell you anything interesting?
 

Yes. I see that in my testcase (2M writes) - QEMU is pickup 512K
requests from the virtio ring and merging them back to 2M before
submitting them. 

Unfortunately, I can't do that quite easily in vhost-blk. QEMU
does re-creates iovecs for the merged IO. I have to come up with
a scheme to do this :(

 It does.  I suggest using fio O_DIRECT random access patterns to avoid 
 such issues.

Well, I am not trying to come up with a test case where vhost-blk
performs better than virtio-blk. I am trying to understand where
and why vhost-blk performnce worse than virtio-blk.


Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-25 Thread Badari Pulavarty

Avi Kivity wrote:

On 03/24/2010 10:22 PM, Badari Pulavarty wrote:

Which caching mode is this?  I assume data=writeback, because otherwise
you'd be doing synchronous I/O directly from the handler.



Yes. This is with default (writeback) cache model. As mentioned 
earlier, readhead is helping here

and most cases, data would be ready in the pagecache.



btw, relying on readahead is problematic.  The guest already issues 
readahead requests, and the host will issue readahead based on that 
readahead, reading far into the future.


Sure. In my tests, I did O_DIRECT in the guest - so there is no 
readahead in the guest.


Thanks,
Badari

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-24 Thread Badari Pulavarty

Christoph Hellwig wrote:
Inspired by vhost-net implementation, I did initial prototype 
of vhost-blk to see if it provides any benefits over QEMU virtio-blk.

I haven't handled all the error cases, fixed naming conventions etc.,
but the implementation is stable to play with. I tried not to deviate
from vhost-net implementation where possible.



Can you also send the qemu side of it?

  

with vhost-blk:


# time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
64+0 records in
64+0 records out
8388608 bytes (84 GB) copied, 126.135 seconds, 665 MB/s

real2m6.137s
user0m0.281s
sys 0m14.725s

without vhost-blk: (virtio)
---

# time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
64+0 records in
64+0 records out
8388608 bytes (84 GB) copied, 275.466 seconds, 305 MB/s

real4m35.468s
user0m0.373s
sys 0m48.074s



Which caching mode is this?  I assume data=writeback, because otherwise
you'd be doing synchronous I/O directly from the handler.
  


Yes. This is with default (writeback) cache model. As mentioned earlier, 
readhead is helping here

and most cases, data would be ready in the pagecache.
  

+static int do_handle_io(struct file *file, uint32_t type, uint64_t sector,
+   struct iovec *iov, int in)
+{
+   loff_t pos = sector  8;
+   int ret = 0;
+
+   if (type  VIRTIO_BLK_T_FLUSH)  {
+   ret = vfs_fsync(file, file-f_path.dentry, 1);
+   } else if (type  VIRTIO_BLK_T_OUT) {
+   ret = vfs_writev(file, iov, in, pos);
+   } else {
+   ret = vfs_readv(file, iov, in, pos);
+   }
+   return ret;



I have to admit I don't understand the vhost architecture at all, but
where do the actual data pointers used by the iovecs reside?
vfs_readv/writev expect both the iovec itself and the buffers
pointed to by it to reside in userspace, so just using kernel buffers
here will break badly on architectures with different user/kernel
mappings.  A lot of this is fixable using simple set_fs  co tricks,
but for direct I/O which uses get_user_pages even that will fail badly.
  
iovecs and buffers are user-space pointers (from the host kernel point 
of view). They are

guest address. So, I don't need to do any set_fs tricks.

Also it seems like you're doing all the I/O synchronous here?  For
data=writeback operations that could explain the read speedup
as you're avoiding context switches, but for actual write I/O
which has to get data to disk (either directly from vfs_writev or
later through vfs_fsync) this seems like a really bad idea stealing
a lot of guest time that should happen in the background.
  
Yes. QEMU virtio-blk is batching up all the writes and handing of the 
work to another
thread. When the writes() are complete, its sending a status completion. 
Since I am
doing everything synchronous (even though its write to pagecache) one 
request at a

time, that explains the slow down. We need to find a way to

1) batch IO writes together
2) hand off to another thread to do the IO, so that vhost-thread can handle
next set of requests
3) update the status on the completion

What do should I do here ? I can create bunch of kernel threads to do 
the IO for me.
Or some how fit and reuse AIO io_submit() mechanism. Whats the best way 
here ?

I hate do duplicate all the code VFS is doing.


Other than that the code seems quite nice and simple, but one huge
problem is that it'll only support raw images, and thus misses out
on all the nice image formats used in qemu deployments, especially
qcow2.  It's also missing the ioctl magic we're having in various
places, both for controlling host devices like cdroms and SG
passthrough.
  
True... unfortunately, I don't understand all of those (qcow2) details 
yet !! I need to read up on those,

to even make a comment :(

Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-24 Thread Badari Pulavarty

Christoph Hellwig wrote:
Inspired by vhost-net implementation, I did initial prototype 
of vhost-blk to see if it provides any benefits over QEMU virtio-blk.

I haven't handled all the error cases, fixed naming conventions etc.,
but the implementation is stable to play with. I tried not to deviate
from vhost-net implementation where possible.



Can you also send the qemu side of it?
  
Its pretty hacky and based it on old patch (vhost-net) from MST for 
simplicity.
I haven't focused on cleaning it up and I will re-base it on MST's 
latest code

once it gets into QEMU.

Thanks,
Badari

---
hw/virtio-blk.c |  199 
1 file changed, 199 insertions(+)

Index: vhost/hw/virtio-blk.c
===
--- vhost.orig/hw/virtio-blk.c  2010-02-25 16:47:04.0 -0500
+++ vhost/hw/virtio-blk.c   2010-03-17 14:07:26.477430740 -0400
@@ -18,6 +18,7 @@
#ifdef __linux__
# include scsi/sg.h
#endif
+#include kvm.h

typedef struct VirtIOBlock
{
@@ -28,8 +29,13 @@
char serial_str[BLOCK_SERIAL_STRLEN + 1];
QEMUBH *bh;
size_t config_size;
+uint8_t vhost_started;
} VirtIOBlock;

+typedef struct BDRVRawState {
+int fd;
+} BDRVRawState;
+
static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
{
return (VirtIOBlock *)vdev;
@@ -501,6 +507,198 @@
return 0;
}

+#if 1
+#include linux/vhost.h
+#include sys/ioctl.h
+#include sys/eventfd.h
+#include vhost.h
+
+int vhost_blk_fd;
+
+struct slot_info {
+unsigned long phys_addr;
+unsigned long len;
+unsigned long userspace_addr;
+unsigned flags;
+int logging_count;
+};
+
+extern struct slot_info slots[KVM_MAX_NUM_MEM_REGIONS];
+
+static int vhost_blk_start(struct VirtIODevice *vdev)
+{
+   target_phys_addr_t s, l, a;
+   int r, num, idx = 0;
+   struct vhost_vring_state state;
+   struct vhost_vring_file file;
+   struct vhost_vring_addr addr;
+   unsigned long long used_phys;
+   void *desc, *avail, *used;
+   int i, n =0;
+   struct VirtQueue *q = virtio_queue(vdev, idx);
+   VirtIOBlock *vb = to_virtio_blk(vdev);
+   struct vhost_memory *mem;
+   BDRVRawState *st = vb-bs-opaque;
+
+   vhost_blk_fd = open(/dev/vhost-blk, O_RDWR);
+   if (vhost_blk_fd  0) {
+   fprintf(stderr, unable to open vhost-blk\n);
+   return -errno;
+   }
+
+   r = ioctl(vhost_blk_fd, VHOST_SET_OWNER, NULL);
+if (r  0) {
+   fprintf(stderr, ioctl VHOST_SET_OWNER failed\n);
+return -errno;
+   }
+
+for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS; ++i) {
+if (!slots[i].len ||
+   (slots[i].flags  KVM_MEM_LOG_DIRTY_PAGES)) {
+  continue;
+}
+++n;
+}
+
+mem = qemu_mallocz(offsetof(struct vhost_memory, regions) +
+   n * sizeof(struct vhost_memory_region));
+if (!mem)
+return -ENOMEM;
+
+mem-nregions = n;
+n = 0;
+for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS; ++i) {
+if (!slots[i].len || (slots[i].flags 
+   KVM_MEM_LOG_DIRTY_PAGES)) {
+continue;
+}
+mem-regions[n].guest_phys_addr = slots[i].phys_addr;
+mem-regions[n].memory_size = slots[i].len;
+mem-regions[n].userspace_addr = slots[i].userspace_addr;
+++n;
+}
+
+r = ioctl(vhost_blk_fd, VHOST_SET_MEM_TABLE, mem);
+if (r  0)
+return -errno;
+
+   state.index = idx;
+   num = state.num = virtio_queue_get_num(vdev, idx);
+   r = ioctl(vhost_blk_fd, VHOST_SET_VRING_NUM, state);
+if (r) {
+   fprintf(stderr, ioctl VHOST_SET_VRING_NUM failed\n);
+return -errno;
+}
+
+   state.num = virtio_queue_last_avail_idx(vdev, idx);
+   r = ioctl(vhost_blk_fd, VHOST_SET_VRING_BASE, state);
+   if (r) {
+   fprintf(stderr, ioctl VHOST_SET_VRING_BASE failed\n);
+return -errno;
+   }
+
+   s = l = sizeof(struct vring_desc) * num;
+   a = virtio_queue_get_desc(vdev, idx);
+   desc = cpu_physical_memory_map(a, l, 0);
+   if (!desc || l != s) {
+r = -ENOMEM;
+goto fail_alloc;
+   }
+   s = l = offsetof(struct vring_avail, ring) +
+sizeof(u_int64_t) * num;
+a = virtio_queue_get_avail(vdev, idx);
+avail = cpu_physical_memory_map(a, l, 0);
+if (!avail || l != s) {
+r = -ENOMEM;
+goto fail_alloc;
+}
+s = l = offsetof(struct vring_used, ring) +
+sizeof(struct vring_used_elem) * num;
+used_phys = a = virtio_queue_get_used(vdev, idx);
+used = 

Re: [RFC] vhost-blk implementation

2010-03-23 Thread Badari Pulavarty

Avi Kivity wrote:

On 03/23/2010 04:50 AM, Badari Pulavarty wrote:

Anthony Liguori wrote:

On 03/22/2010 08:45 PM, Badari Pulavarty wrote:

Anthony Liguori wrote:

On 03/22/2010 08:00 PM, Badari Pulavarty wrote:

Forgot to CC: KVM list earlier



These virtio results are still with a 2.6.18 kernel with no aio, 
right?

Results are on 2.6.33-rc8-net-next kernel. But not using AIO.


Can you try with aio and cache=none?


aio-native,cache=none

# time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
64+0 records in
64+0 records out
8388608 bytes (84 GB) copied, 470.588 seconds, 170 MB/s



You're effectively killing readahead with this.  


Yes. cache=none, aio=native will force it hit the disk all the time. We 
loose all the benefits of

readahead.

Please use fio with a sequential pattern, queue depth  1.


Will do.

Thanks,
Badari

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-23 Thread Badari Pulavarty

Avi Kivity wrote:

On 03/23/2010 03:00 AM, Badari Pulavarty wrote:

Forgot to CC: KVM list earlier
  
[RFC] vhost-blk implementation.eml


Subject:
[RFC] vhost-blk implementation
From:
Badari Pulavarty pbad...@us.ibm.com
Date:
Mon, 22 Mar 2010 17:34:06 -0700

To:
virtualizat...@lists.linux-foundation.org, qemu-de...@nongnu.org


Hi,

Inspired by vhost-net implementation, I did initial prototype
of vhost-blk to see if it provides any benefits over QEMU virtio-blk.
I haven't handled all the error cases, fixed naming conventions etc.,
but the implementation is stable to play with. I tried not to deviate
from vhost-net implementation where possible.

NOTE:  Only change I had to make to vhost core code is to
increase VHOST_NET_MAX_SG to 130 (128+2) in vhost.h

Performance:
=

I have done simple tests to see how it performs. I got very
encouraging results on sequential read tests. But on sequential
write tests, I see degrade over virtio-blk. I can't figure out and
explain why. Can some one shed light on whats happening here ?

Read Results:
=
Test does read of 84GB file from the host (through virtio). I unmount
and mount the filesystem on the host to make sure there is nothing
in the page cache..

+#define VHOST_BLK_VQ_MAX 1
+
+struct vhost_blk {
+struct vhost_dev dev;
+struct vhost_virtqueue vqs[VHOST_BLK_VQ_MAX];
+struct vhost_poll poll[VHOST_BLK_VQ_MAX];
+};
+
+static int do_handle_io(struct file *file, uint32_t type, uint64_t 
sector,

+struct iovec *iov, int in)
+{
+loff_t pos = sector  8;
+int ret = 0;
+
+if (type  VIRTIO_BLK_T_FLUSH)  {
+ret = vfs_fsync(file, file-f_path.dentry, 1);
+} else if (type  VIRTIO_BLK_T_OUT) {
+ret = vfs_writev(file, iov, in,pos);
+} else {
+ret = vfs_readv(file, iov, in,pos);
+}
+return ret;
+}
   


This should be done asynchronously.  That is likely the cause of write 
performance degradation.  For reads, readahead means that that you're 
async anyway, but writes/syncs are still synchronous.
I am not sure what you mean by async here. Even if I use 
f_op-aio_write() its still synchronous (except for DIO).  Since we are 
writing to pagecache and not waiting for write()

to complete, this is the best we can do here.

Do you mean offload write() handling to another thread ?


I also think it should be done at the bio layer.  
I am not sure what you meant here. Do you want to do submit_bio() 
directly ? Its not going to be that simple. Since the sector# is offset 
within the file, one have to do getblocks() on it
to find the real-disk-block#s + we have to do get_user_pages() on these 
iovecs before submitting them to bio.. All of this work is done by 
vfs_write()/vfs_read() anyway.. I am not

sure what you are suggesting here..

File I/O is going to be slower, if we do vhost-blk we should 
concentrate on maximum performance.  The block layer also exposes more 
functionality we can use (asynchronous barriers for example).




btw, for fairness, cpu measurements should be done from the host side 
and include the vhost thread.



Will do.

Thanks,
Badari

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] vhost-blk implementation

2010-03-22 Thread Badari Pulavarty
Forgot to CC: KVM list earlier
---BeginMessage---
Hi,

Inspired by vhost-net implementation, I did initial prototype 
of vhost-blk to see if it provides any benefits over QEMU virtio-blk.
I haven't handled all the error cases, fixed naming conventions etc.,
but the implementation is stable to play with. I tried not to deviate
from vhost-net implementation where possible.

NOTE:  Only change I had to make to vhost core code is to 
increase VHOST_NET_MAX_SG to 130 (128+2) in vhost.h 

Performance:
=

I have done simple tests to see how it performs. I got very
encouraging results on sequential read tests. But on sequential
write tests, I see degrade over virtio-blk. I can't figure out and
explain why. Can some one shed light on whats happening here ?

Read Results:
=
Test does read of 84GB file from the host (through virtio). I unmount
and mount the filesystem on the host to make sure there is nothing
in the page cache..


with vhost-blk:


# time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
64+0 records in
64+0 records out
8388608 bytes (84 GB) copied, 126.135 seconds, 665 MB/s

real2m6.137s
user0m0.281s
sys 0m14.725s

without vhost-blk: (virtio)
---

# time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
64+0 records in
64+0 records out
8388608 bytes (84 GB) copied, 275.466 seconds, 305 MB/s

real4m35.468s
user0m0.373s
sys 0m48.074s



Write Results:
==

I see degraded IO performance when doing sequential IO write
tests with vhost-blk compared to virtio-blk.

# time dd of=/dev/vda if=/dev/zero bs=2M oflag=direct

I get ~110MB/sec with virtio-blk, but I get only ~60MB/sec with
vhost-blk. Wondering why ?

Comments/flames ? 

Thanks,
Badari


vhost-blk is in-kernel accelerator for virtio-blk. 
At this time, this is a prototype based on virtio-net.
Lots of error handling and clean up needs to be done.
Read performance is pretty good over QEMU virtio-blk, but
write performance is not anywhere close to QEMU virtio-blk.
Why ?

Signed-off-by: Badari Pulavarty pbad...@us.ibm.com
---
 drivers/vhost/blk.c |  242 
 1 file changed, 242 insertions(+)

Index: net-next/drivers/vhost/blk.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ net-next/drivers/vhost/blk.c2010-03-22 18:07:18.156584400 -0400
@@ -0,0 +1,242 @@
+ /*
+  * virtio-block server in host kernel.
+  * Inspired by vhost-net and shamlessly ripped code from it :)
+  */
+
+#include linux/compat.h
+#include linux/eventfd.h
+#include linux/vhost.h
+#include linux/virtio_net.h
+#include linux/virtio_blk.h
+#include linux/mmu_context.h
+#include linux/miscdevice.h
+#include linux/module.h
+#include linux/mutex.h
+#include linux/workqueue.h
+#include linux/rcupdate.h
+#include linux/file.h
+
+#include vhost.h
+
+#define VHOST_BLK_VQ_MAX 1
+
+struct vhost_blk {
+   struct vhost_dev dev;
+   struct vhost_virtqueue vqs[VHOST_BLK_VQ_MAX];
+   struct vhost_poll poll[VHOST_BLK_VQ_MAX];
+};
+
+static int do_handle_io(struct file *file, uint32_t type, uint64_t sector,
+   struct iovec *iov, int in)
+{
+   loff_t pos = sector  8;
+   int ret = 0;
+
+   if (type  VIRTIO_BLK_T_FLUSH)  {
+   ret = vfs_fsync(file, file-f_path.dentry, 1);
+   } else if (type  VIRTIO_BLK_T_OUT) {
+   ret = vfs_writev(file, iov, in, pos);
+   } else {
+   ret = vfs_readv(file, iov, in, pos);
+   }
+   return ret;
+}
+
+static void handle_blk(struct vhost_blk *blk)
+{
+   struct vhost_virtqueue *vq = blk-dev.vqs[0];
+   unsigned head, out, in;
+   struct virtio_blk_outhdr hdr;
+   int r, nvecs;
+   uint8_t status = 0;
+
+   use_mm(blk-dev.mm);
+   mutex_lock(vq-mutex);
+
+   vhost_disable_notify(vq);
+
+   for (;;) {
+   head = vhost_get_vq_desc(blk-dev, vq, vq-iov,
+ARRAY_SIZE(vq-iov),
+out, in, NULL, NULL);
+   if (head == vq-num) {
+   if (unlikely(vhost_enable_notify(vq))) {
+   vhost_disable_notify(vq);
+   continue;
+   }
+   break;
+   }
+
+   BUG_ON(vq-iov[0].iov_len != 16);
+
+   r = copy_from_user(hdr, vq-iov[0].iov_base, sizeof hdr);
+   if (r  0) {
+   printk(copy from user failed\n);
+   vhost_discard_vq_desc(vq);
+   break;
+   }
+
+   nvecs = out - 1;
+   if (hdr.type == VIRTIO_BLK_T_IN)
+   nvecs = in - 1;
+
+   r = do_handle_io(vq-private_data, hdr.type, hdr.sector, 
vq-iov[1], nvecs

Re: [RFC] vhost-blk implementation

2010-03-22 Thread Badari Pulavarty

Anthony Liguori wrote:

On 03/22/2010 08:00 PM, Badari Pulavarty wrote:

Forgot to CC: KVM list earlier
   



These virtio results are still with a 2.6.18 kernel with no aio, right?

Results are on 2.6.33-rc8-net-next kernel. But not using AIO.

Thanks,
Badari


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vhost-blk implementation

2010-03-22 Thread Badari Pulavarty

Anthony Liguori wrote:

On 03/22/2010 08:45 PM, Badari Pulavarty wrote:

Anthony Liguori wrote:

On 03/22/2010 08:00 PM, Badari Pulavarty wrote:

Forgot to CC: KVM list earlier



These virtio results are still with a 2.6.18 kernel with no aio, right?

Results are on 2.6.33-rc8-net-next kernel. But not using AIO.


Can you try with aio and cache=none?


aio-native,cache=none

# time dd if=/dev/vda of=/dev/null bs=128k iflag=direct
64+0 records in
64+0 records out
8388608 bytes (84 GB) copied, 470.588 seconds, 170 MB/s

Thanks,
Badari




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html