Re: [Qemu-devel] [PATCH 6/9] vhost-scsi: new device supporting the tcm_vhost Linux kernel module
On 05/29/2013 10:36 PM, Nicholas A. Bellinger wrote: On Wed, 2013-05-29 at 21:29 -0700, Nicholas A. Bellinger wrote: On Thu, 2013-05-30 at 06:17 +0800, Asias He wrote: On Wed, May 29, 2013 at 08:10:44AM -0700, Badari Pulavarty wrote: On 05/29/2013 02:05 AM, Wenchao Xia wrote: 于 2013-5-28 17:00, Wenchao Xia 写道: SNIP I have done a basic test of vhost-scsi, following is the result I'd like to post, generally it seems fine: Result: fdisk/mkfs: fdisk can find it, mke2fs works fine. mount: can mount it. file I/O: dd 90M zero to a file in that disk succeed. I tried without nested kvm. Issues: 1) in fdisk -l, sometime timeout with dmesg end_request: I/O error, dev fd0, sector 0, I guess it is caused by nested KVM that failed to kick host kernel? I don't see this issue. Are you sure fd0 is actually the scsi device ? what is fd0 ? 2) in fdisk -l, it shows 512 bytes larger than the parameter I specified in fd_dev_size parameter in configfs on host.(shows 104858112 bytes, see the invocation script below) I see the same. For some reason fdisk -l in the VM shows 512-bytes more than the actual size for the file (on the host). Hmm, interesting. Will look into it. Nick, Any ideas here? Mmm, fd_get_blocks() is not returning the expected minus one logical blocks with a !S_ISBLK() setup. This is happening for every other backend -get_blocks() call already, and should be happening for the fd_dev_size case as well. Applying the following to target-pending.git now. Actually sorry, that last patch is not correct.. Here's a better one to properly set fd_dev-fd_block_size at configure time, and use dev_attrib.block_size in fd_get_blocks() to allow for user defined block_sizes. Thanks, --nab commit 9e309f9307fe644dee8718980bfcb77de91ce38e Author: Nicholas Bellinger n...@linux-iscsi.org Date: Wed May 29 21:35:23 2013 -0700 target/file: Fix off-by-one READ_CAPACITY bug for !S_ISBLK export This patch fixes a bug where FILEIO was incorrectly reporting the number of logical blocks (+ 1) when using non struct block_device export mode. It changes fd_get_blocks() to follow all other backend -get_blocks() cases, and reduces the calculated dev_size by one dev-dev_attrib.block_size number of bytes, and fixes the initial block_size assignment within fd_configure_device() Reported-by: Wenchao Xia xiaw...@linux.vnet.ibm.com Reported-by: Badari Pulavarty pbad...@us.ibm.com Signed-off-by: Nicholas Bellinger n...@linux-iscsi.org diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c index 1b1d544..b11890d 100644 --- a/drivers/target/target_core_file.c +++ b/drivers/target/target_core_file.c @@ -153,6 +153,7 @@ static int fd_configure_device(struct se_device *dev) struct request_queue *q = bdev_get_queue(inode-i_bdev); unsigned long long dev_size; + fd_dev-fd_block_size = bdev_logical_block_size(inode-i_bdev); /* * Determine the number of bytes from i_size_read() minus * one (1) logical sector from underlying struct block_device @@ -199,6 +200,7 @@ static int fd_configure_device(struct se_device *dev) goto fail; } + fd_dev-fd_block_size = FD_BLOCKSIZE; /* * Limit UNMAP emulation to 8k Number of LBAs (NoLB) */ @@ -217,9 +219,7 @@ static int fd_configure_device(struct se_device *dev) dev-dev_attrib.max_write_same_len = 0x1000; } - fd_dev-fd_block_size = dev-dev_attrib.hw_block_size; - - dev-dev_attrib.hw_block_size = FD_BLOCKSIZE; + dev-dev_attrib.hw_block_size = fd_dev-fd_block_size; dev-dev_attrib.hw_max_sectors = FD_MAX_SECTORS; dev-dev_attrib.hw_queue_depth = FD_MAX_DEVICE_QUEUE_DEPTH; @@ -694,11 +694,12 @@ static sector_t fd_get_blocks(struct se_device *dev) * to handle underlying block_device resize operations. */ if (S_ISBLK(i-i_mode)) - dev_size = (i_size_read(i) - fd_dev-fd_block_size); + dev_size = i_size_read(i); else dev_size = fd_dev-fd_dev_size; - return div_u64(dev_size, dev-dev_attrib.block_size); + return div_u64(dev_size - dev-dev_attrib.block_size, + dev-dev_attrib.block_size); } static struct sbc_ops fd_sbc_ops = { Verified and it shows the correct value now. Thanks, Badari
Re: [Qemu-devel] [PATCH 6/9] vhost-scsi: new device supporting the tcm_vhost Linux kernel module
On 05/29/2013 02:05 AM, Wenchao Xia wrote: 于 2013-5-28 17:00, Wenchao Xia 写道: 于 2013-5-28 16:33, Asias He 写道: On Tue, May 28, 2013 at 10:01:14AM +0200, Paolo Bonzini wrote: Il 28/05/2013 09:13, Wenchao Xia ha scritto: From: Nicholas Bellinger n...@linux-iscsi.org The WWPN specified in configfs is passed to -device vhost-scsi-pci. The tgpt field of the SET_ENDPOINT ioctl is obsolete now, so it is not available from the QEMU command-line. Instead, I hardcode it to zero. Hi, Paolo Any document about how to config it correctly in configfs, before invoking qemu with the WWPN number? Unfortunately no, but vhost-scsi doesn't have many knobs (unlike iSCSI for example) so it's quite simple. Here is an example: cd /sys/kernel/config/target mkdir -p core/fileio_0/fileio echo 'fd_dev_name=/home/pbonzini/test.img,fd_dev_size=5905580032' core/fileio_0/fileio/control echo 1 core/fileio_0/fileio/enable mkdir -p vhost/naa.600140554cf3a18e/tpgt_0/lun/lun_0 cd vhost/naa.600140554cf3a18e/tpgt_0 ln -sf ../../../../../core/fileio_0/fileio/ lun/lun_0/virtual_scsi_port echo naa.60014053226f0388 nexus The nexus value is the initiator WWN. naa.600140554cf3a18e is the target WWN that you have to pass to -device vhost-scsi-pci. Paolo For me, I always use targetcli utils instead of the sysfs interface. targetcli in F18 has vhost support now. Thanks very much for above information, I'll try it for test. I have done a basic test of vhost-scsi, following is the result I'd like to post, generally it seems fine: Result: fdisk/mkfs: fdisk can find it, mke2fs works fine. mount: can mount it. file I/O: dd 90M zero to a file in that disk succeed. I tried without nested kvm. Issues: 1) in fdisk -l, sometime timeout with dmesg end_request: I/O error, dev fd0, sector 0, I guess it is caused by nested KVM that failed to kick host kernel? I don't see this issue. Are you sure fd0 is actually the scsi device ? what is fd0 ? 2) in fdisk -l, it shows 512 bytes larger than the parameter I specified in fd_dev_size parameter in configfs on host.(shows 104858112 bytes, see the invocation script below) I see the same. For some reason fdisk -l in the VM shows 512-bytes more than the actual size for the file (on the host). Thanks, Badari
Re: [Qemu-devel] qemu seabios issue with vhost-scsi
On 05/23/2013 06:32 AM, Stefan Hajnoczi wrote: On Thu, May 23, 2013 at 11:48 AM, Gleb Natapov g...@redhat.com wrote: On Thu, May 23, 2013 at 08:53:55AM +0800, Asias He wrote: On Wed, May 22, 2013 at 05:36:08PM -0700, Badari wrote: Hi, While testing vhost-scsi in the current qemu git, ran into an earlier issue with seabios. I had to disable scsi support in seabios to get it working. I was hoping this issue got resolved when vhost-scsi support got merged into qemu. Is this still being worked on ? Hmm, can you try seabios.git? Not sure if seabios shipped by qemu picked up the fixes for vhost-scsi. Nothing in seabios should crash qemu. Agreed. I'm pretty sure it's the scenario that I posted in my other reply to this thread. The common virtio-scsi code in QEMU should guard against this. In virtio-blk data plane I hit a similar case and ended up starting the data plane thread (equivalent to vhost here) *before* the status register is set to DRIVER_OK. Thats exactly what my debug in vhost_scsi_set_status() shows. set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 3 Program received signal SIGSEGV, Segmentation fault. We never got a chance to call vhost_scsi_start() as we are waiting for DRIVER_OK. Thanks, Badari
Re: [Qemu-devel] qemu seabios issue with vhost-scsi
On 05/23/2013 07:58 AM, Paolo Bonzini wrote: Il 23/05/2013 16:48, Badari Pulavarty ha scritto: The common virtio-scsi code in QEMU should guard against this. In virtio-blk data plane I hit a similar case and ended up starting the data plane thread (equivalent to vhost here) *before* the status register is set to DRIVER_OK. Thats exactly what my debug in vhost_scsi_set_status() shows. set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 3 Program received signal SIGSEGV, Segmentation fault. We never got a chance to call vhost_scsi_start() as we are waiting for DRIVER_OK. This is the fix in SeaBIOS: commit 5a7730db57ab0715223421e65b54fb50d6fefe5c Author: Asias He as...@redhat.com Date: Fri Mar 15 09:45:15 2013 +0800 virtio-scsi: Set _DRIVER_OK flag before scsi target scanning Before we start scsi target scanning, we need to set the VIRTIO_CONFIG_S_DRIVER_OK flag so the device can do setup properly. This fix a bug when booting tcm_vhost with seabios. Signed-off-by: Asias He as...@redhat.com Acked-by: Paolo Bonzini pbonz...@redhat.com Still, Gleb is right that SeaBIOS should not be able to crash QEMU; exit(1) is fine, SIGSEGV is not. Paolo This fixed the issue and makes the guest boot. Thanks Badari
Re: [Qemu-devel] qemu seabios issue with vhost-scsi
On 05/23/2013 08:30 AM, Paolo Bonzini wrote: Il 23/05/2013 17:27, Asias He ha scritto: On Thu, May 23, 2013 at 04:58:05PM +0200, Paolo Bonzini wrote: Il 23/05/2013 16:48, Badari Pulavarty ha scritto: The common virtio-scsi code in QEMU should guard against this. In virtio-blk data plane I hit a similar case and ended up starting the data plane thread (equivalent to vhost here) *before* the status register is set to DRIVER_OK. Thats exactly what my debug in vhost_scsi_set_status() shows. set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 3 Program received signal SIGSEGV, Segmentation fault. We never got a chance to call vhost_scsi_start() as we are waiting for DRIVER_OK. Reproduced the SIGSEGV and verified that replacing the bios.bin with the one from seabios.git makes the guest boot. This should fix it: diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index 08dd3f3..3139355 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -266,7 +266,7 @@ fail: static void virtio_scsi_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq) { -VirtIOSCSI *s = (VirtIOSCSI *)vdev; +VirtIOSCSI *s = VIRTIO_SCSI(vdev); VirtIOSCSIReq *req; while ((req = virtio_scsi_pop_req(s, vq))) { @@ -347,9 +347,8 @@ static void virtio_scsi_fail_cmd_req(VirtIOSCSIReq *req) static void virtio_scsi_handle_cmd(VirtIODevice *vdev, VirtQueue *vq) { -/* use non-QOM casts in the data path */ -VirtIOSCSI *s = (VirtIOSCSI *)vdev; -VirtIOSCSICommon *vs = s-parent_obj; +VirtIOSCSI *s = VIRTIO_SCSI(vdev); +VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev); VirtIOSCSIReq *req; int n; Paolo Hmm.. Not quite.. (gdb) run -cpu qemu64 --enable-kvm -m 4096 -drive file=/var/lib/libvirt/images/lnx.img,if=ide,cache=writethrough -device vhost-scsi-pci,wwpn=naa.6001405bd4e8476d,event_idx=off -vnc :10 -boot d Starting program: /root/qemu/x86_64-softmmu/qemu-system-x86_64 -cpu qemu64 --enable-kvm -m 4096 -drive file=/var/lib/libvirt/images/lnx.img,if=ide,cache=writethrough -device vhost-scsi-pci,wwpn=naa.6001405bd4e8476d,event_idx=off -vnc :10 -boot d warning: no loadable sections found in added symbol-file system-supplied DSO at 0x77ffa000 [Thread debugging using libthread_db enabled] [New Thread 0x71c1c700 (LWP 2458)] [New Thread 0x71239700 (LWP 2459)] [New Thread 0x7fffeb7ff700 (LWP 2462)] set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 3 /root/qemu/hw/scsi/virtio-scsi.c:356:virtio_scsi_handle_cmd: Object 0x565aca88 is not an instance of type virtio-scsi-device Program received signal SIGABRT, Aborted. [Switching to Thread 0x71239700 (LWP 2459)] 0x75cf18a5 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install cyrus-sasl-gssapi-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-md5-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-plain-2.1.23-13.el6_3.1.x86_64 db4-4.7.25-17.el6.x86_64 glib2-2.22.5-7.el6.x86_64 glibc-2.12-1.107.el6.x86_64 gnutls-2.8.5-10.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libcurl-7.19.7-35.el6.x86_64 libgcrypt-1.4.5-9.el6_2.2.x86_64 libgpg-error-1.7-4.el6.x86_64 libidn-1.18-2.el6.x86_64 libpng-1.2.49-1.el6_2.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libssh2-1.4.2-1.el6.x86_64 libtasn1-2.3-3.el6_2.1.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 nspr-4.9.2-1.el6.x86_64 nss-3.14.0.0-12.el6.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.14.0.0-2.el6.x86_64 openldap-2.4.23-31.el6.x86_64 openssl-1.0.0-27.el6.x86_64 pixman-0.26.2-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x75cf18a5 in raise () from /lib64/libc.so.6 #1 0x75cf3085 in abort () from /lib64/libc.so.6 #2 0x557230d0 in object_dynamic_cast_assert (obj=0x565aca88, typename=0x558a56e5 virtio-scsi-device, file=0x558bda30 /root/qemu/hw/scsi/virtio-scsi.c, line=356, func=value optimized out) at qom/object.c:456 #3 0x557a5ef1 in virtio_scsi_handle_cmd (vdev=0x565aca88, vq=0x565d2160) at /root/qemu/hw/scsi/virtio-scsi.c:356 #4 0x557b3a60 in access_with_adjusted_size (addr=16, value=0x71238b78, size=2, access_size_min=value optimized out, access_size_max=value optimized out, access= 0x557b51d0 memory_region_write_accessor, opaque=0x565ac940) at /root/qemu/memory.c:364 #5 0x557b408b in memory_region_iorange_write (iorange=value optimized out, offset=value optimized out, width=value optimized out, data=2) at /root/qemu/memory.c:439 #6 0x557b2ff6 in kvm_handle_io (env=0x56521af0) at /root/qemu/kvm-all.c:1485 #7 kvm_cpu_exec (env=0x56521af0) at /root/qemu/kvm-all.c:1634 #8 0x5576148e
Re: [Qemu-devel] qemu seabios issue with vhost-scsi
On 05/23/2013 09:19 AM, Paolo Bonzini wrote: Il 23/05/2013 18:11, Badari Pulavarty ha scritto: On 05/23/2013 08:30 AM, Paolo Bonzini wrote: Il 23/05/2013 17:27, Asias He ha scritto: On Thu, May 23, 2013 at 04:58:05PM +0200, Paolo Bonzini wrote: Il 23/05/2013 16:48, Badari Pulavarty ha scritto: The common virtio-scsi code in QEMU should guard against this. In virtio-blk data plane I hit a similar case and ended up starting the data plane thread (equivalent to vhost here) *before* the status register is set to DRIVER_OK. Thats exactly what my debug in vhost_scsi_set_status() shows. set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 3 Program received signal SIGSEGV, Segmentation fault. We never got a chance to call vhost_scsi_start() as we are waiting for DRIVER_OK. Reproduced the SIGSEGV and verified that replacing the bios.bin with the one from seabios.git makes the guest boot. This should fix it: diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index 08dd3f3..3139355 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -266,7 +266,7 @@ fail: static void virtio_scsi_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq) { -VirtIOSCSI *s = (VirtIOSCSI *)vdev; +VirtIOSCSI *s = VIRTIO_SCSI(vdev); VirtIOSCSIReq *req; while ((req = virtio_scsi_pop_req(s, vq))) { @@ -347,9 +347,8 @@ static void virtio_scsi_fail_cmd_req(VirtIOSCSIReq *req) static void virtio_scsi_handle_cmd(VirtIODevice *vdev, VirtQueue *vq) { -/* use non-QOM casts in the data path */ -VirtIOSCSI *s = (VirtIOSCSI *)vdev; -VirtIOSCSICommon *vs = s-parent_obj; +VirtIOSCSI *s = VIRTIO_SCSI(vdev); +VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev); VirtIOSCSIReq *req; int n; Paolo Hmm.. Not quite.. If that is with the old SeaBIOS, then SIGABRT is intended. :) The guest is buggy, the problem in QEMU only lies in _how_ it fails. Paolo I am confused now. Without above changes, seabios fix makes the guest boot. But with the above changes, I run into this (even with seabios fix) (gdb) run -L /root/bios2 -cpu qemu64 --enable-kvm -m 4096 -drive file=/var/lib/libvirt/images/lnx.img,if=ide,cache=writethrough -device vhost-scsi-pci,wwpn=naa.6001405bd4e8476d,event_idx=off -vnc :10 -boot d Starting program: /root/qemu/x86_64-softmmu/qemu-system-x86_64 -L /root/bios2 -cpu qemu64 --enable-kvm -m 4096 -drive file=/var/lib/libvirt/images/lnx.img,if=ide,cache=writethrough -device vhost-scsi-pci,wwpn=naa.6001405bd4e8476d,event_idx=off -vnc :10 -boot d warning: no loadable sections found in added symbol-file system-supplied DSO at 0x77ffa000 [Thread debugging using libthread_db enabled] [New Thread 0x71c1c700 (LWP 3299)] [New Thread 0x71239700 (LWP 3300)] [New Thread 0x7fffeb7ff700 (LWP 3303)] set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 0 set status started 0 val 3 set status started 0 val 7 set status started 1 val 0 /root/qemu/hw/scsi/virtio-scsi.c:356:virtio_scsi_handle_cmd: Object 0x5659d918 is not an instance of type virtio-scsi-device Program received signal SIGABRT, Aborted. [Switching to Thread 0x71239700 (LWP 3300)] 0x75cf18a5 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install cyrus-sasl-gssapi-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-md5-2.1.23-13.el6_3.1.x86_64 cyrus-sasl-plain-2.1.23-13.el6_3.1.x86_64 db4-4.7.25-17.el6.x86_64 glib2-2.22.5-7.el6.x86_64 glibc-2.12-1.107.el6.x86_64 gnutls-2.8.5-10.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libcurl-7.19.7-35.el6.x86_64 libgcrypt-1.4.5-9.el6_2.2.x86_64 libgpg-error-1.7-4.el6.x86_64 libidn-1.18-2.el6.x86_64 libpng-1.2.49-1.el6_2.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libssh2-1.4.2-1.el6.x86_64 libtasn1-2.3-3.el6_2.1.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64 nspr-4.9.2-1.el6.x86_64 nss-3.14.0.0-12.el6.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.14.0.0-2.el6.x86_64 openldap-2.4.23-31.el6.x86_64 openssl-1.0.0-27.el6.x86_64 pixman-0.26.2-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x75cf18a5 in raise () from /lib64/libc.so.6 #1 0x75cf3085 in abort () from /lib64/libc.so.6 #2 0x557230d0 in object_dynamic_cast_assert (obj=0x5659d918, typename=0x558a56e5 virtio-scsi-device, file=0x558bda30 /root/qemu/hw/scsi/virtio-scsi.c, line=356, func=value optimized out) at qom/object.c:456 #3 0x557a5ef1 in virtio_scsi_handle_cmd (vdev=0x5659d918, vq=0x565929d0) at /root/qemu/hw/scsi/virtio-scsi.c:356 #4 0x556e467d in virtio_pci_set_host_notifier_internal (proxy=0x5659d120, n=2, assign=false, set_handler=false) at hw/virtio/virtio-pci.c
Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS
On Fri, 2011-04-15 at 17:34 +0200, Christoph Hellwig wrote: On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote: On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig h...@lst.de wrote: NAK. ?Just wait for the bloody NFS client fix to get in instead of adding crap like that. That's totally fine if NFS client will be fixed in the near future but this doesn't seem likely: http://www.spinics.net/lists/linux-nfs/msg20462.html The code to use preadv/pwritev has been in qemu for over 2 years, and it took people to notice the NFS slowdown until now, so don't expect it to be fixed three days layer. True. That brings up a different question - whether we are doing enough testing on mainline QEMU :( I can't event see you in the relevent threads arguing for getting it fixed, so don't complain. I asked Stefan to work on a QEMU patch - since my QEMU skills are limited and Stefan is more appropriate person to deal with QEMU :) As mentioned earlier, we don't see performance with any temporary solutions suggested so far (RPC table slot + ASYNC) comes anywhere close to readv/writev support patch or linearized QEMU. Thanks, Badari
Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS
On Fri, 2011-04-15 at 13:09 -0500, Anthony Liguori wrote: On 04/15/2011 11:23 AM, Badari Pulavarty wrote: On Fri, 2011-04-15 at 17:34 +0200, Christoph Hellwig wrote: On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote: On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwigh...@lst.de wrote: NAK. ?Just wait for the bloody NFS client fix to get in instead of adding crap like that. That's totally fine if NFS client will be fixed in the near future but this doesn't seem likely: http://www.spinics.net/lists/linux-nfs/msg20462.html The code to use preadv/pwritev has been in qemu for over 2 years, and it took people to notice the NFS slowdown until now, so don't expect it to be fixed three days layer. True. That brings up a different question - whether we are doing enough testing on mainline QEMU :( The issue here is NFS, not QEMU. Sure. But we should have caught the regression on NFS when preadv/pwritev change went into QEMU or before going in -- Isn't it ? Since it was 2 years ago (like hch indicated) - we could have fixed NFS long ago :) Moreover, the real problem is that we're using O_DIRECT. O_DIRECT seems to result in nothing but problems and it never seems to be tested well on any file system. O_DIRECT was added for a specific use-case in mind - and its supposed to handle only that case well (pre-allocated files with database usage - where db has their own caching layer). That case is well tested by various DB vendors on most (important) local filesystems. You know very well why we are on O_DIRECT path :) I think the fundamental problem we keep running into really boils down to O_DIRECT being a second class interface within Linux. Its by design. Its a special case for specific use. Thanks, Badari
Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS
On 4/15/2011 10:29 AM, Christoph Hellwig wrote: On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote: True. That brings up a different question - whether we are doing enough testing on mainline QEMU :( It seems you're clearly not doing enough testing on any qemu. Even the RHEL6 qemu has had preadv/pwritev since the first beta. Christoph, When you say you're - you really meant RH right ? RH should have caught this in their regression year ago as part of their first beta. Correct ? Unfortunately, you are picking on person who spent time find analyzing the regression, narrowing the problem area and suggesting approaches to address the issue :( BTW, I am not sure RH discussion is appropriate in this forum. Thanks, Badari
Re: [Qemu-devel] [PATCH] raw-posix: Linearize direct I/O on Linux NFS
On 4/15/2011 4:00 PM, Anthony Liguori wrote: On 04/15/2011 05:21 PM, pbad...@linux.vnet.ibm.com wrote: On 4/15/2011 10:29 AM, Christoph Hellwig wrote: On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote: True. That brings up a different question - whether we are doing enough testing on mainline QEMU :( It seems you're clearly not doing enough testing on any qemu. Even the RHEL6 qemu has had preadv/pwritev since the first beta. Christoph, When you say you're - you really meant RH right ? RH should have caught this in their regression year ago as part of their first beta. Correct ? Unfortunately, you are picking on person who spent time find analyzing the regression, narrowing the problem area and suggesting approaches to address the issue :( This is a pretty silly discussion to be having. The facts are: 1) NFS sucks with preadv/pwritev and O_DIRECT -- is anyone really surprised? 2) We could work around this in QEMU by doing something ugly 3) We have no way to detect when we no longer need a work around which makes (2) really unappealing. 4) That leaves us with: a) waiting for NFS to get fixed properly and just living with worse performance on older kernels b) having a user-tunable switch to enable bouncing I really dislike the idea of (b) because we're stuck with it forever and it's yet another switch for people to mistakenly depend on. I'm still waiting to see performance data without O_DIRECT. I suspect that using cache=writethrough will make most of this problem go away in which case, we can just recommend that as a work around until NFS is properly fixed. We need to run through all cases and analyze the performance of cache=writethrough. Our initial (smaller setup) analysis indicates that its better than unpatched O_DIRECT - but ~5% slower for sequential writes. But 30%+ slower for random read/writes and mixed IO workloads. (In the past NFS O_SYNC is performance extremely poor compared to O_DIRECT with no scaling - older kernels due to congestion control issues). Khoa would collect the data over next few days. To be honest with you, we should kill cache=none and just optimize only one case and live with it (like other commerical hypervisor). :( Thanks, Badari
[Qemu-devel] [RFC] vhost-blk implementation (v2)
Hi All, Here is the latest version of vhost-blk implementation. Major difference from my previous implementation is that, I now merge all contiguous requests (both read and write), before submitting them. This significantly improved IO performance. I am still collecting performance numbers, I will be posting in next few days. Comments ? Todo: - Address hch's comments on annontations - Implement per device read/write queues - Finish up error handling Thanks, Badari --- drivers/vhost/blk.c | 445 1 file changed, 445 insertions(+) Index: net-next/drivers/vhost/blk.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ net-next/drivers/vhost/blk.c2010-04-06 16:38:03.563847905 -0400 @@ -0,0 +1,445 @@ + /* + * virtio-block server in host kernel. + * Inspired by vhost-net and shamlessly ripped code from it :) + */ + +#include linux/compat.h +#include linux/eventfd.h +#include linux/vhost.h +#include linux/virtio_net.h +#include linux/virtio_blk.h +#include linux/mmu_context.h +#include linux/miscdevice.h +#include linux/module.h +#include linux/mutex.h +#include linux/workqueue.h +#include linux/rcupdate.h +#include linux/file.h + +#include vhost.h + +#define VHOST_BLK_VQ_MAX 1 +#define SECTOR_SHIFT 9 + +struct vhost_blk { + struct vhost_dev dev; + struct vhost_virtqueue vqs[VHOST_BLK_VQ_MAX]; + struct vhost_poll poll[VHOST_BLK_VQ_MAX]; +}; + +struct vhost_blk_io { + struct list_head list; + struct work_struct work; + struct vhost_blk *blk; + struct file *file; + int head; + uint32_t type; + uint32_t nvecs; + uint64_t sector; + uint64_t len; + struct iovec iov[0]; +}; + +static struct workqueue_struct *vblk_workqueue; +static LIST_HEAD(write_queue); +static LIST_HEAD(read_queue); + +static void handle_io_work(struct work_struct *work) +{ + struct vhost_blk_io *vbio, *entry; + struct vhost_virtqueue *vq; + struct vhost_blk *blk; + struct list_head single, *head, *node, *tmp; + + int i, need_free, ret = 0; + loff_t pos; + uint8_t status = 0; + + vbio = container_of(work, struct vhost_blk_io, work); + blk = vbio-blk; + vq = blk-dev.vqs[0]; + pos = vbio-sector 8; + + use_mm(blk-dev.mm); + if (vbio-type VIRTIO_BLK_T_FLUSH) { + ret = vfs_fsync(vbio-file, vbio-file-f_path.dentry, 1); + } else if (vbio-type VIRTIO_BLK_T_OUT) { + ret = vfs_writev(vbio-file, vbio-iov, vbio-nvecs, pos); + } else { + ret = vfs_readv(vbio-file, vbio-iov, vbio-nvecs, pos); + } + status = (ret 0) ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK; + if (vbio-head != -1) { + INIT_LIST_HEAD(single); + list_add(vbio-list, single); + head = single; + need_free = 0; + } else { + head = vbio-list; + need_free = 1; + } + list_for_each_entry(entry, head, list) { + copy_to_user(entry-iov[entry-nvecs].iov_base, status, sizeof status); + } + mutex_lock(vq-mutex); + list_for_each_safe(node, tmp, head) { + entry = list_entry(node, struct vhost_blk_io, list); + vhost_add_used_and_signal(blk-dev, vq, entry-head, ret); + list_del(node); + kfree(entry); + } + mutex_unlock(vq-mutex); + unuse_mm(blk-dev.mm); + if (need_free) + kfree(vbio); +} + +static struct vhost_blk_io *allocate_vbio(int nvecs) +{ + struct vhost_blk_io *vbio; + int size = sizeof(struct vhost_blk_io) + nvecs * sizeof(struct iovec); + vbio = kmalloc(size, GFP_KERNEL); + if (vbio) { + INIT_WORK(vbio-work, handle_io_work); + INIT_LIST_HEAD(vbio-list); + } + return vbio; +} + +static void merge_and_handoff_work(struct list_head *queue) +{ + struct vhost_blk_io *vbio, *entry; + int nvecs = 0; + int entries = 0; + + list_for_each_entry(entry, queue, list) { + nvecs += entry-nvecs; + entries++; + } + + if (entries == 1) { + vbio = list_first_entry(queue, struct vhost_blk_io, list); + list_del(vbio-list); + queue_work(vblk_workqueue, vbio-work); + return; + } + + vbio = allocate_vbio(nvecs); + if (!vbio) { + /* Unable to allocate memory - submit IOs individually */ + list_for_each_entry(vbio, queue, list) { + queue_work(vblk_workqueue, vbio-work); + } + INIT_LIST_HEAD(queue); + return; + } + + entry = list_first_entry(queue, struct vhost_blk_io, list); + vbio-nvecs = nvecs; + vbio-blk = entry-blk; +
[Qemu-devel] Re: [RFC] vhost-blk implementation
Michael S. Tsirkin wrote: On Tue, Mar 23, 2010 at 12:55:07PM -0700, Badari Pulavarty wrote: Michael S. Tsirkin wrote: On Tue, Mar 23, 2010 at 10:57:33AM -0700, Badari Pulavarty wrote: Michael S. Tsirkin wrote: On Mon, Mar 22, 2010 at 05:34:04PM -0700, Badari Pulavarty wrote: Write Results: == I see degraded IO performance when doing sequential IO write tests with vhost-blk compared to virtio-blk. # time dd of=/dev/vda if=/dev/zero bs=2M oflag=direct I get ~110MB/sec with virtio-blk, but I get only ~60MB/sec with vhost-blk. Wondering why ? Try to look and number of interrupts and/or number of exits. I checked interrupts and IO exits - there is no major noticeable difference between vhost-blk and virtio-blk scenerios. It could also be that you are overrunning some queue. I don't see any exit mitigation strategy in your patch: when there are already lots of requests in a queue, it's usually a good idea to disable notifications and poll the queue as requests complete. That could help performance. Do you mean poll eventfd for new requests instead of waiting for new notifications ? Where do you do that in vhost-net code ? vhost_disable_notify does this. Unlike network socket, since we are dealing with a file, there is no -poll support for it. So I can't poll for the data. And also, Issue I am having is on the write() side. Not sure I understand. I looked at it some more - I see 512K write requests on the virtio-queue in both vhost-blk and virtio-blk cases. Both qemu or vhost is doing synchronous writes to page cache (there is no write batching in qemu that is affecting this case). I still puzzled on why virtio-blk outperforms vhost-blk. Thanks, Badari If you say the number of requests is the same, we are left with: - requests are smaller for some reason? - something is causing retries? No. IO requests sizes are exactly same (512K) in both cases. There are no retries or errors in both cases. One thing I am not clear is - for some reason guest kernel could push more data into virtio-ring in case of virtio-blk vs vhost-blk. Is this possible ? Does guest gets to run much sooner in virtio-blk case than vhost-blk ? Sorry, if its dumb question - I don't understand all the vhost details :( Thanks, Badari BTW, did you put the backend in non-blocking mode? As I said, vhost net passes non-blocking flag to socket backend, but vfs_write/read that you use does not have an option to do this. So we'll need to extend the backend to fix that, but just for demo purposes, you could set non-blocking mode on the file from userspace. Michael, Atleast I understand why the performance difference now.. My debug code is changed the behaviour of virtio-blk which confused me. 1) virtio-blk is able to batch up writes from various requests. 2) virtio-blk offloads the writes to different thread Where as vhost-blk, I do each request syncrhonously. Hence the difference. You are right - i have to make backend async. I will working on handing off work to in-kernel threads. I am not sure about IO completion handling and calling vhost_add_used_and_signal() out of context. But let me take a stab at it and ask your help if I run into issues. Thanks, Badari
[Qemu-devel] [RFC] vhost-blk implementation
Hi, Inspired by vhost-net implementation, I did initial prototype of vhost-blk to see if it provides any benefits over QEMU virtio-blk. I haven't handled all the error cases, fixed naming conventions etc., but the implementation is stable to play with. I tried not to deviate from vhost-net implementation where possible. NOTE: Only change I had to make to vhost core code is to increase VHOST_NET_MAX_SG to 130 (128+2) in vhost.h Performance: = I have done simple tests to see how it performs. I got very encouraging results on sequential read tests. But on sequential write tests, I see degrade over virtio-blk. I can't figure out and explain why. Can some one shed light on whats happening here ? Read Results: = Test does read of 84GB file from the host (through virtio). I unmount and mount the filesystem on the host to make sure there is nothing in the page cache.. with vhost-blk: # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct 64+0 records in 64+0 records out 8388608 bytes (84 GB) copied, 126.135 seconds, 665 MB/s real2m6.137s user0m0.281s sys 0m14.725s without vhost-blk: (virtio) --- # time dd if=/dev/vda of=/dev/null bs=128k iflag=direct 64+0 records in 64+0 records out 8388608 bytes (84 GB) copied, 275.466 seconds, 305 MB/s real4m35.468s user0m0.373s sys 0m48.074s Write Results: == I see degraded IO performance when doing sequential IO write tests with vhost-blk compared to virtio-blk. # time dd of=/dev/vda if=/dev/zero bs=2M oflag=direct I get ~110MB/sec with virtio-blk, but I get only ~60MB/sec with vhost-blk. Wondering why ? Comments/flames ? Thanks, Badari vhost-blk is in-kernel accelerator for virtio-blk. At this time, this is a prototype based on virtio-net. Lots of error handling and clean up needs to be done. Read performance is pretty good over QEMU virtio-blk, but write performance is not anywhere close to QEMU virtio-blk. Why ? Signed-off-by: Badari Pulavarty pbad...@us.ibm.com --- drivers/vhost/blk.c | 242 1 file changed, 242 insertions(+) Index: net-next/drivers/vhost/blk.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ net-next/drivers/vhost/blk.c2010-03-22 18:07:18.156584400 -0400 @@ -0,0 +1,242 @@ + /* + * virtio-block server in host kernel. + * Inspired by vhost-net and shamlessly ripped code from it :) + */ + +#include linux/compat.h +#include linux/eventfd.h +#include linux/vhost.h +#include linux/virtio_net.h +#include linux/virtio_blk.h +#include linux/mmu_context.h +#include linux/miscdevice.h +#include linux/module.h +#include linux/mutex.h +#include linux/workqueue.h +#include linux/rcupdate.h +#include linux/file.h + +#include vhost.h + +#define VHOST_BLK_VQ_MAX 1 + +struct vhost_blk { + struct vhost_dev dev; + struct vhost_virtqueue vqs[VHOST_BLK_VQ_MAX]; + struct vhost_poll poll[VHOST_BLK_VQ_MAX]; +}; + +static int do_handle_io(struct file *file, uint32_t type, uint64_t sector, + struct iovec *iov, int in) +{ + loff_t pos = sector 8; + int ret = 0; + + if (type VIRTIO_BLK_T_FLUSH) { + ret = vfs_fsync(file, file-f_path.dentry, 1); + } else if (type VIRTIO_BLK_T_OUT) { + ret = vfs_writev(file, iov, in, pos); + } else { + ret = vfs_readv(file, iov, in, pos); + } + return ret; +} + +static void handle_blk(struct vhost_blk *blk) +{ + struct vhost_virtqueue *vq = blk-dev.vqs[0]; + unsigned head, out, in; + struct virtio_blk_outhdr hdr; + int r, nvecs; + uint8_t status = 0; + + use_mm(blk-dev.mm); + mutex_lock(vq-mutex); + + vhost_disable_notify(vq); + + for (;;) { + head = vhost_get_vq_desc(blk-dev, vq, vq-iov, +ARRAY_SIZE(vq-iov), +out, in, NULL, NULL); + if (head == vq-num) { + if (unlikely(vhost_enable_notify(vq))) { + vhost_disable_notify(vq); + continue; + } + break; + } + + BUG_ON(vq-iov[0].iov_len != 16); + + r = copy_from_user(hdr, vq-iov[0].iov_base, sizeof hdr); + if (r 0) { + printk(copy from user failed\n); + vhost_discard_vq_desc(vq); + break; + } + + nvecs = out - 1; + if (hdr.type == VIRTIO_BLK_T_IN) + nvecs = in - 1; + + r = do_handle_io(vq-private_data, hdr.type, hdr.sector, vq-iov[1], nvecs); + status = (r 0) ? VIRTIO_BLK_S_IOERR
[Qemu-devel] Re: [RFC] vhost-blk implementation
Michael S. Tsirkin wrote: On Mon, Mar 22, 2010 at 05:34:04PM -0700, Badari Pulavarty wrote: Write Results: == I see degraded IO performance when doing sequential IO write tests with vhost-blk compared to virtio-blk. # time dd of=/dev/vda if=/dev/zero bs=2M oflag=direct I get ~110MB/sec with virtio-blk, but I get only ~60MB/sec with vhost-blk. Wondering why ? Try to look and number of interrupts and/or number of exits. It could also be that you are overrunning some queue. Yeah. I don't see any exit mitigation strategy in your patch: when there are already lots of requests in a queue, it's usually a good idea to disable notifications and poll the queue as requests complete. That could help performance. Thanks for the suggestions. I will take a closer look. Thanks, Badari
[Qemu-devel] Re: [RFC] vhost-blk implementation
Michael S. Tsirkin wrote: On Mon, Mar 22, 2010 at 05:34:04PM -0700, Badari Pulavarty wrote: Write Results: == I see degraded IO performance when doing sequential IO write tests with vhost-blk compared to virtio-blk. # time dd of=/dev/vda if=/dev/zero bs=2M oflag=direct I get ~110MB/sec with virtio-blk, but I get only ~60MB/sec with vhost-blk. Wondering why ? Try to look and number of interrupts and/or number of exits. I checked interrupts and IO exits - there is no major noticeable difference between vhost-blk and virtio-blk scenerios. It could also be that you are overrunning some queue. I don't see any exit mitigation strategy in your patch: when there are already lots of requests in a queue, it's usually a good idea to disable notifications and poll the queue as requests complete. That could help performance. Do you mean poll eventfd for new requests instead of waiting for new notifications ? Where do you do that in vhost-net code ? Unlike network socket, since we are dealing with a file, there is no -poll support for it. So I can't poll for the data. And also, Issue I am having is on the write() side. I looked at it some more - I see 512K write requests on the virtio-queue in both vhost-blk and virtio-blk cases. Both qemu or vhost is doing synchronous writes to page cache (there is no write batching in qemu that is affecting this case). I still puzzled on why virtio-blk outperforms vhost-blk. Thanks, Badari
[Qemu-devel] Re: [RFC] vhost-blk implementation
Michael S. Tsirkin wrote: On Tue, Mar 23, 2010 at 10:57:33AM -0700, Badari Pulavarty wrote: Michael S. Tsirkin wrote: On Mon, Mar 22, 2010 at 05:34:04PM -0700, Badari Pulavarty wrote: Write Results: == I see degraded IO performance when doing sequential IO write tests with vhost-blk compared to virtio-blk. # time dd of=/dev/vda if=/dev/zero bs=2M oflag=direct I get ~110MB/sec with virtio-blk, but I get only ~60MB/sec with vhost-blk. Wondering why ? Try to look and number of interrupts and/or number of exits. I checked interrupts and IO exits - there is no major noticeable difference between vhost-blk and virtio-blk scenerios. It could also be that you are overrunning some queue. I don't see any exit mitigation strategy in your patch: when there are already lots of requests in a queue, it's usually a good idea to disable notifications and poll the queue as requests complete. That could help performance. Do you mean poll eventfd for new requests instead of waiting for new notifications ? Where do you do that in vhost-net code ? vhost_disable_notify does this. Unlike network socket, since we are dealing with a file, there is no -poll support for it. So I can't poll for the data. And also, Issue I am having is on the write() side. Not sure I understand. I looked at it some more - I see 512K write requests on the virtio-queue in both vhost-blk and virtio-blk cases. Both qemu or vhost is doing synchronous writes to page cache (there is no write batching in qemu that is affecting this case). I still puzzled on why virtio-blk outperforms vhost-blk. Thanks, Badari If you say the number of requests is the same, we are left with: - requests are smaller for some reason? - something is causing retries? No. IO requests sizes are exactly same (512K) in both cases. There are no retries or errors in both cases. One thing I am not clear is - for some reason guest kernel could push more data into virtio-ring in case of virtio-blk vs vhost-blk. Is this possible ? Does guest gets to run much sooner in virtio-blk case than vhost-blk ? Sorry, if its dumb question - I don't understand all the vhost details :( Thanks, Badari