Re: [Qemu-block] [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device

2016-12-20 Thread Fam Zheng
On Tue, 12/20 15:04, no-re...@patchew.org wrote:
> ERROR: that open brace { should be on the previous line
> #287: FILE: util/vfio-helpers.c:214:
> +struct vfio_group_status group_status =
> +{ .argsz = sizeof(group_status) };

Hmm, it may indeed look better.

> ERROR: Use of volatile is usually wrong: see 
> Documentation/volatile-considered-harmful.txt
> #145: FILE: block/nvme.c:92:
> +volatile uint32_t *doorbell;
> 
> ERROR: Use of volatile is usually wrong: see 
> Documentation/volatile-considered-harmful.txt
> #169: FILE: block/nvme.c:116:
> +typedef volatile struct {

These are shared with hardware, even volatile-considered-harmful.txt admits it
can be valid:

> - Pointers to data structures in coherent memory which might be modified
>   by I/O devices can, sometimes, legitimately be volatile.  A ring buffer
>   used by a network adapter, where that adapter changes pointers to
>   indicate which descriptors have been processed, is an example of this
>   type of situation.

Fam



Re: [Qemu-block] [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device

2016-12-20 Thread no-reply
Hi,

Your series failed automatic build test. Please find the testing commands and
their output below. If you have docker installed, you can probably reproduce it
locally.

Type: series
Message-id: 20161220163139.12016-1-f...@redhat.com
Subject: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device

=== TEST SCRIPT BEGIN ===
#!/bin/bash
set -e
git submodule update --init dtc
# Let docker tests dump environment info
export SHOW_ENV=1
export J=16
make docker-test-quick@centos6
make docker-test-mingw@fedora
make docker-test-build@min-glib
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
ddc026b block: Add VFIO based NVMe driver
f5dd619 util: Add VFIO helper library
5608a5c util: Add a notifier list for qemu_vfree()
822462d ramblock-notifier: new

=== OUTPUT BEGIN ===
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into 'dtc'...
Submodule path 'dtc': checked out '65cc4d2748a2c2e6f27f1cf39e07a5dbabd80ebf'
  BUILD   centos6
make[1]: Entering directory `/var/tmp/patchew-tester-tmp-m_z3bizr/src'
  ARCHIVE qemu.tgz
  ARCHIVE dtc.tgz
  COPYRUNNER
RUN test-quick in qemu:centos6 
Packages installed:
SDL-devel-1.2.14-7.el6_7.1.x86_64
ccache-3.1.6-2.el6.x86_64
epel-release-6-8.noarch
gcc-4.4.7-17.el6.x86_64
git-1.7.1-4.el6_7.1.x86_64
glib2-devel-2.28.8-5.el6.x86_64
libfdt-devel-1.4.0-1.el6.x86_64
make-3.81-23.el6.x86_64
package g++ is not installed
pixman-devel-0.32.8-1.el6.x86_64
tar-1.23-15.el6_8.x86_64
zlib-devel-1.2.3-29.el6.x86_64

Environment variables:
PACKAGES=libfdt-devel ccache tar git make gcc g++ zlib-devel 
glib2-devel SDL-devel pixman-devel epel-release
HOSTNAME=71a2d1232481
TERM=xterm
MAKEFLAGS= -j16
HISTSIZE=1000
J=16
USER=root
CCACHE_DIR=/var/tmp/ccache
EXTRA_CONFIGURE_OPTS=
V=
SHOW_ENV=1
MAIL=/var/spool/mail/root
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
LANG=en_US.UTF-8
TARGET_LIST=
HISTCONTROL=ignoredups
SHLVL=1
HOME=/root
TEST_DIR=/tmp/qemu-test
LOGNAME=root
LESSOPEN=||/usr/bin/lesspipe.sh %s
FEATURES= dtc
DEBUG=
G_BROKEN_FILENAMES=1
CCACHE_HASHDIR=
_=/usr/bin/env

Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu 
--prefix=/var/tmp/qemu-build/install
No C++ compiler available; disabling C++ specific optional code
Install prefix/var/tmp/qemu-build/install
BIOS directory/var/tmp/qemu-build/install/share/qemu
binary directory  /var/tmp/qemu-build/install/bin
library directory /var/tmp/qemu-build/install/lib
module directory  /var/tmp/qemu-build/install/lib/qemu
libexec directory /var/tmp/qemu-build/install/libexec
include directory /var/tmp/qemu-build/install/include
config directory  /var/tmp/qemu-build/install/etc
local state directory   /var/tmp/qemu-build/install/var
Manual directory  /var/tmp/qemu-build/install/share/man
ELF interp prefix /usr/gnemul/qemu-%M
Source path   /tmp/qemu-test/src
C compilercc
Host C compiler   cc
C++ compiler  
Objective-C compiler cc
ARFLAGS   rv
CFLAGS-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
QEMU_CFLAGS   -I/usr/include/pixman-1-pthread -I/usr/include/glib-2.0 
-I/usr/lib64/glib-2.0/include   -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes 
-Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes 
-fno-strict-aliasing -fno-common -fwrapv  -Wendif-labels -Wmissing-include-dirs 
-Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self 
-Wignored-qualifiers -Wold-style-declaration -Wold-style-definition 
-Wtype-limits -fstack-protector-all
LDFLAGS   -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g 
make  make
install   install
pythonpython -B
smbd  /usr/sbin/smbd
module supportno
host CPU  x86_64
host big endian   no
target list   x86_64-softmmu aarch64-softmmu
tcg debug enabled no
gprof enabled no
sparse enabledno
strip binariesyes
profiler  no
static build  no
pixmansystem
SDL support   yes (1.2.14)
GTK support   no 
GTK GL supportno
VTE support   no 
TLS priority  NORMAL
GNUTLS supportno
GNUTLS rndno
libgcrypt no
libgcrypt kdf no
nettleno 
nettle kdfno
libtasn1  no
curses supportno
virgl support no
curl support  no
mingw32 support   no
Audio drivers oss
Block whitelist (rw) 
Block whitelist (ro) 
VirtFS supportno
VNC support   yes
VNC SASL support  no
VNC JPEG support  no
VNC PNG support   no
xen support   no
brlapi supportno
bluez  supportno
Documentation no
PIE   yes
vde support   no
netmap supportno
Linux AIO support no
ATTR/XATTR support yes
Install blobs yes
KVM support   yes
COLO support  yes
RDMA support  no
TCG interpreter   no
fdt support  

Re: [Qemu-block] [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device

2016-12-20 Thread no-reply
Hi,

Your series seems to have some coding style problems. See output below for
more information:

Subject: [Qemu-devel] [PATCH 0/4] RFC: A VFIO based block driver for NVMe device
Message-id: 20161220163139.12016-1-f...@redhat.com
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
ddc026b block: Add VFIO based NVMe driver
f5dd619 util: Add VFIO helper library
5608a5c util: Add a notifier list for qemu_vfree()
822462d ramblock-notifier: new

=== OUTPUT BEGIN ===
Checking PATCH 1/4: ramblock-notifier: new...
Checking PATCH 2/4: util: Add a notifier list for qemu_vfree()...
Checking PATCH 3/4: util: Add VFIO helper library...
WARNING: line over 80 characters
#184: FILE: util/vfio-helpers.c:111:
+if (ioctl(s->device, VFIO_DEVICE_GET_REGION_INFO, 
>bar_region_info[index])) {

WARNING: line over 80 characters
#262: FILE: util/vfio-helpers.c:189:
+static int qemu_vfio_pci_read_config(QEMUVFIOState *s, void *buf, int size, 
int ofs)

WARNING: line over 80 characters
#264: FILE: util/vfio-helpers.c:191:
+if (pread(s->device, buf, size, s->config_region_info.offset + ofs) == 
size) {

WARNING: line over 80 characters
#270: FILE: util/vfio-helpers.c:197:
+static int qemu_vfio_pci_write_config(QEMUVFIOState *s, void *buf, int size, 
int ofs)

ERROR: that open brace { should be on the previous line
#287: FILE: util/vfio-helpers.c:214:
+struct vfio_group_status group_status =
+{ .argsz = sizeof(group_status) };

WARNING: line over 80 characters
#407: FILE: util/vfio-helpers.c:334:
+static void qemu_vfio_ram_block_added(RAMBlockNotifier *n, void *host, size_t 
size)

WARNING: line over 80 characters
#415: FILE: util/vfio-helpers.c:342:
+static void qemu_vfio_ram_block_removed(RAMBlockNotifier *n, void *host, 
size_t size)

total: 1 errors, 6 warnings, 746 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 4/4: block: Add VFIO based NVMe driver...
ERROR: Use of volatile is usually wrong: see 
Documentation/volatile-considered-harmful.txt
#145: FILE: block/nvme.c:92:
+volatile uint32_t *doorbell;

ERROR: Use of volatile is usually wrong: see 
Documentation/volatile-considered-harmful.txt
#169: FILE: block/nvme.c:116:
+typedef volatile struct {

WARNING: line over 80 characters
#339: FILE: block/nvme.c:286:
+DPRINTF("NVMe error cmd specific %x sq head %x sqid %x cid %x status 
%x\n",

WARNING: line over 80 characters
#626: FILE: block/nvme.c:573:
+error_setg(errp, "Timeout while waiting for device to reset (%ld 
ms)",

WARNING: line over 80 characters
#655: FILE: block/nvme.c:602:
+error_setg(errp, "Timeout while waiting for device to start (%ld 
ms)",

total: 2 errors, 3 warnings, 1033 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-de...@freelists.org

Re: [Qemu-block] [Qemu-devel] [PATCH] mirror: prevent 'top' mode mirroring when no backing file specified on the destination

2016-12-20 Thread no-reply
Hi,

Your series seems to have some coding style problems. See output below for
more information:

Subject: [Qemu-devel] [PATCH] mirror: prevent 'top' mode mirroring when no 
backing file specified on the destination
Message-id: 1482187106-85065-1-git-send-email-sochin.ji...@huawei.com
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
05e3f52 mirror: prevent 'top' mode mirroring when no backing file specified on 
the destination

=== OUTPUT BEGIN ===
Checking PATCH 1/1: mirror: prevent 'top' mode mirroring when no backing file 
specified on the destination...
ERROR: that open brace { should be on the previous line
#27: FILE: block/mirror.c:1041:
+if (mode == MIRROR_SYNC_MODE_TOP && !backing_bs(target))
+{

total: 1 errors, 0 warnings, 12 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-de...@freelists.org

Re: [Qemu-block] [Qemu-devel] [PATCH v4 0/7] add blkdebug tests

2016-12-20 Thread Eric Blake
[oops, I forgot cc's on the cover letter, even though the rest of the
series was properly broadcast]

On 12/20/2016 01:15 PM, Eric Blake wrote:
> Based on Kevin's block-next branch:
> http://repo.or.cz/qemu/kevin.git/shortlog/refs/heads/block-next
> 
> Available as a tag at:
> git fetch git://repo.or.cz/qemu/ericb.git nbd-blkdebug-v4
> 
> v3 was:
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00278.html
> 
> Since then:
> - Address comments from Max and Kevin
>   - typo cleanups
>   - no longer force 512-byte alignment
>   - better naming
> - Merge two series into one
> 
> 001/7:[down] 'qcow2: Assert that cluster operations are aligned'
> 002/7:[down] 'qcow2: Discard/zero clusters by byte count'
> 003/7:[] [--] 'blkdebug: Sanity check block layer guarantees'
> 004/7:[0015] [FC] 'blkdebug: Add pass-through write_zero and discard support'
> 005/7:[0002] [FC] 'blkdebug: Simplify override logic'
> 006/7:[0046] [FC] 'blkdebug: Add ability to override unmap geometries'
> 007/7:[0015] [FC] 'tests: Add coverage for recent block geometry fixes'
> 
> Eric Blake (7):
>   qcow2: Assert that cluster operations are aligned
>   qcow2: Discard/zero clusters by byte count
>   blkdebug: Sanity check block layer guarantees
>   blkdebug: Add pass-through write_zero and discard support
>   blkdebug: Simplify override logic
>   blkdebug: Add ability to override unmap geometries
>   tests: Add coverage for recent block geometry fixes
> 
>  qapi/block-core.json   |  24 +-
>  block/qcow2.h  |   9 +-
>  block/blkdebug.c   | 204 
> +++--
>  block/qcow2-cluster.c  |  29 ---
>  block/qcow2-snapshot.c |   7 +-
>  block/qcow2.c  |  21 ++---
>  tests/qemu-iotests/173 | 114 +
>  tests/qemu-iotests/173.out |  49 +++
>  tests/qemu-iotests/group   |   1 +
>  9 files changed, 413 insertions(+), 45 deletions(-)
>  create mode 100755 tests/qemu-iotests/173
>  create mode 100644 tests/qemu-iotests/173.out
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH v4 6/7] blkdebug: Add ability to override unmap geometries

2016-12-20 Thread Eric Blake
Make it easier to simulate various unusual hardware setups (for
example, recent commits 3482b9b and b8d0a98 affect the Dell
Equallogic iSCSI with its 15M preferred and maximum unmap and
write zero sizing, or b2f95fe deals with the Linux loopback
block device having a max_transfer of 64k), by allowing blkdebug
to wrap any other device with further restrictions on various
alignments.

Signed-off-by: Eric Blake 

---
v4: relax 512 byte minimum now that blkdebug is byte-based, fix doc typo
v3: improve legibility of bounds checking, improve docs
v2: new patch
---
 qapi/block-core.json | 24 +-
 block/blkdebug.c | 91 +++-
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 6b42216..6561409 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2072,6 +2072,26 @@
 # @align:   #optional required alignment for requests in bytes,
 #   must be power of 2, or 0 for default
 #
+# @max-transfer:#optional maximum size for I/O transfers in bytes,
+#   must be multiple of @align (but need not be a power of
+#   2), or 0 for default (since 2.9)
+#
+# @opt-write-zero:  #optional preferred alignment for write zero requests
+#   in bytes, must be multiple of @align (but need not be
+#   a power of 2), or 0 for default (since 2.9)
+#
+# @max-write-zero:  #optional maximum size for write zero requests in bytes,
+#   must be multiple of @align (but need not be a power of
+#   2), or 0 for default (since 2.9)
+#
+# @opt-discard: #optional preferred alignment for discard requests
+#   in bytes, must be multiple of @align (but need not be
+#   a power of 2), or 0 for default (since 2.9)
+#
+# @max-discard: #optional maximum size for discard requests in bytes,
+#   must be multiple of @align (but need not be a power of
+#   2), or 0 for default (since 2.9)
+#
 # @inject-error:#optional array of error injection descriptions
 #
 # @set-state:   #optional array of state-change descriptions
@@ -2081,7 +2101,9 @@
 { 'struct': 'BlockdevOptionsBlkdebug',
   'data': { 'image': 'BlockdevRef',
 '*config': 'str',
-'*align': 'int',
+'*align': 'int', '*max-transfer': 'int32',
+'*opt-write-zero': 'int32', '*max-write-zero': 'int32',
+'*opt-discard': 'int32', '*max-discard': 'int32',
 '*inject-error': ['BlkdebugInjectErrorOptions'],
 '*set-state': ['BlkdebugSetStateOptions'] } }

diff --git a/block/blkdebug.c b/block/blkdebug.c
index df68d0f..6f9db15 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -39,6 +39,11 @@ typedef struct BDRVBlkdebugState {
 int state;
 int new_state;
 uint64_t align;
+uint64_t max_transfer;
+uint64_t opt_write_zero;
+uint64_t max_write_zero;
+uint64_t opt_discard;
+uint64_t max_discard;

 /* For blkdebug_refresh_filename() */
 char *config_file;
@@ -343,6 +348,31 @@ static QemuOptsList runtime_opts = {
 .type = QEMU_OPT_SIZE,
 .help = "Required alignment in bytes",
 },
+{
+.name = "max-transfer",
+.type = QEMU_OPT_SIZE,
+.help = "Maximum transfer size in bytes",
+},
+{
+.name = "opt-write-zero",
+.type = QEMU_OPT_SIZE,
+.help = "Optimum write zero alignment in bytes",
+},
+{
+.name = "max-write-zero",
+.type = QEMU_OPT_SIZE,
+.help = "Maximum write zero size in bytes",
+},
+{
+.name = "opt-discard",
+.type = QEMU_OPT_SIZE,
+.help = "Optimum discard alignment in bytes",
+},
+{
+.name = "max-discard",
+.type = QEMU_OPT_SIZE,
+.help = "Maximum discard size in bytes",
+},
 { /* end of list */ }
 },
 };
@@ -354,6 +384,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 QemuOpts *opts;
 Error *local_err = NULL;
 int ret;
+uint64_t align;

 opts = qemu_opts_create(_opts, NULL, 0, _abort);
 qemu_opts_absorb_qdict(opts, options, _err);
@@ -387,12 +418,55 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
 bs->file->bs->supported_zero_flags;

-/* Set request alignment */
+/* Set alignment overrides */
 s->align = qemu_opt_get_size(opts, "align", 0);
 if (s->align && (s->align >= INT_MAX || !is_power_of_2(s->align))) {
 error_setg(errp, "Invalid alignment, align must be integer power of 
2");
 goto fail_unref;
 }
+align = MAX(s->align, 

[Qemu-block] [PATCH v4 4/7] blkdebug: Add pass-through write_zero and discard support

2016-12-20 Thread Eric Blake
In order to test the effects of artificial geometry constraints
on operations like write zero or discard, we first need blkdebug
to manage these actions.  It also allows us to inject errors on
those operations, just like we can for read/write/flush.

We can also test the contract promised by the block layer; namely,
if a device has specified limits on alignment or maximum size,
then those limits must be obeyed (for now, the blkdebug driver
merely inherits limits from whatever it is wrapping, but the next
patch will further enhance it to allow specific limit overrides).

This patch intentionally refuses to service requests smaller than
the requested alignments; this is because an upcoming patch adds
a qemu-iotest to prove that the block layer is correctly handling
fragmentation, but the test only works if there is a way to tell
the difference at artificial alignment boundaries when blkdebug is
using a larger-than-default alignment.  If we let the blkdebug
layer always defer to the underlying layer, which potentially has
a smaller granularity, the iotest will be thwarted.

Tested by setting up an NBD server with export 'foo', then invoking:
$ ./qemu-io
qemu-io> open -o driver=blkdebug blkdebug::nbd://localhost:10809/foo
qemu-io> d 0 15M
qemu-io> w -z 0 15M

Pre-patch, the server never sees the discard (it was silently
eaten by the block layer); post-patch it is passed across the
wire.  Likewise, pre-patch the write is always passed with
NBD_WRITE (with 15M of zeroes on the wire), while post-patch
it can utilize NBD_WRITE_ZEROES (for less traffic).

Signed-off-by: Eric Blake 

---
v4: correct error injection to respect byte range, tweak formatting
v3: rebase to byte-based read/write, improve docs on why no
partial write zero passthrough
v2: new patch
---
 block/blkdebug.c | 86 
 1 file changed, 86 insertions(+)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 37094a2..b85a291 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -1,6 +1,7 @@
 /*
  * Block protocol for I/O error injection
  *
+ * Copyright (C) 2016 Red Hat, Inc.
  * Copyright (c) 2010 Kevin Wolf 
  *
  * Permission is hereby granted, free of charge, to any person obtaining a copy
@@ -382,6 +383,11 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto out;
 }

+bs->supported_write_flags = BDRV_REQ_FUA &
+bs->file->bs->supported_write_flags;
+bs->supported_zero_flags = (BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+bs->file->bs->supported_zero_flags;
+
 /* Set request alignment */
 align = qemu_opt_get_size(opts, "align", 0);
 if (align < INT_MAX && is_power_of_2(align)) {
@@ -511,6 +517,84 @@ static int blkdebug_co_flush(BlockDriverState *bs)
 return bdrv_co_flush(bs->file->bs);
 }

+static int coroutine_fn blkdebug_co_pwrite_zeroes(BlockDriverState *bs,
+  int64_t offset, int count,
+  BdrvRequestFlags flags)
+{
+BDRVBlkdebugState *s = bs->opaque;
+BlkdebugRule *rule = NULL;
+uint32_t align = MAX(bs->bl.request_alignment,
+ bs->bl.pwrite_zeroes_alignment);
+
+/* Only pass through requests that are larger than requested
+ * preferred alignment (so that we test the fallback to writes on
+ * unaligned portions), and check that the block layer never hands
+ * us anything crossing an alignment boundary.  */
+if (count < align) {
+return -ENOTSUP;
+}
+assert(QEMU_IS_ALIGNED(offset, align));
+assert(QEMU_IS_ALIGNED(count, align));
+if (bs->bl.max_pwrite_zeroes) {
+assert(count <= bs->bl.max_pwrite_zeroes);
+}
+
+QSIMPLEQ_FOREACH(rule, >active_rules, active_next) {
+uint64_t inject_offset = rule->options.inject.offset;
+
+if (inject_offset == -1 ||
+(inject_offset >= offset && inject_offset < offset + count))
+{
+break;
+}
+}
+
+if (rule && rule->options.inject.error) {
+return inject_error(bs, rule);
+}
+
+return bdrv_co_pwrite_zeroes(bs->file, offset, count, flags);
+}
+
+static int coroutine_fn blkdebug_co_pdiscard(BlockDriverState *bs,
+ int64_t offset, int count)
+{
+BDRVBlkdebugState *s = bs->opaque;
+BlkdebugRule *rule = NULL;
+uint32_t align = bs->bl.pdiscard_alignment;
+
+/* Only pass through requests that are larger than requested
+ * minimum alignment, and ensure that unaligned requests do not
+ * cross optimum discard boundaries. */
+if (count < bs->bl.request_alignment) {
+return -ENOTSUP;
+}
+assert(QEMU_IS_ALIGNED(offset, bs->bl.request_alignment));
+assert(QEMU_IS_ALIGNED(count, bs->bl.request_alignment));
+if (align && count >= align) {
+assert(QEMU_IS_ALIGNED(offset, align));
+

[Qemu-block] [PATCH v4 7/7] tests: Add coverage for recent block geometry fixes

2016-12-20 Thread Eric Blake
Use blkdebug's new geometry constraints to emulate setups that
have caused recent regression fixes: write zeroes asserting
when running through a loopback block device with max-transfer
smaller than cluster size, and discard rounding away portions
of requests not aligned to preferred boundaries.  Also, add
coverage that the block layer is honoring max transfer limits.

For now, a single iotest performs all actions, with the idea
that we can add future blkdebug constraint test cases in the
same file; but it can be split into multiple iotests if we find
reason to run one portion of the test in more setups than what
are possible in the other.

For reference, the final portion of the test (checking whether
discard passes as much as possible to the lowest layers of the
stack) works as follows:

qemu-io: discard 30M at 8001, passed to blkdebug
  blkdebug: discard 511 bytes at 8001, -ENOTSUP (smaller than
blkdebug's 512 align)
  blkdebug: discard 14371328 bytes at 8512, passed to qcow2
qcow2: discard 739840 bytes at 8512, -ENOTSUP (smaller than
qcow2's 1M align)
qcow2: discard 13M bytes at 77M, succeeds
  blkdebug: discard 15M bytes at 90M, passed to qcow2
qcow2: discard 15M bytes at 90M, succeeds
  blkdebug: discard 1356800 bytes at 105M, passed to qcow2
qcow2: discard 1M at 105M, succeeds
qcow2: discard 308224 bytes at 106M, -ENOTSUP (smaller than qcow2's
1M align)
  blkdebug: discard 1 byte at 111457280, -ENOTSUP (smaller than
blkdebug's 512 align)

Signed-off-by: Eric Blake 

---
v4: clean up some comments, nicer backing file creation, more commit message
v3: make comments tied more to test at hand, rather than the
particular hardware that led to the earlier patches being tested
v2: new patch
---
 tests/qemu-iotests/173 | 114 +
 tests/qemu-iotests/173.out |  49 +++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 164 insertions(+)
 create mode 100755 tests/qemu-iotests/173
 create mode 100644 tests/qemu-iotests/173.out

diff --git a/tests/qemu-iotests/173 b/tests/qemu-iotests/173
new file mode 100755
index 000..867c9ca
--- /dev/null
+++ b/tests/qemu-iotests/173
@@ -0,0 +1,114 @@
+#!/bin/bash
+#
+# Test corner cases with unusual block geometries
+#
+# Copyright (C) 2016 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=ebl...@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1   # failure is the default!
+
+_cleanup()
+{
+   _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto file
+
+CLUSTER_SIZE=1M
+size=128M
+options=driver=blkdebug,image.driver=qcow2
+
+echo
+echo "== setting up files =="
+
+TEST_IMG="$TEST_IMG.base" _make_test_img $size
+$QEMU_IO -c "write -P 11 0 $size" "$TEST_IMG.base" | _filter_qemu_io
+_make_test_img -b "$TEST_IMG.base"
+$QEMU_IO -c "write -P 22 0 $size" "$TEST_IMG" | _filter_qemu_io
+
+# Limited to 64k max-transfer
+echo
+echo "== constrained alignment and max-transfer =="
+limits=align=4k,max-transfer=64k
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+ -c "write -P 33 1000 128k" -c "read -P 33 1000 128k" | _filter_qemu_io
+
+echo
+echo "== write zero with constrained max-transfer =="
+limits=align=512,max-transfer=64k,opt-write-zero=$CLUSTER_SIZE
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+ -c "write -z 8003584 2093056" | _filter_qemu_io
+
+# non-power-of-2 write-zero/discard alignments
+echo
+echo "== non-power-of-2 write zeroes limits =="
+
+limits=align=512,opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+ -c "write -z 32M 32M" | _filter_qemu_io
+
+echo
+echo "== non-power-of-2 discard limits =="
+
+limits=align=512,opt-write-zero=15M,max-write-zero=15M,opt-discard=15M,max-discard=15M
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+ -c "discard 8001 30M" | _filter_qemu_io
+
+echo
+echo "== verify image content =="
+
+function verify_io()
+{
+if ($QEMU_IMG info -f "$IMGFMT" "$TEST_IMG" |
+   grep "compat: 0.10" > /dev/null); then
+# For v2 images, discarded 

[Qemu-block] [PATCH v4 5/7] blkdebug: Simplify override logic

2016-12-20 Thread Eric Blake
Rather than store into a local variable, then copy to the struct
if the value is valid, then reporting errors otherwise, it is
simpler to just store into the struct and report errors if the
value is invalid.  This however requires that the struct store
a 64-bit number, rather than a narrower type.  Move the errno
assignment into a label that will be reused from more places in
the next patch.

Signed-off-by: Eric Blake 

---
v4: fix typo in commit message, move errno assignment
v3: new patch
---
 block/blkdebug.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index b85a291..df68d0f 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -38,7 +38,7 @@
 typedef struct BDRVBlkdebugState {
 int state;
 int new_state;
-int align;
+uint64_t align;

 /* For blkdebug_refresh_filename() */
 char *config_file;
@@ -353,7 +353,6 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 BDRVBlkdebugState *s = bs->opaque;
 QemuOpts *opts;
 Error *local_err = NULL;
-uint64_t align;
 int ret;

 opts = qemu_opts_create(_opts, NULL, 0, _abort);
@@ -389,12 +388,9 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 bs->file->bs->supported_zero_flags;

 /* Set request alignment */
-align = qemu_opt_get_size(opts, "align", 0);
-if (align < INT_MAX && is_power_of_2(align)) {
-s->align = align;
-} else if (align) {
-error_setg(errp, "Invalid alignment");
-ret = -EINVAL;
+s->align = qemu_opt_get_size(opts, "align", 0);
+if (s->align && (s->align >= INT_MAX || !is_power_of_2(s->align))) {
+error_setg(errp, "Invalid alignment, align must be integer power of 
2");
 goto fail_unref;
 }

@@ -402,6 +398,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto out;

 fail_unref:
+ret = -EINVAL;
 bdrv_unref_child(bs, bs->file);
 out:
 if (ret < 0) {
-- 
2.9.3




[Qemu-block] [PATCH v4 2/7] qcow2: Discard/zero clusters by byte count

2016-12-20 Thread Eric Blake
Passing a byte offset, but sector count, when we ultimately
want to operate on cluster granularity, is madness.  Clean up
the interfaces to take both offset and count as bytes, while
still keeping the assertion added previously that the caller
must align the values to a cluster.  Then rename things to
make sure backports don't get confused by changed units:
instead of qcow2_discard_clusters() and qcow2_zero_clusters(),
we now have qcow2_cluster_discard and qcow2_cluster_zeroize().

Signed-off-by: Eric Blake 

---
v4: improve function names, split assertion additions into earlier patch
[no v3 or v2]
v1: https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00339.html
---
 block/qcow2.h  |  9 +
 block/qcow2-cluster.c  | 17 -
 block/qcow2-snapshot.c |  7 +++
 block/qcow2.c  | 21 +
 4 files changed, 25 insertions(+), 29 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 1823414..a131b72 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -543,10 +543,11 @@ uint64_t 
qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
  int compressed_size);

 int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
-int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
-int nb_sectors, enum qcow2_discard_type type, bool full_discard);
-int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors,
-int flags);
+int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
+  uint64_t count, enum qcow2_discard_type type,
+  bool full_discard);
+int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
+  uint64_t count, int flags);

 int qcow2_expand_zero_clusters(BlockDriverState *bs,
BlockDriverAmendStatusCB *status_cb,
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 3304a15..aad5183 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1511,16 +1511,15 @@ static int discard_single_l2(BlockDriverState *bs, 
uint64_t offset,
 return nb_clusters;
 }

-int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
-int nb_sectors, enum qcow2_discard_type type, bool full_discard)
+int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
+  uint64_t count, enum qcow2_discard_type type,
+  bool full_discard)
 {
 BDRVQcow2State *s = bs->opaque;
-uint64_t end_offset;
+uint64_t end_offset = offset + count;
 uint64_t nb_clusters;
 int ret;

-end_offset = offset + (nb_sectors << BDRV_SECTOR_BITS);
-
 /* Caller must pass aligned values */
 assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
 assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size));
@@ -1591,8 +1590,8 @@ static int zero_single_l2(BlockDriverState *bs, uint64_t 
offset,
 return nb_clusters;
 }

-int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors,
-int flags)
+int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset,
+  uint64_t count, int flags)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t nb_clusters;
@@ -1600,7 +1599,7 @@ int qcow2_zero_clusters(BlockDriverState *bs, uint64_t 
offset, int nb_sectors,

 /* Caller must pass aligned values */
 assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
-assert(QEMU_IS_ALIGNED(nb_sectors, s->cluster_size >> BDRV_SECTOR_BITS));
+assert(QEMU_IS_ALIGNED(count, s->cluster_size));

 /* The zero flag is only supported by version 3 and newer */
 if (s->qcow_version < 3) {
@@ -1608,7 +1607,7 @@ int qcow2_zero_clusters(BlockDriverState *bs, uint64_t 
offset, int nb_sectors,
 }

 /* Each L2 table is handled by its own loop iteration */
-nb_clusters = size_to_clusters(s, nb_sectors << BDRV_SECTOR_BITS);
+nb_clusters = size_to_clusters(s, count);

 s->cache_discards = true;

diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index 0324243..44243e0 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -440,10 +440,9 @@ int qcow2_snapshot_create(BlockDriverState *bs, 
QEMUSnapshotInfo *sn_info)

 /* The VM state isn't needed any more in the active L1 table; in fact, it
  * hurts by causing expensive COW for the next snapshot. */
-qcow2_discard_clusters(bs, qcow2_vm_state_offset(s),
-   align_offset(sn->vm_state_size, s->cluster_size)
->> BDRV_SECTOR_BITS,
-   QCOW2_DISCARD_NEVER, false);
+qcow2_cluster_discard(bs, qcow2_vm_state_offset(s),
+  align_offset(sn->vm_state_size, s->cluster_size),
+  QCOW2_DISCARD_NEVER, false);

 #ifdef DEBUG_ALLOC
 {
diff --git a/block/qcow2.c b/block/qcow2.c
index 

[Qemu-block] [PATCH v4 3/7] blkdebug: Sanity check block layer guarantees

2016-12-20 Thread Eric Blake
Commits 04ed95f4 and 1a62d0ac updated the block layer to auto-fragment
any I/O to fit within device boundaries. Additionally, when using a
minimum alignment of 4k, we want to ensure the block layer does proper
read-modify-write rather than requesting I/O on a slice of a sector.
Let's enforce that the contract is obeyed when using blkdebug.  For
now, blkdebug only allows alignment overrides, and just inherits other
limits from whatever device it is wrapping, but a future patch will
further enhance things.

Signed-off-by: Eric Blake 
Reviewed-by: Kevin Wolf 
Reviewed-by: Max Reitz 

---
v3: rebase to byte-based interfaces
v2: new patch
---
 block/blkdebug.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index acccf85..37094a2 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -438,6 +438,13 @@ blkdebug_co_preadv(BlockDriverState *bs, uint64_t offset, 
uint64_t bytes,
 BDRVBlkdebugState *s = bs->opaque;
 BlkdebugRule *rule = NULL;

+/* Sanity check block layer guarantees */
+assert(QEMU_IS_ALIGNED(offset, bs->bl.request_alignment));
+assert(QEMU_IS_ALIGNED(bytes, bs->bl.request_alignment));
+if (bs->bl.max_transfer) {
+assert(bytes <= bs->bl.max_transfer);
+}
+
 QSIMPLEQ_FOREACH(rule, >active_rules, active_next) {
 uint64_t inject_offset = rule->options.inject.offset;

@@ -462,6 +469,13 @@ blkdebug_co_pwritev(BlockDriverState *bs, uint64_t offset, 
uint64_t bytes,
 BDRVBlkdebugState *s = bs->opaque;
 BlkdebugRule *rule = NULL;

+/* Sanity check block layer guarantees */
+assert(QEMU_IS_ALIGNED(offset, bs->bl.request_alignment));
+assert(QEMU_IS_ALIGNED(bytes, bs->bl.request_alignment));
+if (bs->bl.max_transfer) {
+assert(bytes <= bs->bl.max_transfer);
+}
+
 QSIMPLEQ_FOREACH(rule, >active_rules, active_next) {
 uint64_t inject_offset = rule->options.inject.offset;

-- 
2.9.3




[Qemu-block] [PATCH v4 1/7] qcow2: Assert that cluster operations are aligned

2016-12-20 Thread Eric Blake
qcow2_discard_clusters() is set up to silently ignore sub-cluster
head or tail on unaligned requests.  However, it is easy to audit
the various callers: qcow2_snapshot_create() has always passed
aligned data since the call was introduced in 1ebf561;
qcow2_co_pdiscard() has passed aligned clusters since commit
ecdbead taught the block layer the preferred discard alignment (the
block layer can still pass sub-cluster values, but those are
handled directly in qcow2_co_pdiscard()); and qcow2_make_empty()
was fixed to pass aligned clusters in commit a3e1505.  Replace
rounding with assertions to hold us to the tighter contract,
eliminating the now-impossible case of an early exit for a
sub-cluster request.

qcow2_zero_clusters() has always been called with cluster-aligned
arguments from its lone caller qcow2_co_pwrite_zeroes() (like
qcow2_co_pdiscard(), the caller takes care of sub-cluster requests
from the block layer; and qcow2_zero_clusters() would have
misbehaved on unaligned requests), but it deserves the same
assertion for symmetry.

Signed-off-by: Eric Blake 

---
v4: new patch
---
 block/qcow2-cluster.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 928c1e2..3304a15 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1521,13 +1521,9 @@ int qcow2_discard_clusters(BlockDriverState *bs, 
uint64_t offset,

 end_offset = offset + (nb_sectors << BDRV_SECTOR_BITS);

-/* Round start up and end down */
-offset = align_offset(offset, s->cluster_size);
-end_offset = start_of_cluster(s, end_offset);
-
-if (offset > end_offset) {
-return 0;
-}
+/* Caller must pass aligned values */
+assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
+assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size));

 nb_clusters = size_to_clusters(s, end_offset - offset);

@@ -1602,6 +1598,10 @@ int qcow2_zero_clusters(BlockDriverState *bs, uint64_t 
offset, int nb_sectors,
 uint64_t nb_clusters;
 int ret;

+/* Caller must pass aligned values */
+assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
+assert(QEMU_IS_ALIGNED(nb_sectors, s->cluster_size >> BDRV_SECTOR_BITS));
+
 /* The zero flag is only supported by version 3 and newer */
 if (s->qcow_version < 3) {
 return -ENOTSUP;
-- 
2.9.3




Re: [Qemu-block] [PATCH 4/4] block: Add VFIO based NVMe driver

2016-12-20 Thread Paolo Bonzini


- Original Message -
> From: "Fam Zheng" 
> To: qemu-de...@nongnu.org
> Cc: "Paolo Bonzini" , qemu-block@nongnu.org, "Karl 
> Rister" , "Kevin Wolf"
> , "Max Reitz" , borntrae...@de.ibm.com, 
> "Stefan Hajnoczi" 
> Sent: Tuesday, December 20, 2016 5:31:39 PM
> Subject: [PATCH 4/4] block: Add VFIO based NVMe driver
> 
> This is a new protocol driver that exclusively opens a host NVMe
> controller through VFIO. It achieves better latency than linux-aio.
> 
>  nvme://linux-aio
> --
>   fio bs=4k  iodepth=1 (IOPS) 30014  24009
>   fio bs=4k  iodepth=1 (IOPS)  +polling   45148  34689
>   fio bs=64k iodepth=1 (IOPS) 17801  14954
>   fio bs=64k iodepth=1 (IOPS)  +polling   17970  14564
> 
>   fio bs=4k  iodepth=32 (IOPS)   110637 121480
>   fio bs=4k  iodepth=32 (IOPS) +polling  145533 166878
>   fio bs=64k iodepth=32 (IOPS)50525  50596
>   fio bs=64k iodepth=32 (IOPS) +polling   50482  50534
> 
>   (host) qemu-img bench -c 800 (sec)  15.00  43.13
> 
> ("+polling" means poll-max-ns=18000 which is a rule of thumb value for
> the used NVMe device, otherwise it defaults to 0).
> 
> For the rows where linux-aio is faster, a lot of evidence shows that the
> io queue is more likely to stay full if the request completions happen
> faster, as in the nvme:// case, hence less batch submission and request
> merging than linux-aio.
> 
> This patch is co-authored by Paolo and me.

... though to be fair it's at least 95% Fam's work. :)

Paolo

> 
> Signed-off-by: Fam Zheng 
> ---
>  block/Makefile.objs |1 +
>  block/nvme.c| 1026
>  +++
>  2 files changed, 1027 insertions(+)
>  create mode 100644 block/nvme.c
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 67a036a..fe975a6 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>  block-obj-y += null.o mirror.o commit.o io.o
>  block-obj-y += throttle-groups.o
> +block-obj-y += nvme.o
>  
>  block-obj-y += nbd.o nbd-client.o sheepdog.o
>  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
> diff --git a/block/nvme.c b/block/nvme.c
> new file mode 100644
> index 000..5c9ffde
> --- /dev/null
> +++ b/block/nvme.c
> @@ -0,0 +1,1026 @@
> +/*
> + * NVMe block driver based on vfio
> + *
> + * Copyright 2016 Red Hat, Inc.
> + *
> + * Authors:
> + *   Fam Zheng 
> + *   Paolo Bonzini 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include 
> +#include "qapi/error.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qapi/qmp/qstring.h"
> +#include "block/block_int.h"
> +#include "qemu/error-report.h"
> +#include "qemu/vfio-helpers.h"
> +
> +#define NVME_DEBUG 0
> +
> +#define DPRINTF(...) \
> +if (NVME_DEBUG) { \
> +printf(__VA_ARGS__); \
> +}
> +
> +#define NVME_SQ_ENTRY_BYTES 64
> +#define NVME_CQ_ENTRY_BYTES 16
> +#define NVME_QUEUE_SIZE 128
> +
> +/* TODO: share definitions with hw/block/nvme.c? */
> +enum NvmeAdminCommands {
> +NVME_ADM_CMD_DELETE_SQ  = 0x00,
> +NVME_ADM_CMD_CREATE_SQ  = 0x01,
> +NVME_ADM_CMD_GET_LOG_PAGE   = 0x02,
> +NVME_ADM_CMD_DELETE_CQ  = 0x04,
> +NVME_ADM_CMD_CREATE_CQ  = 0x05,
> +NVME_ADM_CMD_IDENTIFY   = 0x06,
> +NVME_ADM_CMD_ABORT  = 0x08,
> +NVME_ADM_CMD_SET_FEATURES   = 0x09,
> +NVME_ADM_CMD_GET_FEATURES   = 0x0a,
> +NVME_ADM_CMD_ASYNC_EV_REQ   = 0x0c,
> +NVME_ADM_CMD_ACTIVATE_FW= 0x10,
> +NVME_ADM_CMD_DOWNLOAD_FW= 0x11,
> +NVME_ADM_CMD_FORMAT_NVM = 0x80,
> +NVME_ADM_CMD_SECURITY_SEND  = 0x81,
> +NVME_ADM_CMD_SECURITY_RECV  = 0x82,
> +};
> +
> +enum NvmeIoCommands {
> +NVME_CMD_FLUSH  = 0x00,
> +NVME_CMD_WRITE  = 0x01,
> +NVME_CMD_READ   = 0x02,
> +NVME_CMD_WRITE_UNCOR= 0x04,
> +NVME_CMD_COMPARE= 0x05,
> +NVME_CMD_DSM= 0x09,
> +};
> +
> +typedef struct {
> +uint8_t  opcode;
> +uint8_t  flags;
> +uint16_t cid;
> +uint32_t nsid;
> +uint64_t reserved;
> +uint64_t mptr;
> +uint64_t prp1;
> +uint64_t prp2;
> +uint32_t cdw10;
> +uint32_t cdw11;
> +uint32_t cdw12;
> +uint32_t cdw13;
> +uint32_t cdw14;
> +uint32_t cdw15;
> +} QEMU_PACKED NVMeCommand;
> +
> +typedef struct {
> +uint32_t cmd_specific;
> +uint32_t reserved;
> +uint16_t sq_head;
> +uint16_t sqid;
> +uint16_t cid;
> +uint16_t status;
> 

[Qemu-block] [PATCH 2/4] util: Add a notifier list for qemu_vfree()

2016-12-20 Thread Fam Zheng
To allow other code to intercept buffer freeing and do customized
unmapping operations.

Signed-off-by: Fam Zheng 
---
 include/qemu/notify.h | 1 +
 util/oslib-posix.c| 9 +
 2 files changed, 10 insertions(+)

diff --git a/include/qemu/notify.h b/include/qemu/notify.h
index a3d73e4..24c4f5f 100644
--- a/include/qemu/notify.h
+++ b/include/qemu/notify.h
@@ -69,4 +69,5 @@ void notifier_with_return_remove(NotifierWithReturn 
*notifier);
 int notifier_with_return_list_notify(NotifierWithReturnList *list,
  void *data);
 
+void qemu_vfree_add_notifier(Notifier *n);
 #endif
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index f631464..b0dd207 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -125,9 +125,18 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *alignment)
 return ptr;
 }
 
+static NotifierList qemu_vfree_notifiers =
+NOTIFIER_LIST_INITIALIZER(qemu_vfree_notifiers);
+
+void qemu_vfree_add_notifier(Notifier *n)
+{
+notifier_list_add(_vfree_notifiers, n);
+}
+
 void qemu_vfree(void *ptr)
 {
 trace_qemu_vfree(ptr);
+notifier_list_notify(_vfree_notifiers, ptr);
 free(ptr);
 }
 
-- 
2.9.3




[Qemu-block] [PATCH 3/4] util: Add VFIO helper library

2016-12-20 Thread Fam Zheng
This is a simple helper library to be used by VFIO device drivers in
QEMU, makeing it very easy to interface with /dev/vfio and do DMA
mappings.

Especially, once initialized, this module proactively maps all guest ram
regions to IO address space (in the host IOMMU context) so that in
further I/O paths, it's very cheap to look up the list of IOVAs of
SG buffers that we get from guest driver, without doing any syscall.

For bounce buffers that are allocated by QEMU, we do lazy mapping
transparently when the caller is invoking the same DMA mapping function.
This is relatively slower but should still be efficient thanks to
HBitmap.  However, caller is responsible for unmapping bounce buffers
explicitly.

Signed-off-by: Fam Zheng 
---
 include/qemu/vfio-helpers.h |  29 ++
 util/Makefile.objs  |   1 +
 util/vfio-helpers.c | 713 
 3 files changed, 743 insertions(+)
 create mode 100644 include/qemu/vfio-helpers.h
 create mode 100644 util/vfio-helpers.c

diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
new file mode 100644
index 000..4096dd8
--- /dev/null
+++ b/include/qemu/vfio-helpers.h
@@ -0,0 +1,29 @@
+/*
+ * VFIO helper functions
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *   Fam Zheng 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_VFIO_H
+#define QEMU_VFIO_H
+#include "qemu/queue.h"
+
+typedef struct QEMUVFIOState QEMUVFIOState;
+
+QEMUVFIOState *qemu_vfio_open_pci(const char *device, Error **errp);
+void qemu_vfio_close(QEMUVFIOState *s);
+int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size,
+  bool contiguous, uint64_t *iova_list);
+void qemu_vfio_dma_unmap(QEMUVFIOState *s, void *host);
+void *qemu_vfio_pci_map_bar(QEMUVFIOState *s, int index, Error **errp);
+void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar);
+int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
+   int irq_type, Error **errp);
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index ad0f9c7..16c714a 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -36,3 +36,4 @@ util-obj-y += log.o
 util-obj-y += qdist.o
 util-obj-y += qht.o
 util-obj-y += range.o
+util-obj-y += vfio-helpers.o
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
new file mode 100644
index 000..55d35e7
--- /dev/null
+++ b/util/vfio-helpers.c
@@ -0,0 +1,713 @@
+/*
+ * VFIO helper functions
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *   Fam Zheng 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+#include 
+#include "qapi/error.h"
+#include "exec/ramlist.h"
+#include "trace.h"
+#include "qemu/queue.h"
+#include "qemu/vfio-helpers.h"
+#include "qemu/error-report.h"
+#include "standard-headers/linux/pci_regs.h"
+#include "qemu/event_notifier.h"
+#include "qemu/hbitmap.h"
+
+#define VFIO_DEBUG 0
+
+#define DPRINTF(...) \
+if (VFIO_DEBUG) { \
+printf(__VA_ARGS__); \
+}
+
+/* XXX: Once VFIO exposes the iova bit width in the IOMMU capability interface,
+ * we can use a runtime limit. It's also possible to do platform specific
+ * detection by reading sysfs entries. Until then, 39 is a safe bet. */
+#define QEMU_VFIO_IOVA_MAX (1ULL << 39)
+
+/* DMA address space is managed in chunks. */
+#define QEMU_VFIO_CHUNK_SIZE (2ULL << 20)
+
+#define QEMU_VFIO_ALLOC_BITMAP_SIZE (QEMU_VFIO_IOVA_MAX / QEMU_VFIO_CHUNK_SIZE)
+
+typedef struct IOVARange {
+uint64_t iova;
+uint64_t nr_pages;
+QSIMPLEQ_ENTRY(IOVARange) next;
+} IOVARange;
+
+typedef struct {
+/* Page aligned. */
+void *host;
+size_t size;
+/* The IOVA range list to which the [host, host + size) area is mapped. */
+QSIMPLEQ_HEAD(, IOVARange) iova_list;
+} IOVAMapping;
+
+struct QEMUVFIOState {
+int container;
+int group;
+int device;
+RAMBlockNotifier ram_notifier;
+struct vfio_region_info config_region_info, bar_region_info[6];
+
+/* Allocation bitmap of IOVA address space, each bit represents
+ * QEMU_VFIO_CHUNK_SIZE bytes. Set bits mean free, unset for used/reserved.
+ **/
+HBitmap *free_chunks;
+IOVAMapping *mappings;
+int nr_mappings;
+};
+
+static int sysfs_find_group_file(const char *device, char **path, Error **errp)
+{
+int ret;
+char *sysfs_link = NULL;
+char *sysfs_group = NULL;
+char *p;
+
+sysfs_link = g_strdup_printf("/sys/bus/pci/devices/%s/iommu_group",
+ device);
+sysfs_group = g_malloc(PATH_MAX);
+ret = readlink(sysfs_link, sysfs_group, PATH_MAX - 1);
+if (ret == -1) {
+error_setg_errno(errp, errno, "Failed to find iommu group sysfs path");
+  

[Qemu-block] [PATCH 1/4] ramblock-notifier: new

2016-12-20 Thread Fam Zheng
From: Paolo Bonzini 

This adds a notify interface of ram block additions and removals.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Fam Zheng 
---
 exec.c  |  5 
 include/exec/memory.h   |  6 +
 include/exec/ram_addr.h | 46 ++-
 include/exec/ramlist.h  | 72 +
 numa.c  | 29 
 stubs/Makefile.objs |  1 +
 stubs/ram.c | 10 +++
 xen-mapcache.c  |  3 +++
 8 files changed, 123 insertions(+), 49 deletions(-)
 create mode 100644 include/exec/ramlist.h
 create mode 100644 stubs/ram.c

diff --git a/exec.c b/exec.c
index 08c558e..027c301 100644
--- a/exec.c
+++ b/exec.c
@@ -1654,6 +1654,7 @@ static void ram_block_add(RAMBlock *new_block, Error 
**errp)
 qemu_madvise(new_block->host, new_block->max_length, 
QEMU_MADV_HUGEPAGE);
 /* MADV_DONTFORK is also needed by KVM in absence of synchronous MMU */
 qemu_madvise(new_block->host, new_block->max_length, 
QEMU_MADV_DONTFORK);
+ram_block_notify_add(new_block->host, new_block->max_length);
 }
 }
 
@@ -1784,6 +1785,10 @@ void qemu_ram_free(RAMBlock *block)
 return;
 }
 
+if (block->host) {
+ram_block_notify_remove(block->host, block->max_length);
+}
+
 qemu_mutex_lock_ramlist();
 QLIST_REMOVE_RCU(block, next);
 ram_list.mru_block = NULL;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9728a2f..fbc0a8f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -16,16 +16,12 @@
 
 #ifndef CONFIG_USER_ONLY
 
-#define DIRTY_MEMORY_VGA   0
-#define DIRTY_MEMORY_CODE  1
-#define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM   3/* num of dirty bits */
-
 #include "exec/cpu-common.h"
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
 #endif
 #include "exec/memattrs.h"
+#include "exec/ramlist.h"
 #include "qemu/queue.h"
 #include "qemu/int128.h"
 #include "qemu/notify.h"
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 54d7108..3e79466 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "exec/ramlist.h"
 
 struct RAMBlock {
 struct rcu_head rcu;
@@ -35,6 +36,7 @@ struct RAMBlock {
 char idstr[256];
 /* RCU-enabled, writes protected by the ramlist lock */
 QLIST_ENTRY(RAMBlock) next;
+QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
 int fd;
 size_t page_size;
 };
@@ -50,51 +52,7 @@ static inline void *ramblock_ptr(RAMBlock *block, ram_addr_t 
offset)
 return (char *)block->host + offset;
 }
 
-/* The dirty memory bitmap is split into fixed-size blocks to allow growth
- * under RCU.  The bitmap for a block can be accessed as follows:
- *
- *   rcu_read_lock();
- *
- *   DirtyMemoryBlocks *blocks =
- *   atomic_rcu_read(_list.dirty_memory[DIRTY_MEMORY_MIGRATION]);
- *
- *   ram_addr_t idx = (addr >> TARGET_PAGE_BITS) / DIRTY_MEMORY_BLOCK_SIZE;
- *   unsigned long *block = blocks.blocks[idx];
- *   ...access block bitmap...
- *
- *   rcu_read_unlock();
- *
- * Remember to check for the end of the block when accessing a range of
- * addresses.  Move on to the next block if you reach the end.
- *
- * Organization into blocks allows dirty memory to grow (but not shrink) under
- * RCU.  When adding new RAMBlocks requires the dirty memory to grow, a new
- * DirtyMemoryBlocks array is allocated with pointers to existing blocks kept
- * the same.  Other threads can safely access existing blocks while dirty
- * memory is being grown.  When no threads are using the old DirtyMemoryBlocks
- * anymore it is freed by RCU (but the underlying blocks stay because they are
- * pointed to from the new DirtyMemoryBlocks).
- */
-#define DIRTY_MEMORY_BLOCK_SIZE ((ram_addr_t)256 * 1024 * 8)
-typedef struct {
-struct rcu_head rcu;
-unsigned long *blocks[];
-} DirtyMemoryBlocks;
-
-typedef struct RAMList {
-QemuMutex mutex;
-RAMBlock *mru_block;
-/* RCU-enabled, writes protected by the ramlist lock. */
-QLIST_HEAD(, RAMBlock) blocks;
-DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
-uint32_t version;
-} RAMList;
-extern RAMList ram_list;
-
 ram_addr_t last_ram_offset(void);
-void qemu_mutex_lock_ramlist(void);
-void qemu_mutex_unlock_ramlist(void);
-
 RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
bool share, const char *mem_path,
Error **errp);
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
new file mode 100644
index 000..c59880d
--- /dev/null
+++ b/include/exec/ramlist.h
@@ -0,0 +1,72 @@
+#ifndef RAMLIST_H
+#define RAMLIST_H
+
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+#include "qemu/rcu.h"
+
+typedef struct RAMBlockNotifier RAMBlockNotifier;
+
+#define 

Re: [Qemu-block] [Qemu-devel] [PATCH v3 5/5] tests: Add coverage for recent block geometry fixes

2016-12-20 Thread Eric Blake
On 12/07/2016 10:34 AM, Eric Blake wrote:
> On 12/07/2016 10:16 AM, Kevin Wolf wrote:
>> Am 02.12.2016 um 20:22 hat Eric Blake geschrieben:
>>> Use blkdebug's new geometry constraints to emulate setups that
>>> have caused recent regression fixes: write zeroes asserting
>>> when running through a loopback block device with max-transfer
>>> smaller than cluster size, and discard rounding away portions
>>> of requests not aligned to preferred boundaries.  Also, add
>>> coverage that the block layer is honoring max transfer limits.
>>>

>>> +
>>> +_supported_fmt qcow2
>>> +_supported_proto file
>>> +
>>> +CLUSTER_SIZE=1M
>>> +size=128M
>>> +options=driver=blkdebug,image.driver=qcow2
>>> +
>>> +echo
>>> +echo "== setting up files =="
>>> +
>>> +_make_test_img $size
>>> +$QEMU_IO -c "write -P 11 0 $size" "$TEST_IMG" | _filter_qemu_io
>>> +mv "$TEST_IMG" "$TEST_IMG.base"
>>
>> I know that you declared "_supported_proto file", but if you don't use
>> mv after creating the image, it might actually work with other
>> protocols.
>>
>> Most other tests use something like this:
>>
>> TEST_IMG="$TEST_IMG.base" _make_test_img $size
>>
>> And for the qemu-io invocation you can just use the right filename.
> 
> Thanks.  I obviously copied-and-pasted from an earlier test, rather than
> a later one, so I'll make the tweaks in v4.

Even with the tweaks, I still wasn't able to get things like:

./check -qcow2 -nfs 173

to work.  I'm probably missing something trivial, but if someone wants
to improve the test as a follow-up, they can be my guest.  I'll go ahead
and post v4 with the cleaner creation of the backing file, but leave the
_supported_proto file line for now.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH 4/4] block: Add VFIO based NVMe driver

2016-12-20 Thread Fam Zheng
This is a new protocol driver that exclusively opens a host NVMe
controller through VFIO. It achieves better latency than linux-aio.

 nvme://linux-aio
--
  fio bs=4k  iodepth=1 (IOPS) 30014  24009
  fio bs=4k  iodepth=1 (IOPS)  +polling   45148  34689
  fio bs=64k iodepth=1 (IOPS) 17801  14954
  fio bs=64k iodepth=1 (IOPS)  +polling   17970  14564

  fio bs=4k  iodepth=32 (IOPS)   110637 121480
  fio bs=4k  iodepth=32 (IOPS) +polling  145533 166878
  fio bs=64k iodepth=32 (IOPS)50525  50596
  fio bs=64k iodepth=32 (IOPS) +polling   50482  50534

  (host) qemu-img bench -c 800 (sec)  15.00  43.13

("+polling" means poll-max-ns=18000 which is a rule of thumb value for
the used NVMe device, otherwise it defaults to 0).

For the rows where linux-aio is faster, a lot of evidence shows that the
io queue is more likely to stay full if the request completions happen
faster, as in the nvme:// case, hence less batch submission and request
merging than linux-aio.

This patch is co-authored by Paolo and me.

Signed-off-by: Fam Zheng 
---
 block/Makefile.objs |1 +
 block/nvme.c| 1026 +++
 2 files changed, 1027 insertions(+)
 create mode 100644 block/nvme.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 67a036a..fe975a6 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -11,6 +11,7 @@ block-obj-$(CONFIG_POSIX) += raw-posix.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 block-obj-y += null.o mirror.o commit.o io.o
 block-obj-y += throttle-groups.o
+block-obj-y += nvme.o
 
 block-obj-y += nbd.o nbd-client.o sheepdog.o
 block-obj-$(CONFIG_LIBISCSI) += iscsi.o
diff --git a/block/nvme.c b/block/nvme.c
new file mode 100644
index 000..5c9ffde
--- /dev/null
+++ b/block/nvme.c
@@ -0,0 +1,1026 @@
+/*
+ * NVMe block driver based on vfio
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *   Fam Zheng 
+ *   Paolo Bonzini 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include 
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qstring.h"
+#include "block/block_int.h"
+#include "qemu/error-report.h"
+#include "qemu/vfio-helpers.h"
+
+#define NVME_DEBUG 0
+
+#define DPRINTF(...) \
+if (NVME_DEBUG) { \
+printf(__VA_ARGS__); \
+}
+
+#define NVME_SQ_ENTRY_BYTES 64
+#define NVME_CQ_ENTRY_BYTES 16
+#define NVME_QUEUE_SIZE 128
+
+/* TODO: share definitions with hw/block/nvme.c? */
+enum NvmeAdminCommands {
+NVME_ADM_CMD_DELETE_SQ  = 0x00,
+NVME_ADM_CMD_CREATE_SQ  = 0x01,
+NVME_ADM_CMD_GET_LOG_PAGE   = 0x02,
+NVME_ADM_CMD_DELETE_CQ  = 0x04,
+NVME_ADM_CMD_CREATE_CQ  = 0x05,
+NVME_ADM_CMD_IDENTIFY   = 0x06,
+NVME_ADM_CMD_ABORT  = 0x08,
+NVME_ADM_CMD_SET_FEATURES   = 0x09,
+NVME_ADM_CMD_GET_FEATURES   = 0x0a,
+NVME_ADM_CMD_ASYNC_EV_REQ   = 0x0c,
+NVME_ADM_CMD_ACTIVATE_FW= 0x10,
+NVME_ADM_CMD_DOWNLOAD_FW= 0x11,
+NVME_ADM_CMD_FORMAT_NVM = 0x80,
+NVME_ADM_CMD_SECURITY_SEND  = 0x81,
+NVME_ADM_CMD_SECURITY_RECV  = 0x82,
+};
+
+enum NvmeIoCommands {
+NVME_CMD_FLUSH  = 0x00,
+NVME_CMD_WRITE  = 0x01,
+NVME_CMD_READ   = 0x02,
+NVME_CMD_WRITE_UNCOR= 0x04,
+NVME_CMD_COMPARE= 0x05,
+NVME_CMD_DSM= 0x09,
+};
+
+typedef struct {
+uint8_t  opcode;
+uint8_t  flags;
+uint16_t cid;
+uint32_t nsid;
+uint64_t reserved;
+uint64_t mptr;
+uint64_t prp1;
+uint64_t prp2;
+uint32_t cdw10;
+uint32_t cdw11;
+uint32_t cdw12;
+uint32_t cdw13;
+uint32_t cdw14;
+uint32_t cdw15;
+} QEMU_PACKED NVMeCommand;
+
+typedef struct {
+uint32_t cmd_specific;
+uint32_t reserved;
+uint16_t sq_head;
+uint16_t sqid;
+uint16_t cid;
+uint16_t status;
+} QEMU_PACKED NVMeCompletion;
+
+typedef struct {
+int32_t  head, tail;
+uint8_t  *queue;
+uint64_t iova;
+volatile uint32_t *doorbell;
+} NVMeQueue;
+
+typedef struct {
+BlockCompletionFunc *cb;
+void *opaque;
+int cid;
+void *prp_list_page;
+uint64_t prp_list_iova;
+} NVMeRequest;
+
+typedef struct {
+int index;
+NVMeQueue   sq, cq;
+int cq_phase;
+uint8_t *prp_list_pages;
+uint64_tprp_list_base_iova;
+NVMeRequest reqs[NVME_QUEUE_SIZE];
+CoQueue wait_queue;
+boolbusy;
+int need_kick;
+int inflight;
+} NVMeQueuePair;
+
+typedef volatile struct {
+uint64_t cap;
+uint32_t vs;
+uint32_t intms;
+uint32_t intmc;
+uint32_t cc;
+

Re: [Qemu-block] [PATCH RFC v2 4/6] replication: fix code logic with the new shared_disk option

2016-12-20 Thread Changlong Xie

On 12/05/2016 04:35 PM, zhanghailiang wrote:

Some code logic only be needed in non-shared disk, here
we adjust these codes to prepare for shared disk scenario.

Signed-off-by: zhanghailiang 
---
  block/replication.c | 47 ---
  1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 35e9ab3..6574cc2 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -531,21 +531,28 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  aio_context_release(aio_context);
  return;
  }
-bdrv_op_block_all(top_bs, s->blocker);
-bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);

-job = backup_job_create(NULL, s->secondary_disk->bs, 
s->hidden_disk->bs,
-0, MIRROR_SYNC_MODE_NONE, NULL, false,
+/*
+ * Only in the case of non-shared disk,
+ * the backup job is in the secondary side
+ */
+if (!s->is_shared_disk) {
+bdrv_op_block_all(top_bs, s->blocker);
+bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+job = backup_job_create(NULL, s->secondary_disk->bs,
+s->hidden_disk->bs, 0,
+MIRROR_SYNC_MODE_NONE, NULL, false,
  BLOCKDEV_ON_ERROR_REPORT,
  BLOCKDEV_ON_ERROR_REPORT, BLOCK_JOB_INTERNAL,
  backup_job_completed, bs, NULL, _err);


Coding style here.


-if (local_err) {
-error_propagate(errp, local_err);
-backup_job_cleanup(bs);
-aio_context_release(aio_context);
-return;
+if (local_err) {
+error_propagate(errp, local_err);
+backup_job_cleanup(bs);
+aio_context_release(aio_context);
+return;
+}
+block_job_start(job);
  }
-block_job_start(job);

  secondary_do_checkpoint(s, errp);
  break;
@@ -575,14 +582,16 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
  case REPLICATION_MODE_PRIMARY:
  break;
  case REPLICATION_MODE_SECONDARY:
-if (!s->secondary_disk->bs->job) {
-error_setg(errp, "Backup job was cancelled unexpectedly");
-break;
-}
-backup_do_checkpoint(s->secondary_disk->bs->job, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-break;
+if (!s->is_shared_disk) {
+if (!s->secondary_disk->bs->job) {
+error_setg(errp, "Backup job was cancelled unexpectedly");
+break;
+}
+backup_do_checkpoint(s->secondary_disk->bs->job, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+break;
+}
  }
  secondary_do_checkpoint(s, errp);
  break;
@@ -663,7 +672,7 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
   * before the BDS is closed, because we will access hidden
   * disk, secondary disk in backup_job_completed().
   */
-if (s->secondary_disk->bs->job) {
+if (!s->is_shared_disk && s->secondary_disk->bs->job) {
  block_job_cancel_sync(s->secondary_disk->bs->job);
  }








Re: [Qemu-block] [PATCH RFC v2 3/6] replication: Split out backup_do_checkpoint() from secondary_do_checkpoint()

2016-12-20 Thread Changlong Xie

On 12/05/2016 04:35 PM, zhanghailiang wrote:

The helper backup_do_checkpoint() will be used for primary related
codes. Here we split it out from secondary_do_checkpoint().

Besides, it is unnecessary to call backup_do_checkpoint() in
replication starting and normally stop replication path.
We only need call it while do real checkpointing.

Signed-off-by: zhanghailiang 
---
  block/replication.c | 36 +++-
  1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index e87ae87..35e9ab3 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -332,20 +332,8 @@ static bool 
replication_recurse_is_first_non_filter(BlockDriverState *bs,

  static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
  {
-Error *local_err = NULL;
  int ret;

-if (!s->secondary_disk->bs->job) {
-error_setg(errp, "Backup job was cancelled unexpectedly");
-return;
-}
-
-backup_do_checkpoint(s->secondary_disk->bs->job, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-return;
-}
-
  ret = s->active_disk->bs->drv->bdrv_make_empty(s->active_disk->bs);
  if (ret < 0) {
  error_setg(errp, "Cannot make active disk empty");
@@ -558,6 +546,8 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  return;
  }
  block_job_start(job);
+
+secondary_do_checkpoint(s, errp);
  break;
  default:
  aio_context_release(aio_context);
@@ -566,10 +556,6 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,

  s->replication_state = BLOCK_REPLICATION_RUNNING;

-if (s->mode == REPLICATION_MODE_SECONDARY) {
-secondary_do_checkpoint(s, errp);
-}
-
  s->error = 0;
  aio_context_release(aio_context);
  }
@@ -579,13 +565,29 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
  BlockDriverState *bs = rs->opaque;
  BDRVReplicationState *s;
  AioContext *aio_context;
+Error *local_err = NULL;

  aio_context = bdrv_get_aio_context(bs);
  aio_context_acquire(aio_context);
  s = bs->opaque;

-if (s->mode == REPLICATION_MODE_SECONDARY) {
+switch (s->mode) {
+case REPLICATION_MODE_PRIMARY:
+break;
+case REPLICATION_MODE_SECONDARY:
+if (!s->secondary_disk->bs->job) {
+error_setg(errp, "Backup job was cancelled unexpectedly");
+break;
+}
+backup_do_checkpoint(s->secondary_disk->bs->job, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+break;
+}
  secondary_do_checkpoint(s, errp);
+break;
+default:
+abort();
  }
  aio_context_release(aio_context);
  }



Looks good to me

Reviewed-by: Changlong Xie 





Re: [Qemu-block] [PATCH RFC v2 2/6] replication: add shared-disk and shared-disk-id options

2016-12-20 Thread Changlong Xie

On 12/05/2016 04:35 PM, zhanghailiang wrote:

diff --git a/qapi/block-core.json b/qapi/block-core.json
index c29bef7..52d7e0d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2232,12 +2232,19 @@
  #  node who owns the replication node chain. Must not be given in
  #  primary mode.
  #
+# @shared-disk-id: #optional The id of shared disk while in replication mode.
+#
+# @shared-disk: #optional To indicate whether or not a disk is shared by
+#   primary VM and secondary VM.
+#


I think we need more detailed description here.

For @shared-disk, we can only both enable or disable it on both side.
For @shared-disk-id, it must/only be given when @shared-disk enable on 
Primary side.


More, you also need to perfect the replication_open() logic.


  # Since: 2.8
  ##
  { 'struct': 'BlockdevOptionsReplication',
'base': 'BlockdevOptionsGenericFormat',
'data': { 'mode': 'ReplicationMode',
-'*top-id': 'str' } }
+'*top-id': 'str',
+'*shared-disk-id': 'str',
+'*shared-disk': 'bool' } }

  ##
  # @NFSTransport
--






Re: [Qemu-block] [PATCH RFC v2 1/6] docs/block-replication: Add description for shared-disk case

2016-12-20 Thread Changlong Xie

On 12/05/2016 04:34 PM, zhanghailiang wrote:

Introuduce the scenario of shared-disk block replication
and how to use it.

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
v2:
- fix some problems found by Changlong
---
  docs/block-replication.txt | 139 +++--
  1 file changed, 135 insertions(+), 4 deletions(-)

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
index 6bde673..fbfe005 100644
--- a/docs/block-replication.txt
+++ b/docs/block-replication.txt
@@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network 
transportation
  effort during a vmstate checkpoint, the disk modification operations of
  the Primary disk are asynchronously forwarded to the Secondary node.

-== Workflow ==
+== Non-shared disk workflow ==
  The following is the image of block replication workflow:

  +--+++
@@ -57,7 +57,7 @@ The following is the image of block replication workflow:
  4) Secondary write requests will be buffered in the Disk buffer and it
 will overwrite the existing sector content in the buffer.

-== Architecture ==
+== Non-shared disk architecture ==
  We are going to implement block replication from many basic
  blocks that are already in QEMU.

@@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative 
write-through
  of the NBD server into the secondary disk. So before block replication,
  the primary disk and secondary disk should contain the same data.

+== Shared Disk Mode Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  | (2)Forward and write through | |
+  | +--> | Disk Buffer |
+  | || |
+  | |\-/
+  | |(1)read   |
+  | |  |
+   (3)write   | |  | backing file
+  V |  |
+ +-+   |
+ | Shared Disk | <-+
+ +-+
+
+1) Primary writes will read original data and forward it to Secondary
+   QEMU.
+2) Before Primary write requests are written to Shared disk, the
+   original sector content will be read from Shared disk and
+   forwarded and buffered in the Disk buffer on the secondary site,
+   but it will not overwrite the existing sector content (it could be
+   from either "Secondary Write Requests" or previous COW of "Primary
+   Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Shared disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Shared Disk Mode Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+ virtio-blk ||   
.--
+ /  ||   | 
Secondary
+/   ||   
'--
+   /|| 
virtio-blk
+  / || 
 |
+  | ||   
replication(5)
+  |NBD  >   NBD   (2)  
 |
+  |  client ||server ---> hidden disk <-- 
active disk(4)
+  | ^   ||  |
+  |  replication(1) ||  |
+  | |   ||  |
+  |   +-'   ||  |
+ (3)  |drive-backup sync=none   ||  |
+. |   +-+   ||  |
+Primary | | |   ||   backing|
+' | |   ||  |
+  V