[Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-05 Thread Eric Blake
While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
team discovered that it is useful if a server can advertise
whether an export is in a known-all-zeroes state at the time
the client connects.

Signed-off-by: Eric Blake 
---
 doc/proto.md | 5 +
 1 file changed, 5 insertions(+)

This replaces the following qemu patch attempt:
https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html
which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The
semantics in this proposal should be much better.

Patch is to the merge of the master branch and the
extension-write-zeroes branch.  By the way, qemu 2.8 is due
to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES,
so maybe it is time to consider promoting the extension-write-zeroes
branch into master.

diff --git a/doc/proto.md b/doc/proto.md
index afe71fc..7e4ec7f 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -697,6 +697,11 @@ The field has the following format:
   the export.
 - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental
   `BLOCK_STATUS` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md).
+- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees
+  that at the time transmission phase begins, all offsets within the
+  export read as zero bytes.  Clients MAY use this information to
+  avoid writing to sections of the export that should still read as
+  zero after the client is done writing.

 Clients SHOULD ignore unknown flags.

-- 
2.9.3




[Qemu-block] [PULL for-2.8 3/3] qemu-doc: update gluster protocol usage guide

2016-12-05 Thread Jeff Cody
From: Prasanna Kumar Kalever 

Document:
1. The new debug and logfile options with their usages
2. New json format and its usage and
3. update "GlusterFS, Device URL Syntax" section in "Invocation"

Signed-off-by: Prasanna Kumar Kalever 
Reviewed-by: Eric Blake 
Signed-off-by: Jeff Cody 
---
 qemu-doc.texi   | 59 +++--
 qemu-options.hx | 25 ++--
 2 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/qemu-doc.texi b/qemu-doc.texi
index 023c140..02cb39d 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -1041,35 +1041,55 @@ GlusterFS is an user space distributed file system.
 
 You can boot from the GlusterFS disk image with the command:
 @example
-qemu-system-x86_64 -drive 
file=gluster[+@var{transport}]://[@var{server}[:@var{port}]]/@var{volname}/@var{image}[?socket=...]
+URI:
+qemu-system-x86_64 -drive 
file=gluster[+@var{type}]://[@var{host}[:@var{port}]]/@var{volume}/@var{path}
+   [?socket=...][,file.debug=9][,file.logfile=...]
+
+JSON:
+qemu-system-x86_64 'json:@{"driver":"qcow2",
+   "file":@{"driver":"gluster",
+
"volume":"testvol","path":"a.img","debug":9,"logfile":"...",
+
"server":[@{"type":"tcp","host":"...","port":"..."@},
+  
@{"type":"unix","socket":"..."@}]@}@}'
 @end example
 
 @var{gluster} is the protocol.
 
-@var{transport} specifies the transport type used to connect to gluster
+@var{type} specifies the transport type used to connect to gluster
 management daemon (glusterd). Valid transport types are
-tcp, unix and rdma. If a transport type isn't specified, then tcp
-type is assumed.
+tcp and unix. In the URI form, if a transport type isn't specified,
+then tcp type is assumed.
 
-@var{server} specifies the server where the volume file specification for
-the given volume resides. This can be either hostname, ipv4 address
-or ipv6 address. ipv6 address needs to be within square brackets [ ].
-If transport type is unix, then @var{server} field should not be specified.
+@var{host} specifies the server where the volume file specification for
+the given volume resides. This can be either a hostname or an ipv4 address.
+If transport type is unix, then @var{host} field should not be specified.
 Instead @var{socket} field needs to be populated with the path to unix domain
 socket.
 
 @var{port} is the port number on which glusterd is listening. This is optional
-and if not specified, QEMU will send 0 which will make gluster to use the
-default port. If the transport type is unix, then @var{port} should not be
-specified.
+and if not specified, it defaults to port 24007. If the transport type is unix,
+then @var{port} should not be specified.
+
+@var{volume} is the name of the gluster volume which contains the disk image.
+
+@var{path} is the path to the actual disk image that resides on gluster volume.
+
+@var{debug} is the logging level of the gluster protocol driver. Debug levels
+are 0-9, with 9 being the most verbose, and 0 representing no debugging output.
+The default level is 4. The current logging levels defined in the gluster 
source
+are 0 - None, 1 - Emergency, 2 - Alert, 3 - Critical, 4 - Error, 5 - Warning,
+6 - Notice, 7 - Info, 8 - Debug, 9 - Trace
+
+@var{logfile} is a commandline option to mention log file path which helps in
+logging to the specified file and also help in persisting the gfapi logs. The
+default is stderr.
+
 
-@var{volname} is the name of the gluster volume which contains the disk image.
 
-@var{image} is the path to the actual disk image that resides on gluster 
volume.
 
 You can create a GlusterFS disk image with the command:
 @example
-qemu-img create gluster://@var{server}/@var{volname}/@var{image} @var{size}
+qemu-img create gluster://@var{host}/@var{volume}/@var{path} @var{size}
 @end example
 
 Examples
@@ -1082,6 +1102,17 @@ qemu-system-x86_64 -drive 
file=gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir
 qemu-system-x86_64 -drive 
file=gluster+tcp://server.domain.com:24007/testvol/dir/a.img
 qemu-system-x86_64 -drive 
file=gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
 qemu-system-x86_64 -drive file=gluster+rdma://1.2.3.4:24007/testvol/a.img
+qemu-system-x86_64 -drive 
file=gluster://1.2.3.4/testvol/a.img,file.debug=9,file.logfile=/var/log/qemu-gluster.log
+qemu-system-x86_64 'json:@{"driver":"qcow2",
+   "file":@{"driver":"gluster",
+"volume":"testvol","path":"a.img",
+
"debug":9,"logfile":"/var/log/qemu-gluster.log",
+
"server":[@{"type":"tcp","host":"1.2.3.4","port":24007@},
+  
@{"type":"unix","socket":"/var/run/glusterd.socket"@}]@}@}'
+qemu-system-x86_64 -drive 
driver=qcow2,file.driver=gluster,file.volume=tes

[Qemu-block] [PULL for-2.8 2/3] block/nfs: fix QMP to match debug option

2016-12-05 Thread Jeff Cody
From: Prasanna Kumar Kalever 

The QMP definition of BlockdevOptionsNfs:
{ 'struct': 'BlockdevOptionsNfs',
  'data': { 'server': 'NFSServer',
'path': 'str',
'*user': 'int',
'*group': 'int',
'*tcp-syn-count': 'int',
'*readahead-size': 'int',
'*page-cache-size': 'int',
'*debug-level': 'int' } }

To make this consistent with other block protocols like gluster, lets
change s/debug-level/debug/

Suggested-by: Eric Blake 
Signed-off-by: Prasanna Kumar Kalever 
Reviewed-by: Eric Blake 
Signed-off-by: Jeff Cody 
---
 block/nfs.c  | 4 ++--
 qapi/block-core.json | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/nfs.c b/block/nfs.c
index d082783..a490660 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -134,7 +134,7 @@ static int nfs_parse_uri(const char *filename, QDict 
*options, Error **errp)
 qdict_put(options, "page-cache-size",
   qstring_from_str(qp->p[i].value));
 } else if (!strcmp(qp->p[i].name, "debug")) {
-qdict_put(options, "debug-level",
+qdict_put(options, "debug",
   qstring_from_str(qp->p[i].value));
 } else {
 error_setg(errp, "Unknown NFS parameter name: %s",
@@ -165,7 +165,7 @@ static bool nfs_has_filename_options_conflict(QDict 
*options, Error **errp)
 !strcmp(qe->key, "tcp-syn-count") ||
 !strcmp(qe->key, "readahead-size") ||
 !strcmp(qe->key, "page-cache-size") ||
-!strcmp(qe->key, "debug-level") ||
+!strcmp(qe->key, "debug") ||
 strstart(qe->key, "server.", NULL))
 {
 error_setg(errp, "Option %s cannot be used with a filename",
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 03b19f1..f22ed2a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2292,7 +2292,7 @@
 # @page-cache-size: #optional set the pagecache size in bytes (defaults
 #   to libnfs default)
 #
-# @debug-level: #optional set the NFS debug level (max 2) (defaults
+# @debug:   #optional set the NFS debug level (max 2) (defaults
 #   to libnfs default)
 #
 # Since 2.8
@@ -2305,7 +2305,7 @@
 '*tcp-syn-count': 'int',
 '*readahead-size': 'int',
 '*page-cache-size': 'int',
-'*debug-level': 'int' } }
+'*debug': 'int' } }
 
 ##
 # @BlockdevOptionsCurl
-- 
2.7.4




[Qemu-block] [PULL for-2.8 1/3] block/gluster: fix QMP to match debug option

2016-12-05 Thread Jeff Cody
From: Prasanna Kumar Kalever 

The QMP definition of BlockdevOptionsGluster:
{ 'struct': 'BlockdevOptionsGluster',
  'data': { 'volume': 'str',
'path': 'str',
'server': ['GlusterServer'],
'*debug-level': 'int',
'*logfile': 'str' } }

But instead of 'debug-level we have exported 'debug' as the option for choosing
debug level of gluster protocol driver.

This patch fix QMP definition BlockdevOptionsGluster
s/debug-level/debug/

Suggested-by: Eric Blake 
Signed-off-by: Prasanna Kumar Kalever 
Reviewed-by: Eric Blake 
Signed-off-by: Jeff Cody 
---
 block/gluster.c  | 38 +++---
 qapi/block-core.json |  4 ++--
 2 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 891c13b..a0a74e4 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -48,7 +48,7 @@ typedef struct BDRVGlusterState {
 struct glfs_fd *fd;
 char *logfile;
 bool supports_seek_data;
-int debug_level;
+int debug;
 } BDRVGlusterState;
 
 typedef struct BDRVGlusterReopenState {
@@ -434,7 +434,7 @@ static struct glfs 
*qemu_gluster_glfs_init(BlockdevOptionsGluster *gconf,
 }
 }
 
-ret = glfs_set_logging(glfs, gconf->logfile, gconf->debug_level);
+ret = glfs_set_logging(glfs, gconf->logfile, gconf->debug);
 if (ret < 0) {
 goto out;
 }
@@ -788,17 +788,17 @@ static int qemu_gluster_open(BlockDriverState *bs,  QDict 
*options,
 
 filename = qemu_opt_get(opts, GLUSTER_OPT_FILENAME);
 
-s->debug_level = qemu_opt_get_number(opts, GLUSTER_OPT_DEBUG,
- GLUSTER_DEBUG_DEFAULT);
-if (s->debug_level < 0) {
-s->debug_level = 0;
-} else if (s->debug_level > GLUSTER_DEBUG_MAX) {
-s->debug_level = GLUSTER_DEBUG_MAX;
+s->debug = qemu_opt_get_number(opts, GLUSTER_OPT_DEBUG,
+   GLUSTER_DEBUG_DEFAULT);
+if (s->debug < 0) {
+s->debug = 0;
+} else if (s->debug > GLUSTER_DEBUG_MAX) {
+s->debug = GLUSTER_DEBUG_MAX;
 }
 
 gconf = g_new0(BlockdevOptionsGluster, 1);
-gconf->debug_level = s->debug_level;
-gconf->has_debug_level = true;
+gconf->debug = s->debug;
+gconf->has_debug = true;
 
 logfile = qemu_opt_get(opts, GLUSTER_OPT_LOGFILE);
 s->logfile = g_strdup(logfile ? logfile : GLUSTER_LOGFILE_DEFAULT);
@@ -874,8 +874,8 @@ static int qemu_gluster_reopen_prepare(BDRVReopenState 
*state,
 qemu_gluster_parse_flags(state->flags, &open_flags);
 
 gconf = g_new0(BlockdevOptionsGluster, 1);
-gconf->debug_level = s->debug_level;
-gconf->has_debug_level = true;
+gconf->debug = s->debug;
+gconf->has_debug = true;
 gconf->logfile = g_strdup(s->logfile);
 gconf->has_logfile = true;
 reop_s->glfs = qemu_gluster_init(gconf, state->bs->filename, NULL, errp);
@@ -1011,14 +1011,14 @@ static int qemu_gluster_create(const char *filename,
 char *tmp = NULL;
 
 gconf = g_new0(BlockdevOptionsGluster, 1);
-gconf->debug_level = qemu_opt_get_number_del(opts, GLUSTER_OPT_DEBUG,
- GLUSTER_DEBUG_DEFAULT);
-if (gconf->debug_level < 0) {
-gconf->debug_level = 0;
-} else if (gconf->debug_level > GLUSTER_DEBUG_MAX) {
-gconf->debug_level = GLUSTER_DEBUG_MAX;
+gconf->debug = qemu_opt_get_number_del(opts, GLUSTER_OPT_DEBUG,
+   GLUSTER_DEBUG_DEFAULT);
+if (gconf->debug < 0) {
+gconf->debug = 0;
+} else if (gconf->debug > GLUSTER_DEBUG_MAX) {
+gconf->debug = GLUSTER_DEBUG_MAX;
 }
-gconf->has_debug_level = true;
+gconf->has_debug = true;
 
 gconf->logfile = qemu_opt_get_del(opts, GLUSTER_OPT_LOGFILE);
 if (!gconf->logfile) {
diff --git a/qapi/block-core.json b/qapi/block-core.json
index c29bef7..03b19f1 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2195,7 +2195,7 @@
 #
 # @server:  gluster servers description
 #
-# @debug-level: #optional libgfapi log level (default '4' which is Error)
+# @debug:   #optional libgfapi log level (default '4' which is Error)
 #
 # @logfile: #optional libgfapi log file (default /dev/stderr) (Since 2.8)
 #
@@ -2205,7 +2205,7 @@
   'data': { 'volume': 'str',
 'path': 'str',
 'server': ['GlusterServer'],
-'*debug-level': 'int',
+'*debug': 'int',
 '*logfile': 'str' } }
 
 ##
-- 
2.7.4




[Qemu-block] [PULL for-2.8 0/3] Block patches for -rc3

2016-12-05 Thread Jeff Cody
The following changes since commit bc66cedb4141fb7588f2462c74310d8fb5dd4cf1:

  Merge remote-tracking branch 'yongbok/tags/mips-20161204' into staging 
(2016-12-05 10:56:45 +)

are available in the git repository at:

  https://github.com/codyprime/qemu-kvm-jtc.git tags/block-pull-request

for you to fetch changes up to 76b5550f709b975a7b04fb4c887f300b7bb731c2:

  qemu-doc: update gluster protocol usage guide (2016-12-05 16:30:29 -0500)


Gluster block patches for -rc3


Prasanna Kumar Kalever (3):
  block/gluster: fix QMP to match debug option
  block/nfs: fix QMP to match debug option
  qemu-doc: update gluster protocol usage guide

 block/gluster.c  | 38 -
 block/nfs.c  |  4 ++--
 qapi/block-core.json |  8 +++
 qemu-doc.texi| 59 +++-
 qemu-options.hx  | 25 --
 5 files changed, 93 insertions(+), 41 deletions(-)

-- 
2.7.4




Re: [Qemu-block] [Qemu-devel] [PATCH for-2.8 v2] qcow2: Don't strand clusters near 2G intervals during commit

2016-12-05 Thread John Snow


On 12/05/2016 10:49 AM, Eric Blake wrote:
> The qcow2_make_empty() function is reached during 'qemu-img commit',
> in order to clear out ALL clusters of an image.  However, if the
> image cannot use the fast code path (true if the image is format
> 0.10, or if the image contains a snapshot), the cluster size is
> larger than 512, and the image is larger than 2G in size, then our
> choice of sector_step causes problems.  Since it is not cluster
> aligned, but qcow2_discard_clusters() silently ignores an unaligned
> head or tail, we are leaving clusters allocated.
> 
> Enhance the testsuite to expose the flaw, and patch the problem by
> ensuring our step size is aligned.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v2: perform rounding correctly
> ---
>  block/qcow2.c  |   3 +-
>  tests/qemu-iotests/097 |  41 +---
>  tests/qemu-iotests/097.out | 249 
> +
>  3 files changed, 210 insertions(+), 83 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index ed9e0f3..96fb8a8 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2808,7 +2808,8 @@ static int qcow2_make_empty(BlockDriverState *bs)
>  {
>  BDRVQcow2State *s = bs->opaque;
>  uint64_t start_sector;
> -int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> +int sector_step = (QEMU_ALIGN_DOWN(INT_MAX, s->cluster_size) /
> +   BDRV_SECTOR_SIZE);
>  int l1_clusters, ret = 0;
> 
>  l1_clusters = DIV_ROUND_UP(s->l1_size, s->cluster_size / 
> sizeof(uint64_t));
> diff --git a/tests/qemu-iotests/097 b/tests/qemu-iotests/097
> index 01d8dd0..4c33e80 100755
> --- a/tests/qemu-iotests/097
> +++ b/tests/qemu-iotests/097
> @@ -46,7 +46,7 @@ _supported_proto file
>  _supported_os Linux
> 
> 
> -# Four passes:
> +# Four main passes:
>  #  0: Two-layer backing chain, commit to upper backing file (implicitly)
>  # (in this case, the top image will be emptied)
>  #  1: Two-layer backing chain, commit to upper backing file (explicitly)
> @@ -56,22 +56,30 @@ _supported_os Linux
>  #  3: Two-layer backing chain, commit to lower backing file
>  # (in this case, the top image will implicitly stay unchanged)
>  #
> +# Each pass is run twice, since qcow2 has different code paths for cleaning
> +# an image depending on whether it has a snapshot.
> +#
>  # 020 already tests committing, so this only tests whether image chains are
>  # working properly and that all images above the base are emptied; therefore,
> -# no complicated patterns are necessary
> +# no complicated patterns are necessary.  Check near the 2G mark, as qcow2
> +# has been buggy at that boundary in the past.
>  for i in 0 1 2 3; do
> +for j in 0 1; do
> 
>  echo
> -echo "=== Test pass $i ==="
> +echo "=== Test pass $i.$j ==="
>  echo
> 
> -TEST_IMG="$TEST_IMG.base" _make_test_img 64M
> -TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 64M
> -_make_test_img -b "$TEST_IMG.itmd" 64M
> +TEST_IMG="$TEST_IMG.base" _make_test_img 2100M
> +TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 2100M
> +_make_test_img -b "$TEST_IMG.itmd" 2100M
> +if [ $j -eq 0 ]; then
> +$QEMU_IMG snapshot -c snap "$TEST_IMG"
> +fi
> 
> -$QEMU_IO -c 'write -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
> -$QEMU_IO -c 'write -P 2 64k 128k' "$TEST_IMG.itmd" | _filter_qemu_io
> -$QEMU_IO -c 'write -P 3 128k 64k' "$TEST_IMG" | _filter_qemu_io
> +$QEMU_IO -c 'write -P 1 0x7ffd 192k' "$TEST_IMG.base" | _filter_qemu_io
> +$QEMU_IO -c 'write -P 2 0x7ffe 128k' "$TEST_IMG.itmd" | _filter_qemu_io
> +$QEMU_IO -c 'write -P 3 0x7fff 64k' "$TEST_IMG" | _filter_qemu_io
> 
>  if [ $i -lt 3 ]; then
>  if [ $i == 0 ]; then
> @@ -88,12 +96,12 @@ if [ $i -lt 3 ]; then
>  fi
> 
>  # Bottom should be unchanged
> -$QEMU_IO -c 'read -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
> +$QEMU_IO -c 'read -P 1 0x7ffd 192k' "$TEST_IMG.base" | 
> _filter_qemu_io
> 
>  # Intermediate should contain changes from top
> -$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.itmd" | _filter_qemu_io
> -$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
> -$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
> +$QEMU_IO -c 'read -P 1 0x7ffd 64k' "$TEST_IMG.itmd" | _filter_qemu_io
> +$QEMU_IO -c 'read -P 2 0x7ffe 64k' "$TEST_IMG.itmd" | _filter_qemu_io
> +$QEMU_IO -c 'read -P 3 0x7fff 64k' "$TEST_IMG.itmd" | _filter_qemu_io
> 
>  # And in pass 0, the top image should be empty, whereas in both other 
> passes
>  # it should be unchanged (which is both checked by qemu-img map)
> @@ -101,9 +109,9 @@ else
>  $QEMU_IMG commit -b "$TEST_IMG.base" "$TEST_IMG"
> 
>  # Bottom should contain all changes
> -$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.base" | _filter_qemu_io
> -$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.base" | _filter_qemu_io
> -$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.base" | _filter_qemu_io
> +$QEMU_IO

Re: [Qemu-block] [Qemu-devel] [PATCH RFC v2 2/6] replication: add shared-disk and shared-disk-id options

2016-12-05 Thread Eric Blake
On 12/05/2016 02:35 AM, zhanghailiang wrote:
> We use these two options to identify which disk is
> shared
> 
> Signed-off-by: zhanghailiang 
> Signed-off-by: Wen Congyang 
> Signed-off-by: Zhang Chen 
> ---
> v2:
> - add these two options for BlockdevOptionsReplication to support
>   qmp blockdev-add command.
> - fix a memory leak found by Changlong
> ---

> +++ b/qapi/block-core.json
> @@ -2232,12 +2232,19 @@
>  #  node who owns the replication node chain. Must not be given in
>  #  primary mode.
>  #
> +# @shared-disk-id: #optional The id of shared disk while in replication mode.
> +#
> +# @shared-disk: #optional To indicate whether or not a disk is shared by
> +#   primary VM and secondary VM.

Both of these need a '(since 2.9)' designation since they've missed 2.8,
and could probably also use documentation on the default value assumed
when the parameter is omitted.

> +#
>  # Since: 2.8
>  ##
>  { 'struct': 'BlockdevOptionsReplication',
>'base': 'BlockdevOptionsGenericFormat',
>'data': { 'mode': 'ReplicationMode',
> -'*top-id': 'str' } }
> +'*top-id': 'str',
> +'*shared-disk-id': 'str',
> +'*shared-disk': 'bool' } }
>  
>  ##
>  # @NFSTransport
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH for-2.8 v2] qcow2: Don't strand clusters near 2G intervals during commit

2016-12-05 Thread Eric Blake
The qcow2_make_empty() function is reached during 'qemu-img commit',
in order to clear out ALL clusters of an image.  However, if the
image cannot use the fast code path (true if the image is format
0.10, or if the image contains a snapshot), the cluster size is
larger than 512, and the image is larger than 2G in size, then our
choice of sector_step causes problems.  Since it is not cluster
aligned, but qcow2_discard_clusters() silently ignores an unaligned
head or tail, we are leaving clusters allocated.

Enhance the testsuite to expose the flaw, and patch the problem by
ensuring our step size is aligned.

Signed-off-by: Eric Blake 

---
v2: perform rounding correctly
---
 block/qcow2.c  |   3 +-
 tests/qemu-iotests/097 |  41 +---
 tests/qemu-iotests/097.out | 249 +
 3 files changed, 210 insertions(+), 83 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ed9e0f3..96fb8a8 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2808,7 +2808,8 @@ static int qcow2_make_empty(BlockDriverState *bs)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t start_sector;
-int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
+int sector_step = (QEMU_ALIGN_DOWN(INT_MAX, s->cluster_size) /
+   BDRV_SECTOR_SIZE);
 int l1_clusters, ret = 0;

 l1_clusters = DIV_ROUND_UP(s->l1_size, s->cluster_size / sizeof(uint64_t));
diff --git a/tests/qemu-iotests/097 b/tests/qemu-iotests/097
index 01d8dd0..4c33e80 100755
--- a/tests/qemu-iotests/097
+++ b/tests/qemu-iotests/097
@@ -46,7 +46,7 @@ _supported_proto file
 _supported_os Linux


-# Four passes:
+# Four main passes:
 #  0: Two-layer backing chain, commit to upper backing file (implicitly)
 # (in this case, the top image will be emptied)
 #  1: Two-layer backing chain, commit to upper backing file (explicitly)
@@ -56,22 +56,30 @@ _supported_os Linux
 #  3: Two-layer backing chain, commit to lower backing file
 # (in this case, the top image will implicitly stay unchanged)
 #
+# Each pass is run twice, since qcow2 has different code paths for cleaning
+# an image depending on whether it has a snapshot.
+#
 # 020 already tests committing, so this only tests whether image chains are
 # working properly and that all images above the base are emptied; therefore,
-# no complicated patterns are necessary
+# no complicated patterns are necessary.  Check near the 2G mark, as qcow2
+# has been buggy at that boundary in the past.
 for i in 0 1 2 3; do
+for j in 0 1; do

 echo
-echo "=== Test pass $i ==="
+echo "=== Test pass $i.$j ==="
 echo

-TEST_IMG="$TEST_IMG.base" _make_test_img 64M
-TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 64M
-_make_test_img -b "$TEST_IMG.itmd" 64M
+TEST_IMG="$TEST_IMG.base" _make_test_img 2100M
+TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 2100M
+_make_test_img -b "$TEST_IMG.itmd" 2100M
+if [ $j -eq 0 ]; then
+$QEMU_IMG snapshot -c snap "$TEST_IMG"
+fi

-$QEMU_IO -c 'write -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
-$QEMU_IO -c 'write -P 2 64k 128k' "$TEST_IMG.itmd" | _filter_qemu_io
-$QEMU_IO -c 'write -P 3 128k 64k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c 'write -P 1 0x7ffd 192k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'write -P 2 0x7ffe 128k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'write -P 3 0x7fff 64k' "$TEST_IMG" | _filter_qemu_io

 if [ $i -lt 3 ]; then
 if [ $i == 0 ]; then
@@ -88,12 +96,12 @@ if [ $i -lt 3 ]; then
 fi

 # Bottom should be unchanged
-$QEMU_IO -c 'read -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'read -P 1 0x7ffd 192k' "$TEST_IMG.base" | _filter_qemu_io

 # Intermediate should contain changes from top
-$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.itmd" | _filter_qemu_io
-$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
-$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'read -P 1 0x7ffd 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'read -P 2 0x7ffe 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'read -P 3 0x7fff 64k' "$TEST_IMG.itmd" | _filter_qemu_io

 # And in pass 0, the top image should be empty, whereas in both other 
passes
 # it should be unchanged (which is both checked by qemu-img map)
@@ -101,9 +109,9 @@ else
 $QEMU_IMG commit -b "$TEST_IMG.base" "$TEST_IMG"

 # Bottom should contain all changes
-$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.base" | _filter_qemu_io
-$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.base" | _filter_qemu_io
-$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'read -P 1 0x7ffd 64k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'read -P 2 0x7ffe 64k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'read -P 3 0x7fff 64k' "$TEST_IMG.base" | _filter_qemu_io

 # Both top and intermediate should be unchanged
 fi

Re: [Qemu-block] [Qemu-devel] [PATCH for-2.8] qcow2: Don't strand clusters near 2G intervals during commit

2016-12-05 Thread Eric Blake
On 12/03/2016 11:34 AM, Eric Blake wrote:
> The qcow2_make_empty() function is reached during 'qemu-img commit',
> in order to clear out ALL clusters of an image.  However, if the
> image cannot use the fast code path (true if the image is format
> 0.10, or if the image contains a snapshot), the cluster size is
> larger than 512, and the image is larger than 2G in size, then our
> choice of sector_step causes problems.  Since it is not cluster
> aligned, but qcow2_discard_clusters() silently ignores an unaligned
> head or tail, we are leaving clusters allocated.
> 
> Enhance the testsuite to expose the flaw, and patch the problem by
> ensuring our step size is aligned.
> 
> [qcow2_discard_clusters() is a GROSS interface: it takes a mix of
> byte offset and sector count to perform cluster operations. But
> fixing it to use a saner byte/byte rather than byte/sector interface,
> and/or asserting that the counts are now aligned thanks to both
> this patch and commit 3482b9b, is material for another day.]
> 
> Signed-off-by: Eric Blake 
> ---
>  block/qcow2.c  |   3 +-
>  tests/qemu-iotests/097 |  41 +---
>  tests/qemu-iotests/097.out | 249 
> +
>  3 files changed, 210 insertions(+), 83 deletions(-)

> +++ b/block/qcow2.c
> @@ -2808,7 +2808,8 @@ static int qcow2_make_empty(BlockDriverState *bs)
>  {
>  BDRVQcow2State *s = bs->opaque;
>  uint64_t start_sector;
> -int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> +int sector_step = QEMU_ALIGN_DOWN(INT_MAX / BDRV_SECTOR_SIZE,
> +  s->cluster_size);

Oh shoot. I got the units wrong, and made the slow path do more loop
iterations than necessary (rounding sectors to cluster size in bytes is
inappropriate - either the rounding has to occur before division, or the
rounding needs to be by secotrs instead of bytes).  I'll send a v2 that
gets the math right.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v5 7/9] block: don't make snapshots for filters

2016-12-05 Thread Kevin Wolf
Am 05.12.2016 um 12:49 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:kw...@redhat.com]
> > Am 05.12.2016 um 08:43 hat Pavel Dovgalyuk geschrieben:
> > > Paolo,
> > >
> > > > From: Kevin Wolf [mailto:kw...@redhat.com]
> > > > Am 21.11.2016 um 13:22 hat Pavel Dovgalyuk geschrieben:
> > > > > > From: Kevin Wolf [mailto:kw...@redhat.com]
> > > > > > Am 16.11.2016 um 10:49 hat Pavel Dovgalyuk geschrieben:
> > > > > >
> > > > > > How does that bypass blkreplay? blk->root is supposed to be the 
> > > > > > blkreply
> > > > > > node, do you see something different? If it were the qcow2 node, 
> > > > > > then I
> > > > > > would expect that no requests at all go through the blkreplay layer.
> > > > >
> > > > > It seems, that the problem is in -snapshot option.
> > > > > We have one of the following block layers depending on command line:
> > > > >  tmp_overlay1 -> blkreplay -> tmp_overlay2 -> disk_image
> > > > >  tmp_overlay1 -> blkreplay -> disk_image
> > > > >
> > > > > But the correct scheme is intended to be the following:
> > > > >  blkreplay -> tmp_overlay1 -> disk_image
> > > > >
> > > > > How can we fix it?
> > > > > Maybe we should add blkreplay automatically to all block devices and 
> > > > > not
> > > > > to specify it in the command line?
> > > >
> > > > I think you found a pretty fundamental design problem with "global"
> > > > drive options that add a filter node such as -snapshot and replay mode
> > > > (replay mode isn't one of them today, but your suggestion to make it
> > > > automatic would turn it into one).
> > > >
> > > > At the core of the problem I think we have two questions:
> > > >
> > > > 1. Which block nodes should be affected and get a filter node added, and
> > > >which nodes shouldn't get one? In your case, disl_image is defined
> > > >with a -drive option, but shouldn't get the snapshot.
> > > >
> > > > 2. In which order should filter nodes be added?
> > > >
> > > > Both of these questions feel hard. As long as we haven't thought through
> > > > the concept as such (rather than discussing one-off hacks) and we're not
> > > > completely certain what the right answer to the questions is, we
> > > > shouldn't add more automatic filter nodes, because chances are that we
> > > > get it wrong and would regret it.
> > > >
> > > > The obvious answer for a workaround would be: Make everything manual,
> > > > i.e. don't use -snapshot, but create a qcow2 overlay manually.
> > >
> > > What about to switching to manual overlay creation by default?
> > > We can make rrsnapshot option mandatory.
> > > Therefore user will have to create snapshot in image or overlay and
> > > the disk image will not be corrupted.
> > >
> > > It is not very convenient, but we could disable rrsnapshot again when
> > > the solution for -snapshot will be found.
> > 
> > Hm, what is this rrsnapshot option? git grep can't find it.
> 
> It was a patch that was not included yet.
> This option creates/loads vm snapshot in record/replay mode
> leaving original disk image unchanged.

You mean internal qcow2 snapshots? Ok.

> Record/replay without this option uses '-snapshot' to preserve
> the state of the disk images.
> 
> > Anyway, it seems that doing things manually is the safe way as long as
> > we don't know the final solution, so I think I agree.
> > 
> > For a slightly more convenient way, one of the problems to solve seems
> > to be that snapshot=on always affects the top level node and you can't
> > create a temporary snapshot in the middle of the chain. Perhaps we
> > should introduce a 'temporary-overlay' driver or something like that, so
> > that you could specify things like this:
> > 
> > -drive if=none,driver=file,filename=test.img,id=orig
> > -drive if=none,driver=temporary-overlay,file=orig,id=snap
> > -drive if=none,driver=blkreplay,image=snap
> 
> This seems reasonable for manual way.

Maybe another, easier to implement option could be something like this:

-drive 
if=none,driver=file,filename=test.img,snapshot=on,overlay.node-name=snap
-drive if=none,driver=blkreplay,image=snap

It would require that we implement support for overlay.* options like we
already support backing.* options. Allowing to specify options for the
overlay node is probably nice to have anyway.

However, there could be a few tricky parts there. For example, what
happens if someone uses overlay.backing=something-else? Perhaps
completely disallowing backing and backing.* for overlays would already
solve this.

> > Which makes me wonder... Is blkreplay usable without the temporary
> > snapshot or is this pretty much a requirement? 
> 
> It's not a requirement. But to make replay deterministic we have to
> start with the same image every time. As I know, this may be achieved by:
> 1. Restoring original disk image manually
> 2. Using vm snapshot to start execution from
> 3. Using -snapshot option
> 4. Not using disks at all
> 
> > Because if it has to be
> > there, the next step could be that blk

[Qemu-block] Meeting notes on -blockdev, dynamic backend reconfiguration

2016-12-05 Thread Markus Armbruster
I recently met Kevin, and we discussed two block layer topics in some
depth.

= -blockdev =

We want a command line option to mirror QMP blockdev-add for 2.9.
QemuOpts has to grow from "list of (key, simple value) plus conventions
to support lists of simple values in limited ways" to the expressive
power of JSON.

== Basic idea ==

QMP pipeline: JSON string - JSON parser - QObject - QObject input
visitor - QAPI object.  For commands with with 'gen': false, we stop at
QObject.  These are rare.

Command line now: option argument string - QemuOpts parser - QemuOpts.

We occasionally continue - options or string input visitor - QAPI
object.  Both visitors can't do arbitrary QAPI objects.  Both visitors
extend QemuOpts syntax.

Daniel Berrange posted patches to instead do - crumple - QObject -
qobject input visitor - QAPI object.  Arbitrary QObjects (thus QAPI
objects) are possible with dotted key convention, which is already used
by block layer.

As before, a visitor sits on top of QemuOpts, providing syntax
extensions.  Stacking parsers like that is not a good idea.  We want
*one* option argument parser, and we need it to yield a QObject.

== Backward compatibility issues ==

* Traditional key=value,... syntax

* The "repeated key is list" hack

* Options and string input visitor syntax extensions

* Dotted key convention

Hopefully, most of the solutions can be adapted from Daniel's patches.

== Type ambguity ==

In JSON, the type of a value is syntactically obvious.  The JSON parser
yields QObject with these types.  The QObject input visitor rejects
values with types that don't match the QAPI schema.

In the traditional key=value command line syntax, the type of a value
isn't obvious.  Options and string input visitor convert the string
value to the type expected by the QAPI schema.

Unlike a QObject from JSON, a QObject from QemuOpts has only string
values, and the QObject input visitor needs to be able to convert
instead of reject.  Daniel's patches do that.

== Action item ==

Markus to explore the proposed solution as soon as possible.


= Dynamic block backend reconfiguration =

== Mirror job ==

State before the job:

frontend
|
   BB
|
   BDS

Frontend writes flow down.

Passive mirror job, as it currently works:

frontend   mirror-job
|   |  |
BB  BB'BB2
|  /   |
| /|
   BDSBDS2

The mirror job copies the contents of BDS to BDS2.  To handle frontend
writes, BDS maintains a dirty bitmap, which mirror-job uses to copy
updates from BB' to BB2.

Pivot to mirror on job completion: replace BB's child BDS by BDS2,
delete mirror-job and its BB', BB2.

frontend
|
BB
 \_
   \
   BDSBDS2


Future mirror job using a mirror-filter:

frontend   mirror-job
| |
BB   BB'
|/
mirror-filter
|\
   BDS  BDS2

Passive mirror-filter: maintains dirty bitmap, copies from BDS to BDS2.

Active mirror-filter: no dirty bitmap, mirrors writes to BDS2 directly.

Can easily switch from passive to active at any time.

Pivot: replace parent of mirror-filter's child mirror-filter by BDS2,
delete mirror job and its BB'.  "Parent of" in case other filters have
been inserted: we drop the ones below mirror-filter, and keep the ones
above.

== Backup job ==

Current backup job:

frontend   backup-job
|   |  |
BB  BB'BB2
|  /   |
| /|
   BDSBDS2

The backup job copies the contents of BDS to BDS2.  To handle frontend
writes, BDS provices a before-write-notifier, backup-job uses it to copy
old data from BB' to BB2 right before it's overwritten.

Pivot: delete backup-job and its BB', BB2.

frontend
|
BB
|
|
   BDSBDS2

Future backup job using a backup-filter:

frontend   backup-job
| |
BB   BB'
|/
backup-filter
|   \
   BDS  BDS2

backup-filter copies old data from BDS to BDS2 before it forwards write
to BDS.

Pivot: replace parent of backup-filter's child backup-filter by BDS2,
delete backup-job and its BB'.

== Commit job ==

State before the job:

   frontend
   |
   BB
   |
 QCOW2
file /   \ backing
/ \
   /   \
 BDS1 QCOW2_top
 file /   \ backing
 / .
  BDS_top   .
 \
  BDS_base

"file" and "backing" are the QCOW2 child names for the delta image and
the backing image, respectively.

Frontend writes flow to BDS1.

Current commit job to commit from QCOW2_top down to BDS_base:

   frontend
   |
   BB   commit-job
   | / \
 QCOW2   BB_top  BB_base
file /   \ b

Re: [Qemu-block] [PATCH v5 7/9] block: don't make snapshots for filters

2016-12-05 Thread Pavel Dovgalyuk
> From: Kevin Wolf [mailto:kw...@redhat.com]
> Am 05.12.2016 um 08:43 hat Pavel Dovgalyuk geschrieben:
> > Paolo,
> >
> > > From: Kevin Wolf [mailto:kw...@redhat.com]
> > > Am 21.11.2016 um 13:22 hat Pavel Dovgalyuk geschrieben:
> > > > > From: Kevin Wolf [mailto:kw...@redhat.com]
> > > > > Am 16.11.2016 um 10:49 hat Pavel Dovgalyuk geschrieben:
> > > > >
> > > > > How does that bypass blkreplay? blk->root is supposed to be the 
> > > > > blkreply
> > > > > node, do you see something different? If it were the qcow2 node, then 
> > > > > I
> > > > > would expect that no requests at all go through the blkreplay layer.
> > > >
> > > > It seems, that the problem is in -snapshot option.
> > > > We have one of the following block layers depending on command line:
> > > >  tmp_overlay1 -> blkreplay -> tmp_overlay2 -> disk_image
> > > >  tmp_overlay1 -> blkreplay -> disk_image
> > > >
> > > > But the correct scheme is intended to be the following:
> > > >  blkreplay -> tmp_overlay1 -> disk_image
> > > >
> > > > How can we fix it?
> > > > Maybe we should add blkreplay automatically to all block devices and not
> > > > to specify it in the command line?
> > >
> > > I think you found a pretty fundamental design problem with "global"
> > > drive options that add a filter node such as -snapshot and replay mode
> > > (replay mode isn't one of them today, but your suggestion to make it
> > > automatic would turn it into one).
> > >
> > > At the core of the problem I think we have two questions:
> > >
> > > 1. Which block nodes should be affected and get a filter node added, and
> > >which nodes shouldn't get one? In your case, disl_image is defined
> > >with a -drive option, but shouldn't get the snapshot.
> > >
> > > 2. In which order should filter nodes be added?
> > >
> > > Both of these questions feel hard. As long as we haven't thought through
> > > the concept as such (rather than discussing one-off hacks) and we're not
> > > completely certain what the right answer to the questions is, we
> > > shouldn't add more automatic filter nodes, because chances are that we
> > > get it wrong and would regret it.
> > >
> > > The obvious answer for a workaround would be: Make everything manual,
> > > i.e. don't use -snapshot, but create a qcow2 overlay manually.
> >
> > What about to switching to manual overlay creation by default?
> > We can make rrsnapshot option mandatory.
> > Therefore user will have to create snapshot in image or overlay and
> > the disk image will not be corrupted.
> >
> > It is not very convenient, but we could disable rrsnapshot again when
> > the solution for -snapshot will be found.
> 
> Hm, what is this rrsnapshot option? git grep can't find it.

It was a patch that was not included yet.
This option creates/loads vm snapshot in record/replay mode
leaving original disk image unchanged.
Record/replay without this option uses '-snapshot' to preserve
the state of the disk images.

> Anyway, it seems that doing things manually is the safe way as long as
> we don't know the final solution, so I think I agree.
> 
> For a slightly more convenient way, one of the problems to solve seems
> to be that snapshot=on always affects the top level node and you can't
> create a temporary snapshot in the middle of the chain. Perhaps we
> should introduce a 'temporary-overlay' driver or something like that, so
> that you could specify things like this:
> 
> -drive if=none,driver=file,filename=test.img,id=orig
> -drive if=none,driver=temporary-overlay,file=orig,id=snap
> -drive if=none,driver=blkreplay,image=snap

This seems reasonable for manual way.

> Which makes me wonder... Is blkreplay usable without the temporary
> snapshot or is this pretty much a requirement? 

It's not a requirement. But to make replay deterministic we have to
start with the same image every time. As I know, this may be achieved by:
1. Restoring original disk image manually
2. Using vm snapshot to start execution from
3. Using -snapshot option
4. Not using disks at all

> Because if it has to be
> there, the next step could be that blkreplay creates temporary-overlay
> internally in its .bdrv_open().

Here is your answer about such an approach :)
https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg04687.html

Pavel Dovgalyuk




Re: [Qemu-block] [PATCH for-2.8] qcow2: Don't strand clusters near 2G intervals during commit

2016-12-05 Thread Stefan Hajnoczi
On Sat, Dec 03, 2016 at 11:34:02AM -0600, Eric Blake wrote:
> The qcow2_make_empty() function is reached during 'qemu-img commit',
> in order to clear out ALL clusters of an image.  However, if the
> image cannot use the fast code path (true if the image is format
> 0.10, or if the image contains a snapshot), the cluster size is
> larger than 512, and the image is larger than 2G in size, then our
> choice of sector_step causes problems.  Since it is not cluster
> aligned, but qcow2_discard_clusters() silently ignores an unaligned
> head or tail, we are leaving clusters allocated.
> 
> Enhance the testsuite to expose the flaw, and patch the problem by
> ensuring our step size is aligned.
> 
> [qcow2_discard_clusters() is a GROSS interface: it takes a mix of
> byte offset and sector count to perform cluster operations. But
> fixing it to use a saner byte/byte rather than byte/sector interface,
> and/or asserting that the counts are now aligned thanks to both
> this patch and commit 3482b9b, is material for another day.]
> 
> Signed-off-by: Eric Blake 
> ---
>  block/qcow2.c  |   3 +-
>  tests/qemu-iotests/097 |  41 +---
>  tests/qemu-iotests/097.out | 249 
> +
>  3 files changed, 210 insertions(+), 83 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH v3 0/2] less confusing block file names

2016-12-05 Thread Stefan Hajnoczi
On Fri, Dec 02, 2016 at 01:48:52PM -0600, Eric Blake wrote:
> no real change from v2 other than trivial rebasing and R-by:
> https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg07220.html
> 
> but it's been more than a month since the last activity, and the
> start of the release cycle is as good a time as any to avoid any
> potential churn on conflicts due to renaming files
> 
> applies on Kevin's block-next branch for 2.9
> 
> Eric Blake (2):
>   block: Rename raw_bsd to raw-format.c
>   block: Rename raw-{posix,win32} to file-*.c
> 
>  include/block/block_int.h   | 2 +-
>  block/{raw-posix.c => file-posix.c} | 0
>  block/{raw-win32.c => file-win32.c} | 0
>  block/gluster.c | 4 ++--
>  block/{raw_bsd.c => raw-format.c}   | 2 +-
>  MAINTAINERS | 6 +++---
>  block/Makefile.objs | 6 +++---
>  block/trace-events  | 4 ++--
>  configure   | 2 +-
>  9 files changed, 13 insertions(+), 13 deletions(-)
>  rename block/{raw-posix.c => file-posix.c} (100%)
>  rename block/{raw-win32.c => file-win32.c} (100%)
>  rename block/{raw_bsd.c => raw-format.c} (99%)
> 
> -- 
> 2.9.3
> 
> 

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH v5 7/9] block: don't make snapshots for filters

2016-12-05 Thread Kevin Wolf
Am 05.12.2016 um 08:43 hat Pavel Dovgalyuk geschrieben:
> Paolo,
> 
> > From: Kevin Wolf [mailto:kw...@redhat.com]
> > Am 21.11.2016 um 13:22 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:kw...@redhat.com]
> > > > Am 16.11.2016 um 10:49 hat Pavel Dovgalyuk geschrieben:
> > > > > > From: Pavel Dovgalyuk [mailto:dovga...@ispras.ru]
> > > > >
> > > > > I've investigated this issue.
> > > > > This command line works ok:
> > > > >  -drive 
> > > > > driver=blkreplay,if=none,image.driver=file,image.filename=testdisk.qcow,id=img-
> > > > blkreplay
> > > > >  -device ide-hd,drive=img-blkreplay
> > > > >
> > > > > And this does not:
> > > > >  -drive
> > > > >
> > > >
> > driver=blkreplay,if=none,image.driver=qcow2,image.file.driver=file,image.file.filename=testdis
> > > > k.qcow
> > > > > ,id=img-blkreplay
> > > > >  -device ide-hd,drive=img-blkreplay
> > > > >
> > > > > QEMU hangs at some moment of replay.
> > > > >
> > > > > I found that some dma requests do not pass through the blkreplay 
> > > > > driver
> > > > > due to the following line in block-backend.c:
> > > > > return bdrv_co_preadv(blk->root, offset, bytes, qiov, flags);
> > > > >
> > > > > This line passes read request directly to qcow driver and blkreplay 
> > > > > cannot
> > > > > process it to make deterministic.
> > > >
> > > > How does that bypass blkreplay? blk->root is supposed to be the blkreply
> > > > node, do you see something different? If it were the qcow2 node, then I
> > > > would expect that no requests at all go through the blkreplay layer.
> > >
> > > It seems, that the problem is in -snapshot option.
> > > We have one of the following block layers depending on command line:
> > >  tmp_overlay1 -> blkreplay -> tmp_overlay2 -> disk_image
> > >  tmp_overlay1 -> blkreplay -> disk_image
> > >
> > > But the correct scheme is intended to be the following:
> > >  blkreplay -> tmp_overlay1 -> disk_image
> > >
> > > How can we fix it?
> > > Maybe we should add blkreplay automatically to all block devices and not
> > > to specify it in the command line?
> > 
> > I think you found a pretty fundamental design problem with "global"
> > drive options that add a filter node such as -snapshot and replay mode
> > (replay mode isn't one of them today, but your suggestion to make it
> > automatic would turn it into one).
> > 
> > At the core of the problem I think we have two questions:
> > 
> > 1. Which block nodes should be affected and get a filter node added, and
> >which nodes shouldn't get one? In your case, disl_image is defined
> >with a -drive option, but shouldn't get the snapshot.
> > 
> > 2. In which order should filter nodes be added?
> > 
> > Both of these questions feel hard. As long as we haven't thought through
> > the concept as such (rather than discussing one-off hacks) and we're not
> > completely certain what the right answer to the questions is, we
> > shouldn't add more automatic filter nodes, because chances are that we
> > get it wrong and would regret it.
> > 
> > The obvious answer for a workaround would be: Make everything manual,
> > i.e. don't use -snapshot, but create a qcow2 overlay manually.
> 
> What about to switching to manual overlay creation by default?
> We can make rrsnapshot option mandatory.
> Therefore user will have to create snapshot in image or overlay and
> the disk image will not be corrupted.
> 
> It is not very convenient, but we could disable rrsnapshot again when
> the solution for -snapshot will be found.

Hm, what is this rrsnapshot option? git grep can't find it.

Anyway, it seems that doing things manually is the safe way as long as
we don't know the final solution, so I think I agree.

For a slightly more convenient way, one of the problems to solve seems
to be that snapshot=on always affects the top level node and you can't
create a temporary snapshot in the middle of the chain. Perhaps we
should introduce a 'temporary-overlay' driver or something like that, so
that you could specify things like this:

-drive if=none,driver=file,filename=test.img,id=orig
-drive if=none,driver=temporary-overlay,file=orig,id=snap
-drive if=none,driver=blkreplay,image=snap

Which makes me wonder... Is blkreplay usable without the temporary
snapshot or is this pretty much a requirement? Because if it has to be
there, the next step could be that blkreplay creates temporary-overlay
internally in its .bdrv_open().

Kevin



[Qemu-block] [PATCH RFC v2 4/6] replication: fix code logic with the new shared_disk option

2016-12-05 Thread zhanghailiang
Some code logic only be needed in non-shared disk, here
we adjust these codes to prepare for shared disk scenario.

Signed-off-by: zhanghailiang 
---
 block/replication.c | 47 ---
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 35e9ab3..6574cc2 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -531,21 +531,28 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 aio_context_release(aio_context);
 return;
 }
-bdrv_op_block_all(top_bs, s->blocker);
-bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
 
-job = backup_job_create(NULL, s->secondary_disk->bs, 
s->hidden_disk->bs,
-0, MIRROR_SYNC_MODE_NONE, NULL, false,
+/*
+ * Only in the case of non-shared disk,
+ * the backup job is in the secondary side
+ */
+if (!s->is_shared_disk) {
+bdrv_op_block_all(top_bs, s->blocker);
+bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+job = backup_job_create(NULL, s->secondary_disk->bs,
+s->hidden_disk->bs, 0,
+MIRROR_SYNC_MODE_NONE, NULL, false,
 BLOCKDEV_ON_ERROR_REPORT,
 BLOCKDEV_ON_ERROR_REPORT, BLOCK_JOB_INTERNAL,
 backup_job_completed, bs, NULL, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
-backup_job_cleanup(bs);
-aio_context_release(aio_context);
-return;
+if (local_err) {
+error_propagate(errp, local_err);
+backup_job_cleanup(bs);
+aio_context_release(aio_context);
+return;
+}
+block_job_start(job);
 }
-block_job_start(job);
 
 secondary_do_checkpoint(s, errp);
 break;
@@ -575,14 +582,16 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
 case REPLICATION_MODE_PRIMARY:
 break;
 case REPLICATION_MODE_SECONDARY:
-if (!s->secondary_disk->bs->job) {
-error_setg(errp, "Backup job was cancelled unexpectedly");
-break;
-}
-backup_do_checkpoint(s->secondary_disk->bs->job, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
-break;
+if (!s->is_shared_disk) {
+if (!s->secondary_disk->bs->job) {
+error_setg(errp, "Backup job was cancelled unexpectedly");
+break;
+}
+backup_do_checkpoint(s->secondary_disk->bs->job, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+break;
+}
 }
 secondary_do_checkpoint(s, errp);
 break;
@@ -663,7 +672,7 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
  * before the BDS is closed, because we will access hidden
  * disk, secondary disk in backup_job_completed().
  */
-if (s->secondary_disk->bs->job) {
+if (!s->is_shared_disk && s->secondary_disk->bs->job) {
 block_job_cancel_sync(s->secondary_disk->bs->job);
 }
 
-- 
1.8.3.1





[Qemu-block] [PATCH RFC v2 0/6] COLO block replication supports shared disk case

2016-12-05 Thread zhanghailiang
COLO block replication doesn't support the shared disk case,
Here we try to implement it.

For the detail of shared-disk scenario, please refer to patch 1.

COLO codes with shared-disk block replication can be found from the link:
https://github.com/coloft/qemu/tree/colo-developing-with-shared-disk-2016-12-5

Test procedures:
1. Secondary:
# x86_64-softmmu/qemu-system-x86_64 -boot c -m 2048 -smp 2 -qmp stdio -vnc :9 
-name secondary -enable-kvm -cpu qemu64,+kvmclock -device piix3-usb-uhci -drive 
if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,backing.driver=raw,backing.file.filename=/work/kvm/suse11_sp3_64
  -drive 
if=ide,id=active-disk0,driver=replication,mode=secondary,file.driver=qcow2,top-id=active-disk0,file.file.filename=/mnt/ramfs/active_disk.img,file.backing=hidden_disk0,shared-disk=on
 -incoming tcp:0:

Issue qmp commands:
{'execute':'qmp_capabilities'}
{'execute': 'nbd-server-start', 'arguments': {'addr': {'type': 'inet', 'data': 
{'host': '0', 'port': '9998'} } } }
{'execute': 'nbd-server-add', 'arguments': {'device': 'hidden_disk0', 
'writable': true } }

2.Primary:
# x86_64-softmmu/qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -qmp stdio -vnc 
:9 -name primary -cpu qemu64,+kvmclock -device piix3-usb-uhci -drive 
if=virtio,id=primary_disk0,file.filename=/work/kvm/suse11_sp3_64,driver=raw -S

Issue qmp commands:
{'execute':'qmp_capabilities'}
{'execute': 'human-monitor-command', 'arguments': {'command-line': 'drive_add 
-n buddy 
driver=replication,mode=primary,file.driver=nbd,file.host=9.42.3.17,file.port=9998,file.export=hidden_disk0,shared-disk-id=primary_disk0,shared-disk=on,node-name=rep'}}
{'execute': 'migrate-set-capabilities', 'arguments': {'capabilities': [ 
{'capability': 'x-colo', 'state': true } ] } }
{'execute': 'migrate', 'arguments': {'uri': 'tcp:9.42.3.17:' } }

3. Failover
Secondary side:
Issue qmp commands:
{ 'execute': 'nbd-server-stop' }
{ "execute": "x-colo-lost-heartbeat" }

Please review and any commits are welcomed.

Cc: Juan Quintela 
Cc: Amit Shah  
Cc: Dr. David Alan Gilbert (git) 
Cc: eddie.d...@intel.com

v2:
- Drop the patch which add a blk_root() helper
- Fix some comments from Changlong

zhanghailiang (6):
  docs/block-replication: Add description for shared-disk case
  replication: add shared-disk and shared-disk-id options
  replication: Split out backup_do_checkpoint() from
secondary_do_checkpoint()
  replication: fix code logic with the new shared_disk option
  replication: Implement block replication for shared disk case
  nbd/replication: implement .bdrv_get_info() for nbd and replication
driver

 block/nbd.c|  12 
 block/replication.c| 156 +++--
 docs/block-replication.txt | 139 ++--
 qapi/block-core.json   |   9 ++-
 4 files changed, 278 insertions(+), 38 deletions(-)

-- 
1.8.3.1





[Qemu-block] [PATCH RFC v2 1/6] docs/block-replication: Add description for shared-disk case

2016-12-05 Thread zhanghailiang
Introuduce the scenario of shared-disk block replication
and how to use it.

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
v2:
- fix some problems found by Changlong
---
 docs/block-replication.txt | 139 +++--
 1 file changed, 135 insertions(+), 4 deletions(-)

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
index 6bde673..fbfe005 100644
--- a/docs/block-replication.txt
+++ b/docs/block-replication.txt
@@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network 
transportation
 effort during a vmstate checkpoint, the disk modification operations of
 the Primary disk are asynchronously forwarded to the Secondary node.
 
-== Workflow ==
+== Non-shared disk workflow ==
 The following is the image of block replication workflow:
 
 +--+++
@@ -57,7 +57,7 @@ The following is the image of block replication workflow:
 4) Secondary write requests will be buffered in the Disk buffer and it
will overwrite the existing sector content in the buffer.
 
-== Architecture ==
+== Non-shared disk architecture ==
 We are going to implement block replication from many basic
 blocks that are already in QEMU.
 
@@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative 
write-through
 of the NBD server into the secondary disk. So before block replication,
 the primary disk and secondary disk should contain the same data.
 
+== Shared Disk Mode Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  | (2)Forward and write through | |
+  | +--> | Disk Buffer |
+  | || |
+  | |\-/
+  | |(1)read   |
+  | |  |
+   (3)write   | |  | backing file
+  V |  |
+ +-+   |
+ | Shared Disk | <-+
+ +-+
+
+1) Primary writes will read original data and forward it to Secondary
+   QEMU.
+2) Before Primary write requests are written to Shared disk, the
+   original sector content will be read from Shared disk and
+   forwarded and buffered in the Disk buffer on the secondary site,
+   but it will not overwrite the existing sector content (it could be
+   from either "Secondary Write Requests" or previous COW of "Primary
+   Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Shared disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Shared Disk Mode Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+ virtio-blk ||   
.--
+ /  ||   | 
Secondary
+/   ||   
'--
+   /|| 
virtio-blk
+  / || 
 |
+  | ||   
replication(5)
+  |NBD  >   NBD   (2)  
 |
+  |  client ||server ---> hidden disk <-- 
active disk(4)
+  | ^   ||  |
+  |  replication(1) ||  |
+  | |   ||  |
+  |   +-'   ||  |
+ (3)  |drive-backup sync=none   ||  |
+. |   +-+   ||  |
+Primary | | |   ||   backing|
+' | |   ||  |
+  V |   |
+   +---+|
+   | 

[Qemu-block] [PATCH RFC v2 2/6] replication: add shared-disk and shared-disk-id options

2016-12-05 Thread zhanghailiang
We use these two options to identify which disk is
shared

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
v2:
- add these two options for BlockdevOptionsReplication to support
  qmp blockdev-add command.
- fix a memory leak found by Changlong
---
 block/replication.c  | 37 +
 qapi/block-core.json |  9 -
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 729dd12..e87ae87 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -25,9 +25,12 @@
 typedef struct BDRVReplicationState {
 ReplicationMode mode;
 int replication_state;
+bool is_shared_disk;
+char *shared_disk_id;
 BdrvChild *active_disk;
 BdrvChild *hidden_disk;
 BdrvChild *secondary_disk;
+BdrvChild *primary_disk;
 char *top_id;
 ReplicationState *rs;
 Error *blocker;
@@ -53,6 +56,9 @@ static void replication_stop(ReplicationState *rs, bool 
failover,
 
 #define REPLICATION_MODE"mode"
 #define REPLICATION_TOP_ID  "top-id"
+#define REPLICATION_SHARED_DISK "shared-disk"
+#define REPLICATION_SHARED_DISK_ID "shared-disk-id"
+
 static QemuOptsList replication_runtime_opts = {
 .name = "replication",
 .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
@@ -65,6 +71,14 @@ static QemuOptsList replication_runtime_opts = {
 .name = REPLICATION_TOP_ID,
 .type = QEMU_OPT_STRING,
 },
+{
+.name = REPLICATION_SHARED_DISK_ID,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_SHARED_DISK,
+.type = QEMU_OPT_BOOL,
+},
 { /* end of list */ }
 },
 };
@@ -85,6 +99,9 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
 QemuOpts *opts = NULL;
 const char *mode;
 const char *top_id;
+const char *shared_disk_id;
+BlockBackend *blk;
+BlockDriverState *tmp_bs;
 
 ret = -EINVAL;
 opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort);
@@ -119,6 +136,25 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
"The option mode's value should be primary or secondary");
 goto fail;
 }
+s->is_shared_disk = qemu_opt_get_bool(opts, REPLICATION_SHARED_DISK,
+  false);
+if (s->is_shared_disk && (s->mode == REPLICATION_MODE_PRIMARY)) {
+shared_disk_id = qemu_opt_get(opts, REPLICATION_SHARED_DISK_ID);
+if (!shared_disk_id) {
+error_setg(&local_err, "Missing shared disk blk");
+goto fail;
+}
+s->shared_disk_id = g_strdup(shared_disk_id);
+blk = blk_by_name(s->shared_disk_id);
+if (!blk) {
+g_free(s->shared_disk_id);
+error_setg(&local_err, "There is no %s block", s->shared_disk_id);
+goto fail;
+}
+/* We can't access root member of BlockBackend directly */
+tmp_bs = blk_bs(blk);
+s->primary_disk = QLIST_FIRST(&tmp_bs->parents);
+}
 
 s->rs = replication_new(bs, &replication_ops);
 
@@ -135,6 +171,7 @@ static void replication_close(BlockDriverState *bs)
 {
 BDRVReplicationState *s = bs->opaque;
 
+g_free(s->shared_disk_id);
 if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
 replication_stop(s->rs, false, NULL);
 }
diff --git a/qapi/block-core.json b/qapi/block-core.json
index c29bef7..52d7e0d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2232,12 +2232,19 @@
 #  node who owns the replication node chain. Must not be given in
 #  primary mode.
 #
+# @shared-disk-id: #optional The id of shared disk while in replication mode.
+#
+# @shared-disk: #optional To indicate whether or not a disk is shared by
+#   primary VM and secondary VM.
+#
 # Since: 2.8
 ##
 { 'struct': 'BlockdevOptionsReplication',
   'base': 'BlockdevOptionsGenericFormat',
   'data': { 'mode': 'ReplicationMode',
-'*top-id': 'str' } }
+'*top-id': 'str',
+'*shared-disk-id': 'str',
+'*shared-disk': 'bool' } }
 
 ##
 # @NFSTransport
-- 
1.8.3.1





[Qemu-block] [PATCH RFC v2 5/6] replication: Implement block replication for shared disk case

2016-12-05 Thread zhanghailiang
Just as the scenario of non-shared disk block replication,
we are going to implement block replication from many basic
blocks that are already in QEMU.
The architecture is:

 virtio-blk ||   
.--
 /  ||   | 
Secondary
/   ||   
'--
   /|| 
virtio-blk
  / ||  
|
  | ||   
replication(5)
  |NBD  >   NBD   (2)   
|
  |  client ||server ---> hidden disk <-- 
active disk(4)
  | ^   ||  |
  |  replication(1) ||  |
  | |   ||  |
  |   +-'   ||  |
 (3)  |drive-backup sync=none   ||  |
. |   +-+   ||  |
Primary | | |   ||   backing|
' | |   ||  |
  V |   |
   +---+|
   |   shared disk | <--+
   +---+

1) Primary writes will read original data and forward it to Secondary
   QEMU.
2) The hidden-disk is created automatically. It buffers the original content
   that is modified by the primary VM. It should also be an empty disk, and
   the driver supports bdrv_make_empty() and backing file.
3) Primary write requests will be written to Shared disk.
4) Secondary write requests will be buffered in the active disk and it
   will overwrite the existing sector content in the buffer.

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
 block/replication.c | 48 ++--
 1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 6574cc2..f416ca5 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -233,7 +233,7 @@ static coroutine_fn int 
replication_co_readv(BlockDriverState *bs,
  QEMUIOVector *qiov)
 {
 BDRVReplicationState *s = bs->opaque;
-BdrvChild *child = s->secondary_disk;
+BdrvChild *child = s->is_shared_disk ? s->primary_disk : s->secondary_disk;
 BlockJob *job = NULL;
 CowRequest req;
 int ret;
@@ -415,7 +415,12 @@ static void backup_job_completed(void *opaque, int ret)
 s->error = -EIO;
 }
 
-backup_job_cleanup(bs);
+if (s->mode == REPLICATION_MODE_PRIMARY) {
+s->replication_state = BLOCK_REPLICATION_DONE;
+s->error = 0;
+} else {
+backup_job_cleanup(bs);
+}
 }
 
 static bool check_top_bs(BlockDriverState *top_bs, BlockDriverState *bs)
@@ -467,6 +472,19 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 
 switch (s->mode) {
 case REPLICATION_MODE_PRIMARY:
+if (s->is_shared_disk) {
+job = backup_job_create(NULL, s->primary_disk->bs, bs, 0,
+MIRROR_SYNC_MODE_NONE, NULL, false, BLOCKDEV_ON_ERROR_REPORT,
+BLOCKDEV_ON_ERROR_REPORT, BLOCK_JOB_INTERNAL,
+backup_job_completed, bs, NULL, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+backup_job_cleanup(bs);
+aio_context_release(aio_context);
+return;
+}
+block_job_start(job);
+}
 break;
 case REPLICATION_MODE_SECONDARY:
 s->active_disk = bs->file;
@@ -485,7 +503,8 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 }
 
 s->secondary_disk = s->hidden_disk->bs->backing;
-if (!s->secondary_disk->bs || !bdrv_has_blk(s->secondary_disk->bs)) {
+if (!s->secondary_disk->bs ||
+(!s->is_shared_disk && !bdrv_has_blk(s->secondary_disk->bs))) {
 error_setg(errp, "The secondary disk doesn't have block backend");
 aio_context_release(aio_context);
 return;
@@ -580,11 +599,24 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
 
 switch (s->mode) {
 case REPLICATION_MODE_PRIMARY:
+if (s->is_shared_disk) {
+if (!s->primary_disk->bs->job) {
+error_setg(errp, "Primary backup job was cancelled"
+   " unexpectedly");
+break

[Qemu-block] [PATCH RFC v2 3/6] replication: Split out backup_do_checkpoint() from secondary_do_checkpoint()

2016-12-05 Thread zhanghailiang
The helper backup_do_checkpoint() will be used for primary related
codes. Here we split it out from secondary_do_checkpoint().

Besides, it is unnecessary to call backup_do_checkpoint() in
replication starting and normally stop replication path.
We only need call it while do real checkpointing.

Signed-off-by: zhanghailiang 
---
 block/replication.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index e87ae87..35e9ab3 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -332,20 +332,8 @@ static bool 
replication_recurse_is_first_non_filter(BlockDriverState *bs,
 
 static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
 {
-Error *local_err = NULL;
 int ret;
 
-if (!s->secondary_disk->bs->job) {
-error_setg(errp, "Backup job was cancelled unexpectedly");
-return;
-}
-
-backup_do_checkpoint(s->secondary_disk->bs->job, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
-return;
-}
-
 ret = s->active_disk->bs->drv->bdrv_make_empty(s->active_disk->bs);
 if (ret < 0) {
 error_setg(errp, "Cannot make active disk empty");
@@ -558,6 +546,8 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 return;
 }
 block_job_start(job);
+
+secondary_do_checkpoint(s, errp);
 break;
 default:
 aio_context_release(aio_context);
@@ -566,10 +556,6 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 
 s->replication_state = BLOCK_REPLICATION_RUNNING;
 
-if (s->mode == REPLICATION_MODE_SECONDARY) {
-secondary_do_checkpoint(s, errp);
-}
-
 s->error = 0;
 aio_context_release(aio_context);
 }
@@ -579,13 +565,29 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
 BlockDriverState *bs = rs->opaque;
 BDRVReplicationState *s;
 AioContext *aio_context;
+Error *local_err = NULL;
 
 aio_context = bdrv_get_aio_context(bs);
 aio_context_acquire(aio_context);
 s = bs->opaque;
 
-if (s->mode == REPLICATION_MODE_SECONDARY) {
+switch (s->mode) {
+case REPLICATION_MODE_PRIMARY:
+break;
+case REPLICATION_MODE_SECONDARY:
+if (!s->secondary_disk->bs->job) {
+error_setg(errp, "Backup job was cancelled unexpectedly");
+break;
+}
+backup_do_checkpoint(s->secondary_disk->bs->job, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+break;
+}
 secondary_do_checkpoint(s, errp);
+break;
+default:
+abort();
 }
 aio_context_release(aio_context);
 }
-- 
1.8.3.1





[Qemu-block] [PATCH RFC v2 6/6] nbd/replication: implement .bdrv_get_info() for nbd and replication driver

2016-12-05 Thread zhanghailiang
Without this callback, there will be an error reports in the primary side:
"qemu-system-x86_64: Couldn't determine the cluster size of the target image,
which has no backing file: Operation not supported
Aborting, since this may create an unusable destination image"

For nbd driver, it doesn't have cluster size, so here we return
a fake value for it.

This patch should be dropped if Eric's nbd patch be merged.
https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg03567.html

Cc: Eric Blake 
Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
---
 block/nbd.c | 12 
 block/replication.c |  6 ++
 2 files changed, 18 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 35f24be..b71a13d 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -43,6 +43,8 @@
 
 #define EN_OPTSTR ":exportname="
 
+#define NBD_FAKE_CLUSTER_SIZE 512
+
 typedef struct BDRVNBDState {
 NBDClientSession client;
 
@@ -552,6 +554,13 @@ static void nbd_refresh_filename(BlockDriverState *bs, 
QDict *options)
 bs->full_open_options = opts;
 }
 
+static int nbd_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+bdi->cluster_size  = NBD_FAKE_CLUSTER_SIZE;
+
+return 0;
+}
+
 static BlockDriver bdrv_nbd = {
 .format_name= "nbd",
 .protocol_name  = "nbd",
@@ -569,6 +578,7 @@ static BlockDriver bdrv_nbd = {
 .bdrv_detach_aio_context= nbd_detach_aio_context,
 .bdrv_attach_aio_context= nbd_attach_aio_context,
 .bdrv_refresh_filename  = nbd_refresh_filename,
+.bdrv_get_info  = nbd_get_info,
 };
 
 static BlockDriver bdrv_nbd_tcp = {
@@ -588,6 +598,7 @@ static BlockDriver bdrv_nbd_tcp = {
 .bdrv_detach_aio_context= nbd_detach_aio_context,
 .bdrv_attach_aio_context= nbd_attach_aio_context,
 .bdrv_refresh_filename  = nbd_refresh_filename,
+.bdrv_get_info  = nbd_get_info,
 };
 
 static BlockDriver bdrv_nbd_unix = {
@@ -607,6 +618,7 @@ static BlockDriver bdrv_nbd_unix = {
 .bdrv_detach_aio_context= nbd_detach_aio_context,
 .bdrv_attach_aio_context= nbd_attach_aio_context,
 .bdrv_refresh_filename  = nbd_refresh_filename,
+.bdrv_get_info  = nbd_get_info,
 };
 
 static void bdrv_nbd_init(void)
diff --git a/block/replication.c b/block/replication.c
index f416ca5..5f14360 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -731,6 +731,11 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
 aio_context_release(aio_context);
 }
 
+static int replication_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+return bdrv_get_info(bs->file->bs, bdi);
+}
+
 BlockDriver bdrv_replication = {
 .format_name= "replication",
 .protocol_name  = "replication",
@@ -743,6 +748,7 @@ BlockDriver bdrv_replication = {
 .bdrv_co_readv  = replication_co_readv,
 .bdrv_co_writev = replication_co_writev,
 
+.bdrv_get_info  = replication_get_info,
 .is_filter  = true,
 .bdrv_recurse_is_first_non_filter = 
replication_recurse_is_first_non_filter,
 
-- 
1.8.3.1





Re: [Qemu-block] [Qemu-devel] [RFC PATCH] glusterfs: allow partial reads

2016-12-05 Thread Wolfgang Bumiller
On Fri, Dec 02, 2016 at 01:13:28PM -0600, Eric Blake wrote:
> On 12/01/2016 04:59 AM, Wolfgang Bumiller wrote:
> > Fixes #1644754.
> > 
> > Signed-off-by: Wolfgang Bumiller 
> > ---
> > I'm not sure what the original rationale was to treat both partial
> > reads as well as well as writes as I/O error. (Seems to have happened
> > from original glusterfs v1 to v2 series with a note but no reasoning
> > for the read side as far as I could see.)
> > The general direction lately seems to be to move away from sector
> > based block APIs. Also eg. the NFS code allows partial reads. (It
> > does, however, have an old patch (c2eb918e3) dedicated to aligning
> > sizes to 512 byte boundaries for file creation for compatibility to
> > other parts of qemu like qcow2. This already happens in glusterfs,
> > though, but if you move a file from a different storage over to
> > glusterfs you may end up with a qcow2 file with eg. the L1 table in
> > the last 80 bytes of the file aligned to _begin_ at a 512 boundary,
> > but not _end_ at one.)
> > 
> >  block/gluster.c | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/gluster.c b/block/gluster.c
> > index 891c13b..3db0bf8 100644
> > --- a/block/gluster.c
> > +++ b/block/gluster.c
> > @@ -41,6 +41,7 @@ typedef struct GlusterAIOCB {
> >  int ret;
> >  Coroutine *coroutine;
> >  AioContext *aio_context;
> > +bool is_write;
> >  } GlusterAIOCB;
> >  
> >  typedef struct BDRVGlusterState {
> > @@ -716,8 +717,10 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, 
> > ssize_t ret, void *arg)
> >  acb->ret = 0; /* Success */
> >  } else if (ret < 0) {
> >  acb->ret = -errno; /* Read/Write failed */
> > +} else if (acb->is_write) {
> > +acb->ret = -EIO; /* Partial write - fail it */
> >  } else {
> > -acb->ret = -EIO; /* Partial read/write - fail it */
> > +acb->ret = 0; /* Success */
> 
> Does this properly guarantee that the portion beyond EOF reads as zero?

I'd argue this wasn't necessarily the case before either, considering
the first check starts with `!ret`:

if (!ret || ret == acb->size) {
acb->ret = 0; /* Success */

A read right at EOF would return 0 and be treated as success there, no?
Iow. it wouldn't zero out the destination buffer as far as I can see.
Come to think of it, I'm not too fond of this part of the check for the
write case either.

> Would it be better to switch to byte-based interfaces rather than
> continue to force gluster interaction in 512-byte sector chunks, since
> gluster can obviously store files that are not 512-aligned?