[Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension
While not directly related to NBD_CMD_WRITE_ZEROES, the qemu team discovered that it is useful if a server can advertise whether an export is in a known-all-zeroes state at the time the client connects. Signed-off-by: Eric Blake --- doc/proto.md | 5 + 1 file changed, 5 insertions(+) This replaces the following qemu patch attempt: https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The semantics in this proposal should be much better. Patch is to the merge of the master branch and the extension-write-zeroes branch. By the way, qemu 2.8 is due to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES, so maybe it is time to consider promoting the extension-write-zeroes branch into master. diff --git a/doc/proto.md b/doc/proto.md index afe71fc..7e4ec7f 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -697,6 +697,11 @@ The field has the following format: the export. - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental `BLOCK_STATUS` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md). +- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees + that at the time transmission phase begins, all offsets within the + export read as zero bytes. Clients MAY use this information to + avoid writing to sections of the export that should still read as + zero after the client is done writing. Clients SHOULD ignore unknown flags. -- 2.9.3
[Qemu-block] [PULL for-2.8 3/3] qemu-doc: update gluster protocol usage guide
From: Prasanna Kumar Kalever Document: 1. The new debug and logfile options with their usages 2. New json format and its usage and 3. update "GlusterFS, Device URL Syntax" section in "Invocation" Signed-off-by: Prasanna Kumar Kalever Reviewed-by: Eric Blake Signed-off-by: Jeff Cody --- qemu-doc.texi | 59 +++-- qemu-options.hx | 25 ++-- 2 files changed, 68 insertions(+), 16 deletions(-) diff --git a/qemu-doc.texi b/qemu-doc.texi index 023c140..02cb39d 100644 --- a/qemu-doc.texi +++ b/qemu-doc.texi @@ -1041,35 +1041,55 @@ GlusterFS is an user space distributed file system. You can boot from the GlusterFS disk image with the command: @example -qemu-system-x86_64 -drive file=gluster[+@var{transport}]://[@var{server}[:@var{port}]]/@var{volname}/@var{image}[?socket=...] +URI: +qemu-system-x86_64 -drive file=gluster[+@var{type}]://[@var{host}[:@var{port}]]/@var{volume}/@var{path} + [?socket=...][,file.debug=9][,file.logfile=...] + +JSON: +qemu-system-x86_64 'json:@{"driver":"qcow2", + "file":@{"driver":"gluster", + "volume":"testvol","path":"a.img","debug":9,"logfile":"...", + "server":[@{"type":"tcp","host":"...","port":"..."@}, + @{"type":"unix","socket":"..."@}]@}@}' @end example @var{gluster} is the protocol. -@var{transport} specifies the transport type used to connect to gluster +@var{type} specifies the transport type used to connect to gluster management daemon (glusterd). Valid transport types are -tcp, unix and rdma. If a transport type isn't specified, then tcp -type is assumed. +tcp and unix. In the URI form, if a transport type isn't specified, +then tcp type is assumed. -@var{server} specifies the server where the volume file specification for -the given volume resides. This can be either hostname, ipv4 address -or ipv6 address. ipv6 address needs to be within square brackets [ ]. -If transport type is unix, then @var{server} field should not be specified. +@var{host} specifies the server where the volume file specification for +the given volume resides. This can be either a hostname or an ipv4 address. +If transport type is unix, then @var{host} field should not be specified. Instead @var{socket} field needs to be populated with the path to unix domain socket. @var{port} is the port number on which glusterd is listening. This is optional -and if not specified, QEMU will send 0 which will make gluster to use the -default port. If the transport type is unix, then @var{port} should not be -specified. +and if not specified, it defaults to port 24007. If the transport type is unix, +then @var{port} should not be specified. + +@var{volume} is the name of the gluster volume which contains the disk image. + +@var{path} is the path to the actual disk image that resides on gluster volume. + +@var{debug} is the logging level of the gluster protocol driver. Debug levels +are 0-9, with 9 being the most verbose, and 0 representing no debugging output. +The default level is 4. The current logging levels defined in the gluster source +are 0 - None, 1 - Emergency, 2 - Alert, 3 - Critical, 4 - Error, 5 - Warning, +6 - Notice, 7 - Info, 8 - Debug, 9 - Trace + +@var{logfile} is a commandline option to mention log file path which helps in +logging to the specified file and also help in persisting the gfapi logs. The +default is stderr. + -@var{volname} is the name of the gluster volume which contains the disk image. -@var{image} is the path to the actual disk image that resides on gluster volume. You can create a GlusterFS disk image with the command: @example -qemu-img create gluster://@var{server}/@var{volname}/@var{image} @var{size} +qemu-img create gluster://@var{host}/@var{volume}/@var{path} @var{size} @end example Examples @@ -1082,6 +1102,17 @@ qemu-system-x86_64 -drive file=gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir qemu-system-x86_64 -drive file=gluster+tcp://server.domain.com:24007/testvol/dir/a.img qemu-system-x86_64 -drive file=gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket qemu-system-x86_64 -drive file=gluster+rdma://1.2.3.4:24007/testvol/a.img +qemu-system-x86_64 -drive file=gluster://1.2.3.4/testvol/a.img,file.debug=9,file.logfile=/var/log/qemu-gluster.log +qemu-system-x86_64 'json:@{"driver":"qcow2", + "file":@{"driver":"gluster", +"volume":"testvol","path":"a.img", + "debug":9,"logfile":"/var/log/qemu-gluster.log", + "server":[@{"type":"tcp","host":"1.2.3.4","port":24007@}, + @{"type":"unix","socket":"/var/run/glusterd.socket"@}]@}@}' +qemu-system-x86_64 -drive driver=qcow2,file.driver=gluster,file.volume=tes
[Qemu-block] [PULL for-2.8 2/3] block/nfs: fix QMP to match debug option
From: Prasanna Kumar Kalever The QMP definition of BlockdevOptionsNfs: { 'struct': 'BlockdevOptionsNfs', 'data': { 'server': 'NFSServer', 'path': 'str', '*user': 'int', '*group': 'int', '*tcp-syn-count': 'int', '*readahead-size': 'int', '*page-cache-size': 'int', '*debug-level': 'int' } } To make this consistent with other block protocols like gluster, lets change s/debug-level/debug/ Suggested-by: Eric Blake Signed-off-by: Prasanna Kumar Kalever Reviewed-by: Eric Blake Signed-off-by: Jeff Cody --- block/nfs.c | 4 ++-- qapi/block-core.json | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block/nfs.c b/block/nfs.c index d082783..a490660 100644 --- a/block/nfs.c +++ b/block/nfs.c @@ -134,7 +134,7 @@ static int nfs_parse_uri(const char *filename, QDict *options, Error **errp) qdict_put(options, "page-cache-size", qstring_from_str(qp->p[i].value)); } else if (!strcmp(qp->p[i].name, "debug")) { -qdict_put(options, "debug-level", +qdict_put(options, "debug", qstring_from_str(qp->p[i].value)); } else { error_setg(errp, "Unknown NFS parameter name: %s", @@ -165,7 +165,7 @@ static bool nfs_has_filename_options_conflict(QDict *options, Error **errp) !strcmp(qe->key, "tcp-syn-count") || !strcmp(qe->key, "readahead-size") || !strcmp(qe->key, "page-cache-size") || -!strcmp(qe->key, "debug-level") || +!strcmp(qe->key, "debug") || strstart(qe->key, "server.", NULL)) { error_setg(errp, "Option %s cannot be used with a filename", diff --git a/qapi/block-core.json b/qapi/block-core.json index 03b19f1..f22ed2a 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2292,7 +2292,7 @@ # @page-cache-size: #optional set the pagecache size in bytes (defaults # to libnfs default) # -# @debug-level: #optional set the NFS debug level (max 2) (defaults +# @debug: #optional set the NFS debug level (max 2) (defaults # to libnfs default) # # Since 2.8 @@ -2305,7 +2305,7 @@ '*tcp-syn-count': 'int', '*readahead-size': 'int', '*page-cache-size': 'int', -'*debug-level': 'int' } } +'*debug': 'int' } } ## # @BlockdevOptionsCurl -- 2.7.4
[Qemu-block] [PULL for-2.8 1/3] block/gluster: fix QMP to match debug option
From: Prasanna Kumar Kalever The QMP definition of BlockdevOptionsGluster: { 'struct': 'BlockdevOptionsGluster', 'data': { 'volume': 'str', 'path': 'str', 'server': ['GlusterServer'], '*debug-level': 'int', '*logfile': 'str' } } But instead of 'debug-level we have exported 'debug' as the option for choosing debug level of gluster protocol driver. This patch fix QMP definition BlockdevOptionsGluster s/debug-level/debug/ Suggested-by: Eric Blake Signed-off-by: Prasanna Kumar Kalever Reviewed-by: Eric Blake Signed-off-by: Jeff Cody --- block/gluster.c | 38 +++--- qapi/block-core.json | 4 ++-- 2 files changed, 21 insertions(+), 21 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index 891c13b..a0a74e4 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -48,7 +48,7 @@ typedef struct BDRVGlusterState { struct glfs_fd *fd; char *logfile; bool supports_seek_data; -int debug_level; +int debug; } BDRVGlusterState; typedef struct BDRVGlusterReopenState { @@ -434,7 +434,7 @@ static struct glfs *qemu_gluster_glfs_init(BlockdevOptionsGluster *gconf, } } -ret = glfs_set_logging(glfs, gconf->logfile, gconf->debug_level); +ret = glfs_set_logging(glfs, gconf->logfile, gconf->debug); if (ret < 0) { goto out; } @@ -788,17 +788,17 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, filename = qemu_opt_get(opts, GLUSTER_OPT_FILENAME); -s->debug_level = qemu_opt_get_number(opts, GLUSTER_OPT_DEBUG, - GLUSTER_DEBUG_DEFAULT); -if (s->debug_level < 0) { -s->debug_level = 0; -} else if (s->debug_level > GLUSTER_DEBUG_MAX) { -s->debug_level = GLUSTER_DEBUG_MAX; +s->debug = qemu_opt_get_number(opts, GLUSTER_OPT_DEBUG, + GLUSTER_DEBUG_DEFAULT); +if (s->debug < 0) { +s->debug = 0; +} else if (s->debug > GLUSTER_DEBUG_MAX) { +s->debug = GLUSTER_DEBUG_MAX; } gconf = g_new0(BlockdevOptionsGluster, 1); -gconf->debug_level = s->debug_level; -gconf->has_debug_level = true; +gconf->debug = s->debug; +gconf->has_debug = true; logfile = qemu_opt_get(opts, GLUSTER_OPT_LOGFILE); s->logfile = g_strdup(logfile ? logfile : GLUSTER_LOGFILE_DEFAULT); @@ -874,8 +874,8 @@ static int qemu_gluster_reopen_prepare(BDRVReopenState *state, qemu_gluster_parse_flags(state->flags, &open_flags); gconf = g_new0(BlockdevOptionsGluster, 1); -gconf->debug_level = s->debug_level; -gconf->has_debug_level = true; +gconf->debug = s->debug; +gconf->has_debug = true; gconf->logfile = g_strdup(s->logfile); gconf->has_logfile = true; reop_s->glfs = qemu_gluster_init(gconf, state->bs->filename, NULL, errp); @@ -1011,14 +1011,14 @@ static int qemu_gluster_create(const char *filename, char *tmp = NULL; gconf = g_new0(BlockdevOptionsGluster, 1); -gconf->debug_level = qemu_opt_get_number_del(opts, GLUSTER_OPT_DEBUG, - GLUSTER_DEBUG_DEFAULT); -if (gconf->debug_level < 0) { -gconf->debug_level = 0; -} else if (gconf->debug_level > GLUSTER_DEBUG_MAX) { -gconf->debug_level = GLUSTER_DEBUG_MAX; +gconf->debug = qemu_opt_get_number_del(opts, GLUSTER_OPT_DEBUG, + GLUSTER_DEBUG_DEFAULT); +if (gconf->debug < 0) { +gconf->debug = 0; +} else if (gconf->debug > GLUSTER_DEBUG_MAX) { +gconf->debug = GLUSTER_DEBUG_MAX; } -gconf->has_debug_level = true; +gconf->has_debug = true; gconf->logfile = qemu_opt_get_del(opts, GLUSTER_OPT_LOGFILE); if (!gconf->logfile) { diff --git a/qapi/block-core.json b/qapi/block-core.json index c29bef7..03b19f1 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2195,7 +2195,7 @@ # # @server: gluster servers description # -# @debug-level: #optional libgfapi log level (default '4' which is Error) +# @debug: #optional libgfapi log level (default '4' which is Error) # # @logfile: #optional libgfapi log file (default /dev/stderr) (Since 2.8) # @@ -2205,7 +2205,7 @@ 'data': { 'volume': 'str', 'path': 'str', 'server': ['GlusterServer'], -'*debug-level': 'int', +'*debug': 'int', '*logfile': 'str' } } ## -- 2.7.4
[Qemu-block] [PULL for-2.8 0/3] Block patches for -rc3
The following changes since commit bc66cedb4141fb7588f2462c74310d8fb5dd4cf1: Merge remote-tracking branch 'yongbok/tags/mips-20161204' into staging (2016-12-05 10:56:45 +) are available in the git repository at: https://github.com/codyprime/qemu-kvm-jtc.git tags/block-pull-request for you to fetch changes up to 76b5550f709b975a7b04fb4c887f300b7bb731c2: qemu-doc: update gluster protocol usage guide (2016-12-05 16:30:29 -0500) Gluster block patches for -rc3 Prasanna Kumar Kalever (3): block/gluster: fix QMP to match debug option block/nfs: fix QMP to match debug option qemu-doc: update gluster protocol usage guide block/gluster.c | 38 - block/nfs.c | 4 ++-- qapi/block-core.json | 8 +++ qemu-doc.texi| 59 +++- qemu-options.hx | 25 -- 5 files changed, 93 insertions(+), 41 deletions(-) -- 2.7.4
Re: [Qemu-block] [Qemu-devel] [PATCH for-2.8 v2] qcow2: Don't strand clusters near 2G intervals during commit
On 12/05/2016 10:49 AM, Eric Blake wrote: > The qcow2_make_empty() function is reached during 'qemu-img commit', > in order to clear out ALL clusters of an image. However, if the > image cannot use the fast code path (true if the image is format > 0.10, or if the image contains a snapshot), the cluster size is > larger than 512, and the image is larger than 2G in size, then our > choice of sector_step causes problems. Since it is not cluster > aligned, but qcow2_discard_clusters() silently ignores an unaligned > head or tail, we are leaving clusters allocated. > > Enhance the testsuite to expose the flaw, and patch the problem by > ensuring our step size is aligned. > > Signed-off-by: Eric Blake > > --- > v2: perform rounding correctly > --- > block/qcow2.c | 3 +- > tests/qemu-iotests/097 | 41 +--- > tests/qemu-iotests/097.out | 249 > + > 3 files changed, 210 insertions(+), 83 deletions(-) > > diff --git a/block/qcow2.c b/block/qcow2.c > index ed9e0f3..96fb8a8 100644 > --- a/block/qcow2.c > +++ b/block/qcow2.c > @@ -2808,7 +2808,8 @@ static int qcow2_make_empty(BlockDriverState *bs) > { > BDRVQcow2State *s = bs->opaque; > uint64_t start_sector; > -int sector_step = INT_MAX / BDRV_SECTOR_SIZE; > +int sector_step = (QEMU_ALIGN_DOWN(INT_MAX, s->cluster_size) / > + BDRV_SECTOR_SIZE); > int l1_clusters, ret = 0; > > l1_clusters = DIV_ROUND_UP(s->l1_size, s->cluster_size / > sizeof(uint64_t)); > diff --git a/tests/qemu-iotests/097 b/tests/qemu-iotests/097 > index 01d8dd0..4c33e80 100755 > --- a/tests/qemu-iotests/097 > +++ b/tests/qemu-iotests/097 > @@ -46,7 +46,7 @@ _supported_proto file > _supported_os Linux > > > -# Four passes: > +# Four main passes: > # 0: Two-layer backing chain, commit to upper backing file (implicitly) > # (in this case, the top image will be emptied) > # 1: Two-layer backing chain, commit to upper backing file (explicitly) > @@ -56,22 +56,30 @@ _supported_os Linux > # 3: Two-layer backing chain, commit to lower backing file > # (in this case, the top image will implicitly stay unchanged) > # > +# Each pass is run twice, since qcow2 has different code paths for cleaning > +# an image depending on whether it has a snapshot. > +# > # 020 already tests committing, so this only tests whether image chains are > # working properly and that all images above the base are emptied; therefore, > -# no complicated patterns are necessary > +# no complicated patterns are necessary. Check near the 2G mark, as qcow2 > +# has been buggy at that boundary in the past. > for i in 0 1 2 3; do > +for j in 0 1; do > > echo > -echo "=== Test pass $i ===" > +echo "=== Test pass $i.$j ===" > echo > > -TEST_IMG="$TEST_IMG.base" _make_test_img 64M > -TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 64M > -_make_test_img -b "$TEST_IMG.itmd" 64M > +TEST_IMG="$TEST_IMG.base" _make_test_img 2100M > +TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 2100M > +_make_test_img -b "$TEST_IMG.itmd" 2100M > +if [ $j -eq 0 ]; then > +$QEMU_IMG snapshot -c snap "$TEST_IMG" > +fi > > -$QEMU_IO -c 'write -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io > -$QEMU_IO -c 'write -P 2 64k 128k' "$TEST_IMG.itmd" | _filter_qemu_io > -$QEMU_IO -c 'write -P 3 128k 64k' "$TEST_IMG" | _filter_qemu_io > +$QEMU_IO -c 'write -P 1 0x7ffd 192k' "$TEST_IMG.base" | _filter_qemu_io > +$QEMU_IO -c 'write -P 2 0x7ffe 128k' "$TEST_IMG.itmd" | _filter_qemu_io > +$QEMU_IO -c 'write -P 3 0x7fff 64k' "$TEST_IMG" | _filter_qemu_io > > if [ $i -lt 3 ]; then > if [ $i == 0 ]; then > @@ -88,12 +96,12 @@ if [ $i -lt 3 ]; then > fi > > # Bottom should be unchanged > -$QEMU_IO -c 'read -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io > +$QEMU_IO -c 'read -P 1 0x7ffd 192k' "$TEST_IMG.base" | > _filter_qemu_io > > # Intermediate should contain changes from top > -$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.itmd" | _filter_qemu_io > -$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.itmd" | _filter_qemu_io > -$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.itmd" | _filter_qemu_io > +$QEMU_IO -c 'read -P 1 0x7ffd 64k' "$TEST_IMG.itmd" | _filter_qemu_io > +$QEMU_IO -c 'read -P 2 0x7ffe 64k' "$TEST_IMG.itmd" | _filter_qemu_io > +$QEMU_IO -c 'read -P 3 0x7fff 64k' "$TEST_IMG.itmd" | _filter_qemu_io > > # And in pass 0, the top image should be empty, whereas in both other > passes > # it should be unchanged (which is both checked by qemu-img map) > @@ -101,9 +109,9 @@ else > $QEMU_IMG commit -b "$TEST_IMG.base" "$TEST_IMG" > > # Bottom should contain all changes > -$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.base" | _filter_qemu_io > -$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.base" | _filter_qemu_io > -$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.base" | _filter_qemu_io > +$QEMU_IO
Re: [Qemu-block] [Qemu-devel] [PATCH RFC v2 2/6] replication: add shared-disk and shared-disk-id options
On 12/05/2016 02:35 AM, zhanghailiang wrote: > We use these two options to identify which disk is > shared > > Signed-off-by: zhanghailiang > Signed-off-by: Wen Congyang > Signed-off-by: Zhang Chen > --- > v2: > - add these two options for BlockdevOptionsReplication to support > qmp blockdev-add command. > - fix a memory leak found by Changlong > --- > +++ b/qapi/block-core.json > @@ -2232,12 +2232,19 @@ > # node who owns the replication node chain. Must not be given in > # primary mode. > # > +# @shared-disk-id: #optional The id of shared disk while in replication mode. > +# > +# @shared-disk: #optional To indicate whether or not a disk is shared by > +# primary VM and secondary VM. Both of these need a '(since 2.9)' designation since they've missed 2.8, and could probably also use documentation on the default value assumed when the parameter is omitted. > +# > # Since: 2.8 > ## > { 'struct': 'BlockdevOptionsReplication', >'base': 'BlockdevOptionsGenericFormat', >'data': { 'mode': 'ReplicationMode', > -'*top-id': 'str' } } > +'*top-id': 'str', > +'*shared-disk-id': 'str', > +'*shared-disk': 'bool' } } > > ## > # @NFSTransport > -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-block] [PATCH for-2.8 v2] qcow2: Don't strand clusters near 2G intervals during commit
The qcow2_make_empty() function is reached during 'qemu-img commit', in order to clear out ALL clusters of an image. However, if the image cannot use the fast code path (true if the image is format 0.10, or if the image contains a snapshot), the cluster size is larger than 512, and the image is larger than 2G in size, then our choice of sector_step causes problems. Since it is not cluster aligned, but qcow2_discard_clusters() silently ignores an unaligned head or tail, we are leaving clusters allocated. Enhance the testsuite to expose the flaw, and patch the problem by ensuring our step size is aligned. Signed-off-by: Eric Blake --- v2: perform rounding correctly --- block/qcow2.c | 3 +- tests/qemu-iotests/097 | 41 +--- tests/qemu-iotests/097.out | 249 + 3 files changed, 210 insertions(+), 83 deletions(-) diff --git a/block/qcow2.c b/block/qcow2.c index ed9e0f3..96fb8a8 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -2808,7 +2808,8 @@ static int qcow2_make_empty(BlockDriverState *bs) { BDRVQcow2State *s = bs->opaque; uint64_t start_sector; -int sector_step = INT_MAX / BDRV_SECTOR_SIZE; +int sector_step = (QEMU_ALIGN_DOWN(INT_MAX, s->cluster_size) / + BDRV_SECTOR_SIZE); int l1_clusters, ret = 0; l1_clusters = DIV_ROUND_UP(s->l1_size, s->cluster_size / sizeof(uint64_t)); diff --git a/tests/qemu-iotests/097 b/tests/qemu-iotests/097 index 01d8dd0..4c33e80 100755 --- a/tests/qemu-iotests/097 +++ b/tests/qemu-iotests/097 @@ -46,7 +46,7 @@ _supported_proto file _supported_os Linux -# Four passes: +# Four main passes: # 0: Two-layer backing chain, commit to upper backing file (implicitly) # (in this case, the top image will be emptied) # 1: Two-layer backing chain, commit to upper backing file (explicitly) @@ -56,22 +56,30 @@ _supported_os Linux # 3: Two-layer backing chain, commit to lower backing file # (in this case, the top image will implicitly stay unchanged) # +# Each pass is run twice, since qcow2 has different code paths for cleaning +# an image depending on whether it has a snapshot. +# # 020 already tests committing, so this only tests whether image chains are # working properly and that all images above the base are emptied; therefore, -# no complicated patterns are necessary +# no complicated patterns are necessary. Check near the 2G mark, as qcow2 +# has been buggy at that boundary in the past. for i in 0 1 2 3; do +for j in 0 1; do echo -echo "=== Test pass $i ===" +echo "=== Test pass $i.$j ===" echo -TEST_IMG="$TEST_IMG.base" _make_test_img 64M -TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 64M -_make_test_img -b "$TEST_IMG.itmd" 64M +TEST_IMG="$TEST_IMG.base" _make_test_img 2100M +TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 2100M +_make_test_img -b "$TEST_IMG.itmd" 2100M +if [ $j -eq 0 ]; then +$QEMU_IMG snapshot -c snap "$TEST_IMG" +fi -$QEMU_IO -c 'write -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io -$QEMU_IO -c 'write -P 2 64k 128k' "$TEST_IMG.itmd" | _filter_qemu_io -$QEMU_IO -c 'write -P 3 128k 64k' "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c 'write -P 1 0x7ffd 192k' "$TEST_IMG.base" | _filter_qemu_io +$QEMU_IO -c 'write -P 2 0x7ffe 128k' "$TEST_IMG.itmd" | _filter_qemu_io +$QEMU_IO -c 'write -P 3 0x7fff 64k' "$TEST_IMG" | _filter_qemu_io if [ $i -lt 3 ]; then if [ $i == 0 ]; then @@ -88,12 +96,12 @@ if [ $i -lt 3 ]; then fi # Bottom should be unchanged -$QEMU_IO -c 'read -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io +$QEMU_IO -c 'read -P 1 0x7ffd 192k' "$TEST_IMG.base" | _filter_qemu_io # Intermediate should contain changes from top -$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.itmd" | _filter_qemu_io -$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.itmd" | _filter_qemu_io -$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.itmd" | _filter_qemu_io +$QEMU_IO -c 'read -P 1 0x7ffd 64k' "$TEST_IMG.itmd" | _filter_qemu_io +$QEMU_IO -c 'read -P 2 0x7ffe 64k' "$TEST_IMG.itmd" | _filter_qemu_io +$QEMU_IO -c 'read -P 3 0x7fff 64k' "$TEST_IMG.itmd" | _filter_qemu_io # And in pass 0, the top image should be empty, whereas in both other passes # it should be unchanged (which is both checked by qemu-img map) @@ -101,9 +109,9 @@ else $QEMU_IMG commit -b "$TEST_IMG.base" "$TEST_IMG" # Bottom should contain all changes -$QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.base" | _filter_qemu_io -$QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.base" | _filter_qemu_io -$QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.base" | _filter_qemu_io +$QEMU_IO -c 'read -P 1 0x7ffd 64k' "$TEST_IMG.base" | _filter_qemu_io +$QEMU_IO -c 'read -P 2 0x7ffe 64k' "$TEST_IMG.base" | _filter_qemu_io +$QEMU_IO -c 'read -P 3 0x7fff 64k' "$TEST_IMG.base" | _filter_qemu_io # Both top and intermediate should be unchanged fi
Re: [Qemu-block] [Qemu-devel] [PATCH for-2.8] qcow2: Don't strand clusters near 2G intervals during commit
On 12/03/2016 11:34 AM, Eric Blake wrote: > The qcow2_make_empty() function is reached during 'qemu-img commit', > in order to clear out ALL clusters of an image. However, if the > image cannot use the fast code path (true if the image is format > 0.10, or if the image contains a snapshot), the cluster size is > larger than 512, and the image is larger than 2G in size, then our > choice of sector_step causes problems. Since it is not cluster > aligned, but qcow2_discard_clusters() silently ignores an unaligned > head or tail, we are leaving clusters allocated. > > Enhance the testsuite to expose the flaw, and patch the problem by > ensuring our step size is aligned. > > [qcow2_discard_clusters() is a GROSS interface: it takes a mix of > byte offset and sector count to perform cluster operations. But > fixing it to use a saner byte/byte rather than byte/sector interface, > and/or asserting that the counts are now aligned thanks to both > this patch and commit 3482b9b, is material for another day.] > > Signed-off-by: Eric Blake > --- > block/qcow2.c | 3 +- > tests/qemu-iotests/097 | 41 +--- > tests/qemu-iotests/097.out | 249 > + > 3 files changed, 210 insertions(+), 83 deletions(-) > +++ b/block/qcow2.c > @@ -2808,7 +2808,8 @@ static int qcow2_make_empty(BlockDriverState *bs) > { > BDRVQcow2State *s = bs->opaque; > uint64_t start_sector; > -int sector_step = INT_MAX / BDRV_SECTOR_SIZE; > +int sector_step = QEMU_ALIGN_DOWN(INT_MAX / BDRV_SECTOR_SIZE, > + s->cluster_size); Oh shoot. I got the units wrong, and made the slow path do more loop iterations than necessary (rounding sectors to cluster size in bytes is inappropriate - either the rounding has to occur before division, or the rounding needs to be by secotrs instead of bytes). I'll send a v2 that gets the math right. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [PATCH v5 7/9] block: don't make snapshots for filters
Am 05.12.2016 um 12:49 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kw...@redhat.com] > > Am 05.12.2016 um 08:43 hat Pavel Dovgalyuk geschrieben: > > > Paolo, > > > > > > > From: Kevin Wolf [mailto:kw...@redhat.com] > > > > Am 21.11.2016 um 13:22 hat Pavel Dovgalyuk geschrieben: > > > > > > From: Kevin Wolf [mailto:kw...@redhat.com] > > > > > > Am 16.11.2016 um 10:49 hat Pavel Dovgalyuk geschrieben: > > > > > > > > > > > > How does that bypass blkreplay? blk->root is supposed to be the > > > > > > blkreply > > > > > > node, do you see something different? If it were the qcow2 node, > > > > > > then I > > > > > > would expect that no requests at all go through the blkreplay layer. > > > > > > > > > > It seems, that the problem is in -snapshot option. > > > > > We have one of the following block layers depending on command line: > > > > > tmp_overlay1 -> blkreplay -> tmp_overlay2 -> disk_image > > > > > tmp_overlay1 -> blkreplay -> disk_image > > > > > > > > > > But the correct scheme is intended to be the following: > > > > > blkreplay -> tmp_overlay1 -> disk_image > > > > > > > > > > How can we fix it? > > > > > Maybe we should add blkreplay automatically to all block devices and > > > > > not > > > > > to specify it in the command line? > > > > > > > > I think you found a pretty fundamental design problem with "global" > > > > drive options that add a filter node such as -snapshot and replay mode > > > > (replay mode isn't one of them today, but your suggestion to make it > > > > automatic would turn it into one). > > > > > > > > At the core of the problem I think we have two questions: > > > > > > > > 1. Which block nodes should be affected and get a filter node added, and > > > >which nodes shouldn't get one? In your case, disl_image is defined > > > >with a -drive option, but shouldn't get the snapshot. > > > > > > > > 2. In which order should filter nodes be added? > > > > > > > > Both of these questions feel hard. As long as we haven't thought through > > > > the concept as such (rather than discussing one-off hacks) and we're not > > > > completely certain what the right answer to the questions is, we > > > > shouldn't add more automatic filter nodes, because chances are that we > > > > get it wrong and would regret it. > > > > > > > > The obvious answer for a workaround would be: Make everything manual, > > > > i.e. don't use -snapshot, but create a qcow2 overlay manually. > > > > > > What about to switching to manual overlay creation by default? > > > We can make rrsnapshot option mandatory. > > > Therefore user will have to create snapshot in image or overlay and > > > the disk image will not be corrupted. > > > > > > It is not very convenient, but we could disable rrsnapshot again when > > > the solution for -snapshot will be found. > > > > Hm, what is this rrsnapshot option? git grep can't find it. > > It was a patch that was not included yet. > This option creates/loads vm snapshot in record/replay mode > leaving original disk image unchanged. You mean internal qcow2 snapshots? Ok. > Record/replay without this option uses '-snapshot' to preserve > the state of the disk images. > > > Anyway, it seems that doing things manually is the safe way as long as > > we don't know the final solution, so I think I agree. > > > > For a slightly more convenient way, one of the problems to solve seems > > to be that snapshot=on always affects the top level node and you can't > > create a temporary snapshot in the middle of the chain. Perhaps we > > should introduce a 'temporary-overlay' driver or something like that, so > > that you could specify things like this: > > > > -drive if=none,driver=file,filename=test.img,id=orig > > -drive if=none,driver=temporary-overlay,file=orig,id=snap > > -drive if=none,driver=blkreplay,image=snap > > This seems reasonable for manual way. Maybe another, easier to implement option could be something like this: -drive if=none,driver=file,filename=test.img,snapshot=on,overlay.node-name=snap -drive if=none,driver=blkreplay,image=snap It would require that we implement support for overlay.* options like we already support backing.* options. Allowing to specify options for the overlay node is probably nice to have anyway. However, there could be a few tricky parts there. For example, what happens if someone uses overlay.backing=something-else? Perhaps completely disallowing backing and backing.* for overlays would already solve this. > > Which makes me wonder... Is blkreplay usable without the temporary > > snapshot or is this pretty much a requirement? > > It's not a requirement. But to make replay deterministic we have to > start with the same image every time. As I know, this may be achieved by: > 1. Restoring original disk image manually > 2. Using vm snapshot to start execution from > 3. Using -snapshot option > 4. Not using disks at all > > > Because if it has to be > > there, the next step could be that blk
[Qemu-block] Meeting notes on -blockdev, dynamic backend reconfiguration
I recently met Kevin, and we discussed two block layer topics in some depth. = -blockdev = We want a command line option to mirror QMP blockdev-add for 2.9. QemuOpts has to grow from "list of (key, simple value) plus conventions to support lists of simple values in limited ways" to the expressive power of JSON. == Basic idea == QMP pipeline: JSON string - JSON parser - QObject - QObject input visitor - QAPI object. For commands with with 'gen': false, we stop at QObject. These are rare. Command line now: option argument string - QemuOpts parser - QemuOpts. We occasionally continue - options or string input visitor - QAPI object. Both visitors can't do arbitrary QAPI objects. Both visitors extend QemuOpts syntax. Daniel Berrange posted patches to instead do - crumple - QObject - qobject input visitor - QAPI object. Arbitrary QObjects (thus QAPI objects) are possible with dotted key convention, which is already used by block layer. As before, a visitor sits on top of QemuOpts, providing syntax extensions. Stacking parsers like that is not a good idea. We want *one* option argument parser, and we need it to yield a QObject. == Backward compatibility issues == * Traditional key=value,... syntax * The "repeated key is list" hack * Options and string input visitor syntax extensions * Dotted key convention Hopefully, most of the solutions can be adapted from Daniel's patches. == Type ambguity == In JSON, the type of a value is syntactically obvious. The JSON parser yields QObject with these types. The QObject input visitor rejects values with types that don't match the QAPI schema. In the traditional key=value command line syntax, the type of a value isn't obvious. Options and string input visitor convert the string value to the type expected by the QAPI schema. Unlike a QObject from JSON, a QObject from QemuOpts has only string values, and the QObject input visitor needs to be able to convert instead of reject. Daniel's patches do that. == Action item == Markus to explore the proposed solution as soon as possible. = Dynamic block backend reconfiguration = == Mirror job == State before the job: frontend | BB | BDS Frontend writes flow down. Passive mirror job, as it currently works: frontend mirror-job | | | BB BB'BB2 | / | | /| BDSBDS2 The mirror job copies the contents of BDS to BDS2. To handle frontend writes, BDS maintains a dirty bitmap, which mirror-job uses to copy updates from BB' to BB2. Pivot to mirror on job completion: replace BB's child BDS by BDS2, delete mirror-job and its BB', BB2. frontend | BB \_ \ BDSBDS2 Future mirror job using a mirror-filter: frontend mirror-job | | BB BB' |/ mirror-filter |\ BDS BDS2 Passive mirror-filter: maintains dirty bitmap, copies from BDS to BDS2. Active mirror-filter: no dirty bitmap, mirrors writes to BDS2 directly. Can easily switch from passive to active at any time. Pivot: replace parent of mirror-filter's child mirror-filter by BDS2, delete mirror job and its BB'. "Parent of" in case other filters have been inserted: we drop the ones below mirror-filter, and keep the ones above. == Backup job == Current backup job: frontend backup-job | | | BB BB'BB2 | / | | /| BDSBDS2 The backup job copies the contents of BDS to BDS2. To handle frontend writes, BDS provices a before-write-notifier, backup-job uses it to copy old data from BB' to BB2 right before it's overwritten. Pivot: delete backup-job and its BB', BB2. frontend | BB | | BDSBDS2 Future backup job using a backup-filter: frontend backup-job | | BB BB' |/ backup-filter | \ BDS BDS2 backup-filter copies old data from BDS to BDS2 before it forwards write to BDS. Pivot: replace parent of backup-filter's child backup-filter by BDS2, delete backup-job and its BB'. == Commit job == State before the job: frontend | BB | QCOW2 file / \ backing / \ / \ BDS1 QCOW2_top file / \ backing / . BDS_top . \ BDS_base "file" and "backing" are the QCOW2 child names for the delta image and the backing image, respectively. Frontend writes flow to BDS1. Current commit job to commit from QCOW2_top down to BDS_base: frontend | BB commit-job | / \ QCOW2 BB_top BB_base file / \ b
Re: [Qemu-block] [PATCH v5 7/9] block: don't make snapshots for filters
> From: Kevin Wolf [mailto:kw...@redhat.com] > Am 05.12.2016 um 08:43 hat Pavel Dovgalyuk geschrieben: > > Paolo, > > > > > From: Kevin Wolf [mailto:kw...@redhat.com] > > > Am 21.11.2016 um 13:22 hat Pavel Dovgalyuk geschrieben: > > > > > From: Kevin Wolf [mailto:kw...@redhat.com] > > > > > Am 16.11.2016 um 10:49 hat Pavel Dovgalyuk geschrieben: > > > > > > > > > > How does that bypass blkreplay? blk->root is supposed to be the > > > > > blkreply > > > > > node, do you see something different? If it were the qcow2 node, then > > > > > I > > > > > would expect that no requests at all go through the blkreplay layer. > > > > > > > > It seems, that the problem is in -snapshot option. > > > > We have one of the following block layers depending on command line: > > > > tmp_overlay1 -> blkreplay -> tmp_overlay2 -> disk_image > > > > tmp_overlay1 -> blkreplay -> disk_image > > > > > > > > But the correct scheme is intended to be the following: > > > > blkreplay -> tmp_overlay1 -> disk_image > > > > > > > > How can we fix it? > > > > Maybe we should add blkreplay automatically to all block devices and not > > > > to specify it in the command line? > > > > > > I think you found a pretty fundamental design problem with "global" > > > drive options that add a filter node such as -snapshot and replay mode > > > (replay mode isn't one of them today, but your suggestion to make it > > > automatic would turn it into one). > > > > > > At the core of the problem I think we have two questions: > > > > > > 1. Which block nodes should be affected and get a filter node added, and > > >which nodes shouldn't get one? In your case, disl_image is defined > > >with a -drive option, but shouldn't get the snapshot. > > > > > > 2. In which order should filter nodes be added? > > > > > > Both of these questions feel hard. As long as we haven't thought through > > > the concept as such (rather than discussing one-off hacks) and we're not > > > completely certain what the right answer to the questions is, we > > > shouldn't add more automatic filter nodes, because chances are that we > > > get it wrong and would regret it. > > > > > > The obvious answer for a workaround would be: Make everything manual, > > > i.e. don't use -snapshot, but create a qcow2 overlay manually. > > > > What about to switching to manual overlay creation by default? > > We can make rrsnapshot option mandatory. > > Therefore user will have to create snapshot in image or overlay and > > the disk image will not be corrupted. > > > > It is not very convenient, but we could disable rrsnapshot again when > > the solution for -snapshot will be found. > > Hm, what is this rrsnapshot option? git grep can't find it. It was a patch that was not included yet. This option creates/loads vm snapshot in record/replay mode leaving original disk image unchanged. Record/replay without this option uses '-snapshot' to preserve the state of the disk images. > Anyway, it seems that doing things manually is the safe way as long as > we don't know the final solution, so I think I agree. > > For a slightly more convenient way, one of the problems to solve seems > to be that snapshot=on always affects the top level node and you can't > create a temporary snapshot in the middle of the chain. Perhaps we > should introduce a 'temporary-overlay' driver or something like that, so > that you could specify things like this: > > -drive if=none,driver=file,filename=test.img,id=orig > -drive if=none,driver=temporary-overlay,file=orig,id=snap > -drive if=none,driver=blkreplay,image=snap This seems reasonable for manual way. > Which makes me wonder... Is blkreplay usable without the temporary > snapshot or is this pretty much a requirement? It's not a requirement. But to make replay deterministic we have to start with the same image every time. As I know, this may be achieved by: 1. Restoring original disk image manually 2. Using vm snapshot to start execution from 3. Using -snapshot option 4. Not using disks at all > Because if it has to be > there, the next step could be that blkreplay creates temporary-overlay > internally in its .bdrv_open(). Here is your answer about such an approach :) https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg04687.html Pavel Dovgalyuk
Re: [Qemu-block] [PATCH for-2.8] qcow2: Don't strand clusters near 2G intervals during commit
On Sat, Dec 03, 2016 at 11:34:02AM -0600, Eric Blake wrote: > The qcow2_make_empty() function is reached during 'qemu-img commit', > in order to clear out ALL clusters of an image. However, if the > image cannot use the fast code path (true if the image is format > 0.10, or if the image contains a snapshot), the cluster size is > larger than 512, and the image is larger than 2G in size, then our > choice of sector_step causes problems. Since it is not cluster > aligned, but qcow2_discard_clusters() silently ignores an unaligned > head or tail, we are leaving clusters allocated. > > Enhance the testsuite to expose the flaw, and patch the problem by > ensuring our step size is aligned. > > [qcow2_discard_clusters() is a GROSS interface: it takes a mix of > byte offset and sector count to perform cluster operations. But > fixing it to use a saner byte/byte rather than byte/sector interface, > and/or asserting that the counts are now aligned thanks to both > this patch and commit 3482b9b, is material for another day.] > > Signed-off-by: Eric Blake > --- > block/qcow2.c | 3 +- > tests/qemu-iotests/097 | 41 +--- > tests/qemu-iotests/097.out | 249 > + > 3 files changed, 210 insertions(+), 83 deletions(-) Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [Qemu-block] [PATCH v3 0/2] less confusing block file names
On Fri, Dec 02, 2016 at 01:48:52PM -0600, Eric Blake wrote: > no real change from v2 other than trivial rebasing and R-by: > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg07220.html > > but it's been more than a month since the last activity, and the > start of the release cycle is as good a time as any to avoid any > potential churn on conflicts due to renaming files > > applies on Kevin's block-next branch for 2.9 > > Eric Blake (2): > block: Rename raw_bsd to raw-format.c > block: Rename raw-{posix,win32} to file-*.c > > include/block/block_int.h | 2 +- > block/{raw-posix.c => file-posix.c} | 0 > block/{raw-win32.c => file-win32.c} | 0 > block/gluster.c | 4 ++-- > block/{raw_bsd.c => raw-format.c} | 2 +- > MAINTAINERS | 6 +++--- > block/Makefile.objs | 6 +++--- > block/trace-events | 4 ++-- > configure | 2 +- > 9 files changed, 13 insertions(+), 13 deletions(-) > rename block/{raw-posix.c => file-posix.c} (100%) > rename block/{raw-win32.c => file-win32.c} (100%) > rename block/{raw_bsd.c => raw-format.c} (99%) > > -- > 2.9.3 > > Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [Qemu-block] [PATCH v5 7/9] block: don't make snapshots for filters
Am 05.12.2016 um 08:43 hat Pavel Dovgalyuk geschrieben: > Paolo, > > > From: Kevin Wolf [mailto:kw...@redhat.com] > > Am 21.11.2016 um 13:22 hat Pavel Dovgalyuk geschrieben: > > > > From: Kevin Wolf [mailto:kw...@redhat.com] > > > > Am 16.11.2016 um 10:49 hat Pavel Dovgalyuk geschrieben: > > > > > > From: Pavel Dovgalyuk [mailto:dovga...@ispras.ru] > > > > > > > > > > I've investigated this issue. > > > > > This command line works ok: > > > > > -drive > > > > > driver=blkreplay,if=none,image.driver=file,image.filename=testdisk.qcow,id=img- > > > > blkreplay > > > > > -device ide-hd,drive=img-blkreplay > > > > > > > > > > And this does not: > > > > > -drive > > > > > > > > > > > driver=blkreplay,if=none,image.driver=qcow2,image.file.driver=file,image.file.filename=testdis > > > > k.qcow > > > > > ,id=img-blkreplay > > > > > -device ide-hd,drive=img-blkreplay > > > > > > > > > > QEMU hangs at some moment of replay. > > > > > > > > > > I found that some dma requests do not pass through the blkreplay > > > > > driver > > > > > due to the following line in block-backend.c: > > > > > return bdrv_co_preadv(blk->root, offset, bytes, qiov, flags); > > > > > > > > > > This line passes read request directly to qcow driver and blkreplay > > > > > cannot > > > > > process it to make deterministic. > > > > > > > > How does that bypass blkreplay? blk->root is supposed to be the blkreply > > > > node, do you see something different? If it were the qcow2 node, then I > > > > would expect that no requests at all go through the blkreplay layer. > > > > > > It seems, that the problem is in -snapshot option. > > > We have one of the following block layers depending on command line: > > > tmp_overlay1 -> blkreplay -> tmp_overlay2 -> disk_image > > > tmp_overlay1 -> blkreplay -> disk_image > > > > > > But the correct scheme is intended to be the following: > > > blkreplay -> tmp_overlay1 -> disk_image > > > > > > How can we fix it? > > > Maybe we should add blkreplay automatically to all block devices and not > > > to specify it in the command line? > > > > I think you found a pretty fundamental design problem with "global" > > drive options that add a filter node such as -snapshot and replay mode > > (replay mode isn't one of them today, but your suggestion to make it > > automatic would turn it into one). > > > > At the core of the problem I think we have two questions: > > > > 1. Which block nodes should be affected and get a filter node added, and > >which nodes shouldn't get one? In your case, disl_image is defined > >with a -drive option, but shouldn't get the snapshot. > > > > 2. In which order should filter nodes be added? > > > > Both of these questions feel hard. As long as we haven't thought through > > the concept as such (rather than discussing one-off hacks) and we're not > > completely certain what the right answer to the questions is, we > > shouldn't add more automatic filter nodes, because chances are that we > > get it wrong and would regret it. > > > > The obvious answer for a workaround would be: Make everything manual, > > i.e. don't use -snapshot, but create a qcow2 overlay manually. > > What about to switching to manual overlay creation by default? > We can make rrsnapshot option mandatory. > Therefore user will have to create snapshot in image or overlay and > the disk image will not be corrupted. > > It is not very convenient, but we could disable rrsnapshot again when > the solution for -snapshot will be found. Hm, what is this rrsnapshot option? git grep can't find it. Anyway, it seems that doing things manually is the safe way as long as we don't know the final solution, so I think I agree. For a slightly more convenient way, one of the problems to solve seems to be that snapshot=on always affects the top level node and you can't create a temporary snapshot in the middle of the chain. Perhaps we should introduce a 'temporary-overlay' driver or something like that, so that you could specify things like this: -drive if=none,driver=file,filename=test.img,id=orig -drive if=none,driver=temporary-overlay,file=orig,id=snap -drive if=none,driver=blkreplay,image=snap Which makes me wonder... Is blkreplay usable without the temporary snapshot or is this pretty much a requirement? Because if it has to be there, the next step could be that blkreplay creates temporary-overlay internally in its .bdrv_open(). Kevin
[Qemu-block] [PATCH RFC v2 4/6] replication: fix code logic with the new shared_disk option
Some code logic only be needed in non-shared disk, here we adjust these codes to prepare for shared disk scenario. Signed-off-by: zhanghailiang --- block/replication.c | 47 --- 1 file changed, 28 insertions(+), 19 deletions(-) diff --git a/block/replication.c b/block/replication.c index 35e9ab3..6574cc2 100644 --- a/block/replication.c +++ b/block/replication.c @@ -531,21 +531,28 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, aio_context_release(aio_context); return; } -bdrv_op_block_all(top_bs, s->blocker); -bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker); -job = backup_job_create(NULL, s->secondary_disk->bs, s->hidden_disk->bs, -0, MIRROR_SYNC_MODE_NONE, NULL, false, +/* + * Only in the case of non-shared disk, + * the backup job is in the secondary side + */ +if (!s->is_shared_disk) { +bdrv_op_block_all(top_bs, s->blocker); +bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker); +job = backup_job_create(NULL, s->secondary_disk->bs, +s->hidden_disk->bs, 0, +MIRROR_SYNC_MODE_NONE, NULL, false, BLOCKDEV_ON_ERROR_REPORT, BLOCKDEV_ON_ERROR_REPORT, BLOCK_JOB_INTERNAL, backup_job_completed, bs, NULL, &local_err); -if (local_err) { -error_propagate(errp, local_err); -backup_job_cleanup(bs); -aio_context_release(aio_context); -return; +if (local_err) { +error_propagate(errp, local_err); +backup_job_cleanup(bs); +aio_context_release(aio_context); +return; +} +block_job_start(job); } -block_job_start(job); secondary_do_checkpoint(s, errp); break; @@ -575,14 +582,16 @@ static void replication_do_checkpoint(ReplicationState *rs, Error **errp) case REPLICATION_MODE_PRIMARY: break; case REPLICATION_MODE_SECONDARY: -if (!s->secondary_disk->bs->job) { -error_setg(errp, "Backup job was cancelled unexpectedly"); -break; -} -backup_do_checkpoint(s->secondary_disk->bs->job, &local_err); -if (local_err) { -error_propagate(errp, local_err); -break; +if (!s->is_shared_disk) { +if (!s->secondary_disk->bs->job) { +error_setg(errp, "Backup job was cancelled unexpectedly"); +break; +} +backup_do_checkpoint(s->secondary_disk->bs->job, &local_err); +if (local_err) { +error_propagate(errp, local_err); +break; +} } secondary_do_checkpoint(s, errp); break; @@ -663,7 +672,7 @@ static void replication_stop(ReplicationState *rs, bool failover, Error **errp) * before the BDS is closed, because we will access hidden * disk, secondary disk in backup_job_completed(). */ -if (s->secondary_disk->bs->job) { +if (!s->is_shared_disk && s->secondary_disk->bs->job) { block_job_cancel_sync(s->secondary_disk->bs->job); } -- 1.8.3.1
[Qemu-block] [PATCH RFC v2 0/6] COLO block replication supports shared disk case
COLO block replication doesn't support the shared disk case, Here we try to implement it. For the detail of shared-disk scenario, please refer to patch 1. COLO codes with shared-disk block replication can be found from the link: https://github.com/coloft/qemu/tree/colo-developing-with-shared-disk-2016-12-5 Test procedures: 1. Secondary: # x86_64-softmmu/qemu-system-x86_64 -boot c -m 2048 -smp 2 -qmp stdio -vnc :9 -name secondary -enable-kvm -cpu qemu64,+kvmclock -device piix3-usb-uhci -drive if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,backing.driver=raw,backing.file.filename=/work/kvm/suse11_sp3_64 -drive if=ide,id=active-disk0,driver=replication,mode=secondary,file.driver=qcow2,top-id=active-disk0,file.file.filename=/mnt/ramfs/active_disk.img,file.backing=hidden_disk0,shared-disk=on -incoming tcp:0: Issue qmp commands: {'execute':'qmp_capabilities'} {'execute': 'nbd-server-start', 'arguments': {'addr': {'type': 'inet', 'data': {'host': '0', 'port': '9998'} } } } {'execute': 'nbd-server-add', 'arguments': {'device': 'hidden_disk0', 'writable': true } } 2.Primary: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -qmp stdio -vnc :9 -name primary -cpu qemu64,+kvmclock -device piix3-usb-uhci -drive if=virtio,id=primary_disk0,file.filename=/work/kvm/suse11_sp3_64,driver=raw -S Issue qmp commands: {'execute':'qmp_capabilities'} {'execute': 'human-monitor-command', 'arguments': {'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=9.42.3.17,file.port=9998,file.export=hidden_disk0,shared-disk-id=primary_disk0,shared-disk=on,node-name=rep'}} {'execute': 'migrate-set-capabilities', 'arguments': {'capabilities': [ {'capability': 'x-colo', 'state': true } ] } } {'execute': 'migrate', 'arguments': {'uri': 'tcp:9.42.3.17:' } } 3. Failover Secondary side: Issue qmp commands: { 'execute': 'nbd-server-stop' } { "execute": "x-colo-lost-heartbeat" } Please review and any commits are welcomed. Cc: Juan Quintela Cc: Amit Shah Cc: Dr. David Alan Gilbert (git) Cc: eddie.d...@intel.com v2: - Drop the patch which add a blk_root() helper - Fix some comments from Changlong zhanghailiang (6): docs/block-replication: Add description for shared-disk case replication: add shared-disk and shared-disk-id options replication: Split out backup_do_checkpoint() from secondary_do_checkpoint() replication: fix code logic with the new shared_disk option replication: Implement block replication for shared disk case nbd/replication: implement .bdrv_get_info() for nbd and replication driver block/nbd.c| 12 block/replication.c| 156 +++-- docs/block-replication.txt | 139 ++-- qapi/block-core.json | 9 ++- 4 files changed, 278 insertions(+), 38 deletions(-) -- 1.8.3.1
[Qemu-block] [PATCH RFC v2 1/6] docs/block-replication: Add description for shared-disk case
Introuduce the scenario of shared-disk block replication and how to use it. Signed-off-by: zhanghailiang Signed-off-by: Wen Congyang Signed-off-by: Zhang Chen --- v2: - fix some problems found by Changlong --- docs/block-replication.txt | 139 +++-- 1 file changed, 135 insertions(+), 4 deletions(-) diff --git a/docs/block-replication.txt b/docs/block-replication.txt index 6bde673..fbfe005 100644 --- a/docs/block-replication.txt +++ b/docs/block-replication.txt @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network transportation effort during a vmstate checkpoint, the disk modification operations of the Primary disk are asynchronously forwarded to the Secondary node. -== Workflow == +== Non-shared disk workflow == The following is the image of block replication workflow: +--+++ @@ -57,7 +57,7 @@ The following is the image of block replication workflow: 4) Secondary write requests will be buffered in the Disk buffer and it will overwrite the existing sector content in the buffer. -== Architecture == +== Non-shared disk architecture == We are going to implement block replication from many basic blocks that are already in QEMU. @@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative write-through of the NBD server into the secondary disk. So before block replication, the primary disk and secondary disk should contain the same data. +== Shared Disk Mode Workflow == +The following is the image of block replication workflow: + ++--+++ +|Primary Write Requests||Secondary Write Requests| ++--+++ + | | + | (4) + | V + | /-\ + | (2)Forward and write through | | + | +--> | Disk Buffer | + | || | + | |\-/ + | |(1)read | + | | | + (3)write | | | backing file + V | | + +-+ | + | Shared Disk | <-+ + +-+ + +1) Primary writes will read original data and forward it to Secondary + QEMU. +2) Before Primary write requests are written to Shared disk, the + original sector content will be read from Shared disk and + forwarded and buffered in the Disk buffer on the secondary site, + but it will not overwrite the existing sector content (it could be + from either "Secondary Write Requests" or previous COW of "Primary + Write Requests") in the Disk buffer. +3) Primary write requests will be written to Shared disk. +4) Secondary write requests will be buffered in the Disk buffer and it + will overwrite the existing sector content in the buffer. + +== Shared Disk Mode Architecture == +We are going to implement block replication from many basic +blocks that are already in QEMU. + virtio-blk || .-- + / || | Secondary +/ || '-- + /|| virtio-blk + / || | + | || replication(5) + |NBD > NBD (2) | + | client ||server ---> hidden disk <-- active disk(4) + | ^ || | + | replication(1) || | + | | || | + | +-' || | + (3) |drive-backup sync=none || | +. | +-+ || | +Primary | | | || backing| +' | | || | + V | | + +---+| + |
[Qemu-block] [PATCH RFC v2 2/6] replication: add shared-disk and shared-disk-id options
We use these two options to identify which disk is shared Signed-off-by: zhanghailiang Signed-off-by: Wen Congyang Signed-off-by: Zhang Chen --- v2: - add these two options for BlockdevOptionsReplication to support qmp blockdev-add command. - fix a memory leak found by Changlong --- block/replication.c | 37 + qapi/block-core.json | 9 - 2 files changed, 45 insertions(+), 1 deletion(-) diff --git a/block/replication.c b/block/replication.c index 729dd12..e87ae87 100644 --- a/block/replication.c +++ b/block/replication.c @@ -25,9 +25,12 @@ typedef struct BDRVReplicationState { ReplicationMode mode; int replication_state; +bool is_shared_disk; +char *shared_disk_id; BdrvChild *active_disk; BdrvChild *hidden_disk; BdrvChild *secondary_disk; +BdrvChild *primary_disk; char *top_id; ReplicationState *rs; Error *blocker; @@ -53,6 +56,9 @@ static void replication_stop(ReplicationState *rs, bool failover, #define REPLICATION_MODE"mode" #define REPLICATION_TOP_ID "top-id" +#define REPLICATION_SHARED_DISK "shared-disk" +#define REPLICATION_SHARED_DISK_ID "shared-disk-id" + static QemuOptsList replication_runtime_opts = { .name = "replication", .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head), @@ -65,6 +71,14 @@ static QemuOptsList replication_runtime_opts = { .name = REPLICATION_TOP_ID, .type = QEMU_OPT_STRING, }, +{ +.name = REPLICATION_SHARED_DISK_ID, +.type = QEMU_OPT_STRING, +}, +{ +.name = REPLICATION_SHARED_DISK, +.type = QEMU_OPT_BOOL, +}, { /* end of list */ } }, }; @@ -85,6 +99,9 @@ static int replication_open(BlockDriverState *bs, QDict *options, QemuOpts *opts = NULL; const char *mode; const char *top_id; +const char *shared_disk_id; +BlockBackend *blk; +BlockDriverState *tmp_bs; ret = -EINVAL; opts = qemu_opts_create(&replication_runtime_opts, NULL, 0, &error_abort); @@ -119,6 +136,25 @@ static int replication_open(BlockDriverState *bs, QDict *options, "The option mode's value should be primary or secondary"); goto fail; } +s->is_shared_disk = qemu_opt_get_bool(opts, REPLICATION_SHARED_DISK, + false); +if (s->is_shared_disk && (s->mode == REPLICATION_MODE_PRIMARY)) { +shared_disk_id = qemu_opt_get(opts, REPLICATION_SHARED_DISK_ID); +if (!shared_disk_id) { +error_setg(&local_err, "Missing shared disk blk"); +goto fail; +} +s->shared_disk_id = g_strdup(shared_disk_id); +blk = blk_by_name(s->shared_disk_id); +if (!blk) { +g_free(s->shared_disk_id); +error_setg(&local_err, "There is no %s block", s->shared_disk_id); +goto fail; +} +/* We can't access root member of BlockBackend directly */ +tmp_bs = blk_bs(blk); +s->primary_disk = QLIST_FIRST(&tmp_bs->parents); +} s->rs = replication_new(bs, &replication_ops); @@ -135,6 +171,7 @@ static void replication_close(BlockDriverState *bs) { BDRVReplicationState *s = bs->opaque; +g_free(s->shared_disk_id); if (s->replication_state == BLOCK_REPLICATION_RUNNING) { replication_stop(s->rs, false, NULL); } diff --git a/qapi/block-core.json b/qapi/block-core.json index c29bef7..52d7e0d 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2232,12 +2232,19 @@ # node who owns the replication node chain. Must not be given in # primary mode. # +# @shared-disk-id: #optional The id of shared disk while in replication mode. +# +# @shared-disk: #optional To indicate whether or not a disk is shared by +# primary VM and secondary VM. +# # Since: 2.8 ## { 'struct': 'BlockdevOptionsReplication', 'base': 'BlockdevOptionsGenericFormat', 'data': { 'mode': 'ReplicationMode', -'*top-id': 'str' } } +'*top-id': 'str', +'*shared-disk-id': 'str', +'*shared-disk': 'bool' } } ## # @NFSTransport -- 1.8.3.1
[Qemu-block] [PATCH RFC v2 5/6] replication: Implement block replication for shared disk case
Just as the scenario of non-shared disk block replication, we are going to implement block replication from many basic blocks that are already in QEMU. The architecture is: virtio-blk || .-- / || | Secondary / || '-- /|| virtio-blk / || | | || replication(5) |NBD > NBD (2) | | client ||server ---> hidden disk <-- active disk(4) | ^ || | | replication(1) || | | | || | | +-' || | (3) |drive-backup sync=none || | . | +-+ || | Primary | | | || backing| ' | | || | V | | +---+| | shared disk | <--+ +---+ 1) Primary writes will read original data and forward it to Secondary QEMU. 2) The hidden-disk is created automatically. It buffers the original content that is modified by the primary VM. It should also be an empty disk, and the driver supports bdrv_make_empty() and backing file. 3) Primary write requests will be written to Shared disk. 4) Secondary write requests will be buffered in the active disk and it will overwrite the existing sector content in the buffer. Signed-off-by: zhanghailiang Signed-off-by: Wen Congyang Signed-off-by: Zhang Chen --- block/replication.c | 48 ++-- 1 file changed, 42 insertions(+), 6 deletions(-) diff --git a/block/replication.c b/block/replication.c index 6574cc2..f416ca5 100644 --- a/block/replication.c +++ b/block/replication.c @@ -233,7 +233,7 @@ static coroutine_fn int replication_co_readv(BlockDriverState *bs, QEMUIOVector *qiov) { BDRVReplicationState *s = bs->opaque; -BdrvChild *child = s->secondary_disk; +BdrvChild *child = s->is_shared_disk ? s->primary_disk : s->secondary_disk; BlockJob *job = NULL; CowRequest req; int ret; @@ -415,7 +415,12 @@ static void backup_job_completed(void *opaque, int ret) s->error = -EIO; } -backup_job_cleanup(bs); +if (s->mode == REPLICATION_MODE_PRIMARY) { +s->replication_state = BLOCK_REPLICATION_DONE; +s->error = 0; +} else { +backup_job_cleanup(bs); +} } static bool check_top_bs(BlockDriverState *top_bs, BlockDriverState *bs) @@ -467,6 +472,19 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, switch (s->mode) { case REPLICATION_MODE_PRIMARY: +if (s->is_shared_disk) { +job = backup_job_create(NULL, s->primary_disk->bs, bs, 0, +MIRROR_SYNC_MODE_NONE, NULL, false, BLOCKDEV_ON_ERROR_REPORT, +BLOCKDEV_ON_ERROR_REPORT, BLOCK_JOB_INTERNAL, +backup_job_completed, bs, NULL, &local_err); +if (local_err) { +error_propagate(errp, local_err); +backup_job_cleanup(bs); +aio_context_release(aio_context); +return; +} +block_job_start(job); +} break; case REPLICATION_MODE_SECONDARY: s->active_disk = bs->file; @@ -485,7 +503,8 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, } s->secondary_disk = s->hidden_disk->bs->backing; -if (!s->secondary_disk->bs || !bdrv_has_blk(s->secondary_disk->bs)) { +if (!s->secondary_disk->bs || +(!s->is_shared_disk && !bdrv_has_blk(s->secondary_disk->bs))) { error_setg(errp, "The secondary disk doesn't have block backend"); aio_context_release(aio_context); return; @@ -580,11 +599,24 @@ static void replication_do_checkpoint(ReplicationState *rs, Error **errp) switch (s->mode) { case REPLICATION_MODE_PRIMARY: +if (s->is_shared_disk) { +if (!s->primary_disk->bs->job) { +error_setg(errp, "Primary backup job was cancelled" + " unexpectedly"); +break
[Qemu-block] [PATCH RFC v2 3/6] replication: Split out backup_do_checkpoint() from secondary_do_checkpoint()
The helper backup_do_checkpoint() will be used for primary related codes. Here we split it out from secondary_do_checkpoint(). Besides, it is unnecessary to call backup_do_checkpoint() in replication starting and normally stop replication path. We only need call it while do real checkpointing. Signed-off-by: zhanghailiang --- block/replication.c | 36 +++- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/block/replication.c b/block/replication.c index e87ae87..35e9ab3 100644 --- a/block/replication.c +++ b/block/replication.c @@ -332,20 +332,8 @@ static bool replication_recurse_is_first_non_filter(BlockDriverState *bs, static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp) { -Error *local_err = NULL; int ret; -if (!s->secondary_disk->bs->job) { -error_setg(errp, "Backup job was cancelled unexpectedly"); -return; -} - -backup_do_checkpoint(s->secondary_disk->bs->job, &local_err); -if (local_err) { -error_propagate(errp, local_err); -return; -} - ret = s->active_disk->bs->drv->bdrv_make_empty(s->active_disk->bs); if (ret < 0) { error_setg(errp, "Cannot make active disk empty"); @@ -558,6 +546,8 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, return; } block_job_start(job); + +secondary_do_checkpoint(s, errp); break; default: aio_context_release(aio_context); @@ -566,10 +556,6 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode, s->replication_state = BLOCK_REPLICATION_RUNNING; -if (s->mode == REPLICATION_MODE_SECONDARY) { -secondary_do_checkpoint(s, errp); -} - s->error = 0; aio_context_release(aio_context); } @@ -579,13 +565,29 @@ static void replication_do_checkpoint(ReplicationState *rs, Error **errp) BlockDriverState *bs = rs->opaque; BDRVReplicationState *s; AioContext *aio_context; +Error *local_err = NULL; aio_context = bdrv_get_aio_context(bs); aio_context_acquire(aio_context); s = bs->opaque; -if (s->mode == REPLICATION_MODE_SECONDARY) { +switch (s->mode) { +case REPLICATION_MODE_PRIMARY: +break; +case REPLICATION_MODE_SECONDARY: +if (!s->secondary_disk->bs->job) { +error_setg(errp, "Backup job was cancelled unexpectedly"); +break; +} +backup_do_checkpoint(s->secondary_disk->bs->job, &local_err); +if (local_err) { +error_propagate(errp, local_err); +break; +} secondary_do_checkpoint(s, errp); +break; +default: +abort(); } aio_context_release(aio_context); } -- 1.8.3.1
[Qemu-block] [PATCH RFC v2 6/6] nbd/replication: implement .bdrv_get_info() for nbd and replication driver
Without this callback, there will be an error reports in the primary side: "qemu-system-x86_64: Couldn't determine the cluster size of the target image, which has no backing file: Operation not supported Aborting, since this may create an unusable destination image" For nbd driver, it doesn't have cluster size, so here we return a fake value for it. This patch should be dropped if Eric's nbd patch be merged. https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg03567.html Cc: Eric Blake Signed-off-by: zhanghailiang Signed-off-by: Wen Congyang --- block/nbd.c | 12 block/replication.c | 6 ++ 2 files changed, 18 insertions(+) diff --git a/block/nbd.c b/block/nbd.c index 35f24be..b71a13d 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -43,6 +43,8 @@ #define EN_OPTSTR ":exportname=" +#define NBD_FAKE_CLUSTER_SIZE 512 + typedef struct BDRVNBDState { NBDClientSession client; @@ -552,6 +554,13 @@ static void nbd_refresh_filename(BlockDriverState *bs, QDict *options) bs->full_open_options = opts; } +static int nbd_get_info(BlockDriverState *bs, BlockDriverInfo *bdi) +{ +bdi->cluster_size = NBD_FAKE_CLUSTER_SIZE; + +return 0; +} + static BlockDriver bdrv_nbd = { .format_name= "nbd", .protocol_name = "nbd", @@ -569,6 +578,7 @@ static BlockDriver bdrv_nbd = { .bdrv_detach_aio_context= nbd_detach_aio_context, .bdrv_attach_aio_context= nbd_attach_aio_context, .bdrv_refresh_filename = nbd_refresh_filename, +.bdrv_get_info = nbd_get_info, }; static BlockDriver bdrv_nbd_tcp = { @@ -588,6 +598,7 @@ static BlockDriver bdrv_nbd_tcp = { .bdrv_detach_aio_context= nbd_detach_aio_context, .bdrv_attach_aio_context= nbd_attach_aio_context, .bdrv_refresh_filename = nbd_refresh_filename, +.bdrv_get_info = nbd_get_info, }; static BlockDriver bdrv_nbd_unix = { @@ -607,6 +618,7 @@ static BlockDriver bdrv_nbd_unix = { .bdrv_detach_aio_context= nbd_detach_aio_context, .bdrv_attach_aio_context= nbd_attach_aio_context, .bdrv_refresh_filename = nbd_refresh_filename, +.bdrv_get_info = nbd_get_info, }; static void bdrv_nbd_init(void) diff --git a/block/replication.c b/block/replication.c index f416ca5..5f14360 100644 --- a/block/replication.c +++ b/block/replication.c @@ -731,6 +731,11 @@ static void replication_stop(ReplicationState *rs, bool failover, Error **errp) aio_context_release(aio_context); } +static int replication_get_info(BlockDriverState *bs, BlockDriverInfo *bdi) +{ +return bdrv_get_info(bs->file->bs, bdi); +} + BlockDriver bdrv_replication = { .format_name= "replication", .protocol_name = "replication", @@ -743,6 +748,7 @@ BlockDriver bdrv_replication = { .bdrv_co_readv = replication_co_readv, .bdrv_co_writev = replication_co_writev, +.bdrv_get_info = replication_get_info, .is_filter = true, .bdrv_recurse_is_first_non_filter = replication_recurse_is_first_non_filter, -- 1.8.3.1
Re: [Qemu-block] [Qemu-devel] [RFC PATCH] glusterfs: allow partial reads
On Fri, Dec 02, 2016 at 01:13:28PM -0600, Eric Blake wrote: > On 12/01/2016 04:59 AM, Wolfgang Bumiller wrote: > > Fixes #1644754. > > > > Signed-off-by: Wolfgang Bumiller > > --- > > I'm not sure what the original rationale was to treat both partial > > reads as well as well as writes as I/O error. (Seems to have happened > > from original glusterfs v1 to v2 series with a note but no reasoning > > for the read side as far as I could see.) > > The general direction lately seems to be to move away from sector > > based block APIs. Also eg. the NFS code allows partial reads. (It > > does, however, have an old patch (c2eb918e3) dedicated to aligning > > sizes to 512 byte boundaries for file creation for compatibility to > > other parts of qemu like qcow2. This already happens in glusterfs, > > though, but if you move a file from a different storage over to > > glusterfs you may end up with a qcow2 file with eg. the L1 table in > > the last 80 bytes of the file aligned to _begin_ at a 512 boundary, > > but not _end_ at one.) > > > > block/gluster.c | 10 +- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/block/gluster.c b/block/gluster.c > > index 891c13b..3db0bf8 100644 > > --- a/block/gluster.c > > +++ b/block/gluster.c > > @@ -41,6 +41,7 @@ typedef struct GlusterAIOCB { > > int ret; > > Coroutine *coroutine; > > AioContext *aio_context; > > +bool is_write; > > } GlusterAIOCB; > > > > typedef struct BDRVGlusterState { > > @@ -716,8 +717,10 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, > > ssize_t ret, void *arg) > > acb->ret = 0; /* Success */ > > } else if (ret < 0) { > > acb->ret = -errno; /* Read/Write failed */ > > +} else if (acb->is_write) { > > +acb->ret = -EIO; /* Partial write - fail it */ > > } else { > > -acb->ret = -EIO; /* Partial read/write - fail it */ > > +acb->ret = 0; /* Success */ > > Does this properly guarantee that the portion beyond EOF reads as zero? I'd argue this wasn't necessarily the case before either, considering the first check starts with `!ret`: if (!ret || ret == acb->size) { acb->ret = 0; /* Success */ A read right at EOF would return 0 and be treated as success there, no? Iow. it wouldn't zero out the destination buffer as far as I can see. Come to think of it, I'm not too fond of this part of the check for the write case either. > Would it be better to switch to byte-based interfaces rather than > continue to force gluster interaction in 512-byte sector chunks, since > gluster can obviously store files that are not 512-aligned?