date:20170913

Re: [Qemu-block] [Qemu-devel] [PATCH v7 31/38] libqtest: Merge qtest_clock_() with clock_()

2017-09-13 Thread Thomas Huth

On 12.09.2017 15:35, Eric Blake wrote:
> On 09/12/2017 05:45 AM, Thomas Huth wrote:
>> On 11.09.2017 19:20, Eric Blake wrote:
>>> Maintaining two layers of libqtest APIs, one that takes an explicit
>>> QTestState object, and the other that uses the implicit global_qtest,
>>> is annoying.  In the interest of getting rid of global implicit
>>> state and having less code to maintain, merge:
>>>  qtest_clock_set()
>>>  qtest_clock_step()
>>>  qtest_clock_step_next()
>>> with their short counterparts.  All callers that previously
>>> used the short form now make it explicit that they are relying on
>>> global_qtest, and later patches can then clean things up to remove
>>> the global variable.
>>>
> 
>>> @@ -446,7 +446,7 @@ int64_t qtest_clock_step(QTestState *s, int64_t step);
>>>   *
>>>   * Returns: The current value of the QEMU_CLOCK_VIRTUAL in nanoseconds.
>>>   */
>>> -int64_t qtest_clock_set(QTestState *s, int64_t val);
>>> +int64_t clock_set(QTestState *s, int64_t val);
>>  Could we please keep the "qtest" prefix here and rather get rid of the
>> other ones? Even if it's more to type, I prefer to have a proper prefix
>> here so that it is clear at the first sight that the functions belong to
>> the qtest framework.
> 
> I suppose we can, although it makes more lines that are likely to bump
> up against 80 columns, and thus slightly more churn to reformat things
> to keep checkpatch happy.  I like the shorter name, because less typing
> is easier to remember.  I'd prefer a second opinion on naming before
> doing anything about it though - Markus or Paolo, do you have any
> preference?

IMHO you should at least keep the qtest prefix in patch 33/38 to avoid
confusion with the system definitions that have the same names (see "man
2 outb" for example).

 Thomas



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread John Snow



On 09/13/2017 05:36 PM, Adam Wolfe Gordon via Qemu-devel wrote:
> On Wed, Sep 13, 2017 at 2:53 PM, John Snow  wrote:
>> On 09/13/2017 12:44 PM, Adam Wolfe Gordon wrote:
>>> Register a watcher with rbd so that we get notified when an image is
>>> resized. Propagate resizes to parent block devices so that guest devices
>>> get resized without user intervention.
>>>
>>> Signed-off-by: Adam Wolfe Gordon 
>>> ---
>>> Hello!
>>>
>>> We are using this patch in production at DigitalOcean and have tested it 
>>> fairly
>>> extensively. However, we use the same block device configuration 
>>> everywhere, so
>>> I'd be interested to get an answer to the question I've left in the code:>
 /* NOTE: This assumes there's only one layer between us and the
block-backend. Is this always true? */
>>>
>>
>> Even if it were always true presently, the fact that the block layer
>> storage graph can be configured in arbitrary ways suggests that it might
>> not always be true in the future.
>>
>> Probably in practice MOST drives are configured like
>> [BB]<--[Format]<--[Protocol], but the block layer is adding a lot of fun
>> stuff like filter nodes for throttling or debugging, so it's probably
>> too much to assert that the length of the chain will ALWAYS be exactly two.
> 
> OK, thanks - that's exactly what I was wondering. I can add some code
> to traverse the parents until we get to the BB.
> 
>> Can I play dumb and ask what you are trying to accomplish? The concept
>> of a drive deciding to resize itself is not something I have a model for
>> understanding, presently...
> 
> Good question; I probably should have explained the use-case in the
> commit message.
> 
> We have a storage orchestration service that manages our ceph block
> storage clusters and doesn't interact directly with qemu. Volumes get
> resized through the orchestration service, which (after doing some

resized bigger, one hopes ...

> checks, updating external metadata, etc.) issues the resize via
> librbd. The drive doesn't exactly resize itself, but the trigger is
> out-of-band from any VM the drive may be attached to.

>From QEMU's POV, the drive resized itself! Facts on the ground may be
different, of course.

> 
> Previously, we would notify the VM of the resize by issuing a
> blockresize via qmp after doing the resize itself externally. That
> meant we were actually resizing the rbd image twice (though the second
> was, hopefully, a no-op). We occasionally had trouble with the resize
> issued by qemu getting stuck and blocking the qemu main thread.
> Detecting the out-of-band resize lets us avoid the extra rbd_resize
> call and means that we never modify an rbd image's metadata from qemu.
> 

Hm, I see... It sounds like you want an operation here that lets us
detect medium changes without actually attempting to orchestrate one.

It smells like you want the second half of bdrv_truncate without
actually issuing the call. Perhaps you could split this function into
its two halves, and in the event of an external resize being detected,
you could call the latter-half portion of bdrv_truncate.

...if the drive is configured to automatically detect those events, that
is. Conceivably not all resize events that QEMU *could* detect *should*
automatically result in guest-visible changes as soon as they occur.

--js

Re: [Qemu-block] [Qemu-devel] [PATCH v7 12/20] dirty-bitmap: Change bdrv_get_dirty_count() to report bytes

2017-09-13 Thread John Snow



On 09/12/2017 04:31 PM, Eric Blake wrote:
> Thanks to recent cleanups, all callers were scaling a return value
> of sectors into bytes; do the scaling internally instead.
> 
> Signed-off-by: Eric Blake 
> 

Reviewed-by: John Snow

Re: [Qemu-block] [Qemu-devel] [PATCH v7 11/20] dirty-bitmap: Change bdrv_dirty_iter_next() to report byte offset

2017-09-13 Thread John Snow



On 09/12/2017 04:31 PM, Eric Blake wrote:
> Thanks to recent cleanups, most callers were scaling a return value
> of sectors into bytes (the exception, in qcow2-bitmap, will be
> converted to byte-based iteration later).  Update the interface to
> do the scaling internally instead.
> 
> In qcow2-bitmap, the code was specifically checking for an error
> to be -1; it is more robust to treat all negative values as an
> error, but at the same time it is also easy enough to ensure we
> return -1 (and not -512) on error.
> 
> Signed-off-by: Eric Blake 
> 

This patch now smells like a bugfix and a separate incremental feature
enhancement.

Do we need to backport the error-checking to a possible 2.10.1?

If no:

Reviewed-by: John Snow

Re: [Qemu-block] [Qemu-devel] [PATCH v7 07/20] dirty-bitmap: Track bitmap size by bytes

2017-09-13 Thread John Snow



On 09/12/2017 04:31 PM, Eric Blake wrote:
> We are still using an internal hbitmap that tracks a size in sectors,
> with the granularity scaled down accordingly, because it lets us
> use a shortcut for our iterators which are currently sector-based.
> But there's no reason we can't track the dirty bitmap size in bytes,
> since it is (mostly) an internal-only variable (remember, the size
> is how many bytes are covered by the bitmap, not how many bytes the
> bitmap occupies).  A later cleanup will convert dirty bitmap
> internals to be entirely byte-based, eliminating the intermediate
> sector rounding added here; and technically, since bdrv_getlength()
> already rounds up to sectors, our use of DIV_ROUND_UP is more for
> theoretical completeness than for any actual rounding.
> 
> Use is_power_of_2() while at it, instead of open-coding that.
> 
> Signed-off-by: Eric Blake 
> 

Reviewed-by: John Snow

Re: [Qemu-block] [Qemu-devel] [PATCH v7 06/20] dirty-bitmap: Change bdrv_dirty_bitmap_size() to report bytes

2017-09-13 Thread John Snow



On 09/12/2017 04:31 PM, Eric Blake wrote:
> We're already reporting bytes for bdrv_dirty_bitmap_granularity();
> mixing bytes and sectors in our return values is a recipe for
> confusion.  A later cleanup will convert dirty bitmap internals
> to be entirely byte-based, but in the meantime, we should report
> the bitmap size in bytes.
> 
> The only external caller in qcow2-bitmap.c is temporarily more verbose
> (because it is still using sector-based math), but will later be
> switched to track progress by bytes instead of sectors.
> 
> Signed-off-by: Eric Blake 
> 

Reviewed-by: John Snow

Re: [Qemu-block] [Qemu-devel] [PATCH v2 00/10] cleanup qemu-iotests

2017-09-13 Thread Paolo Bonzini

On 13/09/2017 09:59, Fam Zheng wrote:
> On Wed, 09/13 08:47, Thomas Huth wrote:
>> Meta comment: Could we maybe also rename "tests/qemu-iotests" to
>> "tests/iotests" ? The "qemu" prefix sounds always very superfluous to me
>> here...
> 
> Sounds good, and saves typing for when this path is manually entered. But 
> maybe
> keep tests/qemu-iotests as a symlink to keep old scripts happy?

The simplest way to keep old scripts happy would be too keep the prefix
:) even though it does sound unnecessary.

Paolo

Re: [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread Adam Wolfe Gordon

On Wed, Sep 13, 2017 at 3:21 PM, Jason Dillaman  wrote:
> On Wed, 2017-09-13 at 10:44 -0600, Adam Wolfe Gordon wrote:
>> +parent = bs->inherits_from;
>> +if (parent == NULL) {
>> +error_report("bs %s does not have parent", 
>> bdrv_get_device_or_node_name(bs));
>> +return;
>> +}
>> +
>> +/* Force parents to re-read our size. */
>
> Can you just invoke blk_truncate instead?

blk_truncate will end up calling qemu_rbd_truncate, which does an
rbd_resize. The goal of this patch is to avoid calling rbd_resize from
qemu, since we do resizes from an external orchestration service.

That said, this patch also introduces a cache of the size (in
BDRVRBDState.size_bytes), so it would be possible for
qemu_rbd_truncate to avoid calling rbd_resize when the image is
already the right size. If I did that, we could use blk_truncate when
we detect a resize as well. Would that be a desirable change?

Re: [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread Adam Wolfe Gordon

On Wed, Sep 13, 2017 at 2:53 PM, John Snow  wrote:
> On 09/13/2017 12:44 PM, Adam Wolfe Gordon wrote:
>> Register a watcher with rbd so that we get notified when an image is
>> resized. Propagate resizes to parent block devices so that guest devices
>> get resized without user intervention.
>>
>> Signed-off-by: Adam Wolfe Gordon 
>> ---
>> Hello!
>>
>> We are using this patch in production at DigitalOcean and have tested it 
>> fairly
>> extensively. However, we use the same block device configuration everywhere, 
>> so
>> I'd be interested to get an answer to the question I've left in the code:>
>>> /* NOTE: This assumes there's only one layer between us and the
>>>block-backend. Is this always true? */
>>
>
> Even if it were always true presently, the fact that the block layer
> storage graph can be configured in arbitrary ways suggests that it might
> not always be true in the future.
>
> Probably in practice MOST drives are configured like
> [BB]<--[Format]<--[Protocol], but the block layer is adding a lot of fun
> stuff like filter nodes for throttling or debugging, so it's probably
> too much to assert that the length of the chain will ALWAYS be exactly two.

OK, thanks - that's exactly what I was wondering. I can add some code
to traverse the parents until we get to the BB.

> Can I play dumb and ask what you are trying to accomplish? The concept
> of a drive deciding to resize itself is not something I have a model for
> understanding, presently...

Good question; I probably should have explained the use-case in the
commit message.

We have a storage orchestration service that manages our ceph block
storage clusters and doesn't interact directly with qemu. Volumes get
resized through the orchestration service, which (after doing some
checks, updating external metadata, etc.) issues the resize via
librbd. The drive doesn't exactly resize itself, but the trigger is
out-of-band from any VM the drive may be attached to.

Previously, we would notify the VM of the resize by issuing a
blockresize via qmp after doing the resize itself externally. That
meant we were actually resizing the rbd image twice (though the second
was, hopefully, a no-op). We occasionally had trouble with the resize
issued by qemu getting stuck and blocking the qemu main thread.
Detecting the out-of-band resize lets us avoid the extra rbd_resize
call and means that we never modify an rbd image's metadata from qemu.

Re: [Qemu-block] [PATCH] osdep: Fix ROUND_UP(64-bit, 32-bit)

2017-09-13 Thread Paolo Bonzini

On 13/09/2017 23:03, Eric Blake wrote:
> When using bit-wise operations that exploit the power-of-two
> nature of the second argument of ROUND_UP(), we still need to
> ensure that the mask is as wide as the first argument (done
> by using addition of 0 to force proper arithmetic promotion).
> Unpatched, ROUND_UP(2ULL*1024*1024*1024*1024, 512) produces 0,
> instead of the intended 2TiB.
> 
> Broken since its introduction in commit 292c8e50 (v1.5.0).
> 
> CC: qemu-triv...@nongnu.org
> Signed-off-by: Eric Blake 
> 
> ---
> I did not audit to see how many potential users of ROUND_UP
> are actually passing in different sized types where the first
> argument can be larger than UINT32_MAX; I stumbled across the
> problem when iotests 190 started failing on a patch where I
> added a new use.  We can either be conservative and put this
> on qemu-stable no matter what, or go through the effort of an
> audit to see what might be broken (many callers in the block
> layer, but not just there).
> ---
>  include/qemu/osdep.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 6855b94bbf..7a3000efc5 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -189,13 +189,13 @@ extern int daemon(int, int);
> 
>  /* Round number up to multiple. Requires that d be a power of 2 (see
>   * QEMU_ALIGN_UP for a safer but slower version on arbitrary
> - * numbers) */
> + * numbers); works even if d is a smaller type than n.  */
>  #ifndef ROUND_UP
> -#define ROUND_UP(n,d) (((n) + (d) - 1) & -(d))
> +#define ROUND_UP(n, d) (((n) + (d) - 1) & -((n) - (n) + (d)))
>  #endif
> 
>  #ifndef DIV_ROUND_UP
> -#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
> +#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
>  #endif
> 
>  /*
> 


Cc: qemu-sta...@nongnu.org

Paolo

Re: [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread Jason Dillaman

On Wed, 2017-09-13 at 10:44 -0600, Adam Wolfe Gordon wrote:
> Register a watcher with rbd so that we get notified when an image is
> resized. Propagate resizes to parent block devices so that guest devices
> get resized without user intervention.
> 
> Signed-off-by: Adam Wolfe Gordon 
> ---
> Hello!
> 
> We are using this patch in production at DigitalOcean and have tested it 
> fairly
> extensively. However, we use the same block device configuration everywhere, 
> so
> I'd be interested to get an answer to the question I've left in the code:
> 
> > /* NOTE: This assumes there's only one layer between us and the
> >block-backend. Is this always true? */
> 
> If that isn't true, this patch will need a bit of work to traverse up the 
> block
> device tree and find the block-backend.
> 
> Of course, any other feedback would be welcome too - this is my first foray 
> into
> the qemu code.
> 
> Regards,
> Adam
> 
>  block/rbd.c | 80 
> +
>  1 file changed, 80 insertions(+)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 144f350e1f..1c9fcbec1f 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -96,6 +96,8 @@ typedef struct BDRVRBDState {
>  rbd_image_t image;
>  char *image_name;
>  char *snap;
> +uint64_t watcher_handle;
> +uint64_t size_bytes;
>  } BDRVRBDState;
>  
>  static char *qemu_rbd_next_tok(char *src, char delim, char **p)
> @@ -540,6 +542,67 @@ out:
>  return rados_str;
>  }
>  
> +/* BH for when rbd notifies us of an update. */
> +static void qemu_rbd_update_bh(void *arg)
> +{
> +BlockDriverState *parent, *bs = arg;
> +BDRVRBDState *s = bs->opaque;
> +bool was_variable_length;
> +uint64_t new_size_bytes;
> +int64_t new_parent_len;
> +BdrvChild *c;
> +int r;
> +
> +r = rbd_get_size(s->image, _size_bytes);
> +if (r < 0) {
> +error_report("error reading size for %s: %s", s->name, strerror(-r));
> +return;
> +}
> +
> +/* Avoid no-op resizes on non-resize notifications. */
> +if (new_size_bytes == s->size_bytes) {
> +error_printf("skipping non-resize rbd cb\n");

Image update callbacks can be invoked for any number of reasons, not
just resize events. Therefore, you probably don't want to have an error
message printed out here.

> +return;
> +}
> +
> +/* NOTE: This assumes there's only one layer between us and the
> +   block-backend. Is this always true? */

I don't think that can be assumed to be true.

> +parent = bs->inherits_from;
> +if (parent == NULL) {
> +error_report("bs %s does not have parent", 
> bdrv_get_device_or_node_name(bs));
> +return;
> +}
> +
> +/* Force parents to re-read our size. */

Can you just invoke blk_truncate instead? 

> +was_variable_length = bs->drv->has_variable_length;
> +bs->drv->has_variable_length = true;
> +new_parent_len = bdrv_getlength(parent);
> +if (new_parent_len < 0) {
> +error_report("getlength failed on parent %s", 
> bdrv_get_device_or_node_name(parent));
> +bs->drv->has_variable_length = was_variable_length;
> +return;
> +}
> +bs->drv->has_variable_length = was_variable_length;
> +
> +/* Notify block backends that that we have resized.
> +   Copied from bdrv_parent_cb_resize. */
> +QLIST_FOREACH(c, >parents, next_parent) {
> +if (c->role->resize) {
> +c->role->resize(c);
> +}
> +}
> +
> +s->size_bytes = new_size_bytes;
> +}
> +
> +/* Called from non-qemu thread - careful! */
> +static void qemu_rbd_update_cb(void *arg)
> +{
> +BlockDriverState *bs = arg;
> +
> +aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), qemu_rbd_update_bh, 
> bs);
> +}
> +
>  static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
>   Error **errp)
>  {
> @@ -672,9 +735,25 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>  }
>  
> +r = rbd_update_watch(s->image, >watcher_handle, qemu_rbd_update_cb, 
> bs);

This API method was only added in the Ceph Jewel release just over a
year ago. This should probably gracefully degrade if compiled against an
older version of the librbd API.

> +if (r < 0) {
> +error_setg_errno(errp, -r, "error registering watcher on %s", 
> s->name);
> +goto failed_watch;
> +}
> +
> +r = rbd_get_size(s->image, >size_bytes);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "error reading size for %s", s->name);
> +goto failed_sz;
> +}
> +
>  qemu_opts_del(opts);
>  return 0;
>  
> +failed_sz:
> +rbd_update_unwatch(s->image, s->watcher_handle);
> +failed_watch:
> +rbd_close(s->image);
>  failed_open:
>  rados_ioctx_destroy(s->io_ctx);
>  failed_shutdown:
> @@ -712,6 +791,7 @@ static void qemu_rbd_close(BlockDriverState *bs)
>  {
>  BDRVRBDState *s =

Re: [Qemu-block] [Qemu-devel] [PATCH v4 00/23] make bdrv_get_block_status byte-based

2017-09-13 Thread Eric Blake

On 09/13/2017 11:03 AM, Eric Blake wrote:
> There are patches floating around to add NBD_CMD_BLOCK_STATUS,
> but NBD wants to report status on byte granularity (even if the
> reporting will probably be naturally aligned to sectors or even
> much higher levels).  I've therefore started the task of
> converting our block status code to report at a byte granularity
> rather than sectors.
> 
> Now that 2.11 is open, I'm rebasing/reposting the remaining patches.
> 
> The overall conversion currently looks like:
> part 1: bdrv_is_allocated (merged, commit 51b0a488)
> part 2: dirty-bitmap (v7 is posted [1], mostly reviewed)
> part 3: bdrv_get_block_status (this series, v3 at [2])
> part 4: .bdrv_co_block_status (v2 is posted [3], but needs a rebase)
> 
> Available as a tag at:
> git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-status-v4
> 
> Based-on: <20170912203119.24166-1-ebl...@redhat.com>
> ([PATCH v7 00/20] make dirty-bitmap byte-based)

Also,

Based-on: <20170913210343.19078-1-ebl...@redhat.com>
(osdep: Fix ROUND_UP(64-bit, 32-bit))

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH] osdep: Fix ROUND_UP(64-bit, 32-bit)

2017-09-13 Thread Eric Blake

When using bit-wise operations that exploit the power-of-two
nature of the second argument of ROUND_UP(), we still need to
ensure that the mask is as wide as the first argument (done
by using addition of 0 to force proper arithmetic promotion).
Unpatched, ROUND_UP(2ULL*1024*1024*1024*1024, 512) produces 0,
instead of the intended 2TiB.

Broken since its introduction in commit 292c8e50 (v1.5.0).

CC: qemu-triv...@nongnu.org
Signed-off-by: Eric Blake 

---
I did not audit to see how many potential users of ROUND_UP
are actually passing in different sized types where the first
argument can be larger than UINT32_MAX; I stumbled across the
problem when iotests 190 started failing on a patch where I
added a new use.  We can either be conservative and put this
on qemu-stable no matter what, or go through the effort of an
audit to see what might be broken (many callers in the block
layer, but not just there).
---
 include/qemu/osdep.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 6855b94bbf..7a3000efc5 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -189,13 +189,13 @@ extern int daemon(int, int);

 /* Round number up to multiple. Requires that d be a power of 2 (see
  * QEMU_ALIGN_UP for a safer but slower version on arbitrary
- * numbers) */
+ * numbers); works even if d is a smaller type than n.  */
 #ifndef ROUND_UP
-#define ROUND_UP(n,d) (((n) + (d) - 1) & -(d))
+#define ROUND_UP(n, d) (((n) + (d) - 1) & -((n) - (n) + (d)))
 #endif

 #ifndef DIV_ROUND_UP
-#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 #endif

 /*
-- 
2.13.5

Re: [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread John Snow



On 09/13/2017 12:44 PM, Adam Wolfe Gordon wrote:
> Register a watcher with rbd so that we get notified when an image is
> resized. Propagate resizes to parent block devices so that guest devices
> get resized without user intervention.
> 
> Signed-off-by: Adam Wolfe Gordon 
> ---
> Hello!
> 
> We are using this patch in production at DigitalOcean and have tested it 
> fairly
> extensively. However, we use the same block device configuration everywhere, 
> so
> I'd be interested to get an answer to the question I've left in the code:>
>> /* NOTE: This assumes there's only one layer between us and the
>>block-backend. Is this always true? */
> 

Even if it were always true presently, the fact that the block layer
storage graph can be configured in arbitrary ways suggests that it might
not always be true in the future.

Probably in practice MOST drives are configured like
[BB]<--[Format]<--[Protocol], but the block layer is adding a lot of fun
stuff like filter nodes for throttling or debugging, so it's probably
too much to assert that the length of the chain will ALWAYS be exactly two.

Can I play dumb and ask what you are trying to accomplish? The concept
of a drive deciding to resize itself is not something I have a model for
understanding, presently...

> If that isn't true, this patch will need a bit of work to traverse up the 
> block
> device tree and find the block-backend.
> 
> Of course, any other feedback would be welcome too - this is my first foray 
> into
> the qemu code.
> 
> Regards,
> Adam
> 
>  block/rbd.c | 80 
> +
>  1 file changed, 80 insertions(+)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 144f350e1f..1c9fcbec1f 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -96,6 +96,8 @@ typedef struct BDRVRBDState {
>  rbd_image_t image;
>  char *image_name;
>  char *snap;
> +uint64_t watcher_handle;
> +uint64_t size_bytes;
>  } BDRVRBDState;
>  
>  static char *qemu_rbd_next_tok(char *src, char delim, char **p)
> @@ -540,6 +542,67 @@ out:
>  return rados_str;
>  }
>  
> +/* BH for when rbd notifies us of an update. */
> +static void qemu_rbd_update_bh(void *arg)
> +{
> +BlockDriverState *parent, *bs = arg;
> +BDRVRBDState *s = bs->opaque;
> +bool was_variable_length;
> +uint64_t new_size_bytes;
> +int64_t new_parent_len;
> +BdrvChild *c;
> +int r;
> +
> +r = rbd_get_size(s->image, _size_bytes);
> +if (r < 0) {
> +error_report("error reading size for %s: %s", s->name, strerror(-r));
> +return;
> +}
> +
> +/* Avoid no-op resizes on non-resize notifications. */
> +if (new_size_bytes == s->size_bytes) {
> +error_printf("skipping non-resize rbd cb\n");
> +return;
> +}
> +
> +/* NOTE: This assumes there's only one layer between us and the
> +   block-backend. Is this always true? */
> +parent = bs->inherits_from;
> +if (parent == NULL) {
> +error_report("bs %s does not have parent", 
> bdrv_get_device_or_node_name(bs));
> +return;
> +}
> +
> +/* Force parents to re-read our size. */
> +was_variable_length = bs->drv->has_variable_length;
> +bs->drv->has_variable_length = true;
> +new_parent_len = bdrv_getlength(parent);
> +if (new_parent_len < 0) {
> +error_report("getlength failed on parent %s", 
> bdrv_get_device_or_node_name(parent));
> +bs->drv->has_variable_length = was_variable_length;
> +return;
> +}
> +bs->drv->has_variable_length = was_variable_length;
> +
> +/* Notify block backends that that we have resized.
> +   Copied from bdrv_parent_cb_resize. */
> +QLIST_FOREACH(c, >parents, next_parent) {
> +if (c->role->resize) {
> +c->role->resize(c);
> +}
> +}
> +
> +s->size_bytes = new_size_bytes;
> +}
> +
> +/* Called from non-qemu thread - careful! */
> +static void qemu_rbd_update_cb(void *arg)
> +{
> +BlockDriverState *bs = arg;
> +
> +aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), qemu_rbd_update_bh, 
> bs);
> +}
> +
>  static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
>   Error **errp)
>  {
> @@ -672,9 +735,25 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>  }
>  
> +r = rbd_update_watch(s->image, >watcher_handle, qemu_rbd_update_cb, 
> bs);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "error registering watcher on %s", 
> s->name);
> +goto failed_watch;
> +}
> +
> +r = rbd_get_size(s->image, >size_bytes);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "error reading size for %s", s->name);
> +goto failed_sz;
> +}
> +
>  qemu_opts_del(opts);
>  return 0;
>  
> +failed_sz:
> +rbd_update_unwatch(s->image, s->watcher_handle);
> +failed_watch:
> +rbd_close(s->image);
>

Re: [Qemu-block] [Qemu-devel] [PATCH v4 21/23] block: Align block status requests

2017-09-13 Thread Eric Blake

On 09/13/2017 02:26 PM, Eric Blake wrote:
> On 09/13/2017 11:03 AM, Eric Blake wrote:
>> Any device that has request_alignment greater than 512 should be
>> unable to report status at a finer granularity; it may also be
>> simpler for such devices to be guaranteed that the block layer
>> has rounded things out to the granularity boundary (the way the
>> block layer already rounds all other I/O out).  Besides, getting
>> the code correct for super-sector alignment also benefits us
>> for the fact that our public interface now has byte granularity,
>> even though none of our drivers have byte-level callbacks.
>>
>> Add an assertion in blkdebug that proves that the block layer
>> never requests status of unaligned sections, similar to what it
>> does on other requests (while still keeping the generic helper
>> in place for when future patches add a throttle driver).  Note
>> that iotest 177 already covers this (it would fail if you use
>> just the blkdebug.c hunk without the io.c changes).  Meanwhile,
>> we can drop assertions in callers that no longer have to pass
>> in sector-aligned addresses.
> 
> Bummer - 'git bisect' says this patch causes iotests 190 to hang.  I'm
> investigating root cause, but I'll have to post a fixup once I figure it
> out.

Found it:

> +/* Round out to request_alignment boundaries */
> +align = MAX(bs->bl.request_alignment, BDRV_SECTOR_SIZE);
> +aligned_offset = QEMU_ALIGN_DOWN(offset, align);
> +aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;

ROUND_UP(64-bit, 32-bit) has a subtle bug: it truncates the operation at
32 bits, instead of producing a 64-bit result.  Using QEMU_ROUND_UP
instead does NOT have the bug.

That's a ticking time bomb, so I'll patch ROUND_UP() directly as a
pre-requisite, then reply to the cover letter with a Based-on tag.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread no-reply

Hi,

This series failed build test on s390x host. Please find the details below.

Subject: [Qemu-devel] [PATCH] rbd: Detect rbd image resizes and propagate them
Message-id: 20170913164424.32129-1-...@digitalocean.com
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash
# Testing script will be invoked under the git checkout with
# HEAD pointing to a commit that has the patches applied on top of "base"
# branch
set -e
echo "=== ENV ==="
env
echo "=== PACKAGES ==="
rpm -qa
echo "=== TEST BEGIN ==="
CC=$HOME/bin/cc
INSTALL=$PWD/install
BUILD=$PWD/build
echo -n "Using CC: "
realpath $CC
mkdir -p $BUILD $INSTALL
SRC=$PWD
cd $BUILD
$SRC/configure --cc=$CC --prefix=$INSTALL
make -j4
# XXX: we need reliable clean up
# make check -j4 V=1
make install
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]  patchew/1505298088-10878-1-git-send-email-th...@redhat.com 
-> patchew/1505298088-10878-1-git-send-email-th...@redhat.com
 * [new tag] patchew/20170913164424.32129-1-...@digitalocean.com -> 
patchew/20170913164424.32129-1-...@digitalocean.com
Switched to a new branch 'test'
c0e97b7 rbd: Detect rbd image resizes and propagate them

=== OUTPUT BEGIN ===
=== ENV ===
XDG_SESSION_ID=862
SHELL=/bin/sh
USER=fam
PATCHEW=/home/fam/patchew/patchew-cli -s http://patchew.org --nodebug
PATH=/usr/bin:/bin
PWD=/var/tmp/patchew-tester-tmp-ihkih90_/src
LANG=en_US.UTF-8
HOME=/home/fam
SHLVL=2
LOGNAME=fam
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1012/bus
XDG_RUNTIME_DIR=/run/user/1012
_=/usr/bin/env
=== PACKAGES ===
gpg-pubkey-873529b8-54e386ff
xz-libs-5.2.2-2.fc24.s390x
libxshmfence-1.2-3.fc24.s390x
giflib-4.1.6-15.fc24.s390x
trousers-lib-0.3.13-6.fc24.s390x
ncurses-base-6.0-6.20160709.fc25.noarch
gmp-6.1.1-1.fc25.s390x
libidn-1.33-1.fc25.s390x
slang-2.3.0-7.fc25.s390x
pkgconfig-0.29.1-1.fc25.s390x
alsa-lib-1.1.1-2.fc25.s390x
yum-metadata-parser-1.1.4-17.fc25.s390x
python3-slip-dbus-0.6.4-4.fc25.noarch
python2-cssselect-0.9.2-1.fc25.noarch
createrepo_c-libs-0.10.0-6.fc25.s390x
initscripts-9.69-1.fc25.s390x
parted-3.2-21.fc25.s390x
flex-2.6.0-3.fc25.s390x
colord-libs-1.3.4-1.fc25.s390x
python-osbs-client-0.33-3.fc25.noarch
perl-Pod-Simple-3.35-1.fc25.noarch
python2-simplejson-3.10.0-1.fc25.s390x
brltty-5.4-2.fc25.s390x
librados2-10.2.4-2.fc25.s390x
tcp_wrappers-7.6-83.fc25.s390x
libcephfs_jni1-10.2.4-2.fc25.s390x
nettle-devel-3.3-1.fc25.s390x
bzip2-devel-1.0.6-21.fc25.s390x
libuuid-2.28.2-2.fc25.s390x
python3-dnf-1.1.10-6.fc25.noarch
texlive-kpathsea-doc-svn41139-33.fc25.1.noarch
openssh-7.4p1-4.fc25.s390x
texlive-kpathsea-bin-svn40473-33.20160520.fc25.1.s390x
texlive-graphics-svn41015-33.fc25.1.noarch
texlive-dvipdfmx-def-svn40328-33.fc25.1.noarch
texlive-mfware-svn40768-33.fc25.1.noarch
texlive-texlive-scripts-svn41433-33.fc25.1.noarch
texlive-euro-svn22191.1.1-33.fc25.1.noarch
texlive-etex-svn37057.0-33.fc25.1.noarch
texlive-iftex-svn29654.0.2-33.fc25.1.noarch
texlive-palatino-svn31835.0-33.fc25.1.noarch
texlive-texlive-docindex-svn41430-33.fc25.1.noarch
texlive-xunicode-svn30466.0.981-33.fc25.1.noarch
texlive-koma-script-svn41508-33.fc25.1.noarch
texlive-pst-grad-svn15878.1.06-33.fc25.1.noarch
texlive-pst-blur-svn15878.2.0-33.fc25.1.noarch
texlive-jknapltx-svn19440.0-33.fc25.1.noarch
texinfo-6.1-4.fc25.s390x
openssl-devel-1.0.2k-1.fc25.s390x
jansson-2.10-2.fc25.s390x
fedora-repos-25-4.noarch
python3-libs-3.5.3-6.fc25.s390x
perl-Errno-1.25-387.fc25.s390x
acl-2.2.52-13.fc25.s390x
systemd-pam-231-17.fc25.s390x
NetworkManager-libnm-1.4.4-5.fc25.s390x
poppler-0.45.0-5.fc25.s390x
ccache-3.3.4-1.fc25.s390x
valgrind-3.12.0-9.fc25.s390x
perl-open-1.10-387.fc25.noarch
libgcc-6.4.1-1.fc25.s390x
libsoup-2.56.1-1.fc25.s390x
libstdc++-devel-6.4.1-1.fc25.s390x
nss-softokn-devel-3.31.0-1.0.fc25.s390x
libobjc-6.4.1-1.fc25.s390x
python2-rpm-4.13.0.1-2.fc25.s390x
python2-gluster-3.10.5-1.fc25.s390x
rpm-build-4.13.0.1-2.fc25.s390x
glibc-static-2.24-10.fc25.s390x
lz4-1.8.0-1.fc25.s390x
xapian-core-libs-1.2.24-1.fc25.s390x
elfutils-libelf-devel-0.169-1.fc25.s390x
libaio-0.3.110-6.fc24.s390x
libfontenc-1.1.3-3.fc24.s390x
lzo-2.08-8.fc24.s390x
isl-0.14-5.fc24.s390x
libXau-1.0.8-6.fc24.s390x
linux-atm-libs-2.5.1-14.fc24.s390x
libXext-1.3.3-4.fc24.s390x
libXxf86vm-1.1.4-3.fc24.s390x
bison-3.0.4-4.fc24.s390x
perl-srpm-macros-1-20.fc25.noarch
gawk-4.1.3-8.fc25.s390x
libwayland-client-1.12.0-1.fc25.s390x
perl-Exporter-5.72-366.fc25.noarch
perl-version-0.99.17-1.fc25.s390x
fftw-libs-double-3.3.5-3.fc25.s390x
libssh2-1.8.0-1.fc25.s390x
ModemManager-glib-1.6.4-1.fc25.s390x
newt-python3-0.52.19-2.fc25.s390x
python-munch-2.0.4-3.fc25.noarch
python-bugzilla-1.2.2-4.fc25.noarch
libedit-3.1-16.20160618cvs.fc25.s390x
createrepo_c-0.10.0-6.fc25.s390x
device-mapper-multipath-libs-0.4.9-83.fc25.s390x
yum-3.4.3-510.fc25.noarch
mozjs17-17.0.0-16.fc25.s390x
libselinux-2.5-13.fc25.s390x
python2-pyparsing-2.1.10-1.fc25.noarch
cairo-gobject-1.14.8-1.fc25.s390x

Re: [Qemu-block] [Qemu-devel] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread no-reply

Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [Qemu-devel] [PATCH] rbd: Detect rbd image resizes and propagate them
Message-id: 20170913164424.32129-1-...@digitalocean.com
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 t [tag update]
patchew/1505298088-10878-1-git-send-email-th...@redhat.com -> 
patchew/1505298088-10878-1-git-send-email-th...@redhat.com
 * [new tag]   patchew/20170913164424.32129-1-...@digitalocean.com 
-> patchew/20170913164424.32129-1-...@digitalocean.com
Switched to a new branch 'test'
c0e97b7e5b rbd: Detect rbd image resizes and propagate them

=== OUTPUT BEGIN ===
Checking PATCH 1/1: rbd: Detect rbd image resizes and propagate them...
WARNING: line over 80 characters
#57: FILE: block/rbd.c:572:
+error_report("bs %s does not have parent", 
bdrv_get_device_or_node_name(bs));

ERROR: line over 90 characters
#66: FILE: block/rbd.c:581:
+error_report("getlength failed on parent %s", 
bdrv_get_device_or_node_name(parent));

total: 1 errors, 1 warnings, 107 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-de...@freelists.org

[Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread Adam Wolfe Gordon

Register a watcher with rbd so that we get notified when an image is
resized. Propagate resizes to parent block devices so that guest devices
get resized without user intervention.

Signed-off-by: Adam Wolfe Gordon 
---
Hello!

We are using this patch in production at DigitalOcean and have tested it fairly
extensively. However, we use the same block device configuration everywhere, so
I'd be interested to get an answer to the question I've left in the code:

> /* NOTE: This assumes there's only one layer between us and the
>block-backend. Is this always true? */

If that isn't true, this patch will need a bit of work to traverse up the block
device tree and find the block-backend.

Of course, any other feedback would be welcome too - this is my first foray into
the qemu code.

Regards,
Adam

 block/rbd.c | 80 +
 1 file changed, 80 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index 144f350e1f..1c9fcbec1f 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -96,6 +96,8 @@ typedef struct BDRVRBDState {
 rbd_image_t image;
 char *image_name;
 char *snap;
+uint64_t watcher_handle;
+uint64_t size_bytes;
 } BDRVRBDState;
 
 static char *qemu_rbd_next_tok(char *src, char delim, char **p)
@@ -540,6 +542,67 @@ out:
 return rados_str;
 }
 
+/* BH for when rbd notifies us of an update. */
+static void qemu_rbd_update_bh(void *arg)
+{
+BlockDriverState *parent, *bs = arg;
+BDRVRBDState *s = bs->opaque;
+bool was_variable_length;
+uint64_t new_size_bytes;
+int64_t new_parent_len;
+BdrvChild *c;
+int r;
+
+r = rbd_get_size(s->image, _size_bytes);
+if (r < 0) {
+error_report("error reading size for %s: %s", s->name, strerror(-r));
+return;
+}
+
+/* Avoid no-op resizes on non-resize notifications. */
+if (new_size_bytes == s->size_bytes) {
+error_printf("skipping non-resize rbd cb\n");
+return;
+}
+
+/* NOTE: This assumes there's only one layer between us and the
+   block-backend. Is this always true? */
+parent = bs->inherits_from;
+if (parent == NULL) {
+error_report("bs %s does not have parent", 
bdrv_get_device_or_node_name(bs));
+return;
+}
+
+/* Force parents to re-read our size. */
+was_variable_length = bs->drv->has_variable_length;
+bs->drv->has_variable_length = true;
+new_parent_len = bdrv_getlength(parent);
+if (new_parent_len < 0) {
+error_report("getlength failed on parent %s", 
bdrv_get_device_or_node_name(parent));
+bs->drv->has_variable_length = was_variable_length;
+return;
+}
+bs->drv->has_variable_length = was_variable_length;
+
+/* Notify block backends that that we have resized.
+   Copied from bdrv_parent_cb_resize. */
+QLIST_FOREACH(c, >parents, next_parent) {
+if (c->role->resize) {
+c->role->resize(c);
+}
+}
+
+s->size_bytes = new_size_bytes;
+}
+
+/* Called from non-qemu thread - careful! */
+static void qemu_rbd_update_cb(void *arg)
+{
+BlockDriverState *bs = arg;
+
+aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), qemu_rbd_update_bh, bs);
+}
+
 static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
  Error **errp)
 {
@@ -672,9 +735,25 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 }
 
+r = rbd_update_watch(s->image, >watcher_handle, qemu_rbd_update_cb, bs);
+if (r < 0) {
+error_setg_errno(errp, -r, "error registering watcher on %s", s->name);
+goto failed_watch;
+}
+
+r = rbd_get_size(s->image, >size_bytes);
+if (r < 0) {
+error_setg_errno(errp, -r, "error reading size for %s", s->name);
+goto failed_sz;
+}
+
 qemu_opts_del(opts);
 return 0;
 
+failed_sz:
+rbd_update_unwatch(s->image, s->watcher_handle);
+failed_watch:
+rbd_close(s->image);
 failed_open:
 rados_ioctx_destroy(s->io_ctx);
 failed_shutdown:
@@ -712,6 +791,7 @@ static void qemu_rbd_close(BlockDriverState *bs)
 {
 BDRVRBDState *s = bs->opaque;
 
+rbd_update_unwatch(s->image, s->watcher_handle);
 rbd_close(s->image);
 rados_ioctx_destroy(s->io_ctx);
 g_free(s->snap);
-- 
2.14.1

Re: [Qemu-block] [Qemu-devel] [PATCH v4 21/23] block: Align block status requests

2017-09-13 Thread Eric Blake

On 09/13/2017 11:03 AM, Eric Blake wrote:
> Any device that has request_alignment greater than 512 should be
> unable to report status at a finer granularity; it may also be
> simpler for such devices to be guaranteed that the block layer
> has rounded things out to the granularity boundary (the way the
> block layer already rounds all other I/O out).  Besides, getting
> the code correct for super-sector alignment also benefits us
> for the fact that our public interface now has byte granularity,
> even though none of our drivers have byte-level callbacks.
> 
> Add an assertion in blkdebug that proves that the block layer
> never requests status of unaligned sections, similar to what it
> does on other requests (while still keeping the generic helper
> in place for when future patches add a throttle driver).  Note
> that iotest 177 already covers this (it would fail if you use
> just the blkdebug.c hunk without the io.c changes).  Meanwhile,
> we can drop assertions in callers that no longer have to pass
> in sector-aligned addresses.

Bummer - 'git bisect' says this patch causes iotests 190 to hang.  I'm
investigating root cause, but I'll have to post a fixup once I figure it
out.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH] block: Clean up some bad code in the vvfat driver

2017-09-13 Thread John Snow



On 09/13/2017 06:21 AM, Thomas Huth wrote:
> Remove the unnecessary home-grown redefinition of the assert() macro here,
> and remove the unusable debug code at the end of the checkpoint() function.
> The code there uses assert() with side-effects (assignment to the "mapping"
> variable), which should be avoided. Looking more closely, it seems as it is
> apparently also only usable for one certain directory layout (with a file
> named USB.H in it) and thus is of no use for the rest of the world.
> 
> Signed-off-by: Thomas Huth 

Farewell, bitrot code.

Reviewed-by: John Snow 

Out of curiosity, I wonder ...

jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l
320

oh no

[Qemu-block] [PATCH 17/18] qemu-io: Add background write

2017-09-13 Thread Max Reitz

Add a new parameter -B to qemu-io's write command.  When used, qemu-io
will not wait for the result of the operation and instead execute it in
the background.

Signed-off-by: Max Reitz 
---
 qemu-io-cmds.c | 83 +-
 1 file changed, 77 insertions(+), 6 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 2811a89099..c635a248f5 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -481,6 +481,62 @@ static int do_pwrite(BlockBackend *blk, char *buf, int64_t 
offset,
 typedef struct {
 BlockBackend *blk;
 int64_t offset;
+int bytes;
+char *buf;
+int flags;
+} CoBackgroundWrite;
+
+static void coroutine_fn co_background_pwrite_entry(void *opaque)
+{
+CoBackgroundWrite *data = opaque;
+QEMUIOVector qiov;
+int ret;
+
+qemu_iovec_init(, 1);
+qemu_iovec_add(, data->buf, data->bytes);
+
+ret = blk_co_pwritev(data->blk, data->offset, data->bytes, ,
+ data->flags);
+
+qemu_iovec_destroy();
+g_free(data->buf);
+
+if (ret < 0) {
+Error *err;
+error_setg_errno(, -ret, "Background write failed");
+error_report_err(err);
+}
+}
+
+/* Takes ownership of @buf */
+static int do_background_pwrite(BlockBackend *blk, char *buf, int64_t offset,
+int64_t bytes, int flags)
+{
+Coroutine *co;
+CoBackgroundWrite *data;
+
+if (bytes > INT_MAX) {
+return -ERANGE;
+}
+
+data = g_new(CoBackgroundWrite, 1);
+*data = (CoBackgroundWrite){
+.blk= blk,
+.offset = offset,
+.bytes  = bytes,
+.buf= buf,
+.flags  = flags,
+};
+
+co = qemu_coroutine_create(co_background_pwrite_entry, data);
+bdrv_coroutine_enter(blk_bs(blk), co);
+
+return bytes;
+}
+
+typedef struct {
+BlockBackend *blk;
+int64_t offset;
 int64_t bytes;
 int64_t *total;
 int flags;
@@ -931,6 +987,7 @@ static void write_help(void)
 " Writes into a segment of the currently open file, using a buffer\n"
 " filled with a set pattern (0xcdcdcdcd).\n"
 " -b, -- write to the VM state rather than the virtual disk\n"
+" -B, -- just start a background write, do not wait for the result\n"
 " -c, -- write compressed data with blk_write_compressed\n"
 " -f, -- use Force Unit Access semantics\n"
 " -p, -- ignored for backwards compatibility\n"
@@ -951,7 +1008,7 @@ static const cmdinfo_t write_cmd = {
 .perm   = BLK_PERM_WRITE,
 .argmin = 2,
 .argmax = -1,
-.args   = "[-bcCfquz] [-P pattern] off len",
+.args   = "[-bBcCfquz] [-P pattern] off len",
 .oneline= "writes a number of bytes at a specified offset",
 .help   = write_help,
 };
@@ -961,6 +1018,7 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 struct timeval t1, t2;
 bool Cflag = false, qflag = false, bflag = false;
 bool Pflag = false, zflag = false, cflag = false;
+bool background = false;
 int flags = 0;
 int c, cnt;
 char *buf = NULL;
@@ -970,11 +1028,14 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 int64_t total = 0;
 int pattern = 0xcd;
 
-while ((c = getopt(argc, argv, "bcCfpP:quz")) != -1) {
+while ((c = getopt(argc, argv, "bBcCfpP:quz")) != -1) {
 switch (c) {
 case 'b':
 bflag = true;
 break;
+case 'B':
+background = true;
+break;
 case 'c':
 cflag = true;
 break;
@@ -1032,6 +1093,11 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 return 0;
 }
 
+if (background && (bflag || cflag || zflag)) {
+printf("-B cannot be specified together with -b, -c, or -z\n");
+return 0;
+}
+
 offset = cvtnum(argv[optind]);
 if (offset < 0) {
 print_cvtnum_err(offset, argv[optind]);
@@ -1074,6 +1140,8 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 cnt = do_co_pwrite_zeroes(blk, offset, count, flags, );
 } else if (cflag) {
 cnt = do_write_compressed(blk, buf, offset, count, );
+} else if (background) {
+cnt = do_background_pwrite(blk, buf, offset, count, flags);
 } else {
 cnt = do_pwrite(blk, buf, offset, count, flags, );
 }
@@ -1088,12 +1156,15 @@ static int write_f(BlockBackend *blk, int argc, char 
**argv)
 goto out;
 }
 
-/* Finally, report back -- -C gives a parsable format */
-t2 = tsub(t2, t1);
-print_report("wrote", , offset, count, total, cnt, Cflag);
+if (!background) {
+/* Finally, report back -- -C gives a parsable format */
+t2 = tsub(t2, t1);
+print_report("wrote", , offset, count, total, cnt, Cflag);
+}
 
 out:
-if (!zflag) {
+/* do_background_pwrite() takes ownership of the buffer */
+if (!zflag && !background) {
 qemu_io_free(buf);
 }
 
-- 
2.13.5

[Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring

2017-09-13 Thread Max Reitz

This patch implements active synchronous mirroring.  In active mode, the
passive mechanism will still be in place and is used to copy all
initially dirty clusters off the source disk; but every write request
will write data both to the source and the target disk, so the source
cannot be dirtied faster than data is mirrored to the target.  Also,
once the block job has converged (BLOCK_JOB_READY sent), source and
target are guaranteed to stay in sync (unless an error occurs).

Optionally, dirty data can be copied to the target disk on read
operations, too.

Active mode is completely optional and currently disabled at runtime.  A
later patch will add a way for users to enable it.

Signed-off-by: Max Reitz 
---
 qapi/block-core.json |  23 +++
 block/mirror.c   | 187 +--
 2 files changed, 205 insertions(+), 5 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index bb11815608..e072cfa67c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -938,6 +938,29 @@
   'data': ['top', 'full', 'none', 'incremental'] }
 
 ##
+# @MirrorCopyMode:
+#
+# An enumeration whose values tell the mirror block job when to
+# trigger writes to the target.
+#
+# @passive: copy data in background only.
+#
+# @active-write: when data is written to the source, write it
+#(synchronously) to the target as well.  In addition,
+#data is copied in background just like in @passive
+#mode.
+#
+# @active-read-write: write data to the target (synchronously) both
+# when it is read from and written to the source.
+# In addition, data is copied in background just
+# like in @passive mode.
+#
+# Since: 2.11
+##
+{ 'enum': 'MirrorCopyMode',
+  'data': ['passive', 'active-write', 'active-read-write'] }
+
+##
 # @BlockJobType:
 #
 # Type of a block job.
diff --git a/block/mirror.c b/block/mirror.c
index 8fea619a68..c429aa77bb 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -54,8 +54,12 @@ typedef struct MirrorBlockJob {
 Error *replace_blocker;
 bool is_none_mode;
 BlockMirrorBackingMode backing_mode;
+MirrorCopyMode copy_mode;
 BlockdevOnError on_source_error, on_target_error;
 bool synced;
+/* Set when the target is synced (dirty bitmap is clean, nothing
+ * in flight) and the job is running in active mode */
+bool actively_synced;
 bool should_complete;
 int64_t granularity;
 size_t buf_size;
@@ -77,6 +81,7 @@ typedef struct MirrorBlockJob {
 int target_cluster_size;
 int max_iov;
 bool initial_zeroing_ongoing;
+int in_active_write_counter;
 
 /* Signals that we are no longer accessing source and target and the mirror
  * BDS should thus relinquish all permissions */
@@ -112,6 +117,7 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob 
*s, bool read,
 int error)
 {
 s->synced = false;
+s->actively_synced = false;
 if (read) {
 return block_job_error_action(>common, s->on_source_error,
   true, error);
@@ -283,13 +289,12 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 return ret;
 }
 
-static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
+static inline void mirror_wait_for_any_operation(MirrorBlockJob *s, bool 
active)
 {
 MirrorOp *op;
 
 QTAILQ_FOREACH(op, >ops_in_flight, next) {
-if (!op->is_active_write) {
-/* Only non-active operations use up in-flight slots */
+if (op->is_active_write == active) {
 qemu_co_queue_wait(>waiting_requests, NULL);
 return;
 }
@@ -297,6 +302,12 @@ static inline void 
mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
 abort();
 }
 
+static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
+{
+/* Only non-active operations use up in-flight slots */
+mirror_wait_for_any_operation(s, false);
+}
+
 /* Submit async read while handling COW.
  * Returns: The number of bytes copied after and including offset,
  *  excluding any bytes copied prior to offset due to alignment.
@@ -861,6 +872,7 @@ static void coroutine_fn mirror_run(void *opaque)
 /* Report BLOCK_JOB_READY and wait for complete. */
 block_job_event_ready(>common);
 s->synced = true;
+s->actively_synced = true;
 while (!block_job_is_cancelled(>common) && !s->should_complete) {
 block_job_yield(>common);
 }
@@ -912,6 +924,12 @@ static void coroutine_fn mirror_run(void *opaque)
 int64_t cnt, delta;
 bool should_complete;
 
+/* Do not start passive operations while there are active
+ * writes in progress */
+while (s->in_active_write_counter) {
+mirror_wait_for_any_operation(s, true);
+}
+
 if

[Qemu-block] [PATCH 18/18] iotests: Add test for active mirroring

2017-09-13 Thread Max Reitz

Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/151 | 111 +
 tests/qemu-iotests/151.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 117 insertions(+)
 create mode 100755 tests/qemu-iotests/151
 create mode 100644 tests/qemu-iotests/151.out

diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
new file mode 100755
index 00..49a60773f9
--- /dev/null
+++ b/tests/qemu-iotests/151
@@ -0,0 +1,111 @@
+#!/usr/bin/env python
+#
+# Tests for active mirroring
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import iotests
+from iotests import qemu_img
+
+source_img = os.path.join(iotests.test_dir, 'source.' + iotests.imgfmt)
+target_img = os.path.join(iotests.test_dir, 'target.' + iotests.imgfmt)
+
+class TestActiveMirror(iotests.QMPTestCase):
+image_len = 128 * 1024 * 1024 # MB
+potential_writes_in_flight = True
+
+def setUp(self):
+qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
+qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
+
+blk_source = {'node-name': 'source',
+  'driver': iotests.imgfmt,
+  'file': {'driver': 'file',
+   'filename': source_img}}
+
+blk_target = {'node-name': 'target',
+  'driver': iotests.imgfmt,
+  'file': {'driver': 'file',
+   'filename': target_img}}
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.qmp_to_opts(blk_source))
+self.vm.add_blockdev(self.qmp_to_opts(blk_target))
+self.vm.launch()
+
+def tearDown(self):
+self.vm.shutdown()
+
+if not self.potential_writes_in_flight:
+self.assertTrue(iotests.compare_images(source_img, target_img),
+'mirror target does not match source')
+
+os.remove(source_img)
+os.remove(target_img)
+
+def doActiveIO(self, sync_source_and_target):
+# Fill the source image
+self.vm.hmp_qemu_io('source',
+'write -P 1 0 %i' % self.image_len);
+
+# Start some background requests
+for offset in range(0, self.image_len, 1024 * 1024):
+self.vm.hmp_qemu_io('source', 'write -B -P 2 %i 1M' % offset)
+
+# Start the block job
+result = self.vm.qmp('blockdev-mirror',
+ job_id='mirror',
+ filter_node_name='mirror-node',
+ device='source',
+ target='target',
+ sync='full',
+ copy_mode='active-write')
+self.assert_qmp(result, 'return', {})
+
+# Start some more requests
+for offset in range(0, self.image_len, 1024 * 1024):
+self.vm.hmp_qemu_io('mirror-node', 'write -B -P 3 %i 1M' % offset)
+
+# Wait for the READY event
+self.wait_ready(drive='mirror')
+
+# Now start some final requests; all of these (which land on
+# the source) should be settled using the active mechanism.
+# The mirror code itself asserts that the source BDS's dirty
+# bitmap will stay clean between READY and COMPLETED.
+for offset in range(0, self.image_len, 1024 * 1024):
+self.vm.hmp_qemu_io('mirror-node', 'write -B -P 4 %i 1M' % offset)
+
+if sync_source_and_target:
+# If source and target should be in sync after the mirror,
+# we have to flush before completion
+self.vm.hmp_qemu_io('mirror-node', 'flush')
+self.potential_writes_in_flight = False
+
+self.complete_and_wait(drive='mirror', wait_ready=False)
+
+def testActiveIO(self):
+self.doActiveIO(False)
+
+def testActiveIOFlushed(self):
+self.doActiveIO(True)
+
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2', 'raw'])
diff --git a/tests/qemu-iotests/151.out b/tests/qemu-iotests/151.out
new file mode 100644
index 00..fbc63e62f8
--- /dev/null
+++ b/tests/qemu-iotests/151.out
@@ -0,0 +1,5 @@
+..
+--
+Ran 2 tests
+
+OK
diff --git a/tests/qemu-iotests/group

[Qemu-block] [PATCH 16/18] block/mirror: Add copy mode QAPI interface

2017-09-13 Thread Max Reitz

This patch allows the user to specify whether to use active or only
passive mode for mirror block jobs.  Currently, this setting will remain
constant for the duration of the entire block job.

Signed-off-by: Max Reitz 
---
 qapi/block-core.json  | 11 +--
 include/block/block_int.h |  4 +++-
 block/mirror.c| 11 ++-
 blockdev.c|  9 -
 4 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e072cfa67c..40204d367a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1578,6 +1578,9 @@
 # written. Both will result in identical contents.
 # Default is true. (Since 2.4)
 #
+# @copy-mode: when to copy data to the destination; defaults to 'passive'
+# (Since: 2.11)
+#
 # Since: 1.3
 ##
 { 'struct': 'DriveMirror',
@@ -1587,7 +1590,7 @@
 '*speed': 'int', '*granularity': 'uint32',
 '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
 '*on-target-error': 'BlockdevOnError',
-'*unmap': 'bool' } }
+'*unmap': 'bool', '*copy-mode': 'MirrorCopyMode' } }
 
 ##
 # @BlockDirtyBitmap:
@@ -1766,6 +1769,9 @@
 #above @device. If this option is not given, a node name is
 #autogenerated. (Since: 2.9)
 #
+# @copy-mode: when to copy data to the destination; defaults to 'passive'
+# (Since: 2.11)
+#
 # Returns: nothing on success.
 #
 # Since: 2.6
@@ -1786,7 +1792,8 @@
 '*speed': 'int', '*granularity': 'uint32',
 '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
 '*on-target-error': 'BlockdevOnError',
-'*filter-node-name': 'str' } }
+'*filter-node-name': 'str',
+'*copy-mode': 'MirrorCopyMode' } }
 
 ##
 # @block_set_io_throttle:
diff --git a/include/block/block_int.h b/include/block/block_int.h
index fa8bbf1f8b..517b2680ce 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -934,6 +934,7 @@ void commit_active_start(const char *job_id, 
BlockDriverState *bs,
  * @filter_node_name: The node name that should be assigned to the filter
  * driver that the mirror job inserts into the graph above @bs. NULL means that
  * a node name should be autogenerated.
+ * @copy_mode: When to trigger writes to the target.
  * @errp: Error object.
  *
  * Start a mirroring operation on @bs.  Clusters that are allocated
@@ -947,7 +948,8 @@ void mirror_start(const char *job_id, BlockDriverState *bs,
   MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
-  bool unmap, const char *filter_node_name, Error **errp);
+  bool unmap, const char *filter_node_name,
+  MirrorCopyMode copy_mode, Error **errp);
 
 /*
  * backup_job_create:
diff --git a/block/mirror.c b/block/mirror.c
index c429aa77bb..8a67935cc4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1464,7 +1464,7 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
  const BlockJobDriver *driver,
  bool is_none_mode, BlockDriverState *base,
  bool auto_complete, const char *filter_node_name,
- bool is_mirror,
+ bool is_mirror, MirrorCopyMode copy_mode,
  Error **errp)
 {
 MirrorBlockJob *s;
@@ -1574,7 +1574,7 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 s->on_target_error = on_target_error;
 s->is_none_mode = is_none_mode;
 s->backing_mode = backing_mode;
-s->copy_mode = MIRROR_COPY_MODE_PASSIVE;
+s->copy_mode = copy_mode;
 s->base = base;
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
@@ -1643,7 +1643,8 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
   MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
-  bool unmap, const char *filter_node_name, Error **errp)
+  bool unmap, const char *filter_node_name,
+  MirrorCopyMode copy_mode, Error **errp)
 {
 bool is_none_mode;
 BlockDriverState *base;
@@ -1658,7 +1659,7 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
  speed, granularity, buf_size, backing_mode,
  on_source_error, on_target_error, unmap, NULL, NULL,
  _job_driver, is_none_mode, base, false,
- filter_node_name, true, errp);
+ filter_node_name, true, copy_mode, errp);
 }
 
 void commit_active_start(const char *job_id, BlockDriverState *bs,
@@

[Qemu-block] [PATCH 14/18] block/mirror: Distinguish active from passive ops

2017-09-13 Thread Max Reitz

Currently, the mirror block job only knows passive operations.  But once
we introduce active writes, we need to distinguish between the two; for
example, mirror_wait_for_free_in_flight_slot() should wait for a passive
operation because active writes will not use the same in-flight slots.

Signed-off-by: Max Reitz 
---
 block/mirror.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 612fab660e..8fea619a68 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -96,6 +96,7 @@ struct MirrorOp {
 /* Set by mirror_co_read() before yielding for the first time */
 uint64_t bytes_copied;
 
+bool is_active_write;
 CoQueue waiting_requests;
 
 QTAILQ_ENTRY(MirrorOp) next;
@@ -286,9 +287,14 @@ static inline void 
mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
 {
 MirrorOp *op;
 
-op = QTAILQ_FIRST(>ops_in_flight);
-assert(op);
-qemu_co_queue_wait(>waiting_requests, NULL);
+QTAILQ_FOREACH(op, >ops_in_flight, next) {
+if (!op->is_active_write) {
+/* Only non-active operations use up in-flight slots */
+qemu_co_queue_wait(>waiting_requests, NULL);
+return;
+}
+}
+abort();
 }
 
 /* Submit async read while handling COW.
-- 
2.13.5

[Qemu-block] [PATCH 13/18] block/mirror: Keep write perm for pending writes

2017-09-13 Thread Max Reitz

The owner of the mirror BDS might retire its write permission; but there
may still be pending mirror operations so the mirror BDS cannot
necessarily retire its write permission for its child then.

Signed-off-by: Max Reitz 
---
 block/mirror.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 05410c94ca..612fab660e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1236,6 +1236,7 @@ static void bdrv_mirror_top_child_perm(BlockDriverState 
*bs, BdrvChild *c,
uint64_t *nperm, uint64_t *nshared)
 {
 MirrorBDSOpaque *s = bs->opaque;
+bool ops_in_flight = s->job && !QTAILQ_EMPTY(>job->ops_in_flight);
 
 if (s->job && s->job->exiting) {
 *nperm = 0;
@@ -1243,9 +1244,10 @@ static void bdrv_mirror_top_child_perm(BlockDriverState 
*bs, BdrvChild *c,
 return;
 }
 
-/* Must be able to forward guest writes to the real image */
+/* Must be able to forward both new and pending guest writes to
+ * the real image */
 *nperm = 0;
-if (perm & BLK_PERM_WRITE) {
+if ((perm & BLK_PERM_WRITE) || ops_in_flight) {
 *nperm |= BLK_PERM_WRITE;
 }
 
-- 
2.13.5

[Qemu-block] [PATCH 12/18] block/dirty-bitmap: Add bdrv_dirty_iter_next_area

2017-09-13 Thread Max Reitz

This new function allows to look for a consecutively dirty area in a
dirty bitmap.

Signed-off-by: Max Reitz 
---
 include/block/dirty-bitmap.h |  2 ++
 block/dirty-bitmap.c | 52 
 2 files changed, 54 insertions(+)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index a79a58d2c3..7654748700 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -90,6 +90,8 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 int64_t cur_sector, int64_t nr_sectors);
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
+bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t max_offset,
+   uint64_t *offset, int *bytes);
 void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
 int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index aee57cf8c8..81b2f78016 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -550,6 +550,58 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
 return hbitmap_iter_next(>hbi, true);
 }
 
+/**
+ * Return the next consecutively dirty area in the dirty bitmap
+ * belonging to the given iterator @iter.
+ *
+ * @max_offset: Maximum value that may be returned for
+ *  *offset + *bytes
+ * @offset: Will contain the start offset of the next dirty area
+ * @bytes:  Will contain the length of the next dirty area
+ *
+ * Returns: True if a dirty area could be found before max_offset
+ *  (which means that *offset and *bytes then contain valid
+ *  values), false otherwise.
+ */
+bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t max_offset,
+   uint64_t *offset, int *bytes)
+{
+uint32_t granularity = bdrv_dirty_bitmap_granularity(iter->bitmap);
+uint64_t gran_max_offset;
+int sector_gran = granularity >> BDRV_SECTOR_BITS;
+int64_t ret;
+int size;
+
+if (DIV_ROUND_UP(max_offset, BDRV_SECTOR_SIZE) == iter->bitmap->size) {
+/* If max_offset points to the image end, round it up by the
+ * bitmap granularity */
+gran_max_offset = ROUND_UP(max_offset, granularity);
+} else {
+gran_max_offset = max_offset;
+}
+
+ret = hbitmap_iter_next(>hbi, false);
+if (ret < 0 || (ret << BDRV_SECTOR_BITS) + granularity > gran_max_offset) {
+return false;
+}
+
+*offset = ret << BDRV_SECTOR_BITS;
+size = 0;
+
+assert(granularity <= INT_MAX);
+
+do {
+/* Advance iterator */
+ret = hbitmap_iter_next(>hbi, true);
+size += granularity;
+} while ((ret << BDRV_SECTOR_BITS) + granularity <= gran_max_offset &&
+ hbitmap_iter_next(>hbi, false) == ret + sector_gran &&
+ size <= INT_MAX - granularity);
+
+*bytes = MIN(size, max_offset - *offset);
+return true;
+}
+
 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   int64_t cur_sector, int64_t nr_sectors)
-- 
2.13.5

[Qemu-block] [PATCH 11/18] hbitmap: Add @advance param to hbitmap_iter_next()

2017-09-13 Thread Max Reitz

This new parameter allows the caller to just query the next dirty
position without moving the iterator.

Signed-off-by: Max Reitz 
---
 include/qemu/hbitmap.h |  4 +++-
 block/dirty-bitmap.c   |  2 +-
 tests/test-hbitmap.c   | 26 +-
 util/hbitmap.c | 10 +++---
 4 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index d3a74a21fc..6a52575ad5 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -316,11 +316,13 @@ void hbitmap_free_meta(HBitmap *hb);
 /**
  * hbitmap_iter_next:
  * @hbi: HBitmapIter to operate on.
+ * @advance: If true, advance the iterator.  Otherwise, the next call
+ *   of this function will return the same result.
  *
  * Return the next bit that is set in @hbi's associated HBitmap,
  * or -1 if all remaining bits are zero.
  */
-int64_t hbitmap_iter_next(HBitmapIter *hbi);
+int64_t hbitmap_iter_next(HBitmapIter *hbi, bool advance);
 
 /**
  * hbitmap_iter_next_word:
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 30462d4f9a..aee57cf8c8 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -547,7 +547,7 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)
 
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
 {
-return hbitmap_iter_next(>hbi);
+return hbitmap_iter_next(>hbi, true);
 }
 
 /* Called within bdrv_dirty_bitmap_lock..unlock */
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index 1acb353889..e6d4d563cb 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -46,7 +46,7 @@ static void hbitmap_test_check(TestHBitmapData *data,
 
 i = first;
 for (;;) {
-next = hbitmap_iter_next();
+next = hbitmap_iter_next(, true);
 if (next < 0) {
 next = data->size;
 }
@@ -435,25 +435,25 @@ static void test_hbitmap_iter_granularity(TestHBitmapData 
*data,
 /* Note that hbitmap_test_check has to be invoked manually in this test.  
*/
 hbitmap_test_init(data, 131072 << 7, 7);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_test_set(data, ((L2 + L1 + 1) << 7) + 8, 8);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_test_set(data, (131072 << 7) - 8, 8);
 hbitmap_iter_init(, data->hb, 0);
-g_assert_cmpint(hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7);
-g_assert_cmpint(hbitmap_iter_next(), ==, 131071 << 7);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7);
+g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 
 hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7);
-g_assert_cmpint(hbitmap_iter_next(), ==, 131071 << 7);
-g_assert_cmpint(hbitmap_iter_next(), <, 0);
+g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7);
+g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
 }
 
 static void hbitmap_test_set_boundary_bits(TestHBitmapData *data, ssize_t diff)
@@ -893,7 +893,7 @@ static void test_hbitmap_serialize_zeroes(TestHBitmapData 
*data,
 for (i = 0; i < num_positions; i++) {
 hbitmap_deserialize_zeroes(data->hb, positions[i], min_l1, true);
 hbitmap_iter_init(, data->hb, 0);
-next = hbitmap_iter_next();
+next = hbitmap_iter_next(, true);
 if (i == num_positions - 1) {
 g_assert_cmpint(next, ==, -1);
 } else {
@@ -919,10 +919,10 @@ static void test_hbitmap_iter_and_reset(TestHBitmapData 
*data,
 
 hbitmap_iter_init(, data->hb, BITS_PER_LONG - 1);
 
-hbitmap_iter_next();
+hbitmap_iter_next(, true);
 
 hbitmap_reset_all(data->hb);
-hbitmap_iter_next();
+hbitmap_iter_next(, true);
 }
 
 int main(int argc, char **argv)
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 21535cc90b..96525983ce 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -141,7 +141,7 @@ unsigned long hbitmap_iter_skip_words(HBitmapIter *hbi)
 return cur;
 }
 
-int64_t hbitmap_iter_next(HBitmapIter *hbi)
+int64_t hbitmap_iter_next(HBitmapIter *hbi, bool advance)
 {
 unsigned long cur = hbi->cur[HBITMAP_LEVELS - 1] &
 hbi->hb->levels[HBITMAP_LEVELS - 1][hbi->pos];
@@ -154,8 +154,12 @@ int64_t hbitmap_iter_next(HBitmapIter *hbi)
 }
 }
 
-/* The next call will resume work from the next bit.  */
-hbi->cur[HBITMAP_LEVELS - 1] = cur & (cur - 1);
+if (advance) {
+/*

[Qemu-block] [PATCH 09/18] block: Generalize should_update_child() rule

2017-09-13 Thread Max Reitz

Currently, bdrv_replace_node() refuses to create loops from one BDS to
itself if the BDS to be replaced is the backing node of the BDS to
replace it: Say there is a node A and a node B.  Replacing B by A means
making all references to B point to A.  If B is a child of A (i.e. A has
a reference to B), that would mean we would have to make this reference
point to A itself -- so we'd create a loop.

bdrv_replace_node() (through should_update_child()) refuses to do so if
B is the backing node of A.  There is no reason why we should create
loops if B is not the backing node of A, though.  The BDS graph should
never contain loops, so we should always refuse to create them.

If B is a child of A and B is to be replaced by A, we should simply
leave B in place there because it is the most sensible choice.

A more specific argument would be: Putting filter drivers into the BDS
graph is basically the same as appending an overlay to a backing chain.
But the main child BDS of a filter driver is not "backing" but "file",
so restricting the no-loop rule to backing nodes would fail here.

Signed-off-by: Max Reitz 
---
 include/block/block_int.h |  2 ++
 block.c   | 44 ++--
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index eaeaad9428..fa8bbf1f8b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -573,6 +573,8 @@ struct BdrvChild {
 QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+typedef QLIST_HEAD(BdrvChildList, BdrvChild) BdrvChildList;
+
 /*
  * Note: the function bdrv_append() copies and swaps contents of
  * BlockDriverStates, so if you add new fields to this struct, please
diff --git a/block.c b/block.c
index 0b55c5a41c..1898b958c9 100644
--- a/block.c
+++ b/block.c
@@ -3134,16 +3134,39 @@ static bool should_update_child(BdrvChild *c, 
BlockDriverState *to)
 return false;
 }
 
-if (c->role == _backing) {
-/* If @from is a backing file of @to, ignore the child to avoid
- * creating a loop. We only want to change the pointer of other
- * parents. */
-QLIST_FOREACH(to_c, >children, next) {
-if (to_c == c) {
-break;
-}
-}
-if (to_c) {
+/* If the child @c belongs to the BDS @to, replacing the current
+ * c->bs by @to would mean to create a loop.
+ *
+ * Such a case occurs when appending a BDS to a backing chain.
+ * For instance, imagine the following chain:
+ *
+ *   guest device -> node A -> further backing chain...
+ *
+ * Now we create a new BDS B which we want to put on top of this
+ * chain, so we first attach A as its backing node:
+ *
+ *   node B
+ * |
+ * v
+ *   guest device -> node A -> further backing chain...
+ *
+ * Finally we want to replace A by B.  When doing that, we want to
+ * replace all pointers to A by pointers to B -- except for the
+ * pointer from B because (1) that would create a loop, and (2)
+ * that pointer should simply stay intact:
+ *
+ *   guest device -> node B
+ * |
+ * v
+ *   node A -> further backing chain...
+ *
+ * In general, when replacing a node A (c->bs) by a node B (@to),
+ * if A is a child of B, that means we cannot replace A by B there
+ * because that would create a loop.  Silently detaching A from B
+ * is also not really an option.  So overall just leaving A in
+ * place there is the most sensible choice. */
+QLIST_FOREACH(to_c, >children, next) {
+if (to_c == c) {
 return false;
 }
 }
@@ -3169,6 +3192,7 @@ void bdrv_replace_node(BlockDriverState *from, 
BlockDriverState *to,
 
 /* Put all parents into @list and calculate their cumulative permissions */
 QLIST_FOREACH_SAFE(c, >parents, next_parent, next) {
+assert(c->bs == from);
 if (!should_update_child(c, to)) {
 continue;
 }
-- 
2.13.5

[Qemu-block] [PATCH 06/18] block/mirror: Use CoQueue to wait on in-flight ops

2017-09-13 Thread Max Reitz

Attach a CoQueue to each in-flight operation so if we need to wait for
any we can use it to wait instead of just blindly yielding and hoping
for some operation to wake us.

A later patch will use this infrastructure to allow requests accessing
the same area of the virtual disk to specifically wait for each other.

Signed-off-by: Max Reitz 
---
 block/mirror.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 2b3297aa61..81253fbad1 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
+#include "qemu/coroutine.h"
 #include "trace.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -34,6 +35,8 @@ typedef struct MirrorBuffer {
 QSIMPLEQ_ENTRY(MirrorBuffer) next;
 } MirrorBuffer;
 
+typedef struct MirrorOp MirrorOp;
+
 typedef struct MirrorBlockJob {
 BlockJob common;
 RateLimit limit;
@@ -67,15 +70,15 @@ typedef struct MirrorBlockJob {
 unsigned long *in_flight_bitmap;
 int in_flight;
 int64_t bytes_in_flight;
+QTAILQ_HEAD(MirrorOpList, MirrorOp) ops_in_flight;
 int ret;
 bool unmap;
-bool waiting_for_io;
 int target_cluster_size;
 int max_iov;
 bool initial_zeroing_ongoing;
 } MirrorBlockJob;
 
-typedef struct MirrorOp {
+struct MirrorOp {
 MirrorBlockJob *s;
 QEMUIOVector qiov;
 int64_t offset;
@@ -83,7 +86,11 @@ typedef struct MirrorOp {
 
 /* Set by mirror_co_read() before yielding for the first time */
 uint64_t bytes_copied;
-} MirrorOp;
+
+CoQueue waiting_requests;
+
+QTAILQ_ENTRY(MirrorOp) next;
+};
 
 typedef enum MirrorMethod {
 MIRROR_METHOD_COPY,
@@ -124,7 +131,9 @@ static void coroutine_fn mirror_iteration_done(MirrorOp 
*op, int ret)
 
 chunk_num = op->offset / s->granularity;
 nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
+
 bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
+QTAILQ_REMOVE(>ops_in_flight, op, next);
 if (ret >= 0) {
 if (s->cow_bitmap) {
 bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
@@ -134,11 +143,9 @@ static void coroutine_fn mirror_iteration_done(MirrorOp 
*op, int ret)
 }
 }
 qemu_iovec_destroy(>qiov);
-g_free(op);
 
-if (s->waiting_for_io) {
-qemu_coroutine_enter(s->common.co);
-}
+qemu_co_queue_restart_all(>waiting_requests);
+g_free(op);
 }
 
 static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
@@ -233,10 +240,11 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 
 static inline void mirror_wait_for_io(MirrorBlockJob *s)
 {
-assert(!s->waiting_for_io);
-s->waiting_for_io = true;
-qemu_coroutine_yield();
-s->waiting_for_io = false;
+MirrorOp *op;
+
+op = QTAILQ_FIRST(>ops_in_flight);
+assert(op);
+qemu_co_queue_wait(>waiting_requests, NULL);
 }
 
 /* Submit async read while handling COW.
@@ -342,6 +350,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 .offset = offset,
 .bytes  = bytes,
 };
+qemu_co_queue_init(>waiting_requests);
 
 switch (mirror_method) {
 case MIRROR_METHOD_COPY:
@@ -357,6 +366,7 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 abort();
 }
 
+QTAILQ_INSERT_TAIL(>ops_in_flight, op, next);
 qemu_coroutine_enter(co);
 
 if (mirror_method == MIRROR_METHOD_COPY) {
@@ -1292,6 +1302,8 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 }
 }
 
+QTAILQ_INIT(>ops_in_flight);
+
 trace_mirror_start(bs, s, opaque);
 block_job_start(>common);
 return;
-- 
2.13.5

[Qemu-block] [PATCH 10/18] block/mirror: Make source the file child

2017-09-13 Thread Max Reitz

Regarding the source BDS, the mirror BDS is arguably a filter node.
Therefore, the source BDS should be its "file" child.

Signed-off-by: Max Reitz 
---
 block/mirror.c | 127 ++---
 block/qapi.c   |  25 ++---
 tests/qemu-iotests/141.out |   4 +-
 3 files changed, 119 insertions(+), 37 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 9df4157511..05410c94ca 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -77,8 +77,16 @@ typedef struct MirrorBlockJob {
 int target_cluster_size;
 int max_iov;
 bool initial_zeroing_ongoing;
+
+/* Signals that we are no longer accessing source and target and the mirror
+ * BDS should thus relinquish all permissions */
+bool exiting;
 } MirrorBlockJob;
 
+typedef struct MirrorBDSOpaque {
+MirrorBlockJob *job;
+} MirrorBDSOpaque;
+
 struct MirrorOp {
 MirrorBlockJob *s;
 QEMUIOVector qiov;
@@ -595,12 +603,15 @@ static void mirror_exit(BlockJob *job, void *opaque)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
 MirrorExitData *data = opaque;
+MirrorBDSOpaque *bs_opaque = s->mirror_top_bs->opaque;
 AioContext *replace_aio_context = NULL;
 BlockDriverState *src = s->source->bs;
 BlockDriverState *target_bs = blk_bs(s->target);
 BlockDriverState *mirror_top_bs = s->mirror_top_bs;
 Error *local_err = NULL;
 
+s->exiting = true;
+
 bdrv_release_dirty_bitmap(src, s->dirty_bitmap);
 
 /* Make sure that the source BDS doesn't go away before we called
@@ -622,7 +633,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
 
 /* We don't access the source any more. Dropping any WRITE/RESIZE is
  * required before it could become a backing file of target_bs. */
-bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
+bdrv_child_try_set_perm(mirror_top_bs->file, 0, BLK_PERM_ALL,
 _abort);
 if (s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
 BlockDriverState *backing = s->is_none_mode ? src : s->base;
@@ -673,12 +684,11 @@ static void mirror_exit(BlockJob *job, void *opaque)
 
 /* Remove the mirror filter driver from the graph. Before this, get rid of
  * the blockers on the intermediate nodes so that the resulting state is
- * valid. Also give up permissions on mirror_top_bs->backing, which might
+ * valid. Also give up permissions on mirror_top_bs->file, which might
  * block the removal. */
 block_job_remove_all_bdrv(job);
-bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
-_abort);
-bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), _abort);
+bdrv_child_try_set_perm(mirror_top_bs->file, 0, BLK_PERM_ALL, 
_abort);
+bdrv_replace_node(mirror_top_bs, mirror_top_bs->file->bs, _abort);
 
 /* We just changed the BDS the job BB refers to (with either or both of the
  * bdrv_replace_node() calls), so switch the BB back so the cleanup does
@@ -687,6 +697,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
 blk_set_perm(job->blk, 0, BLK_PERM_ALL, _abort);
 blk_insert_bs(job->blk, mirror_top_bs, _abort);
 
+bs_opaque->job = NULL;
 block_job_completed(>common, data->ret);
 
 g_free(data);
@@ -1102,7 +1113,7 @@ static void mirror_drain(BlockJob *job)
 {
 MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
 
-/* Need to keep a reference in case blk_drain triggers execution
+/* Need to keep a reference in case bdrv_drain triggers execution
  * of mirror_complete...
  */
 if (s->target) {
@@ -1135,44 +1146,88 @@ static const BlockJobDriver commit_active_job_driver = {
 .drain  = mirror_drain,
 };
 
+static void source_child_inherit_fmt_options(int *child_flags,
+ QDict *child_options,
+ int parent_flags,
+ QDict *parent_options)
+{
+child_backing.inherit_options(child_flags, child_options,
+  parent_flags, parent_options);
+}
+
+static char *source_child_get_parent_desc(BdrvChild *c)
+{
+return child_backing.get_parent_desc(c);
+}
+
+static void source_child_cb_drained_begin(BdrvChild *c)
+{
+BlockDriverState *bs = c->opaque;
+MirrorBDSOpaque *s = bs->opaque;
+
+if (s && s->job) {
+block_job_drained_begin(>job->common);
+}
+bdrv_drained_begin(bs);
+}
+
+static void source_child_cb_drained_end(BdrvChild *c)
+{
+BlockDriverState *bs = c->opaque;
+MirrorBDSOpaque *s = bs->opaque;
+
+if (s && s->job) {
+block_job_drained_end(>job->common);
+}
+bdrv_drained_end(bs);
+}
+
+static BdrvChildRole source_child_role = {
+.inherit_options= source_child_inherit_fmt_options,
+.get_parent_desc= source_child_get_parent_desc,
+

[Qemu-block] [PATCH 05/18] block/mirror: Convert to coroutines

2017-09-13 Thread Max Reitz

In order to talk to the source BDS (and maybe in the future to the
target BDS as well) directly, we need to convert our existing AIO
requests into coroutine I/O requests.

Signed-off-by: Max Reitz 
---
 block/mirror.c | 134 +
 1 file changed, 78 insertions(+), 56 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 4664b0516f..2b3297aa61 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -80,6 +80,9 @@ typedef struct MirrorOp {
 QEMUIOVector qiov;
 int64_t offset;
 uint64_t bytes;
+
+/* Set by mirror_co_read() before yielding for the first time */
+uint64_t bytes_copied;
 } MirrorOp;
 
 typedef enum MirrorMethod {
@@ -101,7 +104,7 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob 
*s, bool read,
 }
 }
 
-static void mirror_iteration_done(MirrorOp *op, int ret)
+static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
 {
 MirrorBlockJob *s = op->s;
 struct iovec *iov;
@@ -138,9 +141,8 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
 }
 }
 
-static void mirror_write_complete(void *opaque, int ret)
+static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
 {
-MirrorOp *op = opaque;
 MirrorBlockJob *s = op->s;
 
 aio_context_acquire(blk_get_aio_context(s->common.blk));
@@ -158,9 +160,8 @@ static void mirror_write_complete(void *opaque, int ret)
 aio_context_release(blk_get_aio_context(s->common.blk));
 }
 
-static void mirror_read_complete(void *opaque, int ret)
+static void coroutine_fn mirror_read_complete(MirrorOp *op, int ret)
 {
-MirrorOp *op = opaque;
 MirrorBlockJob *s = op->s;
 
 aio_context_acquire(blk_get_aio_context(s->common.blk));
@@ -176,8 +177,11 @@ static void mirror_read_complete(void *opaque, int ret)
 
 mirror_iteration_done(op, ret);
 } else {
-blk_aio_pwritev(s->target, op->offset, >qiov,
-0, mirror_write_complete, op);
+int ret;
+
+ret = blk_co_pwritev(s->target, op->offset,
+ op->qiov.size, >qiov, 0);
+mirror_write_complete(op, ret);
 }
 aio_context_release(blk_get_aio_context(s->common.blk));
 }
@@ -242,53 +246,49 @@ static inline void mirror_wait_for_io(MirrorBlockJob *s)
  *  (new_end - offset) if tail is rounded up or down due to
  *  alignment or buffer limit.
  */
-static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
-   uint64_t bytes)
+static void coroutine_fn mirror_co_read(void *opaque)
 {
+MirrorOp *op = opaque;
+MirrorBlockJob *s = op->s;
 BlockBackend *source = s->common.blk;
 int nb_chunks;
 uint64_t ret;
-MirrorOp *op;
 uint64_t max_bytes;
 
 max_bytes = s->granularity * s->max_iov;
 
 /* We can only handle as much as buf_size at a time. */
-bytes = MIN(s->buf_size, MIN(max_bytes, bytes));
-assert(bytes);
-assert(bytes < BDRV_REQUEST_MAX_BYTES);
-ret = bytes;
+op->bytes = MIN(s->buf_size, MIN(max_bytes, op->bytes));
+assert(op->bytes);
+assert(op->bytes < BDRV_REQUEST_MAX_BYTES);
+op->bytes_copied = op->bytes;
 
 if (s->cow_bitmap) {
-ret += mirror_cow_align(s, , );
+op->bytes_copied += mirror_cow_align(s, >offset, >bytes);
 }
-assert(bytes <= s->buf_size);
+/* Cannot exceed BDRV_REQUEST_MAX_BYTES + INT_MAX */
+assert(op->bytes_copied <= UINT_MAX);
+assert(op->bytes <= s->buf_size);
 /* The offset is granularity-aligned because:
  * 1) Caller passes in aligned values;
  * 2) mirror_cow_align is used only when target cluster is larger. */
-assert(QEMU_IS_ALIGNED(offset, s->granularity));
+assert(QEMU_IS_ALIGNED(op->offset, s->granularity));
 /* The range is sector-aligned, since bdrv_getlength() rounds up. */
-assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE));
-nb_chunks = DIV_ROUND_UP(bytes, s->granularity);
+assert(QEMU_IS_ALIGNED(op->bytes, BDRV_SECTOR_SIZE));
+nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
 
 while (s->buf_free_count < nb_chunks) {
-trace_mirror_yield_in_flight(s, offset, s->in_flight);
+trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
 mirror_wait_for_io(s);
 }
 
-/* Allocate a MirrorOp that is used as an AIO callback.  */
-op = g_new(MirrorOp, 1);
-op->s = s;
-op->offset = offset;
-op->bytes = bytes;
-
 /* Now make a QEMUIOVector taking enough granularity-sized chunks
  * from s->buf_free.
  */
 qemu_iovec_init(>qiov, nb_chunks);
 while (nb_chunks-- > 0) {
 MirrorBuffer *buf = QSIMPLEQ_FIRST(>buf_free);
-size_t remaining = bytes - op->qiov.size;
+size_t remaining = op->bytes - op->qiov.size;
 
 QSIMPLEQ_REMOVE_HEAD(>buf_free, next);
 s->buf_free_count--;
@@ -297,53 +297,75 @@ static uint64_t mirror_do_read(MirrorBlockJob *s,

[Qemu-block] [PATCH 03/18] blockjob: Make drained_{begin, end} public

2017-09-13 Thread Max Reitz

When a block job decides to be represented as a BDS and track its
associated child nodes itself instead of having the BlockJob object
track them, it needs to implement the drained_begin/drained_end child
operations.  In order to do that, it has to be able to control drainage
of the block job (i.e. to pause and resume it).  Therefore, we need to
make these operations public.

Signed-off-by: Max Reitz 
---
 include/block/blockjob.h | 15 +++
 blockjob.c   | 20 ++--
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 67c0968fa5..a59f316788 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -339,6 +339,21 @@ void block_job_ref(BlockJob *job);
 void block_job_unref(BlockJob *job);
 
 /**
+ * block_job_drained_begin:
+ *
+ * Inhibit I/O requests initiated by the block job.
+ */
+void block_job_drained_begin(BlockJob *job);
+
+/**
+ * block_job_drained_end:
+ *
+ * Resume I/O after it has been paused through
+ * block_job_drained_begin().
+ */
+void block_job_drained_end(BlockJob *job);
+
+/**
  * block_job_txn_unref:
  *
  * Release a reference that was previously acquired with block_job_txn_add_job
diff --git a/blockjob.c b/blockjob.c
index 3a0c49137e..4312a121fa 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -217,21 +217,29 @@ static const BdrvChildRole child_job = {
 .stay_at_node   = true,
 };
 
-static void block_job_drained_begin(void *opaque)
+void block_job_drained_begin(BlockJob *job)
 {
-BlockJob *job = opaque;
 block_job_pause(job);
 }
 
-static void block_job_drained_end(void *opaque)
+static void block_job_drained_begin_op(void *opaque)
+{
+block_job_drained_begin(opaque);
+}
+
+void block_job_drained_end(BlockJob *job)
 {
-BlockJob *job = opaque;
 block_job_resume(job);
 }
 
+static void block_job_drained_end_op(void *opaque)
+{
+block_job_drained_end(opaque);
+}
+
 static const BlockDevOps block_job_dev_ops = {
-.drained_begin = block_job_drained_begin,
-.drained_end = block_job_drained_end,
+.drained_begin = block_job_drained_begin_op,
+.drained_end = block_job_drained_end_op,
 };
 
 void block_job_remove_all_bdrv(BlockJob *job)
-- 
2.13.5

[Qemu-block] [PATCH 07/18] block/mirror: Wait for in-flight op conflicts

2017-09-13 Thread Max Reitz

This patch makes the mirror code differentiate between simply waiting
for any operation to complete (mirror_wait_for_free_in_flight_slot())
and specifically waiting for all operations touching a certain range of
the virtual disk to complete (mirror_wait_on_conflicts()).

Signed-off-by: Max Reitz 
---
 block/mirror.c | 85 +++---
 1 file changed, 70 insertions(+), 15 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 81253fbad1..2ece38094d 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
 #include "qemu/coroutine.h"
+#include "qemu/range.h"
 #include "trace.h"
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
@@ -111,6 +112,41 @@ static BlockErrorAction mirror_error_action(MirrorBlockJob 
*s, bool read,
 }
 }
 
+static void coroutine_fn mirror_wait_on_conflicts(MirrorOp *self,
+  MirrorBlockJob *s,
+  uint64_t offset,
+  uint64_t bytes)
+{
+uint64_t self_start_chunk = offset / s->granularity;
+uint64_t self_end_chunk = DIV_ROUND_UP(offset + bytes, s->granularity);
+uint64_t self_nb_chunks = self_end_chunk - self_start_chunk;
+
+while (find_next_bit(s->in_flight_bitmap, self_end_chunk,
+ self_start_chunk) < self_end_chunk &&
+   s->ret >= 0)
+{
+MirrorOp *op;
+
+QTAILQ_FOREACH(op, >ops_in_flight, next) {
+uint64_t op_start_chunk = op->offset / s->granularity;
+uint64_t op_nb_chunks = DIV_ROUND_UP(op->offset + op->bytes,
+ s->granularity) -
+op_start_chunk;
+
+if (op == self) {
+continue;
+}
+
+if (ranges_overlap(self_start_chunk, self_nb_chunks,
+   op_start_chunk, op_nb_chunks))
+{
+qemu_co_queue_wait(>waiting_requests, NULL);
+break;
+}
+}
+}
+}
+
 static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
 {
 MirrorBlockJob *s = op->s;
@@ -238,7 +274,7 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 return ret;
 }
 
-static inline void mirror_wait_for_io(MirrorBlockJob *s)
+static inline void mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
 {
 MirrorOp *op;
 
@@ -287,7 +323,7 @@ static void coroutine_fn mirror_co_read(void *opaque)
 
 while (s->buf_free_count < nb_chunks) {
 trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
-mirror_wait_for_io(s);
+mirror_wait_for_free_in_flight_slot(s);
 }
 
 /* Now make a QEMUIOVector taking enough granularity-sized chunks
@@ -381,8 +417,9 @@ static unsigned mirror_perform(MirrorBlockJob *s, int64_t 
offset,
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
 BlockDriverState *source = s->source;
-int64_t offset, first_chunk;
-uint64_t delay_ns = 0;
+MirrorOp *pseudo_op;
+int64_t offset;
+uint64_t delay_ns = 0, ret = 0;
 /* At least the first dirty chunk is mirrored in one iteration. */
 int nb_chunks = 1;
 int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
@@ -400,11 +437,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 }
 bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
 
-first_chunk = offset / s->granularity;
-while (test_bit(first_chunk, s->in_flight_bitmap)) {
-trace_mirror_yield_in_flight(s, offset, s->in_flight);
-mirror_wait_for_io(s);
-}
+mirror_wait_on_conflicts(NULL, s, offset, 1);
 
 block_job_pause_point(>common);
 
@@ -442,6 +475,20 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
nb_chunks * sectors_per_chunk);
 bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
 
+/* Before claiming an area in the in-flight bitmap, we have to
+ * create a MirrorOp for it so that conflicting requests can wait
+ * for it.  mirror_perform() will create the real MirrorOps later,
+ * for now we just create a pseudo operation that will wake up all
+ * conflicting requests once all real operations have been
+ * launched. */
+pseudo_op = g_new(MirrorOp, 1);
+*pseudo_op = (MirrorOp){
+.offset = offset,
+.bytes  = nb_chunks * s->granularity,
+};
+qemu_co_queue_init(_op->waiting_requests);
+QTAILQ_INSERT_TAIL(>ops_in_flight, pseudo_op, next);
+
 bitmap_set(s->in_flight_bitmap, offset / s->granularity, nb_chunks);
 while (nb_chunks > 0 && offset < s->bdev_length) {
 int64_t ret;
@@ -481,11 +528,12 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 
 while (s->in_flight >= MAX_IN_FLIGHT)

[Qemu-block] [PATCH 04/18] block/mirror: Pull out mirror_perform()

2017-09-13 Thread Max Reitz

When converting mirror's I/O to coroutines, we are going to need a point
where these coroutines are created.  mirror_perform() is going to be
that point.

Signed-off-by: Max Reitz 
---
 block/mirror.c | 53 ++---
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 6531652d73..4664b0516f 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -82,6 +82,12 @@ typedef struct MirrorOp {
 uint64_t bytes;
 } MirrorOp;
 
+typedef enum MirrorMethod {
+MIRROR_METHOD_COPY,
+MIRROR_METHOD_ZERO,
+MIRROR_METHOD_DISCARD,
+} MirrorMethod;
+
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
 int error)
 {
@@ -324,6 +330,22 @@ static void mirror_do_zero_or_discard(MirrorBlockJob *s,
 }
 }
 
+static unsigned mirror_perform(MirrorBlockJob *s, int64_t offset,
+   unsigned bytes, MirrorMethod mirror_method)
+{
+switch (mirror_method) {
+case MIRROR_METHOD_COPY:
+return mirror_do_read(s, offset, bytes);
+case MIRROR_METHOD_ZERO:
+case MIRROR_METHOD_DISCARD:
+mirror_do_zero_or_discard(s, offset, bytes,
+  mirror_method == MIRROR_METHOD_DISCARD);
+return bytes;
+default:
+abort();
+}
+}
+
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 {
 BlockDriverState *source = s->source;
@@ -395,11 +417,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 unsigned int io_bytes;
 int64_t io_bytes_acct;
 BlockDriverState *file;
-enum MirrorMethod {
-MIRROR_METHOD_COPY,
-MIRROR_METHOD_ZERO,
-MIRROR_METHOD_DISCARD
-} mirror_method = MIRROR_METHOD_COPY;
+MirrorMethod mirror_method = MIRROR_METHOD_COPY;
 
 assert(!(offset % s->granularity));
 ret = bdrv_get_block_status_above(source, NULL,
@@ -439,22 +457,11 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 }
 
 io_bytes = mirror_clip_bytes(s, offset, io_bytes);
-switch (mirror_method) {
-case MIRROR_METHOD_COPY:
-io_bytes = io_bytes_acct = mirror_do_read(s, offset, io_bytes);
-break;
-case MIRROR_METHOD_ZERO:
-case MIRROR_METHOD_DISCARD:
-mirror_do_zero_or_discard(s, offset, io_bytes,
-  mirror_method == MIRROR_METHOD_DISCARD);
-if (write_zeroes_ok) {
-io_bytes_acct = 0;
-} else {
-io_bytes_acct = io_bytes;
-}
-break;
-default:
-abort();
+io_bytes = mirror_perform(s, offset, io_bytes, mirror_method);
+if (mirror_method != MIRROR_METHOD_COPY && write_zeroes_ok) {
+io_bytes_acct = 0;
+} else {
+io_bytes_acct = io_bytes;
 }
 assert(io_bytes);
 offset += io_bytes;
@@ -650,8 +657,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 continue;
 }
 
-mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
-  nb_sectors * BDRV_SECTOR_SIZE, false);
+mirror_perform(s, sector_num * BDRV_SECTOR_SIZE,
+   nb_sectors * BDRV_SECTOR_SIZE, MIRROR_METHOD_ZERO);
 sector_num += nb_sectors;
 }
 
-- 
2.13.5

[Qemu-block] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse

2017-09-13 Thread Max Reitz

Drainined a BDS child may lead to both the original BDS and/or its other
children being deleted (e.g. if the original BDS represents a block
job).  We should prepare for this in both bdrv_drain_recurse() and
bdrv_drained_begin() by monitoring whether the BDS we are about to drain
still exists at all.

Signed-off-by: Max Reitz 
---
 block/io.c | 72 +-
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/block/io.c b/block/io.c
index 4378ae4c7d..8ec1a564ad 100644
--- a/block/io.c
+++ b/block/io.c
@@ -182,33 +182,57 @@ static void bdrv_drain_invoke(BlockDriverState *bs)
 
 static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
-BdrvChild *child, *tmp;
+BdrvChild *child;
 bool waited;
+struct BDSToDrain {
+BlockDriverState *bs;
+BdrvDeletedStatus del_stat;
+QLIST_ENTRY(BDSToDrain) next;
+};
+QLIST_HEAD(, BDSToDrain) bs_list = QLIST_HEAD_INITIALIZER(bs_list);
+bool in_main_loop =
+qemu_get_current_aio_context() == qemu_get_aio_context();
 
 waited = BDRV_POLL_WHILE(bs, atomic_read(>in_flight) > 0);
 
 /* Ensure any pending metadata writes are submitted to bs->file.  */
 bdrv_drain_invoke(bs);
 
-QLIST_FOREACH_SAFE(child, >children, next, tmp) {
-BlockDriverState *bs = child->bs;
-bool in_main_loop =
-qemu_get_current_aio_context() == qemu_get_aio_context();
-assert(bs->refcnt > 0);
-if (in_main_loop) {
-/* In case the recursive bdrv_drain_recurse processes a
- * block_job_defer_to_main_loop BH and modifies the graph,
- * let's hold a reference to bs until we are done.
- *
- * IOThread doesn't have such a BH, and it is not safe to call
- * bdrv_unref without BQL, so skip doing it there.
- */
-bdrv_ref(bs);
-}
-waited |= bdrv_drain_recurse(bs);
-if (in_main_loop) {
-bdrv_unref(bs);
+/* Draining children may result in other children being removed and maybe
+ * even deleted, so copy the children list first */
+QLIST_FOREACH(child, >children, next) {
+struct BDSToDrain *bs2d = g_new0(struct BDSToDrain, 1);
+
+bs2d->bs = child->bs;
+QLIST_INSERT_HEAD(>deleted_status, >del_stat, next);
+
+QLIST_INSERT_HEAD(_list, bs2d, next);
+}
+
+while (!QLIST_EMPTY(_list)) {
+struct BDSToDrain *bs2d = QLIST_FIRST(_list);
+QLIST_REMOVE(bs2d, next);
+
+if (!bs2d->del_stat.deleted) {
+QLIST_REMOVE(>del_stat, next);
+
+if (in_main_loop) {
+/* In case the recursive bdrv_drain_recurse processes a
+ * block_job_defer_to_main_loop BH and modifies the graph,
+ * let's hold a reference to the BDS until we are done.
+ *
+ * IOThread doesn't have such a BH, and it is not safe to call
+ * bdrv_unref without BQL, so skip doing it there.
+ */
+bdrv_ref(bs2d->bs);
+}
+waited |= bdrv_drain_recurse(bs2d->bs);
+if (in_main_loop) {
+bdrv_unref(bs2d->bs);
+}
 }
+
+g_free(bs2d);
 }
 
 return waited;
@@ -252,17 +276,25 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs)
 
 void bdrv_drained_begin(BlockDriverState *bs)
 {
+BdrvDeletedStatus del_stat = { .deleted = false };
+
 if (qemu_in_coroutine()) {
 bdrv_co_yield_to_drain(bs);
 return;
 }
 
+QLIST_INSERT_HEAD(>deleted_status, _stat, next);
+
 if (atomic_fetch_inc(>quiesce_counter) == 0) {
 aio_disable_external(bdrv_get_aio_context(bs));
 bdrv_parent_drained_begin(bs);
 }
 
-bdrv_drain_recurse(bs);
+if (!del_stat.deleted) {
+QLIST_REMOVE(_stat, next);
+
+bdrv_drain_recurse(bs);
+}
 }
 
 void bdrv_drained_end(BlockDriverState *bs)
-- 
2.13.5

[Qemu-block] [PATCH 01/18] block: Add BdrvDeletedStatus

2017-09-13 Thread Max Reitz

Sometimes an operation may delete a BDS.  It may then not be trivial to
determine this because the BDS object itself cannot be accessed
afterwards.  With this patch, one can attach a BdrvDeletedStatus object
to a BDS which can be used to safely query whether the BDS still exists
even after it has been deleted.

Signed-off-by: Max Reitz 
---
 include/block/block_int.h | 12 
 block.c   |  6 ++
 2 files changed, 18 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index ba4c383393..eaeaad9428 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -498,6 +498,13 @@ typedef struct BdrvAioNotifier {
 QLIST_ENTRY(BdrvAioNotifier) list;
 } BdrvAioNotifier;
 
+typedef struct BdrvDeletedStatus {
+/* Set to true by bdrv_delete() */
+bool deleted;
+
+QLIST_ENTRY(BdrvDeletedStatus) next;
+} BdrvDeletedStatus;
+
 struct BdrvChildRole {
 /* If true, bdrv_replace_node() doesn't change the node this BdrvChild
  * points to. */
@@ -706,6 +713,11 @@ struct BlockDriverState {
 
 /* Only read/written by whoever has set active_flush_req to true.  */
 unsigned int flushed_gen; /* Flushed write generation */
+
+/* When bdrv_delete() is invoked, it will walk through this list
+ * and set every entry's @deleted field to true.  The entries will
+ * not be freed automatically. */
+QLIST_HEAD(, BdrvDeletedStatus) deleted_status;
 };
 
 struct BlockBackendRootState {
diff --git a/block.c b/block.c
index 6dd47e414e..0b55c5a41c 100644
--- a/block.c
+++ b/block.c
@@ -3246,10 +3246,16 @@ out:
 
 static void bdrv_delete(BlockDriverState *bs)
 {
+BdrvDeletedStatus *del_stat;
+
 assert(!bs->job);
 assert(bdrv_op_blocker_is_empty(bs));
 assert(!bs->refcnt);
 
+QLIST_FOREACH(del_stat, >deleted_status, next) {
+del_stat->deleted = true;
+}
+
 bdrv_close(bs);
 
 /* remove from list, if necessary */
-- 
2.13.5

[Qemu-block] [PATCH 00/18] block/mirror: Add active-sync mirroring

2017-09-13 Thread Max Reitz

This series implements an active and synchronous mirroring mode.

Currently, the mirror block job is passive an asynchronous: Depending on
your start conditions, some part of the source disk starts as "dirty".
Then, the block job will (as a background operation) continuously copy
dirty parts to the target disk until all of the source disk is clean.
In the meantime, any write to the source disk dirties the affected area.

One effect of this operational mode is that the job may never converge:
If the writes to the source happen faster than the block job copies data
to the target, the job can never finish.

When the active mode implemented in this series is enabled, every write
request to the source will automatically trigger a synchronous write to
the target right afterwards.  Therefore, the source can never get dirty
faster than data is copied to the target.  Most importantly, once source
and target are in sync (BLOCK_JOB_READY is emitted), they will not
diverge (unless e.g. an I/O error occurs).

Active mirroring also improves on a second issue of the passive mode: We
do not have to read data from the source in order to write it to the
target.  When new data is written to the source in active mode, it is
automatically mirrored to the target, which saves us the superfluous
read from the source.
(Optionally, one can choose to also mirror data read from the source.
This does not necessarily help with convergence, but it saves an extra
read operation (in exchange for slower read access to the source because
this mirroring is implemented synchronously).)


There may be a couple of things to do on top of this series:
- Allow switching between active and passive mode at runtime: This
  should not be too difficult to implement, the main question is how to
  expose it to the user.
  (I seem to recall we wanted some form of block-job-set-option
  command...?)

- Implement an asynchronous active mode: May be detrimental when it
  comes to convergence, but it might be nice to have anyway.  May or may
  not be complicated to implement.

- Make the target a BdrvChild of the mirror BDS: This series does some
  work to make the mirror BDS a more integral part of the block job (see
  below for more).  One of the things I wanted to do is to make both the
  source and the target plain children of that BDS, and I did have
  patches to do this.  However, at some point continuing to do this for
  the target seemed rather difficult, and also a bit pointless, so I
  decided to keep it for later.
  (To be specific, that "some point" was exactly when I tried to rebase
  onto 045a2f8254c.)


=== Structure of this series ===

The first half (up until patch 10) restructures parts of the mirror
block job:

- Patches 4/5:
  The job is converted to use coroutines instead of AIO.
  (because this is probably where we want to go, and also because active
   mirroring will need to wait on conflicting in-flight operations, and
   I really don't want to wait on in-flight AIO requests)

  This is done primarily by patch 5, with patch 4 being necessary
  beforehand.

- Patches 6/7:
  Every in-flight operation gets a CoQueue so it can be waited on
  (because this allows active mirroring operations to wait for
  conflicting writes)

  This is started by patch 6, and with patch 7, every bit in the
  in-flight bitmap has at least one  corresponding operation in the
  MirrorBlockJob.ops_in_flight list that can be waited on.

- Patches 1/2/3/8/9/10:
  The source node is now no longer used through a BlockBackend (patch 8)
  and also it is now attached to the mirror BDS as the "file" child
  instead of the "backing" child (patch 10).
  This is mostly because I'd personally like the mirror BDS to be a real
  filter BDS instead of some technicality that needs to be there to
  solve op blocker issues.

  Patches 3 and 9 are necessary for patch 10.

  Patches 1 and 2 were necessary for this when I decided to include
  another patch to make the target node an immediate child of the mirror
  BDS, too.  However, as I wrote above, I later decided to put this idea
  off until later, and as long as the mirror BDS only has a single
  child, those patches are not strictly necessary.
  However, I think that those patches are good to have anyway, so I
  decided to keep them.


The second half (patches 11 to 18) implement active mirroring:
- Patch 11 is required by patch 12.  This in turn is required by the
  active-sync mode when mirroring data read from the source to the
  target, because that functionality needs to be able to find all the
  parts of the data read which are actually dirty so we don't copy clean
  data.

- Patches 13 and 14 prepare for the job for active operations.

- Patch 15 implements active mirroring.

- Patch 16 allows it to be used (by adding a parameter to
  blockdev-mirror and drive-mirror).

- Patch 18 adds an iotest which relies on functionality introduced by
  patch 17.



Max Reitz (18):
  block: Add BdrvDeletedStatus
  block: BDS

Re: [Qemu-block] [Qemu-devel] [PATCH] throttle: Assert that bkt->max is valid in throttle_compute_wait()

2017-09-13 Thread Philippe Mathieu-Daudé


Hi Alberto,

On 09/13/2017 05:28 AM, Alberto Garcia wrote:

If bkt->max == 0 and bkt->burst_length > 1 then we could have a
division by 0 in throttle_do_compute_wait(). That configuration is
however not permitted and is already detected by throttle_is_valid(),
but let's assert it in throttle_compute_wait() to make it explicit.


This is correct but I'm not sure this is enough, as 
throttle_compute_wait() is exported/public, however it seems testing is 
the only reason to export it.


Also I spent 10min looking at it thinking about how bkt->max is used, 
before to realize there should be a simpler way to write this (KISS).


Anyway I'm queuing your patch in my /static-analysis branch I plan to 
this week where I use a self-explanatory macro instead of assert() and 
will see if I can simplify it (note that I'm not a maintainer!).


Regards,

Phil.



Found by Coverity (CID: 1381016).

Signed-off-by: Alberto Garcia 
---
  util/throttle.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/util/throttle.c b/util/throttle.c
index 06bf916adc..b38e742da5 100644
--- a/util/throttle.c
+++ b/util/throttle.c
@@ -124,6 +124,7 @@ int64_t throttle_compute_wait(LeakyBucket *bkt)
  /* If the main bucket is not full yet we still have to check the
   * burst bucket in order to enforce the burst limit */
  if (bkt->burst_length > 1) {
+assert(bkt->max > 0); /* see throttle_is_valid() */
  extra = bkt->burst_level - burst_bucket_size;
  if (extra > 0) {
  return throttle_do_compute_wait(bkt->max, extra);

[Qemu-block] [PATCH v4 20/23] qemu-img: Change img_compare() to be byte-based

2017-09-13 Thread Eric Blake

In the continuing quest to make more things byte-based, change
the internal iteration of img_compare().  We can finally drop the
TODO assertion added earlier, now that the entire algorithm is
byte-based and no longer has to shift from bytes to sectors.

Most of the change is mechanical ('total_sectors' becomes
'total_size', 'sector_num' becomes 'offset', 'nb_sectors' becomes
'chunk', 'progress_base' goes from sectors to bytes); some of it
is also a cleanup (sectors_to_bytes() is now unused, loss of
variable 'count' added earlier in commit 51b0a488).

Signed-off-by: Eric Blake 

---
v3: new patch
---
 qemu-img.c | 119 -
 1 file changed, 46 insertions(+), 73 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 028c34a2cc..ef7062649d 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1185,11 +1185,6 @@ static int compare_buffers(const uint8_t *buf1, const 
uint8_t *buf2,

 #define IO_BUF_SIZE (2 * 1024 * 1024)

-static int64_t sectors_to_bytes(int64_t sectors)
-{
-return sectors << BDRV_SECTOR_BITS;
-}
-
 /*
  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
  *
@@ -1240,7 +1235,7 @@ static int img_compare(int argc, char **argv)
 const char *fmt1 = NULL, *fmt2 = NULL, *cache, *filename1, *filename2;
 BlockBackend *blk1, *blk2;
 BlockDriverState *bs1, *bs2;
-int64_t total_sectors1, total_sectors2;
+int64_t total_size1, total_size2;
 uint8_t *buf1 = NULL, *buf2 = NULL;
 int64_t pnum1, pnum2;
 int allocated1, allocated2;
@@ -1248,9 +1243,9 @@ static int img_compare(int argc, char **argv)
 bool progress = false, quiet = false, strict = false;
 int flags;
 bool writethrough;
-int64_t total_sectors;
-int64_t sector_num = 0;
-int64_t nb_sectors;
+int64_t total_size;
+int64_t offset = 0;
+int64_t chunk;
 int c;
 uint64_t progress_base;
 bool image_opts = false;
@@ -1364,39 +1359,36 @@ static int img_compare(int argc, char **argv)

 buf1 = blk_blockalign(blk1, IO_BUF_SIZE);
 buf2 = blk_blockalign(blk2, IO_BUF_SIZE);
-total_sectors1 = blk_nb_sectors(blk1);
-if (total_sectors1 < 0) {
+total_size1 = blk_getlength(blk1);
+if (total_size1 < 0) {
 error_report("Can't get size of %s: %s",
- filename1, strerror(-total_sectors1));
+ filename1, strerror(-total_size1));
 ret = 4;
 goto out;
 }
-total_sectors2 = blk_nb_sectors(blk2);
-if (total_sectors2 < 0) {
+total_size2 = blk_getlength(blk2);
+if (total_size2 < 0) {
 error_report("Can't get size of %s: %s",
- filename2, strerror(-total_sectors2));
+ filename2, strerror(-total_size2));
 ret = 4;
 goto out;
 }
-total_sectors = MIN(total_sectors1, total_sectors2);
-progress_base = MAX(total_sectors1, total_sectors2);
+total_size = MIN(total_size1, total_size2);
+progress_base = MAX(total_size1, total_size2);

 qemu_progress_print(0, 100);

-if (strict && total_sectors1 != total_sectors2) {
+if (strict && total_size1 != total_size2) {
 ret = 1;
 qprintf(quiet, "Strict mode: Image size mismatch!\n");
 goto out;
 }

-while (sector_num < total_sectors) {
+while (offset < total_size) {
 int64_t status1, status2;

-status1 = bdrv_block_status_above(bs1, NULL,
-  sector_num * BDRV_SECTOR_SIZE,
-  (total_sectors1 - sector_num) *
-  BDRV_SECTOR_SIZE,
-  , NULL);
+status1 = bdrv_block_status_above(bs1, NULL, offset,
+  total_size1 - offset, , NULL);
 if (status1 < 0) {
 ret = 3;
 error_report("Sector allocation test failed for %s", filename1);
@@ -1404,11 +1396,8 @@ static int img_compare(int argc, char **argv)
 }
 allocated1 = status1 & BDRV_BLOCK_ALLOCATED;

-status2 = bdrv_block_status_above(bs2, NULL,
-  sector_num * BDRV_SECTOR_SIZE,
-  (total_sectors2 - sector_num) *
-  BDRV_SECTOR_SIZE,
-  , NULL);
+status2 = bdrv_block_status_above(bs2, NULL, offset,
+  total_size2 - offset, , NULL);
 if (status2 < 0) {
 ret = 3;
 error_report("Sector allocation test failed for %s", filename2);
@@ -1417,15 +1406,14 @@ static int img_compare(int argc, char **argv)
 allocated2 = status2 & BDRV_BLOCK_ALLOCATED;

 assert(pnum1 && pnum2);
-nb_sectors = DIV_ROUND_UP(MIN(pnum1, pnum2), BDRV_SECTOR_SIZE);
+chunk = MIN(pnum1, pnum2);

[Qemu-block] [PATCH v4 21/23] block: Align block status requests

2017-09-13 Thread Eric Blake

Any device that has request_alignment greater than 512 should be
unable to report status at a finer granularity; it may also be
simpler for such devices to be guaranteed that the block layer
has rounded things out to the granularity boundary (the way the
block layer already rounds all other I/O out).  Besides, getting
the code correct for super-sector alignment also benefits us
for the fact that our public interface now has byte granularity,
even though none of our drivers have byte-level callbacks.

Add an assertion in blkdebug that proves that the block layer
never requests status of unaligned sections, similar to what it
does on other requests (while still keeping the generic helper
in place for when future patches add a throttle driver).  Note
that iotest 177 already covers this (it would fail if you use
just the blkdebug.c hunk without the io.c changes).  Meanwhile,
we can drop assertions in callers that no longer have to pass
in sector-aligned addresses.

There is a mid-function scope added for 'int count', for a
couple of reasons: first, an upcoming patch will add an 'if'
statement that checks whether a driver has an old- or new-style
callback, and can conveniently use the same scope for less
indentation churn at that time.  Second, since we are trying
to get rid of sector-based computations, wrapping things in
a scope makes it easier to group and see what will be deleted
in a final cleanup patch once all drivers have been converted
to the new-style callback.

Signed-off-by: Eric Blake 

---
v3: tweak commit message [Fam], rebase to context conflicts, ensure
we don't exceed 32-bit limit, drop R-b
v2: new patch
---
 include/block/block_int.h |  3 ++-
 block/io.c| 55 +++
 block/blkdebug.c  | 13 ++-
 3 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 7f71c585a0..b1ceffba78 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -207,7 +207,8 @@ struct BlockDriver {
  * according to the current layer, and should not set
  * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
  * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
- * layer guarantees non-NULL pnum and file.
+ * layer guarantees input aligned to request_alignment, as well as
+ * non-NULL pnum and file.
  */
 int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, int *pnum,
diff --git a/block/io.c b/block/io.c
index ea63d19480..c78201b8eb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1773,7 +1773,8 @@ static int64_t coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
 int64_t n; /* bytes */
 int64_t ret, ret2;
 BlockDriverState *local_file = NULL;
-int count; /* sectors */
+int64_t aligned_offset, aligned_bytes;
+uint32_t align;

 assert(pnum);
 total_size = bdrv_getlength(bs);
@@ -1815,28 +1816,45 @@ static int64_t coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
 }

 bdrv_inc_in_flight(bs);
-/*
- * TODO: Rather than require aligned offsets, we could instead
- * round to the driver's request_alignment here, then touch up
- * count afterwards back to the caller's expectations.
- */
-assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
-bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-ret = bs->drv->bdrv_co_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
-bytes >> BDRV_SECTOR_BITS, ,
-_file);
-if (ret < 0) {
-*pnum = 0;
-goto out;
+
+/* Round out to request_alignment boundaries */
+align = MAX(bs->bl.request_alignment, BDRV_SECTOR_SIZE);
+aligned_offset = QEMU_ALIGN_DOWN(offset, align);
+aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
+
+{
+int count; /* sectors */
+
+assert(QEMU_IS_ALIGNED(aligned_offset | aligned_bytes,
+   BDRV_SECTOR_SIZE));
+ret = bs->drv->bdrv_co_get_block_status(
+bs, aligned_offset >> BDRV_SECTOR_BITS,
+MIN(INT_MAX, aligned_bytes) >> BDRV_SECTOR_BITS, ,
+_file);
+if (ret < 0) {
+*pnum = 0;
+goto out;
+}
+*pnum = count * BDRV_SECTOR_SIZE;
+}
+
+/* Clamp pnum and ret to original request */
+assert(QEMU_IS_ALIGNED(*pnum, align));
+*pnum -= offset - aligned_offset;
+if (aligned_offset >> BDRV_SECTOR_BITS != offset >> BDRV_SECTOR_BITS &&
+ret & BDRV_BLOCK_OFFSET_VALID) {
+ret += QEMU_ALIGN_DOWN(offset - aligned_offset, BDRV_SECTOR_SIZE);
+}
+if (*pnum > bytes) {
+*pnum = bytes;
 }
-*pnum = count * BDRV_SECTOR_SIZE;

 if (ret & BDRV_BLOCK_RAW) {
 assert(ret & BDRV_BLOCK_OFFSET_VALID &&

[Qemu-block] [PATCH v4 22/23] block: Relax bdrv_aligned_preadv() assertion

2017-09-13 Thread Eric Blake

Now that bdrv_is_allocated accepts non-aligned inputs, we can
remove the TODO added in commit d6a644bb.

Signed-off-by: Eric Blake 

---
v3: new patch [Kevin]
---
 block/io.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/io.c b/block/io.c
index c78201b8eb..e0f9bca7e2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1055,18 +1055,14 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
*child,
 }

 if (flags & BDRV_REQ_COPY_ON_READ) {
-/* TODO: Simplify further once bdrv_is_allocated no longer
- * requires sector alignment */
-int64_t start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
-int64_t end = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE);
 int64_t pnum;

-ret = bdrv_is_allocated(bs, start, end - start, );
+ret = bdrv_is_allocated(bs, offset, bytes, );
 if (ret < 0) {
 goto out;
 }

-if (!ret || pnum != end - start) {
+if (!ret || pnum != bytes) {
 ret = bdrv_co_do_copy_on_readv(child, offset, bytes, qiov);
 goto out;
 }
-- 
2.13.5

[Qemu-block] [PATCH v4 17/23] qemu-img: Change check_empty_sectors() to byte-based

2017-09-13 Thread Eric Blake

Continue on the quest to make more things byte-based instead of
sector-based.

Signed-off-by: Eric Blake 

---
v3: new patch
---
 qemu-img.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 3e1e373e8f..2e05f92e85 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1202,30 +1202,29 @@ static int64_t sectors_to_bytes(int64_t sectors)
  * an error message.
  *
  * @param blk:  BlockBackend for the image
- * @param sect_num: Number of first sector to check
- * @param sect_count: Number of sectors to check
+ * @param offset: Starting offset to check
+ * @param bytes: Number of bytes to check
  * @param filename: Name of disk file we are checking (logging purpose)
  * @param buffer: Allocated buffer for storing read data
  * @param quiet: Flag for quiet mode
  */
-static int check_empty_sectors(BlockBackend *blk, int64_t sect_num,
-   int sect_count, const char *filename,
+static int check_empty_sectors(BlockBackend *blk, int64_t offset,
+   int64_t bytes, const char *filename,
uint8_t *buffer, bool quiet)
 {
 int ret = 0;
 int64_t idx;

-ret = blk_pread(blk, sect_num << BDRV_SECTOR_BITS, buffer,
-sect_count << BDRV_SECTOR_BITS);
+ret = blk_pread(blk, offset, buffer, bytes);
 if (ret < 0) {
 error_report("Error while reading offset %" PRId64 " of %s: %s",
- sectors_to_bytes(sect_num), filename, strerror(-ret));
+ offset, filename, strerror(-ret));
 return 4;
 }
-idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
+idx = find_nonzero(buffer, bytes);
 if (idx >= 0) {
 qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
-sectors_to_bytes(sect_num) + idx);
+offset + idx);
 return 1;
 }

@@ -1468,10 +1467,12 @@ static int img_compare(int argc, char **argv)
 } else {
 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
 if (allocated1) {
-ret = check_empty_sectors(blk1, sector_num, nb_sectors,
+ret = check_empty_sectors(blk1, sector_num * BDRV_SECTOR_SIZE,
+  nb_sectors * BDRV_SECTOR_SIZE,
   filename1, buf1, quiet);
 } else {
-ret = check_empty_sectors(blk2, sector_num, nb_sectors,
+ret = check_empty_sectors(blk2, sector_num * BDRV_SECTOR_SIZE,
+  nb_sectors * BDRV_SECTOR_SIZE,
   filename2, buf1, quiet);
 }
 if (ret) {
@@ -1516,7 +1517,9 @@ static int img_compare(int argc, char **argv)
 nb_sectors = count >> BDRV_SECTOR_BITS;
 if (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO)) {
 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
+ret = check_empty_sectors(blk_over,
+  sector_num * BDRV_SECTOR_SIZE,
+  nb_sectors * BDRV_SECTOR_SIZE,
   filename_over, buf1, quiet);
 if (ret) {
 goto out;
-- 
2.13.5

[Qemu-block] [PATCH v4 19/23] qemu-img: Change img_rebase() to be byte-based

2017-09-13 Thread Eric Blake

In the continuing quest to make more things byte-based, change
the internal iteration of img_rebase().  We can finally drop the
TODO assertion added earlier, now that the entire algorithm is
byte-based and no longer has to shift from bytes to sectors.

Most of the change is mechanical ('num_sectors' becomes 'size',
'sector' becomes 'offset', 'n' goes from sectors to bytes); some
of it is also a cleanup (use of MIN() instead of open-coding,
loss of variable 'count' added earlier in commit d6a644bb).

Signed-off-by: Eric Blake 

---
v3: new patch
---
 qemu-img.c | 84 +-
 1 file changed, 34 insertions(+), 50 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 034122eba5..028c34a2cc 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3244,70 +3244,58 @@ static int img_rebase(int argc, char **argv)
  * the image is the same as the original one at any time.
  */
 if (!unsafe) {
-int64_t num_sectors;
-int64_t old_backing_num_sectors;
-int64_t new_backing_num_sectors = 0;
-uint64_t sector;
-int n;
-int64_t count;
+int64_t size;
+int64_t old_backing_size;
+int64_t new_backing_size = 0;
+uint64_t offset;
+int64_t n;
 float local_progress = 0;

 buf_old = blk_blockalign(blk, IO_BUF_SIZE);
 buf_new = blk_blockalign(blk, IO_BUF_SIZE);

-num_sectors = blk_nb_sectors(blk);
-if (num_sectors < 0) {
+size = blk_getlength(blk);
+if (size < 0) {
 error_report("Could not get size of '%s': %s",
- filename, strerror(-num_sectors));
+ filename, strerror(-size));
 ret = -1;
 goto out;
 }
-old_backing_num_sectors = blk_nb_sectors(blk_old_backing);
-if (old_backing_num_sectors < 0) {
+old_backing_size = blk_getlength(blk_old_backing);
+if (old_backing_size < 0) {
 char backing_name[PATH_MAX];

 bdrv_get_backing_filename(bs, backing_name, sizeof(backing_name));
 error_report("Could not get size of '%s': %s",
- backing_name, strerror(-old_backing_num_sectors));
+ backing_name, strerror(-old_backing_size));
 ret = -1;
 goto out;
 }
 if (blk_new_backing) {
-new_backing_num_sectors = blk_nb_sectors(blk_new_backing);
-if (new_backing_num_sectors < 0) {
+new_backing_size = blk_getlength(blk_new_backing);
+if (new_backing_size < 0) {
 error_report("Could not get size of '%s': %s",
- out_baseimg, strerror(-new_backing_num_sectors));
+ out_baseimg, strerror(-new_backing_size));
 ret = -1;
 goto out;
 }
 }

-if (num_sectors != 0) {
-local_progress = (float)100 /
-(num_sectors / MIN(num_sectors, IO_BUF_SIZE / 512));
+if (size != 0) {
+local_progress = (float)100 / (size / MIN(size, IO_BUF_SIZE));
 }

-for (sector = 0; sector < num_sectors; sector += n) {
-
-/* How many sectors can we handle with the next read? */
-if (sector + (IO_BUF_SIZE / 512) <= num_sectors) {
-n = (IO_BUF_SIZE / 512);
-} else {
-n = num_sectors - sector;
-}
+for (offset = 0; offset < size; offset += n) {
+/* How many bytes can we handle with the next read? */
+n = MIN(IO_BUF_SIZE, size - offset);

 /* If the cluster is allocated, we don't need to take action */
-ret = bdrv_is_allocated(bs, sector << BDRV_SECTOR_BITS,
-n << BDRV_SECTOR_BITS, );
+ret = bdrv_is_allocated(bs, offset, n, );
 if (ret < 0) {
 error_report("error while reading image metadata: %s",
  strerror(-ret));
 goto out;
 }
-/* TODO relax this once bdrv_is_allocated does not enforce
- * sector alignment */
-assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-n = count >> BDRV_SECTOR_BITS;
 if (ret) {
 continue;
 }
@@ -3316,30 +3304,28 @@ static int img_rebase(int argc, char **argv)
  * Read old and new backing file and take into consideration that
  * backing files may be smaller than the COW image.
  */
-if (sector >= old_backing_num_sectors) {
-memset(buf_old, 0, n * BDRV_SECTOR_SIZE);
+if (offset >= old_backing_size) {
+memset(buf_old, 0, n);
 } else {
-if (sector + n > old_backing_num_sectors) {
-n =

[Qemu-block] [PATCH v4 23/23] qemu-io: Relax 'alloc' now that block-status doesn't assert

2017-09-13 Thread Eric Blake

Previously, the alloc command required that input parameters be
sector-aligned and clamped to 32 bits, because the underlying
bdrv_is_allocated used a 32-bit parameter and asserted aligned
inputs.  But now that we have fixed block status to report a
64-bit bytes value, and to properly round requests on behalf of
guests, we can pass any values, and can use qemu-io to add
coverage that our rounding is correct regardless of the guest
alignment constraints.

Update iotest 177 to intentionally probe block status at
unaligned boundaries as well as with a bytes value that does not
map to 32-bit sectors, which also required tweaking the image
prep to leave an unallocated portion to the image under test.

Signed-off-by: Eric Blake 

---
v3: also test huge bytes value, R-b dropped
v2: new patch
---
 qemu-io-cmds.c | 13 -
 tests/qemu-iotests/177 | 12 ++--
 tests/qemu-iotests/177.out | 19 ++-
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 2811a89099..d9a32f3bed 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1769,10 +1769,6 @@ static int alloc_f(BlockBackend *blk, int argc, char 
**argv)
 if (offset < 0) {
 print_cvtnum_err(offset, argv[1]);
 return 0;
-} else if (!QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE)) {
-printf("%" PRId64 " is not a sector-aligned value for 'offset'\n",
-   offset);
-return 0;
 }

 if (argc == 3) {
@@ -1780,19 +1776,10 @@ static int alloc_f(BlockBackend *blk, int argc, char 
**argv)
 if (count < 0) {
 print_cvtnum_err(count, argv[2]);
 return 0;
-} else if (count > INT_MAX * BDRV_SECTOR_SIZE) {
-printf("length argument cannot exceed %llu, given %s\n",
-   INT_MAX * BDRV_SECTOR_SIZE, argv[2]);
-return 0;
 }
 } else {
 count = BDRV_SECTOR_SIZE;
 }
-if (!QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE)) {
-printf("%" PRId64 " is not a sector-aligned value for 'count'\n",
-   count);
-return 0;
-}

 remaining = count;
 sum_alloc = 0;
diff --git a/tests/qemu-iotests/177 b/tests/qemu-iotests/177
index f8ed8fb86b..28990977f1 100755
--- a/tests/qemu-iotests/177
+++ b/tests/qemu-iotests/177
@@ -51,7 +51,7 @@ echo "== setting up files =="
 TEST_IMG="$TEST_IMG.base" _make_test_img $size
 $QEMU_IO -c "write -P 11 0 $size" "$TEST_IMG.base" | _filter_qemu_io
 _make_test_img -b "$TEST_IMG.base"
-$QEMU_IO -c "write -P 22 0 $size" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -P 22 0 110M" "$TEST_IMG" | _filter_qemu_io

 # Limited to 64k max-transfer
 echo
@@ -82,6 +82,13 @@ $QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
  -c "discard 8001 30M" | _filter_qemu_io

 echo
+echo "== block status smaller than alignment =="
+limits=align=4k
+$QEMU_IO -c "open -o $options,$limits blkdebug::$TEST_IMG" \
+-c "alloc 1 1" -c "alloc 0x6d0 1000" -c "alloc 127m 5P" \
+-c map | _filter_qemu_io
+
+echo
 echo "== verify image content =="

 function verify_io()
@@ -103,7 +110,8 @@ function verify_io()
 echo read -P 0 32M 32M
 echo read -P 22 64M 13M
 echo read -P $discarded 77M 29M
-echo read -P 22 106M 22M
+echo read -P 22 106M 4M
+echo read -P 11 110M 18M
 }

 verify_io | $QEMU_IO -r "$TEST_IMG" | _filter_qemu_io
diff --git a/tests/qemu-iotests/177.out b/tests/qemu-iotests/177.out
index 43a777836c..f788b55e20 100644
--- a/tests/qemu-iotests/177.out
+++ b/tests/qemu-iotests/177.out
@@ -5,8 +5,8 @@ Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=134217728
 wrote 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
backing_file=TEST_DIR/t.IMGFMT.base
-wrote 134217728/134217728 bytes at offset 0
-128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 115343360/115343360 bytes at offset 0
+110 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)

 == constrained alignment and max-transfer ==
 wrote 131072/131072 bytes at offset 1000
@@ -26,6 +26,13 @@ wrote 33554432/33554432 bytes at offset 33554432
 discard 31457280/31457280 bytes at offset 8001
 30 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)

+== block status smaller than alignment ==
+1/1 bytes allocated at offset 1 bytes
+16/1000 bytes allocated at offset 110 MiB
+0/1048576 bytes allocated at offset 127 MiB
+110 MiB (0x6e0) bytes allocated at offset 0 bytes (0x0)
+18 MiB (0x120) bytes not allocated at offset 110 MiB (0x6e0)
+
 == verify image content ==
 read 1000/1000 bytes at offset 0
 1000 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
@@ -43,12 +50,14 @@ read 13631488/13631488 bytes at offset 67108864
 13 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 30408704/30408704 bytes at offset 80740352
 29 MiB, X ops;

[Qemu-block] [PATCH v4 18/23] qemu-img: Change compare_sectors() to be byte-based

2017-09-13 Thread Eric Blake

In the continuing quest to make more things byte-based, change
compare_sectors(), renaming it to compare_buffers() in the
process.  Note that one caller (qemu-img compare) only cares
about the first difference, while the other (qemu-img rebase)
cares about how many consecutive sectors have the same
equal/different status; however, this patch does not bother to
micro-optimize the compare case to avoid the comparisons of
sectors beyond the first mismatch.  Both callers are always
passing valid buffers in, so the initial check for buffer size
can be turned into an assertion.

Signed-off-by: Eric Blake 

---
v3: new patch
---
 qemu-img.c | 55 +++
 1 file changed, 27 insertions(+), 28 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 2e05f92e85..034122eba5 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1155,31 +1155,28 @@ static int is_allocated_sectors_min(const uint8_t *buf, 
int n, int *pnum,
 }

 /*
- * Compares two buffers sector by sector. Returns 0 if the first sector of both
- * buffers matches, non-zero otherwise.
+ * Compares two buffers sector by sector. Returns 0 if the first
+ * sector of each buffer matches, non-zero otherwise.
  *
- * pnum is set to the number of sectors (including and immediately following
- * the first one) that are known to have the same comparison result
+ * pnum is set to the sector-aligned size of the buffer prefix that
+ * has the same matching status as the first sector.
  */
-static int compare_sectors(const uint8_t *buf1, const uint8_t *buf2, int n,
-int *pnum)
+static int compare_buffers(const uint8_t *buf1, const uint8_t *buf2,
+   int64_t bytes, int64_t *pnum)
 {
 bool res;
-int i;
+int64_t i = MIN(bytes, BDRV_SECTOR_SIZE);

-if (n <= 0) {
-*pnum = 0;
-return 0;
-}
+assert(bytes > 0);

-res = !!memcmp(buf1, buf2, 512);
-for(i = 1; i < n; i++) {
-buf1 += 512;
-buf2 += 512;
+res = !!memcmp(buf1, buf2, i);
+while (i < bytes) {
+int64_t len = MIN(bytes - i, BDRV_SECTOR_SIZE);

-if (!!memcmp(buf1, buf2, 512) != res) {
+if (!!memcmp(buf1 + i, buf2 + i, len) != res) {
 break;
 }
+i += len;
 }

 *pnum = i;
@@ -1254,7 +1251,7 @@ static int img_compare(int argc, char **argv)
 int64_t total_sectors;
 int64_t sector_num = 0;
 int64_t nb_sectors;
-int c, pnum;
+int c;
 uint64_t progress_base;
 bool image_opts = false;
 bool force_share = false;
@@ -1436,6 +1433,8 @@ static int img_compare(int argc, char **argv)
 /* nothing to do */
 } else if (allocated1 == allocated2) {
 if (allocated1) {
+int64_t pnum;
+
 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
 ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
 nb_sectors << BDRV_SECTOR_BITS);
@@ -1455,11 +1454,11 @@ static int img_compare(int argc, char **argv)
 ret = 4;
 goto out;
 }
-ret = compare_sectors(buf1, buf2, nb_sectors, );
-if (ret || pnum != nb_sectors) {
+ret = compare_buffers(buf1, buf2,
+  nb_sectors * BDRV_SECTOR_SIZE, );
+if (ret || pnum != nb_sectors * BDRV_SECTOR_SIZE) {
 qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
-sectors_to_bytes(
-ret ? sector_num : sector_num + pnum));
+sectors_to_bytes(sector_num) + (ret ? 0 : pnum));
 ret = 1;
 goto out;
 }
@@ -3350,16 +3349,16 @@ static int img_rebase(int argc, char **argv)
 /* If they differ, we need to write to the COW file */
 uint64_t written = 0;

-while (written < n) {
-int pnum;
+while (written < n * BDRV_SECTOR_SIZE) {
+int64_t pnum;

-if (compare_sectors(buf_old + written * 512,
-buf_new + written * 512, n - written, ))
+if (compare_buffers(buf_old + written,
+buf_new + written,
+n * BDRV_SECTOR_SIZE - written, ))
 {
 ret = blk_pwrite(blk,
- (sector + written) << BDRV_SECTOR_BITS,
- buf_old + written * 512,
- pnum << BDRV_SECTOR_BITS, 0);
+ (sector << BDRV_SECTOR_BITS) + written,
+ buf_old + written, pnum, 0);
 if (ret < 0) {
 error_report("Error while writing to COW image: %s",

[Qemu-block] [PATCH v4 15/23] qemu-img: Add find_nonzero()

2017-09-13 Thread Eric Blake

During 'qemu-img compare', when we are checking that an allocated
portion of one file is all zeros, we don't need to waste time
computing how many additional sectors after the first non-zero
byte are also non-zero.  Create a new helper find_nonzero() to do
the check for a first non-zero sector, and rebase
check_empty_sectors() to use it.

The new interface intentionally uses bytes in its interface, even
though it still crawls the buffer a sector at a time; it is robust
to a partial sector at the end of the buffer.

Signed-off-by: Eric Blake 

---
v3: new patch
---
 qemu-img.c | 32 
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index f5ab29d176..dfccebe6bc 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1064,6 +1064,28 @@ done:
 }

 /*
+ * Returns -1 if 'buf' contains only zeroes, otherwise the byte index
+ * of the first sector boundary within buf where the sector contains a
+ * non-zero byte.  This function is robust to a buffer that is not
+ * sector-aligned.
+ */
+static int64_t find_nonzero(const uint8_t *buf, int64_t n)
+{
+int64_t i;
+int64_t end = QEMU_ALIGN_DOWN(n, BDRV_SECTOR_SIZE);
+
+for (i = 0; i < end; i += BDRV_SECTOR_SIZE) {
+if (!buffer_is_zero(buf + i, BDRV_SECTOR_SIZE)) {
+return i;
+}
+}
+if (i < n && !buffer_is_zero(buf + i, n - end)) {
+return i;
+}
+return -1;
+}
+
+/*
  * Returns true iff the first sector pointed to by 'buf' contains at least
  * a non-NUL byte.
  *
@@ -1188,7 +1210,9 @@ static int check_empty_sectors(BlockBackend *blk, int64_t 
sect_num,
int sect_count, const char *filename,
uint8_t *buffer, bool quiet)
 {
-int pnum, ret = 0;
+int ret = 0;
+int64_t idx;
+
 ret = blk_pread(blk, sect_num << BDRV_SECTOR_BITS, buffer,
 sect_count << BDRV_SECTOR_BITS);
 if (ret < 0) {
@@ -1196,10 +1220,10 @@ static int check_empty_sectors(BlockBackend *blk, 
int64_t sect_num,
  sectors_to_bytes(sect_num), filename, strerror(-ret));
 return ret;
 }
-ret = is_allocated_sectors(buffer, sect_count, );
-if (ret || pnum != sect_count) {
+idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
+if (idx >= 0) {
 qprintf(quiet, "Content mismatch at offset %" PRId64 "!\n",
-sectors_to_bytes(ret ? sect_num : sect_num + pnum));
+sectors_to_bytes(sect_num) + idx);
 return 1;
 }

-- 
2.13.5

[Qemu-block] [PATCH v4 14/23] qemu-img: Speed up compare on pre-allocated larger file

2017-09-13 Thread Eric Blake

Compare the following images with all-zero contents:
$ truncate --size 1M A
$ qemu-img create -f qcow2 -o preallocation=off B 1G
$ qemu-img create -f qcow2 -o preallocation=metadata C 1G

On my machine, the difference is noticeable for pre-patch speeds,
with more than an order of magnitude in difference caused by the
choice of preallocation in the qcow2 file:

$ time ./qemu-img compare -f raw -F qcow2 A B
Warning: Image size mismatch!
Images are identical.

real0m0.014s
user0m0.007s
sys 0m0.007s

$ time ./qemu-img compare -f raw -F qcow2 A C
Warning: Image size mismatch!
Images are identical.

real0m0.341s
user0m0.144s
sys 0m0.188s

Why? Because bdrv_is_allocated() returns false for image B but
true for image C, throwing away the fact that both images know
via lseek(SEEK_HOLE) that the entire image still reads as zero.
>From there, qemu-img ends up calling bdrv_pread() for every byte
of the tail, instead of quickly looking for the next allocation.
The solution: use block_status instead of is_allocated, giving:

$ time ./qemu-img compare -f raw -F qcow2 A C
Warning: Image size mismatch!
Images are identical.

real0m0.014s
user0m0.011s
sys 0m0.003s

which is on par with the speeds for no pre-allocation.

Signed-off-by: Eric Blake 

---
v3: new patch
---
 qemu-img.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index f8423e9b3f..f5ab29d176 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1477,11 +1477,11 @@ static int img_compare(int argc, char **argv)
 while (sector_num < progress_base) {
 int64_t count;

-ret = bdrv_is_allocated_above(blk_bs(blk_over), NULL,
+ret = bdrv_block_status_above(blk_bs(blk_over), NULL,
   sector_num * BDRV_SECTOR_SIZE,
   (progress_base - sector_num) *
   BDRV_SECTOR_SIZE,
-  );
+  , NULL);
 if (ret < 0) {
 ret = 3;
 error_report("Sector allocation test failed for %s",
@@ -1489,11 +1489,11 @@ static int img_compare(int argc, char **argv)
 goto out;

 }
-/* TODO relax this once bdrv_is_allocated_above does not enforce
+/* TODO relax this once bdrv_block_status_above does not enforce
  * sector alignment */
 assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
 nb_sectors = count >> BDRV_SECTOR_BITS;
-if (ret) {
+if (ret & BDRV_BLOCK_ALLOCATED && !(ret & BDRV_BLOCK_ZERO)) {
 nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
 ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
   filename_over, buf1, quiet);
-- 
2.13.5

[Qemu-block] [PATCH v4 13/23] qemu-img: Simplify logic in img_compare()

2017-09-13 Thread Eric Blake

As long as we are querying the status for a chunk smaller than
the known image size, we are guaranteed that a successful return
will have set pnum to a non-zero size (pnum is zero only for
queries beyond the end of the file).  Use that to slightly
simplify the calculation of the current chunk size being compared.
Likewise, we don't have to shrink the amount of data operated on
until we know we have to read the file, and therefore have to fit
in the bounds of our buffer.  Also, note that 'total_sectors_over'
is equivalent to 'progress_base'.

With these changes in place, sectors_to_process() is now dead code,
and can be removed.

Signed-off-by: Eric Blake 

---
v3: new patch
---
 qemu-img.c | 40 +++-
 1 file changed, 11 insertions(+), 29 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index b91133b922..f8423e9b3f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1171,11 +1171,6 @@ static int64_t sectors_to_bytes(int64_t sectors)
 return sectors << BDRV_SECTOR_BITS;
 }

-static int64_t sectors_to_process(int64_t total, int64_t from)
-{
-return MIN(total - from, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
-}
-
 /*
  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
  *
@@ -1372,13 +1367,9 @@ static int img_compare(int argc, char **argv)
 goto out;
 }

-for (;;) {
+while (sector_num < total_sectors) {
 int64_t status1, status2;

-nb_sectors = sectors_to_process(total_sectors, sector_num);
-if (nb_sectors <= 0) {
-break;
-}
 status1 = bdrv_block_status_above(bs1, NULL,
   sector_num * BDRV_SECTOR_SIZE,
   (total_sectors1 - sector_num) *
@@ -1402,14 +1393,9 @@ static int img_compare(int argc, char **argv)
 goto out;
 }
 allocated2 = status2 & BDRV_BLOCK_ALLOCATED;
-if (pnum1) {
-nb_sectors = MIN(nb_sectors,
- DIV_ROUND_UP(pnum1, BDRV_SECTOR_SIZE));
-}
-if (pnum2) {
-nb_sectors = MIN(nb_sectors,
- DIV_ROUND_UP(pnum2, BDRV_SECTOR_SIZE));
-}
+
+assert(pnum1 && pnum2);
+nb_sectors = DIV_ROUND_UP(MIN(pnum1, pnum2), BDRV_SECTOR_SIZE);

 if (strict) {
 if ((status1 & ~BDRV_BLOCK_OFFSET_MASK) !=
@@ -1422,9 +1408,10 @@ static int img_compare(int argc, char **argv)
 }
 }
 if ((status1 & BDRV_BLOCK_ZERO) && (status2 & BDRV_BLOCK_ZERO)) {
-nb_sectors = DIV_ROUND_UP(MIN(pnum1, pnum2), BDRV_SECTOR_SIZE);
+/* nothing to do */
 } else if (allocated1 == allocated2) {
 if (allocated1) {
+nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
 ret = blk_pread(blk1, sector_num << BDRV_SECTOR_BITS, buf1,
 nb_sectors << BDRV_SECTOR_BITS);
 if (ret < 0) {
@@ -1453,7 +1440,7 @@ static int img_compare(int argc, char **argv)
 }
 }
 } else {
-
+nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);
 if (allocated1) {
 ret = check_empty_sectors(blk1, sector_num, nb_sectors,
   filename1, buf1, quiet);
@@ -1476,30 +1463,24 @@ static int img_compare(int argc, char **argv)

 if (total_sectors1 != total_sectors2) {
 BlockBackend *blk_over;
-int64_t total_sectors_over;
 const char *filename_over;

 qprintf(quiet, "Warning: Image size mismatch!\n");
 if (total_sectors1 > total_sectors2) {
-total_sectors_over = total_sectors1;
 blk_over = blk1;
 filename_over = filename1;
 } else {
-total_sectors_over = total_sectors2;
 blk_over = blk2;
 filename_over = filename2;
 }

-for (;;) {
+while (sector_num < progress_base) {
 int64_t count;

-nb_sectors = sectors_to_process(total_sectors_over, sector_num);
-if (nb_sectors <= 0) {
-break;
-}
 ret = bdrv_is_allocated_above(blk_bs(blk_over), NULL,
   sector_num * BDRV_SECTOR_SIZE,
-  nb_sectors * BDRV_SECTOR_SIZE,
+  (progress_base - sector_num) *
+  BDRV_SECTOR_SIZE,
   );
 if (ret < 0) {
 ret = 3;
@@ -1513,6 +1494,7 @@ static int img_compare(int argc, char **argv)
 assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
 nb_sectors = count >> BDRV_SECTOR_BITS;
 if (ret) {
+nb_sectors = MIN(nb_sectors, IO_BUF_SIZE >> BDRV_SECTOR_BITS);

[Qemu-block] [PATCH v4 04/23] qcow2: Switch is_zero_sectors() to byte-based

2017-09-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change), and rename it to is_zero() in the
process.

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 
---
v3: no change
v2: rename function, rebase to upstream changes
---
 block/qcow2.c | 32 ++--
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 9a7b5cd41f..eb498b56d4 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2971,21 +2971,28 @@ finish:
 }


-static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
-uint32_t count)
+static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
 int nr;
 int64_t res;
+int64_t start;

-if (start + count > bs->total_sectors) {
-count = bs->total_sectors - start;
+/* Widen to sector boundaries, then clamp to image length, before
+ * checking status of underlying sectors */
+start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
+bytes = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE) - start;
+
+if (start + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
+bytes = bs->total_sectors * BDRV_SECTOR_SIZE - start;
 }

-if (!count) {
+if (!bytes) {
 return true;
 }
-res = bdrv_get_block_status_above(bs, NULL, start, count, , NULL);
-return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
+res = bdrv_get_block_status_above(bs, NULL, start >> BDRV_SECTOR_BITS,
+  bytes >> BDRV_SECTOR_BITS, , NULL);
+return res >= 0 && (res & BDRV_BLOCK_ZERO) &&
+nr * BDRV_SECTOR_SIZE == bytes;
 }

 static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
@@ -3003,24 +3010,21 @@ static coroutine_fn int 
qcow2_co_pwrite_zeroes(BlockDriverState *bs,
 }

 if (head || tail) {
-int64_t cl_start = (offset - head) >> BDRV_SECTOR_BITS;
 uint64_t off;
 unsigned int nr;

 assert(head + bytes <= s->cluster_size);

 /* check whether remainder of cluster already reads as zero */
-if (!(is_zero_sectors(bs, cl_start,
-  DIV_ROUND_UP(head, BDRV_SECTOR_SIZE)) &&
-  is_zero_sectors(bs, (offset + bytes) >> BDRV_SECTOR_BITS,
-  DIV_ROUND_UP(-tail & (s->cluster_size - 1),
-   BDRV_SECTOR_SIZE {
+if (!(is_zero(bs, offset - head, head) &&
+  is_zero(bs, offset + bytes,
+  tail ? s->cluster_size - tail : 0))) {
 return -ENOTSUP;
 }

 qemu_co_mutex_lock(>lock);
 /* We can have new write after previous check */
-offset = cl_start << BDRV_SECTOR_BITS;
+offset = QEMU_ALIGN_DOWN(offset, s->cluster_size);
 bytes = s->cluster_size;
 nr = s->cluster_size;
 ret = qcow2_get_cluster_offset(bs, offset, , );
-- 
2.13.5

[Qemu-block] [PATCH v4 09/23] block: Switch BdrvCoGetBlockStatusData to byte-based

2017-09-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
type (no semantic change), and rename it to match the corresponding
public function rename.

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 

---
v4: no change
v3: rebase to context conflicts, simple enough to keep R-b
v2: rebase to earlier changes
---
 block/io.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/block/io.c b/block/io.c
index da85c903dd..0adfbb8e70 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1700,17 +1700,17 @@ int bdrv_flush_all(void)
 }


-typedef struct BdrvCoGetBlockStatusData {
+typedef struct BdrvCoBlockStatusData {
 BlockDriverState *bs;
 BlockDriverState *base;
 BlockDriverState **file;
-int64_t sector_num;
-int nb_sectors;
-int *pnum;
+int64_t offset;
+int64_t bytes;
+int64_t *pnum;
 int64_t ret;
 bool mapping;
 bool done;
-} BdrvCoGetBlockStatusData;
+} BdrvCoBlockStatusData;

 int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
 int64_t sector_num,
@@ -1941,14 +1941,16 @@ static int64_t coroutine_fn 
bdrv_co_get_block_status_above(BlockDriverState *bs,
 /* Coroutine wrapper for bdrv_get_block_status_above() */
 static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
 {
-BdrvCoGetBlockStatusData *data = opaque;
+BdrvCoBlockStatusData *data = opaque;
+int n;

 data->ret = bdrv_co_get_block_status_above(data->bs, data->base,
data->mapping,
-   data->sector_num,
-   data->nb_sectors,
-   data->pnum,
+   data->offset >> 
BDRV_SECTOR_BITS,
+   data->bytes >> BDRV_SECTOR_BITS,
+   ,
data->file);
+*data->pnum = n * BDRV_SECTOR_SIZE;
 data->done = true;
 }

@@ -1965,13 +1967,14 @@ static int64_t 
bdrv_common_block_status_above(BlockDriverState *bs,
   BlockDriverState **file)
 {
 Coroutine *co;
-BdrvCoGetBlockStatusData data = {
+int64_t n;
+BdrvCoBlockStatusData data = {
 .bs = bs,
 .base = base,
 .file = file,
-.sector_num = sector_num,
-.nb_sectors = nb_sectors,
-.pnum = pnum,
+.offset = sector_num * BDRV_SECTOR_SIZE,
+.bytes = nb_sectors * BDRV_SECTOR_SIZE,
+.pnum = ,
 .mapping = mapping,
 .done = false,
 };
@@ -1985,6 +1988,8 @@ static int64_t 
bdrv_common_block_status_above(BlockDriverState *bs,
 bdrv_coroutine_enter(bs, co);
 BDRV_POLL_WHILE(bs, !data.done);
 }
+assert(data.ret < 0 || QEMU_IS_ALIGNED(n, BDRV_SECTOR_SIZE));
+*pnum = n >> BDRV_SECTOR_BITS;
 return data.ret;
 }

-- 
2.13.5

[Qemu-block] [PATCH v4 12/23] block: Convert bdrv_get_block_status_above() to bytes

2017-09-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  In the common case, allocation is unlikely to ever use
values that are not naturally sector-aligned, but it is possible
that byte-based values will let us be more precise about allocation
at the end of an unaligned file that can do byte-based access.

Changing the name of the function from bdrv_get_block_status_above()
to bdrv_block_status_above() ensures that the compiler enforces that
all callers are updated.  For now, the io.c layer still assert()s
that all callers are sector-aligned, but that can be relaxed when a
later patch implements byte-based block status in the drivers.

For the most part this patch is just the addition of scaling at the
callers followed by inverse scaling at bdrv_block_status().  But some
code, particularly bdrv_block_status(), gets a lot simpler because
it no longer has to mess with sectors.  Likewise, mirror code no
longer computes s->granularity >> BDRV_SECTOR_BITS, and can therefore
drop an assertion (fix a neighboring assertion to use is_power_of_2
while there).

For ease of review, bdrv_get_block_status() was tackled separately.

Signed-off-by: Eric Blake 

---
v4: rebase to earlier changes
v3: rebase to allocation/mapping sense change and qcow2-measure, tweak
mirror assertions, drop R-b
v2: rebase to earlier changes
---
 include/block/block.h | 10 +-
 block/io.c| 39 ---
 block/mirror.c| 16 +---
 block/qcow2.c | 25 +
 qemu-img.c| 39 +++
 5 files changed, 50 insertions(+), 79 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 7a9a8db588..e87348dcfa 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -426,11 +426,11 @@ bool bdrv_can_write_zeroes_with_unmap(BlockDriverState 
*bs);
 int64_t bdrv_block_status(BlockDriverState *bs, int64_t offset,
   int64_t bytes, int64_t *pnum,
   BlockDriverState **file);
-int64_t bdrv_get_block_status_above(BlockDriverState *bs,
-BlockDriverState *base,
-int64_t sector_num,
-int nb_sectors, int *pnum,
-BlockDriverState **file);
+int64_t bdrv_block_status_above(BlockDriverState *bs,
+BlockDriverState *base,
+int64_t offset,
+int64_t bytes, int64_t *pnum,
+BlockDriverState **file);
 int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
   int64_t *pnum);
 int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
diff --git a/block/io.c b/block/io.c
index 409cfe0938..ea63d19480 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1931,7 +1931,7 @@ static int64_t coroutine_fn 
bdrv_co_block_status_above(BlockDriverState *bs,
 return ret;
 }

-/* Coroutine wrapper for bdrv_get_block_status_above() */
+/* Coroutine wrapper for bdrv_block_status_above() */
 static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
 {
 BdrvCoBlockStatusData *data = opaque;
@@ -1978,43 +1978,20 @@ static int64_t 
bdrv_common_block_status_above(BlockDriverState *bs,
 return data.ret;
 }

-int64_t bdrv_get_block_status_above(BlockDriverState *bs,
-BlockDriverState *base,
-int64_t sector_num,
-int nb_sectors, int *pnum,
-BlockDriverState **file)
+int64_t bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
+int64_t offset, int64_t bytes, int64_t *pnum,
+BlockDriverState **file)
 {
-int64_t ret;
-int64_t n;
-
-ret = bdrv_common_block_status_above(bs, base, true,
- sector_num * BDRV_SECTOR_SIZE,
- nb_sectors * BDRV_SECTOR_SIZE,
- , file);
-if (ret < 0) {
-return ret;
-}
-assert(QEMU_IS_ALIGNED(n, BDRV_SECTOR_SIZE));
-*pnum = n >> BDRV_SECTOR_BITS;
-return ret;
+return bdrv_common_block_status_above(bs, base, true, offset, bytes,
+  pnum, file);
 }

 int64_t bdrv_block_status(BlockDriverState *bs,
   int64_t offset, int64_t bytes, int64_t *pnum,
   BlockDriverState **file)
 {
-int64_t ret;
-int n;
-
-assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
-bytes = MIN(bytes, BDRV_REQUEST_MAX_BYTES);
-ret = bdrv_get_block_status_above(bs, backing_bs(bs),
-  offset >>

[Qemu-block] [PATCH v4 10/23] block: Switch bdrv_common_block_status_above() to byte-based

2017-09-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change).

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 

---
v3: rebase to allocation/mapping sense change, simple enough to keep R-b
v2: new patch
---
 block/io.c | 41 +
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/block/io.c b/block/io.c
index 0adfbb8e70..bc0e3fd0e2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1962,19 +1962,18 @@ static void coroutine_fn 
bdrv_get_block_status_above_co_entry(void *opaque)
 static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
   BlockDriverState *base,
   bool mapping,
-  int64_t sector_num,
-  int nb_sectors, int *pnum,
+  int64_t offset,
+  int64_t bytes, int64_t *pnum,
   BlockDriverState **file)
 {
 Coroutine *co;
-int64_t n;
 BdrvCoBlockStatusData data = {
 .bs = bs,
 .base = base,
 .file = file,
-.offset = sector_num * BDRV_SECTOR_SIZE,
-.bytes = nb_sectors * BDRV_SECTOR_SIZE,
-.pnum = ,
+.offset = offset,
+.bytes = bytes,
+.pnum = pnum,
 .mapping = mapping,
 .done = false,
 };
@@ -1988,8 +1987,6 @@ static int64_t 
bdrv_common_block_status_above(BlockDriverState *bs,
 bdrv_coroutine_enter(bs, co);
 BDRV_POLL_WHILE(bs, !data.done);
 }
-assert(data.ret < 0 || QEMU_IS_ALIGNED(n, BDRV_SECTOR_SIZE));
-*pnum = n >> BDRV_SECTOR_BITS;
 return data.ret;
 }

@@ -1999,8 +1996,19 @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 int nb_sectors, int *pnum,
 BlockDriverState **file)
 {
-return bdrv_common_block_status_above(bs, base, true, sector_num,
-  nb_sectors, pnum, file);
+int64_t ret;
+int64_t n;
+
+ret = bdrv_common_block_status_above(bs, base, true,
+ sector_num * BDRV_SECTOR_SIZE,
+ nb_sectors * BDRV_SECTOR_SIZE,
+ , file);
+if (ret < 0) {
+return ret;
+}
+assert(QEMU_IS_ALIGNED(n, BDRV_SECTOR_SIZE));
+*pnum = n >> BDRV_SECTOR_BITS;
+return ret;
 }

 int64_t bdrv_block_status(BlockDriverState *bs,
@@ -2025,20 +2033,13 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState 
*bs, int64_t offset,
int64_t bytes, int64_t *pnum)
 {
 int64_t ret;
-int psectors;
+int64_t dummy;

-assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
-assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) && bytes < INT_MAX);
-ret = bdrv_common_block_status_above(bs, backing_bs(bs), false,
- offset >> BDRV_SECTOR_BITS,
- bytes >> BDRV_SECTOR_BITS, ,
- NULL);
+ret = bdrv_common_block_status_above(bs, backing_bs(bs), false, offset,
+ bytes, pnum ? pnum : , NULL);
 if (ret < 0) {
 return ret;
 }
-if (pnum) {
-*pnum = psectors * BDRV_SECTOR_SIZE;
-}
 return !!(ret & BDRV_BLOCK_ALLOCATED);
 }

-- 
2.13.5

[Qemu-block] [PATCH v4 16/23] qemu-img: Drop redundant error message in compare

2017-09-13 Thread Eric Blake

If a read error is encountered during 'qemu-img compare', we
were printing the "Error while reading offset ..." message twice.
Update the testsuite for the improved output.

Further simplify the code by hoisting the error code conversion
into the helper function, rather than repeating it at the callers.

Signed-off-by: Eric Blake 

---
v3: new patch
---
 qemu-img.c | 19 +--
 tests/qemu-iotests/074.out |  2 --
 2 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index dfccebe6bc..3e1e373e8f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1196,8 +1196,10 @@ static int64_t sectors_to_bytes(int64_t sectors)
 /*
  * Check if passed sectors are empty (not allocated or contain only 0 bytes)
  *
- * Returns 0 in case sectors are filled with 0, 1 if sectors contain non-zero
- * data and negative value on error.
+ * Intended for use by 'qemu-img compare': Returns 0 in case sectors are
+ * filled with 0, 1 if sectors contain non-zero data (this is a comparison
+ * failure), and 4 on error (the exit status for read errors), after emitting
+ * an error message.
  *
  * @param blk:  BlockBackend for the image
  * @param sect_num: Number of first sector to check
@@ -1218,7 +1220,7 @@ static int check_empty_sectors(BlockBackend *blk, int64_t 
sect_num,
 if (ret < 0) {
 error_report("Error while reading offset %" PRId64 " of %s: %s",
  sectors_to_bytes(sect_num), filename, strerror(-ret));
-return ret;
+return 4;
 }
 idx = find_nonzero(buffer, sect_count * BDRV_SECTOR_SIZE);
 if (idx >= 0) {
@@ -1473,11 +1475,6 @@ static int img_compare(int argc, char **argv)
   filename2, buf1, quiet);
 }
 if (ret) {
-if (ret < 0) {
-error_report("Error while reading offset %" PRId64 ": %s",
- sectors_to_bytes(sector_num), strerror(-ret));
-ret = 4;
-}
 goto out;
 }
 }
@@ -1522,12 +1519,6 @@ static int img_compare(int argc, char **argv)
 ret = check_empty_sectors(blk_over, sector_num, nb_sectors,
   filename_over, buf1, quiet);
 if (ret) {
-if (ret < 0) {
-error_report("Error while reading offset %" PRId64
- " of %s: %s", 
sectors_to_bytes(sector_num),
- filename_over, strerror(-ret));
-ret = 4;
-}
 goto out;
 }
 }
diff --git a/tests/qemu-iotests/074.out b/tests/qemu-iotests/074.out
index 8fba5aea9c..ede66c3f81 100644
--- a/tests/qemu-iotests/074.out
+++ b/tests/qemu-iotests/074.out
@@ -4,7 +4,6 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 wrote 512/512 bytes at offset 512
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qemu-img: Error while reading offset 0 of 
blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
-qemu-img: Error while reading offset 0: Input/output error
 4
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Formatting 'TEST_DIR/t.IMGFMT.2', fmt=IMGFMT size=0
@@ -12,7 +11,6 @@ Formatting 'TEST_DIR/t.IMGFMT.2', fmt=IMGFMT size=0
 wrote 512/512 bytes at offset 512
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 qemu-img: Error while reading offset 0 of 
blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
-qemu-img: Error while reading offset 0 of 
blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
 Warning: Image size mismatch!
 4
 Cleanup
-- 
2.13.5

[Qemu-block] [PATCH v4 08/23] block: Switch bdrv_co_get_block_status() to byte-based

2017-09-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
function (no semantic change); and as with its public counterpart,
rename to bdrv_co_block_status() to make the compiler enforce that
we catch all uses.  For now, we assert that callers still pass
aligned data, but ultimately, this will be the function where we
hand off to a byte-based driver callback, and will eventually need
to add logic to ensure we round calls according to the driver's
request_alignment then touch up the result handed back to the
caller, to start permitting a caller to pass unaligned offsets.

Note that we are now prepared to accepts 'bytes' larger than INT_MAX;
this is okay as long as we clamp things internally before violating
any 32-bit limits, and makes no difference to how a client will
use the information (clients looping over the entire file must
already be prepared for consecutive calls to return the same status,
as drivers are already free to return shorter-than-maximal status
due to any other convenient split points, such as when the L2 table
crosses cluster boundaries in qcow2).

Signed-off-by: Eric Blake 

---
v4: no change
v3: rebase to allocation/mapping sense change, clamp bytes to 32-bits
when needed, drop R-b
v2: rebase to earlier changes
---
 block/io.c | 91 +-
 1 file changed, 54 insertions(+), 37 deletions(-)

diff --git a/block/io.c b/block/io.c
index 1ed46bcece..da85c903dd 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1748,42 +1748,43 @@ int64_t coroutine_fn 
bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
  * possible; otherwise, the result may omit that bit particularly if
  * it allows for a larger value in 'pnum'.
  *
- * If 'sector_num' is beyond the end of the disk image the return value is
+ * If 'offset' is beyond the end of the disk image the return value is
  * BDRV_BLOCK_EOF and 'pnum' is set to 0.
  *
- * 'pnum' is set to the number of sectors (including and immediately following
- * the specified sector) that are known to be in the same
- * allocated/unallocated state.
+ * 'pnum' is set to the number of bytes (including and immediately following
+ * the specified offset) that are known to be in the same
+ * allocated/unallocated state.  It may be NULL.
  *
- * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
+ * 'bytes' is the max value 'pnum' should be set to.  If bytes goes
  * beyond the end of the disk image it will be clamped; if 'pnum' is set to
  * the end of the image, then the returned value will include BDRV_BLOCK_EOF.
  *
  * If returned value is positive, BDRV_BLOCK_OFFSET_VALID bit is set, and
- * 'file' is non-NULL, then '*file' points to the BDS which the sector range
- * is allocated in.
+ * 'file' is non-NULL, then '*file' points to the BDS which owns the
+ * allocated sector that contains offset.
  */
-static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
- bool mapping,
- int64_t sector_num,
- int nb_sectors, int *pnum,
- BlockDriverState **file)
+static int64_t coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
+ bool mapping,
+ int64_t offset, int64_t bytes,
+ int64_t *pnum,
+ BlockDriverState **file)
 {
-int64_t total_sectors;
-int64_t n;
+int64_t total_size;
+int64_t n; /* bytes */
 int64_t ret, ret2;
 BlockDriverState *local_file = NULL;
+int count; /* sectors */

 assert(pnum);
-total_sectors = bdrv_nb_sectors(bs);
-if (total_sectors < 0) {
+total_size = bdrv_getlength(bs);
+if (total_size < 0) {
 if (file) {
 *file = NULL;
 }
-return total_sectors;
+return total_size;
 }

-if (sector_num >= total_sectors) {
+if (offset >= total_size) {
 *pnum = 0;
 if (file) {
 *file = NULL;
@@ -1791,19 +1792,19 @@ static int64_t coroutine_fn 
bdrv_co_get_block_status(BlockDriverState *bs,
 return BDRV_BLOCK_EOF;
 }

-n = total_sectors - sector_num;
-if (n < nb_sectors) {
-nb_sectors = n;
+n = total_size - offset;
+if (n < bytes) {
+bytes = n;
 }

 if (!bs->drv->bdrv_co_get_block_status) {
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED;
-if (sector_num + nb_sectors == total_sectors) {
+if (offset + bytes == total_size) {
 ret |= BDRV_BLOCK_EOF;
 }
 if (bs->drv->protocol_name) {
-ret

[Qemu-block] [PATCH v4 05/23] block: Switch bdrv_make_zero() to byte-based

2017-09-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Change the internal
loop iteration of zeroing a device to track by bytes instead of
sectors (although we are still guaranteed that we iterate by steps
that are sector-aligned).

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 

---
v3: no change
v2: rebase to earlier changes
---
 block/io.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/io.c b/block/io.c
index b362b46e3d..638b3890b7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -693,38 +693,38 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
  */
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 {
-int64_t target_sectors, ret, nb_sectors, sector_num = 0;
+int64_t target_size, ret, bytes, offset = 0;
 BlockDriverState *bs = child->bs;
-int n;
+int n; /* sectors */

-target_sectors = bdrv_nb_sectors(bs);
-if (target_sectors < 0) {
-return target_sectors;
+target_size = bdrv_getlength(bs);
+if (target_size < 0) {
+return target_size;
 }

 for (;;) {
-nb_sectors = MIN(target_sectors - sector_num, 
BDRV_REQUEST_MAX_SECTORS);
-if (nb_sectors <= 0) {
+bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
+if (bytes <= 0) {
 return 0;
 }
-ret = bdrv_get_block_status(bs, sector_num, nb_sectors, , NULL);
+ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
+bytes >> BDRV_SECTOR_BITS, , NULL);
 if (ret < 0) {
-error_report("error getting block status at sector %" PRId64 ": 
%s",
- sector_num, strerror(-ret));
+error_report("error getting block status at offset %" PRId64 ": 
%s",
+ offset, strerror(-ret));
 return ret;
 }
 if (ret & BDRV_BLOCK_ZERO) {
-sector_num += n;
+offset += n * BDRV_SECTOR_BITS;
 continue;
 }
-ret = bdrv_pwrite_zeroes(child, sector_num << BDRV_SECTOR_BITS,
- n << BDRV_SECTOR_BITS, flags);
+ret = bdrv_pwrite_zeroes(child, offset, n * BDRV_SECTOR_SIZE, flags);
 if (ret < 0) {
-error_report("error writing zeroes at sector %" PRId64 ": %s",
- sector_num, strerror(-ret));
+error_report("error writing zeroes at offset %" PRId64 ": %s",
+ offset, strerror(-ret));
 return ret;
 }
-sector_num += n;
+offset += n * BDRV_SECTOR_SIZE;
 }
 }

-- 
2.13.5

[Qemu-block] [PATCH v4 02/23] block: Add flag to avoid wasted work in bdrv_is_allocated()

2017-09-13 Thread Eric Blake

Not all callers care about which BDS owns the mapping for a given
range of the file.  In particular, bdrv_is_allocated() cares more
about finding the largest run of allocated data from the guest
perspective, whether or not that data is consecutive from the
host perspective.  Therefore, doing subsequent refinements such
as checking how much of the format-layer allocation also satisfies
BDRV_BLOCK_ZERO at the protocol layer is wasted work - in the best
case, it just costs extra CPU cycles during a single
bdrv_is_allocated(), but in the worst case, it results in a smaller
*pnum, and forces callers to iterate through more status probes when
visiting the entire file for even more extra CPU cycles.

This patch only optimizes the block layer.  But subsequent patches
will tweak the driver callback to be byte-based, and in the process,
can also pass this hint through to the driver.

Signed-off-by: Eric Blake 

---
v4: only context changes
v3: s/allocation/mapping/ and flip sense of bool
v2: new patch
---
 block/io.c | 52 ++--
 1 file changed, 38 insertions(+), 14 deletions(-)

diff --git a/block/io.c b/block/io.c
index f250029395..6509c804d4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1709,6 +1709,7 @@ typedef struct BdrvCoGetBlockStatusData {
 int nb_sectors;
 int *pnum;
 int64_t ret;
+bool mapping;
 bool done;
 } BdrvCoGetBlockStatusData;

@@ -1743,6 +1744,11 @@ int64_t coroutine_fn 
bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
  * Drivers not implementing the functionality are assumed to not support
  * backing files, hence all their sectors are reported as allocated.
  *
+ * If 'mapping' is true, the caller is querying for mapping purposes,
+ * and the result should include BDRV_BLOCK_OFFSET_VALID where
+ * possible; otherwise, the result may omit that bit particularly if
+ * it allows for a larger value in 'pnum'.
+ *
  * If 'sector_num' is beyond the end of the disk image the return value is
  * BDRV_BLOCK_EOF and 'pnum' is set to 0.
  *
@@ -1759,6 +1765,7 @@ int64_t coroutine_fn 
bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
  * is allocated in.
  */
 static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
+ bool mapping,
  int64_t sector_num,
  int nb_sectors, int *pnum,
  BlockDriverState **file)
@@ -1817,14 +1824,15 @@ static int64_t coroutine_fn 
bdrv_co_get_block_status(BlockDriverState *bs,

 if (ret & BDRV_BLOCK_RAW) {
 assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
-ret = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
+ret = bdrv_co_get_block_status(local_file, mapping,
+   ret >> BDRV_SECTOR_BITS,
*pnum, pnum, _file);
 goto out;
 }

 if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
 ret |= BDRV_BLOCK_ALLOCATED;
-} else {
+} else if (mapping) {
 if (bdrv_unallocated_blocks_are_zero(bs)) {
 ret |= BDRV_BLOCK_ZERO;
 } else if (bs->backing) {
@@ -1836,12 +1844,13 @@ static int64_t coroutine_fn 
bdrv_co_get_block_status(BlockDriverState *bs,
 }
 }

-if (local_file && local_file != bs &&
+if (mapping && local_file && local_file != bs &&
 (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
 (ret & BDRV_BLOCK_OFFSET_VALID)) {
 int file_pnum;

-ret2 = bdrv_co_get_block_status(local_file, ret >> BDRV_SECTOR_BITS,
+ret2 = bdrv_co_get_block_status(local_file, mapping,
+ret >> BDRV_SECTOR_BITS,
 *pnum, _pnum, NULL);
 if (ret2 >= 0) {
 /* Ignore errors.  This is just providing extra information, it
@@ -1876,6 +1885,7 @@ out:

 static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState 
*bs,
 BlockDriverState *base,
+bool mapping,
 int64_t sector_num,
 int nb_sectors,
 int *pnum,
@@ -1887,7 +1897,8 @@ static int64_t coroutine_fn 
bdrv_co_get_block_status_above(BlockDriverState *bs,

 assert(bs != base);
 for (p = bs; p != base; p = backing_bs(p)) {
-ret = bdrv_co_get_block_status(p, sector_num, nb_sectors, pnum, file);
+ret = bdrv_co_get_block_status(p, mapping, sector_num, nb_sectors,
+   pnum, file);
 if (ret < 0) {
 break;
 }
@@ -1917,6 +1928,7 @@ static void coroutine_fn 
bdrv_get_block_status_above_co_entry(void *opaque)
 BdrvCoGetBlockStatusData *data = opaque;

 data->ret = bdrv_co_get_block_status_above(data->bs, data->base,
+

[Qemu-block] [PATCH v4 01/23] block: Allow NULL file for bdrv_get_block_status()

2017-09-13 Thread Eric Blake

Not all callers care about which BDS owns the mapping for a given
range of the file.  This patch merely simplifies the callers by
consolidating the logic in the common call point, while guaranteeing
a non-NULL file to all the driver callbacks, for no semantic change.
The only caller that does not care about pnum is bdrv_is_allocated,
as invoked by vvfat; we can likewise add assertions that the rest
of the stack does not have to worry about a NULL pnum.

Furthermore, this will also set the stage for a future cleanup: when
a caller does not care about which BDS owns an offset, it would be
nice to allow the driver to optimize things to not have to return
BDRV_BLOCK_OFFSET_VALID in the first place.  In the case of fragmented
allocation (for example, it's fairly easy to create a qcow2 image
where consecutive guest addresses are not at consecutive host
addresses), the current contract requires bdrv_get_block_status()
to clamp *pnum to the limit where host addresses are no longer
consecutive, but allowing a NULL file means that *pnum could be
set to the full length of known-allocated data.

Signed-off-by: Eric Blake 

---
v4: only context changes
v3: rebase to recent changes (qcow2_measure), dropped R-b
v2: use local variable and final transfer, rather than assignment
of parameter to local
[previously in different series]:
v2: new patch, 
https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05645.html
---
 include/block/block_int.h | 10 ++
 block/io.c| 44 
 block/mirror.c|  3 +--
 block/qcow2.c |  8 ++--
 qemu-img.c| 10 --
 5 files changed, 41 insertions(+), 34 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 55c5d573d4..7f71c585a0 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -202,10 +202,12 @@ struct BlockDriver {
 int64_t offset, int bytes);

 /*
- * Building block for bdrv_block_status[_above]. The driver should
- * answer only according to the current layer, and should not
- * set BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
- * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.
+ * Building block for bdrv_block_status[_above] and
+ * bdrv_is_allocated[_above].  The driver should answer only
+ * according to the current layer, and should not set
+ * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
+ * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
+ * layer guarantees non-NULL pnum and file.
  */
 int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, int *pnum,
diff --git a/block/io.c b/block/io.c
index 8a0cd8835a..f250029395 100644
--- a/block/io.c
+++ b/block/io.c
@@ -695,7 +695,6 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 {
 int64_t target_sectors, ret, nb_sectors, sector_num = 0;
 BlockDriverState *bs = child->bs;
-BlockDriverState *file;
 int n;

 target_sectors = bdrv_nb_sectors(bs);
@@ -708,7 +707,7 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 if (nb_sectors <= 0) {
 return 0;
 }
-ret = bdrv_get_block_status(bs, sector_num, nb_sectors, , );
+ret = bdrv_get_block_status(bs, sector_num, nb_sectors, , NULL);
 if (ret < 0) {
 error_report("error getting block status at sector %" PRId64 ": 
%s",
  sector_num, strerror(-ret));
@@ -1755,8 +1754,9 @@ int64_t coroutine_fn 
bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
  * beyond the end of the disk image it will be clamped; if 'pnum' is set to
  * the end of the image, then the returned value will include BDRV_BLOCK_EOF.
  *
- * If returned value is positive and BDRV_BLOCK_OFFSET_VALID bit is set, 'file'
- * points to the BDS which the sector range is allocated in.
+ * If returned value is positive, BDRV_BLOCK_OFFSET_VALID bit is set, and
+ * 'file' is non-NULL, then '*file' points to the BDS which the sector range
+ * is allocated in.
  */
 static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
  int64_t sector_num,
@@ -1766,15 +1766,22 @@ static int64_t coroutine_fn 
bdrv_co_get_block_status(BlockDriverState *bs,
 int64_t total_sectors;
 int64_t n;
 int64_t ret, ret2;
+BlockDriverState *local_file = NULL;

-*file = NULL;
+assert(pnum);
 total_sectors = bdrv_nb_sectors(bs);
 if (total_sectors < 0) {
+if (file) {
+*file = NULL;
+}
 return total_sectors;
 }

 if (sector_num >= total_sectors) {
 *pnum = 0;
+if (file) {
+*file = NULL;
+}
 return BDRV_BLOCK_EOF;
 }

@@ -1791,23 +1798,27 @@ static int64_t coroutine_fn

[Qemu-block] [PATCH v4 11/23] block: Switch bdrv_co_get_block_status_above() to byte-based

2017-09-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert another internal
type (no semantic change), and rename it to match the corresponding
public function rename.

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 

---
v3: rebase to allocation/mapping sense change, simple enough to keep R-b
v2: rebase to earlier changes
---
 block/io.c | 48 ++--
 1 file changed, 18 insertions(+), 30 deletions(-)

diff --git a/block/io.c b/block/io.c
index bc0e3fd0e2..409cfe0938 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1893,12 +1893,12 @@ out:
 return ret;
 }

-static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState 
*bs,
+static int64_t coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
 BlockDriverState *base,
 bool mapping,
-int64_t sector_num,
-int nb_sectors,
-int *pnum,
+int64_t offset,
+int64_t bytes,
+int64_t *pnum,
 BlockDriverState **file)
 {
 BlockDriverState *p;
@@ -1907,17 +1907,10 @@ static int64_t coroutine_fn 
bdrv_co_get_block_status_above(BlockDriverState *bs,

 assert(bs != base);
 for (p = bs; p != base; p = backing_bs(p)) {
-int64_t count;
-
-ret = bdrv_co_block_status(p, mapping,
-   sector_num * BDRV_SECTOR_SIZE,
-   nb_sectors * BDRV_SECTOR_SIZE, ,
-   file);
+ret = bdrv_co_block_status(p, mapping, offset, bytes, pnum, file);
 if (ret < 0) {
 break;
 }
-assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-*pnum = count >> BDRV_SECTOR_BITS;
 if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
 /*
  * Reading beyond the end of the file continues to read
@@ -1925,39 +1918,35 @@ static int64_t coroutine_fn 
bdrv_co_get_block_status_above(BlockDriverState *bs,
  * unallocated length we learned from an earlier
  * iteration.
  */
-*pnum = nb_sectors;
+*pnum = bytes;
 }
 if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
 break;
 }
-/* [sector_num, pnum] unallocated on this layer, which could be only
- * the first part of [sector_num, nb_sectors].  */
-nb_sectors = MIN(nb_sectors, *pnum);
+/* [offset, pnum] unallocated on this layer, which could be only
+ * the first part of [offset, bytes].  */
+bytes = MIN(bytes, *pnum);
 first = false;
 }
 return ret;
 }

 /* Coroutine wrapper for bdrv_get_block_status_above() */
-static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
+static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
 {
 BdrvCoBlockStatusData *data = opaque;
-int n;

-data->ret = bdrv_co_get_block_status_above(data->bs, data->base,
-   data->mapping,
-   data->offset >> 
BDRV_SECTOR_BITS,
-   data->bytes >> BDRV_SECTOR_BITS,
-   ,
-   data->file);
-*data->pnum = n * BDRV_SECTOR_SIZE;
+data->ret = bdrv_co_block_status_above(data->bs, data->base,
+   data->mapping,
+   data->offset, data->bytes,
+   data->pnum, data->file);
 data->done = true;
 }

 /*
- * Synchronous wrapper around bdrv_co_get_block_status_above().
+ * Synchronous wrapper around bdrv_co_block_status_above().
  *
- * See bdrv_co_get_block_status_above() for details.
+ * See bdrv_co_block_status_above() for details.
  */
 static int64_t bdrv_common_block_status_above(BlockDriverState *bs,
   BlockDriverState *base,
@@ -1980,10 +1969,9 @@ static int64_t 
bdrv_common_block_status_above(BlockDriverState *bs,

 if (qemu_in_coroutine()) {
 /* Fast-path if already in coroutine context */
-bdrv_get_block_status_above_co_entry();
+bdrv_block_status_above_co_entry();
 } else {
-co = qemu_coroutine_create(bdrv_get_block_status_above_co_entry,
-   );
+co = qemu_coroutine_create(bdrv_block_status_above_co_entry, );
 bdrv_coroutine_enter(bs, co);
 BDRV_POLL_WHILE(bs, !data.done);
 }
-- 
2.13.5

[Qemu-block] [PATCH v4 06/23] qemu-img: Switch get_block_status() to byte-based

2017-09-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Continue by converting
an internal function (no semantic change), and simplifying its
caller accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 
---
v2-v3: no change
---
 qemu-img.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 0c12e1c240..54f7682069 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2670,14 +2670,16 @@ static void dump_map_entry(OutputFormat output_format, 
MapEntry *e,
 }
 }

-static int get_block_status(BlockDriverState *bs, int64_t sector_num,
-int nb_sectors, MapEntry *e)
+static int get_block_status(BlockDriverState *bs, int64_t offset,
+int64_t bytes, MapEntry *e)
 {
 int64_t ret;
 int depth;
 BlockDriverState *file;
 bool has_offset;
+int nb_sectors = bytes >> BDRV_SECTOR_BITS;

+assert(bytes < INT_MAX);
 /* As an optimization, we could cache the current range of unallocated
  * clusters in each file of the chain, and avoid querying the same
  * range repeatedly.
@@ -2685,8 +2687,8 @@ static int get_block_status(BlockDriverState *bs, int64_t 
sector_num,

 depth = 0;
 for (;;) {
-ret = bdrv_get_block_status(bs, sector_num, nb_sectors, _sectors,
-);
+ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS, nb_sectors,
+_sectors, );
 if (ret < 0) {
 return ret;
 }
@@ -2706,7 +2708,7 @@ static int get_block_status(BlockDriverState *bs, int64_t 
sector_num,
 has_offset = !!(ret & BDRV_BLOCK_OFFSET_VALID);

 *e = (MapEntry) {
-.start = sector_num * BDRV_SECTOR_SIZE,
+.start = offset,
 .length = nb_sectors * BDRV_SECTOR_SIZE,
 .data = !!(ret & BDRV_BLOCK_DATA),
 .zero = !!(ret & BDRV_BLOCK_ZERO),
@@ -2836,16 +2838,12 @@ static int img_map(int argc, char **argv)

 length = blk_getlength(blk);
 while (curr.start + curr.length < length) {
-int64_t nsectors_left;
-int64_t sector_num;
-int n;
-
-sector_num = (curr.start + curr.length) >> BDRV_SECTOR_BITS;
+int64_t offset = curr.start + curr.length;
+int64_t n;

 /* Probe up to 1 GiB at a time.  */
-nsectors_left = DIV_ROUND_UP(length, BDRV_SECTOR_SIZE) - sector_num;
-n = MIN(1 << (30 - BDRV_SECTOR_BITS), nsectors_left);
-ret = get_block_status(bs, sector_num, n, );
+n = QEMU_ALIGN_DOWN(MIN(1 << 30, length - offset), BDRV_SECTOR_SIZE);
+ret = get_block_status(bs, offset, n, );

 if (ret < 0) {
 error_report("Could not read file metadata: %s", strerror(-ret));
-- 
2.13.5

[Qemu-block] [PATCH v4 07/23] block: Convert bdrv_get_block_status() to bytes

2017-09-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  In the common case, allocation is unlikely to ever use
values that are not naturally sector-aligned, but it is possible
that byte-based values will let us be more precise about allocation
at the end of an unaligned file that can do byte-based access.

Changing the name of the function from bdrv_get_block_status() to
bdrv_block_status() ensures that the compiler enforces that all
callers are updated.  For now, the io.c layer still assert()s that
all callers are sector-aligned, but that can be relaxed when a later
patch implements byte-based block status in the drivers.

Note that we have an inherent limitation in the BDRV_BLOCK_* return
values: BDRV_BLOCK_OFFSET_VALID can only return the start of a
sector, even if we later relax the interface to query for the status
starting at an intermediate byte; document the obvious interpretation
that valid offsets are always sector-relative.

Therefore, for the most part this patch is just the addition of scaling
at the callers followed by inverse scaling at bdrv_block_status().  But
some code, particularly bdrv_is_allocated(), gets a lot simpler because
it no longer has to mess with sectors.

For ease of review, bdrv_get_block_status_above() will be tackled
separately.

Signed-off-by: Eric Blake 

---
v3: clamp bytes to 32-bits, rather than asserting
v2: rebase to earlier changes
---
 include/block/block.h | 12 +++-
 block/io.c| 31 +++
 block/qcow2-cluster.c |  2 +-
 qemu-img.c| 20 +++-
 4 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index bb3b95d491..7a9a8db588 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -138,8 +138,10 @@ typedef struct HDGeometry {
  *
  * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK)
  * represent the offset in the returned BDS that is allocated for the
- * corresponding raw data; however, whether that offset actually contains
- * data also depends on BDRV_BLOCK_DATA and BDRV_BLOCK_ZERO, as follows:
+ * corresponding raw data.  Individual bytes are at the same sector-relative
+ * locations (and thus, this bit cannot be set for mappings which are
+ * not equivalent modulo 512).  However, whether that offset actually
+ * contains data also depends on BDRV_BLOCK_DATA, as follows:
  *
  * DATA ZERO OFFSET_VALID
  *  ttt   sectors read as zero, returned file is zero at offset
@@ -421,9 +423,9 @@ int bdrv_has_zero_init_1(BlockDriverState *bs);
 int bdrv_has_zero_init(BlockDriverState *bs);
 bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
-int64_t bdrv_get_block_status(BlockDriverState *bs, int64_t sector_num,
-  int nb_sectors, int *pnum,
-  BlockDriverState **file);
+int64_t bdrv_block_status(BlockDriverState *bs, int64_t offset,
+  int64_t bytes, int64_t *pnum,
+  BlockDriverState **file);
 int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 BlockDriverState *base,
 int64_t sector_num,
diff --git a/block/io.c b/block/io.c
index 638b3890b7..1ed46bcece 100644
--- a/block/io.c
+++ b/block/io.c
@@ -695,7 +695,6 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 {
 int64_t target_size, ret, bytes, offset = 0;
 BlockDriverState *bs = child->bs;
-int n; /* sectors */

 target_size = bdrv_getlength(bs);
 if (target_size < 0) {
@@ -707,24 +706,23 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags 
flags)
 if (bytes <= 0) {
 return 0;
 }
-ret = bdrv_get_block_status(bs, offset >> BDRV_SECTOR_BITS,
-bytes >> BDRV_SECTOR_BITS, , NULL);
+ret = bdrv_block_status(bs, offset, bytes, , NULL);
 if (ret < 0) {
 error_report("error getting block status at offset %" PRId64 ": 
%s",
  offset, strerror(-ret));
 return ret;
 }
 if (ret & BDRV_BLOCK_ZERO) {
-offset += n * BDRV_SECTOR_BITS;
+offset += bytes;
 continue;
 }
-ret = bdrv_pwrite_zeroes(child, offset, n * BDRV_SECTOR_SIZE, flags);
+ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
 if (ret < 0) {
 error_report("error writing zeroes at offset %" PRId64 ": %s",
  offset, strerror(-ret));
 return ret;
 }
-offset += n * BDRV_SECTOR_SIZE;
+offset += bytes;
 }
 }

@@ -1983,13 +1981,22 @@ int64_t bdrv_get_block_status_above(BlockDriverState 
*bs,
   nb_sectors, pnum, file);
 }

-int64_t

[Qemu-block] [PATCH v4 03/23] block: Make bdrv_round_to_clusters() signature more useful

2017-09-13 Thread Eric Blake

In the process of converting sector-based interfaces to bytes,
I'm finding it easier to represent a byte count as a 64-bit
integer at the block layer (even if we are internally capped
by SIZE_MAX or even INT_MAX for individual transactions, it's
still nicer to not have to worry about truncation/overflow
issues on as many variables).  Update the signature of
bdrv_round_to_clusters() to uniformly use int64_t, matching
the signature already chosen for bdrv_is_allocated and the
fact that off_t is also a signed type, then adjust clients
according to the required fallout.

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 

---
v4: only context changes
v3: no change
v2: fix commit message [John], rebase to earlier changes, including
mirror_clip_bytes() signature update
---
 include/block/block.h | 4 ++--
 block/io.c| 7 ---
 block/mirror.c| 7 +++
 block/trace-events| 2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 2ad18775af..bb3b95d491 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -475,9 +475,9 @@ int bdrv_get_flags(BlockDriverState *bs);
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs);
 void bdrv_round_to_clusters(BlockDriverState *bs,
-int64_t offset, unsigned int bytes,
+int64_t offset, int64_t bytes,
 int64_t *cluster_offset,
-unsigned int *cluster_bytes);
+int64_t *cluster_bytes);

 const char *bdrv_get_encrypted_filename(BlockDriverState *bs);
 void bdrv_get_backing_filename(BlockDriverState *bs,
diff --git a/block/io.c b/block/io.c
index 6509c804d4..b362b46e3d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -446,9 +446,9 @@ static void mark_request_serialising(BdrvTrackedRequest 
*req, uint64_t align)
  * Round a region to cluster boundaries
  */
 void bdrv_round_to_clusters(BlockDriverState *bs,
-int64_t offset, unsigned int bytes,
+int64_t offset, int64_t bytes,
 int64_t *cluster_offset,
-unsigned int *cluster_bytes)
+int64_t *cluster_bytes)
 {
 BlockDriverInfo bdi;

@@ -946,7 +946,7 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild 
*child,
 struct iovec iov;
 QEMUIOVector bounce_qiov;
 int64_t cluster_offset;
-unsigned int cluster_bytes;
+int64_t cluster_bytes;
 size_t skip_bytes;
 int ret;

@@ -967,6 +967,7 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild 
*child,
 trace_bdrv_co_do_copy_on_readv(bs, offset, bytes,
cluster_offset, cluster_bytes);

+assert(cluster_bytes < SIZE_MAX);
 iov.iov_len = cluster_bytes;
 iov.iov_base = bounce_buffer = qemu_try_blockalign(bs, iov.iov_len);
 if (bounce_buffer == NULL) {
diff --git a/block/mirror.c b/block/mirror.c
index 032cfe91fa..67f45cec4e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -190,10 +190,9 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t 
*offset,
 bool need_cow;
 int ret = 0;
 int64_t align_offset = *offset;
-unsigned int align_bytes = *bytes;
+int64_t align_bytes = *bytes;
 int max_bytes = s->granularity * s->max_iov;

-assert(*bytes < INT_MAX);
 need_cow = !test_bit(*offset / s->granularity, s->cow_bitmap);
 need_cow |= !test_bit((*offset + *bytes - 1) / s->granularity,
   s->cow_bitmap);
@@ -388,7 +387,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 while (nb_chunks > 0 && offset < s->bdev_length) {
 int64_t ret;
 int io_sectors;
-unsigned int io_bytes;
+int64_t io_bytes;
 int64_t io_bytes_acct;
 enum MirrorMethod {
 MIRROR_METHOD_COPY,
@@ -413,7 +412,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 io_bytes = s->granularity;
 } else if (ret >= 0 && !(ret & BDRV_BLOCK_DATA)) {
 int64_t target_offset;
-unsigned int target_bytes;
+int64_t target_bytes;
 bdrv_round_to_clusters(blk_bs(s->target), offset, io_bytes,
_offset, _bytes);
 if (target_offset == offset &&
diff --git a/block/trace-events b/block/trace-events
index 25dd5a3026..4c6586f156 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -12,7 +12,7 @@ blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned 
int bytes, int flag
 bdrv_co_preadv(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) 
"bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
 bdrv_co_pwritev(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) 
"bs %p offset %"PRId64"

[Qemu-block] [PATCH v4 00/23] make bdrv_get_block_status byte-based

2017-09-13 Thread Eric Blake

There are patches floating around to add NBD_CMD_BLOCK_STATUS,
but NBD wants to report status on byte granularity (even if the
reporting will probably be naturally aligned to sectors or even
much higher levels).  I've therefore started the task of
converting our block status code to report at a byte granularity
rather than sectors.

Now that 2.11 is open, I'm rebasing/reposting the remaining patches.

The overall conversion currently looks like:
part 1: bdrv_is_allocated (merged, commit 51b0a488)
part 2: dirty-bitmap (v7 is posted [1], mostly reviewed)
part 3: bdrv_get_block_status (this series, v3 at [2])
part 4: .bdrv_co_block_status (v2 is posted [3], but needs a rebase)

Available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-status-v4

Based-on: <20170912203119.24166-1-ebl...@redhat.com>
([PATCH v7 00/20] make dirty-bitmap byte-based)

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03160.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg03853.html
[3] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg04370.html

Since v3:
- Minor rebasing

001/23:[] [-C] 'block: Allow NULL file for bdrv_get_block_status()'
002/23:[] [--] 'block: Add flag to avoid wasted work in bdrv_is_allocated()'
003/23:[] [-C] 'block: Make bdrv_round_to_clusters() signature more useful'
004/23:[] [--] 'qcow2: Switch is_zero_sectors() to byte-based'
005/23:[] [--] 'block: Switch bdrv_make_zero() to byte-based'
006/23:[] [--] 'qemu-img: Switch get_block_status() to byte-based'
007/23:[] [--] 'block: Convert bdrv_get_block_status() to bytes'
008/23:[] [--] 'block: Switch bdrv_co_get_block_status() to byte-based'
009/23:[] [--] 'block: Switch BdrvCoGetBlockStatusData to byte-based'
010/23:[] [--] 'block: Switch bdrv_common_block_status_above() to 
byte-based'
011/23:[] [--] 'block: Switch bdrv_co_get_block_status_above() to 
byte-based'
012/23:[0002] [FC] 'block: Convert bdrv_get_block_status_above() to bytes'
013/23:[] [--] 'qemu-img: Simplify logic in img_compare()'
014/23:[] [--] 'qemu-img: Speed up compare on pre-allocated larger file'
015/23:[] [--] 'qemu-img: Add find_nonzero()'
016/23:[] [--] 'qemu-img: Drop redundant error message in compare'
017/23:[] [--] 'qemu-img: Change check_empty_sectors() to byte-based'
018/23:[] [--] 'qemu-img: Change compare_sectors() to be byte-based'
019/23:[] [--] 'qemu-img: Change img_rebase() to be byte-based'
020/23:[] [--] 'qemu-img: Change img_compare() to be byte-based'
021/23:[] [--] 'block: Align block status requests'
022/23:[] [--] 'block: Relax bdrv_aligned_preadv() assertion'
023/23:[] [--] 'qemu-io: Relax 'alloc' now that block-status doesn't assert'
Eric Blake (23):
  block: Allow NULL file for bdrv_get_block_status()
  block: Add flag to avoid wasted work in bdrv_is_allocated()
  block: Make bdrv_round_to_clusters() signature more useful
  qcow2: Switch is_zero_sectors() to byte-based
  block: Switch bdrv_make_zero() to byte-based
  qemu-img: Switch get_block_status() to byte-based
  block: Convert bdrv_get_block_status() to bytes
  block: Switch bdrv_co_get_block_status() to byte-based
  block: Switch BdrvCoGetBlockStatusData to byte-based
  block: Switch bdrv_common_block_status_above() to byte-based
  block: Switch bdrv_co_get_block_status_above() to byte-based
  block: Convert bdrv_get_block_status_above() to bytes
  qemu-img: Simplify logic in img_compare()
  qemu-img: Speed up compare on pre-allocated larger file
  qemu-img: Add find_nonzero()
  qemu-img: Drop redundant error message in compare
  qemu-img: Change check_empty_sectors() to byte-based
  qemu-img: Change compare_sectors() to be byte-based
  qemu-img: Change img_rebase() to be byte-based
  qemu-img: Change img_compare() to be byte-based
  block: Align block status requests
  block: Relax bdrv_aligned_preadv() assertion
  qemu-io: Relax 'alloc' now that block-status doesn't assert

 include/block/block.h  |  26 ++--
 include/block/block_int.h  |  11 +-
 block/io.c | 287 ---
 block/blkdebug.c   |  13 +-
 block/mirror.c |  24 +--
 block/qcow2-cluster.c  |   2 +-
 block/qcow2.c  |  53 +++
 qemu-img.c | 365 -
 qemu-io-cmds.c |  13 --
 block/trace-events |   2 +-
 tests/qemu-iotests/074.out |   2 -
 tests/qemu-iotests/177 |  12 +-
 tests/qemu-iotests/177.out |  19 ++-
 13 files changed, 420 insertions(+), 409 deletions(-)

-- 
2.13.5

Re: [Qemu-block] [PATCH v2 1/3] iotests: use -ccw on s390x for 040, 139, and 182

2017-09-13 Thread David Hildenbrand

On 13.09.2017 11:10, Cornelia Huck wrote:
> The default cpu model on s390x does not provide zPCI, which is
> not yet wired up on tcg. Moreover, virtio-ccw is the standard
> on s390x, so use the -ccw instead of the -pci versions of virtio
> devices on s390x.
> 
> Reviewed-by: Kevin Wolf 
> Signed-off-by: Cornelia Huck 
> ---
>  tests/qemu-iotests/040 |  6 +-
>  tests/qemu-iotests/139 | 12 ++--
>  tests/qemu-iotests/182 | 13 +++--
>  3 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
> index 95b7510571..c284d08796 100755
> --- a/tests/qemu-iotests/040
> +++ b/tests/qemu-iotests/040
> @@ -82,7 +82,11 @@ class TestSingleDrive(ImageCommitTestCase):
>  qemu_io('-f', 'raw', '-c', 'write -P 0xab 0 524288', backing_img)
>  qemu_io('-f', iotests.imgfmt, '-c', 'write -P 0xef 524288 524288', 
> mid_img)
>  self.vm = iotests.VM().add_drive(test_img, 
> "node-name=top,backing.node-name=mid,backing.backing.node-name=base", 
> interface="none")
> -self.vm.add_device("virtio-scsi-pci")
> +if iotests.qemu_default_machine == 's390-ccw-virtio':
> +self.vm.add_device("virtio-scsi-ccw")
> +else:
> +self.vm.add_device("virtio-scsi-pci")
> +
>  self.vm.add_device("scsi-hd,id=scsi0,drive=drive0")
>  self.vm.launch()
>  
> diff --git a/tests/qemu-iotests/139 b/tests/qemu-iotests/139
> index 50cf40fbd5..f8f02808a9 100644
> --- a/tests/qemu-iotests/139
> +++ b/tests/qemu-iotests/139
> @@ -25,13 +25,21 @@ import time
>  
>  base_img = os.path.join(iotests.test_dir, 'base.img')
>  new_img = os.path.join(iotests.test_dir, 'new.img')
> +if iotests.qemu_default_machine == 's390-ccw-virtio':
> +default_virtio_blk = 'virtio-blk-ccw'
> +else:
> +default_virtio_blk = 'virtio-blk-pci'

simply "virtio_blk" ?

>  
>  class TestBlockdevDel(iotests.QMPTestCase):
>  
>  def setUp(self):
>  iotests.qemu_img('create', '-f', iotests.imgfmt, base_img, '1M')
>  self.vm = iotests.VM()
> -self.vm.add_device("virtio-scsi-pci,id=virtio-scsi")
> +if iotests.qemu_default_machine == 's390-ccw-virtio':
> +self.vm.add_device("virtio-scsi-ccw,id=virtio-scsi")
> +else:
> +self.vm.add_device("virtio-scsi-pci,id=virtio-scsi")
> +
>  self.vm.launch()
>  
>  def tearDown(self):
> @@ -87,7 +95,7 @@ class TestBlockdevDel(iotests.QMPTestCase):
>  self.checkBlockDriverState(node, expect_error)
>  
>  # Add a device model
> -def addDeviceModel(self, device, backend, driver = 'virtio-blk-pci'):
> +def addDeviceModel(self, device, backend, driver = default_virtio_blk):
>  result = self.vm.qmp('device_add', id = device,
>   driver = driver, drive = backend)
>  self.assert_qmp(result, 'return', {})
> diff --git a/tests/qemu-iotests/182 b/tests/qemu-iotests/182
> index 7ecbb22604..2e078ceed8 100755
> --- a/tests/qemu-iotests/182
> +++ b/tests/qemu-iotests/182
> @@ -45,17 +45,26 @@ _supported_os Linux
>  
>  size=32M
>  
> +case "$QEMU_DEFAULT_MACHINE" in
> +  s390-ccw-virtio)
> +  virtioblk=virtio-blk-ccw
> +  ;;
> +  *)
> +  virtioblk=virtio-blk-pci

s/virtioblk/virtio_blk/ ?

(due to "virtio_scsi" in next patch)

> +  ;;
> +esac
> +
>  _make_test_img $size
>  
>  echo "Starting QEMU"
>  _launch_qemu -drive file=$TEST_IMG,if=none,id=drive0,file.locking=on \
> --device virtio-blk-pci,drive=drive0
> +-device $virtioblk,drive=drive0
>  
>  echo
>  echo "Starting a second QEMU using the same image should fail"
>  echo 'quit' | $QEMU -monitor stdio \
>  -drive file=$TEST_IMG,if=none,id=drive0,file.locking=on \
> --device virtio-blk-pci,drive=drive0 2>&1 | _filter_testdir 2>&1 |
> +-device $virtioblk,drive=drive0 2>&1 | _filter_testdir 2>&1 |
>  _filter_qemu |
>  sed -e '/falling back to POSIX file/d' \
>  -e '/locks can be lost unexpectedly/d'
> 


-- 

Thanks,

David

Re: [Qemu-block] [PATCH v2 3/3] iotests: use virtio aliases for 067

2017-09-13 Thread David Hildenbrand

On 13.09.2017 11:10, Cornelia Huck wrote:
> The default cpu model on s390x does not provide zPCI, which is
> not yet wired up on tcg. Moreover, virtio-ccw is the standard
> on s390x.
> 
> Using virtio-scsi will implicitly pick the right device, so just
> switch to that for simplicity.
> 
> Signed-off-by: Cornelia Huck 
> ---
>  tests/qemu-iotests/067 | 3 ++-
>  tests/qemu-iotests/067.out | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qemu-iotests/067 b/tests/qemu-iotests/067
> index 5d4ca4bc61..cbb3da286a 100755
> --- a/tests/qemu-iotests/067
> +++ b/tests/qemu-iotests/067
> @@ -141,7 +141,7 @@ echo
>  echo === Empty drive with -device and device_del ===
>  echo
>  
> -run_qemu -device virtio-scsi-pci -device scsi-cd,id=cd0 < +run_qemu -device virtio-scsi -device scsi-cd,id=cd0 <  { "execute": "qmp_capabilities" }
>  { "execute": "query-block" }
>  { "execute": "device_del", "arguments": { "id": "cd0" } }
> @@ -150,6 +150,7 @@ run_qemu -device virtio-scsi-pci -device scsi-cd,id=cd0 
> <  { "execute": "quit" }
>  EOF
>  
> +
>  # success, all done
>  echo "*** done"
>  rm -f $seq.full
> diff --git a/tests/qemu-iotests/067.out b/tests/qemu-iotests/067.out
> index bd70557ddc..58e83c4505 100644
> --- a/tests/qemu-iotests/067.out
> +++ b/tests/qemu-iotests/067.out
> @@ -419,7 +419,7 @@ Testing:
>  
>  === Empty drive with -device and device_del ===
>  
> -Testing: -device virtio-scsi-pci -device scsi-cd,id=cd0
> +Testing: -device virtio-scsi -device scsi-cd,id=cd0
>  {
>  QMP_VERSION
>  }
> 

Certainly not wrong to use the old alias in some tests, as long as we
test both variants.

Reviewed-by: David Hildenbrand 

-- 

Thanks,

David

Re: [Qemu-block] [PATCH v2 2/3] iotests: use -ccw on s390x for 051

2017-09-13 Thread David Hildenbrand

On 13.09.2017 11:10, Cornelia Huck wrote:
> The default cpu model on s390x does not provide zPCI, which is
> not yet wired up on tcg. Moreover, virtio-ccw is the standard
> on s390x, so use the -ccw instead of the -pci versions of virtio
> devices on s390x.
> 
> Signed-off-by: Cornelia Huck 
> ---
>  tests/qemu-iotests/051| 12 +++-
>  tests/qemu-iotests/051.out|  2 +-
>  tests/qemu-iotests/051.pc.out |  2 +-
>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/qemu-iotests/051 b/tests/qemu-iotests/051
> index c8cfc764bc..dba8816c9f 100755
> --- a/tests/qemu-iotests/051
> +++ b/tests/qemu-iotests/051
> @@ -103,7 +103,17 @@ echo
>  echo === Device without drive ===
>  echo
>  
> -run_qemu -device virtio-scsi-pci -device scsi-hd
> +case "$QEMU_DEFAULT_MACHINE" in
> +  s390-ccw-virtio)
> +  virtio_scsi=virtio-scsi-ccw
> +  ;;
> +  *)
> +  virtio_scsi=virtio-scsi-pci
> +  ;;
> +esac
> +
> +run_qemu -device $virtio_scsi -device scsi-hd |
> +sed -e "s/$virtio_scsi/VIRTIO_SCSI/"
>  
>  echo
>  echo === Overriding backing file ===
> diff --git a/tests/qemu-iotests/051.out b/tests/qemu-iotests/051.out
> index 4d3b1ff316..e3c6eaba57 100644
> --- a/tests/qemu-iotests/051.out
> +++ b/tests/qemu-iotests/051.out
> @@ -49,7 +49,7 @@ QEMU_PROG: -drive 
> file=TEST_DIR/t.qcow2,driver=qcow2,format=qcow2: Cannot specif
>  
>  === Device without drive ===
>  
> -Testing: -device virtio-scsi-pci -device scsi-hd
> +Testing: -device VIRTIO_SCSI -device scsi-hd
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) QEMU_PROG: -device scsi-hd: drive property not set
>  
> diff --git a/tests/qemu-iotests/051.pc.out b/tests/qemu-iotests/051.pc.out
> index 76d7205460..ae7801b44b 100644
> --- a/tests/qemu-iotests/051.pc.out
> +++ b/tests/qemu-iotests/051.pc.out
> @@ -49,7 +49,7 @@ QEMU_PROG: -drive 
> file=TEST_DIR/t.qcow2,driver=qcow2,format=qcow2: Cannot specif
>  
>  === Device without drive ===
>  
> -Testing: -device virtio-scsi-pci -device scsi-hd
> +Testing: -device VIRTIO_SCSI -device scsi-hd
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) QEMU_PROG: -device scsi-hd: drive property not set
>  
> 

Reviewed-by: David Hildenbrand 

-- 

Thanks,

David

Re: [Qemu-block] [Qemu-devel] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM

2017-09-13 Thread Kevin Wolf

Am 13.09.2017 um 15:32 hat Darren Kenny geschrieben:
> Hi Kevin,
> 
> Thanks for getting back to me so quickly.
> 
> Kevin Wolf wrote:
> > Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben:
> > > [Cross-posted from qemu-devel, meant to send here first]
> > 
> > Just keep both lists in the CC for the same email.
> 
> Will do.
> > There is an issue here, which is that you are accessing the image at the
> > same time from two separate processes. qemu is using writeback caches in
> > order to improve performance, so only after the guest has issued a flush
> > command to its disk or after you shut down or at least pause qemu, the
> > changes are fully written to the image file. In qemu 2.10, you would
> > probably see this instead: $ qemu-img check ./test.qcow2 qemu-img: Could
> > not open './test.qcow2': Failed to get shared "write" lock Is another
> > process using the image? This lock can be overridden, but at least it
> > shows clearly that you are doing something that you probably shouldn't
> > be doing.
> 
> Hmm, I've just updated to the HEAD of the Git repo, and I didn't see this
> locking behaviour, it still did the same thing as before.
> 
> Does the disk need to be formatted/mounted before it's seen as locked?
> Or even a configure option?
> 
> The version that have is:
> 
> $ qemu-img --version
> qemu-img version 2.10.50 (v2.10.0-476-g04ef330)
> Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
> 
> $ qemu-system-x86_64 --version
> QEMU emulator version 2.10.50 (v2.10.0-476-g04ef330)
> Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
> 
> The last commit I have is (as in the version string):
> 
> 04ef330 tcg/tci: do not use ldst label (never implemented)

This should have the locking code. It only works with relatively new
Linux kernels, though (it needs F_OFD_SETLK support). If you don't have
that, no locking is used even in qemu 2.10.

You could try enforcing some locking by adding file.locking=on to your
-drive option. If you're running an old kernel, this should print a
warning message (and use some less safe locking variant).

> > Doing a flush here wouldn't be wrong, but it's also unnecessary and
> > would slow down the operation a bit.
> Sure, but how often does a resize/truncate get done? Would seem like a
> small impact to do it - but I agree w.r.t. the single-process access
> as a better solution.

The thing is, truncate isn't the only operation that will lead to
qemu-img check reporting failure. Any cluster allocation in the image
can cause the same symptom, and there it is actually very important for
performance that we use the cache and do a batched write only later.

So changing truncate so that this specific operation looks as if
accessing the image from a second process were okay wouldn't actually
make a big difference for the overall state. Maybe it's in fact better
to have such attempts fail consistently.

Kevin

Re: [Qemu-block] [Qemu-devel] [PATCH] throttle: Assert that bkt->max is valid in throttle_compute_wait()

2017-09-13 Thread Eric Blake

On 09/13/2017 03:28 AM, Alberto Garcia wrote:
> If bkt->max == 0 and bkt->burst_length > 1 then we could have a
> division by 0 in throttle_do_compute_wait(). That configuration is
> however not permitted and is already detected by throttle_is_valid(),
> but let's assert it in throttle_compute_wait() to make it explicit.
> 
> Found by Coverity (CID: 1381016).
> 
> Signed-off-by: Alberto Garcia 
> ---
>  util/throttle.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake 

> 
> diff --git a/util/throttle.c b/util/throttle.c
> index 06bf916adc..b38e742da5 100644
> --- a/util/throttle.c
> +++ b/util/throttle.c
> @@ -124,6 +124,7 @@ int64_t throttle_compute_wait(LeakyBucket *bkt)
>  /* If the main bucket is not full yet we still have to check the
>   * burst bucket in order to enforce the burst limit */
>  if (bkt->burst_length > 1) {
> +assert(bkt->max > 0); /* see throttle_is_valid() */
>  extra = bkt->burst_level - burst_bucket_size;
>  if (extra > 0) {
>  return throttle_do_compute_wait(bkt->max, extra);
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH] block: Clean up some bad code in the vvfat driver

2017-09-13 Thread Eric Blake

On 09/13/2017 05:21 AM, Thomas Huth wrote:
> Remove the unnecessary home-grown redefinition of the assert() macro here,
> and remove the unusable debug code at the end of the checkpoint() function.
> The code there uses assert() with side-effects (assignment to the "mapping"
> variable), which should be avoided. Looking more closely, it seems as it is
> apparently also only usable for one certain directory layout (with a file
> named USB.H in it) and thus is of no use for the rest of the world.
> 
> Signed-off-by: Thomas Huth 
> ---
>  block/vvfat.c | 26 ++
>  1 file changed, 2 insertions(+), 24 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM

2017-09-13 Thread Darren Kenny


Hi Kevin,

Thanks for getting back to me so quickly.

Kevin Wolf wrote:

Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben:

[Cross-posted from qemu-devel, meant to send here first]


Just keep both lists in the CC for the same email.


Will do.
There is an issue here, which is that you are accessing the image at 
the same time from two separate processes. qemu is using writeback 
caches in order to improve performance, so only after the guest has 
issued a flush command to its disk or after you shut down or at least 
pause qemu, the changes are fully written to the image file. In qemu 
2.10, you would probably see this instead: $ qemu-img check 
./test.qcow2 qemu-img: Could not open './test.qcow2': Failed to get 
shared "write" lock Is another process using the image? This lock can 
be overridden, but at least it shows clearly that you are doing 
something that you probably shouldn't be doing.


Hmm, I've just updated to the HEAD of the Git repo, and I didn't see this
locking behaviour, it still did the same thing as before.

Does the disk need to be formatted/mounted before it's seen as locked?
Or even a configure option?

The version that have is:

$ qemu-img --version
qemu-img version 2.10.50 (v2.10.0-476-g04ef330)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

$ qemu-system-x86_64 --version
QEMU emulator version 2.10.50 (v2.10.0-476-g04ef330)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

The last commit I have is (as in the version string):

04ef330 tcg/tci: do not use ldst label (never implemented)


What I observed, then was that if I powered down the VM, or even just
quit the VM, that the subsequent check of the disk would say that
everything was just fine, and there no longer were leaked clusters.


Yes, quitting the VM writes out all of the cached data, so the on-disk
state and the in-memory state become fully consistent again.


OK


In looking at the code in qcow2_truncate() it would appear that in the case
where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the
metadata to disk - which seems to be the case here.

If I ignore the if test, and always execute the block in block/qcow2.c,
lines 3250 to 3258:

   if (prealloc != PREALLOC_MODE_OFF) {
   /* Flush metadata before actually changing the image size */
   ret = bdrv_flush(bs);
   if (ret<  0) {
   error_setg_errno(errp, -ret,
"Failed to flush the preallocated area to disk");
   return ret;
   }
   }

causing the flush to always be done, then the check will succeed when
the VM is still running.

While I know that this resolves the issue, I can only imagine that
there was some reason that this check for !PREALLOC_MODE_OFF was being
done in the first place.

So, I'm hoping that someone here might be able to explain to me why
that check is needed, but also why it might be wrong to do the flush
regardless of the value of prealloc here.


Doing a flush here wouldn't be wrong, but it's also unnecessary and
would slow down the operation a bit.

Sure, but how often does a resize/truncate get done? Would seem like a
small impact to do it - but I agree w.r.t. the single-process access
as a better solution.

For the preallocation case, the goal that is achieved with the flush is
that the header update with the new image size is only written if the
preallocation succeeded, so that in error cases you don't end up with an
image that seems to be resized, but isn't actually preallocated as
requested.

Understood, thanks.

If it is wrong to do that flush here, then would anyone have
suggestions as to an alternative solution to this issue?


Don't call 'qemu-img check' (or basically any other operation on the
image) while a VM is using the image.

Heh, sure - if the locking was working as you describe then I think
I would be fine with that :)

Thanks,

Darren.

Re: [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM

2017-09-13 Thread Kevin Wolf

Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben:
> [Cross-posted from qemu-devel, meant to send here first]

Just keep both lists in the CC for the same email.

> Hi,
> 
> It was observed during some testing of Qemu 2.9 that it appeared that if you
> resized a qcow2 block device while the VM is running, that an qemu-img check
> would report that there were leaked clusters.
> 
> The steps to reproduce are:
> 
> - First create the test image:
>   [...]
> 
> - Now run a VM based here on Oracle Linux 7, but the disto really isn't
>   important here, since the test disk is not even mounted in the VM at this
>   point in time:
>   [...]
> 
> - Resize the img size to 15g from qemu monitor (on stdio after above
> command)
> 
> QEMU 2.5.0 monitor - type 'help' for more information
> (qemu) block_resize drive_test 15360
> 
> - Now, in a separate terminal, while leaving the VM running, check the img
>   again from host side:
> 
> # qemu-img check ./test.qcow2
> Leaked cluster 3 refcount=1 reference=0
> 
> 1 leaked clusters were found on the image.
> This means waste of disk space, but no harm to data.
> Image end offset: 327680
> 
> As it suggests above, this is not really corruption, but it is a bit
> misleading, and could make people think there is an issue here
> (hence the reason I've been asked to find a fix).

There is an issue here, which is that you are accessing the image at the
same time from two separate processes. qemu is using writeback caches in
order to improve performance, so only after the guest has issued a flush
command to its disk or after you shut down or at least pause qemu, the
changes are fully written to the image file.

In qemu 2.10, you would probably see this instead:

$ qemu-img check ./test.qcow2
qemu-img: Could not open './test.qcow2': Failed to get shared "write" lock
Is another process using the image?

This lock can be overridden, but at least it shows clearly that you are
doing something that you probably shouldn't be doing.

> What I observed, then was that if I powered down the VM, or even just
> quit the VM, that the subsequent check of the disk would say that
> everything was just fine, and there no longer were leaked clusters.

Yes, quitting the VM writes out all of the cached data, so the on-disk
state and the in-memory state become fully consistent again.

> In looking at the code in qcow2_truncate() it would appear that in the case
> where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the
> metadata to disk - which seems to be the case here.
> 
> If I ignore the if test, and always execute the block in block/qcow2.c,
> lines 3250 to 3258:
> 
>   if (prealloc != PREALLOC_MODE_OFF) {
>   /* Flush metadata before actually changing the image size */
>   ret = bdrv_flush(bs);
>   if (ret < 0) {
>   error_setg_errno(errp, -ret,
>"Failed to flush the preallocated area to disk");
>   return ret;
>   }
>   }
> 
> causing the flush to always be done, then the check will succeed when
> the VM is still running.
> 
> While I know that this resolves the issue, I can only imagine that
> there was some reason that this check for !PREALLOC_MODE_OFF was being
> done in the first place.
> 
> So, I'm hoping that someone here might be able to explain to me why
> that check is needed, but also why it might be wrong to do the flush
> regardless of the value of prealloc here.

Doing a flush here wouldn't be wrong, but it's also unnecessary and
would slow down the operation a bit.

For the preallocation case, the goal that is achieved with the flush is
that the header update with the new image size is only written if the
preallocation succeeded, so that in error cases you don't end up with an
image that seems to be resized, but isn't actually preallocated as
requested.

> If it is wrong to do that flush here, then would anyone have
> suggestions as to an alternative solution to this issue?

Don't call 'qemu-img check' (or basically any other operation on the
image) while a VM is using the image.

Kevin

[Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM

2017-09-13 Thread Darren Kenny


[Cross-posted from qemu-devel, meant to send here first]

Hi,

It was observed during some testing of Qemu 2.9 that it appeared that if you 

resized a qcow2 block device while the VM is running, that an qemu-img check 


would report that there were leaked clusters.

The steps to reproduce are:

- First create the test image:

# /usr/bin/qemu-img create -f qcow2 test.qcow2 10G
Formatting 'test.qcow2', fmt=qcow2 size=10737418240 encryption=off
cluster_size=65536 lazy_refcounts=off refcount_bits=16

# qemu-img check test.qcow2
No errors were found on the image.

- Now run a VM based here on Oracle Linux 7, but the disto really isn't
  important here, since the test disk is not even mounted in the VM at this
  point in time:

# /usr/bin/qemu-kvm \
-name 'test-vm' \
-monitor stdio  \
-drive 
id=drive_image1,if=none,snapshot=on,format=qcow2,file=./ol73-64.qcow2 \
-device 
virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \ 


-drive id=drive_test,if=none,format=qcow2,file=./stg.qcow2 \
-device 
virtio-blk-pci,id=stg,drive=drive_test,bootindex=1,serial=TARGET_DISK0,bus=pci.0,addr=0x5 
\
-net bridge,br=br1 -net 
nic,model=virtio,macaddr=52:54:00:90:91:92 \

-m 4096  \
-smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
-vnc :0

- Resize the img size to 15g from qemu monitor (on stdio after above 
command)


QEMU 2.5.0 monitor - type 'help' for more information
(qemu) block_resize drive_test 15360

- Now, in a separate terminal, while leaving the VM running, check the img
  again from host side:

# qemu-img check ./test.qcow2
Leaked cluster 3 refcount=1 reference=0

1 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
Image end offset: 327680

As it suggests above, this is not really corruption, but it is a bit
misleading, and could make people think there is an issue here
(hence the reason I've been asked to find a fix).

What I observed, then was that if I powered down the VM, or even just 
quit the
VM, that the subsequent check of the disk would say that everything was just 


fine, and there no longer were leaked clusters.

In looking at the code in qcow2_truncate() it would appear that in the case
where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the
metadata to disk - which seems to be the case here.

If I ignore the if test, and always execute the block in block/qcow2.c,
lines 3250 to 3258:

  if (prealloc != PREALLOC_MODE_OFF) {
  /* Flush metadata before actually changing the image size */
  ret = bdrv_flush(bs);
  if (ret < 0) {
  error_setg_errno(errp, -ret,
   "Failed to flush the preallocated area to 
disk");

  return ret;
  }
  }

causing the flush to always be done, then the check will succeed when the VM 


is still running.

While I know that this resolves the issue, I can only imagine that there was 


some reason that this check for !PREALLOC_MODE_OFF was being done in the
first place.

So, I'm hoping that someone here might be able to explain to me why that 
check

is needed, but also why it might be wrong to do the flush regardless of the
value of prealloc here.

If it is wrong to do that flush here, then would anyone have suggestions 
as to

an alternative solution to this issue?

Thanks,

Darren.

[Qemu-block] [PATCH] block: Clean up some bad code in the vvfat driver

2017-09-13 Thread Thomas Huth

Remove the unnecessary home-grown redefinition of the assert() macro here,
and remove the unusable debug code at the end of the checkpoint() function.
The code there uses assert() with side-effects (assignment to the "mapping"
variable), which should be avoided. Looking more closely, it seems as it is
apparently also only usable for one certain directory layout (with a file
named USB.H in it) and thus is of no use for the rest of the world.

Signed-off-by: Thomas Huth 
---
 block/vvfat.c | 26 ++
 1 file changed, 2 insertions(+), 24 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index c54fa94..777a8cd 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -56,15 +56,6 @@
 
 static void checkpoint(void);
 
-#ifdef __MINGW32__
-void nonono(const char* file, int line, const char* msg) {
-fprintf(stderr, "Nonono! %s:%d %s\n", file, line, msg);
-exit(-5);
-}
-#undef assert
-#define assert(a) do {if (!(a)) nonono(__FILE__, __LINE__, #a);}while(0)
-#endif
-
 #else
 
 #define DLOG(a)
@@ -3269,24 +3260,11 @@ static void bdrv_vvfat_init(void)
 block_init(bdrv_vvfat_init);
 
 #ifdef DEBUG
-static void checkpoint(void) {
+static void checkpoint(void)
+{
 assert(((mapping_t*)array_get(&(vvv->mapping), 0))->end == 2);
 check1(vvv);
 check2(vvv);
 assert(!vvv->current_mapping || vvv->current_fd || 
(vvv->current_mapping->mode & MODE_DIRECTORY));
-#if 0
-if (((direntry_t*)vvv->directory.pointer)[1].attributes != 0xf)
-fprintf(stderr, "Nonono!\n");
-mapping_t* mapping;
-direntry_t* direntry;
-assert(vvv->mapping.size >= vvv->mapping.item_size * vvv->mapping.next);
-assert(vvv->directory.size >= vvv->directory.item_size * 
vvv->directory.next);
-if (vvv->mapping.next<47)
-return;
-assert((mapping = array_get(&(vvv->mapping), 47)));
-assert(mapping->dir_index < vvv->directory.next);
-direntry = array_get(&(vvv->directory), mapping->dir_index);
-assert(!memcmp(direntry->name, "USB H  ", 11) || direntry->name[0]==0);
-#endif
 }
 #endif
-- 
1.8.3.1

[Qemu-block] [PATCH v2 3/3] iotests: use virtio aliases for 067

2017-09-13 Thread Cornelia Huck

The default cpu model on s390x does not provide zPCI, which is
not yet wired up on tcg. Moreover, virtio-ccw is the standard
on s390x.

Using virtio-scsi will implicitly pick the right device, so just
switch to that for simplicity.

Signed-off-by: Cornelia Huck 
---
 tests/qemu-iotests/067 | 3 ++-
 tests/qemu-iotests/067.out | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/067 b/tests/qemu-iotests/067
index 5d4ca4bc61..cbb3da286a 100755
--- a/tests/qemu-iotests/067
+++ b/tests/qemu-iotests/067
@@ -141,7 +141,7 @@ echo
 echo === Empty drive with -device and device_del ===
 echo
 
-run_qemu -device virtio-scsi-pci -device scsi-cd,id=cd0 <

Re: [Qemu-block] [PATCH v2 2/3] iotests: use -ccw on s390x for 051

2017-09-13 Thread Thomas Huth

On 13.09.2017 11:10, Cornelia Huck wrote:
> The default cpu model on s390x does not provide zPCI, which is
> not yet wired up on tcg. Moreover, virtio-ccw is the standard
> on s390x, so use the -ccw instead of the -pci versions of virtio
> devices on s390x.
> 
> Signed-off-by: Cornelia Huck 
> ---
>  tests/qemu-iotests/051| 12 +++-
>  tests/qemu-iotests/051.out|  2 +-
>  tests/qemu-iotests/051.pc.out |  2 +-
>  3 files changed, 13 insertions(+), 3 deletions(-)

Reviewed-by: Thomas Huth

Re: [Qemu-block] [PATCH v2 3/3] iotests: use virtio aliases for 067

2017-09-13 Thread Thomas Huth

On 13.09.2017 11:10, Cornelia Huck wrote:
> The default cpu model on s390x does not provide zPCI, which is
> not yet wired up on tcg. Moreover, virtio-ccw is the standard
> on s390x.
> 
> Using virtio-scsi will implicitly pick the right device, so just
> switch to that for simplicity.
> 
> Signed-off-by: Cornelia Huck 
> ---
>  tests/qemu-iotests/067 | 3 ++-
>  tests/qemu-iotests/067.out | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qemu-iotests/067 b/tests/qemu-iotests/067
> index 5d4ca4bc61..cbb3da286a 100755
> --- a/tests/qemu-iotests/067
> +++ b/tests/qemu-iotests/067
> @@ -141,7 +141,7 @@ echo
>  echo === Empty drive with -device and device_del ===
>  echo
>  
> -run_qemu -device virtio-scsi-pci -device scsi-cd,id=cd0 < +run_qemu -device virtio-scsi -device scsi-cd,id=cd0 <  { "execute": "qmp_capabilities" }
>  { "execute": "query-block" }
>  { "execute": "device_del", "arguments": { "id": "cd0" } }
> @@ -150,6 +150,7 @@ run_qemu -device virtio-scsi-pci -device scsi-cd,id=cd0 
> <  { "execute": "quit" }
>  EOF
>  
> +

Superfluous white space change?

With that nit removed:

Reviewed-by: Thomas Huth

[Qemu-block] [PATCH v2 2/3] iotests: use -ccw on s390x for 051

2017-09-13 Thread Cornelia Huck

The default cpu model on s390x does not provide zPCI, which is
not yet wired up on tcg. Moreover, virtio-ccw is the standard
on s390x, so use the -ccw instead of the -pci versions of virtio
devices on s390x.

Signed-off-by: Cornelia Huck 
---
 tests/qemu-iotests/051| 12 +++-
 tests/qemu-iotests/051.out|  2 +-
 tests/qemu-iotests/051.pc.out |  2 +-
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/tests/qemu-iotests/051 b/tests/qemu-iotests/051
index c8cfc764bc..dba8816c9f 100755
--- a/tests/qemu-iotests/051
+++ b/tests/qemu-iotests/051
@@ -103,7 +103,17 @@ echo
 echo === Device without drive ===
 echo
 
-run_qemu -device virtio-scsi-pci -device scsi-hd
+case "$QEMU_DEFAULT_MACHINE" in
+  s390-ccw-virtio)
+  virtio_scsi=virtio-scsi-ccw
+  ;;
+  *)
+  virtio_scsi=virtio-scsi-pci
+  ;;
+esac
+
+run_qemu -device $virtio_scsi -device scsi-hd |
+sed -e "s/$virtio_scsi/VIRTIO_SCSI/"
 
 echo
 echo === Overriding backing file ===
diff --git a/tests/qemu-iotests/051.out b/tests/qemu-iotests/051.out
index 4d3b1ff316..e3c6eaba57 100644
--- a/tests/qemu-iotests/051.out
+++ b/tests/qemu-iotests/051.out
@@ -49,7 +49,7 @@ QEMU_PROG: -drive 
file=TEST_DIR/t.qcow2,driver=qcow2,format=qcow2: Cannot specif
 
 === Device without drive ===
 
-Testing: -device virtio-scsi-pci -device scsi-hd
+Testing: -device VIRTIO_SCSI -device scsi-hd
 QEMU X.Y.Z monitor - type 'help' for more information
 (qemu) QEMU_PROG: -device scsi-hd: drive property not set
 
diff --git a/tests/qemu-iotests/051.pc.out b/tests/qemu-iotests/051.pc.out
index 76d7205460..ae7801b44b 100644
--- a/tests/qemu-iotests/051.pc.out
+++ b/tests/qemu-iotests/051.pc.out
@@ -49,7 +49,7 @@ QEMU_PROG: -drive 
file=TEST_DIR/t.qcow2,driver=qcow2,format=qcow2: Cannot specif
 
 === Device without drive ===
 
-Testing: -device virtio-scsi-pci -device scsi-hd
+Testing: -device VIRTIO_SCSI -device scsi-hd
 QEMU X.Y.Z monitor - type 'help' for more information
 (qemu) QEMU_PROG: -device scsi-hd: drive property not set
 
-- 
2.13.5

[Qemu-block] [PATCH v2 1/3] iotests: use -ccw on s390x for 040, 139, and 182

2017-09-13 Thread Cornelia Huck

The default cpu model on s390x does not provide zPCI, which is
not yet wired up on tcg. Moreover, virtio-ccw is the standard
on s390x, so use the -ccw instead of the -pci versions of virtio
devices on s390x.

Reviewed-by: Kevin Wolf 
Signed-off-by: Cornelia Huck 
---
 tests/qemu-iotests/040 |  6 +-
 tests/qemu-iotests/139 | 12 ++--
 tests/qemu-iotests/182 | 13 +++--
 3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
index 95b7510571..c284d08796 100755
--- a/tests/qemu-iotests/040
+++ b/tests/qemu-iotests/040
@@ -82,7 +82,11 @@ class TestSingleDrive(ImageCommitTestCase):
 qemu_io('-f', 'raw', '-c', 'write -P 0xab 0 524288', backing_img)
 qemu_io('-f', iotests.imgfmt, '-c', 'write -P 0xef 524288 524288', 
mid_img)
 self.vm = iotests.VM().add_drive(test_img, 
"node-name=top,backing.node-name=mid,backing.backing.node-name=base", 
interface="none")
-self.vm.add_device("virtio-scsi-pci")
+if iotests.qemu_default_machine == 's390-ccw-virtio':
+self.vm.add_device("virtio-scsi-ccw")
+else:
+self.vm.add_device("virtio-scsi-pci")
+
 self.vm.add_device("scsi-hd,id=scsi0,drive=drive0")
 self.vm.launch()
 
diff --git a/tests/qemu-iotests/139 b/tests/qemu-iotests/139
index 50cf40fbd5..f8f02808a9 100644
--- a/tests/qemu-iotests/139
+++ b/tests/qemu-iotests/139
@@ -25,13 +25,21 @@ import time
 
 base_img = os.path.join(iotests.test_dir, 'base.img')
 new_img = os.path.join(iotests.test_dir, 'new.img')
+if iotests.qemu_default_machine == 's390-ccw-virtio':
+default_virtio_blk = 'virtio-blk-ccw'
+else:
+default_virtio_blk = 'virtio-blk-pci'
 
 class TestBlockdevDel(iotests.QMPTestCase):
 
 def setUp(self):
 iotests.qemu_img('create', '-f', iotests.imgfmt, base_img, '1M')
 self.vm = iotests.VM()
-self.vm.add_device("virtio-scsi-pci,id=virtio-scsi")
+if iotests.qemu_default_machine == 's390-ccw-virtio':
+self.vm.add_device("virtio-scsi-ccw,id=virtio-scsi")
+else:
+self.vm.add_device("virtio-scsi-pci,id=virtio-scsi")
+
 self.vm.launch()
 
 def tearDown(self):
@@ -87,7 +95,7 @@ class TestBlockdevDel(iotests.QMPTestCase):
 self.checkBlockDriverState(node, expect_error)
 
 # Add a device model
-def addDeviceModel(self, device, backend, driver = 'virtio-blk-pci'):
+def addDeviceModel(self, device, backend, driver = default_virtio_blk):
 result = self.vm.qmp('device_add', id = device,
  driver = driver, drive = backend)
 self.assert_qmp(result, 'return', {})
diff --git a/tests/qemu-iotests/182 b/tests/qemu-iotests/182
index 7ecbb22604..2e078ceed8 100755
--- a/tests/qemu-iotests/182
+++ b/tests/qemu-iotests/182
@@ -45,17 +45,26 @@ _supported_os Linux
 
 size=32M
 
+case "$QEMU_DEFAULT_MACHINE" in
+  s390-ccw-virtio)
+  virtioblk=virtio-blk-ccw
+  ;;
+  *)
+  virtioblk=virtio-blk-pci
+  ;;
+esac
+
 _make_test_img $size
 
 echo "Starting QEMU"
 _launch_qemu -drive file=$TEST_IMG,if=none,id=drive0,file.locking=on \
--device virtio-blk-pci,drive=drive0
+-device $virtioblk,drive=drive0
 
 echo
 echo "Starting a second QEMU using the same image should fail"
 echo 'quit' | $QEMU -monitor stdio \
 -drive file=$TEST_IMG,if=none,id=drive0,file.locking=on \
--device virtio-blk-pci,drive=drive0 2>&1 | _filter_testdir 2>&1 |
+-device $virtioblk,drive=drive0 2>&1 | _filter_testdir 2>&1 |
 _filter_qemu |
 sed -e '/falling back to POSIX file/d' \
 -e '/locks can be lost unexpectedly/d'
-- 
2.13.5

[Qemu-block] [PATCH v2 0/3] iotests: cure s390x failures by switching to ccw/aliases

2017-09-13 Thread Cornelia Huck

Recent changes in s390x made pci support dependant on the zpci cpu
feature, which is not provided on all models (and not on by default).
This means we cannot instatiate pci devices on a standard qemu
invocation for s390x. Moreover, the zpci instructions are not even
wired up for tcg yet, so actually doing anything with those pci devices
is bound to fail on tcg.

For 040, 051, 139, and 182, this can be fixed by switching to virtio-ccw
from virtio-pci on s390x. 051 also needs a bit of post-processing on
the output.

For 067, it is easier to switch to virtio aliases, which will pick
virtio-ccw on s390x and virtio-pci elsewhere. It also exercises the
aliasing path.

v1->v2:
- avoid adding new reference output by adding post-processing to 051
  and switching to aliases for 067

Cornelia Huck (3):
  iotests: use -ccw on s390x for 040, 139, and 182
  iotests: use -ccw on s390x for 051
  iotests: use virtio aliases for 067

 tests/qemu-iotests/040|  6 +-
 tests/qemu-iotests/051| 12 +++-
 tests/qemu-iotests/051.out|  2 +-
 tests/qemu-iotests/051.pc.out |  2 +-
 tests/qemu-iotests/067|  3 ++-
 tests/qemu-iotests/067.out|  2 +-
 tests/qemu-iotests/139| 12 ++--
 tests/qemu-iotests/182| 13 +++--
 8 files changed, 42 insertions(+), 10 deletions(-)

-- 
2.13.5

[Qemu-block] [PATCH] throttle: Assert that bkt->max is valid in throttle_compute_wait()

2017-09-13 Thread Alberto Garcia

If bkt->max == 0 and bkt->burst_length > 1 then we could have a
division by 0 in throttle_do_compute_wait(). That configuration is
however not permitted and is already detected by throttle_is_valid(),
but let's assert it in throttle_compute_wait() to make it explicit.

Found by Coverity (CID: 1381016).

Signed-off-by: Alberto Garcia 
---
 util/throttle.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/util/throttle.c b/util/throttle.c
index 06bf916adc..b38e742da5 100644
--- a/util/throttle.c
+++ b/util/throttle.c
@@ -124,6 +124,7 @@ int64_t throttle_compute_wait(LeakyBucket *bkt)
 /* If the main bucket is not full yet we still have to check the
  * burst bucket in order to enforce the burst limit */
 if (bkt->burst_length > 1) {
+assert(bkt->max > 0); /* see throttle_is_valid() */
 extra = bkt->burst_level - burst_bucket_size;
 if (extra > 0) {
 return throttle_do_compute_wait(bkt->max, extra);
-- 
2.11.0

Re: [Qemu-block] [Qemu-devel] [PATCH v2 00/10] cleanup qemu-iotests

2017-09-13 Thread Fam Zheng

On Wed, 09/13 08:47, Thomas Huth wrote:
> Meta comment: Could we maybe also rename "tests/qemu-iotests" to
> "tests/iotests" ? The "qemu" prefix sounds always very superfluous to me
> here...

Sounds good, and saves typing for when this path is manually entered. But maybe
keep tests/qemu-iotests as a symlink to keep old scripts happy?

Fam

Re: [Qemu-block] [Qemu-devel] [PATCH v7 28/38] libqtest: Add qtest_[v]startf()

2017-09-13 Thread Thomas Huth

On 12.09.2017 15:32, Eric Blake wrote:
> On 09/12/2017 05:14 AM, Thomas Huth wrote:
>> On 11.09.2017 19:20, Eric Blake wrote:
>>> We have several callers that were formatting the argument strings
>>> themselves; consolidate this effort by adding new convenience
>>> functions directly in libqtest, and update all call-sites that
>>> can benefit from it.
[...]
>>>  static void test_mon_partial(const void *data)
>>>  {
>>>  char *s;
>>> -char *cli;
>>> +const char *args = data;
>>>
>>> -cli = make_cli(data, "-smp 8 "
>>> -   "-numa node,nodeid=0,cpus=0-1 "
>>> -   "-numa node,nodeid=1,cpus=4-5 ");
>>> -qtest_start(cli);
>>> +global_qtest = qtest_startf("%s -smp 8 "
>>> +"-numa node,nodeid=0,cpus=0-1 "
>>> +"-numa node,nodeid=1,cpus=4-5 ", args);
>>
>> Does GCC emit a warning if you'd used data here directly? Otherwise I
>> think you could simply do this without the local args variable...
> 
> Passing void* through varargs, with the intent of the receiver parsing
> it as char*, is technically undefined in C.  I don't know if gcc warns,
> but I'm also worried that clang might warn.  I prefer to err on the side
> of defined behavior in this case, even though it annoyingly requires a
> temporary variable.

OK, sounds reasonable, so let's keep it!

 Thomas



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH v7 10/38] libqos: Move/rename qpci_unplug_acpi_device_test() to pci.c

2017-09-13 Thread Thomas Huth

On 12.09.2017 15:28, Eric Blake wrote:
> On 09/12/2017 02:29 AM, Thomas Huth wrote:
>> On 11.09.2017 19:19, Eric Blake wrote:
>>> Commit 2f8b2767 originally added qpci_plug_device_test() and
>>> qpci_unplug_acpi_device_test() as a pair, both in pci-pc.c.
>>> Later, commit cf716b31 moved one half of the pair to pci.c
>>> when adding PPC64 support.  Keep the implementations of the
>>> two functions together, and shorten the name to
>>> qpci_unplug_device_test(), since all callers use the two
>>> functions in tandem.
>>>
> 
>>
>> No, that's a bad idea. ACPI and that outb() is clearly something
>> specific to x86, so this should not reside in pci.c but in pci-pc.c
>>
>> We might be able to unify this - I've had a similar patch here:
>>
>>  https://patchwork.kernel.org/patch/9905031/
>>
>> ... but I think this needs some more careful thinking and discussion, so
>> I'd suggest that you remove this from your already huge patch series for
>> now and we fix it later instead.
> 
> Okay, I'm fine dropping this patch, and can base my respin on top of
> your cleanup instead.

Note that my patches are currently on halt since I'm waiting for your
qemu_startf() reworks to get included first :-) ... so feel free to pick
up ideas or patches from my series into your series if that helps.

 Thomas



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH v7 09/38] libqos: Track QTestState with QVirtioBus

2017-09-13 Thread Thomas Huth

On 12.09.2017 15:28, Eric Blake wrote:
> On 09/12/2017 02:21 AM, Thomas Huth wrote:
[...]
>> I fail to see why we need a separate bus object here for each device.
>> The bus is only available one time, not multiple times, isn't it? So
>> there should also only be one bus object floating around, not multiple
>> ones... or do I miss something?
> 
> You are right that there is only one bus for the machine - the problem
> is that the object representing the bus is dynamically created on the
> fly at the point where the device is first grabbed, rather than up front
> as part of creating the machine (in other words, the device search is
> not performed on a pre-existing bus object).
> 
> Contrast qpci_device_find() (lookup based on a pre-existing bus) with
> qvirtio_mmio_init_device() (lookup that did not use a pre-existing bus);
> also note that qvirtio_pci_device_find() takes a pre-existing QPCIBus
> (since on that architecture, the QVirtioBus is itself a device on the
> QPCIBus).
> 
> I suppose that we could indeed refactor the code to require callers to
> create the QVirtioBus object up front, and then make the device lookup
> (both qvirtio_mmio_init_device() and qvirtio_pci_device_find()) take a
> QVirtioBus parameter instead of QTestState or QPCIBus.  That's slightly
> more work, but I'm happy to attempt it if you think it will be better.

At least to me, that sounds like the right way to go. I guess we might
run into other problems later if we have multiple instances of the bus
object floating around ... so sorry if this is extra work, but I'd say
let's better do it properly now instead of having to rework this again
later.

 Thomas



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH v2 00/10] cleanup qemu-iotests

2017-09-13 Thread Thomas Huth

On 12.09.2017 16:44, Paolo Bonzini wrote:
> The purpose of this series is to separate the "check" sources from
> the tests.  After these patches, common.config is reduced to simple
> shell initialization, and common.rc is only included by the tests.
> 
> Along the way, a lot of dead code is removed too.
> 
> In v2, the following patches:
> 
>   qemu-iotests: do not do useless search for QEMU_*_PROG
>   qemu-iotests: do not search for binaries in the current directory
>   qemu-iotests: include common.env and common.config early
> 
> have been replaced by "qemu-iotests: cleanup and fix search for programs",
> which also preserves the behavior of searching for programs as symlinks
> in the current directory.
> 
> Paolo
> 
> Paolo Bonzini (10):
>   qemu-iotests: remove dead code
>   qemu-iotests: get rid of AWK_PROG
>   qemu-iotests: move "check" code out of common.rc
>   qemu-iotests: cleanup and fix search for programs
>   qemu-iotests: limit non-_PROG-suffixed variables to common.rc
>   qemu-iotests: do not include common.rc in "check"
>   qemu-iotests: disintegrate more parts of common.config
>   qemu-iotests: fix uninitialized variable
>   qemu-iotests: get rid of $iam
>   qemu-iotests: merge "check" and "common"
> 
>  tests/qemu-iotests/039.out   |  10 +-
>  tests/qemu-iotests/061.out   |   4 +-
>  tests/qemu-iotests/137.out   |   2 +-
>  tests/qemu-iotests/check | 575 
> ++-
>  tests/qemu-iotests/common| 459 ---
>  tests/qemu-iotests/common.config | 206 +-
>  tests/qemu-iotests/common.qemu   |   1 +
>  tests/qemu-iotests/common.rc | 205 +++---

Meta comment: Could we maybe also rename "tests/qemu-iotests" to
"tests/iotests" ? The "qemu" prefix sounds always very superfluous to me
here...

 Thomas

85 matches

Mail list logo