Re: [Qemu-block] [PATCH v8 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration

2017-09-19 Thread Kevin Wolf
Am 19.09.2017 um 21:42 hat Eric Blake geschrieben:
> However...
> 
> >> -sbc = limit >> BDRV_SECTOR_BITS;
> >>  assert(DIV_ROUND_UP(bm_size, limit) == tb_size);
> >>
> >> -while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) 
> >> {
> >> -uint64_t cluster = sector / sbc;
> >> +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
> >> +uint64_t cluster = offset / limit;
> 
> bdrv_dirty_iter_next() returns the next dirty bit (which is not
> necessarily the first bit in the cluster).  For the purposes of
> serialization, we want to serialize the entire cluster in one go, even
> though we will be serializing 0 bits up until the first dirty bit.  So
> offset at this point may be unaligned,

Ok, this is the part that I was missing. It makes a lot more sense now.

Also, I think 'cluster' meaning bitmap clusters and not qcow2 clusters
here confused me a bit.

> > The part that I'm missing yet is why we need to do it. The bitmap
> > granularity is also the granularity of bdrv_dirty_iter_next(), so isn't
> > offset already aligned and we could even assert that instead of aligning
> > down? (As long we enforce our restriction, which we seem to do in
> > bitmap_list_load().)
> 
> Sadly, a quick:
> [...]
> does NOT fail iotests 165, which appears to be the only test that
> actually hammers on qcow2 bitmaps (changing it to an 'assert(false)'
> only shows an effect on 165) - which means our test is NOT exercising
> all possible alignments.  And it's python-based, with lame output, which
> makes debugging it painful.  But never fear, for v9 I will improve the
> test to actually affect the bitmap at a point that would fail with my
> temporary assertion in place, and thus proving that we DO need to align
> down.  Note that test 165 is testing only a 1G image, but I just showed
> that 64k clusters with 64k granularity covers up to 32G of image space
> in one cluster of the bitmap, so the test is only covering one cluster
> of serialization in the first place, and as written, the test is
> dirtying byte 0, which explains why it happens to get an offset aligned
> to limit, even though that is not a valid assertion.

More tests are always welcome and a good argument for getting a series
merged. :-)

Kevin


signature.asc
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v9 05/20] dirty-bitmap: Avoid size query failure during truncate

2017-09-19 Thread Fam Zheng
On Tue, 09/19 19:00, John Snow wrote:
> 
> 
> On 09/19/2017 04:18 PM, Eric Blake wrote:
> > We've previously fixed several places where we failed to account
> > for possible errors from bdrv_nb_sectors().  Fix another one by
> > making bdrv_dirty_bitmap_truncate() take the new size from the
> > caller instead of querying itself; then adjust the sole caller
> > bdrv_truncate() to pass the size just determined by a successful
> > resize, or to skip the bitmap resize on failure, thus avoiding
> > sizing the bitmaps to -1.
> > 
> > Signed-off-by: Eric Blake 
> > 
> > ---
> > v9: skip only bdrv_dirty_bitmap_truncate on error [Fam]
> > v8: retitle and rework to avoid possibility of secondary failure [John]
> > v7: new patch [Kevin]
> > ---
> >  include/block/dirty-bitmap.h |  2 +-
> >  block.c  | 15 ++-
> >  block/dirty-bitmap.c |  6 +++---
> >  3 files changed, 14 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
> > index 8fd842eac9..7a27590047 100644
> > --- a/include/block/dirty-bitmap.h
> > +++ b/include/block/dirty-bitmap.h
> > @@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
> >  void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
> >  int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
> >  int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
> > -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
> > +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes);
> >  bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap);
> >  bool bdrv_has_readonly_bitmaps(BlockDriverState *bs);
> >  bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap);
> > diff --git a/block.c b/block.c
> > index ee6a48976e..89261a7a53 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, 
> > PreallocMode prealloc,
> >  assert(!(bs->open_flags & BDRV_O_INACTIVE));
> > 
> >  ret = drv->bdrv_truncate(bs, offset, prealloc, errp);
> > -if (ret == 0) {
> > -ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> > -bdrv_dirty_bitmap_truncate(bs);
> > -bdrv_parent_cb_resize(bs);
> > -atomic_inc(&bs->write_gen);
> > +if (ret < 0) {
> > +return ret;
> >  }
> > +ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> > +if (ret < 0) {
> > +error_setg_errno(errp, -ret, "Could not refresh total sector 
> > count");
> > +} else {
> 
> Sorry for dragging my feet on this point; if anyone else wants to R-B
> this I will cede without much of a fuss, but perhaps you can help me
> understand.
> 
> (Copying some questions I asked Eric on IRC, airing to list for wider
> discussion, and also because I had to drive home before the stores
> closed for the evening)
> 
> Do you suspect that almost certainly if bdrv_truncate() fails overall
> that the image format driver will either unmount the image or become
> read-only?
> 
> There are calls from parallels, qcow, qcow2-refcount, qcow2, raw-format,
> vhdx-log, vhdx plus whichever calls from blk_truncate (jobs, all of the
> same formats, blockdev, qemu-img)
> 
> I'm just trying to picture exactly what's going to happen if we manage
> to resize the drive but then don't resize the bitmap -- say we expand
> the drive larger but fail to refresh (and fail to resize the bitmap.)
> 
> if the block format module does not immediately and dutifully stop all
> further writes -- is there anything that stops us from writing to new
> sectors that were beyond EOF previously?
> 
> I suppose if *not* that's a bug for callers of bdrv_truncate to allow
> that kind of monkey business, but if it CAN happen, hbitmap only guards
> against such things with an assert (which, IIRC, is not guaranteed to be
> on for all builds)

It's guaranteed since a few hours ago:

commit 262a69f4282e44426c7a132138581d400053e0a1
Author: Eric Blake 
Date:   Mon Sep 11 16:13:20 2017 -0500

osdep.h: Prohibit disabling assert() in supported builds

> 
> 
> So the question is: "bdrv_truncate failure is NOT considered recoverable
> in ANY case, is it?"
> 
> It may possibly be safer to, if the initial truncate request succeeds,
> apply a best-effort to the bitmap before returning the error.

Like fallback "offset" (or it aligned up to bs cluster size) if
refresh_total_sectors() returns error? I think that is okay.

Fam



Re: [Qemu-block] [Qemu-devel] [PATCH v9 05/20] dirty-bitmap: Avoid size query failure during truncate

2017-09-19 Thread John Snow


On 09/19/2017 04:18 PM, Eric Blake wrote:
> We've previously fixed several places where we failed to account
> for possible errors from bdrv_nb_sectors().  Fix another one by
> making bdrv_dirty_bitmap_truncate() take the new size from the
> caller instead of querying itself; then adjust the sole caller
> bdrv_truncate() to pass the size just determined by a successful
> resize, or to skip the bitmap resize on failure, thus avoiding
> sizing the bitmaps to -1.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v9: skip only bdrv_dirty_bitmap_truncate on error [Fam]
> v8: retitle and rework to avoid possibility of secondary failure [John]
> v7: new patch [Kevin]
> ---
>  include/block/dirty-bitmap.h |  2 +-
>  block.c  | 15 ++-
>  block/dirty-bitmap.c |  6 +++---
>  3 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
> index 8fd842eac9..7a27590047 100644
> --- a/include/block/dirty-bitmap.h
> +++ b/include/block/dirty-bitmap.h
> @@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
>  void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
>  int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>  int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
> -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
> +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes);
>  bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap);
>  bool bdrv_has_readonly_bitmaps(BlockDriverState *bs);
>  bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap);
> diff --git a/block.c b/block.c
> index ee6a48976e..89261a7a53 100644
> --- a/block.c
> +++ b/block.c
> @@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, 
> PreallocMode prealloc,
>  assert(!(bs->open_flags & BDRV_O_INACTIVE));
> 
>  ret = drv->bdrv_truncate(bs, offset, prealloc, errp);
> -if (ret == 0) {
> -ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> -bdrv_dirty_bitmap_truncate(bs);
> -bdrv_parent_cb_resize(bs);
> -atomic_inc(&bs->write_gen);
> +if (ret < 0) {
> +return ret;
>  }
> +ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Could not refresh total sector count");
> +} else {

Sorry for dragging my feet on this point; if anyone else wants to R-B
this I will cede without much of a fuss, but perhaps you can help me
understand.

(Copying some questions I asked Eric on IRC, airing to list for wider
discussion, and also because I had to drive home before the stores
closed for the evening)

Do you suspect that almost certainly if bdrv_truncate() fails overall
that the image format driver will either unmount the image or become
read-only?

There are calls from parallels, qcow, qcow2-refcount, qcow2, raw-format,
vhdx-log, vhdx plus whichever calls from blk_truncate (jobs, all of the
same formats, blockdev, qemu-img)

I'm just trying to picture exactly what's going to happen if we manage
to resize the drive but then don't resize the bitmap -- say we expand
the drive larger but fail to refresh (and fail to resize the bitmap.)

if the block format module does not immediately and dutifully stop all
further writes -- is there anything that stops us from writing to new
sectors that were beyond EOF previously?

I suppose if *not* that's a bug for callers of bdrv_truncate to allow
that kind of monkey business, but if it CAN happen, hbitmap only guards
against such things with an assert (which, IIRC, is not guaranteed to be
on for all builds)


So the question is: "bdrv_truncate failure is NOT considered recoverable
in ANY case, is it?"

It may possibly be safer to, if the initial truncate request succeeds,
apply a best-effort to the bitmap before returning the error.

> +bdrv_dirty_bitmap_truncate(bs, bs->total_sectors * BDRV_SECTOR_SIZE);
> +}
> +bdrv_parent_cb_resize(bs);
> +atomic_inc(&bs->write_gen);
>  return ret;
>  }
> 
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index 42a55e4a4b..ee164fb518 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -1,7 +1,7 @@
>  /*
>   * Block Dirty Bitmap
>   *
> - * Copyright (c) 2016 Red Hat. Inc
> + * Copyright (c) 2016-2017 Red Hat. Inc
>   *
>   * Permission is hereby granted, free of charge, to any person obtaining a 
> copy
>   * of this software and associated documentation files (the "Software"), to 
> deal
> @@ -302,10 +302,10 @@ BdrvDirtyBitmap 
> *bdrv_reclaim_dirty_bitmap(BlockDriverState *bs,
>   * Truncates _all_ bitmaps attached to a BDS.
>   * Called with BQL taken.
>   */
> -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs)
> +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes)
>  {
>  BdrvDirtyBitmap *bitmap;
> -uint64_t size = bdrv_nb_sectors(bs);
> +int64_t size = 

[Qemu-block] [PATCH v9 20/20] dirty-bitmap: Convert internal hbitmap size/granularity

2017-09-19 Thread Eric Blake
Now that all callers are using byte-based interfaces, there's no
reason for our internal hbitmap to remain with sector-based
granularity.  It also simplifies our internal scaling, since we
already know that hbitmap widens requests out to granularity
boundaries.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v9: no change
v8: rebase to earlier truncate changes (R-b kept)
v7: rebase to dirty_iter_next cleanup (no semantic change, R-b kept)
v6: no change
v5: fix bdrv_dirty_bitmap_truncate [John]
v4: rebase to earlier changes, include serialization, R-b dropped
v3: no change
v2: no change
---
 block/dirty-bitmap.c | 62 +++-
 1 file changed, 18 insertions(+), 44 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 58a3f330a9..bd04e991b1 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -38,7 +38,7 @@
  */
 struct BdrvDirtyBitmap {
 QemuMutex *mutex;
-HBitmap *bitmap;/* Dirty sector bitmap implementation */
+HBitmap *bitmap;/* Dirty bitmap implementation */
 HBitmap *meta;  /* Meta dirty bitmap */
 BdrvDirtyBitmap *successor; /* Anonymous child; implies frozen status */
 char *name; /* Optional non-empty unique ID */
@@ -130,12 +130,7 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState 
*bs,
 }
 bitmap = g_new0(BdrvDirtyBitmap, 1);
 bitmap->mutex = &bs->dirty_bitmap_mutex;
-/*
- * TODO - let hbitmap track full granularity. For now, it is tracking
- * only sector granularity, as a shortcut for our iterators.
- */
-bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap_size, BDRV_SECTOR_SIZE),
-   ctz32(granularity) - BDRV_SECTOR_BITS);
+bitmap->bitmap = hbitmap_alloc(bitmap_size, ctz32(granularity));
 bitmap->size = bitmap_size;
 bitmap->name = g_strdup(name);
 bitmap->disabled = false;
@@ -312,7 +307,7 @@ void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, 
int64_t bytes)
 QLIST_FOREACH(bitmap, &bs->dirty_bitmaps, list) {
 assert(!bdrv_dirty_bitmap_frozen(bitmap));
 assert(!bitmap->active_iterators);
-hbitmap_truncate(bitmap->bitmap, DIV_ROUND_UP(bytes, 
BDRV_SECTOR_SIZE));
+hbitmap_truncate(bitmap->bitmap, bytes);
 bitmap->size = bytes;
 }
 bdrv_dirty_bitmaps_unlock(bs);
@@ -442,7 +437,7 @@ bool bdrv_get_dirty_locked(BlockDriverState *bs, 
BdrvDirtyBitmap *bitmap,
int64_t offset)
 {
 if (bitmap) {
-return hbitmap_get(bitmap->bitmap, offset >> BDRV_SECTOR_BITS);
+return hbitmap_get(bitmap->bitmap, offset);
 } else {
 return false;
 }
@@ -470,7 +465,7 @@ uint32_t 
bdrv_get_default_bitmap_granularity(BlockDriverState *bs)

 uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap)
 {
-return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap);
+return 1U << hbitmap_granularity(bitmap->bitmap);
 }

 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap)
@@ -503,20 +498,16 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)

 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
 {
-int64_t ret = hbitmap_iter_next(&iter->hbi);
-return ret < 0 ? -1 : ret * BDRV_SECTOR_SIZE;
+return hbitmap_iter_next(&iter->hbi);
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   int64_t offset, int64_t bytes)
 {
-int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
-
 assert(bdrv_dirty_bitmap_enabled(bitmap));
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
-end_sector - (offset >> BDRV_SECTOR_BITS));
+hbitmap_set(bitmap->bitmap, offset, bytes);
 }

 void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
@@ -531,12 +522,9 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 int64_t offset, int64_t bytes)
 {
-int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
-
 assert(bdrv_dirty_bitmap_enabled(bitmap));
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_reset(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
-  end_sector - (offset >> BDRV_SECTOR_BITS));
+hbitmap_reset(bitmap->bitmap, offset, bytes);
 }

 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
@@ -556,8 +544,7 @@ void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, 
HBitmap **out)
 hbitmap_reset_all(bitmap->bitmap);
 } else {
 HBitmap *backup = bitmap->bitmap;
-bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap->size,
-BDRV_SECTOR_SIZE),
+bitmap->bitmap = hbitmap

[Qemu-block] [PATCH v9 19/20] dirty-bitmap: Switch bdrv_set_dirty() to bytes

2017-09-19 Thread Eric Blake
Both callers already had bytes available, but were scaling to
sectors.  Move the scaling to internal code.  In the case of
bdrv_aligned_pwritev(), we are now passing the exact offset
rather than a rounded sector-aligned value, but that's okay
as long as dirty bitmap widens start/bytes to granularity
boundaries.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v4: only context changes
v3: rebase to lock context changes, R-b kept
v2: no change
---
 include/block/block_int.h | 2 +-
 block/io.c| 6 ++
 block/dirty-bitmap.c  | 7 ---
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index ba4c383393..55c5d573d4 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1021,7 +1021,7 @@ void blk_dev_eject_request(BlockBackend *blk, bool force);
 bool blk_dev_is_tray_open(BlockBackend *blk);
 bool blk_dev_is_medium_locked(BlockBackend *blk);

-void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector, int64_t nr_sect);
+void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes);
 bool bdrv_requests_pending(BlockDriverState *bs);

 void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out);
diff --git a/block/io.c b/block/io.c
index 4378ae4c7d..8a0cd8835a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1334,7 +1334,6 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild 
*child,
 bool waited;
 int ret;

-int64_t start_sector = offset >> BDRV_SECTOR_BITS;
 int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
 uint64_t bytes_remaining = bytes;
 int max_transfer;
@@ -1409,7 +1408,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild 
*child,
 bdrv_debug_event(bs, BLKDBG_PWRITEV_DONE);

 atomic_inc(&bs->write_gen);
-bdrv_set_dirty(bs, start_sector, end_sector - start_sector);
+bdrv_set_dirty(bs, offset, bytes);

 stat64_max(&bs->wr_highest_offset, offset + bytes);

@@ -2438,8 +2437,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, 
int64_t offset,
 ret = 0;
 out:
 atomic_inc(&bs->write_gen);
-bdrv_set_dirty(bs, req.offset >> BDRV_SECTOR_BITS,
-   req.bytes >> BDRV_SECTOR_BITS);
+bdrv_set_dirty(bs, req.offset, req.bytes);
 tracked_request_end(&req);
 bdrv_dec_in_flight(bs);
 return ret;
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 117837b3cc..58a3f330a9 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -628,10 +628,10 @@ void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap 
*bitmap)
 hbitmap_deserialize_finish(bitmap->bitmap);
 }

-void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
-int64_t nr_sectors)
+void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
 BdrvDirtyBitmap *bitmap;
+int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);

 if (QLIST_EMPTY(&bs->dirty_bitmaps)) {
 return;
@@ -643,7 +643,8 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t 
cur_sector,
 continue;
 }
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
+hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
+end_sector - (offset >> BDRV_SECTOR_BITS));
 }
 bdrv_dirty_bitmaps_unlock(bs);
 }
-- 
2.13.5




[Qemu-block] [PATCH v9 15/20] mirror: Switch mirror_dirty_init() to byte-based iteration

2017-09-19 Thread Eric Blake
Now that we have adjusted the majority of the calls this function
makes to be byte-based, it is easier to read the code if it makes
passes over the image using bytes rather than sectors.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v6: no change
v5: rebase to earlier changes
v2-v4: no change
---
 block/mirror.c | 38 ++
 1 file changed, 14 insertions(+), 24 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c2f73c91c5..5cdaaed7be 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -612,15 +612,13 @@ static void mirror_throttle(MirrorBlockJob *s)

 static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 {
-int64_t sector_num, end;
+int64_t offset;
 BlockDriverState *base = s->base;
 BlockDriverState *bs = s->source;
 BlockDriverState *target_bs = blk_bs(s->target);
-int ret, n;
+int ret;
 int64_t count;

-end = s->bdev_length / BDRV_SECTOR_SIZE;
-
 if (base == NULL && !bdrv_has_zero_init(target_bs)) {
 if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
 bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, s->bdev_length);
@@ -628,9 +626,9 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 }

 s->initial_zeroing_ongoing = true;
-for (sector_num = 0; sector_num < end; ) {
-int nb_sectors = MIN(end - sector_num,
-QEMU_ALIGN_DOWN(INT_MAX, s->granularity) >> BDRV_SECTOR_BITS);
+for (offset = 0; offset < s->bdev_length; ) {
+int bytes = MIN(s->bdev_length - offset,
+QEMU_ALIGN_DOWN(INT_MAX, s->granularity));

 mirror_throttle(s);

@@ -646,9 +644,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 continue;
 }

-mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
-  nb_sectors * BDRV_SECTOR_SIZE, false);
-sector_num += nb_sectors;
+mirror_do_zero_or_discard(s, offset, bytes, false);
+offset += bytes;
 }

 mirror_wait_for_all_io(s);
@@ -656,10 +653,10 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob 
*s)
 }

 /* First part, loop on the sectors and initialize the dirty bitmap.  */
-for (sector_num = 0; sector_num < end; ) {
+for (offset = 0; offset < s->bdev_length; ) {
 /* Just to make sure we are not exceeding int limit. */
-int nb_sectors = MIN(INT_MAX >> BDRV_SECTOR_BITS,
- end - sector_num);
+int bytes = MIN(s->bdev_length - offset,
+QEMU_ALIGN_DOWN(INT_MAX, s->granularity));

 mirror_throttle(s);

@@ -667,23 +664,16 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob 
*s)
 return 0;
 }

-ret = bdrv_is_allocated_above(bs, base, sector_num * BDRV_SECTOR_SIZE,
-  nb_sectors * BDRV_SECTOR_SIZE, &count);
+ret = bdrv_is_allocated_above(bs, base, offset, bytes, &count);
 if (ret < 0) {
 return ret;
 }

-/* TODO: Relax this once bdrv_is_allocated_above and dirty
- * bitmaps no longer require sector alignment. */
-assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-n = count >> BDRV_SECTOR_BITS;
-assert(n > 0);
+assert(count);
 if (ret == 1) {
-bdrv_set_dirty_bitmap(s->dirty_bitmap,
-  sector_num * BDRV_SECTOR_SIZE,
-  n * BDRV_SECTOR_SIZE);
+bdrv_set_dirty_bitmap(s->dirty_bitmap, offset, count);
 }
-sector_num += n;
+offset += count;
 }
 return 0;
 }
-- 
2.13.5




[Qemu-block] [PATCH v9 14/20] dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use bytes

2017-09-19 Thread Eric Blake
Some of the callers were already scaling bytes to sectors; others
can be easily converted to pass byte offsets, all in our shift
towards a consistent byte interface everywhere.  Making the change
will also make it easier to write the hold-out callers to use byte
rather than sectors for their iterations; it also makes it easier
for a future dirty-bitmap patch to offload scaling over to the
internal hbitmap.  Although all callers happen to pass
sector-aligned values, make the internal scaling robust to any
sub-sector requests.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v5: only context change
v4: only context change, due to rebasing to persistent bitmaps
v3: rebase to addition of _locked interfaces; complex enough that I
dropped R-b
v2: no change
---
 include/block/dirty-bitmap.h |  8 
 block/dirty-bitmap.c | 22 ++
 block/mirror.c   | 16 
 migration/block.c|  7 +--
 4 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 9e39537e4b..3579a7597c 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -40,9 +40,9 @@ const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap 
*bitmap);
 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap);
 DirtyBitmapStatus bdrv_dirty_bitmap_status(BdrvDirtyBitmap *bitmap);
 void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
-   int64_t cur_sector, int64_t nr_sectors);
+   int64_t offset, int64_t bytes);
 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
- int64_t cur_sector, int64_t nr_sectors);
+ int64_t offset, int64_t bytes);
 BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap);
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);
@@ -75,9 +75,9 @@ void bdrv_dirty_bitmap_unlock(BdrvDirtyBitmap *bitmap);
 bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
int64_t offset);
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-  int64_t cur_sector, int64_t nr_sectors);
+  int64_t offset, int64_t bytes);
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-int64_t cur_sector, int64_t nr_sectors);
+int64_t offset, int64_t bytes);
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
 void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t offset);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index ad559c62b1..117837b3cc 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -509,35 +509,41 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)

 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-  int64_t cur_sector, int64_t nr_sectors)
+  int64_t offset, int64_t bytes)
 {
+int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
+
 assert(bdrv_dirty_bitmap_enabled(bitmap));
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
+hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
+end_sector - (offset >> BDRV_SECTOR_BITS));
 }

 void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
-   int64_t cur_sector, int64_t nr_sectors)
+   int64_t offset, int64_t bytes)
 {
 bdrv_dirty_bitmap_lock(bitmap);
-bdrv_set_dirty_bitmap_locked(bitmap, cur_sector, nr_sectors);
+bdrv_set_dirty_bitmap_locked(bitmap, offset, bytes);
 bdrv_dirty_bitmap_unlock(bitmap);
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-int64_t cur_sector, int64_t nr_sectors)
+int64_t offset, int64_t bytes)
 {
+int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
+
 assert(bdrv_dirty_bitmap_enabled(bitmap));
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_reset(bitmap->bitmap, cur_sector, nr_sectors);
+hbitmap_reset(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
+  end_sector - (offset >> BDRV_SECTOR_BITS));
 }

 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
- int64_t cur_sector, int64_t nr_sectors)
+ int64_t offset, int64_t bytes)
 {
 bdrv_dirty_bitmap_lock(bitmap);
-bdrv_reset_dirty_bitmap_locked(bitmap, cur_sector, nr_sectors);
+bdrv_reset_dirty_bi

[Qemu-block] [PATCH v9 09/20] qcow2: Switch sectors_covered_by_bitmap_cluster() to byte-based

2017-09-19 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Change the qcow2 bitmap
helper function sectors_covered_by_bitmap_cluster(), renaming it
to bytes_covered_by_bitmap_cluster() in the process.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v5: no change
v4: new patch
---
 block/qcow2-bitmap.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 92098bfa49..4475273d8c 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -269,18 +269,16 @@ static int free_bitmap_clusters(BlockDriverState *bs, 
Qcow2BitmapTable *tb)
 return 0;
 }

-/* This function returns the number of disk sectors covered by a single qcow2
- * cluster of bitmap data. */
-static uint64_t sectors_covered_by_bitmap_cluster(const BDRVQcow2State *s,
-  const BdrvDirtyBitmap 
*bitmap)
+/* Return the disk size covered by a single qcow2 cluster of bitmap data. */
+static uint64_t bytes_covered_by_bitmap_cluster(const BDRVQcow2State *s,
+const BdrvDirtyBitmap *bitmap)
 {
-uint64_t sector_granularity =
-bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS;
-uint64_t sbc = sector_granularity * (s->cluster_size << 3);
+uint64_t granularity = bdrv_dirty_bitmap_granularity(bitmap);
+uint64_t limit = granularity * (s->cluster_size << 3);

-assert(QEMU_IS_ALIGNED(sbc << BDRV_SECTOR_BITS,
+assert(QEMU_IS_ALIGNED(limit,
bdrv_dirty_bitmap_serialization_align(bitmap)));
-return sbc;
+return limit;
 }

 /* load_bitmap_data
@@ -293,7 +291,7 @@ static int load_bitmap_data(BlockDriverState *bs,
 {
 int ret = 0;
 BDRVQcow2State *s = bs->opaque;
-uint64_t sector, sbc;
+uint64_t sector, limit, sbc;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
 uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 uint8_t *buf = NULL;
@@ -306,7 +304,8 @@ static int load_bitmap_data(BlockDriverState *bs,
 }

 buf = g_malloc(s->cluster_size);
-sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
+limit = bytes_covered_by_bitmap_cluster(s, bitmap);
+sbc = limit >> BDRV_SECTOR_BITS;
 for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) {
 uint64_t count = MIN(bm_sectors - sector, sbc);
 uint64_t entry = bitmap_table[i];
@@ -1080,7 +1079,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 int ret;
 BDRVQcow2State *s = bs->opaque;
 int64_t sector;
-uint64_t sbc;
+uint64_t limit, sbc;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
 uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
@@ -1106,8 +1105,9 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,

 dbi = bdrv_dirty_iter_new(bitmap, 0);
 buf = g_malloc(s->cluster_size);
-sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
-assert(DIV_ROUND_UP(bm_sectors, sbc) == tb_size);
+limit = bytes_covered_by_bitmap_cluster(s, bitmap);
+sbc = limit >> BDRV_SECTOR_BITS;
+assert(DIV_ROUND_UP(bm_size, limit) == tb_size);

 while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
 uint64_t cluster = sector / sbc;
-- 
2.13.5




[Qemu-block] [PATCH v9 13/20] dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes

2017-09-19 Thread Eric Blake
Half the callers were already scaling bytes to sectors; the other
half can eventually be simplified to use byte iteration.  Both
callers were already using the result as a bool, so make that
explicit.  Making the change also makes it easier for a future
dirty-bitmap patch to offload scaling over to the internal hbitmap.

Remember, asking whether a byte is dirty is effectively asking
whether the entire granularity containing the byte is dirty, since
we only track dirtiness by granularity.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Juan Quintela 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v4: only context change
v3: rebase to _locked rename was straightforward enough that R-b kept
v2: tweak commit message, no code change
---
 include/block/dirty-bitmap.h | 4 ++--
 block/dirty-bitmap.c | 8 
 block/mirror.c   | 3 +--
 migration/block.c| 3 ++-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 757fc4c5b8..9e39537e4b 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -72,8 +72,8 @@ void bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap 
*bitmap,
 /* Functions that require manual locking.  */
 void bdrv_dirty_bitmap_lock(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_unlock(BdrvDirtyBitmap *bitmap);
-int bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
-  int64_t sector);
+bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
+   int64_t offset);
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   int64_t cur_sector, int64_t nr_sectors);
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 8322e23f0d..ad559c62b1 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -438,13 +438,13 @@ BlockDirtyInfoList 
*bdrv_query_dirty_bitmaps(BlockDriverState *bs)
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
-int bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
-  int64_t sector)
+bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
+   int64_t offset)
 {
 if (bitmap) {
-return hbitmap_get(bitmap->bitmap, sector);
+return hbitmap_get(bitmap->bitmap, offset >> BDRV_SECTOR_BITS);
 } else {
-return 0;
+return false;
 }
 }

diff --git a/block/mirror.c b/block/mirror.c
index 7113d47db4..e36fc81df3 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -361,8 +361,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 int64_t next_offset = offset + nb_chunks * s->granularity;
 int64_t next_chunk = next_offset / s->granularity;
 if (next_offset >= s->bdev_length ||
-!bdrv_get_dirty_locked(source, s->dirty_bitmap,
-   next_offset >> BDRV_SECTOR_BITS)) {
+!bdrv_get_dirty_locked(source, s->dirty_bitmap, next_offset)) {
 break;
 }
 if (test_bit(next_chunk, s->in_flight_bitmap)) {
diff --git a/migration/block.c b/migration/block.c
index a3512945da..b618869661 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -530,7 +530,8 @@ static int mig_save_device_dirty(QEMUFile *f, 
BlkMigDevState *bmds,
 blk_mig_unlock();
 }
 bdrv_dirty_bitmap_lock(bmds->dirty_bitmap);
-if (bdrv_get_dirty_locked(bs, bmds->dirty_bitmap, sector)) {
+if (bdrv_get_dirty_locked(bs, bmds->dirty_bitmap,
+  sector * BDRV_SECTOR_SIZE)) {
 if (total_sectors - sector < BDRV_SECTORS_PER_DIRTY_CHUNK) {
 nr_sectors = total_sectors - sector;
 } else {
-- 
2.13.5




[Qemu-block] [PATCH v9 11/20] dirty-bitmap: Change bdrv_dirty_iter_next() to report byte offset

2017-09-19 Thread Eric Blake
Thanks to recent cleanups, most callers were scaling a return value
of sectors into bytes (the exception, in qcow2-bitmap, will be
converted to byte-based iteration later).  Update the interface to
do the scaling internally instead.

In qcow2-bitmap, the code was specifically checking for an error
return of -1.  To avoid a regression, we either have to make sure
we continue to return -1 (rather than a scaled -512) on error, or
we have to fix the caller to treat all negative values as error
rather than just one magic value.  It's easy enough to make both
changes at the same time, even though either one in isolation
would work.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v8: tweak commit message, add R-b
v7: return -1, not -512; and fix qcow2-bitmap to check all negatives [Kevin]
v5-v6: no change
v4: rebase to persistent bitmap
v3: no change
v2: no change
---
 block/backup.c   | 2 +-
 block/dirty-bitmap.c | 3 ++-
 block/mirror.c   | 8 
 block/qcow2-bitmap.c | 2 +-
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ac9c018717..06ddbfd03d 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -375,7 +375,7 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)
 dbi = bdrv_dirty_iter_new(job->sync_bitmap);

 /* Find the next dirty sector(s) */
-while ((offset = bdrv_dirty_iter_next(dbi) * BDRV_SECTOR_SIZE) >= 0) {
+while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
 cluster = offset / job->cluster_size;

 /* Fake progress updates for any clusters we skipped */
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 84509476ba..e451916187 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -503,7 +503,8 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)

 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
 {
-return hbitmap_iter_next(&iter->hbi);
+int64_t ret = hbitmap_iter_next(&iter->hbi);
+return ret < 0 ? -1 : ret * BDRV_SECTOR_SIZE;
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
diff --git a/block/mirror.c b/block/mirror.c
index 0b063b3c20..77bf5aa3a4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -336,10 +336,10 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);

 bdrv_dirty_bitmap_lock(s->dirty_bitmap);
-offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+offset = bdrv_dirty_iter_next(s->dbi);
 if (offset < 0) {
 bdrv_set_dirty_iter(s->dbi, 0);
-offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+offset = bdrv_dirty_iter_next(s->dbi);
 trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap) *
   BDRV_SECTOR_SIZE);
 assert(offset >= 0);
@@ -370,11 +370,11 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 break;
 }

-next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+next_dirty = bdrv_dirty_iter_next(s->dbi);
 if (next_dirty > next_offset || next_dirty < 0) {
 /* The bitmap iterator's cache is stale, refresh it */
 bdrv_set_dirty_iter(s->dbi, next_offset);
-next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+next_dirty = bdrv_dirty_iter_next(s->dbi);
 }
 assert(next_dirty == next_offset);
 nb_chunks++;
diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 44329fc74f..b09010b1d3 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1109,7 +1109,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 sbc = limit >> BDRV_SECTOR_BITS;
 assert(DIV_ROUND_UP(bm_size, limit) == tb_size);

-while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
+while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) {
 uint64_t cluster = sector / sbc;
 uint64_t end, write_size;
 int64_t off;
-- 
2.13.5




[Qemu-block] [PATCH v9 08/20] dirty-bitmap: Change bdrv_dirty_bitmap_*serialize*() to take bytes

2017-09-19 Thread Eric Blake
Right now, the dirty-bitmap code exposes the fact that we use
a scale of sector granularity in the underlying hbitmap to anything
that wants to serialize a dirty bitmap.  It's nicer to uniformly
expose bytes as our dirty-bitmap interface, matching the previous
change to bitmap size.  The only caller to serialization is currently
qcow2-cluster.c, which becomes a bit more verbose because it is still
tracking sectors for other reasons, but a later patch will fix that
to more uniformly use byte offsets everywhere.  Likewise, within
dirty-bitmap, we have to add more assertions that we are not
truncating incorrectly, which can go away once the internal hbitmap
is byte-based rather than sector-based.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v5: no change
v4: new patch
---
 include/block/dirty-bitmap.h | 14 +++---
 block/dirty-bitmap.c | 37 -
 block/qcow2-bitmap.c | 22 ++
 3 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 7a27590047..5f34a1a3c7 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -49,19 +49,19 @@ BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap 
*bitmap,
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);

 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
-  uint64_t start, uint64_t count);
+  uint64_t offset, uint64_t bytes);
 uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
-  uint8_t *buf, uint64_t start,
-  uint64_t count);
+  uint8_t *buf, uint64_t offset,
+  uint64_t bytes);
 void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap,
-uint8_t *buf, uint64_t start,
-uint64_t count, bool finish);
+uint8_t *buf, uint64_t offset,
+uint64_t bytes, bool finish);
 void bdrv_dirty_bitmap_deserialize_zeroes(BdrvDirtyBitmap *bitmap,
-  uint64_t start, uint64_t count,
+  uint64_t offset, uint64_t bytes,
   bool finish);
 void bdrv_dirty_bitmap_deserialize_ones(BdrvDirtyBitmap *bitmap,
-uint64_t start, uint64_t count,
+uint64_t offset, uint64_t bytes,
 bool finish);
 void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap *bitmap);

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 6d32554b4e..555c736024 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -568,42 +568,53 @@ void bdrv_undo_clear_dirty_bitmap(BdrvDirtyBitmap 
*bitmap, HBitmap *in)
 }

 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
-  uint64_t start, uint64_t count)
+  uint64_t offset, uint64_t bytes)
 {
-return hbitmap_serialization_size(bitmap->bitmap, start, count);
+assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
+return hbitmap_serialization_size(bitmap->bitmap,
+  offset >> BDRV_SECTOR_BITS,
+  bytes >> BDRV_SECTOR_BITS);
 }

 uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap)
 {
-return hbitmap_serialization_align(bitmap->bitmap);
+return hbitmap_serialization_align(bitmap->bitmap) * BDRV_SECTOR_SIZE;
 }

 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
-  uint8_t *buf, uint64_t start,
-  uint64_t count)
+  uint8_t *buf, uint64_t offset,
+  uint64_t bytes)
 {
-hbitmap_serialize_part(bitmap->bitmap, buf, start, count);
+assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
+hbitmap_serialize_part(bitmap->bitmap, buf, offset >> BDRV_SECTOR_BITS,
+   bytes >> BDRV_SECTOR_BITS);
 }

 void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap,
-uint8_t *buf, uint64_t start,
-uint64_t count, bool finish)
+uint8_t *buf, uint64_t offset,
+uint64_t bytes, bool finish)
 {
-hbitmap_deserialize_p

[Qemu-block] [PATCH v9 06/20] dirty-bitmap: Change bdrv_dirty_bitmap_size() to report bytes

2017-09-19 Thread Eric Blake
We're already reporting bytes for bdrv_dirty_bitmap_granularity();
mixing bytes and sectors in our return values is a recipe for
confusion.  A later cleanup will convert dirty bitmap internals
to be entirely byte-based, but in the meantime, we should report
the bitmap size in bytes.

The only external caller in qcow2-bitmap.c is temporarily more verbose
(because it is still using sector-based math), but will later be
switched to track progress by bytes instead of sectors.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v8: no change
v7: split external from internal change [Kevin], drop R-b
v6: no change
v5: fix bdrv_dirty_bitmap_truncate [John], drop R-b
v4: retitle from "Track size in bytes", rebase to persistent bitmaps,
round up when converting bytes to sectors
v3: no change
v2: tweak commit message, no code change
---
 block/dirty-bitmap.c |  2 +-
 block/qcow2-bitmap.c | 14 --
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index ee164fb518..75af32 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -175,7 +175,7 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap)

 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap)
 {
-return bitmap->size;
+return bitmap->size * BDRV_SECTOR_SIZE;
 }

 const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap)
diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index b3ee4c794a..65122e9ae1 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -295,10 +295,11 @@ static int load_bitmap_data(BlockDriverState *bs,
 BDRVQcow2State *s = bs->opaque;
 uint64_t sector, sbc;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
+uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 uint8_t *buf = NULL;
 uint64_t i, tab_size =
 size_to_clusters(s,
-bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_size));
+bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_sectors));

 if (tab_size != bitmap_table_size || tab_size > BME_MAX_TABLE_SIZE) {
 return -EINVAL;
@@ -307,7 +308,7 @@ static int load_bitmap_data(BlockDriverState *bs,
 buf = g_malloc(s->cluster_size);
 sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
 for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) {
-uint64_t count = MIN(bm_size - sector, sbc);
+uint64_t count = MIN(bm_sectors - sector, sbc);
 uint64_t entry = bitmap_table[i];
 uint64_t offset = entry & BME_TABLE_ENTRY_OFFSET_MASK;

@@ -1077,13 +1078,14 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 int64_t sector;
 uint64_t sbc;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
+uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
 uint8_t *buf = NULL;
 BdrvDirtyBitmapIter *dbi;
 uint64_t *tb;
 uint64_t tb_size =
 size_to_clusters(s,
-bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_size));
+bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_sectors));

 if (tb_size > BME_MAX_TABLE_SIZE ||
 tb_size * s->cluster_size > BME_MAX_PHYS_SIZE)
@@ -1101,7 +1103,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 dbi = bdrv_dirty_iter_new(bitmap, 0);
 buf = g_malloc(s->cluster_size);
 sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
-assert(DIV_ROUND_UP(bm_size, sbc) == tb_size);
+assert(DIV_ROUND_UP(bm_sectors, sbc) == tb_size);

 while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
 uint64_t cluster = sector / sbc;
@@ -1109,7 +,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 int64_t off;

 sector = cluster * sbc;
-end = MIN(bm_size, sector + sbc);
+end = MIN(bm_sectors, sector + sbc);
 write_size =
 bdrv_dirty_bitmap_serialization_size(bitmap, sector, end - sector);
 assert(write_size <= s->cluster_size);
@@ -1141,7 +1143,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 goto fail;
 }

-if (end >= bm_size) {
+if (end >= bm_sectors) {
 break;
 }

-- 
2.13.5




[Qemu-block] [PATCH v9 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration

2017-09-19 Thread Eric Blake
Now that we have adjusted the majority of the calls this function
makes to be byte-based, it is easier to read the code if it makes
passes over the image using bytes rather than sectors.

iotests 165 was rather weak - on a default 64k-cluster image, where
bitmap granularity also defaults to 64k bytes, a single cluster of
the bitmap table thus covers (64*1024*8) bits which each cover 64k
bytes, or 32G of image space.  But the test only uses a 1G image,
so it cannot trigger any more than one loop of the code in
store_bitmap_data(); and it was writing to the first cluster.  In
order to test that we are properly aligning which portions of the
bitmap are being written to the file, we really want to test a case
where the first dirty bit returned by bdrv_dirty_iter_next() is not
aligned to the start of a cluster, which we can do by modifying the
test to write data that doesn't happen to fall in the first cluster
of the image.

Signed-off-by: Eric Blake 

---
v9: update iotests to show why aligning down is needed [Kevin], R-b dropped
v8: no change
v7: rebase to earlier change, make rounding of offset obvious (no semantic
change, so R-b kept) [Kevin]
v5-v6: no change
v4: new patch
---
 block/qcow2-bitmap.c   | 31 ---
 tests/qemu-iotests/165 |  2 +-
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 692ce0de88..df957c66d5 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1072,10 +1072,9 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 {
 int ret;
 BDRVQcow2State *s = bs->opaque;
-int64_t sector;
-uint64_t limit, sbc;
+int64_t offset;
+uint64_t limit;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
-uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
 uint8_t *buf = NULL;
 BdrvDirtyBitmapIter *dbi;
@@ -1100,18 +1099,22 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 dbi = bdrv_dirty_iter_new(bitmap);
 buf = g_malloc(s->cluster_size);
 limit = bytes_covered_by_bitmap_cluster(s, bitmap);
-sbc = limit >> BDRV_SECTOR_BITS;
 assert(DIV_ROUND_UP(bm_size, limit) == tb_size);

-while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) {
-uint64_t cluster = sector / sbc;
+while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
+uint64_t cluster = offset / limit;
 uint64_t end, write_size;
 int64_t off;

-sector = cluster * sbc;
-end = MIN(bm_sectors, sector + sbc);
-write_size = bdrv_dirty_bitmap_serialization_size(bitmap,
-sector * BDRV_SECTOR_SIZE, (end - sector) * BDRV_SECTOR_SIZE);
+/*
+ * We found the first dirty offset, but want to write out the
+ * entire cluster of the bitmap that includes that offset,
+ * including any leading zero bits.
+ */
+offset = QEMU_ALIGN_DOWN(offset, limit);
+end = MIN(bm_size, offset + limit);
+write_size = bdrv_dirty_bitmap_serialization_size(bitmap, offset,
+  end - offset);
 assert(write_size <= s->cluster_size);

 off = qcow2_alloc_clusters(bs, s->cluster_size);
@@ -1123,9 +1126,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 }
 tb[cluster] = off;

-bdrv_dirty_bitmap_serialize_part(bitmap, buf,
- sector * BDRV_SECTOR_SIZE,
- (end - sector) * BDRV_SECTOR_SIZE);
+bdrv_dirty_bitmap_serialize_part(bitmap, buf, offset, end - offset);
 if (write_size < s->cluster_size) {
 memset(buf + write_size, 0, s->cluster_size - write_size);
 }
@@ -1143,11 +1144,11 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 goto fail;
 }

-if (end >= bm_sectors) {
+if (end >= bm_size) {
 break;
 }

-bdrv_set_dirty_iter(dbi, end * BDRV_SECTOR_SIZE);
+bdrv_set_dirty_iter(dbi, end);
 }

 *bitmap_table_size = tb_size;
diff --git a/tests/qemu-iotests/165 b/tests/qemu-iotests/165
index 74d7b79a0b..a3932db3de 100755
--- a/tests/qemu-iotests/165
+++ b/tests/qemu-iotests/165
@@ -27,7 +27,7 @@ disk = os.path.join(iotests.test_dir, 'disk')
 disk_size = 0x4000 # 1G

 # regions for qemu_io: (start, count) in bytes
-regions1 = ((0,0x10),
+regions1 = ((0x0fff00, 0x1),
 (0x20, 0x10))

 regions2 = ((0x1000, 0x2),
-- 
2.13.5




[Qemu-block] [PATCH v9 17/20] qcow2: Switch load_bitmap_data() to byte-based iteration

2017-09-19 Thread Eric Blake
Now that we have adjusted the majority of the calls this function
makes to be byte-based, it is easier to read the code if it makes
passes over the image using bytes rather than sectors.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v5: no change
v4: new patch
---
 block/qcow2-bitmap.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index b09010b1d3..692ce0de88 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -291,9 +291,8 @@ static int load_bitmap_data(BlockDriverState *bs,
 {
 int ret = 0;
 BDRVQcow2State *s = bs->opaque;
-uint64_t sector, limit, sbc;
+uint64_t offset, limit;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
-uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 uint8_t *buf = NULL;
 uint64_t i, tab_size =
 size_to_clusters(s,
@@ -305,32 +304,27 @@ static int load_bitmap_data(BlockDriverState *bs,

 buf = g_malloc(s->cluster_size);
 limit = bytes_covered_by_bitmap_cluster(s, bitmap);
-sbc = limit >> BDRV_SECTOR_BITS;
-for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) {
-uint64_t count = MIN(bm_sectors - sector, sbc);
+for (i = 0, offset = 0; i < tab_size; ++i, offset += limit) {
+uint64_t count = MIN(bm_size - offset, limit);
 uint64_t entry = bitmap_table[i];
-uint64_t offset = entry & BME_TABLE_ENTRY_OFFSET_MASK;
+uint64_t data_offset = entry & BME_TABLE_ENTRY_OFFSET_MASK;

 assert(check_table_entry(entry, s->cluster_size) == 0);

-if (offset == 0) {
+if (data_offset == 0) {
 if (entry & BME_TABLE_ENTRY_FLAG_ALL_ONES) {
-bdrv_dirty_bitmap_deserialize_ones(bitmap,
-   sector * BDRV_SECTOR_SIZE,
-   count * BDRV_SECTOR_SIZE,
+bdrv_dirty_bitmap_deserialize_ones(bitmap, offset, count,
false);
 } else {
 /* No need to deserialize zeros because the dirty bitmap is
  * already cleared */
 }
 } else {
-ret = bdrv_pread(bs->file, offset, buf, s->cluster_size);
+ret = bdrv_pread(bs->file, data_offset, buf, s->cluster_size);
 if (ret < 0) {
 goto finish;
 }
-bdrv_dirty_bitmap_deserialize_part(bitmap, buf,
-   sector * BDRV_SECTOR_SIZE,
-   count * BDRV_SECTOR_SIZE,
+bdrv_dirty_bitmap_deserialize_part(bitmap, buf, offset, count,
false);
 }
 }
-- 
2.13.5




[Qemu-block] [PATCH v9 12/20] dirty-bitmap: Change bdrv_get_dirty_count() to report bytes

2017-09-19 Thread Eric Blake
Thanks to recent cleanups, all callers were scaling a return value
of sectors into bytes; do the scaling internally instead.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v8: no change, add R-b
v7: fix one more trace caller [Kevin]
v4-v6: no change
v3: no change, add R-b
v2: no change
---
 block/dirty-bitmap.c |  4 ++--
 block/mirror.c   | 16 ++--
 migration/block.c|  2 +-
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index e451916187..8322e23f0d 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -423,7 +423,7 @@ BlockDirtyInfoList 
*bdrv_query_dirty_bitmaps(BlockDriverState *bs)
 QLIST_FOREACH(bm, &bs->dirty_bitmaps, list) {
 BlockDirtyInfo *info = g_new0(BlockDirtyInfo, 1);
 BlockDirtyInfoList *entry = g_new0(BlockDirtyInfoList, 1);
-info->count = bdrv_get_dirty_count(bm) << BDRV_SECTOR_BITS;
+info->count = bdrv_get_dirty_count(bm);
 info->granularity = bdrv_dirty_bitmap_granularity(bm);
 info->has_name = !!bm->name;
 info->name = g_strdup(bm->name);
@@ -652,7 +652,7 @@ void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t 
offset)

 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap)
 {
-return hbitmap_count(bitmap->bitmap);
+return hbitmap_count(bitmap->bitmap) << BDRV_SECTOR_BITS;
 }

 int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap)
diff --git a/block/mirror.c b/block/mirror.c
index 77bf5aa3a4..7113d47db4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -340,8 +340,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 if (offset < 0) {
 bdrv_set_dirty_iter(s->dbi, 0);
 offset = bdrv_dirty_iter_next(s->dbi);
-trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap) *
-  BDRV_SECTOR_SIZE);
+trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap));
 assert(offset >= 0);
 }
 bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
@@ -811,11 +810,10 @@ static void coroutine_fn mirror_run(void *opaque)

 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
 /* s->common.offset contains the number of bytes already processed so
- * far, cnt is the number of dirty sectors remaining and
+ * far, cnt is the number of dirty bytes remaining and
  * s->bytes_in_flight is the number of bytes currently being
  * processed; together those are the current total operation length */
-s->common.len = s->common.offset + s->bytes_in_flight +
-cnt * BDRV_SECTOR_SIZE;
+s->common.len = s->common.offset + s->bytes_in_flight + cnt;

 /* Note that even when no rate limit is applied we need to yield
  * periodically with no pending I/O so that bdrv_drain_all() returns.
@@ -827,8 +825,7 @@ static void coroutine_fn mirror_run(void *opaque)
 s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
 if (s->in_flight >= MAX_IN_FLIGHT || s->buf_free_count == 0 ||
 (cnt == 0 && s->in_flight > 0)) {
-trace_mirror_yield(s, cnt * BDRV_SECTOR_SIZE,
-   s->buf_free_count, s->in_flight);
+trace_mirror_yield(s, cnt, s->buf_free_count, s->in_flight);
 mirror_wait_for_io(s);
 continue;
 } else if (cnt != 0) {
@@ -869,7 +866,7 @@ static void coroutine_fn mirror_run(void *opaque)
  * whether to switch to target check one last time if I/O has
  * come in the meanwhile, and if not flush the data to disk.
  */
-trace_mirror_before_drain(s, cnt * BDRV_SECTOR_SIZE);
+trace_mirror_before_drain(s, cnt);

 bdrv_drained_begin(bs);
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
@@ -888,8 +885,7 @@ static void coroutine_fn mirror_run(void *opaque)
 }

 ret = 0;
-trace_mirror_before_sleep(s, cnt * BDRV_SECTOR_SIZE,
-  s->synced, delay_ns);
+trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
 if (!s->synced) {
 block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, delay_ns);
 if (block_job_is_cancelled(&s->common)) {
diff --git a/migration/block.c b/migration/block.c
index 9171f60028..a3512945da 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -667,7 +667,7 @@ static int64_t get_remaining_dirty(void)
 aio_context_release(blk_get_aio_context(bmds->blk));
 }

-return dirty << BDRV_SECTOR_BITS;
+return dirty;
 }


-- 
2.13.5




[Qemu-block] [PATCH v9 16/20] qcow2: Switch qcow2_measure() to byte-based iteration

2017-09-19 Thread Eric Blake
This is new code, but it is easier to read if it makes passes over
the image using bytes rather than sectors (and will get easier in
the future when bdrv_get_block_status is converted to byte-based).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v7: tweak constant given to MIN (no semantic change, R-b kept) [Kevin]
v6: separate bug fix to earlier patch
v5: new patch
---
 block/qcow2.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index bae5893327..64dcd98a91 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3647,20 +3647,19 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, 
BlockDriverState *in_bs,
  */
 required = virtual_size;
 } else {
-int cluster_sectors = cluster_size / BDRV_SECTOR_SIZE;
-int64_t sector_num;
+int64_t offset;
 int pnum = 0;

-for (sector_num = 0;
- sector_num < ssize / BDRV_SECTOR_SIZE;
- sector_num += pnum) {
-int nb_sectors = MIN(ssize / BDRV_SECTOR_SIZE - sector_num,
- BDRV_REQUEST_MAX_SECTORS);
+for (offset = 0; offset < ssize;
+ offset += pnum * BDRV_SECTOR_SIZE) {
+int nb_sectors = MIN(ssize - offset,
+ BDRV_REQUEST_MAX_BYTES) / 
BDRV_SECTOR_SIZE;
 BlockDriverState *file;
 int64_t ret;

 ret = bdrv_get_block_status_above(in_bs, NULL,
-  sector_num, nb_sectors,
+  offset >> BDRV_SECTOR_BITS,
+  nb_sectors,
   &pnum, &file);
 if (ret < 0) {
 error_setg_errno(&local_err, -ret,
@@ -3673,12 +3672,11 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, 
BlockDriverState *in_bs,
 } else if ((ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) ==
(BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) {
 /* Extend pnum to end of cluster for next iteration */
-pnum = ROUND_UP(sector_num + pnum, cluster_sectors) -
-   sector_num;
+pnum = (ROUND_UP(offset + pnum * BDRV_SECTOR_SIZE,
+ cluster_size) - offset) >> BDRV_SECTOR_BITS;

 /* Count clusters we've seen */
-required += (sector_num % cluster_sectors + pnum) *
-BDRV_SECTOR_SIZE;
+required += offset % cluster_size + pnum * 
BDRV_SECTOR_SIZE;
 }
 }
 }
-- 
2.13.5




[Qemu-block] [PATCH v9 10/20] dirty-bitmap: Set iterator start by offset, not sector

2017-09-19 Thread Eric Blake
All callers to bdrv_dirty_iter_new() passed 0 for their initial
starting point, drop that parameter.

Most callers to bdrv_set_dirty_iter() were scaling a byte offset to
a sector number; the exception qcow2-bitmap will be converted later
to use byte rather than sector iteration.  Move the scaling to occur
internally to dirty bitmap code instead, so that callers now pass
in bytes.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v5: no change
v4: rebase to persistent bitmaps
v3: no change
v2: no change
---
 include/block/dirty-bitmap.h | 5 ++---
 block/backup.c   | 5 ++---
 block/dirty-bitmap.c | 9 -
 block/mirror.c   | 4 ++--
 block/qcow2-bitmap.c | 4 ++--
 5 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 5f34a1a3c7..757fc4c5b8 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -44,8 +44,7 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
  int64_t cur_sector, int64_t nr_sectors);
 BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap);
-BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
- uint64_t first_sector);
+BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);

 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
@@ -80,7 +79,7 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 int64_t cur_sector, int64_t nr_sectors);
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
-void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
+void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t offset);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
 int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes);
diff --git a/block/backup.c b/block/backup.c
index 517c300a49..ac9c018717 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -372,7 +372,7 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)

 granularity = bdrv_dirty_bitmap_granularity(job->sync_bitmap);
 clusters_per_iter = MAX((granularity / job->cluster_size), 1);
-dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);
+dbi = bdrv_dirty_iter_new(job->sync_bitmap);

 /* Find the next dirty sector(s) */
 while ((offset = bdrv_dirty_iter_next(dbi) * BDRV_SECTOR_SIZE) >= 0) {
@@ -403,8 +403,7 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)
 /* If the bitmap granularity is smaller than the backup granularity,
  * we need to advance the iterator pointer to the next cluster. */
 if (granularity < job->cluster_size) {
-bdrv_set_dirty_iter(dbi,
-cluster * job->cluster_size / 
BDRV_SECTOR_SIZE);
+bdrv_set_dirty_iter(dbi, cluster * job->cluster_size);
 }

 last_cluster = cluster - 1;
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 555c736024..84509476ba 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -473,11 +473,10 @@ uint32_t bdrv_dirty_bitmap_granularity(const 
BdrvDirtyBitmap *bitmap)
 return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap);
 }

-BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
- uint64_t first_sector)
+BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap)
 {
 BdrvDirtyBitmapIter *iter = g_new(BdrvDirtyBitmapIter, 1);
-hbitmap_iter_init(&iter->hbi, bitmap->bitmap, first_sector);
+hbitmap_iter_init(&iter->hbi, bitmap->bitmap, 0);
 iter->bitmap = bitmap;
 bitmap->active_iterators++;
 return iter;
@@ -645,9 +644,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t 
cur_sector,
 /**
  * Advance a BdrvDirtyBitmapIter to an arbitrary offset.
  */
-void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t sector_num)
+void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t offset)
 {
-hbitmap_iter_init(&iter->hbi, iter->hbi.hb, sector_num);
+hbitmap_iter_init(&iter->hbi, iter->hbi.hb, offset >> BDRV_SECTOR_BITS);
 }

 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap)
diff --git a/block/mirror.c b/block/mirror.c
index 6531652d73..0b063b3c20 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -373,7 +373,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
 if (next_dirty > next_offset || next_dirty < 0) {
 /* The bitmap iterator's cache is stale, refresh it */
-   

[Qemu-block] [PATCH v9 07/20] dirty-bitmap: Track bitmap size by bytes

2017-09-19 Thread Eric Blake
We are still using an internal hbitmap that tracks a size in sectors,
with the granularity scaled down accordingly, because it lets us
use a shortcut for our iterators which are currently sector-based.
But there's no reason we can't track the dirty bitmap size in bytes,
since it is (mostly) an internal-only variable (remember, the size
is how many bytes are covered by the bitmap, not how many bytes the
bitmap occupies).  A later cleanup will convert dirty bitmap
internals to be entirely byte-based, eliminating the intermediate
sector rounding added here; and technically, since bdrv_getlength()
already rounds up to sectors, our use of DIV_ROUND_UP is more for
theoretical completeness than for any actual rounding.

Use is_power_of_2() while at it, instead of open-coding that.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v8: rebase to earlier truncate changes, R-b kept
v7: split external from internal [Kevin], drop R-b
v6: no change
v5: fix bdrv_dirty_bitmap_truncate [John], drop R-b
v4: retitle from "Track size in bytes", rebase to persistent bitmaps,
round up when converting bytes to sectors
v3: no change
v2: tweak commit message, no code change
---
 block/dirty-bitmap.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 75af32..6d32554b4e 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -42,7 +42,7 @@ struct BdrvDirtyBitmap {
 HBitmap *meta;  /* Meta dirty bitmap */
 BdrvDirtyBitmap *successor; /* Anonymous child; implies frozen status */
 char *name; /* Optional non-empty unique ID */
-int64_t size;   /* Size of the bitmap (Number of sectors) */
+int64_t size;   /* Size of the bitmap, in bytes */
 bool disabled;  /* Bitmap is disabled. It ignores all writes to
the device */
 int active_iterators;   /* How many iterators are active */
@@ -115,17 +115,14 @@ BdrvDirtyBitmap 
*bdrv_create_dirty_bitmap(BlockDriverState *bs,
 {
 int64_t bitmap_size;
 BdrvDirtyBitmap *bitmap;
-uint32_t sector_granularity;

-assert((granularity & (granularity - 1)) == 0);
+assert(is_power_of_2(granularity) && granularity >= BDRV_SECTOR_SIZE);

 if (name && bdrv_find_dirty_bitmap(bs, name)) {
 error_setg(errp, "Bitmap already exists: %s", name);
 return NULL;
 }
-sector_granularity = granularity >> BDRV_SECTOR_BITS;
-assert(sector_granularity);
-bitmap_size = bdrv_nb_sectors(bs);
+bitmap_size = bdrv_getlength(bs);
 if (bitmap_size < 0) {
 error_setg_errno(errp, -bitmap_size, "could not get length of device");
 errno = -bitmap_size;
@@ -133,7 +130,12 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState 
*bs,
 }
 bitmap = g_new0(BdrvDirtyBitmap, 1);
 bitmap->mutex = &bs->dirty_bitmap_mutex;
-bitmap->bitmap = hbitmap_alloc(bitmap_size, ctz32(sector_granularity));
+/*
+ * TODO - let hbitmap track full granularity. For now, it is tracking
+ * only sector granularity, as a shortcut for our iterators.
+ */
+bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap_size, BDRV_SECTOR_SIZE),
+   ctz32(granularity) - BDRV_SECTOR_BITS);
 bitmap->size = bitmap_size;
 bitmap->name = g_strdup(name);
 bitmap->disabled = false;
@@ -175,7 +177,7 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap)

 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap)
 {
-return bitmap->size * BDRV_SECTOR_SIZE;
+return bitmap->size;
 }

 const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap)
@@ -305,14 +307,13 @@ BdrvDirtyBitmap 
*bdrv_reclaim_dirty_bitmap(BlockDriverState *bs,
 void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes)
 {
 BdrvDirtyBitmap *bitmap;
-int64_t size = DIV_ROUND_UP(bytes, BDRV_SECTOR_SIZE);

 bdrv_dirty_bitmaps_lock(bs);
 QLIST_FOREACH(bitmap, &bs->dirty_bitmaps, list) {
 assert(!bdrv_dirty_bitmap_frozen(bitmap));
 assert(!bitmap->active_iterators);
-hbitmap_truncate(bitmap->bitmap, size);
-bitmap->size = size;
+hbitmap_truncate(bitmap->bitmap, DIV_ROUND_UP(bytes, 
BDRV_SECTOR_SIZE));
+bitmap->size = bytes;
 }
 bdrv_dirty_bitmaps_unlock(bs);
 }
@@ -549,7 +550,8 @@ void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, 
HBitmap **out)
 hbitmap_reset_all(bitmap->bitmap);
 } else {
 HBitmap *backup = bitmap->bitmap;
-bitmap->bitmap = hbitmap_alloc(bitmap->size,
+bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap->size,
+BDRV_SECTOR_SIZE),
hbitmap_granularity(backup));
 *out = backup;
 }
-- 
2.13.5




[Qemu-block] [PATCH v9 01/20] block: Make bdrv_img_create() size selection easier to read

2017-09-19 Thread Eric Blake
All callers of bdrv_img_create() pass in a size, or -1 to read the
size from the backing file.  We then set that size as the QemuOpt
default, which means we will reuse that default rather than the
final parameter to qemu_opt_get_size() several lines later.  But
it is rather confusing to read subsequent checks of 'size == -1'
when it looks (without seeing the full context) like size defaults
to 0; it also doesn't help that a size of 0 is valid (for some
formats).

Rework the logic to make things more legible.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v6: Combine into a series rather than being a standalone patch (more for
ease of tracking than for being on topic)
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 6dd47e414e..ee6a48976e 100644
--- a/block.c
+++ b/block.c
@@ -4393,7 +4393,7 @@ void bdrv_img_create(const char *filename, const char 
*fmt,

 /* The size for the image must always be specified, unless we have a 
backing
  * file and we have not been forbidden from opening it. */
-size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0);
+size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, img_size);
 if (backing_file && !(flags & BDRV_O_NO_BACKING)) {
 BlockDriverState *bs;
 char *full_backing = g_new0(char, PATH_MAX);
-- 
2.13.5




[Qemu-block] [PATCH v9 00/20] make dirty-bitmap byte-based

2017-09-19 Thread Eric Blake
There are patches floating around to add NBD_CMD_BLOCK_STATUS,
but NBD wants to report status on byte granularity (even if the
reporting will probably be naturally aligned to sectors or even
much higher levels).  I've therefore started the task of
converting our block status code to report at a byte granularity
rather than sectors.

Now that 2.11 is open, I'm rebasing/reposting the remaining patches.

The overall conversion currently looks like:
part 1: bdrv_is_allocated (merged in 2.10, commit 51b0a488)
part 2: dirty-bitmap (this series, v8 was here [1])
part 3: bdrv_get_block_status (v4 is posted [2] and is mostly reviewed)
part 4: .bdrv_co_block_status (v3 is posted [3], but needs review)

Available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-dirty-v8

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg04733.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03543.html
[3] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03812.html

Since v8:
- add R-b where appropriate
- patch 5 simplified again [Fam]
- patch 18: add a comment, and enhance iotests 165 to test that aspect [Kevin]

001/20:[] [--] 'block: Make bdrv_img_create() size selection easier to read'
002/20:[] [--] 'hbitmap: Rename serialization_granularity to 
serialization_align'
003/20:[] [--] 'qcow2: Ensure bitmap serialization is aligned'
004/20:[] [--] 'dirty-bitmap: Drop unused functions'
005/20:[0004] [FC] 'dirty-bitmap: Avoid size query failure during truncate'
006/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_bitmap_size() to report 
bytes'
007/20:[] [--] 'dirty-bitmap: Track bitmap size by bytes'
008/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_bitmap_*serialize*() to 
take bytes'
009/20:[] [--] 'qcow2: Switch sectors_covered_by_bitmap_cluster() to 
byte-based'
010/20:[] [--] 'dirty-bitmap: Set iterator start by offset, not sector'
011/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_iter_next() to report byte 
offset'
012/20:[] [--] 'dirty-bitmap: Change bdrv_get_dirty_count() to report bytes'
013/20:[] [--] 'dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes'
014/20:[] [--] 'dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use 
bytes'
015/20:[] [--] 'mirror: Switch mirror_dirty_init() to byte-based iteration'
016/20:[] [--] 'qcow2: Switch qcow2_measure() to byte-based iteration'
017/20:[] [--] 'qcow2: Switch load_bitmap_data() to byte-based iteration'
018/20:[0007] [FC] 'qcow2: Switch store_bitmap_data() to byte-based iteration'
019/20:[] [--] 'dirty-bitmap: Switch bdrv_set_dirty() to bytes'
020/20:[] [--] 'dirty-bitmap: Convert internal hbitmap size/granularity'

Eric Blake (20):
  block: Make bdrv_img_create() size selection easier to read
  hbitmap: Rename serialization_granularity to serialization_align
  qcow2: Ensure bitmap serialization is aligned
  dirty-bitmap: Drop unused functions
  dirty-bitmap: Avoid size query failure during truncate
  dirty-bitmap: Change bdrv_dirty_bitmap_size() to report bytes
  dirty-bitmap: Track bitmap size by bytes
  dirty-bitmap: Change bdrv_dirty_bitmap_*serialize*() to take bytes
  qcow2: Switch sectors_covered_by_bitmap_cluster() to byte-based
  dirty-bitmap: Set iterator start by offset, not sector
  dirty-bitmap: Change bdrv_dirty_iter_next() to report byte offset
  dirty-bitmap: Change bdrv_get_dirty_count() to report bytes
  dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes
  dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use bytes
  mirror: Switch mirror_dirty_init() to byte-based iteration
  qcow2: Switch qcow2_measure() to byte-based iteration
  qcow2: Switch load_bitmap_data() to byte-based iteration
  qcow2: Switch store_bitmap_data() to byte-based iteration
  dirty-bitmap: Switch bdrv_set_dirty() to bytes
  dirty-bitmap: Convert internal hbitmap size/granularity

 include/block/block_int.h|   2 +-
 include/block/dirty-bitmap.h |  43 ++
 include/qemu/hbitmap.h   |   8 +--
 block/io.c   |   6 +-
 block.c  |  17 --
 block/backup.c   |   7 +--
 block/dirty-bitmap.c | 134 ++-
 block/mirror.c   |  79 +++--
 block/qcow2-bitmap.c |  62 +++-
 block/qcow2.c|  22 ---
 migration/block.c|  12 ++--
 tests/test-hbitmap.c |  10 ++--
 util/hbitmap.c   |   8 +--
 tests/qemu-iotests/165   |   2 +-
 14 files changed, 173 insertions(+), 239 deletions(-)

-- 
2.13.5




[Qemu-block] [PATCH v9 05/20] dirty-bitmap: Avoid size query failure during truncate

2017-09-19 Thread Eric Blake
We've previously fixed several places where we failed to account
for possible errors from bdrv_nb_sectors().  Fix another one by
making bdrv_dirty_bitmap_truncate() take the new size from the
caller instead of querying itself; then adjust the sole caller
bdrv_truncate() to pass the size just determined by a successful
resize, or to skip the bitmap resize on failure, thus avoiding
sizing the bitmaps to -1.

Signed-off-by: Eric Blake 

---
v9: skip only bdrv_dirty_bitmap_truncate on error [Fam]
v8: retitle and rework to avoid possibility of secondary failure [John]
v7: new patch [Kevin]
---
 include/block/dirty-bitmap.h |  2 +-
 block.c  | 15 ++-
 block/dirty-bitmap.c |  6 +++---
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 8fd842eac9..7a27590047 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
 void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
 int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
-void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
+void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes);
 bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap);
 bool bdrv_has_readonly_bitmaps(BlockDriverState *bs);
 bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap);
diff --git a/block.c b/block.c
index ee6a48976e..89261a7a53 100644
--- a/block.c
+++ b/block.c
@@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, 
PreallocMode prealloc,
 assert(!(bs->open_flags & BDRV_O_INACTIVE));

 ret = drv->bdrv_truncate(bs, offset, prealloc, errp);
-if (ret == 0) {
-ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
-bdrv_dirty_bitmap_truncate(bs);
-bdrv_parent_cb_resize(bs);
-atomic_inc(&bs->write_gen);
+if (ret < 0) {
+return ret;
 }
+ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "Could not refresh total sector count");
+} else {
+bdrv_dirty_bitmap_truncate(bs, bs->total_sectors * BDRV_SECTOR_SIZE);
+}
+bdrv_parent_cb_resize(bs);
+atomic_inc(&bs->write_gen);
 return ret;
 }

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 42a55e4a4b..ee164fb518 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -1,7 +1,7 @@
 /*
  * Block Dirty Bitmap
  *
- * Copyright (c) 2016 Red Hat. Inc
+ * Copyright (c) 2016-2017 Red Hat. Inc
  *
  * Permission is hereby granted, free of charge, to any person obtaining a copy
  * of this software and associated documentation files (the "Software"), to 
deal
@@ -302,10 +302,10 @@ BdrvDirtyBitmap 
*bdrv_reclaim_dirty_bitmap(BlockDriverState *bs,
  * Truncates _all_ bitmaps attached to a BDS.
  * Called with BQL taken.
  */
-void bdrv_dirty_bitmap_truncate(BlockDriverState *bs)
+void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes)
 {
 BdrvDirtyBitmap *bitmap;
-uint64_t size = bdrv_nb_sectors(bs);
+int64_t size = DIV_ROUND_UP(bytes, BDRV_SECTOR_SIZE);

 bdrv_dirty_bitmaps_lock(bs);
 QLIST_FOREACH(bitmap, &bs->dirty_bitmaps, list) {
-- 
2.13.5




[Qemu-block] [PATCH v9 03/20] qcow2: Ensure bitmap serialization is aligned

2017-09-19 Thread Eric Blake
When subdividing a bitmap serialization, the code in hbitmap.c
enforces that start/count parameters are aligned (except that
count can end early at end-of-bitmap).  We exposed this required
alignment through bdrv_dirty_bitmap_serialization_align(), but
forgot to actually check that we comply with it.

Fortunately, qcow2 is never dividing bitmap serialization smaller
than one cluster (which is a minimum of 512 bytes); so we are
always compliant with the serialization alignment (which insists
that we partition at least 64 bits per chunk) because we are doing
at least 4k bits per chunk.

Still, it's safer to add an assertion (for the unlikely case that
we'd ever support a cluster smaller than 512 bytes, or if the
hbitmap implementation changes what it considers to be aligned),
rather than leaving bdrv_dirty_bitmap_serialization_align()
without a caller.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v5: no change
v4: new patch
---
 block/qcow2-bitmap.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index e8d3bdbd6e..b3ee4c794a 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -274,10 +274,13 @@ static int free_bitmap_clusters(BlockDriverState *bs, 
Qcow2BitmapTable *tb)
 static uint64_t sectors_covered_by_bitmap_cluster(const BDRVQcow2State *s,
   const BdrvDirtyBitmap 
*bitmap)
 {
-uint32_t sector_granularity =
+uint64_t sector_granularity =
 bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS;
+uint64_t sbc = sector_granularity * (s->cluster_size << 3);

-return (uint64_t)sector_granularity * (s->cluster_size << 3);
+assert(QEMU_IS_ALIGNED(sbc,
+   bdrv_dirty_bitmap_serialization_align(bitmap)));
+return sbc;
 }

 /* load_bitmap_data
-- 
2.13.5




[Qemu-block] [PATCH v9 04/20] dirty-bitmap: Drop unused functions

2017-09-19 Thread Eric Blake
We had several functions that no one is currently using, and which
use sector-based interfaces.  I'm trying to convert towards byte-based
interfaces, so it's easier to just drop the unused functions:

bdrv_dirty_bitmap_get_meta
bdrv_dirty_bitmap_get_meta_locked
bdrv_dirty_bitmap_reset_meta
bdrv_dirty_bitmap_meta_granularity

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v5: no change
v4: rebase to Vladimir's persistent bitmaps (bdrv_dirty_bitmap_size now
in use), dropped R-b
v3: rebase to upstream changes (bdrv_dirty_bitmap_get_meta_locked was
added in b64bd51e with no clients), kept R-b
v2: tweak commit message based on review, no code change
---
 include/block/dirty-bitmap.h | 10 --
 block/dirty-bitmap.c | 44 
 2 files changed, 54 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index a79a58d2c3..8fd842eac9 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -34,7 +34,6 @@ void bdrv_enable_dirty_bitmap(BdrvDirtyBitmap *bitmap);
 BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs);
 uint32_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs);
 uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap);
-uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap);
 bool bdrv_dirty_bitmap_enabled(BdrvDirtyBitmap *bitmap);
 bool bdrv_dirty_bitmap_frozen(BdrvDirtyBitmap *bitmap);
 const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap);
@@ -44,15 +43,6 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
int64_t cur_sector, int64_t nr_sectors);
 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
  int64_t cur_sector, int64_t nr_sectors);
-int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs,
-   BdrvDirtyBitmap *bitmap, int64_t sector,
-   int nb_sectors);
-int bdrv_dirty_bitmap_get_meta_locked(BlockDriverState *bs,
-  BdrvDirtyBitmap *bitmap, int64_t sector,
-  int nb_sectors);
-void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
-  BdrvDirtyBitmap *bitmap, int64_t sector,
-  int nb_sectors);
 BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap);
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
  uint64_t first_sector);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 0490ca3aff..42a55e4a4b 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -173,45 +173,6 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap 
*bitmap)
 qemu_mutex_unlock(bitmap->mutex);
 }

-int bdrv_dirty_bitmap_get_meta_locked(BlockDriverState *bs,
-  BdrvDirtyBitmap *bitmap, int64_t sector,
-  int nb_sectors)
-{
-uint64_t i;
-int sectors_per_bit = 1 << hbitmap_granularity(bitmap->meta);
-
-/* To optimize: we can make hbitmap to internally check the range in a
- * coarse level, or at least do it word by word. */
-for (i = sector; i < sector + nb_sectors; i += sectors_per_bit) {
-if (hbitmap_get(bitmap->meta, i)) {
-return true;
-}
-}
-return false;
-}
-
-int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs,
-   BdrvDirtyBitmap *bitmap, int64_t sector,
-   int nb_sectors)
-{
-bool dirty;
-
-qemu_mutex_lock(bitmap->mutex);
-dirty = bdrv_dirty_bitmap_get_meta_locked(bs, bitmap, sector, nb_sectors);
-qemu_mutex_unlock(bitmap->mutex);
-
-return dirty;
-}
-
-void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
-  BdrvDirtyBitmap *bitmap, int64_t sector,
-  int nb_sectors)
-{
-qemu_mutex_lock(bitmap->mutex);
-hbitmap_reset(bitmap->meta, sector, nb_sectors);
-qemu_mutex_unlock(bitmap->mutex);
-}
-
 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap)
 {
 return bitmap->size;
@@ -511,11 +472,6 @@ uint32_t bdrv_dirty_bitmap_granularity(const 
BdrvDirtyBitmap *bitmap)
 return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap);
 }

-uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap)
-{
-return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->meta);
-}
-
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
  uint64_t first_sector)
 {
-- 
2.13.5




[Qemu-block] [PATCH v9 02/20] hbitmap: Rename serialization_granularity to serialization_align

2017-09-19 Thread Eric Blake
The only client of hbitmap_serialization_granularity() is dirty-bitmap's
bdrv_dirty_bitmap_serialization_align().  Keeping the two names consistent
is worthwhile, and the shorter name is more representative of what the
function returns (the required alignment to be used for start/count of
other serialization functions, where violating the alignment causes
assertion failures).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Kevin Wolf 
Reviewed-by: Fam Zheng 

---
v5: no change
v4: new patch
---
 include/qemu/hbitmap.h |  8 
 block/dirty-bitmap.c   |  2 +-
 tests/test-hbitmap.c   | 10 +-
 util/hbitmap.c |  8 
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index d3a74a21fc..81e78043d1 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -159,16 +159,16 @@ bool hbitmap_get(const HBitmap *hb, uint64_t item);
 bool hbitmap_is_serializable(const HBitmap *hb);

 /**
- * hbitmap_serialization_granularity:
+ * hbitmap_serialization_align:
  * @hb: HBitmap to operate on.
  *
- * Granularity of serialization chunks, used by other serialization functions.
- * For every chunk:
+ * Required alignment of serialization chunks, used by other serialization
+ * functions. For every chunk:
  * 1. Chunk start should be aligned to this granularity.
  * 2. Chunk size should be aligned too, except for last chunk (for which
  *  start + count == hb->size)
  */
-uint64_t hbitmap_serialization_granularity(const HBitmap *hb);
+uint64_t hbitmap_serialization_align(const HBitmap *hb);

 /**
  * hbitmap_serialization_size:
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 30462d4f9a..0490ca3aff 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -617,7 +617,7 @@ uint64_t bdrv_dirty_bitmap_serialization_size(const 
BdrvDirtyBitmap *bitmap,

 uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap)
 {
-return hbitmap_serialization_granularity(bitmap->bitmap);
+return hbitmap_serialization_align(bitmap->bitmap);
 }

 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index 1acb353889..af41642346 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -738,15 +738,15 @@ static void test_hbitmap_meta_one(TestHBitmapData *data, 
const void *unused)
 }
 }

-static void test_hbitmap_serialize_granularity(TestHBitmapData *data,
-   const void *unused)
+static void test_hbitmap_serialize_align(TestHBitmapData *data,
+ const void *unused)
 {
 int r;

 hbitmap_test_init(data, L3 * 2, 3);
 g_assert(hbitmap_is_serializable(data->hb));

-r = hbitmap_serialization_granularity(data->hb);
+r = hbitmap_serialization_align(data->hb);
 g_assert_cmpint(r, ==, 64 << 3);
 }

@@ -974,8 +974,8 @@ int main(int argc, char **argv)
 hbitmap_test_add("/hbitmap/meta/word", test_hbitmap_meta_word);
 hbitmap_test_add("/hbitmap/meta/sector", test_hbitmap_meta_sector);

-hbitmap_test_add("/hbitmap/serialize/granularity",
- test_hbitmap_serialize_granularity);
+hbitmap_test_add("/hbitmap/serialize/align",
+ test_hbitmap_serialize_align);
 hbitmap_test_add("/hbitmap/serialize/basic",
  test_hbitmap_serialize_basic);
 hbitmap_test_add("/hbitmap/serialize/part",
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 21535cc90b..2f9d0fdbd0 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -413,14 +413,14 @@ bool hbitmap_is_serializable(const HBitmap *hb)
 {
 /* Every serialized chunk must be aligned to 64 bits so that endianness
  * requirements can be fulfilled on both 64 bit and 32 bit hosts.
- * We have hbitmap_serialization_granularity() which converts this
+ * We have hbitmap_serialization_align() which converts this
  * alignment requirement from bitmap bits to items covered (e.g. sectors).
  * That value is:
  *64 << hb->granularity
  * Since this value must not exceed UINT64_MAX, hb->granularity must be
  * less than 58 (== 64 - 6, where 6 is ld(64), i.e. 1 << 6 == 64).
  *
- * In order for hbitmap_serialization_granularity() to always return a
+ * In order for hbitmap_serialization_align() to always return a
  * meaningful value, bitmaps that are to be serialized must have a
  * granularity of less than 58. */

@@ -437,7 +437,7 @@ bool hbitmap_get(const HBitmap *hb, uint64_t item)
 return (hb->levels[HBITMAP_LEVELS - 1][pos >> BITS_PER_LEVEL] & bit) != 0;
 }

-uint64_t hbitmap_serialization_granularity(const HBitmap *hb)
+uint64_t hbitmap_serialization_align(const HBitmap *hb)
 {
 assert(hbitmap_is_serializable(hb));

@@ -454,7 +454,7 @@ static void serialization_chunk(const HBitmap *hb,
 unsigned long **first_el

Re: [Qemu-block] [PATCH v8 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration

2017-09-19 Thread Eric Blake
On 09/19/2017 07:44 AM, Kevin Wolf wrote:
> Am 18.09.2017 um 20:58 hat Eric Blake geschrieben:
>> Now that we have adjusted the majority of the calls this function
>> makes to be byte-based, it is easier to read the code if it makes
>> passes over the image using bytes rather than sectors.
>>
>> Signed-off-by: Eric Blake 
>> Reviewed-by: John Snow 
>> Reviewed-by: Vladimir Sementsov-Ogievskiy 

>> @@ -1100,18 +1099,17 @@ static uint64_t *store_bitmap_data(BlockDriverState 
>> *bs,
>>  dbi = bdrv_dirty_iter_new(bitmap);
>>  buf = g_malloc(s->cluster_size);
>>  limit = bytes_covered_by_bitmap_cluster(s, bitmap);

Limit is huge.  For ease of math, let's consider an image with 512-byte
clusters, and an explicit bitmap granularity of 512 bytes.  One cluster
(512 bytes, or 4k bits) of the bitmap covers 4k multiples of the
granularity, or 2M of image; and the minimum serialization alignment of
64 bits covers 32k bytes of image.  bytes_covered_by_bitmap_cluster()
computes granularity (512 bytes per bit), limit (2M bytes per cluster of
bits), and checks that 2M is indeed aligned to
bdrv_dirty_bitmap_serialization_align() (which is 32k).

Next, an image with default 64k clusters and a default bitmap
granularity of 64k bytes; one cluster (64k bytes, or 512k bits) now
covers 512k multiples of the granularity, or 32G of image size.

However...

>> -sbc = limit >> BDRV_SECTOR_BITS;
>>  assert(DIV_ROUND_UP(bm_size, limit) == tb_size);
>>
>> -while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) {
>> -uint64_t cluster = sector / sbc;
>> +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
>> +uint64_t cluster = offset / limit;

bdrv_dirty_iter_next() returns the next dirty bit (which is not
necessarily the first bit in the cluster).  For the purposes of
serialization, we want to serialize the entire cluster in one go, even
though we will be serializing 0 bits up until the first dirty bit.  So
offset at this point may be unaligned,

>>  uint64_t end, write_size;
>>  int64_t off;
>>
>> -sector = cluster * sbc;
>> -end = MIN(bm_sectors, sector + sbc);
>> -write_size = bdrv_dirty_bitmap_serialization_size(bitmap,
>> -sector * BDRV_SECTOR_SIZE, (end - sector) * BDRV_SECTOR_SIZE);
>> +offset = QEMU_ALIGN_DOWN(offset, limit);

and we round it down to the start of the cluster that we will actually
be serializing.

> 
> With the question that I asked in v6, apart from changing the spelling
> to be more explicit, I actually hoped that you would explain why
> aligning down is the right thing to do.
> 
> It looks plausible to me that we can do this in correct code because we
> don't support granularities < 512 bytes (a qemu restriction that is
> written down as a note in the qcow2 spec).

It's not granularities < 512 that we have to worry about, but dirty bits
with an offset < 4M (because we are serializing an entire cluster of
bitmap data, where the first dirty offset is not necessarily aligned to
that large).

> 
> The part that I'm missing yet is why we need to do it. The bitmap
> granularity is also the granularity of bdrv_dirty_iter_next(), so isn't
> offset already aligned and we could even assert that instead of aligning
> down? (As long we enforce our restriction, which we seem to do in
> bitmap_list_load().)

Sadly, a quick:

diff --git i/block/qcow2-bitmap.c w/block/qcow2-bitmap.c
index 302fffd6e1..160f3b99f9 100644
--- i/block/qcow2-bitmap.c
+++ w/block/qcow2-bitmap.c
@@ -1106,6 +1106,7 @@ static uint64_t
*store_bitmap_data(BlockDriverState *bs,
 uint64_t end, write_size;
 int64_t off;

+assert(QEMU_IS_ALIGNED(offset, limit));
 offset = QEMU_ALIGN_DOWN(offset, limit);
 end = MIN(bm_size, offset + limit);
 write_size = bdrv_dirty_bitmap_serialization_size(bitmap, offset,

does NOT fail iotests 165, which appears to be the only test that
actually hammers on qcow2 bitmaps (changing it to an 'assert(false)'
only shows an effect on 165) - which means our test is NOT exercising
all possible alignments.  And it's python-based, with lame output, which
makes debugging it painful.  But never fear, for v9 I will improve the
test to actually affect the bitmap at a point that would fail with my
temporary assertion in place, and thus proving that we DO need to align
down.  Note that test 165 is testing only a 1G image, but I just showed
that 64k clusters with 64k granularity covers up to 32G of image space
in one cluster of the bitmap, so the test is only covering one cluster
of serialization in the first place, and as written, the test is
dirtying byte 0, which explains why it happens to get an offset aligned
to limit, even though that is not a valid assertion.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH] block: Clean up some bad code in the vvfat driver

2017-09-19 Thread Thomas Huth
On 19.09.2017 21:01, John Snow wrote:
> 
> 
> On 09/19/2017 04:06 AM, Paolo Bonzini wrote:
>> On 13/09/2017 21:08, John Snow wrote:
>>>
>>>
>>> On 09/13/2017 06:21 AM, Thomas Huth wrote:
 Remove the unnecessary home-grown redefinition of the assert() macro here,
 and remove the unusable debug code at the end of the checkpoint() function.
 The code there uses assert() with side-effects (assignment to the "mapping"
 variable), which should be avoided. Looking more closely, it seems as it is
 apparently also only usable for one certain directory layout (with a file
 named USB.H in it) and thus is of no use for the rest of the world.

 Signed-off-by: Thomas Huth 
>>>
>>> Farewell, bitrot code.
>>>
>>> Reviewed-by: John Snow 
>>>
>>> Out of curiosity, I wonder ...
>>>
>>> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l
>>> 320
>>
>>
>> $ git grep -c '#if 0' | sort -k2 --field-separator=: -n
>> ...
>> hw/net/eepro100.c:21
>> target/ppc/cpu-models.h:76
>>
>> whoa :)
>>
> 
> Wonder if '#if 0' should be against the style guide / in checkpatch.

checkpatch already complains if you have a "#if 0" in your patch, so I
think we should be pretty fine here already - but of course you can
still add a paragraph to the CODING_STYLE if you like.

 Thomas



Re: [Qemu-block] ping Re: [Qemu-devel] [PATCH v7 03/16] migration: split common postcopy out of ram postcopy

2017-09-19 Thread Dr. David Alan Gilbert
* Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote:
> ping for 1-3
> Can we merge them?

I see all of them have R-b's; so lets try and put them in the next
migration merge.

Quintela: Sound good?

Dave

> 22.08.2017 02:34, John Snow wrote:
> > 
> > On 07/11/2017 09:38 AM, Vladimir Sementsov-Ogievskiy wrote:
> > > 11.07.2017 16:06, Dr. David Alan Gilbert wrote:
> > > > * Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote:
> > > > > Split common postcopy staff from ram postcopy staff.
> > > > > 
> > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > > I'm OK with this; I'm not sure I'd have bothered making the ping's
> > > > dependent on it being RAM.
> > > > 
> > > > (These first few are pretty much a separable series).
> > > It would be grate if you (or Juan?) can take them separately.
> > > 
> > Yes please. I don't think this ever happened, did it? Can we split off
> > 1-3 and re-roll?
> 
> 
> 
> -- 
> Best regards,
> Vladimir
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [Qemu-devel] Dead code in cpu-models.h

2017-09-19 Thread Thomas Huth
On 19.09.2017 20:54, John Snow wrote:
> 
> 
> On 09/19/2017 04:14 AM, Thomas Huth wrote:
>> On 19.09.2017 10:06, Paolo Bonzini wrote:
>>> On 13/09/2017 21:08, John Snow wrote:
>> [...]
 Farewell, bitrot code.

 Reviewed-by: John Snow 

 Out of curiosity, I wonder ...

 jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l
 320
>>>
>>> $ git grep -c '#if 0' | sort -k2 --field-separator=: -n
>>> ...
>>> hw/net/eepro100.c:21
>>> target/ppc/cpu-models.h:76
>>>
>>> whoa :)
>>
>> Igor recently already removed the dead definitions from cpu-models.c :
>>
>> https://git.qemu.org/?p=qemu.git;a=commitdiff;h=aef779605779579afbafff
>>
>> I guess we could now remove the corresponding dead definitions from the
>> header, too...
>>
>> Any volunteers?
>>
>>  Thomas
> 
> Well, I can just take a chainsaw to it blindly if you want to critique it.
> 

I think it should at least be aligned with the changes that have been
done to cpu-models.c (e.g. the definition for CPU_POWERPC_401B3 has been
removed from the .c file, so the CPU_POWERPC_401B3 could be removed from
the .h file, too.

 Thomas



Re: [Qemu-block] [Qemu-devel] [PATCH] block: Clean up some bad code in the vvfat driver

2017-09-19 Thread John Snow


On 09/19/2017 04:06 AM, Paolo Bonzini wrote:
> On 13/09/2017 21:08, John Snow wrote:
>>
>>
>> On 09/13/2017 06:21 AM, Thomas Huth wrote:
>>> Remove the unnecessary home-grown redefinition of the assert() macro here,
>>> and remove the unusable debug code at the end of the checkpoint() function.
>>> The code there uses assert() with side-effects (assignment to the "mapping"
>>> variable), which should be avoided. Looking more closely, it seems as it is
>>> apparently also only usable for one certain directory layout (with a file
>>> named USB.H in it) and thus is of no use for the rest of the world.
>>>
>>> Signed-off-by: Thomas Huth 
>>
>> Farewell, bitrot code.
>>
>> Reviewed-by: John Snow 
>>
>> Out of curiosity, I wonder ...
>>
>> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l
>> 320
> 
> 
> $ git grep -c '#if 0' | sort -k2 --field-separator=: -n
> ...
> hw/net/eepro100.c:21
> target/ppc/cpu-models.h:76
> 
> whoa :)
> 

Wonder if '#if 0' should be against the style guide / in checkpatch.
Conditional compilations should at least be contingent on some named
variable/condition.

(Probably super ideally all conditional compilations correlate directly
to a configure variable...)

--js



Re: [Qemu-block] [Qemu-devel] Dead code in cpu-models.h

2017-09-19 Thread John Snow


On 09/19/2017 04:14 AM, Thomas Huth wrote:
> On 19.09.2017 10:06, Paolo Bonzini wrote:
>> On 13/09/2017 21:08, John Snow wrote:
> [...]
>>> Farewell, bitrot code.
>>>
>>> Reviewed-by: John Snow 
>>>
>>> Out of curiosity, I wonder ...
>>>
>>> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l
>>> 320
>>
>> $ git grep -c '#if 0' | sort -k2 --field-separator=: -n
>> ...
>> hw/net/eepro100.c:21
>> target/ppc/cpu-models.h:76
>>
>> whoa :)
> 
> Igor recently already removed the dead definitions from cpu-models.c :
> 
> https://git.qemu.org/?p=qemu.git;a=commitdiff;h=aef779605779579afbafff
> 
> I guess we could now remove the corresponding dead definitions from the
> header, too...
> 
> Any volunteers?
> 
>  Thomas
> 

Well, I can just take a chainsaw to it blindly if you want to critique it.



Re: [Qemu-block] [Qemu-devel] [PATCH] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-19 Thread Stefan Hajnoczi
On Tue, Sep 19, 2017 at 11:07:53AM -0500, Eric Blake wrote:
> On 09/19/2017 10:50 AM, Stefan Hajnoczi wrote:
> > Clear tg->any_timer_armed[] when throttling timers are destroy during
> 
> s/destroy/destroyed/

All your base are belong to us!



[Qemu-block] qemu-img: Check failed: No space left on device

2017-09-19 Thread Nicolas Ecarnot

Hello,

First post here, so maybe I should introduce myself :
- I'm a sysadmin for decades and currently managing 4 oVirt clusters, 
made out of tens of hypervisors, all are CentOS 7.2+ based.
- I'm very happy with this solution we choose especially because it is 
based on qemu-kvm (open source, reliable, documented).


On one VM, we experienced the following :
- oVirt/vdsm is detecting an issue on the image
- following this hints https://access.redhat.com/solutions/1173623, I 
managed to detect one error and fix it

- the VM is now running perfectly

On two other VMs, we experienced a similar situation, except the check 
stage is showing something like 14000+ errors, and the relevant logs are :


Repairing refcount block 14 is outside image
ERROR could not resize image: Invalid argument
ERROR cluster 425984 refcount=0 reference=1
ERROR cluster 425985 refcount=0 reference=1
[... repeating the previous line 7000+ times...]
ERROR cluster 457166 refcount=0 reference=1
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device

You surely know that oVirt/RHEV is storing its qcow2 images in dedicated 
logical volumes.


pvs/vgs/lvs are all showing there is plenty of space available, so I 
understand that I don't understand what "No space left on device" means.


The relevant versions are :
- qemu-img-ev-2.3.0-31.el7_2.10.1.x86_64.
- libvirt-daemon-kvm-1.2.17-13.el7_2.4.x86_64
- vdsm-4.17.32-1.el7.noarch
- # qemu-img info /the/relevant/logical/volume/path
file format: qcow2
virtual size: 30G (32212254720 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
compat: 0.10
refcount bits: 16

Is there a hope I'll be able to repair this image?

--
Nicolas ECARNOT


Cette transmission contient des informations confidentielles et/ou privilégiées 
appartenant au Service Départemental d'Incendie et de Secours de l'Isère 
(SDIS38) pour l'emploi exclusif de la personne identifiée ci-dessus comme 
destinataire.
Si vous n'êtes pas cette personne, il vous est notifié par les présentes que 
toute divulgation, reproduction, distribution ou prise d'action sur la base de 
cette transmission est formellement interdite.
Si vous avez reçu cette transmission par erreur, merci de la supprimer de votre 
système et de nous joindre par téléphone ou courriel immédiatement afin de nous 
avertir de cette réception.
La correspondance en ligne doit être utilisée en tenant compte du fait qu'il ne 
s'agit pas d'un moyen entièrement sécurisé. La confidentialité de son contenu 
ne peut donc pas être considérée comme assurée.
Par ailleurs, et malgré toutes les précautions prises pour éviter la présence 
de virus dans nos envois, nous vous recommandons de prendre, de votre côté, les 
mesures permettant d'assurer la non-introduction de virus dans votre propre 
système informatique.



Re: [Qemu-block] [Qemu-devel] [PATCH] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-19 Thread Eric Blake
On 09/19/2017 10:50 AM, Stefan Hajnoczi wrote:
> Clear tg->any_timer_armed[] when throttling timers are destroy during

s/destroy/destroyed/

> AioContext attach/detach.  Failure to do so causes throttling to hang
> because we believe the timer is already scheduled!
> 
> The following was broken at least since QEMU 2.10.0 with -drive
> iops=100:
> 
>   $ dd if=/dev/zero of=/dev/vdb oflag=direct count=1000
>   (qemu) stop
>   (qemu) cont
>   ...I/O is stuck...
> 
> Reported-by: Yongxue Hong 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/throttle-groups.c | 13 +
>  1 file changed, 13 insertions(+)
> 
Reviewed-by: Eric Blake 

CC: qemu-sta...@nongnu.org

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] image locking

2017-09-19 Thread Eric Blake
On 09/19/2017 10:46 AM, Vladimir Sementsov-Ogievskiy wrote:
> Hi Fam!
> 
> I have a question about your image locking series:
> 
> Can you please explain, why OFD locking is enabled by default and posix
> locking is not? What is wrong with posix locking, what are the problems?

POSIX locking (lockf(3), which wraps fcntl(2)) has a severe flaw,
mandated by POSIX due to historical behavior, regarding the duration of
the lock - within a single process, ALL lock access to the same inode,
regardless of which fd triggered the access, is tied together:

int fd1 = open("/dev/null", O_RDONLY);
int fd2 = open("/dev/null", O_RDONLY);
fcntl(F_SETLK, fd1, ...); // Obtain lock
close(fd2); // oops, lock on fd1 is lost!

BSD locks (flock(2)) do not have this issue, but have a different flaw -
they can only lock the entire file at once.  But we need fine-grained
access (we lock different byte ranges in order to convey multiple pieces
of information across processes).

Hence, Linux invented OFD locking as the best of both worlds (the locks
are tied to the inode rather than the process, but allow fine-grained
access), and it is likely that POSIX will eventually standardize it
(http://austingroupbugs.net/view.php?id=768).

More from the Linux 'man fcntl' page:

>The  record  locks  described  above  are  associated  with the process
>(unlike the open file description locks  described  below).   This  has
>some unfortunate consequences:
> 
>*  If  a  process  closes any file descriptor referring to a file, then
>   all of the process's locks on that file are released, regardless  of
>   the  file  descriptor(s)  on which the locks were obtained.  This is
>   bad: it means that a process can lose its locks on a  file  such  as
>   /etc/passwd  or  /etc/mtab  when  for some reason a library function
>   decides to open, read, and close the same file.
> 
>*  The threads in a process share locks.   In  other  words,  a  multi‐
>   threaded  program  can't  use  record locking to ensure that threads
>   don't simultaneously access the same region of a file.
> 
>Open file description locks solve both of these problems.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] image locking

2017-09-19 Thread Daniel P. Berrange
On Tue, Sep 19, 2017 at 06:46:19PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi Fam!
> 
> I have a question about your image locking series:
> 
> Can you please explain, why OFD locking is enabled by default and posix
> locking is not? What is wrong with posix locking, what are the problems?

POSIX locking suffers from a horribly broken design making it practically
impossible to use it reliably, particularly in a threaded program. If
there are two file descriptors open to the same file, closing one FD
will release locks held by the other FD. See more details here:

  http://0pointer.de/blog/projects/locking.html

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



[Qemu-block] [PATCH] throttle-groups: update tg->any_timer_armed[] on detach

2017-09-19 Thread Stefan Hajnoczi
Clear tg->any_timer_armed[] when throttling timers are destroy during
AioContext attach/detach.  Failure to do so causes throttling to hang
because we believe the timer is already scheduled!

The following was broken at least since QEMU 2.10.0 with -drive
iops=100:

  $ dd if=/dev/zero of=/dev/vdb oflag=direct count=1000
  (qemu) stop
  (qemu) cont
  ...I/O is stuck...

Reported-by: Yongxue Hong 
Signed-off-by: Stefan Hajnoczi 
---
 block/throttle-groups.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index 6ba992c8d7..2bfd03faa0 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -592,7 +592,20 @@ void throttle_group_attach_aio_context(ThrottleGroupMember 
*tgm,
 void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
 {
 ThrottleTimers *tt = &tgm->throttle_timers;
+ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts);
+
 throttle_timers_detach_aio_context(tt);
+
+/* Forget about these timers, they have been destroyed */
+qemu_mutex_lock(&tg->lock);
+if (tg->tokens[0] == tgm) {
+tg->any_timer_armed[0] = false;
+}
+if (tg->tokens[1] == tgm) {
+tg->any_timer_armed[1] = false;
+}
+qemu_mutex_unlock(&tg->lock);
+
 tgm->aio_context = NULL;
 }
 
-- 
2.13.5




[Qemu-block] image locking

2017-09-19 Thread Vladimir Sementsov-Ogievskiy

Hi Fam!

I have a question about your image locking series:

Can you please explain, why OFD locking is enabled by default and posix 
locking is not? What is wrong with posix locking, what are the problems?


--
Best regards,
Vladimir




[Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-19 Thread Alberto Garcia
Hi everyone,

over the past few weeks I have been testing the effects of reducing
the size of the entries in the qcow2 L2 cache. This was briefly
mentioned by Denis in the same thread where we discussed subcluster
allocation back in April, but I'll describe here the problem and the
proposal in detail.

=== Problem ===

In the qcow2 file format guest addresses are mapped to host addresses
using the so-called L1 and L2 tables. The size of an L2 table is the
same as the cluster size, therefore a larger cluster means more L2
entries in a table, and because of that an L2 table can map a larger
portion of the address space (not only because it contains more
entries, but also because the data cluster that each one of those
entries points at is larger).

There are two consequences of this:

   1) If you double the cluster size of a qcow2 image then the maximum
  space needed for all L2 tables is divided by two (i.e. you need
  half the metadata).

   2) If you double the cluster size of a qcow2 image then each one of
  the L2 tables will map four times as much disk space.

With the default cluster size of 64KB, each L2 table maps 512MB of
contiguous disk space. This table shows what happens when you change
the cluster size:

 |--+--|
 | Cluster size | An L2 table maps |
 |--+--|
 |   512  B |32 KB |
 | 1 KB |   128 KB |
 | 2 KB |   512 KB |
 | 4 KB | 2 MB |
 | 8 KB | 8 MB |
 |16 KB |32 MB |
 |32 KB |   128 MB |
 |64 KB |   512 MB |
 |   128 KB | 2 GB |
 |   256 KB | 8 GB |
 |   512 KB |32 GB |
 | 1 MB |   128 GB |
 | 2 MB |   512 GB |
 |--+--|

When QEMU wants to convert a guest address into a host address, it
needs to read the entry from the corresponding L2 table. The qcow2
driver doesn't read those entries directly, it does it by loading the
tables in the L2 cache so they can be kept in memory in case they are
needed later.

The problem here is that the L2 cache (and the qcow2 driver in
general) always works with complete L2 tables: if QEMU needs a
particular L2 entry then the whole cluster containing the L2 table is
read from disk, and if the cache is full then a cluster worth of
cached data has to be discarded.

The consequences of this are worse the larger the cluster size is, not
only because we're reading (and discarding) larger amounts of data,
but also because we're using that memory in a very inefficient way.

Example: with 1MB clusters each L2 table maps 128GB of contiguous
virtual disk, so that's the granularity of our cache. If we're
performing I/O in a 4GB area that overlaps two of those 128GB chunks,
we need to have in the cache two complete L2 tables (2MB) even when in
practice we're only using 32KB of those 2MB (32KB contain enough L2
entries to map the 4GB that we're using).

=== The proposal ===

One way to solve the problems described above is to decouple the L2
table size (which is equal to the cluster size) from the cache entry
size.

The qcow2 cache doesn't actually know anything about the data that
it's loading, it just receives a disk offset and checks that it is
properly aligned. It's perfectly possible to make it load data blocks
smaller than a cluster.

I already have a working prototype, and I was doing tests using a 4KB
cache entry size. 4KB is small enough, it allows us to make a more
flexible use of the cache, it's also a common file system block size
and it can hold enough L2 entries to cover substantial amounts of disk
space (especially with large clusters).

 |--+---|
 | Cluster size | 4KB of L2 entries map |
 |--+---|
 | 64 KB| 32 MB |
 | 128 KB   | 64 MB |
 | 256 KB   | 128 MB|
 | 512 KB   | 256 MB|
 | 1 MB | 512 MB|
 | 2 MB | 1 GB  |
 |--+---|

Some results from my tests (using an SSD drive and random 4K reads):

|---+--+-+---+--|
| Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU |
|---+--+-+---+--|
| 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS   |
|  2 TB |  2 MB| 4 MB [1 TB] |  576 IOPS | 11000 IOPS   |
|---+--+-+---+--|

The improvements are clearly visible, but it's important to point out
a couple of things:

   - L2 cache size is always < total L2 metadata on disk (otherwise
 this wouldn't make sense).

Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-19 Thread Denis V. Lunev
On 09/19/2017 06:07 PM, Alberto Garcia wrote:
> Hi everyone,
>
> over the past few weeks I have been testing the effects of reducing
> the size of the entries in the qcow2 L2 cache. This was briefly
> mentioned by Denis in the same thread where we discussed subcluster
> allocation back in April, but I'll describe here the problem and the
> proposal in detail.
>
> === Problem ===
>
> In the qcow2 file format guest addresses are mapped to host addresses
> using the so-called L1 and L2 tables. The size of an L2 table is the
> same as the cluster size, therefore a larger cluster means more L2
> entries in a table, and because of that an L2 table can map a larger
> portion of the address space (not only because it contains more
> entries, but also because the data cluster that each one of those
> entries points at is larger).
>
> There are two consequences of this:
>
>1) If you double the cluster size of a qcow2 image then the maximum
>   space needed for all L2 tables is divided by two (i.e. you need
>   half the metadata).
>
>2) If you double the cluster size of a qcow2 image then each one of
>   the L2 tables will map four times as much disk space.
>
> With the default cluster size of 64KB, each L2 table maps 512MB of
> contiguous disk space. This table shows what happens when you change
> the cluster size:
>
>  |--+--|
>  | Cluster size | An L2 table maps |
>  |--+--|
>  |   512  B |32 KB |
>  | 1 KB |   128 KB |
>  | 2 KB |   512 KB |
>  | 4 KB | 2 MB |
>  | 8 KB | 8 MB |
>  |16 KB |32 MB |
>  |32 KB |   128 MB |
>  |64 KB |   512 MB |
>  |   128 KB | 2 GB |
>  |   256 KB | 8 GB |
>  |   512 KB |32 GB |
>  | 1 MB |   128 GB |
>  | 2 MB |   512 GB |
>  |--+--|
>
> When QEMU wants to convert a guest address into a host address, it
> needs to read the entry from the corresponding L2 table. The qcow2
> driver doesn't read those entries directly, it does it by loading the
> tables in the L2 cache so they can be kept in memory in case they are
> needed later.
>
> The problem here is that the L2 cache (and the qcow2 driver in
> general) always works with complete L2 tables: if QEMU needs a
> particular L2 entry then the whole cluster containing the L2 table is
> read from disk, and if the cache is full then a cluster worth of
> cached data has to be discarded.
>
> The consequences of this are worse the larger the cluster size is, not
> only because we're reading (and discarding) larger amounts of data,
> but also because we're using that memory in a very inefficient way.
>
> Example: with 1MB clusters each L2 table maps 128GB of contiguous
> virtual disk, so that's the granularity of our cache. If we're
> performing I/O in a 4GB area that overlaps two of those 128GB chunks,
> we need to have in the cache two complete L2 tables (2MB) even when in
> practice we're only using 32KB of those 2MB (32KB contain enough L2
> entries to map the 4GB that we're using).
>
> === The proposal ===
>
> One way to solve the problems described above is to decouple the L2
> table size (which is equal to the cluster size) from the cache entry
> size.
>
> The qcow2 cache doesn't actually know anything about the data that
> it's loading, it just receives a disk offset and checks that it is
> properly aligned. It's perfectly possible to make it load data blocks
> smaller than a cluster.
>
> I already have a working prototype, and I was doing tests using a 4KB
> cache entry size. 4KB is small enough, it allows us to make a more
> flexible use of the cache, it's also a common file system block size
> and it can hold enough L2 entries to cover substantial amounts of disk
> space (especially with large clusters).
>
>  |--+---|
>  | Cluster size | 4KB of L2 entries map |
>  |--+---|
>  | 64 KB| 32 MB |
>  | 128 KB   | 64 MB |
>  | 256 KB   | 128 MB|
>  | 512 KB   | 256 MB|
>  | 1 MB | 512 MB|
>  | 2 MB | 1 GB  |
>  |--+---|
>
> Some results from my tests (using an SSD drive and random 4K reads):
>
> |---+--+-+---+--|
> | Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU |
> |---+--+-+---+--|
> | 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS   |
> |  2 TB |  2 MB| 4 MB [1 TB] |  576 IOPS | 11000 IOPS   |
> |---+-

Re: [Qemu-block] [Qemu-devel] [PATCH v8 00/20] make dirty-bitmap byte-based

2017-09-19 Thread Fam Zheng
On Mon, 09/18 13:57, Eric Blake wrote:
> There are patches floating around to add NBD_CMD_BLOCK_STATUS,
> but NBD wants to report status on byte granularity (even if the
> reporting will probably be naturally aligned to sectors or even
> much higher levels).  I've therefore started the task of
> converting our block status code to report at a byte granularity
> rather than sectors.
> 
> Now that 2.11 is open, I'm rebasing/reposting the remaining patches.
> 
> The overall conversion currently looks like:
> part 1: bdrv_is_allocated (merged in 2.10, commit 51b0a488)
> part 2: dirty-bitmap (this series, v7 was here [1])
> part 3: bdrv_get_block_status (v4 is posted [2] and is mostly reviewed)
> part 4: .bdrv_co_block_status (v3 is posted [3], but needs review)

Patches 1-4, 6-20:

Reviewed-by: Fam Zheng 



Re: [Qemu-block] Block Migration and CPU throttling

2017-09-19 Thread Paolo Bonzini
On 19/09/2017 15:36, Peter Lieven wrote:
> Hi,
> 
> I just noticed that CPU throttling and Block Migration don't work
> together very well.
> During block migration the throttling heuristic detects that we
> obviously make no progress
> in ram transfer. But the reason is the running block migration and not a
> too high dirty pages rate.
> 
> The result is that any VM is throttled by 99% during block migration.
> 
> I wonder what the best way would be fix this. I came up with the
> following ideas so far:
> 
> - disable throttling while block migration is in bulk stage
> - check if absolute number of num_dirty_pages_period crosses a threshold
> and not if its just
>   greater than 50% of transferred bytes
> - check if migration_dirty_pages > 0. This slows down throttling, but
> does not avoid it completely.

If you can use nbd-server and drive-mirror for block migration (libvirt
would do it), then you will use multiple sockets and be able to migrate
block and RAM at the same time.

Otherwise, disabling throttling during the bulk stage is the one that
seems nicest and most promising.

Paolo



Re: [Qemu-block] Block Migration and CPU throttling

2017-09-19 Thread Dr. David Alan Gilbert
* Peter Lieven (p...@kamp.de) wrote:
> Am 19.09.2017 um 16:38 schrieb Dr. David Alan Gilbert:
> > * Peter Lieven (p...@kamp.de) wrote:
> > > Hi,
> > > 
> > > I just noticed that CPU throttling and Block Migration don't work 
> > > together very well.
> > > During block migration the throttling heuristic detects that we obviously 
> > > make no progress
> > > in ram transfer. But the reason is the running block migration and not a 
> > > too high dirty pages rate.
> > > 
> > > The result is that any VM is throttled by 99% during block migration.
> > Hmm that's unfortunate; do you have a bandwidth set lower than your
> > actual network connection? I'm just wondering if it's actually going
> > between the block and RAM iterative sections or getting stuck in ne.
> 
> It happens also if source and dest are on the same machine and speed is set 
> to 100G.

But does it happen if they're not and the speed is set low?

Dave

> Peter
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] Block Migration and CPU throttling

2017-09-19 Thread Peter Lieven

Am 19.09.2017 um 16:38 schrieb Dr. David Alan Gilbert:

* Peter Lieven (p...@kamp.de) wrote:

Hi,

I just noticed that CPU throttling and Block Migration don't work together very 
well.
During block migration the throttling heuristic detects that we obviously make 
no progress
in ram transfer. But the reason is the running block migration and not a too 
high dirty pages rate.

The result is that any VM is throttled by 99% during block migration.

Hmm that's unfortunate; do you have a bandwidth set lower than your
actual network connection? I'm just wondering if it's actually going
between the block and RAM iterative sections or getting stuck in ne.


It happens also if source and dest are on the same machine and speed is set to 
100G.

Peter




Re: [Qemu-block] Block Migration and CPU throttling

2017-09-19 Thread Dr. David Alan Gilbert
* Peter Lieven (p...@kamp.de) wrote:
> Hi,
> 
> I just noticed that CPU throttling and Block Migration don't work together 
> very well.
> During block migration the throttling heuristic detects that we obviously 
> make no progress
> in ram transfer. But the reason is the running block migration and not a too 
> high dirty pages rate.
> 
> The result is that any VM is throttled by 99% during block migration.

Hmm that's unfortunate; do you have a bandwidth set lower than your
actual network connection? I'm just wondering if it's actually going
between the block and RAM iterative sections or getting stuck in ne.

> I wonder what the best way would be fix this. I came up with the following 
> ideas so far:
> 
> - disable throttling while block migration is in bulk stage

mig_throttle_guest_down is currently in migration/ram.c
so however we do it, we've got to add an 'inhibit' that
block.c could set (I suggest a counter so that it could
be set by a few things).

> - check if absolute number of num_dirty_pages_period crosses a threshold and 
> not if its just
>   greater than 50% of transferred bytes
> - check if migration_dirty_pages > 0. This slows down throttling, but does 
> not avoid it completely.

An interesting question is whether you want to inhibit in the non-bulk
stage if IO writes are happening too quickly to keep up.

Dave

> 
> Peter
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper

2017-09-19 Thread Paolo Bonzini
On 19/09/2017 15:26, Daniel P. Berrange wrote:
> On Tue, Sep 19, 2017 at 03:23:09PM +0200, Paolo Bonzini wrote:
>> On 19/09/2017 15:12, Daniel P. Berrange wrote:
>>> On Tue, Sep 19, 2017 at 02:57:00PM +0200, Paolo Bonzini wrote:
 On 19/09/2017 14:53, Daniel P. Berrange wrote:
>> +/* Try to reconnect while sending the CDB.  */
>> +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; 
>> attempts++) {
>
> I'm curious why you need to loop here. The helper daemon should be running
> already, as you're not spawning it on demand IIUC. So it shoudl either
> succeed first time, or fail every time.

 You're focusing on the usecase where the helper daemon is spawned per-VM
 by the system libvirtd, which I agree is the most important one.
 However, the other usecase is the one with a global daemon, access to
 which is controlled via Unix permissions.  This is not SELinux-friendly,
 but it is nicer for testing and it is also the only possibility for user
 libvirtd.

 In that case, upgrading QEMU on the host could result in a "systemctl
 restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could
 result in systemd respawning the daemon).  Reconnect is useful in that 
 case.
>>>
>>> If using systemd socket activation and you restart the daemon, the listening
>>> socket should be preserved, so you shouldn't need to reconnect - the client
>>> should get queued until it has started again (likewise on crash).
>>
>> Oh, that's cool.  I didn't know that.  However, systemd socket
>> activation is optional, and it's only a handful of lines so I think it's
>> a bit nicer behavior (chardevs for example have options to reconnect).
> 
> The downside is that if someone forget to start the daemon, or enable
> the socket,  QEMU will spin for 5 seconds trying to reconnect, instead
> of reporting an error immediately. 

Not exactly:

- the daemon must be present at startup.  Reconnect is only tried when
sending a command.

- it's not a busy wait, pr_manager_helper_run runs in an
util/thread-pool.c worker thread (see pr_manager_execute in patch 1).

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH 2/4] scsi: build qemu-pr-helper

2017-09-19 Thread Paolo Bonzini
On 19/09/2017 15:33, Daniel P. Berrange wrote:
> On Tue, Sep 19, 2017 at 12:24:32PM +0200, Paolo Bonzini wrote:
>> Introduce a privileged helper to run persistent reservation commands.
>> This lets virtual machines send persistent reservations without using
>> CAP_SYS_RAWIO or out-of-tree patches.  The helper uses Unix permissions
>> and SCM_RIGHTS to restrict access to processes that can access its socket
>> and prove that they have an open file descriptor for a raw SCSI device.
>>
>> The next patch will also correct the usage of persistent reservations
>> with multipath devices.
>>
>> It would also be possible to support for Linux's IOC_PR_* ioctls in
>> the future, to support NVMe devices.  For now, however, only SCSI is
>> supported.
>>
>> Signed-off-by: Paolo Bonzini 
>> ---
> 
> 
>> +
>> +#define PR_HELPER_CDB_SIZE 16
>> +#define PR_HELPER_SENSE_SIZE   96
>> +#define PR_HELPER_DATA_SIZE8192
>> +
>> +typedef struct PRHelperResponse {
>> +int32_t result;
>> +int32_t sz;
>> +uint8_t sense[PR_HELPER_SENSE_SIZE];
>> +} PRHelperResponse;
> 
> Should we annotate this with 'packed' to ensure its immune to compiler
> padding ?

Yes...

> 
>> +typedef struct PRHelperRequest {
>> +int fd;
>> +size_t sz;
>> +uint8_t cdb[PR_HELPER_CDB_SIZE];
>> +} PRHelperRequest;
> 
> Same q here ?

... and no since this one is not the wire format (which is just a
16-byte cdb, plus fd in ancillary data).

> 
>> +static int coroutine_fn prh_write_response(PRHelperClient *client,
>> +   PRHelperRequest *req,
>> +   PRHelperResponse *resp, Error 
>> **errp)
>> +{
>> +ssize_t r;
> 
> Can just be int
> 
>> +size_t sz;
>> +
>> +if (req->cdb[0] == PERSISTENT_RESERVE_IN && resp->result == GOOD) {
>> +assert(resp->sz <= req->sz && resp->sz <= sizeof(client->data));
>> +} else {
>> +assert(resp->sz == 0);
>> +}
>> +
>> +sz = resp->sz;
>> +
>> +resp->result = cpu_to_be32(resp->result);
>> +resp->sz = cpu_to_be32(resp->sz);
>> +r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
>> +  (char *) resp, sizeof(*resp), errp);
>> +if (r < 0) {
>> +return r;
>> +}
>> +
>> +r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
>> +  (char *) client->data,
>> +  sz, errp);
>> +return r < 0 ? r : 0;
> 
> Just return 'r' directly, its only ever -1 or 0

Ok, thanks, will adjust all of these.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH 2/4] scsi: build qemu-pr-helper

2017-09-19 Thread Daniel P. Berrange
On Tue, Sep 19, 2017 at 12:24:32PM +0200, Paolo Bonzini wrote:
> Introduce a privileged helper to run persistent reservation commands.
> This lets virtual machines send persistent reservations without using
> CAP_SYS_RAWIO or out-of-tree patches.  The helper uses Unix permissions
> and SCM_RIGHTS to restrict access to processes that can access its socket
> and prove that they have an open file descriptor for a raw SCSI device.
> 
> The next patch will also correct the usage of persistent reservations
> with multipath devices.
> 
> It would also be possible to support for Linux's IOC_PR_* ioctls in
> the future, to support NVMe devices.  For now, however, only SCSI is
> supported.
> 
> Signed-off-by: Paolo Bonzini 
> ---


> +
> +#define PR_HELPER_CDB_SIZE 16
> +#define PR_HELPER_SENSE_SIZE   96
> +#define PR_HELPER_DATA_SIZE8192
> +
> +typedef struct PRHelperResponse {
> +int32_t result;
> +int32_t sz;
> +uint8_t sense[PR_HELPER_SENSE_SIZE];
> +} PRHelperResponse;

Should we annotate this with 'packed' to ensure its immune to compiler
padding ?


> +typedef struct PRHelperRequest {
> +int fd;
> +size_t sz;
> +uint8_t cdb[PR_HELPER_CDB_SIZE];
> +} PRHelperRequest;

Same q here ?


> +static int coroutine_fn prh_write_response(PRHelperClient *client,
> +   PRHelperRequest *req,
> +   PRHelperResponse *resp, Error 
> **errp)
> +{
> +ssize_t r;

Can just be int

> +size_t sz;
> +
> +if (req->cdb[0] == PERSISTENT_RESERVE_IN && resp->result == GOOD) {
> +assert(resp->sz <= req->sz && resp->sz <= sizeof(client->data));
> +} else {
> +assert(resp->sz == 0);
> +}
> +
> +sz = resp->sz;
> +
> +resp->result = cpu_to_be32(resp->result);
> +resp->sz = cpu_to_be32(resp->sz);
> +r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
> +  (char *) resp, sizeof(*resp), errp);
> +if (r < 0) {
> +return r;
> +}
> +
> +r = qio_channel_write_all(QIO_CHANNEL(client->ioc),
> +  (char *) client->data,
> +  sz, errp);
> +return r < 0 ? r : 0;

Just return 'r' directly, its only ever -1 or 0



Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH v2 7/7] block/nbd-client: do not yield from nbd_read_reply_entry

2017-09-19 Thread Eric Blake
On 09/19/2017 05:00 AM, Vladimir Sementsov-Ogievskiy wrote:
> 19.09.2017 01:36, Eric Blake wrote:
>> On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> Refactor nbd client to not yield from nbd_read_reply_entry. It's
>>> possible now as all reading is done in nbd_read_reply_entry and
>>> all request-related data is stored in per-request s->requests[i].
>>>
>>> We need here some additional work with s->requests[i].ret and
>>> s->quit to not fail requests on normal disconnet and to not report
>>> reading errors on normal disconnect.
>>>

>>> @@ -364,6 +353,8 @@ void nbd_client_close(BlockDriverState *bs)
>>>     nbd_send_request(client->ioc, &request);
>>>   +    client->quit = true;
>> Previously, client->quit was only set when detecting a broken server,
>> now it is also set for a clean exit.  Do we need to change any
>> documentation of the field?
> 
> It has documentation?

Touche.  But it probably should, especially if we are changing its
semantics, to make it easier to understand from looking at the struct
what semantics to expect.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] Block Migration and CPU throttling

2017-09-19 Thread Peter Lieven

Hi,

I just noticed that CPU throttling and Block Migration don't work together very 
well.
During block migration the throttling heuristic detects that we obviously make 
no progress
in ram transfer. But the reason is the running block migration and not a too 
high dirty pages rate.

The result is that any VM is throttled by 99% during block migration.

I wonder what the best way would be fix this. I came up with the following 
ideas so far:

- disable throttling while block migration is in bulk stage
- check if absolute number of num_dirty_pages_period crosses a threshold and 
not if its just
  greater than 50% of transferred bytes
- check if migration_dirty_pages > 0. This slows down throttling, but does not 
avoid it completely.

Peter



Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper

2017-09-19 Thread Paolo Bonzini
On 19/09/2017 15:12, Daniel P. Berrange wrote:
> On Tue, Sep 19, 2017 at 02:57:00PM +0200, Paolo Bonzini wrote:
>> On 19/09/2017 14:53, Daniel P. Berrange wrote:
 +/* Try to reconnect while sending the CDB.  */
 +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; attempts++) {
>>>
>>> I'm curious why you need to loop here. The helper daemon should be running
>>> already, as you're not spawning it on demand IIUC. So it shoudl either
>>> succeed first time, or fail every time.
>>
>> You're focusing on the usecase where the helper daemon is spawned per-VM
>> by the system libvirtd, which I agree is the most important one.
>> However, the other usecase is the one with a global daemon, access to
>> which is controlled via Unix permissions.  This is not SELinux-friendly,
>> but it is nicer for testing and it is also the only possibility for user
>> libvirtd.
>>
>> In that case, upgrading QEMU on the host could result in a "systemctl
>> restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could
>> result in systemd respawning the daemon).  Reconnect is useful in that case.
> 
> If using systemd socket activation and you restart the daemon, the listening
> socket should be preserved, so you shouldn't need to reconnect - the client
> should get queued until it has started again (likewise on crash).

Oh, that's cool.  I didn't know that.  However, systemd socket
activation is optional, and it's only a handful of lines so I think it's
a bit nicer behavior (chardevs for example have options to reconnect).

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper

2017-09-19 Thread Paolo Bonzini
On 19/09/2017 14:53, Daniel P. Berrange wrote:
>> +/* Try to reconnect while sending the CDB.  */
>> +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; attempts++) {
>
> I'm curious why you need to loop here. The helper daemon should be running
> already, as you're not spawning it on demand IIUC. So it shoudl either
> succeed first time, or fail every time.

You're focusing on the usecase where the helper daemon is spawned per-VM
by the system libvirtd, which I agree is the most important one.
However, the other usecase is the one with a global daemon, access to
which is controlled via Unix permissions.  This is not SELinux-friendly,
but it is nicer for testing and it is also the only possibility for user
libvirtd.

In that case, upgrading QEMU on the host could result in a "systemctl
restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could
result in systemd respawning the daemon).  Reconnect is useful in that case.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper

2017-09-19 Thread Daniel P. Berrange
On Tue, Sep 19, 2017 at 12:24:34PM +0200, Paolo Bonzini wrote:
> This adds a concrete subclass of pr-manager that talks to qemu-pr-helper.
> 
> Signed-off-by: Paolo Bonzini 
> ---
> v1->v2: fixed string property double-free
> fixed/cleaned up error handling
> handle buffer underrun
> 
>  scsi/Makefile.objs   |   2 +-
>  scsi/pr-manager-helper.c | 302 
> +++
>  2 files changed, 303 insertions(+), 1 deletion(-)
>  create mode 100644 scsi/pr-manager-helper.c
> 
> diff --git a/scsi/Makefile.objs b/scsi/Makefile.objs
> index 5496d2ae6a..4d25e476cf 100644
> --- a/scsi/Makefile.objs
> +++ b/scsi/Makefile.objs
> @@ -1,3 +1,3 @@
>  block-obj-y += utils.o
>  
> -block-obj-$(CONFIG_LINUX) += pr-manager.o
> +block-obj-$(CONFIG_LINUX) += pr-manager.o pr-manager-helper.o
> diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c
> new file mode 100644
> index 00..063fd35dee
> --- /dev/null
> +++ b/scsi/pr-manager-helper.c

> +/* Called with lock held.  */
> +static int pr_manager_helper_read(PRManagerHelper *pr_mgr,
> +  void *buf, int sz, Error **errp)
> +{
> +ssize_t r = qio_channel_read_all(pr_mgr->ioc, buf, sz, errp);

The return value isn't ssize_t - it just returns 0 on success, -1 on
error, never a bytes count since it always reads all requested bytes
and treats EOF as an error condition.

> +
> +if (r < 0) {
> +object_unref(OBJECT(pr_mgr->ioc));
> +pr_mgr->ioc = NULL;
> +return -EINVAL;
> +}
> +
> +return 0;
> +}
> +

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper

2017-09-19 Thread Daniel P. Berrange
On Tue, Sep 19, 2017 at 02:57:00PM +0200, Paolo Bonzini wrote:
> On 19/09/2017 14:53, Daniel P. Berrange wrote:
> >> +/* Try to reconnect while sending the CDB.  */
> >> +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; attempts++) {
> >
> > I'm curious why you need to loop here. The helper daemon should be running
> > already, as you're not spawning it on demand IIUC. So it shoudl either
> > succeed first time, or fail every time.
> 
> You're focusing on the usecase where the helper daemon is spawned per-VM
> by the system libvirtd, which I agree is the most important one.
> However, the other usecase is the one with a global daemon, access to
> which is controlled via Unix permissions.  This is not SELinux-friendly,
> but it is nicer for testing and it is also the only possibility for user
> libvirtd.
> 
> In that case, upgrading QEMU on the host could result in a "systemctl
> restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could
> result in systemd respawning the daemon).  Reconnect is useful in that case.

If using systemd socket activation and you restart the daemon, the listening
socket should be preserved, so you shouldn't need to reconnect - the client
should get queued until it has started again (likewise on crash).

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper

2017-09-19 Thread Daniel P. Berrange
On Tue, Sep 19, 2017 at 03:23:09PM +0200, Paolo Bonzini wrote:
> On 19/09/2017 15:12, Daniel P. Berrange wrote:
> > On Tue, Sep 19, 2017 at 02:57:00PM +0200, Paolo Bonzini wrote:
> >> On 19/09/2017 14:53, Daniel P. Berrange wrote:
>  +/* Try to reconnect while sending the CDB.  */
>  +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; 
>  attempts++) {
> >>>
> >>> I'm curious why you need to loop here. The helper daemon should be running
> >>> already, as you're not spawning it on demand IIUC. So it shoudl either
> >>> succeed first time, or fail every time.
> >>
> >> You're focusing on the usecase where the helper daemon is spawned per-VM
> >> by the system libvirtd, which I agree is the most important one.
> >> However, the other usecase is the one with a global daemon, access to
> >> which is controlled via Unix permissions.  This is not SELinux-friendly,
> >> but it is nicer for testing and it is also the only possibility for user
> >> libvirtd.
> >>
> >> In that case, upgrading QEMU on the host could result in a "systemctl
> >> restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could
> >> result in systemd respawning the daemon).  Reconnect is useful in that 
> >> case.
> > 
> > If using systemd socket activation and you restart the daemon, the listening
> > socket should be preserved, so you shouldn't need to reconnect - the client
> > should get queued until it has started again (likewise on crash).
> 
> Oh, that's cool.  I didn't know that.  However, systemd socket
> activation is optional, and it's only a handful of lines so I think it's
> a bit nicer behavior (chardevs for example have options to reconnect).

The downside is that if someone forget to start the daemon, or enable
the socket,  QEMU will spin for 5 seconds trying to reconnect, instead
of reporting an error immediately. 

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply

2017-09-19 Thread Paolo Bonzini
On 19/09/2017 13:03, Vladimir Sementsov-Ogievskiy wrote:
>>>
>> I disagree that it is easier to extend it in the future.  If some
>> commands in the future need a different "how do we read it" (e.g. for
>> structured reply), nbd_read_reply_entry may not have all the information
>> it needs anymore.
> 
> how is it possible? all information is in requests[i].

If you are okay with violating information hiding, then it is.

>>
>> In fact, you're pushing knowledge of the commands from nbd_co_request to
>> nbd_read_reply_entry:
>>
>> +if (s->reply.error == 0 &&
>> +s->requests[i].request->type == NBD_CMD_READ)
>>
>> and knowledge of NBD_CMD_READ simply doesn't belong there.  So there is
>> an example of that right now, it is not some arbitrary bad thing that
>> could happen in the future.
> 
> I can't agree that my point of view is wrong, it's just different.
> You want commands, reading from socket, each knows what it should read.
> I want the receiver - one sequential reader, that can read all kinds of
> replies and just wake requesting coroutines when all reading is done.
> For me it looks simpler than switching.

It may look simpler, but I think it goes against the principle of
coroutines which is to have related code in the same function.  Here you
have the command function that takes care of sending the request payload
but not receiving the reply payload (except that for commands that
aren't as simple as read or write, it will have to _process_ the reply
payload).  What to do with the reply payload is in another function that
has to peek at the command in order to find out what to do.  It doesn't
seem like a better design, honestly.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper

2017-09-19 Thread Daniel P. Berrange
On Tue, Sep 19, 2017 at 12:24:34PM +0200, Paolo Bonzini wrote:
> This adds a concrete subclass of pr-manager that talks to qemu-pr-helper.
> 
> Signed-off-by: Paolo Bonzini 
> ---
> v1->v2: fixed string property double-free
> fixed/cleaned up error handling
> handle buffer underrun
> 
>  scsi/Makefile.objs   |   2 +-
>  scsi/pr-manager-helper.c | 302 
> +++
>  2 files changed, 303 insertions(+), 1 deletion(-)
>  create mode 100644 scsi/pr-manager-helper.c


> +static int pr_manager_helper_run(PRManager *p,
> + int fd, struct sg_io_hdr *io_hdr)
> +{
> +PRManagerHelper *pr_mgr = PR_MANAGER_HELPER(p);
> +
> +uint32_t len;
> +PRHelperResponse resp;
> +int ret;
> +int expected_dir;
> +int attempts;
> +uint8_t cdb[PR_HELPER_CDB_SIZE] = { 0 };
> +
> +if (!io_hdr->cmd_len || io_hdr->cmd_len > PR_HELPER_CDB_SIZE) {
> +return -EINVAL;
> +}
> +
> +memcpy(cdb, io_hdr->cmdp, io_hdr->cmd_len);
> +assert(cdb[0] == PERSISTENT_RESERVE_OUT || cdb[0] == 
> PERSISTENT_RESERVE_IN);
> +expected_dir =
> +(cdb[0] == PERSISTENT_RESERVE_OUT ? SG_DXFER_TO_DEV : 
> SG_DXFER_FROM_DEV);
> +if (io_hdr->dxfer_direction != expected_dir) {
> +return -EINVAL;
> +}
> +
> +len = scsi_cdb_xfer(cdb);
> +if (io_hdr->dxfer_len < len || len > PR_HELPER_DATA_SIZE) {
> +return -EINVAL;
> +}
> +
> +qemu_mutex_lock(&pr_mgr->lock);
> +
> +/* Try to reconnect while sending the CDB.  */
> +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; attempts++) {

I'm curious why you need to loop here. The helper daemon should be running
already, as you're not spawning it on demand IIUC. So it shoudl either
succeed first time, or fail every time.

> +if (!pr_mgr->ioc) {
> +ret = pr_manager_helper_initialize(pr_mgr, NULL);
> +if (ret < 0) {
> +qemu_mutex_unlock(&pr_mgr->lock);
> +g_usleep(G_USEC_PER_SEC);
> +qemu_mutex_lock(&pr_mgr->lock);
> +continue;
> +}
> +}
> +
> +ret = pr_manager_helper_write(pr_mgr, fd, cdb, ARRAY_SIZE(cdb), 
> NULL);
> +if (ret >= 0) {
> +break;
> +}
> +}
> +if (ret < 0) {
> +goto out;
> +}

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-block] [PATCH v8 00/20] make dirty-bitmap byte-based

2017-09-19 Thread Kevin Wolf
Am 18.09.2017 um 20:57 hat Eric Blake geschrieben:
> There are patches floating around to add NBD_CMD_BLOCK_STATUS,
> but NBD wants to report status on byte granularity (even if the
> reporting will probably be naturally aligned to sectors or even
> much higher levels).  I've therefore started the task of
> converting our block status code to report at a byte granularity
> rather than sectors.
> 
> Now that 2.11 is open, I'm rebasing/reposting the remaining patches.
> 
> The overall conversion currently looks like:
> part 1: bdrv_is_allocated (merged in 2.10, commit 51b0a488)
> part 2: dirty-bitmap (this series, v7 was here [1])
> part 3: bdrv_get_block_status (v4 is posted [2] and is mostly reviewed)
> part 4: .bdrv_co_block_status (v3 is posted [3], but needs review)

I agree with Fam's comment on patch 5 and had a question on patch 18.

Everything else is
Reviewed-by: Kevin Wolf 



Re: [Qemu-block] [PATCH v8 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration

2017-09-19 Thread Kevin Wolf
Am 18.09.2017 um 20:58 hat Eric Blake geschrieben:
> Now that we have adjusted the majority of the calls this function
> makes to be byte-based, it is easier to read the code if it makes
> passes over the image using bytes rather than sectors.
> 
> Signed-off-by: Eric Blake 
> Reviewed-by: John Snow 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 
> ---
> v7: rebase to earlier change, make rounding of offset obvious (no semantic
> change, so R-b kept) [Kevin]
> v5-v6: no change
> v4: new patch
> ---
>  block/qcow2-bitmap.c | 26 +++---
>  1 file changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
> index 692ce0de88..302fffd6e1 100644
> --- a/block/qcow2-bitmap.c
> +++ b/block/qcow2-bitmap.c
> @@ -1072,10 +1072,9 @@ static uint64_t *store_bitmap_data(BlockDriverState 
> *bs,
>  {
>  int ret;
>  BDRVQcow2State *s = bs->opaque;
> -int64_t sector;
> -uint64_t limit, sbc;
> +int64_t offset;
> +uint64_t limit;
>  uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
> -uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
>  const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
>  uint8_t *buf = NULL;
>  BdrvDirtyBitmapIter *dbi;
> @@ -1100,18 +1099,17 @@ static uint64_t *store_bitmap_data(BlockDriverState 
> *bs,
>  dbi = bdrv_dirty_iter_new(bitmap);
>  buf = g_malloc(s->cluster_size);
>  limit = bytes_covered_by_bitmap_cluster(s, bitmap);
> -sbc = limit >> BDRV_SECTOR_BITS;
>  assert(DIV_ROUND_UP(bm_size, limit) == tb_size);
> 
> -while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) {
> -uint64_t cluster = sector / sbc;
> +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
> +uint64_t cluster = offset / limit;
>  uint64_t end, write_size;
>  int64_t off;
> 
> -sector = cluster * sbc;
> -end = MIN(bm_sectors, sector + sbc);
> -write_size = bdrv_dirty_bitmap_serialization_size(bitmap,
> -sector * BDRV_SECTOR_SIZE, (end - sector) * BDRV_SECTOR_SIZE);
> +offset = QEMU_ALIGN_DOWN(offset, limit);

With the question that I asked in v6, apart from changing the spelling
to be more explicit, I actually hoped that you would explain why
aligning down is the right thing to do.

It looks plausible to me that we can do this in correct code because we
don't support granularities < 512 bytes (a qemu restriction that is
written down as a note in the qcow2 spec).

The part that I'm missing yet is why we need to do it. The bitmap
granularity is also the granularity of bdrv_dirty_iter_next(), so isn't
offset already aligned and we could even assert that instead of aligning
down? (As long we enforce our restriction, which we seem to do in
bitmap_list_load().)

> +end = MIN(bm_size, offset + limit);
> +write_size = bdrv_dirty_bitmap_serialization_size(bitmap, offset,
> +  end - offset);
>  assert(write_size <= s->cluster_size);
> 
>  off = qcow2_alloc_clusters(bs, s->cluster_size);

Kevin



Re: [Qemu-block] [PATCH v2 6/7] block/nbd-client: early fail nbd_read_reply_entry if s->quit is set

2017-09-19 Thread Vladimir Sementsov-Ogievskiy

19.09.2017 13:03, Paolo Bonzini wrote:

On 19/09/2017 11:43, Vladimir Sementsov-Ogievskiy wrote:

I'm trying to look forward to structured reads, where we will have to
deal with more than one server message in reply to a single client
request.  For read, we just piece together portions of the qiov until
the server has sent us all the pieces.  But for block query, I really do
think we'll want to defer to specialized handlers for doing further
reads (the amount of data to be read from the server is not known by the
client a priori, so it is a two-part read, one to get the length, and
another to read that remaining length); if we need some sort of callback
function rather than cramming all the logic here in the main read loop,

by patch 3 I do not mean in any way that I want to have all reading
staff in one function, ofcourse it should be refactored
for block-status addition. I just want to have all reading staff in one
coroutine. Reading from ioc is sequential, so it's
native to do it sequentially in one coroutine, without switches.

But why is that a goal?  See it as a state machine that goes between the
"waiting for header" and "waiting for payload" states.  Reading header
corresponds to read_reply_co waiting on the socket (and reading when
it's ready); reading payload corresponds to the request coroutine
waiting on the socket and reading when it's ready.

Paolo



I just dislike public access to ioc for commands and extra coroutine 
switching and synchronization.




--
Best regards,
Vladimir




Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply

2017-09-19 Thread Vladimir Sementsov-Ogievskiy

19.09.2017 13:01, Paolo Bonzini wrote:

On 19/09/2017 11:25, Vladimir Sementsov-Ogievskiy wrote:

18.09.2017 18:43, Paolo Bonzini wrote:

On 18/09/2017 15:59, Vladimir Sementsov-Ogievskiy wrote:

Read the whole reply in one place - in nbd_read_reply_entry.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
   block/nbd-client.h |  1 +
   block/nbd-client.c | 42 --
   2 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index b1900e6a6d..3f97d31013 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -21,6 +21,7 @@ typedef struct {
   Coroutine *coroutine;
   bool receiving; /* waiting for read_reply_co? */
   NBDRequest *request;
+QEMUIOVector *qiov;
   } NBDClientRequest;
 typedef struct NBDClientSession {
diff --git a/block/nbd-client.c b/block/nbd-client.c
index 5f241ecc22..f7b642f079 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -98,6 +98,21 @@ static coroutine_fn void nbd_read_reply_entry(void
*opaque)
   break;
   }
   +if (s->reply.error == 0 &&
+s->requests[i].request->type == NBD_CMD_READ)
+{
+QEMUIOVector *qiov = s->requests[i].qiov;
+assert(qiov != NULL);
+assert(s->requests[i].request->len ==
+   iov_size(qiov->iov, qiov->niov));
+if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
+  NULL) < 0)
+{
+s->reply.error = EIO;
+break;
+}
+}

I am not sure this is an improvement.  In principle you could have
commands that read replies a bit at a time without using a QEMUIOVector.

Paolo

Hm, what do you mean? In this patch I don't change "how do we read it",
I only move the reading code to one coroutine, to make it simple to
extend it in future (block status, etc).

I disagree that it is easier to extend it in the future.  If some
commands in the future need a different "how do we read it" (e.g. for
structured reply), nbd_read_reply_entry may not have all the information
it needs anymore.


how is it possible? all information is in requests[i].



In fact, you're pushing knowledge of the commands from nbd_co_request to
nbd_read_reply_entry:

+if (s->reply.error == 0 &&
+s->requests[i].request->type == NBD_CMD_READ)

and knowledge of NBD_CMD_READ simply doesn't belong there.  So there is
an example of that right now, it is not some arbitrary bad thing that
could happen in the future.

Paolo


I can't agree that my point of view is wrong, it's just different.
You want commands, reading from socket, each knows what it should read.
I want the receiver - one sequential reader, that can read all kinds of 
replies and just

wake requesting coroutines when all reading is done. For me it looks simpler
than switching.

We can add reading callbacks for commands which have some payload, like
s->requests[i].read_payload(ioc, request) or may be 
s->requests[i].request->read_payload(...),

and call them from reply_entry instead of if (s->... == NBD_CMD_READ).

How do you see the refactoring for introducing structured reply?






nbd_read_reply_enty has all
information it need (s->requests[i].request) to handle the whole reply..
It's an improvement, because it leads to separating reply_entry
coroutine - it don't need any synchronization (yield/wake) more with
request coroutines.


   /* We're woken up again by the request itself.  Note that
there
* is no race between yielding and reentering
read_reply_co.  This
* is because:
@@ -118,6 +133,7 @@ static coroutine_fn void
nbd_read_reply_entry(void *opaque)
   s->read_reply_co = NULL;
   }
   +/* qiov should be provided for READ and WRITE */
   static int nbd_co_send_request(BlockDriverState *bs,
  NBDRequest *request,
  QEMUIOVector *qiov)
@@ -145,6 +161,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 request->handle = INDEX_TO_HANDLE(s, i);
   s->requests[i].request = request;
+s->requests[i].qiov = qiov;
 if (s->quit) {
   rc = -EIO;
@@ -155,7 +172,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
   goto err;
   }
   -if (qiov) {
+if (s->requests[i].request->type == NBD_CMD_WRITE) {
+assert(qiov);
   qio_channel_set_cork(s->ioc, true);
   rc = nbd_send_request(s->ioc, request);
   if (rc >= 0 && !s->quit) {
@@ -181,9 +199,7 @@ err:
   return rc;
   }
   -static int nbd_co_receive_reply(NBDClientSession *s,
-NBDRequest *request,
-QEMUIOVector *qiov)
+static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest
*request)
   {
   int ret;
   int i = HANDLE_TO_INDEX(s, request->handle);
@@ -197,14 +213,6 @@ static int nbd_

Re: [Qemu-block] [Qemu-devel] [PATCH] nvme: fix cmbuf leak on exit

2017-09-19 Thread Philippe Mathieu-Daudé

Hi Stefan,

On 09/19/2017 06:48 AM, Stefan Hajnoczi wrote:

Cc: Keith Busch 
Signed-off-by: Stefan Hajnoczi 
---
  hw/block/nvme.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 9aa32692a3..513ec7065d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1057,6 +1057,7 @@ static void nvme_exit(PCIDevice *pci_dev)
  g_free(n->namespaces);
  g_free(n->cq);
  g_free(n->sq);
+g_free(n->cmbuf);


I'd move it 1 line below,


  if (n->cmbsz) {


here:
g_free(n->cmbuf);


  memory_region_unref(&n->ctrl_mem);
  }



Anyway g_free() handles NULL, so:
Reviewed-by: Philippe Mathieu-Daudé 



[Qemu-block] [PATCH 2/4] scsi: build qemu-pr-helper

2017-09-19 Thread Paolo Bonzini
Introduce a privileged helper to run persistent reservation commands.
This lets virtual machines send persistent reservations without using
CAP_SYS_RAWIO or out-of-tree patches.  The helper uses Unix permissions
and SCM_RIGHTS to restrict access to processes that can access its socket
and prove that they have an open file descriptor for a raw SCSI device.

The next patch will also correct the usage of persistent reservations
with multipath devices.

It would also be possible to support for Linux's IOC_PR_* ioctls in
the future, to support NVMe devices.  For now, however, only SCSI is
supported.

Signed-off-by: Paolo Bonzini 
---
v1->v2: do not install man page on !Linux systems
fix typos in documentation
clarify initialization protocol
clarify concurrency
added BSD license to protocol header
added payload size to reply, to handle residual (buffer underrun)
fixed closing of file descriptors on error
block PERSISTENT RESERVE OUT for read-only file descriptors
do not use g_assert
use EXIT_SUCCESS/EXIT_FAILURE consistently
removed all CONFIG_MPATH usage from this patch

 Makefile   |   2 +
 configure  |   2 +-
 docs/interop/pr-helper.rst |  83 ++
 docs/pr-manager.rst|  33 +++
 scsi/pr-helper.h   |  41 +++
 scsi/qemu-pr-helper.c  | 698 +
 6 files changed, 866 insertions(+), 1 deletions(-)
 create mode 100644 docs/interop/pr-helper.rst
 create mode 100644 scsi/pr-helper.h
 create mode 100644 scsi/qemu-pr-helper.c

diff --git a/Makefile b/Makefile
index b53fc69a60..51503ee8ad 100644
--- a/Makefile
+++ b/Makefile
@@ -373,6 +376,8 @@ qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o 
$(COMMON_LDADDS)
 fsdev/virtfs-proxy-helper$(EXESUF): fsdev/virtfs-proxy-helper.o 
fsdev/9p-marshal.o fsdev/9p-iov-marshal.o $(COMMON_LDADDS)
 fsdev/virtfs-proxy-helper$(EXESUF): LIBS += -lcap
 
+scsi/qemu-pr-helper$(EXESUF): scsi/qemu-pr-helper.o scsi/utils.o 
$(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS)
+
 qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx $(SRC_PATH)/scripts/hxtool
$(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h < $< > 
$@,"GEN","$@")
 
diff --git a/configure b/configure
index 94db2d103e..fba27cf0fc 100755
--- a/configure
+++ b/configure
@@ -6506,7 +6506,7 @@ fi
 
 # build tree in object directory in case the source is not in the current 
directory
 DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32 tests/libqos 
tests/qapi-schema tests/tcg/xtensa tests/qemu-iotests"
-DIRS="$DIRS docs docs/interop fsdev"
+DIRS="$DIRS docs docs/interop fsdev scsi"
 DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas pc-bios/s390-ccw"
 DIRS="$DIRS roms/seabios roms/vgabios"
 DIRS="$DIRS qapi-generated"
diff --git a/docs/interop/pr-helper.rst b/docs/interop/pr-helper.rst
new file mode 100644
index 00..9f76d5bcf9
--- /dev/null
+++ b/docs/interop/pr-helper.rst
@@ -0,0 +1,83 @@
+..
+
+==
+Persistent reservation helper protocol
+==
+
+QEMU's SCSI passthrough devices, ``scsi-block`` and ``scsi-generic``,
+can delegate implementation of persistent reservations to an external
+(and typically privileged) program.  Persistent Reservations allow
+restricting access to block devices to specific initiators in a shared
+storage setup.
+
+For a more detailed reference please refer the the SCSI Primary
+Commands standard, specifically the section on Reservations and the
+"PERSISTENT RESERVE IN" and "PERSISTENT RESERVE OUT" commands.
+
+This document describes the socket protocol used between QEMU's
+``pr-manager-helper`` object and the external program.
+
+.. contents::
+
+Connection and initialization
+-
+
+All data transmitted on the socket is big-endian.
+
+After connecting to the helper program's socket, the helper starts a simple
+feature negotiation process by writing four bytes corresponding to
+the features it exposes (``supported_features``).  QEMU reads it,
+then writes four bytes corresponding to the desired features of the
+helper program (``requested_features``).
+
+If a bit is 1 in ``requested_features`` and 0 in ``supported_features``,
+the corresponding feature is not supported by the helper and the connection
+is closed.  On the other hand, it is acceptable for a bit to be 0 in
+``requested_features`` and 1 in ``supported_features``; in this case,
+the helper will not enable the feature.
+
+Right now no feature is defined, so the two parties always write four
+zero bytes.
+
+Command format
+--
+
+It is invalid to send multiple commands concurrently on the same
+socket.  It is however possible to connect multiple sockets to the
+helper and send multiple commands to the helper for one or more
+file descriptors.
+
+A command consists of a request and a response.  A request consists
+of a 16-byte SCSI CDB.  A file descripto

[Qemu-block] [PATCH 1/4] scsi, file-posix: add support for persistent reservation management

2017-09-19 Thread Paolo Bonzini
It is a common requirement for virtual machine to send persistent
reservations, but this currently requires either running QEMU with
CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged
QEMU bypass Linux's filter on SG_IO commands.

As an alternative mechanism, the next patches will introduce a
privileged helper to run persistent reservation commands without
expanding QEMU's attack surface unnecessarily.

The helper is invoked through a "pr-manager" QOM object, to which
file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and
PERSISTENT RESERVE IN commands.  For example:

  $ qemu-system-x86_64
  -device virtio-scsi \
  -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
  -drive 
if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
  -device scsi-block,drive=hd

or:

  $ qemu-system-x86_64
  -device virtio-scsi \
  -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
  -blockdev 
node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
  -device scsi-block,drive=hd

Multiple pr-manager implementations are conceivable and possible, though
only one is implemented right now.  For example, a pr-manager could:

- talk directly to the multipath daemon from a privileged QEMU
  (i.e. QEMU links to libmpathpersist); this makes reservation work
  properly with multipath, but still requires CAP_SYS_RAWIO

- use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though)

- more interestingly, implement reservations directly in QEMU
  through file system locks or a shared database (e.g. sqlite)

Signed-off-by: Paolo Bonzini 
---
 Makefile.objs |   1 +
 block/file-posix.c|  30 +
 docs/pr-manager.rst   |  51 ++
 include/scsi/pr-manager.h |  56 
 qapi/block-core.json  |   4 ++
 scsi/Makefile.objs|   2 +
 scsi/pr-manager.c | 109 ++
 vl.c  |   3 +-
 8 files changed, 255 insertions(+), 1 deletion(-)
 create mode 100644 docs/pr-manager.rst
 create mode 100644 include/scsi/pr-manager.h
 create mode 100644 scsi/pr-manager.c

diff --git a/Makefile.objs b/Makefile.objs
index 0caa8a5cf8..12abaa6191 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -170,6 +170,7 @@ trace-events-subdirs += qapi
 trace-events-subdirs += accel/tcg
 trace-events-subdirs += accel/kvm
 trace-events-subdirs += nbd
+trace-events-subdirs += scsi
 
 trace-events-files = $(SRC_PATH)/trace-events 
$(trace-events-subdirs:%=$(SRC_PATH)/%/trace-events)
 
diff --git a/block/file-posix.c b/block/file-posix.c
index 6acbd56238..ab12a2b591 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -33,6 +33,9 @@
 #include "block/raw-aio.h"
 #include "qapi/qmp/qstring.h"
 
+#include "scsi/pr-manager.h"
+#include "scsi/constants.h"
+
 #if defined(__APPLE__) && (__MACH__)
 #include 
 #include 
@@ -155,6 +158,8 @@ typedef struct BDRVRawState {
 bool page_cache_inconsistent:1;
 bool has_fallocate;
 bool needs_alignment;
+
+PRManager *pr_mgr;
 } BDRVRawState;
 
 typedef struct BDRVRawReopenState {
@@ -402,6 +407,11 @@ static QemuOptsList raw_runtime_opts = {
 .type = QEMU_OPT_STRING,
 .help = "file locking mode (on/off/auto, default: auto)",
 },
+{
+.name = "pr-manager",
+.type = QEMU_OPT_STRING,
+.help = "id of persistent reservation manager object (default: 
none)",
+},
 { /* end of list */ }
 },
 };
@@ -413,6 +423,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 QemuOpts *opts;
 Error *local_err = NULL;
 const char *filename = NULL;
+const char *str;
 BlockdevAioOptions aio, aio_default;
 int fd, ret;
 struct stat st;
@@ -476,6 +487,16 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 abort();
 }
 
+str = qemu_opt_get(opts, "pr-manager");
+if (str) {
+s->pr_mgr = pr_manager_lookup(str, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+ret = -EINVAL;
+goto fail;
+}
+}
+
 s->open_flags = open_flags;
 raw_parse_flags(bdrv_flags, &s->open_flags);
 
@@ -2597,6 +2618,15 @@ static BlockAIOCB *hdev_aio_ioctl(BlockDriverState *bs,
 if (fd_open(bs) < 0)
 return NULL;
 
+if (req == SG_IO && s->pr_mgr) {
+struct sg_io_hdr *io_hdr = buf;
+if (io_hdr->cmdp[0] == PERSISTENT_RESERVE_OUT ||
+io_hdr->cmdp[0] == PERSISTENT_RESERVE_IN) {
+return pr_manager_execute(s->pr_mgr, bdrv_get_aio_context(bs),
+  s->fd, io_hdr, cb, opaque);
+}
+}
+
 acb = g_new(RawPosixAIOData, 1);
 acb->bs = bs;
 acb->aio_type = QEMU_AIO_IOCTL;
diff --git a/docs/pr-manager.rst b/docs/pr-manager.rst
new

[Qemu-block] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper

2017-09-19 Thread Paolo Bonzini
This adds a concrete subclass of pr-manager that talks to qemu-pr-helper.

Signed-off-by: Paolo Bonzini 
---
v1->v2: fixed string property double-free
fixed/cleaned up error handling
handle buffer underrun

 scsi/Makefile.objs   |   2 +-
 scsi/pr-manager-helper.c | 302 +++
 2 files changed, 303 insertions(+), 1 deletion(-)
 create mode 100644 scsi/pr-manager-helper.c

diff --git a/scsi/Makefile.objs b/scsi/Makefile.objs
index 5496d2ae6a..4d25e476cf 100644
--- a/scsi/Makefile.objs
+++ b/scsi/Makefile.objs
@@ -1,3 +1,3 @@
 block-obj-y += utils.o
 
-block-obj-$(CONFIG_LINUX) += pr-manager.o
+block-obj-$(CONFIG_LINUX) += pr-manager.o pr-manager-helper.o
diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c
new file mode 100644
index 00..063fd35dee
--- /dev/null
+++ b/scsi/pr-manager-helper.c
@@ -0,0 +1,302 @@
+/*
+ * Persistent reservation manager that talks to qemu-pr-helper
+ *
+ * Copyright (c) 2017 Red Hat, Inc.
+ *
+ * Author: Paolo Bonzini 
+ *
+ * This code is licensed under the LGPL v2.1 or later.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "scsi/constants.h"
+#include "scsi/pr-manager.h"
+#include "scsi/utils.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "pr-helper.h"
+
+#include 
+
+#define PR_MAX_RECONNECT_ATTEMPTS 5
+
+#define TYPE_PR_MANAGER_HELPER "pr-manager-helper"
+
+#define PR_MANAGER_HELPER(obj) \
+ OBJECT_CHECK(PRManagerHelper, (obj), \
+  TYPE_PR_MANAGER_HELPER)
+
+typedef struct PRManagerHelper {
+/*  */
+PRManager parent;
+
+char *path;
+
+QemuMutex lock;
+QIOChannel *ioc;
+} PRManagerHelper;
+
+/* Called with lock held.  */
+static int pr_manager_helper_read(PRManagerHelper *pr_mgr,
+  void *buf, int sz, Error **errp)
+{
+ssize_t r = qio_channel_read_all(pr_mgr->ioc, buf, sz, errp);
+
+if (r < 0) {
+object_unref(OBJECT(pr_mgr->ioc));
+pr_mgr->ioc = NULL;
+return -EINVAL;
+}
+
+return 0;
+}
+
+/* Called with lock held.  */
+static int pr_manager_helper_write(PRManagerHelper *pr_mgr,
+   int fd,
+   const void *buf, int sz, Error **errp)
+{
+size_t nfds = (fd != -1);
+while (sz > 0) {
+struct iovec iov;
+ssize_t n_written;
+
+iov.iov_base = (void *)buf;
+iov.iov_len = sz;
+n_written = qio_channel_writev_full(QIO_CHANNEL(pr_mgr->ioc), &iov, 1,
+nfds ? &fd : NULL, nfds, errp);
+
+if (n_written <= 0) {
+assert(n_written != QIO_CHANNEL_ERR_BLOCK);
+object_unref(OBJECT(pr_mgr->ioc));
+return n_written < 0 ? -EINVAL : 0;
+}
+
+nfds = 0;
+buf += n_written;
+sz -= n_written;
+}
+
+return 0;
+}
+
+/* Called with lock held.  */
+static int pr_manager_helper_initialize(PRManagerHelper *pr_mgr,
+Error **errp)
+{
+char *path = g_strdup(pr_mgr->path);
+SocketAddress saddr = {
+.type = SOCKET_ADDRESS_TYPE_UNIX,
+.u.q_unix.path = path
+};
+QIOChannelSocket *sioc = qio_channel_socket_new();
+Error *local_err = NULL;
+
+uint32_t flags;
+int r;
+
+assert(!pr_mgr->ioc);
+qio_channel_set_name(QIO_CHANNEL(sioc), "pr-manager-helper");
+qio_channel_socket_connect_sync(sioc,
+&saddr,
+&local_err);
+g_free(path);
+if (local_err) {
+object_unref(OBJECT(sioc));
+error_propagate(errp, local_err);
+return -ENOTCONN;
+}
+
+qio_channel_set_delay(QIO_CHANNEL(sioc), false);
+pr_mgr->ioc = QIO_CHANNEL(sioc);
+
+/* A simple feature negotation protocol, even though there is
+ * no optional feature right now.
+ */
+r = pr_manager_helper_read(pr_mgr, &flags, sizeof(flags), errp);
+if (r < 0) {
+goto out_close;
+}
+
+flags = 0;
+r = pr_manager_helper_write(pr_mgr, -1, &flags, sizeof(flags), errp);
+if (r < 0) {
+goto out_close;
+}
+
+return 0;
+
+out_close:
+object_unref(OBJECT(pr_mgr->ioc));
+pr_mgr->ioc = NULL;
+return r;
+}
+
+static int pr_manager_helper_run(PRManager *p,
+ int fd, struct sg_io_hdr *io_hdr)
+{
+PRManagerHelper *pr_mgr = PR_MANAGER_HELPER(p);
+
+uint32_t len;
+PRHelperResponse resp;
+int ret;
+int expected_dir;
+int attempts;
+uint8_t cdb[PR_HELPER_CDB_SIZE] = { 0 };
+
+if (!io_hdr->cmd_len || io_hdr->cmd_len > PR_HELPER_CDB_SIZE) {
+return -EINVAL;
+}
+
+memcpy(cdb, io_hdr->cmdp, io_hdr->cmd_len);
+assert(cdb[0] == PERSISTENT_RESERVE_OUT || cdb[0] == 
PERSISTENT_RESERVE_IN);
+expected_dir =
+(cdb[0] == PERSISTENT_RESERVE_OUT ? SG_DXFER_TO_DEV : 
SG_DXFER_FRO

[Qemu-block] [PATCH 3/4] scsi: add multipath support to qemu-pr-helper

2017-09-19 Thread Paolo Bonzini
Proper support of persistent reservation for multipath devices requires
communication with the multipath daemon, so that the reservation is
registered and applied when a path comes up.  The device mapper
utilities provide a library to do so; this patch makes qemu-pr-helper.c
detect multipath devices and, when one is found, delegate the operation
to libmpathpersist.

Signed-off-by: Paolo Bonzini 
---
v1->v2: moved dm_init/multipath_pr_init to this patch
drop CAP_SYS_ADMIN if multipath not compiled in
simplify buffer size handling in multipath PERSISTENT RESERVE IN
block REGISTER AND MOVE operation for multipath PERSISTENT RESERVE OUT
fixed transport id handling in multipath PERSISTENT RESERVE OUT

 Makefile  |   3 +
 configure |  57 -
 docs/pr-manager.rst   |  27 
 include/scsi/utils.h  |   4 +
 scsi/qemu-pr-helper.c | 346 +-
 scsi/utils.c  |  10 ++
 6 files changed, 441 insertions(+), 6 deletions(-)

diff --git a/Makefile b/Makefile
index 51503ee8ad..39ec5112b6 100644
--- a/Makefile
+++ b/Makefile
@@ -377,6 +377,9 @@ fsdev/virtfs-proxy-helper$(EXESUF): 
fsdev/virtfs-proxy-helper.o fsdev/9p-marshal
 fsdev/virtfs-proxy-helper$(EXESUF): LIBS += -lcap
 
 scsi/qemu-pr-helper$(EXESUF): scsi/qemu-pr-helper.o scsi/utils.o 
$(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS)
+ifdef CONFIG_MPATH
+scsi/qemu-pr-helper$(EXESUF): LIBS += -ludev -lmultipath -lmpathpersist
+endif
 
 qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx $(SRC_PATH)/scripts/hxtool
$(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h < $< > 
$@,"GEN","$@")
diff --git a/configure b/configure
index fba27cf0fc..bea343c701 100755
--- a/configure
+++ b/configure
@@ -290,6 +290,7 @@ netmap="no"
 sdl=""
 sdlabi=""
 virtfs=""
+mpath=""
 vnc="yes"
 sparse="no"
 vde=""
@@ -936,6 +937,10 @@ for opt do
   ;;
   --enable-virtfs) virtfs="yes"
   ;;
+  --disable-mpath) mpath="no"
+  ;;
+  --enable-mpath) mpath="yes"
+  ;;
   --disable-vnc) vnc="no"
   ;;
   --enable-vnc) vnc="yes"
@@ -1479,6 +1484,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   vnc-png PNG compression for VNC server
   cocoa   Cocoa UI (Mac OS X only)
   virtfs  VirtFS
+  mpath   Multipath persistent reservation passthrough
   xen xen backend driver support
   xen-pci-passthrough
   brlapi  BrlAPI (Braile)
@@ -3300,6 +3306,29 @@ else
 fi
 
 ##
+# libmpathpersist probe
+
+if test "$mpath" != "no" ; then
+  cat > $TMPC <
+#include 
+unsigned mpath_mx_alloc_len = 1024;
+int logsink;
+int main(void) {
+struct udev *udev = udev_new();
+mpath_lib_init(udev);
+}
+EOF
+  if compile_prog "" "-ludev -lmultipath -lmpathpersist" ; then
+mpathpersist=yes
+  else
+mpathpersist=no
+  fi
+else
+  mpathpersist=no
+fi
+
+##
 # libcap probe
 
 if test "$cap" != "no" ; then
@@ -5034,16 +5063,34 @@ if test "$want_tools" = "yes" ; then
   fi
 fi
 if test "$softmmu" = yes ; then
-  if test "$virtfs" != no ; then
-if test "$cap" = yes && test "$linux" = yes && test "$attr" = yes ; then
+  if test "$linux" = yes; then
+if test "$virtfs" != no && test "$cap" = yes && test "$attr" = yes ; then
   virtfs=yes
   tools="$tools fsdev/virtfs-proxy-helper\$(EXESUF)"
 else
   if test "$virtfs" = yes; then
-error_exit "VirtFS is supported only on Linux and requires libcap 
devel and libattr devel"
+error_exit "VirtFS requires libcap devel and libattr devel"
   fi
   virtfs=no
 fi
+if test "$mpath" != no && test "$mpathpersist" = yes ; then
+  mpath=yes
+  tools="$tools mpath/qemu-mpath-helper\$(EXESUF)"
+else
+  if test "$mpath" = yes; then
+error_exit "Multipath requires libmpathpersist devel"
+  fi
+  mpath=no
+fi
+  else
+if test "$virtfs" = yes; then
+  error_exit "VirtFS is supported only on Linux"
+fi
+virtfs=no
+if test "$mpath" = yes; then
+  error_exit "Multipath is supported only on Linux"
+fi
+mpath=no
   fi
 fi
 
@@ -5289,6 +5336,7 @@ echo "Audio drivers $audio_drv_list"
 echo "Block whitelist (rw) $block_drv_rw_whitelist"
 echo "Block whitelist (ro) $block_drv_ro_whitelist"
 echo "VirtFS support$virtfs"
+echo "Multipath support $mpath"
 echo "VNC support   $vnc"
 if test "$vnc" = "yes" ; then
 echo "VNC SASL support  $vnc_sasl"
@@ -5732,6 +5780,9 @@ fi
 if test "$virtfs" = "yes" ; then
   echo "CONFIG_VIRTFS=y" >> $config_host_mak
 fi
+if test "$mpath" = "yes" ; then
+  echo "CONFIG_MPATH=y" >> $config_host_mak
+fi
 if test "$vhost_scsi" = "yes" ; then
   echo "CONFIG_VHOST_SCSI=y" >> $config_host_mak
 fi
diff --git a/docs/pr-manager.rst b/docs/pr-manager.rst
index 7107e59fb8..9b1de198b1 100644
--- a/docs/pr-manager.rst
+++ b/docs/pr-manager.rst
@@ -60,6 +60,7 @@ system ser

[Qemu-block] [PATCH v2 0/4] scsi, block: introduce persistent reservation managers

2017-09-19 Thread Paolo Bonzini
SCSI persistent Reservations allow restricting access to block devices
to specific initiators in a shared storage setup.  When implementing
clustering of virtual machines, it is a common requirement for virtual
machines to send persistent reservation SCSI commands.  However,
the operating system restricts sending these commands to unprivileged
programs because incorrect usage can disrupt regular operation of the
storage fabric.

With these patches, the scsi-block and scsi-generic SCSI passthrough
devices learn to delegate the implementation of persistent reservations to
a separate object, the "persistent reservation manager".  The persistent
reservation manager talks to a separate privileged program, with a very
simple protocol based on SCM_RIGHTS.  In addition to invoking PERSISTENT
RESERVATION OUT and PERSISTENT RESERVATION IN commands, the privileged
component can also use libmpathpersist so that persistent reservations
are applied to all paths in a multipath setting.

Patch 1 defines the abstract QOM class and plugs it into block/file-posix.c.

Patch 2 and 3 introduce the privileged helper program, while patch 4
defines the concrete QOM class that talks to it.

Paolo

v1->v2: removed scsi/ patches which were all reviewed
removed man page (requires .texi while I used .rst for docs)
qio_channel_read/write_all also went in independently
fix installation of qemu-pr-helper man page
fixes to documentation
added BSD license to protocol header
added handling of residual (buffer underrun)
fixed closing of file descriptors on error
block PERSISTENT RESERVE OUT for read-only file descriptors
do not use g_assert
use EXIT_SUCCESS/EXIT_FAILURE consistently
moved all CONFIG_MPATH usage to the right patch
drop CAP_SYS_ADMIN if multipath not compiled in
simplify buffer size handling in multipath PERSISTENT RESERVE IN
block REGISTER AND MOVE operation for multipath PERSISTENT RESERVE OUT
fixed transport id handling in multipath PERSISTENT RESERVE OUT
fixed string property double-free in pr-manager-helper
fixed/cleaned up error handling in pr-manager-helper

Paolo Bonzini (9):
  scsi, file-posix: add support for persistent reservation management
  scsi: build qemu-pr-helper
  scsi: add multipath support to qemu-pr-helper
  scsi: add persistent reservation manager using qemu-pr-helper

 Makefile   |   14 +-
 configure  |   59 ++-
 docs/interop/pr-helper.rst |   83 
 docs/pr-manager.rst|   60 +++
 include/scsi/utils.h   |4 +
 scsi/Makefile.objs |2 +-
 scsi/pr-helper.h   |   41 ++
 scsi/pr-manager-helper.c   |  302 +
 scsi/qemu-pr-helper.c  | 1038 
 scsi/utils.c   |   10 +
 10 files changed, 1607 insertions(+), 6 deletions(-)

-- 
2.13.5




[Qemu-block] qemu-img: Check failed: No space left on device

2017-09-19 Thread Nicolas Ecarnot

Hello,

First post here, so maybe I should introduce myself :
- I'm a sysadmin for decades and currently managing 4 oVirt clusters, 
made out of tens of hypervisors, all are CentOS 7.2+ based.
- I'm very happy with this solution we choose especially because it is 
based on qemu-kvm (open source, reliable, documented).


On one VM, we experienced the following :
- oVirt/vdsm is detecting an issue on the image
- following this hints https://access.redhat.com/solutions/1173623, I 
managed to detect one error and fix it

- the VM is now running perfectly

On two other VMs, we experienced a similar situation, except the check 
stage is showing something like 14000+ errors, and the relevant logs are :


Repairing refcount block 14 is outside image
ERROR could not resize image: Invalid argument
ERROR cluster 425984 refcount=0 reference=1
ERROR cluster 425985 refcount=0 reference=1
[... repeating the previous line 7000+ times...]
ERROR cluster 457166 refcount=0 reference=1
Rebuilding refcount structure
ERROR writing refblock: No space left on device
qemu-img: Check failed: No space left on device

You surely know that oVirt/RHEV is storing its qcow2 images in dedicated 
logical volumes.


pvs/vgs/lvs are all showing there is plenty of space available, so I 
understand that I don't understand what "No space left on device" means.


The relevant versions are :
- qemu-img-ev-2.3.0-31.el7_2.10.1.x86_64.
- libvirt-daemon-kvm-1.2.17-13.el7_2.4.x86_64
- vdsm-4.17.32-1.el7.noarch
- # qemu-img info /the/relevant/logical/volume/path
file format: qcow2
virtual size: 30G (32212254720 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
compat: 0.10
refcount bits: 16

Is there a hope I'll be able to repair this image?

--
Nicolas ECARNOT



Re: [Qemu-block] [PATCH v2 6/7] block/nbd-client: early fail nbd_read_reply_entry if s->quit is set

2017-09-19 Thread Paolo Bonzini
On 19/09/2017 11:43, Vladimir Sementsov-Ogievskiy wrote:
>>
>> I'm trying to look forward to structured reads, where we will have to
>> deal with more than one server message in reply to a single client
>> request.  For read, we just piece together portions of the qiov until
>> the server has sent us all the pieces.  But for block query, I really do
>> think we'll want to defer to specialized handlers for doing further
>> reads (the amount of data to be read from the server is not known by the
>> client a priori, so it is a two-part read, one to get the length, and
>> another to read that remaining length); if we need some sort of callback
>> function rather than cramming all the logic here in the main read loop,
> 
> by patch 3 I do not mean in any way that I want to have all reading
> staff in one function, ofcourse it should be refactored
> for block-status addition. I just want to have all reading staff in one
> coroutine. Reading from ioc is sequential, so it's
> native to do it sequentially in one coroutine, without switches.

But why is that a goal?  See it as a state machine that goes between the
"waiting for header" and "waiting for payload" states.  Reading header
corresponds to read_reply_co waiting on the socket (and reading when
it's ready); reading payload corresponds to the request coroutine
waiting on the socket and reading when it's ready.

Paolo



Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply

2017-09-19 Thread Paolo Bonzini
On 19/09/2017 11:25, Vladimir Sementsov-Ogievskiy wrote:
> 18.09.2017 18:43, Paolo Bonzini wrote:
>> On 18/09/2017 15:59, Vladimir Sementsov-Ogievskiy wrote:
>>> Read the whole reply in one place - in nbd_read_reply_entry.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>> ---
>>>   block/nbd-client.h |  1 +
>>>   block/nbd-client.c | 42 --
>>>   2 files changed, 25 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/block/nbd-client.h b/block/nbd-client.h
>>> index b1900e6a6d..3f97d31013 100644
>>> --- a/block/nbd-client.h
>>> +++ b/block/nbd-client.h
>>> @@ -21,6 +21,7 @@ typedef struct {
>>>   Coroutine *coroutine;
>>>   bool receiving; /* waiting for read_reply_co? */
>>>   NBDRequest *request;
>>> +QEMUIOVector *qiov;
>>>   } NBDClientRequest;
>>> typedef struct NBDClientSession {
>>> diff --git a/block/nbd-client.c b/block/nbd-client.c
>>> index 5f241ecc22..f7b642f079 100644
>>> --- a/block/nbd-client.c
>>> +++ b/block/nbd-client.c
>>> @@ -98,6 +98,21 @@ static coroutine_fn void nbd_read_reply_entry(void
>>> *opaque)
>>>   break;
>>>   }
>>>   +if (s->reply.error == 0 &&
>>> +s->requests[i].request->type == NBD_CMD_READ)
>>> +{
>>> +QEMUIOVector *qiov = s->requests[i].qiov;
>>> +assert(qiov != NULL);
>>> +assert(s->requests[i].request->len ==
>>> +   iov_size(qiov->iov, qiov->niov));
>>> +if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
>>> +  NULL) < 0)
>>> +{
>>> +s->reply.error = EIO;
>>> +break;
>>> +}
>>> +}
>> I am not sure this is an improvement.  In principle you could have
>> commands that read replies a bit at a time without using a QEMUIOVector.
>>
>> Paolo
> 
> Hm, what do you mean? In this patch I don't change "how do we read it",
> I only move the reading code to one coroutine, to make it simple to
> extend it in future (block status, etc).

I disagree that it is easier to extend it in the future.  If some
commands in the future need a different "how do we read it" (e.g. for
structured reply), nbd_read_reply_entry may not have all the information
it needs anymore.

In fact, you're pushing knowledge of the commands from nbd_co_request to
nbd_read_reply_entry:

+if (s->reply.error == 0 &&
+s->requests[i].request->type == NBD_CMD_READ)

and knowledge of NBD_CMD_READ simply doesn't belong there.  So there is
an example of that right now, it is not some arbitrary bad thing that
could happen in the future.

Paolo


> nbd_read_reply_enty has all
> information it need (s->requests[i].request) to handle the whole reply..
> It's an improvement, because it leads to separating reply_entry
> coroutine - it don't need any synchronization (yield/wake) more with
> request coroutines.
> 
>>
>>>   /* We're woken up again by the request itself.  Note that
>>> there
>>>* is no race between yielding and reentering
>>> read_reply_co.  This
>>>* is because:
>>> @@ -118,6 +133,7 @@ static coroutine_fn void
>>> nbd_read_reply_entry(void *opaque)
>>>   s->read_reply_co = NULL;
>>>   }
>>>   +/* qiov should be provided for READ and WRITE */
>>>   static int nbd_co_send_request(BlockDriverState *bs,
>>>  NBDRequest *request,
>>>  QEMUIOVector *qiov)
>>> @@ -145,6 +161,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
>>> request->handle = INDEX_TO_HANDLE(s, i);
>>>   s->requests[i].request = request;
>>> +s->requests[i].qiov = qiov;
>>> if (s->quit) {
>>>   rc = -EIO;
>>> @@ -155,7 +172,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
>>>   goto err;
>>>   }
>>>   -if (qiov) {
>>> +if (s->requests[i].request->type == NBD_CMD_WRITE) {
>>> +assert(qiov);
>>>   qio_channel_set_cork(s->ioc, true);
>>>   rc = nbd_send_request(s->ioc, request);
>>>   if (rc >= 0 && !s->quit) {
>>> @@ -181,9 +199,7 @@ err:
>>>   return rc;
>>>   }
>>>   -static int nbd_co_receive_reply(NBDClientSession *s,
>>> -NBDRequest *request,
>>> -QEMUIOVector *qiov)
>>> +static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest
>>> *request)
>>>   {
>>>   int ret;
>>>   int i = HANDLE_TO_INDEX(s, request->handle);
>>> @@ -197,14 +213,6 @@ static int nbd_co_receive_reply(NBDClientSession
>>> *s,
>>>   } else {
>>>   assert(s->reply.handle == request->handle);
>>>   ret = -s->reply.error;
>>> -if (qiov && s->reply.error == 0) {
>>> -assert(request->len == iov_size(qiov->iov, qiov->niov));
>>> -if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
>>> -  NULL) < 0) {
>>> -   

Re: [Qemu-block] [PATCH v2 7/7] block/nbd-client: do not yield from nbd_read_reply_entry

2017-09-19 Thread Vladimir Sementsov-Ogievskiy

19.09.2017 01:36, Eric Blake wrote:

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:

Refactor nbd client to not yield from nbd_read_reply_entry. It's
possible now as all reading is done in nbd_read_reply_entry and
all request-related data is stored in per-request s->requests[i].

We need here some additional work with s->requests[i].ret and
s->quit to not fail requests on normal disconnet and to not report
reading errors on normal disconnect.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/nbd-client.c | 29 ++---
  1 file changed, 10 insertions(+), 19 deletions(-)

The diffstat may have merit, but I'm wondering:


diff --git a/block/nbd-client.c b/block/nbd-client.c
index f80a4c5564..bf20d5d5e6 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -79,7 +79,11 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
  while (!s->quit) {
  ret = nbd_receive_reply(s->ioc, &reply, &local_err);
  if (ret < 0) {
-error_report_err(local_err);
+if (s->quit) {
+error_free(local_err);

This discards errors on all remaining coroutines after we detect an
early exit. Are we sure the coroutine that set s->quit will have
reported an appropriate error, and thus explaining why we can discard
all secondary errors?  A comment in the code would be helpful.


I'll think about it.




+} else {
+error_report_err(local_err);
+}
  }
@@ -210,7 +204,7 @@ static int nbd_co_receive_reply(NBDClientSession *s, 
NBDRequest *request)
  s->requests[i].receiving = true;
  qemu_coroutine_yield();
  s->requests[i].receiving = false;
-if (!s->ioc || s->quit) {
+if (!s->ioc) {
  ret = -EIO;

Why don't we need to check s->quit any more?  That was just added
earlier in the series; why the churn?


it is "to not fail requests on normal disconnet". Because reply_entry 
may be already finished.

Initializing "+    s->requests[i].ret = -EIO;" is enough.




  } else {
  ret = s->requests[i].ret;
@@ -218,11 +212,6 @@ static int nbd_co_receive_reply(NBDClientSession *s, 
NBDRequest *request)
  
  s->requests[i].coroutine = NULL;
  
-/* Kick the read_reply_co to get the next reply.  */

-if (s->read_reply_co) {
-aio_co_wake(s->read_reply_co);
-}
-
  qemu_co_mutex_lock(&s->send_mutex);
  s->in_flight--;
  qemu_co_queue_next(&s->free_sema);
@@ -364,6 +353,8 @@ void nbd_client_close(BlockDriverState *bs)
  
  nbd_send_request(client->ioc, &request);
  
+client->quit = true;

Previously, client->quit was only set when detecting a broken server,
now it is also set for a clean exit.  Do we need to change any
documentation of the field?


It has documentation?





--
Best regards,
Vladimir




Re: [Qemu-block] [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring

2017-09-19 Thread Daniel P. Berrange
On Tue, Sep 19, 2017 at 10:44:16AM +0100, Stefan Hajnoczi wrote:
> On Mon, Sep 18, 2017 at 06:26:51PM +0200, Max Reitz wrote:
> > On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> > > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> > >> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> > >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> >  This patch implements active synchronous mirroring.  In active mode, 
> >  the
> >  passive mechanism will still be in place and is used to copy all
> >  initially dirty clusters off the source disk; but every write request
> >  will write data both to the source and the target disk, so the source
> >  cannot be dirtied faster than data is mirrored to the target.  Also,
> >  once the block job has converged (BLOCK_JOB_READY sent), source and
> >  target are guaranteed to stay in sync (unless an error occurs).
> > 
> >  Optionally, dirty data can be copied to the target disk on read
> >  operations, too.
> > 
> >  Active mode is completely optional and currently disabled at runtime.  
> >  A
> >  later patch will add a way for users to enable it.
> > 
> >  Signed-off-by: Max Reitz 
> >  ---
> >   qapi/block-core.json |  23 +++
> >   block/mirror.c   | 187 
> >  +--
> >   2 files changed, 205 insertions(+), 5 deletions(-)
> > 
> >  diff --git a/qapi/block-core.json b/qapi/block-core.json
> >  index bb11815608..e072cfa67c 100644
> >  --- a/qapi/block-core.json
> >  +++ b/qapi/block-core.json
> >  @@ -938,6 +938,29 @@
> > 'data': ['top', 'full', 'none', 'incremental'] }
> >   
> >   ##
> >  +# @MirrorCopyMode:
> >  +#
> >  +# An enumeration whose values tell the mirror block job when to
> >  +# trigger writes to the target.
> >  +#
> >  +# @passive: copy data in background only.
> >  +#
> >  +# @active-write: when data is written to the source, write it
> >  +#(synchronously) to the target as well.  In addition,
> >  +#data is copied in background just like in @passive
> >  +#mode.
> >  +#
> >  +# @active-read-write: write data to the target (synchronously) both
> >  +# when it is read from and written to the source.
> >  +# In addition, data is copied in background just
> >  +# like in @passive mode.
> > >>>
> > >>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> > >>> means committing the top-most BDS while the guest is accessing it.  The
> > >>> "passive" mirror block still works on the top-most BDS while the guest
> > >>> is accessing it.
> > >>>
> > >>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> > >>> the terminology used in disk replication (e.g. DRBD).
> > >>
> > >> I'd be OK with that, too, but I think I remember that in the past at
> > >> least Kevin made a clear distinction between active/passive and
> > >> sync/async when it comes to mirroring.
> > >>
> > >>> Ideally the user wouldn't have to worry about async vs sync because QEMU
> > >>> would switch modes as appropriate in order to converge.  That way
> > >>> libvirt also doesn't have to worry about this.
> > >>
> > >> So here you mean async/sync in the way I meant it, i.e., whether the
> > >> mirror operations themselves are async/sync?
> > > 
> > > The meaning I had in mind is:
> > > 
> > > Sync mirroring means a guest write waits until the target write
> > > completes.
> > 
> > I.e. active-sync, ...
> > 
> > > Async mirroring means guest writes completes independently of target
> > > writes.
> > 
> > ... i.e. passive or active-async in the future.
> > 
> > So you really want qemu to decide whether to use active or passive mode
> > depending on what's enough to let the block job converge and not
> > introduce any switch for the user?
> > 
> > I'm not sure whether I like this too much, mostly because "libvirt
> > doesn't have to worry" doesn't feel quite right to me.  If we don't make
> > libvirt worry about this, then qemu has to worry.  I'm not sure whether
> > that's any better.
> > 
> > I think this really does get into policy territory.  Just switching to
> > active mode the instant target writes are slower than source writes may
> > not be what the user wants: Maybe it's OK for a short duration because
> > they don't care about hard convergence too much.  Maybe they want to
> > switch to active mode already when "only" twice as much is written to
> > the target as to the source.
> > 
> > I think this is a decision the management layer (or the user) has to make.
> 
> Eric: Does libvirt want to be involved with converging the mirror job
> (i.e. if the guest is writing to disk faster than QEMU can copy data to
> the target)?

Libvirt doesn't really want to set the policy - it will j

[Qemu-block] [PATCH] nvme: fix cmbuf leak on exit

2017-09-19 Thread Stefan Hajnoczi
Cc: Keith Busch 
Signed-off-by: Stefan Hajnoczi 
---
 hw/block/nvme.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 9aa32692a3..513ec7065d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1057,6 +1057,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 g_free(n->namespaces);
 g_free(n->cq);
 g_free(n->sq);
+g_free(n->cmbuf);
 if (n->cmbsz) {
 memory_region_unref(&n->ctrl_mem);
 }
-- 
2.13.5




Re: [Qemu-block] [PATCH v2 6/7] block/nbd-client: early fail nbd_read_reply_entry if s->quit is set

2017-09-19 Thread Vladimir Sementsov-Ogievskiy

19.09.2017 01:27, Eric Blake wrote:

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:

Do not continue any operation if s->quit is set in parallel.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/nbd-client.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 280147e6a7..f80a4c5564 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -81,7 +81,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
  if (ret < 0) {
  error_report_err(local_err);
  }
-if (ret <= 0) {
+if (ret <= 0 || s->quit) {
  break;
  }

I'm still not convinced this helps: either nbd_receive_reply() already
returned an error, or we got a successful reply header, at which point
either the command is done (no need to abort the command early - it
succeeded) or it is a read command (and we should detect at the point
where we try to populate the qiov that we don't want to read any more).
It depends on your 3/7 patch for the fact that we even try to read the
qiov directly in this while loop rather than in the coroutine handler,
where Paolo questioned whether we need that change.


@@ -105,9 +105,8 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
  assert(qiov != NULL);
  assert(s->requests[i].request->len ==
 iov_size(qiov->iov, qiov->niov));
-if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
-  NULL) < 0)
-{
+ret = qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, NULL);
+if (ret < 0 || s->quit) {
  s->requests[i].ret = -EIO;
  break;
  }

The placement here looks odd. Either we should not attempt the read
because s->quit was already set (okay, your first addition makes sense
in that light), or we succeeded at the read (at which point, why do we
need to claim it failed?).

I'm trying to look forward to structured reads, where we will have to
deal with more than one server message in reply to a single client
request.  For read, we just piece together portions of the qiov until
the server has sent us all the pieces.  But for block query, I really do
think we'll want to defer to specialized handlers for doing further
reads (the amount of data to be read from the server is not known by the
client a priori, so it is a two-part read, one to get the length, and
another to read that remaining length); if we need some sort of callback
function rather than cramming all the logic here in the main read loop,


by patch 3 I do not mean in any way that I want to have all reading 
staff in one function, ofcourse it should be refactored
for block-status addition. I just want to have all reading staff in one 
coroutine. Reading from ioc is sequential, so it's

native to do it sequentially in one coroutine, without switches.


then the qio_channel_readv_all for read commands would belong in the
callback, and it is more a question of the main read loop checking
whether there is an early quit condition before calling into the callback.



If I understand correctly, the discussion is:

me: if something fails - fail all other parallel operations - it's 
simple and clear, but we need to check s->quit after every possible yield


you:  _not_ all other parallel operations. - may be it is better, 
but in this case we need carefully define, which parallel operation we fail


 on s->quit and which not and understand all effects of this division.

--
Best regards,
Vladimir




Re: [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring

2017-09-19 Thread Stefan Hajnoczi
On Mon, Sep 18, 2017 at 06:26:51PM +0200, Max Reitz wrote:
> On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> >> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
>  This patch implements active synchronous mirroring.  In active mode, the
>  passive mechanism will still be in place and is used to copy all
>  initially dirty clusters off the source disk; but every write request
>  will write data both to the source and the target disk, so the source
>  cannot be dirtied faster than data is mirrored to the target.  Also,
>  once the block job has converged (BLOCK_JOB_READY sent), source and
>  target are guaranteed to stay in sync (unless an error occurs).
> 
>  Optionally, dirty data can be copied to the target disk on read
>  operations, too.
> 
>  Active mode is completely optional and currently disabled at runtime.  A
>  later patch will add a way for users to enable it.
> 
>  Signed-off-by: Max Reitz 
>  ---
>   qapi/block-core.json |  23 +++
>   block/mirror.c   | 187 
>  +--
>   2 files changed, 205 insertions(+), 5 deletions(-)
> 
>  diff --git a/qapi/block-core.json b/qapi/block-core.json
>  index bb11815608..e072cfa67c 100644
>  --- a/qapi/block-core.json
>  +++ b/qapi/block-core.json
>  @@ -938,6 +938,29 @@
> 'data': ['top', 'full', 'none', 'incremental'] }
>   
>   ##
>  +# @MirrorCopyMode:
>  +#
>  +# An enumeration whose values tell the mirror block job when to
>  +# trigger writes to the target.
>  +#
>  +# @passive: copy data in background only.
>  +#
>  +# @active-write: when data is written to the source, write it
>  +#(synchronously) to the target as well.  In addition,
>  +#data is copied in background just like in @passive
>  +#mode.
>  +#
>  +# @active-read-write: write data to the target (synchronously) both
>  +# when it is read from and written to the source.
>  +# In addition, data is copied in background just
>  +# like in @passive mode.
> >>>
> >>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> >>> means committing the top-most BDS while the guest is accessing it.  The
> >>> "passive" mirror block still works on the top-most BDS while the guest
> >>> is accessing it.
> >>>
> >>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> >>> the terminology used in disk replication (e.g. DRBD).
> >>
> >> I'd be OK with that, too, but I think I remember that in the past at
> >> least Kevin made a clear distinction between active/passive and
> >> sync/async when it comes to mirroring.
> >>
> >>> Ideally the user wouldn't have to worry about async vs sync because QEMU
> >>> would switch modes as appropriate in order to converge.  That way
> >>> libvirt also doesn't have to worry about this.
> >>
> >> So here you mean async/sync in the way I meant it, i.e., whether the
> >> mirror operations themselves are async/sync?
> > 
> > The meaning I had in mind is:
> > 
> > Sync mirroring means a guest write waits until the target write
> > completes.
> 
> I.e. active-sync, ...
> 
> > Async mirroring means guest writes completes independently of target
> > writes.
> 
> ... i.e. passive or active-async in the future.
> 
> So you really want qemu to decide whether to use active or passive mode
> depending on what's enough to let the block job converge and not
> introduce any switch for the user?
> 
> I'm not sure whether I like this too much, mostly because "libvirt
> doesn't have to worry" doesn't feel quite right to me.  If we don't make
> libvirt worry about this, then qemu has to worry.  I'm not sure whether
> that's any better.
> 
> I think this really does get into policy territory.  Just switching to
> active mode the instant target writes are slower than source writes may
> not be what the user wants: Maybe it's OK for a short duration because
> they don't care about hard convergence too much.  Maybe they want to
> switch to active mode already when "only" twice as much is written to
> the target as to the source.
> 
> I think this is a decision the management layer (or the user) has to make.

Eric: Does libvirt want to be involved with converging the mirror job
(i.e. if the guest is writing to disk faster than QEMU can copy data to
the target)?

Stefan



Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply

2017-09-19 Thread Vladimir Sementsov-Ogievskiy

18.09.2017 18:43, Paolo Bonzini wrote:

On 18/09/2017 15:59, Vladimir Sementsov-Ogievskiy wrote:

Read the whole reply in one place - in nbd_read_reply_entry.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/nbd-client.h |  1 +
  block/nbd-client.c | 42 --
  2 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index b1900e6a6d..3f97d31013 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -21,6 +21,7 @@ typedef struct {
  Coroutine *coroutine;
  bool receiving; /* waiting for read_reply_co? */
  NBDRequest *request;
+QEMUIOVector *qiov;
  } NBDClientRequest;
  
  typedef struct NBDClientSession {

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 5f241ecc22..f7b642f079 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -98,6 +98,21 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
  break;
  }
  
+if (s->reply.error == 0 &&

+s->requests[i].request->type == NBD_CMD_READ)
+{
+QEMUIOVector *qiov = s->requests[i].qiov;
+assert(qiov != NULL);
+assert(s->requests[i].request->len ==
+   iov_size(qiov->iov, qiov->niov));
+if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
+  NULL) < 0)
+{
+s->reply.error = EIO;
+break;
+}
+}

I am not sure this is an improvement.  In principle you could have
commands that read replies a bit at a time without using a QEMUIOVector.

Paolo


Hm, what do you mean? In this patch I don't change "how do we read it", 
I only move the reading code to one coroutine, to make it simple to 
extend it in future (block status, etc). nbd_read_reply_enty has all 
information it need (s->requests[i].request) to handle the whole reply.. 
It's an improvement, because it leads to separating reply_entry 
coroutine - it don't need any synchronization (yield/wake) more with 
request coroutines.





  /* We're woken up again by the request itself.  Note that there
   * is no race between yielding and reentering read_reply_co.  This
   * is because:
@@ -118,6 +133,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
  s->read_reply_co = NULL;
  }
  
+/* qiov should be provided for READ and WRITE */

  static int nbd_co_send_request(BlockDriverState *bs,
 NBDRequest *request,
 QEMUIOVector *qiov)
@@ -145,6 +161,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
  
  request->handle = INDEX_TO_HANDLE(s, i);

  s->requests[i].request = request;
+s->requests[i].qiov = qiov;
  
  if (s->quit) {

  rc = -EIO;
@@ -155,7 +172,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
  goto err;
  }
  
-if (qiov) {

+if (s->requests[i].request->type == NBD_CMD_WRITE) {
+assert(qiov);
  qio_channel_set_cork(s->ioc, true);
  rc = nbd_send_request(s->ioc, request);
  if (rc >= 0 && !s->quit) {
@@ -181,9 +199,7 @@ err:
  return rc;
  }
  
-static int nbd_co_receive_reply(NBDClientSession *s,

-NBDRequest *request,
-QEMUIOVector *qiov)
+static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest *request)
  {
  int ret;
  int i = HANDLE_TO_INDEX(s, request->handle);
@@ -197,14 +213,6 @@ static int nbd_co_receive_reply(NBDClientSession *s,
  } else {
  assert(s->reply.handle == request->handle);
  ret = -s->reply.error;
-if (qiov && s->reply.error == 0) {
-assert(request->len == iov_size(qiov->iov, qiov->niov));
-if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
-  NULL) < 0) {
-ret = -EIO;
-s->quit = true;
-}
-}
  
  /* Tell the read handler to read another header.  */

  s->reply.handle = 0;
@@ -232,16 +240,14 @@ static int nbd_co_request(BlockDriverState *bs,
  NBDClientSession *client = nbd_get_client_session(bs);
  int ret;
  
-assert(!qiov || request->type == NBD_CMD_WRITE ||

-   request->type == NBD_CMD_READ);
-ret = nbd_co_send_request(bs, request,
-  request->type == NBD_CMD_WRITE ? qiov : NULL);
+assert((qiov == NULL) !=
+   (request->type == NBD_CMD_WRITE || request->type == NBD_CMD_READ));
+ret = nbd_co_send_request(bs, request, qiov);
  if (ret < 0) {
  return ret;
  }
  
-return nbd_co_receive_reply(client, request,

-request->type == NBD_CMD_READ ? qiov : NULL);
+return nbd_co_receive_reply(client, request);
  }
  
  int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,




Re: [Qemu-block] [PATCH v8 05/20] dirty-bitmap: Avoid size query failure during truncate

2017-09-19 Thread Fam Zheng
On Mon, 09/18 13:58, Eric Blake wrote:
> We've previously fixed several places where we failed to account
> for possible errors from bdrv_nb_sectors().  Fix another one by
> making bdrv_dirty_bitmap_truncate() take the new size from the
> caller instead of querying itself; then adjust the sole caller
> bdrv_truncate() to pass the size just determined by a successful
> resize, or to skip the bitmap resize on failure, thus avoiding
> sizing the bitmaps to -1.
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v8: retitle and rework to avoid possibility of secondary failure [John]
> v7: new patch [Kevin]
> ---
>  include/block/dirty-bitmap.h |  2 +-
>  block.c  | 15 ++-
>  block/dirty-bitmap.c |  6 +++---
>  3 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
> index 8fd842eac9..7a27590047 100644
> --- a/include/block/dirty-bitmap.h
> +++ b/include/block/dirty-bitmap.h
> @@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
>  void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
>  int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>  int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
> -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
> +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes);
>  bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap);
>  bool bdrv_has_readonly_bitmaps(BlockDriverState *bs);
>  bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap);
> diff --git a/block.c b/block.c
> index ee6a48976e..61ee9d4b83 100644
> --- a/block.c
> +++ b/block.c
> @@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, 
> PreallocMode prealloc,
>  assert(!(bs->open_flags & BDRV_O_INACTIVE));
> 
>  ret = drv->bdrv_truncate(bs, offset, prealloc, errp);
> -if (ret == 0) {
> -ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> -bdrv_dirty_bitmap_truncate(bs);
> -bdrv_parent_cb_resize(bs);
> -atomic_inc(&bs->write_gen);
> +if (ret < 0) {
> +return ret;
>  }
> +ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
> +if (ret < 0) {

Ugh, if we get here the situation is a bit embarrassing, because...

> +error_setg_errno(errp, -ret, "Could not refresh total sector count");
> +return ret;
> +}
> +bdrv_dirty_bitmap_truncate(bs, bs->total_sectors * BDRV_SECTOR_SIZE);
> +bdrv_parent_cb_resize(bs);
> +atomic_inc(&bs->write_gen);

I think we still want to inc write_gen even if refresh_total_sectors failed, if
drv->bdrv_truncate has succeeded? That way the next bdrv_co_flush will actually
flush the metadata change to disk.

Maybe similarly call bdrv_parent_cb_resize() as long as drv->bdrv_truncate()
succeeded? The effect is the virtual devices notify guest about this "resized"
event, which I think is correct.

Fam



Re: [Qemu-block] [PATCH] nbd-client: Use correct macro parenthesization

2017-09-19 Thread Stefan Hajnoczi
On Mon, Sep 18, 2017 at 04:46:49PM -0500, Eric Blake wrote:
> If 'bs' is a complex expression, we were only casting the front half
> rather than the full expression.  Luckily, none of the callers were
> passing bad arguments, but it's better to be robust up front.
> 
> Signed-off-by: Eric Blake 
> ---
> 
> I plan to take this through my NBD queue, unless it is picked up
> by qemu-trivial first...
> 
>  block/nbd-client.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 



[Qemu-block] Dead code in cpu-models.h (was: block: Clean up some bad code in the vvfat driver)

2017-09-19 Thread Thomas Huth
On 19.09.2017 10:06, Paolo Bonzini wrote:
> On 13/09/2017 21:08, John Snow wrote:
[...]
>> Farewell, bitrot code.
>>
>> Reviewed-by: John Snow 
>>
>> Out of curiosity, I wonder ...
>>
>> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l
>> 320
> 
> $ git grep -c '#if 0' | sort -k2 --field-separator=: -n
> ...
> hw/net/eepro100.c:21
> target/ppc/cpu-models.h:76
> 
> whoa :)

Igor recently already removed the dead definitions from cpu-models.c :

https://git.qemu.org/?p=qemu.git;a=commitdiff;h=aef779605779579afbafff

I guess we could now remove the corresponding dead definitions from the
header, too...

Any volunteers?

 Thomas




Re: [Qemu-block] [PATCH 18/18] iotests: Add test for active mirroring

2017-09-19 Thread Fam Zheng
On Mon, 09/18 18:53, Max Reitz wrote:
> >> +
> >> +if sync_source_and_target:
> >> +# If source and target should be in sync after the mirror,
> >> +# we have to flush before completion
> > 
> > Not sure I understand this requirements, does it apply to libvirt and user 
> > too?
> > I.e. it's a part of the interface ? Why cannot mirror_complete do it
> > automatically?
> 
> Well, it seems to pass without this flush, but the original intention
> was this: When mirror is completed, the source node is replaced by the
> target.  All further writes are then only executed on the (former)
> target node.  So what might happen (or at least I think it could) is
> that qemu-io submits some writes, but before they are actually
> performed, the mirror block job is completed and the source is replaced
> by the target.  Then, the write operations are performed on the target
> but no longer on the source, so source and target are then out of sync.
> For the mirror block job, that is fine -- at the point of completion,
> source and target were in sync.  The job doesn't care that they get out
> of sync after completion.  But here, we have to care or we can't compare
> source and target.
> 
> The reason for why it always seems to pass without a flush is that every
> submitted write is actually sent to the mirror node before it yields for
> the first time.  But I wouldn't bet on that, so I think it's better to
> keep the flush before completing the block job.

OK, that makes sense. Thanks.

Fam



Re: [Qemu-block] [PATCH] block: Clean up some bad code in the vvfat driver

2017-09-19 Thread Paolo Bonzini
On 13/09/2017 21:08, John Snow wrote:
> 
> 
> On 09/13/2017 06:21 AM, Thomas Huth wrote:
>> Remove the unnecessary home-grown redefinition of the assert() macro here,
>> and remove the unusable debug code at the end of the checkpoint() function.
>> The code there uses assert() with side-effects (assignment to the "mapping"
>> variable), which should be avoided. Looking more closely, it seems as it is
>> apparently also only usable for one certain directory layout (with a file
>> named USB.H in it) and thus is of no use for the rest of the world.
>>
>> Signed-off-by: Thomas Huth 
> 
> Farewell, bitrot code.
> 
> Reviewed-by: John Snow 
> 
> Out of curiosity, I wonder ...
> 
> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l
> 320


$ git grep -c '#if 0' | sort -k2 --field-separator=: -n
...
hw/net/eepro100.c:21
target/ppc/cpu-models.h:76

whoa :)



Re: [Qemu-block] [PATCH 17/18] qemu-io: Add background write

2017-09-19 Thread Fam Zheng
On Mon, 09/18 19:53, Max Reitz wrote:
> On 2017-09-18 08:46, Fam Zheng wrote:
> > On Wed, 09/13 20:19, Max Reitz wrote:
> >> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
> >> will not wait for the result of the operation and instead execute it in
> >> the background.
> > 
> > Cannot aio_write be used for this purpose?
> 
> Depends.  I have been trained to dislike *_aio_*, so that's probably the
> initial reason why I didn't use it.
> 
> Second, I'd have to fix aio_write before it can be used.  Currently,
> this aborts:
> 
> echo 'qemu-io drv0 "aio_write -P 0x11 0 64M"' \
> | x86_64-softmmu/qemu-system-x86_64 -monitor stdio \
>   -blockdev node-name=drv0,driver=null-co
> 
> because aio_write_done thinks it's a good idea to use qemu-io's
> BlockBackend -- but when qemu-io is executed through the HMP, the
> BlockBackend is only created for the duration of the qemu-io command
> (unless there already is a BB).  So what I'd have to do is add a
> blk_ref()/blk_unref() there, but for some reason I really don't like that.

What is the reason? If it crashes it should be fixed anyway, I assume?

Fam



Re: [Qemu-block] [PATCH] block: Clean up some bad code in the vvfat driver

2017-09-19 Thread Kevin Wolf
Am 13.09.2017 um 12:21 hat Thomas Huth geschrieben:
> Remove the unnecessary home-grown redefinition of the assert() macro here,
> and remove the unusable debug code at the end of the checkpoint() function.
> The code there uses assert() with side-effects (assignment to the "mapping"
> variable), which should be avoided. Looking more closely, it seems as it is
> apparently also only usable for one certain directory layout (with a file
> named USB.H in it) and thus is of no use for the rest of the world.
> 
> Signed-off-by: Thomas Huth 

Thanks, applied to the block branch.

Kevin



Re: [Qemu-block] [PATCH] block/throttle-groups.c: allocate RestartData on the heap

2017-09-19 Thread Kevin Wolf
Am 18.09.2017 um 22:25 hat Manos Pitsidianakis geschrieben:
> RestartData is the opaque data of the throttle_group_restart_queue_entry
> coroutine. By being stack allocated, it isn't available anymore if
> aio_co_enter schedules the coroutine with a bottom halve and runs after
> throttle_group_restart_queue returns.

Cc: qemu-sta...@nongnu.org

(It should be mentioned explicitly in the commit message so you can
filter for it in the commit history. I routinely add this while applying
patches if I'm aware of the qemu-stable CC, but it makes my life easier
if it's already there.)

> Signed-off-by: Manos Pitsidianakis 

Thanks, applied to the block branch.

Kevin