Re: [Qemu-block] [PATCH v8 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration
Am 19.09.2017 um 21:42 hat Eric Blake geschrieben: > However... > > >> -sbc = limit >> BDRV_SECTOR_BITS; > >> assert(DIV_ROUND_UP(bm_size, limit) == tb_size); > >> > >> -while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) > >> { > >> -uint64_t cluster = sector / sbc; > >> +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) { > >> +uint64_t cluster = offset / limit; > > bdrv_dirty_iter_next() returns the next dirty bit (which is not > necessarily the first bit in the cluster). For the purposes of > serialization, we want to serialize the entire cluster in one go, even > though we will be serializing 0 bits up until the first dirty bit. So > offset at this point may be unaligned, Ok, this is the part that I was missing. It makes a lot more sense now. Also, I think 'cluster' meaning bitmap clusters and not qcow2 clusters here confused me a bit. > > The part that I'm missing yet is why we need to do it. The bitmap > > granularity is also the granularity of bdrv_dirty_iter_next(), so isn't > > offset already aligned and we could even assert that instead of aligning > > down? (As long we enforce our restriction, which we seem to do in > > bitmap_list_load().) > > Sadly, a quick: > [...] > does NOT fail iotests 165, which appears to be the only test that > actually hammers on qcow2 bitmaps (changing it to an 'assert(false)' > only shows an effect on 165) - which means our test is NOT exercising > all possible alignments. And it's python-based, with lame output, which > makes debugging it painful. But never fear, for v9 I will improve the > test to actually affect the bitmap at a point that would fail with my > temporary assertion in place, and thus proving that we DO need to align > down. Note that test 165 is testing only a 1G image, but I just showed > that 64k clusters with 64k granularity covers up to 32G of image space > in one cluster of the bitmap, so the test is only covering one cluster > of serialization in the first place, and as written, the test is > dirtying byte 0, which explains why it happens to get an offset aligned > to limit, even though that is not a valid assertion. More tests are always welcome and a good argument for getting a series merged. :-) Kevin signature.asc Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH v9 05/20] dirty-bitmap: Avoid size query failure during truncate
On Tue, 09/19 19:00, John Snow wrote: > > > On 09/19/2017 04:18 PM, Eric Blake wrote: > > We've previously fixed several places where we failed to account > > for possible errors from bdrv_nb_sectors(). Fix another one by > > making bdrv_dirty_bitmap_truncate() take the new size from the > > caller instead of querying itself; then adjust the sole caller > > bdrv_truncate() to pass the size just determined by a successful > > resize, or to skip the bitmap resize on failure, thus avoiding > > sizing the bitmaps to -1. > > > > Signed-off-by: Eric Blake > > > > --- > > v9: skip only bdrv_dirty_bitmap_truncate on error [Fam] > > v8: retitle and rework to avoid possibility of secondary failure [John] > > v7: new patch [Kevin] > > --- > > include/block/dirty-bitmap.h | 2 +- > > block.c | 15 ++- > > block/dirty-bitmap.c | 6 +++--- > > 3 files changed, 14 insertions(+), 9 deletions(-) > > > > diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h > > index 8fd842eac9..7a27590047 100644 > > --- a/include/block/dirty-bitmap.h > > +++ b/include/block/dirty-bitmap.h > > @@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter); > > void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num); > > int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap); > > int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap); > > -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs); > > +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes); > > bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap); > > bool bdrv_has_readonly_bitmaps(BlockDriverState *bs); > > bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap); > > diff --git a/block.c b/block.c > > index ee6a48976e..89261a7a53 100644 > > --- a/block.c > > +++ b/block.c > > @@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, > > PreallocMode prealloc, > > assert(!(bs->open_flags & BDRV_O_INACTIVE)); > > > > ret = drv->bdrv_truncate(bs, offset, prealloc, errp); > > -if (ret == 0) { > > -ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); > > -bdrv_dirty_bitmap_truncate(bs); > > -bdrv_parent_cb_resize(bs); > > -atomic_inc(&bs->write_gen); > > +if (ret < 0) { > > +return ret; > > } > > +ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); > > +if (ret < 0) { > > +error_setg_errno(errp, -ret, "Could not refresh total sector > > count"); > > +} else { > > Sorry for dragging my feet on this point; if anyone else wants to R-B > this I will cede without much of a fuss, but perhaps you can help me > understand. > > (Copying some questions I asked Eric on IRC, airing to list for wider > discussion, and also because I had to drive home before the stores > closed for the evening) > > Do you suspect that almost certainly if bdrv_truncate() fails overall > that the image format driver will either unmount the image or become > read-only? > > There are calls from parallels, qcow, qcow2-refcount, qcow2, raw-format, > vhdx-log, vhdx plus whichever calls from blk_truncate (jobs, all of the > same formats, blockdev, qemu-img) > > I'm just trying to picture exactly what's going to happen if we manage > to resize the drive but then don't resize the bitmap -- say we expand > the drive larger but fail to refresh (and fail to resize the bitmap.) > > if the block format module does not immediately and dutifully stop all > further writes -- is there anything that stops us from writing to new > sectors that were beyond EOF previously? > > I suppose if *not* that's a bug for callers of bdrv_truncate to allow > that kind of monkey business, but if it CAN happen, hbitmap only guards > against such things with an assert (which, IIRC, is not guaranteed to be > on for all builds) It's guaranteed since a few hours ago: commit 262a69f4282e44426c7a132138581d400053e0a1 Author: Eric Blake Date: Mon Sep 11 16:13:20 2017 -0500 osdep.h: Prohibit disabling assert() in supported builds > > > So the question is: "bdrv_truncate failure is NOT considered recoverable > in ANY case, is it?" > > It may possibly be safer to, if the initial truncate request succeeds, > apply a best-effort to the bitmap before returning the error. Like fallback "offset" (or it aligned up to bs cluster size) if refresh_total_sectors() returns error? I think that is okay. Fam
Re: [Qemu-block] [Qemu-devel] [PATCH v9 05/20] dirty-bitmap: Avoid size query failure during truncate
On 09/19/2017 04:18 PM, Eric Blake wrote: > We've previously fixed several places where we failed to account > for possible errors from bdrv_nb_sectors(). Fix another one by > making bdrv_dirty_bitmap_truncate() take the new size from the > caller instead of querying itself; then adjust the sole caller > bdrv_truncate() to pass the size just determined by a successful > resize, or to skip the bitmap resize on failure, thus avoiding > sizing the bitmaps to -1. > > Signed-off-by: Eric Blake > > --- > v9: skip only bdrv_dirty_bitmap_truncate on error [Fam] > v8: retitle and rework to avoid possibility of secondary failure [John] > v7: new patch [Kevin] > --- > include/block/dirty-bitmap.h | 2 +- > block.c | 15 ++- > block/dirty-bitmap.c | 6 +++--- > 3 files changed, 14 insertions(+), 9 deletions(-) > > diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h > index 8fd842eac9..7a27590047 100644 > --- a/include/block/dirty-bitmap.h > +++ b/include/block/dirty-bitmap.h > @@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter); > void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num); > int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap); > int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap); > -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs); > +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes); > bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap); > bool bdrv_has_readonly_bitmaps(BlockDriverState *bs); > bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap); > diff --git a/block.c b/block.c > index ee6a48976e..89261a7a53 100644 > --- a/block.c > +++ b/block.c > @@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, > PreallocMode prealloc, > assert(!(bs->open_flags & BDRV_O_INACTIVE)); > > ret = drv->bdrv_truncate(bs, offset, prealloc, errp); > -if (ret == 0) { > -ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); > -bdrv_dirty_bitmap_truncate(bs); > -bdrv_parent_cb_resize(bs); > -atomic_inc(&bs->write_gen); > +if (ret < 0) { > +return ret; > } > +ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); > +if (ret < 0) { > +error_setg_errno(errp, -ret, "Could not refresh total sector count"); > +} else { Sorry for dragging my feet on this point; if anyone else wants to R-B this I will cede without much of a fuss, but perhaps you can help me understand. (Copying some questions I asked Eric on IRC, airing to list for wider discussion, and also because I had to drive home before the stores closed for the evening) Do you suspect that almost certainly if bdrv_truncate() fails overall that the image format driver will either unmount the image or become read-only? There are calls from parallels, qcow, qcow2-refcount, qcow2, raw-format, vhdx-log, vhdx plus whichever calls from blk_truncate (jobs, all of the same formats, blockdev, qemu-img) I'm just trying to picture exactly what's going to happen if we manage to resize the drive but then don't resize the bitmap -- say we expand the drive larger but fail to refresh (and fail to resize the bitmap.) if the block format module does not immediately and dutifully stop all further writes -- is there anything that stops us from writing to new sectors that were beyond EOF previously? I suppose if *not* that's a bug for callers of bdrv_truncate to allow that kind of monkey business, but if it CAN happen, hbitmap only guards against such things with an assert (which, IIRC, is not guaranteed to be on for all builds) So the question is: "bdrv_truncate failure is NOT considered recoverable in ANY case, is it?" It may possibly be safer to, if the initial truncate request succeeds, apply a best-effort to the bitmap before returning the error. > +bdrv_dirty_bitmap_truncate(bs, bs->total_sectors * BDRV_SECTOR_SIZE); > +} > +bdrv_parent_cb_resize(bs); > +atomic_inc(&bs->write_gen); > return ret; > } > > diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c > index 42a55e4a4b..ee164fb518 100644 > --- a/block/dirty-bitmap.c > +++ b/block/dirty-bitmap.c > @@ -1,7 +1,7 @@ > /* > * Block Dirty Bitmap > * > - * Copyright (c) 2016 Red Hat. Inc > + * Copyright (c) 2016-2017 Red Hat. Inc > * > * Permission is hereby granted, free of charge, to any person obtaining a > copy > * of this software and associated documentation files (the "Software"), to > deal > @@ -302,10 +302,10 @@ BdrvDirtyBitmap > *bdrv_reclaim_dirty_bitmap(BlockDriverState *bs, > * Truncates _all_ bitmaps attached to a BDS. > * Called with BQL taken. > */ > -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs) > +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes) > { > BdrvDirtyBitmap *bitmap; > -uint64_t size = bdrv_nb_sectors(bs); > +int64_t size =
[Qemu-block] [PATCH v9 20/20] dirty-bitmap: Convert internal hbitmap size/granularity
Now that all callers are using byte-based interfaces, there's no reason for our internal hbitmap to remain with sector-based granularity. It also simplifies our internal scaling, since we already know that hbitmap widens requests out to granularity boundaries. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v9: no change v8: rebase to earlier truncate changes (R-b kept) v7: rebase to dirty_iter_next cleanup (no semantic change, R-b kept) v6: no change v5: fix bdrv_dirty_bitmap_truncate [John] v4: rebase to earlier changes, include serialization, R-b dropped v3: no change v2: no change --- block/dirty-bitmap.c | 62 +++- 1 file changed, 18 insertions(+), 44 deletions(-) diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 58a3f330a9..bd04e991b1 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -38,7 +38,7 @@ */ struct BdrvDirtyBitmap { QemuMutex *mutex; -HBitmap *bitmap;/* Dirty sector bitmap implementation */ +HBitmap *bitmap;/* Dirty bitmap implementation */ HBitmap *meta; /* Meta dirty bitmap */ BdrvDirtyBitmap *successor; /* Anonymous child; implies frozen status */ char *name; /* Optional non-empty unique ID */ @@ -130,12 +130,7 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState *bs, } bitmap = g_new0(BdrvDirtyBitmap, 1); bitmap->mutex = &bs->dirty_bitmap_mutex; -/* - * TODO - let hbitmap track full granularity. For now, it is tracking - * only sector granularity, as a shortcut for our iterators. - */ -bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap_size, BDRV_SECTOR_SIZE), - ctz32(granularity) - BDRV_SECTOR_BITS); +bitmap->bitmap = hbitmap_alloc(bitmap_size, ctz32(granularity)); bitmap->size = bitmap_size; bitmap->name = g_strdup(name); bitmap->disabled = false; @@ -312,7 +307,7 @@ void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes) QLIST_FOREACH(bitmap, &bs->dirty_bitmaps, list) { assert(!bdrv_dirty_bitmap_frozen(bitmap)); assert(!bitmap->active_iterators); -hbitmap_truncate(bitmap->bitmap, DIV_ROUND_UP(bytes, BDRV_SECTOR_SIZE)); +hbitmap_truncate(bitmap->bitmap, bytes); bitmap->size = bytes; } bdrv_dirty_bitmaps_unlock(bs); @@ -442,7 +437,7 @@ bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap, int64_t offset) { if (bitmap) { -return hbitmap_get(bitmap->bitmap, offset >> BDRV_SECTOR_BITS); +return hbitmap_get(bitmap->bitmap, offset); } else { return false; } @@ -470,7 +465,7 @@ uint32_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs) uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap) { -return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap); +return 1U << hbitmap_granularity(bitmap->bitmap); } BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap) @@ -503,20 +498,16 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter) int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter) { -int64_t ret = hbitmap_iter_next(&iter->hbi); -return ret < 0 ? -1 : ret * BDRV_SECTOR_SIZE; +return hbitmap_iter_next(&iter->hbi); } /* Called within bdrv_dirty_bitmap_lock..unlock */ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, int64_t offset, int64_t bytes) { -int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE); - assert(bdrv_dirty_bitmap_enabled(bitmap)); assert(!bdrv_dirty_bitmap_readonly(bitmap)); -hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS, -end_sector - (offset >> BDRV_SECTOR_BITS)); +hbitmap_set(bitmap->bitmap, offset, bytes); } void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap, @@ -531,12 +522,9 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap, void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, int64_t offset, int64_t bytes) { -int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE); - assert(bdrv_dirty_bitmap_enabled(bitmap)); assert(!bdrv_dirty_bitmap_readonly(bitmap)); -hbitmap_reset(bitmap->bitmap, offset >> BDRV_SECTOR_BITS, - end_sector - (offset >> BDRV_SECTOR_BITS)); +hbitmap_reset(bitmap->bitmap, offset, bytes); } void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap, @@ -556,8 +544,7 @@ void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out) hbitmap_reset_all(bitmap->bitmap); } else { HBitmap *backup = bitmap->bitmap; -bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap->size, -BDRV_SECTOR_SIZE), +bitmap->bitmap = hbitmap
[Qemu-block] [PATCH v9 19/20] dirty-bitmap: Switch bdrv_set_dirty() to bytes
Both callers already had bytes available, but were scaling to sectors. Move the scaling to internal code. In the case of bdrv_aligned_pwritev(), we are now passing the exact offset rather than a rounded sector-aligned value, but that's okay as long as dirty bitmap widens start/bytes to granularity boundaries. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v4: only context changes v3: rebase to lock context changes, R-b kept v2: no change --- include/block/block_int.h | 2 +- block/io.c| 6 ++ block/dirty-bitmap.c | 7 --- 3 files changed, 7 insertions(+), 8 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index ba4c383393..55c5d573d4 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -1021,7 +1021,7 @@ void blk_dev_eject_request(BlockBackend *blk, bool force); bool blk_dev_is_tray_open(BlockBackend *blk); bool blk_dev_is_medium_locked(BlockBackend *blk); -void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector, int64_t nr_sect); +void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes); bool bdrv_requests_pending(BlockDriverState *bs); void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out); diff --git a/block/io.c b/block/io.c index 4378ae4c7d..8a0cd8835a 100644 --- a/block/io.c +++ b/block/io.c @@ -1334,7 +1334,6 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child, bool waited; int ret; -int64_t start_sector = offset >> BDRV_SECTOR_BITS; int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE); uint64_t bytes_remaining = bytes; int max_transfer; @@ -1409,7 +1408,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child, bdrv_debug_event(bs, BLKDBG_PWRITEV_DONE); atomic_inc(&bs->write_gen); -bdrv_set_dirty(bs, start_sector, end_sector - start_sector); +bdrv_set_dirty(bs, offset, bytes); stat64_max(&bs->wr_highest_offset, offset + bytes); @@ -2438,8 +2437,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, ret = 0; out: atomic_inc(&bs->write_gen); -bdrv_set_dirty(bs, req.offset >> BDRV_SECTOR_BITS, - req.bytes >> BDRV_SECTOR_BITS); +bdrv_set_dirty(bs, req.offset, req.bytes); tracked_request_end(&req); bdrv_dec_in_flight(bs); return ret; diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 117837b3cc..58a3f330a9 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -628,10 +628,10 @@ void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap *bitmap) hbitmap_deserialize_finish(bitmap->bitmap); } -void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector, -int64_t nr_sectors) +void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes) { BdrvDirtyBitmap *bitmap; +int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE); if (QLIST_EMPTY(&bs->dirty_bitmaps)) { return; @@ -643,7 +643,8 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector, continue; } assert(!bdrv_dirty_bitmap_readonly(bitmap)); -hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors); +hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS, +end_sector - (offset >> BDRV_SECTOR_BITS)); } bdrv_dirty_bitmaps_unlock(bs); } -- 2.13.5
[Qemu-block] [PATCH v9 15/20] mirror: Switch mirror_dirty_init() to byte-based iteration
Now that we have adjusted the majority of the calls this function makes to be byte-based, it is easier to read the code if it makes passes over the image using bytes rather than sectors. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v6: no change v5: rebase to earlier changes v2-v4: no change --- block/mirror.c | 38 ++ 1 file changed, 14 insertions(+), 24 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index c2f73c91c5..5cdaaed7be 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -612,15 +612,13 @@ static void mirror_throttle(MirrorBlockJob *s) static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s) { -int64_t sector_num, end; +int64_t offset; BlockDriverState *base = s->base; BlockDriverState *bs = s->source; BlockDriverState *target_bs = blk_bs(s->target); -int ret, n; +int ret; int64_t count; -end = s->bdev_length / BDRV_SECTOR_SIZE; - if (base == NULL && !bdrv_has_zero_init(target_bs)) { if (!bdrv_can_write_zeroes_with_unmap(target_bs)) { bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, s->bdev_length); @@ -628,9 +626,9 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s) } s->initial_zeroing_ongoing = true; -for (sector_num = 0; sector_num < end; ) { -int nb_sectors = MIN(end - sector_num, -QEMU_ALIGN_DOWN(INT_MAX, s->granularity) >> BDRV_SECTOR_BITS); +for (offset = 0; offset < s->bdev_length; ) { +int bytes = MIN(s->bdev_length - offset, +QEMU_ALIGN_DOWN(INT_MAX, s->granularity)); mirror_throttle(s); @@ -646,9 +644,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s) continue; } -mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE, - nb_sectors * BDRV_SECTOR_SIZE, false); -sector_num += nb_sectors; +mirror_do_zero_or_discard(s, offset, bytes, false); +offset += bytes; } mirror_wait_for_all_io(s); @@ -656,10 +653,10 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s) } /* First part, loop on the sectors and initialize the dirty bitmap. */ -for (sector_num = 0; sector_num < end; ) { +for (offset = 0; offset < s->bdev_length; ) { /* Just to make sure we are not exceeding int limit. */ -int nb_sectors = MIN(INT_MAX >> BDRV_SECTOR_BITS, - end - sector_num); +int bytes = MIN(s->bdev_length - offset, +QEMU_ALIGN_DOWN(INT_MAX, s->granularity)); mirror_throttle(s); @@ -667,23 +664,16 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s) return 0; } -ret = bdrv_is_allocated_above(bs, base, sector_num * BDRV_SECTOR_SIZE, - nb_sectors * BDRV_SECTOR_SIZE, &count); +ret = bdrv_is_allocated_above(bs, base, offset, bytes, &count); if (ret < 0) { return ret; } -/* TODO: Relax this once bdrv_is_allocated_above and dirty - * bitmaps no longer require sector alignment. */ -assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE)); -n = count >> BDRV_SECTOR_BITS; -assert(n > 0); +assert(count); if (ret == 1) { -bdrv_set_dirty_bitmap(s->dirty_bitmap, - sector_num * BDRV_SECTOR_SIZE, - n * BDRV_SECTOR_SIZE); +bdrv_set_dirty_bitmap(s->dirty_bitmap, offset, count); } -sector_num += n; +offset += count; } return 0; } -- 2.13.5
[Qemu-block] [PATCH v9 14/20] dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use bytes
Some of the callers were already scaling bytes to sectors; others can be easily converted to pass byte offsets, all in our shift towards a consistent byte interface everywhere. Making the change will also make it easier to write the hold-out callers to use byte rather than sectors for their iterations; it also makes it easier for a future dirty-bitmap patch to offload scaling over to the internal hbitmap. Although all callers happen to pass sector-aligned values, make the internal scaling robust to any sub-sector requests. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v5: only context change v4: only context change, due to rebasing to persistent bitmaps v3: rebase to addition of _locked interfaces; complex enough that I dropped R-b v2: no change --- include/block/dirty-bitmap.h | 8 block/dirty-bitmap.c | 22 ++ block/mirror.c | 16 migration/block.c| 7 +-- 4 files changed, 31 insertions(+), 22 deletions(-) diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h index 9e39537e4b..3579a7597c 100644 --- a/include/block/dirty-bitmap.h +++ b/include/block/dirty-bitmap.h @@ -40,9 +40,9 @@ const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap); int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap); DirtyBitmapStatus bdrv_dirty_bitmap_status(BdrvDirtyBitmap *bitmap); void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap, - int64_t cur_sector, int64_t nr_sectors); + int64_t offset, int64_t bytes); void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap, - int64_t cur_sector, int64_t nr_sectors); + int64_t offset, int64_t bytes); BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap); BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap); void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter); @@ -75,9 +75,9 @@ void bdrv_dirty_bitmap_unlock(BdrvDirtyBitmap *bitmap); bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap, int64_t offset); void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, - int64_t cur_sector, int64_t nr_sectors); + int64_t offset, int64_t bytes); void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, -int64_t cur_sector, int64_t nr_sectors); +int64_t offset, int64_t bytes); int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter); void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t offset); int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap); diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index ad559c62b1..117837b3cc 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -509,35 +509,41 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter) /* Called within bdrv_dirty_bitmap_lock..unlock */ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, - int64_t cur_sector, int64_t nr_sectors) + int64_t offset, int64_t bytes) { +int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE); + assert(bdrv_dirty_bitmap_enabled(bitmap)); assert(!bdrv_dirty_bitmap_readonly(bitmap)); -hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors); +hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS, +end_sector - (offset >> BDRV_SECTOR_BITS)); } void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap, - int64_t cur_sector, int64_t nr_sectors) + int64_t offset, int64_t bytes) { bdrv_dirty_bitmap_lock(bitmap); -bdrv_set_dirty_bitmap_locked(bitmap, cur_sector, nr_sectors); +bdrv_set_dirty_bitmap_locked(bitmap, offset, bytes); bdrv_dirty_bitmap_unlock(bitmap); } /* Called within bdrv_dirty_bitmap_lock..unlock */ void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, -int64_t cur_sector, int64_t nr_sectors) +int64_t offset, int64_t bytes) { +int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE); + assert(bdrv_dirty_bitmap_enabled(bitmap)); assert(!bdrv_dirty_bitmap_readonly(bitmap)); -hbitmap_reset(bitmap->bitmap, cur_sector, nr_sectors); +hbitmap_reset(bitmap->bitmap, offset >> BDRV_SECTOR_BITS, + end_sector - (offset >> BDRV_SECTOR_BITS)); } void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap, - int64_t cur_sector, int64_t nr_sectors) + int64_t offset, int64_t bytes) { bdrv_dirty_bitmap_lock(bitmap); -bdrv_reset_dirty_bitmap_locked(bitmap, cur_sector, nr_sectors); +bdrv_reset_dirty_bi
[Qemu-block] [PATCH v9 09/20] qcow2: Switch sectors_covered_by_bitmap_cluster() to byte-based
We are gradually converting to byte-based interfaces, as they are easier to reason about than sector-based. Change the qcow2 bitmap helper function sectors_covered_by_bitmap_cluster(), renaming it to bytes_covered_by_bitmap_cluster() in the process. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v5: no change v4: new patch --- block/qcow2-bitmap.c | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c index 92098bfa49..4475273d8c 100644 --- a/block/qcow2-bitmap.c +++ b/block/qcow2-bitmap.c @@ -269,18 +269,16 @@ static int free_bitmap_clusters(BlockDriverState *bs, Qcow2BitmapTable *tb) return 0; } -/* This function returns the number of disk sectors covered by a single qcow2 - * cluster of bitmap data. */ -static uint64_t sectors_covered_by_bitmap_cluster(const BDRVQcow2State *s, - const BdrvDirtyBitmap *bitmap) +/* Return the disk size covered by a single qcow2 cluster of bitmap data. */ +static uint64_t bytes_covered_by_bitmap_cluster(const BDRVQcow2State *s, +const BdrvDirtyBitmap *bitmap) { -uint64_t sector_granularity = -bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS; -uint64_t sbc = sector_granularity * (s->cluster_size << 3); +uint64_t granularity = bdrv_dirty_bitmap_granularity(bitmap); +uint64_t limit = granularity * (s->cluster_size << 3); -assert(QEMU_IS_ALIGNED(sbc << BDRV_SECTOR_BITS, +assert(QEMU_IS_ALIGNED(limit, bdrv_dirty_bitmap_serialization_align(bitmap))); -return sbc; +return limit; } /* load_bitmap_data @@ -293,7 +291,7 @@ static int load_bitmap_data(BlockDriverState *bs, { int ret = 0; BDRVQcow2State *s = bs->opaque; -uint64_t sector, sbc; +uint64_t sector, limit, sbc; uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap); uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE); uint8_t *buf = NULL; @@ -306,7 +304,8 @@ static int load_bitmap_data(BlockDriverState *bs, } buf = g_malloc(s->cluster_size); -sbc = sectors_covered_by_bitmap_cluster(s, bitmap); +limit = bytes_covered_by_bitmap_cluster(s, bitmap); +sbc = limit >> BDRV_SECTOR_BITS; for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) { uint64_t count = MIN(bm_sectors - sector, sbc); uint64_t entry = bitmap_table[i]; @@ -1080,7 +1079,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, int ret; BDRVQcow2State *s = bs->opaque; int64_t sector; -uint64_t sbc; +uint64_t limit, sbc; uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap); uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE); const char *bm_name = bdrv_dirty_bitmap_name(bitmap); @@ -1106,8 +1105,9 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, dbi = bdrv_dirty_iter_new(bitmap, 0); buf = g_malloc(s->cluster_size); -sbc = sectors_covered_by_bitmap_cluster(s, bitmap); -assert(DIV_ROUND_UP(bm_sectors, sbc) == tb_size); +limit = bytes_covered_by_bitmap_cluster(s, bitmap); +sbc = limit >> BDRV_SECTOR_BITS; +assert(DIV_ROUND_UP(bm_size, limit) == tb_size); while ((sector = bdrv_dirty_iter_next(dbi)) != -1) { uint64_t cluster = sector / sbc; -- 2.13.5
[Qemu-block] [PATCH v9 13/20] dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes
Half the callers were already scaling bytes to sectors; the other half can eventually be simplified to use byte iteration. Both callers were already using the result as a bool, so make that explicit. Making the change also makes it easier for a future dirty-bitmap patch to offload scaling over to the internal hbitmap. Remember, asking whether a byte is dirty is effectively asking whether the entire granularity containing the byte is dirty, since we only track dirtiness by granularity. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Juan Quintela Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v4: only context change v3: rebase to _locked rename was straightforward enough that R-b kept v2: tweak commit message, no code change --- include/block/dirty-bitmap.h | 4 ++-- block/dirty-bitmap.c | 8 block/mirror.c | 3 +-- migration/block.c| 3 ++- 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h index 757fc4c5b8..9e39537e4b 100644 --- a/include/block/dirty-bitmap.h +++ b/include/block/dirty-bitmap.h @@ -72,8 +72,8 @@ void bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap *bitmap, /* Functions that require manual locking. */ void bdrv_dirty_bitmap_lock(BdrvDirtyBitmap *bitmap); void bdrv_dirty_bitmap_unlock(BdrvDirtyBitmap *bitmap); -int bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap, - int64_t sector); +bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap, + int64_t offset); void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, int64_t cur_sector, int64_t nr_sectors); void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 8322e23f0d..ad559c62b1 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -438,13 +438,13 @@ BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs) } /* Called within bdrv_dirty_bitmap_lock..unlock */ -int bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap, - int64_t sector) +bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap, + int64_t offset) { if (bitmap) { -return hbitmap_get(bitmap->bitmap, sector); +return hbitmap_get(bitmap->bitmap, offset >> BDRV_SECTOR_BITS); } else { -return 0; +return false; } } diff --git a/block/mirror.c b/block/mirror.c index 7113d47db4..e36fc81df3 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -361,8 +361,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) int64_t next_offset = offset + nb_chunks * s->granularity; int64_t next_chunk = next_offset / s->granularity; if (next_offset >= s->bdev_length || -!bdrv_get_dirty_locked(source, s->dirty_bitmap, - next_offset >> BDRV_SECTOR_BITS)) { +!bdrv_get_dirty_locked(source, s->dirty_bitmap, next_offset)) { break; } if (test_bit(next_chunk, s->in_flight_bitmap)) { diff --git a/migration/block.c b/migration/block.c index a3512945da..b618869661 100644 --- a/migration/block.c +++ b/migration/block.c @@ -530,7 +530,8 @@ static int mig_save_device_dirty(QEMUFile *f, BlkMigDevState *bmds, blk_mig_unlock(); } bdrv_dirty_bitmap_lock(bmds->dirty_bitmap); -if (bdrv_get_dirty_locked(bs, bmds->dirty_bitmap, sector)) { +if (bdrv_get_dirty_locked(bs, bmds->dirty_bitmap, + sector * BDRV_SECTOR_SIZE)) { if (total_sectors - sector < BDRV_SECTORS_PER_DIRTY_CHUNK) { nr_sectors = total_sectors - sector; } else { -- 2.13.5
[Qemu-block] [PATCH v9 11/20] dirty-bitmap: Change bdrv_dirty_iter_next() to report byte offset
Thanks to recent cleanups, most callers were scaling a return value of sectors into bytes (the exception, in qcow2-bitmap, will be converted to byte-based iteration later). Update the interface to do the scaling internally instead. In qcow2-bitmap, the code was specifically checking for an error return of -1. To avoid a regression, we either have to make sure we continue to return -1 (rather than a scaled -512) on error, or we have to fix the caller to treat all negative values as error rather than just one magic value. It's easy enough to make both changes at the same time, even though either one in isolation would work. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v8: tweak commit message, add R-b v7: return -1, not -512; and fix qcow2-bitmap to check all negatives [Kevin] v5-v6: no change v4: rebase to persistent bitmap v3: no change v2: no change --- block/backup.c | 2 +- block/dirty-bitmap.c | 3 ++- block/mirror.c | 8 block/qcow2-bitmap.c | 2 +- 4 files changed, 8 insertions(+), 7 deletions(-) diff --git a/block/backup.c b/block/backup.c index ac9c018717..06ddbfd03d 100644 --- a/block/backup.c +++ b/block/backup.c @@ -375,7 +375,7 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job) dbi = bdrv_dirty_iter_new(job->sync_bitmap); /* Find the next dirty sector(s) */ -while ((offset = bdrv_dirty_iter_next(dbi) * BDRV_SECTOR_SIZE) >= 0) { +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) { cluster = offset / job->cluster_size; /* Fake progress updates for any clusters we skipped */ diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 84509476ba..e451916187 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -503,7 +503,8 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter) int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter) { -return hbitmap_iter_next(&iter->hbi); +int64_t ret = hbitmap_iter_next(&iter->hbi); +return ret < 0 ? -1 : ret * BDRV_SECTOR_SIZE; } /* Called within bdrv_dirty_bitmap_lock..unlock */ diff --git a/block/mirror.c b/block/mirror.c index 0b063b3c20..77bf5aa3a4 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -336,10 +336,10 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES); bdrv_dirty_bitmap_lock(s->dirty_bitmap); -offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE; +offset = bdrv_dirty_iter_next(s->dbi); if (offset < 0) { bdrv_set_dirty_iter(s->dbi, 0); -offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE; +offset = bdrv_dirty_iter_next(s->dbi); trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap) * BDRV_SECTOR_SIZE); assert(offset >= 0); @@ -370,11 +370,11 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) break; } -next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE; +next_dirty = bdrv_dirty_iter_next(s->dbi); if (next_dirty > next_offset || next_dirty < 0) { /* The bitmap iterator's cache is stale, refresh it */ bdrv_set_dirty_iter(s->dbi, next_offset); -next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE; +next_dirty = bdrv_dirty_iter_next(s->dbi); } assert(next_dirty == next_offset); nb_chunks++; diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c index 44329fc74f..b09010b1d3 100644 --- a/block/qcow2-bitmap.c +++ b/block/qcow2-bitmap.c @@ -1109,7 +1109,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, sbc = limit >> BDRV_SECTOR_BITS; assert(DIV_ROUND_UP(bm_size, limit) == tb_size); -while ((sector = bdrv_dirty_iter_next(dbi)) != -1) { +while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) { uint64_t cluster = sector / sbc; uint64_t end, write_size; int64_t off; -- 2.13.5
[Qemu-block] [PATCH v9 08/20] dirty-bitmap: Change bdrv_dirty_bitmap_*serialize*() to take bytes
Right now, the dirty-bitmap code exposes the fact that we use a scale of sector granularity in the underlying hbitmap to anything that wants to serialize a dirty bitmap. It's nicer to uniformly expose bytes as our dirty-bitmap interface, matching the previous change to bitmap size. The only caller to serialization is currently qcow2-cluster.c, which becomes a bit more verbose because it is still tracking sectors for other reasons, but a later patch will fix that to more uniformly use byte offsets everywhere. Likewise, within dirty-bitmap, we have to add more assertions that we are not truncating incorrectly, which can go away once the internal hbitmap is byte-based rather than sector-based. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v5: no change v4: new patch --- include/block/dirty-bitmap.h | 14 +++--- block/dirty-bitmap.c | 37 - block/qcow2-bitmap.c | 22 ++ 3 files changed, 45 insertions(+), 28 deletions(-) diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h index 7a27590047..5f34a1a3c7 100644 --- a/include/block/dirty-bitmap.h +++ b/include/block/dirty-bitmap.h @@ -49,19 +49,19 @@ BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap, void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter); uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap, - uint64_t start, uint64_t count); + uint64_t offset, uint64_t bytes); uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap); void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap, - uint8_t *buf, uint64_t start, - uint64_t count); + uint8_t *buf, uint64_t offset, + uint64_t bytes); void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap, -uint8_t *buf, uint64_t start, -uint64_t count, bool finish); +uint8_t *buf, uint64_t offset, +uint64_t bytes, bool finish); void bdrv_dirty_bitmap_deserialize_zeroes(BdrvDirtyBitmap *bitmap, - uint64_t start, uint64_t count, + uint64_t offset, uint64_t bytes, bool finish); void bdrv_dirty_bitmap_deserialize_ones(BdrvDirtyBitmap *bitmap, -uint64_t start, uint64_t count, +uint64_t offset, uint64_t bytes, bool finish); void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap *bitmap); diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 6d32554b4e..555c736024 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -568,42 +568,53 @@ void bdrv_undo_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *in) } uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap, - uint64_t start, uint64_t count) + uint64_t offset, uint64_t bytes) { -return hbitmap_serialization_size(bitmap->bitmap, start, count); +assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE)); +return hbitmap_serialization_size(bitmap->bitmap, + offset >> BDRV_SECTOR_BITS, + bytes >> BDRV_SECTOR_BITS); } uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap) { -return hbitmap_serialization_align(bitmap->bitmap); +return hbitmap_serialization_align(bitmap->bitmap) * BDRV_SECTOR_SIZE; } void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap, - uint8_t *buf, uint64_t start, - uint64_t count) + uint8_t *buf, uint64_t offset, + uint64_t bytes) { -hbitmap_serialize_part(bitmap->bitmap, buf, start, count); +assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE)); +hbitmap_serialize_part(bitmap->bitmap, buf, offset >> BDRV_SECTOR_BITS, + bytes >> BDRV_SECTOR_BITS); } void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap, -uint8_t *buf, uint64_t start, -uint64_t count, bool finish) +uint8_t *buf, uint64_t offset, +uint64_t bytes, bool finish) { -hbitmap_deserialize_p
[Qemu-block] [PATCH v9 06/20] dirty-bitmap: Change bdrv_dirty_bitmap_size() to report bytes
We're already reporting bytes for bdrv_dirty_bitmap_granularity(); mixing bytes and sectors in our return values is a recipe for confusion. A later cleanup will convert dirty bitmap internals to be entirely byte-based, but in the meantime, we should report the bitmap size in bytes. The only external caller in qcow2-bitmap.c is temporarily more verbose (because it is still using sector-based math), but will later be switched to track progress by bytes instead of sectors. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v8: no change v7: split external from internal change [Kevin], drop R-b v6: no change v5: fix bdrv_dirty_bitmap_truncate [John], drop R-b v4: retitle from "Track size in bytes", rebase to persistent bitmaps, round up when converting bytes to sectors v3: no change v2: tweak commit message, no code change --- block/dirty-bitmap.c | 2 +- block/qcow2-bitmap.c | 14 -- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index ee164fb518..75af32 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -175,7 +175,7 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap) int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap) { -return bitmap->size; +return bitmap->size * BDRV_SECTOR_SIZE; } const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap) diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c index b3ee4c794a..65122e9ae1 100644 --- a/block/qcow2-bitmap.c +++ b/block/qcow2-bitmap.c @@ -295,10 +295,11 @@ static int load_bitmap_data(BlockDriverState *bs, BDRVQcow2State *s = bs->opaque; uint64_t sector, sbc; uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap); +uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE); uint8_t *buf = NULL; uint64_t i, tab_size = size_to_clusters(s, -bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_size)); +bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_sectors)); if (tab_size != bitmap_table_size || tab_size > BME_MAX_TABLE_SIZE) { return -EINVAL; @@ -307,7 +308,7 @@ static int load_bitmap_data(BlockDriverState *bs, buf = g_malloc(s->cluster_size); sbc = sectors_covered_by_bitmap_cluster(s, bitmap); for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) { -uint64_t count = MIN(bm_size - sector, sbc); +uint64_t count = MIN(bm_sectors - sector, sbc); uint64_t entry = bitmap_table[i]; uint64_t offset = entry & BME_TABLE_ENTRY_OFFSET_MASK; @@ -1077,13 +1078,14 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, int64_t sector; uint64_t sbc; uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap); +uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE); const char *bm_name = bdrv_dirty_bitmap_name(bitmap); uint8_t *buf = NULL; BdrvDirtyBitmapIter *dbi; uint64_t *tb; uint64_t tb_size = size_to_clusters(s, -bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_size)); +bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_sectors)); if (tb_size > BME_MAX_TABLE_SIZE || tb_size * s->cluster_size > BME_MAX_PHYS_SIZE) @@ -1101,7 +1103,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, dbi = bdrv_dirty_iter_new(bitmap, 0); buf = g_malloc(s->cluster_size); sbc = sectors_covered_by_bitmap_cluster(s, bitmap); -assert(DIV_ROUND_UP(bm_size, sbc) == tb_size); +assert(DIV_ROUND_UP(bm_sectors, sbc) == tb_size); while ((sector = bdrv_dirty_iter_next(dbi)) != -1) { uint64_t cluster = sector / sbc; @@ -1109,7 +,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, int64_t off; sector = cluster * sbc; -end = MIN(bm_size, sector + sbc); +end = MIN(bm_sectors, sector + sbc); write_size = bdrv_dirty_bitmap_serialization_size(bitmap, sector, end - sector); assert(write_size <= s->cluster_size); @@ -1141,7 +1143,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, goto fail; } -if (end >= bm_size) { +if (end >= bm_sectors) { break; } -- 2.13.5
[Qemu-block] [PATCH v9 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration
Now that we have adjusted the majority of the calls this function makes to be byte-based, it is easier to read the code if it makes passes over the image using bytes rather than sectors. iotests 165 was rather weak - on a default 64k-cluster image, where bitmap granularity also defaults to 64k bytes, a single cluster of the bitmap table thus covers (64*1024*8) bits which each cover 64k bytes, or 32G of image space. But the test only uses a 1G image, so it cannot trigger any more than one loop of the code in store_bitmap_data(); and it was writing to the first cluster. In order to test that we are properly aligning which portions of the bitmap are being written to the file, we really want to test a case where the first dirty bit returned by bdrv_dirty_iter_next() is not aligned to the start of a cluster, which we can do by modifying the test to write data that doesn't happen to fall in the first cluster of the image. Signed-off-by: Eric Blake --- v9: update iotests to show why aligning down is needed [Kevin], R-b dropped v8: no change v7: rebase to earlier change, make rounding of offset obvious (no semantic change, so R-b kept) [Kevin] v5-v6: no change v4: new patch --- block/qcow2-bitmap.c | 31 --- tests/qemu-iotests/165 | 2 +- 2 files changed, 17 insertions(+), 16 deletions(-) diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c index 692ce0de88..df957c66d5 100644 --- a/block/qcow2-bitmap.c +++ b/block/qcow2-bitmap.c @@ -1072,10 +1072,9 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, { int ret; BDRVQcow2State *s = bs->opaque; -int64_t sector; -uint64_t limit, sbc; +int64_t offset; +uint64_t limit; uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap); -uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE); const char *bm_name = bdrv_dirty_bitmap_name(bitmap); uint8_t *buf = NULL; BdrvDirtyBitmapIter *dbi; @@ -1100,18 +1099,22 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, dbi = bdrv_dirty_iter_new(bitmap); buf = g_malloc(s->cluster_size); limit = bytes_covered_by_bitmap_cluster(s, bitmap); -sbc = limit >> BDRV_SECTOR_BITS; assert(DIV_ROUND_UP(bm_size, limit) == tb_size); -while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) { -uint64_t cluster = sector / sbc; +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) { +uint64_t cluster = offset / limit; uint64_t end, write_size; int64_t off; -sector = cluster * sbc; -end = MIN(bm_sectors, sector + sbc); -write_size = bdrv_dirty_bitmap_serialization_size(bitmap, -sector * BDRV_SECTOR_SIZE, (end - sector) * BDRV_SECTOR_SIZE); +/* + * We found the first dirty offset, but want to write out the + * entire cluster of the bitmap that includes that offset, + * including any leading zero bits. + */ +offset = QEMU_ALIGN_DOWN(offset, limit); +end = MIN(bm_size, offset + limit); +write_size = bdrv_dirty_bitmap_serialization_size(bitmap, offset, + end - offset); assert(write_size <= s->cluster_size); off = qcow2_alloc_clusters(bs, s->cluster_size); @@ -1123,9 +1126,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, } tb[cluster] = off; -bdrv_dirty_bitmap_serialize_part(bitmap, buf, - sector * BDRV_SECTOR_SIZE, - (end - sector) * BDRV_SECTOR_SIZE); +bdrv_dirty_bitmap_serialize_part(bitmap, buf, offset, end - offset); if (write_size < s->cluster_size) { memset(buf + write_size, 0, s->cluster_size - write_size); } @@ -1143,11 +1144,11 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, goto fail; } -if (end >= bm_sectors) { +if (end >= bm_size) { break; } -bdrv_set_dirty_iter(dbi, end * BDRV_SECTOR_SIZE); +bdrv_set_dirty_iter(dbi, end); } *bitmap_table_size = tb_size; diff --git a/tests/qemu-iotests/165 b/tests/qemu-iotests/165 index 74d7b79a0b..a3932db3de 100755 --- a/tests/qemu-iotests/165 +++ b/tests/qemu-iotests/165 @@ -27,7 +27,7 @@ disk = os.path.join(iotests.test_dir, 'disk') disk_size = 0x4000 # 1G # regions for qemu_io: (start, count) in bytes -regions1 = ((0,0x10), +regions1 = ((0x0fff00, 0x1), (0x20, 0x10)) regions2 = ((0x1000, 0x2), -- 2.13.5
[Qemu-block] [PATCH v9 17/20] qcow2: Switch load_bitmap_data() to byte-based iteration
Now that we have adjusted the majority of the calls this function makes to be byte-based, it is easier to read the code if it makes passes over the image using bytes rather than sectors. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v5: no change v4: new patch --- block/qcow2-bitmap.c | 22 -- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c index b09010b1d3..692ce0de88 100644 --- a/block/qcow2-bitmap.c +++ b/block/qcow2-bitmap.c @@ -291,9 +291,8 @@ static int load_bitmap_data(BlockDriverState *bs, { int ret = 0; BDRVQcow2State *s = bs->opaque; -uint64_t sector, limit, sbc; +uint64_t offset, limit; uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap); -uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE); uint8_t *buf = NULL; uint64_t i, tab_size = size_to_clusters(s, @@ -305,32 +304,27 @@ static int load_bitmap_data(BlockDriverState *bs, buf = g_malloc(s->cluster_size); limit = bytes_covered_by_bitmap_cluster(s, bitmap); -sbc = limit >> BDRV_SECTOR_BITS; -for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) { -uint64_t count = MIN(bm_sectors - sector, sbc); +for (i = 0, offset = 0; i < tab_size; ++i, offset += limit) { +uint64_t count = MIN(bm_size - offset, limit); uint64_t entry = bitmap_table[i]; -uint64_t offset = entry & BME_TABLE_ENTRY_OFFSET_MASK; +uint64_t data_offset = entry & BME_TABLE_ENTRY_OFFSET_MASK; assert(check_table_entry(entry, s->cluster_size) == 0); -if (offset == 0) { +if (data_offset == 0) { if (entry & BME_TABLE_ENTRY_FLAG_ALL_ONES) { -bdrv_dirty_bitmap_deserialize_ones(bitmap, - sector * BDRV_SECTOR_SIZE, - count * BDRV_SECTOR_SIZE, +bdrv_dirty_bitmap_deserialize_ones(bitmap, offset, count, false); } else { /* No need to deserialize zeros because the dirty bitmap is * already cleared */ } } else { -ret = bdrv_pread(bs->file, offset, buf, s->cluster_size); +ret = bdrv_pread(bs->file, data_offset, buf, s->cluster_size); if (ret < 0) { goto finish; } -bdrv_dirty_bitmap_deserialize_part(bitmap, buf, - sector * BDRV_SECTOR_SIZE, - count * BDRV_SECTOR_SIZE, +bdrv_dirty_bitmap_deserialize_part(bitmap, buf, offset, count, false); } } -- 2.13.5
[Qemu-block] [PATCH v9 12/20] dirty-bitmap: Change bdrv_get_dirty_count() to report bytes
Thanks to recent cleanups, all callers were scaling a return value of sectors into bytes; do the scaling internally instead. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v8: no change, add R-b v7: fix one more trace caller [Kevin] v4-v6: no change v3: no change, add R-b v2: no change --- block/dirty-bitmap.c | 4 ++-- block/mirror.c | 16 ++-- migration/block.c| 2 +- 3 files changed, 9 insertions(+), 13 deletions(-) diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index e451916187..8322e23f0d 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -423,7 +423,7 @@ BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs) QLIST_FOREACH(bm, &bs->dirty_bitmaps, list) { BlockDirtyInfo *info = g_new0(BlockDirtyInfo, 1); BlockDirtyInfoList *entry = g_new0(BlockDirtyInfoList, 1); -info->count = bdrv_get_dirty_count(bm) << BDRV_SECTOR_BITS; +info->count = bdrv_get_dirty_count(bm); info->granularity = bdrv_dirty_bitmap_granularity(bm); info->has_name = !!bm->name; info->name = g_strdup(bm->name); @@ -652,7 +652,7 @@ void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t offset) int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap) { -return hbitmap_count(bitmap->bitmap); +return hbitmap_count(bitmap->bitmap) << BDRV_SECTOR_BITS; } int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap) diff --git a/block/mirror.c b/block/mirror.c index 77bf5aa3a4..7113d47db4 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -340,8 +340,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) if (offset < 0) { bdrv_set_dirty_iter(s->dbi, 0); offset = bdrv_dirty_iter_next(s->dbi); -trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap) * - BDRV_SECTOR_SIZE); +trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap)); assert(offset >= 0); } bdrv_dirty_bitmap_unlock(s->dirty_bitmap); @@ -811,11 +810,10 @@ static void coroutine_fn mirror_run(void *opaque) cnt = bdrv_get_dirty_count(s->dirty_bitmap); /* s->common.offset contains the number of bytes already processed so - * far, cnt is the number of dirty sectors remaining and + * far, cnt is the number of dirty bytes remaining and * s->bytes_in_flight is the number of bytes currently being * processed; together those are the current total operation length */ -s->common.len = s->common.offset + s->bytes_in_flight + -cnt * BDRV_SECTOR_SIZE; +s->common.len = s->common.offset + s->bytes_in_flight + cnt; /* Note that even when no rate limit is applied we need to yield * periodically with no pending I/O so that bdrv_drain_all() returns. @@ -827,8 +825,7 @@ static void coroutine_fn mirror_run(void *opaque) s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) { if (s->in_flight >= MAX_IN_FLIGHT || s->buf_free_count == 0 || (cnt == 0 && s->in_flight > 0)) { -trace_mirror_yield(s, cnt * BDRV_SECTOR_SIZE, - s->buf_free_count, s->in_flight); +trace_mirror_yield(s, cnt, s->buf_free_count, s->in_flight); mirror_wait_for_io(s); continue; } else if (cnt != 0) { @@ -869,7 +866,7 @@ static void coroutine_fn mirror_run(void *opaque) * whether to switch to target check one last time if I/O has * come in the meanwhile, and if not flush the data to disk. */ -trace_mirror_before_drain(s, cnt * BDRV_SECTOR_SIZE); +trace_mirror_before_drain(s, cnt); bdrv_drained_begin(bs); cnt = bdrv_get_dirty_count(s->dirty_bitmap); @@ -888,8 +885,7 @@ static void coroutine_fn mirror_run(void *opaque) } ret = 0; -trace_mirror_before_sleep(s, cnt * BDRV_SECTOR_SIZE, - s->synced, delay_ns); +trace_mirror_before_sleep(s, cnt, s->synced, delay_ns); if (!s->synced) { block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, delay_ns); if (block_job_is_cancelled(&s->common)) { diff --git a/migration/block.c b/migration/block.c index 9171f60028..a3512945da 100644 --- a/migration/block.c +++ b/migration/block.c @@ -667,7 +667,7 @@ static int64_t get_remaining_dirty(void) aio_context_release(blk_get_aio_context(bmds->blk)); } -return dirty << BDRV_SECTOR_BITS; +return dirty; } -- 2.13.5
[Qemu-block] [PATCH v9 16/20] qcow2: Switch qcow2_measure() to byte-based iteration
This is new code, but it is easier to read if it makes passes over the image using bytes rather than sectors (and will get easier in the future when bdrv_get_block_status is converted to byte-based). Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v7: tweak constant given to MIN (no semantic change, R-b kept) [Kevin] v6: separate bug fix to earlier patch v5: new patch --- block/qcow2.c | 22 ++ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/block/qcow2.c b/block/qcow2.c index bae5893327..64dcd98a91 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -3647,20 +3647,19 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs, */ required = virtual_size; } else { -int cluster_sectors = cluster_size / BDRV_SECTOR_SIZE; -int64_t sector_num; +int64_t offset; int pnum = 0; -for (sector_num = 0; - sector_num < ssize / BDRV_SECTOR_SIZE; - sector_num += pnum) { -int nb_sectors = MIN(ssize / BDRV_SECTOR_SIZE - sector_num, - BDRV_REQUEST_MAX_SECTORS); +for (offset = 0; offset < ssize; + offset += pnum * BDRV_SECTOR_SIZE) { +int nb_sectors = MIN(ssize - offset, + BDRV_REQUEST_MAX_BYTES) / BDRV_SECTOR_SIZE; BlockDriverState *file; int64_t ret; ret = bdrv_get_block_status_above(in_bs, NULL, - sector_num, nb_sectors, + offset >> BDRV_SECTOR_BITS, + nb_sectors, &pnum, &file); if (ret < 0) { error_setg_errno(&local_err, -ret, @@ -3673,12 +3672,11 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs, } else if ((ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) == (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) { /* Extend pnum to end of cluster for next iteration */ -pnum = ROUND_UP(sector_num + pnum, cluster_sectors) - - sector_num; +pnum = (ROUND_UP(offset + pnum * BDRV_SECTOR_SIZE, + cluster_size) - offset) >> BDRV_SECTOR_BITS; /* Count clusters we've seen */ -required += (sector_num % cluster_sectors + pnum) * -BDRV_SECTOR_SIZE; +required += offset % cluster_size + pnum * BDRV_SECTOR_SIZE; } } } -- 2.13.5
[Qemu-block] [PATCH v9 10/20] dirty-bitmap: Set iterator start by offset, not sector
All callers to bdrv_dirty_iter_new() passed 0 for their initial starting point, drop that parameter. Most callers to bdrv_set_dirty_iter() were scaling a byte offset to a sector number; the exception qcow2-bitmap will be converted later to use byte rather than sector iteration. Move the scaling to occur internally to dirty bitmap code instead, so that callers now pass in bytes. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v5: no change v4: rebase to persistent bitmaps v3: no change v2: no change --- include/block/dirty-bitmap.h | 5 ++--- block/backup.c | 5 ++--- block/dirty-bitmap.c | 9 - block/mirror.c | 4 ++-- block/qcow2-bitmap.c | 4 ++-- 5 files changed, 12 insertions(+), 15 deletions(-) diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h index 5f34a1a3c7..757fc4c5b8 100644 --- a/include/block/dirty-bitmap.h +++ b/include/block/dirty-bitmap.h @@ -44,8 +44,7 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap, void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap, int64_t cur_sector, int64_t nr_sectors); BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap); -BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap, - uint64_t first_sector); +BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap); void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter); uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap, @@ -80,7 +79,7 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, int64_t cur_sector, int64_t nr_sectors); int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter); -void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num); +void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t offset); int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap); int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap); void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes); diff --git a/block/backup.c b/block/backup.c index 517c300a49..ac9c018717 100644 --- a/block/backup.c +++ b/block/backup.c @@ -372,7 +372,7 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job) granularity = bdrv_dirty_bitmap_granularity(job->sync_bitmap); clusters_per_iter = MAX((granularity / job->cluster_size), 1); -dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0); +dbi = bdrv_dirty_iter_new(job->sync_bitmap); /* Find the next dirty sector(s) */ while ((offset = bdrv_dirty_iter_next(dbi) * BDRV_SECTOR_SIZE) >= 0) { @@ -403,8 +403,7 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job) /* If the bitmap granularity is smaller than the backup granularity, * we need to advance the iterator pointer to the next cluster. */ if (granularity < job->cluster_size) { -bdrv_set_dirty_iter(dbi, -cluster * job->cluster_size / BDRV_SECTOR_SIZE); +bdrv_set_dirty_iter(dbi, cluster * job->cluster_size); } last_cluster = cluster - 1; diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 555c736024..84509476ba 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -473,11 +473,10 @@ uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap) return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap); } -BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap, - uint64_t first_sector) +BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap) { BdrvDirtyBitmapIter *iter = g_new(BdrvDirtyBitmapIter, 1); -hbitmap_iter_init(&iter->hbi, bitmap->bitmap, first_sector); +hbitmap_iter_init(&iter->hbi, bitmap->bitmap, 0); iter->bitmap = bitmap; bitmap->active_iterators++; return iter; @@ -645,9 +644,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector, /** * Advance a BdrvDirtyBitmapIter to an arbitrary offset. */ -void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t sector_num) +void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t offset) { -hbitmap_iter_init(&iter->hbi, iter->hbi.hb, sector_num); +hbitmap_iter_init(&iter->hbi, iter->hbi.hb, offset >> BDRV_SECTOR_BITS); } int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap) diff --git a/block/mirror.c b/block/mirror.c index 6531652d73..0b063b3c20 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -373,7 +373,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE; if (next_dirty > next_offset || next_dirty < 0) { /* The bitmap iterator's cache is stale, refresh it */ -
[Qemu-block] [PATCH v9 07/20] dirty-bitmap: Track bitmap size by bytes
We are still using an internal hbitmap that tracks a size in sectors, with the granularity scaled down accordingly, because it lets us use a shortcut for our iterators which are currently sector-based. But there's no reason we can't track the dirty bitmap size in bytes, since it is (mostly) an internal-only variable (remember, the size is how many bytes are covered by the bitmap, not how many bytes the bitmap occupies). A later cleanup will convert dirty bitmap internals to be entirely byte-based, eliminating the intermediate sector rounding added here; and technically, since bdrv_getlength() already rounds up to sectors, our use of DIV_ROUND_UP is more for theoretical completeness than for any actual rounding. Use is_power_of_2() while at it, instead of open-coding that. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v8: rebase to earlier truncate changes, R-b kept v7: split external from internal [Kevin], drop R-b v6: no change v5: fix bdrv_dirty_bitmap_truncate [John], drop R-b v4: retitle from "Track size in bytes", rebase to persistent bitmaps, round up when converting bytes to sectors v3: no change v2: tweak commit message, no code change --- block/dirty-bitmap.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 75af32..6d32554b4e 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -42,7 +42,7 @@ struct BdrvDirtyBitmap { HBitmap *meta; /* Meta dirty bitmap */ BdrvDirtyBitmap *successor; /* Anonymous child; implies frozen status */ char *name; /* Optional non-empty unique ID */ -int64_t size; /* Size of the bitmap (Number of sectors) */ +int64_t size; /* Size of the bitmap, in bytes */ bool disabled; /* Bitmap is disabled. It ignores all writes to the device */ int active_iterators; /* How many iterators are active */ @@ -115,17 +115,14 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState *bs, { int64_t bitmap_size; BdrvDirtyBitmap *bitmap; -uint32_t sector_granularity; -assert((granularity & (granularity - 1)) == 0); +assert(is_power_of_2(granularity) && granularity >= BDRV_SECTOR_SIZE); if (name && bdrv_find_dirty_bitmap(bs, name)) { error_setg(errp, "Bitmap already exists: %s", name); return NULL; } -sector_granularity = granularity >> BDRV_SECTOR_BITS; -assert(sector_granularity); -bitmap_size = bdrv_nb_sectors(bs); +bitmap_size = bdrv_getlength(bs); if (bitmap_size < 0) { error_setg_errno(errp, -bitmap_size, "could not get length of device"); errno = -bitmap_size; @@ -133,7 +130,12 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState *bs, } bitmap = g_new0(BdrvDirtyBitmap, 1); bitmap->mutex = &bs->dirty_bitmap_mutex; -bitmap->bitmap = hbitmap_alloc(bitmap_size, ctz32(sector_granularity)); +/* + * TODO - let hbitmap track full granularity. For now, it is tracking + * only sector granularity, as a shortcut for our iterators. + */ +bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap_size, BDRV_SECTOR_SIZE), + ctz32(granularity) - BDRV_SECTOR_BITS); bitmap->size = bitmap_size; bitmap->name = g_strdup(name); bitmap->disabled = false; @@ -175,7 +177,7 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap) int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap) { -return bitmap->size * BDRV_SECTOR_SIZE; +return bitmap->size; } const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap) @@ -305,14 +307,13 @@ BdrvDirtyBitmap *bdrv_reclaim_dirty_bitmap(BlockDriverState *bs, void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes) { BdrvDirtyBitmap *bitmap; -int64_t size = DIV_ROUND_UP(bytes, BDRV_SECTOR_SIZE); bdrv_dirty_bitmaps_lock(bs); QLIST_FOREACH(bitmap, &bs->dirty_bitmaps, list) { assert(!bdrv_dirty_bitmap_frozen(bitmap)); assert(!bitmap->active_iterators); -hbitmap_truncate(bitmap->bitmap, size); -bitmap->size = size; +hbitmap_truncate(bitmap->bitmap, DIV_ROUND_UP(bytes, BDRV_SECTOR_SIZE)); +bitmap->size = bytes; } bdrv_dirty_bitmaps_unlock(bs); } @@ -549,7 +550,8 @@ void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out) hbitmap_reset_all(bitmap->bitmap); } else { HBitmap *backup = bitmap->bitmap; -bitmap->bitmap = hbitmap_alloc(bitmap->size, +bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap->size, +BDRV_SECTOR_SIZE), hbitmap_granularity(backup)); *out = backup; } -- 2.13.5
[Qemu-block] [PATCH v9 01/20] block: Make bdrv_img_create() size selection easier to read
All callers of bdrv_img_create() pass in a size, or -1 to read the size from the backing file. We then set that size as the QemuOpt default, which means we will reuse that default rather than the final parameter to qemu_opt_get_size() several lines later. But it is rather confusing to read subsequent checks of 'size == -1' when it looks (without seeing the full context) like size defaults to 0; it also doesn't help that a size of 0 is valid (for some formats). Rework the logic to make things more legible. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v6: Combine into a series rather than being a standalone patch (more for ease of tracking than for being on topic) --- block.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block.c b/block.c index 6dd47e414e..ee6a48976e 100644 --- a/block.c +++ b/block.c @@ -4393,7 +4393,7 @@ void bdrv_img_create(const char *filename, const char *fmt, /* The size for the image must always be specified, unless we have a backing * file and we have not been forbidden from opening it. */ -size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0); +size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, img_size); if (backing_file && !(flags & BDRV_O_NO_BACKING)) { BlockDriverState *bs; char *full_backing = g_new0(char, PATH_MAX); -- 2.13.5
[Qemu-block] [PATCH v9 00/20] make dirty-bitmap byte-based
There are patches floating around to add NBD_CMD_BLOCK_STATUS, but NBD wants to report status on byte granularity (even if the reporting will probably be naturally aligned to sectors or even much higher levels). I've therefore started the task of converting our block status code to report at a byte granularity rather than sectors. Now that 2.11 is open, I'm rebasing/reposting the remaining patches. The overall conversion currently looks like: part 1: bdrv_is_allocated (merged in 2.10, commit 51b0a488) part 2: dirty-bitmap (this series, v8 was here [1]) part 3: bdrv_get_block_status (v4 is posted [2] and is mostly reviewed) part 4: .bdrv_co_block_status (v3 is posted [3], but needs review) Available as a tag at: git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-dirty-v8 [1] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg04733.html [2] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03543.html [3] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03812.html Since v8: - add R-b where appropriate - patch 5 simplified again [Fam] - patch 18: add a comment, and enhance iotests 165 to test that aspect [Kevin] 001/20:[] [--] 'block: Make bdrv_img_create() size selection easier to read' 002/20:[] [--] 'hbitmap: Rename serialization_granularity to serialization_align' 003/20:[] [--] 'qcow2: Ensure bitmap serialization is aligned' 004/20:[] [--] 'dirty-bitmap: Drop unused functions' 005/20:[0004] [FC] 'dirty-bitmap: Avoid size query failure during truncate' 006/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_bitmap_size() to report bytes' 007/20:[] [--] 'dirty-bitmap: Track bitmap size by bytes' 008/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_bitmap_*serialize*() to take bytes' 009/20:[] [--] 'qcow2: Switch sectors_covered_by_bitmap_cluster() to byte-based' 010/20:[] [--] 'dirty-bitmap: Set iterator start by offset, not sector' 011/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_iter_next() to report byte offset' 012/20:[] [--] 'dirty-bitmap: Change bdrv_get_dirty_count() to report bytes' 013/20:[] [--] 'dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes' 014/20:[] [--] 'dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use bytes' 015/20:[] [--] 'mirror: Switch mirror_dirty_init() to byte-based iteration' 016/20:[] [--] 'qcow2: Switch qcow2_measure() to byte-based iteration' 017/20:[] [--] 'qcow2: Switch load_bitmap_data() to byte-based iteration' 018/20:[0007] [FC] 'qcow2: Switch store_bitmap_data() to byte-based iteration' 019/20:[] [--] 'dirty-bitmap: Switch bdrv_set_dirty() to bytes' 020/20:[] [--] 'dirty-bitmap: Convert internal hbitmap size/granularity' Eric Blake (20): block: Make bdrv_img_create() size selection easier to read hbitmap: Rename serialization_granularity to serialization_align qcow2: Ensure bitmap serialization is aligned dirty-bitmap: Drop unused functions dirty-bitmap: Avoid size query failure during truncate dirty-bitmap: Change bdrv_dirty_bitmap_size() to report bytes dirty-bitmap: Track bitmap size by bytes dirty-bitmap: Change bdrv_dirty_bitmap_*serialize*() to take bytes qcow2: Switch sectors_covered_by_bitmap_cluster() to byte-based dirty-bitmap: Set iterator start by offset, not sector dirty-bitmap: Change bdrv_dirty_iter_next() to report byte offset dirty-bitmap: Change bdrv_get_dirty_count() to report bytes dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use bytes mirror: Switch mirror_dirty_init() to byte-based iteration qcow2: Switch qcow2_measure() to byte-based iteration qcow2: Switch load_bitmap_data() to byte-based iteration qcow2: Switch store_bitmap_data() to byte-based iteration dirty-bitmap: Switch bdrv_set_dirty() to bytes dirty-bitmap: Convert internal hbitmap size/granularity include/block/block_int.h| 2 +- include/block/dirty-bitmap.h | 43 ++ include/qemu/hbitmap.h | 8 +-- block/io.c | 6 +- block.c | 17 -- block/backup.c | 7 +-- block/dirty-bitmap.c | 134 ++- block/mirror.c | 79 +++-- block/qcow2-bitmap.c | 62 +++- block/qcow2.c| 22 --- migration/block.c| 12 ++-- tests/test-hbitmap.c | 10 ++-- util/hbitmap.c | 8 +-- tests/qemu-iotests/165 | 2 +- 14 files changed, 173 insertions(+), 239 deletions(-) -- 2.13.5
[Qemu-block] [PATCH v9 05/20] dirty-bitmap: Avoid size query failure during truncate
We've previously fixed several places where we failed to account for possible errors from bdrv_nb_sectors(). Fix another one by making bdrv_dirty_bitmap_truncate() take the new size from the caller instead of querying itself; then adjust the sole caller bdrv_truncate() to pass the size just determined by a successful resize, or to skip the bitmap resize on failure, thus avoiding sizing the bitmaps to -1. Signed-off-by: Eric Blake --- v9: skip only bdrv_dirty_bitmap_truncate on error [Fam] v8: retitle and rework to avoid possibility of secondary failure [John] v7: new patch [Kevin] --- include/block/dirty-bitmap.h | 2 +- block.c | 15 ++- block/dirty-bitmap.c | 6 +++--- 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h index 8fd842eac9..7a27590047 100644 --- a/include/block/dirty-bitmap.h +++ b/include/block/dirty-bitmap.h @@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter); void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num); int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap); int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap); -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs); +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes); bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap); bool bdrv_has_readonly_bitmaps(BlockDriverState *bs); bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap); diff --git a/block.c b/block.c index ee6a48976e..89261a7a53 100644 --- a/block.c +++ b/block.c @@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, PreallocMode prealloc, assert(!(bs->open_flags & BDRV_O_INACTIVE)); ret = drv->bdrv_truncate(bs, offset, prealloc, errp); -if (ret == 0) { -ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); -bdrv_dirty_bitmap_truncate(bs); -bdrv_parent_cb_resize(bs); -atomic_inc(&bs->write_gen); +if (ret < 0) { +return ret; } +ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); +if (ret < 0) { +error_setg_errno(errp, -ret, "Could not refresh total sector count"); +} else { +bdrv_dirty_bitmap_truncate(bs, bs->total_sectors * BDRV_SECTOR_SIZE); +} +bdrv_parent_cb_resize(bs); +atomic_inc(&bs->write_gen); return ret; } diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 42a55e4a4b..ee164fb518 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -1,7 +1,7 @@ /* * Block Dirty Bitmap * - * Copyright (c) 2016 Red Hat. Inc + * Copyright (c) 2016-2017 Red Hat. Inc * * Permission is hereby granted, free of charge, to any person obtaining a copy * of this software and associated documentation files (the "Software"), to deal @@ -302,10 +302,10 @@ BdrvDirtyBitmap *bdrv_reclaim_dirty_bitmap(BlockDriverState *bs, * Truncates _all_ bitmaps attached to a BDS. * Called with BQL taken. */ -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs) +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes) { BdrvDirtyBitmap *bitmap; -uint64_t size = bdrv_nb_sectors(bs); +int64_t size = DIV_ROUND_UP(bytes, BDRV_SECTOR_SIZE); bdrv_dirty_bitmaps_lock(bs); QLIST_FOREACH(bitmap, &bs->dirty_bitmaps, list) { -- 2.13.5
[Qemu-block] [PATCH v9 03/20] qcow2: Ensure bitmap serialization is aligned
When subdividing a bitmap serialization, the code in hbitmap.c enforces that start/count parameters are aligned (except that count can end early at end-of-bitmap). We exposed this required alignment through bdrv_dirty_bitmap_serialization_align(), but forgot to actually check that we comply with it. Fortunately, qcow2 is never dividing bitmap serialization smaller than one cluster (which is a minimum of 512 bytes); so we are always compliant with the serialization alignment (which insists that we partition at least 64 bits per chunk) because we are doing at least 4k bits per chunk. Still, it's safer to add an assertion (for the unlikely case that we'd ever support a cluster smaller than 512 bytes, or if the hbitmap implementation changes what it considers to be aligned), rather than leaving bdrv_dirty_bitmap_serialization_align() without a caller. Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v5: no change v4: new patch --- block/qcow2-bitmap.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c index e8d3bdbd6e..b3ee4c794a 100644 --- a/block/qcow2-bitmap.c +++ b/block/qcow2-bitmap.c @@ -274,10 +274,13 @@ static int free_bitmap_clusters(BlockDriverState *bs, Qcow2BitmapTable *tb) static uint64_t sectors_covered_by_bitmap_cluster(const BDRVQcow2State *s, const BdrvDirtyBitmap *bitmap) { -uint32_t sector_granularity = +uint64_t sector_granularity = bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS; +uint64_t sbc = sector_granularity * (s->cluster_size << 3); -return (uint64_t)sector_granularity * (s->cluster_size << 3); +assert(QEMU_IS_ALIGNED(sbc, + bdrv_dirty_bitmap_serialization_align(bitmap))); +return sbc; } /* load_bitmap_data -- 2.13.5
[Qemu-block] [PATCH v9 04/20] dirty-bitmap: Drop unused functions
We had several functions that no one is currently using, and which use sector-based interfaces. I'm trying to convert towards byte-based interfaces, so it's easier to just drop the unused functions: bdrv_dirty_bitmap_get_meta bdrv_dirty_bitmap_get_meta_locked bdrv_dirty_bitmap_reset_meta bdrv_dirty_bitmap_meta_granularity Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v5: no change v4: rebase to Vladimir's persistent bitmaps (bdrv_dirty_bitmap_size now in use), dropped R-b v3: rebase to upstream changes (bdrv_dirty_bitmap_get_meta_locked was added in b64bd51e with no clients), kept R-b v2: tweak commit message based on review, no code change --- include/block/dirty-bitmap.h | 10 -- block/dirty-bitmap.c | 44 2 files changed, 54 deletions(-) diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h index a79a58d2c3..8fd842eac9 100644 --- a/include/block/dirty-bitmap.h +++ b/include/block/dirty-bitmap.h @@ -34,7 +34,6 @@ void bdrv_enable_dirty_bitmap(BdrvDirtyBitmap *bitmap); BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs); uint32_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs); uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap); -uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap); bool bdrv_dirty_bitmap_enabled(BdrvDirtyBitmap *bitmap); bool bdrv_dirty_bitmap_frozen(BdrvDirtyBitmap *bitmap); const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap); @@ -44,15 +43,6 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap, int64_t cur_sector, int64_t nr_sectors); void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap, int64_t cur_sector, int64_t nr_sectors); -int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs, - BdrvDirtyBitmap *bitmap, int64_t sector, - int nb_sectors); -int bdrv_dirty_bitmap_get_meta_locked(BlockDriverState *bs, - BdrvDirtyBitmap *bitmap, int64_t sector, - int nb_sectors); -void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs, - BdrvDirtyBitmap *bitmap, int64_t sector, - int nb_sectors); BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap); BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap, uint64_t first_sector); diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 0490ca3aff..42a55e4a4b 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -173,45 +173,6 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap) qemu_mutex_unlock(bitmap->mutex); } -int bdrv_dirty_bitmap_get_meta_locked(BlockDriverState *bs, - BdrvDirtyBitmap *bitmap, int64_t sector, - int nb_sectors) -{ -uint64_t i; -int sectors_per_bit = 1 << hbitmap_granularity(bitmap->meta); - -/* To optimize: we can make hbitmap to internally check the range in a - * coarse level, or at least do it word by word. */ -for (i = sector; i < sector + nb_sectors; i += sectors_per_bit) { -if (hbitmap_get(bitmap->meta, i)) { -return true; -} -} -return false; -} - -int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs, - BdrvDirtyBitmap *bitmap, int64_t sector, - int nb_sectors) -{ -bool dirty; - -qemu_mutex_lock(bitmap->mutex); -dirty = bdrv_dirty_bitmap_get_meta_locked(bs, bitmap, sector, nb_sectors); -qemu_mutex_unlock(bitmap->mutex); - -return dirty; -} - -void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs, - BdrvDirtyBitmap *bitmap, int64_t sector, - int nb_sectors) -{ -qemu_mutex_lock(bitmap->mutex); -hbitmap_reset(bitmap->meta, sector, nb_sectors); -qemu_mutex_unlock(bitmap->mutex); -} - int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap) { return bitmap->size; @@ -511,11 +472,6 @@ uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap) return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap); } -uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap) -{ -return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->meta); -} - BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap, uint64_t first_sector) { -- 2.13.5
[Qemu-block] [PATCH v9 02/20] hbitmap: Rename serialization_granularity to serialization_align
The only client of hbitmap_serialization_granularity() is dirty-bitmap's bdrv_dirty_bitmap_serialization_align(). Keeping the two names consistent is worthwhile, and the shorter name is more representative of what the function returns (the required alignment to be used for start/count of other serialization functions, where violating the alignment causes assertion failures). Signed-off-by: Eric Blake Reviewed-by: John Snow Reviewed-by: Kevin Wolf Reviewed-by: Fam Zheng --- v5: no change v4: new patch --- include/qemu/hbitmap.h | 8 block/dirty-bitmap.c | 2 +- tests/test-hbitmap.c | 10 +- util/hbitmap.c | 8 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h index d3a74a21fc..81e78043d1 100644 --- a/include/qemu/hbitmap.h +++ b/include/qemu/hbitmap.h @@ -159,16 +159,16 @@ bool hbitmap_get(const HBitmap *hb, uint64_t item); bool hbitmap_is_serializable(const HBitmap *hb); /** - * hbitmap_serialization_granularity: + * hbitmap_serialization_align: * @hb: HBitmap to operate on. * - * Granularity of serialization chunks, used by other serialization functions. - * For every chunk: + * Required alignment of serialization chunks, used by other serialization + * functions. For every chunk: * 1. Chunk start should be aligned to this granularity. * 2. Chunk size should be aligned too, except for last chunk (for which * start + count == hb->size) */ -uint64_t hbitmap_serialization_granularity(const HBitmap *hb); +uint64_t hbitmap_serialization_align(const HBitmap *hb); /** * hbitmap_serialization_size: diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c index 30462d4f9a..0490ca3aff 100644 --- a/block/dirty-bitmap.c +++ b/block/dirty-bitmap.c @@ -617,7 +617,7 @@ uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap, uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap) { -return hbitmap_serialization_granularity(bitmap->bitmap); +return hbitmap_serialization_align(bitmap->bitmap); } void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap, diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c index 1acb353889..af41642346 100644 --- a/tests/test-hbitmap.c +++ b/tests/test-hbitmap.c @@ -738,15 +738,15 @@ static void test_hbitmap_meta_one(TestHBitmapData *data, const void *unused) } } -static void test_hbitmap_serialize_granularity(TestHBitmapData *data, - const void *unused) +static void test_hbitmap_serialize_align(TestHBitmapData *data, + const void *unused) { int r; hbitmap_test_init(data, L3 * 2, 3); g_assert(hbitmap_is_serializable(data->hb)); -r = hbitmap_serialization_granularity(data->hb); +r = hbitmap_serialization_align(data->hb); g_assert_cmpint(r, ==, 64 << 3); } @@ -974,8 +974,8 @@ int main(int argc, char **argv) hbitmap_test_add("/hbitmap/meta/word", test_hbitmap_meta_word); hbitmap_test_add("/hbitmap/meta/sector", test_hbitmap_meta_sector); -hbitmap_test_add("/hbitmap/serialize/granularity", - test_hbitmap_serialize_granularity); +hbitmap_test_add("/hbitmap/serialize/align", + test_hbitmap_serialize_align); hbitmap_test_add("/hbitmap/serialize/basic", test_hbitmap_serialize_basic); hbitmap_test_add("/hbitmap/serialize/part", diff --git a/util/hbitmap.c b/util/hbitmap.c index 21535cc90b..2f9d0fdbd0 100644 --- a/util/hbitmap.c +++ b/util/hbitmap.c @@ -413,14 +413,14 @@ bool hbitmap_is_serializable(const HBitmap *hb) { /* Every serialized chunk must be aligned to 64 bits so that endianness * requirements can be fulfilled on both 64 bit and 32 bit hosts. - * We have hbitmap_serialization_granularity() which converts this + * We have hbitmap_serialization_align() which converts this * alignment requirement from bitmap bits to items covered (e.g. sectors). * That value is: *64 << hb->granularity * Since this value must not exceed UINT64_MAX, hb->granularity must be * less than 58 (== 64 - 6, where 6 is ld(64), i.e. 1 << 6 == 64). * - * In order for hbitmap_serialization_granularity() to always return a + * In order for hbitmap_serialization_align() to always return a * meaningful value, bitmaps that are to be serialized must have a * granularity of less than 58. */ @@ -437,7 +437,7 @@ bool hbitmap_get(const HBitmap *hb, uint64_t item) return (hb->levels[HBITMAP_LEVELS - 1][pos >> BITS_PER_LEVEL] & bit) != 0; } -uint64_t hbitmap_serialization_granularity(const HBitmap *hb) +uint64_t hbitmap_serialization_align(const HBitmap *hb) { assert(hbitmap_is_serializable(hb)); @@ -454,7 +454,7 @@ static void serialization_chunk(const HBitmap *hb, unsigned long **first_el
Re: [Qemu-block] [PATCH v8 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration
On 09/19/2017 07:44 AM, Kevin Wolf wrote: > Am 18.09.2017 um 20:58 hat Eric Blake geschrieben: >> Now that we have adjusted the majority of the calls this function >> makes to be byte-based, it is easier to read the code if it makes >> passes over the image using bytes rather than sectors. >> >> Signed-off-by: Eric Blake >> Reviewed-by: John Snow >> Reviewed-by: Vladimir Sementsov-Ogievskiy >> @@ -1100,18 +1099,17 @@ static uint64_t *store_bitmap_data(BlockDriverState >> *bs, >> dbi = bdrv_dirty_iter_new(bitmap); >> buf = g_malloc(s->cluster_size); >> limit = bytes_covered_by_bitmap_cluster(s, bitmap); Limit is huge. For ease of math, let's consider an image with 512-byte clusters, and an explicit bitmap granularity of 512 bytes. One cluster (512 bytes, or 4k bits) of the bitmap covers 4k multiples of the granularity, or 2M of image; and the minimum serialization alignment of 64 bits covers 32k bytes of image. bytes_covered_by_bitmap_cluster() computes granularity (512 bytes per bit), limit (2M bytes per cluster of bits), and checks that 2M is indeed aligned to bdrv_dirty_bitmap_serialization_align() (which is 32k). Next, an image with default 64k clusters and a default bitmap granularity of 64k bytes; one cluster (64k bytes, or 512k bits) now covers 512k multiples of the granularity, or 32G of image size. However... >> -sbc = limit >> BDRV_SECTOR_BITS; >> assert(DIV_ROUND_UP(bm_size, limit) == tb_size); >> >> -while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) { >> -uint64_t cluster = sector / sbc; >> +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) { >> +uint64_t cluster = offset / limit; bdrv_dirty_iter_next() returns the next dirty bit (which is not necessarily the first bit in the cluster). For the purposes of serialization, we want to serialize the entire cluster in one go, even though we will be serializing 0 bits up until the first dirty bit. So offset at this point may be unaligned, >> uint64_t end, write_size; >> int64_t off; >> >> -sector = cluster * sbc; >> -end = MIN(bm_sectors, sector + sbc); >> -write_size = bdrv_dirty_bitmap_serialization_size(bitmap, >> -sector * BDRV_SECTOR_SIZE, (end - sector) * BDRV_SECTOR_SIZE); >> +offset = QEMU_ALIGN_DOWN(offset, limit); and we round it down to the start of the cluster that we will actually be serializing. > > With the question that I asked in v6, apart from changing the spelling > to be more explicit, I actually hoped that you would explain why > aligning down is the right thing to do. > > It looks plausible to me that we can do this in correct code because we > don't support granularities < 512 bytes (a qemu restriction that is > written down as a note in the qcow2 spec). It's not granularities < 512 that we have to worry about, but dirty bits with an offset < 4M (because we are serializing an entire cluster of bitmap data, where the first dirty offset is not necessarily aligned to that large). > > The part that I'm missing yet is why we need to do it. The bitmap > granularity is also the granularity of bdrv_dirty_iter_next(), so isn't > offset already aligned and we could even assert that instead of aligning > down? (As long we enforce our restriction, which we seem to do in > bitmap_list_load().) Sadly, a quick: diff --git i/block/qcow2-bitmap.c w/block/qcow2-bitmap.c index 302fffd6e1..160f3b99f9 100644 --- i/block/qcow2-bitmap.c +++ w/block/qcow2-bitmap.c @@ -1106,6 +1106,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs, uint64_t end, write_size; int64_t off; +assert(QEMU_IS_ALIGNED(offset, limit)); offset = QEMU_ALIGN_DOWN(offset, limit); end = MIN(bm_size, offset + limit); write_size = bdrv_dirty_bitmap_serialization_size(bitmap, offset, does NOT fail iotests 165, which appears to be the only test that actually hammers on qcow2 bitmaps (changing it to an 'assert(false)' only shows an effect on 165) - which means our test is NOT exercising all possible alignments. And it's python-based, with lame output, which makes debugging it painful. But never fear, for v9 I will improve the test to actually affect the bitmap at a point that would fail with my temporary assertion in place, and thus proving that we DO need to align down. Note that test 165 is testing only a 1G image, but I just showed that 64k clusters with 64k granularity covers up to 32G of image space in one cluster of the bitmap, so the test is only covering one cluster of serialization in the first place, and as written, the test is dirtying byte 0, which explains why it happens to get an offset aligned to limit, even though that is not a valid assertion. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] [PATCH] block: Clean up some bad code in the vvfat driver
On 19.09.2017 21:01, John Snow wrote: > > > On 09/19/2017 04:06 AM, Paolo Bonzini wrote: >> On 13/09/2017 21:08, John Snow wrote: >>> >>> >>> On 09/13/2017 06:21 AM, Thomas Huth wrote: Remove the unnecessary home-grown redefinition of the assert() macro here, and remove the unusable debug code at the end of the checkpoint() function. The code there uses assert() with side-effects (assignment to the "mapping" variable), which should be avoided. Looking more closely, it seems as it is apparently also only usable for one certain directory layout (with a file named USB.H in it) and thus is of no use for the rest of the world. Signed-off-by: Thomas Huth >>> >>> Farewell, bitrot code. >>> >>> Reviewed-by: John Snow >>> >>> Out of curiosity, I wonder ... >>> >>> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l >>> 320 >> >> >> $ git grep -c '#if 0' | sort -k2 --field-separator=: -n >> ... >> hw/net/eepro100.c:21 >> target/ppc/cpu-models.h:76 >> >> whoa :) >> > > Wonder if '#if 0' should be against the style guide / in checkpatch. checkpatch already complains if you have a "#if 0" in your patch, so I think we should be pretty fine here already - but of course you can still add a paragraph to the CODING_STYLE if you like. Thomas
Re: [Qemu-block] ping Re: [Qemu-devel] [PATCH v7 03/16] migration: split common postcopy out of ram postcopy
* Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote: > ping for 1-3 > Can we merge them? I see all of them have R-b's; so lets try and put them in the next migration merge. Quintela: Sound good? Dave > 22.08.2017 02:34, John Snow wrote: > > > > On 07/11/2017 09:38 AM, Vladimir Sementsov-Ogievskiy wrote: > > > 11.07.2017 16:06, Dr. David Alan Gilbert wrote: > > > > * Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote: > > > > > Split common postcopy staff from ram postcopy staff. > > > > > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy > > > > I'm OK with this; I'm not sure I'd have bothered making the ping's > > > > dependent on it being RAM. > > > > > > > > (These first few are pretty much a separable series). > > > It would be grate if you (or Juan?) can take them separately. > > > > > Yes please. I don't think this ever happened, did it? Can we split off > > 1-3 and re-roll? > > > > -- > Best regards, > Vladimir > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [Qemu-devel] Dead code in cpu-models.h
On 19.09.2017 20:54, John Snow wrote: > > > On 09/19/2017 04:14 AM, Thomas Huth wrote: >> On 19.09.2017 10:06, Paolo Bonzini wrote: >>> On 13/09/2017 21:08, John Snow wrote: >> [...] Farewell, bitrot code. Reviewed-by: John Snow Out of curiosity, I wonder ... jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l 320 >>> >>> $ git grep -c '#if 0' | sort -k2 --field-separator=: -n >>> ... >>> hw/net/eepro100.c:21 >>> target/ppc/cpu-models.h:76 >>> >>> whoa :) >> >> Igor recently already removed the dead definitions from cpu-models.c : >> >> https://git.qemu.org/?p=qemu.git;a=commitdiff;h=aef779605779579afbafff >> >> I guess we could now remove the corresponding dead definitions from the >> header, too... >> >> Any volunteers? >> >> Thomas > > Well, I can just take a chainsaw to it blindly if you want to critique it. > I think it should at least be aligned with the changes that have been done to cpu-models.c (e.g. the definition for CPU_POWERPC_401B3 has been removed from the .c file, so the CPU_POWERPC_401B3 could be removed from the .h file, too. Thomas
Re: [Qemu-block] [Qemu-devel] [PATCH] block: Clean up some bad code in the vvfat driver
On 09/19/2017 04:06 AM, Paolo Bonzini wrote: > On 13/09/2017 21:08, John Snow wrote: >> >> >> On 09/13/2017 06:21 AM, Thomas Huth wrote: >>> Remove the unnecessary home-grown redefinition of the assert() macro here, >>> and remove the unusable debug code at the end of the checkpoint() function. >>> The code there uses assert() with side-effects (assignment to the "mapping" >>> variable), which should be avoided. Looking more closely, it seems as it is >>> apparently also only usable for one certain directory layout (with a file >>> named USB.H in it) and thus is of no use for the rest of the world. >>> >>> Signed-off-by: Thomas Huth >> >> Farewell, bitrot code. >> >> Reviewed-by: John Snow >> >> Out of curiosity, I wonder ... >> >> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l >> 320 > > > $ git grep -c '#if 0' | sort -k2 --field-separator=: -n > ... > hw/net/eepro100.c:21 > target/ppc/cpu-models.h:76 > > whoa :) > Wonder if '#if 0' should be against the style guide / in checkpatch. Conditional compilations should at least be contingent on some named variable/condition. (Probably super ideally all conditional compilations correlate directly to a configure variable...) --js
Re: [Qemu-block] [Qemu-devel] Dead code in cpu-models.h
On 09/19/2017 04:14 AM, Thomas Huth wrote: > On 19.09.2017 10:06, Paolo Bonzini wrote: >> On 13/09/2017 21:08, John Snow wrote: > [...] >>> Farewell, bitrot code. >>> >>> Reviewed-by: John Snow >>> >>> Out of curiosity, I wonder ... >>> >>> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l >>> 320 >> >> $ git grep -c '#if 0' | sort -k2 --field-separator=: -n >> ... >> hw/net/eepro100.c:21 >> target/ppc/cpu-models.h:76 >> >> whoa :) > > Igor recently already removed the dead definitions from cpu-models.c : > > https://git.qemu.org/?p=qemu.git;a=commitdiff;h=aef779605779579afbafff > > I guess we could now remove the corresponding dead definitions from the > header, too... > > Any volunteers? > > Thomas > Well, I can just take a chainsaw to it blindly if you want to critique it.
Re: [Qemu-block] [Qemu-devel] [PATCH] throttle-groups: update tg->any_timer_armed[] on detach
On Tue, Sep 19, 2017 at 11:07:53AM -0500, Eric Blake wrote: > On 09/19/2017 10:50 AM, Stefan Hajnoczi wrote: > > Clear tg->any_timer_armed[] when throttling timers are destroy during > > s/destroy/destroyed/ All your base are belong to us!
[Qemu-block] qemu-img: Check failed: No space left on device
Hello, First post here, so maybe I should introduce myself : - I'm a sysadmin for decades and currently managing 4 oVirt clusters, made out of tens of hypervisors, all are CentOS 7.2+ based. - I'm very happy with this solution we choose especially because it is based on qemu-kvm (open source, reliable, documented). On one VM, we experienced the following : - oVirt/vdsm is detecting an issue on the image - following this hints https://access.redhat.com/solutions/1173623, I managed to detect one error and fix it - the VM is now running perfectly On two other VMs, we experienced a similar situation, except the check stage is showing something like 14000+ errors, and the relevant logs are : Repairing refcount block 14 is outside image ERROR could not resize image: Invalid argument ERROR cluster 425984 refcount=0 reference=1 ERROR cluster 425985 refcount=0 reference=1 [... repeating the previous line 7000+ times...] ERROR cluster 457166 refcount=0 reference=1 Rebuilding refcount structure ERROR writing refblock: No space left on device qemu-img: Check failed: No space left on device You surely know that oVirt/RHEV is storing its qcow2 images in dedicated logical volumes. pvs/vgs/lvs are all showing there is plenty of space available, so I understand that I don't understand what "No space left on device" means. The relevant versions are : - qemu-img-ev-2.3.0-31.el7_2.10.1.x86_64. - libvirt-daemon-kvm-1.2.17-13.el7_2.4.x86_64 - vdsm-4.17.32-1.el7.noarch - # qemu-img info /the/relevant/logical/volume/path file format: qcow2 virtual size: 30G (32212254720 bytes) disk size: 0 cluster_size: 65536 Format specific information: compat: 0.10 refcount bits: 16 Is there a hope I'll be able to repair this image? -- Nicolas ECARNOT Cette transmission contient des informations confidentielles et/ou privilégiées appartenant au Service Départemental d'Incendie et de Secours de l'Isère (SDIS38) pour l'emploi exclusif de la personne identifiée ci-dessus comme destinataire. Si vous n'êtes pas cette personne, il vous est notifié par les présentes que toute divulgation, reproduction, distribution ou prise d'action sur la base de cette transmission est formellement interdite. Si vous avez reçu cette transmission par erreur, merci de la supprimer de votre système et de nous joindre par téléphone ou courriel immédiatement afin de nous avertir de cette réception. La correspondance en ligne doit être utilisée en tenant compte du fait qu'il ne s'agit pas d'un moyen entièrement sécurisé. La confidentialité de son contenu ne peut donc pas être considérée comme assurée. Par ailleurs, et malgré toutes les précautions prises pour éviter la présence de virus dans nos envois, nous vous recommandons de prendre, de votre côté, les mesures permettant d'assurer la non-introduction de virus dans votre propre système informatique.
Re: [Qemu-block] [Qemu-devel] [PATCH] throttle-groups: update tg->any_timer_armed[] on detach
On 09/19/2017 10:50 AM, Stefan Hajnoczi wrote: > Clear tg->any_timer_armed[] when throttling timers are destroy during s/destroy/destroyed/ > AioContext attach/detach. Failure to do so causes throttling to hang > because we believe the timer is already scheduled! > > The following was broken at least since QEMU 2.10.0 with -drive > iops=100: > > $ dd if=/dev/zero of=/dev/vdb oflag=direct count=1000 > (qemu) stop > (qemu) cont > ...I/O is stuck... > > Reported-by: Yongxue Hong > Signed-off-by: Stefan Hajnoczi > --- > block/throttle-groups.c | 13 + > 1 file changed, 13 insertions(+) > Reviewed-by: Eric Blake CC: qemu-sta...@nongnu.org -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] image locking
On 09/19/2017 10:46 AM, Vladimir Sementsov-Ogievskiy wrote: > Hi Fam! > > I have a question about your image locking series: > > Can you please explain, why OFD locking is enabled by default and posix > locking is not? What is wrong with posix locking, what are the problems? POSIX locking (lockf(3), which wraps fcntl(2)) has a severe flaw, mandated by POSIX due to historical behavior, regarding the duration of the lock - within a single process, ALL lock access to the same inode, regardless of which fd triggered the access, is tied together: int fd1 = open("/dev/null", O_RDONLY); int fd2 = open("/dev/null", O_RDONLY); fcntl(F_SETLK, fd1, ...); // Obtain lock close(fd2); // oops, lock on fd1 is lost! BSD locks (flock(2)) do not have this issue, but have a different flaw - they can only lock the entire file at once. But we need fine-grained access (we lock different byte ranges in order to convey multiple pieces of information across processes). Hence, Linux invented OFD locking as the best of both worlds (the locks are tied to the inode rather than the process, but allow fine-grained access), and it is likely that POSIX will eventually standardize it (http://austingroupbugs.net/view.php?id=768). More from the Linux 'man fcntl' page: >The record locks described above are associated with the process >(unlike the open file description locks described below). This has >some unfortunate consequences: > >* If a process closes any file descriptor referring to a file, then > all of the process's locks on that file are released, regardless of > the file descriptor(s) on which the locks were obtained. This is > bad: it means that a process can lose its locks on a file such as > /etc/passwd or /etc/mtab when for some reason a library function > decides to open, read, and close the same file. > >* The threads in a process share locks. In other words, a multi‐ > threaded program can't use record locking to ensure that threads > don't simultaneously access the same region of a file. > >Open file description locks solve both of these problems. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] image locking
On Tue, Sep 19, 2017 at 06:46:19PM +0300, Vladimir Sementsov-Ogievskiy wrote: > Hi Fam! > > I have a question about your image locking series: > > Can you please explain, why OFD locking is enabled by default and posix > locking is not? What is wrong with posix locking, what are the problems? POSIX locking suffers from a horribly broken design making it practically impossible to use it reliably, particularly in a threaded program. If there are two file descriptors open to the same file, closing one FD will release locks held by the other FD. See more details here: http://0pointer.de/blog/projects/locking.html Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
[Qemu-block] [PATCH] throttle-groups: update tg->any_timer_armed[] on detach
Clear tg->any_timer_armed[] when throttling timers are destroy during AioContext attach/detach. Failure to do so causes throttling to hang because we believe the timer is already scheduled! The following was broken at least since QEMU 2.10.0 with -drive iops=100: $ dd if=/dev/zero of=/dev/vdb oflag=direct count=1000 (qemu) stop (qemu) cont ...I/O is stuck... Reported-by: Yongxue Hong Signed-off-by: Stefan Hajnoczi --- block/throttle-groups.c | 13 + 1 file changed, 13 insertions(+) diff --git a/block/throttle-groups.c b/block/throttle-groups.c index 6ba992c8d7..2bfd03faa0 100644 --- a/block/throttle-groups.c +++ b/block/throttle-groups.c @@ -592,7 +592,20 @@ void throttle_group_attach_aio_context(ThrottleGroupMember *tgm, void throttle_group_detach_aio_context(ThrottleGroupMember *tgm) { ThrottleTimers *tt = &tgm->throttle_timers; +ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts); + throttle_timers_detach_aio_context(tt); + +/* Forget about these timers, they have been destroyed */ +qemu_mutex_lock(&tg->lock); +if (tg->tokens[0] == tgm) { +tg->any_timer_armed[0] = false; +} +if (tg->tokens[1] == tgm) { +tg->any_timer_armed[1] = false; +} +qemu_mutex_unlock(&tg->lock); + tgm->aio_context = NULL; } -- 2.13.5
[Qemu-block] image locking
Hi Fam! I have a question about your image locking series: Can you please explain, why OFD locking is enabled by default and posix locking is not? What is wrong with posix locking, what are the problems? -- Best regards, Vladimir
[Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache
Hi everyone, over the past few weeks I have been testing the effects of reducing the size of the entries in the qcow2 L2 cache. This was briefly mentioned by Denis in the same thread where we discussed subcluster allocation back in April, but I'll describe here the problem and the proposal in detail. === Problem === In the qcow2 file format guest addresses are mapped to host addresses using the so-called L1 and L2 tables. The size of an L2 table is the same as the cluster size, therefore a larger cluster means more L2 entries in a table, and because of that an L2 table can map a larger portion of the address space (not only because it contains more entries, but also because the data cluster that each one of those entries points at is larger). There are two consequences of this: 1) If you double the cluster size of a qcow2 image then the maximum space needed for all L2 tables is divided by two (i.e. you need half the metadata). 2) If you double the cluster size of a qcow2 image then each one of the L2 tables will map four times as much disk space. With the default cluster size of 64KB, each L2 table maps 512MB of contiguous disk space. This table shows what happens when you change the cluster size: |--+--| | Cluster size | An L2 table maps | |--+--| | 512 B |32 KB | | 1 KB | 128 KB | | 2 KB | 512 KB | | 4 KB | 2 MB | | 8 KB | 8 MB | |16 KB |32 MB | |32 KB | 128 MB | |64 KB | 512 MB | | 128 KB | 2 GB | | 256 KB | 8 GB | | 512 KB |32 GB | | 1 MB | 128 GB | | 2 MB | 512 GB | |--+--| When QEMU wants to convert a guest address into a host address, it needs to read the entry from the corresponding L2 table. The qcow2 driver doesn't read those entries directly, it does it by loading the tables in the L2 cache so they can be kept in memory in case they are needed later. The problem here is that the L2 cache (and the qcow2 driver in general) always works with complete L2 tables: if QEMU needs a particular L2 entry then the whole cluster containing the L2 table is read from disk, and if the cache is full then a cluster worth of cached data has to be discarded. The consequences of this are worse the larger the cluster size is, not only because we're reading (and discarding) larger amounts of data, but also because we're using that memory in a very inefficient way. Example: with 1MB clusters each L2 table maps 128GB of contiguous virtual disk, so that's the granularity of our cache. If we're performing I/O in a 4GB area that overlaps two of those 128GB chunks, we need to have in the cache two complete L2 tables (2MB) even when in practice we're only using 32KB of those 2MB (32KB contain enough L2 entries to map the 4GB that we're using). === The proposal === One way to solve the problems described above is to decouple the L2 table size (which is equal to the cluster size) from the cache entry size. The qcow2 cache doesn't actually know anything about the data that it's loading, it just receives a disk offset and checks that it is properly aligned. It's perfectly possible to make it load data blocks smaller than a cluster. I already have a working prototype, and I was doing tests using a 4KB cache entry size. 4KB is small enough, it allows us to make a more flexible use of the cache, it's also a common file system block size and it can hold enough L2 entries to cover substantial amounts of disk space (especially with large clusters). |--+---| | Cluster size | 4KB of L2 entries map | |--+---| | 64 KB| 32 MB | | 128 KB | 64 MB | | 256 KB | 128 MB| | 512 KB | 256 MB| | 1 MB | 512 MB| | 2 MB | 1 GB | |--+---| Some results from my tests (using an SSD drive and random 4K reads): |---+--+-+---+--| | Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU | |---+--+-+---+--| | 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS | | 2 TB | 2 MB| 4 MB [1 TB] | 576 IOPS | 11000 IOPS | |---+--+-+---+--| The improvements are clearly visible, but it's important to point out a couple of things: - L2 cache size is always < total L2 metadata on disk (otherwise this wouldn't make sense).
Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache
On 09/19/2017 06:07 PM, Alberto Garcia wrote: > Hi everyone, > > over the past few weeks I have been testing the effects of reducing > the size of the entries in the qcow2 L2 cache. This was briefly > mentioned by Denis in the same thread where we discussed subcluster > allocation back in April, but I'll describe here the problem and the > proposal in detail. > > === Problem === > > In the qcow2 file format guest addresses are mapped to host addresses > using the so-called L1 and L2 tables. The size of an L2 table is the > same as the cluster size, therefore a larger cluster means more L2 > entries in a table, and because of that an L2 table can map a larger > portion of the address space (not only because it contains more > entries, but also because the data cluster that each one of those > entries points at is larger). > > There are two consequences of this: > >1) If you double the cluster size of a qcow2 image then the maximum > space needed for all L2 tables is divided by two (i.e. you need > half the metadata). > >2) If you double the cluster size of a qcow2 image then each one of > the L2 tables will map four times as much disk space. > > With the default cluster size of 64KB, each L2 table maps 512MB of > contiguous disk space. This table shows what happens when you change > the cluster size: > > |--+--| > | Cluster size | An L2 table maps | > |--+--| > | 512 B |32 KB | > | 1 KB | 128 KB | > | 2 KB | 512 KB | > | 4 KB | 2 MB | > | 8 KB | 8 MB | > |16 KB |32 MB | > |32 KB | 128 MB | > |64 KB | 512 MB | > | 128 KB | 2 GB | > | 256 KB | 8 GB | > | 512 KB |32 GB | > | 1 MB | 128 GB | > | 2 MB | 512 GB | > |--+--| > > When QEMU wants to convert a guest address into a host address, it > needs to read the entry from the corresponding L2 table. The qcow2 > driver doesn't read those entries directly, it does it by loading the > tables in the L2 cache so they can be kept in memory in case they are > needed later. > > The problem here is that the L2 cache (and the qcow2 driver in > general) always works with complete L2 tables: if QEMU needs a > particular L2 entry then the whole cluster containing the L2 table is > read from disk, and if the cache is full then a cluster worth of > cached data has to be discarded. > > The consequences of this are worse the larger the cluster size is, not > only because we're reading (and discarding) larger amounts of data, > but also because we're using that memory in a very inefficient way. > > Example: with 1MB clusters each L2 table maps 128GB of contiguous > virtual disk, so that's the granularity of our cache. If we're > performing I/O in a 4GB area that overlaps two of those 128GB chunks, > we need to have in the cache two complete L2 tables (2MB) even when in > practice we're only using 32KB of those 2MB (32KB contain enough L2 > entries to map the 4GB that we're using). > > === The proposal === > > One way to solve the problems described above is to decouple the L2 > table size (which is equal to the cluster size) from the cache entry > size. > > The qcow2 cache doesn't actually know anything about the data that > it's loading, it just receives a disk offset and checks that it is > properly aligned. It's perfectly possible to make it load data blocks > smaller than a cluster. > > I already have a working prototype, and I was doing tests using a 4KB > cache entry size. 4KB is small enough, it allows us to make a more > flexible use of the cache, it's also a common file system block size > and it can hold enough L2 entries to cover substantial amounts of disk > space (especially with large clusters). > > |--+---| > | Cluster size | 4KB of L2 entries map | > |--+---| > | 64 KB| 32 MB | > | 128 KB | 64 MB | > | 256 KB | 128 MB| > | 512 KB | 256 MB| > | 1 MB | 512 MB| > | 2 MB | 1 GB | > |--+---| > > Some results from my tests (using an SSD drive and random 4K reads): > > |---+--+-+---+--| > | Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU | > |---+--+-+---+--| > | 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS | > | 2 TB | 2 MB| 4 MB [1 TB] | 576 IOPS | 11000 IOPS | > |---+-
Re: [Qemu-block] [Qemu-devel] [PATCH v8 00/20] make dirty-bitmap byte-based
On Mon, 09/18 13:57, Eric Blake wrote: > There are patches floating around to add NBD_CMD_BLOCK_STATUS, > but NBD wants to report status on byte granularity (even if the > reporting will probably be naturally aligned to sectors or even > much higher levels). I've therefore started the task of > converting our block status code to report at a byte granularity > rather than sectors. > > Now that 2.11 is open, I'm rebasing/reposting the remaining patches. > > The overall conversion currently looks like: > part 1: bdrv_is_allocated (merged in 2.10, commit 51b0a488) > part 2: dirty-bitmap (this series, v7 was here [1]) > part 3: bdrv_get_block_status (v4 is posted [2] and is mostly reviewed) > part 4: .bdrv_co_block_status (v3 is posted [3], but needs review) Patches 1-4, 6-20: Reviewed-by: Fam Zheng
Re: [Qemu-block] Block Migration and CPU throttling
On 19/09/2017 15:36, Peter Lieven wrote: > Hi, > > I just noticed that CPU throttling and Block Migration don't work > together very well. > During block migration the throttling heuristic detects that we > obviously make no progress > in ram transfer. But the reason is the running block migration and not a > too high dirty pages rate. > > The result is that any VM is throttled by 99% during block migration. > > I wonder what the best way would be fix this. I came up with the > following ideas so far: > > - disable throttling while block migration is in bulk stage > - check if absolute number of num_dirty_pages_period crosses a threshold > and not if its just > greater than 50% of transferred bytes > - check if migration_dirty_pages > 0. This slows down throttling, but > does not avoid it completely. If you can use nbd-server and drive-mirror for block migration (libvirt would do it), then you will use multiple sockets and be able to migrate block and RAM at the same time. Otherwise, disabling throttling during the bulk stage is the one that seems nicest and most promising. Paolo
Re: [Qemu-block] Block Migration and CPU throttling
* Peter Lieven (p...@kamp.de) wrote: > Am 19.09.2017 um 16:38 schrieb Dr. David Alan Gilbert: > > * Peter Lieven (p...@kamp.de) wrote: > > > Hi, > > > > > > I just noticed that CPU throttling and Block Migration don't work > > > together very well. > > > During block migration the throttling heuristic detects that we obviously > > > make no progress > > > in ram transfer. But the reason is the running block migration and not a > > > too high dirty pages rate. > > > > > > The result is that any VM is throttled by 99% during block migration. > > Hmm that's unfortunate; do you have a bandwidth set lower than your > > actual network connection? I'm just wondering if it's actually going > > between the block and RAM iterative sections or getting stuck in ne. > > It happens also if source and dest are on the same machine and speed is set > to 100G. But does it happen if they're not and the speed is set low? Dave > Peter > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] Block Migration and CPU throttling
Am 19.09.2017 um 16:38 schrieb Dr. David Alan Gilbert: * Peter Lieven (p...@kamp.de) wrote: Hi, I just noticed that CPU throttling and Block Migration don't work together very well. During block migration the throttling heuristic detects that we obviously make no progress in ram transfer. But the reason is the running block migration and not a too high dirty pages rate. The result is that any VM is throttled by 99% during block migration. Hmm that's unfortunate; do you have a bandwidth set lower than your actual network connection? I'm just wondering if it's actually going between the block and RAM iterative sections or getting stuck in ne. It happens also if source and dest are on the same machine and speed is set to 100G. Peter
Re: [Qemu-block] Block Migration and CPU throttling
* Peter Lieven (p...@kamp.de) wrote: > Hi, > > I just noticed that CPU throttling and Block Migration don't work together > very well. > During block migration the throttling heuristic detects that we obviously > make no progress > in ram transfer. But the reason is the running block migration and not a too > high dirty pages rate. > > The result is that any VM is throttled by 99% during block migration. Hmm that's unfortunate; do you have a bandwidth set lower than your actual network connection? I'm just wondering if it's actually going between the block and RAM iterative sections or getting stuck in ne. > I wonder what the best way would be fix this. I came up with the following > ideas so far: > > - disable throttling while block migration is in bulk stage mig_throttle_guest_down is currently in migration/ram.c so however we do it, we've got to add an 'inhibit' that block.c could set (I suggest a counter so that it could be set by a few things). > - check if absolute number of num_dirty_pages_period crosses a threshold and > not if its just > greater than 50% of transferred bytes > - check if migration_dirty_pages > 0. This slows down throttling, but does > not avoid it completely. An interesting question is whether you want to inhibit in the non-bulk stage if IO writes are happening too quickly to keep up. Dave > > Peter -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper
On 19/09/2017 15:26, Daniel P. Berrange wrote: > On Tue, Sep 19, 2017 at 03:23:09PM +0200, Paolo Bonzini wrote: >> On 19/09/2017 15:12, Daniel P. Berrange wrote: >>> On Tue, Sep 19, 2017 at 02:57:00PM +0200, Paolo Bonzini wrote: On 19/09/2017 14:53, Daniel P. Berrange wrote: >> +/* Try to reconnect while sending the CDB. */ >> +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; >> attempts++) { > > I'm curious why you need to loop here. The helper daemon should be running > already, as you're not spawning it on demand IIUC. So it shoudl either > succeed first time, or fail every time. You're focusing on the usecase where the helper daemon is spawned per-VM by the system libvirtd, which I agree is the most important one. However, the other usecase is the one with a global daemon, access to which is controlled via Unix permissions. This is not SELinux-friendly, but it is nicer for testing and it is also the only possibility for user libvirtd. In that case, upgrading QEMU on the host could result in a "systemctl restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could result in systemd respawning the daemon). Reconnect is useful in that case. >>> >>> If using systemd socket activation and you restart the daemon, the listening >>> socket should be preserved, so you shouldn't need to reconnect - the client >>> should get queued until it has started again (likewise on crash). >> >> Oh, that's cool. I didn't know that. However, systemd socket >> activation is optional, and it's only a handful of lines so I think it's >> a bit nicer behavior (chardevs for example have options to reconnect). > > The downside is that if someone forget to start the daemon, or enable > the socket, QEMU will spin for 5 seconds trying to reconnect, instead > of reporting an error immediately. Not exactly: - the daemon must be present at startup. Reconnect is only tried when sending a command. - it's not a busy wait, pr_manager_helper_run runs in an util/thread-pool.c worker thread (see pr_manager_execute in patch 1). Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH 2/4] scsi: build qemu-pr-helper
On 19/09/2017 15:33, Daniel P. Berrange wrote: > On Tue, Sep 19, 2017 at 12:24:32PM +0200, Paolo Bonzini wrote: >> Introduce a privileged helper to run persistent reservation commands. >> This lets virtual machines send persistent reservations without using >> CAP_SYS_RAWIO or out-of-tree patches. The helper uses Unix permissions >> and SCM_RIGHTS to restrict access to processes that can access its socket >> and prove that they have an open file descriptor for a raw SCSI device. >> >> The next patch will also correct the usage of persistent reservations >> with multipath devices. >> >> It would also be possible to support for Linux's IOC_PR_* ioctls in >> the future, to support NVMe devices. For now, however, only SCSI is >> supported. >> >> Signed-off-by: Paolo Bonzini >> --- > > >> + >> +#define PR_HELPER_CDB_SIZE 16 >> +#define PR_HELPER_SENSE_SIZE 96 >> +#define PR_HELPER_DATA_SIZE8192 >> + >> +typedef struct PRHelperResponse { >> +int32_t result; >> +int32_t sz; >> +uint8_t sense[PR_HELPER_SENSE_SIZE]; >> +} PRHelperResponse; > > Should we annotate this with 'packed' to ensure its immune to compiler > padding ? Yes... > >> +typedef struct PRHelperRequest { >> +int fd; >> +size_t sz; >> +uint8_t cdb[PR_HELPER_CDB_SIZE]; >> +} PRHelperRequest; > > Same q here ? ... and no since this one is not the wire format (which is just a 16-byte cdb, plus fd in ancillary data). > >> +static int coroutine_fn prh_write_response(PRHelperClient *client, >> + PRHelperRequest *req, >> + PRHelperResponse *resp, Error >> **errp) >> +{ >> +ssize_t r; > > Can just be int > >> +size_t sz; >> + >> +if (req->cdb[0] == PERSISTENT_RESERVE_IN && resp->result == GOOD) { >> +assert(resp->sz <= req->sz && resp->sz <= sizeof(client->data)); >> +} else { >> +assert(resp->sz == 0); >> +} >> + >> +sz = resp->sz; >> + >> +resp->result = cpu_to_be32(resp->result); >> +resp->sz = cpu_to_be32(resp->sz); >> +r = qio_channel_write_all(QIO_CHANNEL(client->ioc), >> + (char *) resp, sizeof(*resp), errp); >> +if (r < 0) { >> +return r; >> +} >> + >> +r = qio_channel_write_all(QIO_CHANNEL(client->ioc), >> + (char *) client->data, >> + sz, errp); >> +return r < 0 ? r : 0; > > Just return 'r' directly, its only ever -1 or 0 Ok, thanks, will adjust all of these. Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH 2/4] scsi: build qemu-pr-helper
On Tue, Sep 19, 2017 at 12:24:32PM +0200, Paolo Bonzini wrote: > Introduce a privileged helper to run persistent reservation commands. > This lets virtual machines send persistent reservations without using > CAP_SYS_RAWIO or out-of-tree patches. The helper uses Unix permissions > and SCM_RIGHTS to restrict access to processes that can access its socket > and prove that they have an open file descriptor for a raw SCSI device. > > The next patch will also correct the usage of persistent reservations > with multipath devices. > > It would also be possible to support for Linux's IOC_PR_* ioctls in > the future, to support NVMe devices. For now, however, only SCSI is > supported. > > Signed-off-by: Paolo Bonzini > --- > + > +#define PR_HELPER_CDB_SIZE 16 > +#define PR_HELPER_SENSE_SIZE 96 > +#define PR_HELPER_DATA_SIZE8192 > + > +typedef struct PRHelperResponse { > +int32_t result; > +int32_t sz; > +uint8_t sense[PR_HELPER_SENSE_SIZE]; > +} PRHelperResponse; Should we annotate this with 'packed' to ensure its immune to compiler padding ? > +typedef struct PRHelperRequest { > +int fd; > +size_t sz; > +uint8_t cdb[PR_HELPER_CDB_SIZE]; > +} PRHelperRequest; Same q here ? > +static int coroutine_fn prh_write_response(PRHelperClient *client, > + PRHelperRequest *req, > + PRHelperResponse *resp, Error > **errp) > +{ > +ssize_t r; Can just be int > +size_t sz; > + > +if (req->cdb[0] == PERSISTENT_RESERVE_IN && resp->result == GOOD) { > +assert(resp->sz <= req->sz && resp->sz <= sizeof(client->data)); > +} else { > +assert(resp->sz == 0); > +} > + > +sz = resp->sz; > + > +resp->result = cpu_to_be32(resp->result); > +resp->sz = cpu_to_be32(resp->sz); > +r = qio_channel_write_all(QIO_CHANNEL(client->ioc), > + (char *) resp, sizeof(*resp), errp); > +if (r < 0) { > +return r; > +} > + > +r = qio_channel_write_all(QIO_CHANNEL(client->ioc), > + (char *) client->data, > + sz, errp); > +return r < 0 ? r : 0; Just return 'r' directly, its only ever -1 or 0 Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [Qemu-block] [PATCH v2 7/7] block/nbd-client: do not yield from nbd_read_reply_entry
On 09/19/2017 05:00 AM, Vladimir Sementsov-Ogievskiy wrote: > 19.09.2017 01:36, Eric Blake wrote: >> On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote: >>> Refactor nbd client to not yield from nbd_read_reply_entry. It's >>> possible now as all reading is done in nbd_read_reply_entry and >>> all request-related data is stored in per-request s->requests[i]. >>> >>> We need here some additional work with s->requests[i].ret and >>> s->quit to not fail requests on normal disconnet and to not report >>> reading errors on normal disconnect. >>> >>> @@ -364,6 +353,8 @@ void nbd_client_close(BlockDriverState *bs) >>> nbd_send_request(client->ioc, &request); >>> + client->quit = true; >> Previously, client->quit was only set when detecting a broken server, >> now it is also set for a clean exit. Do we need to change any >> documentation of the field? > > It has documentation? Touche. But it probably should, especially if we are changing its semantics, to make it easier to understand from looking at the struct what semantics to expect. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-block] Block Migration and CPU throttling
Hi, I just noticed that CPU throttling and Block Migration don't work together very well. During block migration the throttling heuristic detects that we obviously make no progress in ram transfer. But the reason is the running block migration and not a too high dirty pages rate. The result is that any VM is throttled by 99% during block migration. I wonder what the best way would be fix this. I came up with the following ideas so far: - disable throttling while block migration is in bulk stage - check if absolute number of num_dirty_pages_period crosses a threshold and not if its just greater than 50% of transferred bytes - check if migration_dirty_pages > 0. This slows down throttling, but does not avoid it completely. Peter
Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper
On 19/09/2017 15:12, Daniel P. Berrange wrote: > On Tue, Sep 19, 2017 at 02:57:00PM +0200, Paolo Bonzini wrote: >> On 19/09/2017 14:53, Daniel P. Berrange wrote: +/* Try to reconnect while sending the CDB. */ +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; attempts++) { >>> >>> I'm curious why you need to loop here. The helper daemon should be running >>> already, as you're not spawning it on demand IIUC. So it shoudl either >>> succeed first time, or fail every time. >> >> You're focusing on the usecase where the helper daemon is spawned per-VM >> by the system libvirtd, which I agree is the most important one. >> However, the other usecase is the one with a global daemon, access to >> which is controlled via Unix permissions. This is not SELinux-friendly, >> but it is nicer for testing and it is also the only possibility for user >> libvirtd. >> >> In that case, upgrading QEMU on the host could result in a "systemctl >> restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could >> result in systemd respawning the daemon). Reconnect is useful in that case. > > If using systemd socket activation and you restart the daemon, the listening > socket should be preserved, so you shouldn't need to reconnect - the client > should get queued until it has started again (likewise on crash). Oh, that's cool. I didn't know that. However, systemd socket activation is optional, and it's only a handful of lines so I think it's a bit nicer behavior (chardevs for example have options to reconnect). Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper
On 19/09/2017 14:53, Daniel P. Berrange wrote: >> +/* Try to reconnect while sending the CDB. */ >> +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; attempts++) { > > I'm curious why you need to loop here. The helper daemon should be running > already, as you're not spawning it on demand IIUC. So it shoudl either > succeed first time, or fail every time. You're focusing on the usecase where the helper daemon is spawned per-VM by the system libvirtd, which I agree is the most important one. However, the other usecase is the one with a global daemon, access to which is controlled via Unix permissions. This is not SELinux-friendly, but it is nicer for testing and it is also the only possibility for user libvirtd. In that case, upgrading QEMU on the host could result in a "systemctl restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could result in systemd respawning the daemon). Reconnect is useful in that case. Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper
On Tue, Sep 19, 2017 at 12:24:34PM +0200, Paolo Bonzini wrote: > This adds a concrete subclass of pr-manager that talks to qemu-pr-helper. > > Signed-off-by: Paolo Bonzini > --- > v1->v2: fixed string property double-free > fixed/cleaned up error handling > handle buffer underrun > > scsi/Makefile.objs | 2 +- > scsi/pr-manager-helper.c | 302 > +++ > 2 files changed, 303 insertions(+), 1 deletion(-) > create mode 100644 scsi/pr-manager-helper.c > > diff --git a/scsi/Makefile.objs b/scsi/Makefile.objs > index 5496d2ae6a..4d25e476cf 100644 > --- a/scsi/Makefile.objs > +++ b/scsi/Makefile.objs > @@ -1,3 +1,3 @@ > block-obj-y += utils.o > > -block-obj-$(CONFIG_LINUX) += pr-manager.o > +block-obj-$(CONFIG_LINUX) += pr-manager.o pr-manager-helper.o > diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c > new file mode 100644 > index 00..063fd35dee > --- /dev/null > +++ b/scsi/pr-manager-helper.c > +/* Called with lock held. */ > +static int pr_manager_helper_read(PRManagerHelper *pr_mgr, > + void *buf, int sz, Error **errp) > +{ > +ssize_t r = qio_channel_read_all(pr_mgr->ioc, buf, sz, errp); The return value isn't ssize_t - it just returns 0 on success, -1 on error, never a bytes count since it always reads all requested bytes and treats EOF as an error condition. > + > +if (r < 0) { > +object_unref(OBJECT(pr_mgr->ioc)); > +pr_mgr->ioc = NULL; > +return -EINVAL; > +} > + > +return 0; > +} > + Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper
On Tue, Sep 19, 2017 at 02:57:00PM +0200, Paolo Bonzini wrote: > On 19/09/2017 14:53, Daniel P. Berrange wrote: > >> +/* Try to reconnect while sending the CDB. */ > >> +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; attempts++) { > > > > I'm curious why you need to loop here. The helper daemon should be running > > already, as you're not spawning it on demand IIUC. So it shoudl either > > succeed first time, or fail every time. > > You're focusing on the usecase where the helper daemon is spawned per-VM > by the system libvirtd, which I agree is the most important one. > However, the other usecase is the one with a global daemon, access to > which is controlled via Unix permissions. This is not SELinux-friendly, > but it is nicer for testing and it is also the only possibility for user > libvirtd. > > In that case, upgrading QEMU on the host could result in a "systemctl > restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could > result in systemd respawning the daemon). Reconnect is useful in that case. If using systemd socket activation and you restart the daemon, the listening socket should be preserved, so you shouldn't need to reconnect - the client should get queued until it has started again (likewise on crash). Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper
On Tue, Sep 19, 2017 at 03:23:09PM +0200, Paolo Bonzini wrote: > On 19/09/2017 15:12, Daniel P. Berrange wrote: > > On Tue, Sep 19, 2017 at 02:57:00PM +0200, Paolo Bonzini wrote: > >> On 19/09/2017 14:53, Daniel P. Berrange wrote: > +/* Try to reconnect while sending the CDB. */ > +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; > attempts++) { > >>> > >>> I'm curious why you need to loop here. The helper daemon should be running > >>> already, as you're not spawning it on demand IIUC. So it shoudl either > >>> succeed first time, or fail every time. > >> > >> You're focusing on the usecase where the helper daemon is spawned per-VM > >> by the system libvirtd, which I agree is the most important one. > >> However, the other usecase is the one with a global daemon, access to > >> which is controlled via Unix permissions. This is not SELinux-friendly, > >> but it is nicer for testing and it is also the only possibility for user > >> libvirtd. > >> > >> In that case, upgrading QEMU on the host could result in a "systemctl > >> restart qemu-pr-helper.service" (or, hopefully unlikely, a crash could > >> result in systemd respawning the daemon). Reconnect is useful in that > >> case. > > > > If using systemd socket activation and you restart the daemon, the listening > > socket should be preserved, so you shouldn't need to reconnect - the client > > should get queued until it has started again (likewise on crash). > > Oh, that's cool. I didn't know that. However, systemd socket > activation is optional, and it's only a handful of lines so I think it's > a bit nicer behavior (chardevs for example have options to reconnect). The downside is that if someone forget to start the daemon, or enable the socket, QEMU will spin for 5 seconds trying to reconnect, instead of reporting an error immediately. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply
On 19/09/2017 13:03, Vladimir Sementsov-Ogievskiy wrote: >>> >> I disagree that it is easier to extend it in the future. If some >> commands in the future need a different "how do we read it" (e.g. for >> structured reply), nbd_read_reply_entry may not have all the information >> it needs anymore. > > how is it possible? all information is in requests[i]. If you are okay with violating information hiding, then it is. >> >> In fact, you're pushing knowledge of the commands from nbd_co_request to >> nbd_read_reply_entry: >> >> +if (s->reply.error == 0 && >> +s->requests[i].request->type == NBD_CMD_READ) >> >> and knowledge of NBD_CMD_READ simply doesn't belong there. So there is >> an example of that right now, it is not some arbitrary bad thing that >> could happen in the future. > > I can't agree that my point of view is wrong, it's just different. > You want commands, reading from socket, each knows what it should read. > I want the receiver - one sequential reader, that can read all kinds of > replies and just wake requesting coroutines when all reading is done. > For me it looks simpler than switching. It may look simpler, but I think it goes against the principle of coroutines which is to have related code in the same function. Here you have the command function that takes care of sending the request payload but not receiving the reply payload (except that for commands that aren't as simple as read or write, it will have to _process_ the reply payload). What to do with the reply payload is in another function that has to peek at the command in order to find out what to do. It doesn't seem like a better design, honestly. Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper
On Tue, Sep 19, 2017 at 12:24:34PM +0200, Paolo Bonzini wrote: > This adds a concrete subclass of pr-manager that talks to qemu-pr-helper. > > Signed-off-by: Paolo Bonzini > --- > v1->v2: fixed string property double-free > fixed/cleaned up error handling > handle buffer underrun > > scsi/Makefile.objs | 2 +- > scsi/pr-manager-helper.c | 302 > +++ > 2 files changed, 303 insertions(+), 1 deletion(-) > create mode 100644 scsi/pr-manager-helper.c > +static int pr_manager_helper_run(PRManager *p, > + int fd, struct sg_io_hdr *io_hdr) > +{ > +PRManagerHelper *pr_mgr = PR_MANAGER_HELPER(p); > + > +uint32_t len; > +PRHelperResponse resp; > +int ret; > +int expected_dir; > +int attempts; > +uint8_t cdb[PR_HELPER_CDB_SIZE] = { 0 }; > + > +if (!io_hdr->cmd_len || io_hdr->cmd_len > PR_HELPER_CDB_SIZE) { > +return -EINVAL; > +} > + > +memcpy(cdb, io_hdr->cmdp, io_hdr->cmd_len); > +assert(cdb[0] == PERSISTENT_RESERVE_OUT || cdb[0] == > PERSISTENT_RESERVE_IN); > +expected_dir = > +(cdb[0] == PERSISTENT_RESERVE_OUT ? SG_DXFER_TO_DEV : > SG_DXFER_FROM_DEV); > +if (io_hdr->dxfer_direction != expected_dir) { > +return -EINVAL; > +} > + > +len = scsi_cdb_xfer(cdb); > +if (io_hdr->dxfer_len < len || len > PR_HELPER_DATA_SIZE) { > +return -EINVAL; > +} > + > +qemu_mutex_lock(&pr_mgr->lock); > + > +/* Try to reconnect while sending the CDB. */ > +for (attempts = 0; attempts < PR_MAX_RECONNECT_ATTEMPTS; attempts++) { I'm curious why you need to loop here. The helper daemon should be running already, as you're not spawning it on demand IIUC. So it shoudl either succeed first time, or fail every time. > +if (!pr_mgr->ioc) { > +ret = pr_manager_helper_initialize(pr_mgr, NULL); > +if (ret < 0) { > +qemu_mutex_unlock(&pr_mgr->lock); > +g_usleep(G_USEC_PER_SEC); > +qemu_mutex_lock(&pr_mgr->lock); > +continue; > +} > +} > + > +ret = pr_manager_helper_write(pr_mgr, fd, cdb, ARRAY_SIZE(cdb), > NULL); > +if (ret >= 0) { > +break; > +} > +} > +if (ret < 0) { > +goto out; > +} Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [Qemu-block] [PATCH v8 00/20] make dirty-bitmap byte-based
Am 18.09.2017 um 20:57 hat Eric Blake geschrieben: > There are patches floating around to add NBD_CMD_BLOCK_STATUS, > but NBD wants to report status on byte granularity (even if the > reporting will probably be naturally aligned to sectors or even > much higher levels). I've therefore started the task of > converting our block status code to report at a byte granularity > rather than sectors. > > Now that 2.11 is open, I'm rebasing/reposting the remaining patches. > > The overall conversion currently looks like: > part 1: bdrv_is_allocated (merged in 2.10, commit 51b0a488) > part 2: dirty-bitmap (this series, v7 was here [1]) > part 3: bdrv_get_block_status (v4 is posted [2] and is mostly reviewed) > part 4: .bdrv_co_block_status (v3 is posted [3], but needs review) I agree with Fam's comment on patch 5 and had a question on patch 18. Everything else is Reviewed-by: Kevin Wolf
Re: [Qemu-block] [PATCH v8 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration
Am 18.09.2017 um 20:58 hat Eric Blake geschrieben: > Now that we have adjusted the majority of the calls this function > makes to be byte-based, it is easier to read the code if it makes > passes over the image using bytes rather than sectors. > > Signed-off-by: Eric Blake > Reviewed-by: John Snow > Reviewed-by: Vladimir Sementsov-Ogievskiy > > --- > v7: rebase to earlier change, make rounding of offset obvious (no semantic > change, so R-b kept) [Kevin] > v5-v6: no change > v4: new patch > --- > block/qcow2-bitmap.c | 26 +++--- > 1 file changed, 11 insertions(+), 15 deletions(-) > > diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c > index 692ce0de88..302fffd6e1 100644 > --- a/block/qcow2-bitmap.c > +++ b/block/qcow2-bitmap.c > @@ -1072,10 +1072,9 @@ static uint64_t *store_bitmap_data(BlockDriverState > *bs, > { > int ret; > BDRVQcow2State *s = bs->opaque; > -int64_t sector; > -uint64_t limit, sbc; > +int64_t offset; > +uint64_t limit; > uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap); > -uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE); > const char *bm_name = bdrv_dirty_bitmap_name(bitmap); > uint8_t *buf = NULL; > BdrvDirtyBitmapIter *dbi; > @@ -1100,18 +1099,17 @@ static uint64_t *store_bitmap_data(BlockDriverState > *bs, > dbi = bdrv_dirty_iter_new(bitmap); > buf = g_malloc(s->cluster_size); > limit = bytes_covered_by_bitmap_cluster(s, bitmap); > -sbc = limit >> BDRV_SECTOR_BITS; > assert(DIV_ROUND_UP(bm_size, limit) == tb_size); > > -while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) { > -uint64_t cluster = sector / sbc; > +while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) { > +uint64_t cluster = offset / limit; > uint64_t end, write_size; > int64_t off; > > -sector = cluster * sbc; > -end = MIN(bm_sectors, sector + sbc); > -write_size = bdrv_dirty_bitmap_serialization_size(bitmap, > -sector * BDRV_SECTOR_SIZE, (end - sector) * BDRV_SECTOR_SIZE); > +offset = QEMU_ALIGN_DOWN(offset, limit); With the question that I asked in v6, apart from changing the spelling to be more explicit, I actually hoped that you would explain why aligning down is the right thing to do. It looks plausible to me that we can do this in correct code because we don't support granularities < 512 bytes (a qemu restriction that is written down as a note in the qcow2 spec). The part that I'm missing yet is why we need to do it. The bitmap granularity is also the granularity of bdrv_dirty_iter_next(), so isn't offset already aligned and we could even assert that instead of aligning down? (As long we enforce our restriction, which we seem to do in bitmap_list_load().) > +end = MIN(bm_size, offset + limit); > +write_size = bdrv_dirty_bitmap_serialization_size(bitmap, offset, > + end - offset); > assert(write_size <= s->cluster_size); > > off = qcow2_alloc_clusters(bs, s->cluster_size); Kevin
Re: [Qemu-block] [PATCH v2 6/7] block/nbd-client: early fail nbd_read_reply_entry if s->quit is set
19.09.2017 13:03, Paolo Bonzini wrote: On 19/09/2017 11:43, Vladimir Sementsov-Ogievskiy wrote: I'm trying to look forward to structured reads, where we will have to deal with more than one server message in reply to a single client request. For read, we just piece together portions of the qiov until the server has sent us all the pieces. But for block query, I really do think we'll want to defer to specialized handlers for doing further reads (the amount of data to be read from the server is not known by the client a priori, so it is a two-part read, one to get the length, and another to read that remaining length); if we need some sort of callback function rather than cramming all the logic here in the main read loop, by patch 3 I do not mean in any way that I want to have all reading staff in one function, ofcourse it should be refactored for block-status addition. I just want to have all reading staff in one coroutine. Reading from ioc is sequential, so it's native to do it sequentially in one coroutine, without switches. But why is that a goal? See it as a state machine that goes between the "waiting for header" and "waiting for payload" states. Reading header corresponds to read_reply_co waiting on the socket (and reading when it's ready); reading payload corresponds to the request coroutine waiting on the socket and reading when it's ready. Paolo I just dislike public access to ioc for commands and extra coroutine switching and synchronization. -- Best regards, Vladimir
Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply
19.09.2017 13:01, Paolo Bonzini wrote: On 19/09/2017 11:25, Vladimir Sementsov-Ogievskiy wrote: 18.09.2017 18:43, Paolo Bonzini wrote: On 18/09/2017 15:59, Vladimir Sementsov-Ogievskiy wrote: Read the whole reply in one place - in nbd_read_reply_entry. Signed-off-by: Vladimir Sementsov-Ogievskiy --- block/nbd-client.h | 1 + block/nbd-client.c | 42 -- 2 files changed, 25 insertions(+), 18 deletions(-) diff --git a/block/nbd-client.h b/block/nbd-client.h index b1900e6a6d..3f97d31013 100644 --- a/block/nbd-client.h +++ b/block/nbd-client.h @@ -21,6 +21,7 @@ typedef struct { Coroutine *coroutine; bool receiving; /* waiting for read_reply_co? */ NBDRequest *request; +QEMUIOVector *qiov; } NBDClientRequest; typedef struct NBDClientSession { diff --git a/block/nbd-client.c b/block/nbd-client.c index 5f241ecc22..f7b642f079 100644 --- a/block/nbd-client.c +++ b/block/nbd-client.c @@ -98,6 +98,21 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) break; } +if (s->reply.error == 0 && +s->requests[i].request->type == NBD_CMD_READ) +{ +QEMUIOVector *qiov = s->requests[i].qiov; +assert(qiov != NULL); +assert(s->requests[i].request->len == + iov_size(qiov->iov, qiov->niov)); +if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, + NULL) < 0) +{ +s->reply.error = EIO; +break; +} +} I am not sure this is an improvement. In principle you could have commands that read replies a bit at a time without using a QEMUIOVector. Paolo Hm, what do you mean? In this patch I don't change "how do we read it", I only move the reading code to one coroutine, to make it simple to extend it in future (block status, etc). I disagree that it is easier to extend it in the future. If some commands in the future need a different "how do we read it" (e.g. for structured reply), nbd_read_reply_entry may not have all the information it needs anymore. how is it possible? all information is in requests[i]. In fact, you're pushing knowledge of the commands from nbd_co_request to nbd_read_reply_entry: +if (s->reply.error == 0 && +s->requests[i].request->type == NBD_CMD_READ) and knowledge of NBD_CMD_READ simply doesn't belong there. So there is an example of that right now, it is not some arbitrary bad thing that could happen in the future. Paolo I can't agree that my point of view is wrong, it's just different. You want commands, reading from socket, each knows what it should read. I want the receiver - one sequential reader, that can read all kinds of replies and just wake requesting coroutines when all reading is done. For me it looks simpler than switching. We can add reading callbacks for commands which have some payload, like s->requests[i].read_payload(ioc, request) or may be s->requests[i].request->read_payload(...), and call them from reply_entry instead of if (s->... == NBD_CMD_READ). How do you see the refactoring for introducing structured reply? nbd_read_reply_enty has all information it need (s->requests[i].request) to handle the whole reply.. It's an improvement, because it leads to separating reply_entry coroutine - it don't need any synchronization (yield/wake) more with request coroutines. /* We're woken up again by the request itself. Note that there * is no race between yielding and reentering read_reply_co. This * is because: @@ -118,6 +133,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) s->read_reply_co = NULL; } +/* qiov should be provided for READ and WRITE */ static int nbd_co_send_request(BlockDriverState *bs, NBDRequest *request, QEMUIOVector *qiov) @@ -145,6 +161,7 @@ static int nbd_co_send_request(BlockDriverState *bs, request->handle = INDEX_TO_HANDLE(s, i); s->requests[i].request = request; +s->requests[i].qiov = qiov; if (s->quit) { rc = -EIO; @@ -155,7 +172,8 @@ static int nbd_co_send_request(BlockDriverState *bs, goto err; } -if (qiov) { +if (s->requests[i].request->type == NBD_CMD_WRITE) { +assert(qiov); qio_channel_set_cork(s->ioc, true); rc = nbd_send_request(s->ioc, request); if (rc >= 0 && !s->quit) { @@ -181,9 +199,7 @@ err: return rc; } -static int nbd_co_receive_reply(NBDClientSession *s, -NBDRequest *request, -QEMUIOVector *qiov) +static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest *request) { int ret; int i = HANDLE_TO_INDEX(s, request->handle); @@ -197,14 +213,6 @@ static int nbd_
Re: [Qemu-block] [Qemu-devel] [PATCH] nvme: fix cmbuf leak on exit
Hi Stefan, On 09/19/2017 06:48 AM, Stefan Hajnoczi wrote: Cc: Keith Busch Signed-off-by: Stefan Hajnoczi --- hw/block/nvme.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 9aa32692a3..513ec7065d 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -1057,6 +1057,7 @@ static void nvme_exit(PCIDevice *pci_dev) g_free(n->namespaces); g_free(n->cq); g_free(n->sq); +g_free(n->cmbuf); I'd move it 1 line below, if (n->cmbsz) { here: g_free(n->cmbuf); memory_region_unref(&n->ctrl_mem); } Anyway g_free() handles NULL, so: Reviewed-by: Philippe Mathieu-Daudé
[Qemu-block] [PATCH 2/4] scsi: build qemu-pr-helper
Introduce a privileged helper to run persistent reservation commands. This lets virtual machines send persistent reservations without using CAP_SYS_RAWIO or out-of-tree patches. The helper uses Unix permissions and SCM_RIGHTS to restrict access to processes that can access its socket and prove that they have an open file descriptor for a raw SCSI device. The next patch will also correct the usage of persistent reservations with multipath devices. It would also be possible to support for Linux's IOC_PR_* ioctls in the future, to support NVMe devices. For now, however, only SCSI is supported. Signed-off-by: Paolo Bonzini --- v1->v2: do not install man page on !Linux systems fix typos in documentation clarify initialization protocol clarify concurrency added BSD license to protocol header added payload size to reply, to handle residual (buffer underrun) fixed closing of file descriptors on error block PERSISTENT RESERVE OUT for read-only file descriptors do not use g_assert use EXIT_SUCCESS/EXIT_FAILURE consistently removed all CONFIG_MPATH usage from this patch Makefile | 2 + configure | 2 +- docs/interop/pr-helper.rst | 83 ++ docs/pr-manager.rst| 33 +++ scsi/pr-helper.h | 41 +++ scsi/qemu-pr-helper.c | 698 + 6 files changed, 866 insertions(+), 1 deletions(-) create mode 100644 docs/interop/pr-helper.rst create mode 100644 scsi/pr-helper.h create mode 100644 scsi/qemu-pr-helper.c diff --git a/Makefile b/Makefile index b53fc69a60..51503ee8ad 100644 --- a/Makefile +++ b/Makefile @@ -373,6 +376,8 @@ qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o $(COMMON_LDADDS) fsdev/virtfs-proxy-helper$(EXESUF): fsdev/virtfs-proxy-helper.o fsdev/9p-marshal.o fsdev/9p-iov-marshal.o $(COMMON_LDADDS) fsdev/virtfs-proxy-helper$(EXESUF): LIBS += -lcap +scsi/qemu-pr-helper$(EXESUF): scsi/qemu-pr-helper.o scsi/utils.o $(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS) + qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx $(SRC_PATH)/scripts/hxtool $(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h < $< > $@,"GEN","$@") diff --git a/configure b/configure index 94db2d103e..fba27cf0fc 100755 --- a/configure +++ b/configure @@ -6506,7 +6506,7 @@ fi # build tree in object directory in case the source is not in the current directory DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32 tests/libqos tests/qapi-schema tests/tcg/xtensa tests/qemu-iotests" -DIRS="$DIRS docs docs/interop fsdev" +DIRS="$DIRS docs docs/interop fsdev scsi" DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas pc-bios/s390-ccw" DIRS="$DIRS roms/seabios roms/vgabios" DIRS="$DIRS qapi-generated" diff --git a/docs/interop/pr-helper.rst b/docs/interop/pr-helper.rst new file mode 100644 index 00..9f76d5bcf9 --- /dev/null +++ b/docs/interop/pr-helper.rst @@ -0,0 +1,83 @@ +.. + +== +Persistent reservation helper protocol +== + +QEMU's SCSI passthrough devices, ``scsi-block`` and ``scsi-generic``, +can delegate implementation of persistent reservations to an external +(and typically privileged) program. Persistent Reservations allow +restricting access to block devices to specific initiators in a shared +storage setup. + +For a more detailed reference please refer the the SCSI Primary +Commands standard, specifically the section on Reservations and the +"PERSISTENT RESERVE IN" and "PERSISTENT RESERVE OUT" commands. + +This document describes the socket protocol used between QEMU's +``pr-manager-helper`` object and the external program. + +.. contents:: + +Connection and initialization +- + +All data transmitted on the socket is big-endian. + +After connecting to the helper program's socket, the helper starts a simple +feature negotiation process by writing four bytes corresponding to +the features it exposes (``supported_features``). QEMU reads it, +then writes four bytes corresponding to the desired features of the +helper program (``requested_features``). + +If a bit is 1 in ``requested_features`` and 0 in ``supported_features``, +the corresponding feature is not supported by the helper and the connection +is closed. On the other hand, it is acceptable for a bit to be 0 in +``requested_features`` and 1 in ``supported_features``; in this case, +the helper will not enable the feature. + +Right now no feature is defined, so the two parties always write four +zero bytes. + +Command format +-- + +It is invalid to send multiple commands concurrently on the same +socket. It is however possible to connect multiple sockets to the +helper and send multiple commands to the helper for one or more +file descriptors. + +A command consists of a request and a response. A request consists +of a 16-byte SCSI CDB. A file descripto
[Qemu-block] [PATCH 1/4] scsi, file-posix: add support for persistent reservation management
It is a common requirement for virtual machine to send persistent reservations, but this currently requires either running QEMU with CAP_SYS_RAWIO, or using out-of-tree patches that let an unprivileged QEMU bypass Linux's filter on SG_IO commands. As an alternative mechanism, the next patches will introduce a privileged helper to run persistent reservation commands without expanding QEMU's attack surface unnecessarily. The helper is invoked through a "pr-manager" QOM object, to which file-posix.c passes SG_IO requests for PERSISTENT RESERVE OUT and PERSISTENT RESERVE IN commands. For example: $ qemu-system-x86_64 -device virtio-scsi \ -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock -drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0 -device scsi-block,drive=hd or: $ qemu-system-x86_64 -device virtio-scsi \ -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock -blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0 -device scsi-block,drive=hd Multiple pr-manager implementations are conceivable and possible, though only one is implemented right now. For example, a pr-manager could: - talk directly to the multipath daemon from a privileged QEMU (i.e. QEMU links to libmpathpersist); this makes reservation work properly with multipath, but still requires CAP_SYS_RAWIO - use the Linux IOC_PR_* ioctls (they require CAP_SYS_ADMIN though) - more interestingly, implement reservations directly in QEMU through file system locks or a shared database (e.g. sqlite) Signed-off-by: Paolo Bonzini --- Makefile.objs | 1 + block/file-posix.c| 30 + docs/pr-manager.rst | 51 ++ include/scsi/pr-manager.h | 56 qapi/block-core.json | 4 ++ scsi/Makefile.objs| 2 + scsi/pr-manager.c | 109 ++ vl.c | 3 +- 8 files changed, 255 insertions(+), 1 deletion(-) create mode 100644 docs/pr-manager.rst create mode 100644 include/scsi/pr-manager.h create mode 100644 scsi/pr-manager.c diff --git a/Makefile.objs b/Makefile.objs index 0caa8a5cf8..12abaa6191 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -170,6 +170,7 @@ trace-events-subdirs += qapi trace-events-subdirs += accel/tcg trace-events-subdirs += accel/kvm trace-events-subdirs += nbd +trace-events-subdirs += scsi trace-events-files = $(SRC_PATH)/trace-events $(trace-events-subdirs:%=$(SRC_PATH)/%/trace-events) diff --git a/block/file-posix.c b/block/file-posix.c index 6acbd56238..ab12a2b591 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -33,6 +33,9 @@ #include "block/raw-aio.h" #include "qapi/qmp/qstring.h" +#include "scsi/pr-manager.h" +#include "scsi/constants.h" + #if defined(__APPLE__) && (__MACH__) #include #include @@ -155,6 +158,8 @@ typedef struct BDRVRawState { bool page_cache_inconsistent:1; bool has_fallocate; bool needs_alignment; + +PRManager *pr_mgr; } BDRVRawState; typedef struct BDRVRawReopenState { @@ -402,6 +407,11 @@ static QemuOptsList raw_runtime_opts = { .type = QEMU_OPT_STRING, .help = "file locking mode (on/off/auto, default: auto)", }, +{ +.name = "pr-manager", +.type = QEMU_OPT_STRING, +.help = "id of persistent reservation manager object (default: none)", +}, { /* end of list */ } }, }; @@ -413,6 +423,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, QemuOpts *opts; Error *local_err = NULL; const char *filename = NULL; +const char *str; BlockdevAioOptions aio, aio_default; int fd, ret; struct stat st; @@ -476,6 +487,16 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, abort(); } +str = qemu_opt_get(opts, "pr-manager"); +if (str) { +s->pr_mgr = pr_manager_lookup(str, &local_err); +if (local_err) { +error_propagate(errp, local_err); +ret = -EINVAL; +goto fail; +} +} + s->open_flags = open_flags; raw_parse_flags(bdrv_flags, &s->open_flags); @@ -2597,6 +2618,15 @@ static BlockAIOCB *hdev_aio_ioctl(BlockDriverState *bs, if (fd_open(bs) < 0) return NULL; +if (req == SG_IO && s->pr_mgr) { +struct sg_io_hdr *io_hdr = buf; +if (io_hdr->cmdp[0] == PERSISTENT_RESERVE_OUT || +io_hdr->cmdp[0] == PERSISTENT_RESERVE_IN) { +return pr_manager_execute(s->pr_mgr, bdrv_get_aio_context(bs), + s->fd, io_hdr, cb, opaque); +} +} + acb = g_new(RawPosixAIOData, 1); acb->bs = bs; acb->aio_type = QEMU_AIO_IOCTL; diff --git a/docs/pr-manager.rst b/docs/pr-manager.rst new
[Qemu-block] [PATCH 4/4] scsi: add persistent reservation manager using qemu-pr-helper
This adds a concrete subclass of pr-manager that talks to qemu-pr-helper. Signed-off-by: Paolo Bonzini --- v1->v2: fixed string property double-free fixed/cleaned up error handling handle buffer underrun scsi/Makefile.objs | 2 +- scsi/pr-manager-helper.c | 302 +++ 2 files changed, 303 insertions(+), 1 deletion(-) create mode 100644 scsi/pr-manager-helper.c diff --git a/scsi/Makefile.objs b/scsi/Makefile.objs index 5496d2ae6a..4d25e476cf 100644 --- a/scsi/Makefile.objs +++ b/scsi/Makefile.objs @@ -1,3 +1,3 @@ block-obj-y += utils.o -block-obj-$(CONFIG_LINUX) += pr-manager.o +block-obj-$(CONFIG_LINUX) += pr-manager.o pr-manager-helper.o diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c new file mode 100644 index 00..063fd35dee --- /dev/null +++ b/scsi/pr-manager-helper.c @@ -0,0 +1,302 @@ +/* + * Persistent reservation manager that talks to qemu-pr-helper + * + * Copyright (c) 2017 Red Hat, Inc. + * + * Author: Paolo Bonzini + * + * This code is licensed under the LGPL v2.1 or later. + * + */ + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "scsi/constants.h" +#include "scsi/pr-manager.h" +#include "scsi/utils.h" +#include "io/channel.h" +#include "io/channel-socket.h" +#include "pr-helper.h" + +#include + +#define PR_MAX_RECONNECT_ATTEMPTS 5 + +#define TYPE_PR_MANAGER_HELPER "pr-manager-helper" + +#define PR_MANAGER_HELPER(obj) \ + OBJECT_CHECK(PRManagerHelper, (obj), \ + TYPE_PR_MANAGER_HELPER) + +typedef struct PRManagerHelper { +/* */ +PRManager parent; + +char *path; + +QemuMutex lock; +QIOChannel *ioc; +} PRManagerHelper; + +/* Called with lock held. */ +static int pr_manager_helper_read(PRManagerHelper *pr_mgr, + void *buf, int sz, Error **errp) +{ +ssize_t r = qio_channel_read_all(pr_mgr->ioc, buf, sz, errp); + +if (r < 0) { +object_unref(OBJECT(pr_mgr->ioc)); +pr_mgr->ioc = NULL; +return -EINVAL; +} + +return 0; +} + +/* Called with lock held. */ +static int pr_manager_helper_write(PRManagerHelper *pr_mgr, + int fd, + const void *buf, int sz, Error **errp) +{ +size_t nfds = (fd != -1); +while (sz > 0) { +struct iovec iov; +ssize_t n_written; + +iov.iov_base = (void *)buf; +iov.iov_len = sz; +n_written = qio_channel_writev_full(QIO_CHANNEL(pr_mgr->ioc), &iov, 1, +nfds ? &fd : NULL, nfds, errp); + +if (n_written <= 0) { +assert(n_written != QIO_CHANNEL_ERR_BLOCK); +object_unref(OBJECT(pr_mgr->ioc)); +return n_written < 0 ? -EINVAL : 0; +} + +nfds = 0; +buf += n_written; +sz -= n_written; +} + +return 0; +} + +/* Called with lock held. */ +static int pr_manager_helper_initialize(PRManagerHelper *pr_mgr, +Error **errp) +{ +char *path = g_strdup(pr_mgr->path); +SocketAddress saddr = { +.type = SOCKET_ADDRESS_TYPE_UNIX, +.u.q_unix.path = path +}; +QIOChannelSocket *sioc = qio_channel_socket_new(); +Error *local_err = NULL; + +uint32_t flags; +int r; + +assert(!pr_mgr->ioc); +qio_channel_set_name(QIO_CHANNEL(sioc), "pr-manager-helper"); +qio_channel_socket_connect_sync(sioc, +&saddr, +&local_err); +g_free(path); +if (local_err) { +object_unref(OBJECT(sioc)); +error_propagate(errp, local_err); +return -ENOTCONN; +} + +qio_channel_set_delay(QIO_CHANNEL(sioc), false); +pr_mgr->ioc = QIO_CHANNEL(sioc); + +/* A simple feature negotation protocol, even though there is + * no optional feature right now. + */ +r = pr_manager_helper_read(pr_mgr, &flags, sizeof(flags), errp); +if (r < 0) { +goto out_close; +} + +flags = 0; +r = pr_manager_helper_write(pr_mgr, -1, &flags, sizeof(flags), errp); +if (r < 0) { +goto out_close; +} + +return 0; + +out_close: +object_unref(OBJECT(pr_mgr->ioc)); +pr_mgr->ioc = NULL; +return r; +} + +static int pr_manager_helper_run(PRManager *p, + int fd, struct sg_io_hdr *io_hdr) +{ +PRManagerHelper *pr_mgr = PR_MANAGER_HELPER(p); + +uint32_t len; +PRHelperResponse resp; +int ret; +int expected_dir; +int attempts; +uint8_t cdb[PR_HELPER_CDB_SIZE] = { 0 }; + +if (!io_hdr->cmd_len || io_hdr->cmd_len > PR_HELPER_CDB_SIZE) { +return -EINVAL; +} + +memcpy(cdb, io_hdr->cmdp, io_hdr->cmd_len); +assert(cdb[0] == PERSISTENT_RESERVE_OUT || cdb[0] == PERSISTENT_RESERVE_IN); +expected_dir = +(cdb[0] == PERSISTENT_RESERVE_OUT ? SG_DXFER_TO_DEV : SG_DXFER_FRO
[Qemu-block] [PATCH 3/4] scsi: add multipath support to qemu-pr-helper
Proper support of persistent reservation for multipath devices requires communication with the multipath daemon, so that the reservation is registered and applied when a path comes up. The device mapper utilities provide a library to do so; this patch makes qemu-pr-helper.c detect multipath devices and, when one is found, delegate the operation to libmpathpersist. Signed-off-by: Paolo Bonzini --- v1->v2: moved dm_init/multipath_pr_init to this patch drop CAP_SYS_ADMIN if multipath not compiled in simplify buffer size handling in multipath PERSISTENT RESERVE IN block REGISTER AND MOVE operation for multipath PERSISTENT RESERVE OUT fixed transport id handling in multipath PERSISTENT RESERVE OUT Makefile | 3 + configure | 57 - docs/pr-manager.rst | 27 include/scsi/utils.h | 4 + scsi/qemu-pr-helper.c | 346 +- scsi/utils.c | 10 ++ 6 files changed, 441 insertions(+), 6 deletions(-) diff --git a/Makefile b/Makefile index 51503ee8ad..39ec5112b6 100644 --- a/Makefile +++ b/Makefile @@ -377,6 +377,9 @@ fsdev/virtfs-proxy-helper$(EXESUF): fsdev/virtfs-proxy-helper.o fsdev/9p-marshal fsdev/virtfs-proxy-helper$(EXESUF): LIBS += -lcap scsi/qemu-pr-helper$(EXESUF): scsi/qemu-pr-helper.o scsi/utils.o $(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS) +ifdef CONFIG_MPATH +scsi/qemu-pr-helper$(EXESUF): LIBS += -ludev -lmultipath -lmpathpersist +endif qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx $(SRC_PATH)/scripts/hxtool $(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h < $< > $@,"GEN","$@") diff --git a/configure b/configure index fba27cf0fc..bea343c701 100755 --- a/configure +++ b/configure @@ -290,6 +290,7 @@ netmap="no" sdl="" sdlabi="" virtfs="" +mpath="" vnc="yes" sparse="no" vde="" @@ -936,6 +937,10 @@ for opt do ;; --enable-virtfs) virtfs="yes" ;; + --disable-mpath) mpath="no" + ;; + --enable-mpath) mpath="yes" + ;; --disable-vnc) vnc="no" ;; --enable-vnc) vnc="yes" @@ -1479,6 +1484,7 @@ disabled with --disable-FEATURE, default is enabled if available: vnc-png PNG compression for VNC server cocoa Cocoa UI (Mac OS X only) virtfs VirtFS + mpath Multipath persistent reservation passthrough xen xen backend driver support xen-pci-passthrough brlapi BrlAPI (Braile) @@ -3300,6 +3306,29 @@ else fi ## +# libmpathpersist probe + +if test "$mpath" != "no" ; then + cat > $TMPC < +#include +unsigned mpath_mx_alloc_len = 1024; +int logsink; +int main(void) { +struct udev *udev = udev_new(); +mpath_lib_init(udev); +} +EOF + if compile_prog "" "-ludev -lmultipath -lmpathpersist" ; then +mpathpersist=yes + else +mpathpersist=no + fi +else + mpathpersist=no +fi + +## # libcap probe if test "$cap" != "no" ; then @@ -5034,16 +5063,34 @@ if test "$want_tools" = "yes" ; then fi fi if test "$softmmu" = yes ; then - if test "$virtfs" != no ; then -if test "$cap" = yes && test "$linux" = yes && test "$attr" = yes ; then + if test "$linux" = yes; then +if test "$virtfs" != no && test "$cap" = yes && test "$attr" = yes ; then virtfs=yes tools="$tools fsdev/virtfs-proxy-helper\$(EXESUF)" else if test "$virtfs" = yes; then -error_exit "VirtFS is supported only on Linux and requires libcap devel and libattr devel" +error_exit "VirtFS requires libcap devel and libattr devel" fi virtfs=no fi +if test "$mpath" != no && test "$mpathpersist" = yes ; then + mpath=yes + tools="$tools mpath/qemu-mpath-helper\$(EXESUF)" +else + if test "$mpath" = yes; then +error_exit "Multipath requires libmpathpersist devel" + fi + mpath=no +fi + else +if test "$virtfs" = yes; then + error_exit "VirtFS is supported only on Linux" +fi +virtfs=no +if test "$mpath" = yes; then + error_exit "Multipath is supported only on Linux" +fi +mpath=no fi fi @@ -5289,6 +5336,7 @@ echo "Audio drivers $audio_drv_list" echo "Block whitelist (rw) $block_drv_rw_whitelist" echo "Block whitelist (ro) $block_drv_ro_whitelist" echo "VirtFS support$virtfs" +echo "Multipath support $mpath" echo "VNC support $vnc" if test "$vnc" = "yes" ; then echo "VNC SASL support $vnc_sasl" @@ -5732,6 +5780,9 @@ fi if test "$virtfs" = "yes" ; then echo "CONFIG_VIRTFS=y" >> $config_host_mak fi +if test "$mpath" = "yes" ; then + echo "CONFIG_MPATH=y" >> $config_host_mak +fi if test "$vhost_scsi" = "yes" ; then echo "CONFIG_VHOST_SCSI=y" >> $config_host_mak fi diff --git a/docs/pr-manager.rst b/docs/pr-manager.rst index 7107e59fb8..9b1de198b1 100644 --- a/docs/pr-manager.rst +++ b/docs/pr-manager.rst @@ -60,6 +60,7 @@ system ser
[Qemu-block] [PATCH v2 0/4] scsi, block: introduce persistent reservation managers
SCSI persistent Reservations allow restricting access to block devices to specific initiators in a shared storage setup. When implementing clustering of virtual machines, it is a common requirement for virtual machines to send persistent reservation SCSI commands. However, the operating system restricts sending these commands to unprivileged programs because incorrect usage can disrupt regular operation of the storage fabric. With these patches, the scsi-block and scsi-generic SCSI passthrough devices learn to delegate the implementation of persistent reservations to a separate object, the "persistent reservation manager". The persistent reservation manager talks to a separate privileged program, with a very simple protocol based on SCM_RIGHTS. In addition to invoking PERSISTENT RESERVATION OUT and PERSISTENT RESERVATION IN commands, the privileged component can also use libmpathpersist so that persistent reservations are applied to all paths in a multipath setting. Patch 1 defines the abstract QOM class and plugs it into block/file-posix.c. Patch 2 and 3 introduce the privileged helper program, while patch 4 defines the concrete QOM class that talks to it. Paolo v1->v2: removed scsi/ patches which were all reviewed removed man page (requires .texi while I used .rst for docs) qio_channel_read/write_all also went in independently fix installation of qemu-pr-helper man page fixes to documentation added BSD license to protocol header added handling of residual (buffer underrun) fixed closing of file descriptors on error block PERSISTENT RESERVE OUT for read-only file descriptors do not use g_assert use EXIT_SUCCESS/EXIT_FAILURE consistently moved all CONFIG_MPATH usage to the right patch drop CAP_SYS_ADMIN if multipath not compiled in simplify buffer size handling in multipath PERSISTENT RESERVE IN block REGISTER AND MOVE operation for multipath PERSISTENT RESERVE OUT fixed transport id handling in multipath PERSISTENT RESERVE OUT fixed string property double-free in pr-manager-helper fixed/cleaned up error handling in pr-manager-helper Paolo Bonzini (9): scsi, file-posix: add support for persistent reservation management scsi: build qemu-pr-helper scsi: add multipath support to qemu-pr-helper scsi: add persistent reservation manager using qemu-pr-helper Makefile | 14 +- configure | 59 ++- docs/interop/pr-helper.rst | 83 docs/pr-manager.rst| 60 +++ include/scsi/utils.h |4 + scsi/Makefile.objs |2 +- scsi/pr-helper.h | 41 ++ scsi/pr-manager-helper.c | 302 + scsi/qemu-pr-helper.c | 1038 scsi/utils.c | 10 + 10 files changed, 1607 insertions(+), 6 deletions(-) -- 2.13.5
[Qemu-block] qemu-img: Check failed: No space left on device
Hello, First post here, so maybe I should introduce myself : - I'm a sysadmin for decades and currently managing 4 oVirt clusters, made out of tens of hypervisors, all are CentOS 7.2+ based. - I'm very happy with this solution we choose especially because it is based on qemu-kvm (open source, reliable, documented). On one VM, we experienced the following : - oVirt/vdsm is detecting an issue on the image - following this hints https://access.redhat.com/solutions/1173623, I managed to detect one error and fix it - the VM is now running perfectly On two other VMs, we experienced a similar situation, except the check stage is showing something like 14000+ errors, and the relevant logs are : Repairing refcount block 14 is outside image ERROR could not resize image: Invalid argument ERROR cluster 425984 refcount=0 reference=1 ERROR cluster 425985 refcount=0 reference=1 [... repeating the previous line 7000+ times...] ERROR cluster 457166 refcount=0 reference=1 Rebuilding refcount structure ERROR writing refblock: No space left on device qemu-img: Check failed: No space left on device You surely know that oVirt/RHEV is storing its qcow2 images in dedicated logical volumes. pvs/vgs/lvs are all showing there is plenty of space available, so I understand that I don't understand what "No space left on device" means. The relevant versions are : - qemu-img-ev-2.3.0-31.el7_2.10.1.x86_64. - libvirt-daemon-kvm-1.2.17-13.el7_2.4.x86_64 - vdsm-4.17.32-1.el7.noarch - # qemu-img info /the/relevant/logical/volume/path file format: qcow2 virtual size: 30G (32212254720 bytes) disk size: 0 cluster_size: 65536 Format specific information: compat: 0.10 refcount bits: 16 Is there a hope I'll be able to repair this image? -- Nicolas ECARNOT
Re: [Qemu-block] [PATCH v2 6/7] block/nbd-client: early fail nbd_read_reply_entry if s->quit is set
On 19/09/2017 11:43, Vladimir Sementsov-Ogievskiy wrote: >> >> I'm trying to look forward to structured reads, where we will have to >> deal with more than one server message in reply to a single client >> request. For read, we just piece together portions of the qiov until >> the server has sent us all the pieces. But for block query, I really do >> think we'll want to defer to specialized handlers for doing further >> reads (the amount of data to be read from the server is not known by the >> client a priori, so it is a two-part read, one to get the length, and >> another to read that remaining length); if we need some sort of callback >> function rather than cramming all the logic here in the main read loop, > > by patch 3 I do not mean in any way that I want to have all reading > staff in one function, ofcourse it should be refactored > for block-status addition. I just want to have all reading staff in one > coroutine. Reading from ioc is sequential, so it's > native to do it sequentially in one coroutine, without switches. But why is that a goal? See it as a state machine that goes between the "waiting for header" and "waiting for payload" states. Reading header corresponds to read_reply_co waiting on the socket (and reading when it's ready); reading payload corresponds to the request coroutine waiting on the socket and reading when it's ready. Paolo
Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply
On 19/09/2017 11:25, Vladimir Sementsov-Ogievskiy wrote: > 18.09.2017 18:43, Paolo Bonzini wrote: >> On 18/09/2017 15:59, Vladimir Sementsov-Ogievskiy wrote: >>> Read the whole reply in one place - in nbd_read_reply_entry. >>> >>> Signed-off-by: Vladimir Sementsov-Ogievskiy >>> --- >>> block/nbd-client.h | 1 + >>> block/nbd-client.c | 42 -- >>> 2 files changed, 25 insertions(+), 18 deletions(-) >>> >>> diff --git a/block/nbd-client.h b/block/nbd-client.h >>> index b1900e6a6d..3f97d31013 100644 >>> --- a/block/nbd-client.h >>> +++ b/block/nbd-client.h >>> @@ -21,6 +21,7 @@ typedef struct { >>> Coroutine *coroutine; >>> bool receiving; /* waiting for read_reply_co? */ >>> NBDRequest *request; >>> +QEMUIOVector *qiov; >>> } NBDClientRequest; >>> typedef struct NBDClientSession { >>> diff --git a/block/nbd-client.c b/block/nbd-client.c >>> index 5f241ecc22..f7b642f079 100644 >>> --- a/block/nbd-client.c >>> +++ b/block/nbd-client.c >>> @@ -98,6 +98,21 @@ static coroutine_fn void nbd_read_reply_entry(void >>> *opaque) >>> break; >>> } >>> +if (s->reply.error == 0 && >>> +s->requests[i].request->type == NBD_CMD_READ) >>> +{ >>> +QEMUIOVector *qiov = s->requests[i].qiov; >>> +assert(qiov != NULL); >>> +assert(s->requests[i].request->len == >>> + iov_size(qiov->iov, qiov->niov)); >>> +if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, >>> + NULL) < 0) >>> +{ >>> +s->reply.error = EIO; >>> +break; >>> +} >>> +} >> I am not sure this is an improvement. In principle you could have >> commands that read replies a bit at a time without using a QEMUIOVector. >> >> Paolo > > Hm, what do you mean? In this patch I don't change "how do we read it", > I only move the reading code to one coroutine, to make it simple to > extend it in future (block status, etc). I disagree that it is easier to extend it in the future. If some commands in the future need a different "how do we read it" (e.g. for structured reply), nbd_read_reply_entry may not have all the information it needs anymore. In fact, you're pushing knowledge of the commands from nbd_co_request to nbd_read_reply_entry: +if (s->reply.error == 0 && +s->requests[i].request->type == NBD_CMD_READ) and knowledge of NBD_CMD_READ simply doesn't belong there. So there is an example of that right now, it is not some arbitrary bad thing that could happen in the future. Paolo > nbd_read_reply_enty has all > information it need (s->requests[i].request) to handle the whole reply.. > It's an improvement, because it leads to separating reply_entry > coroutine - it don't need any synchronization (yield/wake) more with > request coroutines. > >> >>> /* We're woken up again by the request itself. Note that >>> there >>>* is no race between yielding and reentering >>> read_reply_co. This >>>* is because: >>> @@ -118,6 +133,7 @@ static coroutine_fn void >>> nbd_read_reply_entry(void *opaque) >>> s->read_reply_co = NULL; >>> } >>> +/* qiov should be provided for READ and WRITE */ >>> static int nbd_co_send_request(BlockDriverState *bs, >>> NBDRequest *request, >>> QEMUIOVector *qiov) >>> @@ -145,6 +161,7 @@ static int nbd_co_send_request(BlockDriverState *bs, >>> request->handle = INDEX_TO_HANDLE(s, i); >>> s->requests[i].request = request; >>> +s->requests[i].qiov = qiov; >>> if (s->quit) { >>> rc = -EIO; >>> @@ -155,7 +172,8 @@ static int nbd_co_send_request(BlockDriverState *bs, >>> goto err; >>> } >>> -if (qiov) { >>> +if (s->requests[i].request->type == NBD_CMD_WRITE) { >>> +assert(qiov); >>> qio_channel_set_cork(s->ioc, true); >>> rc = nbd_send_request(s->ioc, request); >>> if (rc >= 0 && !s->quit) { >>> @@ -181,9 +199,7 @@ err: >>> return rc; >>> } >>> -static int nbd_co_receive_reply(NBDClientSession *s, >>> -NBDRequest *request, >>> -QEMUIOVector *qiov) >>> +static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest >>> *request) >>> { >>> int ret; >>> int i = HANDLE_TO_INDEX(s, request->handle); >>> @@ -197,14 +213,6 @@ static int nbd_co_receive_reply(NBDClientSession >>> *s, >>> } else { >>> assert(s->reply.handle == request->handle); >>> ret = -s->reply.error; >>> -if (qiov && s->reply.error == 0) { >>> -assert(request->len == iov_size(qiov->iov, qiov->niov)); >>> -if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, >>> - NULL) < 0) { >>> -
Re: [Qemu-block] [PATCH v2 7/7] block/nbd-client: do not yield from nbd_read_reply_entry
19.09.2017 01:36, Eric Blake wrote: On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote: Refactor nbd client to not yield from nbd_read_reply_entry. It's possible now as all reading is done in nbd_read_reply_entry and all request-related data is stored in per-request s->requests[i]. We need here some additional work with s->requests[i].ret and s->quit to not fail requests on normal disconnet and to not report reading errors on normal disconnect. Signed-off-by: Vladimir Sementsov-Ogievskiy --- block/nbd-client.c | 29 ++--- 1 file changed, 10 insertions(+), 19 deletions(-) The diffstat may have merit, but I'm wondering: diff --git a/block/nbd-client.c b/block/nbd-client.c index f80a4c5564..bf20d5d5e6 100644 --- a/block/nbd-client.c +++ b/block/nbd-client.c @@ -79,7 +79,11 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) while (!s->quit) { ret = nbd_receive_reply(s->ioc, &reply, &local_err); if (ret < 0) { -error_report_err(local_err); +if (s->quit) { +error_free(local_err); This discards errors on all remaining coroutines after we detect an early exit. Are we sure the coroutine that set s->quit will have reported an appropriate error, and thus explaining why we can discard all secondary errors? A comment in the code would be helpful. I'll think about it. +} else { +error_report_err(local_err); +} } @@ -210,7 +204,7 @@ static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest *request) s->requests[i].receiving = true; qemu_coroutine_yield(); s->requests[i].receiving = false; -if (!s->ioc || s->quit) { +if (!s->ioc) { ret = -EIO; Why don't we need to check s->quit any more? That was just added earlier in the series; why the churn? it is "to not fail requests on normal disconnet". Because reply_entry may be already finished. Initializing "+ s->requests[i].ret = -EIO;" is enough. } else { ret = s->requests[i].ret; @@ -218,11 +212,6 @@ static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest *request) s->requests[i].coroutine = NULL; -/* Kick the read_reply_co to get the next reply. */ -if (s->read_reply_co) { -aio_co_wake(s->read_reply_co); -} - qemu_co_mutex_lock(&s->send_mutex); s->in_flight--; qemu_co_queue_next(&s->free_sema); @@ -364,6 +353,8 @@ void nbd_client_close(BlockDriverState *bs) nbd_send_request(client->ioc, &request); +client->quit = true; Previously, client->quit was only set when detecting a broken server, now it is also set for a clean exit. Do we need to change any documentation of the field? It has documentation? -- Best regards, Vladimir
Re: [Qemu-block] [Qemu-devel] [PATCH 15/18] block/mirror: Add active mirroring
On Tue, Sep 19, 2017 at 10:44:16AM +0100, Stefan Hajnoczi wrote: > On Mon, Sep 18, 2017 at 06:26:51PM +0200, Max Reitz wrote: > > On 2017-09-18 12:06, Stefan Hajnoczi wrote: > > > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote: > > >> On 2017-09-14 17:57, Stefan Hajnoczi wrote: > > >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote: > > This patch implements active synchronous mirroring. In active mode, > > the > > passive mechanism will still be in place and is used to copy all > > initially dirty clusters off the source disk; but every write request > > will write data both to the source and the target disk, so the source > > cannot be dirtied faster than data is mirrored to the target. Also, > > once the block job has converged (BLOCK_JOB_READY sent), source and > > target are guaranteed to stay in sync (unless an error occurs). > > > > Optionally, dirty data can be copied to the target disk on read > > operations, too. > > > > Active mode is completely optional and currently disabled at runtime. > > A > > later patch will add a way for users to enable it. > > > > Signed-off-by: Max Reitz > > --- > > qapi/block-core.json | 23 +++ > > block/mirror.c | 187 > > +-- > > 2 files changed, 205 insertions(+), 5 deletions(-) > > > > diff --git a/qapi/block-core.json b/qapi/block-core.json > > index bb11815608..e072cfa67c 100644 > > --- a/qapi/block-core.json > > +++ b/qapi/block-core.json > > @@ -938,6 +938,29 @@ > > 'data': ['top', 'full', 'none', 'incremental'] } > > > > ## > > +# @MirrorCopyMode: > > +# > > +# An enumeration whose values tell the mirror block job when to > > +# trigger writes to the target. > > +# > > +# @passive: copy data in background only. > > +# > > +# @active-write: when data is written to the source, write it > > +#(synchronously) to the target as well. In addition, > > +#data is copied in background just like in @passive > > +#mode. > > +# > > +# @active-read-write: write data to the target (synchronously) both > > +# when it is read from and written to the source. > > +# In addition, data is copied in background just > > +# like in @passive mode. > > >>> > > >>> I'm not sure the terms "active"/"passive" are helpful. "Active commit" > > >>> means committing the top-most BDS while the guest is accessing it. The > > >>> "passive" mirror block still works on the top-most BDS while the guest > > >>> is accessing it. > > >>> > > >>> Calling it "asynchronous" and "synchronous" is clearer to me. It's also > > >>> the terminology used in disk replication (e.g. DRBD). > > >> > > >> I'd be OK with that, too, but I think I remember that in the past at > > >> least Kevin made a clear distinction between active/passive and > > >> sync/async when it comes to mirroring. > > >> > > >>> Ideally the user wouldn't have to worry about async vs sync because QEMU > > >>> would switch modes as appropriate in order to converge. That way > > >>> libvirt also doesn't have to worry about this. > > >> > > >> So here you mean async/sync in the way I meant it, i.e., whether the > > >> mirror operations themselves are async/sync? > > > > > > The meaning I had in mind is: > > > > > > Sync mirroring means a guest write waits until the target write > > > completes. > > > > I.e. active-sync, ... > > > > > Async mirroring means guest writes completes independently of target > > > writes. > > > > ... i.e. passive or active-async in the future. > > > > So you really want qemu to decide whether to use active or passive mode > > depending on what's enough to let the block job converge and not > > introduce any switch for the user? > > > > I'm not sure whether I like this too much, mostly because "libvirt > > doesn't have to worry" doesn't feel quite right to me. If we don't make > > libvirt worry about this, then qemu has to worry. I'm not sure whether > > that's any better. > > > > I think this really does get into policy territory. Just switching to > > active mode the instant target writes are slower than source writes may > > not be what the user wants: Maybe it's OK for a short duration because > > they don't care about hard convergence too much. Maybe they want to > > switch to active mode already when "only" twice as much is written to > > the target as to the source. > > > > I think this is a decision the management layer (or the user) has to make. > > Eric: Does libvirt want to be involved with converging the mirror job > (i.e. if the guest is writing to disk faster than QEMU can copy data to > the target)? Libvirt doesn't really want to set the policy - it will j
[Qemu-block] [PATCH] nvme: fix cmbuf leak on exit
Cc: Keith Busch Signed-off-by: Stefan Hajnoczi --- hw/block/nvme.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 9aa32692a3..513ec7065d 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -1057,6 +1057,7 @@ static void nvme_exit(PCIDevice *pci_dev) g_free(n->namespaces); g_free(n->cq); g_free(n->sq); +g_free(n->cmbuf); if (n->cmbsz) { memory_region_unref(&n->ctrl_mem); } -- 2.13.5
Re: [Qemu-block] [PATCH v2 6/7] block/nbd-client: early fail nbd_read_reply_entry if s->quit is set
19.09.2017 01:27, Eric Blake wrote: On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote: Do not continue any operation if s->quit is set in parallel. Signed-off-by: Vladimir Sementsov-Ogievskiy --- block/nbd-client.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/block/nbd-client.c b/block/nbd-client.c index 280147e6a7..f80a4c5564 100644 --- a/block/nbd-client.c +++ b/block/nbd-client.c @@ -81,7 +81,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) if (ret < 0) { error_report_err(local_err); } -if (ret <= 0) { +if (ret <= 0 || s->quit) { break; } I'm still not convinced this helps: either nbd_receive_reply() already returned an error, or we got a successful reply header, at which point either the command is done (no need to abort the command early - it succeeded) or it is a read command (and we should detect at the point where we try to populate the qiov that we don't want to read any more). It depends on your 3/7 patch for the fact that we even try to read the qiov directly in this while loop rather than in the coroutine handler, where Paolo questioned whether we need that change. @@ -105,9 +105,8 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) assert(qiov != NULL); assert(s->requests[i].request->len == iov_size(qiov->iov, qiov->niov)); -if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, - NULL) < 0) -{ +ret = qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, NULL); +if (ret < 0 || s->quit) { s->requests[i].ret = -EIO; break; } The placement here looks odd. Either we should not attempt the read because s->quit was already set (okay, your first addition makes sense in that light), or we succeeded at the read (at which point, why do we need to claim it failed?). I'm trying to look forward to structured reads, where we will have to deal with more than one server message in reply to a single client request. For read, we just piece together portions of the qiov until the server has sent us all the pieces. But for block query, I really do think we'll want to defer to specialized handlers for doing further reads (the amount of data to be read from the server is not known by the client a priori, so it is a two-part read, one to get the length, and another to read that remaining length); if we need some sort of callback function rather than cramming all the logic here in the main read loop, by patch 3 I do not mean in any way that I want to have all reading staff in one function, ofcourse it should be refactored for block-status addition. I just want to have all reading staff in one coroutine. Reading from ioc is sequential, so it's native to do it sequentially in one coroutine, without switches. then the qio_channel_readv_all for read commands would belong in the callback, and it is more a question of the main read loop checking whether there is an early quit condition before calling into the callback. If I understand correctly, the discussion is: me: if something fails - fail all other parallel operations - it's simple and clear, but we need to check s->quit after every possible yield you: _not_ all other parallel operations. - may be it is better, but in this case we need carefully define, which parallel operation we fail on s->quit and which not and understand all effects of this division. -- Best regards, Vladimir
Re: [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring
On Mon, Sep 18, 2017 at 06:26:51PM +0200, Max Reitz wrote: > On 2017-09-18 12:06, Stefan Hajnoczi wrote: > > On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote: > >> On 2017-09-14 17:57, Stefan Hajnoczi wrote: > >>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote: > This patch implements active synchronous mirroring. In active mode, the > passive mechanism will still be in place and is used to copy all > initially dirty clusters off the source disk; but every write request > will write data both to the source and the target disk, so the source > cannot be dirtied faster than data is mirrored to the target. Also, > once the block job has converged (BLOCK_JOB_READY sent), source and > target are guaranteed to stay in sync (unless an error occurs). > > Optionally, dirty data can be copied to the target disk on read > operations, too. > > Active mode is completely optional and currently disabled at runtime. A > later patch will add a way for users to enable it. > > Signed-off-by: Max Reitz > --- > qapi/block-core.json | 23 +++ > block/mirror.c | 187 > +-- > 2 files changed, 205 insertions(+), 5 deletions(-) > > diff --git a/qapi/block-core.json b/qapi/block-core.json > index bb11815608..e072cfa67c 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -938,6 +938,29 @@ > 'data': ['top', 'full', 'none', 'incremental'] } > > ## > +# @MirrorCopyMode: > +# > +# An enumeration whose values tell the mirror block job when to > +# trigger writes to the target. > +# > +# @passive: copy data in background only. > +# > +# @active-write: when data is written to the source, write it > +#(synchronously) to the target as well. In addition, > +#data is copied in background just like in @passive > +#mode. > +# > +# @active-read-write: write data to the target (synchronously) both > +# when it is read from and written to the source. > +# In addition, data is copied in background just > +# like in @passive mode. > >>> > >>> I'm not sure the terms "active"/"passive" are helpful. "Active commit" > >>> means committing the top-most BDS while the guest is accessing it. The > >>> "passive" mirror block still works on the top-most BDS while the guest > >>> is accessing it. > >>> > >>> Calling it "asynchronous" and "synchronous" is clearer to me. It's also > >>> the terminology used in disk replication (e.g. DRBD). > >> > >> I'd be OK with that, too, but I think I remember that in the past at > >> least Kevin made a clear distinction between active/passive and > >> sync/async when it comes to mirroring. > >> > >>> Ideally the user wouldn't have to worry about async vs sync because QEMU > >>> would switch modes as appropriate in order to converge. That way > >>> libvirt also doesn't have to worry about this. > >> > >> So here you mean async/sync in the way I meant it, i.e., whether the > >> mirror operations themselves are async/sync? > > > > The meaning I had in mind is: > > > > Sync mirroring means a guest write waits until the target write > > completes. > > I.e. active-sync, ... > > > Async mirroring means guest writes completes independently of target > > writes. > > ... i.e. passive or active-async in the future. > > So you really want qemu to decide whether to use active or passive mode > depending on what's enough to let the block job converge and not > introduce any switch for the user? > > I'm not sure whether I like this too much, mostly because "libvirt > doesn't have to worry" doesn't feel quite right to me. If we don't make > libvirt worry about this, then qemu has to worry. I'm not sure whether > that's any better. > > I think this really does get into policy territory. Just switching to > active mode the instant target writes are slower than source writes may > not be what the user wants: Maybe it's OK for a short duration because > they don't care about hard convergence too much. Maybe they want to > switch to active mode already when "only" twice as much is written to > the target as to the source. > > I think this is a decision the management layer (or the user) has to make. Eric: Does libvirt want to be involved with converging the mirror job (i.e. if the guest is writing to disk faster than QEMU can copy data to the target)? Stefan
Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply
18.09.2017 18:43, Paolo Bonzini wrote: On 18/09/2017 15:59, Vladimir Sementsov-Ogievskiy wrote: Read the whole reply in one place - in nbd_read_reply_entry. Signed-off-by: Vladimir Sementsov-Ogievskiy --- block/nbd-client.h | 1 + block/nbd-client.c | 42 -- 2 files changed, 25 insertions(+), 18 deletions(-) diff --git a/block/nbd-client.h b/block/nbd-client.h index b1900e6a6d..3f97d31013 100644 --- a/block/nbd-client.h +++ b/block/nbd-client.h @@ -21,6 +21,7 @@ typedef struct { Coroutine *coroutine; bool receiving; /* waiting for read_reply_co? */ NBDRequest *request; +QEMUIOVector *qiov; } NBDClientRequest; typedef struct NBDClientSession { diff --git a/block/nbd-client.c b/block/nbd-client.c index 5f241ecc22..f7b642f079 100644 --- a/block/nbd-client.c +++ b/block/nbd-client.c @@ -98,6 +98,21 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) break; } +if (s->reply.error == 0 && +s->requests[i].request->type == NBD_CMD_READ) +{ +QEMUIOVector *qiov = s->requests[i].qiov; +assert(qiov != NULL); +assert(s->requests[i].request->len == + iov_size(qiov->iov, qiov->niov)); +if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, + NULL) < 0) +{ +s->reply.error = EIO; +break; +} +} I am not sure this is an improvement. In principle you could have commands that read replies a bit at a time without using a QEMUIOVector. Paolo Hm, what do you mean? In this patch I don't change "how do we read it", I only move the reading code to one coroutine, to make it simple to extend it in future (block status, etc). nbd_read_reply_enty has all information it need (s->requests[i].request) to handle the whole reply.. It's an improvement, because it leads to separating reply_entry coroutine - it don't need any synchronization (yield/wake) more with request coroutines. /* We're woken up again by the request itself. Note that there * is no race between yielding and reentering read_reply_co. This * is because: @@ -118,6 +133,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque) s->read_reply_co = NULL; } +/* qiov should be provided for READ and WRITE */ static int nbd_co_send_request(BlockDriverState *bs, NBDRequest *request, QEMUIOVector *qiov) @@ -145,6 +161,7 @@ static int nbd_co_send_request(BlockDriverState *bs, request->handle = INDEX_TO_HANDLE(s, i); s->requests[i].request = request; +s->requests[i].qiov = qiov; if (s->quit) { rc = -EIO; @@ -155,7 +172,8 @@ static int nbd_co_send_request(BlockDriverState *bs, goto err; } -if (qiov) { +if (s->requests[i].request->type == NBD_CMD_WRITE) { +assert(qiov); qio_channel_set_cork(s->ioc, true); rc = nbd_send_request(s->ioc, request); if (rc >= 0 && !s->quit) { @@ -181,9 +199,7 @@ err: return rc; } -static int nbd_co_receive_reply(NBDClientSession *s, -NBDRequest *request, -QEMUIOVector *qiov) +static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest *request) { int ret; int i = HANDLE_TO_INDEX(s, request->handle); @@ -197,14 +213,6 @@ static int nbd_co_receive_reply(NBDClientSession *s, } else { assert(s->reply.handle == request->handle); ret = -s->reply.error; -if (qiov && s->reply.error == 0) { -assert(request->len == iov_size(qiov->iov, qiov->niov)); -if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, - NULL) < 0) { -ret = -EIO; -s->quit = true; -} -} /* Tell the read handler to read another header. */ s->reply.handle = 0; @@ -232,16 +240,14 @@ static int nbd_co_request(BlockDriverState *bs, NBDClientSession *client = nbd_get_client_session(bs); int ret; -assert(!qiov || request->type == NBD_CMD_WRITE || - request->type == NBD_CMD_READ); -ret = nbd_co_send_request(bs, request, - request->type == NBD_CMD_WRITE ? qiov : NULL); +assert((qiov == NULL) != + (request->type == NBD_CMD_WRITE || request->type == NBD_CMD_READ)); +ret = nbd_co_send_request(bs, request, qiov); if (ret < 0) { return ret; } -return nbd_co_receive_reply(client, request, -request->type == NBD_CMD_READ ? qiov : NULL); +return nbd_co_receive_reply(client, request); } int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
Re: [Qemu-block] [PATCH v8 05/20] dirty-bitmap: Avoid size query failure during truncate
On Mon, 09/18 13:58, Eric Blake wrote: > We've previously fixed several places where we failed to account > for possible errors from bdrv_nb_sectors(). Fix another one by > making bdrv_dirty_bitmap_truncate() take the new size from the > caller instead of querying itself; then adjust the sole caller > bdrv_truncate() to pass the size just determined by a successful > resize, or to skip the bitmap resize on failure, thus avoiding > sizing the bitmaps to -1. > > Signed-off-by: Eric Blake > > --- > v8: retitle and rework to avoid possibility of secondary failure [John] > v7: new patch [Kevin] > --- > include/block/dirty-bitmap.h | 2 +- > block.c | 15 ++- > block/dirty-bitmap.c | 6 +++--- > 3 files changed, 14 insertions(+), 9 deletions(-) > > diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h > index 8fd842eac9..7a27590047 100644 > --- a/include/block/dirty-bitmap.h > +++ b/include/block/dirty-bitmap.h > @@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter); > void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num); > int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap); > int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap); > -void bdrv_dirty_bitmap_truncate(BlockDriverState *bs); > +void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes); > bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap); > bool bdrv_has_readonly_bitmaps(BlockDriverState *bs); > bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap); > diff --git a/block.c b/block.c > index ee6a48976e..61ee9d4b83 100644 > --- a/block.c > +++ b/block.c > @@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, > PreallocMode prealloc, > assert(!(bs->open_flags & BDRV_O_INACTIVE)); > > ret = drv->bdrv_truncate(bs, offset, prealloc, errp); > -if (ret == 0) { > -ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); > -bdrv_dirty_bitmap_truncate(bs); > -bdrv_parent_cb_resize(bs); > -atomic_inc(&bs->write_gen); > +if (ret < 0) { > +return ret; > } > +ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS); > +if (ret < 0) { Ugh, if we get here the situation is a bit embarrassing, because... > +error_setg_errno(errp, -ret, "Could not refresh total sector count"); > +return ret; > +} > +bdrv_dirty_bitmap_truncate(bs, bs->total_sectors * BDRV_SECTOR_SIZE); > +bdrv_parent_cb_resize(bs); > +atomic_inc(&bs->write_gen); I think we still want to inc write_gen even if refresh_total_sectors failed, if drv->bdrv_truncate has succeeded? That way the next bdrv_co_flush will actually flush the metadata change to disk. Maybe similarly call bdrv_parent_cb_resize() as long as drv->bdrv_truncate() succeeded? The effect is the virtual devices notify guest about this "resized" event, which I think is correct. Fam
Re: [Qemu-block] [PATCH] nbd-client: Use correct macro parenthesization
On Mon, Sep 18, 2017 at 04:46:49PM -0500, Eric Blake wrote: > If 'bs' is a complex expression, we were only casting the front half > rather than the full expression. Luckily, none of the callers were > passing bad arguments, but it's better to be robust up front. > > Signed-off-by: Eric Blake > --- > > I plan to take this through my NBD queue, unless it is picked up > by qemu-trivial first... > > block/nbd-client.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Reviewed-by: Stefan Hajnoczi
[Qemu-block] Dead code in cpu-models.h (was: block: Clean up some bad code in the vvfat driver)
On 19.09.2017 10:06, Paolo Bonzini wrote: > On 13/09/2017 21:08, John Snow wrote: [...] >> Farewell, bitrot code. >> >> Reviewed-by: John Snow >> >> Out of curiosity, I wonder ... >> >> jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l >> 320 > > $ git grep -c '#if 0' | sort -k2 --field-separator=: -n > ... > hw/net/eepro100.c:21 > target/ppc/cpu-models.h:76 > > whoa :) Igor recently already removed the dead definitions from cpu-models.c : https://git.qemu.org/?p=qemu.git;a=commitdiff;h=aef779605779579afbafff I guess we could now remove the corresponding dead definitions from the header, too... Any volunteers? Thomas
Re: [Qemu-block] [PATCH 18/18] iotests: Add test for active mirroring
On Mon, 09/18 18:53, Max Reitz wrote: > >> + > >> +if sync_source_and_target: > >> +# If source and target should be in sync after the mirror, > >> +# we have to flush before completion > > > > Not sure I understand this requirements, does it apply to libvirt and user > > too? > > I.e. it's a part of the interface ? Why cannot mirror_complete do it > > automatically? > > Well, it seems to pass without this flush, but the original intention > was this: When mirror is completed, the source node is replaced by the > target. All further writes are then only executed on the (former) > target node. So what might happen (or at least I think it could) is > that qemu-io submits some writes, but before they are actually > performed, the mirror block job is completed and the source is replaced > by the target. Then, the write operations are performed on the target > but no longer on the source, so source and target are then out of sync. > For the mirror block job, that is fine -- at the point of completion, > source and target were in sync. The job doesn't care that they get out > of sync after completion. But here, we have to care or we can't compare > source and target. > > The reason for why it always seems to pass without a flush is that every > submitted write is actually sent to the mirror node before it yields for > the first time. But I wouldn't bet on that, so I think it's better to > keep the flush before completing the block job. OK, that makes sense. Thanks. Fam
Re: [Qemu-block] [PATCH] block: Clean up some bad code in the vvfat driver
On 13/09/2017 21:08, John Snow wrote: > > > On 09/13/2017 06:21 AM, Thomas Huth wrote: >> Remove the unnecessary home-grown redefinition of the assert() macro here, >> and remove the unusable debug code at the end of the checkpoint() function. >> The code there uses assert() with side-effects (assignment to the "mapping" >> variable), which should be avoided. Looking more closely, it seems as it is >> apparently also only usable for one certain directory layout (with a file >> named USB.H in it) and thus is of no use for the rest of the world. >> >> Signed-off-by: Thomas Huth > > Farewell, bitrot code. > > Reviewed-by: John Snow > > Out of curiosity, I wonder ... > > jhuston@probe (foobar) ~/s/qemu> git grep '#if 0' | wc -l > 320 $ git grep -c '#if 0' | sort -k2 --field-separator=: -n ... hw/net/eepro100.c:21 target/ppc/cpu-models.h:76 whoa :)
Re: [Qemu-block] [PATCH 17/18] qemu-io: Add background write
On Mon, 09/18 19:53, Max Reitz wrote: > On 2017-09-18 08:46, Fam Zheng wrote: > > On Wed, 09/13 20:19, Max Reitz wrote: > >> Add a new parameter -B to qemu-io's write command. When used, qemu-io > >> will not wait for the result of the operation and instead execute it in > >> the background. > > > > Cannot aio_write be used for this purpose? > > Depends. I have been trained to dislike *_aio_*, so that's probably the > initial reason why I didn't use it. > > Second, I'd have to fix aio_write before it can be used. Currently, > this aborts: > > echo 'qemu-io drv0 "aio_write -P 0x11 0 64M"' \ > | x86_64-softmmu/qemu-system-x86_64 -monitor stdio \ > -blockdev node-name=drv0,driver=null-co > > because aio_write_done thinks it's a good idea to use qemu-io's > BlockBackend -- but when qemu-io is executed through the HMP, the > BlockBackend is only created for the duration of the qemu-io command > (unless there already is a BB). So what I'd have to do is add a > blk_ref()/blk_unref() there, but for some reason I really don't like that. What is the reason? If it crashes it should be fixed anyway, I assume? Fam
Re: [Qemu-block] [PATCH] block: Clean up some bad code in the vvfat driver
Am 13.09.2017 um 12:21 hat Thomas Huth geschrieben: > Remove the unnecessary home-grown redefinition of the assert() macro here, > and remove the unusable debug code at the end of the checkpoint() function. > The code there uses assert() with side-effects (assignment to the "mapping" > variable), which should be avoided. Looking more closely, it seems as it is > apparently also only usable for one certain directory layout (with a file > named USB.H in it) and thus is of no use for the rest of the world. > > Signed-off-by: Thomas Huth Thanks, applied to the block branch. Kevin
Re: [Qemu-block] [PATCH] block/throttle-groups.c: allocate RestartData on the heap
Am 18.09.2017 um 22:25 hat Manos Pitsidianakis geschrieben: > RestartData is the opaque data of the throttle_group_restart_queue_entry > coroutine. By being stack allocated, it isn't available anymore if > aio_co_enter schedules the coroutine with a bottom halve and runs after > throttle_group_restart_queue returns. Cc: qemu-sta...@nongnu.org (It should be mentioned explicitly in the commit message so you can filter for it in the commit history. I routinely add this while applying patches if I'm aware of the qemu-stable CC, but it makes my life easier if it's already there.) > Signed-off-by: Manos Pitsidianakis Thanks, applied to the block branch. Kevin