Re: [Qemu-block] [Qemu-devel] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap
On 05/06/2015 10:20 PM, Wen Congyang wrote: On 05/02/2015 12:47 AM, John Snow wrote: On 04/03/2015 07:05 AM, Paolo Bonzini wrote: On 03/04/2015 12:01, Wen Congyang wrote: Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com Signed-off-by: Gonglei arei.gong...@huawei.com --- include/qemu/hbitmap.h | 8 tests/test-hbitmap.c | 39 +++ util/hbitmap.c | 16 3 files changed, 63 insertions(+) diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h index 550d7ce..95a55e4 100644 --- a/include/qemu/hbitmap.h +++ b/include/qemu/hbitmap.h @@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count); void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count); /** + * hbitmap_reset_all: + * @hb: HBitmap to operate on. + * + * Reset all bits in an HBitmap. + */ +void hbitmap_reset_all(HBitmap *hb); + +/** * hbitmap_get: * @hb: HBitmap to operate on. * @item: Bit to query (0-based). diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c index 8c902f2..1f0078a 100644 --- a/tests/test-hbitmap.c +++ b/tests/test-hbitmap.c @@ -11,6 +11,7 @@ #include glib.h #include stdarg.h +#include string.h #include qemu/hbitmap.h #define LOG_BITS_PER_LONG (BITS_PER_LONG == 32 ? 5 : 6) @@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data, } } +static void hbitmap_test_reset_all(TestHBitmapData *data) +{ +size_t n; + +hbitmap_reset_all(data-hb); + +n = (data-size + BITS_PER_LONG - 1) / BITS_PER_LONG; +if (n == 0) { +n = 1; +} +memset(data-bits, 0, n * sizeof(unsigned long)); + +if (data-granularity == 0) { +hbitmap_test_check(data, 0); +} +} + static void hbitmap_test_check_get(TestHBitmapData *data) { uint64_t count = 0; @@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data, hbitmap_test_set(data, L3 / 2, L3); } +static void test_hbitmap_reset_all(TestHBitmapData *data, + const void *unused) +{ +hbitmap_test_init(data, L3 * 2, 0); +hbitmap_test_set(data, L1 - 1, L1 + 2); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, 0, L1 * 3); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, L2, L1); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, L2, L3 - L2 + 1); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, L3 - 1, 3); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, 0, L3 * 2); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, L3 / 2, L3); +hbitmap_test_reset_all(data); +} + static void test_hbitmap_granularity(TestHBitmapData *data, const void *unused) { @@ -394,6 +432,7 @@ int main(int argc, char **argv) hbitmap_test_add(/hbitmap/set/overlap, test_hbitmap_set_overlap); hbitmap_test_add(/hbitmap/reset/empty, test_hbitmap_reset_empty); hbitmap_test_add(/hbitmap/reset/general, test_hbitmap_reset); +hbitmap_test_add(/hbitmap/reset/all, test_hbitmap_reset_all); hbitmap_test_add(/hbitmap/granularity, test_hbitmap_granularity); g_test_run(); diff --git a/util/hbitmap.c b/util/hbitmap.c index ab13971..acce93c 100644 --- a/util/hbitmap.c +++ b/util/hbitmap.c @@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count) hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last); } +void hbitmap_reset_all(HBitmap *hb) +{ +uint64_t size = hb-size; +unsigned int i; + +/* Same as hbitmap_alloc() except memset() */ +for (i = HBITMAP_LEVELS; --i = 1; ) { +size = MAX((size + BITS_PER_LONG - 1) BITS_PER_LEVEL, 1); +memset(hb-levels[i], 0, size * sizeof(unsigned long)); +} + For what it's worth, I recently added in a hb-sizes[i] cache to store the size of each array so you don't have to recompute this all the time. Yes, will fix it in the next version. Thanks Wen Congyang Since the reset stuff is useful all by itself, you can send that patch by itself, CC me, and I'll review it. You can update the existing call in block.c: bdrv_clear_dirty_bitmap() { hbitmap_reset(bitmap-bitmap, 0, bitmap-size); } to using your faster hbitmap_reset_all call. Thanks, --js +assert(size == 1); +hb-levels[0][0] = 1UL (BITS_PER_LONG - 1); +hb-count = 0; +} + bool hbitmap_get(const HBitmap *hb, uint64_t item) { /* Compute position and bit in the last layer. */ Acked-by: Paolo Bonzini pbonz...@redhat.com .
Re: [Qemu-block] [PATCH 3/7] qcow2: use an LRU algorithm to replace entries from the L2 cache
On Wed, May 06, 2015 at 04:39:27PM +0300, Alberto Garcia wrote: The current algorithm to evict entries from the cache gives always preference to those in the lowest positions. As the size of the cache increases, the chances of the later elements of being removed decrease exponentially. In a scenario with random I/O and lots of cache misses, entries in positions 8 and higher are rarely (if ever) evicted. This can be seen even with the default cache size, but with larger caches the problem becomes more obvious. Using an LRU algorithm makes the chances of being removed from the cache independent from the position. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpaKPHOt8X3o.pgp Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH 6/7] qcow2: make qcow2_cache_put() a void function
On Wed, May 06, 2015 at 04:39:30PM +0300, Alberto Garcia wrote: This function never receives an invalid table pointer, so we can make it void and remove all the error checking code. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c| 7 +-- block/qcow2-cluster.c | 50 ++ block/qcow2-refcount.c | 29 + block/qcow2.h | 2 +- 4 files changed, 17 insertions(+), 71 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgp5d0sh16YYy.pgp Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH 7/7] qcow2: style fixes in qcow2-cache.c
On Wed, May 06, 2015 at 04:39:31PM +0300, Alberto Garcia wrote: Fix pointer declaration to make it consistent with the rest of the code. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpdlnUS45yrg.pgp Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
On 06/05/2015 19:23, Max Reitz wrote: The guest sees whatever has been written into reply-error, and that field hasn't been written by this function in that case. It has been written by nbd_receive_reply() in nbd.c, and that value comes directly from the server. In case of qemu-nbd being the server, a write beyond the EOF should be caught by blk_check_byte_request() in block/block-backend.c, which returns -EIO. So that's where the EIO comes from. Fair enough. This makes sense, but then we have to create ENOSPC elsewhere. I don't know whether this EIO is subsequently converted to ENOSPC because of werror=enospc, but considering that https://bugzilla.redhat.com/show_bug.cgi?id=1090713 did not override werror, it doesn't look like it. No, it shouldn't indeed. Could alloc_clusters_noref do bdrv_truncate and return ENOSPC if it fails? That's how for example qcow and vhdx work. vdi has the same problem. Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/6] block: Mirror discarded sectors
On Thu, 05/07 14:20, Stefan Hajnoczi wrote: On Wed, May 06, 2015 at 12:52:02PM +0800, Fam Zheng wrote: v2: Fix typo and add Eric's rev-by in patch 3. Add patch 1 to discard target in mirror job. (Paolo) Add patch 6 to improve iotests.wait_ready. (John) This fixes the mirror assert failure reported by wangxiaolong: https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg04458.html The direct cause is that hbitmap code couldn't handle unset of bits *after* iterator's current position. We could fix that, but the bdrv_reset_dirty() call is more questionable: Before, if guest discarded some sectors during migration, it could see different data after moving to dest side, depending on block backends of the src and the dest. This is IMO worse than mirroring the actual reading as done in this series, because we don't know what the guest is doing. For example if a guest first issues WRITE SAME to wipe out the area then issues UNMAP to discard it, just to get rid of some sensitive data completely, we may miss both operations and leave stale data on dest image. Fam Zheng (6): mirror: Discard target sectors if not allocated at source side block: Fix dirty bitmap in bdrv_co_discard block: Remove bdrv_reset_dirty qemu-iotests: Make block job methods common qemu-iotests: Add test case for mirror with unmap iotests: Use event_wait in wait_ready block.c | 12 block/io.c| 4 +-- block/mirror.c| 12 ++-- include/block/block_int.h | 2 -- tests/qemu-iotests/041| 66 ++- tests/qemu-iotests/131| 59 ++ tests/qemu-iotests/131.out| 5 tests/qemu-iotests/group | 1 + tests/qemu-iotests/iotests.py | 23 +++ 9 files changed, 115 insertions(+), 69 deletions(-) create mode 100644 tests/qemu-iotests/131 create mode 100644 tests/qemu-iotests/131.out CCing Jeff Cody for block jobs Thanks, Note that in the discussion of v1 Paolo pointed out that discard on target is not safe in some cases, I need to fix that in v3. Fam
Re: [Qemu-block] [PATCH v2 0/6] block: Mirror discarded sectors
On Wed, May 06, 2015 at 12:52:02PM +0800, Fam Zheng wrote: v2: Fix typo and add Eric's rev-by in patch 3. Add patch 1 to discard target in mirror job. (Paolo) Add patch 6 to improve iotests.wait_ready. (John) This fixes the mirror assert failure reported by wangxiaolong: https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg04458.html The direct cause is that hbitmap code couldn't handle unset of bits *after* iterator's current position. We could fix that, but the bdrv_reset_dirty() call is more questionable: Before, if guest discarded some sectors during migration, it could see different data after moving to dest side, depending on block backends of the src and the dest. This is IMO worse than mirroring the actual reading as done in this series, because we don't know what the guest is doing. For example if a guest first issues WRITE SAME to wipe out the area then issues UNMAP to discard it, just to get rid of some sensitive data completely, we may miss both operations and leave stale data on dest image. Fam Zheng (6): mirror: Discard target sectors if not allocated at source side block: Fix dirty bitmap in bdrv_co_discard block: Remove bdrv_reset_dirty qemu-iotests: Make block job methods common qemu-iotests: Add test case for mirror with unmap iotests: Use event_wait in wait_ready block.c | 12 block/io.c| 4 +-- block/mirror.c| 12 ++-- include/block/block_int.h | 2 -- tests/qemu-iotests/041| 66 ++- tests/qemu-iotests/131| 59 ++ tests/qemu-iotests/131.out| 5 tests/qemu-iotests/group | 1 + tests/qemu-iotests/iotests.py | 23 +++ 9 files changed, 115 insertions(+), 69 deletions(-) create mode 100644 tests/qemu-iotests/131 create mode 100644 tests/qemu-iotests/131.out CCing Jeff Cody for block jobs pgpod9AO_m2Gc.pgp Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
Am 07.05.2015 um 14:47 hat Paolo Bonzini geschrieben: On 07/05/2015 14:29, Kevin Wolf wrote: No, it shouldn't indeed. Could alloc_clusters_noref do bdrv_truncate and return ENOSPC if it fails? That's how for example qcow and vhdx work. vdi has the same problem. If you want NBD to return -ENOSPC for writes after EOF, change the nbd block driver to do just that. There's no reason to add additional overhead to qcow2 (even over raw-posix) for that. Does ENOSPC over LVM (dm-linear) work at all, and who generates the ENOSPC there? The LVM use case is what oVirt uses, so I'm pretty sure that it works. I'm now sure who generates the ENOSPC, but it's not qemu anyway. If I had to guess, I'd say that the kernel block layer might just forbid writing after EOF for any block device. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH] nvme: support NVME_VOLATILE_WRITE_CACHE feature
[ Cc: qemu-block ] Am 30.04.2015 um 16:03 hat Keith Busch geschrieben: On Thu, 30 Apr 2015, Christoph Hellwig wrote: The SCSI emulation in the Linux NVMe driver really wants to know if a device has a volatile write cache. Given that qemu has moved away from a model where we report the backing store WCE bit to one where the WCE bit is supposed to be part of the migratable guest-visible state we always return 1 here. Thanks, this fix was long overdue and already incorporated in my tree. I really need to get my act together for a legit pull request with all the other 1.0, 1.1 and 1.2 features. Acked-by: Keith Busch keith.bu...@intel.com Thanks, applied to the block branch. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
On 07/05/2015 16:34, Kevin Wolf wrote: Am 07.05.2015 um 16:16 hat Paolo Bonzini geschrieben: On 07/05/2015 16:07, Kevin Wolf wrote: This is not right for two reasons: The first is that this is BlockBackend code I think it would take effect for the qemu-nbd case though. Oh, you want to change the server code rather than the client? Yes. Wait... Are you saying that NBD sends a (platform specific) errno value over the network? :-/ Yes. :/ That said, at least the error codes that Linux places in /usr/include/asm/errno-base.h seem to be pretty much standard---at least Windows and most Unices share them---with the exception of EAGAIN. I'll send a patch to NBD to standardize the set of error codes that it sends. In theory, what error code the NBD server needs to send should be specified by the NBD protocol. Am I right to assume that it doesn't do that? Nope. In any case, I'm not sure whether qemu's internal error code should change just for NBD. Producing the right error code for the protocol is the job of nbd_co_receive_request(). Ok, so it shouldn't reach blk_check_request at all. But then, we should aim at making blk_check_request's checks assertions. and it wouldn't even take effect for the qcow2 case where we're writing past EOF only on the protocol layer. The second is that -ENOSPC is only for writes and not for reads. This is right. Reads in the kernel return 0, but in QEMU we do not want that. The code currently returns -EIO, but perhaps -EINVAL is a better match. It also happens to be what Linux returns for discards. Perhaps it is, yes. It shouldn't make a difference for guests anyway. (Unlike -ENOSPC for writes, which would trigger werror=enospc! That's most likely not what we want.) Yes, we want the check duplicated in all BlockBackend users. Most of them already do it, see the work that Markus did last year I think. Paolo
Re: [Qemu-block] [PATCH 3/7] qcow2: use an LRU algorithm to replace entries from the L2 cache
On Thu 07 May 2015 04:55:21 PM CEST, Eric Blake wrote: /* Give the table some hits for the start so that it won't be replaced * immediately. The number 32 is completely arbitrary. */ -c-entries[i].cache_hits = 32; c-entries[i].offset = offset; The comment is now dead. Oh! Feel free to remove it when committing the patch, otherwise I can send a new series. Berto
Re: [Qemu-block] [PATCH 3/7] qcow2: use an LRU algorithm to replace entries from the L2 cache
On 05/06/2015 07:39 AM, Alberto Garcia wrote: The current algorithm to evict entries from the cache gives always preference to those in the lowest positions. As the size of the cache increases, the chances of the later elements of being removed decrease exponentially. In a scenario with random I/O and lots of cache misses, entries in positions 8 and higher are rarely (if ever) evicted. This can be seen even with the default cache size, but with larger caches the problem becomes more obvious. Using an LRU algorithm makes the chances of being removed from the cache independent from the position. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) @@ -318,12 +315,10 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, /* Give the table some hits for the start so that it won't be replaced * immediately. The number 32 is completely arbitrary. */ -c-entries[i].cache_hits = 32; c-entries[i].offset = offset; The comment is now dead. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
Am 07.05.2015 um 16:16 hat Paolo Bonzini geschrieben: On 07/05/2015 16:07, Kevin Wolf wrote: This is not right for two reasons: The first is that this is BlockBackend code I think it would take effect for the qemu-nbd case though. Oh, you want to change the server code rather than the client? Wait... Are you saying that NBD sends a (platform specific) errno value over the network? :-/ In theory, what error code the NBD server needs to send should be specified by the NBD protocol. Am I right to assume that it doesn't do that? In any case, I'm not sure whether qemu's internal error code should change just for NBD. Producing the right error code for the protocol is the job of nbd_co_receive_request(). and it wouldn't even take effect for the qcow2 case where we're writing past EOF only on the protocol layer. The second is that -ENOSPC is only for writes and not for reads. This is right. Reads in the kernel return 0, but in QEMU we do not want that. The code currently returns -EIO, but perhaps -EINVAL is a better match. It also happens to be what Linux returns for discards. Perhaps it is, yes. It shouldn't make a difference for guests anyway. (Unlike -ENOSPC for writes, which would trigger werror=enospc! That's most likely not what we want.) Kevin
[Qemu-block] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
In reference to b0ad5a45...078a458e, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. Only when write is successful that bdrv_flush is called. Signed-off-by: Zhe Qiu phoea...@gmail.com --- block/vdi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/vdi.c b/block/vdi.c index 7642ef3..dfe8ade 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, logout(will write %u block map sectors starting from entry %u\n, n_sectors, bmap_first); ret = bdrv_write(bs-file, offset, base, n_sectors); +if (ret = 0) { +ret = bdrv_flush(bs-file); +} } return ret; -- 2.4.0
Re: [Qemu-block] [PATCH 2/7] qcow2: simplify qcow2_cache_put() and qcow2_cache_entry_mark_dirty()
On Wed, May 06, 2015 at 04:39:26PM +0300, Alberto Garcia wrote: Since all tables are now stored together, it is possible to obtain the position of a particular table directly from its address, so the operation becomes O(1). Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 32 +++- 1 file changed, 15 insertions(+), 17 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpWU7F3fWhBP.pgp Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
Am 07.05.2015 um 14:20 hat Paolo Bonzini geschrieben: On 06/05/2015 19:23, Max Reitz wrote: The guest sees whatever has been written into reply-error, and that field hasn't been written by this function in that case. It has been written by nbd_receive_reply() in nbd.c, and that value comes directly from the server. In case of qemu-nbd being the server, a write beyond the EOF should be caught by blk_check_byte_request() in block/block-backend.c, which returns -EIO. So that's where the EIO comes from. Fair enough. This makes sense, but then we have to create ENOSPC elsewhere. I don't know whether this EIO is subsequently converted to ENOSPC because of werror=enospc, but considering that https://bugzilla.redhat.com/show_bug.cgi?id=1090713 did not override werror, it doesn't look like it. No, it shouldn't indeed. Could alloc_clusters_noref do bdrv_truncate and return ENOSPC if it fails? That's how for example qcow and vhdx work. vdi has the same problem. If you want NBD to return -ENOSPC for writes after EOF, change the nbd block driver to do just that. There's no reason to add additional overhead to qcow2 (even over raw-posix) for that. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH v3 01/10] qapi: Add transaction support to block-dirty-bitmap operations
On 05/07/2015 10:54 AM, Stefan Hajnoczi wrote: On Wed, Apr 22, 2015 at 08:04:44PM -0400, John Snow wrote: +static void block_dirty_bitmap_clear_prepare(BlkTransactionState *common, + Error **errp) +{ +BlockDirtyBitmapState *state = DO_UPCAST(BlockDirtyBitmapState, + common, common); +BlockDirtyBitmap *action; + +action = common-action-block_dirty_bitmap_clear; +state-bitmap = block_dirty_bitmap_lookup(action-node, + action-name, + state-bs, + state-aio_context, + errp); +if (!state-bitmap) { +return; +} + + if (bdrv_dirty_bitmap_frozen(state-bitmap)) { + error_setg(errp, Cannot modify a frozen bitmap); + return; +} else if (!bdrv_dirty_bitmap_enabled(state-bitmap)) { + error_setg(errp, Cannot clear a disabled bitmap); + return; +} + +/* AioContext is released in .clean() */ +} + +static void block_dirty_bitmap_clear_commit(BlkTransactionState *common) +{ + BlockDirtyBitmapState *state = DO_UPCAST(BlockDirtyBitmapState, + common, common); +bdrv_clear_dirty_bitmap(state-bitmap); +} These semantics don't work in this example: [block-dirty-bitmap-clear, drive-backup] Since drive-backup starts the blockjob in .prepare() but block-dirty-bitmap-clear only clears the bitmap in .commit() the order is wrong. .prepare() has to do something non-destructive, like stashing away the HBitmap and replacing it with an empty one. Then .commit() can discard the old bitmap while .abort() can move the old bitmap back to undo the operation. Stefan Hmm, that's sort of gross. That means that any transactional command *ever* destined to be used with drive-backup in any conceivable way needs to move a lot more of its action forward to .prepare(). That sort of defeats the premise of .prepare() and .commit(), no? And all because drive-backup jumped the gun. That's going to get hard to maintain as we add more transactions. --js