Re: [Qemu-devel] [libvirt] RFC: exposing qemu's block-set-write-threshold
- Original Message - From: Eric Blake ebl...@redhat.com To: Francesco Romani from...@redhat.com Cc: libvir-l...@redhat.com, Nir Soffer nsof...@redhat.com, Peter Krempa pkre...@redhat.com, qemu-devel@nongnu.org Sent: Friday, May 22, 2015 6:33:01 AM Subject: Re: [libvirt] RFC: exposing qemu's block-set-write-threshold [adding qemu] I read the thread and I'm pretty sure this will be a silly question, but I want to make sure I am on the same page and I'm not somehow confused by the terminology. Let's consider the simplest of the situation we face in oVirt: (thin provisioned qcow2 disk on LV) vda=[format=qcow2] - lv=[path=/dev/mapper/$UUID] Isn't the LV here the 'backing file' (actually, backing block device) of the disk? Restating what you wrote into libvirt terminology, I think this means that you have a disk where: driver is qcow2 source is a local file name device names vda backingStore index='1' describes the backing LV: driver is also qcow2 (as polling allocation growth in order to resize on demand only makes sense for qcow2 format) source is /dev/mapper/$UUID Yes, exactly my point. I just want to be 100% sure that the three (slightly) different parlances of the three groups (oVirt/libvirt/QEMU) are aligned on the same meaning, and that we're not getting anything lost in translation that you have a disk where: driver is qcow2 source is a local file name device names vda backingStore index='1' describes the backing LV: driver is also qcow2 (as polling allocation growth in order to resize on demand only makes sense for qcow2 format) source is /dev/mapper/$UUID For the final confirmation, here's the actual XML we produce: disk device=disk snapshot=no type=block address bus=0x00 domain=0x function=0x0 slot=0x05 type=pci/ source dev=/rhev/data-center/0002-0002-0002-0002-014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5/ target bus=virtio dev=vda/ serialee1295ee-7ddc-4030-be5e-4557538bc4d2/serial boot order=1/ driver cache=none error_policy=stop io=native name=qemu type=qcow2/ /disk For the sake of completeness: $ ls -lh /rhev/data-center/0002-0002-0002-0002-014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5 lrwxrwxrwx. 1 vdsm kvm 78 May 22 08:49 /rhev/data-center/0002-0002-0002-0002-014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5 - /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/05a88a94-5bd6-4698-be69-39e78c84e1a5 $ ls -lh /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/ total 0 lrwxrwxrwx. 1 root root 8 May 22 08:49 05a88a94-5bd6-4698-be69-39e78c84e1a5 - ../dm-11 lrwxrwxrwx. 1 root root 8 May 22 08:49 54673e6d-207d-4a66-8f0d-3f5b3cda78e5 - ../dm-12 lrwxrwxrwx. 1 root root 9 May 22 08:49 ids - ../dm-606 lrwxrwxrwx. 1 root root 9 May 22 08:49 inbox - ../dm-607 lrwxrwxrwx. 1 root root 9 May 22 08:49 leases - ../dm-605 lrwxrwxrwx. 1 root root 9 May 22 08:49 master - ../dm-608 lrwxrwxrwx. 1 root root 9 May 22 08:49 metadata - ../dm-603 lrwxrwxrwx. 1 root root 9 May 22 08:49 outbox - ../dm-604 lvs | grep 05a88a94 05a88a94-5bd6-4698-be69-39e78c84e1a5 12f68692-2a5a-4e48-af5e-4679bca7fd44 -wi-ao 14.12g then indeed, vda is the local qcow2 file, and vda[1] is the backing file on the LV storage. Normally, you only care about the write threshold at the active layer (the local file, with name vda), because that is the only image that will normally be allocating sectors. But in the case of active commit, where you are taking the thin-provisioned local file and writing its clusters back into the backing LV, the action of commit can allocate sectors in the backing file. Right Thus, libvirt wants to let you set a write-threshold on both parts of the backing chain (the active wrapper, and the LV backing file), where the event could fire on either node first. The existing libvirt virConnectDomainGetAllStats() can already be used to poll allocation growth (the block.N.allocation statistic in libvirt, or 'virtual-size' in QMP's 'ImageInfo'), but the event would let you drop polling. Yes, exactly the intent However, while starting to code the libvirt side of things, I've hit a couple of snags with interacting with the qemu design. First, the 'block-set-write-threshold' command is allowed to set a threshold by 'node-name' (any BDS, whether active or backing), Yes, this emerged during the review of my patch. I first took the simplest approach (probably simplistic, in retrospect), but -IIRC- was pointed out that setting by node-name grants the most flexible approach, hence was required. See: http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02503.html http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02580.html http
[Qemu-devel] [Bug 1338957] Re: RFE: add an event to report block devices watermark
patch posted on qemu-devel, reviewd, acked and merged into maintainer's branch: https://github.com/stefanha/qemu/commit/f050ea639522e9dd7e501ef285a2a12709b8726a -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1338957 Title: RFE: add an event to report block devices watermark Status in QEMU: New Bug description: Add an event to report if a block device usage exceeds a threshold. The threshold should be configurable with a monitor command. The event should report the affected block device. Additional useful information could be the offset of the highest sector , like in the query- blockstats output. Rationale for the RFE Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. In order to let the guest run flawlessly and be not unnecessarily paused, oVirt sets a watermark and automatically resized the image once the watermark is reached or exceeded. In order to detect the mark crossing, the managing application has no choice than aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: scenarios with hunderds of VM are becoming not unusual. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1338957/+subscriptions
Re: [Qemu-devel] [PATCH v5] block: add event when disk usage exceeds threshold
- Original Message - From: Stefan Hajnoczi stefa...@redhat.com To: Francesco Romani from...@redhat.com Cc: qemu-devel@nongnu.org, kw...@redhat.com, lcapitul...@redhat.com, mdr...@linux.vnet.ibm.com, ebl...@redhat.com Sent: Tuesday, January 13, 2015 2:54:41 PM Subject: Re: [PATCH v5] block: add event when disk usage exceeds threshold Thanks, applied to my block branch: https://github.com/stefanha/qemu/tree/block Thank you for the review and the merge! Bests, -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold
Hi, thanks for the review! - Original Message - From: Eric Blake ebl...@redhat.com To: Francesco Romani from...@redhat.com, qemu-devel@nongnu.org Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, lcapitul...@redhat.com Sent: Friday, January 9, 2015 5:54:40 PM Subject: Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold On 12/04/2014 01:59 AM, Francesco Romani wrote: Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. To let the guest run smoothly and be not unnecessarily paused, oVirt sets a disk usage threshold (so called 'high water mark') based on the occupation of the device, and automatically extends the image once the threshold is reached or exceeded. In order to detect the crossing of the threshold, oVirt has no choice but aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: deployments with hundreds of VMs are no longer rare. To fix this, this patch adds: * A new monitor command to set a write threshold for a given block device. * A new event to report if a block device usage exceeds the threshold. Please also mention the names of those two things in the commit message, to make it easier to find them when doing 'git log' archaeology. Sure, will do. This will allow the managing application to use smarter and more efficient monitoring, greatly reducing the need of polling. A followup patch is planned to allow to add the write threshold at device creation. Signed-off-by: Francesco Romani from...@redhat.com --- --- /dev/null +++ b/block/write-threshold.c @@ -0,0 +1,125 @@ +/* + * QEMU System Emulator block write threshold notification + * + * Copyright Red Hat, Inc. 2014 I've been so slow on the review that you may want to add 2015. IANAL, but since most (~99%) of code was written in 2014, I'll just leave as that. +bool bdrv_write_threshold_is_set(const BlockDriverState *bs) +{ +return !!(bs-write_threshold_offset 0); The !! is spurious; use of already guarantees a bool result. Will remove. +++ b/qapi/block-core.json @@ -239,6 +239,9 @@ # # @iops_size: #optional an I/O size in bytes (Since 1.7) # +# @write_threshold: configured write threshold for the device. +# 0 if disabled. (Since 2.3) +# # Since: 0.14.0 # ## @@ -253,7 +256,7 @@ '*bps_max': 'int', '*bps_rd_max': 'int', '*bps_wr_max': 'int', '*iops_max': 'int', '*iops_rd_max': 'int', '*iops_wr_max': 'int', -'*iops_size': 'int' } } +'*iops_size': 'int', 'write_threshold': 'uint64' } } 'int' works as well as 'uint64'; since this is an output parameter, we aren't gaining any stricter input parsing by using a more-specific type. I found one case on which the usage 'int' vs 'uint64' made the code generator emit different code - and the one using uint64 was more correct. Can't recall if that is the case; I'll retry, and add a comment here to document the behaviour if I stumble on this again. My findings are minor, so I'm okay if you post a v5 that addresses them and includes: Reviewed-by: Eric Blake ebl...@redhat.com Yep, will post ASAP. Thanks and best regards. -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
[Qemu-devel] [PATCH v5] block: add event when disk usage exceeds threshold
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. To let the guest run smoothly and be not unnecessarily paused, oVirt sets a disk usage threshold (so called 'high water mark') based on the occupation of the device, and automatically extends the image once the threshold is reached or exceeded. In order to detect the crossing of the threshold, oVirt has no choice but aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: deployments with hundreds of VMs are no longer rare. To fix this, this patch adds: * A new monitor command `block-set-write-threshold', to set a mark for a given block device. * A new event `BLOCK_WRITE_THRESHOLD', to report if a block device usage exceeds the threshold. * A new `write_threshold' field into the `BlockDeviceInfo' structure, to report the configured threshold. This will allow the managing application to use smarter and more efficient monitoring, greatly reducing the need of polling. Signed-off-by: Francesco Romani from...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- block/Makefile.objs | 1 + block/qapi.c| 3 + block/write-threshold.c | 125 include/block/block_int.h | 4 ++ include/block/write-threshold.h | 64 qapi/block-core.json| 51 +++- qmp-commands.hx | 32 ++ tests/Makefile | 3 + tests/test-write-threshold.c| 119 ++ 9 files changed, 401 insertions(+), 1 deletion(-) create mode 100644 block/write-threshold.c create mode 100644 include/block/write-threshold.h create mode 100644 tests/test-write-threshold.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 04b0e43..010afad 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o +block-obj-y += write-threshold.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/qapi.c b/block/qapi.c index fa68ba7..709b328 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -24,6 +24,7 @@ #include block/qapi.h #include block/block_int.h +#include block/write-threshold.h #include qmp-commands.h #include qapi-visit.h #include qapi/qmp-output-visitor.h @@ -89,6 +90,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs) info-iops_size = cfg.op_size; } +info-write_threshold = bdrv_write_threshold_get(bs); + return info; } diff --git a/block/write-threshold.c b/block/write-threshold.c new file mode 100644 index 000..c2cd517 --- /dev/null +++ b/block/write-threshold.c @@ -0,0 +1,125 @@ +/* + * QEMU System Emulator block write threshold notification + * + * Copyright Red Hat, Inc. 2014 + * + * Authors: + * Francesco Romani from...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING.LIB file in the top-level directory. + */ + +#include block/block_int.h +#include block/coroutine.h +#include block/write-threshold.h +#include qemu/notify.h +#include qapi-event.h +#include qmp-commands.h + + +uint64_t bdrv_write_threshold_get(const BlockDriverState *bs) +{ +return bs-write_threshold_offset; +} + +bool bdrv_write_threshold_is_set(const BlockDriverState *bs) +{ +return bs-write_threshold_offset 0; +} + +static void write_threshold_disable(BlockDriverState *bs) +{ +if (bdrv_write_threshold_is_set(bs)) { +notifier_with_return_remove(bs-write_threshold_notifier); +bs-write_threshold_offset = 0; +} +} + +uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs, + const BdrvTrackedRequest *req) +{ +if (bdrv_write_threshold_is_set(bs)) { +if (req-offset bs-write_threshold_offset) { +return (req-offset - bs-write_threshold_offset) + req-bytes; +} +if ((req-offset + req-bytes) bs-write_threshold_offset) { +return (req-offset + req-bytes) - bs-write_threshold_offset; +} +} +return 0; +} + +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, +void *opaque) +{ +BdrvTrackedRequest *req = opaque; +BlockDriverState *bs = req-bs; +uint64_t amount = 0; + +amount = bdrv_write_threshold_exceeded(bs, req); +if (amount 0) { +qapi_event_send_block_write_threshold( +bs-node_name, +amount, +bs-write_threshold_offset, +error_abort); + +/* autodisable to avoid flooding the monitor */ +write_threshold_disable(bs); +} + +return 0; /* should always
Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold
ping - Original Message - From: Francesco Romani from...@redhat.com To: qemu-devel@nongnu.org Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, lcapitul...@redhat.com Sent: Monday, December 15, 2014 12:19:44 PM Subject: Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold ping? - Original Message - From: Francesco Romani from...@redhat.com To: qemu-devel@nongnu.org Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, lcapitul...@redhat.com Sent: Thursday, December 4, 2014 10:00:18 AM Subject: Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold - Original Message - From: Francesco Romani from...@redhat.com To: qemu-devel@nongnu.org Cc: kw...@redhat.com, stefa...@redhat.com, lcapitul...@redhat.com, mdr...@linux.vnet.ibm.com, ebl...@redhat.com, Francesco Romani from...@redhat.com Sent: Thursday, December 4, 2014 9:59:08 AM Subject: [PATCH v4] block: add event when disk usage exceeds threshold [...] Changes since v3: - addressed reviewers comments: - dropped redundant cover letter - improved commit message to reflect real intended use (polling is not going away completely) - spelling fixes - API consistency fixes - kept 'uint64' in QAPI because it is mapped to C type uint64_t, where 'int' is mapped to 'int64_t' - renamed for consistency everywhere: 'usage_threshold' = 'write_threshold' Thanks and best regards -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold
ping? - Original Message - From: Francesco Romani from...@redhat.com To: qemu-devel@nongnu.org Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, lcapitul...@redhat.com Sent: Thursday, December 4, 2014 10:00:18 AM Subject: Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold - Original Message - From: Francesco Romani from...@redhat.com To: qemu-devel@nongnu.org Cc: kw...@redhat.com, stefa...@redhat.com, lcapitul...@redhat.com, mdr...@linux.vnet.ibm.com, ebl...@redhat.com, Francesco Romani from...@redhat.com Sent: Thursday, December 4, 2014 9:59:08 AM Subject: [PATCH v4] block: add event when disk usage exceeds threshold [...] Changes since v3: - addressed reviewers comments: - dropped redundant cover letter - improved commit message to reflect real intended use (polling is not going away completely) - spelling fixes - API consistency fixes - kept 'uint64' in QAPI because it is mapped to C type uint64_t, where 'int' is mapped to 'int64_t' - renamed for consistency everywhere: 'usage_threshold' = 'write_threshold' Thanks and best regards -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
[Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. To let the guest run smoothly and be not unnecessarily paused, oVirt sets a disk usage threshold (so called 'high water mark') based on the occupation of the device, and automatically extends the image once the threshold is reached or exceeded. In order to detect the crossing of the threshold, oVirt has no choice but aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: deployments with hundreds of VMs are no longer rare. To fix this, this patch adds: * A new monitor command to set a write threshold for a given block device. * A new event to report if a block device usage exceeds the threshold. This will allow the managing application to use smarter and more efficient monitoring, greatly reducing the need of polling. A followup patch is planned to allow to add the write threshold at device creation. Signed-off-by: Francesco Romani from...@redhat.com --- block/Makefile.objs | 1 + block/qapi.c| 3 + block/write-threshold.c | 125 include/block/block_int.h | 4 ++ include/block/write-threshold.h | 64 qapi/block-core.json| 50 +++- qmp-commands.hx | 32 ++ tests/Makefile | 3 + tests/test-write-threshold.c| 119 ++ 9 files changed, 400 insertions(+), 1 deletion(-) create mode 100644 block/write-threshold.c create mode 100644 include/block/write-threshold.h create mode 100644 tests/test-write-threshold.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 04b0e43..010afad 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o +block-obj-y += write-threshold.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/qapi.c b/block/qapi.c index a87a34a..d38f7a6 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -24,6 +24,7 @@ #include block/qapi.h #include block/block_int.h +#include block/write-threshold.h #include qmp-commands.h #include qapi-visit.h #include qapi/qmp-output-visitor.h @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs) info-iops_size = cfg.op_size; } +info-write_threshold = bdrv_write_threshold_get(bs); + return info; } diff --git a/block/write-threshold.c b/block/write-threshold.c new file mode 100644 index 000..b7aa20e --- /dev/null +++ b/block/write-threshold.c @@ -0,0 +1,125 @@ +/* + * QEMU System Emulator block write threshold notification + * + * Copyright Red Hat, Inc. 2014 + * + * Authors: + * Francesco Romani from...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING.LIB file in the top-level directory. + */ + +#include block/block_int.h +#include block/coroutine.h +#include block/write-threshold.h +#include qemu/notify.h +#include qapi-event.h +#include qmp-commands.h + + +uint64_t bdrv_write_threshold_get(const BlockDriverState *bs) +{ +return bs-write_threshold_offset; +} + +bool bdrv_write_threshold_is_set(const BlockDriverState *bs) +{ +return !!(bs-write_threshold_offset 0); +} + +static void write_threshold_disable(BlockDriverState *bs) +{ +if (bdrv_write_threshold_is_set(bs)) { +notifier_with_return_remove(bs-write_threshold_notifier); +bs-write_threshold_offset = 0; +} +} + +uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs, + const BdrvTrackedRequest *req) +{ +if (bdrv_write_threshold_is_set(bs)) { +if (req-offset bs-write_threshold_offset) { +return (req-offset - bs-write_threshold_offset) + req-bytes; +} +if ((req-offset + req-bytes) bs-write_threshold_offset) { +return (req-offset + req-bytes) - bs-write_threshold_offset; +} +} +return 0; +} + +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, +void *opaque) +{ +BdrvTrackedRequest *req = opaque; +BlockDriverState *bs = req-bs; +uint64_t amount = 0; + +amount = bdrv_write_threshold_exceeded(bs, req); +if (amount 0) { +qapi_event_send_block_write_threshold( +bs-node_name, +amount, +bs-write_threshold_offset, +error_abort); + +/* autodisable to avoid flooding the monitor */ +write_threshold_disable(bs); +} + +return 0; /* should always let other notifiers run */ +} + +static void write_threshold_register_notifier(BlockDriverState *bs
Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold
- Original Message - From: Francesco Romani from...@redhat.com To: qemu-devel@nongnu.org Cc: kw...@redhat.com, stefa...@redhat.com, lcapitul...@redhat.com, mdr...@linux.vnet.ibm.com, ebl...@redhat.com, Francesco Romani from...@redhat.com Sent: Thursday, December 4, 2014 9:59:08 AM Subject: [PATCH v4] block: add event when disk usage exceeds threshold [...] Changes since v3: - addressed reviewers comments: - dropped redundant cover letter - improved commit message to reflect real intended use (polling is not going away completely) - spelling fixes - API consistency fixes - kept 'uint64' in QAPI because it is mapped to C type uint64_t, where 'int' is mapped to 'int64_t' - renamed for consistency everywhere: 'usage_threshold' = 'write_threshold' Thanks and best regards -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
- Original Message - From: Francesco Romani from...@redhat.com To: Eric Blake ebl...@redhat.com Cc: kw...@redhat.com, lcapitul...@redhat.com, qemu-devel@nongnu.org, stefa...@redhat.com, mdr...@linux.vnet.ibm.com Sent: Tuesday, December 2, 2014 8:47:45 AM Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold Thanks for the quick review! Missing the change to the 'query-block' and 'query-named-block-nodes' examples to show the new always-present output field. Sorry, I missed the *examples* keyword. Will fix. -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
- Original Message - From: Stefan Hajnoczi stefa...@redhat.com To: Francesco Romani from...@redhat.com Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, qemu-devel@nongnu.org, lcapitul...@redhat.com Sent: Tuesday, December 2, 2014 4:01:17 PM Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold On Fri, Nov 28, 2014 at 01:31:07PM +0100, Francesco Romani wrote: @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs) info-iops_size = cfg.op_size; } +info-write_threshold = bdrv_usage_threshold_get(bs); Overall looks good but I notice that write_threshold and usage_threshold are both used. Please use just one consistently (I think write_threshold is clearer). Agreed. Will use write_threshold everywhere, file names included. Bests, -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
Thanks for the quick review! - Original Message - From: Eric Blake ebl...@redhat.com To: Francesco Romani from...@redhat.com, qemu-devel@nongnu.org Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, lcapitul...@redhat.com Sent: Monday, December 1, 2014 10:07:38 PM Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold On 11/28/2014 05:31 AM, Francesco Romani wrote: Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. To let the guest run smoothly and be not unnecessarily paused, oVirt sets a disk usage threshold (so called 'high water mark') based on the occupation of the device, and automatically extends the image once the threshold is reached or exceeded. In order to detect the crossing of the threshold, oVirt has no choice but aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: deployments with hundreds of VMs are no longer rare. To fix this, this patch adds: * A new monitor command to set a mark for a given block device. * A new event to report if a block device usage exceeds the threshold. This will allow the managing application to drop the polling altogether and just wait for a watermark crossing event. I like the idea! Question - what happens if management misses the event (for example, if libvirtd is restarted)? Does the existing 'query-blockstats' and/or 'query-named-block-nodes' still work to query the current threshold and whether it has been exceeded, as a poll-once command executed when reconnecting to the monitor? Indeed oVirt wants to keep the existing polling and to use it as fallback, to make sure no events are missed. oVirt will not rely on the new notification *alone*. The plan is to just poll *much less* frequently. Today's default poll rate is every 2 (two) seconds, so there is a lot of room for improvement. Signed-off-by: Francesco Romani from...@redhat.com --- No need for a 0/1 cover letter on a 1-patch series; you have the option of just putting the side-band information here and sending it as a single mail. But the cover letter approach doesn't hurt either, and I can see how it can be easier for some workflows to always send a cover letter than to special-case a 1-patch series. Also, I found that a separate context/introduction/room for comments would helped, especially on first two revisions of the patch which were tagged as RFC. I have zero problems in dropping the cover letter now that consensus is forming and patch is taking shape, just let me know. [... snip spelling: thanks! will fix] +++ b/qapi/block-core.json @@ -239,6 +239,9 @@ # # @iops_size: #optional an I/O size in bytes (Since 1.7) # +# @write-threshold: configured write threshold for the device. +# 0 if disabled. (Since 2.3) +# # Since: 0.14.0 # ## @@ -253,7 +256,7 @@ '*bps_max': 'int', '*bps_rd_max': 'int', '*bps_wr_max': 'int', '*iops_max': 'int', '*iops_rd_max': 'int', '*iops_wr_max': 'int', -'*iops_size': 'int' } } +'*iops_size': 'int', 'write-threshold': 'uint64' } } In QMP specs, 'uint64' and 'int' are practically synonyms. I can live with either spelling, although 'int' is more common. Bikeshed on naming: Although we prefer '-' over '_' in new interfaces, we also favor consistency, and BlockDeviceInfo is one of those dinosaur commands that uses _ everywhere until your addition. So naming this field 'write_threshold' would be more consistent. Agreed. Will fix. +# +# @node-name: graph node name on which the threshold was exceeded. +# +# @amount-exceeded: amount of data which exceeded the threshold, in bytes. +# +# @offset-threshold: last configured threshold, in bytes. +# Might want to mention that this event is one-shot; after it triggers, a user must re-register a threshold to get the event again. Good point. Will fix. +# Since: 2.3 +## +{ 'event': 'BLOCK_USAGE_THRESHOLD', + 'data': { 'node-name': 'str', + 'amount-exceeded': 'uint64', TAB damage. Please use spaces. ./scripts/checkpatch.pl will catch some offenders (although I didn't test if it will catch this one). However, here you are correct in using '-' for naming :) Oops. Rebase error (was clean before!) will fix. + 'threshold': 'uint64' } } + +## +# @block-set-threshold +# +# Change usage threshold for a block drive. An event will be delivered +# if a write to this block drive crosses the configured threshold. +# This is useful to transparently resize thin-provisioned drives without +# the guest OS noticing. +# +# @node-name: graph node name on which the threshold must be set. +# +# @write-threshold: configured threshold for the block device
[Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. To let the guest run smoothly and be not unnecessarily paused, oVirt sets a disk usage threshold (so called 'high water mark') based on the occupation of the device, and automatically extends the image once the threshold is reached or exceeded. In order to detect the crossing of the threshold, oVirt has no choice but aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: deployments with hundreds of VMs are no longer rare. To fix this, this patch adds: * A new monitor command to set a mark for a given block device. * A new event to report if a block device usage exceeds the threshold. This will allow the managing application to drop the polling altogether and just wait for a watermark crossing event. Signed-off-by: Francesco Romani from...@redhat.com --- block/Makefile.objs | 1 + block/qapi.c| 3 + block/usage-threshold.c | 124 include/block/block_int.h | 4 ++ include/block/usage-threshold.h | 62 qapi/block-core.json| 48 +++- qmp-commands.hx | 28 + tests/Makefile | 3 + tests/test-usage-threshold.c| 101 9 files changed, 373 insertions(+), 1 deletion(-) create mode 100644 block/usage-threshold.c create mode 100644 include/block/usage-threshold.h create mode 100644 tests/test-usage-threshold.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 04b0e43..43e381d 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o +block-obj-y += usage-threshold.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/qapi.c b/block/qapi.c index a87a34a..c85abb0 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -24,6 +24,7 @@ #include block/qapi.h #include block/block_int.h +#include block/usage-threshold.h #include qmp-commands.h #include qapi-visit.h #include qapi/qmp-output-visitor.h @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs) info-iops_size = cfg.op_size; } +info-write_threshold = bdrv_usage_threshold_get(bs); + return info; } diff --git a/block/usage-threshold.c b/block/usage-threshold.c new file mode 100644 index 000..9faf5e6 --- /dev/null +++ b/block/usage-threshold.c @@ -0,0 +1,124 @@ +/* + * QEMU System Emulator block usage threshold notification + * + * Copyright Red Hat, Inc. 2014 + * + * Authors: + * Francesco Romani from...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING.LIB file in the top-level directory. + */ + +#include block/block_int.h +#include block/coroutine.h +#include block/usage-threshold.h +#include qemu/notify.h +#include qapi-event.h +#include qmp-commands.h + + +uint64_t bdrv_usage_threshold_get(const BlockDriverState *bs) +{ +return bs-write_threshold_offset; +} + +bool bdrv_usage_threshold_is_set(const BlockDriverState *bs) +{ +return !!(bs-write_threshold_offset 0); +} + +static void usage_threshold_disable(BlockDriverState *bs) +{ +if (bdrv_usage_threshold_is_set(bs)) { +notifier_with_return_remove(bs-write_threshold_notifier); +bs-write_threshold_offset = 0; +} +} + +uint64_t bdrv_usage_threshold_exceeded(const BlockDriverState *bs, + const BdrvTrackedRequest *req) +{ +if (bdrv_usage_threshold_is_set(bs)) { +if (req-offset bs-write_threshold_offset) { +return (req-offset - bs-write_threshold_offset) + req-bytes; +} +if ((req-offset + req-bytes) bs-write_threshold_offset) { +return (req-offset + req-bytes) - bs-write_threshold_offset; +} +} +return 0; +} + +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, +void *opaque) +{ +BdrvTrackedRequest *req = opaque; +BlockDriverState *bs = req-bs; +uint64_t amount = 0; + +amount = bdrv_usage_threshold_exceeded(bs, req); +if (amount 0) { +qapi_event_send_block_usage_threshold( +bs-node_name, +amount, +bs-write_threshold_offset, +error_abort); + +/* autodisable to avoid to flood the monitor */ +usage_threshold_disable(bs); +} + +return 0; /* should always let other notifiers run */ +} + +static void usage_threshold_register_notifier(BlockDriverState *bs) +{ +bs-write_threshold_notifier.notify = before_write_notify; +notifier_with_return_list_add(bs
[Qemu-devel] [PATCH v3] add write threshold reporting for block devices
v2 was: with RFC. Since last review round I dropped the tag because I think now all the open points are addressed. v1 was: add watermark reporting for block devices, but watermark is incorrectly unused. Hence the change in subject. Changes since v2: - addressed reviewers comments: - use node name to find the block driver state to be checked - report node name in the event notification - fixed signed vs unsigned integer: use uint64 everywhere, to deal with integer overflows (more) gracefully. - fixed pending style issues - renamed and made public a few functions to make them testable - add very basic initial unit tests Changes since v1: - addressed reviewers comments. Highligths: - fixed terminology: watermark - usage threshold - threshold is expressed in bytes - make the event triggers only once when threshold crossed - configured threshold visible in 'query-block' output - fixed bugs Open issues: Not all node names show up in the 'query-block' output, but I'll start a different thread to discuss this. Followup: - Patches I'll have on my queue and I'd like to post as followup - more some unit testing. - add support to set the threshold at device creation Rationale and context from v1 - I'm one of the oVirt developers (http://www.ovirt.org); oVirt is a virtualization management application built around qemu/kvm, so it is nice to get in touch :) We have begun a big scalability improvement effort, aiming to support without problems hundreds of VMs per host, with plans to support thousands in a not so distant future. In doing so, we are reviewing our usage flows. One of them is thin-provisioned storage, which is used quite extensively, with block devices (ISCSI for example) and COW images. When using thin provisioning, oVirt tries hard to hide this fact from the guest OS, and to do so watches closely the usage of the device, and resize it when its usage exceeds a configured threshold (the high water mark), in order to avoid the guest OS to get paused for space exhausted. To do the watching, we poll he devices using libvirt [1], which in turn uses query-blockstats. This is suboptimal with just one VM, but with hundereds of them, let alone thousands, it doesn't scale and it is quite a resource hog. Would be great to have this watermark concept supported into qemu, with a new event to be raised when the limit is crossed. To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957 Moreover, I had the chance to take a look at the QEMU sources and come up with this tentative patch which I'd also like to submit. Notes - [0]: https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg01348.html [1]: http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=ebb0c19c48690f0598de954f8e0e9d4d29d48b85 Francesco Romani (1): block: add event when disk usage exceeds threshold block/Makefile.objs | 1 + block/qapi.c| 3 + block/usage-threshold.c | 124 include/block/block_int.h | 4 ++ include/block/usage-threshold.h | 62 qapi/block-core.json| 48 +++- qmp-commands.hx | 28 + tests/Makefile | 3 + tests/test-usage-threshold.c| 101 9 files changed, 373 insertions(+), 1 deletion(-) create mode 100644 block/usage-threshold.c create mode 100644 include/block/usage-threshold.h create mode 100644 tests/test-usage-threshold.c -- 1.9.3
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
- Original Message - From: Kevin Wolf kw...@redhat.com To: Stefan Hajnoczi stefa...@redhat.com Cc: mdr...@linux.vnet.ibm.com, Stefan Hajnoczi stefa...@gmail.com, lcapitul...@redhat.com, qemu-devel@nongnu.org, Francesco Romani from...@redhat.com Sent: Thursday, November 20, 2014 12:34:28 PM Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices One way to solve this is to require that the management tool tells QEMU which exact BlockDriverState node the threshold applies to. Then QEMU doesn't need any hardcoded policy. But I'm not sure how realistic that it at the moment (whether management tools are uses node names for each node yet), so it may be best to hardcode the bs-file traversal that I've suggested. Kevin: Do you agree? I have a feeling that we would regret this in the long run because it would allow only one special case of a general problem (watching a BDS). This means that we'll get inconsistent APIs. We're only talking about an optimisation here, even though a very useful one, so I wouldn't easily make compromises here. We should probably insist on using the node-name. Management tools need new code anyway to make use of the new functionality, so they can implement node-name support as well while they're at it. Using node-name is the best thing to do. My concern is just whether libvirt and other management tools are actually using node-name yet. I don't think so. They also don't use blockdev-add yet. But that's not a reason for us to add hacks that allow libvirt and other management tools to avoid the proper APIs even in the future. They just need to add support for node-names if they want to use new qemu features. New features require support for new infrastructure, I think that's fair. If they feel that representing complete BDS graphs in their code is too much work for now, they can still keep temporary hacks with hardcoded assumptions in their management code (like setting file.node-name and ignoring other setups). At least it would be temporary hacks there; if we did them in qemu, they would be a permanent API. I'm fine to use node_name in my patch, it looks even much simpler and cleaner I'd love to take this chance and learn more about the topic, becuse I'm very near to the border of my knowledge in that area. I joined the discussion quite later, so my sources are actually pretty sparse. Mostly: - staring at the sources and git history - googling for specific bits - presentations like http://www.linux-kvm.org/wiki/images/3/34/Kvm-forum-2013-block-dev-configuration.pdf There are some sources I'm missing? Hopefully a nice wiki page I somehow lost :) A couple of specific questions more, mostly to make sure I can do meaningful tests for my next submission: 1. I'm running a simple test using the attached script - which is a qemu command line adapted from libvirt ouput driven by oVirt. There is a way to attach a name at this stage, using a QMP command? 2. (related to the former) it seems from a not-so-deep look that the blessed (only?) way to set a proper node_name is using blockdev-add. If so, I'm not sure I follow how the qemu boot flow would look like. It will not be anymore as simple as crafting a command line and run the qemu, right? IIUC some interaction with QMP will be needed (sorry for asking silly question, trying to fill gaps in my knowledge). Thanks for the great feedback! -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani qemu.sh Description: application/shellscript
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
- Original Message - From: Kevin Wolf kw...@redhat.com To: Francesco Romani from...@redhat.com Cc: qemu-devel@nongnu.org, Stefan Hajnoczi stefa...@gmail.com, mdr...@linux.vnet.ibm.com, Luiz Capitulino lcapitul...@redhat.com, Stefan Hajnoczi stefa...@redhat.com Sent: Friday, November 21, 2014 11:11:26 AM Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices [...] 1. I'm running a simple test using the attached script - which is a qemu command line adapted from libvirt ouput driven by oVirt. There is a way to attach a name at this stage, using a QMP command? No, node-name is assigned at the BlockDriverState (BDS) creation and can't be changed later on. Makes sense to me. 2. (related to the former) it seems from a not-so-deep look that the blessed (only?) way to set a proper node_name is using blockdev-add. If so, I'm not sure I follow how the qemu boot flow would look like. It will not be anymore as simple as crafting a command line and run the qemu, right? IIUC some interaction with QMP will be needed (sorry for asking silly question, trying to fill gaps in my knowledge). -drive on the command line can do everything that blockdev-add can do. So let's assume you have a qcow2 image on a filesystem. Then you end up with two BDSes, one for the format driver and one for accessing the filesystem: BlockBackend (virtual device) - qcow2 BDS - file BDS (raw-posix.c) For assigning a node name to the qcow2 BDS, you simply specify it in the obvious way: -drive file=test.qcow2,node-name=foo Now if you want to assign a node name to the file BDS as well, you would get nested dicts in the blockdev-add call. In -drive a dot syntax is used to represent this: -drive file=test.qcow2,node-name=foo,file.node-name=bar Are things a bit clearer with this? Yes, thanks a lot. I was a bit misleaded by the lack of the reference (after a very quick look) in the man page. Maybe the manpage is out of date, but this is a different story -and maybe a different patch :) New revision will come in a few days. Bests, -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
- Original Message - From: Stefan Hajnoczi stefa...@redhat.com To: Francesco Romani from...@redhat.com Cc: kw...@redhat.com, Stefan Hajnoczi stefa...@gmail.com, mdr...@linux.vnet.ibm.com, qemu-devel@nongnu.org, lcapitul...@redhat.com Sent: Wednesday, November 19, 2014 4:52:51 PM Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote: +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, +void *opaque) +{ +BdrvTrackedRequest *req = opaque; +BlockDriverState *bs = req-bs; +int64_t amount = 0; + +assert((req-offset (BDRV_SECTOR_SIZE - 1)) == 0); +assert((req-bytes (BDRV_SECTOR_SIZE - 1)) == 0); Does the code still make these assumptions or are the asserts left over from previous versions of the patch? It's a leftover. I understood they don't hurt and add a bit of safety, but if they are confusing I'll remove them. Yes, it made me wonder why. Probably best to remove them. Will do [...] At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block device, and we'd like to be notified about the allocation of the lvm block device, which (IIUC) is the last bs-file. This is a simple topology (unless I'm missing something), and that's the reason why I go down just one level. Of course I want a general solution for this change, so... There is a block driver for error injection called blkdebug (see docs/blkdebug.txt). Here is an example of the following topology: raw_bsd (drive0) - blkdebug - raw-posix (test.img) qemu-system-x86_64 -drive if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img The blkdebug driver is interposing between the raw_bsd (drive0) root and the raw-posix leaf node. Thanks, I'll have a look [...] The management tool should not need to inspect the graph because the graph can change at runtime (e.g. imagine I/O throttling is implemented as a BlockDriverState node then it could appear/disapper when the feature is activated/deactivated). Instead the management tool should name the nodes it knows about and then use those node names. Agreed - and indeed simpler for us (oVirt), which it doesn't hurt :) If we descend the bs-file chain, AFAIU the easiest mapping, being the device name, is easily lost because only the outermost BlockDriverState has a device name attached, so when the notification trigger bdrv_get_device_name() will return NULL In the worst case a name string can be passed in along with the threshold values. OK, I guess to keep a copy of the string with g_strdup() could be good enough start, at least for further discussion. Thanks for your review and for the informations, I'll submit a new revision of the patch in a couple of days, to give to other reviewers some time to jump in. Bests, -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
- Original Message - From: Stefan Hajnoczi stefa...@gmail.com To: Francesco Romani from...@redhat.com Cc: kw...@redhat.com, lcapitul...@redhat.com, qemu-devel@nongnu.org, stefa...@redhat.com, mdr...@linux.vnet.ibm.com Sent: Monday, November 17, 2014 5:49:36 PM Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote: Sorry for the long review delay. Looks pretty good, just one real issue to think about at the bottom. Hi Stefan, thanks for the review and no problem for the delay :) +static void usage_threshold_disable(BlockDriverState *bs) +{ It would be safest to make this idempotent: if (!usage_threshold_is_set(bs)) { return; } That way it can be called multiple times. Will change. +notifier_with_return_remove(bs-wr_usage_threshold_notifier); +bs-wr_offset_threshold = 0; +} + +static int usage_threshold_is_set(const BlockDriverState *bs) +{ +return !!(bs-wr_offset_threshold 0); +} Please use the bool type instead of an int return value. Sure, will fix. +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, +void *opaque) +{ +BdrvTrackedRequest *req = opaque; +BlockDriverState *bs = req-bs; +int64_t amount = 0; + +assert((req-offset (BDRV_SECTOR_SIZE - 1)) == 0); +assert((req-bytes (BDRV_SECTOR_SIZE - 1)) == 0); Does the code still make these assumptions or are the asserts left over from previous versions of the patch? It's a leftover. I understood they don't hurt and add a bit of safety, but if they are confusing I'll remove them. +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes) +{ +BlockDriverState *target_bs = bs; +if (bs-file) { +target_bs = bs-file; +} Hmm...I think now I understand why you are trying to use bs-file. This is an attempt to make image formats work with the threshold. Unfortunately the BlockDriverState topology can be more complicated than just 1 level. I thought so but I can't reproduce yet more complex topologies. Disclosure: I'm testing against the topology I know to be used on oVirt, lacking immediate availability of others: suggestions welcome. At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block device, and we'd like to be notified about the allocation of the lvm block device, which (IIUC) is the last bs-file. This is a simple topology (unless I'm missing something), and that's the reason why I go down just one level. Of course I want a general solution for this change, so... If we hardcode a strategy to traverse bs-file then it will work in most cases: while (bs-file) { bs = bs-file; } But there are cases like VMDK extent files where a BlockDriverState actually has multiple children. One way to solve this is to require that the management tool tells QEMU which exact BlockDriverState node the threshold applies to. Then QEMU doesn't need any hardcoded policy. But I'm not sure how realistic that it at the moment (whether management tools are uses node names for each node yet), so it may be best to hardcode the bs-file traversal that I've suggested. oVirt relies on libvirt[1], and uses device node (e.g. 'vda'). BTW, I haven't found a way to inspect from the outside (e.g. monitor command) the BlockDriverState topology, there is a way I'm missing? Another issue I don't yet have a proper solution is related to this one. The management app will have deal with VM with more than one block device disk, so we need a way to map the notification with the corresponding device. If we descend the bs-file chain, AFAIU the easiest mapping, being the device name, is easily lost because only the outermost BlockDriverState has a device name attached, so when the notification trigger bdrv_get_device_name() will return NULL I believe we don't necessarily need a device name in the notification, for example something like an opaque cookie set together with the threshold and returned back with the notification will suffice. Of course the device name is nicer :) +++ [1] if that can help further to understand the usecase, these are the libvirt APIs being used by oVirt: http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockStatsFlags http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetBlockInfo both relies on the output[2] of 'query-blockstats' monitor command. [2] AFAIU -but this is just my guesswork!- it also assumes a quite simple topology like I did Thanks and bests, -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
[Qemu-devel] [RFC][PATCH v2] add write threshold reporting for block devices
v1 was: add watermark reporting for block devices, but watermark is incorrectly unused. Hence the change in subject. Sorry for long pause from v1 [0]; Only recently I was able to sort out all the missing details. Context for this RFC/patch is presented below as remider. Why RFC? See Known issues below. Changes since v1: - addressed reviewers comments. Highligths: - fixed terminology: watermark - usage threshold - threshold is expressed in bytes - make the event triggers only once when threshold crossed - configured threshold visible in 'query-block' output - fixed bugs Known Issues The threshold we (oVirt) care about is actually the one in the parent device, not in the device itself. See Rationale and context from v1 below for details. The use case is qcow2 on a thin-provisioned block device. So the notification handler must be set in the inner BlockDriverState, like this (pseudocode, no error checking, see patch for details) /* ... */ BlockDriverState *bs = bdrv_find(drive-virtio-disk0); /* ... */ bdrv_set_usage_threshold(bs, threshold_bytes); /* ... */ void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes) { BlockDriverState *target_bs = bs; if (bs-file) { target_bs = bs-file; /* we care about highest allocation of this! */ } bs-wr_usage_threshold_notifier.notify = usage_threshold_before_write_notify; notifier_with_return_list_add(bs-before_write_notifiers, bs-wr_usage_threshold_notifier); } Problem: when the notifier triggers, I don't know how to get the device name, because a simple bdrv_get_device_name(bs); yields just Despite my efforts and some digging in the code, I don't know which is the clean and correct way to get the (parent) device name. Rationale and context from v1 - I'm one of the oVirt developers (http://www.ovirt.org); oVirt is a virtualization management application built around qemu/kvm, so it is nice to get in touch :) We have begun a big scalability improvement effort, aiming to support without problems hundreds of VMs per host, with plans to support thousands in a not so distant future. In doing so, we are reviewing our usage flows. One of them is thin-provisioned storage, which is used quite extensively, with block devices (ISCSI for example) and COW images. When using thin provisioning, oVirt tries hard to hide this fact from the guest OS, and to do so watches closely the usage of the device, and resize it when its usage exceeds a configured threshold (the high water mark), in order to avoid the guest OS to get paused for space exhausted. To do the watching, we poll he devices using libvirt [1], which in turn uses query-blockstats. This is suboptimal with just one VM, but with hundereds of them, let alone thousands, it doesn't scale and it is quite a resource hog. Would be great to have this watermark concept supported into qemu, with a new event to be raised when the limit is crossed. To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957 Moreover, I had the chance to take a look at the QEMU sources and come up with this tentative patch which I'd also like to submit. Notes - [0]: https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg01348.html [1]: http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=ebb0c19c48690f0598de954f8e0e9d4d29d48b85 Francesco Romani (1): block: add event when disk usage exceeds threshold block/Makefile.objs | 1 + block/qapi.c| 3 + block/usage-threshold.c | 124 include/block/block_int.h | 4 ++ include/block/usage-threshold.h | 39 + qapi/block-core.json| 46 ++- qmp-commands.hx | 26 + 7 files changed, 242 insertions(+), 1 deletion(-) create mode 100644 block/usage-threshold.c create mode 100644 include/block/usage-threshold.h -- 1.9.3
[Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. To let the guest run smoothly and be not unnecessarily paused, oVirt sets a disk usage threshold (so called 'high water mark') based on the occupation of the device, and automatically extends the image once the threshold is reached or exceeded. In order to detect the crossing of the threshold, oVirt has no choice but aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: deployments with hundreds of VMs are no longer rare. To fix this, this patch adds: * A new monitor command to set a mark for a given block device. * A new event to report if a block device usage exceeds the threshold. This will allow the managing application to drop the polling altogether and just wait for a watermark crossing event. Signed-off-by: Francesco Romani from...@redhat.com --- block/Makefile.objs | 1 + block/qapi.c| 3 + block/usage-threshold.c | 124 include/block/block_int.h | 4 ++ include/block/usage-threshold.h | 39 + qapi/block-core.json| 46 ++- qmp-commands.hx | 26 + 7 files changed, 242 insertions(+), 1 deletion(-) create mode 100644 block/usage-threshold.c create mode 100644 include/block/usage-threshold.h diff --git a/block/Makefile.objs b/block/Makefile.objs index 04b0e43..43e381d 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o +block-obj-y += usage-threshold.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/qapi.c b/block/qapi.c index 1301144..3bb0bc7 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -24,6 +24,7 @@ #include block/qapi.h #include block/block_int.h +#include block/usage-threshold.h #include qmp-commands.h #include qapi-visit.h #include qapi/qmp-output-visitor.h @@ -315,6 +316,8 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo **p_info, } } +info-write_threshold = bdrv_get_usage_threshold(bs); + *p_info = info; return; diff --git a/block/usage-threshold.c b/block/usage-threshold.c new file mode 100644 index 000..31a587d --- /dev/null +++ b/block/usage-threshold.c @@ -0,0 +1,124 @@ +/* + * QEMU System Emulator block usage threshold notification + * + * Copyright Red Hat, Inc. 2014 + * + * Authors: + * Francesco Romani from...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING.LIB file in the top-level directory. + */ + +#include block/block_int.h +#include block/coroutine.h +#include block/usage-threshold.h +#include qemu/notify.h +#include qapi-event.h +#include qmp-commands.h + + +int64_t bdrv_get_usage_threshold(const BlockDriverState *bs) +{ +if (bs == NULL) { +return 0; +} +if (bs-file) { +return bs-file-wr_offset_threshold; +} +return bs-wr_offset_threshold; +} + +static void usage_threshold_disable(BlockDriverState *bs) +{ +notifier_with_return_remove(bs-wr_usage_threshold_notifier); +bs-wr_offset_threshold = 0; +} + +static int usage_threshold_is_set(const BlockDriverState *bs) +{ +return !!(bs-wr_offset_threshold 0); +} + +static int64_t usage_threshold_exceeded(const BlockDriverState *bs, +const BdrvTrackedRequest *req) +{ +if (usage_threshold_is_set(bs)) { +int64_t amount = req-offset + req-bytes - bs-wr_offset_threshold; +if (amount 0) { +return amount; +} +} +return 0; +} + +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, +void *opaque) +{ +BdrvTrackedRequest *req = opaque; +BlockDriverState *bs = req-bs; +int64_t amount = 0; + +assert((req-offset (BDRV_SECTOR_SIZE - 1)) == 0); +assert((req-bytes (BDRV_SECTOR_SIZE - 1)) == 0); + +amount = usage_threshold_exceeded(bs, req); +if (amount 0) { +qapi_event_send_block_usage_threshold( +bdrv_get_device_name(bs), /* FIXME: this does not work */ +amount, +bs-wr_offset_threshold, +error_abort); + +/* autodisable to avoid to flood the monitor */ +usage_threshold_disable(bs); +} + +return 0; /* should always let other notifiers run */ +} + +static void usage_threshold_register_notifier(BlockDriverState *bs) +{ +bs-wr_usage_threshold_notifier.notify = before_write_notify; +notifier_with_return_list_add(bs-before_write_notifiers, + bs-wr_usage_threshold_notifier); +} + +void bdrv_set_usage_threshold
Re: [Qemu-devel] [PATCH] block: add watermark event
- Original Message - From: Stefan Hajnoczi stefa...@redhat.com To: Kevin Wolf kw...@redhat.com Cc: mdr...@linux.vnet.ibm.com, Francesco Romani from...@redhat.com, qemu-devel@nongnu.org, lcapitul...@redhat.com Sent: Tuesday, August 5, 2014 3:08:46 PM Subject: Re: [Qemu-devel] [PATCH] block: add watermark event On Tue, Aug 05, 2014 at 10:47:57AM +0200, Kevin Wolf wrote: Am 01.08.2014 um 13:39 hat Stefan Hajnoczi geschrieben: On Tue, Jul 08, 2014 at 04:49:24PM +0200, Francesco Romani wrote: @@ -5813,3 +5815,57 @@ void bdrv_flush_io_queue(BlockDriverState *bs) bdrv_flush_io_queue(bs-file); } } + +static bool watermark_exceeded(BlockDriverState *bs, + int64_t sector_num, + int nb_sectors) +{ + +if (bs-wr_watermark_perc 0) { +int64_t watermark = (bs-total_sectors) / 100 * bs-wr_watermark_perc; bs-total_sectors should not be used directly. Have you considered making the watermark parameter take sector units instead of a percentage? I'm not sure whether a precentage makes sense because 25% of a 10GB image is 2.5 GB so a 75% watermark might be reasonable. 25% of a 1 TB image is 250 GB and that's probably not a reasonable watermark. So let the block-set-watermark caller pass an absolute sector number instead. It keeps things simple for both QEMU and thin provisioning manager. No sector numbers in external interfaces, please. These units of 512 bytes are completely arbitrary and don't make any sense. I hope to get rid of BDRV_SECTOR_* eventually even internally. So for external APIs, please use bytes instead. I agree and forgot about that. Please use bytes instead of sectors or a percentage. Thanks everyone for the great feedback received! I'll post asap a new patch addressing all the comments. I'll also change the name because 'watermark' may be misleading/wrong jargon :) (http://en.wikipedia.org/wiki/Watermark) Bests, -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
[Qemu-devel] [Bug 1338957] [NEW] RFE: add an event to report block devices watermark
Public bug reported: Add an event to report if a block device usage exceeds a threshold. The threshold should be configurable with a monitor command. The event should report the affected block device. Additional useful information could be the offset of the highest sector , like in the query-blockstats output. Rationale for the RFE Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. In order to let the guest run flawlessly and be not unnecessarily paused, oVirt sets a watermark and automatically resized the image once the watermark is reached or exceeded. In order to detect the mark crossing, the managing application has no choice than aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: scenarios with hunderds of VM are becoming not unusual. ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1338957 Title: RFE: add an event to report block devices watermark Status in QEMU: New Bug description: Add an event to report if a block device usage exceeds a threshold. The threshold should be configurable with a monitor command. The event should report the affected block device. Additional useful information could be the offset of the highest sector , like in the query- blockstats output. Rationale for the RFE Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. In order to let the guest run flawlessly and be not unnecessarily paused, oVirt sets a watermark and automatically resized the image once the watermark is reached or exceeded. In order to detect the mark crossing, the managing application has no choice than aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: scenarios with hunderds of VM are becoming not unusual. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1338957/+subscriptions
[Qemu-devel] [PATCH] add watermark reporting for block devices
Hello everyone I'm one of the oVirt developers (http://www.ovirt.org); oVirt is a virtualization management application built around qemu/kvm, so it is nice to get in touch :) We have begun a big scalability improvement effort, aiming to support without problems hundreds of VMs per host, with plans to support thousands in a not so distant future. In doing so, we are reviewing our usage flows. One of them is thin-provisioned storage, which is used quite extensively, with block devices (ISCSI for example) and COW images. When using thin provisioning, oVirt tries hard to hide this fact from the guest OS, and to do so watches closely the usage of the device, and resize it when its usage exceeds a configured threshold (the high water mark), in order to avoid the guest OS to get paused for space exhausted. To do the watching, we poll he devices using libvirt (virDomainGetBlockInfo), which in turn uses query-blockstats. This is suboptimal with just one VM, but with hundereds of them, let alone thousands, it doesn't scale and it is quite a resource hog. Would be great to have this watermark concept supported into qemu, with a new event to be raised when the limit is crossed. To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957 Moreover, I had the chance to take a look at the QEMU sources and come up with this tentative patch which I'd also like to submit. Comments and thoughts very welcome! Thanks and best regards, Francesco Romani (1): block: add watermark event block.c | 56 +++ blockdev.c| 21 ++ include/block/block.h | 2 ++ include/block/block_int.h | 3 +++ qapi/block-core.json | 33 qmp-commands.hx | 24 6 files changed, 139 insertions(+) -- 1.9.3
[Qemu-devel] [PATCH] block: add watermark event
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. In order to let the guest run flawlessly and be not unnecessarily paused, oVirt sets a watermark based on the percentage occupation of the device against the advertised size, and automatically resizes the image once the watermark is reached or exceeded. In order to detect the mark crossing, the managing application has no choice than aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: scenarios with hundreds of VM are becoming not unusual. To fix this, this patch adds: * A new monitor command to set a mark for a given block device. * A new event to report if a block device usage exceeds the threshold. This will allow the managing application to drop the polling alltogether and just wait for a watermark crossing event. Signed-off-by: Francesco Romani from...@redhat.com --- block.c | 56 +++ blockdev.c| 21 ++ include/block/block.h | 2 ++ include/block/block_int.h | 3 +++ qapi/block-core.json | 33 qmp-commands.hx | 24 6 files changed, 139 insertions(+) diff --git a/block.c b/block.c index 8800a6b..cf34b7f 100644 --- a/block.c +++ b/block.c @@ -1983,6 +1983,8 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest, bs_dest-device_list = bs_src-device_list; memcpy(bs_dest-op_blockers, bs_src-op_blockers, sizeof(bs_dest-op_blockers)); + +bs_dest-wr_watermark_perc = bs_src-wr_watermark_perc; } /* @@ -5813,3 +5815,57 @@ void bdrv_flush_io_queue(BlockDriverState *bs) bdrv_flush_io_queue(bs-file); } } + +static bool watermark_exceeded(BlockDriverState *bs, + int64_t sector_num, + int nb_sectors) +{ + +if (bs-wr_watermark_perc 0) { +int64_t watermark = (bs-total_sectors) / 100 * bs-wr_watermark_perc; +if (sector_num = watermark) { +return true; +} +} +return false; +} + +static int coroutine_fn watermark_before_write_notify(NotifierWithReturn *notifier, + void *opaque) +{ +BdrvTrackedRequest *req = opaque; +int64_t sector_num = req-offset BDRV_SECTOR_BITS; +int nb_sectors = req-bytes BDRV_SECTOR_BITS; + +/* FIXME: needed? */ +assert((req-offset (BDRV_SECTOR_SIZE - 1)) == 0); +assert((req-bytes (BDRV_SECTOR_SIZE - 1)) == 0); + +if (watermark_exceeded(req-bs, sector_num, nb_sectors)) { +BlockDriverState *bs = req-bs; +qapi_event_send_block_watermark( +bdrv_get_device_name(bs), +sector_num, +bs-wr_highest_sector, +error_abort); +} + +return 0; /* should always let other notifiers run */ +} + +void bdrv_set_watermark_perc(BlockDriverState *bs, + int watermark_perc) +{ +NotifierWithReturn before_write = { +.notify = watermark_before_write_notify, +}; + +if (watermark_perc = 0) { +return; +} + +if (bs-wr_watermark_perc == 0) { +bdrv_add_before_write_notifier(bs, before_write); +} +bs-wr_watermark_perc = watermark_perc; +} diff --git a/blockdev.c b/blockdev.c index 48bd9a3..ede21d9 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2546,6 +2546,27 @@ BlockJobInfoList *qmp_query_block_jobs(Error **errp) return dummy.next; } +void qmp_block_set_watermark(const char *device, int64_t watermark, + Error **errp) +{ +BlockDriverState *bs; +AioContext *aio_context; + +bs = bdrv_find(device); +if (!bs) { +error_set(errp, QERR_DEVICE_NOT_FOUND, device); +return; +} + +aio_context = bdrv_get_aio_context(bs); +aio_context_acquire(aio_context); + +bdrv_set_watermark_perc(bs, watermark); + +aio_context_release(aio_context); +} + + QemuOptsList qemu_common_drive_opts = { .name = drive, .head = QTAILQ_HEAD_INITIALIZER(qemu_common_drive_opts.head), diff --git a/include/block/block.h b/include/block/block.h index 32d3676..ff92ef9 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -588,4 +588,6 @@ void bdrv_io_plug(BlockDriverState *bs); void bdrv_io_unplug(BlockDriverState *bs); void bdrv_flush_io_queue(BlockDriverState *bs); +void bdrv_set_watermark_perc(BlockDriverState *bs, int watermark_perc); + #endif diff --git a/include/block/block_int.h b/include/block/block_int.h index f6c3bef..666ea1d 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -393,6 +393,9 @@ struct BlockDriverState { /* The error object in use for blocking operations on backing_hd */ Error *backing_blocker; + +/* watermark limit for writes, percentage
[Qemu-devel] [RFC] add watermark reporting for block devices
Sorry, this is actually an RFC; patch was posted separately. - Original Message - From: Francesco Romani from...@redhat.com To: qemu-devel@nongnu.org Cc: kw...@redhat.com, stefa...@redhat.com, lcapitul...@redhat.com, mdr...@linux.vnet.ibm.com, Francesco Romani from...@redhat.com Sent: Tuesday, July 8, 2014 4:49:23 PM Subject: [PATCH] add watermark reporting for block devices Hello everyone I'm one of the oVirt developers (http://www.ovirt.org); oVirt is a virtualization management application built around qemu/kvm, so it is nice to get in touch :) We have begun a big scalability improvement effort, aiming to support without problems hundreds of VMs per host, with plans to support thousands in a not so distant future. In doing so, we are reviewing our usage flows. One of them is thin-provisioned storage, which is used quite extensively, with block devices (ISCSI for example) and COW images. When using thin provisioning, oVirt tries hard to hide this fact from the guest OS, and to do so watches closely the usage of the device, and resize it when its usage exceeds a configured threshold (the high water mark), in order to avoid the guest OS to get paused for space exhausted. To do the watching, we poll he devices using libvirt (virDomainGetBlockInfo), which in turn uses query-blockstats. This is suboptimal with just one VM, but with hundereds of them, let alone thousands, it doesn't scale and it is quite a resource hog. Would be great to have this watermark concept supported into qemu, with a new event to be raised when the limit is crossed. To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957 Moreover, I had the chance to take a look at the QEMU sources and come up with this tentative patch which I'd also like to submit. Comments and thoughts very welcome! Thanks and best regards, Francesco Romani (1): block: add watermark event block.c | 56 +++ blockdev.c| 21 ++ include/block/block.h | 2 ++ include/block/block_int.h | 3 +++ qapi/block-core.json | 33 qmp-commands.hx | 24 6 files changed, 139 insertions(+) -- 1.9.3 -- Francesco Romani RedHat Engineering Virtualization R D Phone: 8261328 IRC: fromani
[Qemu-devel] [Bug 1284090] [NEW] RFE: QMP: report error reason in BLOCK_IO_ERROR message
Public bug reported: when a disk drive is configured with the error policy enospc for write errors, the monitoring client needs a way to distinguish betwwen generic I/O error and the I/O error for space exausted. The JSON QMP protocol lacks this information: the BLOCK_IO_ERROR message does not provide a reason or code for the error verified, so the monitoring client cannot distinguish the source of the errors. verified against git 105a060188dc6fdd4551571a966514d1a5f6815a ** Affects: qemu Importance: Undecided Status: New -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1284090 Title: RFE: QMP: report error reason in BLOCK_IO_ERROR message Status in QEMU: New Bug description: when a disk drive is configured with the error policy enospc for write errors, the monitoring client needs a way to distinguish betwwen generic I/O error and the I/O error for space exausted. The JSON QMP protocol lacks this information: the BLOCK_IO_ERROR message does not provide a reason or code for the error verified, so the monitoring client cannot distinguish the source of the errors. verified against git 105a060188dc6fdd4551571a966514d1a5f6815a To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1284090/+subscriptions