Re: [Qemu-devel] [libvirt] RFC: exposing qemu's block-set-write-threshold

2015-05-22 Thread Francesco Romani
- Original Message -
 From: Eric Blake ebl...@redhat.com
 To: Francesco Romani from...@redhat.com
 Cc: libvir-l...@redhat.com, Nir Soffer nsof...@redhat.com, Peter Krempa 
 pkre...@redhat.com,
 qemu-devel@nongnu.org
 Sent: Friday, May 22, 2015 6:33:01 AM
 Subject: Re: [libvirt] RFC: exposing qemu's block-set-write-threshold
 
 [adding qemu]
 

  I read the thread and I'm pretty sure this will be a silly question, but I
  want
  to make sure I am on the same page and I'm not somehow confused by the
  terminology.
  
  Let's consider the simplest of the situation we face in oVirt:
  
  (thin provisioned qcow2 disk on LV)
  
  vda=[format=qcow2] - lv=[path=/dev/mapper/$UUID]
  
  Isn't the LV here the 'backing file' (actually, backing block device) of
  the disk?
 
 Restating what you wrote into libvirt terminology, I think this means

 that you have a disk where:
 driver is qcow2
 source is a local file name
 device names vda
 backingStore index='1' describes the backing LV:
   driver is also qcow2 (as polling allocation growth in order to
 resize on demand only makes sense for qcow2 format)
   source is /dev/mapper/$UUID


Yes, exactly my point. I just want to be 100% sure that the three (slightly) 
different
parlances of the three groups (oVirt/libvirt/QEMU) are aligned on the same 
meaning,
and that we're not getting anything lost in translation

 that you have a disk where:
 driver is qcow2
 source is a local file name
 device names vda
 backingStore index='1' describes the backing LV:
   driver is also qcow2 (as polling allocation growth in order to
 resize on demand only makes sense for qcow2 format)
   source is /dev/mapper/$UUID

For the final confirmation, here's the actual XML we produce:

disk device=disk snapshot=no type=block
  address bus=0x00 domain=0x function=0x0 slot=0x05 type=pci/
  source 
dev=/rhev/data-center/0002-0002-0002-0002-014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5/
  target bus=virtio dev=vda/
  serialee1295ee-7ddc-4030-be5e-4557538bc4d2/serial
  boot order=1/
  driver cache=none error_policy=stop io=native name=qemu 
type=qcow2/
/disk

For the sake of completeness:

$ ls -lh 
/rhev/data-center/0002-0002-0002-0002-014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5
 
lrwxrwxrwx. 1 vdsm kvm 78 May 22 08:49 
/rhev/data-center/0002-0002-0002-0002-014b/12f68692-2a5a-4e48-af5e-4679bca7fd44/images/ee1295ee-7ddc-4030-be5e-4557538bc4d2/05a88a94-5bd6-4698-be69-39e78c84e1a5
 - 
/dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/05a88a94-5bd6-4698-be69-39e78c84e1a5

$ ls -lh /dev/12f68692-2a5a-4e48-af5e-4679bca7fd44/
total 0
lrwxrwxrwx. 1 root root 8 May 22 08:49 05a88a94-5bd6-4698-be69-39e78c84e1a5 - 
../dm-11
lrwxrwxrwx. 1 root root 8 May 22 08:49 54673e6d-207d-4a66-8f0d-3f5b3cda78e5 - 
../dm-12
lrwxrwxrwx. 1 root root 9 May 22 08:49 ids - ../dm-606
lrwxrwxrwx. 1 root root 9 May 22 08:49 inbox - ../dm-607
lrwxrwxrwx. 1 root root 9 May 22 08:49 leases - ../dm-605
lrwxrwxrwx. 1 root root 9 May 22 08:49 master - ../dm-608
lrwxrwxrwx. 1 root root 9 May 22 08:49 metadata - ../dm-603
lrwxrwxrwx. 1 root root 9 May 22 08:49 outbox - ../dm-604

lvs | grep 05a88a94
  05a88a94-5bd6-4698-be69-39e78c84e1a5 12f68692-2a5a-4e48-af5e-4679bca7fd44 
-wi-ao  14.12g

 
 then indeed, vda is the local qcow2 file, and vda[1] is the backing
 file on the LV storage.
 
 Normally, you only care about the write threshold at the active layer
 (the local file, with name vda), because that is the only image that
 will normally be allocating sectors.  But in the case of active commit,
 where you are taking the thin-provisioned local file and writing its
 clusters back into the backing LV, the action of commit can allocate
 sectors in the backing file. 

Right

 Thus, libvirt wants to let you set a
 write-threshold on both parts of the backing chain (the active wrapper,
 and the LV backing file), where the event could fire on either node
 first.  The existing libvirt virConnectDomainGetAllStats() can already
 be used to poll allocation growth (the block.N.allocation statistic in
 libvirt, or 'virtual-size' in QMP's 'ImageInfo'), but the event would
 let you drop polling.

Yes, exactly the intent

 However, while starting to code the libvirt side of things, I've hit a
 couple of snags with interacting with the qemu design.  First, the
 'block-set-write-threshold' command is allowed to set a threshold by
 'node-name' (any BDS, whether active or backing),

Yes, this emerged during the review of my patch. 
I first took the simplest approach (probably simplistic, in retrospect),
but -IIRC- was pointed out that setting by node-name grants the most
flexible approach, hence was required.

See:
http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02503.html
http://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg02580.html
http

[Qemu-devel] [Bug 1338957] Re: RFE: add an event to report block devices watermark

2015-01-13 Thread Francesco Romani
patch posted on qemu-devel, reviewd, acked and merged into maintainer's branch:
https://github.com/stefanha/qemu/commit/f050ea639522e9dd7e501ef285a2a12709b8726a

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1338957

Title:
  RFE: add an event to report block devices watermark

Status in QEMU:
  New

Bug description:
  Add an event to report if a block device usage exceeds a threshold.
  The threshold should be configurable with a monitor command. The event
  should report the affected block device. Additional useful information
  could be the offset of the highest sector , like in the query-
  blockstats output.

  Rationale for the RFE
  Managing applications, like oVirt (http://www.ovirt.org), make extensive use 
of thin-provisioned disk images.
  In order to let the guest run flawlessly and be not unnecessarily paused, 
oVirt sets a watermark and automatically resized the image once the watermark 
is reached or exceeded.

  In order to detect the mark crossing, the managing application has no choice 
than aggressively polling the QEMU monitor
  using the query-blockstats command. This lead to unnecessary system load, and 
is made even worse under scale: scenarios
  with hunderds of VM are becoming not unusual.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1338957/+subscriptions



Re: [Qemu-devel] [PATCH v5] block: add event when disk usage exceeds threshold

2015-01-13 Thread Francesco Romani
- Original Message -
 From: Stefan Hajnoczi stefa...@redhat.com
 To: Francesco Romani from...@redhat.com
 Cc: qemu-devel@nongnu.org, kw...@redhat.com, lcapitul...@redhat.com, 
 mdr...@linux.vnet.ibm.com, ebl...@redhat.com
 Sent: Tuesday, January 13, 2015 2:54:41 PM
 Subject: Re: [PATCH v5] block: add event when disk usage exceeds threshold

 Thanks, applied to my block branch:
 https://github.com/stefanha/qemu/tree/block

Thank you for the review and the merge!

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold

2015-01-12 Thread Francesco Romani
Hi,

thanks for the review!

- Original Message -
 From: Eric Blake ebl...@redhat.com
 To: Francesco Romani from...@redhat.com, qemu-devel@nongnu.org
 Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, 
 lcapitul...@redhat.com
 Sent: Friday, January 9, 2015 5:54:40 PM
 Subject: Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds 
 threshold
 
 On 12/04/2014 01:59 AM, Francesco Romani wrote:
  Managing applications, like oVirt (http://www.ovirt.org), make extensive
  use of thin-provisioned disk images.
  To let the guest run smoothly and be not unnecessarily paused, oVirt sets
  a disk usage threshold (so called 'high water mark') based on the
  occupation
  of the device,  and automatically extends the image once the threshold
  is reached or exceeded.
  
  In order to detect the crossing of the threshold, oVirt has no choice but
  aggressively polling the QEMU monitor using the query-blockstats command.
  This lead to unnecessary system load, and is made even worse under scale:
  deployments with hundreds of VMs are no longer rare.
  
  To fix this, this patch adds:
  * A new monitor command to set a write threshold for a given block device.
  * A new event to report if a block device usage exceeds the threshold.
 
 Please also mention the names of those two things in the commit message,
 to make it easier to find them when doing 'git log' archaeology.

Sure, will do.

  This will allow the managing application to use smarter and more
  efficient monitoring, greatly reducing the need of polling.
  
  A followup patch is planned to allow to add the write threshold at
  device creation.
  
  Signed-off-by: Francesco Romani from...@redhat.com
  ---
 
  --- /dev/null
  +++ b/block/write-threshold.c
  @@ -0,0 +1,125 @@
  +/*
  + * QEMU System Emulator block write threshold notification
  + *
  + * Copyright Red Hat, Inc. 2014
 
 I've been so slow on the review that you may want to add 2015.

IANAL, but since most (~99%) of code was written in 2014, I'll just leave as 
that.


  +bool bdrv_write_threshold_is_set(const BlockDriverState *bs)
  +{
  +return !!(bs-write_threshold_offset  0);
 
 The !! is spurious; use of  already guarantees a bool result.

Will remove.
 
  +++ b/qapi/block-core.json
  @@ -239,6 +239,9 @@
   #
   # @iops_size: #optional an I/O size in bytes (Since 1.7)
   #
  +# @write_threshold: configured write threshold for the device.
  +#   0 if disabled. (Since 2.3)
  +#
   # Since: 0.14.0
   #
   ##
  @@ -253,7 +256,7 @@
   '*bps_max': 'int', '*bps_rd_max': 'int',
   '*bps_wr_max': 'int', '*iops_max': 'int',
   '*iops_rd_max': 'int', '*iops_wr_max': 'int',
  -'*iops_size': 'int' } }
  +'*iops_size': 'int', 'write_threshold': 'uint64' } }
 
 'int' works as well as 'uint64'; since this is an output parameter, we
 aren't gaining any stricter input parsing by using a more-specific type.

I found one case on which the usage 'int' vs 'uint64' made the code generator
emit different code - and the one using uint64 was more correct.
Can't recall if that is the case; I'll retry, and add a comment here
to document the behaviour if I stumble on this again.


 My findings are minor, so I'm okay if you post a v5 that addresses them
 and includes:
 Reviewed-by: Eric Blake ebl...@redhat.com

Yep, will post ASAP.

Thanks and best regards.

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



[Qemu-devel] [PATCH v5] block: add event when disk usage exceeds threshold

2015-01-12 Thread Francesco Romani
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command `block-set-write-threshold', to set a mark for
  a given block device.
* A new event `BLOCK_WRITE_THRESHOLD', to report if a block device
  usage exceeds the threshold.
* A new `write_threshold' field into the `BlockDeviceInfo' structure,
  to report the configured threshold.

This will allow the managing application to use smarter and more
efficient monitoring, greatly reducing the need of polling.

Signed-off-by: Francesco Romani from...@redhat.com
Reviewed-by: Eric Blake ebl...@redhat.com
---
 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/write-threshold.c | 125 
 include/block/block_int.h   |   4 ++
 include/block/write-threshold.h |  64 
 qapi/block-core.json|  51 +++-
 qmp-commands.hx |  32 ++
 tests/Makefile  |   3 +
 tests/test-write-threshold.c| 119 ++
 9 files changed, 401 insertions(+), 1 deletion(-)
 create mode 100644 block/write-threshold.c
 create mode 100644 include/block/write-threshold.h
 create mode 100644 tests/test-write-threshold.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..010afad 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += write-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index fa68ba7..709b328 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include block/qapi.h
 #include block/block_int.h
+#include block/write-threshold.h
 #include qmp-commands.h
 #include qapi-visit.h
 #include qapi/qmp-output-visitor.h
@@ -89,6 +90,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
 info-iops_size = cfg.op_size;
 }
 
+info-write_threshold = bdrv_write_threshold_get(bs);
+
 return info;
 }
 
diff --git a/block/write-threshold.c b/block/write-threshold.c
new file mode 100644
index 000..c2cd517
--- /dev/null
+++ b/block/write-threshold.c
@@ -0,0 +1,125 @@
+/*
+ * QEMU System Emulator block write threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani from...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include block/block_int.h
+#include block/coroutine.h
+#include block/write-threshold.h
+#include qemu/notify.h
+#include qapi-event.h
+#include qmp-commands.h
+
+
+uint64_t bdrv_write_threshold_get(const BlockDriverState *bs)
+{
+return bs-write_threshold_offset;
+}
+
+bool bdrv_write_threshold_is_set(const BlockDriverState *bs)
+{
+return bs-write_threshold_offset  0;
+}
+
+static void write_threshold_disable(BlockDriverState *bs)
+{
+if (bdrv_write_threshold_is_set(bs)) {
+notifier_with_return_remove(bs-write_threshold_notifier);
+bs-write_threshold_offset = 0;
+}
+}
+
+uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs,
+   const BdrvTrackedRequest *req)
+{
+if (bdrv_write_threshold_is_set(bs)) {
+if (req-offset  bs-write_threshold_offset) {
+return (req-offset - bs-write_threshold_offset) + req-bytes;
+}
+if ((req-offset + req-bytes)  bs-write_threshold_offset) {
+return (req-offset + req-bytes) - bs-write_threshold_offset;
+}
+}
+return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+uint64_t amount = 0;
+
+amount = bdrv_write_threshold_exceeded(bs, req);
+if (amount  0) {
+qapi_event_send_block_write_threshold(
+bs-node_name,
+amount,
+bs-write_threshold_offset,
+error_abort);
+
+/* autodisable to avoid flooding the monitor */
+write_threshold_disable(bs);
+}
+
+return 0; /* should always

Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold

2015-01-09 Thread Francesco Romani
ping

- Original Message -
 From: Francesco Romani from...@redhat.com
 To: qemu-devel@nongnu.org
 Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, 
 lcapitul...@redhat.com
 Sent: Monday, December 15, 2014 12:19:44 PM
 Subject: Re: [Qemu-devel] [PATCH v4] block: add event when disk   usage   
 exceeds threshold
 
 ping?
 
 - Original Message -
  From: Francesco Romani from...@redhat.com
  To: qemu-devel@nongnu.org
  Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com,
  lcapitul...@redhat.com
  Sent: Thursday, December 4, 2014 10:00:18 AM
  Subject: Re: [Qemu-devel] [PATCH v4] block: add event when disk usage
  exceeds threshold
  
  - Original Message -
   From: Francesco Romani from...@redhat.com
   To: qemu-devel@nongnu.org
   Cc: kw...@redhat.com, stefa...@redhat.com, lcapitul...@redhat.com,
   mdr...@linux.vnet.ibm.com, ebl...@redhat.com,
   Francesco Romani from...@redhat.com
   Sent: Thursday, December 4, 2014 9:59:08 AM
   Subject: [PATCH v4] block: add event when disk usage exceeds threshold
  [...]
  
  Changes since v3:
  -
  
  addressed reviewers comments:
  - dropped redundant cover letter
  - improved commit message to reflect real intended use
(polling is not going away completely)
  - spelling fixes
  - API consistency fixes
  - kept 'uint64' in QAPI because it is mapped to C type  uint64_t,
where 'int' is mapped to 'int64_t'
  - renamed for consistency everywhere:
'usage_threshold' = 'write_threshold'
  
  Thanks and best regards
  
  --
  Francesco Romani
  RedHat Engineering Virtualization R  D
  Phone: 8261328
  IRC: fromani
  
  
 
 --
 Francesco Romani
 RedHat Engineering Virtualization R  D
 Phone: 8261328
 IRC: fromani
 
 

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold

2014-12-15 Thread Francesco Romani
ping?

- Original Message -
 From: Francesco Romani from...@redhat.com
 To: qemu-devel@nongnu.org
 Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, 
 lcapitul...@redhat.com
 Sent: Thursday, December 4, 2014 10:00:18 AM
 Subject: Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds 
 threshold
 
 - Original Message -
  From: Francesco Romani from...@redhat.com
  To: qemu-devel@nongnu.org
  Cc: kw...@redhat.com, stefa...@redhat.com, lcapitul...@redhat.com,
  mdr...@linux.vnet.ibm.com, ebl...@redhat.com,
  Francesco Romani from...@redhat.com
  Sent: Thursday, December 4, 2014 9:59:08 AM
  Subject: [PATCH v4] block: add event when disk usage exceeds threshold
 [...]
 
 Changes since v3:
 -
 
 addressed reviewers comments:
 - dropped redundant cover letter
 - improved commit message to reflect real intended use
   (polling is not going away completely)
 - spelling fixes
 - API consistency fixes
 - kept 'uint64' in QAPI because it is mapped to C type  uint64_t,
   where 'int' is mapped to 'int64_t'
 - renamed for consistency everywhere:
   'usage_threshold' = 'write_threshold'
 
 Thanks and best regards
 
 --
 Francesco Romani
 RedHat Engineering Virtualization R  D
 Phone: 8261328
 IRC: fromani
 
 

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



[Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold

2014-12-04 Thread Francesco Romani
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command to set a write threshold for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to use smarter and more
efficient monitoring, greatly reducing the need of polling.

A followup patch is planned to allow to add the write threshold at
device creation.

Signed-off-by: Francesco Romani from...@redhat.com
---
 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/write-threshold.c | 125 
 include/block/block_int.h   |   4 ++
 include/block/write-threshold.h |  64 
 qapi/block-core.json|  50 +++-
 qmp-commands.hx |  32 ++
 tests/Makefile  |   3 +
 tests/test-write-threshold.c| 119 ++
 9 files changed, 400 insertions(+), 1 deletion(-)
 create mode 100644 block/write-threshold.c
 create mode 100644 include/block/write-threshold.h
 create mode 100644 tests/test-write-threshold.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..010afad 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += write-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index a87a34a..d38f7a6 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include block/qapi.h
 #include block/block_int.h
+#include block/write-threshold.h
 #include qmp-commands.h
 #include qapi-visit.h
 #include qapi/qmp-output-visitor.h
@@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
 info-iops_size = cfg.op_size;
 }
 
+info-write_threshold = bdrv_write_threshold_get(bs);
+
 return info;
 }
 
diff --git a/block/write-threshold.c b/block/write-threshold.c
new file mode 100644
index 000..b7aa20e
--- /dev/null
+++ b/block/write-threshold.c
@@ -0,0 +1,125 @@
+/*
+ * QEMU System Emulator block write threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani from...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include block/block_int.h
+#include block/coroutine.h
+#include block/write-threshold.h
+#include qemu/notify.h
+#include qapi-event.h
+#include qmp-commands.h
+
+
+uint64_t bdrv_write_threshold_get(const BlockDriverState *bs)
+{
+return bs-write_threshold_offset;
+}
+
+bool bdrv_write_threshold_is_set(const BlockDriverState *bs)
+{
+return !!(bs-write_threshold_offset  0);
+}
+
+static void write_threshold_disable(BlockDriverState *bs)
+{
+if (bdrv_write_threshold_is_set(bs)) {
+notifier_with_return_remove(bs-write_threshold_notifier);
+bs-write_threshold_offset = 0;
+}
+}
+
+uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs,
+   const BdrvTrackedRequest *req)
+{
+if (bdrv_write_threshold_is_set(bs)) {
+if (req-offset  bs-write_threshold_offset) {
+return (req-offset - bs-write_threshold_offset) + req-bytes;
+}
+if ((req-offset + req-bytes)  bs-write_threshold_offset) {
+return (req-offset + req-bytes) - bs-write_threshold_offset;
+}
+}
+return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+uint64_t amount = 0;
+
+amount = bdrv_write_threshold_exceeded(bs, req);
+if (amount  0) {
+qapi_event_send_block_write_threshold(
+bs-node_name,
+amount,
+bs-write_threshold_offset,
+error_abort);
+
+/* autodisable to avoid flooding the monitor */
+write_threshold_disable(bs);
+}
+
+return 0; /* should always let other notifiers run */
+}
+
+static void write_threshold_register_notifier(BlockDriverState *bs

Re: [Qemu-devel] [PATCH v4] block: add event when disk usage exceeds threshold

2014-12-04 Thread Francesco Romani
- Original Message -
 From: Francesco Romani from...@redhat.com
 To: qemu-devel@nongnu.org
 Cc: kw...@redhat.com, stefa...@redhat.com, lcapitul...@redhat.com, 
 mdr...@linux.vnet.ibm.com, ebl...@redhat.com,
 Francesco Romani from...@redhat.com
 Sent: Thursday, December 4, 2014 9:59:08 AM
 Subject: [PATCH v4] block: add event when disk usage exceeds threshold
[...]

Changes since v3:
-

addressed reviewers comments:
- dropped redundant cover letter
- improved commit message to reflect real intended use
  (polling is not going away completely)
- spelling fixes
- API consistency fixes
- kept 'uint64' in QAPI because it is mapped to C type  uint64_t,
  where 'int' is mapped to 'int64_t'
- renamed for consistency everywhere:
  'usage_threshold' = 'write_threshold'

Thanks and best regards

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-12-02 Thread Francesco Romani
- Original Message -
 From: Francesco Romani from...@redhat.com
 To: Eric Blake ebl...@redhat.com
 Cc: kw...@redhat.com, lcapitul...@redhat.com, qemu-devel@nongnu.org, 
 stefa...@redhat.com, mdr...@linux.vnet.ibm.com
 Sent: Tuesday, December 2, 2014 8:47:45 AM
 Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage  exceeds 
 threshold
 
 Thanks for the quick review!

  Missing the change to the 'query-block' and 'query-named-block-nodes'
  examples to show the new always-present output field.
 

Sorry, I missed the *examples* keyword. Will fix.

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-12-02 Thread Francesco Romani


- Original Message -
 From: Stefan Hajnoczi stefa...@redhat.com
 To: Francesco Romani from...@redhat.com
 Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, qemu-devel@nongnu.org, 
 lcapitul...@redhat.com
 Sent: Tuesday, December 2, 2014 4:01:17 PM
 Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds  
 threshold
 
 On Fri, Nov 28, 2014 at 01:31:07PM +0100, Francesco Romani wrote:
  @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState
  *bs)
   info-iops_size = cfg.op_size;
   }
   
  +info-write_threshold = bdrv_usage_threshold_get(bs);
 
 Overall looks good but I notice that write_threshold and
 usage_threshold are both used.  Please use just one consistently (I
 think write_threshold is clearer).

Agreed. Will use write_threshold everywhere, file names included.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-12-01 Thread Francesco Romani
Thanks for the quick review!

- Original Message -
 From: Eric Blake ebl...@redhat.com
 To: Francesco Romani from...@redhat.com, qemu-devel@nongnu.org
 Cc: kw...@redhat.com, mdr...@linux.vnet.ibm.com, stefa...@redhat.com, 
 lcapitul...@redhat.com
 Sent: Monday, December 1, 2014 10:07:38 PM
 Subject: Re: [Qemu-devel] [PATCHv3] block: add event when disk usage exceeds  
 threshold
 
 On 11/28/2014 05:31 AM, Francesco Romani wrote:
  Managing applications, like oVirt (http://www.ovirt.org), make extensive
  use of thin-provisioned disk images.
  To let the guest run smoothly and be not unnecessarily paused, oVirt sets
  a disk usage threshold (so called 'high water mark') based on the
  occupation
  of the device,  and automatically extends the image once the threshold
  is reached or exceeded.
  
  In order to detect the crossing of the threshold, oVirt has no choice but
  aggressively polling the QEMU monitor using the query-blockstats command.
  This lead to unnecessary system load, and is made even worse under scale:
  deployments with hundreds of VMs are no longer rare.
  
  To fix this, this patch adds:
  * A new monitor command to set a mark for a given block device.
  * A new event to report if a block device usage exceeds the threshold.
  
  This will allow the managing application to drop the polling
  altogether and just wait for a watermark crossing event.
 
 I like the idea!
 
 Question - what happens if management misses the event (for example, if
 libvirtd is restarted)?  Does the existing 'query-blockstats' and/or
 'query-named-block-nodes' still work to query the current threshold and
 whether it has been exceeded, as a poll-once command executed when
 reconnecting to the monitor?

Indeed oVirt wants to keep the existing polling and to use it as fallback,
to make sure no events are missed. oVirt will not rely on the new notification
*alone*.
The plan is to just poll *much less* frequently. Today's default poll
rate is every 2 (two) seconds, so there is a lot of room for improvement.

 
  
  Signed-off-by: Francesco Romani from...@redhat.com
  ---
 
 No need for a 0/1 cover letter on a 1-patch series; you have the option
 of just putting the side-band information here and sending it as a
 single mail.  But the cover letter approach doesn't hurt either, and I
 can see how it can be easier for some workflows to always send a cover
 letter than to special-case a 1-patch series.

Also, I found that a separate context/introduction/room for comments
would helped, especially on first two revisions of the patch which were
tagged as RFC. I have zero problems in dropping the cover letter now that
consensus is forming and patch is taking shape, just let me know.

[... snip spelling: thanks! will fix]
  +++ b/qapi/block-core.json
  @@ -239,6 +239,9 @@
   #
   # @iops_size: #optional an I/O size in bytes (Since 1.7)
   #
  +# @write-threshold: configured write threshold for the device.
  +#   0 if disabled. (Since 2.3)
  +#
   # Since: 0.14.0
   #
   ##
  @@ -253,7 +256,7 @@
   '*bps_max': 'int', '*bps_rd_max': 'int',
   '*bps_wr_max': 'int', '*iops_max': 'int',
   '*iops_rd_max': 'int', '*iops_wr_max': 'int',
  -'*iops_size': 'int' } }
  +'*iops_size': 'int', 'write-threshold': 'uint64' } }
 
 In QMP specs, 'uint64' and 'int' are practically synonyms.  I can live
 with either spelling, although 'int' is more common.
 
 Bikeshed on naming: Although we prefer '-' over '_' in new interfaces,
 we also favor consistency, and BlockDeviceInfo is one of those dinosaur
 commands that uses _ everywhere until your addition.  So naming this
 field 'write_threshold' would be more consistent.

Agreed. Will fix.

  +#
  +# @node-name: graph node name on which the threshold was exceeded.
  +#
  +# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
  +#
  +# @offset-threshold: last configured threshold, in bytes.
  +#
 
 Might want to mention that this event is one-shot; after it triggers, a
 user must re-register a threshold to get the event again.

Good point. Will fix.

 
  +# Since: 2.3
  +##
  +{ 'event': 'BLOCK_USAGE_THRESHOLD',
  +  'data': { 'node-name': 'str',
  +   'amount-exceeded': 'uint64',
 
 TAB damage.  Please use spaces.  ./scripts/checkpatch.pl will catch some
 offenders (although I didn't test if it will catch this one).
 
 However, here you are correct in using '-' for naming :)

Oops. Rebase error (was clean before!) will fix.

 
  +   'threshold': 'uint64' } }
  +
  +##
  +# @block-set-threshold
  +#
  +# Change usage threshold for a block drive. An event will be delivered
  +# if a write to this block drive crosses the configured threshold.
  +# This is useful to transparently resize thin-provisioned drives without
  +# the guest OS noticing.
  +#
  +# @node-name: graph node name on which the threshold must be set.
  +#
  +# @write-threshold: configured threshold for the block device

[Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-11-28 Thread Francesco Romani
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
altogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani from...@redhat.com
---
 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/usage-threshold.c | 124 
 include/block/block_int.h   |   4 ++
 include/block/usage-threshold.h |  62 
 qapi/block-core.json|  48 +++-
 qmp-commands.hx |  28 +
 tests/Makefile  |   3 +
 tests/test-usage-threshold.c| 101 
 9 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h
 create mode 100644 tests/test-usage-threshold.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..43e381d 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += usage-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index a87a34a..c85abb0 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include block/qapi.h
 #include block/block_int.h
+#include block/usage-threshold.h
 #include qmp-commands.h
 #include qapi-visit.h
 #include qapi/qmp-output-visitor.h
@@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
 info-iops_size = cfg.op_size;
 }
 
+info-write_threshold = bdrv_usage_threshold_get(bs);
+
 return info;
 }
 
diff --git a/block/usage-threshold.c b/block/usage-threshold.c
new file mode 100644
index 000..9faf5e6
--- /dev/null
+++ b/block/usage-threshold.c
@@ -0,0 +1,124 @@
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani from...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include block/block_int.h
+#include block/coroutine.h
+#include block/usage-threshold.h
+#include qemu/notify.h
+#include qapi-event.h
+#include qmp-commands.h
+
+
+uint64_t bdrv_usage_threshold_get(const BlockDriverState *bs)
+{
+return bs-write_threshold_offset;
+}
+
+bool bdrv_usage_threshold_is_set(const BlockDriverState *bs)
+{
+return !!(bs-write_threshold_offset  0);
+}
+
+static void usage_threshold_disable(BlockDriverState *bs)
+{
+if (bdrv_usage_threshold_is_set(bs)) {
+notifier_with_return_remove(bs-write_threshold_notifier);
+bs-write_threshold_offset = 0;
+}
+}
+
+uint64_t bdrv_usage_threshold_exceeded(const BlockDriverState *bs,
+   const BdrvTrackedRequest *req)
+{
+if (bdrv_usage_threshold_is_set(bs)) {
+if (req-offset  bs-write_threshold_offset) {
+return (req-offset - bs-write_threshold_offset) + req-bytes;
+}
+if ((req-offset + req-bytes)  bs-write_threshold_offset) {
+return (req-offset + req-bytes) - bs-write_threshold_offset;
+}
+}
+return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+uint64_t amount = 0;
+
+amount = bdrv_usage_threshold_exceeded(bs, req);
+if (amount  0) {
+qapi_event_send_block_usage_threshold(
+bs-node_name,
+amount,
+bs-write_threshold_offset,
+error_abort);
+
+/* autodisable to avoid to flood the monitor */
+usage_threshold_disable(bs);
+}
+
+return 0; /* should always let other notifiers run */
+}
+
+static void usage_threshold_register_notifier(BlockDriverState *bs)
+{
+bs-write_threshold_notifier.notify = before_write_notify;
+notifier_with_return_list_add(bs

[Qemu-devel] [PATCH v3] add write threshold reporting for block devices

2014-11-28 Thread Francesco Romani
v2 was: with RFC. Since last review round I dropped the tag
because I think now all the open points are addressed.
v1 was: add watermark reporting for block devices, but
watermark is incorrectly unused. Hence the change in subject.

Changes since v2:
-

addressed reviewers comments:
- use node name to find the block driver state to be checked
- report node name in the event notification
- fixed signed vs unsigned integer: use uint64 everywhere, to deal
  with integer overflows (more) gracefully.
- fixed pending style issues
- renamed and made public a few functions to make them testable
- add very basic initial unit tests

Changes since v1:
-

addressed reviewers comments. Highligths:
- fixed terminology: watermark - usage threshold
- threshold is expressed in bytes
- make the event triggers only once when threshold crossed
- configured threshold visible in 'query-block' output
- fixed bugs

Open issues:


Not all node names show up in the 'query-block' output, but I'll
start a different thread to discuss this.

Followup:
-

Patches I'll have on my queue and I'd like to post as followup
- more some unit testing.
- add support to set the threshold at device creation


Rationale and context from v1
-

I'm one of the oVirt developers (http://www.ovirt.org);
oVirt is a virtualization management application built
around qemu/kvm, so it is nice to get in touch :)

We have begun a big scalability improvement effort, aiming to
support without problems hundreds of VMs per host, with plans
to support thousands in a not so distant future.
In doing so, we are reviewing our usage flows.

One of them is thin-provisioned storage, which is used
quite extensively, with block devices (ISCSI for example)
and COW images.
When using thin provisioning, oVirt tries hard to hide this
fact from the guest OS, and to do so watches closely
the usage of the device, and resize it when its usage exceeds
a configured threshold (the high water mark), in order
to avoid the guest OS to get paused for space exhausted.

To do the watching, we poll he devices using libvirt [1],
which in turn uses query-blockstats.
This is suboptimal with just one VM, but with hundereds of them,
let alone thousands, it doesn't scale and it is quite a resource
hog.

Would be great to have this watermark concept supported into qemu,
with a new event to be raised when the limit is crossed.

To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957

Moreover, I had the chance to take a look at the QEMU sources
and come up with this tentative patch which I'd also like
to submit.

Notes
-

[0]: https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg01348.html
[1]: 
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=ebb0c19c48690f0598de954f8e0e9d4d29d48b85


Francesco Romani (1):
  block: add event when disk usage exceeds threshold

 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/usage-threshold.c | 124 
 include/block/block_int.h   |   4 ++
 include/block/usage-threshold.h |  62 
 qapi/block-core.json|  48 +++-
 qmp-commands.hx |  28 +
 tests/Makefile  |   3 +
 tests/test-usage-threshold.c| 101 
 9 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h
 create mode 100644 tests/test-usage-threshold.c

-- 
1.9.3




Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-21 Thread Francesco Romani
- Original Message -
 From: Kevin Wolf kw...@redhat.com
 To: Stefan Hajnoczi stefa...@redhat.com
 Cc: mdr...@linux.vnet.ibm.com, Stefan Hajnoczi stefa...@gmail.com, 
 lcapitul...@redhat.com, qemu-devel@nongnu.org,
 Francesco Romani from...@redhat.com
 Sent: Thursday, November 20, 2014 12:34:28 PM
 Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
 reporting for block devices

One way to solve this is to require that the management tool tells QEMU
which exact BlockDriverState node the threshold applies to.  Then QEMU
doesn't need any hardcoded policy.  But I'm not sure how realistic that
it at the moment (whether management tools are uses node names for each
node yet), so it may be best to hardcode the bs-file traversal that
I've suggested.

Kevin: Do you agree?
   
   I have a feeling that we would regret this in the long run because it
   would allow only one special case of a general problem (watching a BDS).
   This means that we'll get inconsistent APIs.
   
   We're only talking about an optimisation here, even though a very
   useful one, so I wouldn't easily make compromises here. We should
   probably insist on using the node-name. Management tools need new code
   anyway to make use of the new functionality, so they can implement
   node-name support as well while they're at it.
  
  Using node-name is the best thing to do.
  
  My concern is just whether libvirt and other management tools are
  actually using node-name yet.
 
 I don't think so. They also don't use blockdev-add yet.
 
 But that's not a reason for us to add hacks that allow libvirt and other
 management tools to avoid the proper APIs even in the future. They just
 need to add support for node-names if they want to use new qemu features.
 New features require support for new infrastructure, I think that's fair.
 
 If they feel that representing complete BDS graphs in their code is too
 much work for now, they can still keep temporary hacks with hardcoded
 assumptions in their management code (like setting file.node-name and
 ignoring other setups). At least it would be temporary hacks there; if
 we did them in qemu, they would be a permanent API.

I'm fine to use node_name in my patch, it looks even much simpler and cleaner

I'd love to take this chance and learn more about the topic, becuse
I'm very near to the border of my knowledge in that area.
I joined the discussion quite later, so my sources are actually
pretty sparse. Mostly:
- staring at the sources and git history
- googling for specific bits
- presentations like 
http://www.linux-kvm.org/wiki/images/3/34/Kvm-forum-2013-block-dev-configuration.pdf

There are some sources I'm missing? Hopefully a nice wiki page I somehow lost :)

A couple of specific questions more, mostly to make sure I can do meaningful
tests for my next submission:

1. I'm running a simple test using the attached script -
which is a qemu command line adapted from libvirt ouput driven
by oVirt. There is a way to attach a name at this stage, using a QMP command?

2. (related to the former) it seems from a not-so-deep look that the blessed 
(only?)
way to set a proper node_name is using blockdev-add.
If so, I'm not sure I follow how the qemu boot flow would look like.
It will not be anymore as simple as crafting a command line and run the qemu, 
right?
IIUC some interaction with QMP will be needed (sorry for asking silly question,
trying to fill gaps in my knowledge).

Thanks for the great feedback!


-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani


qemu.sh
Description: application/shellscript


Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-21 Thread Francesco Romani
- Original Message -
 From: Kevin Wolf kw...@redhat.com
 To: Francesco Romani from...@redhat.com
 Cc: qemu-devel@nongnu.org, Stefan Hajnoczi stefa...@gmail.com, 
 mdr...@linux.vnet.ibm.com, Luiz Capitulino
 lcapitul...@redhat.com, Stefan Hajnoczi stefa...@redhat.com
 Sent: Friday, November 21, 2014 11:11:26 AM
 Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
 reporting for block devices
[...]
  1. I'm running a simple test using the attached script -
  which is a qemu command line adapted from libvirt ouput driven
  by oVirt. There is a way to attach a name at this stage, using a QMP
  command?
 
 No, node-name is assigned at the BlockDriverState (BDS) creation and
 can't be changed later on.

Makes sense to me.

  2. (related to the former) it seems from a not-so-deep look that the
  blessed (only?)
  way to set a proper node_name is using blockdev-add.
  If so, I'm not sure I follow how the qemu boot flow would look like.
  It will not be anymore as simple as crafting a command line and run the
  qemu, right?
  IIUC some interaction with QMP will be needed (sorry for asking silly
  question,
  trying to fill gaps in my knowledge).
 
 -drive on the command line can do everything that blockdev-add can do.
 So let's assume you have a qcow2 image on a filesystem. Then you end up
 with two BDSes, one for the format driver and one for accessing the
 filesystem:
 
 BlockBackend (virtual device) - qcow2 BDS - file BDS (raw-posix.c)
 
 For assigning a node name to the qcow2 BDS, you simply specify it in the
 obvious way:
 
 -drive file=test.qcow2,node-name=foo
 
 Now if you want to assign a node name to the file BDS as well, you would
 get nested dicts in the blockdev-add call. In -drive a dot syntax is
 used to represent this:
 
 -drive file=test.qcow2,node-name=foo,file.node-name=bar
 
 Are things a bit clearer with this?

Yes, thanks a lot. I was a bit misleaded by the lack of the reference (after a 
very quick
look) in the man page.
Maybe the manpage is out of date, but this is a different story -and maybe a 
different patch :)

New revision will come in a few days.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-20 Thread Francesco Romani
- Original Message -
 From: Stefan Hajnoczi stefa...@redhat.com
 To: Francesco Romani from...@redhat.com
 Cc: kw...@redhat.com, Stefan Hajnoczi stefa...@gmail.com, 
 mdr...@linux.vnet.ibm.com, qemu-devel@nongnu.org,
 lcapitul...@redhat.com
 Sent: Wednesday, November 19, 2014 4:52:51 PM
 Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
 reporting for block devices
 
 On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote:
+static int coroutine_fn before_write_notify(NotifierWithReturn
*notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+int64_t amount = 0;
+
+assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
+assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);
   
   Does the code still make these assumptions or are the asserts left over
   from previous versions of the patch?
  
  It's a leftover.
  I understood they don't hurt and add a bit of safety, but if they are
  confusing
  I'll remove them.
 
 Yes, it made me wonder why.  Probably best to remove them.

Will do

[...]
  At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block
  device,
  and we'd like to be notified about the allocation of the lvm block device,
  which (IIUC)
  is the last bs-file.
  
  This is a simple topology (unless I'm missing something), and that's
  the reason why I go down just one level.
  
  Of course I want a general solution for this change, so...
 
 There is a block driver for error injection called blkdebug (see
 docs/blkdebug.txt).  Here is an example of the following topology:
 
   raw_bsd (drive0) - blkdebug - raw-posix (test.img)
 
 qemu-system-x86_64 -drive
 if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img
 
 The blkdebug driver is interposing between the raw_bsd (drive0) root and
 the raw-posix leaf node.

Thanks, I'll have a look

[...]
 The management tool should not need to inspect the graph because the
 graph can change at runtime (e.g. imagine I/O throttling is implemented
 as a BlockDriverState node then it could appear/disapper when the
 feature is activated/deactivated).  Instead the management tool should
 name the nodes it knows about and then use those node names.

Agreed - and indeed simpler for us (oVirt), which it doesn't hurt :)

  If we descend the bs-file chain, AFAIU the easiest mapping, being the
  device name,
  is easily lost because only the outermost BlockDriverState has a device
  name attached, so when the
  notification trigger
  bdrv_get_device_name() will return NULL
 
 In the worst case a name string can be passed in along with the
 threshold values.

OK, I guess to keep a copy of the string with g_strdup() could be good enough 
start,
at least for further discussion.

Thanks for your review and for the informations,
I'll submit a new revision of the patch in a couple of days,
to give to other reviewers some time to jump in.

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-18 Thread Francesco Romani
- Original Message -
 From: Stefan Hajnoczi stefa...@gmail.com
 To: Francesco Romani from...@redhat.com
 Cc: kw...@redhat.com, lcapitul...@redhat.com, qemu-devel@nongnu.org, 
 stefa...@redhat.com, mdr...@linux.vnet.ibm.com
 Sent: Monday, November 17, 2014 5:49:36 PM
 Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold 
 reporting for block devices
 
 On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
 
 Sorry for the long review delay.  Looks pretty good, just one real issue
 to think about at the bottom.

Hi Stefan, thanks for the review and no problem for the delay :)

 
  +static void usage_threshold_disable(BlockDriverState *bs)
  +{
 
 It would be safest to make this idempotent:
 
 if (!usage_threshold_is_set(bs)) {
 return;
 }
 
 That way it can be called multiple times.

Will change.

 
  +notifier_with_return_remove(bs-wr_usage_threshold_notifier);
  +bs-wr_offset_threshold = 0;
  +}
  +
  +static int usage_threshold_is_set(const BlockDriverState *bs)
  +{
  +return !!(bs-wr_offset_threshold  0);
  +}
 
 Please use the bool type instead of an int return value.

Sure, will fix.

  +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
  +void *opaque)
  +{
  +BdrvTrackedRequest *req = opaque;
  +BlockDriverState *bs = req-bs;
  +int64_t amount = 0;
  +
  +assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
  +assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);
 
 Does the code still make these assumptions or are the asserts left over
 from previous versions of the patch?

It's a leftover.
I understood they don't hurt and add a bit of safety, but if they are confusing
I'll remove them.

  +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t
  threshold_bytes)
  +{
  +BlockDriverState *target_bs = bs;
  +if (bs-file) {
  +target_bs = bs-file;
  +}
 
 Hmm...I think now I understand why you are trying to use bs-file.  This
 is an attempt to make image formats work with the threshold.
 
 Unfortunately the BlockDriverState topology can be more complicated than
 just 1 level.

I thought so but I can't reproduce yet more complex topologies.
Disclosure: I'm testing against the topology I know to be used on oVirt, lacking
immediate availability of others: suggestions welcome.

At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block device,
and we'd like to be notified about the allocation of the lvm block device, 
which (IIUC)
is the last bs-file.

This is a simple topology (unless I'm missing something), and that's
the reason why I go down just one level.

Of course I want a general solution for this change, so...

 If we hardcode a strategy to traverse bs-file then it will work in most
 cases:
 
   while (bs-file) {
   bs = bs-file;
   }
 
 But there are cases like VMDK extent files where a BlockDriverState
 actually has multiple children.
 
 One way to solve this is to require that the management tool tells QEMU
 which exact BlockDriverState node the threshold applies to.  Then QEMU
 doesn't need any hardcoded policy.  But I'm not sure how realistic that
 it at the moment (whether management tools are uses node names for each
 node yet), so it may be best to hardcode the bs-file traversal that
 I've suggested.

oVirt relies on libvirt[1], and uses device node (e.g. 'vda').

BTW, I haven't found a way to inspect from the outside (e.g. monitor command) 
the BlockDriverState
topology, there is a way I'm missing?

Another issue I don't yet have a proper solution is related to this one.

The management app will have deal with VM with more than one block device disk, 
so we need
a way to map the notification with the corresponding device.

If we descend the bs-file chain, AFAIU the easiest mapping, being the device 
name,
is easily lost because only the outermost BlockDriverState has a device name 
attached, so when the
notification trigger
bdrv_get_device_name() will return NULL

I believe we don't necessarily need a device name in the notification, for 
example something
like an opaque cookie set together with the threshold and returned back with 
the notification
will suffice. Of course the device name is nicer :)

+++

[1] if that can help further to understand the usecase, these are the libvirt 
APIs being used
by oVirt:
http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockStatsFlags
http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetBlockInfo
both relies on the output[2] of 'query-blockstats' monitor command.

[2] AFAIU -but this is just my guesswork!- it also assumes a quite simple 
topology like I did

Thanks and bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



[Qemu-devel] [RFC][PATCH v2] add write threshold reporting for block devices

2014-11-07 Thread Francesco Romani
v1 was: add watermark reporting for block devices, but
watermark is incorrectly unused. Hence the change in subject.

Sorry for long pause from v1 [0];
Only recently I was able to sort out all the missing details.
Context for this RFC/patch is presented below as remider.

Why RFC?


See Known issues below.

Changes since v1:
-

addressed reviewers comments. Highligths:
- fixed terminology: watermark - usage threshold
- threshold is expressed in bytes
- make the event triggers only once when threshold crossed
- configured threshold visible in 'query-block' output
- fixed bugs

Known Issues


The threshold we (oVirt) care about is actually the one in the
parent device, not in the device itself.
See Rationale and context from v1 below for details.
The use case is qcow2 on a thin-provisioned block device.

So the notification handler must be set in the inner BlockDriverState,
like this (pseudocode, no error checking, see patch for details)

/* ... */

BlockDriverState *bs = bdrv_find(drive-virtio-disk0);

/* ... */

bdrv_set_usage_threshold(bs, threshold_bytes);

/* ... */

void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes)
{
BlockDriverState *target_bs = bs;
if (bs-file) {
target_bs = bs-file; /* we care about highest allocation of this! */
}

bs-wr_usage_threshold_notifier.notify = 
usage_threshold_before_write_notify;
notifier_with_return_list_add(bs-before_write_notifiers,
  bs-wr_usage_threshold_notifier);
}

Problem: when the notifier triggers, I don't know how to get the device name,
because a simple

bdrv_get_device_name(bs);

yields just 

Despite my efforts and some digging in the code, I don't know which is the clean
and correct way to get the (parent) device name.


Rationale and context from v1
-

I'm one of the oVirt developers (http://www.ovirt.org);
oVirt is a virtualization management application built
around qemu/kvm, so it is nice to get in touch :)

We have begun a big scalability improvement effort, aiming to
support without problems hundreds of VMs per host, with plans
to support thousands in a not so distant future.
In doing so, we are reviewing our usage flows.

One of them is thin-provisioned storage, which is used
quite extensively, with block devices (ISCSI for example)
and COW images.
When using thin provisioning, oVirt tries hard to hide this
fact from the guest OS, and to do so watches closely
the usage of the device, and resize it when its usage exceeds
a configured threshold (the high water mark), in order
to avoid the guest OS to get paused for space exhausted.

To do the watching, we poll he devices using libvirt [1],
which in turn uses query-blockstats.
This is suboptimal with just one VM, but with hundereds of them,
let alone thousands, it doesn't scale and it is quite a resource
hog.

Would be great to have this watermark concept supported into qemu,
with a new event to be raised when the limit is crossed.

To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957

Moreover, I had the chance to take a look at the QEMU sources
and come up with this tentative patch which I'd also like
to submit.

Notes
-

[0]: https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg01348.html
[1]: 
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=ebb0c19c48690f0598de954f8e0e9d4d29d48b85

Francesco Romani (1):
  block: add event when disk usage exceeds threshold

 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/usage-threshold.c | 124 
 include/block/block_int.h   |   4 ++
 include/block/usage-threshold.h |  39 +
 qapi/block-core.json|  46 ++-
 qmp-commands.hx |  26 +
 7 files changed, 242 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h

-- 
1.9.3




[Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

2014-11-07 Thread Francesco Romani
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
altogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani from...@redhat.com
---
 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/usage-threshold.c | 124 
 include/block/block_int.h   |   4 ++
 include/block/usage-threshold.h |  39 +
 qapi/block-core.json|  46 ++-
 qmp-commands.hx |  26 +
 7 files changed, 242 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..43e381d 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += usage-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index 1301144..3bb0bc7 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include block/qapi.h
 #include block/block_int.h
+#include block/usage-threshold.h
 #include qmp-commands.h
 #include qapi-visit.h
 #include qapi/qmp-output-visitor.h
@@ -315,6 +316,8 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo 
**p_info,
 }
 }
 
+info-write_threshold = bdrv_get_usage_threshold(bs);
+
 *p_info = info;
 return;
 
diff --git a/block/usage-threshold.c b/block/usage-threshold.c
new file mode 100644
index 000..31a587d
--- /dev/null
+++ b/block/usage-threshold.c
@@ -0,0 +1,124 @@
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani from...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include block/block_int.h
+#include block/coroutine.h
+#include block/usage-threshold.h
+#include qemu/notify.h
+#include qapi-event.h
+#include qmp-commands.h
+
+
+int64_t bdrv_get_usage_threshold(const BlockDriverState *bs)
+{
+if (bs == NULL) {
+return 0;
+}
+if (bs-file) {
+return bs-file-wr_offset_threshold;
+}
+return bs-wr_offset_threshold;
+}
+
+static void usage_threshold_disable(BlockDriverState *bs)
+{
+notifier_with_return_remove(bs-wr_usage_threshold_notifier);
+bs-wr_offset_threshold = 0;
+}
+
+static int usage_threshold_is_set(const BlockDriverState *bs)
+{
+return !!(bs-wr_offset_threshold  0);
+}
+
+static int64_t usage_threshold_exceeded(const BlockDriverState *bs,
+const BdrvTrackedRequest *req)
+{
+if (usage_threshold_is_set(bs)) {
+int64_t amount = req-offset + req-bytes - bs-wr_offset_threshold;
+if (amount  0) {
+return amount;
+}
+}
+return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+int64_t amount = 0;
+
+assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
+assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);
+
+amount = usage_threshold_exceeded(bs, req);
+if (amount  0) {
+qapi_event_send_block_usage_threshold(
+bdrv_get_device_name(bs), /* FIXME: this does not work */
+amount,
+bs-wr_offset_threshold,
+error_abort);
+
+/* autodisable to avoid to flood the monitor */
+usage_threshold_disable(bs);
+}
+
+return 0; /* should always let other notifiers run */
+}
+
+static void usage_threshold_register_notifier(BlockDriverState *bs)
+{
+bs-wr_usage_threshold_notifier.notify = before_write_notify;
+notifier_with_return_list_add(bs-before_write_notifiers,
+  bs-wr_usage_threshold_notifier);
+}
+
+void bdrv_set_usage_threshold

Re: [Qemu-devel] [PATCH] block: add watermark event

2014-08-08 Thread Francesco Romani
- Original Message -
 From: Stefan Hajnoczi stefa...@redhat.com
 To: Kevin Wolf kw...@redhat.com
 Cc: mdr...@linux.vnet.ibm.com, Francesco Romani from...@redhat.com, 
 qemu-devel@nongnu.org, lcapitul...@redhat.com
 Sent: Tuesday, August 5, 2014 3:08:46 PM
 Subject: Re: [Qemu-devel] [PATCH] block: add watermark event
 
 On Tue, Aug 05, 2014 at 10:47:57AM +0200, Kevin Wolf wrote:
  Am 01.08.2014 um 13:39 hat Stefan Hajnoczi geschrieben:
   On Tue, Jul 08, 2014 at 04:49:24PM +0200, Francesco Romani wrote:
@@ -5813,3 +5815,57 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
 bdrv_flush_io_queue(bs-file);
 }
 }
+
+static bool watermark_exceeded(BlockDriverState *bs,
+   int64_t sector_num,
+   int nb_sectors)
+{
+
+if (bs-wr_watermark_perc  0) {
+int64_t watermark = (bs-total_sectors) / 100 *
bs-wr_watermark_perc;
   
   bs-total_sectors should not be used directly.
   
   Have you considered making the watermark parameter take sector units
   instead of a percentage?
   
   I'm not sure whether a precentage makes sense because 25% of a 10GB
   image is 2.5 GB so a 75% watermark might be reasonable.  25% of a 1 TB
   image is 250 GB and that's probably not a reasonable watermark.
   
   So let the block-set-watermark caller pass an absolute sector number
   instead.  It keeps things simple for both QEMU and thin provisioning
   manager.
  
  No sector numbers in external interfaces, please. These units of 512
  bytes are completely arbitrary and don't make any sense. I hope to get
  rid of BDRV_SECTOR_* eventually even internally.
  
  So for external APIs, please use bytes instead.
 
 I agree and forgot about that.  Please use bytes instead of sectors or a
 percentage.
 

Thanks everyone for the great feedback received!

I'll post asap a new patch addressing all the comments.

I'll also change the name because 'watermark' may be misleading/wrong jargon :)
(http://en.wikipedia.org/wiki/Watermark)

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



[Qemu-devel] [Bug 1338957] [NEW] RFE: add an event to report block devices watermark

2014-07-08 Thread Francesco Romani
Public bug reported:

Add an event to report if a block device usage exceeds a threshold. The
threshold should be configurable with a monitor command. The event
should report the affected block device. Additional useful information
could be the offset of the highest sector , like in the query-blockstats
output.

Rationale for the RFE
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of 
thin-provisioned disk images.
In order to let the guest run flawlessly and be not unnecessarily paused, oVirt 
sets a watermark and automatically resized the image once the watermark is 
reached or exceeded.

In order to detect the mark crossing, the managing application has no choice 
than aggressively polling the QEMU monitor
using the query-blockstats command. This lead to unnecessary system load, and 
is made even worse under scale: scenarios
with hunderds of VM are becoming not unusual.

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1338957

Title:
  RFE: add an event to report block devices watermark

Status in QEMU:
  New

Bug description:
  Add an event to report if a block device usage exceeds a threshold.
  The threshold should be configurable with a monitor command. The event
  should report the affected block device. Additional useful information
  could be the offset of the highest sector , like in the query-
  blockstats output.

  Rationale for the RFE
  Managing applications, like oVirt (http://www.ovirt.org), make extensive use 
of thin-provisioned disk images.
  In order to let the guest run flawlessly and be not unnecessarily paused, 
oVirt sets a watermark and automatically resized the image once the watermark 
is reached or exceeded.

  In order to detect the mark crossing, the managing application has no choice 
than aggressively polling the QEMU monitor
  using the query-blockstats command. This lead to unnecessary system load, and 
is made even worse under scale: scenarios
  with hunderds of VM are becoming not unusual.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1338957/+subscriptions



[Qemu-devel] [PATCH] add watermark reporting for block devices

2014-07-08 Thread Francesco Romani
Hello everyone

I'm one of the oVirt developers (http://www.ovirt.org);
oVirt is a virtualization management application built
around qemu/kvm, so it is nice to get in touch :)

We have begun a big scalability improvement effort, aiming to
support without problems hundreds of VMs per host, with plans
to support thousands in a not so distant future.
In doing so, we are reviewing our usage flows.

One of them is thin-provisioned storage, which is used
quite extensively, with block devices (ISCSI for example)
and COW images.
When using thin provisioning, oVirt tries hard to hide this
fact from the guest OS, and to do so watches closely
the usage of the device, and resize it when its usage exceeds
a configured threshold (the high water mark), in order
to avoid the guest OS to get paused for space exhausted.

To do the watching, we poll he devices using libvirt
(virDomainGetBlockInfo), which in turn uses query-blockstats.
This is suboptimal with just one VM, but with hundereds of them,
let alone thousands, it doesn't scale and it is quite a resource
hog.

Would be great to have this watermark concept supported into qemu,
with a new event to be raised when the limit is crossed.

To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957

Moreover, I had the chance to take a look at the QEMU sources
and come up with this tentative patch which I'd also like
to submit.

Comments and thoughts very welcome!

Thanks and best regards,

Francesco Romani (1):
  block: add watermark event

 block.c   | 56 +++
 blockdev.c| 21 ++
 include/block/block.h |  2 ++
 include/block/block_int.h |  3 +++
 qapi/block-core.json  | 33 
 qmp-commands.hx   | 24 
 6 files changed, 139 insertions(+)

-- 
1.9.3




[Qemu-devel] [PATCH] block: add watermark event

2014-07-08 Thread Francesco Romani
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
In order to let the guest run flawlessly and be not unnecessarily
paused, oVirt sets a watermark based on the percentage occupation of the
device against the advertised size, and automatically resizes the image
once the watermark is reached or exceeded.

In order to detect the mark crossing, the managing application has no
choice than aggressively polling the QEMU monitor using the
query-blockstats command. This lead to unnecessary system
load, and is made even worse under scale: scenarios
with hundreds of VM are becoming not unusual.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
alltogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani from...@redhat.com
---
 block.c   | 56 +++
 blockdev.c| 21 ++
 include/block/block.h |  2 ++
 include/block/block_int.h |  3 +++
 qapi/block-core.json  | 33 
 qmp-commands.hx   | 24 
 6 files changed, 139 insertions(+)

diff --git a/block.c b/block.c
index 8800a6b..cf34b7f 100644
--- a/block.c
+++ b/block.c
@@ -1983,6 +1983,8 @@ static void bdrv_move_feature_fields(BlockDriverState 
*bs_dest,
 bs_dest-device_list = bs_src-device_list;
 memcpy(bs_dest-op_blockers, bs_src-op_blockers,
sizeof(bs_dest-op_blockers));
+
+bs_dest-wr_watermark_perc  = bs_src-wr_watermark_perc;
 }
 
 /*
@@ -5813,3 +5815,57 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
 bdrv_flush_io_queue(bs-file);
 }
 }
+
+static bool watermark_exceeded(BlockDriverState *bs,
+   int64_t sector_num,
+   int nb_sectors)
+{
+
+if (bs-wr_watermark_perc  0) {
+int64_t watermark = (bs-total_sectors) / 100 * bs-wr_watermark_perc;
+if (sector_num = watermark) {
+return true;
+}
+}
+return false;
+}
+
+static int coroutine_fn watermark_before_write_notify(NotifierWithReturn 
*notifier,
+  void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+int64_t sector_num = req-offset  BDRV_SECTOR_BITS;
+int nb_sectors = req-bytes  BDRV_SECTOR_BITS;
+
+/*  FIXME: needed? */
+assert((req-offset  (BDRV_SECTOR_SIZE - 1)) == 0);
+assert((req-bytes  (BDRV_SECTOR_SIZE - 1)) == 0);
+
+if (watermark_exceeded(req-bs, sector_num, nb_sectors)) {
+BlockDriverState *bs = req-bs;
+qapi_event_send_block_watermark(
+bdrv_get_device_name(bs),
+sector_num,
+bs-wr_highest_sector,
+error_abort);
+}
+
+return 0; /* should always let other notifiers run */
+}
+
+void bdrv_set_watermark_perc(BlockDriverState *bs,
+ int watermark_perc)
+{
+NotifierWithReturn before_write = {
+.notify = watermark_before_write_notify,
+};
+
+if (watermark_perc = 0) {
+return;
+}
+
+if (bs-wr_watermark_perc == 0) {
+bdrv_add_before_write_notifier(bs, before_write);
+}
+bs-wr_watermark_perc = watermark_perc;
+}
diff --git a/blockdev.c b/blockdev.c
index 48bd9a3..ede21d9 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2546,6 +2546,27 @@ BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 return dummy.next;
 }
 
+void qmp_block_set_watermark(const char *device, int64_t watermark,
+ Error **errp)
+{
+BlockDriverState *bs;
+AioContext *aio_context;
+
+bs = bdrv_find(device);
+if (!bs) {
+error_set(errp, QERR_DEVICE_NOT_FOUND, device);
+return;
+}
+
+aio_context = bdrv_get_aio_context(bs);
+aio_context_acquire(aio_context);
+
+bdrv_set_watermark_perc(bs, watermark);
+
+aio_context_release(aio_context);
+}
+
+
 QemuOptsList qemu_common_drive_opts = {
 .name = drive,
 .head = QTAILQ_HEAD_INITIALIZER(qemu_common_drive_opts.head),
diff --git a/include/block/block.h b/include/block/block.h
index 32d3676..ff92ef9 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -588,4 +588,6 @@ void bdrv_io_plug(BlockDriverState *bs);
 void bdrv_io_unplug(BlockDriverState *bs);
 void bdrv_flush_io_queue(BlockDriverState *bs);
 
+void bdrv_set_watermark_perc(BlockDriverState *bs, int watermark_perc);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f6c3bef..666ea1d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -393,6 +393,9 @@ struct BlockDriverState {
 
 /* The error object in use for blocking operations on backing_hd */
 Error *backing_blocker;
+
+/* watermark limit for writes, percentage

[Qemu-devel] [RFC] add watermark reporting for block devices

2014-07-08 Thread Francesco Romani
Sorry, this is actually an RFC; patch was posted separately.

- Original Message -
 From: Francesco Romani from...@redhat.com
 To: qemu-devel@nongnu.org
 Cc: kw...@redhat.com, stefa...@redhat.com, lcapitul...@redhat.com, 
 mdr...@linux.vnet.ibm.com, Francesco Romani
 from...@redhat.com
 Sent: Tuesday, July 8, 2014 4:49:23 PM
 Subject: [PATCH] add watermark reporting for block devices
 
 Hello everyone
 
 I'm one of the oVirt developers (http://www.ovirt.org);
 oVirt is a virtualization management application built
 around qemu/kvm, so it is nice to get in touch :)
 
 We have begun a big scalability improvement effort, aiming to
 support without problems hundreds of VMs per host, with plans
 to support thousands in a not so distant future.
 In doing so, we are reviewing our usage flows.
 
 One of them is thin-provisioned storage, which is used
 quite extensively, with block devices (ISCSI for example)
 and COW images.
 When using thin provisioning, oVirt tries hard to hide this
 fact from the guest OS, and to do so watches closely
 the usage of the device, and resize it when its usage exceeds
 a configured threshold (the high water mark), in order
 to avoid the guest OS to get paused for space exhausted.
 
 To do the watching, we poll he devices using libvirt
 (virDomainGetBlockInfo), which in turn uses query-blockstats.
 This is suboptimal with just one VM, but with hundereds of them,
 let alone thousands, it doesn't scale and it is quite a resource
 hog.
 
 Would be great to have this watermark concept supported into qemu,
 with a new event to be raised when the limit is crossed.
 
 To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957
 
 Moreover, I had the chance to take a look at the QEMU sources
 and come up with this tentative patch which I'd also like
 to submit.
 
 Comments and thoughts very welcome!
 
 Thanks and best regards,
 
 Francesco Romani (1):
   block: add watermark event
 
  block.c   | 56
  +++
  blockdev.c| 21 ++
  include/block/block.h |  2 ++
  include/block/block_int.h |  3 +++
  qapi/block-core.json  | 33 
  qmp-commands.hx   | 24 
  6 files changed, 139 insertions(+)
 
 --
 1.9.3
 
 

-- 
Francesco Romani
RedHat Engineering Virtualization R  D
Phone: 8261328
IRC: fromani



[Qemu-devel] [Bug 1284090] [NEW] RFE: QMP: report error reason in BLOCK_IO_ERROR message

2014-02-24 Thread Francesco Romani
Public bug reported:

when a disk drive is configured with the error policy enospc for write errors, 
the monitoring client needs a way to distinguish
betwwen generic I/O error and the I/O error for space exausted.

The JSON QMP protocol lacks this information: the BLOCK_IO_ERROR message
does not provide a reason or code for the error verified, so the
monitoring client cannot distinguish the source of the errors.

verified against git 105a060188dc6fdd4551571a966514d1a5f6815a

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1284090

Title:
  RFE: QMP: report error reason in BLOCK_IO_ERROR message

Status in QEMU:
  New

Bug description:
  when a disk drive is configured with the error policy enospc for write 
errors, the monitoring client needs a way to distinguish
  betwwen generic I/O error and the I/O error for space exausted.

  The JSON QMP protocol lacks this information: the BLOCK_IO_ERROR
  message does not provide a reason or code for the error verified, so
  the monitoring client cannot distinguish the source of the errors.

  verified against git 105a060188dc6fdd4551571a966514d1a5f6815a

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1284090/+subscriptions