On 11/28/2014 05:31 AM, Francesco Romani wrote: > Managing applications, like oVirt (http://www.ovirt.org), make extensive > use of thin-provisioned disk images. > To let the guest run smoothly and be not unnecessarily paused, oVirt sets > a disk usage threshold (so called 'high water mark') based on the occupation > of the device, and automatically extends the image once the threshold > is reached or exceeded. > > In order to detect the crossing of the threshold, oVirt has no choice but > aggressively polling the QEMU monitor using the query-blockstats command. > This lead to unnecessary system load, and is made even worse under scale: > deployments with hundreds of VMs are no longer rare. > > To fix this, this patch adds: > * A new monitor command to set a mark for a given block device. > * A new event to report if a block device usage exceeds the threshold. > > This will allow the managing application to drop the polling > altogether and just wait for a watermark crossing event.
I like the idea! Question - what happens if management misses the event (for example, if libvirtd is restarted)? Does the existing 'query-blockstats' and/or 'query-named-block-nodes' still work to query the current threshold and whether it has been exceeded, as a poll-once command executed when reconnecting to the monitor? > > Signed-off-by: Francesco Romani <from...@redhat.com> > --- No need for a 0/1 cover letter on a 1-patch series; you have the option of just putting the side-band information here and sending it as a single mail. But the cover letter approach doesn't hurt either, and I can see how it can be easier for some workflows to always send a cover letter than to special-case a 1-patch series. > +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, > + void *opaque) > +{ > + BdrvTrackedRequest *req = opaque; > + BlockDriverState *bs = req->bs; > + uint64_t amount = 0; > + > + amount = bdrv_usage_threshold_exceeded(bs, req); > + if (amount > 0) { > + qapi_event_send_block_usage_threshold( > + bs->node_name, > + amount, > + bs->write_threshold_offset, > + &error_abort); > + > + /* autodisable to avoid to flood the monitor */ s/to flood/flooding/ > +/* > + * bdrv_usage_threshold_is_set > + * > + * Tell if an usage threshold is set for a given BDS. s/an usage/a usage/ (in English, the difference between "a" and "an" is whether the leading sound of the next word is pronounced or not; in this case, "usage" is pronounced with a hard "yoo-sage". It may help to remember "an umbrella for a unicorn") > +++ b/qapi/block-core.json > @@ -239,6 +239,9 @@ > # > # @iops_size: #optional an I/O size in bytes (Since 1.7) > # > +# @write-threshold: configured write threshold for the device. > +# 0 if disabled. (Since 2.3) > +# > # Since: 0.14.0 > # > ## > @@ -253,7 +256,7 @@ > '*bps_max': 'int', '*bps_rd_max': 'int', > '*bps_wr_max': 'int', '*iops_max': 'int', > '*iops_rd_max': 'int', '*iops_wr_max': 'int', > - '*iops_size': 'int' } } > + '*iops_size': 'int', 'write-threshold': 'uint64' } } In QMP specs, 'uint64' and 'int' are practically synonyms. I can live with either spelling, although 'int' is more common. Bikeshed on naming: Although we prefer '-' over '_' in new interfaces, we also favor consistency, and BlockDeviceInfo is one of those dinosaur commands that uses _ everywhere until your addition. So naming this field 'write_threshold' would be more consistent. > +## > +# @BLOCK_USAGE_THRESHOLD > +# > +# Emitted when writes on block device reaches or exceeds the > +# configured threshold. For thin-provisioned devices, this > +# means the device should be extended to avoid pausing for > +# disk exaustion. s/exaustion/exhaustion/ > +# > +# @node-name: graph node name on which the threshold was exceeded. > +# > +# @amount-exceeded: amount of data which exceeded the threshold, in bytes. > +# > +# @offset-threshold: last configured threshold, in bytes. > +# Might want to mention that this event is one-shot; after it triggers, a user must re-register a threshold to get the event again. > +# Since: 2.3 > +## > +{ 'event': 'BLOCK_USAGE_THRESHOLD', > + 'data': { 'node-name': 'str', > + 'amount-exceeded': 'uint64', TAB damage. Please use spaces. ./scripts/checkpatch.pl will catch some offenders (although I didn't test if it will catch this one). However, here you are correct in using '-' for naming :) > + 'threshold': 'uint64' } } > + > +## > +# @block-set-threshold > +# > +# Change usage threshold for a block drive. An event will be delivered > +# if a write to this block drive crosses the configured threshold. > +# This is useful to transparently resize thin-provisioned drives without > +# the guest OS noticing. > +# > +# @node-name: graph node name on which the threshold must be set. > +# > +# @write-threshold: configured threshold for the block device, bytes. > +# Use 0 to disable the threshold. > +# > +# Returns: Nothing on success > +# If @node name is not found on the block device graph, > +# DeviceNotFound > +# > +# Since: 2.3 > +## > +{ 'command': 'block-set-threshold', > + 'data': { 'node-name': 'str', 'threshold': 'uint64' } } Again, 'int' instead of 'uint64' is more typical, but shouldn't hurt with either spelling. > +SQMP > +block-set-threshold > +------------ > + > +Change the write threshold for a block drive. The threshold is an offset, > +thus must be non-negative. Default is not write threshold. s/not/no/ > +To set the threshold to zero disables it. s/To set/Setting/ Missing the change to the 'query-block' and 'query-named-block-nodes' examples to show the new always-present output field. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature