Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
On 11/21/2014 01:43 AM, Francesco Romani wrote: > A couple of specific questions more, mostly to make sure I can do meaningful > tests for my next submission: > > 1. I'm running a simple test using the attached script - > which is a qemu command line adapted from libvirt ouput driven > by oVirt. There is a way to attach a name at this stage, using a QMP command? Libvirt isn't yet attaching node names, and right now, there is no way to retroactively attach a node name (only at creation). Jeff Cody proposed a patch prior to 2.1 that would give ALL nodes a generated name if one was not supplied, but we still haven't taken that patch in, and by now it probably needs rebasing... -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
- Original Message - > From: "Kevin Wolf" > To: "Francesco Romani" > Cc: qemu-devel@nongnu.org, "Stefan Hajnoczi" , > mdr...@linux.vnet.ibm.com, "Luiz Capitulino" > , "Stefan Hajnoczi" > Sent: Friday, November 21, 2014 11:11:26 AM > Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold > reporting for block devices [...] > > 1. I'm running a simple test using the attached script - > > which is a qemu command line adapted from libvirt ouput driven > > by oVirt. There is a way to attach a name at this stage, using a QMP > > command? > > No, node-name is assigned at the BlockDriverState (BDS) creation and > can't be changed later on. Makes sense to me. > > 2. (related to the former) it seems from a not-so-deep look that the > > blessed (only?) > > way to set a proper node_name is using blockdev-add. > > If so, I'm not sure I follow how the qemu boot flow would look like. > > It will not be anymore as simple as crafting a command line and run the > > qemu, right? > > IIUC some interaction with QMP will be needed (sorry for asking silly > > question, > > trying to fill gaps in my knowledge). > > -drive on the command line can do everything that blockdev-add can do. > So let's assume you have a qcow2 image on a filesystem. Then you end up > with two BDSes, one for the format driver and one for accessing the > filesystem: > > BlockBackend (virtual device) -> qcow2 BDS -> file BDS (raw-posix.c) > > For assigning a node name to the qcow2 BDS, you simply specify it in the > obvious way: > > -drive file=test.qcow2,node-name=foo > > Now if you want to assign a node name to the file BDS as well, you would > get nested dicts in the blockdev-add call. In -drive a dot syntax is > used to represent this: > > -drive file=test.qcow2,node-name=foo,file.node-name=bar > > Are things a bit clearer with this? Yes, thanks a lot. I was a bit misleaded by the lack of the reference (after a very quick look) in the man page. Maybe the manpage is out of date, but this is a different story -and maybe a different patch :) New revision will come in a few days. Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
Am 21.11.2014 um 09:43 hat Francesco Romani geschrieben: > - Original Message - > > From: "Kevin Wolf" > > To: "Stefan Hajnoczi" > > Cc: mdr...@linux.vnet.ibm.com, "Stefan Hajnoczi" , > > lcapitul...@redhat.com, qemu-devel@nongnu.org, > > "Francesco Romani" > > Sent: Thursday, November 20, 2014 12:34:28 PM > > Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold > > reporting for block devices > > > > > > One way to solve this is to require that the management tool tells > > > > > QEMU > > > > > which exact BlockDriverState node the threshold applies to. Then QEMU > > > > > doesn't need any hardcoded policy. But I'm not sure how realistic > > > > > that > > > > > it at the moment (whether management tools are uses node names for > > > > > each > > > > > node yet), so it may be best to hardcode the bs->file traversal that > > > > > I've suggested. > > > > > > > > > > Kevin: Do you agree? > > > > > > > > I have a feeling that we would regret this in the long run because it > > > > would allow only one special case of a general problem (watching a BDS). > > > > This means that we'll get inconsistent APIs. > > > > > > > > We're "only" talking about an optimisation here, even though a very > > > > useful one, so I wouldn't easily make compromises here. We should > > > > probably insist on using the node-name. Management tools need new code > > > > anyway to make use of the new functionality, so they can implement > > > > node-name support as well while they're at it. > > > > > > Using node-name is the best thing to do. > > > > > > My concern is just whether libvirt and other management tools are > > > actually using node-name yet. > > > > I don't think so. They also don't use blockdev-add yet. > > > > But that's not a reason for us to add hacks that allow libvirt and other > > management tools to avoid the proper APIs even in the future. They just > > need to add support for node-names if they want to use new qemu features. > > New features require support for new infrastructure, I think that's fair. > > > > If they feel that representing complete BDS graphs in their code is too > > much work for now, they can still keep temporary hacks with hardcoded > > assumptions in their management code (like setting file.node-name and > > ignoring other setups). At least it would be temporary hacks there; if > > we did them in qemu, they would be a permanent API. > > I'm fine to use node_name in my patch, it looks even much simpler and cleaner > > I'd love to take this chance and learn more about the topic, becuse > I'm very near to the border of my knowledge in that area. > I joined the discussion quite later, so my sources are actually > pretty sparse. Mostly: > - staring at the sources and git history > - googling for specific bits > - presentations like > http://www.linux-kvm.org/wiki/images/3/34/Kvm-forum-2013-block-dev-configuration.pdf The video recording of that presentation wasn't really good quality, unfortunately. But you can watch the one of this year's presentation by Max and myself (you're probably mostly interested in Max's part), the title was "More Block Device Configuration". > There are some sources I'm missing? Hopefully a nice wiki page I somehow lost > :) I'm afraid that this single wiki page that has a well structured presentation of all information doesn't exist. > A couple of specific questions more, mostly to make sure I can do meaningful > tests for my next submission: > > 1. I'm running a simple test using the attached script - > which is a qemu command line adapted from libvirt ouput driven > by oVirt. There is a way to attach a name at this stage, using a QMP command? No, node-name is assigned at the BlockDriverState (BDS) creation and can't be changed later on. > 2. (related to the former) it seems from a not-so-deep look that the blessed > (only?) > way to set a proper node_name is using blockdev-add. > If so, I'm not sure I follow how the qemu boot flow would look like. > It will not be anymore as simple as crafting a command line and run the qemu, > right? > IIUC some interaction with QMP will be needed (sorry for asking silly > question, > trying to fill gaps in my knowledge). -drive on the command line can do everything that blockdev-add can do. So let's assume you have a qcow2 image on a filesystem. Then you end up with two BDSes, one for the format driver and one for accessing the filesystem: BlockBackend (virtual device) -> qcow2 BDS -> file BDS (raw-posix.c) For assigning a node name to the qcow2 BDS, you simply specify it in the obvious way: -drive file=test.qcow2,node-name=foo Now if you want to assign a node name to the file BDS as well, you would get nested dicts in the blockdev-add call. In -drive a dot syntax is used to represent this: -drive file=test.qcow2,node-name=foo,file.node-name=bar Are things a bit clearer with this? Kevin
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
- Original Message - > From: "Kevin Wolf" > To: "Stefan Hajnoczi" > Cc: mdr...@linux.vnet.ibm.com, "Stefan Hajnoczi" , > lcapitul...@redhat.com, qemu-devel@nongnu.org, > "Francesco Romani" > Sent: Thursday, November 20, 2014 12:34:28 PM > Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold > reporting for block devices > > > > One way to solve this is to require that the management tool tells QEMU > > > > which exact BlockDriverState node the threshold applies to. Then QEMU > > > > doesn't need any hardcoded policy. But I'm not sure how realistic that > > > > it at the moment (whether management tools are uses node names for each > > > > node yet), so it may be best to hardcode the bs->file traversal that > > > > I've suggested. > > > > > > > > Kevin: Do you agree? > > > > > > I have a feeling that we would regret this in the long run because it > > > would allow only one special case of a general problem (watching a BDS). > > > This means that we'll get inconsistent APIs. > > > > > > We're "only" talking about an optimisation here, even though a very > > > useful one, so I wouldn't easily make compromises here. We should > > > probably insist on using the node-name. Management tools need new code > > > anyway to make use of the new functionality, so they can implement > > > node-name support as well while they're at it. > > > > Using node-name is the best thing to do. > > > > My concern is just whether libvirt and other management tools are > > actually using node-name yet. > > I don't think so. They also don't use blockdev-add yet. > > But that's not a reason for us to add hacks that allow libvirt and other > management tools to avoid the proper APIs even in the future. They just > need to add support for node-names if they want to use new qemu features. > New features require support for new infrastructure, I think that's fair. > > If they feel that representing complete BDS graphs in their code is too > much work for now, they can still keep temporary hacks with hardcoded > assumptions in their management code (like setting file.node-name and > ignoring other setups). At least it would be temporary hacks there; if > we did them in qemu, they would be a permanent API. I'm fine to use node_name in my patch, it looks even much simpler and cleaner I'd love to take this chance and learn more about the topic, becuse I'm very near to the border of my knowledge in that area. I joined the discussion quite later, so my sources are actually pretty sparse. Mostly: - staring at the sources and git history - googling for specific bits - presentations like http://www.linux-kvm.org/wiki/images/3/34/Kvm-forum-2013-block-dev-configuration.pdf There are some sources I'm missing? Hopefully a nice wiki page I somehow lost :) A couple of specific questions more, mostly to make sure I can do meaningful tests for my next submission: 1. I'm running a simple test using the attached script - which is a qemu command line adapted from libvirt ouput driven by oVirt. There is a way to attach a name at this stage, using a QMP command? 2. (related to the former) it seems from a not-so-deep look that the blessed (only?) way to set a proper node_name is using blockdev-add. If so, I'm not sure I follow how the qemu boot flow would look like. It will not be anymore as simple as crafting a command line and run the qemu, right? IIUC some interaction with QMP will be needed (sorry for asking silly question, trying to fill gaps in my knowledge). Thanks for the great feedback! -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani qemu.sh Description: application/shellscript
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
On 11/20/2014 04:04 AM, Stefan Hajnoczi wrote: >> >> We're "only" talking about an optimisation here, even though a very >> useful one, so I wouldn't easily make compromises here. We should >> probably insist on using the node-name. Management tools need new code >> anyway to make use of the new functionality, so they can implement >> node-name support as well while they're at it. > > Using node-name is the best thing to do. > > My concern is just whether libvirt and other management tools are > actually using node-name yet. Libvirt is not yet using it, but the more compelling we make it, the more libvirt will accelerate the efforts needed to start using node-name. I'm okay with requiring node-name here. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
Am 20.11.2014 um 12:04 hat Stefan Hajnoczi geschrieben: > On Thu, Nov 20, 2014 at 11:30:53AM +0100, Kevin Wolf wrote: > > Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben: > > > On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote: > > > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t > > > > threshold_bytes) > > > > +{ > > > > +BlockDriverState *target_bs = bs; > > > > +if (bs->file) { > > > > +target_bs = bs->file; > > > > +} > > > > > > Hmm...I think now I understand why you are trying to use bs->file. This > > > is an attempt to make image formats work with the threshold. > > > > > > Unfortunately the BlockDriverState topology can be more complicated than > > > just 1 level. > > > > > > If we hardcode a strategy to traverse bs->file then it will work in most > > > cases: > > > > > > while (bs->file) { > > > bs = bs->file; > > > } > > > > > > But there are cases like VMDK extent files where a BlockDriverState > > > actually has multiple children. > > > > > > One way to solve this is to require that the management tool tells QEMU > > > which exact BlockDriverState node the threshold applies to. Then QEMU > > > doesn't need any hardcoded policy. But I'm not sure how realistic that > > > it at the moment (whether management tools are uses node names for each > > > node yet), so it may be best to hardcode the bs->file traversal that > > > I've suggested. > > > > > > Kevin: Do you agree? > > > > I have a feeling that we would regret this in the long run because it > > would allow only one special case of a general problem (watching a BDS). > > This means that we'll get inconsistent APIs. > > > > We're "only" talking about an optimisation here, even though a very > > useful one, so I wouldn't easily make compromises here. We should > > probably insist on using the node-name. Management tools need new code > > anyway to make use of the new functionality, so they can implement > > node-name support as well while they're at it. > > Using node-name is the best thing to do. > > My concern is just whether libvirt and other management tools are > actually using node-name yet. I don't think so. They also don't use blockdev-add yet. But that's not a reason for us to add hacks that allow libvirt and other management tools to avoid the proper APIs even in the future. They just need to add support for node-names if they want to use new qemu features. New features require support for new infrastructure, I think that's fair. If they feel that representing complete BDS graphs in their code is too much work for now, they can still keep temporary hacks with hardcoded assumptions in their management code (like setting file.node-name and ignoring other setups). At least it would be temporary hacks there; if we did them in qemu, they would be a permanent API. Kevin pgpmPx5IZ5nrz.pgp Description: PGP signature
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
On Thu, Nov 20, 2014 at 11:30:53AM +0100, Kevin Wolf wrote: > Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben: > > On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote: > > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t > > > threshold_bytes) > > > +{ > > > +BlockDriverState *target_bs = bs; > > > +if (bs->file) { > > > +target_bs = bs->file; > > > +} > > > > Hmm...I think now I understand why you are trying to use bs->file. This > > is an attempt to make image formats work with the threshold. > > > > Unfortunately the BlockDriverState topology can be more complicated than > > just 1 level. > > > > If we hardcode a strategy to traverse bs->file then it will work in most > > cases: > > > > while (bs->file) { > > bs = bs->file; > > } > > > > But there are cases like VMDK extent files where a BlockDriverState > > actually has multiple children. > > > > One way to solve this is to require that the management tool tells QEMU > > which exact BlockDriverState node the threshold applies to. Then QEMU > > doesn't need any hardcoded policy. But I'm not sure how realistic that > > it at the moment (whether management tools are uses node names for each > > node yet), so it may be best to hardcode the bs->file traversal that > > I've suggested. > > > > Kevin: Do you agree? > > I have a feeling that we would regret this in the long run because it > would allow only one special case of a general problem (watching a BDS). > This means that we'll get inconsistent APIs. > > We're "only" talking about an optimisation here, even though a very > useful one, so I wouldn't easily make compromises here. We should > probably insist on using the node-name. Management tools need new code > anyway to make use of the new functionality, so they can implement > node-name support as well while they're at it. Using node-name is the best thing to do. My concern is just whether libvirt and other management tools are actually using node-name yet. Stefan pgpyKyWSIYpyf.pgp Description: PGP signature
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben: > On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote: > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t > > threshold_bytes) > > +{ > > +BlockDriverState *target_bs = bs; > > +if (bs->file) { > > +target_bs = bs->file; > > +} > > Hmm...I think now I understand why you are trying to use bs->file. This > is an attempt to make image formats work with the threshold. > > Unfortunately the BlockDriverState topology can be more complicated than > just 1 level. > > If we hardcode a strategy to traverse bs->file then it will work in most > cases: > > while (bs->file) { > bs = bs->file; > } > > But there are cases like VMDK extent files where a BlockDriverState > actually has multiple children. > > One way to solve this is to require that the management tool tells QEMU > which exact BlockDriverState node the threshold applies to. Then QEMU > doesn't need any hardcoded policy. But I'm not sure how realistic that > it at the moment (whether management tools are uses node names for each > node yet), so it may be best to hardcode the bs->file traversal that > I've suggested. > > Kevin: Do you agree? I have a feeling that we would regret this in the long run because it would allow only one special case of a general problem (watching a BDS). This means that we'll get inconsistent APIs. We're "only" talking about an optimisation here, even though a very useful one, so I wouldn't easily make compromises here. We should probably insist on using the node-name. Management tools need new code anyway to make use of the new functionality, so they can implement node-name support as well while they're at it. Kevin pgptAWMrPmrn0.pgp Description: PGP signature
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
- Original Message - > From: "Stefan Hajnoczi" > To: "Francesco Romani" > Cc: kw...@redhat.com, "Stefan Hajnoczi" , > mdr...@linux.vnet.ibm.com, qemu-devel@nongnu.org, > lcapitul...@redhat.com > Sent: Wednesday, November 19, 2014 4:52:51 PM > Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold > reporting for block devices > > On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote: > > > > +static int coroutine_fn before_write_notify(NotifierWithReturn > > > > *notifier, > > > > +void *opaque) > > > > +{ > > > > +BdrvTrackedRequest *req = opaque; > > > > +BlockDriverState *bs = req->bs; > > > > +int64_t amount = 0; > > > > + > > > > +assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0); > > > > +assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0); > > > > > > Does the code still make these assumptions or are the asserts left over > > > from previous versions of the patch? > > > > It's a leftover. > > I understood they don't hurt and add a bit of safety, but if they are > > confusing > > I'll remove them. > > Yes, it made me wonder why. Probably best to remove them. Will do [...] > > At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block > > device, > > and we'd like to be notified about the allocation of the lvm block device, > > which (IIUC) > > is the last bs->file. > > > > This is a simple topology (unless I'm missing something), and that's > > the reason why I go down just one level. > > > > Of course I want a general solution for this change, so... > > There is a block driver for error injection called "blkdebug" (see > docs/blkdebug.txt). Here is an example of the following topology: > > raw_bsd (drive0) -> blkdebug -> raw-posix (test.img) > > qemu-system-x86_64 -drive > if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img > > The blkdebug driver is interposing between the raw_bsd (drive0) root and > the raw-posix leaf node. Thanks, I'll have a look [...] > The management tool should not need to inspect the graph because the > graph can change at runtime (e.g. imagine I/O throttling is implemented > as a BlockDriverState node then it could appear/disapper when the > feature is activated/deactivated). Instead the management tool should > name the nodes it knows about and then use those node names. Agreed - and indeed simpler for us (oVirt), which it doesn't hurt :) > > If we descend the bs->file chain, AFAIU the easiest mapping, being the > > device name, > > is easily lost because only the outermost BlockDriverState has a device > > name attached, so when the > > notification trigger > > bdrv_get_device_name() will return NULL > > In the worst case a name string can be passed in along with the > threshold values. OK, I guess to keep a copy of the string with g_strdup() could be good enough start, at least for further discussion. Thanks for your review and for the informations, I'll submit a new revision of the patch in a couple of days, to give to other reviewers some time to jump in. Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote: > > > +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, > > > +void *opaque) > > > +{ > > > +BdrvTrackedRequest *req = opaque; > > > +BlockDriverState *bs = req->bs; > > > +int64_t amount = 0; > > > + > > > +assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0); > > > +assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0); > > > > Does the code still make these assumptions or are the asserts left over > > from previous versions of the patch? > > It's a leftover. > I understood they don't hurt and add a bit of safety, but if they are > confusing > I'll remove them. Yes, it made me wonder why. Probably best to remove them. > > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t > > > threshold_bytes) > > > +{ > > > +BlockDriverState *target_bs = bs; > > > +if (bs->file) { > > > +target_bs = bs->file; > > > +} > > > > Hmm...I think now I understand why you are trying to use bs->file. This > > is an attempt to make image formats work with the threshold. > > > > Unfortunately the BlockDriverState topology can be more complicated than > > just 1 level. > > I thought so but I can't reproduce yet more complex topologies. > Disclosure: I'm testing against the topology I know to be used on oVirt, > lacking > immediate availability of others: suggestions welcome. > > At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block > device, > and we'd like to be notified about the allocation of the lvm block device, > which (IIUC) > is the last bs->file. > > This is a simple topology (unless I'm missing something), and that's > the reason why I go down just one level. > > Of course I want a general solution for this change, so... There is a block driver for error injection called "blkdebug" (see docs/blkdebug.txt). Here is an example of the following topology: raw_bsd (drive0) -> blkdebug -> raw-posix (test.img) qemu-system-x86_64 -drive if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img The blkdebug driver is interposing between the raw_bsd (drive0) root and the raw-posix leaf node. > > If we hardcode a strategy to traverse bs->file then it will work in most > > cases: > > > > while (bs->file) { > > bs = bs->file; > > } > > > > But there are cases like VMDK extent files where a BlockDriverState > > actually has multiple children. > > > > One way to solve this is to require that the management tool tells QEMU > > which exact BlockDriverState node the threshold applies to. Then QEMU > > doesn't need any hardcoded policy. But I'm not sure how realistic that > > it at the moment (whether management tools are uses node names for each > > node yet), so it may be best to hardcode the bs->file traversal that > > I've suggested. > > oVirt relies on libvirt[1], and uses device node (e.g. 'vda'). > > BTW, I haven't found a way to inspect from the outside (e.g. monitor command) > the BlockDriverState > topology, there is a way I'm missing? You can get the BlockDriverState and ->backing_hd chain using the query-block QMP command. I'm not aware of a command that returns the full graph of BlockDriverState nodes. The management tool should not need to inspect the graph because the graph can change at runtime (e.g. imagine I/O throttling is implemented as a BlockDriverState node then it could appear/disapper when the feature is activated/deactivated). Instead the management tool should name the nodes it knows about and then use those node names. > Another issue I don't yet have a proper solution is related to this one. > > The management app will have deal with VM with more than one block device > disk, so we need > a way to map the notification with the corresponding device. > > If we descend the bs->file chain, AFAIU the easiest mapping, being the device > name, > is easily lost because only the outermost BlockDriverState has a device name > attached, so when the > notification trigger > bdrv_get_device_name() will return NULL In the worst case a name string can be passed in along with the threshold values. > I believe we don't necessarily need a device name in the notification, for > example something > like an opaque cookie set together with the threshold and returned back with > the notification > will suffice. Of course the device name is nicer :) Agreed. pgphSm0Y8l855.pgp Description: PGP signature
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
- Original Message - > From: "Stefan Hajnoczi" > To: "Francesco Romani" > Cc: kw...@redhat.com, lcapitul...@redhat.com, qemu-devel@nongnu.org, > stefa...@redhat.com, mdr...@linux.vnet.ibm.com > Sent: Monday, November 17, 2014 5:49:36 PM > Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold > reporting for block devices > > On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote: > > Sorry for the long review delay. Looks pretty good, just one real issue > to think about at the bottom. Hi Stefan, thanks for the review and no problem for the delay :) > > > +static void usage_threshold_disable(BlockDriverState *bs) > > +{ > > It would be safest to make this idempotent: > > if (!usage_threshold_is_set(bs)) { > return; > } > > That way it can be called multiple times. Will change. > > > +notifier_with_return_remove(&bs->wr_usage_threshold_notifier); > > +bs->wr_offset_threshold = 0; > > +} > > + > > +static int usage_threshold_is_set(const BlockDriverState *bs) > > +{ > > +return !!(bs->wr_offset_threshold > 0); > > +} > > Please use the bool type instead of an int return value. Sure, will fix. > > +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, > > +void *opaque) > > +{ > > +BdrvTrackedRequest *req = opaque; > > +BlockDriverState *bs = req->bs; > > +int64_t amount = 0; > > + > > +assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0); > > +assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0); > > Does the code still make these assumptions or are the asserts left over > from previous versions of the patch? It's a leftover. I understood they don't hurt and add a bit of safety, but if they are confusing I'll remove them. > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t > > threshold_bytes) > > +{ > > +BlockDriverState *target_bs = bs; > > +if (bs->file) { > > +target_bs = bs->file; > > +} > > Hmm...I think now I understand why you are trying to use bs->file. This > is an attempt to make image formats work with the threshold. > > Unfortunately the BlockDriverState topology can be more complicated than > just 1 level. I thought so but I can't reproduce yet more complex topologies. Disclosure: I'm testing against the topology I know to be used on oVirt, lacking immediate availability of others: suggestions welcome. At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block device, and we'd like to be notified about the allocation of the lvm block device, which (IIUC) is the last bs->file. This is a simple topology (unless I'm missing something), and that's the reason why I go down just one level. Of course I want a general solution for this change, so... > If we hardcode a strategy to traverse bs->file then it will work in most > cases: > > while (bs->file) { > bs = bs->file; > } > > But there are cases like VMDK extent files where a BlockDriverState > actually has multiple children. > > One way to solve this is to require that the management tool tells QEMU > which exact BlockDriverState node the threshold applies to. Then QEMU > doesn't need any hardcoded policy. But I'm not sure how realistic that > it at the moment (whether management tools are uses node names for each > node yet), so it may be best to hardcode the bs->file traversal that > I've suggested. oVirt relies on libvirt[1], and uses device node (e.g. 'vda'). BTW, I haven't found a way to inspect from the outside (e.g. monitor command) the BlockDriverState topology, there is a way I'm missing? Another issue I don't yet have a proper solution is related to this one. The management app will have deal with VM with more than one block device disk, so we need a way to map the notification with the corresponding device. If we descend the bs->file chain, AFAIU the easiest mapping, being the device name, is easily lost because only the outermost BlockDriverState has a device name attached, so when the notification trigger bdrv_get_device_name() will return NULL I believe we don't necessarily need a device name in the notification, for example something like an opaque cookie set together with the threshold and returned back with the notification will suffice. Of course the device name is nicer :) +++ [1] if that can help further to understand the usecase, these are the libvirt APIs being used by oVirt: http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockStatsFlags http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetBlockInfo both relies on the output[2] of 'query-blockstats' monitor command. [2] AFAIU -but this is just my guesswork!- it also assumes a quite simple topology like I did Thanks and bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani
Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote: Sorry for the long review delay. Looks pretty good, just one real issue to think about at the bottom. > +static void usage_threshold_disable(BlockDriverState *bs) > +{ It would be safest to make this idempotent: if (!usage_threshold_is_set(bs)) { return; } That way it can be called multiple times. > +notifier_with_return_remove(&bs->wr_usage_threshold_notifier); > +bs->wr_offset_threshold = 0; > +} > + > +static int usage_threshold_is_set(const BlockDriverState *bs) > +{ > +return !!(bs->wr_offset_threshold > 0); > +} Please use the bool type instead of an int return value. > + > +static int64_t usage_threshold_exceeded(const BlockDriverState *bs, > +const BdrvTrackedRequest *req) > +{ > +if (usage_threshold_is_set(bs)) { > +int64_t amount = req->offset + req->bytes - bs->wr_offset_threshold; > +if (amount > 0) { > +return amount; > +} > +} > +return 0; > +} > + > +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, > +void *opaque) > +{ > +BdrvTrackedRequest *req = opaque; > +BlockDriverState *bs = req->bs; > +int64_t amount = 0; > + > +assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0); > +assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0); Does the code still make these assumptions or are the asserts left over from previous versions of the patch? > + > +amount = usage_threshold_exceeded(bs, req); > +if (amount > 0) { > +qapi_event_send_block_usage_threshold( > +bdrv_get_device_name(bs), /* FIXME: this does not work */ > +amount, > +bs->wr_offset_threshold, > +&error_abort); > + > +/* autodisable to avoid to flood the monitor */ > +usage_threshold_disable(bs); > +} > + > +return 0; /* should always let other notifiers run */ > +} > + > +static void usage_threshold_register_notifier(BlockDriverState *bs) > +{ > +bs->wr_usage_threshold_notifier.notify = before_write_notify; > +notifier_with_return_list_add(&bs->before_write_notifiers, > + &bs->wr_usage_threshold_notifier); > +} > + > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes) > +{ > +BlockDriverState *target_bs = bs; > +if (bs->file) { > +target_bs = bs->file; > +} Hmm...I think now I understand why you are trying to use bs->file. This is an attempt to make image formats work with the threshold. Unfortunately the BlockDriverState topology can be more complicated than just 1 level. If we hardcode a strategy to traverse bs->file then it will work in most cases: while (bs->file) { bs = bs->file; } But there are cases like VMDK extent files where a BlockDriverState actually has multiple children. One way to solve this is to require that the management tool tells QEMU which exact BlockDriverState node the threshold applies to. Then QEMU doesn't need any hardcoded policy. But I'm not sure how realistic that it at the moment (whether management tools are uses node names for each node yet), so it may be best to hardcode the bs->file traversal that I've suggested. Kevin: Do you agree? pgpr9in8XhYSJ.pgp Description: PGP signature
[Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. To let the guest run smoothly and be not unnecessarily paused, oVirt sets a disk usage threshold (so called 'high water mark') based on the occupation of the device, and automatically extends the image once the threshold is reached or exceeded. In order to detect the crossing of the threshold, oVirt has no choice but aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: deployments with hundreds of VMs are no longer rare. To fix this, this patch adds: * A new monitor command to set a mark for a given block device. * A new event to report if a block device usage exceeds the threshold. This will allow the managing application to drop the polling altogether and just wait for a watermark crossing event. Signed-off-by: Francesco Romani --- block/Makefile.objs | 1 + block/qapi.c| 3 + block/usage-threshold.c | 124 include/block/block_int.h | 4 ++ include/block/usage-threshold.h | 39 + qapi/block-core.json| 46 ++- qmp-commands.hx | 26 + 7 files changed, 242 insertions(+), 1 deletion(-) create mode 100644 block/usage-threshold.c create mode 100644 include/block/usage-threshold.h diff --git a/block/Makefile.objs b/block/Makefile.objs index 04b0e43..43e381d 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o +block-obj-y += usage-threshold.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/qapi.c b/block/qapi.c index 1301144..3bb0bc7 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -24,6 +24,7 @@ #include "block/qapi.h" #include "block/block_int.h" +#include "block/usage-threshold.h" #include "qmp-commands.h" #include "qapi-visit.h" #include "qapi/qmp-output-visitor.h" @@ -315,6 +316,8 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo **p_info, } } +info->write_threshold = bdrv_get_usage_threshold(bs); + *p_info = info; return; diff --git a/block/usage-threshold.c b/block/usage-threshold.c new file mode 100644 index 000..31a587d --- /dev/null +++ b/block/usage-threshold.c @@ -0,0 +1,124 @@ +/* + * QEMU System Emulator block usage threshold notification + * + * Copyright Red Hat, Inc. 2014 + * + * Authors: + * Francesco Romani + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING.LIB file in the top-level directory. + */ + +#include "block/block_int.h" +#include "block/coroutine.h" +#include "block/usage-threshold.h" +#include "qemu/notify.h" +#include "qapi-event.h" +#include "qmp-commands.h" + + +int64_t bdrv_get_usage_threshold(const BlockDriverState *bs) +{ +if (bs == NULL) { +return 0; +} +if (bs->file) { +return bs->file->wr_offset_threshold; +} +return bs->wr_offset_threshold; +} + +static void usage_threshold_disable(BlockDriverState *bs) +{ +notifier_with_return_remove(&bs->wr_usage_threshold_notifier); +bs->wr_offset_threshold = 0; +} + +static int usage_threshold_is_set(const BlockDriverState *bs) +{ +return !!(bs->wr_offset_threshold > 0); +} + +static int64_t usage_threshold_exceeded(const BlockDriverState *bs, +const BdrvTrackedRequest *req) +{ +if (usage_threshold_is_set(bs)) { +int64_t amount = req->offset + req->bytes - bs->wr_offset_threshold; +if (amount > 0) { +return amount; +} +} +return 0; +} + +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, +void *opaque) +{ +BdrvTrackedRequest *req = opaque; +BlockDriverState *bs = req->bs; +int64_t amount = 0; + +assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0); +assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0); + +amount = usage_threshold_exceeded(bs, req); +if (amount > 0) { +qapi_event_send_block_usage_threshold( +bdrv_get_device_name(bs), /* FIXME: this does not work */ +amount, +bs->wr_offset_threshold, +&error_abort); + +/* autodisable to avoid to flood the monitor */ +usage_threshold_disable(bs); +} + +return 0; /* should always let other notifiers run */ +} + +static void usage_threshold_register_notifier(BlockDriverState *bs) +{ +bs->wr_usage_threshold_notifier.notify = before_write_notify; +notifier_with_return_list_add(&bs->before_write_notifiers, + &bs->wr_usage_threshold_notifier); +} + +void bdrv_set_usage_threshol