Re: [Qemu-devel] [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-14 Thread Adam Wolfe Gordon via Qemu-devel
On Wed, Sep 13, 2017 at 6:47 PM, John Snow  wrote:
> On 09/13/2017 05:36 PM, Adam Wolfe Gordon via Qemu-devel wrote:
>> On Wed, Sep 13, 2017 at 2:53 PM, John Snow  wrote:
>> We have a storage orchestration service that manages our ceph block
>> storage clusters and doesn't interact directly with qemu. Volumes get
>> resized through the orchestration service, which (after doing some
>
> resized bigger, one hopes ...

Indeed! That's actually one of the reasons for this change: it's much
easier to guarantee that we're always resizing bigger if all resizes
are issued in the same place.

>> Previously, we would notify the VM of the resize by issuing a
>> blockresize via qmp after doing the resize itself externally. That
>> meant we were actually resizing the rbd image twice (though the second
>> was, hopefully, a no-op). We occasionally had trouble with the resize
>> issued by qemu getting stuck and blocking the qemu main thread.
>> Detecting the out-of-band resize lets us avoid the extra rbd_resize
>> call and means that we never modify an rbd image's metadata from qemu.
>>
>
> Hm, I see... It sounds like you want an operation here that lets us
> detect medium changes without actually attempting to orchestrate one.
>
> It smells like you want the second half of bdrv_truncate without
> actually issuing the call. Perhaps you could split this function into
> its two halves, and in the event of an external resize being detected,
> you could call the latter-half portion of bdrv_truncate.

Yes, I think that agrees with what Jason Dillaman suggested yesterday,
and makes sense to me. Will do that for a v2 of this patch.

> ...if the drive is configured to automatically detect those events, that
> is. Conceivably not all resize events that QEMU *could* detect *should*
> automatically result in guest-visible changes as soon as they occur.

Good point. I can add an option for this to the rbd driver. My gut
feeling is that the default should be off (i.e., the existing
behavior) to avoid any surprises for users who upgrade.

Thanks for the review. I'll get a v2 out as time allows.

-- awg



Re: [Qemu-devel] [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread John Snow


On 09/13/2017 05:36 PM, Adam Wolfe Gordon via Qemu-devel wrote:
> On Wed, Sep 13, 2017 at 2:53 PM, John Snow  wrote:
>> On 09/13/2017 12:44 PM, Adam Wolfe Gordon wrote:
>>> Register a watcher with rbd so that we get notified when an image is
>>> resized. Propagate resizes to parent block devices so that guest devices
>>> get resized without user intervention.
>>>
>>> Signed-off-by: Adam Wolfe Gordon 
>>> ---
>>> Hello!
>>>
>>> We are using this patch in production at DigitalOcean and have tested it 
>>> fairly
>>> extensively. However, we use the same block device configuration 
>>> everywhere, so
>>> I'd be interested to get an answer to the question I've left in the code:>
 /* NOTE: This assumes there's only one layer between us and the
block-backend. Is this always true? */
>>>
>>
>> Even if it were always true presently, the fact that the block layer
>> storage graph can be configured in arbitrary ways suggests that it might
>> not always be true in the future.
>>
>> Probably in practice MOST drives are configured like
>> [BB]<--[Format]<--[Protocol], but the block layer is adding a lot of fun
>> stuff like filter nodes for throttling or debugging, so it's probably
>> too much to assert that the length of the chain will ALWAYS be exactly two.
> 
> OK, thanks - that's exactly what I was wondering. I can add some code
> to traverse the parents until we get to the BB.
> 
>> Can I play dumb and ask what you are trying to accomplish? The concept
>> of a drive deciding to resize itself is not something I have a model for
>> understanding, presently...
> 
> Good question; I probably should have explained the use-case in the
> commit message.
> 
> We have a storage orchestration service that manages our ceph block
> storage clusters and doesn't interact directly with qemu. Volumes get
> resized through the orchestration service, which (after doing some

resized bigger, one hopes ...

> checks, updating external metadata, etc.) issues the resize via
> librbd. The drive doesn't exactly resize itself, but the trigger is
> out-of-band from any VM the drive may be attached to.

>From QEMU's POV, the drive resized itself! Facts on the ground may be
different, of course.

> 
> Previously, we would notify the VM of the resize by issuing a
> blockresize via qmp after doing the resize itself externally. That
> meant we were actually resizing the rbd image twice (though the second
> was, hopefully, a no-op). We occasionally had trouble with the resize
> issued by qemu getting stuck and blocking the qemu main thread.
> Detecting the out-of-band resize lets us avoid the extra rbd_resize
> call and means that we never modify an rbd image's metadata from qemu.
> 

Hm, I see... It sounds like you want an operation here that lets us
detect medium changes without actually attempting to orchestrate one.

It smells like you want the second half of bdrv_truncate without
actually issuing the call. Perhaps you could split this function into
its two halves, and in the event of an external resize being detected,
you could call the latter-half portion of bdrv_truncate.

...if the drive is configured to automatically detect those events, that
is. Conceivably not all resize events that QEMU *could* detect *should*
automatically result in guest-visible changes as soon as they occur.

--js



Re: [Qemu-devel] [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread Adam Wolfe Gordon via Qemu-devel
On Wed, Sep 13, 2017 at 3:21 PM, Jason Dillaman  wrote:
> On Wed, 2017-09-13 at 10:44 -0600, Adam Wolfe Gordon wrote:
>> +parent = bs->inherits_from;
>> +if (parent == NULL) {
>> +error_report("bs %s does not have parent", 
>> bdrv_get_device_or_node_name(bs));
>> +return;
>> +}
>> +
>> +/* Force parents to re-read our size. */
>
> Can you just invoke blk_truncate instead?

blk_truncate will end up calling qemu_rbd_truncate, which does an
rbd_resize. The goal of this patch is to avoid calling rbd_resize from
qemu, since we do resizes from an external orchestration service.

That said, this patch also introduces a cache of the size (in
BDRVRBDState.size_bytes), so it would be possible for
qemu_rbd_truncate to avoid calling rbd_resize when the image is
already the right size. If I did that, we could use blk_truncate when
we detect a resize as well. Would that be a desirable change?



Re: [Qemu-devel] [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread Adam Wolfe Gordon via Qemu-devel
On Wed, Sep 13, 2017 at 2:53 PM, John Snow  wrote:
> On 09/13/2017 12:44 PM, Adam Wolfe Gordon wrote:
>> Register a watcher with rbd so that we get notified when an image is
>> resized. Propagate resizes to parent block devices so that guest devices
>> get resized without user intervention.
>>
>> Signed-off-by: Adam Wolfe Gordon 
>> ---
>> Hello!
>>
>> We are using this patch in production at DigitalOcean and have tested it 
>> fairly
>> extensively. However, we use the same block device configuration everywhere, 
>> so
>> I'd be interested to get an answer to the question I've left in the code:>
>>> /* NOTE: This assumes there's only one layer between us and the
>>>block-backend. Is this always true? */
>>
>
> Even if it were always true presently, the fact that the block layer
> storage graph can be configured in arbitrary ways suggests that it might
> not always be true in the future.
>
> Probably in practice MOST drives are configured like
> [BB]<--[Format]<--[Protocol], but the block layer is adding a lot of fun
> stuff like filter nodes for throttling or debugging, so it's probably
> too much to assert that the length of the chain will ALWAYS be exactly two.

OK, thanks - that's exactly what I was wondering. I can add some code
to traverse the parents until we get to the BB.

> Can I play dumb and ask what you are trying to accomplish? The concept
> of a drive deciding to resize itself is not something I have a model for
> understanding, presently...

Good question; I probably should have explained the use-case in the
commit message.

We have a storage orchestration service that manages our ceph block
storage clusters and doesn't interact directly with qemu. Volumes get
resized through the orchestration service, which (after doing some
checks, updating external metadata, etc.) issues the resize via
librbd. The drive doesn't exactly resize itself, but the trigger is
out-of-band from any VM the drive may be attached to.

Previously, we would notify the VM of the resize by issuing a
blockresize via qmp after doing the resize itself externally. That
meant we were actually resizing the rbd image twice (though the second
was, hopefully, a no-op). We occasionally had trouble with the resize
issued by qemu getting stuck and blocking the qemu main thread.
Detecting the out-of-band resize lets us avoid the extra rbd_resize
call and means that we never modify an rbd image's metadata from qemu.



Re: [Qemu-devel] [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread Jason Dillaman
On Wed, 2017-09-13 at 10:44 -0600, Adam Wolfe Gordon wrote:
> Register a watcher with rbd so that we get notified when an image is
> resized. Propagate resizes to parent block devices so that guest devices
> get resized without user intervention.
> 
> Signed-off-by: Adam Wolfe Gordon 
> ---
> Hello!
> 
> We are using this patch in production at DigitalOcean and have tested it 
> fairly
> extensively. However, we use the same block device configuration everywhere, 
> so
> I'd be interested to get an answer to the question I've left in the code:
> 
> > /* NOTE: This assumes there's only one layer between us and the
> >block-backend. Is this always true? */
> 
> If that isn't true, this patch will need a bit of work to traverse up the 
> block
> device tree and find the block-backend.
> 
> Of course, any other feedback would be welcome too - this is my first foray 
> into
> the qemu code.
> 
> Regards,
> Adam
> 
>  block/rbd.c | 80 
> +
>  1 file changed, 80 insertions(+)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 144f350e1f..1c9fcbec1f 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -96,6 +96,8 @@ typedef struct BDRVRBDState {
>  rbd_image_t image;
>  char *image_name;
>  char *snap;
> +uint64_t watcher_handle;
> +uint64_t size_bytes;
>  } BDRVRBDState;
>  
>  static char *qemu_rbd_next_tok(char *src, char delim, char **p)
> @@ -540,6 +542,67 @@ out:
>  return rados_str;
>  }
>  
> +/* BH for when rbd notifies us of an update. */
> +static void qemu_rbd_update_bh(void *arg)
> +{
> +BlockDriverState *parent, *bs = arg;
> +BDRVRBDState *s = bs->opaque;
> +bool was_variable_length;
> +uint64_t new_size_bytes;
> +int64_t new_parent_len;
> +BdrvChild *c;
> +int r;
> +
> +r = rbd_get_size(s->image, _size_bytes);
> +if (r < 0) {
> +error_report("error reading size for %s: %s", s->name, strerror(-r));
> +return;
> +}
> +
> +/* Avoid no-op resizes on non-resize notifications. */
> +if (new_size_bytes == s->size_bytes) {
> +error_printf("skipping non-resize rbd cb\n");

Image update callbacks can be invoked for any number of reasons, not
just resize events. Therefore, you probably don't want to have an error
message printed out here.

> +return;
> +}
> +
> +/* NOTE: This assumes there's only one layer between us and the
> +   block-backend. Is this always true? */

I don't think that can be assumed to be true.

> +parent = bs->inherits_from;
> +if (parent == NULL) {
> +error_report("bs %s does not have parent", 
> bdrv_get_device_or_node_name(bs));
> +return;
> +}
> +
> +/* Force parents to re-read our size. */

Can you just invoke blk_truncate instead? 

> +was_variable_length = bs->drv->has_variable_length;
> +bs->drv->has_variable_length = true;
> +new_parent_len = bdrv_getlength(parent);
> +if (new_parent_len < 0) {
> +error_report("getlength failed on parent %s", 
> bdrv_get_device_or_node_name(parent));
> +bs->drv->has_variable_length = was_variable_length;
> +return;
> +}
> +bs->drv->has_variable_length = was_variable_length;
> +
> +/* Notify block backends that that we have resized.
> +   Copied from bdrv_parent_cb_resize. */
> +QLIST_FOREACH(c, >parents, next_parent) {
> +if (c->role->resize) {
> +c->role->resize(c);
> +}
> +}
> +
> +s->size_bytes = new_size_bytes;
> +}
> +
> +/* Called from non-qemu thread - careful! */
> +static void qemu_rbd_update_cb(void *arg)
> +{
> +BlockDriverState *bs = arg;
> +
> +aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), qemu_rbd_update_bh, 
> bs);
> +}
> +
>  static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
>   Error **errp)
>  {
> @@ -672,9 +735,25 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>  }
>  
> +r = rbd_update_watch(s->image, >watcher_handle, qemu_rbd_update_cb, 
> bs);

This API method was only added in the Ceph Jewel release just over a
year ago. This should probably gracefully degrade if compiled against an
older version of the librbd API.

> +if (r < 0) {
> +error_setg_errno(errp, -r, "error registering watcher on %s", 
> s->name);
> +goto failed_watch;
> +}
> +
> +r = rbd_get_size(s->image, >size_bytes);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "error reading size for %s", s->name);
> +goto failed_sz;
> +}
> +
>  qemu_opts_del(opts);
>  return 0;
>  
> +failed_sz:
> +rbd_update_unwatch(s->image, s->watcher_handle);
> +failed_watch:
> +rbd_close(s->image);
>  failed_open:
>  rados_ioctx_destroy(s->io_ctx);
>  failed_shutdown:
> @@ -712,6 +791,7 @@ static void qemu_rbd_close(BlockDriverState *bs)
>  {
>  BDRVRBDState *s = 

Re: [Qemu-devel] [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread John Snow


On 09/13/2017 12:44 PM, Adam Wolfe Gordon wrote:
> Register a watcher with rbd so that we get notified when an image is
> resized. Propagate resizes to parent block devices so that guest devices
> get resized without user intervention.
> 
> Signed-off-by: Adam Wolfe Gordon 
> ---
> Hello!
> 
> We are using this patch in production at DigitalOcean and have tested it 
> fairly
> extensively. However, we use the same block device configuration everywhere, 
> so
> I'd be interested to get an answer to the question I've left in the code:>
>> /* NOTE: This assumes there's only one layer between us and the
>>block-backend. Is this always true? */
> 

Even if it were always true presently, the fact that the block layer
storage graph can be configured in arbitrary ways suggests that it might
not always be true in the future.

Probably in practice MOST drives are configured like
[BB]<--[Format]<--[Protocol], but the block layer is adding a lot of fun
stuff like filter nodes for throttling or debugging, so it's probably
too much to assert that the length of the chain will ALWAYS be exactly two.

Can I play dumb and ask what you are trying to accomplish? The concept
of a drive deciding to resize itself is not something I have a model for
understanding, presently...

> If that isn't true, this patch will need a bit of work to traverse up the 
> block
> device tree and find the block-backend.
> 
> Of course, any other feedback would be welcome too - this is my first foray 
> into
> the qemu code.
> 
> Regards,
> Adam
> 
>  block/rbd.c | 80 
> +
>  1 file changed, 80 insertions(+)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 144f350e1f..1c9fcbec1f 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -96,6 +96,8 @@ typedef struct BDRVRBDState {
>  rbd_image_t image;
>  char *image_name;
>  char *snap;
> +uint64_t watcher_handle;
> +uint64_t size_bytes;
>  } BDRVRBDState;
>  
>  static char *qemu_rbd_next_tok(char *src, char delim, char **p)
> @@ -540,6 +542,67 @@ out:
>  return rados_str;
>  }
>  
> +/* BH for when rbd notifies us of an update. */
> +static void qemu_rbd_update_bh(void *arg)
> +{
> +BlockDriverState *parent, *bs = arg;
> +BDRVRBDState *s = bs->opaque;
> +bool was_variable_length;
> +uint64_t new_size_bytes;
> +int64_t new_parent_len;
> +BdrvChild *c;
> +int r;
> +
> +r = rbd_get_size(s->image, _size_bytes);
> +if (r < 0) {
> +error_report("error reading size for %s: %s", s->name, strerror(-r));
> +return;
> +}
> +
> +/* Avoid no-op resizes on non-resize notifications. */
> +if (new_size_bytes == s->size_bytes) {
> +error_printf("skipping non-resize rbd cb\n");
> +return;
> +}
> +
> +/* NOTE: This assumes there's only one layer between us and the
> +   block-backend. Is this always true? */
> +parent = bs->inherits_from;
> +if (parent == NULL) {
> +error_report("bs %s does not have parent", 
> bdrv_get_device_or_node_name(bs));
> +return;
> +}
> +
> +/* Force parents to re-read our size. */
> +was_variable_length = bs->drv->has_variable_length;
> +bs->drv->has_variable_length = true;
> +new_parent_len = bdrv_getlength(parent);
> +if (new_parent_len < 0) {
> +error_report("getlength failed on parent %s", 
> bdrv_get_device_or_node_name(parent));
> +bs->drv->has_variable_length = was_variable_length;
> +return;
> +}
> +bs->drv->has_variable_length = was_variable_length;
> +
> +/* Notify block backends that that we have resized.
> +   Copied from bdrv_parent_cb_resize. */
> +QLIST_FOREACH(c, >parents, next_parent) {
> +if (c->role->resize) {
> +c->role->resize(c);
> +}
> +}
> +
> +s->size_bytes = new_size_bytes;
> +}
> +
> +/* Called from non-qemu thread - careful! */
> +static void qemu_rbd_update_cb(void *arg)
> +{
> +BlockDriverState *bs = arg;
> +
> +aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), qemu_rbd_update_bh, 
> bs);
> +}
> +
>  static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
>   Error **errp)
>  {
> @@ -672,9 +735,25 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>  }
>  
> +r = rbd_update_watch(s->image, >watcher_handle, qemu_rbd_update_cb, 
> bs);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "error registering watcher on %s", 
> s->name);
> +goto failed_watch;
> +}
> +
> +r = rbd_get_size(s->image, >size_bytes);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "error reading size for %s", s->name);
> +goto failed_sz;
> +}
> +
>  qemu_opts_del(opts);
>  return 0;
>  
> +failed_sz:
> +rbd_update_unwatch(s->image, s->watcher_handle);
> +failed_watch:
> +rbd_close(s->image);
>