On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
> Backup-top filter does copy-before-write operation. It should be
> inserted above active disk and has a target node for CBW, like the
> following:
> 
>     +-------+
>     | Guest |
>     +-------+
>         |r,w
>         v
>     +------------+  target   +---------------+
>     | backup_top |---------->| target(qcow2) |
>     +------------+   CBW     +---------------+
>         |
> backing |r,w
>         v
>     +-------------+
>     | Active disk |
>     +-------------+
> 
> The driver will be used in backup instead of write-notifiers.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
> ---
>  block/backup-top.h  |  64 +++++++++
>  block/backup-top.c  | 322 ++++++++++++++++++++++++++++++++++++++++++++
>  block/Makefile.objs |   2 +
>  3 files changed, 388 insertions(+)
>  create mode 100644 block/backup-top.h
>  create mode 100644 block/backup-top.c
> 
> diff --git a/block/backup-top.h b/block/backup-top.h
> new file mode 100644
> index 0000000000..788e18c358
> --- /dev/null
> +++ b/block/backup-top.h
> @@ -0,0 +1,64 @@
> +/*
> + * backup-top filter driver
> + *
> + * The driver performs Copy-Before-Write (CBW) operation: it is injected 
> above
> + * some node, and before each write it copies _old_ data to the target node.
> + *
> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsement...@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef BACKUP_TOP_H
> +#define BACKUP_TOP_H
> +
> +#include "qemu/osdep.h"
> +
> +#include "block/block_int.h"
> +
> +typedef void (*BackupTopProgressCallback)(uint64_t done, void *opaque);
> +typedef struct BDRVBackupTopState {
> +    HBitmap *copy_bitmap; /* what should be copied to @target on guest 
> write. */
> +    BdrvChild *target;
> +
> +    BackupTopProgressCallback progress_cb;
> +    void *progress_opaque;
> +} BDRVBackupTopState;
> +
> +/*
> + * bdrv_backup_top_append
> + *
> + * Append backup_top filter node above @source node. @target node will 
> receive
> + * the data backed up during CBE operations. New filter together with @target
> + * node are attached to @source aio context.
> + *
> + * The resulting filter node is implicit.

Why?  It’s just as easy for the caller to just make it implicit if it
should be.  (And usually the caller should decide that.)

> + *
> + * @copy_bitmap selects regions which needs CBW. Furthermore, backup_top will
> + * use exactly this bitmap, so it may be used to control backup_top behavior
> + * dynamically. Caller should not release @copy_bitmap during life-time of
> + * backup_top. Progress is tracked by calling @progress_cb function.
> + */
> +BlockDriverState *bdrv_backup_top_append(
> +        BlockDriverState *source, BlockDriverState *target,
> +        HBitmap *copy_bitmap, Error **errp);
> +void bdrv_backup_top_set_progress_callback(
> +        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
> +        void *progress_opaque);
> +void bdrv_backup_top_drop(BlockDriverState *bs);
> +
> +#endif /* BACKUP_TOP_H */
> diff --git a/block/backup-top.c b/block/backup-top.c
> new file mode 100644
> index 0000000000..1daa02f539
> --- /dev/null
> +++ b/block/backup-top.c
> @@ -0,0 +1,322 @@
> +/*
> + * backup-top filter driver
> + *
> + * The driver performs Copy-Before-Write (CBW) operation: it is injected 
> above
> + * some node, and before each write it copies _old_ data to the target node.
> + *
> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
> + *
> + * Author:
> + *  Sementsov-Ogievskiy Vladimir <vsement...@virtuozzo.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/cutils.h"
> +#include "qapi/error.h"
> +#include "block/block_int.h"
> +#include "block/qdict.h"
> +
> +#include "block/backup-top.h"
> +
> +static coroutine_fn int backup_top_co_preadv(
> +        BlockDriverState *bs, uint64_t offset, uint64_t bytes,
> +        QEMUIOVector *qiov, int flags)
> +{
> +    /*
> +     * Features to be implemented:
> +     * F1. COR. save read data to fleecing target for fast access
> +     *     (to reduce reads). This possibly may be done with use of 
> copy-on-read
> +     *     filter, but we need an ability to make COR requests optional: for
> +     *     example, if target is a ram-cache, and if it is full now, we 
> should
> +     *     skip doing COR request, as it is actually not necessary.
> +     *
> +     * F2. Feature for guest: read from fleecing target if data is in 
> ram-cache
> +     *     and is unchanged
> +     */
> +
> +    return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t offset,
> +                                       uint64_t bytes)
> +{
> +    int ret = 0;
> +    BDRVBackupTopState *s = bs->opaque;
> +    uint64_t gran = 1UL << hbitmap_granularity(s->copy_bitmap);
> +    uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran);
> +    uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len;

The “, len” is a bit weirdly placed there, why not define it on a
separate line as "uint64_t len = end - off"?

> +    void *buf = qemu_blockalign(bs, end - off);
> +
> +    /*
> +     * Features to be implemented:
> +     * F3. parallelize copying loop
> +     * F4. detect zeroes ? or, otherwise, drop detect zeroes from backup code
> +     *     and just enable zeroes detecting on target
> +     * F5. use block_status ?
> +     * F6. don't copy clusters which are already cached by COR [see F1]
> +     * F7. if target is ram-cache and it is full, there should be a 
> possibility
> +     *     to drop not necessary data (cached by COR [see F1]) to handle CBW
> +     *     fast.

Also “drop BDRV_REQ_SERIALISING from writes to s->target unless necessary”?

> +     */
> +
> +    len = end - off;
> +    while (hbitmap_next_dirty_area(s->copy_bitmap, &off, &len)) {
> +        hbitmap_reset(s->copy_bitmap, off, len);
> +
> +        ret = bdrv_co_pread(bs->backing, off, len, buf,
> +                            BDRV_REQ_NO_SERIALISING);
> +        if (ret < 0) {
> +            hbitmap_set(s->copy_bitmap, off, len);
> +            goto out;
> +        }
> +
> +        ret = bdrv_co_pwrite(s->target, off, len, buf, BDRV_REQ_SERIALISING);
> +        if (ret < 0) {
> +            hbitmap_set(s->copy_bitmap, off, len);
> +            goto out;
> +        }
> +
> +        if (s->progress_cb) {
> +            s->progress_cb(len, s->progress_opaque);
> +        }
> +        off += len;
> +        if (off >= end) {
> +            break;
> +        }
> +        len = end - off;
> +    }
> +
> +out:
> +    qemu_vfree(buf);
> +
> +    /*
> +     * F8. we fail guest request in case of error. We can alter it by
> +     * possibility to fail copying process instead, or retry several times, 
> or
> +     * may be guest pause, etc.
> +     */
> +    return ret;
> +}
> +
> +static int coroutine_fn backup_top_co_pdiscard(BlockDriverState *bs,
> +                                               int64_t offset, int bytes)
> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    /*
> +     * Features to be implemented:
> +     * F9. possibility of lazy discard: just defer the discard after fleecing
> +     *     completion. If write (or new discard) occurs to the same area, 
> just
> +     *     drop deferred discard.
> +     */
> +
> +    return bdrv_co_pdiscard(bs->backing, offset, bytes);
> +}
> +
> +static int coroutine_fn backup_top_co_pwrite_zeroes(BlockDriverState *bs,
> +        int64_t offset, int bytes, BdrvRequestFlags flags)
> +{
> +    int ret = backup_top_cbw(bs, offset, bytes);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
> +}
> +
> +static coroutine_fn int backup_top_co_pwritev(BlockDriverState *bs,
> +                                              uint64_t offset,
> +                                              uint64_t bytes,
> +                                              QEMUIOVector *qiov, int flags)
> +{
> +    if (!(flags & BDRV_REQ_WRITE_UNCHANGED)) {
> +        int ret = backup_top_cbw(bs, offset, bytes);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +
> +    return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
> +}
> +
> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
> +{
> +    if (!bs->backing) {
> +        return 0;
> +    }
> +
> +    return bdrv_co_flush(bs->backing->bs);

Should we flush the target, too?

> +}
> +
> +static void backup_top_refresh_filename(BlockDriverState *bs)
> +{
> +    if (bs->backing == NULL) {
> +        /*
> +         * we can be here after failed bdrv_attach_child in
> +         * bdrv_set_backing_hd
> +         */
> +        return;
> +    }
> +    bdrv_refresh_filename(bs->backing->bs);

bdrv_refresh_filename() should have done this already.

> +    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
> +            bs->backing->bs->filename);
> +}
> +
> +static void backup_top_child_perm(BlockDriverState *bs, BdrvChild *c,
> +                                  const BdrvChildRole *role,
> +                                  BlockReopenQueue *reopen_queue,
> +                                  uint64_t perm, uint64_t shared,
> +                                  uint64_t *nperm, uint64_t *nshared)
> +{
> +    /*
> +     * We have HBitmap in the state, its size is fixed, so we never allow
> +     * resize.
> +     */
> +    uint64_t rw = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
> +                  BLK_PERM_WRITE;
> +
> +    bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared,
> +                              nperm, nshared);
> +
> +    *nperm = *nperm & rw;
> +    *nshared = *nshared & rw;

Somehow, that bdrv_filter_default_perm() call doesn’t make this function
easier for me.

It seems like this is basically just “*nperm = perm & rw” and
“*nshared = shared & rw”.

> +
> +    if (role == &child_file) {
> +        /*
> +         * Target child
> +         *
> +         * Share write to target (child_file), to not interfere
> +         * with guest writes to its disk which may be in target backing 
> chain.
> +         */
> +        if (perm & BLK_PERM_WRITE) {
> +            *nshared = *nshared | BLK_PERM_WRITE;

Why not always share WRITE on the target?

> +        }
> +    } else {
> +        /* Source child */
> +        if (perm & BLK_PERM_WRITE) {

Or WRITE_UNCHANGED, no?

> +            *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
> +        }
> +        *nshared =
> +            *nshared & (BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED);

And here it isn’t “*nshared = shared & rw” but “rw & ~WRITE”.

I feel like this function would be simpler if you just set *nperm and
*nshared separately in the two branches of this condition, without
setting a “default” first.

> +    }
> +}
> +
> +BlockDriver bdrv_backup_top_filter = {
> +    .format_name = "backup-top",
> +    .instance_size = sizeof(BDRVBackupTopState),
> +
> +    .bdrv_co_preadv             = backup_top_co_preadv,
> +    .bdrv_co_pwritev            = backup_top_co_pwritev,
> +    .bdrv_co_pwrite_zeroes      = backup_top_co_pwrite_zeroes,
> +    .bdrv_co_pdiscard           = backup_top_co_pdiscard,
> +    .bdrv_co_flush              = backup_top_co_flush,
> +
> +    .bdrv_co_block_status       = bdrv_co_block_status_from_backing,
> +
> +    .bdrv_refresh_filename      = backup_top_refresh_filename,
> +
> +    .bdrv_child_perm            = backup_top_child_perm,
> +
> +    .is_filter = true,
> +};
> +
> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
> +                                         BlockDriverState *target,
> +                                         HBitmap *copy_bitmap,
> +                                         Error **errp)
> +{
> +    Error *local_err = NULL;
> +    BDRVBackupTopState *state;
> +    BlockDriverState *top = bdrv_new_open_driver(&bdrv_backup_top_filter,
> +                                                 NULL, BDRV_O_RDWR, errp);
> +
> +    if (!top) {
> +        return NULL;
> +    }
> +
> +    top->implicit = true;

As I said above, I think the caller should set this.

> +    top->total_sectors = source->total_sectors;
> +    top->bl.opt_mem_alignment = MAX(bdrv_opt_mem_align(source),
> +                                    bdrv_opt_mem_align(target));
> +    top->opaque = state = g_new0(BDRVBackupTopState, 1);
> +    state->copy_bitmap = copy_bitmap;
> +
> +    bdrv_ref(target);
> +    state->target = bdrv_attach_child(top, target, "target", &child_file, 
> errp);
> +    if (!state->target) {
> +        bdrv_unref(target);
> +        bdrv_unref(top);
> +        return NULL;
> +    }
> +
> +    bdrv_set_aio_context(top, bdrv_get_aio_context(source));
> +    bdrv_set_aio_context(target, bdrv_get_aio_context(source));

I suppose these calls aren’t necessary anymore (compare d0ee0204f4009).
 (I’m not fully sure, though.  In any case, they would need to be
bdrv_try_set_aio_context() if they were still necessary.)

> +
> +    bdrv_drained_begin(source);
> +
> +    bdrv_ref(top);
> +    bdrv_append(top, source, &local_err);
> +    if (local_err) {
> +        error_prepend(&local_err, "Cannot append backup-top filter: ");
> +    }
> +
> +    bdrv_drained_end(source);
> +
> +    if (local_err) {
> +        bdrv_unref_child(top, state->target);
> +        bdrv_unref(top);
> +        error_propagate(errp, local_err);
> +        return NULL;
> +    }
> +
> +    return top;
> +}

I guess it would be nice if it users could blockdev-add a backup-top
node to basically get a backup with sync=none.

(The bdrv_open implementation should then create a new bitmap and
initialize it to be fully set.)

But maybe it wouldn’t be so nice and I just have feverish dreams.

> +void bdrv_backup_top_set_progress_callback(
> +        BlockDriverState *bs, BackupTopProgressCallback progress_cb,
> +        void *progress_opaque)
> +{
> +    BDRVBackupTopState *s = bs->opaque;
> +
> +    s->progress_cb = progress_cb;
> +    s->progress_opaque = progress_opaque;
> +}
> +
> +void bdrv_backup_top_drop(BlockDriverState *bs)
> +{
> +    BDRVBackupTopState *s = bs->opaque;
> +    AioContext *aio_context = bdrv_get_aio_context(bs);
> +
> +    aio_context_acquire(aio_context);
> +
> +    bdrv_drained_begin(bs);
> +
> +    bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);

Pre-existing in other jobs, but I think calling this function is
dangerous.  (Which is why I remove it in my “block: Ignore loosening
perm restrictions failures” series.)

> +    bdrv_replace_node(bs, backing_bs(bs), &error_abort);
> +    bdrv_set_backing_hd(bs, NULL, &error_abort);

I think some of this function should be in a .bdrv_close()
implementation, for example this bdrv_set_backing_hd() call.

> +    bdrv_drained_end(bs);
> +
> +    if (s->target) {
> +        bdrv_unref_child(bs, s->target);

And this.  Well, neither needs to be in a .bdrv_close() implementation,
actually, because bdrv_close() will just do it by itself.

I suppose you’re trying to remove the children so the node is no longer
usable after this point; but that isn’t quite right, then.  The filter
functions still refer to s->target and bs->backing, so you need to stop
them from doing anything then.

At this point, you might as well add a variable to BDRVBackupTopState
that says whether the filter is still supposed to be usable (like the
“stop” variable I added to mirror in “block/mirror: Fix child
permissions”).  If so, the permission function should signal 0/ALL for
the permissions on backing (and probably target), and all filter
functions always immediately return an error.

So I don’t think backing and target need to be removed here.  We can
wait for that until bdrv_close(), but we should ensure that the filter
really is unusable after this function has been called.

Max

> +    }
> +    bdrv_unref(bs);
> +
> +    aio_context_release(aio_context);
> +}
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index ae11605c9f..dfbdfe6ab4 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -40,6 +40,8 @@ block-obj-y += throttle.o copy-on-read.o
>  
>  block-obj-y += crypto.o
>  
> +block-obj-y += backup-top.o
> +
>  common-obj-y += stream.o
>  
>  nfs.o-libs         := $(LIBNFS_LIBS)
> 


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to