13.06.2019 18:57, Max Reitz wrote: > On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote: >> Backup-top filter does copy-before-write operation. It should be >> inserted above active disk and has a target node for CBW, like the >> following: >> >> +-------+ >> | Guest | >> +-------+ >> |r,w >> v >> +------------+ target +---------------+ >> | backup_top |---------->| target(qcow2) | >> +------------+ CBW +---------------+ >> | >> backing |r,w >> v >> +-------------+ >> | Active disk | >> +-------------+ >> >> The driver will be used in backup instead of write-notifiers. >> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> >> --- >> block/backup-top.h | 64 +++++++++ >> block/backup-top.c | 322 ++++++++++++++++++++++++++++++++++++++++++++ >> block/Makefile.objs | 2 + >> 3 files changed, 388 insertions(+) >> create mode 100644 block/backup-top.h >> create mode 100644 block/backup-top.c >> >> diff --git a/block/backup-top.h b/block/backup-top.h >> new file mode 100644 >> index 0000000000..788e18c358 >> --- /dev/null >> +++ b/block/backup-top.h >> @@ -0,0 +1,64 @@ >> +/* >> + * backup-top filter driver >> + * >> + * The driver performs Copy-Before-Write (CBW) operation: it is injected >> above >> + * some node, and before each write it copies _old_ data to the target node. >> + * >> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved. >> + * >> + * Author: >> + * Sementsov-Ogievskiy Vladimir <vsement...@virtuozzo.com> >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU General Public License as published by >> + * the Free Software Foundation; either version 2 of the License, or >> + * (at your option) any later version. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> + * GNU General Public License for more details. >> + * >> + * You should have received a copy of the GNU General Public License >> + * along with this program. If not, see <http://www.gnu.org/licenses/>. >> + */ >> + >> +#ifndef BACKUP_TOP_H >> +#define BACKUP_TOP_H >> + >> +#include "qemu/osdep.h" >> + >> +#include "block/block_int.h" >> + >> +typedef void (*BackupTopProgressCallback)(uint64_t done, void *opaque); >> +typedef struct BDRVBackupTopState { >> + HBitmap *copy_bitmap; /* what should be copied to @target on guest >> write. */ >> + BdrvChild *target; >> + >> + BackupTopProgressCallback progress_cb; >> + void *progress_opaque; >> +} BDRVBackupTopState; >> + >> +/* >> + * bdrv_backup_top_append >> + * >> + * Append backup_top filter node above @source node. @target node will >> receive >> + * the data backed up during CBE operations. New filter together with >> @target >> + * node are attached to @source aio context. >> + * >> + * The resulting filter node is implicit. > > Why? It’s just as easy for the caller to just make it implicit if it > should be. (And usually the caller should decide that.)
Personally, I don't know what are the reasons for filters to bi implicit or not, I just made it like other job-filters.. I can move making-implicit to the caller or drop it at all (if it will work). > >> + * >> + * @copy_bitmap selects regions which needs CBW. Furthermore, backup_top >> will >> + * use exactly this bitmap, so it may be used to control backup_top behavior >> + * dynamically. Caller should not release @copy_bitmap during life-time of >> + * backup_top. Progress is tracked by calling @progress_cb function. >> + */ >> +BlockDriverState *bdrv_backup_top_append( >> + BlockDriverState *source, BlockDriverState *target, >> + HBitmap *copy_bitmap, Error **errp); >> +void bdrv_backup_top_set_progress_callback( >> + BlockDriverState *bs, BackupTopProgressCallback progress_cb, >> + void *progress_opaque); >> +void bdrv_backup_top_drop(BlockDriverState *bs); >> + >> +#endif /* BACKUP_TOP_H */ >> diff --git a/block/backup-top.c b/block/backup-top.c >> new file mode 100644 >> index 0000000000..1daa02f539 >> --- /dev/null >> +++ b/block/backup-top.c >> @@ -0,0 +1,322 @@ >> +/* >> + * backup-top filter driver >> + * >> + * The driver performs Copy-Before-Write (CBW) operation: it is injected >> above >> + * some node, and before each write it copies _old_ data to the target node. >> + * >> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved. >> + * >> + * Author: >> + * Sementsov-Ogievskiy Vladimir <vsement...@virtuozzo.com> >> + * >> + * This program is free software; you can redistribute it and/or modify >> + * it under the terms of the GNU General Public License as published by >> + * the Free Software Foundation; either version 2 of the License, or >> + * (at your option) any later version. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the >> + * GNU General Public License for more details. >> + * >> + * You should have received a copy of the GNU General Public License >> + * along with this program. If not, see <http://www.gnu.org/licenses/>. >> + */ >> + >> +#include "qemu/osdep.h" >> + >> +#include "qemu/cutils.h" >> +#include "qapi/error.h" >> +#include "block/block_int.h" >> +#include "block/qdict.h" >> + >> +#include "block/backup-top.h" >> + >> +static coroutine_fn int backup_top_co_preadv( >> + BlockDriverState *bs, uint64_t offset, uint64_t bytes, >> + QEMUIOVector *qiov, int flags) >> +{ >> + /* >> + * Features to be implemented: >> + * F1. COR. save read data to fleecing target for fast access >> + * (to reduce reads). This possibly may be done with use of >> copy-on-read >> + * filter, but we need an ability to make COR requests optional: for >> + * example, if target is a ram-cache, and if it is full now, we >> should >> + * skip doing COR request, as it is actually not necessary. >> + * >> + * F2. Feature for guest: read from fleecing target if data is in >> ram-cache >> + * and is unchanged >> + */ >> + >> + return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags); >> +} >> + >> +static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t >> offset, >> + uint64_t bytes) >> +{ >> + int ret = 0; >> + BDRVBackupTopState *s = bs->opaque; >> + uint64_t gran = 1UL << hbitmap_granularity(s->copy_bitmap); >> + uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran); >> + uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len; > > The “, len” is a bit weirdly placed there, why not define it on a > separate line as "uint64_t len = end - off"? > >> + void *buf = qemu_blockalign(bs, end - off); >> + >> + /* >> + * Features to be implemented: >> + * F3. parallelize copying loop >> + * F4. detect zeroes ? or, otherwise, drop detect zeroes from backup >> code >> + * and just enable zeroes detecting on target >> + * F5. use block_status ? >> + * F6. don't copy clusters which are already cached by COR [see F1] >> + * F7. if target is ram-cache and it is full, there should be a >> possibility >> + * to drop not necessary data (cached by COR [see F1]) to handle CBW >> + * fast. > > Also “drop BDRV_REQ_SERIALISING from writes to s->target unless necessary”? hmm, yes. It may be a filter parameter, serialize writes or not. > >> + */ >> + >> + len = end - off; >> + while (hbitmap_next_dirty_area(s->copy_bitmap, &off, &len)) { >> + hbitmap_reset(s->copy_bitmap, off, len); >> + >> + ret = bdrv_co_pread(bs->backing, off, len, buf, >> + BDRV_REQ_NO_SERIALISING); >> + if (ret < 0) { >> + hbitmap_set(s->copy_bitmap, off, len); >> + goto out; >> + } >> + >> + ret = bdrv_co_pwrite(s->target, off, len, buf, >> BDRV_REQ_SERIALISING); >> + if (ret < 0) { >> + hbitmap_set(s->copy_bitmap, off, len); >> + goto out; >> + } >> + >> + if (s->progress_cb) { >> + s->progress_cb(len, s->progress_opaque); >> + } >> + off += len; >> + if (off >= end) { >> + break; >> + } >> + len = end - off; >> + } >> + >> +out: >> + qemu_vfree(buf); >> + >> + /* >> + * F8. we fail guest request in case of error. We can alter it by >> + * possibility to fail copying process instead, or retry several times, >> or >> + * may be guest pause, etc. >> + */ >> + return ret; >> +} >> + >> +static int coroutine_fn backup_top_co_pdiscard(BlockDriverState *bs, >> + int64_t offset, int bytes) >> +{ >> + int ret = backup_top_cbw(bs, offset, bytes); >> + if (ret < 0) { >> + return ret; >> + } >> + >> + /* >> + * Features to be implemented: >> + * F9. possibility of lazy discard: just defer the discard after >> fleecing >> + * completion. If write (or new discard) occurs to the same area, >> just >> + * drop deferred discard. >> + */ >> + >> + return bdrv_co_pdiscard(bs->backing, offset, bytes); >> +} >> + >> +static int coroutine_fn backup_top_co_pwrite_zeroes(BlockDriverState *bs, >> + int64_t offset, int bytes, BdrvRequestFlags flags) >> +{ >> + int ret = backup_top_cbw(bs, offset, bytes); >> + if (ret < 0) { >> + return ret; >> + } >> + >> + return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags); >> +} >> + >> +static coroutine_fn int backup_top_co_pwritev(BlockDriverState *bs, >> + uint64_t offset, >> + uint64_t bytes, >> + QEMUIOVector *qiov, int flags) >> +{ >> + if (!(flags & BDRV_REQ_WRITE_UNCHANGED)) { >> + int ret = backup_top_cbw(bs, offset, bytes); >> + if (ret < 0) { >> + return ret; >> + } >> + } >> + >> + return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags); >> +} >> + >> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs) >> +{ >> + if (!bs->backing) { >> + return 0; >> + } >> + >> + return bdrv_co_flush(bs->backing->bs); > > Should we flush the target, too? Hm, you've asked it already, on previous version :) Backup don't do it, so I just keep old behavior. And what is the reason to flush backup target on any guest flush? > >> +} >> + >> +static void backup_top_refresh_filename(BlockDriverState *bs) >> +{ >> + if (bs->backing == NULL) { >> + /* >> + * we can be here after failed bdrv_attach_child in >> + * bdrv_set_backing_hd >> + */ >> + return; >> + } >> + bdrv_refresh_filename(bs->backing->bs); > > bdrv_refresh_filename() should have done this already. > >> + pstrcpy(bs->exact_filename, sizeof(bs->exact_filename), >> + bs->backing->bs->filename); >> +} >> + >> +static void backup_top_child_perm(BlockDriverState *bs, BdrvChild *c, >> + const BdrvChildRole *role, >> + BlockReopenQueue *reopen_queue, >> + uint64_t perm, uint64_t shared, >> + uint64_t *nperm, uint64_t *nshared) >> +{ >> + /* >> + * We have HBitmap in the state, its size is fixed, so we never allow >> + * resize. >> + */ >> + uint64_t rw = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED | >> + BLK_PERM_WRITE; >> + >> + bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared, >> + nperm, nshared); >> + >> + *nperm = *nperm & rw; >> + *nshared = *nshared & rw; > > Somehow, that bdrv_filter_default_perm() call doesn’t make this function > easier for me. > > It seems like this is basically just “*nperm = perm & rw” and > “*nshared = shared & rw”. > >> + >> + if (role == &child_file) { >> + /* >> + * Target child >> + * >> + * Share write to target (child_file), to not interfere >> + * with guest writes to its disk which may be in target backing >> chain. >> + */ >> + if (perm & BLK_PERM_WRITE) { >> + *nshared = *nshared | BLK_PERM_WRITE; > > Why not always share WRITE on the target? Hmm, it's a bad thing to share writes on target, so I'm trying to reduce number of cases when we have share it. > >> + } >> + } else { >> + /* Source child */ >> + if (perm & BLK_PERM_WRITE) { > > Or WRITE_UNCHANGED, no? Why? We don't need doing CBW for unchanged write. > >> + *nperm = *nperm | BLK_PERM_CONSISTENT_READ; >> + } >> + *nshared = >> + *nshared & (BLK_PERM_CONSISTENT_READ | >> BLK_PERM_WRITE_UNCHANGED); > > And here it isn’t “*nshared = shared & rw” but “rw & ~WRITE”. > > I feel like this function would be simpler if you just set *nperm and > *nshared separately in the two branches of this condition, without > setting a “default” first. Ok, I'll try > >> + } >> +} >> + >> +BlockDriver bdrv_backup_top_filter = { >> + .format_name = "backup-top", >> + .instance_size = sizeof(BDRVBackupTopState), >> + >> + .bdrv_co_preadv = backup_top_co_preadv, >> + .bdrv_co_pwritev = backup_top_co_pwritev, >> + .bdrv_co_pwrite_zeroes = backup_top_co_pwrite_zeroes, >> + .bdrv_co_pdiscard = backup_top_co_pdiscard, >> + .bdrv_co_flush = backup_top_co_flush, >> + >> + .bdrv_co_block_status = bdrv_co_block_status_from_backing, >> + >> + .bdrv_refresh_filename = backup_top_refresh_filename, >> + >> + .bdrv_child_perm = backup_top_child_perm, >> + >> + .is_filter = true, >> +}; >> + >> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source, >> + BlockDriverState *target, >> + HBitmap *copy_bitmap, >> + Error **errp) >> +{ >> + Error *local_err = NULL; >> + BDRVBackupTopState *state; >> + BlockDriverState *top = bdrv_new_open_driver(&bdrv_backup_top_filter, >> + NULL, BDRV_O_RDWR, errp); >> + >> + if (!top) { >> + return NULL; >> + } >> + >> + top->implicit = true; > > As I said above, I think the caller should set this. > >> + top->total_sectors = source->total_sectors; >> + top->bl.opt_mem_alignment = MAX(bdrv_opt_mem_align(source), >> + bdrv_opt_mem_align(target)); >> + top->opaque = state = g_new0(BDRVBackupTopState, 1); >> + state->copy_bitmap = copy_bitmap; >> + >> + bdrv_ref(target); >> + state->target = bdrv_attach_child(top, target, "target", &child_file, >> errp); >> + if (!state->target) { >> + bdrv_unref(target); >> + bdrv_unref(top); >> + return NULL; >> + } >> + >> + bdrv_set_aio_context(top, bdrv_get_aio_context(source)); >> + bdrv_set_aio_context(target, bdrv_get_aio_context(source)); > > I suppose these calls aren’t necessary anymore (compare d0ee0204f4009). Hmm, see it, agree. > (I’m not fully sure, though. In any case, they would need to be > bdrv_try_set_aio_context() if they were still necessary.) > >> + >> + bdrv_drained_begin(source); >> + >> + bdrv_ref(top); >> + bdrv_append(top, source, &local_err); >> + if (local_err) { >> + error_prepend(&local_err, "Cannot append backup-top filter: "); >> + } >> + >> + bdrv_drained_end(source); >> + >> + if (local_err) { >> + bdrv_unref_child(top, state->target); >> + bdrv_unref(top); >> + error_propagate(errp, local_err); >> + return NULL; >> + } >> + >> + return top; >> +} > > I guess it would be nice if it users could blockdev-add a backup-top > node to basically get a backup with sync=none. > > (The bdrv_open implementation should then create a new bitmap and > initialize it to be fully set.) > > But maybe it wouldn’t be so nice and I just have feverish dreams. When series begun, I was trying to make exactly this - user-available filter, which may be used in separate, but you was against) Maybe, not totally against, but I decided not to argue long, and instead make filter implicit and drop public api (like mirror and commit filters), but place it in a separate file (no one will argue against putting large enough and complete new feature, represented by object into a separate file :). And this actually makes it easy to publish this filter at some moment. And now I think it was good decision anyway, as we postponed arguing around API and around shared dirty bitmaps. So publishing the filter is future step. > >> +void bdrv_backup_top_set_progress_callback( >> + BlockDriverState *bs, BackupTopProgressCallback progress_cb, >> + void *progress_opaque) >> +{ >> + BDRVBackupTopState *s = bs->opaque; >> + >> + s->progress_cb = progress_cb; >> + s->progress_opaque = progress_opaque; >> +} >> + >> +void bdrv_backup_top_drop(BlockDriverState *bs) >> +{ >> + BDRVBackupTopState *s = bs->opaque; >> + AioContext *aio_context = bdrv_get_aio_context(bs); >> + >> + aio_context_acquire(aio_context); >> + >> + bdrv_drained_begin(bs); >> + >> + bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort); > > Pre-existing in other jobs, but I think calling this function is > dangerous. (Which is why I remove it in my “block: Ignore loosening > perm restrictions failures” series.) Hmm, good thing.. Should I rebase on it? > >> + bdrv_replace_node(bs, backing_bs(bs), &error_abort); >> + bdrv_set_backing_hd(bs, NULL, &error_abort); > > I think some of this function should be in a .bdrv_close() > implementation, for example this bdrv_set_backing_hd() call. Why? We don't have .bdrv_open, so why to have .bdrv_close? I think, when we publish this filter most of _add() and _drop() will be refactored to open() and close(). Is there a real reason to implement .close() now? > >> + bdrv_drained_end(bs); >> + >> + if (s->target) { >> + bdrv_unref_child(bs, s->target); > > And this. Well, neither needs to be in a .bdrv_close() implementation, > actually, because bdrv_close() will just do it by itself. > > I suppose you’re trying to remove the children so the node is no longer > usable after this point; but that isn’t quite right, then. The filter > functions still refer to s->target and bs->backing, so you need to stop > them from doing anything then. > > At this point, you might as well add a variable to BDRVBackupTopState > that says whether the filter is still supposed to be usable (like the > “stop” variable I added to mirror in “block/mirror: Fix child > permissions”). If so, the permission function should signal 0/ALL for > the permissions on backing (and probably target), and all filter > functions always immediately return an error. > > So I don’t think backing and target need to be removed here. We can > wait for that until bdrv_close(), but we should ensure that the filter > really is unusable after this function has been called. Ok, thank you for reviewing! > > Max > >> + } >> + bdrv_unref(bs); >> + >> + aio_context_release(aio_context); >> +} >> diff --git a/block/Makefile.objs b/block/Makefile.objs >> index ae11605c9f..dfbdfe6ab4 100644 >> --- a/block/Makefile.objs >> +++ b/block/Makefile.objs >> @@ -40,6 +40,8 @@ block-obj-y += throttle.o copy-on-read.o >> >> block-obj-y += crypto.o >> >> +block-obj-y += backup-top.o >> + >> common-obj-y += stream.o >> >> nfs.o-libs := $(LIBNFS_LIBS) >> > > -- Best regards, Vladimir