* John Snow (js...@redhat.com) wrote: > > > On 10/30/2017 12:33 PM, Vladimir Sementsov-Ogievskiy wrote: > > Postcopy migration of dirty bitmaps. Only named dirty bitmaps, > > associated with root nodes and non-root named nodes are migrated. > > > > If destination qemu is already containing a dirty bitmap with the same name > > as a migrated bitmap (for the same node), then, if their granularities are > > the same the migration will be done, otherwise the error will be generated. > > > > If destination qemu doesn't contain such bitmap it will be created. > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> > > --- > > include/migration/misc.h | 3 + > > migration/migration.h | 3 + > > migration/block-dirty-bitmap.c | 734 > > +++++++++++++++++++++++++++++++++++++++++ > > Ouch :\ > > > migration/migration.c | 3 + > > migration/savevm.c | 2 + > > vl.c | 1 + > > migration/Makefile.objs | 1 + > > migration/trace-events | 14 + > > 8 files changed, 761 insertions(+) > > create mode 100644 migration/block-dirty-bitmap.c > > > > Organizationally, you introduce three new 'public' prototypes: > > dirty_bitmap_mig_init > dirty_bitmap_mig_before_vm_start > init_dirty_bitmap_incoming_migration > > mig_init is advertised in migration/misc.h, the other two are in > migration/migration.h. > The definitions for all three are in migration/block-dirty-bitmap.c > > In pure naivety, I find it weird to have something that you use in > migration.c and advertised in migration.h actually exist separately in > block-dirty-bitmap.c; but maybe this is the sanest thing to do.
Actually I think that's OK; it makes sense for all of the code for this feature to sit in one place, and there doesn't seem any point creating a header just for this one function. > > diff --git a/include/migration/misc.h b/include/migration/misc.h > > index c079b7771b..9cc539e232 100644 > > --- a/include/migration/misc.h > > +++ b/include/migration/misc.h > > @@ -55,4 +55,7 @@ bool migration_has_failed(MigrationState *); > > bool migration_in_postcopy_after_devices(MigrationState *); > > void migration_global_dump(Monitor *mon); > > > > +/* migration/block-dirty-bitmap.c */ > > +void dirty_bitmap_mig_init(void); > > + > > #endif > > diff --git a/migration/migration.h b/migration/migration.h > > index 50d1f01346..4e3ad04664 100644 > > --- a/migration/migration.h > > +++ b/migration/migration.h > > @@ -211,4 +211,7 @@ void migrate_send_rp_pong(MigrationIncomingState *mis, > > void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char* > > rbname, > > ram_addr_t start, size_t len); > > > > +void dirty_bitmap_mig_before_vm_start(void); > > +void init_dirty_bitmap_incoming_migration(void); > > + > > #endif > > diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c > > new file mode 100644 > > index 0000000000..53cb20045d > > --- /dev/null > > +++ b/migration/block-dirty-bitmap.c > > @@ -0,0 +1,734 @@ > > +/* > > + * Block dirty bitmap postcopy migration > > + * > > + * Copyright IBM, Corp. 2009 > > + * Copyright (c) 2016-2017 Parallels International GmbH > > + * > > + * Authors: > > + * Liran Schour <lir...@il.ibm.com> > > + * Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> > > + * > > + * This work is licensed under the terms of the GNU GPL, version 2. See > > + * the COPYING file in the top-level directory. > > + * This file is derived from migration/block.c, so it's author and IBM > > copyright > > + * are here, although content is quite different. > > + * > > + * Contributions after 2012-01-13 are licensed under the terms of the > > + * GNU GPL, version 2 or (at your option) any later version. > > + * > > + * *** > > + * > > + * Here postcopy migration of dirty bitmaps is realized. Only named dirty > > + * bitmaps, associated with root nodes and non-root named nodes are > > migrated. > > Put another way, only QMP-addressable bitmaps. Nodes without a name that > are not the root have no way to be addressed. > > > + * > > + * If destination qemu is already containing a dirty bitmap with the same > > name > > "If the destination QEMU already contains a dirty bitmap with the same name" > > > + * as a migrated bitmap (for the same node), then, if their granularities > > are > > + * the same the migration will be done, otherwise the error will be > > generated. > > "an error" > > > + * > > + * If destination qemu doesn't contain such bitmap it will be created. > > "If the destination QEMU doesn't contain such a bitmap, it will be created." > > > + * > > + * format of migration: > > + * > > + * # Header (shared for different chunk types) > > + * 1, 2 or 4 bytes: flags (see qemu_{put,put}_flags) > > + * [ 1 byte: node name size ] \ flags & DEVICE_NAME > > + * [ n bytes: node name ] / > > + * [ 1 byte: bitmap name size ] \ flags & BITMAP_NAME > > + * [ n bytes: bitmap name ] / > > + * > > + * # Start of bitmap migration (flags & START) > > + * header > > + * be64: granularity > > + * 1 byte: bitmap flags (corresponds to BdrvDirtyBitmap) > > + * bit 0 - bitmap is enabled > > + * bit 1 - bitmap is persistent > > + * bit 2 - bitmap is autoloading > > + * bits 3-7 - reserved, must be zero > > + * > > + * # Complete of bitmap migration (flags & COMPLETE) > > + * header > > + * > > + * # Data chunk of bitmap migration > > + * header > > + * be64: start sector > > + * be32: number of sectors > > + * [ be64: buffer size ] \ ! (flags & ZEROES) > > + * [ n bytes: buffer ] / > > + * > > + * The last chunk in stream should contain flags & EOS. The chunk may skip > > + * device and/or bitmap names, assuming them to be the same with the > > previous > > + * chunk. > > + */ > > + > > Been a while since I reviewed the format, but it seems sane. > > > +#include "qemu/osdep.h" > > +#include "block/block.h" > > +#include "block/block_int.h" > > +#include "sysemu/block-backend.h" > > +#include "qemu/main-loop.h" > > +#include "qemu/error-report.h" > > +#include "migration/misc.h" > > +#include "migration/migration.h" > > +#include "migration/qemu-file.h" > > +#include "migration/vmstate.h" > > +#include "migration/register.h" > > +#include "qemu/hbitmap.h" > > +#include "sysemu/sysemu.h" > > +#include "qemu/cutils.h" > > +#include "qapi/error.h" > > +#include "trace.h" > > + > > +#define CHUNK_SIZE (1 << 10) > > + > > +/* Flags occupy one, two or four bytes (Big Endian). The size is > > determined as > > + * follows: > > + * in first (most significant) byte bit 8 is clear --> one byte > > + * in first byte bit 8 is set --> two or four bytes, depending on > > second > > + * byte: > > + * | in second byte bit 8 is clear --> two bytes > > + * | in second byte bit 8 is set --> four bytes > > + */ > > +#define DIRTY_BITMAP_MIG_FLAG_EOS 0x01 > > +#define DIRTY_BITMAP_MIG_FLAG_ZEROES 0x02 > > +#define DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME 0x04 > > +#define DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME 0x08 > > +#define DIRTY_BITMAP_MIG_FLAG_START 0x10 > > +#define DIRTY_BITMAP_MIG_FLAG_COMPLETE 0x20 > > +#define DIRTY_BITMAP_MIG_FLAG_BITS 0x40 > > + > > +#define DIRTY_BITMAP_MIG_EXTRA_FLAGS 0x80 > > + > > +#define DIRTY_BITMAP_MIG_START_FLAG_ENABLED 0x01 > > +#define DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT 0x02 > > +#define DIRTY_BITMAP_MIG_START_FLAG_AUTOLOAD 0x04 > > +#define DIRTY_BITMAP_MIG_START_FLAG_RESERVED_MASK 0xf8 > > + > > +typedef struct DirtyBitmapMigBitmapState { > > + /* Written during setup phase. */ > > + BlockDriverState *bs; > > + const char *node_name; > > + BdrvDirtyBitmap *bitmap; > > + uint64_t total_sectors; > > + uint64_t sectors_per_chunk; > > + QSIMPLEQ_ENTRY(DirtyBitmapMigBitmapState) entry; > > + uint8_t flags; > > + > > + /* For bulk phase. */ > > + bool bulk_completed; > > + uint64_t cur_sector; > > +} DirtyBitmapMigBitmapState; > > + > > +typedef struct DirtyBitmapMigState { > > + QSIMPLEQ_HEAD(dbms_list, DirtyBitmapMigBitmapState) dbms_list; > > + > > + bool bulk_completed; > > + > > + /* for send_bitmap_bits() */ > > + BlockDriverState *prev_bs; > > + BdrvDirtyBitmap *prev_bitmap; > > +} DirtyBitmapMigState; > > + > > +typedef struct DirtyBitmapLoadState { > > + uint32_t flags; > > + char node_name[256]; > > + char bitmap_name[256]; > > + BlockDriverState *bs; > > + BdrvDirtyBitmap *bitmap; > > +} DirtyBitmapLoadState; > > + > > +static DirtyBitmapMigState dirty_bitmap_mig_state; > > + > > +typedef struct DirtyBitmapLoadBitmapState { > > + BlockDriverState *bs; > > + BdrvDirtyBitmap *bitmap; > > + bool migrated; > > +} DirtyBitmapLoadBitmapState; > > +static GSList *enabled_bitmaps; > > +QemuMutex finish_lock; > > + > > +void init_dirty_bitmap_incoming_migration(void) > > +{ > > + qemu_mutex_init(&finish_lock); > > +} > > + > > This is a little odd as public interface. David, is there a nicer way to > integrate in-migrate hooks? I guess it hasn't come up yet. Anyway, it > might be nice to leave a comment here for now that says that the only > caller is migration.c, and it will only ever call it once. I don't think having an init_ function like that is a problem; we've got an init_blk_migration, so I guess it's similar. I generally prefer things to be part of the incoming state structure rather than a file global; but that's not a huge one. > > +static uint32_t qemu_get_bitmap_flags(QEMUFile *f) > > +{ > > + uint8_t flags = qemu_get_byte(f); > > + if (flags & DIRTY_BITMAP_MIG_EXTRA_FLAGS) { > > + flags = flags << 8 | qemu_get_byte(f); > > + if (flags & DIRTY_BITMAP_MIG_EXTRA_FLAGS) { > > + flags = flags << 16 | qemu_get_be16(f); > > + } > > + } > > + > > + return flags; > > +} > > + > > ok > > (Sorry for the per-function ACKs, it's just helpful for me to know which > functions I followed execution of on paper to make sure I got everything > in this big patch.) > > > +static void qemu_put_bitmap_flags(QEMUFile *f, uint32_t flags) > > +{ > > + /* The code currently do not send flags more than one byte */ > > + assert(!(flags & (0xffffff00 | DIRTY_BITMAP_MIG_EXTRA_FLAGS))); > > + > > + qemu_put_byte(f, flags); > > +} > > + > > ok > > > +static void send_bitmap_header(QEMUFile *f, DirtyBitmapMigBitmapState > > *dbms, > > + uint32_t additional_flags) > > +{ > > + BlockDriverState *bs = dbms->bs; > > + BdrvDirtyBitmap *bitmap = dbms->bitmap; > > + uint32_t flags = additional_flags; > > + trace_send_bitmap_header_enter(); > > + > > + if (bs != dirty_bitmap_mig_state.prev_bs) { > > + dirty_bitmap_mig_state.prev_bs = bs; > > + flags |= DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME; > > + } > > + > > + if (bitmap != dirty_bitmap_mig_state.prev_bitmap) { > > + dirty_bitmap_mig_state.prev_bitmap = bitmap; > > + flags |= DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME; > > + } > > + > > I guess the idea here is that we might be able to skip the node name > broadcast, leaving the possibilities as: > > - new node and bitmap: send both > - new bitmap, but not node: send bitmap name only > - same for both: send neither > > and that otherwise it's not possible to have a new node but "same > bitmap" by nature of how the structures are organized. > > > + qemu_put_bitmap_flags(f, flags); > > + > > + if (flags & DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME) { > > + qemu_put_counted_string(f, dbms->node_name); > > + } > > + > > + if (flags & DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME) { > > + qemu_put_counted_string(f, bdrv_dirty_bitmap_name(bitmap)); > > + } > > +} > > + > > ok > > > +static void send_bitmap_start(QEMUFile *f, DirtyBitmapMigBitmapState *dbms) > > +{ > > + send_bitmap_header(f, dbms, DIRTY_BITMAP_MIG_FLAG_START); > > + qemu_put_be32(f, bdrv_dirty_bitmap_granularity(dbms->bitmap)); > > + qemu_put_byte(f, dbms->flags); > > +} > > + > > ok > > > +static void send_bitmap_complete(QEMUFile *f, DirtyBitmapMigBitmapState > > *dbms) > > +{ > > + send_bitmap_header(f, dbms, DIRTY_BITMAP_MIG_FLAG_COMPLETE); > > +} > > + > > ok > > > +static void send_bitmap_bits(QEMUFile *f, DirtyBitmapMigBitmapState *dbms, > > + uint64_t start_sector, uint32_t nr_sectors) > > +{ > > + /* align for buffer_is_zero() */ > > + uint64_t align = 4 * sizeof(long); > > + uint64_t unaligned_size = > > + bdrv_dirty_bitmap_serialization_size( > > + dbms->bitmap, start_sector << BDRV_SECTOR_BITS, > > + (uint64_t)nr_sectors << BDRV_SECTOR_BITS); > > + uint64_t buf_size = (unaligned_size + align - 1) & ~(align - 1); > > + uint8_t *buf = g_malloc0(buf_size); > > + uint32_t flags = DIRTY_BITMAP_MIG_FLAG_BITS; > > + > > + bdrv_dirty_bitmap_serialize_part( > > + dbms->bitmap, buf, start_sector << BDRV_SECTOR_BITS, > > + (uint64_t)nr_sectors << BDRV_SECTOR_BITS); > > + > > + if (buffer_is_zero(buf, buf_size)) { > > + g_free(buf); > > + buf = NULL; > > + flags |= DIRTY_BITMAP_MIG_FLAG_ZEROES; > > + } > > + > > + trace_send_bitmap_bits(flags, start_sector, nr_sectors, buf_size); > > + > > + send_bitmap_header(f, dbms, flags); > > + > > + qemu_put_be64(f, start_sector); > > + qemu_put_be32(f, nr_sectors); > > + > > + /* if a block is zero we need to flush here since the network > > + * bandwidth is now a lot higher than the storage device bandwidth. > > + * thus if we queue zero blocks we slow down the migration. */ > > Can you elaborate on this for me? > > > + if (flags & DIRTY_BITMAP_MIG_FLAG_ZEROES) { > > + qemu_fflush(f); > > + } else { > > + qemu_put_be64(f, buf_size); > > + qemu_put_buffer(f, buf, buf_size); > > + } > > + > > + g_free(buf); > > +} > > + > > + > > +/* Called with iothread lock taken. */ > > + > > +static int init_dirty_bitmap_migration(void) > > +{ > > + BlockDriverState *bs; > > + BdrvDirtyBitmap *bitmap; > > + DirtyBitmapMigBitmapState *dbms; > > + BdrvNextIterator it; > > + > > + dirty_bitmap_mig_state.bulk_completed = false; > > + dirty_bitmap_mig_state.prev_bs = NULL; > > + dirty_bitmap_mig_state.prev_bitmap = NULL; > > + > > + for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) { > > + if (!bdrv_get_device_or_node_name(bs)) { > > + /* not named non-root node */ > > I can't imagine the situation it would arise in, but is it possible to > have a named bitmap attached to a now-anonymous node? > > Let's say we attach a bitmap to a root node, but then later we insert a > filter or something above it and it's no longer at the root. > > We should probably prohibit such things, or at the very least toss out > an error here instead of silently continuing. > > I think the only things valid to just *skip* are nameless bitmaps. > Anything named we really ought to either migrate or error out over, I > think, even if the circumstances leading to such a configuration are > very unlikely. > > > + continue; > > + } > > + > > + for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap; > > + bitmap = bdrv_dirty_bitmap_next(bs, bitmap)) { > > + if (!bdrv_dirty_bitmap_name(bitmap)) { > > + continue; > > + } > > + > > + if (bdrv_dirty_bitmap_frozen(bitmap)) { > > + error_report("Can't migrate frozen dirty bitmap: '%s", > > + bdrv_dirty_bitmap_name(bitmap)); > > + return -1; > > + } > > + } > > + } > > + > > + for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) { > > + if (!bdrv_get_device_or_node_name(bs)) { > > + /* not named non-root node */ > > + continue; > > + } > > Why two-pass? I guess because we don't have to tear anything down if we > check for errors in advance? > > I worry that if we need to amend the logic here that it's error-prone to > update it in two places, so maybe we ought to just have one loop. > > > + > > + for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap; > > + bitmap = bdrv_dirty_bitmap_next(bs, bitmap)) { > > + if (!bdrv_dirty_bitmap_name(bitmap)) { > > + continue; > > + } > > + > > + bdrv_ref(bs); > > + bdrv_dirty_bitmap_set_frozen(bitmap, true); > > + > > We could say that for any bitmap in the list of pending bitmaps to > migrate, we know that we have to un-freeze it, since we never add any > bitmaps that are frozen to our list. > > > + dbms = g_new0(DirtyBitmapMigBitmapState, 1); > > + dbms->bs = bs; > > + dbms->node_name = bdrv_get_node_name(bs); > > + if (!dbms->node_name || dbms->node_name[0] == '\0') { > > + dbms->node_name = bdrv_get_device_name(bs); > > + } > > + dbms->bitmap = bitmap; > > + dbms->total_sectors = bdrv_nb_sectors(bs); > > + dbms->sectors_per_chunk = CHUNK_SIZE * 8 * > > + bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS; > > Eric may want to avoid checking in new code that thinks in sectors, but > for the sake of review I don't mind right now. > > (Sorry, Eric!) > > > + if (bdrv_dirty_bitmap_enabled(bitmap)) { > > + dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_ENABLED; > > + } > > + if (bdrv_dirty_bitmap_get_persistance(bitmap)) { > > + dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT; > > + } > > + if (bdrv_dirty_bitmap_get_autoload(bitmap)) { > > + dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_AUTOLOAD; > > + } > > + > > + bdrv_dirty_bitmap_set_persistance(bitmap, false); > > Oh, this might be stranger to undo. Perhaps what we CAN do is limit the > second pass to just this action and allow ourselves to unroll everything > else. > > > + > > + QSIMPLEQ_INSERT_TAIL(&dirty_bitmap_mig_state.dbms_list, > > + dbms, entry); > > + } > > + } > > + > > + return 0; > > +} > > + > > +/* Called with no lock taken. */ > > +static void bulk_phase_send_chunk(QEMUFile *f, DirtyBitmapMigBitmapState > > *dbms) > > +{ > > + uint32_t nr_sectors = MIN(dbms->total_sectors - dbms->cur_sector, > > + dbms->sectors_per_chunk); > > + > > + send_bitmap_bits(f, dbms, dbms->cur_sector, nr_sectors); > > + > > + dbms->cur_sector += nr_sectors; > > + if (dbms->cur_sector >= dbms->total_sectors) { > > + dbms->bulk_completed = true; > > + } > > +} > > + > > +/* Called with no lock taken. */ > > +static void bulk_phase(QEMUFile *f, bool limit) > > +{ > > + DirtyBitmapMigBitmapState *dbms; > > + > > + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) { > > + while (!dbms->bulk_completed) { > > + bulk_phase_send_chunk(f, dbms); > > + if (limit && qemu_file_rate_limit(f)) { > > + return; > > + } > > + } > > + } > > + > > + dirty_bitmap_mig_state.bulk_completed = true; > > +} > > + > > +/* Called with iothread lock taken. */ > > +static void dirty_bitmap_mig_cleanup(void) > > +{ > > + DirtyBitmapMigBitmapState *dbms; > > + > > + while ((dbms = QSIMPLEQ_FIRST(&dirty_bitmap_mig_state.dbms_list)) != > > NULL) { > > + QSIMPLEQ_REMOVE_HEAD(&dirty_bitmap_mig_state.dbms_list, entry); > > + bdrv_dirty_bitmap_set_frozen(dbms->bitmap, false); > > + bdrv_unref(dbms->bs); > > + g_free(dbms); > > + } > > +} > > + > > ok > > > +/* for SaveVMHandlers */ > > +static void dirty_bitmap_save_cleanup(void *opaque) > > +{ > > + dirty_bitmap_mig_cleanup(); > > +} > > + > > ok > > > +static int dirty_bitmap_save_iterate(QEMUFile *f, void *opaque) > > +{ > > + trace_dirty_bitmap_save_iterate(migration_in_postcopy()); > > + > > + if (migration_in_postcopy() && !dirty_bitmap_mig_state.bulk_completed) > > { > > + bulk_phase(f, true); > > + } > > + > > + qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS); > > + > > + return dirty_bitmap_mig_state.bulk_completed; > > +} > > + > > What's the purpose behind doing bulk save both here and in > dirty_bitmap_save_complete? Is there a path that isn't guaranteed to > call one of the completion functions? > > (The way it's coded seems like it'll work fine, but I'm curious about > what looks like a redundancy at a glance.) I think I can see the reasoning; I think the idea is that in each iteration you send a chunk of data (note the 'true' means that it gets modulated by bandwidth limiting - although we currently don't have that in postcopy - I need to fix that) - so it might not actually send all of it. The complete guarantees it's all sent. So note this IS iterating potentially (OK but probably not at the moment) > > +/* Called with iothread lock taken. */ > > + > > +static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque) > > +{ > > + DirtyBitmapMigBitmapState *dbms; > > + trace_dirty_bitmap_save_complete_enter(); > > + > > + if (!dirty_bitmap_mig_state.bulk_completed) { > > + bulk_phase(f, false); > > + } > > + > > funny now that we don't actually really iterate over the data, and the > bulk phase is now really the _only_ phase :) > > > + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) { > > + send_bitmap_complete(f, dbms); > > + } > > + > > + qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS); > > + > > + trace_dirty_bitmap_save_complete_finish(); > > + > > + dirty_bitmap_mig_cleanup(); > > + return 0; > > +} > > + > > ok > > > +static void dirty_bitmap_save_pending(QEMUFile *f, void *opaque, > > + uint64_t max_size, > > + uint64_t *res_precopy_only, > > + uint64_t *res_compatible, > > + uint64_t *res_postcopy_only) > > +{ > > + DirtyBitmapMigBitmapState *dbms; > > + uint64_t pending = 0; > > + > > + qemu_mutex_lock_iothread(); > > + > > + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) { > > + uint64_t gran = bdrv_dirty_bitmap_granularity(dbms->bitmap); > > + uint64_t sectors = dbms->bulk_completed ? 0 : > > + dbms->total_sectors - dbms->cur_sector; > > + > > + pending += (sectors * BDRV_SECTOR_SIZE + gran - 1) / gran; > > + } > > + > > + qemu_mutex_unlock_iothread(); > > + > > + trace_dirty_bitmap_save_pending(pending, max_size); > > + > > + *res_postcopy_only += pending; > > +} > > + > > ok > > > +/* First occurrence of this bitmap. It should be created if doesn't exist > > */ > > +static int dirty_bitmap_load_start(QEMUFile *f, DirtyBitmapLoadState *s) > > +{ > > + Error *local_err = NULL; > > + uint32_t granularity = qemu_get_be32(f); > > + uint8_t flags = qemu_get_byte(f); > > + > > + if (!s->bitmap) { > > + s->bitmap = bdrv_create_dirty_bitmap(s->bs, granularity, > > + s->bitmap_name, &local_err); > > + if (!s->bitmap) { > > + error_report_err(local_err); > > + return -EINVAL; > > + } > > + } else { > > + uint32_t dest_granularity = > > + bdrv_dirty_bitmap_granularity(s->bitmap); > > + if (dest_granularity != granularity) { > > + error_report("Error: " > > + "Migrated bitmap granularity (%" PRIu32 ") " > > + "doesn't match the destination bitmap '%s' " > > + "granularity (%" PRIu32 ")", > > + granularity, > > + bdrv_dirty_bitmap_name(s->bitmap), > > + dest_granularity); > > + return -EINVAL; > > + } > > + } > > + > > I'm a fan of auto-creating the bitmaps. Do you have a use-case for why > creating them ahead of time is better, or are you just attempting to be > flexible? > > > + if (flags & DIRTY_BITMAP_MIG_START_FLAG_RESERVED_MASK) { > > + error_report("Unknown flags in migrated dirty bitmap header: %x", > > + flags); > > + return -EINVAL; > > + } > > + > > + if (flags & DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT) { > > + bdrv_dirty_bitmap_set_persistance(s->bitmap, true); > > + } > > + if (flags & DIRTY_BITMAP_MIG_START_FLAG_AUTOLOAD) { > > + bdrv_dirty_bitmap_set_autoload(s->bitmap, true); > > + } > > + > > + bdrv_disable_dirty_bitmap(s->bitmap); > > OK, so we start them off as disabled, and > > > + if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) { > > + DirtyBitmapLoadBitmapState *b; > > + > > + bdrv_dirty_bitmap_create_successor(s->bs, s->bitmap, &local_err); > > + if (local_err) { > > + error_report_err(local_err); > > + return -EINVAL; > > + } > > + > > If they weren't disabled on the host, we create a successor to record > writes while we wait for the rest of the data to arrive. > > > + b = g_new(DirtyBitmapLoadBitmapState, 1); > > + b->bs = s->bs; > > + b->bitmap = s->bitmap; > > + b->migrated = false; > > + enabled_bitmaps = g_slist_prepend(enabled_bitmaps, b); > > And we make a note of which ones we were supposed to re-enable. > > > + } > > + > > + return 0; > > +} > > OK > > > + > > +void dirty_bitmap_mig_before_vm_start(void) > > Similarly, I guess I find it weird that this is a callable interface. > > David, no nice hook for just-prior-to-vm-start calls? a > .postcopy_pivot() hook or something might be nice.. Hmm, the other way a lot of devices do it is to hook the runstate change. > > +{ > > + GSList *item; > > + > > + qemu_mutex_lock(&finish_lock); > > + > > + for (item = enabled_bitmaps; item; item = g_slist_next(item)) { > > + DirtyBitmapLoadBitmapState *b = item->data; > > + > > Anyway, if I am reading this right; we call this in the postcopy phase > prior to receiving the entirety of the bitmaps (so before receiving > COMPLETE) ... right? > > > + if (b->migrated) { > > + bdrv_enable_dirty_bitmap(b->bitmap); > > ...or, am I confused, and we might receive a COMPLETE event prior to the > postcopy pivot? > > Anyway, I suppose this code says "If we received the entire bitmap, go > ahead and enable it." > > > + } else { > > + bdrv_dirty_bitmap_enable_successor(b->bitmap); > > And this says "If we haven't received it yet, enable the successor to > record writes until we get the rest of the data." > > > + } > > + > > + g_free(b); > > + } > > + > > + g_slist_free(enabled_bitmaps); > > + enabled_bitmaps = NULL; > > + > > + qemu_mutex_unlock(&finish_lock); > > +} > > + > > +static void dirty_bitmap_load_complete(QEMUFile *f, DirtyBitmapLoadState > > *s) > > +{ > > + GSList *item; > > + trace_dirty_bitmap_load_complete(); > > + bdrv_dirty_bitmap_deserialize_finish(s->bitmap); > > + > > + qemu_mutex_lock(&finish_lock); > > + > > + for (item = enabled_bitmaps; item; item = g_slist_next(item)) { > > + DirtyBitmapLoadBitmapState *b = item->data; > > + > > + if (b->bitmap == s->bitmap) { > > + b->migrated = true; > > we can probably break; here now, right? > > (This whole stanza is a little strange, can't we cache the active > DirtyBitmapLoadBitmapState in DirtyBitmapLoadState? I guess not, because > we're looking up the BdrvDirtyBitmap itself and caching that instead, so > either way we have some kind of lookup on every context switch.) > > > + } > > + } > > + > > + if (bdrv_dirty_bitmap_frozen(s->bitmap)) { > > + if (enabled_bitmaps == NULL) { > > + /* in postcopy */ > > + AioContext *aio_context = bdrv_get_aio_context(s->bs); > > + aio_context_acquire(aio_context); > > + > > + bdrv_reclaim_dirty_bitmap(s->bs, s->bitmap, &error_abort); > > + bdrv_enable_dirty_bitmap(s->bitmap); > > + > > OK, so if enabled_bitmaps is gone, that means we already pivoted and > we're live on the receiving end here. We merge the successor into the > fully deserialized bitmap and enable it. > > > + aio_context_release(aio_context); > > + } else { > > + /* target not started, successor is empty */ > > + bdrv_dirty_bitmap_release_successor(s->bs, s->bitmap); > > Otherwise we just trash the successor. Did we really need a new call for > this? I suppose it's faster than merging a 0-bit bitmap, but the > additional API complexity for the speed win here seems like premature > optimization. > > ...well, you already wrote it, so I won't argue. > > > + } > > + } > > + > > + qemu_mutex_unlock(&finish_lock); > > +} > > + > > +static int dirty_bitmap_load_bits(QEMUFile *f, DirtyBitmapLoadState *s) > > +{ > > + uint64_t first_byte = qemu_get_be64(f) << BDRV_SECTOR_BITS; > > + uint64_t nr_bytes = (uint64_t)qemu_get_be32(f) << BDRV_SECTOR_BITS; > > + trace_dirty_bitmap_load_bits_enter(first_byte >> BDRV_SECTOR_BITS, > > + nr_bytes >> BDRV_SECTOR_BITS); > > + > > + if (s->flags & DIRTY_BITMAP_MIG_FLAG_ZEROES) { > > + trace_dirty_bitmap_load_bits_zeroes(); > > + bdrv_dirty_bitmap_deserialize_zeroes(s->bitmap, first_byte, > > nr_bytes, > > + false); > > + } else { > > + uint8_t *buf; > > + uint64_t buf_size = qemu_get_be64(f); > > + uint64_t needed_size = > > + bdrv_dirty_bitmap_serialization_size(s->bitmap, > > + first_byte, nr_bytes); > > + > > + if (needed_size > buf_size) { > > + error_report("Error: Migrated bitmap granularity doesn't " > > + "match the destination bitmap '%s' granularity", > > + bdrv_dirty_bitmap_name(s->bitmap)); > > + return -EINVAL; > > + } > > + > > + buf = g_malloc(buf_size); > > + qemu_get_buffer(f, buf, buf_size); > > + bdrv_dirty_bitmap_deserialize_part(s->bitmap, buf, first_byte, > > nr_bytes, > > + false); > > + g_free(buf); > > + } > > + > > + return 0; > > +} > > + > > ok > > > +static int dirty_bitmap_load_header(QEMUFile *f, DirtyBitmapLoadState *s) > > +{ > > + Error *local_err = NULL; > > + s->flags = qemu_get_bitmap_flags(f); > > + trace_dirty_bitmap_load_header(s->flags); > > + > > + if (s->flags & DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME) { > > + if (!qemu_get_counted_string(f, s->node_name)) { > > + error_report("Unable to read node name string"); > > + return -EINVAL; > > + } > > + s->bs = bdrv_lookup_bs(s->node_name, s->node_name, &local_err); > > + if (!s->bs) { > > + error_report_err(local_err); > > + return -EINVAL; > > + } > > + } else if (!s->bs) { > > + error_report("Error: block device name is not set"); > > + return -EINVAL; > > + } > > + > > + if (s->flags & DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME) { > > + if (!qemu_get_counted_string(f, s->bitmap_name)) { > > + error_report("Unable to read node name string"); > > + return -EINVAL; > > + } > > + s->bitmap = bdrv_find_dirty_bitmap(s->bs, s->bitmap_name); > > + > > + /* bitmap may be NULL here, it wouldn't be an error if it is the > > + * first occurrence of the bitmap */ > > + if (!s->bitmap && !(s->flags & DIRTY_BITMAP_MIG_FLAG_START)) { > > + error_report("Error: unknown dirty bitmap " > > + "'%s' for block device '%s'", > > + s->bitmap_name, s->node_name); > > + return -EINVAL; > > + } > > + } else if (!s->bitmap) { > > + error_report("Error: block device name is not set"); > > + return -EINVAL; > > + } > > + > > + return 0; > > +} > > + > > ok > > > +static int dirty_bitmap_load(QEMUFile *f, void *opaque, int version_id) > > +{ > > + static DirtyBitmapLoadState s; > > + int ret = 0; > > + > > + trace_dirty_bitmap_load_enter(); > > + > > + if (version_id != 1) { > > + return -EINVAL; > > + } > > + > > + do { > > + dirty_bitmap_load_header(f, &s); > > + > > + if (s.flags & DIRTY_BITMAP_MIG_FLAG_START) { > > + ret = dirty_bitmap_load_start(f, &s); > > + } else if (s.flags & DIRTY_BITMAP_MIG_FLAG_COMPLETE) { > > + dirty_bitmap_load_complete(f, &s); > > + } else if (s.flags & DIRTY_BITMAP_MIG_FLAG_BITS) { > > + ret = dirty_bitmap_load_bits(f, &s); > > + } > > + > > + if (!ret) { > > + ret = qemu_file_get_error(f); > > + } > > + > > + if (ret) { > > + return ret; > > + } > > + } while (!(s.flags & DIRTY_BITMAP_MIG_FLAG_EOS)); > > + > > + trace_dirty_bitmap_load_success(); > > + return 0; > > +} > > OK > > > + > > +static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque) > > +{ > > + DirtyBitmapMigBitmapState *dbms = NULL; > > + if (init_dirty_bitmap_migration() < 0) { > > + return -1; > > + } > > + > > + QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) { > > + send_bitmap_start(f, dbms); > > + } > > + qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS); > > + > > + return 0; > > +} > > + > > ok > > > +static bool dirty_bitmap_is_active(void *opaque) > > +{ > > + return migrate_dirty_bitmaps(); > > +} > > + > > ok > > > +static bool dirty_bitmap_is_active_iterate(void *opaque) > > +{ > > + return dirty_bitmap_is_active(opaque) && !runstate_is_running(); > > +} > > + > > On second thought, in patch 9, can you add a little tiny bit of > documentation text explaining the exact nature of this callback? > > Is the second portion of the conditional here so that once we reach the > postcopy state that .is_active_iterate starts returning true? > > "ok" if I'm reading this right, but it might be helped along by a little > comment. > > > +static bool dirty_bitmap_has_postcopy(void *opaque) > > +{ > > + return true; > > +} > > + > > ok! :) > > > +static SaveVMHandlers savevm_dirty_bitmap_handlers = { > > + .save_setup = dirty_bitmap_save_setup, > > + .save_live_complete_postcopy = dirty_bitmap_save_complete, > > + .save_live_complete_precopy = dirty_bitmap_save_complete, > > + .has_postcopy = dirty_bitmap_has_postcopy, > > + .save_live_pending = dirty_bitmap_save_pending, > > + .save_live_iterate = dirty_bitmap_save_iterate, > > + .is_active_iterate = dirty_bitmap_is_active_iterate, > > + .load_state = dirty_bitmap_load, > > + .save_cleanup = dirty_bitmap_save_cleanup, > > + .is_active = dirty_bitmap_is_active, > > +}; > > + > > +void dirty_bitmap_mig_init(void) > > +{ > > + QSIMPLEQ_INIT(&dirty_bitmap_mig_state.dbms_list); > > + > > + register_savevm_live(NULL, "dirty-bitmap", 0, 1, > > + &savevm_dirty_bitmap_handlers, > > + &dirty_bitmap_mig_state); > > +} > > ok > > dgilbert, would it be worth registering a block-driver-like registration > trick that automatically invokes these functions instead of having to > hook them up in vl.c? the more we add, the more hacky it looks to not > have some subsystem-wide registration hook. Maybe it's just simpler to put a mig_init in migration/migration.c and make one call in vl.c; I'd rather keep a simple function than a whole registration mechanism. > > diff --git a/migration/migration.c b/migration/migration.c > > index e973837bfd..66e9cf03cd 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -150,6 +150,9 @@ MigrationIncomingState > > *migration_incoming_get_current(void) > > memset(&mis_current, 0, sizeof(MigrationIncomingState)); > > qemu_mutex_init(&mis_current.rp_mutex); > > qemu_event_init(&mis_current.main_thread_load_event, false); > > + > > + init_dirty_bitmap_incoming_migration(); > > + > > once = true; > > } > > return &mis_current; > > diff --git a/migration/savevm.c b/migration/savevm.c > > index 9bbfb3fa1b..b0c37ef9f1 100644 > > --- a/migration/savevm.c > > +++ b/migration/savevm.c > > @@ -1673,6 +1673,8 @@ static void loadvm_postcopy_handle_run_bh(void > > *opaque) > > > > trace_loadvm_postcopy_handle_run_vmstart(); > > > > + dirty_bitmap_mig_before_vm_start(); > > + > > if (autostart) { > > /* Hold onto your hats, starting the CPU */ > > vm_start(); > > diff --git a/vl.c b/vl.c > > index ec299099ff..3d393aaf2c 100644 > > --- a/vl.c > > +++ b/vl.c > > @@ -4642,6 +4642,7 @@ int main(int argc, char **argv, char **envp) > > > > blk_mig_init(); > > ram_mig_init(); > > + dirty_bitmap_mig_init(); > > > > /* If the currently selected machine wishes to override the > > units-per-bus > > * property of its default HBA interface type, do so now. */ > > diff --git a/migration/Makefile.objs b/migration/Makefile.objs > > index 99e038024d..c83ec47ba8 100644 > > --- a/migration/Makefile.objs > > +++ b/migration/Makefile.objs > > @@ -6,6 +6,7 @@ common-obj-y += qemu-file.o global_state.o > > common-obj-y += qemu-file-channel.o > > common-obj-y += xbzrle.o postcopy-ram.o > > common-obj-y += qjson.o > > +common-obj-y += block-dirty-bitmap.o > > > > common-obj-$(CONFIG_RDMA) += rdma.o > > > > diff --git a/migration/trace-events b/migration/trace-events > > index a04fffb877..e9eb8078d4 100644 > > --- a/migration/trace-events > > +++ b/migration/trace-events > > @@ -227,3 +227,17 @@ colo_vm_state_change(const char *old, const char *new) > > "Change '%s' => '%s'" > > colo_send_message(const char *msg) "Send '%s' message" > > colo_receive_message(const char *msg) "Receive '%s' message" > > colo_failover_set_state(const char *new_state) "new state %s" > > + > > +# migration/block-dirty-bitmap.c > > +send_bitmap_header_enter(void) "" > > +send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t > > nr_sectors, uint64_t data_size) "\n flags: 0x%x\n start_sector: > > %" PRIu64 "\n nr_sectors: %" PRIu32 "\n data_size: %" PRIu64 "\n" > > +dirty_bitmap_save_iterate(int in_postcopy) "in postcopy: %d" > > +dirty_bitmap_save_complete_enter(void) "" > > +dirty_bitmap_save_complete_finish(void) "" > > +dirty_bitmap_save_pending(uint64_t pending, uint64_t max_size) "pending %" > > PRIu64 " max: %" PRIu64 > > +dirty_bitmap_load_complete(void) "" > > +dirty_bitmap_load_bits_enter(uint64_t first_sector, uint32_t nr_sectors) > > "chunk: %" PRIu64 " %" PRIu32 > > +dirty_bitmap_load_bits_zeroes(void) "" > > +dirty_bitmap_load_header(uint32_t flags) "flags 0x%x" > > +dirty_bitmap_load_enter(void) "" > > +dirty_bitmap_load_success(void) "" > > > > OK, I think everything here is probably in order, and it's only my > understanding that is a barrier at this point. Help me understand the > rest of this patch and I'll re-pester migration to get this staged for > qemu-next. > > --js Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK