Re: [Qemu-devel] [PATCH v8 10/14] migration: add postcopy migration of dirty bitmaps

Dr. David Alan Gilbert Thu, 16 Nov 2017 08:51:57 -0800

* John Snow (js...@redhat.com) wrote:
> 
> 
> On 10/30/2017 12:33 PM, Vladimir Sementsov-Ogievskiy wrote:
> > Postcopy migration of dirty bitmaps. Only named dirty bitmaps,
> > associated with root nodes and non-root named nodes are migrated.
> > 
> > If destination qemu is already containing a dirty bitmap with the same name
> > as a migrated bitmap (for the same node), then, if their granularities are
> > the same the migration will be done, otherwise the error will be generated.
> > 
> > If destination qemu doesn't contain such bitmap it will be created.
> > 
> > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
> > ---
> >  include/migration/misc.h       |   3 +
> >  migration/migration.h          |   3 +
> >  migration/block-dirty-bitmap.c | 734 
> > +++++++++++++++++++++++++++++++++++++++++
> 
> Ouch :\
> 
> >  migration/migration.c          |   3 +
> >  migration/savevm.c             |   2 +
> >  vl.c                           |   1 +
> >  migration/Makefile.objs        |   1 +
> >  migration/trace-events         |  14 +
> >  8 files changed, 761 insertions(+)
> >  create mode 100644 migration/block-dirty-bitmap.c
> > 
> 
> Organizationally, you introduce three new 'public' prototypes:
> 
> dirty_bitmap_mig_init
> dirty_bitmap_mig_before_vm_start
> init_dirty_bitmap_incoming_migration
> 
> mig_init is advertised in migration/misc.h, the other two are in
> migration/migration.h.
> The definitions for all three are in migration/block-dirty-bitmap.c
> 
> In pure naivety, I find it weird to have something that you use in
> migration.c and advertised in migration.h actually exist separately in
> block-dirty-bitmap.c; but maybe this is the sanest thing to do.


Actually I think that's OK; it makes sense for all of the code for this
feature to sit in one place,   and there doesn't seem any point creating
a header just for this one function.

> > diff --git a/include/migration/misc.h b/include/migration/misc.h
> > index c079b7771b..9cc539e232 100644
> > --- a/include/migration/misc.h
> > +++ b/include/migration/misc.h
> > @@ -55,4 +55,7 @@ bool migration_has_failed(MigrationState *);
> >  bool migration_in_postcopy_after_devices(MigrationState *);
> >  void migration_global_dump(Monitor *mon);
> >  
> > +/* migration/block-dirty-bitmap.c */
> > +void dirty_bitmap_mig_init(void);
> > +
> >  #endif
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 50d1f01346..4e3ad04664 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -211,4 +211,7 @@ void migrate_send_rp_pong(MigrationIncomingState *mis,
> >  void migrate_send_rp_req_pages(MigrationIncomingState *mis, const char* 
> > rbname,
> >                                ram_addr_t start, size_t len);
> >  
> > +void dirty_bitmap_mig_before_vm_start(void);
> > +void init_dirty_bitmap_incoming_migration(void);
> > +
> >  #endif
> > diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> > new file mode 100644
> > index 0000000000..53cb20045d
> > --- /dev/null
> > +++ b/migration/block-dirty-bitmap.c
> > @@ -0,0 +1,734 @@
> > +/*
> > + * Block dirty bitmap postcopy migration
> > + *
> > + * Copyright IBM, Corp. 2009
> > + * Copyright (c) 2016-2017 Parallels International GmbH
> > + *
> > + * Authors:
> > + *  Liran Schour   <lir...@il.ibm.com>
> > + *  Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.  See
> > + * the COPYING file in the top-level directory.
> > + * This file is derived from migration/block.c, so it's author and IBM 
> > copyright
> > + * are here, although content is quite different.
> > + *
> > + * Contributions after 2012-01-13 are licensed under the terms of the
> > + * GNU GPL, version 2 or (at your option) any later version.
> > + *
> > + *                                ***
> > + *
> > + * Here postcopy migration of dirty bitmaps is realized. Only named dirty
> > + * bitmaps, associated with root nodes and non-root named nodes are 
> > migrated.
> 
> Put another way, only QMP-addressable bitmaps. Nodes without a name that
> are not the root have no way to be addressed.
> 
> > + *
> > + * If destination qemu is already containing a dirty bitmap with the same 
> > name
> 
> "If the destination QEMU already contains a dirty bitmap with the same name"
> 
> > + * as a migrated bitmap (for the same node), then, if their granularities 
> > are
> > + * the same the migration will be done, otherwise the error will be 
> > generated.
> 
> "an error"
> 
> > + *
> > + * If destination qemu doesn't contain such bitmap it will be created.
> 
> "If the destination QEMU doesn't contain such a bitmap, it will be created."
> 
> > + *
> > + * format of migration:
> > + *
> > + * # Header (shared for different chunk types)
> > + * 1, 2 or 4 bytes: flags (see qemu_{put,put}_flags)
> > + * [ 1 byte: node name size ] \  flags & DEVICE_NAME
> > + * [ n bytes: node name     ] /
> > + * [ 1 byte: bitmap name size ] \  flags & BITMAP_NAME
> > + * [ n bytes: bitmap name     ] /
> > + *
> > + * # Start of bitmap migration (flags & START)
> > + * header
> > + * be64: granularity
> > + * 1 byte: bitmap flags (corresponds to BdrvDirtyBitmap)
> > + *   bit 0    -  bitmap is enabled
> > + *   bit 1    -  bitmap is persistent
> > + *   bit 2    -  bitmap is autoloading
> > + *   bits 3-7 - reserved, must be zero
> > + *
> > + * # Complete of bitmap migration (flags & COMPLETE)
> > + * header
> > + *
> > + * # Data chunk of bitmap migration
> > + * header
> > + * be64: start sector
> > + * be32: number of sectors
> > + * [ be64: buffer size  ] \ ! (flags & ZEROES)
> > + * [ n bytes: buffer    ] /
> > + *
> > + * The last chunk in stream should contain flags & EOS. The chunk may skip
> > + * device and/or bitmap names, assuming them to be the same with the 
> > previous
> > + * chunk.
> > + */
> > +
> 
> Been a while since I reviewed the format, but it seems sane.
> 
> > +#include "qemu/osdep.h"
> > +#include "block/block.h"
> > +#include "block/block_int.h"
> > +#include "sysemu/block-backend.h"
> > +#include "qemu/main-loop.h"
> > +#include "qemu/error-report.h"
> > +#include "migration/misc.h"
> > +#include "migration/migration.h"
> > +#include "migration/qemu-file.h"
> > +#include "migration/vmstate.h"
> > +#include "migration/register.h"
> > +#include "qemu/hbitmap.h"
> > +#include "sysemu/sysemu.h"
> > +#include "qemu/cutils.h"
> > +#include "qapi/error.h"
> > +#include "trace.h"
> > +
> > +#define CHUNK_SIZE     (1 << 10)
> > +
> > +/* Flags occupy one, two or four bytes (Big Endian). The size is 
> > determined as
> > + * follows:
> > + * in first (most significant) byte bit 8 is clear  -->  one byte
> > + * in first byte bit 8 is set    -->  two or four bytes, depending on 
> > second
> > + *                                    byte:
> > + *    | in second byte bit 8 is clear  -->  two bytes
> > + *    | in second byte bit 8 is set    -->  four bytes
> > + */
> > +#define DIRTY_BITMAP_MIG_FLAG_EOS           0x01
> > +#define DIRTY_BITMAP_MIG_FLAG_ZEROES        0x02
> > +#define DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME   0x04
> > +#define DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME   0x08
> > +#define DIRTY_BITMAP_MIG_FLAG_START         0x10
> > +#define DIRTY_BITMAP_MIG_FLAG_COMPLETE      0x20
> > +#define DIRTY_BITMAP_MIG_FLAG_BITS          0x40
> > +
> > +#define DIRTY_BITMAP_MIG_EXTRA_FLAGS        0x80
> > +
> > +#define DIRTY_BITMAP_MIG_START_FLAG_ENABLED          0x01
> > +#define DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT       0x02
> > +#define DIRTY_BITMAP_MIG_START_FLAG_AUTOLOAD         0x04
> > +#define DIRTY_BITMAP_MIG_START_FLAG_RESERVED_MASK    0xf8
> > +
> > +typedef struct DirtyBitmapMigBitmapState {
> > +    /* Written during setup phase. */
> > +    BlockDriverState *bs;
> > +    const char *node_name;
> > +    BdrvDirtyBitmap *bitmap;
> > +    uint64_t total_sectors;
> > +    uint64_t sectors_per_chunk;
> > +    QSIMPLEQ_ENTRY(DirtyBitmapMigBitmapState) entry;
> > +    uint8_t flags;
> > +
> > +    /* For bulk phase. */
> > +    bool bulk_completed;
> > +    uint64_t cur_sector;
> > +} DirtyBitmapMigBitmapState;
> > +
> > +typedef struct DirtyBitmapMigState {
> > +    QSIMPLEQ_HEAD(dbms_list, DirtyBitmapMigBitmapState) dbms_list;
> > +
> > +    bool bulk_completed;
> > +
> > +    /* for send_bitmap_bits() */
> > +    BlockDriverState *prev_bs;
> > +    BdrvDirtyBitmap *prev_bitmap;
> > +} DirtyBitmapMigState;
> > +
> > +typedef struct DirtyBitmapLoadState {
> > +    uint32_t flags;
> > +    char node_name[256];
> > +    char bitmap_name[256];
> > +    BlockDriverState *bs;
> > +    BdrvDirtyBitmap *bitmap;
> > +} DirtyBitmapLoadState;
> > +
> > +static DirtyBitmapMigState dirty_bitmap_mig_state;
> > +
> > +typedef struct DirtyBitmapLoadBitmapState {
> > +    BlockDriverState *bs;
> > +    BdrvDirtyBitmap *bitmap;
> > +    bool migrated;
> > +} DirtyBitmapLoadBitmapState;
> > +static GSList *enabled_bitmaps;
> > +QemuMutex finish_lock;
> > +
> > +void init_dirty_bitmap_incoming_migration(void)
> > +{
> > +    qemu_mutex_init(&finish_lock);
> > +}
> > +
> 
> This is a little odd as public interface. David, is there a nicer way to
> integrate in-migrate hooks? I guess it hasn't come up yet. Anyway, it
> might be nice to leave a comment here for now that says that the only
> caller is migration.c, and it will only ever call it once.

I don't think having an init_ function like that is a problem;
we've got an init_blk_migration, so I guess it's similar.
I generally prefer things to be part of the incoming state structure
rather than a file global; but that's not a huge one.

> > +static uint32_t qemu_get_bitmap_flags(QEMUFile *f)
> > +{
> > +    uint8_t flags = qemu_get_byte(f);
> > +    if (flags & DIRTY_BITMAP_MIG_EXTRA_FLAGS) {
> > +        flags = flags << 8 | qemu_get_byte(f);
> > +        if (flags & DIRTY_BITMAP_MIG_EXTRA_FLAGS) {
> > +            flags = flags << 16 | qemu_get_be16(f);
> > +        }
> > +    }
> > +
> > +    return flags;
> > +}
> > +
> 
> ok
> 
> (Sorry for the per-function ACKs, it's just helpful for me to know which
> functions I followed execution of on paper to make sure I got everything
> in this big patch.)
> 
> > +static void qemu_put_bitmap_flags(QEMUFile *f, uint32_t flags)
> > +{
> > +    /* The code currently do not send flags more than one byte */
> > +    assert(!(flags & (0xffffff00 | DIRTY_BITMAP_MIG_EXTRA_FLAGS)));
> > +
> > +    qemu_put_byte(f, flags);
> > +}
> > +
> 
> ok
> 
> > +static void send_bitmap_header(QEMUFile *f, DirtyBitmapMigBitmapState 
> > *dbms,
> > +                               uint32_t additional_flags)
> > +{
> > +    BlockDriverState *bs = dbms->bs;
> > +    BdrvDirtyBitmap *bitmap = dbms->bitmap;
> > +    uint32_t flags = additional_flags;
> > +    trace_send_bitmap_header_enter();
> > +
> > +    if (bs != dirty_bitmap_mig_state.prev_bs) {
> > +        dirty_bitmap_mig_state.prev_bs = bs;
> > +        flags |= DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME;
> > +    }
> > +
> > +    if (bitmap != dirty_bitmap_mig_state.prev_bitmap) {
> > +        dirty_bitmap_mig_state.prev_bitmap = bitmap;
> > +        flags |= DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME;
> > +    }
> > +
> 
> I guess the idea here is that we might be able to skip the node name
> broadcast, leaving the possibilities as:
> 
> - new node and bitmap: send both
> - new bitmap, but not node: send bitmap name only
> - same for both: send neither
> 
> and that otherwise it's not possible to have a new node but "same
> bitmap" by nature of how the structures are organized.
> 
> > +    qemu_put_bitmap_flags(f, flags);
> > +
> > +    if (flags & DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME) {
> > +        qemu_put_counted_string(f, dbms->node_name);
> > +    }
> > +
> > +    if (flags & DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME) {
> > +        qemu_put_counted_string(f, bdrv_dirty_bitmap_name(bitmap));
> > +    }
> > +}
> > +
> 
> ok
> 
> > +static void send_bitmap_start(QEMUFile *f, DirtyBitmapMigBitmapState *dbms)
> > +{
> > +    send_bitmap_header(f, dbms, DIRTY_BITMAP_MIG_FLAG_START);
> > +    qemu_put_be32(f, bdrv_dirty_bitmap_granularity(dbms->bitmap));
> > +    qemu_put_byte(f, dbms->flags);
> > +}
> > +
> 
> ok
> 
> > +static void send_bitmap_complete(QEMUFile *f, DirtyBitmapMigBitmapState 
> > *dbms)
> > +{
> > +    send_bitmap_header(f, dbms, DIRTY_BITMAP_MIG_FLAG_COMPLETE);
> > +}
> > +
> 
> ok
> 
> > +static void send_bitmap_bits(QEMUFile *f, DirtyBitmapMigBitmapState *dbms,
> > +                             uint64_t start_sector, uint32_t nr_sectors)
> > +{
> > +    /* align for buffer_is_zero() */
> > +    uint64_t align = 4 * sizeof(long);
> > +    uint64_t unaligned_size =
> > +        bdrv_dirty_bitmap_serialization_size(
> > +            dbms->bitmap, start_sector << BDRV_SECTOR_BITS,
> > +            (uint64_t)nr_sectors << BDRV_SECTOR_BITS);
> > +    uint64_t buf_size = (unaligned_size + align - 1) & ~(align - 1);
> > +    uint8_t *buf = g_malloc0(buf_size);
> > +    uint32_t flags = DIRTY_BITMAP_MIG_FLAG_BITS;
> > +
> > +    bdrv_dirty_bitmap_serialize_part(
> > +        dbms->bitmap, buf, start_sector << BDRV_SECTOR_BITS,
> > +        (uint64_t)nr_sectors << BDRV_SECTOR_BITS);
> > +
> > +    if (buffer_is_zero(buf, buf_size)) {
> > +        g_free(buf);
> > +        buf = NULL;
> > +        flags |= DIRTY_BITMAP_MIG_FLAG_ZEROES;
> > +    }
> > +
> > +    trace_send_bitmap_bits(flags, start_sector, nr_sectors, buf_size);
> > +
> > +    send_bitmap_header(f, dbms, flags);
> > +
> > +    qemu_put_be64(f, start_sector);
> > +    qemu_put_be32(f, nr_sectors);
> > +
> > +    /* if a block is zero we need to flush here since the network
> > +     * bandwidth is now a lot higher than the storage device bandwidth.
> > +     * thus if we queue zero blocks we slow down the migration. */
> 
> Can you elaborate on this for me?
> 
> > +    if (flags & DIRTY_BITMAP_MIG_FLAG_ZEROES) {
> > +        qemu_fflush(f);
> > +    } else {
> > +        qemu_put_be64(f, buf_size);
> > +        qemu_put_buffer(f, buf, buf_size);
> > +    }
> > +
> > +    g_free(buf);
> > +}
> > +
> > +
> > +/* Called with iothread lock taken.  */
> > +
> > +static int init_dirty_bitmap_migration(void)
> > +{
> > +    BlockDriverState *bs;
> > +    BdrvDirtyBitmap *bitmap;
> > +    DirtyBitmapMigBitmapState *dbms;
> > +    BdrvNextIterator it;
> > +
> > +    dirty_bitmap_mig_state.bulk_completed = false;
> > +    dirty_bitmap_mig_state.prev_bs = NULL;
> > +    dirty_bitmap_mig_state.prev_bitmap = NULL;
> > +
> > +    for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
> > +        if (!bdrv_get_device_or_node_name(bs)) {
> > +            /* not named non-root node */
> 
> I can't imagine the situation it would arise in, but is it possible to
> have a named bitmap attached to a now-anonymous node?
> 
> Let's say we attach a bitmap to a root node, but then later we insert a
> filter or something above it and it's no longer at the root.
> 
> We should probably prohibit such things, or at the very least toss out
> an error here instead of silently continuing.
> 
> I think the only things valid to just *skip* are nameless bitmaps.
> Anything named we really ought to either migrate or error out over, I
> think, even if the circumstances leading to such a configuration are
> very unlikely.
> 
> > +            continue;
> > +        }
> > +
> > +        for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap;
> > +             bitmap = bdrv_dirty_bitmap_next(bs, bitmap)) {
> > +            if (!bdrv_dirty_bitmap_name(bitmap)) {
> > +                continue;
> > +            }
> > +
> > +            if (bdrv_dirty_bitmap_frozen(bitmap)) {
> > +                error_report("Can't migrate frozen dirty bitmap: '%s",
> > +                             bdrv_dirty_bitmap_name(bitmap));
> > +                return -1;
> > +            }
> > +        }
> > +    }
> > +
> > +    for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
> > +        if (!bdrv_get_device_or_node_name(bs)) {
> > +            /* not named non-root node */
> > +            continue;
> > +        }
> 
> Why two-pass? I guess because we don't have to tear anything down if we
> check for errors in advance?
> 
> I worry that if we need to amend the logic here that it's error-prone to
> update it in two places, so maybe we ought to just have one loop.
> 
> > +
> > +        for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap;
> > +             bitmap = bdrv_dirty_bitmap_next(bs, bitmap)) {
> > +            if (!bdrv_dirty_bitmap_name(bitmap)) {
> > +                continue;
> > +            }
> > +
> > +            bdrv_ref(bs);
> > +            bdrv_dirty_bitmap_set_frozen(bitmap, true);
> > +
> 
> We could say that for any bitmap in the list of pending bitmaps to
> migrate, we know that we have to un-freeze it, since we never add any
> bitmaps that are frozen to our list.
> 
> > +            dbms = g_new0(DirtyBitmapMigBitmapState, 1);
> > +            dbms->bs = bs;
> > +            dbms->node_name = bdrv_get_node_name(bs);
> > +            if (!dbms->node_name || dbms->node_name[0] == '\0') {
> > +                dbms->node_name = bdrv_get_device_name(bs);
> > +            }
> > +            dbms->bitmap = bitmap;
> > +            dbms->total_sectors = bdrv_nb_sectors(bs);
> > +            dbms->sectors_per_chunk = CHUNK_SIZE * 8 *
> > +                bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS;
> 
> Eric may want to avoid checking in new code that thinks in sectors, but
> for the sake of review I don't mind right now.
> 
> (Sorry, Eric!)
> 
> > +            if (bdrv_dirty_bitmap_enabled(bitmap)) {
> > +                dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_ENABLED;
> > +            }
> > +            if (bdrv_dirty_bitmap_get_persistance(bitmap)) {
> > +                dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT;
> > +            }
> > +            if (bdrv_dirty_bitmap_get_autoload(bitmap)) {
> > +                dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_AUTOLOAD;
> > +            }
> > +
> > +            bdrv_dirty_bitmap_set_persistance(bitmap, false);
> 
> Oh, this might be stranger to undo. Perhaps what we CAN do is limit the
> second pass to just this action and allow ourselves to unroll everything
> else.
> 
> > +
> > +            QSIMPLEQ_INSERT_TAIL(&dirty_bitmap_mig_state.dbms_list,
> > +                                 dbms, entry);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +/* Called with no lock taken.  */
> > +static void bulk_phase_send_chunk(QEMUFile *f, DirtyBitmapMigBitmapState 
> > *dbms)
> > +{
> > +    uint32_t nr_sectors = MIN(dbms->total_sectors - dbms->cur_sector,
> > +                             dbms->sectors_per_chunk);
> > +
> > +    send_bitmap_bits(f, dbms, dbms->cur_sector, nr_sectors);
> > +
> > +    dbms->cur_sector += nr_sectors;
> > +    if (dbms->cur_sector >= dbms->total_sectors) {
> > +        dbms->bulk_completed = true;
> > +    }
> > +}
> > +
> > +/* Called with no lock taken.  */
> > +static void bulk_phase(QEMUFile *f, bool limit)
> > +{
> > +    DirtyBitmapMigBitmapState *dbms;
> > +
> > +    QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) {
> > +        while (!dbms->bulk_completed) {
> > +            bulk_phase_send_chunk(f, dbms);
> > +            if (limit && qemu_file_rate_limit(f)) {
> > +                return;
> > +            }
> > +        }
> > +    }
> > +
> > +    dirty_bitmap_mig_state.bulk_completed = true;
> > +}
> > +
> > +/* Called with iothread lock taken.  */
> > +static void dirty_bitmap_mig_cleanup(void)
> > +{
> > +    DirtyBitmapMigBitmapState *dbms;
> > +
> > +    while ((dbms = QSIMPLEQ_FIRST(&dirty_bitmap_mig_state.dbms_list)) != 
> > NULL) {
> > +        QSIMPLEQ_REMOVE_HEAD(&dirty_bitmap_mig_state.dbms_list, entry);
> > +        bdrv_dirty_bitmap_set_frozen(dbms->bitmap, false);
> > +        bdrv_unref(dbms->bs);
> > +        g_free(dbms);
> > +    }
> > +}
> > +
> 
> ok
> 
> > +/* for SaveVMHandlers */
> > +static void dirty_bitmap_save_cleanup(void *opaque)
> > +{
> > +    dirty_bitmap_mig_cleanup();
> > +}
> > +
> 
> ok
> 
> > +static int dirty_bitmap_save_iterate(QEMUFile *f, void *opaque)
> > +{
> > +    trace_dirty_bitmap_save_iterate(migration_in_postcopy());
> > +
> > +    if (migration_in_postcopy() && !dirty_bitmap_mig_state.bulk_completed) 
> > {
> > +        bulk_phase(f, true);
> > +    }
> > +
> > +    qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
> > +
> > +    return dirty_bitmap_mig_state.bulk_completed;
> > +}
> > +
> 
> What's the purpose behind doing bulk save both here and in
> dirty_bitmap_save_complete? Is there a path that isn't guaranteed to
> call one of the completion functions?
> 
> (The way it's coded seems like it'll work fine, but I'm curious about
> what looks like a redundancy at a glance.)

I think I can see the reasoning; I think the idea is that in each
iteration you send a chunk of data (note the 'true' means that it
gets modulated by bandwidth limiting - although we currently don't
have that in postcopy - I need to fix that) - so it might
not actually send all of it.
The complete guarantees it's all sent.
So note this IS iterating potentially (OK but probably not
at the moment)

> > +/* Called with iothread lock taken.  */
> > +
> > +static int dirty_bitmap_save_complete(QEMUFile *f, void *opaque)
> > +{
> > +    DirtyBitmapMigBitmapState *dbms;
> > +    trace_dirty_bitmap_save_complete_enter();
> > +
> > +    if (!dirty_bitmap_mig_state.bulk_completed) {
> > +        bulk_phase(f, false);
> > +    }
> > +
> 
> funny now that we don't actually really iterate over the data, and the
> bulk phase is now really the _only_ phase :)
> 
> > +    QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) {
> > +        send_bitmap_complete(f, dbms);
> > +    }
> > +
> > +    qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
> > +
> > +    trace_dirty_bitmap_save_complete_finish();
> > +
> > +    dirty_bitmap_mig_cleanup();
> > +    return 0;
> > +}
> > +
> 
> ok
> 
> > +static void dirty_bitmap_save_pending(QEMUFile *f, void *opaque,
> > +                                      uint64_t max_size,
> > +                                      uint64_t *res_precopy_only,
> > +                                      uint64_t *res_compatible,
> > +                                      uint64_t *res_postcopy_only)
> > +{
> > +    DirtyBitmapMigBitmapState *dbms;
> > +    uint64_t pending = 0;
> > +
> > +    qemu_mutex_lock_iothread();
> > +
> > +    QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) {
> > +        uint64_t gran = bdrv_dirty_bitmap_granularity(dbms->bitmap);
> > +        uint64_t sectors = dbms->bulk_completed ? 0 :
> > +                           dbms->total_sectors - dbms->cur_sector;
> > +
> > +        pending += (sectors * BDRV_SECTOR_SIZE + gran - 1) / gran;
> > +    }
> > +
> > +    qemu_mutex_unlock_iothread();
> > +
> > +    trace_dirty_bitmap_save_pending(pending, max_size);
> > +
> > +    *res_postcopy_only += pending;
> > +}
> > +
> 
> ok
> 
> > +/* First occurrence of this bitmap. It should be created if doesn't exist 
> > */
> > +static int dirty_bitmap_load_start(QEMUFile *f, DirtyBitmapLoadState *s)
> > +{
> > +    Error *local_err = NULL;
> > +    uint32_t granularity = qemu_get_be32(f);
> > +    uint8_t flags = qemu_get_byte(f);
> > +
> > +    if (!s->bitmap) {
> > +        s->bitmap = bdrv_create_dirty_bitmap(s->bs, granularity,
> > +                                             s->bitmap_name, &local_err);
> > +        if (!s->bitmap) {
> > +            error_report_err(local_err);
> > +            return -EINVAL;
> > +        }
> > +    } else {
> > +        uint32_t dest_granularity =
> > +            bdrv_dirty_bitmap_granularity(s->bitmap);
> > +        if (dest_granularity != granularity) {
> > +            error_report("Error: "
> > +                         "Migrated bitmap granularity (%" PRIu32 ") "
> > +                         "doesn't match the destination bitmap '%s' "
> > +                         "granularity (%" PRIu32 ")",
> > +                         granularity,
> > +                         bdrv_dirty_bitmap_name(s->bitmap),
> > +                         dest_granularity);
> > +            return -EINVAL;
> > +        }
> > +    }
> > +
> 
> I'm a fan of auto-creating the bitmaps. Do you have a use-case for why
> creating them ahead of time is better, or are you just attempting to be
> flexible?
> 
> > +    if (flags & DIRTY_BITMAP_MIG_START_FLAG_RESERVED_MASK) {
> > +        error_report("Unknown flags in migrated dirty bitmap header: %x",
> > +                     flags);
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (flags & DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT) {
> > +        bdrv_dirty_bitmap_set_persistance(s->bitmap, true);
> > +    }
> > +    if (flags & DIRTY_BITMAP_MIG_START_FLAG_AUTOLOAD) {
> > +        bdrv_dirty_bitmap_set_autoload(s->bitmap, true);
> > +    }
> > +
> > +    bdrv_disable_dirty_bitmap(s->bitmap);
> 
> OK, so we start them off as disabled, and
> 
> > +    if (flags & DIRTY_BITMAP_MIG_START_FLAG_ENABLED) {
> > +        DirtyBitmapLoadBitmapState *b;
> > +
> > +        bdrv_dirty_bitmap_create_successor(s->bs, s->bitmap, &local_err);
> > +        if (local_err) {
> > +            error_report_err(local_err);
> > +            return -EINVAL;
> > +        }
> > +
> 
> If they weren't disabled on the host, we create a successor to record
> writes while we wait for the rest of the data to arrive.
> 
> > +        b = g_new(DirtyBitmapLoadBitmapState, 1);
> > +        b->bs = s->bs;
> > +        b->bitmap = s->bitmap;
> > +        b->migrated = false;
> > +        enabled_bitmaps = g_slist_prepend(enabled_bitmaps, b);
> 
> And we make a note of which ones we were supposed to re-enable.
> 
> > +    }
> > +
> > +    return 0;
> > +}
> 
> OK
> 
> > +
> > +void dirty_bitmap_mig_before_vm_start(void)
> 
> Similarly, I guess I find it weird that this is a callable interface.
> 
> David, no nice hook for just-prior-to-vm-start calls? a
> .postcopy_pivot() hook or something might be nice..

Hmm, the other way a lot of devices do it is to hook the runstate
change.

> > +{
> > +    GSList *item;
> > +
> > +    qemu_mutex_lock(&finish_lock);
> > +
> > +    for (item = enabled_bitmaps; item; item = g_slist_next(item)) {
> > +        DirtyBitmapLoadBitmapState *b = item->data;
> > +
> 
> Anyway, if I am reading this right; we call this in the postcopy phase
> prior to receiving the entirety of the bitmaps (so before receiving
> COMPLETE) ... right?
> 
> > +        if (b->migrated) {
> > +            bdrv_enable_dirty_bitmap(b->bitmap);
> 
> ...or, am I confused, and we might receive a COMPLETE event prior to the
> postcopy pivot?
> 
> Anyway, I suppose this code says "If we received the entire bitmap, go
> ahead and enable it."
> 
> > +        } else {
> > +            bdrv_dirty_bitmap_enable_successor(b->bitmap);
> 
> And this says "If we haven't received it yet, enable the successor to
> record writes until we get the rest of the data."
> 
> > +        }
> > +
> > +        g_free(b);
> > +    }
> > +
> > +    g_slist_free(enabled_bitmaps);
> > +    enabled_bitmaps = NULL;
> > +
> > +    qemu_mutex_unlock(&finish_lock);
> > +}
> > +
> > +static void dirty_bitmap_load_complete(QEMUFile *f, DirtyBitmapLoadState 
> > *s)
> > +{
> > +    GSList *item;
> > +    trace_dirty_bitmap_load_complete();
> > +    bdrv_dirty_bitmap_deserialize_finish(s->bitmap);
> > +
> > +    qemu_mutex_lock(&finish_lock);
> > +
> > +    for (item = enabled_bitmaps; item; item = g_slist_next(item)) {
> > +        DirtyBitmapLoadBitmapState *b = item->data;
> > +
> > +        if (b->bitmap == s->bitmap) {
> > +            b->migrated = true;
> 
> we can probably break; here now, right?
> 
> (This whole stanza is a little strange, can't we cache the active
> DirtyBitmapLoadBitmapState in DirtyBitmapLoadState? I guess not, because
> we're looking up the BdrvDirtyBitmap itself and caching that instead, so
> either way we have some kind of lookup on every context switch.)
> 
> > +        }
> > +    }
> > +
> > +    if (bdrv_dirty_bitmap_frozen(s->bitmap)) {
> > +        if (enabled_bitmaps == NULL) {
> > +            /* in postcopy */
> > +            AioContext *aio_context = bdrv_get_aio_context(s->bs);
> > +            aio_context_acquire(aio_context);
> > +
> > +            bdrv_reclaim_dirty_bitmap(s->bs, s->bitmap, &error_abort);
> > +            bdrv_enable_dirty_bitmap(s->bitmap);
> > +
> 
> OK, so if enabled_bitmaps is gone, that means we already pivoted and
> we're live on the receiving end here. We merge the successor into the
> fully deserialized bitmap and enable it.
> 
> > +            aio_context_release(aio_context);
> > +        } else {
> > +            /* target not started, successor is empty */
> > +            bdrv_dirty_bitmap_release_successor(s->bs, s->bitmap);
> 
> Otherwise we just trash the successor. Did we really need a new call for
> this? I suppose it's faster than merging a 0-bit bitmap, but the
> additional API complexity for the speed win here seems like premature
> optimization.
> 
> ...well, you already wrote it, so I won't argue.
> 
> > +        }
> > +    }
> > +
> > +    qemu_mutex_unlock(&finish_lock);
> > +}
> > +
> > +static int dirty_bitmap_load_bits(QEMUFile *f, DirtyBitmapLoadState *s)
> > +{
> > +    uint64_t first_byte = qemu_get_be64(f) << BDRV_SECTOR_BITS;
> > +    uint64_t nr_bytes = (uint64_t)qemu_get_be32(f) << BDRV_SECTOR_BITS;
> > +    trace_dirty_bitmap_load_bits_enter(first_byte >> BDRV_SECTOR_BITS,
> > +                                       nr_bytes >> BDRV_SECTOR_BITS);
> > +
> > +    if (s->flags & DIRTY_BITMAP_MIG_FLAG_ZEROES) {
> > +        trace_dirty_bitmap_load_bits_zeroes();
> > +        bdrv_dirty_bitmap_deserialize_zeroes(s->bitmap, first_byte, 
> > nr_bytes,
> > +                                             false);
> > +    } else {
> > +        uint8_t *buf;
> > +        uint64_t buf_size = qemu_get_be64(f);
> > +        uint64_t needed_size =
> > +            bdrv_dirty_bitmap_serialization_size(s->bitmap,
> > +                                                 first_byte, nr_bytes);
> > +
> > +        if (needed_size > buf_size) {
> > +            error_report("Error: Migrated bitmap granularity doesn't "
> > +                         "match the destination bitmap '%s' granularity",
> > +                         bdrv_dirty_bitmap_name(s->bitmap));
> > +            return -EINVAL;
> > +        }
> > +
> > +        buf = g_malloc(buf_size);
> > +        qemu_get_buffer(f, buf, buf_size);
> > +        bdrv_dirty_bitmap_deserialize_part(s->bitmap, buf, first_byte, 
> > nr_bytes,
> > +                                           false);
> > +        g_free(buf);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> 
> ok
> 
> > +static int dirty_bitmap_load_header(QEMUFile *f, DirtyBitmapLoadState *s)
> > +{
> > +    Error *local_err = NULL;
> > +    s->flags = qemu_get_bitmap_flags(f);
> > +    trace_dirty_bitmap_load_header(s->flags);
> > +
> > +    if (s->flags & DIRTY_BITMAP_MIG_FLAG_DEVICE_NAME) {
> > +        if (!qemu_get_counted_string(f, s->node_name)) {
> > +            error_report("Unable to read node name string");
> > +            return -EINVAL;
> > +        }
> > +        s->bs = bdrv_lookup_bs(s->node_name, s->node_name, &local_err);
> > +        if (!s->bs) {
> > +            error_report_err(local_err);
> > +            return -EINVAL;
> > +        }
> > +    } else if (!s->bs) {
> > +        error_report("Error: block device name is not set");
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (s->flags & DIRTY_BITMAP_MIG_FLAG_BITMAP_NAME) {
> > +        if (!qemu_get_counted_string(f, s->bitmap_name)) {
> > +            error_report("Unable to read node name string");
> > +            return -EINVAL;
> > +        }
> > +        s->bitmap = bdrv_find_dirty_bitmap(s->bs, s->bitmap_name);
> > +
> > +        /* bitmap may be NULL here, it wouldn't be an error if it is the
> > +         * first occurrence of the bitmap */
> > +        if (!s->bitmap && !(s->flags & DIRTY_BITMAP_MIG_FLAG_START)) {
> > +            error_report("Error: unknown dirty bitmap "
> > +                         "'%s' for block device '%s'",
> > +                         s->bitmap_name, s->node_name);
> > +            return -EINVAL;
> > +        }
> > +    } else if (!s->bitmap) {
> > +        error_report("Error: block device name is not set");
> > +        return -EINVAL;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> 
> ok
> 
> > +static int dirty_bitmap_load(QEMUFile *f, void *opaque, int version_id)
> > +{
> > +    static DirtyBitmapLoadState s;
> > +    int ret = 0;
> > +
> > +    trace_dirty_bitmap_load_enter();
> > +
> > +    if (version_id != 1) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    do {
> > +        dirty_bitmap_load_header(f, &s);
> > +
> > +        if (s.flags & DIRTY_BITMAP_MIG_FLAG_START) {
> > +            ret = dirty_bitmap_load_start(f, &s);
> > +        } else if (s.flags & DIRTY_BITMAP_MIG_FLAG_COMPLETE) {
> > +            dirty_bitmap_load_complete(f, &s);
> > +        } else if (s.flags & DIRTY_BITMAP_MIG_FLAG_BITS) {
> > +            ret = dirty_bitmap_load_bits(f, &s);
> > +        }
> > +
> > +        if (!ret) {
> > +            ret = qemu_file_get_error(f);
> > +        }
> > +
> > +        if (ret) {
> > +            return ret;
> > +        }
> > +    } while (!(s.flags & DIRTY_BITMAP_MIG_FLAG_EOS));
> > +
> > +    trace_dirty_bitmap_load_success();
> > +    return 0;
> > +}
> 
> OK
> 
> > +
> > +static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque)
> > +{
> > +    DirtyBitmapMigBitmapState *dbms = NULL;
> > +    if (init_dirty_bitmap_migration() < 0) {
> > +        return -1;
> > +    }
> > +
> > +    QSIMPLEQ_FOREACH(dbms, &dirty_bitmap_mig_state.dbms_list, entry) {
> > +        send_bitmap_start(f, dbms);
> > +    }
> > +    qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
> > +
> > +    return 0;
> > +}
> > +
> 
> ok
> 
> > +static bool dirty_bitmap_is_active(void *opaque)
> > +{
> > +    return migrate_dirty_bitmaps();
> > +}
> > +
> 
> ok
> 
> > +static bool dirty_bitmap_is_active_iterate(void *opaque)
> > +{
> > +    return dirty_bitmap_is_active(opaque) && !runstate_is_running();
> > +}
> > +
> 
> On second thought, in patch 9, can you add a little tiny bit of
> documentation text explaining the exact nature of this callback?
> 
> Is the second portion of the conditional here so that once we reach the
> postcopy state that .is_active_iterate starts returning true?
> 
> "ok" if I'm reading this right, but it might be helped along by a little
> comment.
> 
> > +static bool dirty_bitmap_has_postcopy(void *opaque)
> > +{
> > +    return true;
> > +}
> > +
> 
> ok! :)
> 
> > +static SaveVMHandlers savevm_dirty_bitmap_handlers = {
> > +    .save_setup = dirty_bitmap_save_setup,
> > +    .save_live_complete_postcopy = dirty_bitmap_save_complete,
> > +    .save_live_complete_precopy = dirty_bitmap_save_complete,
> > +    .has_postcopy = dirty_bitmap_has_postcopy,
> > +    .save_live_pending = dirty_bitmap_save_pending,
> > +    .save_live_iterate = dirty_bitmap_save_iterate,
> > +    .is_active_iterate = dirty_bitmap_is_active_iterate,
> > +    .load_state = dirty_bitmap_load,
> > +    .save_cleanup = dirty_bitmap_save_cleanup,
> > +    .is_active = dirty_bitmap_is_active,
> > +};
> > +
> > +void dirty_bitmap_mig_init(void)
> > +{
> > +    QSIMPLEQ_INIT(&dirty_bitmap_mig_state.dbms_list);
> > +
> > +    register_savevm_live(NULL, "dirty-bitmap", 0, 1,
> > +                         &savevm_dirty_bitmap_handlers,
> > +                         &dirty_bitmap_mig_state);
> > +}
> 
> ok
> 
> dgilbert, would it be worth registering a block-driver-like registration
> trick that automatically invokes these functions instead of having to
> hook them up in vl.c? the more we add, the more hacky it looks to not
> have some subsystem-wide registration hook.

Maybe it's just simpler to put a mig_init in migration/migration.c and
make one call in vl.c; I'd rather keep a simple function than a whole
registration mechanism.

> > diff --git a/migration/migration.c b/migration/migration.c
> > index e973837bfd..66e9cf03cd 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -150,6 +150,9 @@ MigrationIncomingState 
> > *migration_incoming_get_current(void)
> >          memset(&mis_current, 0, sizeof(MigrationIncomingState));
> >          qemu_mutex_init(&mis_current.rp_mutex);
> >          qemu_event_init(&mis_current.main_thread_load_event, false);
> > +
> > +        init_dirty_bitmap_incoming_migration();
> > +
> >          once = true;
> >      }
> >      return &mis_current;
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index 9bbfb3fa1b..b0c37ef9f1 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1673,6 +1673,8 @@ static void loadvm_postcopy_handle_run_bh(void 
> > *opaque)
> >  
> >      trace_loadvm_postcopy_handle_run_vmstart();
> >  
> > +    dirty_bitmap_mig_before_vm_start();
> > +
> >      if (autostart) {
> >          /* Hold onto your hats, starting the CPU */
> >          vm_start();
> > diff --git a/vl.c b/vl.c
> > index ec299099ff..3d393aaf2c 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -4642,6 +4642,7 @@ int main(int argc, char **argv, char **envp)
> >  
> >      blk_mig_init();
> >      ram_mig_init();
> > +    dirty_bitmap_mig_init();
> >  
> >      /* If the currently selected machine wishes to override the 
> > units-per-bus
> >       * property of its default HBA interface type, do so now. */
> > diff --git a/migration/Makefile.objs b/migration/Makefile.objs
> > index 99e038024d..c83ec47ba8 100644
> > --- a/migration/Makefile.objs
> > +++ b/migration/Makefile.objs
> > @@ -6,6 +6,7 @@ common-obj-y += qemu-file.o global_state.o
> >  common-obj-y += qemu-file-channel.o
> >  common-obj-y += xbzrle.o postcopy-ram.o
> >  common-obj-y += qjson.o
> > +common-obj-y += block-dirty-bitmap.o
> >  
> >  common-obj-$(CONFIG_RDMA) += rdma.o
> >  
> > diff --git a/migration/trace-events b/migration/trace-events
> > index a04fffb877..e9eb8078d4 100644
> > --- a/migration/trace-events
> > +++ b/migration/trace-events
> > @@ -227,3 +227,17 @@ colo_vm_state_change(const char *old, const char *new) 
> > "Change '%s' => '%s'"
> >  colo_send_message(const char *msg) "Send '%s' message"
> >  colo_receive_message(const char *msg) "Receive '%s' message"
> >  colo_failover_set_state(const char *new_state) "new state %s"
> > +
> > +# migration/block-dirty-bitmap.c
> > +send_bitmap_header_enter(void) ""
> > +send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t 
> > nr_sectors, uint64_t data_size) "\n   flags:        0x%x\n   start_sector: 
> > %" PRIu64 "\n   nr_sectors:   %" PRIu32 "\n   data_size:    %" PRIu64 "\n"
> > +dirty_bitmap_save_iterate(int in_postcopy) "in postcopy: %d"
> > +dirty_bitmap_save_complete_enter(void) ""
> > +dirty_bitmap_save_complete_finish(void) ""
> > +dirty_bitmap_save_pending(uint64_t pending, uint64_t max_size) "pending %" 
> > PRIu64 " max: %" PRIu64
> > +dirty_bitmap_load_complete(void) ""
> > +dirty_bitmap_load_bits_enter(uint64_t first_sector, uint32_t nr_sectors) 
> > "chunk: %" PRIu64 " %" PRIu32
> > +dirty_bitmap_load_bits_zeroes(void) ""
> > +dirty_bitmap_load_header(uint32_t flags) "flags 0x%x"
> > +dirty_bitmap_load_enter(void) ""
> > +dirty_bitmap_load_success(void) ""
> > 
> 
> OK, I think everything here is probably in order, and it's only my
> understanding that is a barrier at this point. Help me understand the
> rest of this patch and I'll re-pester migration to get this staged for
> qemu-next.
> 
> --js

Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH v8 10/14] migration: add postcopy migration of dirty bitmaps

Reply via email to