Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
On 02/19/2014 10:53 AM, Li Guang wrote: Michael R. Hines wrote: On 02/19/2014 09:07 AM, Li Guang wrote: Hi, mrhi...@linux.vnet.ibm.com wrote: From: "Michael R. Hines" This implements the core logic, all described in the first patch (docs/mc.txt). Signed-off-by: Michael R. Hines --- migration-checkpoint.c | 1565 1 file changed, 1565 insertions(+) create mode 100644 migration-checkpoint.c [big snip] ... + +/* + * Stop the VM, generate the micro checkpoint, + * but save the dirty memory into staging memory until + * we can re-activate the VM as soon as possible. + */ +static int capture_checkpoint(MCParams *mc, MigrationState *s) +{ +MCCopyset *copyset; +int idx, ret = 0; +uint64_t start, stop, copies = 0; +int64_t start_time; + +mc->total_copies = 0; +qemu_mutex_lock_iothread(); +vm_stop_force_state(RUN_STATE_CHECKPOINT_VM); +start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + +/* + * If buffering is enabled, insert a Qdisc plug here + * to hold packets for the *next* MC, (not this one, + * the packets for this one have already been plugged + * and will be released after the MC has been transmitted. + */ +mc_start_buffer(); actually, I have a special request, if QEMU started without netdev, then don't bother me by Qdisc for network buffering. :-) Thanks! That ability is already available in the patchset. It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt). Did you try it? I don't mean disable it manually, I say even don't start buffering for network when no netdev. Thanks! Oh, I see. Got it. I will update the patch =). - Michael
Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
Michael R. Hines wrote: On 02/19/2014 09:07 AM, Li Guang wrote: Hi, mrhi...@linux.vnet.ibm.com wrote: From: "Michael R. Hines" This implements the core logic, all described in the first patch (docs/mc.txt). Signed-off-by: Michael R. Hines --- migration-checkpoint.c | 1565 1 file changed, 1565 insertions(+) create mode 100644 migration-checkpoint.c [big snip] ... + +/* + * Stop the VM, generate the micro checkpoint, + * but save the dirty memory into staging memory until + * we can re-activate the VM as soon as possible. + */ +static int capture_checkpoint(MCParams *mc, MigrationState *s) +{ +MCCopyset *copyset; +int idx, ret = 0; +uint64_t start, stop, copies = 0; +int64_t start_time; + +mc->total_copies = 0; +qemu_mutex_lock_iothread(); +vm_stop_force_state(RUN_STATE_CHECKPOINT_VM); +start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + +/* + * If buffering is enabled, insert a Qdisc plug here + * to hold packets for the *next* MC, (not this one, + * the packets for this one have already been plugged + * and will be released after the MC has been transmitted. + */ +mc_start_buffer(); actually, I have a special request, if QEMU started without netdev, then don't bother me by Qdisc for network buffering. :-) Thanks! That ability is already available in the patchset. It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt). Did you try it? I don't mean disable it manually, I say even don't start buffering for network when no netdev. Thanks!
Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
On 02/19/2014 09:07 AM, Li Guang wrote: Hi, mrhi...@linux.vnet.ibm.com wrote: From: "Michael R. Hines" This implements the core logic, all described in the first patch (docs/mc.txt). Signed-off-by: Michael R. Hines --- migration-checkpoint.c | 1565 1 file changed, 1565 insertions(+) create mode 100644 migration-checkpoint.c [big snip] ... + +/* + * Stop the VM, generate the micro checkpoint, + * but save the dirty memory into staging memory until + * we can re-activate the VM as soon as possible. + */ +static int capture_checkpoint(MCParams *mc, MigrationState *s) +{ +MCCopyset *copyset; +int idx, ret = 0; +uint64_t start, stop, copies = 0; +int64_t start_time; + +mc->total_copies = 0; +qemu_mutex_lock_iothread(); +vm_stop_force_state(RUN_STATE_CHECKPOINT_VM); +start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + +/* + * If buffering is enabled, insert a Qdisc plug here + * to hold packets for the *next* MC, (not this one, + * the packets for this one have already been plugged + * and will be released after the MC has been transmitted. + */ +mc_start_buffer(); actually, I have a special request, if QEMU started without netdev, then don't bother me by Qdisc for network buffering. :-) Thanks! That ability is already available in the patchset. It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt). Did you try it? - Michael
Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
Hi, mrhi...@linux.vnet.ibm.com wrote: From: "Michael R. Hines" This implements the core logic, all described in the first patch (docs/mc.txt). Signed-off-by: Michael R. Hines --- migration-checkpoint.c | 1565 1 file changed, 1565 insertions(+) create mode 100644 migration-checkpoint.c [big snip] ... + +/* + * Stop the VM, generate the micro checkpoint, + * but save the dirty memory into staging memory until + * we can re-activate the VM as soon as possible. + */ +static int capture_checkpoint(MCParams *mc, MigrationState *s) +{ +MCCopyset *copyset; +int idx, ret = 0; +uint64_t start, stop, copies = 0; +int64_t start_time; + +mc->total_copies = 0; +qemu_mutex_lock_iothread(); +vm_stop_force_state(RUN_STATE_CHECKPOINT_VM); +start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + +/* + * If buffering is enabled, insert a Qdisc plug here + * to hold packets for the *next* MC, (not this one, + * the packets for this one have already been plugged + * and will be released after the MC has been transmitted. + */ +mc_start_buffer(); actually, I have a special request, if QEMU started without netdev, then don't bother me by Qdisc for network buffering. :-) Thanks! + +qemu_savevm_state_begin(mc->staging,&s->params); +ret = qemu_file_get_error(s->file); + +if (ret< 0) { +migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR); +} + +qemu_savevm_state_complete(mc->staging); + +ret = qemu_file_get_error(s->file); +if (ret< 0) { +migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR); +goto out; +} + +/* + * The copied memory gets appended to the end of the snapshot, so let's + * remember where its going to go first and start a new slab. + */ +mc_slab_next(mc, mc->curr_slab); +mc->start_copyset = mc->curr_slab->idx; + +start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + +/* + * Now perform the actual copy of memory into the tail end of the slab list. + */ +QTAILQ_FOREACH(copyset,&mc->copy_head, node) { +if (!copyset->nb_copies) { +break; +} + +copies += copyset->nb_copies; + +DDDPRINTF("copyset %d copies: %" PRIu64 " total: %" PRIu64 "\n", +copyset->idx, copyset->nb_copies, copies); + +for (idx = 0; idx< copyset->nb_copies; idx++) { +uint8_t *addr; +long size; +mc->copy =©set->copies[idx]; +addr = (uint8_t *) (mc->copy->host_addr + mc->copy->offset); +size = mc_put_buffer(mc, addr, mc->copy->offset, mc->copy->size); +if (size != mc->copy->size) { +fprintf(stderr, "Failure to initiate copyset %d index %d\n", +copyset->idx, idx); +migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR); +vm_start(); +goto out; +} + +DDDPRINTF("Success copyset %d index %d\n", copyset->idx, idx); +} + +copyset->nb_copies = 0; +} + +s->ram_copy_time = (qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - start_time); + +mc->copy = NULL; +ram_control_before_iterate(mc->file, RAM_CONTROL_FLUSH); +assert(mc->total_copies == copies); + +stop = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + +/* + * MC is safe in staging area. Let the VM go. + */ +vm_start(); +qemu_fflush(mc->staging); + +s->downtime = stop - start; +out: +qemu_mutex_unlock_iothread(); +return ret; +} + +/* + * Synchronously send a micro-checkpointing command + */ +static int mc_send(QEMUFile *f, uint64_t request) +{ +int ret = 0; + +qemu_put_be64(f, request); + +ret = qemu_file_get_error(f); +if (ret) { +fprintf(stderr, "transaction: send error while sending %" PRIu64 ", " +"bailing: %s\n", request, strerror(-ret)); +} else { +DDPRINTF("transaction: sent: %s (%" PRIu64 ")\n", +mc_desc[request], request); +} + +qemu_fflush(f); + +return ret; +} + +/* + * Synchronously receive a micro-checkpointing command + */ +static int mc_recv(QEMUFile *f, uint64_t request, uint64_t *action) +{ +int ret = 0; +uint64_t got; + +got = qemu_get_be64(f); + +ret = qemu_file_get_error(f); +if (ret) { +fprintf(stderr, "transaction: recv error while expecting %s (%" +PRIu64 "), bailing: %s\n", mc_desc[request], +request, strerror(-ret)); +} else { +if ((request != MC_TRANSACTION_ANY)&& request != got) { +fprintf(stderr, "transaction: was expecting %s (%" PRIu64 +") but got %" PRIu64 " instead\n", +mc_desc[request], request, got); +ret = -EINVAL; +} else { +DDPRINTF("transaction: recv: %s (%" PRIu64 ")\n", +
[Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
From: "Michael R. Hines" This implements the core logic, all described in the first patch (docs/mc.txt). Signed-off-by: Michael R. Hines --- migration-checkpoint.c | 1565 1 file changed, 1565 insertions(+) create mode 100644 migration-checkpoint.c diff --git a/migration-checkpoint.c b/migration-checkpoint.c new file mode 100644 index 000..a69edb2 --- /dev/null +++ b/migration-checkpoint.c @@ -0,0 +1,1565 @@ +/* + * Micro-Checkpointing (MC) support + * (a.k.a. Fault Tolerance or Continuous Replication) + * + * Copyright IBM, Corp. 2014 + * + * Authors: + * Michael R. Hines + * + * This work is licensed under the terms of the GNU GPL, version 2 or + * later. See the COPYING file in the top-level directory. + * + */ +#include +#include +#include +#include +#include +#include +#include "qemu-common.h" +#include "hw/virtio/virtio.h" +#include "hw/virtio/virtio-net.h" +#include "qemu/sockets.h" +#include "migration/migration.h" +#include "migration/qemu-file.h" +#include "qmp-commands.h" +#include "net/tap-linux.h" +#include + +#define DEBUG_MC +//#define DEBUG_MC_VERBOSE +//#define DEBUG_MC_REALLY_VERBOSE + +#ifdef DEBUG_MC +#define DPRINTF(fmt, ...) \ +do { printf("mc: " fmt, ## __VA_ARGS__); } while (0) +#else +#define DPRINTF(fmt, ...) \ +do { } while (0) +#endif + +#ifdef DEBUG_MC_VERBOSE +#define DDPRINTF(fmt, ...) \ +do { printf("mc: " fmt, ## __VA_ARGS__); } while (0) +#else +#define DDPRINTF(fmt, ...) \ +do { } while (0) +#endif + +#ifdef DEBUG_MC_REALLY_VERBOSE +#define DDDPRINTF(fmt, ...) \ +do { printf("mc: " fmt, ## __VA_ARGS__); } while (0) +#else +#define DDDPRINTF(fmt, ...) \ +do { } while (0) +#endif + +/* + * Micro checkpoints (MC)s are typically only a few MB when idle. + * However, they can easily be very large during heavy workloads. + * In the *extreme* worst-case, QEMU might need double the amount of main memory + * than that of what was originally allocated to the virtual machine. + * + * To support this variability during transient periods, a MC + * consists of a linked list of slabs, each of identical size. A better name + * would be welcome, as the name was only chosen because it resembles linux + * memory allocation. Because MCs occur several times per second + * (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink + * without constantly re-allocating all memory in place during each checkpoint. + * + * During steady-state, the 'head' slab is permanently allocated and never goes + * away, so when the VM is idle, there is no memory allocation at all. + * This design supports the use of RDMA. Since RDMA requires memory pinning, we + * must be able to hold on to a slab for a reasonable amount of time to get any + * real use out of it. + * + * Regardless, the current strategy taken is: + * + * 1. If the checkpoint size increases, + *then grow the number of slabs to support it, + *(if and only if RDMA is activated, these slabs will be pinned.) + * 2. If the next checkpoint size is smaller than the last one, + then that's a "strike". + * 3. After N strikes, cut the size of the slab cache in half + *(to a minimum of 1 slab as described before). + * + * As of this writing, a typical average size of + * an Idle-VM checkpoint is under 5MB. + */ + +#define MC_SLAB_BUFFER_SIZE (5UL * 1024UL * 1024UL) /* empirical */ +#define MC_DEV_NAME_MAX_SIZE256 + +#define MC_DEFAULT_CHECKPOINT_FREQ_MS 100 /* too slow, but best for now */ +#define CALC_MAX_STRIKES() \ +do { max_strikes = (max_strikes_delay_secs * 1000) / freq_ms; } \ +while (0) + +/* + * How many "seconds-worth" of checkpoints to wait before re-evaluating the size + * of the slab list? + * + * #strikes_until_shrink_cache = Function(#checkpoints/sec) + * + * Increasing the number of seconds also increases the number of strikes needed + * to be reached until it is time to cut the cache in half. + * + * Below value is open for debate - we just want it to be small enough to ensure + * that a large, idle slab list doesn't stay too large for too long. + */ +#define MC_DEFAULT_SLAB_MAX_CHECK_DELAY_SECS 10 + +/* + * MC serializes the actual RAM page contents in such a way that the actual + * pages are separated from the meta-data (all the QEMUFile stuff). + * + * This is done strictly for the purposes of being able to use RDMA + * and to replace memcpy() on the local machine for hardware with very + * fast RAM memory speeds. + * + * This serialization requires recording the page descriptions and then + * pushing them into slabs after the checkpoint has been captured + * (minus the page data). + * + * The memory holding the page descriptions are allocated in unison with the + * slabs themselves, and thus we need to know in advance the maximum number of + * page descriptions that can fit into a slab before allocating the slab. + * It should be safe to