Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic

2014-02-18 Thread Michael R. Hines

On 02/19/2014 10:53 AM, Li Guang wrote:

Michael R. Hines wrote:

On 02/19/2014 09:07 AM, Li Guang wrote:

Hi,
mrhi...@linux.vnet.ibm.com wrote:

From: "Michael R. Hines"

This implements the core logic,
all described in the first patch (docs/mc.txt).

Signed-off-by: Michael R. Hines
---
  migration-checkpoint.c | 1565 


  1 file changed, 1565 insertions(+)
  create mode 100644 migration-checkpoint.c



[big snip] ...


+
+/*
+ * Stop the VM, generate the micro checkpoint,
+ * but save the dirty memory into staging memory until
+ * we can re-activate the VM as soon as possible.
+ */
+static int capture_checkpoint(MCParams *mc, MigrationState *s)
+{
+MCCopyset *copyset;
+int idx, ret = 0;
+uint64_t start, stop, copies = 0;
+int64_t start_time;
+
+mc->total_copies = 0;
+qemu_mutex_lock_iothread();
+vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
+start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+/*
+ * If buffering is enabled, insert a Qdisc plug here
+ * to hold packets for the *next* MC, (not this one,
+ * the packets for this one have already been plugged
+ * and will be released after the MC has been transmitted.
+ */
+mc_start_buffer();


actually, I have a special request,
if QEMU started without netdev,
then don't bother me by Qdisc for network buffering. :-)

Thanks!



That ability is already available in the patchset.
It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt).

Did you try it?



I don't mean disable it manually, I say even don't start buffering
for network when no netdev.

Thanks!




Oh, I see. Got it. I will update the patch =).

- Michael




Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic

2014-02-18 Thread Li Guang

Michael R. Hines wrote:

On 02/19/2014 09:07 AM, Li Guang wrote:

Hi,
mrhi...@linux.vnet.ibm.com wrote:

From: "Michael R. Hines"

This implements the core logic,
all described in the first patch (docs/mc.txt).

Signed-off-by: Michael R. Hines
---
  migration-checkpoint.c | 1565 


  1 file changed, 1565 insertions(+)
  create mode 100644 migration-checkpoint.c



[big snip] ...


+
+/*
+ * Stop the VM, generate the micro checkpoint,
+ * but save the dirty memory into staging memory until
+ * we can re-activate the VM as soon as possible.
+ */
+static int capture_checkpoint(MCParams *mc, MigrationState *s)
+{
+MCCopyset *copyset;
+int idx, ret = 0;
+uint64_t start, stop, copies = 0;
+int64_t start_time;
+
+mc->total_copies = 0;
+qemu_mutex_lock_iothread();
+vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
+start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+/*
+ * If buffering is enabled, insert a Qdisc plug here
+ * to hold packets for the *next* MC, (not this one,
+ * the packets for this one have already been plugged
+ * and will be released after the MC has been transmitted.
+ */
+mc_start_buffer();


actually, I have a special request,
if QEMU started without netdev,
then don't bother me by Qdisc for network buffering. :-)

Thanks!



That ability is already available in the patchset.
It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt).

Did you try it?



I don't mean disable it manually, I say even don't start buffering
for network when no netdev.

Thanks!




Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic

2014-02-18 Thread Michael R. Hines

On 02/19/2014 09:07 AM, Li Guang wrote:

Hi,
mrhi...@linux.vnet.ibm.com wrote:

From: "Michael R. Hines"

This implements the core logic,
all described in the first patch (docs/mc.txt).

Signed-off-by: Michael R. Hines
---
  migration-checkpoint.c | 1565 


  1 file changed, 1565 insertions(+)
  create mode 100644 migration-checkpoint.c



[big snip] ...


+
+/*
+ * Stop the VM, generate the micro checkpoint,
+ * but save the dirty memory into staging memory until
+ * we can re-activate the VM as soon as possible.
+ */
+static int capture_checkpoint(MCParams *mc, MigrationState *s)
+{
+MCCopyset *copyset;
+int idx, ret = 0;
+uint64_t start, stop, copies = 0;
+int64_t start_time;
+
+mc->total_copies = 0;
+qemu_mutex_lock_iothread();
+vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
+start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+/*
+ * If buffering is enabled, insert a Qdisc plug here
+ * to hold packets for the *next* MC, (not this one,
+ * the packets for this one have already been plugged
+ * and will be released after the MC has been transmitted.
+ */
+mc_start_buffer();


actually, I have a special request,
if QEMU started without netdev,
then don't bother me by Qdisc for network buffering. :-)

Thanks!



That ability is already available in the patchset.
It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt).

Did you try it?

- Michael




Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic

2014-02-18 Thread Li Guang

Hi,
mrhi...@linux.vnet.ibm.com wrote:

From: "Michael R. Hines"

This implements the core logic,
all described in the first patch (docs/mc.txt).

Signed-off-by: Michael R. Hines
---
  migration-checkpoint.c | 1565 
  1 file changed, 1565 insertions(+)
  create mode 100644 migration-checkpoint.c


   

[big snip] ...


+
+/*
+ * Stop the VM, generate the micro checkpoint,
+ * but save the dirty memory into staging memory until
+ * we can re-activate the VM as soon as possible.
+ */
+static int capture_checkpoint(MCParams *mc, MigrationState *s)
+{
+MCCopyset *copyset;
+int idx, ret = 0;
+uint64_t start, stop, copies = 0;
+int64_t start_time;
+
+mc->total_copies = 0;
+qemu_mutex_lock_iothread();
+vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
+start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+/*
+ * If buffering is enabled, insert a Qdisc plug here
+ * to hold packets for the *next* MC, (not this one,
+ * the packets for this one have already been plugged
+ * and will be released after the MC has been transmitted.
+ */
+mc_start_buffer();
   


actually, I have a special request,
if QEMU started without netdev,
then don't bother me by Qdisc for network buffering. :-)

Thanks!


+
+qemu_savevm_state_begin(mc->staging,&s->params);
+ret = qemu_file_get_error(s->file);
+
+if (ret<  0) {
+migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
+}
+
+qemu_savevm_state_complete(mc->staging);
+
+ret = qemu_file_get_error(s->file);
+if (ret<  0) {
+migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
+goto out;
+}
+
+/*
+ * The copied memory gets appended to the end of the snapshot, so let's
+ * remember where its going to go first and start a new slab.
+ */
+mc_slab_next(mc, mc->curr_slab);
+mc->start_copyset = mc->curr_slab->idx;
+
+start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+/*
+ * Now perform the actual copy of memory into the tail end of the slab 
list.
+ */
+QTAILQ_FOREACH(copyset,&mc->copy_head, node) {
+if (!copyset->nb_copies) {
+break;
+}
+
+copies += copyset->nb_copies;
+
+DDDPRINTF("copyset %d copies: %" PRIu64 " total: %" PRIu64 "\n",
+copyset->idx, copyset->nb_copies, copies);
+
+for (idx = 0; idx<  copyset->nb_copies; idx++) {
+uint8_t *addr;
+long size;
+mc->copy =©set->copies[idx];
+addr = (uint8_t *) (mc->copy->host_addr + mc->copy->offset);
+size = mc_put_buffer(mc, addr, mc->copy->offset, mc->copy->size);
+if (size != mc->copy->size) {
+fprintf(stderr, "Failure to initiate copyset %d index %d\n",
+copyset->idx, idx);
+migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
+vm_start();
+goto out;
+}
+
+DDDPRINTF("Success copyset %d index %d\n", copyset->idx, idx);
+}
+
+copyset->nb_copies = 0;
+}
+
+s->ram_copy_time = (qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - start_time);
+
+mc->copy = NULL;
+ram_control_before_iterate(mc->file, RAM_CONTROL_FLUSH);
+assert(mc->total_copies == copies);
+
+stop = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+/*
+ * MC is safe in staging area. Let the VM go.
+ */
+vm_start();
+qemu_fflush(mc->staging);
+
+s->downtime = stop - start;
+out:
+qemu_mutex_unlock_iothread();
+return ret;
+}
+
+/*
+ * Synchronously send a micro-checkpointing command
+ */
+static int mc_send(QEMUFile *f, uint64_t request)
+{
+int ret = 0;
+
+qemu_put_be64(f, request);
+
+ret = qemu_file_get_error(f);
+if (ret) {
+fprintf(stderr, "transaction: send error while sending %" PRIu64 ", "
+"bailing: %s\n", request, strerror(-ret));
+} else {
+DDPRINTF("transaction: sent: %s (%" PRIu64 ")\n",
+mc_desc[request], request);
+}
+
+qemu_fflush(f);
+
+return ret;
+}
+
+/*
+ * Synchronously receive a micro-checkpointing command
+ */
+static int mc_recv(QEMUFile *f, uint64_t request, uint64_t *action)
+{
+int ret = 0;
+uint64_t got;
+
+got = qemu_get_be64(f);
+
+ret = qemu_file_get_error(f);
+if (ret) {
+fprintf(stderr, "transaction: recv error while expecting %s (%"
+PRIu64 "), bailing: %s\n", mc_desc[request],
+request, strerror(-ret));
+} else {
+if ((request != MC_TRANSACTION_ANY)&&  request != got) {
+fprintf(stderr, "transaction: was expecting %s (%" PRIu64
+") but got %" PRIu64 " instead\n",
+mc_desc[request], request, got);
+ret = -EINVAL;
+} else {
+DDPRINTF("transaction: recv: %s (%" PRIu64 ")\n",
+ 

[Qemu-devel] [RFC PATCH v2 08/12] mc: core logic

2014-02-18 Thread mrhines
From: "Michael R. Hines" 

This implements the core logic,
all described in the first patch (docs/mc.txt).

Signed-off-by: Michael R. Hines 
---
 migration-checkpoint.c | 1565 
 1 file changed, 1565 insertions(+)
 create mode 100644 migration-checkpoint.c

diff --git a/migration-checkpoint.c b/migration-checkpoint.c
new file mode 100644
index 000..a69edb2
--- /dev/null
+++ b/migration-checkpoint.c
@@ -0,0 +1,1565 @@
+/*
+ *  Micro-Checkpointing (MC) support 
+ *  (a.k.a. Fault Tolerance or Continuous Replication)
+ *
+ *  Copyright IBM, Corp. 2014
+ *
+ *  Authors:
+ *   Michael R. Hines 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "qemu-common.h"
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/virtio-net.h"
+#include "qemu/sockets.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "qmp-commands.h"
+#include "net/tap-linux.h"
+#include 
+
+#define DEBUG_MC
+//#define DEBUG_MC_VERBOSE
+//#define DEBUG_MC_REALLY_VERBOSE
+
+#ifdef DEBUG_MC
+#define DPRINTF(fmt, ...) \
+do { printf("mc: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+do { } while (0)
+#endif
+
+#ifdef DEBUG_MC_VERBOSE
+#define DDPRINTF(fmt, ...) \
+do { printf("mc: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DDPRINTF(fmt, ...) \
+do { } while (0)
+#endif
+
+#ifdef DEBUG_MC_REALLY_VERBOSE
+#define DDDPRINTF(fmt, ...) \
+do { printf("mc: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DDDPRINTF(fmt, ...) \
+do { } while (0)
+#endif
+
+/*
+ * Micro checkpoints (MC)s are typically only a few MB when idle.
+ * However, they can easily be very large during heavy workloads.
+ * In the *extreme* worst-case, QEMU might need double the amount of main 
memory
+ * than that of what was originally allocated to the virtual machine.
+ *
+ * To support this variability during transient periods, a MC
+ * consists of a linked list of slabs, each of identical size. A better name
+ * would be welcome, as the name was only chosen because it resembles linux
+ * memory allocation. Because MCs occur several times per second 
+ * (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink 
+ * without constantly re-allocating all memory in place during each checkpoint.
+ *
+ * During steady-state, the 'head' slab is permanently allocated and never goes
+ * away, so when the VM is idle, there is no memory allocation at all.
+ * This design supports the use of RDMA. Since RDMA requires memory pinning, we
+ * must be able to hold on to a slab for a reasonable amount of time to get any
+ * real use out of it.
+ *
+ * Regardless, the current strategy taken is:
+ * 
+ * 1. If the checkpoint size increases,
+ *then grow the number of slabs to support it,
+ *(if and only if RDMA is activated, these slabs will be pinned.)
+ * 2. If the next checkpoint size is smaller than the last one,
+  then that's a "strike".
+ * 3. After N strikes, cut the size of the slab cache in half
+ *(to a minimum of 1 slab as described before).
+ *
+ * As of this writing, a typical average size of 
+ * an Idle-VM checkpoint is under 5MB.
+ */
+
+#define MC_SLAB_BUFFER_SIZE (5UL * 1024UL * 1024UL) /* empirical */
+#define MC_DEV_NAME_MAX_SIZE256
+
+#define MC_DEFAULT_CHECKPOINT_FREQ_MS 100 /* too slow, but best for now */
+#define CALC_MAX_STRIKES()   \
+do {  max_strikes = (max_strikes_delay_secs * 1000) / freq_ms; } \
+while (0)
+
+/*
+ * How many "seconds-worth" of checkpoints to wait before re-evaluating the 
size
+ * of the slab list?
+ *
+ * #strikes_until_shrink_cache = Function(#checkpoints/sec)
+ *
+ * Increasing the number of seconds also increases the number of strikes needed
+ * to be reached until it is time to cut the cache in half.
+ *
+ * Below value is open for debate - we just want it to be small enough to 
ensure
+ * that a large, idle slab list doesn't stay too large for too long.
+ */
+#define MC_DEFAULT_SLAB_MAX_CHECK_DELAY_SECS 10
+
+/* 
+ * MC serializes the actual RAM page contents in such a way that the actual
+ * pages are separated from the meta-data (all the QEMUFile stuff).
+ *
+ * This is done strictly for the purposes of being able to use RDMA
+ * and to replace memcpy() on the local machine for hardware with very
+ * fast RAM memory speeds.
+ * 
+ * This serialization requires recording the page descriptions and then
+ * pushing them into slabs after the checkpoint has been captured
+ * (minus the page data).
+ *
+ * The memory holding the page descriptions are allocated in unison with the
+ * slabs themselves, and thus we need to know in advance the maximum number of
+ * page descriptions that can fit into a slab before allocating the slab.
+ * It should be safe to