Chegu Vinod <chegu_vi...@hp.com> writes: > On 5/10/2013 6:07 AM, Anthony Liguori wrote: >> Chegu Vinod <chegu_vi...@hp.com> writes: >> >>> If a user chooses to turn on the auto-converge migration capability >>> these changes detect the lack of convergence and throttle down the >>> guest. i.e. force the VCPUs out of the guest for some duration >>> and let the migration thread catchup and help converge. >>> >>> Verified the convergence using the following : >>> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) >>> - OLTP like workload running on a 80VCPU/512G guest (~80% busy) >>> >>> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and >>> migrate downtime set to 4seconds). >> Would it make sense to separate out the "slow the VCPU down" part of >> this? >> >> That would give a management tool more flexibility to create policies >> around slowing the VCPU down to encourage migration. > > I believe one can always enhance libvirt tools to monitor the migration > statistics and control the shares/entitlements of the vcpus via > cgroups..thereby slowing the guest down to allow for convergence (I had > that listed in my earlier versions of the patches as an option and also > noted that it requires external (i.e. tool driven) monitoring and > triggers...and that this alternative was kind of automatic after the > initial setting of the capability). > > Is that what you meant by your comment above (or) are you talking about > something outside the scope of cgroups and from an implementation point > of view also outside the migration code path...i.e. a new knob that an > external tool can use to just throttle down the vcpus of a guest ?
I'm saying, a knob to throttle the guest vcpus within QEMU that could be used by management tools to encourage convergence. For instance, consider an imaginary "vcpu_throttle" command that took a number between 0 and 1 that throttled VCPU performance accordingly. Then migration would look like: 0) throttle = 1.0 1) call migrate command to start migration 2) query progress until you decide you aren't converging 3) throttle *= 0.75; call vcpu_throttle $throttle 4) goto (2) Now I'm not opposed to a series like this that adds this sort of policy to QEMU itself too but I want to make sure the pieces are exposed for a management tool to implement its own policies too. Regards, Anthony Liguori > > Thanks > Vinod > > > >> >> In fact, I wonder if we need anything in the migration path if we just >> expose the "slow the VCPU down" bit as a feature. >> >> Slow the VCPU down is not quite the same as setting priority of the VCPU >> thread largely because of the QBL so I recognize the need to have >> something for this in QEMU. >> >> Regards, >> >> Anthony Liguori >> >>> (qemu) info migrate >>> capabilities: xbzrle: off auto-converge: off <---- >>> Migration status: active >>> total time: 1487503 milliseconds >>> expected downtime: 519 milliseconds >>> transferred ram: 383749347 kbytes >>> remaining ram: 2753372 kbytes >>> total ram: 268444224 kbytes >>> duplicate: 65461532 pages >>> skipped: 64901568 pages >>> normal: 95750218 pages >>> normal bytes: 383000872 kbytes >>> dirty pages rate: 67551 pages >>> >>> --- >>> >>> (qemu) info migrate >>> capabilities: xbzrle: off auto-converge: on <---- >>> Migration status: completed >>> total time: 241161 milliseconds >>> downtime: 6373 milliseconds >>> transferred ram: 28235307 kbytes >>> remaining ram: 0 kbytes >>> total ram: 268444224 kbytes >>> duplicate: 64946416 pages >>> skipped: 64903523 pages >>> normal: 7044971 pages >>> normal bytes: 28179884 kbytes >>> >>> Signed-off-by: Chegu Vinod <chegu_vi...@hp.com> >>> --- >>> arch_init.c | 68 >>> +++++++++++++++++++++++++++++++++++++++++ >>> include/migration/migration.h | 4 ++ >>> migration.c | 1 + >>> 3 files changed, 73 insertions(+), 0 deletions(-) >>> >>> diff --git a/arch_init.c b/arch_init.c >>> index 49c5dc2..29788d6 100644 >>> --- a/arch_init.c >>> +++ b/arch_init.c >>> @@ -49,6 +49,7 @@ >>> #include "trace.h" >>> #include "exec/cpu-all.h" >>> #include "hw/acpi/acpi.h" >>> +#include "sysemu/cpus.h" >>> >>> #ifdef DEBUG_ARCH_INIT >>> #define DPRINTF(fmt, ...) \ >>> @@ -104,6 +105,8 @@ int graphic_depth = 15; >>> #endif >>> >>> const uint32_t arch_type = QEMU_ARCH; >>> +static bool mig_throttle_on; >>> + >>> >>> /***********************************************************/ >>> /* ram save/restore */ >>> @@ -378,8 +381,15 @@ static void migration_bitmap_sync(void) >>> uint64_t num_dirty_pages_init = migration_dirty_pages; >>> MigrationState *s = migrate_get_current(); >>> static int64_t start_time; >>> + static int64_t bytes_xfer_prev; >>> static int64_t num_dirty_pages_period; >>> int64_t end_time; >>> + int64_t bytes_xfer_now; >>> + static int dirty_rate_high_cnt; >>> + >>> + if (!bytes_xfer_prev) { >>> + bytes_xfer_prev = ram_bytes_transferred(); >>> + } >>> >>> if (!start_time) { >>> start_time = qemu_get_clock_ms(rt_clock); >>> @@ -404,6 +414,23 @@ static void migration_bitmap_sync(void) >>> >>> /* more than 1 second = 1000 millisecons */ >>> if (end_time > start_time + 1000) { >>> + if (migrate_auto_converge()) { >>> + /* The following detection logic can be refined later. For now: >>> + Check to see if the dirtied bytes is 50% more than the >>> approx. >>> + amount of bytes that just got transferred since the last >>> time we >>> + were in this routine. If that happens N times (for now N==5) >>> + we turn on the throttle down logic */ >>> + bytes_xfer_now = ram_bytes_transferred(); >>> + if (s->dirty_pages_rate && >>> + ((num_dirty_pages_period*TARGET_PAGE_SIZE) > >>> + ((bytes_xfer_now - bytes_xfer_prev)/2))) { >>> + if (dirty_rate_high_cnt++ > 5) { >>> + DPRINTF("Unable to converge. Throtting down guest\n"); >>> + mig_throttle_on = true; >>> + } >>> + } >>> + bytes_xfer_prev = bytes_xfer_now; >>> + } >>> s->dirty_pages_rate = num_dirty_pages_period * 1000 >>> / (end_time - start_time); >>> s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE; >>> @@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage) >>> return bytes_sent; >>> } >>> >>> +bool throttling_needed(void) >>> +{ >>> + if (!migrate_auto_converge()) { >>> + return false; >>> + } >>> + >>> + return mig_throttle_on; >>> +} >>> + >>> static uint64_t bytes_transferred; >>> >>> static ram_addr_t ram_save_remaining(void) >>> @@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp) >>> >>> return info; >>> } >>> + >>> +static void mig_delay_vcpu(void) >>> +{ >>> + qemu_mutex_unlock_iothread(); >>> + g_usleep(50*1000); >>> + qemu_mutex_lock_iothread(); >>> +} >>> + >>> +/* Stub used for getting the vcpu out of VM and into qemu via >>> + run_on_cpu()*/ >>> +static void mig_kick_cpu(void *opq) >>> +{ >>> + mig_delay_vcpu(); >>> + return; >>> +} >>> + >>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending >>> + much time in the VM. The migration thread will try to catchup. >>> + Workload will experience a performance drop. >>> +*/ >>> +void migration_throttle_down(void) >>> +{ >>> + if (throttling_needed()) { >>> + CPUArchState *penv = first_cpu; >>> + while (penv) { >>> + qemu_mutex_lock_iothread(); >>> + async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL); >>> + qemu_mutex_unlock_iothread(); >>> + penv = penv->next_cpu; >>> + } >>> + } >>> +} >>> diff --git a/include/migration/migration.h b/include/migration/migration.h >>> index ace91b0..68b65c6 100644 >>> --- a/include/migration/migration.h >>> +++ b/include/migration/migration.h >>> @@ -129,4 +129,8 @@ int64_t migrate_xbzrle_cache_size(void); >>> int64_t xbzrle_cache_resize(int64_t new_size); >>> >>> bool migrate_auto_converge(void); >>> +bool throttling_needed(void); >>> +void stop_throttling(void); >>> +void migration_throttle_down(void); >>> + >>> #endif >>> diff --git a/migration.c b/migration.c >>> index 570cee5..d3673a6 100644 >>> --- a/migration.c >>> +++ b/migration.c >>> @@ -526,6 +526,7 @@ static void *migration_thread(void *opaque) >>> DPRINTF("pending size %lu max %lu\n", pending_size, max_size); >>> if (pending_size && pending_size >= max_size) { >>> qemu_savevm_state_iterate(s->file); >>> + migration_throttle_down(); >>> } else { >>> DPRINTF("done iterating\n"); >>> qemu_mutex_lock_iothread(); >>> -- >>> 1.7.1 >> . >>