Chegu Vinod <chegu_vi...@hp.com> wrote: > Busy enterprise workloads hosted on large sized VM's tend to dirty > memory faster than the transfer rate achieved via live guest migration. > Despite some good recent improvements (& using dedicated 10Gig NICs > between hosts) the live migration does NOT converge. > > A few options that were discussed/being-pursued to help with > the convergence issue include: > > 1) Slow down guest considerably via cgroup's CPU controls - requires > libvirt client support to detect & trigger action, but conceptually > similar to this RFC change. > > 2) Speed up transfer rate: > - RDMA based Pre-copy - lower overhead and fast (Unfortunately > has a few restrictions and some customers still choose not > to deploy RDMA :-( ). > - Add parallelism to improve transfer rate and use multiple 10Gig > connections (bonded). - could add some overhead on the host. > > 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds > promising but need to consider & handle newer failure scenarios. > > If an enterprise user chooses to force convergence of their migration > via the new capability "auto-converge" then with this change we auto-detect > lack of convergence scenario and trigger a slow down of the workload > by explicitly disallowing the VCPUs from spending much time in the VM > context. > > The migration thread tries to catchup and this eventually leads > to convergence in some "deterministic" amount of time. Yes it does > impact the performance of all the VCPUs but in my observation that > lasts only for a short duration of time. i.e. we end up entering > stage 3 (downtime phase) soon after that. > > No exernal trigger is required (unlike option 1) and it can co-exist > with enhancements being pursued as part of Option 2 (e.g. RDMA). > > Thanks to Juan and Paolo for their useful suggestions. > > Verified the convergence using the following : > - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) > - OLTP like workload running on a 80VCPU/512G guest (~80% busy) > > Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and > migrate downtime set to 4seconds). > > (qemu) info migrate > capabilities: xbzrle: off auto-converge: off <---- > Migration status: active > total time: 1487503 milliseconds
148 seconds > expected downtime: 519 milliseconds > transferred ram: 383749347 kbytes > remaining ram: 2753372 kbytes > total ram: 268444224 kbytes > duplicate: 65461532 pages > skipped: 64901568 pages > normal: 95750218 pages > normal bytes: 383000872 kbytes > dirty pages rate: 67551 pages > > --- > > (qemu) info migrate > capabilities: xbzrle: off auto-converge: on <---- > Migration status: completed > total time: 241161 milliseconds > downtime: 6373 milliseconds 6.3 seconds and finished, not bad at all O:-) How much does the guest throughput drops while we enter autoconverge mode? > transferred ram: 28235307 kbytes > remaining ram: 0 kbytes > total ram: 268444224 kbytes > duplicate: 64946416 pages > skipped: 64903523 pages > normal: 7044971 pages > normal bytes: 28179884 kbytes > > Changes from v1: > - rebased to latest qemu.git > - added auto-converge capability(default off) - suggested by Anthony Liguori & > Eric Blake. > > Signed-off-by: Chegu Vinod <chegu_vi...@hp.com> > @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void) > MigrationState *s = migrate_get_current(); > static int64_t start_time; > static int64_t num_dirty_pages_period; > + static int64_t bytes_xfer_prev; > int64_t end_time; > + int64_t bytes_xfer_now; > + static int dirty_rate_high_cnt; > + > + if (migrate_auto_converge() && !bytes_xfer_prev) { Just do the !bytes_xfer_prev test here? migrate_autoconverge is more expensive to call that just do the assignment? > + > + if (value) { > + return true; > + } > + return false; this code is just: return value; > diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h > index 6f0200a..9a3886d 100644 > --- a/include/qemu/main-loop.h > +++ b/include/qemu/main-loop.h > @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void); > */ > void qemu_mutex_unlock_iothread(void); > > +void qemu_mutex_lock_mig_throttle(void); > +void qemu_mutex_unlock_mig_throttle(void); > + > /* internal interfaces */ > > void qemu_fd_register(int fd); > diff --git a/kvm-all.c b/kvm-all.c > index 2d92721..a92cb77 100644 > --- a/kvm-all.c > +++ b/kvm-all.c > @@ -33,6 +33,8 @@ > #include "exec/memory.h" > #include "exec/address-spaces.h" > #include "qemu/event_notifier.h" > +#include "sysemu/cpus.h" > +#include "migration/migration.h" > > /* This check must be after config-host.h is included */ > #ifdef CONFIG_EVENTFD > @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] > = { > KVM_CAP_LAST_INFO > }; > > +static void mig_delay_vcpu(void); > + move function definiton to here? > + > +static bool throttling; > +bool throttling_now(void) > +{ > + if (throttling) { > + return true; > + } > + return false; return throttling; > +/* Stub used for getting the vcpu out of VM and into qemu via > + run_on_cpu()*/ > +static void mig_kick_cpu(void *opq) > +{ > + return; > +} > + > +/* To reduce the dirty rate explicitly disallow the VCPUs from spending > + much time in the VM. The migration thread will try to catchup. > + Workload will experience a greater performance drop but for a shorter > + duration. > +*/ > +void *migration_throttle_down(void *opaque) > +{ > + throttling = true; > + while (throttling_needed()) { > + CPUArchState *penv = first_cpu; I am not sure that we can follow the list without the iothread lock here. > + while (penv) { > + qemu_mutex_lock_iothread(); > + run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL); > + qemu_mutex_unlock_iothread(); > + penv = penv->next_cpu; > + } > + g_usleep(25*1000); > + } > + throttling = false; > + return NULL; > +}