Re: [Qemu-devel] [PATCH 05/10] migration: Fix the migrate auto converge process
On 3/11/2014 1:48 PM, Juan Quintela wrote: arei.gong...@huawei.com wrote: From: ChenLiang chenlian...@huawei.com It is inaccuracy and complex that using the transfer speed of migration thread to determine whether the convergence migration. The dirty page may be compressed by XBZRLE or ZERO_PAGE.The counter of updating dirty bitmap will be increasing continuously if the migration can't convergence. It is inexact and complex to use the migration transfer speed to dectermine weather the convergence of migration. @@ -530,21 +523,11 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { if (migrate_auto_converge()) { -/* The following detection logic can be refined later. For now: - Check to see if the dirtied bytes is 50% more than the approx. - amount of bytes that just got transferred since the last time we - were in this routine. If that happens N times (for now N==4) - we turn on the throttle down logic */ -bytes_xfer_now = ram_bytes_transferred(); -if (s-dirty_pages_rate - (num_dirty_pages_period * TARGET_PAGE_SIZE - (bytes_xfer_now - bytes_xfer_prev)/2) - (dirty_rate_high_cnt++ 4)) { -trace_migration_throttle(); -mig_throttle_on = true; -dirty_rate_high_cnt = 0; - } - bytes_xfer_prev = bytes_xfer_now; +if (get_bitmap_sync_cnt() 15) { +/* It indicates that migration can't converge when the counter +is larger than fifteen. Enable the feature of auto converge */ Comment is not needed, it says excatly what the code does. But why 15? It is not that I think that the older code is better or worse than yours. Just that we move from one magic number to another (that is even bigger). Shouldn't it be easier jut just change mig_sleep_cpu() to do something like: static void mig_sleep_cpu(void *opq) { qemu_mutex_unlock_iothread(); g_usleep((2*get_bitmap_sync_cnt()*1000); qemu_mutex_lock_iothread(); } This would get the 30ms on the 15th iteration. I am open to change that formula to anything different, but what I want is changing this to something that makes the less convergence - the more throotling. 'already got some feedback earlier on this and had this task in the list of things to work on... :) Having the throttling start with some per-defined degree and then have that degree gradually increase ...either a) automatically as shown in Juan's example above (or) b) via some TBD user level interface... ...is one way to help with ensuring convergence for all cases. The issue of continuing to increase this degree of throttling is an obvious area of concern for the workload ( that is still trying to run in the VM). Would it it better to force the live migration to switch from the iterative pre-copy phase to the downtime phase ...if it fails to converge even after throttling it for a couple of iterations ? Doing so could result in a longer actual downtime. Hope to try this and see... but if anyone has inputs(other than doing post-copy etc) pl. do share. BTW, you are testing this with any workload to see that it improves? Yes. Please do share some data. +mig_throttle_on = true; +} Vinod, what do you think? As is noted in the current code...the logic to detect the lack of convergence needs to be refined. If there is a better way to help detect same (and which covers these other cases like XBZRLE etc) then I am all for it. I do agree with Juan about the choice of magic numbers (i.e. one may not be better than the other). BTW, on a related note... I haven't used XBZRLE in the recent past (after having tried it in the early days). Does it now perform well with larger sized VMs running real world workloads ? Assume that is where you found that there was still need for forcing convergence ? Pl. do consider sharing some results about the type of workload and also the size of the VMs etc that you have tried with XBZRLE. Do you have a workload to test this? Hmm... One can test this with memory intensive Java warehouse type of workloads (besides using synthetic workloads). Vinod Thanks, Juan. .
Re: [Qemu-devel] [PATCH v2 00/39] bitmap handling optimization
On 11/6/2013 5:04 AM, Juan Quintela wrote: Hi [v2] In this version: - fixed all the comments from last versions (thanks Eric) - kvm migration bitmap is synchronized using bitmap operations - qemu bitmap - migration bitmap is synchronized using bitmap operations If bitmaps are not properly aligned, we fall back to old code. Code survives virt-tests, so should be in quite good shape. ToDo list: - vga ram by default is not aligned in a page number multiple of 64, it could be optimized. Kraxel? It syncs the kvm bitmap at least 1 a second or so? bitmap is only 2048 pages (16MB by default). We need to change the ram_addr only - vga: still more, after we finish migration, vga code continues synchronizing the kvm bitmap on source machine. Notice that there is no graphics client connected to the VGA. Worth investigating? - I haven't yet meassure speed differences on big hosts. Vinod? - Depending of performance, more optimizations to do. - debugging printf's still on the code, just to see if we are taking (or not) the optimized paths. And that is all. Please test comment. Thanks, Juan. [v1] This series split the dirty bitmap (8 bits per page, only three used) into 3 individual bitmaps. Once the conversion is done, operations are handled by bitmap operations, not bit by bit. - *_DIRTY_FLAG flags are gone, now we use memory.h DIRTY_MEMORY_* everywhere. - We set/reset each flag individually (set_dirty_flags(0xff~CODE_DIRTY_FLAG)) are gone. - Rename several functions to clarify/make consistent things. - I know it dont't pass checkpatch for long lines, propper submission should pass it. We have to have long lines, short variable names, or ugly line splitting :p - DIRTY_MEMORY_NUM: how can one include exec/memory.h into cpu-all.h? #include it don't work, as a workaround, I have copied its value, but any better idea? I can always create exec/migration-flags.h, though. - The meat of the code is patch 19. Rest of patches are quite easy (even that one is not too complex). Only optimizations done so far are set_dirty_range()/clear_dirty_range() that now operates with bitmap_set/clear. Note for Xen: cpu_physical_memory_set_dirty_range() was wrong for xen, see comment on patch. It passes virt-test migration tests, so it should be perfect. I post it to ask for comments. ToDo list: - create a lock for the bitmaps and fold migration bitmap into this one. This would avoid a copy and make things easier? - As this code uses/abuses bitmaps, we need to change the type of the index from int to long. With an int index, we can only access a maximum of 8TB guest (yes, this is not urgent, we have a couple of years to do it). - merging KVM - QEMU bitmap as a bitmap and not bit-by-bit. - spliting the KVM bitmap synchronization into chunks, i.e. not synchronize all memory, just enough to continue with migration. Any further ideas/needs? Thanks, Juan. PD. Why it took so long? Because I was trying to integrate the bitmap on the MemoryRegion abstraction. Would have make the code cleaner, but hit dead-end after dead-end. As practical terms, TCG don't know about MemoryRegions, it has been ported to run on top of them, but don't use them effective The following changes since commit c2d30667760e3d7b81290d801e567d4f758825ca: rtc: remove dead SQW IRQ code (2013-11-05 20:04:03 -0800) are available in the git repository at: git://github.com/juanquintela/qemu.git bitmap-v2.next for you to fetch changes up to d91eff97e6f36612eb22d57c2b6c2623f73d3997: migration: synchronize memory bitmap 64bits at a time (2013-11-06 13:54:56 +0100) Juan Quintela (39): Move prototypes to memory.h memory: cpu_physical_memory_set_dirty_flags() result is never used memory: cpu_physical_memory_set_dirty_range() return void exec: use accessor function to know if memory is dirty memory: create function to set a single dirty bit exec: create function to get a single dirty bit memory: make cpu_physical_memory_is_dirty return bool exec: simplify notdirty_mem_write() memory: all users of cpu_physical_memory_get_dirty used only one flag memory: set single dirty flags when possible memory: cpu_physical_memory_set_dirty_range() allways dirty all flags memory: cpu_physical_memory_mask_dirty_range() always clear a single flag memory: use DIRTY_MEMORY_* instead of *_DIRTY_FLAG memory: use bit 2 for migration memory: make sure that client is always inside range memory: only resize dirty bitmap when memory size increases memory: cpu_physical_memory_clear_dirty_flag() result is never used bitmap: Add bitmap_zero_extend operation memory: split dirty bitmap into three memory: unfold cpu_physical_memory_clear_dirty_flag() in its only user
Re: [Qemu-devel] [PATCH v7 3/3] Force auto-convegence of live migration
On 6/24/2013 6:01 AM, Paolo Bonzini wrote: One nit and one question: Il 23/06/2013 22:11, Chegu Vinod ha scritto: @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==4) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate + (num_dirty_pages_period * TARGET_PAGE_SIZE + (bytes_xfer_now - bytes_xfer_prev)/2) + (dirty_rate_high_cnt++ 4)) { +trace_migration_throttle(); +mig_throttle_on = true; +dirty_rate_high_cnt = 0; + } + bytes_xfer_prev = bytes_xfer_now; +} Missing: else { mig_throttle_on = false; } Ok. +/* Stub function that's gets run on the vcpu when its brought out of the + VM to run inside qemu via async_run_on_cpu()*/ +static void mig_sleep_cpu(void *opq) +{ +qemu_mutex_unlock_iothread(); +g_usleep(30*1000); +qemu_mutex_lock_iothread(); +} + +/* If it has been more than 40 ms since the last time the guest + * was throttled then do it again. + */ +if (40 (t1-t0)/100) { You're stealing 75% of the CPU time, isn't that a lot? Depends on the dirty rate vs. transfer rate... I had tried 50% too and it took much longer for the migration to converge. Vinod +mig_throttle_guest_down(); +t0 = t1; +} +} Paolo .
[Qemu-devel] [PATCH v8 3/3] Force auto-convegence of live migration
If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Verified the convergence using the following : - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with Java warehouse workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 79 +++ 1 files changed, 79 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index a8b91ee..e7ca3b1 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,9 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; +static int dirty_rate_high_cnt; +static void check_guest_throttling(void); /***/ /* ram save/restore */ @@ -378,8 +381,14 @@ static void migration_bitmap_sync(void) uint64_t num_dirty_pages_init = migration_dirty_pages; MigrationState *s = migrate_get_current(); static int64_t start_time; +static int64_t bytes_xfer_prev; static int64_t num_dirty_pages_period; int64_t end_time; +int64_t bytes_xfer_now; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==4) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate + (num_dirty_pages_period * TARGET_PAGE_SIZE + (bytes_xfer_now - bytes_xfer_prev)/2) + (dirty_rate_high_cnt++ 4)) { +trace_migration_throttle(); +mig_throttle_on = true; +dirty_rate_high_cnt = 0; + } + bytes_xfer_prev = bytes_xfer_now; +} else { + mig_throttle_on = false; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque) migration_bitmap = bitmap_new(ram_pages); bitmap_set(migration_bitmap, 0, ram_pages); migration_dirty_pages = ram_pages; +mig_throttle_on = false; +dirty_rate_high_cnt = 0; if (migrate_use_xbzrle()) { XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() / @@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } total_sent += bytes_sent; acct_info.iterations++; +check_guest_throttling(); /* we want to check in the 1st loop, just in case it was the 1st time and we had to sync the dirty bitmap. qemu_get_clock_ns() is a bit expensive, so we only check each some @@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp) return info; } + +/* Stub function that's gets run on the vcpu when its brought out of the + VM to run inside qemu via async_run_on_cpu()*/ +static void mig_sleep_cpu(void *opq) +{ +qemu_mutex_unlock_iothread(); +g_usleep(30*1000); +qemu_mutex_lock_iothread(); +} + +/* To reduce the dirty rate explicitly disallow the VCPUs from spending + much time in the VM. The migration thread will try to catchup. + Workload will experience a performance drop. +*/ +static void
[Qemu-devel] [PATCH v8 0/3] Throttle-down guest to help with live migration convergence
Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. If a user chooses to force convergence of their migration via a new migration capability auto-converge then this change will auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. end up entering stage 3 (downtime phase) soon after that. No external trigger is required. Thanks to Juan and Paolo for their useful suggestions. --- Changes from v7: - added a missing else to patch 3/3. Changes from v6: - incorporated feedback from Paolo. - rebased to latest qemu.git and removing RFC Changes from v5: - incorporated feedback from Paolo Igor. - rebased to latest qemu.git Changes from v4: - incorporated feedback from Paolo. - split into 3 patches. Changes from v3: - incorporated feedback from Paolo and Eric - rebased to latest qemu.git Changes from v2: - incorporated feedback from Orit, Juan and Eric - stop the throttling thread at the start of stage 3 - rebased to latest qemu.git Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- Chegu Vinod (3): Introduce async_run_on_cpu() Add 'auto-converge' migration capability Force auto-convegence of live migration arch_init.c | 85 + cpus.c| 29 ++ include/migration/migration.h |2 + include/qemu-common.h |1 + include/qom/cpu.h | 10 + migration.c |9 qapi-schema.json |5 ++- 7 files changed, 140 insertions(+), 1 deletions(-)
[Qemu-devel] [PATCH v8 2/3] Add 'auto-converge' migration capability
The auto-converge migration capability allows the user to specify if they choose live migration seqeunce to automatically detect and force convergence. Signed-off-by: Chegu Vinod chegu_vi...@hp.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- include/migration/migration.h |2 ++ migration.c |9 + qapi-schema.json |5 - 3 files changed, 15 insertions(+), 1 deletions(-) diff --git a/include/migration/migration.h b/include/migration/migration.h index e2acec6..ace91b0 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -127,4 +127,6 @@ int migrate_use_xbzrle(void); int64_t migrate_xbzrle_cache_size(void); int64_t xbzrle_cache_resize(int64_t new_size); + +bool migrate_auto_converge(void); #endif diff --git a/migration.c b/migration.c index 058f9e6..d0759c1 100644 --- a/migration.c +++ b/migration.c @@ -473,6 +473,15 @@ void qmp_migrate_set_downtime(double value, Error **errp) max_downtime = (uint64_t)value; } +bool migrate_auto_converge(void) +{ +MigrationState *s; + +s = migrate_get_current(); + +return s-enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE]; +} + int migrate_use_xbzrle(void) { MigrationState *s; diff --git a/qapi-schema.json b/qapi-schema.json index a80ee40..c019fec 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -605,10 +605,13 @@ # This feature allows us to minimize migration traffic for certain work # loads, by sending compressed difference of the pages # +# @auto-converge: If enabled, QEMU will automatically throttle down the guest +# to speed up convergence of RAM migration. (since 1.6) +# # Since: 1.2 ## { 'enum': 'MigrationCapability', - 'data': ['xbzrle'] } + 'data': ['xbzrle', 'auto-converge'] } ## # @MigrationCapabilityStatus -- 1.7.1
[Qemu-devel] [PATCH v8 1/3] Introduce async_run_on_cpu()
Introduce an asynchronous version of run_on_cpu() i.e. the caller doesn't have to block till the call back routine finishes execution on the target vcpu. Signed-off-by: Chegu Vinod chegu_vi...@hp.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com --- cpus.c| 29 + include/qemu-common.h |1 + include/qom/cpu.h | 10 ++ 3 files changed, 40 insertions(+), 0 deletions(-) diff --git a/cpus.c b/cpus.c index c8bc8ad..c7c90d0 100644 --- a/cpus.c +++ b/cpus.c @@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) wi.func = func; wi.data = data; +wi.free = false; if (cpu-queued_work_first == NULL) { cpu-queued_work_first = wi; } else { @@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) } } +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) +{ +struct qemu_work_item *wi; + +if (qemu_cpu_is_self(cpu)) { +func(data); +return; +} + +wi = g_malloc0(sizeof(struct qemu_work_item)); +wi-func = func; +wi-data = data; +wi-free = true; +if (cpu-queued_work_first == NULL) { +cpu-queued_work_first = wi; +} else { +cpu-queued_work_last-next = wi; +} +cpu-queued_work_last = wi; +wi-next = NULL; +wi-done = false; + +qemu_cpu_kick(cpu); +} + static void flush_queued_work(CPUState *cpu) { struct qemu_work_item *wi; @@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu) cpu-queued_work_first = wi-next; wi-func(wi-data); wi-done = true; +if (wi-free) { +g_free(wi); +} } cpu-queued_work_last = NULL; qemu_cond_broadcast(qemu_work_cond); diff --git a/include/qemu-common.h b/include/qemu-common.h index 3c91375..9834dcb 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -291,6 +291,7 @@ struct qemu_work_item { void (*func)(void *data); void *data; int done; +bool free; }; #ifdef CONFIG_USER_ONLY diff --git a/include/qom/cpu.h b/include/qom/cpu.h index a5bb515..b555c22 100644 --- a/include/qom/cpu.h +++ b/include/qom/cpu.h @@ -288,6 +288,16 @@ bool cpu_is_stopped(CPUState *cpu); void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); /** + * async_run_on_cpu: + * @cpu: The vCPU to run on. + * @func: The function to be executed. + * @data: Data to pass to the function. + * + * Schedules the function @func for execution on the vCPU @cpu asynchronously. + */ +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); + +/** * qemu_for_each_cpu: * @func: The function to be executed. * @data: Data to pass to the function. -- 1.7.1
Re: [Qemu-devel] [PATCH v8 3/3] Force auto-convegence of live migration
On 6/24/2013 8:59 AM, Paolo Bonzini wrote: Il 24/06/2013 11:47, Chegu Vinod ha scritto: If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Verified the convergence using the following : - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with Java warehouse workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com As far as the algorithm is concerned, Reviewed-by: Paolo Bonzini pbonz...@redhat.com Thanks! but are you sure that this passes checkpatch.pl? Yes it does (had checked it before I posted). # ./scripts/checkpatch.pl 0003-Force-auto-convegence-of-live-migration.patch total: 0 errors, 0 warnings, 114 lines checked 0003-Force-auto-convegence-of-live-migration.patch has no obvious style problems and is ready for submission. Vinod +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==4) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate + (num_dirty_pages_period * TARGET_PAGE_SIZE + (bytes_xfer_now - bytes_xfer_prev)/2) + (dirty_rate_high_cnt++ 4)) { the spacing of the operators here looks like something checkpatch.pl would complain about. If you have to respin for that, keep my R-b and please also remove all other superfluous parentheses. Paolo +trace_migration_throttle(); +mig_throttle_on = true; +dirty_rate_high_cnt = 0; + } + bytes_xfer_prev = bytes_xfer_now; +} else { + mig_throttle_on = false; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque) migration_bitmap = bitmap_new(ram_pages); bitmap_set(migration_bitmap, 0, ram_pages); migration_dirty_pages = ram_pages; +mig_throttle_on = false; +dirty_rate_high_cnt = 0; if (migrate_use_xbzrle()) { XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() / @@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } total_sent += bytes_sent; acct_info.iterations++; +check_guest_throttling(); /* we want to check in the 1st loop, just in case it was the 1st time and we had to sync the dirty bitmap. qemu_get_clock_ns() is a bit expensive, so we only check each some @@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp) return info; } + +/* Stub function that's gets run on the vcpu when its brought out of the + VM to run inside qemu via async_run_on_cpu()*/ +static void mig_sleep_cpu(void *opq) +{ +qemu_mutex_unlock_iothread(); +g_usleep(30*1000); +qemu_mutex_lock_iothread(); +} + +/* To reduce the dirty rate explicitly disallow the VCPUs from spending + much time in the VM. The migration thread will try to catchup. + Workload will experience a performance drop. +*/ +static void mig_throttle_cpu_down(CPUState *cpu, void *data) +{ +async_run_on_cpu(cpu, mig_sleep_cpu, NULL); +} + +static void mig_throttle_guest_down(void) +{ +qemu_mutex_lock_iothread(); +qemu_for_each_cpu(mig_throttle_cpu_down, NULL); +qemu_mutex_unlock_iothread(); +} + +static void check_guest_throttling(void) +{ +static int64_t t0; +int64_tt1; + +if (!mig_throttle_on) { +return; +} + +if (!t0
Re: [Qemu-devel] [RFC PATCH v6 3/3] Force auto-convegence of live migration
On 6/20/2013 5:54 AM, Paolo Bonzini wrote: Il 14/06/2013 15:58, Chegu Vinod ha scritto: If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Hi Vinod, pretty much the same comments I sent you yesterday on the obsolete version of the patch still apply. Verified the convergence using the following : - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with Java warehouse workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 85 +++ 1 files changed, 85 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 5d32ecf..69c6c8c 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,8 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; +static void throttle_down_guest_to_converge(void); /***/ /* ram save/restore */ @@ -378,8 +380,15 @@ static void migration_bitmap_sync(void) uint64_t num_dirty_pages_init = migration_dirty_pages; MigrationState *s = migrate_get_current(); static int64_t start_time; +static int64_t bytes_xfer_prev; static int64_t num_dirty_pages_period; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==4) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate +((num_dirty_pages_period*TARGET_PAGE_SIZE) +((bytes_xfer_now - bytes_xfer_prev)/2))) { +if (dirty_rate_high_cnt++ 4) { Too many parentheses, and please remove the nested if. +DPRINTF(Unable to converge. Throtting down guest\n); Please use tracepoint instead. +mig_throttle_on = true; Need to reset dirty_rate_high_cnt here, and both dirty_rate_high_cnt/mig_throttle_on if you see !migrate_auto_converge(). This ensures that throttling does not kick in automatically if you disable and re-enable the feature. It also lets you remove a bunch of migrate_auto_converge() checks. You also need to reset dirty_rate_high_cnt/mig_throttle_on in the setup phase of migration. +} + } + bytes_xfer_prev = bytes_xfer_now; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -628,6 +654,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } total_sent += bytes_sent; acct_info.iterations++; +throttle_down_guest_to_converge(); You can use a shorter name, like check_cpu_throttling(). /* we want to check in the 1st loop, just in case it was the 1st time and we had to sync the dirty bitmap. qemu_get_clock_ns() is a bit expensive, so we only check each some @@ -1098,3 +1125,61 @@ TargetInfo *qmp_query_target(Error **errp) return info; } + +static bool throttling_needed(void) +{ +if (!migrate_auto_converge
[Qemu-devel] [PATCH v7 1/3] Introduce async_run_on_cpu()
Introduce an asynchronous version of run_on_cpu() i.e. the caller doesn't have to block till the call back routine finishes execution on the target vcpu. Signed-off-by: Chegu Vinod chegu_vi...@hp.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com --- cpus.c| 29 + include/qemu-common.h |1 + include/qom/cpu.h | 10 ++ 3 files changed, 40 insertions(+), 0 deletions(-) diff --git a/cpus.c b/cpus.c index c8bc8ad..c7c90d0 100644 --- a/cpus.c +++ b/cpus.c @@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) wi.func = func; wi.data = data; +wi.free = false; if (cpu-queued_work_first == NULL) { cpu-queued_work_first = wi; } else { @@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) } } +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) +{ +struct qemu_work_item *wi; + +if (qemu_cpu_is_self(cpu)) { +func(data); +return; +} + +wi = g_malloc0(sizeof(struct qemu_work_item)); +wi-func = func; +wi-data = data; +wi-free = true; +if (cpu-queued_work_first == NULL) { +cpu-queued_work_first = wi; +} else { +cpu-queued_work_last-next = wi; +} +cpu-queued_work_last = wi; +wi-next = NULL; +wi-done = false; + +qemu_cpu_kick(cpu); +} + static void flush_queued_work(CPUState *cpu) { struct qemu_work_item *wi; @@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu) cpu-queued_work_first = wi-next; wi-func(wi-data); wi-done = true; +if (wi-free) { +g_free(wi); +} } cpu-queued_work_last = NULL; qemu_cond_broadcast(qemu_work_cond); diff --git a/include/qemu-common.h b/include/qemu-common.h index 3c91375..9834dcb 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -291,6 +291,7 @@ struct qemu_work_item { void (*func)(void *data); void *data; int done; +bool free; }; #ifdef CONFIG_USER_ONLY diff --git a/include/qom/cpu.h b/include/qom/cpu.h index a5bb515..b555c22 100644 --- a/include/qom/cpu.h +++ b/include/qom/cpu.h @@ -288,6 +288,16 @@ bool cpu_is_stopped(CPUState *cpu); void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); /** + * async_run_on_cpu: + * @cpu: The vCPU to run on. + * @func: The function to be executed. + * @data: Data to pass to the function. + * + * Schedules the function @func for execution on the vCPU @cpu asynchronously. + */ +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); + +/** * qemu_for_each_cpu: * @func: The function to be executed. * @data: Data to pass to the function. -- 1.7.1
[Qemu-devel] [PATCH v7 0/3] Throttle-down guest to help with live migration convergence
Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. If a user chooses to force convergence of their migration via a new migration capability auto-converge then this change will auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. end up entering stage 3 (downtime phase) soon after that. No external trigger is required. Thanks to Juan and Paolo for their useful suggestions. --- Changes from v6: - incorporated feedback from Paolo. - rebased to latest qemu.git and removing RFC Changes from v5: - incorporated feedback from Paolo Igor. - rebased to latest qemu.git Changes from v4: - incorporated feedback from Paolo. - split into 3 patches. Changes from v3: - incorporated feedback from Paolo and Eric - rebased to latest qemu.git Changes from v2: - incorporated feedback from Orit, Juan and Eric - stop the throttling thread at the start of stage 3 - rebased to latest qemu.git Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- Chegu Vinod (3): Introduce async_run_on_cpu() Add 'auto-converge' migration capability Force auto-convegence of live migration arch_init.c | 85 + cpus.c| 29 ++ include/migration/migration.h |2 + include/qemu-common.h |1 + include/qom/cpu.h | 10 + migration.c |9 qapi-schema.json |5 ++- 7 files changed, 140 insertions(+), 1 deletions(-)
[Qemu-devel] [PATCH v7 2/3] Add 'auto-converge' migration capability
If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Verified the convergence using the following : - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with Java warehouse workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 79 +++ 1 files changed, 79 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index a8b91ee..e7ca3b1 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,9 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; +static int dirty_rate_high_cnt; +static void check_guest_throttling(void); /***/ /* ram save/restore */ @@ -378,8 +381,14 @@ static void migration_bitmap_sync(void) uint64_t num_dirty_pages_init = migration_dirty_pages; MigrationState *s = migrate_get_current(); static int64_t start_time; +static int64_t bytes_xfer_prev; static int64_t num_dirty_pages_period; int64_t end_time; +int64_t bytes_xfer_now; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==4) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate + (num_dirty_pages_period * TARGET_PAGE_SIZE + (bytes_xfer_now - bytes_xfer_prev)/2) + (dirty_rate_high_cnt++ 4)) { +trace_migration_throttle(); +mig_throttle_on = true; +dirty_rate_high_cnt = 0; + } + bytes_xfer_prev = bytes_xfer_now; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque) migration_bitmap = bitmap_new(ram_pages); bitmap_set(migration_bitmap, 0, ram_pages); migration_dirty_pages = ram_pages; +mig_throttle_on = false; +dirty_rate_high_cnt = 0; if (migrate_use_xbzrle()) { XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() / @@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } total_sent += bytes_sent; acct_info.iterations++; +check_guest_throttling(); /* we want to check in the 1st loop, just in case it was the 1st time and we had to sync the dirty bitmap. qemu_get_clock_ns() is a bit expensive, so we only check each some @@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp) return info; } + +/* Stub function that's gets run on the vcpu when its brought out of the + VM to run inside qemu via async_run_on_cpu()*/ +static void mig_sleep_cpu(void *opq) +{ +qemu_mutex_unlock_iothread(); +g_usleep(30*1000); +qemu_mutex_lock_iothread(); +} + +/* To reduce the dirty rate explicitly disallow the VCPUs from spending + much time in the VM. The migration thread will try to catchup. + Workload will experience a performance drop. +*/ +static void mig_throttle_cpu_down(CPUState *cpu, void *data) +{ +async_run_on_cpu
[Qemu-devel] [PATCH v7 2/3] Add 'auto-converge' migration capability
The auto-converge migration capability allows the user to specify if they choose live migration seqeunce to automatically detect and force convergence. Signed-off-by: Chegu Vinod chegu_vi...@hp.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- include/migration/migration.h |2 ++ migration.c |9 + qapi-schema.json |5 - 3 files changed, 15 insertions(+), 1 deletions(-) diff --git a/include/migration/migration.h b/include/migration/migration.h index e2acec6..ace91b0 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -127,4 +127,6 @@ int migrate_use_xbzrle(void); int64_t migrate_xbzrle_cache_size(void); int64_t xbzrle_cache_resize(int64_t new_size); + +bool migrate_auto_converge(void); #endif diff --git a/migration.c b/migration.c index 058f9e6..d0759c1 100644 --- a/migration.c +++ b/migration.c @@ -473,6 +473,15 @@ void qmp_migrate_set_downtime(double value, Error **errp) max_downtime = (uint64_t)value; } +bool migrate_auto_converge(void) +{ +MigrationState *s; + +s = migrate_get_current(); + +return s-enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE]; +} + int migrate_use_xbzrle(void) { MigrationState *s; diff --git a/qapi-schema.json b/qapi-schema.json index a80ee40..c019fec 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -605,10 +605,13 @@ # This feature allows us to minimize migration traffic for certain work # loads, by sending compressed difference of the pages # +# @auto-converge: If enabled, QEMU will automatically throttle down the guest +# to speed up convergence of RAM migration. (since 1.6) +# # Since: 1.2 ## { 'enum': 'MigrationCapability', - 'data': ['xbzrle'] } + 'data': ['xbzrle', 'auto-converge'] } ## # @MigrationCapabilityStatus -- 1.7.1
[Qemu-devel] [PATCH v7 3/3] Force auto-convegence of live migration
If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Verified the convergence using the following : - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with Java warehouse workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 79 +++ 1 files changed, 79 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index a8b91ee..e7ca3b1 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,9 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; +static int dirty_rate_high_cnt; +static void check_guest_throttling(void); /***/ /* ram save/restore */ @@ -378,8 +381,14 @@ static void migration_bitmap_sync(void) uint64_t num_dirty_pages_init = migration_dirty_pages; MigrationState *s = migrate_get_current(); static int64_t start_time; +static int64_t bytes_xfer_prev; static int64_t num_dirty_pages_period; int64_t end_time; +int64_t bytes_xfer_now; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==4) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate + (num_dirty_pages_period * TARGET_PAGE_SIZE + (bytes_xfer_now - bytes_xfer_prev)/2) + (dirty_rate_high_cnt++ 4)) { +trace_migration_throttle(); +mig_throttle_on = true; +dirty_rate_high_cnt = 0; + } + bytes_xfer_prev = bytes_xfer_now; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque) migration_bitmap = bitmap_new(ram_pages); bitmap_set(migration_bitmap, 0, ram_pages); migration_dirty_pages = ram_pages; +mig_throttle_on = false; +dirty_rate_high_cnt = 0; if (migrate_use_xbzrle()) { XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() / @@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } total_sent += bytes_sent; acct_info.iterations++; +check_guest_throttling(); /* we want to check in the 1st loop, just in case it was the 1st time and we had to sync the dirty bitmap. qemu_get_clock_ns() is a bit expensive, so we only check each some @@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp) return info; } + +/* Stub function that's gets run on the vcpu when its brought out of the + VM to run inside qemu via async_run_on_cpu()*/ +static void mig_sleep_cpu(void *opq) +{ +qemu_mutex_unlock_iothread(); +g_usleep(30*1000); +qemu_mutex_lock_iothread(); +} + +/* To reduce the dirty rate explicitly disallow the VCPUs from spending + much time in the VM. The migration thread will try to catchup. + Workload will experience a performance drop. +*/ +static void mig_throttle_cpu_down(CPUState *cpu, void *data) +{ +async_run_on_cpu
Re: [Qemu-devel] [PATCH v7 2/3] Add 'auto-converge' migration capability
Oops! A minor glitch on my side (pl. ignore the subject line of this...as this is actually patch 3/3 and not patch 2/3). I just resent this as patch 3/3 with the correct subject line. Thx Vinod On 6/23/2013 1:05 PM, Chegu Vinod wrote: If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Verified the convergence using the following : - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with Java warehouse workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 79 +++ 1 files changed, 79 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index a8b91ee..e7ca3b1 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,9 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; +static int dirty_rate_high_cnt; +static void check_guest_throttling(void); /***/ /* ram save/restore */ @@ -378,8 +381,14 @@ static void migration_bitmap_sync(void) uint64_t num_dirty_pages_init = migration_dirty_pages; MigrationState *s = migrate_get_current(); static int64_t start_time; +static int64_t bytes_xfer_prev; static int64_t num_dirty_pages_period; int64_t end_time; +int64_t bytes_xfer_now; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==4) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate + (num_dirty_pages_period * TARGET_PAGE_SIZE + (bytes_xfer_now - bytes_xfer_prev)/2) + (dirty_rate_high_cnt++ 4)) { +trace_migration_throttle(); +mig_throttle_on = true; +dirty_rate_high_cnt = 0; + } + bytes_xfer_prev = bytes_xfer_now; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque) migration_bitmap = bitmap_new(ram_pages); bitmap_set(migration_bitmap, 0, ram_pages); migration_dirty_pages = ram_pages; +mig_throttle_on = false; +dirty_rate_high_cnt = 0; if (migrate_use_xbzrle()) { XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() / @@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } total_sent += bytes_sent; acct_info.iterations++; +check_guest_throttling(); /* we want to check in the 1st loop, just in case it was the 1st time and we had to sync the dirty bitmap. qemu_get_clock_ns() is a bit expensive, so we only check each some @@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp) return info; } + +/* Stub function that's gets run on the vcpu when its brought out of the + VM to run inside qemu via async_run_on_cpu()*/ +static void mig_sleep_cpu(void *opq) +{ +qemu_mutex_unlock_iothread(); +g_usleep(30*1000
Re: [Qemu-devel] [PATCH v9 14/14] rdma: add pin-all accounting timestamp to QMP statistics
On 6/14/2013 1:35 PM, mrhi...@linux.vnet.ibm.com wrote: From: Michael R. Hines mrhi...@us.ibm.com For very large virtual machines, pinning can take a long time. While this does not affect the migration's *actual* time itself, it is still important for the user to know what's going on and to know what component of the total time is actual taken up by pinning. For example, using a 14GB virtual machine, pinning can take as long as 5 seconds, for which the user would not otherwise know what was happening. Reviewed-by: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Michael R. Hines mrhi...@us.ibm.com --- hmp.c |4 +++ include/migration/migration.h |1 + migration-rdma.c | 55 +++-- migration.c | 13 +- qapi-schema.json |3 ++- 5 files changed, 56 insertions(+), 20 deletions(-) diff --git a/hmp.c b/hmp.c index 148a3fb..90c55f2 100644 --- a/hmp.c +++ b/hmp.c @@ -164,6 +164,10 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict) monitor_printf(mon, downtime: % PRIu64 milliseconds\n, info-downtime); } +if (info-has_pin_all_time) { +monitor_printf(mon, pin-all: % PRIu64 milliseconds\n, + info-pin_all_time); +} } if (info-has_ram) { diff --git a/include/migration/migration.h b/include/migration/migration.h index b49e68b..d2ca75b 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -49,6 +49,7 @@ struct MigrationState bool enabled_capabilities[MIGRATION_CAPABILITY_MAX]; int64_t xbzrle_cache_size; double mbps; +int64_t pin_all_time; }; void process_incoming_migration(QEMUFile *f); diff --git a/migration-rdma.c b/migration-rdma.c index 853de18..e407dce 100644 --- a/migration-rdma.c +++ b/migration-rdma.c @@ -699,11 +699,11 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma) return 0; } -static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, -RDMALocalBlocks *rdma_local_ram_blocks) +static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma) { int i; -uint64_t start = qemu_get_clock_ms(rt_clock); +int64_t start = qemu_get_clock_ms(host_clock); +RDMALocalBlocks *rdma_local_ram_blocks = rdma-local_ram_blocks; (void)start; for (i = 0; i rdma_local_ram_blocks-num_blocks; i++) { @@ -721,7 +721,8 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, rdma-total_registrations++; } -DPRINTF(lock time: % PRIu64 \n, qemu_get_clock_ms(rt_clock) - start); +DPRINTF(local lock time: % PRId64 \n, +qemu_get_clock_ms(host_clock) - start); if (i = rdma_local_ram_blocks-num_blocks) { return 0; @@ -1262,7 +1263,8 @@ static void qemu_rdma_move_header(RDMAContext *rdma, int idx, */ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head, uint8_t *data, RDMAControlHeader *resp, - int *resp_idx) + int *resp_idx, + int (*callback)(RDMAContext *rdma)) { int ret = 0; int idx = 0; @@ -1315,6 +1317,14 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head, * If we're expecting a response, block and wait for it. */ if (resp) { +if (callback) { +DPRINTF(Issuing callback before receiving response...\n); +ret = callback(rdma); +if (ret 0) { +return ret; +} +} + DDPRINTF(Waiting for response %s\n, control_desc[resp-type]); ret = qemu_rdma_exchange_get_response(rdma, resp, resp-type, idx + 1); @@ -1464,7 +1474,7 @@ static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma, chunk, sge.length, current_index, offset); ret = qemu_rdma_exchange_send(rdma, head, -(uint8_t *) comp, NULL, NULL); +(uint8_t *) comp, NULL, NULL, NULL); if (ret 0) { return -EIO; @@ -1487,7 +1497,7 @@ static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma, chunk, sge.length, current_index, offset); ret = qemu_rdma_exchange_send(rdma, head, (uint8_t *) reg, -resp, reg_result_idx); +resp, reg_result_idx, NULL); if (ret 0) { return ret; } @@ -2126,7 +2136,7 @@ static int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf, head.len = r-len; head.type = RDMA_CONTROL_QEMU_FILE; -ret = qemu_rdma_exchange_send(rdma, head, data, NULL,
Re: [Qemu-devel] RDMA: please pull and re-test freezing fixes
On 6/14/2013 1:38 PM, Michael R. Hines wrote: Chegu, I sent a V9 to the mailing list: The version goes even further, by explicitly timing the pinning latency and pushing the value out to QMP so the user clearly knows which component of total migration time is consumed by pinning. If you're satisfied, I'd appreciate if I could add your Reviewed-By: =) Pl. see below...and yes you can add me. Thanks, Vinod The migration speed was set to 40G and the downtime to 2sec for all experiments below. Note: Idle guests are not interesting due to tons of zero pages etc...but including them here to highlght the overhead of pinning. 1) 20vcpu/64GB guest: (kind of a larger sized Cloud-type guest) a) Idle guest with No pinning (default) : capabilities: xbzrle: off x-rdma-pin-all: off Migration status: completed total time: 51062 milliseconds downtime: 1948 milliseconds pin-all: 0 milliseconds transferred ram: 1816547 kbytes throughput: 6872.23 mbps remaining ram: 0 kbytes total ram: 67117632 kbytes duplicate: 16331552 pages skipped: 0 pages normal: 450038 pages normal bytes: 1800152 kbytes b) Idle guest with Pinning : capabilities: xbzrle: off x-rdma-pin-all: on Migration status: completed total time: 47451 milliseconds downtime: 2639 milliseconds pin-all: 22780 milliseconds transferred ram: 67136643 kbytes throughput: 25222.91 mbps remaining ram: 0 kbytes total ram: 67117632 kbytes duplicate: 0 pages skipped: 0 pages normal: 16780064 pages normal bytes: 67120256 kbytes There weere no freezes observed in the guest at the start of the migration but the qemu monitor prompt was not responsive for the duration of the memory pinning. Total migration time was affected by the cost pinning at the start of the migration as shown above( This issue can be pursued and optimized later). c) Pining + guest running a Java warehouse workload (I cranked the workload up to keep the guest 95+% busy) capabilities: xbzrle: off x-rdma-pin-all: on Migration status: active total time: 412706 milliseconds expected downtime: 499 milliseconds pin-all: 22758 milliseconds transferred ram: 657243669 kbytes throughput: 25241.89 mbps remaining ram: 7281848 kbytes total ram: 67117632 kbytes duplicate: 0 pages skipped: 0 pages normal: 164270810 pages normal bytes: 657083240 kbytes dirty pages rate: 369925 pages No Convergence ! (For workloads where the memory dirty rate is very high there are other alternatives that have been discussed in the past...) --- Enterprise type guests tend to get fatter (more memory per cpu) than the larger Cloud guests...so here are a coupld of them. a) 20VCPU/256G Idle guest : Default: capabilities: xbzrle: off x-rdma-pin-all: off Migration status: completed total time: 259259 milliseconds downtime: 3924 milliseconds pin-all: 0 milliseconds transferred ram: 5522078 kbytes throughput: 6586.06 mbps remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 65755168 pages skipped: 0 pages normal: 1364124 pages normal bytes: 5456496 kbytes Pinned: capabilities: xbzrle: off x-rdma-pin-all: on Migration status: completed total time: 219053 milliseconds downtime: 4277 milliseconds pin-all: 118153 milliseconds transferred ram: 268512809 kbytes throughput: 22209.32 mbps remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 0 pages skipped: 0 pages normal: 67111817 pages normal bytes: 268447268 kbytes b) 40VCPU/512GB Idle guest : Default: capabilities: xbzrle: off x-rdma-pin-all: off Migration status: completed total time: 670577 milliseconds downtime: 6139 milliseconds pin-all: 0 milliseconds transferred ram: 10279256 kbytes throughput: 6150.93 mbps remaining ram: 0 kbytes total ram: 536879680 kbytes duplicate: 131704099 pages skipped: 0 pages normal: 2537017 pages normal bytes: 10148068 kbytes Pinned: capabilities: xbzrle: off x-rdma-pin-all: on Migration status: completed total time: 527576 milliseconds downtime: 6314 milliseconds pin-all: 312984 milliseconds transferred ram: 537129685 kbytes throughput: 20177.27 mbps remaining ram: 0 kbytes total ram: 536879680 kbytes duplicate: 0 pages skipped: 0 pages normal: 134249644 pages normal bytes: 536998576 kbytes No freezes in the guest due to memory pinning. (Freezes were only due to the dirty bitmap synchup stuff which is being done while BQL is held. Juan is working on addresing already for qemu 1.6)
Re: [Qemu-devel] [PATCH v9 00/14] rdma: migration support
On 6/14/2013 1:35 PM, mrhi...@linux.vnet.ibm.com wrote: From: Michael R. Hines mrhi...@us.ibm.com Changes since v8: For very large virtual machines, pinning can take a long time. While this does not affect the migration's *actual* time itself, it is still important for the user to know what's going on and to know what component of the total time is actual taken up by pinning. For example, using a 14GB virtual machine, pinning can take as long as 5 seconds, for which the user would not otherwise know what was happening. Reviewed-by: Eric Blake ebl...@redhat.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com Reviewed-by: Chegu Vinod chegu_vi...@hp.com Tested-by: Chegu Vinod chegu_vi...@hp.com Thx Vinod Wiki: http://wiki.qemu.org/Features/RDMALiveMigration Github: g...@github.com:hinesmr/qemu.git Here is a brief summary of total migration time and downtime using RDMA: Using a 40gbps infiniband link performing a worst-case stress test, using an 8GB RAM virtual machine: Using the following command: $ apt-get install stress $ stress --vm-bytes 7500M --vm 1 --vm-keep RESULTS: 1. Migration throughput: 26 gigabits/second. 2. Downtime (stop time) varies between 15 and 100 milliseconds. EFFECTS of memory registration on bulk phase round: For example, in the same 8GB RAM example with all 8GB of memory in active use and the VM itself is completely idle using the same 40 gbps infiniband link: 1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps 2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps These numbers would of course scale up to whatever size virtual machine you have to migrate using RDMA. Enabling this feature does *not* have any measurable affect on migration *downtime*. This is because, without this feature, all of the memory will have already been registered already in advance during the bulk round and does not need to be re-registered during the successive iteration rounds. The following changes since commit f3aa844bbb2922a5b8393d17620eca7d7e921ab3: build: include config-{, all-}devices.mak after defining CONFIG_SOFTMMU and CONFIG_USER_ONLY (2013-04-24 12:18:41 -0500) are available in the git repository at: g...@github.com:hinesmr/qemu.git rdma_patch_v9 for you to fetch changes up to 75e6fac1f642885b93cefe6e1874d648e9850f8f: rdma: send pc.ram (2013-04-24 14:55:01 -0400) Michael R. Hines (14): rdma: add documentation rdma: introduce qemu_update_position() rdma: export yield_until_fd_readable() rdma: export throughput w/ MigrationStats QMP rdma: introduce qemu_file_mode_is_not_valid() rdma: export qemu_fflush() rdma: introduce ram_handle_compressed() rdma: introduce qemu_ram_foreach_block() rdma: new QEMUFileOps hooks rdma: introduce capability x-rdma-pin-all rdma: core logic rdma: send pc.ram rdma: fix mlock() freezes and accounting rdma: add pin-all accounting timestamp to QMP statistics Makefile.objs |1 + arch_init.c | 69 +- configure | 29 + docs/rdma.txt | 415 ++ exec.c|9 + hmp.c |6 + include/block/coroutine.h |6 + include/exec/cpu-common.h |5 + include/migration/migration.h | 32 + include/migration/qemu-file.h | 32 + migration-rdma.c | 2831 + migration.c | 36 +- qapi-schema.json | 15 +- qemu-coroutine-io.c | 23 + savevm.c | 114 +- 15 files changed, 3574 insertions(+), 49 deletions(-) create mode 100644 docs/rdma.txt create mode 100644 migration-rdma.c
[Qemu-devel] [RFC PATCH v6 1/3] Introduce async_run_on_cpu()
Introduce an asynchronous version of run_on_cpu() i.e. the caller doesn't have to block till the call back routine finishes execution on the target vcpu. Signed-off-by: Chegu Vinod chegu_vi...@hp.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com --- cpus.c| 29 + include/qemu-common.h |1 + include/qom/cpu.h | 10 ++ 3 files changed, 40 insertions(+), 0 deletions(-) diff --git a/cpus.c b/cpus.c index c232265..8cd4eab 100644 --- a/cpus.c +++ b/cpus.c @@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) wi.func = func; wi.data = data; +wi.free = false; if (cpu-queued_work_first == NULL) { cpu-queued_work_first = wi; } else { @@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) } } +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) +{ +struct qemu_work_item *wi; + +if (qemu_cpu_is_self(cpu)) { +func(data); +return; +} + +wi = g_malloc0(sizeof(struct qemu_work_item)); +wi-func = func; +wi-data = data; +wi-free = true; +if (cpu-queued_work_first == NULL) { +cpu-queued_work_first = wi; +} else { +cpu-queued_work_last-next = wi; +} +cpu-queued_work_last = wi; +wi-next = NULL; +wi-done = false; + +qemu_cpu_kick(cpu); +} + static void flush_queued_work(CPUState *cpu) { struct qemu_work_item *wi; @@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu) cpu-queued_work_first = wi-next; wi-func(wi-data); wi-done = true; +if (wi-free) { +g_free(wi); +} } cpu-queued_work_last = NULL; qemu_cond_broadcast(qemu_work_cond); diff --git a/include/qemu-common.h b/include/qemu-common.h index ed8b6e2..ac0ed38 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -302,6 +302,7 @@ struct qemu_work_item { void (*func)(void *data); void *data; int done; +bool free; }; #ifdef CONFIG_USER_ONLY diff --git a/include/qom/cpu.h b/include/qom/cpu.h index 7cd9442..46465e9 100644 --- a/include/qom/cpu.h +++ b/include/qom/cpu.h @@ -265,6 +265,16 @@ bool cpu_is_stopped(CPUState *cpu); void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); /** + * async_run_on_cpu: + * @cpu: The vCPU to run on. + * @func: The function to be executed. + * @data: Data to pass to the function. + * + * Schedules the function @func for execution on the vCPU @cpu asynchronously. + */ +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); + +/** * qemu_for_each_cpu: * @func: The function to be executed. * @data: Data to pass to the function. -- 1.7.1
[Qemu-devel] [RFC PATCH v6 3/3] Force auto-convegence of live migration
If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Verified the convergence using the following : - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with Java warehouse workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 85 +++ 1 files changed, 85 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 5d32ecf..69c6c8c 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,8 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; +static void throttle_down_guest_to_converge(void); /***/ /* ram save/restore */ @@ -378,8 +380,15 @@ static void migration_bitmap_sync(void) uint64_t num_dirty_pages_init = migration_dirty_pages; MigrationState *s = migrate_get_current(); static int64_t start_time; +static int64_t bytes_xfer_prev; static int64_t num_dirty_pages_period; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==4) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate +((num_dirty_pages_period*TARGET_PAGE_SIZE) +((bytes_xfer_now - bytes_xfer_prev)/2))) { +if (dirty_rate_high_cnt++ 4) { +DPRINTF(Unable to converge. Throtting down guest\n); +mig_throttle_on = true; +} + } + bytes_xfer_prev = bytes_xfer_now; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -628,6 +654,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } total_sent += bytes_sent; acct_info.iterations++; +throttle_down_guest_to_converge(); /* we want to check in the 1st loop, just in case it was the 1st time and we had to sync the dirty bitmap. qemu_get_clock_ns() is a bit expensive, so we only check each some @@ -1098,3 +1125,61 @@ TargetInfo *qmp_query_target(Error **errp) return info; } + +static bool throttling_needed(void) +{ +if (!migrate_auto_converge()) { +return false; +} +return mig_throttle_on; +} + +/* Stub function that's gets run on the vcpu when its brought out of the + VM to run inside qemu via async_run_on_cpu()*/ +static void mig_sleep_cpu(void *opq) +{ +qemu_mutex_unlock_iothread(); +g_usleep(30*1000); +qemu_mutex_lock_iothread(); +} + +/* To reduce the dirty rate explicitly disallow the VCPUs from spending + much time in the VM. The migration thread will try to catchup. + Workload will experience a performance drop. +*/ +static void mig_throttle_cpu_down(CPUState *cpu, void *data) +{ +async_run_on_cpu(cpu, mig_sleep_cpu, NULL); +} + +static void mig_throttle_guest_down(void) +{ +if (throttling_needed()) { +qemu_mutex_lock_iothread(); +qemu_for_each_cpu(mig_throttle_cpu_down, NULL
[Qemu-devel] [RFC PATCH v6 0/3] Throttle-down guest to help with live migration convergence
Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. If a user chooses to force convergence of their migration via a new migration capability auto-converge then this change will auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in our observation that lasts only for a short duration of time. i.e. end up entering stage 3 (downtime phase) soon after that. No external monitoring/triggers are required. Thanks to Juan and Paolo for their useful suggestions. --- Changes from v5: - incorporated feedback from Paolo Igor. - rebased to latest qemu.git Changes from v4: - incorporated feedback from Paolo. - split into 3 patches. Changes from v3: - incorporated feedback from Paolo and Eric - rebased to latest qemu.git Changes from v2: - incorporated feedback from Orit, Juan and Eric - stop the throttling thread at the start of stage 3 - rebased to latest qemu.git Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- Chegu Vinod (3): Introduce async_run_on_cpu() Add 'auto-converge' migration capability Force auto-convegence of live migration arch_init.c | 85 + cpus.c| 29 ++ include/migration/migration.h |2 + include/qemu-common.h |1 + include/qom/cpu.h | 10 + migration.c |9 qapi-schema.json |5 ++- 7 files changed, 140 insertions(+), 1 deletions(-)
[Qemu-devel] [RFC PATCH v6 2/3] Add 'auto-converge' migration capability
The auto-converge migration capability allows the user to specify if they choose live migration seqeunce to automatically detect and force convergence. Signed-off-by: Chegu Vinod chegu_vi...@hp.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- include/migration/migration.h |2 ++ migration.c |9 + qapi-schema.json |5 - 3 files changed, 15 insertions(+), 1 deletions(-) diff --git a/include/migration/migration.h b/include/migration/migration.h index e2acec6..ace91b0 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -127,4 +127,6 @@ int migrate_use_xbzrle(void); int64_t migrate_xbzrle_cache_size(void); int64_t xbzrle_cache_resize(int64_t new_size); + +bool migrate_auto_converge(void); #endif diff --git a/migration.c b/migration.c index 058f9e6..d0759c1 100644 --- a/migration.c +++ b/migration.c @@ -473,6 +473,15 @@ void qmp_migrate_set_downtime(double value, Error **errp) max_downtime = (uint64_t)value; } +bool migrate_auto_converge(void) +{ +MigrationState *s; + +s = migrate_get_current(); + +return s-enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE]; +} + int migrate_use_xbzrle(void) { MigrationState *s; diff --git a/qapi-schema.json b/qapi-schema.json index 5ad6894..882a7fd 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -605,10 +605,13 @@ # This feature allows us to minimize migration traffic for certain work # loads, by sending compressed difference of the pages # +# @auto-converge: If enabled, QEMU will automatically throttle down the guest +# to speed up convergence of RAM migration. (since 1.6) +# # Since: 1.2 ## { 'enum': 'MigrationCapability', - 'data': ['xbzrle'] } + 'data': ['xbzrle', 'auto-converge'] } ## # @MigrationCapabilityStatus -- 1.7.1
Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
On 6/1/2013 9:09 PM, Michael R. Hines wrote: All, I have successfully performed over 1000+ back-to-back RDMA migrations automatically looped *in a row* using a heavy-weight memory-stress benchmark here at IBM. Migration success is done by capturing the actual serial console output of the virtual machine while the benchmark is running and redirecting each migration output to a file to verify that the output matches the expected output of a successful migration. For half of the 1000 migrations, I used a 14GB virtual machine size (largest VM I can create) and the remaining 500 migrations I used a 2GB virtual machine (to make sure I was testing both 32-bit and 64-bit address boundaries). The benchmark is configured to have 75% stores and 25% loads and is configured to use 80% of the allocatable free memory of the VM (i.e. no swapping allowed). I have defined a successful migration per the output file as follows: 1. The memory benchmark is still running and active (CPU near 100% and memory usage is high) 2. There are no kernel panics in the console output (regex keywords panic, BUG, oom, etc...) 3. The VM is still responding to network activity (pings) 4. The console is still responsive by printing periodic messages throughout the life of the VM to the console from inside the VM using the 'write' command in infinite loop. With this method in a loop, I believe I've ironed out all the regression-testing bugs that I can find. You all may find the following bugs interesting. The original version of this patch was written in 2010 (Before my time @ IBM). Bug #1: In the original 2010 patch, each write operation uses the same identifier. (A Work Request ID in infiniband terminology). This is not typical (but allowed by the hardware) - and instead each operation should have its own unique identifier so that the write operation can be tracked properly as it completes. Bug #2: Also in the original 2010 patch, write operations were grouped into separate signaled and unsignaled work requests, which is also not typical (but allowed by the hardware). Signalling is infiniband terminology which means to activate/deactivate notifying the sender whether or not the RDMA operation has already completed. (Note: the receiver is never notified - which is what a DMA is supposed to be). In normal operation per infiniband specifications, unsignaled operations (which indicate to the hardware *not* to notify the sender of completion) are *supposed* to be paired simultaneously with a signaled operation using the *same* work request identifier. Instead, the original patch was using *different* work requests for signaled/unsignaled writes, which means that most of the writes would be transmitted without ever being tracked for completion whatsoever. (Per infinband specifications, signaled and unsignaled writes must be grouped together because the hardware ensures that completion notification is not given until *all* of the writes of the same request have actually completed). Bug #3: Finally, in the original 2010 patch, ordering was not being handled. Per infiniband specifications, writes can happen completely out of order. Not only that, but PCI-express itself can change the order of the writes as well. It was only until after the first 2 bugs were fixed that I could actually manifest this bug *in code*: What was happening was that a very large group of requests would burst from the QEMU migration thread. At which point, not all of the requests would finish. Then a short time later, the next iteration would start and the virtual machine's writable working set was still hovering somewhere in the same vicinity of the address space as the previous burst of writes that had not yet completed. When this happens, the new writes were much smaller (not a part of a larger chunk per our algorithms). Since the new writes were smaller they would complete faster than the larger, older writes in the same address range. Since they complete out of order, the newer writes would then get clobbered by the older writes - resulting in an inconsistent virtual machine. So, to solve this: during each new write, we now do a search to see if the address of the next requested write matches or overlaps with the address range of any of the previous outstanding writes that were still in transit, and I found several hits. This was easily solved by blocking until the conflicting write has completed before proceeding to issue a new write to the hardware. - Michael Hi Michael, Got some limited time on the systems so gave your latest bits a quick try today (with the default no pinning) and it seems to be better than before. Ran a Java warehouse workload where the guest was 85-90% busy... For both cases (qemu) migrate_set_speed 40G (qemu) migrate_set_downtime 2 (qemu) migrate -d x-rdma:ip:port ... 20VCPU/256G guest (qemu) info migrate capabilities: xbzrle: off x-rdma-pin-all: off Migration
[Qemu-devel] OS crash while attempting to boot a 1TB guest.
Hello, For guest sizes = 1TB RAM the guest OS is unable to boot up (please see attached GIF file for the Oops message). Wonder if this is a bug/regression in qemu/seabios or does one have to enable/disable something else in the qemu command line (pl. see below) ? Thanks Vinod Host and Guest OS : 3.10-rc2 (from kvm.git) and qemu1.5.5 (from qemu.git as of May 28th) The qemu command line : /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -name vm1 \ -m 1048576 -smp 80,sockets=80,cores=1,threads=1 \ -mem-path /dev/hugepages \ -no-hpet \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown \ -drive file=/var/lib/libvirt/images/vm1/vm1.img,if=none,id=drive-virtio-disk0,format=raw,cache=none \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,model=virtio,macaddr=...,netdev=nic-0 \ -netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \ -vnc :4 attachment: guest_panic2.GIF
Re: [Qemu-devel] [PATCH 3/2] vfio: Provide module option to disable vfio_iommu_type1 hugepage support
On 5/28/2013 9:27 AM, Alex Williamson wrote: Add a module option to vfio_iommu_type1 to disable IOMMU hugepage support. This causes iommu_map to only be called with single page mappings, disabling the IOMMU driver's ability to use hugepages. This option can be enabled by loading vfio_iommu_type1 with disable_hugepages=1 or dynamically through sysfs. If enabled dynamically, only new mappings are restricted. Signed-off-by: Alex Williamson alex.william...@redhat.com --- As suggested by Konrad. This is cleaner to add as a follow-on drivers/vfio/vfio_iommu_type1.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 6654a7e..8a2be4e 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -48,6 +48,12 @@ module_param_named(allow_unsafe_interrupts, MODULE_PARM_DESC(allow_unsafe_interrupts, Enable VFIO IOMMU support for on platforms without interrupt remapping support.); +static bool disable_hugepages; +module_param_named(disable_hugepages, + disable_hugepages, bool, S_IRUGO | S_IWUSR); +MODULE_PARM_DESC(disable_hugepages, +Disable VFIO IOMMU support for IOMMU hugepages.); + struct vfio_iommu { struct iommu_domain *domain; struct mutexlock; @@ -270,6 +276,11 @@ static long vfio_pin_pages(unsigned long vaddr, long npage, return -ENOMEM; } + if (unlikely(disable_hugepages)) { + vfio_lock_acct(1); + return 1; + } + /* Lock all the consecutive pages from pfn_base */ for (i = 1, vaddr += PAGE_SIZE; i npage; i++, vaddr += PAGE_SIZE) { unsigned long pfn = 0; . Tested-by: Chegu Vinod chegu_vi...@hp.com I was able to verify your changes on a 2 Sandybridge-EP socket platform and observed about ~7-8% improvement in the netperf's TCP_RR performance. The guest size was small (16vcpu/32GB). Hopefully these changes also have an indirect benefit of avoiding soft lockups on the host side when larger guests ( 256GB ) are rebooted. Someone who has ready access to a larger Sandybridge-EP/EX platform can verify this. FYI Vinod
Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
On 5/10/2013 6:07 AM, Anthony Liguori wrote: Chegu Vinod chegu_vi...@hp.com writes: If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Verified the convergence using the following : - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). Would it make sense to separate out the slow the VCPU down part of this? That would give a management tool more flexibility to create policies around slowing the VCPU down to encourage migration. I believe one can always enhance libvirt tools to monitor the migration statistics and control the shares/entitlements of the vcpus via cgroups..thereby slowing the guest down to allow for convergence (I had that listed in my earlier versions of the patches as an option and also noted that it requires external (i.e. tool driven) monitoring and triggers...and that this alternative was kind of automatic after the initial setting of the capability). Is that what you meant by your comment above (or) are you talking about something outside the scope of cgroups and from an implementation point of view also outside the migration code path...i.e. a new knob that an external tool can use to just throttle down the vcpus of a guest ? Thanks Vinod In fact, I wonder if we need anything in the migration path if we just expose the slow the VCPU down bit as a feature. Slow the VCPU down is not quite the same as setting priority of the VCPU thread largely because of the QBL so I recognize the need to have something for this in QEMU. Regards, Anthony Liguori (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 68 + include/migration/migration.h |4 ++ migration.c |1 + 3 files changed, 73 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 49c5dc2..29788d6 100644 --- a/arch_init.c +++ b/arch_init.c @@ -49,6 +49,7 @@ #include trace.h #include exec/cpu-all.h #include hw/acpi/acpi.h +#include sysemu/cpus.h #ifdef DEBUG_ARCH_INIT #define DPRINTF(fmt, ...) \ @@ -104,6 +105,8 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; + /***/ /* ram save/restore */ @@ -378,8 +381,15 @@ static void migration_bitmap_sync(void) uint64_t num_dirty_pages_init = migration_dirty_pages; MigrationState *s = migrate_get_current(); static int64_t start_time; +static int64_t bytes_xfer_prev; static int64_t num_dirty_pages_period; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +414,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==5) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate +((num_dirty_pages_period*TARGET_PAGE_SIZE) +((bytes_xfer_now - bytes_xfer_prev)/2))) { +if (dirty_rate_high_cnt++ 5) { +DPRINTF(Unable to converge. Throtting down guest\n); +mig_throttle_on = true
[Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability
The auto-converge migration capability allows the user to specify if they choose live migration seqeunce to automatically detect and force convergence. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- include/migration/migration.h |2 ++ migration.c |9 + qapi-schema.json |5 - 3 files changed, 15 insertions(+), 1 deletions(-) diff --git a/include/migration/migration.h b/include/migration/migration.h index e2acec6..ace91b0 100644 --- a/include/migration/migration.h +++ b/include/migration/migration.h @@ -127,4 +127,6 @@ int migrate_use_xbzrle(void); int64_t migrate_xbzrle_cache_size(void); int64_t xbzrle_cache_resize(int64_t new_size); + +bool migrate_auto_converge(void); #endif diff --git a/migration.c b/migration.c index 3eb0fad..570cee5 100644 --- a/migration.c +++ b/migration.c @@ -474,6 +474,15 @@ void qmp_migrate_set_downtime(double value, Error **errp) max_downtime = (uint64_t)value; } +bool migrate_auto_converge(void) +{ +MigrationState *s; + +s = migrate_get_current(); + +return s-enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE]; +} + int migrate_use_xbzrle(void) { MigrationState *s; diff --git a/qapi-schema.json b/qapi-schema.json index 199744a..b33839c 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -602,10 +602,13 @@ # This feature allows us to minimize migration traffic for certain work # loads, by sending compressed difference of the pages # +# @auto-converge: Migration supports automatic throttling down of guest +# to force convergence. (since 1.6) +# # Since: 1.2 ## { 'enum': 'MigrationCapability', - 'data': ['xbzrle'] } + 'data': ['xbzrle', 'auto-converge'] } ## # @MigrationCapabilityStatus -- 1.7.1
[Qemu-devel] [RFC PATCH v5 0/3] Throttle-down guest to help with live migration convergence.
Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. If a user chooses to force convergence of their migration via a new migration capability auto-converge then this change will auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. end up entering stage 3 (downtime phase) soon after that. No external trigger is required. Thanks to Juan and Paolo for their useful suggestions. --- Changes from v4: - incorporated feedback from Paolo. - split into 3 patches. Changes from v3: - incorporated feedback from Paolo and Eric - rebased to latest qemu.git Changes from v2: - incorporated feedback from Orit, Juan and Eric - stop the throttling thread at the start of stage 3 - rebased to latest qemu.git Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com Chegu Vinod (3): Introduce async_run_on_cpu() Add 'auto-converge' migration capability Force auto-convegence of live migration arch_init.c | 68 + cpus.c| 29 + include/migration/migration.h |6 +++ include/qemu-common.h |1 + include/qom/cpu.h | 10 ++ migration.c | 10 ++ qapi-schema.json |5 ++- 7 files changed, 128 insertions(+), 1 deletions(-)
[Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu()
Introduce an asynchronous version of run_on_cpu() i.e. the caller doesn't have to block till the call back routine finishes execution on the target vcpu. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- cpus.c| 29 + include/qemu-common.h |1 + include/qom/cpu.h | 10 ++ 3 files changed, 40 insertions(+), 0 deletions(-) diff --git a/cpus.c b/cpus.c index c232265..8cd4eab 100644 --- a/cpus.c +++ b/cpus.c @@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) wi.func = func; wi.data = data; +wi.free = false; if (cpu-queued_work_first == NULL) { cpu-queued_work_first = wi; } else { @@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) } } +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data) +{ +struct qemu_work_item *wi; + +if (qemu_cpu_is_self(cpu)) { +func(data); +return; +} + +wi = g_malloc0(sizeof(struct qemu_work_item)); +wi-func = func; +wi-data = data; +wi-free = true; +if (cpu-queued_work_first == NULL) { +cpu-queued_work_first = wi; +} else { +cpu-queued_work_last-next = wi; +} +cpu-queued_work_last = wi; +wi-next = NULL; +wi-done = false; + +qemu_cpu_kick(cpu); +} + static void flush_queued_work(CPUState *cpu) { struct qemu_work_item *wi; @@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu) cpu-queued_work_first = wi-next; wi-func(wi-data); wi-done = true; +if (wi-free) { +g_free(wi); +} } cpu-queued_work_last = NULL; qemu_cond_broadcast(qemu_work_cond); diff --git a/include/qemu-common.h b/include/qemu-common.h index b399d85..bad6e1f 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -286,6 +286,7 @@ struct qemu_work_item { void (*func)(void *data); void *data; int done; +bool free; }; #ifdef CONFIG_USER_ONLY diff --git a/include/qom/cpu.h b/include/qom/cpu.h index 7cd9442..46465e9 100644 --- a/include/qom/cpu.h +++ b/include/qom/cpu.h @@ -265,6 +265,16 @@ bool cpu_is_stopped(CPUState *cpu); void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); /** + * async_run_on_cpu: + * @cpu: The vCPU to run on. + * @func: The function to be executed. + * @data: Data to pass to the function. + * + * Schedules the function @func for execution on the vCPU @cpu asynchronously. + */ +void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data); + +/** * qemu_for_each_cpu: * @func: The function to be executed. * @data: Data to pass to the function. -- 1.7.1
[Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. Verified the convergence using the following : - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 68 + include/migration/migration.h |4 ++ migration.c |1 + 3 files changed, 73 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 49c5dc2..29788d6 100644 --- a/arch_init.c +++ b/arch_init.c @@ -49,6 +49,7 @@ #include trace.h #include exec/cpu-all.h #include hw/acpi/acpi.h +#include sysemu/cpus.h #ifdef DEBUG_ARCH_INIT #define DPRINTF(fmt, ...) \ @@ -104,6 +105,8 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; + /***/ /* ram save/restore */ @@ -378,8 +381,15 @@ static void migration_bitmap_sync(void) uint64_t num_dirty_pages_init = migration_dirty_pages; MigrationState *s = migrate_get_current(); static int64_t start_time; +static int64_t bytes_xfer_prev; static int64_t num_dirty_pages_period; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +414,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==5) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate +((num_dirty_pages_period*TARGET_PAGE_SIZE) +((bytes_xfer_now - bytes_xfer_prev)/2))) { +if (dirty_rate_high_cnt++ 5) { +DPRINTF(Unable to converge. Throtting down guest\n); +mig_throttle_on = true; +} + } + bytes_xfer_prev = bytes_xfer_now; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage) return bytes_sent; } +bool throttling_needed(void) +{ +if (!migrate_auto_converge()) { +return false; +} + +return mig_throttle_on; +} + static uint64_t bytes_transferred; static ram_addr_t ram_save_remaining(void) @@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp) return info; } + +static void mig_delay_vcpu(void) +{ +qemu_mutex_unlock_iothread(); +g_usleep(50*1000); +qemu_mutex_lock_iothread(); +} + +/* Stub used for getting the vcpu out of VM and into qemu via + run_on_cpu()*/ +static void mig_kick_cpu(void *opq) +{ +mig_delay_vcpu(); +return; +} + +/* To reduce the dirty rate explicitly disallow the VCPUs from spending + much time in the VM. The migration thread will try to catchup. + Workload will experience a performance drop. +*/ +void migration_throttle_down(void) +{ +if (throttling_needed()) { +CPUArchState *penv = first_cpu; +while (penv) { +qemu_mutex_lock_iothread(); +async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL
Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
On 5/9/2013 10:20 AM, Michael R. Hines wrote: Comments inline. FYI: please CC mrhi...@us.ibm.com, because it helps me know when to scroll threw the bazillion qemu-devel emails. I have things separated out into folders and rules, but a direct CC is better =) Sure will do. On 05/03/2013 07:28 PM, Chegu Vinod wrote: Hi Michael, I picked up the qemu bits from your github branch and gave it a try. (BTW the setup I was given temporary access to has a pair of MLX's IB QDR cards connected back to back via QSFP cables) Observed a couple of things and wanted to share..perhaps you may be aware of them already or perhaps these are unrelated to your specific changes ? (Note: Still haven't finished the review of your changes ). a) x-rdma-pin-all off case Seem to only work sometimes but fails at other times. Here is an example... (qemu) rdma: Accepting rdma connection... rdma: Memory pin all: disabled rdma: verbs context after listen: 0x56757d50 rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: fe80::2:c903:9:5855 rdma: Accepted migration qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent with Host ind ex 0x4ec: delta 0xffe6 qemu: warning: error while loading state for instance 0x0 of device 'virtio-net' load of migration failed Can you give me more details about the configuration of your VM? The guest is a 10-VCPU/128GB ...and nothing really that fancy with respect to storage or networking. Hosted on a large Westmere-EX box (target is a similarly configured Westmere-X system). There is a shared SAN disk between the two hosts. Both hosts have 3.9-rc7 kernel that I got at that time from kvm.git tree. The guest was also running the same kernel. Since I was just trying it out I was not running any workload either. On the source host the qemu command line : /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -name vm1 \ -m 131072 -smp 10,sockets=1,cores=10,threads=1 \ -mem-path /dev/hugepages \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -drive file=/dev/libvirt_lvm3/vm1,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,model=virtio,macaddr=52:54:00:71:01:01,netdev=nic-0 \ -netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \ -vnc :4 On the destination host the command line was same as the above with the following additional arg... -incoming x-rdma:static private ipaddr of the IB:port # b) x-rdma-pin-all on case : The guest is not resuming on the target host. i.e. the source host's qemu states that migration is complete but the guest is not responsive anymore... (doesn't seem to have crashed but its stuck somewhere).Have you seen this behavior before ? Any tips on how I could extract additional info ? Is the QEMU monitor still responsive? They were responsive. Can you capture a screenshot of the guest's console to see if there is a panic? No panic on the guest's console :( What kind of storage is attached to the VM? Simple virtio disk hosted on a SAN disk (see the qemu command line). Besides the list of noted restrictions/issues around having to pin all of guest memoryif the pinning is done as part of starting of the migration it ends up taking noticeably long time for larger guests. Wonder whether that should be counted as part of the total migration time ?. That's a good question: The pin-all option should not be slowing down your VM to much as the VM should still be running before the migration_thread() actually kicks in and starts the migration. Well I had hoped that it would not have any serious impacts but it ended up freezing the guest... I need more information on the configuration of your VM, guest operating system, architecture and so forth... Pl. see above. And similarly as before whether or not QEMU is not responsive or whether or not it's the guest that's panicked... Guest just freezes...doesn't panic when this pinning is in progress (i.e. after I set the capability and start the migration) . After the pin'ng completes the guest continues to run and the migration continues...till it completes (as per the source host's qemu)...but I never see it resume on the target host. Also the act of pinning all the memory seems to freeze the guest. e.g. : For larger enterprise sized guests (say 128GB and higher) the guest is frozen is anywhere from nearly a minute (~50seconds) to multiple minutes as the guest size increases...which imo kind of defeats the purpose of live guest migration. That's bad =) There must be a bug somewhere the largest VM I can create on my hardware is ~16GB - so let me give that a try and try to track down the problem. Ok. Perhaps run a simple test run inside the guest can help observe any scheduling
Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
On 5/9/2013 1:05 PM, Igor Mammedov wrote: On Thu, 9 May 2013 12:43:20 -0700 Chegu Vinod chegu_vi...@hp.com wrote: If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. [...] +void migration_throttle_down(void) +{ +if (throttling_needed()) { +CPUArchState *penv = first_cpu; +while (penv) { +qemu_mutex_lock_iothread(); +async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL); +qemu_mutex_unlock_iothread(); +penv = penv-next_cpu; could you replace open coded loop with qemu_for_each_cpu()? Yes will try to replace it in the next version. Vinod +} +} +}
Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration
On 5/9/2013 1:24 PM, Igor Mammedov wrote: On Thu, 9 May 2013 12:43:20 -0700 Chegu Vinod chegu_vi...@hp.com wrote: If a user chooses to turn on the auto-converge migration capability these changes detect the lack of convergence and throttle down the guest. i.e. force the VCPUs out of the guest for some duration and let the migration thread catchup and help converge. [...] + +static void mig_delay_vcpu(void) +{ +qemu_mutex_unlock_iothread(); +g_usleep(50*1000); +qemu_mutex_lock_iothread(); +} + +/* Stub used for getting the vcpu out of VM and into qemu via + run_on_cpu()*/ +static void mig_kick_cpu(void *opq) +{ +mig_delay_vcpu(); +return; +} + +/* To reduce the dirty rate explicitly disallow the VCPUs from spending + much time in the VM. The migration thread will try to catchup. + Workload will experience a performance drop. +*/ +void migration_throttle_down(void) +{ +if (throttling_needed()) { +CPUArchState *penv = first_cpu; +while (penv) { +qemu_mutex_lock_iothread(); Locking it here and the unlocking it inside of queued work doesn't look nice. Yes...but see below. What exactly are you protecting with this lock? It was my understanding that BQL is supposed to be held when the vcpu threads start entering and executing in the qemu context (as qemu is not MP safe).. Still true? In this specific use case I was concerned about the fraction of the time when a given vcpu thread is in the qemu context but not executing the callback routine...and was hence holding the BQL.Holding the BQL and g_usleep'ng is not only bad but would slow down the migration thread...hence the doesn't look nice stuff :( For this specific use case If its not really required to even bother with the BQL then pl. do let me know. Also pl. refer to version 3 of my patchI was doing a g_usleep() in kvm_cpu_exec() and was not messing much with the BQLbut that was deemed as not a good thing either. Thanks Vinod +async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL); +qemu_mutex_unlock_iothread(); +penv = penv-next_cpu; +} +} +}
[Qemu-devel] [RFC PATCH v4] Throttle-down guest when live migration does not converge.
Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. If a user chooses to force convergence of their migration via a new migration capability auto-converge then this change will auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. we end up entering stage 3 (downtime phase) soon after that. No external trigger is required. Thanks to Juan and Paolo for their useful suggestions. Verified the convergence using the following : - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes --- Changes from v3: - incorporated feedback from Paolo and Eric - rebased to latest qemu.git Changes from v2: - incorporated feedback from Orit, Juan and Eric - stop the throttling thread at the start of stage 3 - rebased to latest qemu.git Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 61 - cpus.c| 41 +++ include/migration/migration.h |7 + include/qemu-common.h |1 + include/qemu/main-loop.h |3 ++ include/qom/cpu.h | 10 +++ kvm-all.c | 46 +++ migration.c | 18 qapi-schema.json |5 +++- 9 files changed, 190 insertions(+), 2 deletions(-) diff --git a/arch_init.c b/arch_init.c index 49c5dc2..2f703cf 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,7 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; /***/ /* ram save/restore */ @@ -379,7 +380,14 @@ static void migration_bitmap_sync(void) MigrationState *s = migrate_get_current(); static int64_t start_time; static int64_t num_dirty_pages_period; +static int64_t bytes_xfer_prev; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +412,27 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==5) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate +((num_dirty_pages_period*TARGET_PAGE_SIZE) +((bytes_xfer_now - bytes_xfer_prev)/2))) { +if (dirty_rate_high_cnt++ 5) { +DPRINTF(Unable to converge. Throtting down guest\n); +qemu_mutex_lock_mig_throttle(); +if (!mig_throttle_on) { +mig_throttle_on = true; +} +qemu_mutex_unlock_mig_throttle(); +} + } + bytes_xfer_prev = bytes_xfer_now
Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
Hi Michael, I picked up the qemu bits from your github branch and gave it a try. (BTW the setup I was given temporary access to has a pair of MLX's IB QDR cards connected back to back via QSFP cables) Observed a couple of things and wanted to share..perhaps you may be aware of them already or perhaps these are unrelated to your specific changes ? (Note: Still haven't finished the review of your changes ). a) x-rdma-pin-all off case Seem to only work sometimes but fails at other times. Here is an example... (qemu) rdma: Accepting rdma connection... rdma: Memory pin all: disabled rdma: verbs context after listen: 0x56757d50 rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: fe80::2:c903:9:5855 rdma: Accepted migration qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent with Host ind ex 0x4ec: delta 0xffe6 qemu: warning: error while loading state for instance 0x0 of device 'virtio-net' load of migration failed b) x-rdma-pin-all on case : The guest is not resuming on the target host. i.e. the source host's qemu states that migration is complete but the guest is not responsive anymore... (doesn't seem to have crashed but its stuck somewhere). Have you seen this behavior before ? Any tips on how I could extract additional info ? Besides the list of noted restrictions/issues around having to pin all of guest memoryif the pinning is done as part of starting of the migration it ends up taking noticeably long time for larger guests. Wonder whether that should be counted as part of the total migration time ?. Also the act of pinning all the memory seems to freeze the guest. e.g. : For larger enterprise sized guests (say 128GB and higher) the guest is frozen is anywhere from nearly a minute (~50seconds) to multiple minutes as the guest size increases...which imo kind of defeats the purpose of live guest migration. Would like to hear if you have already thought about any other alternatives to address this issue ? for e.g. would it be better to pin all of the guest's memory as part of starting the guest itself ? Yes there are restrictions when we do pinning...but it can help with performance. --- BTW, a different (yet sort of related) topic... recently a patch went into upstream that provided an option to qemu to mlock all of guest memory : https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html . but when attempting to do the mlock for larger guests a lot of time is spent bringing each page into cache and clearing/zeron'g it etc.etc. https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html Note: The basic tcp based live guest migration in the same qemu version still works fine on the same hosts over a pair of non-RDMA cards 10Gb NICs connected back-to-back. Thanks Vinod
[Qemu-devel] [PATCH v3] Throttle-down guest when live migration does not converge.
Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. If a user chooses to force convergence of their migration via a new migration capability auto-converge then this change will auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. we end up entering stage 3 (downtime phase) soon after that. No external trigger is required. Thanks to Juan and Paolo for their useful suggestions. Verified the convergence using the following : - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes --- Changes from v2: - incorporated feedback from Orit, Juan and Eric - stop the throttling thread at the start of stage 3 - rebased to latest qemu.git Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 61 + cpus.c| 12 include/migration/migration.h |7 + include/qemu/main-loop.h |3 ++ kvm-all.c | 46 +++ migration.c | 18 qapi-schema.json |7 - 7 files changed, 153 insertions(+), 1 deletions(-) diff --git a/arch_init.c b/arch_init.c index 49c5dc2..7e03b2c 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,7 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static bool mig_throttle_on; /***/ /* ram save/restore */ @@ -379,7 +380,14 @@ static void migration_bitmap_sync(void) MigrationState *s = migrate_get_current(); static int64_t start_time; static int64_t num_dirty_pages_period; +static int64_t bytes_xfer_prev; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); @@ -404,6 +412,27 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==5) + we turn on the throttle down logic */ +bytes_xfer_now = ram_bytes_transferred(); +if (s-dirty_pages_rate +((num_dirty_pages_period*TARGET_PAGE_SIZE) +((bytes_xfer_now - bytes_xfer_prev)/2))) { +if (dirty_rate_high_cnt++ 5) { +DPRINTF(Unable to converge. Throtting down guest\n); +qemu_mutex_lock_mig_throttle(); +if (!mig_throttle_on) { +mig_throttle_on = true; +} +qemu_mutex_unlock_mig_throttle(); +} + } + bytes_xfer_prev = bytes_xfer_now; +} s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -496,6 +525,33 @@ static
Re: [Qemu-devel] [PATCH v3] Throttle-down guest when live migration does not converge.
On 5/1/2013 5:38 AM, Eric Blake wrote: On 05/01/2013 06:22 AM, Chegu Vinod wrote: Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. --- Changes from v2: - incorporated feedback from Orit, Juan and Eric - stop the throttling thread at the start of stage 3 - rebased to latest qemu.git +++ b/qapi-schema.json @@ -600,9 +600,14 @@ # loads, by sending compressed difference of the pages # # Since: 1.2 +# +# @auto-converge: Migration supports automatic throttling down of guest +# to force convergence. Disabled by default. +# +# Since: 1.6 ## I've already argued that ALL new migration capabilities should be disabled by default (see the thread on 'x-rdma-pin-all', which will be a merge conflict if it gets applied before your patch). So I don't think that last sentence adds anything, and can be dropped. I think this works, although it's the first instance of having two top-level Since: tags on a single JSON entity. I was envisioning: @xbzrle: yadda... pages @auto-convert: Migration supports... convergence (since 1.6) Since: 1.2 to match the conventions elsewhere that the overall JSON entity (the enum MigrationCapability) exists since 1.2, but the addition of auto-convert happened in 1.6. However, as nothing parses the .json file to turn it into formal docs (yet), I'm not going to insist on a respin if this is the only problem with your patch. I'm not comfortable enough with my skills in reviewing the rest of the patch, or I'd offer a reviewed-by. I shall make the suggested changes. Appreciate your review feedback on this part of the change. Thanks Vinod
Re: [Qemu-devel] [PATCH v3] Throttle-down guest when live migration does not converge.
On 5/1/2013 8:40 AM, Paolo Bonzini wrote: I shall make the suggested changes. Appreciate your review feedback on this part of the change. Hi Paolo., Thanks for taking a look (BTW, I accidentally left out the RFC in the patch subject line...my bad!). Hi Vinod, I think unfortunately it is not acceptable to make this patch work only for KVM. (It cannot work for Xen, but that's not a problem since Xen uses a different migration mechanism; but it should work for TCG). Ok. I hadn't yet looked at TCG aspects etc. Will follow up offline... Unfortunately, as you noted the run_on_cpu callbacks currently run under the big QEMU lock. We need to fix that first. We have time for that during 1.6. Ok. Was under the impression that anytime a vcpu thread enters to do anything in qemu the BQL had to be held. So choose to go with run_on_cpu() . Will follow up offline on alternatives Holding the vcpus in the host context (i.e. kvm module) itself is perhaps another way. Would need some handshakes (i.e. new ioctls ) with the kernel. Would that be acceptable way to proceed? Thanks Vinod Paolo .
Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.
On 4/30/2013 8:20 AM, Juan Quintela wrote: Chegu Vinod chegu_vi...@hp.com wrote: Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. A few options that were discussed/being-pursued to help with the convergence issue include: 1) Slow down guest considerably via cgroup's CPU controls - requires libvirt client support to detect trigger action, but conceptually similar to this RFC change. 2) Speed up transfer rate: - RDMA based Pre-copy - lower overhead and fast (Unfortunately has a few restrictions and some customers still choose not to deploy RDMA :-( ). - Add parallelism to improve transfer rate and use multiple 10Gig connections (bonded). - could add some overhead on the host. 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds promising but need to consider handle newer failure scenarios. If an enterprise user chooses to force convergence of their migration via the new capability auto-converge then with this change we auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. we end up entering stage 3 (downtime phase) soon after that. No exernal trigger is required (unlike option 1) and it can co-exist with enhancements being pursued as part of Option 2 (e.g. RDMA). Thanks to Juan and Paolo for their useful suggestions. Verified the convergence using the following : - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds 148 seconds 1487 seconds and still the Migration is not completed. expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds 6.3 seconds and finished, not bad at all O:-) That's the *downtime*.. The total time for migration to complete is 241 secs. (SpecJBB is one of those workloads that dirties memory quite a bit). How much does the guest throughput drops while we enter autoconverge mode? Workload performance drops for some short duration and it...but it soon switches to stage 3. transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void) MigrationState *s = migrate_get_current(); static int64_t start_time; static int64_t num_dirty_pages_period; +static int64_t bytes_xfer_prev; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (migrate_auto_converge() !bytes_xfer_prev) { Just do the !bytes_xfer_prev test here? migrate_autoconverge is more expensive to call that just do the assignment? Sure + +if (value) { +return true; +} +return false; this code is just: return value; ok diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h index 6f0200a..9a3886d 100644 --- a/include/qemu/main-loop.h +++ b/include/qemu/main-loop.h @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void); */ void qemu_mutex_unlock_iothread(void); +void qemu_mutex_lock_mig_throttle(void); +void qemu_mutex_unlock_mig_throttle(void); + /* internal interfaces */ void qemu_fd_register(int fd); diff --git a/kvm-all.c b/kvm-all.c index 2d92721..a92cb77 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -33,6 +33,8 @@ #include exec/memory.h #include exec/address-spaces.h #include qemu/event_notifier.h +#include sysemu/cpus.h +#include migration/migration.h /* This check must be after config-host.h is included */ #ifdef CONFIG_EVENTFD @@ -116,6 +118,8 @@ static const
Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.
On 4/30/2013 9:01 AM, Juan Quintela wrote: Chegu Vinod chegu_vi...@hp.com wrote: On 4/30/2013 8:20 AM, Juan Quintela wrote: (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds 148 seconds 1487 seconds and still the Migration is not completed. expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds 6.3 seconds and finished, not bad at all O:-) That's the *downtime*.. The total time for migration to complete is 241 secs. (SpecJBB is one of those workloads that dirties memory quite a bit). Sorry, you are right. Imressive anyways for such small change. +/* To reduce the dirty rate explicitly disallow the VCPUs from spending + much time in the VM. The migration thread will try to catchup. + Workload will experience a greater performance drop but for a shorter + duration. +*/ +void *migration_throttle_down(void *opaque) +{ +throttling = true; +while (throttling_needed()) { +CPUArchState *penv = first_cpu; I am not sure that we can follow the list without the iothread lock here. Hmm.. Is this due to vcpu hot plug that might happen at the time of live migration (or) due to something else ? I was trying to avoid holding the iothread lock for longer duration and slow down the migration thread... Well, thinking back about it, what we should do is disable cpu hotplug/unplug during migration I tend to agree. For now I am not going to hold the iothread lock for following the list... (it is not working well anyways as Today). Yes...and I see that Igor, Eduardo et.al. are trying to fix this. Vinod Thanks, Juan. .
Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.
On 4/30/2013 8:04 AM, Orit Wasserman wrote: On 04/27/2013 11:50 PM, Chegu Vinod wrote: Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. A few options that were discussed/being-pursued to help with the convergence issue include: 1) Slow down guest considerably via cgroup's CPU controls - requires libvirt client support to detect trigger action, but conceptually similar to this RFC change. 2) Speed up transfer rate: - RDMA based Pre-copy - lower overhead and fast (Unfortunately has a few restrictions and some customers still choose not to deploy RDMA :-( ). - Add parallelism to improve transfer rate and use multiple 10Gig connections (bonded). - could add some overhead on the host. 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds promising but need to consider handle newer failure scenarios. If an enterprise user chooses to force convergence of their migration via the new capability auto-converge then with this change we auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. we end up entering stage 3 (downtime phase) soon after that. No exernal trigger is required (unlike option 1) and it can co-exist with enhancements being pursued as part of Option 2 (e.g. RDMA). Thanks to Juan and Paolo for their useful suggestions. Verified the convergence using the following : - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 44 +++ cpus.c| 12 + include/migration/migration.h | 12 + include/qemu/main-loop.h |3 ++ kvm-all.c | 51 + migration.c | 15 qapi-schema.json |6 - 7 files changed, 142 insertions(+), 1 deletions(-) diff --git a/arch_init.c b/arch_init.c index 92de1bd..6dcc742 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,7 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static uint64_t mig_throttle_on; /***/ /* ram save/restore */ @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void) MigrationState *s = migrate_get_current(); static int64_t start_time; static int64_t num_dirty_pages_period; +static int64_t bytes_xfer_prev; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (migrate_auto_converge() !bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); } + trace_migration_bitmap_sync_start(); memory_global_sync_dirty_bitmap(get_system_memory()); @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were
Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.
On 4/29/2013 7:53 AM, Eric Blake wrote: On 04/27/2013 02:50 PM, Chegu Vinod wrote: Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. No exernal trigger is required (unlike option 1) and it can co-exist s/exernal/external/ with enhancements being pursued as part of Option 2 (e.g. RDMA). Thanks to Juan and Paolo for their useful suggestions. --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on This part looks nice. I'm not reviewing the entire patch (I'm not an expert on the internals of migration), but just the interface: Thanks for taking a look at this. I shall incorporate your suggested changes in the next version. Hoping to hear from Juan/Orit and others on the live migration part. Thanks, Vinod +++ b/qapi-schema.json @@ -599,10 +599,14 @@ # This feature allows us to minimize migration traffic for certain work # loads, by sending compressed difference of the pages # +# @auto-converge: Controls whether or not the we want the migration to +# automaticially detect and force convergence by slowing s/automaticially/automatically/ +# down the guest. Disabled by default. Missing a (since 1.6) designation. Also, use of first-person (us, we) in docs seems a bit unprofessional, although you were copying pre-existing usage. How about: @xbzrle: Migration supports xbzrle (Xor Based Zero Run Length Encoding), which minimizes migration traffic for certain workloads by sending compressed differences of active pages @auto-converge: Migration supports automatic throttling of guest activity to force convergence (since 1.6)
[Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.
Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. A few options that were discussed/being-pursued to help with the convergence issue include: 1) Slow down guest considerably via cgroup's CPU controls - requires libvirt client support to detect trigger action, but conceptually similar to this RFC change. 2) Speed up transfer rate: - RDMA based Pre-copy - lower overhead and fast (Unfortunately has a few restrictions and some customers still choose not to deploy RDMA :-( ). - Add parallelism to improve transfer rate and use multiple 10Gig connections (bonded). - could add some overhead on the host. 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds promising but need to consider handle newer failure scenarios. If an enterprise user chooses to force convergence of their migration via the new capability auto-converge then with this change we auto-detect lack of convergence scenario and trigger a slow down of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. we end up entering stage 3 (downtime phase) soon after that. No exernal trigger is required (unlike option 1) and it can co-exist with enhancements being pursued as part of Option 2 (e.g. RDMA). Thanks to Juan and Paolo for their useful suggestions. Verified the convergence using the following : - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and migrate downtime set to 4seconds). (qemu) info migrate capabilities: xbzrle: off auto-converge: off Migration status: active total time: 1487503 milliseconds expected downtime: 519 milliseconds transferred ram: 383749347 kbytes remaining ram: 2753372 kbytes total ram: 268444224 kbytes duplicate: 65461532 pages skipped: 64901568 pages normal: 95750218 pages normal bytes: 383000872 kbytes dirty pages rate: 67551 pages --- (qemu) info migrate capabilities: xbzrle: off auto-converge: on Migration status: completed total time: 241161 milliseconds downtime: 6373 milliseconds transferred ram: 28235307 kbytes remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64946416 pages skipped: 64903523 pages normal: 7044971 pages normal bytes: 28179884 kbytes Changes from v1: - rebased to latest qemu.git - added auto-converge capability(default off) - suggested by Anthony Liguori Eric Blake. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 44 +++ cpus.c| 12 + include/migration/migration.h | 12 + include/qemu/main-loop.h |3 ++ kvm-all.c | 51 + migration.c | 15 qapi-schema.json |6 - 7 files changed, 142 insertions(+), 1 deletions(-) diff --git a/arch_init.c b/arch_init.c index 92de1bd..6dcc742 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,7 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static uint64_t mig_throttle_on; /***/ /* ram save/restore */ @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void) MigrationState *s = migrate_get_current(); static int64_t start_time; static int64_t num_dirty_pages_period; +static int64_t bytes_xfer_prev; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; + +if (migrate_auto_converge() !bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); } + trace_migration_bitmap_sync_start(); memory_global_sync_dirty_bitmap(get_system_memory()); @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { +if (migrate_auto_converge()) { +/* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==5) + we turn on the throttle down logic
[Qemu-devel] [RFC PATCH] Throttle-down guest when live migration does not converge.
Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. A few options that were discussed/being-pursued to help with the convergence issue include: 1) Slow down guest considerably via cgroup's CPU controls - requires libvirt client support to detect trigger action, but conceptually similar to this RFC change. 2) Speed up transfer rate: - RDMA based Pre-copy - lower overhead and fast (Unfortunately has a few restrictions and some customers still choose not to deploy RDMA :-( ). - Add parallelism to improve transfer rate and use multiple 10Gig connections (bonded). - could add some overhead on the host. 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds promising but need to consider handle newer failure scenarios. The following [RFC] change attempts to auto-detect lack of convergence situation and trigger a slowdown of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. No exernal trigger is required (unlike option 1) and it can co-exist with enhancements being pursued as part of Option 2 (e.g. RDMA). The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. we end up entering stage 3 (downtime phase) soon after that. Verified the convergence using the following: - SpecJbb2005 workload running on a 20VCPU/128G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Thanks to Juan and Paolo for some useful suggestions. More refinment is needed (e.g. smarter way to detect variable throttling based on need etc). For now I was hoping to get some feedback or hear about other more refined ideas. Signed-off-by: Chegu Vinod chegu_vi...@hp.com --- arch_init.c | 37 +++ cpus.c| 12 ++ include/migration/migration.h |9 +++ include/qemu/main-loop.h |3 ++ kvm-all.c | 49 + migration.c |6 + 6 files changed, 116 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 92de1bd..a06ff81 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,7 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static uint64_t mig_throttle_on; /***/ /* ram save/restore */ @@ -379,12 +380,19 @@ static void migration_bitmap_sync(void) MigrationState *s = migrate_get_current(); static int64_t start_time; static int64_t num_dirty_pages_period; +static int64_t bytes_xfer_prev; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); } +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} + trace_migration_bitmap_sync_start(); memory_global_sync_dirty_bitmap(get_system_memory()); @@ -404,6 +412,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { + /* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==5) + we turn on the throttle down logic */ + bytes_xfer_now = ram_bytes_transferred(); + if (s-dirty_pages_rate + ((num_dirty_pages_period*TARGET_PAGE_SIZE) + ((bytes_xfer_now - bytes_xfer_prev)/2))) { + if (dirty_rate_high_cnt++ 5) { + DPRINTF(Unable to converge. Throtting down guest\n); + mig_throttle_on = 1; + } +} +bytes_xfer_prev = bytes_xfer_now; + s-dirty_pages_rate = num_dirty_pages_period * 1000 / (end_time - start_time); s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE; @@ -496,6 +521,18 @@ static int ram_save_block(QEMUFile *f, bool last_stage) return bytes_sent; } +bool throttling_needed(void) +{ +bool value; + +qemu_mutex_lock_mig_throttle(); +value = mig_throttle_on; +qemu_mutex_unlock_mig_throttle(); + +if (value) { +return true; +} +return false; +} + static uint64_t bytes_transferred; static ram_addr_t ram_save_remaining(void) diff --git a/cpus.c b/cpus.c index 5a98a37..eea6601 100644 --- a/cpus.c +++ b/cpus.c @@ -616,6 +616,7
Re: [Qemu-devel] [RFC PATCH] Throttle-down guest when live migration does not converge.
On 4/24/2013 6:59 PM, Anthony Liguori wrote: On Wed, Apr 24, 2013 at 6:42 PM, Chegu Vinod chegu_vi...@hp.com mailto:chegu_vi...@hp.com wrote: Busy enterprise workloads hosted on large sized VM's tend to dirty memory faster than the transfer rate achieved via live guest migration. Despite some good recent improvements ( using dedicated 10Gig NICs between hosts) the live migration does NOT converge. A few options that were discussed/being-pursued to help with the convergence issue include: 1) Slow down guest considerably via cgroup's CPU controls - requires libvirt client support to detect trigger action, but conceptually similar to this RFC change. 2) Speed up transfer rate: - RDMA based Pre-copy - lower overhead and fast (Unfortunately has a few restrictions and some customers still choose not to deploy RDMA :-( ). - Add parallelism to improve transfer rate and use multiple 10Gig connections (bonded). - could add some overhead on the host. 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds promising but need to consider handle newer failure scenarios. The following [RFC] change attempts to auto-detect lack of convergence situation and trigger a slowdown of the workload by explicitly disallowing the VCPUs from spending much time in the VM context. No exernal trigger is required (unlike option 1) and it can co-exist with enhancements being pursued as part of Option 2 (e.g. RDMA). The migration thread tries to catchup and this eventually leads to convergence in some deterministic amount of time. Yes it does impact the performance of all the VCPUs but in my observation that lasts only for a short duration of time. i.e. we end up entering stage 3 (downtime phase) soon after that. This is a reasonable idea and approach but it cannot be unconditional. Sacrificing VCPU performance to encourage convergence is a management decision. In some cases, VCPU performance is far more important than migration convergence. I understand the concern and agree. Would it be ok to pass in an additional argument to qemu as part of trigerring the live migration i.e. to indicate if its ok to force convergence when it fails to converge on its own after N # of tries following the bulk transfer ? Thanks! Vinod Regards, Anthony Liguori Verified the convergence using the following: - SpecJbb2005 workload running on a 20VCPU/128G guest(~80% busy) - OLTP like workload running on a 80VCPU/512G guest (~80% busy) Thanks to Juan and Paolo for some useful suggestions. More refinment is needed (e.g. smarter way to detect variable throttling based on need etc). For now I was hoping to get some feedback or hear about other more refined ideas. Signed-off-by: Chegu Vinod chegu_vi...@hp.com mailto:chegu_vi...@hp.com --- arch_init.c | 37 +++ cpus.c| 12 ++ include/migration/migration.h |9 +++ include/qemu/main-loop.h |3 ++ kvm-all.c | 49 + migration.c |6 + 6 files changed, 116 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 92de1bd..a06ff81 100644 --- a/arch_init.c +++ b/arch_init.c @@ -104,6 +104,7 @@ int graphic_depth = 15; #endif const uint32_t arch_type = QEMU_ARCH; +static uint64_t mig_throttle_on; /***/ /* ram save/restore */ @@ -379,12 +380,19 @@ static void migration_bitmap_sync(void) MigrationState *s = migrate_get_current(); static int64_t start_time; static int64_t num_dirty_pages_period; +static int64_t bytes_xfer_prev; int64_t end_time; +int64_t bytes_xfer_now; +static int dirty_rate_high_cnt; if (!start_time) { start_time = qemu_get_clock_ms(rt_clock); } +if (!bytes_xfer_prev) { +bytes_xfer_prev = ram_bytes_transferred(); +} + trace_migration_bitmap_sync_start(); memory_global_sync_dirty_bitmap(get_system_memory()); @@ -404,6 +412,23 @@ static void migration_bitmap_sync(void) /* more than 1 second = 1000 millisecons */ if (end_time start_time + 1000) { + /* The following detection logic can be refined later. For now: + Check to see if the dirtied bytes is 50% more than the approx. + amount of bytes that just got transferred since the last time we + were in this routine. If that happens N times (for now N==5) + we turn on the throttle down logic */ + bytes_xfer_now = ram_bytes_transferred
Re: [Qemu-devel] [PATCH v4] Add option to mlock qemu and guest memory
Hi Satoru, FYI... I had tried to use this change earlier and it did show some improvements in perf. (due to reduced exits). But as expected mlockall () on large sized guests adds a considerable delay in boot time. For e.g. on an 8 socket Westmere box = a 256G guest : took an additional ~2+ mins to boot and a 512G guest took an additional ~5+ mins to boot. This is mainly due to long time spent in trying to clear all the pages. 77.96% 35728 qemu-system-x86 [kernel.kallsyms] [k] clear_page_c | --- clear_page_c hugetlb_no_page hugetlb_fault follow_hugetlb_page __get_user_pages __mlock_vma_pages_range __mm_populate vm_mmap_pgoff sys_mmap_pgoff sys_mmap system_call __GI___mmap64 qemu_ram_alloc_from_ptr qemu_ram_alloc memory_region_init_ram pc_memory_init pc_init1 pc_init_pci main __libc_start_main Need to have a faster way to clear pages. Vinod
[Qemu-devel] Large guest boot hangs the host.
Hello, I have been noticing host hangs when trying to boot large guests (=40Vcpus) with the current upstream qemu. Host is running 3.8.2 kernel. qemu is the latest one from qemu.git. Example qemu command line listed below... this used to work with a slightly older qemu (about 1.5 weeks ago and on the same host with the 3.8.2 kernel) 'am trying to determine the cause of the host hang...but wanted to check to see if anyone else has seen it... Thanks Vinod /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -smp sockets=8,cores=10,threads=1 \ -numa node,nodeid=0,cpus=0-9,mem=64g \ -numa node,nodeid=1,cpus=10-19,mem=64g \ -numa node,nodeid=2,cpus=20-29,mem=64g \ -numa node,nodeid=3,cpus=30-39,mem=64g \ -numa node,nodeid=4,cpus=40-49,mem=64g \ -numa node,nodeid=5,cpus=50-59,mem=64g \ -numa node,nodeid=6,cpus=60-69,mem=64g \ -numa node,nodeid=7,cpus=70-79,mem=64g \ -m 524288 \ -mem-path /dev/hugepages \ -name vm1 \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -drive file=/dev/libvirt_lvm/vm1,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,model=virtio,macaddr=52:54:00:71:01:01,netdev=nic-0 \ -netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \ -vnc :4
Re: [Qemu-devel] [PATCH 00/41] Migration cleanups and latency improvements
On 2/15/2013 9:46 AM, Paolo Bonzini wrote: This series does many of the improvements that the migration thread promised. It removes buffering, lets a large amount of code run outside the big QEMU lock, and removes some duplication between incoming and outgoing migration. Patches 1 to 7 are simple cleanups. Patches 8 to 14 simplify the lifecycle of the migration thread and the migration QEMUFile. Patches 15 to 18 add fine-grained locking to the block migration data structures, so that patches 19 to 21 can move RAM/block live migration out of the big QEMU lock. At this point blocking writes will not starve other threads seeking to grab the big QEMU mutex: patches 22 to 24 removes the buffering and cleanup the code. Patches 25 to 28 are more cleanups. Patches 29 to 33 improve QEMUFile so that patches 34 and 35 can use QEMUFile to write out data, instead of MigrationState. Patches 36 to 41 then can remove the useless QEMUFile wrapper that remains. Please review and test! You can find these patches at git://github.com/bonzini/qemu.git, branch migration-thread-20130115. Juan Quintela (1): Rename buffered_ to migration_ Paolo Bonzini (40): migration: simplify while loop migration: always use vm_stop_force_state migration: move more error handling to migrate_fd_cleanup migration: push qemu_savevm_state_cancel out of qemu_savevm_state_* block-migration: remove useless calls to blk_mig_cleanup qemu-file: pass errno from qemu_fflush via f-last_error migration: use qemu_file_set_error to pass error codes back to qemu_savevm_state qemu-file: temporarily expose qemu_file_set_error and qemu_fflush migration: flush all data to fd when buffered_flush is called migration: use qemu_file_set_error migration: simplify error handling migration: do not nest flushing of device data migration: prepare to access s-state outside critical sections migration: cleanup migration (including thread) in the iothread block-migration: remove variables that are never read block-migration: small preparatory changes for locking block-migration: document usage of state across threads block-migration: add lock migration: reorder SaveVMHandlers members migration: run pending/iterate callbacks out of big lock migration: run setup callbacks out of big lock migration: yay, buffering is gone qemu-file: make qemu_fflush and qemu_file_set_error private again migration: eliminate last_round migration: detect error before sleeping migration: remove useless qemu_file_get_error check migration: use qemu_file_rate_limit consistently migration: merge qemu_popen_cmd with qemu_popen qemu-file: fsync a writable stdio QEMUFile qemu-file: check exit status when closing a pipe QEMUFile qemu-file: add writable socket QEMUFile qemu-file: simplify and export qemu_ftell migration: use QEMUFile for migration channel lifetime migration: use QEMUFile for writing outgoing migration data migration: use qemu_ftell to compute bandwidth migration: small changes around rate-limiting migration: move rate limiting to QEMUFile migration: move contents of migration_close to migrate_fd_cleanup migration: eliminate s-migration_file migration: inline migrate_fd_close arch_init.c | 14 ++- block-migration.c | 167 +++-- docs/migration.txt| 20 +--- include/migration/migration.h | 12 +-- include/migration/qemu-file.h | 21 +-- include/migration/vmstate.h | 21 ++- include/qemu/atomic.h |1 + include/sysemu/sysemu.h |6 +- migration-exec.c | 39 +- migration-fd.c| 47 +-- migration-tcp.c | 33 + migration-unix.c | 33 + migration.c | 345 - savevm.c | 214 +++--- util/osdep.c |6 +- 15 files changed, 367 insertions(+), 612 deletions(-) . 'am still in the midst of reviewing the changes but gave them a try. The following are my preliminary observations : - The mult-second freezes at the start of migration of larger guests (i.e. 128GB and higher) aren't observable with the above changes. (The simple timer script that does a gettimeofday every 100ms didn't complain about delays etc.). - Noticed improvements in bandwidth utilization during the iterative pre-copy phase and during the downtime phase. - The total migration time reduced...more for larger guests (Note: The undesirably large actual downtime for larger guests is a different topic that still needs to be addressed independent of these changes). Some details follow below... Thanks Vinod Details: -- Host and Guest kernels are running : 3.8-rc5. Comparing upstream (Qemu 1.4.50) vs. Paolo's branch(Qemu 1.3.92 based) i.e. git clone
Re: [Qemu-devel] [PATCH 0/4] migration stats fixes
On 02/01/2013 02:32 PM, Juan Quintela wrote: Hi migration expected_downtime calculation was removed on commit e4ed1541ac9413eac494a03532e34beaf8a7d1c5. We add the calculation back. Before doing the calculation we do: - expected_downtime intial value is max_downtime. Much, much better intial value than 0. - we move when we measure the time. We used to measure how much it took before we really sent the data. - we introduce sleep_time concept. While we are sleeping because we have sent all the allowed data for this second we shouldn't be accounting that time as sending. - last patch just introduces the re-calculation of expected_downtime. It just changes the stats value. Well, patchs 2 3 change the bandwidth calculation for migration, but I think that we were undercalculating it enough than it was a bug. Without the 2 3 patches, the expected_downtime for an idle gust was calculated as 80ms (with 30 ms default target value), and we ended having a downtime of around 15ms. With this patches applied, we calculate an expected downtime of around 15ms or so, and then we spent aroqund 18ms on downtime. Notice that we only calculate how much it takes to sent the rest of the RAM, it just happens that there is some more data to sent that what we are calculating. Review, please. Later, Juan. The following changes since commit 8a55ebf01507ab73cc458cfcd5b9cb856aba0b9e: Merge remote-tracking branch 'afaerber/qom-cpu' into staging (2013-01-31 19:37:33 -0600) are available in the git repository at: git://repo.or.cz/qemu/quintela.git stats.next for you to fetch changes up to 791128495e3546ccc88dd037ea4dfd31eca14a56: migration: calculate expected_downtime (2013-02-01 13:22:37 +0100) Juan Quintela (4): migration: change initial value of expected_downtime migration: calculate end time after we have sent the data migration: don't account sleep time for calculating bandwidth migration: calculate expected_downtime arch_init.c | 1 + include/migration/migration.h | 1 + migration.c | 15 +-- 3 files changed, 15 insertions(+), 2 deletions(-) Reviewed-by: Chegu Vinod chegu_vi...@hp.com
[Qemu-devel] vhost-net thread getting stuck ?
Hello, 'am running into an issue with the latest bits. [ Pl. see below. The vhost thread seems to be getting stuck while trying to memcopy...perhaps a bad address?. ] Wondering if this is a known issue or some recent regression ? 'am using the latest qemu (from qemu.git) and the latest kvm.git kernel on the host. Started the guest using the following command line /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -smp sockets=8,cores=10,threads=1 \ -numa node,nodeid=0,cpus=0-9,mem=64g \ -numa node,nodeid=1,cpus=10-19,mem=64g \ -numa node,nodeid=2,cpus=20-29,mem=64g \ -numa node,nodeid=3,cpus=30-39,mem=64g \ -numa node,nodeid=4,cpus=40-49,mem=64g \ -numa node,nodeid=5,cpus=50-59,mem=64g \ -numa node,nodeid=6,cpus=60-69,mem=64g \ -numa node,nodeid=7,cpus=70-79,mem=64g \ -m 524288 \ -mem-path /dev/hugepages \ -name vm2 \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm2.monitor,server,now ait \ -drive file=/dev/libvirt_lvm2/vm2,if=none,id=drive-virtio-disk0,format=raw,cache =none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=v irtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,model=virtio,macaddr=52:54:00:71:01:02,netdev=nic-0 \ -netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \ -vnc :4 Was just doing a basic kernel build in the guest when it hung with the following in the dmesg of the host. Thanks Vinod BUG: soft lockup - CPU#46 stuck for 23s! [vhost-135220:135231] Modules linked in: kvm_intel kvm fuse ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc autofs4 sunrpc pcc_cpufreq ipv6 vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp crc32c_intel ghash_clmulni_intel microcode pcspkr mlx4_core be2net lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core sg netxen_nic ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul pata_acpi ata_generic ata_piix hpsa lpfc scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: kvm] CPU 46 Pid: 135231, comm: vhost-135220 Not tainted 3.7.0+ #1 HP ProLiant DL980 G7 RIP: 0010:[8147bab0] [8147bab0] skb_flow_dissect+0x1b0/0x440 RSP: 0018:881ffd131bc8 EFLAGS: 0246 RAX: 8a1f7dc70c00 RBX: RCX: 7fa0 RDX: RSI: 881ffd131c68 RDI: 8a1ff1bd6c80 RBP: 881ffd131c58 R08: 881ffd131bf8 R09: 8a1ff1bd6c80 R10: 0010 R11: 0004 R12: 8a1ff1bd6c80 R13: 000b R14: 8147330b R15: 881ffd131b58 FS: () GS:8a1fff98() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 003d5c810dc0 CR3: 009f77c04000 CR4: 27e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process vhost-135220 (pid: 135231, threadinfo 881ffd13, task 881ffcb754c0) Stack: 881ffd131c18 81477b90 00e2 2b289bcc58ce 881ffd131ce4 00a2 00a2 00a2 00a2 881ffd131c88 937e754e Call Trace: [81477b90] ? memcpy_fromiovecend+0x90/0xd0 [8147f3ca] __skb_get_rxhash+0x1a/0xe0 [a03c90f8] tun_get_user+0x468/0x660 [tun] [81090010] ? __sdt_alloc+0x80/0x1a0 [a03c934d] tun_sendmsg+0x5d/0x80 [tun] [a0468e8a] handle_tx+0x34a/0x680 [vhost_net] [a04691f5] handle_tx_kick+0x15/0x20 [vhost_net] [a0466dfc] vhost_worker+0x10c/0x1c0 [vhost_net] [a0466cf0] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net] [a0466cf0] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net] [8107ecfe] kthread+0xce/0xe0 [8107ec30] ? kthread_freezable_should_stop+0x70/0x70 [815537ac] ret_from_fork+0x7c/0xb0 [8107ec30] ? kthread_freezable_should_stop+0x70/0x70 Code: b6 50 06 48 89 ce 48 c1 ee 20 31 f1 41 89 0e 48 8b 48 20 48 33 48 18 48 89 c8 48 c1 e8 20 31 c1 41 89 4e 04 e9 35 ff ff ff 66 90 0f b6 50 09 e9 1a ff ff ff 0f 1f 80 00 00 00 00 41 8b 44 24 68 [root@hydra11 kvm_rik]# Message from syslogd@hydra11 at Jan 9 13:06:58 ... kernel:BUG: soft lockup - CPU#46 stuck for 22s! [vhost-135220:135231]
Re: [Qemu-devel] vhost-net thread getting stuck ?
On 1/9/2013 8:35 PM, Jason Wang wrote: On 01/10/2013 04:25 AM, Chegu Vinod wrote: Hello, 'am running into an issue with the latest bits. [ Pl. see below. The vhost thread seems to be getting stuck while trying to memcopy...perhaps a bad address?. ] Wondering if this is a known issue or some recent regression ? Hi: Looks like the issue has been fixed in following commits, does you tree contain these? 499744209b2cbca66c42119226e5470da3bb7040 and 76fe45812a3b134c39170ca32dfd4b7217d33145. They have been merged in to Linus 3.8-rc tree. I was using kvm.git kernel (as of today morning)looks like the fixes aren't there yet. Will try the Linus's 3.8-rc tree. Thanks! Vinod Thanks 'am using the latest qemu (from qemu.git) and the latest kvm.git kernel on the host. Started the guest using the following command line /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -smp sockets=8,cores=10,threads=1 \ -numa node,nodeid=0,cpus=0-9,mem=64g \ -numa node,nodeid=1,cpus=10-19,mem=64g \ -numa node,nodeid=2,cpus=20-29,mem=64g \ -numa node,nodeid=3,cpus=30-39,mem=64g \ -numa node,nodeid=4,cpus=40-49,mem=64g \ -numa node,nodeid=5,cpus=50-59,mem=64g \ -numa node,nodeid=6,cpus=60-69,mem=64g \ -numa node,nodeid=7,cpus=70-79,mem=64g \ -m 524288 \ -mem-path /dev/hugepages \ -name vm2 \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm2.monitor,server,now ait \ -drive file=/dev/libvirt_lvm2/vm2,if=none,id=drive-virtio-disk0,format=raw,cache =none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=v irtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,model=virtio,macaddr=52:54:00:71:01:02,netdev=nic-0 \ -netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \ -vnc :4 Was just doing a basic kernel build in the guest when it hung with the following in the dmesg of the host. Thanks Vinod BUG: soft lockup - CPU#46 stuck for 23s! [vhost-135220:135231] Modules linked in: kvm_intel kvm fuse ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc autofs4 sunrpc pcc_cpufreq ipv6 vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp crc32c_intel ghash_clmulni_intel microcode pcspkr mlx4_core be2net lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core sg netxen_nic ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul pata_acpi ata_generic ata_piix hpsa lpfc scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: kvm] CPU 46 Pid: 135231, comm: vhost-135220 Not tainted 3.7.0+ #1 HP ProLiant DL980 G7 RIP: 0010:[8147bab0] [8147bab0] skb_flow_dissect+0x1b0/0x440 RSP: 0018:881ffd131bc8 EFLAGS: 0246 RAX: 8a1f7dc70c00 RBX: RCX: 7fa0 RDX: RSI: 881ffd131c68 RDI: 8a1ff1bd6c80 RBP: 881ffd131c58 R08: 881ffd131bf8 R09: 8a1ff1bd6c80 R10: 0010 R11: 0004 R12: 8a1ff1bd6c80 R13: 000b R14: 8147330b R15: 881ffd131b58 FS: () GS:8a1fff98() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 003d5c810dc0 CR3: 009f77c04000 CR4: 27e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process vhost-135220 (pid: 135231, threadinfo 881ffd13, task 881ffcb754c0) Stack: 881ffd131c18 81477b90 00e2 2b289bcc58ce 881ffd131ce4 00a2 00a2 00a2 00a2 881ffd131c88 937e754e Call Trace: [81477b90] ? memcpy_fromiovecend+0x90/0xd0 [8147f3ca] __skb_get_rxhash+0x1a/0xe0 [a03c90f8] tun_get_user+0x468/0x660 [tun] [81090010] ? __sdt_alloc+0x80/0x1a0 [a03c934d] tun_sendmsg+0x5d/0x80 [tun] [a0468e8a] handle_tx+0x34a/0x680 [vhost_net] [a04691f5] handle_tx_kick+0x15/0x20 [vhost_net] [a0466dfc] vhost_worker+0x10c/0x1c0 [vhost_net] [a0466cf0] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net] [a0466cf0] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net] [8107ecfe] kthread+0xce/0xe0 [8107ec30] ? kthread_freezable_should_stop+0x70/0x70 [815537ac] ret_from_fork+0x7c/0xb0 [8107ec30] ? kthread_freezable_should_stop+0x70/0x70 Code: b6 50 06 48 89 ce 48 c1 ee 20 31 f1 41 89 0e 48 8b 48 20 48 33 48 18 48 89 c8 48 c1 e8 20 31 c1 41 89 4e 04 e9 35 ff ff ff 66 90 0f b6 50 09 e9 1a ff ff ff 0f 1f 80 00 00 00 00 41 8b 44 24 68 [root@hydra11 kvm_rik]# Message from syslogd@hydra11 at Jan 9 13:06:58 ... kernel:BUG: soft lockup - CPU#46 stuck for 22s! [vhost-135220:135231] .
Re: [Qemu-devel] Migration ToDo list
On 11/13/2012 8:18 AM, Juan Quintela wrote: Hi If you have anything else to put, please add. Migration Thread * Plan is integrate it as one of first thing in December (me) * Remove copies with buffered file (me) Bitmap Optimization * Finish moving to individual bitmaps for migration/vga/code * Make sure we don't copy things around * Shared memory bitmap with kvm? * Move to 2MB pages bitmap and then fine grain? If its not already implied in the above ... the long freezes observed at the start of the migration needs to be addressed (its most likely related to BQL ?). QIDL * Review the patches (me) PostCopy * Review patches? * See what we can already integrate? I remember for last year that we could integrate the 1st third or so RDMA * Send RDMA/tcp/ library they already have (Benoit) * This is required for postcopy * This can be used for precopy Not sure if what Benoit has can be directly used for pre-copy also. As Paolo said... we need to look at RDS API's for pre-copy. ('have just started looking at the same). Would like to know if SDP can be used... General * Change protocol to: a) being always 16byte aligned (paolo said that is faster) b) do scatter/gather of the pages? Control of where the migration thread(s) run... -- BTW, has anyone tried doing multiple guest migration from a host ? Are there limitations (enforced via higher level management tools) as to how many guests can be migrated at once (in an attempt to quickly evacuate a flaky host) ? Vinod Fault Tolerance * That is built on top of migration code, but I have nothing to add. Any more ideas? Later, Juan. .
Re: [Qemu-devel] [PATCH 00/18] Migration thread lite (20121029)
On 10/29/2012 9:21 AM, Vinod, Chegu wrote: Date: Mon, 29 Oct 2012 15:11:25 +0100 From: Juan Quintela quint...@redhat.com To: qemu-devel@nongnu.org Cc: owass...@redhat.com, mtosa...@redhat.com, a...@redhat.com, pbonz...@redhat.com Subject: [Qemu-devel] [PATCH 00/18] Migration thread lite (20121029) Hi After discussing with Anthony and Paolo, this is the minimal migration thread support that we can do for 1.3. - fixed all the comments (thanks eric, paolo and orit). - buffered_file.c remains until 1.4. - testing for vinod showed very nice results, that is why the bitmap handling optimizations remain. Hi Juan, Is there any value in calling the migration_bitmap_synch() routine in the ram_save_setup() . All the pages were marked as dirty to begin with...so can't we just assume that all pages need to be sent and proceed ? migration_bitmap_synch () - still remains an expensive call. and the very first call to seems to be ~3x times more expensive than the subsequent calls. For large guests (128G guests) this is multiple seconds...and it freezes the OS instance. Thanks Vinod Note: Writes has become blocking, and I have to change the remove the feature now in qemu-sockets.c. Checked that migration was the only user of that feature. If new users appear, they just need to add the socket_set_nonblock() by hand. Please, review. Thanks, Juan. The following changes since commit 50cd72148211c5e5f22ea2519d19ce024226e61f: hw/xtensa_sim: get rid of intermediate xtensa_sim_init (2012-10-27 15:04:00 +) are available in the git repository at: http://repo.or.cz/r/qemu/quintela.git migration-thread-20121029 for you to fetch changes up to 2c74654f19efc7db35117a87c0d9db4776931e1b: ram: optimize migration bitmap walking (2012-10-29 14:14:28 +0100) Juan Quintela (15): buffered_file: Move from using a timer to use a thread migration: make qemu_fopen_ops_buffered() return void migration: stop all cpus correctly migration: make writes blocking migration: remove unfreeze logic migration: take finer locking buffered_file: Unfold the trick to restart generating migration data buffered_file: don't flush on put buffer buffered_file: unfold buffered_append in buffered_put_buffer savevm: New save live migration method: pending migration: include qemu-file.h migration-fd: remove duplicate include memory: introduce memory_region_test_and_clear_dirty ram: Use memory_region_test_and_clear_dirty ram: optimize migration bitmap walking Paolo Bonzini (1): split MRU ram list Umesh Deshpande (2): add a version number to ram_list protect the ramlist with a separate mutex arch_init.c | 115 --- block-migration.c | 49 +--- buffered_file.c | 130 +- buffered_file.h | 2 +- cpu-all.h | 15 ++- exec.c| 44 +++--- memory.c | 16 +++ memory.h | 18 migration-exec.c | 3 +- migration-fd.c| 4 +- migration-tcp.c | 2 +- migration-unix.c | 2 +- migration.c | 100 - migration.h | 4 +- qemu-file.h | 5 --- qemu-sockets.c| 4 -- savevm.c | 24 +++--- sysemu.h | 1 + vmstate.h | 1 + 19 files changed, 280 insertions(+), 259 deletions(-) -- 1.7.11.7 .
Re: [Qemu-devel] Unable to enable +x2apic for the guest cpus...
On 10/13/2012 12:32 AM, Gleb Natapov wrote: On Fri, Oct 12, 2012 at 07:38:42PM -0700, Chegu Vinod wrote: Hello, I am using a very recent upstream version of qemu.git along with kvm.git kernels (in the host and guest). [Guest kernel had been compiled with CONFIG_X86_X2APIC and CONFIG_IRQ_REMAP both set] When I attempt to start a guest with +x2apic flag (pl. see the qemu cmd line below) I end up with a hang of the qemu and a kernel BUG at /arch/x86/kvm/lapic.c:159 !Pl. see the attached screen shot of the console for additional info. I am able to boot the same guest without the +x2apic flag in the qemu cmd line. Not sure if this an issue (or) if I have something incorrectly specified in the qemu cmd line ? If its the latter...pl. advise the correct usage for enabling x2apic for the guest cpus.. for the upstream bits. This is the bug in how ldr in x2apic mode is calculated. Try the following patch: diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index c6e6b72..43e9fad 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1311,7 +1311,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) vcpu-arch.apic_base = value; if (apic_x2apic_mode(apic)) { u32 id = kvm_apic_id(apic); - u32 ldr = ((id ~0xf) 16) | (1 (id 0xf)); + u32 ldr = ((id 4) 16) | (1 (id 0xf)); kvm_apic_set_ldr(apic, ldr); } apic-base_address = apic-vcpu-arch.apic_base -- Gleb. . Retried with the above patch and the guest is booting fine. (x2apic flag shows up in the guest's /proc/cpuinfo). Was this a recent regression? Thanks! Vinod
[Qemu-devel] Fwd: Re: [RFC 0/7] Migration stats
Forwarding to the alias. Thanks, Vinod Original Message Subject:Re: [RFC 0/7] Migration stats Date: Mon, 13 Aug 2012 15:20:10 +0200 From: Juan Quintela quint...@redhat.com Reply-To: quint...@redhat.com To: Chegu Vinod chegu_vi...@hp.com CC: [ snip ] - Prints the real downtime that we have had really, it prints the total downtime of the complete phase, but the downtime also includes the last ram_iterate phase. Working on fixing that one. Good one. [...] What do I want to know: - is there any stat that you want? Once here, adding a new one should be easy. A few suggestions : a) Total amount of time spent sync'ng up dirty bitmap logs for the total duration of migration. I can add that one, it is not difficult. Notice that in future I expect to do the syncs in smaller chucks (but that is pie on the sky) b) Actual [average?] bandwidth that was used as compared to the allocated bandwidth ... (I am wanting to know how folks are observing near line rate on a 10Gig...when I am not...). Print average bandwidth is easy. The hardware one is difficult to get from inside one application. 'think it would be useful to know the approximate amount of [host] cpu time that got used up by the migration related thread(s) and any related host side services (like servicing the I/O interrupts while driving traffic through the network). I assume there are alternate methods to derive all these (and we don't need to overload the migration stats?] This one is not easy to do from inside qemu. Much easier to get from the outside. As far as I know, it is not easy to monitor cpu usage from inside the cpu that we can to measure. Thanks for the comments, Juan. .
Re: [Qemu-devel] FW: Fwd: [RFC 00/27] Migration thread (WIP)
On 7/27/2012 7:11 AM, Vinod, Chegu wrote: -Original Message- From: Juan Quintela [mailto:quint...@redhat.com] Sent: Friday, July 27, 2012 4:06 AM To: Vinod, Chegu Cc: qemu-devel@nongnu.org; Orit Wasserman Subject: Re: Fwd: [RFC 00/27] Migration thread (WIP) Chegu Vinod chegu_vi...@hp.com wrote: On 7/26/2012 11:41 AM, Chegu Vinod wrote: Original Message Subject: [Qemu-devel] [RFC 00/27] Migration thread (WIP) Date: Tue, 24 Jul 2012 20:36:25 +0200 From: Juan Quintela quint...@redhat.com To: qemu-devel@nongnu.org Hi This series are on top of the migration-next-v5 series just posted. First of all, this is an RFC/Work in progress. Just a lot of people asked for it, and I would like review of the design. Hello, Thanks for sharing this early/WIP version for evaluation. Still in the middle of code review..but wanted to share a couple of quick observations. 'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and downtime 2s). Once with no workload (i.e. idle guest) and the second was with a SpecJBB running in the guest. The idle guest case seemed to migrate fine... capabilities: xbzrle: off Migration status: completed transferred ram: 3811345 kbytes remaining ram: 0 kbytes total ram: 134226368 kbytes total time: 199743 milliseconds In the case of the SpecJBB I ran into issues during stage 3...the source host's qemu and the guest hung. I need to debug this more... (if already have some hints pl. let me know.). capabilities: xbzrle: off Migration status: active transferred ram: 127618578 kbytes remaining ram: 2386832 kbytes total ram: 134226368 kbytes total time: 526139 milliseconds (qemu) qemu_savevm_state_complete called qemu_savevm_state_complete calling ram_save_complete --- hung somewhere after this ('need to get more info). Appears to be some race condition...as there are cases when it hangs and in some cases it succeeds. Weird guess, try to use less vcpus, same ram. Ok..will try that. The way that we stop cpus is _hacky_ to say it somewhere. Will try to think about that part. Ok. Thanks for the testing. All my testing has been done with 8GB guests and 2vcps. Will try with more vcpus to see if it makes a difference. (qemu) info migrate capabilities: xbzrle: off Migration status: completed transferred ram: 129937687 kbytes remaining ram: 0 kbytes total ram: 134226368 kbytes total time: 543228 milliseconds Humm, _that_ is more strange. This means that it finished. There are cases where the migration is finishing just fine... even with larger guest configurations (256G/20VCPUs). Could you run qemu under gdb and sent me the stack traces? I don't know your gdb thread kung-fu, so here are the instructions just in case: gdb --args exact qemu commandh line you used C-c to break when it hangs (gdb)info threads you see all the threads running (gdb)thread 1 or whatever other number (gdb)bt the backtrace of that thread. The hang is intermittent... I ran it 4-5 times (under gdb) just now and I didn't see the issue :-( I am specially interested in the backtrace of the migration thread and of the iothread. Will keep re-trying with different configs. and see if i get lucky in reproducing it (under gdb). Vinod Thanks, Juan. Need to review/debug... Vinod --- As with the non-migration-thread version the Specjbb workload completed before the migration attempted to move to stage 3 (i.e. didn't converge while the workload was still active). BTW, with this version of the bits (i.e. while running SpecJBB which is supposed to dirty quite a bit of memory) I noticed that there wasn't much change in the b/w usage of the dedicated 10Gb private network link (It was still ~1.5-3.0Gb/sec). Expected this to be a little better since we have a separate thread... not sure what else is in play here ? (numa locality of where the migration thread runs or something other basic tuning in the implementation ?) 'have a hi-level design question... (perhaps folks have already thought about it..and categorized it as potential future optimization..?) Would it be possible to off load the iothread completely [from all migration related activity] and have one thread
Re: [Qemu-devel] Fwd: [RFC 00/27] Migration thread (WIP)
Original Message Subject:[Qemu-devel] [RFC 00/27] Migration thread (WIP) Date: Tue, 24 Jul 2012 20:36:25 +0200 From: Juan Quintela quint...@redhat.com To: qemu-devel@nongnu.org Hi This series are on top of the migration-next-v5 series just posted. First of all, this is an RFC/Work in progress. Just a lot of people asked for it, and I would like review of the design. Hello, Thanks for sharing this early/WIP version for evaluation. Still in the middle of code review..but wanted to share a couple of quick observations. 'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and downtime 2s). Once with no workload (i.e. idle guest) and the second was with a SpecJBB running in the guest. The idle guest case seemed to migrate fine... capabilities: xbzrle: off Migration status: completed transferred ram: 3811345 kbytes remaining ram: 0 kbytes total ram: 134226368 kbytes total time: 199743 milliseconds In the case of the SpecJBB I ran into issues during stage 3...the source host's qemu and the guest hung. I need to debug this more... (if already have some hints pl. let me know.). capabilities: xbzrle: off Migration status: active transferred ram: 127618578 kbytes remaining ram: 2386832 kbytes total ram: 134226368 kbytes total time: 526139 milliseconds (qemu) qemu_savevm_state_complete called qemu_savevm_state_complete calling ram_save_complete --- hung somewhere after this ('need to get more info). --- As with the non-migration-thread version the Specjbb workload completed before the migration attempted to move to stage 3 (i.e. didn't converge while the workload was still active). BTW, with this version of the bits (i.e. while running SpecJBB which is supposed to dirty quite a bit of memory) I noticed that there wasn't much change in the b/w usage of the dedicated 10Gb private network link (It was still ~1.5-3.0Gb/sec). Expected this to be a little better since we have a separate thread... not sure what else is in play here ? (numa locality of where the migration thread runs or something other basic tuning in the implementation ?) 'have a hi-level design question... (perhaps folks have already thought about it..and categorized it as potential future optimization..?) Would it be possible to off load the iothread completely [from all migration related activity] and have one thread (with the appropriate protection) get involved with getting the list of the dirty pages ? Have one or more threads dedicated for trying to push multiple streams of data to saturate the allocated network bandwidth ? This may help in large + busy guests. Comments?There are perhaps other implications of doing all of this (like burning more host cpu cycles) but perhaps this can be configurable based on user's needs... e.g. fewer but large guests on a host with no over subscription. Thanks Vinod It does: - get a new bitmap for migration, and that bitmap uses 1 bit by page - it unfolds migration_buffered_file. Only one user existed. - it simplifies buffered_file a lot. - About the migration thread, special attention was giving to try to get the series review-able (reviewers would tell if I got it). Basic design: - we create a new thread instead of a timer function - we move all the migration work to that thread (but run everything except the waits with the iothread lock. - we move all the writting to outside the iothread lock. i.e. we walk the state with the iothread hold, and copy everything to one buffer. then we write that buffer to the sockets outside the iothread lock. - once here, we move to writting synchronously to the sockets. - this allows us to simplify quite a lot. And basically, that is it. Notice that we still do the iterate page walking with the iothread held. Light testing show that we got similar speed and latencies than without the thread (notice that almost no optimizations done here yet). Appart of the review: - Are there any locking issues that I have missed (I guess so) - stop all cpus correctly. vm_stop should be called from the iothread, I use the trick of using a bottom half to get that working correctly. but this _implementation_ is ugly as hell. Is there an easy way of doing it? - Do I really have to export last_ram_offset(), there is no other way of knowing the ammount of RAM? Known issues: - for some reason, when it has to start a 2nd round of bitmap handling, it decides to dirty all pages. Haven't found still why this happens. If you can test it, and said me where it breaks, it would also help. Work is based on Umesh thread work, and work that Paolo Bonzini had work on top of that. All the mirgation thread was done from scratch becase I was unable to debug why it was failing, but it owes a lot to the previous design. Thanks in advance, Juan. The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb: Merge remote-tracking branch
Re: [Qemu-devel] Fwd: [RFC 00/27] Migration thread (WIP)
On 7/26/2012 11:41 AM, Chegu Vinod wrote: Original Message Subject:[Qemu-devel] [RFC 00/27] Migration thread (WIP) Date: Tue, 24 Jul 2012 20:36:25 +0200 From: Juan Quintela quint...@redhat.com To: qemu-devel@nongnu.org Hi This series are on top of the migration-next-v5 series just posted. First of all, this is an RFC/Work in progress. Just a lot of people asked for it, and I would like review of the design. Hello, Thanks for sharing this early/WIP version for evaluation. Still in the middle of code review..but wanted to share a couple of quick observations. 'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and downtime 2s). Once with no workload (i.e. idle guest) and the second was with a SpecJBB running in the guest. The idle guest case seemed to migrate fine... capabilities: xbzrle: off Migration status: completed transferred ram: 3811345 kbytes remaining ram: 0 kbytes total ram: 134226368 kbytes total time: 199743 milliseconds In the case of the SpecJBB I ran into issues during stage 3...the source host's qemu and the guest hung. I need to debug this more... (if already have some hints pl. let me know.). capabilities: xbzrle: off Migration status: active transferred ram: 127618578 kbytes remaining ram: 2386832 kbytes total ram: 134226368 kbytes total time: 526139 milliseconds (qemu) qemu_savevm_state_complete called qemu_savevm_state_complete calling ram_save_complete --- hung somewhere after this ('need to get more info). Appears to be some race condition...as there are cases when it hangs and in some cases it succeeds. (qemu) info migrate capabilities: xbzrle: off Migration status: completed transferred ram: 129937687 kbytes remaining ram: 0 kbytes total ram: 134226368 kbytes total time: 543228 milliseconds Need to review/debug... Vinod --- As with the non-migration-thread version the Specjbb workload completed before the migration attempted to move to stage 3 (i.e. didn't converge while the workload was still active). BTW, with this version of the bits (i.e. while running SpecJBB which is supposed to dirty quite a bit of memory) I noticed that there wasn't much change in the b/w usage of the dedicated 10Gb private network link (It was still ~1.5-3.0Gb/sec). Expected this to be a little better since we have a separate thread... not sure what else is in play here ? (numa locality of where the migration thread runs or something other basic tuning in the implementation ?) 'have a hi-level design question... (perhaps folks have already thought about it..and categorized it as potential future optimization..?) Would it be possible to off load the iothread completely [from all migration related activity] and have one thread (with the appropriate protection) get involved with getting the list of the dirty pages ? Have one or more threads dedicated for trying to push multiple streams of data to saturate the allocated network bandwidth ? This may help in large + busy guests. Comments? There are perhaps other implications of doing all of this (like burning more host cpu cycles) but perhaps this can be configurable based on user's needs... e.g. fewer but large guests on a host with no over subscription. Thanks Vinod It does: - get a new bitmap for migration, and that bitmap uses 1 bit by page - it unfolds migration_buffered_file. Only one user existed. - it simplifies buffered_file a lot. - About the migration thread, special attention was giving to try to get the series review-able (reviewers would tell if I got it). Basic design: - we create a new thread instead of a timer function - we move all the migration work to that thread (but run everything except the waits with the iothread lock. - we move all the writting to outside the iothread lock. i.e. we walk the state with the iothread hold, and copy everything to one buffer. then we write that buffer to the sockets outside the iothread lock. - once here, we move to writting synchronously to the sockets. - this allows us to simplify quite a lot. And basically, that is it. Notice that we still do the iterate page walking with the iothread held. Light testing show that we got similar speed and latencies than without the thread (notice that almost no optimizations done here yet). Appart of the review: - Are there any locking issues that I have missed (I guess so) - stop all cpus correctly. vm_stop should be called from the iothread, I use the trick of using a bottom half to get that working correctly. but this _implementation_ is ugly as hell. Is there an easy way of doing it? - Do I really have to export last_ram_offset(), there is no other way of knowing the ammount of RAM? Known issues: - for some reason, when it has to start a 2nd round of bitmap handling, it decides to dirty all pages. Haven't found still why this happens. If you can test it, and said me where it breaks, it would also help
[Qemu-devel] [PATCH v4] Fixes related to processing of qemu's -numa option
Changes since v3: - using bitmap_set() instead of set_bit() in numa_add() routine. - removed call to bitmak_zero() since bitmap_new() also zeros' the bitmap. - Rebased to the latest qemu. Changes since v2: - Using unsigned long * for the node_cpumask[]. - Use bitmap_new() instead of g_malloc0() for allocation. - Don't rely on max_cpus since it may not be initialized before the numa related qemu options are parsed processed. Note: Continuing to use a new constant for allocation of the mask (This constant is currently set to 255 since with an 8bit APIC ID VCPUs can range from 0-254 in a guest. The APIC ID 255 (0xFF) is reserved for broadcast). Changes since v1: - Use bitmap functions that are already in qemu (instead of cpu_set_t macro's from sched.h) - Added a check for endvalue = max_cpus. - Fix to address the round-robbing assignment when cpu's are not explicitly specified. --- v1: -- The -numa option to qemu is used to create [fake] numa nodes and expose them to the guest OS instance. There are a couple of issues with the -numa option: a) Max VCPU's that can be specified for a guest while using the qemu's -numa option is 64. Due to a typecasting issue when the number of VCPUs is 32 the VCPUs don't show up under the specified [fake] numa nodes. b) KVM currently has support for 160VCPUs per guest. The qemu's -numa option has only support for upto 64VCPUs per guest. This patch addresses these two issues. Below are examples of (a) and (b) a) 32 VCPUs are specified with the -numa option: /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ 71:01:01 \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 ... Upstream qemu : -- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 6 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 32 33 34 35 36 37 38 39 40 41 node 0 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 node 2 size: 131072 MB node 3 cpus: 30 node 3 size: 131072 MB node 4 cpus: node 4 size: 131072 MB node 5 cpus: 31 node 5 size: 131072 MB With the patch applied : --- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 6 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 node 0 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 131072 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 131072 MB node 4 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 131072 MB node 5 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size: 131072 MB b) 64 VCPUs specified with -numa option: /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+d-vnc :4 ... Upstream qemu : -- only 63 CPUs in NUMA mode supported. only 64 CPUs in NUMA mode supported. QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 8 nodes node 0 cpus: 6 7 8 9 38 39 40 41 70 71 72 73 node 0 size: 65536 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 74 75 76 77 78 79 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 60 61 node 2 size: 65536 MB node 3 cpus: 30 62 node 3 size: 65536 MB node 4 cpus: node 4 size: 65536 MB node 5 cpus: node 5 size: 65536 MB node 6 cpus: 31 63 node 6 size: 65536 MB node 7 cpus: 0 1 2 3 4 5 32 33 34 35 36 37 64 65 66 67 68 69 node 7 size: 65536 MB With the patch applied : --- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 8 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 node 0 size: 65536 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 65536 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 65536 MB node 4 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 65536 MB node 5 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size: 65536 MB node 6 cpus: 60 61 62 63 64 65 66 67 68 69 node 6 size: 65536 MB node 7 cpus: 70 71 72 73 74 75 76 77 78 79 Signed-off-by: Chegu Vinod chegu_vi...@hp.com, Jim Hull jim.h...@hp.com, Craig Hada craig.h...@hp.com --- cpus.c |3 ++- hw/pc.c |3 ++- sysemu.h |3 ++- vl.c | 43 +-- 4 files changed, 27 insertions(+), 25 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..acccd08 100644 --- a/cpus.c +++ b/cpus.c @@ -36,6 +36,7 @@ #include cpus.h #include qtest.h #include main-loop.h +#include bitmap.h #ifndef _WIN32 #include compatfd.h @@ -1145,7 +1146,7 @@ void set_numa_modes(void) for (env = first_cpu; env != NULL; env = env-next_cpu) { for (i = 0; i nb_numa_nodes; i
[Qemu-devel] [PATCH v3] Fixes related to processing of qemu's -numa option
-off-by: Chegu Vinod chegu_vi...@hp.com, Jim Hull jim.h...@hp.com, Craig Hada craig.h...@hp.com Tested-by: Eduardo Habkost ehabkost at redhat.com --- cpus.c |3 ++- hw/pc.c |3 ++- sysemu.h |3 ++- vl.c | 48 ++-- 4 files changed, 32 insertions(+), 25 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..acccd08 100644 --- a/cpus.c +++ b/cpus.c @@ -36,6 +36,7 @@ #include cpus.h #include qtest.h #include main-loop.h +#include bitmap.h #ifndef _WIN32 #include compatfd.h @@ -1145,7 +1146,7 @@ void set_numa_modes(void) for (env = first_cpu; env != NULL; env = env-next_cpu) { for (i = 0; i nb_numa_nodes; i++) { -if (node_cpumask[i] (1 env-cpu_index)) { +if (test_bit(env-cpu_index, node_cpumask[i])) { env-numa_node = i; } } diff --git a/hw/pc.c b/hw/pc.c index c7e9ab3..2edcc07 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -48,6 +48,7 @@ #include memory.h #include exec-memory.h #include arch_init.h +#include bitmap.h /* output Bochs bios info messages */ //#define DEBUG_BIOS @@ -639,7 +640,7 @@ static void *bochs_bios_init(void) numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes); for (i = 0; i max_cpus; i++) { for (j = 0; j nb_numa_nodes; j++) { -if (node_cpumask[j] (1 i)) { +if (test_bit(i, node_cpumask[j])) { numa_fw_cfg[i + 1] = cpu_to_le64(j); break; } diff --git a/sysemu.h b/sysemu.h index bc2c788..2ce63fc 100644 --- a/sysemu.h +++ b/sysemu.h @@ -133,9 +133,10 @@ extern uint8_t qemu_extra_params_fw[2]; extern QEMUClock *rtc_clock; #define MAX_NODES 64 +#define MAX_CPUMASK_BITS 255 extern int nb_numa_nodes; extern uint64_t node_mem[MAX_NODES]; -extern uint64_t node_cpumask[MAX_NODES]; +extern unsigned long *node_cpumask[MAX_NODES]; #define MAX_OPTION_ROMS 16 typedef struct QEMUOptionRom { diff --git a/vl.c b/vl.c index 1329c30..fdd7b74 100644 --- a/vl.c +++ b/vl.c @@ -28,6 +28,7 @@ #include errno.h #include sys/time.h #include zlib.h +#include bitmap.h /* Needed early for CONFIG_BSD etc. */ #include config-host.h @@ -240,7 +241,7 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = QTAILQ_HEAD_INITIALIZER(fw_boot_order int nb_numa_nodes; uint64_t node_mem[MAX_NODES]; -uint64_t node_cpumask[MAX_NODES]; +unsigned long *node_cpumask[MAX_NODES]; uint8_t qemu_uuid[16]; @@ -950,6 +951,9 @@ static void numa_add(const char *optarg) char *endptr; unsigned long long value, endvalue; int nodenr; +int i; + +value = endvalue = 0ULL; optarg = get_opt_name(option, 128, optarg, ',') + 1; if (!strcmp(option, node)) { @@ -970,27 +974,25 @@ static void numa_add(const char *optarg) } node_mem[nodenr] = sval; } -if (get_param_value(option, 128, cpus, optarg) == 0) { -node_cpumask[nodenr] = 0; -} else { +if (get_param_value(option, 128, cpus, optarg) != 0) { value = strtoull(option, endptr, 10); -if (value = 64) { -value = 63; -fprintf(stderr, only 64 CPUs in NUMA mode supported.\n); +if (*endptr == '-') { +endvalue = strtoull(endptr+1, endptr, 10); } else { -if (*endptr == '-') { -endvalue = strtoull(endptr+1, endptr, 10); -if (endvalue = 63) { -endvalue = 62; -fprintf(stderr, -only 63 CPUs in NUMA mode supported.\n); -} -value = (2ULL endvalue) - (1ULL value); -} else { -value = 1ULL value; -} +endvalue = value; +} + + +if (!(endvalue MAX_CPUMASK_BITS)) { +endvalue = MAX_CPUMASK_BITS - 1; +fprintf(stderr, +A max of %d CPUs are supported in a guest\n, + MAX_CPUMASK_BITS); +} + +for (i = value; i = endvalue; ++i) { +set_bit(i, node_cpumask[nodenr]); } -node_cpumask[nodenr] = value; } nb_numa_nodes++; } @@ -2331,7 +2333,8 @@ int main(int argc, char **argv, char **envp) for (i = 0; i MAX_NODES; i++) { node_mem[i] = 0; -node_cpumask[i] = 0; +node_cpumask[i] = bitmap_new(MAX_CPUMASK_BITS); +bitmap_zero(node_cpumask[i], MAX_CPUMASK_BITS); } nb_numa_nodes = 0; @@ -3469,8 +3472,9 @@ int main(int argc, char **argv, char **envp) } for (i = 0; i nb_numa_nodes; i++) { -if (node_cpumask[i] != 0) +if (!bitmap_empty(node_cpumask[i], MAX_CPUMASK_BITS)) { break; +} } /* assigning the VCPUs round-robin is easier to implement
[Qemu-devel] [PATCH v2] Fixes related to processing of qemu's -numa option
From: root r...@hydra11.kio Changes since v1: - Use bitmap functions that are already in qemu (instead of cpu_set_t macro's) - Added a check for endvalue = max_cpus. - Fix to address the round-robbing assignment (for the case when cpu's are not explicitly specified) Note: Continuing to use a new constant for allocation of the cpumask (max_cpus was not getting set early enough). --- v1: -- The -numa option to qemu is used to create [fake] numa nodes and expose them to the guest OS instance. There are a couple of issues with the -numa option: a) Max VCPU's that can be specified for a guest while using the qemu's -numa option is 64. Due to a typecasting issue when the number of VCPUs is 32 the VCPUs don't show up under the specified [fake] numa nodes. b) KVM currently has support for 160VCPUs per guest. The qemu's -numa option has only support for upto 64VCPUs per guest. This patch addresses these two issues. Below are examples of (a) and (b) a) 32 VCPUs are specified with the -numa option: /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ 71:01:01 \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 ... Upstream qemu : -- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 6 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 32 33 34 35 36 37 38 39 40 41 node 0 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 node 2 size: 131072 MB node 3 cpus: 30 node 3 size: 131072 MB node 4 cpus: node 4 size: 131072 MB node 5 cpus: 31 node 5 size: 131072 MB With the patch applied : --- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 6 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 node 0 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 131072 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 131072 MB node 4 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 131072 MB node 5 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size: 131072 MB b) 64 VCPUs specified with -numa option: /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme \ -smp sockets=8,cores=10,threads=1 \ -numa node,nodeid=0,cpus=0-9,mem=64g \ -numa node,nodeid=1,cpus=10-19,mem=64g \ -numa node,nodeid=2,cpus=20-29,mem=64g \ -numa node,nodeid=3,cpus=30-39,mem=64g \ -numa node,nodeid=4,cpus=40-49,mem=64g \ -numa node,nodeid=5,cpus=50-59,mem=64g \ -numa node,nodeid=6,cpus=60-69,mem=64g \ -numa node,nodeid=7,cpus=70-79,mem=64g \ -m 524288 \ -name vm1 \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -drive file=/dev/libvirt_lvm/vm.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,macaddr=52:54:00:71:01:01 \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 ... Upstream qemu : -- only 63 CPUs in NUMA mode supported. only 64 CPUs in NUMA mode supported. QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 8 nodes node 0 cpus: 6 7 8 9 38 39 40 41 70 71 72 73 node 0 size: 65536 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 74 75 76 77 78 79 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 60 61 node 2 size: 65536 MB node 3 cpus: 30 62 node 3 size: 65536 MB node 4 cpus: node 4 size: 65536 MB node 5 cpus: node 5 size: 65536 MB node 6 cpus: 31 63 node 6 size: 65536 MB node 7 cpus: 0 1 2 3 4 5 32 33 34 35 36 37 64 65 66 67 68 69 node 7 size: 65536 MB With the patch applied : --- QEMU 1.1.50 monitor - type 'help' for more information (qemu) info numa 8 nodes node 0 cpus: 0 1 2 3 4 5 6 7 8 9 node 0 size: 65536 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 65536 MB node 3 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 65536 MB node 4 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 65536 MB node 5 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size: 65536 MB node 6 cpus: 60 61 62 63 64 65 66 67 68 69 node 6 size: 65536 MB node 7 cpus: 70 71 72 73 74 75 76 77 78 79 Signed-off-by: Chegu Vinod chegu_vi...@hp.com, Jim Hull jim.h...@hp.com, Craig Hada craig.h...@hp.com --- cpus.c |3 ++- hw/pc.c |4 +++- sysemu.h |3 ++- vl.c | 48 ++-- 4 files changed, 33 insertions(+), 25 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..89ce04d 100644 --- a/cpus.c +++ b/cpus.c @@ -36,6 +36,7
Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option
On 6/18/2012 1:29 PM, Eduardo Habkost wrote: On Sun, Jun 17, 2012 at 01:12:31PM -0700, Chegu Vinod wrote: The -numa option to qemu is used to create [fake] numa nodes and expose them to the guest OS instance. There are a couple of issues with the -numa option: a) Max VCPU's that can be specified for a guest while using the qemu's -numa option is 64. Due to a typecasting issue when the number of VCPUs is 32 the VCPUs don't show up under the specified [fake] numa nodes. b) KVM currently has support for 160VCPUs per guest. The qemu's -numa option has only support for upto 64VCPUs per guest. This patch addresses these two issues. [ Note: This patch has been verified by Eduardo Habkost ]. I was going to add a Tested-by line, but this patch breaks the automatic round-robin assignment when no CPU is added to any node on the command-line. Pl. see below. Additional questions below: [...] diff --git a/cpus.c b/cpus.c index b182b3d..f9cee60 100644 --- a/cpus.c +++ b/cpus.c @@ -1145,7 +1145,7 @@ void set_numa_modes(void) for (env = first_cpu; env != NULL; env = env-next_cpu) { for (i = 0; i nb_numa_nodes; i++) { -if (node_cpumask[i] (1 env-cpu_index)) { +if (CPU_ISSET_S(env-cpu_index, cpumask_size, node_cpumask[i])) { env-numa_node = i; } } diff --git a/hw/pc.c b/hw/pc.c index 8368701..f0c3665 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -639,7 +639,7 @@ static void *bochs_bios_init(void) numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes); for (i = 0; i max_cpus; i++) { for (j = 0; j nb_numa_nodes; j++) { -if (node_cpumask[j] (1 i)) { +if (CPU_ISSET_S(i, cpumask_size, node_cpumask[j])) { numa_fw_cfg[i + 1] = cpu_to_le64(j); break; } diff --git a/sysemu.h b/sysemu.h index bc2c788..6e4d342 100644 --- a/sysemu.h +++ b/sysemu.h @@ -9,6 +9,7 @@ #include qapi-types.h #include notify.h #include main-loop.h +#includesched.h /* vl.c */ @@ -133,9 +134,11 @@ extern uint8_t qemu_extra_params_fw[2]; extern QEMUClock *rtc_clock; #define MAX_NODES 64 +#define KVM_MAX_VCPUS 254 Do we really need this constant? Why not just use max_cpus when allocating the CPU sets, instead? Hmm I had thought about this earlier too max_cpus was not getting set at the point where the allocations were being done. I have now moved that code to a bit later and switched to using max_cpus. extern int nb_numa_nodes; extern uint64_t node_mem[MAX_NODES]; -extern uint64_t node_cpumask[MAX_NODES]; +extern cpu_set_t *node_cpumask[MAX_NODES]; +extern size_t cpumask_size; #define MAX_OPTION_ROMS 16 typedef struct QEMUOptionRom { diff --git a/vl.c b/vl.c index 204d85b..1906412 100644 --- a/vl.c +++ b/vl.c @@ -28,6 +28,7 @@ #includeerrno.h #includesys/time.h #includezlib.h +#includesched.h /* Needed early for CONFIG_BSD etc. */ #include config-host.h @@ -240,7 +241,8 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = QTAILQ_HEAD_INITIALIZER(fw_boot_order int nb_numa_nodes; uint64_t node_mem[MAX_NODES]; -uint64_t node_cpumask[MAX_NODES]; +cpu_set_t *node_cpumask[MAX_NODES]; +size_t cpumask_size; uint8_t qemu_uuid[16]; @@ -950,6 +952,9 @@ static void numa_add(const char *optarg) char *endptr; unsigned long long value, endvalue; int nodenr; +int i; + +value = endvalue = 0; optarg = get_opt_name(option, 128, optarg, ',') + 1; if (!strcmp(option, node)) { @@ -970,27 +975,17 @@ static void numa_add(const char *optarg) } node_mem[nodenr] = sval; } -if (get_param_value(option, 128, cpus, optarg) == 0) { -node_cpumask[nodenr] = 0; -} else { +if (get_param_value(option, 128, cpus, optarg) != 0) { value = strtoull(option,endptr, 10); -if (value= 64) { -value = 63; -fprintf(stderr, only 64 CPUs in NUMA mode supported.\n); +if (*endptr == '-') { +endvalue = strtoull(endptr+1,endptr, 10); } else { -if (*endptr == '-') { -endvalue = strtoull(endptr+1,endptr, 10); -if (endvalue= 63) { -endvalue = 62; -fprintf(stderr, -only 63 CPUs in NUMA mode supported.\n); -} -value = (2ULL endvalue) - (1ULL value); -} else { -value = 1ULL value; -} +endvalue = value; +} + +for (i = value; i= endvalue; ++i) { +CPU_SET_S(i, cpumask_size, node_cpumask[nodenr]); What if endvalue is larger than the cpu mask size we allocated? I can add a check (endvalue = max_cpus) and print an error. Should we force set the endvalue to max_cpus-1 in that case
Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option
On 6/18/2012 3:11 PM, Eric Blake wrote: On 06/18/2012 04:05 PM, Andreas Färber wrote: Am 17.06.2012 22:12, schrieb Chegu Vinod: diff --git a/vl.c b/vl.c index 204d85b..1906412 100644 --- a/vl.c +++ b/vl.c @@ -28,6 +28,7 @@ #includeerrno.h #includesys/time.h #includezlib.h +#includesched.h Did you check whether this and the macros you're using are available on POSIX and mingw32? vl.c is a pretty central file. POSIX, yes. mingw32, no. Use of preprocessor conditionals is probably in order. Thanks. I will look into this. Vinod
[Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option
-off-by: Chegu Vinod chegu_vi...@hp.com, Jim Hull jim.h...@hp.com, Craig Hada craig.h...@hp.com --- cpus.c |2 +- hw/pc.c |2 +- sysemu.h |5 - vl.c | 42 -- 4 files changed, 26 insertions(+), 25 deletions(-) diff --git a/cpus.c b/cpus.c index b182b3d..f9cee60 100644 --- a/cpus.c +++ b/cpus.c @@ -1145,7 +1145,7 @@ void set_numa_modes(void) for (env = first_cpu; env != NULL; env = env-next_cpu) { for (i = 0; i nb_numa_nodes; i++) { -if (node_cpumask[i] (1 env-cpu_index)) { +if (CPU_ISSET_S(env-cpu_index, cpumask_size, node_cpumask[i])) { env-numa_node = i; } } diff --git a/hw/pc.c b/hw/pc.c index 8368701..f0c3665 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -639,7 +639,7 @@ static void *bochs_bios_init(void) numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes); for (i = 0; i max_cpus; i++) { for (j = 0; j nb_numa_nodes; j++) { -if (node_cpumask[j] (1 i)) { +if (CPU_ISSET_S(i, cpumask_size, node_cpumask[j])) { numa_fw_cfg[i + 1] = cpu_to_le64(j); break; } diff --git a/sysemu.h b/sysemu.h index bc2c788..6e4d342 100644 --- a/sysemu.h +++ b/sysemu.h @@ -9,6 +9,7 @@ #include qapi-types.h #include notify.h #include main-loop.h +#include sched.h /* vl.c */ @@ -133,9 +134,11 @@ extern uint8_t qemu_extra_params_fw[2]; extern QEMUClock *rtc_clock; #define MAX_NODES 64 +#define KVM_MAX_VCPUS 254 extern int nb_numa_nodes; extern uint64_t node_mem[MAX_NODES]; -extern uint64_t node_cpumask[MAX_NODES]; +extern cpu_set_t *node_cpumask[MAX_NODES]; +extern size_t cpumask_size; #define MAX_OPTION_ROMS 16 typedef struct QEMUOptionRom { diff --git a/vl.c b/vl.c index 204d85b..1906412 100644 --- a/vl.c +++ b/vl.c @@ -28,6 +28,7 @@ #include errno.h #include sys/time.h #include zlib.h +#include sched.h /* Needed early for CONFIG_BSD etc. */ #include config-host.h @@ -240,7 +241,8 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = QTAILQ_HEAD_INITIALIZER(fw_boot_order int nb_numa_nodes; uint64_t node_mem[MAX_NODES]; -uint64_t node_cpumask[MAX_NODES]; +cpu_set_t *node_cpumask[MAX_NODES]; +size_t cpumask_size; uint8_t qemu_uuid[16]; @@ -950,6 +952,9 @@ static void numa_add(const char *optarg) char *endptr; unsigned long long value, endvalue; int nodenr; +int i; + +value = endvalue = 0; optarg = get_opt_name(option, 128, optarg, ',') + 1; if (!strcmp(option, node)) { @@ -970,27 +975,17 @@ static void numa_add(const char *optarg) } node_mem[nodenr] = sval; } -if (get_param_value(option, 128, cpus, optarg) == 0) { -node_cpumask[nodenr] = 0; -} else { +if (get_param_value(option, 128, cpus, optarg) != 0) { value = strtoull(option, endptr, 10); -if (value = 64) { -value = 63; -fprintf(stderr, only 64 CPUs in NUMA mode supported.\n); +if (*endptr == '-') { +endvalue = strtoull(endptr+1, endptr, 10); } else { -if (*endptr == '-') { -endvalue = strtoull(endptr+1, endptr, 10); -if (endvalue = 63) { -endvalue = 62; -fprintf(stderr, -only 63 CPUs in NUMA mode supported.\n); -} -value = (2ULL endvalue) - (1ULL value); -} else { -value = 1ULL value; -} +endvalue = value; +} + +for (i = value; i = endvalue; ++i) { +CPU_SET_S(i, cpumask_size, node_cpumask[nodenr]); } -node_cpumask[nodenr] = value; } nb_numa_nodes++; } @@ -2331,7 +2326,9 @@ int main(int argc, char **argv, char **envp) for (i = 0; i MAX_NODES; i++) { node_mem[i] = 0; -node_cpumask[i] = 0; +node_cpumask[i] = CPU_ALLOC(KVM_MAX_VCPUS); +cpumask_size = CPU_ALLOC_SIZE(KVM_MAX_VCPUS); +CPU_ZERO_S(cpumask_size, node_cpumask[i]); } nb_numa_nodes = 0; @@ -3465,8 +3462,9 @@ int main(int argc, char **argv, char **envp) } for (i = 0; i nb_numa_nodes; i++) { -if (node_cpumask[i] != 0) +if (node_cpumask[i] != NULL) { break; +} } /* assigning the VCPUs round-robin is easier to implement, guest OSes * must cope with this anyway, because there are BIOSes out there in @@ -3474,7 +3472,7 @@ int main(int argc, char **argv, char **envp) */ if (i == nb_numa_nodes) { for (i = 0; i max_cpus; i++) { -node_cpumask[i % nb_numa_nodes] |= 1 i; +CPU_SET_S(i, cpumask_size, node_cpumask[i % nb_numa_nodes
Re: [Qemu-devel] Large sized guest taking for ever to boot...
On 6/8/2012 11:37 AM, Jan Kiszka wrote: On 2012-06-08 20:20, Chegu Vinod wrote: On 6/8/2012 11:08 AM, Jan Kiszka wrote: [CC'ing qemu as this discusses its code base] On 2012-06-08 19:57, Chegu Vinod wrote: On 6/8/2012 10:42 AM, Alex Williamson wrote: On Fri, 2012-06-08 at 10:10 -0700, Chegu Vinod wrote: On 6/8/2012 9:46 AM, Alex Williamson wrote: On Fri, 2012-06-08 at 16:29 +, Chegu Vinod wrote: Hello, I picked up a recent version of the qemu (1.0.92 with some fixes) and tried it on x86_64 server (with host and the guest running 3.4.1 kernel). BTW, I observe the same thing if i were to use 1.1.50 version of the qemu... not sure if this is really related to qemu... While trying to boot a large guest (80 vcpus + 512GB) I observed that the guest took for ever to boot up... ~1 hr or even more. [This wasn't the case when I was using RHEL 6.x related bits] Was either case using device assignment? Device assignment will map and pin each page of guest memory before startup, which can be a noticeable pause on smallish (16GB) guests. That should be linear scaling though and if you're using qemu and not qemu-kvm, not related. Thanks, I don't have any device assignment at this point . Yes I am using qemu (not qemu-kvm)... Just to be safe, are you using --enable-kvm with qemu? Yes... Unless you are using current qemu.git master (where it is enabled by default), --enable-kvm does not activate the in-kernel irqchip support for you. Not sure if that can make such a huge difference, but it is a variation from qemu-kvm. You can enable it in qemu-1.1 with -machine kernel_irqchip=on. Thanks for pointing this out...i will add that. I was using qemu.gitnot the master With qemu.git master I meant the latest version you can pull from the master branch or qemu.git. If you are using an older version, please specify the hash. BTW, you can check if irqchip is on by looking at the out of info qtree in the qemu monitor. One of the last devices listed must be called kvm-apic. Sorry for the confusion. I was using the qemu.git and I do see the kvm-apic stuff (in the info qtree output) without specifying the -machine kernel_irqchip=on option.. - /etc/qemu-ifup tap0 /usr/local/bin/qemu-system-x86_64 -enable-kvm \ -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+s s,+acpi,+ds,+vme \ -m 524288 -smp 80,sockets=80,cores=1,threads=1 \ -name vm1 \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -drive file=/dev/libvirt_lvm/vm.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,macaddr=52:54:00:71:01:01 \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 /etc/qemu-ifdown tap0 The issue seems very basic... 'was earlier running RHEL6.3 RC1 on the host and the guest and the host and the guest seemed to boot fine.. Note that RHEL is based on qemu-kvm. Thanks, Yep..knew that :) I was using upstream qemu-kvm and was encouraged to move away from it...to qemu. And that is good. :) Is the problem present in current qemu-kvm.git? If yes, can you bisect when it was introduced? Shall try out the qemu-kvm.git (after finishing some experiments..). Yes, please. Just did that and below are the results... BTW, another data point ...if I try to boot a the RHEL6.3 kernel in the guest (with the latest qemu.git and the 3.4.1 on the host) it boots just fine So something to do with the 3.4.1 kernel in the guest and the existing udev... Need to dig deeper. Maybe. But lets focus on getting the problematic guest running fine in some configuration. If that turns out to be impossible, we may see a guest issue. Host config. : 80 Cores + 1TB.Guest config : 40VCPUs + 512GB. I rebuilt the 3.4.1 kernel in the guest from scratch and retried my experiments and measured the boot times... a) Host : RHEL6.3 RC1 + qemu-kvm (that came with it) Guest : RHEL6.3 RC1: ~1 min b) Host :3.4.1 + qemu-kvm.git Guest : RHEL6.3RC1 - ~1 min c) Host :3.4.1 + qemu-kvm.git Guest : 3.4.1 - ~10 mins d) Host :3.4.1 + qemu.git Guest : RHEL6.3 RC1 -~1 min e) Host :3.4.1 + qemu.git Guest : 3.4.1 -~14 mins In both the case (c) (e) had quite a few warning/error messages from udevd. FYI.. Vinod PS: Haven't had the time to do any further analysis...as the machine is being used for other experiments... Jan
Re: [Qemu-devel] Large sized guest taking for ever to boot...
On 6/12/2012 8:39 AM, Gleb Natapov wrote: On Tue, Jun 12, 2012 at 08:33:59AM -0700, Chegu Vinod wrote: I rebuilt the 3.4.1 kernel in the guest from scratch and retried my experiments and measured the boot times... a) Host : RHEL6.3 RC1 + qemu-kvm (that came with it) Guest : RHEL6.3 RC1: ~1 min b) Host :3.4.1 + qemu-kvm.git Guest : RHEL6.3RC1 - ~1 min c) Host :3.4.1 + qemu-kvm.git Guest : 3.4.1 - ~10 mins d) Host :3.4.1 + qemu.git Guest : RHEL6.3 RC1 -~1 min e) Host :3.4.1 + qemu.git Guest : 3.4.1 -~14 mins In both the case (c) (e) had quite a few warning/error messages from udevd. FYI.. Vinod PS: Haven't had the time to do any further analysis...as the machine is being used for other experiments... Next time you get the machine can you try running (b) but adding -hypervisor to your -cpu config. I tried that and didn't make any difference...i.e. the RHEL6.3 RC1 guest booted in about ~1 min. Vinod -- Gleb.
[Qemu-devel] Live Migration of a large guest : guest frozen on the destination host
Hello, 'am having some issues trying to live migrate a large guest and would like to get some pointers on how to go about about debugging this. Here is some info. on the configuration _Hardware :_ Two DL980's each with 80 Westmere cores + 1 TB of RAM. Using a 10G NIC private link (back to back) between two DL980's _ Host software used:_ Host 3.4.1 kernel Qemu versions used : Case 1: upstream qemu (1.1.50) - from qemu.git Case 2 : 1.0.92 + Juan Quintela's huge_memory changes _ Guest : _40VCPUs + 512GB _Guest software used:_ RHEL6.3 RC1 (had some basic boot issues with 3.4.1 kernel and udevd..) Guest is booted off an FC LUN (visible to both the hosts). [Note: 'am not using virsh/virt-manager etc. but just the qemu to start the guest and also interact with the qemu monitor for live migration etc. Have set the migration speed to 10G but haven't changed the downtime (default : 30ms) ] Tried to live migrate this large guest..using either of the qemu's (i.e. Case 1 or Case2) and observed the following : When this guest was Idling 'was able to live migrate and have the guest come up fine on the other host. Was able to interact with the guest on the destination host. With workloads (e.g. AIM7-compute or SpecJBB or Google Stress App Test (SAT)) running in the guest if we tried to do live migration.. we observe that [after a while] the source host claims that the live migration is complete...but the guest on the destination host is often in a frozen/hung state.. can't really interact with it or ping it. Still trying to capture more information...but was also hoping to get some clues/tips from the experts on these mailing lists... [ BTW, is there a way to get a snap shot of the image of the guest on the source host just before the downtime (i.e. start of stage 3) on the source host and compare that with the image of the guest on the destination host just before its about to resume ? Is such a debugging feature already available ? ] Thanks Vinod
Re: [Qemu-devel] Large sized guest taking for ever to boot...
On 6/10/2012 2:30 AM, Gleb Natapov wrote: On Fri, Jun 08, 2012 at 11:20:53AM -0700, Chegu Vinod wrote: On 6/8/2012 11:08 AM, Jan Kiszka wrote: BTW, another data point ...if I try to boot a the RHEL6.3 kernel in the guest (with the latest qemu.git and the 3.4.1 on the host) it boots just fine So something to do with the 3.4.1 kernel in the guest and the existing udev... Need to dig deeper. How many CPUs do you have on the host? RHEL6.3 uses unfair spinlock when it runs as a guest. 80 active cores on the host. Vinod -- Gleb.
Re: [Qemu-devel] [RFC 0/7] Fix migration with lots of memory
Hello, I did pick up these patches a while back and did run some migration tests while running simple workloads in the guest. Below are some results. FYI... Vinod Config Details: Guest 10vcps, 60GB (running on a host that is 6cores(12threads) and 64GB). The hosts are identical X86_64 Blade servers are connected via a private 10G link (for migration traffic) Guest was started using qemu (no virsh/virt-manager etc). Migration was initiated at the qemu monitor prompt and the migration_set_speed was used to set to 10G. No changes to the downtime. Software: - Guest the Host OS : 3.4.0-rc7+ - Vanilla : basic upstream qemu.git - huge_memory changes(Juan's qemu.git tree) [ Note : BTW, 'did also try vers:11 of XBZRLE patches...but ran into issues (guest crashed after migration) 'have reported it to the author] Here are the simple workloads and results: 1) Idling guest 2) AIM7-compute (with 2000 users). 3) 10way parallel make (of the kernel) 4) 2 instances of memory r/w loop (exactly the same as in docs/xbzrle.txt) 5) SPECJbb2005 Note: In the Vanilla case I had instrumented ram_save_live() to print out the total migration time and the MB's transferred. 1) Idling guest: Vanilla : Total Mig. time: 173016 ms Total MB's transferred : 1606MB huge_memory: Total Mig. time: 48821 ms Total MB's transferred : 1620 MB 2) AIM7-compute (2000 users) Vanilla : Total Mig. time: 241124 ms Total MB's transferred : 4827MB huge_memory: Total Mig. time: 66716 ms Total MB's transferred : 4022MB 3) 10 way parallel make: (of the linux kernel) Vanilla : Total Mig. time: 104319 ms Total MB's transferred : 2316MB huge_memory: Total Mig. time: 55105 ms Total MB's transferred : 2995MB 4) 2 instances of Memory r/w loop: (refer to docs/xbzrle.txt) Vanilla : Total Mig. time: 112102 ms Total MB's transferred : 1739MB huge_memory: Total Mig. time: 85504ms Total MB's transferred : 1745MB 5) SPECJbb : Vanilla : Total Mig. time: 162189 ms Total MB's transferred : 5461MB huge_memory: Total Mig. time: 67787 ms Total MB's transferred : 8528MB [Expected] Observation : Unlike with the Vanilla case( also the XBZRLE case), with these patches I was still able to interact with the qemu monitor prompt and also interact with the guest during the migration (i.e. during the iterative pre-copy phase). -- On 5/22/2012 11:32 AM, Juan Quintela wrote: Hi After a long, long time, this is v2. This are basically the changes that we have for RHEL, due to the problems that we have with big memory machines. I just rebased the patches and fixed the easy parts: - buffered_file_limit is gone: we just use 50ms and call it a day - I let ram_addr_t as a valid type for a counter (no, I still don't agree with Anthony on this, but it is not important). - Print total time of migration always. Notice that I also print it when migration is completed. Luiz, could you take a look to see if I did something worng (probably). - Moved debug printfs to tracepointns. Thanks a lot to Stefan for helping with it. Once here, I had to put the traces in the middle of trace-events file, if I put them on the end of the file, when I enable them, I got generated the previous two tracepoints, instead of the ones I just defined. Stefan is looking on that. Workaround is defining them anywhere else. - exit from cpu_physical_memory_reset_dirty(). Anthony wanted that I created an empty stub for kvm, and maintain the code for tcg. The problem is that we can have both kvm and tcg running from the same binary. Intead of exiting in the middle of the function, I just refactored the code out. Is there an struct where I could add a new function pointer for this behaviour? - exit if we have been too long on ram_save_live() loop. Anthony didn't like this, I will sent a version based on the migration thread in the following days. But just need something working for other people to test. Notice that I still got lots of more than 50ms printf's. (Yes, there is a debugging printf there). - Bitmap handling. Still all code to count dirty pages, will try to get something saner based on bitmap optimizations. Comments? Later, Juan. v1: --- Executive Summary - This series of patches fix migration with lots of memory. With them stalls are removed, and we honored max_dowtime. I also add infrastructure to measure what is happening during migration (#define DEBUG_MIGRATION and DEBUG_SAVEVM). Migration is broken at the momment in qemu tree, Michael patch is needed to fix virtio migration. Measurements are given for qemu-kvm tree. At the end, some measurements with qemu tree. Long Version with measurements (for those that like numbers O:-) -- 8 vCPUS and 64GB RAM, a RHEL5 guest that is completelly idle initial --- savevm: save live iterate section id 3 name ram took 3266 milliseconds 46 times We have 46 stalls, and missed the
Re: [Qemu-devel] Large sized guest taking for ever to boot...
On 6/8/2012 11:08 AM, Jan Kiszka wrote: [CC'ing qemu as this discusses its code base] On 2012-06-08 19:57, Chegu Vinod wrote: On 6/8/2012 10:42 AM, Alex Williamson wrote: On Fri, 2012-06-08 at 10:10 -0700, Chegu Vinod wrote: On 6/8/2012 9:46 AM, Alex Williamson wrote: On Fri, 2012-06-08 at 16:29 +, Chegu Vinod wrote: Hello, I picked up a recent version of the qemu (1.0.92 with some fixes) and tried it on x86_64 server (with host and the guest running 3.4.1 kernel). BTW, I observe the same thing if i were to use 1.1.50 version of the qemu... not sure if this is really related to qemu... While trying to boot a large guest (80 vcpus + 512GB) I observed that the guest took for ever to boot up... ~1 hr or even more. [This wasn't the case when I was using RHEL 6.x related bits] Was either case using device assignment? Device assignment will map and pin each page of guest memory before startup, which can be a noticeable pause on smallish (16GB) guests. That should be linear scaling though and if you're using qemu and not qemu-kvm, not related. Thanks, I don't have any device assignment at this point . Yes I am using qemu (not qemu-kvm)... Just to be safe, are you using --enable-kvm with qemu? Yes... Unless you are using current qemu.git master (where it is enabled by default), --enable-kvm does not activate the in-kernel irqchip support for you. Not sure if that can make such a huge difference, but it is a variation from qemu-kvm. You can enable it in qemu-1.1 with -machine kernel_irqchip=on. Thanks for pointing this out...i will add that. I was using qemu.gitnot the master - /etc/qemu-ifup tap0 /usr/local/bin/qemu-system-x86_64 -enable-kvm \ -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+s s,+acpi,+ds,+vme \ -m 524288 -smp 80,sockets=80,cores=1,threads=1 \ -name vm1 \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -drive file=/dev/libvirt_lvm/vm.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,macaddr=52:54:00:71:01:01 \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 /etc/qemu-ifdown tap0 The issue seems very basic... 'was earlier running RHEL6.3 RC1 on the host and the guest and the host and the guest seemed to boot fine.. Note that RHEL is based on qemu-kvm. Thanks, Yep..knew that :) I was using upstream qemu-kvm and was encouraged to move away from it...to qemu. And that is good. :) Is the problem present in current qemu-kvm.git? If yes, can you bisect when it was introduced? Shall try out the qemu-kvm.git (after finishing some experiments..). BTW, another data point ...if I try to boot a the RHEL6.3 kernel in the guest (with the latest qemu.git and the 3.4.1 on the host) it boots just fine So something to do with the 3.4.1 kernel in the guest and the existing udev... Need to dig deeper. Vinod Jan
Re: [Qemu-devel] Fwd: [PATCH v2 00/41] postcopy live migration
On 6/4/2012 6:13 AM, Isaku Yamahata wrote: On Mon, Jun 04, 2012 at 05:01:30AM -0700, Chegu Vinod wrote: Hello Isaku Yamahata, Hi. I just saw your patches..Would it be possible to email me a tar bundle of these patches (makes it easier to apply the patches to a copy of the upstream qemu.git) I uploaded them to github for those who are interested in it. git://github.com/yamahata/qemu.git qemu-postcopy-june-04-2012 git://github.com/yamahata/linux-umem.git linux-umem-june-04-2012 Thanks for the pointer... BTW, I am also curious if you have considered using any kind of RDMA features for optimizing the page-faults during postcopy ? Yes, RDMA is interesting topic. Can we share your use case/concern/issues? Looking at large sized guests (256GB and higher) running cpu/memory intensive enterprise workloads. The concerns are the same...i.e. having a predictable total migration time, minimal downtime/freeze-time and of course minimal service degradation to the workload(s) in the VM or the co-located VM's... How large of a guest have you tested your changes with and what kind of workloads have you used so far ? Thus we can collaborate. You may want to see Benoit's results. Yes. 'have already seen some of Benoit's results. Hence the question about use of RDMA techniques for post copy. As long as I know, he has not published his code yet. Thanks Vinod thanks, Thanks Vinod -- Message: 1 Date: Mon, 4 Jun 2012 18:57:02 +0900 From: Isaku Yamahatayamah...@valinux.co.jp To: qemu-devel@nongnu.org, k...@vger.kernel.org Cc: benoit.hud...@gmail.com, aarca...@redhat.com, aligu...@us.ibm.com, quint...@redhat.com, stefa...@gmail.com, t.hirofu...@aist.go.jp, dl...@redhat.com, satoshi.i...@aist.go.jp, mdr...@linux.vnet.ibm.com, yoshikawa.tak...@oss.ntt.co.jp, owass...@redhat.com, a...@redhat.com, pbonz...@redhat.com Subject: [Qemu-devel] [PATCH v2 00/41] postcopy live migration Message-ID:cover.1338802190.git.yamah...@valinux.co.jp After the long time, we have v2. This is qemu part. The linux kernel part is sent separatedly. Changes v1 - v2: - split up patches for review - buffered file refactored - many bug fixes Espcially PV drivers can work with postcopy - optimization/heuristic Patches 1 - 30: refactoring exsiting code and preparation 31 - 37: implement postcopy itself (essential part) 38 - 41: some optimization/heuristic for postcopy Intro = This patch series implements postcopy live migration.[1] As discussed at KVM forum 2011, dedicated character device is used for distributed shared memory between migration source and destination. Now we can discuss/benchmark/compare with precopy. I believe there are much rooms for improvement. [1] http://wiki.qemu.org/Features/PostCopyLiveMigration Usage = You need load umem character device on the host before starting migration. Postcopy can be used for tcg and kvm accelarator. The implementation depend on only linux umem character device. But the driver dependent code is split into a file. I tested only host page size == guest page size case, but the implementation allows host page size != guest page size case. The following options are added with this patch series. - incoming part command line options -postcopy [-postcopy-flagsflags] where flags is for changing behavior for benchmark/debugging Currently the following flags are available 0: default 1: enable touching page request example: qemu -postcopy -incoming tcp:0: -monitor stdio -machine accel=kvm - outging part options for migrate command migrate [-p [-n] [-m]] URI [prefault forward [prefault backword]] -p: indicate postcopy migration -n: disable background transferring pages: This is for benchmark/debugging -m: move background transfer of postcopy mode prefault forward: The number of forward pages which is sent with on-demand prefault backward: The number of backward pages which is sent with on-demand example: migrate -p -n tcp:dest ip address: migrate -p -n -m tcp:dest ip address: 32 0 TODO - benchmark/evaluation. Especially how async page fault affects the result. - improve/optimization At the moment at least what I'm aware of is - making incoming socket non-blocking with thread As page compression is comming, it is impractical to non-blocking read and check if the necessary data is read. - touching pages in incoming qemu process by fd handler seems suboptimal. creating dedicated thread? - outgoing handler seems suboptimal causing latency. - consider on FUSE/CUSE possibility - don't fork umemd, but create thread? basic postcopy work flow qemu on the destination | V open(/dev/umem) | V UMEM_INIT