Re: [Qemu-devel] [PATCH 05/10] migration: Fix the migrate auto converge process

2014-03-11 Thread Chegu Vinod

On 3/11/2014 1:48 PM, Juan Quintela wrote:

arei.gong...@huawei.com wrote:

From: ChenLiang chenlian...@huawei.com

It is inaccuracy and complex that using the transfer speed of
migration thread to determine whether the convergence migration.
The dirty page may be compressed by XBZRLE or ZERO_PAGE.The counter
of updating dirty bitmap will be increasing continuously if the
migration can't convergence.

It is inexact and complex to use the migration transfer speed to
dectermine weather the convergence of migration.


@@ -530,21 +523,11 @@ static void migration_bitmap_sync(void)
  /* more than 1 second = 1000 millisecons */
  if (end_time  start_time + 1000) {
  if (migrate_auto_converge()) {
-/* The following detection logic can be refined later. For now:
-   Check to see if the dirtied bytes is 50% more than the approx.
-   amount of bytes that just got transferred since the last time we
-   were in this routine. If that happens N times (for now N==4)
-   we turn on the throttle down logic */
-bytes_xfer_now = ram_bytes_transferred();
-if (s-dirty_pages_rate 
-   (num_dirty_pages_period * TARGET_PAGE_SIZE 
-   (bytes_xfer_now - bytes_xfer_prev)/2) 
-   (dirty_rate_high_cnt++  4)) {
-trace_migration_throttle();
-mig_throttle_on = true;
-dirty_rate_high_cnt = 0;
- }
- bytes_xfer_prev = bytes_xfer_now;
+if (get_bitmap_sync_cnt()  15) {
+/* It indicates that migration can't converge when the counter
+is larger than fifteen. Enable the feature of auto
  converge */

Comment is not needed, it says excatly what the code does.

But why 15?  It is not that I think that the older code is better or
worse than yours.  Just that we move from one magic number to another
(that is even bigger).

Shouldn't it be easier jut just change mig_sleep_cpu()

to do something like:


static void mig_sleep_cpu(void *opq)
{
 qemu_mutex_unlock_iothread();
 g_usleep((2*get_bitmap_sync_cnt()*1000);
 qemu_mutex_lock_iothread();
}

This would get the 30ms on the 15th iteration.  I am open to change that
formula to anything different, but what I want is changing this to
something that makes the less convergence - the more throotling.


  'already got some feedback earlier on this and had this task in the 
list of things

to work on... :)   

Having the throttling start with some per-defined degree and then have 
that degree gradually increase ...either


a) automatically as shown in Juan's example above (or)

b) via some TBD user level interface...

...is one way to help with ensuring convergence for all cases.

The issue of continuing to increase this degree of throttling is an 
obvious area of concern for the workload ( that is still trying to run 
in the VM).   Would it it better to force the live migration to switch 
from the iterative pre-copy phase to the downtime phase ...if it fails 
to converge even after throttling it for a couple of iterations ?  Doing 
so could result in a longer actual downtime. Hope to try this and 
see...  but if anyone has inputs(other than doing post-copy etc) pl. do 
share.





BTW, you are testing this with any workload to see that it improves?


Yes. Please do share some data.





+mig_throttle_on = true;
+}

Vinod, what do you think?
As is noted in the current code...the logic to detect the lack of 
convergence needs to be refined. If there is a better way to help detect 
same (and which covers these other cases like XBZRLE etc) then I am all 
for it.  I do agree with Juan about the choice of magic numbers (i.e. 
one may not be better than the other).


BTW, on a related note...

I haven't used XBZRLE in the recent past (after having tried it in the 
early days). Does it now perform well with larger sized VMs running real 
world workloads  ?   Assume that is where you found that there was still 
need for forcing convergence ?


Pl. do consider sharing some results about the type of workload and also 
the size of the VMs etc that you have tried with XBZRLE.



Do you have a workload to test this?


Hmm... One can test this with memory intensive Java warehouse type of 
workloads (besides using synthetic workloads).


Vinod


Thanks, Juan.
.






Re: [Qemu-devel] [PATCH v2 00/39] bitmap handling optimization

2013-11-08 Thread Chegu Vinod

On 11/6/2013 5:04 AM, Juan Quintela wrote:

Hi

[v2]
In this version:
- fixed all the comments from last versions (thanks Eric)
- kvm migration bitmap is synchronized using bitmap operations
- qemu bitmap - migration bitmap is synchronized using bitmap operations
If bitmaps are not properly aligned, we fall back to old code.
Code survives virt-tests, so should be in quite good shape.

ToDo list:

- vga ram by default is not aligned in a page number multiple of 64,

   it could be optimized.  Kraxel?  It syncs the kvm bitmap at least 1
   a second or so? bitmap is only 2048 pages (16MB by default).
   We need to change the ram_addr only

- vga: still more, after we finish migration, vga code continues
   synchronizing the kvm bitmap on source machine.  Notice that there
   is no graphics client connected to the VGA.  Worth investigating?

- I haven't yet meassure speed differences on big hosts.  Vinod?

- Depending of performance, more optimizations to do.

- debugging printf's still on the code, just to see if we are taking
   (or not) the optimized paths.

And that is all.  Please test  comment.

Thanks, Juan.

[v1]
This series split the dirty bitmap (8 bits per page, only three used)
into 3 individual bitmaps.  Once the conversion is done, operations
are handled by bitmap operations, not bit by bit.

- *_DIRTY_FLAG flags are gone, now we use memory.h DIRTY_MEMORY_*
everywhere.

- We set/reset each flag individually
   (set_dirty_flags(0xff~CODE_DIRTY_FLAG)) are gone.

- Rename several functions to clarify/make consistent things.

- I know it dont't pass checkpatch for long lines, propper submission
   should pass it. We have to have long lines, short variable names, or
   ugly line splitting :p

- DIRTY_MEMORY_NUM: how can one include exec/memory.h into cpu-all.h?
   #include it don't work, as a workaround, I have copied its value, but
   any better idea?  I can always create exec/migration-flags.h, though.

- The meat of the code is patch 19.  Rest of patches are quite easy
(even that one is not too complex).

Only optimizations done so far are
set_dirty_range()/clear_dirty_range() that now operates with
bitmap_set/clear.

Note for Xen: cpu_physical_memory_set_dirty_range() was wrong for xen,
see comment on patch.

It passes virt-test migration tests, so it should be perfect.

I post it to ask for comments.

ToDo list:

- create a lock for the bitmaps and fold migration bitmap into this
   one.  This would avoid a copy and make things easier?

- As this code uses/abuses bitmaps, we need to change the type of the
   index from int to long.  With an int index, we can only access a
   maximum of 8TB guest (yes, this is not urgent, we have a couple of
   years to do it).

- merging KVM - QEMU bitmap as a bitmap and not bit-by-bit.

- spliting the KVM bitmap synchronization into chunks, i.e. not
   synchronize all memory, just enough to continue with migration.

Any further ideas/needs?

Thanks, Juan.

PD.  Why it took so long?

  Because I was trying to integrate the bitmap on the MemoryRegion
  abstraction.  Would have make the code cleaner, but hit dead-end
  after dead-end.  As practical terms, TCG don't know about
  MemoryRegions, it has been ported to run on top of them, but
  don't use them effective


The following changes since commit c2d30667760e3d7b81290d801e567d4f758825ca:

   rtc: remove dead SQW IRQ code (2013-11-05 20:04:03 -0800)

are available in the git repository at:

   git://github.com/juanquintela/qemu.git bitmap-v2.next

for you to fetch changes up to d91eff97e6f36612eb22d57c2b6c2623f73d3997:

   migration: synchronize memory bitmap 64bits at a time (2013-11-06 13:54:56 
+0100)


Juan Quintela (39):
   Move prototypes to memory.h
   memory: cpu_physical_memory_set_dirty_flags() result is never used
   memory: cpu_physical_memory_set_dirty_range() return void
   exec: use accessor function to know if memory is dirty
   memory: create function to set a single dirty bit
   exec: create function to get a single dirty bit
   memory: make cpu_physical_memory_is_dirty return bool
   exec: simplify notdirty_mem_write()
   memory: all users of cpu_physical_memory_get_dirty used only one flag
   memory: set single dirty flags when possible
   memory: cpu_physical_memory_set_dirty_range() allways dirty all flags
   memory: cpu_physical_memory_mask_dirty_range() always clear a single flag
   memory: use DIRTY_MEMORY_* instead of *_DIRTY_FLAG
   memory: use bit 2 for migration
   memory: make sure that client is always inside range
   memory: only resize dirty bitmap when memory size increases
   memory: cpu_physical_memory_clear_dirty_flag() result is never used
   bitmap: Add bitmap_zero_extend operation
   memory: split dirty bitmap into three
   memory: unfold cpu_physical_memory_clear_dirty_flag() in its only user
   

Re: [Qemu-devel] [PATCH v7 3/3] Force auto-convegence of live migration

2013-06-24 Thread Chegu Vinod

On 6/24/2013 6:01 AM, Paolo Bonzini wrote:

One nit and one question:

Il 23/06/2013 22:11, Chegu Vinod ha scritto:

@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
  
  /* more than 1 second = 1000 millisecons */

  if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==4)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+   (num_dirty_pages_period * TARGET_PAGE_SIZE 
+   (bytes_xfer_now - bytes_xfer_prev)/2) 
+   (dirty_rate_high_cnt++  4)) {
+trace_migration_throttle();
+mig_throttle_on = true;
+dirty_rate_high_cnt = 0;
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+}


Missing:

  else {
  mig_throttle_on = false;
  }


Ok.

+/* Stub function that's gets run on the vcpu when its brought out of the
+   VM to run inside qemu via async_run_on_cpu()*/
+static void mig_sleep_cpu(void *opq)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(30*1000);
+qemu_mutex_lock_iothread();
+}
+
+/* If it has been more than 40 ms since the last time the guest
+ * was throttled then do it again.
+ */
+if (40  (t1-t0)/100) {

You're stealing 75% of the CPU time, isn't that a lot?


Depends on the dirty rate vs. transfer rate... I had tried 50% too and 
it took much longer for the migration to converge.


Vinod




+mig_throttle_guest_down();
+t0 = t1;
+}
+}


Paolo

.






[Qemu-devel] [PATCH v8 3/3] Force auto-convegence of live migration

2013-06-24 Thread Chegu Vinod
If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.

Verified the convergence using the following :
 - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
 - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: off  
 Migration status: active
 total time: 1487503 milliseconds
 expected downtime: 519 milliseconds
 transferred ram: 383749347 kbytes
 remaining ram: 2753372 kbytes
 total ram: 268444224 kbytes
 duplicate: 65461532 pages
 skipped: 64901568 pages
 normal: 95750218 pages
 normal bytes: 383000872 kbytes
 dirty pages rate: 67551 pages

 ---

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: on   
 Migration status: completed
 total time: 241161 milliseconds
 downtime: 6373 milliseconds
 transferred ram: 28235307 kbytes
 remaining ram: 0 kbytes
 total ram: 268444224 kbytes
 duplicate: 64946416 pages
 skipped: 64903523 pages
 normal: 7044971 pages
 normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c |   79 +++
 1 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index a8b91ee..e7ca3b1 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,9 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static bool mig_throttle_on;
+static int dirty_rate_high_cnt;
+static void check_guest_throttling(void);
 
 /***/
 /* ram save/restore */
@@ -378,8 +381,14 @@ static void migration_bitmap_sync(void)
 uint64_t num_dirty_pages_init = migration_dirty_pages;
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
+static int64_t bytes_xfer_prev;
 static int64_t num_dirty_pages_period;
 int64_t end_time;
+int64_t bytes_xfer_now;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==4)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+   (num_dirty_pages_period * TARGET_PAGE_SIZE 
+   (bytes_xfer_now - bytes_xfer_prev)/2) 
+   (dirty_rate_high_cnt++  4)) {
+trace_migration_throttle();
+mig_throttle_on = true;
+dirty_rate_high_cnt = 0;
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+} else {
+ mig_throttle_on = false;
+}
 s-dirty_pages_rate = num_dirty_pages_period * 1000
 / (end_time - start_time);
 s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 migration_bitmap = bitmap_new(ram_pages);
 bitmap_set(migration_bitmap, 0, ram_pages);
 migration_dirty_pages = ram_pages;
+mig_throttle_on = false;
+dirty_rate_high_cnt = 0;
 
 if (migrate_use_xbzrle()) {
 XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() /
@@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 }
 total_sent += bytes_sent;
 acct_info.iterations++;
+check_guest_throttling();
 /* we want to check in the 1st loop, just in case it was the 1st time
and we had to sync the dirty bitmap.
qemu_get_clock_ns() is a bit expensive, so we only check each some
@@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp)
 
 return info;
 }
+
+/* Stub function that's gets run on the vcpu when its brought out of the
+   VM to run inside qemu via async_run_on_cpu()*/
+static void mig_sleep_cpu(void *opq)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(30*1000);
+qemu_mutex_lock_iothread();
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a performance drop.
+*/
+static void

[Qemu-devel] [PATCH v8 0/3] Throttle-down guest to help with live migration convergence

2013-06-24 Thread Chegu Vinod
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

If a user chooses to force convergence of their migration via a new
migration capability auto-converge then this change will auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. end up entering
stage 3 (downtime phase) soon after that. No external trigger is
required.

Thanks to Juan and Paolo for their useful suggestions.

---
Changes from v7:
- added a missing else to patch 3/3.

Changes from v6:
- incorporated feedback from Paolo.
- rebased to latest qemu.git and removing RFC

Changes from v5:
- incorporated feedback from Paolo  Igor.
- rebased to latest qemu.git

Changes from v4:
- incorporated feedback from Paolo.
- split into 3 patches.

Changes from v3:
- incorporated feedback from Paolo and Eric
- rebased to latest qemu.git

Changes from v2:
- incorporated feedback from Orit, Juan and Eric
- stop the throttling thread at the start of stage 3
- rebased to latest qemu.git

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---

Chegu Vinod (3):
  Introduce async_run_on_cpu()
  Add 'auto-converge' migration capability
  Force auto-convegence of live migration

 arch_init.c   |   85 +
 cpus.c|   29 ++
 include/migration/migration.h |2 +
 include/qemu-common.h |1 +
 include/qom/cpu.h |   10 +
 migration.c   |9 
 qapi-schema.json  |5 ++-
 7 files changed, 140 insertions(+), 1 deletions(-)




[Qemu-devel] [PATCH v8 2/3] Add 'auto-converge' migration capability

2013-06-24 Thread Chegu Vinod
The auto-converge migration capability allows the user to specify if they
choose live migration seqeunce to automatically detect and force convergence.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com
Reviewed-by: Eric Blake ebl...@redhat.com
---
 include/migration/migration.h |2 ++
 migration.c   |9 +
 qapi-schema.json  |5 -
 3 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..ace91b0 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -127,4 +127,6 @@ int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+bool migrate_auto_converge(void);
 #endif
diff --git a/migration.c b/migration.c
index 058f9e6..d0759c1 100644
--- a/migration.c
+++ b/migration.c
@@ -473,6 +473,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
 max_downtime = (uint64_t)value;
 }
 
+bool migrate_auto_converge(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return s-enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
+}
+
 int migrate_use_xbzrle(void)
 {
 MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index a80ee40..c019fec 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -605,10 +605,13 @@
 #  This feature allows us to minimize migration traffic for certain 
work
 #  loads, by sending compressed difference of the pages
 #
+# @auto-converge: If enabled, QEMU will automatically throttle down the guest
+#  to speed up convergence of RAM migration. (since 1.6)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'auto-converge'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.7.1




[Qemu-devel] [PATCH v8 1/3] Introduce async_run_on_cpu()

2013-06-24 Thread Chegu Vinod
Introduce an asynchronous version of run_on_cpu() i.e. the caller
doesn't have to block till the call back routine finishes execution
on the target vcpu.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com
---
 cpus.c|   29 +
 include/qemu-common.h |1 +
 include/qom/cpu.h |   10 ++
 3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index c8bc8ad..c7c90d0 100644
--- a/cpus.c
+++ b/cpus.c
@@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data)
 
 wi.func = func;
 wi.data = data;
+wi.free = false;
 if (cpu-queued_work_first == NULL) {
 cpu-queued_work_first = wi;
 } else {
@@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data)
 }
 }
 
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
+{
+struct qemu_work_item *wi;
+
+if (qemu_cpu_is_self(cpu)) {
+func(data);
+return;
+}
+
+wi = g_malloc0(sizeof(struct qemu_work_item));
+wi-func = func;
+wi-data = data;
+wi-free = true;
+if (cpu-queued_work_first == NULL) {
+cpu-queued_work_first = wi;
+} else {
+cpu-queued_work_last-next = wi;
+}
+cpu-queued_work_last = wi;
+wi-next = NULL;
+wi-done = false;
+
+qemu_cpu_kick(cpu);
+}
+
 static void flush_queued_work(CPUState *cpu)
 {
 struct qemu_work_item *wi;
@@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu)
 cpu-queued_work_first = wi-next;
 wi-func(wi-data);
 wi-done = true;
+if (wi-free) {
+g_free(wi);
+}
 }
 cpu-queued_work_last = NULL;
 qemu_cond_broadcast(qemu_work_cond);
diff --git a/include/qemu-common.h b/include/qemu-common.h
index 3c91375..9834dcb 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -291,6 +291,7 @@ struct qemu_work_item {
 void (*func)(void *data);
 void *data;
 int done;
+bool free;
 };
 
 #ifdef CONFIG_USER_ONLY
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index a5bb515..b555c22 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -288,6 +288,16 @@ bool cpu_is_stopped(CPUState *cpu);
 void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 
 /**
+ * async_run_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchronously.
+ */
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
+
+/**
  * qemu_for_each_cpu:
  * @func: The function to be executed.
  * @data: Data to pass to the function.
-- 
1.7.1




Re: [Qemu-devel] [PATCH v8 3/3] Force auto-convegence of live migration

2013-06-24 Thread Chegu Vinod

On 6/24/2013 8:59 AM, Paolo Bonzini wrote:

Il 24/06/2013 11:47, Chegu Vinod ha scritto:

If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.

Verified the convergence using the following :
  - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

  (qemu) info migrate
  capabilities: xbzrle: off auto-converge: off  
  Migration status: active
  total time: 1487503 milliseconds
  expected downtime: 519 milliseconds
  transferred ram: 383749347 kbytes
  remaining ram: 2753372 kbytes
  total ram: 268444224 kbytes
  duplicate: 65461532 pages
  skipped: 64901568 pages
  normal: 95750218 pages
  normal bytes: 383000872 kbytes
  dirty pages rate: 67551 pages

  ---

  (qemu) info migrate
  capabilities: xbzrle: off auto-converge: on   
  Migration status: completed
  total time: 241161 milliseconds
  downtime: 6373 milliseconds
  transferred ram: 28235307 kbytes
  remaining ram: 0 kbytes
  total ram: 268444224 kbytes
  duplicate: 64946416 pages
  skipped: 64903523 pages
  normal: 7044971 pages
  normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com

As far as the algorithm is concerned,

Reviewed-by: Paolo Bonzini pbonz...@redhat.com


Thanks!


but are you sure that this passes checkpatch.pl?


Yes it does (had checked it before I posted).

# ./scripts/checkpatch.pl 
0003-Force-auto-convegence-of-live-migration.patch

total: 0 errors, 0 warnings, 114 lines checked

0003-Force-auto-convegence-of-live-migration.patch has no obvious style 
problems and is ready for submission.


Vinod



+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==4)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+   (num_dirty_pages_period * TARGET_PAGE_SIZE 
+   (bytes_xfer_now - bytes_xfer_prev)/2) 
+   (dirty_rate_high_cnt++  4)) {

the spacing of the operators here looks like something checkpatch.pl
would complain about.  If you have to respin for that, keep my R-b and
please also remove all other superfluous parentheses.

Paolo


+trace_migration_throttle();
+mig_throttle_on = true;
+dirty_rate_high_cnt = 0;
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+} else {
+ mig_throttle_on = false;
+}
  s-dirty_pages_rate = num_dirty_pages_period * 1000
  / (end_time - start_time);
  s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
  migration_bitmap = bitmap_new(ram_pages);
  bitmap_set(migration_bitmap, 0, ram_pages);
  migration_dirty_pages = ram_pages;
+mig_throttle_on = false;
+dirty_rate_high_cnt = 0;
  
  if (migrate_use_xbzrle()) {

  XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() /
@@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  }
  total_sent += bytes_sent;
  acct_info.iterations++;
+check_guest_throttling();
  /* we want to check in the 1st loop, just in case it was the 1st time
 and we had to sync the dirty bitmap.
 qemu_get_clock_ns() is a bit expensive, so we only check each some
@@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp)
  
  return info;

  }
+
+/* Stub function that's gets run on the vcpu when its brought out of the
+   VM to run inside qemu via async_run_on_cpu()*/
+static void mig_sleep_cpu(void *opq)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(30*1000);
+qemu_mutex_lock_iothread();
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a performance drop.
+*/
+static void mig_throttle_cpu_down(CPUState *cpu, void *data)
+{
+async_run_on_cpu(cpu, mig_sleep_cpu, NULL);
+}
+
+static void mig_throttle_guest_down(void)
+{
+qemu_mutex_lock_iothread();
+qemu_for_each_cpu(mig_throttle_cpu_down, NULL);
+qemu_mutex_unlock_iothread();
+}
+
+static void check_guest_throttling(void)
+{
+static int64_t t0;
+int64_tt1;
+
+if (!mig_throttle_on) {
+return;
+}
+
+if (!t0

Re: [Qemu-devel] [RFC PATCH v6 3/3] Force auto-convegence of live migration

2013-06-23 Thread Chegu Vinod

On 6/20/2013 5:54 AM, Paolo Bonzini wrote:

Il 14/06/2013 15:58, Chegu Vinod ha scritto:

If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.

Hi Vinod,

pretty much the same comments I sent you yesterday on the obsolete
version of the patch still apply.


Verified the convergence using the following :
  - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

  (qemu) info migrate
  capabilities: xbzrle: off auto-converge: off  
  Migration status: active
  total time: 1487503 milliseconds
  expected downtime: 519 milliseconds
  transferred ram: 383749347 kbytes
  remaining ram: 2753372 kbytes
  total ram: 268444224 kbytes
  duplicate: 65461532 pages
  skipped: 64901568 pages
  normal: 95750218 pages
  normal bytes: 383000872 kbytes
  dirty pages rate: 67551 pages

  ---

  (qemu) info migrate
  capabilities: xbzrle: off auto-converge: on   
  Migration status: completed
  total time: 241161 milliseconds
  downtime: 6373 milliseconds
  transferred ram: 28235307 kbytes
  remaining ram: 0 kbytes
  total ram: 268444224 kbytes
  duplicate: 64946416 pages
  skipped: 64903523 pages
  normal: 7044971 pages
  normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
  arch_init.c |   85 +++
  1 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 5d32ecf..69c6c8c 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,8 @@ int graphic_depth = 15;
  #endif
  
  const uint32_t arch_type = QEMU_ARCH;

+static bool mig_throttle_on;
+static void throttle_down_guest_to_converge(void);
  
  /***/

  /* ram save/restore */
@@ -378,8 +380,15 @@ static void migration_bitmap_sync(void)
  uint64_t num_dirty_pages_init = migration_dirty_pages;
  MigrationState *s = migrate_get_current();
  static int64_t start_time;
+static int64_t bytes_xfer_prev;
  static int64_t num_dirty_pages_period;
  int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
  
  if (!start_time) {

  start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
  
  /* more than 1 second = 1000 millisecons */

  if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==4)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+((num_dirty_pages_period*TARGET_PAGE_SIZE) 
+((bytes_xfer_now - bytes_xfer_prev)/2))) {
+if (dirty_rate_high_cnt++  4) {

Too many parentheses, and please remove the nested if.


+DPRINTF(Unable to converge. Throtting down guest\n);

Please use tracepoint instead.


+mig_throttle_on = true;

Need to reset dirty_rate_high_cnt here, and both
dirty_rate_high_cnt/mig_throttle_on if you see !migrate_auto_converge().
  This ensures that throttling does not kick in automatically if you
disable and re-enable the feature.  It also lets you remove a bunch of
migrate_auto_converge() checks.

You also need to reset dirty_rate_high_cnt/mig_throttle_on in the setup
phase of migration.


+}
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+}
  s-dirty_pages_rate = num_dirty_pages_period * 1000
  / (end_time - start_time);
  s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -628,6 +654,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  }
  total_sent += bytes_sent;
  acct_info.iterations++;
+throttle_down_guest_to_converge();

You can use a shorter name, like check_cpu_throttling().


  /* we want to check in the 1st loop, just in case it was the 1st time
 and we had to sync the dirty bitmap.
 qemu_get_clock_ns() is a bit expensive, so we only check each some
@@ -1098,3 +1125,61 @@ TargetInfo *qmp_query_target(Error **errp)
  
  return info;

  }
+
+static bool throttling_needed(void)
+{
+if (!migrate_auto_converge

[Qemu-devel] [PATCH v7 1/3] Introduce async_run_on_cpu()

2013-06-23 Thread Chegu Vinod
Introduce an asynchronous version of run_on_cpu() i.e. the caller
doesn't have to block till the call back routine finishes execution
on the target vcpu.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com
---
 cpus.c|   29 +
 include/qemu-common.h |1 +
 include/qom/cpu.h |   10 ++
 3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index c8bc8ad..c7c90d0 100644
--- a/cpus.c
+++ b/cpus.c
@@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data)
 
 wi.func = func;
 wi.data = data;
+wi.free = false;
 if (cpu-queued_work_first == NULL) {
 cpu-queued_work_first = wi;
 } else {
@@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data)
 }
 }
 
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
+{
+struct qemu_work_item *wi;
+
+if (qemu_cpu_is_self(cpu)) {
+func(data);
+return;
+}
+
+wi = g_malloc0(sizeof(struct qemu_work_item));
+wi-func = func;
+wi-data = data;
+wi-free = true;
+if (cpu-queued_work_first == NULL) {
+cpu-queued_work_first = wi;
+} else {
+cpu-queued_work_last-next = wi;
+}
+cpu-queued_work_last = wi;
+wi-next = NULL;
+wi-done = false;
+
+qemu_cpu_kick(cpu);
+}
+
 static void flush_queued_work(CPUState *cpu)
 {
 struct qemu_work_item *wi;
@@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu)
 cpu-queued_work_first = wi-next;
 wi-func(wi-data);
 wi-done = true;
+if (wi-free) {
+g_free(wi);
+}
 }
 cpu-queued_work_last = NULL;
 qemu_cond_broadcast(qemu_work_cond);
diff --git a/include/qemu-common.h b/include/qemu-common.h
index 3c91375..9834dcb 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -291,6 +291,7 @@ struct qemu_work_item {
 void (*func)(void *data);
 void *data;
 int done;
+bool free;
 };
 
 #ifdef CONFIG_USER_ONLY
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index a5bb515..b555c22 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -288,6 +288,16 @@ bool cpu_is_stopped(CPUState *cpu);
 void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 
 /**
+ * async_run_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchronously.
+ */
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
+
+/**
  * qemu_for_each_cpu:
  * @func: The function to be executed.
  * @data: Data to pass to the function.
-- 
1.7.1




[Qemu-devel] [PATCH v7 0/3] Throttle-down guest to help with live migration convergence

2013-06-23 Thread Chegu Vinod
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

If a user chooses to force convergence of their migration via a new
migration capability auto-converge then this change will auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. end up entering
stage 3 (downtime phase) soon after that. No external trigger is
required.

Thanks to Juan and Paolo for their useful suggestions.

---
Changes from v6:
- incorporated feedback from Paolo.
- rebased to latest qemu.git and removing RFC

Changes from v5:
- incorporated feedback from Paolo  Igor.
- rebased to latest qemu.git

Changes from v4:
- incorporated feedback from Paolo.
- split into 3 patches.

Changes from v3:
- incorporated feedback from Paolo and Eric
- rebased to latest qemu.git

Changes from v2:
- incorporated feedback from Orit, Juan and Eric
- stop the throttling thread at the start of stage 3
- rebased to latest qemu.git

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---

Chegu Vinod (3):
  Introduce async_run_on_cpu()
  Add 'auto-converge' migration capability
  Force auto-convegence of live migration

 arch_init.c   |   85 +
 cpus.c|   29 ++
 include/migration/migration.h |2 +
 include/qemu-common.h |1 +
 include/qom/cpu.h |   10 +
 migration.c   |9 
 qapi-schema.json  |5 ++-
 7 files changed, 140 insertions(+), 1 deletions(-)




[Qemu-devel] [PATCH v7 2/3] Add 'auto-converge' migration capability

2013-06-23 Thread Chegu Vinod
If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.

Verified the convergence using the following :
 - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
 - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: off  
 Migration status: active
 total time: 1487503 milliseconds
 expected downtime: 519 milliseconds
 transferred ram: 383749347 kbytes
 remaining ram: 2753372 kbytes
 total ram: 268444224 kbytes
 duplicate: 65461532 pages
 skipped: 64901568 pages
 normal: 95750218 pages
 normal bytes: 383000872 kbytes
 dirty pages rate: 67551 pages

 ---

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: on   
 Migration status: completed
 total time: 241161 milliseconds
 downtime: 6373 milliseconds
 transferred ram: 28235307 kbytes
 remaining ram: 0 kbytes
 total ram: 268444224 kbytes
 duplicate: 64946416 pages
 skipped: 64903523 pages
 normal: 7044971 pages
 normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c |   79 +++
 1 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index a8b91ee..e7ca3b1 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,9 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static bool mig_throttle_on;
+static int dirty_rate_high_cnt;
+static void check_guest_throttling(void);
 
 /***/
 /* ram save/restore */
@@ -378,8 +381,14 @@ static void migration_bitmap_sync(void)
 uint64_t num_dirty_pages_init = migration_dirty_pages;
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
+static int64_t bytes_xfer_prev;
 static int64_t num_dirty_pages_period;
 int64_t end_time;
+int64_t bytes_xfer_now;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==4)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+   (num_dirty_pages_period * TARGET_PAGE_SIZE 
+   (bytes_xfer_now - bytes_xfer_prev)/2) 
+   (dirty_rate_high_cnt++  4)) {
+trace_migration_throttle();
+mig_throttle_on = true;
+dirty_rate_high_cnt = 0;
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+}
 s-dirty_pages_rate = num_dirty_pages_period * 1000
 / (end_time - start_time);
 s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 migration_bitmap = bitmap_new(ram_pages);
 bitmap_set(migration_bitmap, 0, ram_pages);
 migration_dirty_pages = ram_pages;
+mig_throttle_on = false;
+dirty_rate_high_cnt = 0;
 
 if (migrate_use_xbzrle()) {
 XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() /
@@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 }
 total_sent += bytes_sent;
 acct_info.iterations++;
+check_guest_throttling();
 /* we want to check in the 1st loop, just in case it was the 1st time
and we had to sync the dirty bitmap.
qemu_get_clock_ns() is a bit expensive, so we only check each some
@@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp)
 
 return info;
 }
+
+/* Stub function that's gets run on the vcpu when its brought out of the
+   VM to run inside qemu via async_run_on_cpu()*/
+static void mig_sleep_cpu(void *opq)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(30*1000);
+qemu_mutex_lock_iothread();
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a performance drop.
+*/
+static void mig_throttle_cpu_down(CPUState *cpu, void *data)
+{
+async_run_on_cpu

[Qemu-devel] [PATCH v7 2/3] Add 'auto-converge' migration capability

2013-06-23 Thread Chegu Vinod
The auto-converge migration capability allows the user to specify if they
choose live migration seqeunce to automatically detect and force convergence.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com
Reviewed-by: Eric Blake ebl...@redhat.com
---
 include/migration/migration.h |2 ++
 migration.c   |9 +
 qapi-schema.json  |5 -
 3 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..ace91b0 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -127,4 +127,6 @@ int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+bool migrate_auto_converge(void);
 #endif
diff --git a/migration.c b/migration.c
index 058f9e6..d0759c1 100644
--- a/migration.c
+++ b/migration.c
@@ -473,6 +473,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
 max_downtime = (uint64_t)value;
 }
 
+bool migrate_auto_converge(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return s-enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
+}
+
 int migrate_use_xbzrle(void)
 {
 MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index a80ee40..c019fec 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -605,10 +605,13 @@
 #  This feature allows us to minimize migration traffic for certain 
work
 #  loads, by sending compressed difference of the pages
 #
+# @auto-converge: If enabled, QEMU will automatically throttle down the guest
+#  to speed up convergence of RAM migration. (since 1.6)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'auto-converge'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.7.1




[Qemu-devel] [PATCH v7 3/3] Force auto-convegence of live migration

2013-06-23 Thread Chegu Vinod
If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.

Verified the convergence using the following :
 - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
 - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: off  
 Migration status: active
 total time: 1487503 milliseconds
 expected downtime: 519 milliseconds
 transferred ram: 383749347 kbytes
 remaining ram: 2753372 kbytes
 total ram: 268444224 kbytes
 duplicate: 65461532 pages
 skipped: 64901568 pages
 normal: 95750218 pages
 normal bytes: 383000872 kbytes
 dirty pages rate: 67551 pages

 ---

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: on   
 Migration status: completed
 total time: 241161 milliseconds
 downtime: 6373 milliseconds
 transferred ram: 28235307 kbytes
 remaining ram: 0 kbytes
 total ram: 268444224 kbytes
 duplicate: 64946416 pages
 skipped: 64903523 pages
 normal: 7044971 pages
 normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c |   79 +++
 1 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index a8b91ee..e7ca3b1 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,9 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static bool mig_throttle_on;
+static int dirty_rate_high_cnt;
+static void check_guest_throttling(void);
 
 /***/
 /* ram save/restore */
@@ -378,8 +381,14 @@ static void migration_bitmap_sync(void)
 uint64_t num_dirty_pages_init = migration_dirty_pages;
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
+static int64_t bytes_xfer_prev;
 static int64_t num_dirty_pages_period;
 int64_t end_time;
+int64_t bytes_xfer_now;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==4)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+   (num_dirty_pages_period * TARGET_PAGE_SIZE 
+   (bytes_xfer_now - bytes_xfer_prev)/2) 
+   (dirty_rate_high_cnt++  4)) {
+trace_migration_throttle();
+mig_throttle_on = true;
+dirty_rate_high_cnt = 0;
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+}
 s-dirty_pages_rate = num_dirty_pages_period * 1000
 / (end_time - start_time);
 s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 migration_bitmap = bitmap_new(ram_pages);
 bitmap_set(migration_bitmap, 0, ram_pages);
 migration_dirty_pages = ram_pages;
+mig_throttle_on = false;
+dirty_rate_high_cnt = 0;
 
 if (migrate_use_xbzrle()) {
 XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() /
@@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 }
 total_sent += bytes_sent;
 acct_info.iterations++;
+check_guest_throttling();
 /* we want to check in the 1st loop, just in case it was the 1st time
and we had to sync the dirty bitmap.
qemu_get_clock_ns() is a bit expensive, so we only check each some
@@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp)
 
 return info;
 }
+
+/* Stub function that's gets run on the vcpu when its brought out of the
+   VM to run inside qemu via async_run_on_cpu()*/
+static void mig_sleep_cpu(void *opq)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(30*1000);
+qemu_mutex_lock_iothread();
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a performance drop.
+*/
+static void mig_throttle_cpu_down(CPUState *cpu, void *data)
+{
+async_run_on_cpu

Re: [Qemu-devel] [PATCH v7 2/3] Add 'auto-converge' migration capability

2013-06-23 Thread Chegu Vinod



Oops!  A minor glitch on my side (pl. ignore the subject line of 
this...as this is actually patch 3/3 and not patch 2/3).   I just resent 
this as patch 3/3 with the correct subject line.


Thx
Vinod

On 6/23/2013 1:05 PM, Chegu Vinod wrote:

If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.

Verified the convergence using the following :
  - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

  (qemu) info migrate
  capabilities: xbzrle: off auto-converge: off  
  Migration status: active
  total time: 1487503 milliseconds
  expected downtime: 519 milliseconds
  transferred ram: 383749347 kbytes
  remaining ram: 2753372 kbytes
  total ram: 268444224 kbytes
  duplicate: 65461532 pages
  skipped: 64901568 pages
  normal: 95750218 pages
  normal bytes: 383000872 kbytes
  dirty pages rate: 67551 pages

  ---

  (qemu) info migrate
  capabilities: xbzrle: off auto-converge: on   
  Migration status: completed
  total time: 241161 milliseconds
  downtime: 6373 milliseconds
  transferred ram: 28235307 kbytes
  remaining ram: 0 kbytes
  total ram: 268444224 kbytes
  duplicate: 64946416 pages
  skipped: 64903523 pages
  normal: 7044971 pages
  normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
  arch_init.c |   79 +++
  1 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index a8b91ee..e7ca3b1 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,9 @@ int graphic_depth = 15;
  #endif
  
  const uint32_t arch_type = QEMU_ARCH;

+static bool mig_throttle_on;
+static int dirty_rate_high_cnt;
+static void check_guest_throttling(void);
  
  /***/

  /* ram save/restore */
@@ -378,8 +381,14 @@ static void migration_bitmap_sync(void)
  uint64_t num_dirty_pages_init = migration_dirty_pages;
  MigrationState *s = migrate_get_current();
  static int64_t start_time;
+static int64_t bytes_xfer_prev;
  static int64_t num_dirty_pages_period;
  int64_t end_time;
+int64_t bytes_xfer_now;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
  
  if (!start_time) {

  start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
  
  /* more than 1 second = 1000 millisecons */

  if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==4)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+   (num_dirty_pages_period * TARGET_PAGE_SIZE 
+   (bytes_xfer_now - bytes_xfer_prev)/2) 
+   (dirty_rate_high_cnt++  4)) {
+trace_migration_throttle();
+mig_throttle_on = true;
+dirty_rate_high_cnt = 0;
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+}
  s-dirty_pages_rate = num_dirty_pages_period * 1000
  / (end_time - start_time);
  s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -566,6 +592,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
  migration_bitmap = bitmap_new(ram_pages);
  bitmap_set(migration_bitmap, 0, ram_pages);
  migration_dirty_pages = ram_pages;
+mig_throttle_on = false;
+dirty_rate_high_cnt = 0;
  
  if (migrate_use_xbzrle()) {

  XBZRLE.cache = cache_init(migrate_xbzrle_cache_size() /
@@ -628,6 +656,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  }
  total_sent += bytes_sent;
  acct_info.iterations++;
+check_guest_throttling();
  /* we want to check in the 1st loop, just in case it was the 1st time
 and we had to sync the dirty bitmap.
 qemu_get_clock_ns() is a bit expensive, so we only check each some
@@ -1097,3 +1126,53 @@ TargetInfo *qmp_query_target(Error **errp)
  
  return info;

  }
+
+/* Stub function that's gets run on the vcpu when its brought out of the
+   VM to run inside qemu via async_run_on_cpu()*/
+static void mig_sleep_cpu(void *opq)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(30*1000

Re: [Qemu-devel] [PATCH v9 14/14] rdma: add pin-all accounting timestamp to QMP statistics

2013-06-16 Thread Chegu Vinod

On 6/14/2013 1:35 PM, mrhi...@linux.vnet.ibm.com wrote:

From: Michael R. Hines mrhi...@us.ibm.com

For very large virtual machines, pinning can take a long time.
While this does not affect the migration's *actual* time itself,
it is still important for the user to know what's going on and
to know what component of the total time is actual taken up by
pinning.

For example, using a 14GB virtual machine, pinning can take as
long as 5 seconds, for which the user would not otherwise know
what was happening.

Reviewed-by: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Michael R. Hines mrhi...@us.ibm.com
---
  hmp.c |4 +++
  include/migration/migration.h |1 +
  migration-rdma.c  |   55 +++--
  migration.c   |   13 +-
  qapi-schema.json  |3 ++-
  5 files changed, 56 insertions(+), 20 deletions(-)

diff --git a/hmp.c b/hmp.c
index 148a3fb..90c55f2 100644
--- a/hmp.c
+++ b/hmp.c
@@ -164,6 +164,10 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
  monitor_printf(mon, downtime: % PRIu64  milliseconds\n,
 info-downtime);
  }
+if (info-has_pin_all_time) {
+monitor_printf(mon, pin-all: % PRIu64  milliseconds\n,
+   info-pin_all_time);
+}
  }
  
  if (info-has_ram) {

diff --git a/include/migration/migration.h b/include/migration/migration.h
index b49e68b..d2ca75b 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -49,6 +49,7 @@ struct MigrationState
  bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
  int64_t xbzrle_cache_size;
  double mbps;
+int64_t pin_all_time;
  };
  
  void process_incoming_migration(QEMUFile *f);

diff --git a/migration-rdma.c b/migration-rdma.c
index 853de18..e407dce 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -699,11 +699,11 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
  return 0;
  }
  
-static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,

-RDMALocalBlocks *rdma_local_ram_blocks)
+static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
  {
  int i;
-uint64_t start = qemu_get_clock_ms(rt_clock);
+int64_t start = qemu_get_clock_ms(host_clock);
+RDMALocalBlocks *rdma_local_ram_blocks = rdma-local_ram_blocks;
  (void)start;
  
  for (i = 0; i  rdma_local_ram_blocks-num_blocks; i++) {

@@ -721,7 +721,8 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
  rdma-total_registrations++;
  }
  
-DPRINTF(lock time: % PRIu64 \n, qemu_get_clock_ms(rt_clock) - start);

+DPRINTF(local lock time: % PRId64 \n,
+qemu_get_clock_ms(host_clock) - start);
  
  if (i = rdma_local_ram_blocks-num_blocks) {

  return 0;
@@ -1262,7 +1263,8 @@ static void qemu_rdma_move_header(RDMAContext *rdma, int 
idx,
   */
  static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
 uint8_t *data, RDMAControlHeader *resp,
-   int *resp_idx)
+   int *resp_idx,
+   int (*callback)(RDMAContext *rdma))
  {
  int ret = 0;
  int idx = 0;
@@ -1315,6 +1317,14 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, 
RDMAControlHeader *head,
   * If we're expecting a response, block and wait for it.
   */
  if (resp) {
+if (callback) {
+DPRINTF(Issuing callback before receiving response...\n);
+ret = callback(rdma);
+if (ret  0) {
+return ret;
+}
+}
+
  DDPRINTF(Waiting for response %s\n, control_desc[resp-type]);
  ret = qemu_rdma_exchange_get_response(rdma, resp, resp-type, idx + 
1);
  
@@ -1464,7 +1474,7 @@ static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,

  chunk, sge.length, current_index, offset);
  
  ret = qemu_rdma_exchange_send(rdma, head,

-(uint8_t *) comp, NULL, NULL);
+(uint8_t *) comp, NULL, NULL, NULL);
  
  if (ret  0) {

  return -EIO;
@@ -1487,7 +1497,7 @@ static int qemu_rdma_write_one(QEMUFile *f, RDMAContext 
*rdma,
  chunk, sge.length, current_index, offset);
  
  ret = qemu_rdma_exchange_send(rdma, head, (uint8_t *) reg,

-resp, reg_result_idx);
+resp, reg_result_idx, NULL);
  if (ret  0) {
  return ret;
  }
@@ -2126,7 +2136,7 @@ static int qemu_rdma_put_buffer(void *opaque, const 
uint8_t *buf,
  head.len = r-len;
  head.type = RDMA_CONTROL_QEMU_FILE;
  
-ret = qemu_rdma_exchange_send(rdma, head, data, NULL, 

Re: [Qemu-devel] RDMA: please pull and re-test freezing fixes

2013-06-15 Thread Chegu Vinod

On 6/14/2013 1:38 PM, Michael R. Hines wrote:

Chegu,

I sent a V9 to the mailing list:

The version goes even further, by explicitly timing the pinning 
latency and

pushing the value out to QMP so the user clearly knows which component
of total migration time is consumed by pinning.

If you're satisfied, I'd appreciate if I could add your Reviewed-By: =)



Pl. see below...and yes you can add me.
Thanks,
Vinod












The migration speed was set to 40G and the downtime to 2sec for all experiments
below. 

Note: Idle guests are not interesting due to tons of zero pages etc...but
including them here to highlght the overhead of pinning.

1) 20vcpu/64GB guest: (kind of a larger sized Cloud-type guest)

a) Idle guest with No pinning (default) :

capabilities: xbzrle: off x-rdma-pin-all: off 
Migration status: completed
total time: 51062 milliseconds
downtime: 1948 milliseconds
pin-all: 0 milliseconds
transferred ram: 1816547 kbytes
throughput: 6872.23 mbps
remaining ram: 0 kbytes
total ram: 67117632 kbytes
duplicate: 16331552 pages
skipped: 0 pages
normal: 450038 pages
normal bytes: 1800152 kbytes


b) Idle guest with Pinning :

capabilities: xbzrle: off x-rdma-pin-all: on 
Migration status: completed
total time: 47451 milliseconds
downtime: 2639 milliseconds
pin-all: 22780 milliseconds
transferred ram: 67136643 kbytes
throughput: 25222.91 mbps
remaining ram: 0 kbytes
total ram: 67117632 kbytes
duplicate: 0 pages
skipped: 0 pages
normal: 16780064 pages
normal bytes: 67120256 kbytes

There weere no freezes observed in the guest at the start of the migration
but the qemu monitor prompt was not responsive for the duration of the 
memory pinning.

Total migration time was affected by the cost pinning at the start of the 
migration as shown above( This issue can be pursued and optimized later).

c) Pining + guest running a Java warehouse workload (I cranked the workload up
 to keep the guest 95+% busy)

capabilities: xbzrle: off x-rdma-pin-all: on 
Migration status: active
total time: 412706 milliseconds
expected downtime: 499 milliseconds
pin-all: 22758 milliseconds
transferred ram: 657243669 kbytes
throughput: 25241.89 mbps
remaining ram: 7281848 kbytes
total ram: 67117632 kbytes
duplicate: 0 pages
skipped: 0 pages
normal: 164270810 pages
normal bytes: 657083240 kbytes
dirty pages rate: 369925 pages

No Convergence ! (For workloads where the memory dirty rate is very high
there are other alternatives that have been discussed in the past...)

---

Enterprise type guests tend to get fatter (more memory per cpu) than the 
larger Cloud  guests...so here are a coupld of them.


a) 20VCPU/256G Idle guest :

Default:

capabilities: xbzrle: off x-rdma-pin-all: off 
Migration status: completed
total time: 259259 milliseconds
downtime: 3924 milliseconds
pin-all: 0 milliseconds
transferred ram: 5522078 kbytes
throughput: 6586.06 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 65755168 pages
skipped: 0 pages
normal: 1364124 pages
normal bytes: 5456496 kbytes


Pinned:

capabilities: xbzrle: off x-rdma-pin-all: on 
Migration status: completed
total time: 219053 milliseconds
downtime: 4277 milliseconds
pin-all: 118153 milliseconds
transferred ram: 268512809 kbytes
throughput: 22209.32 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 0 pages
skipped: 0 pages
normal: 67111817 pages
normal bytes: 268447268 kbytes


b) 40VCPU/512GB Idle guest :


Default:

capabilities: xbzrle: off x-rdma-pin-all: off 
Migration status: completed
total time: 670577 milliseconds
downtime: 6139 milliseconds
pin-all: 0 milliseconds
transferred ram: 10279256 kbytes
throughput: 6150.93 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 131704099 pages
skipped: 0 pages
normal: 2537017 pages
normal bytes: 10148068 kbytes

Pinned:

capabilities: xbzrle: off x-rdma-pin-all: on 
Migration status: completed
total time: 527576 milliseconds
downtime: 6314 milliseconds
pin-all: 312984 milliseconds  
transferred ram: 537129685 kbytes
throughput: 20177.27 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 0 pages
skipped: 0 pages
normal: 134249644 pages
normal bytes: 536998576 kbytes

No freezes in the guest due to memory pinning. (Freezes were only due to the 
dirty bitmap synchup stuff which is being done while BQL is held. Juan is 
working on addresing already for qemu 1.6)


Re: [Qemu-devel] [PATCH v9 00/14] rdma: migration support

2013-06-15 Thread Chegu Vinod

On 6/14/2013 1:35 PM, mrhi...@linux.vnet.ibm.com wrote:

From: Michael R. Hines mrhi...@us.ibm.com

Changes since v8:
 For very large virtual machines, pinning can take a long time.
 While this does not affect the migration's *actual* time itself,
 it is still important for the user to know what's going on and
 to know what component of the total time is actual taken up by
 pinning.

 For example, using a 14GB virtual machine, pinning can take as
 long as 5 seconds, for which the user would not otherwise know
 what was happening.

Reviewed-by: Eric Blake ebl...@redhat.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com


Reviewed-by: Chegu Vinod chegu_vi...@hp.com
Tested-by: Chegu Vinod chegu_vi...@hp.com


Thx
Vinod


Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
Github: g...@github.com:hinesmr/qemu.git

Here is a brief summary of total migration time and downtime using RDMA:

Using a 40gbps infiniband link performing a worst-case stress test,
using an 8GB RAM virtual machine:
Using the following command:

$ apt-get install stress
$ stress --vm-bytes 7500M --vm 1 --vm-keep

RESULTS:

1. Migration throughput: 26 gigabits/second.
2. Downtime (stop time) varies between 15 and 100 milliseconds.

EFFECTS of memory registration on bulk phase round:

For example, in the same 8GB RAM example with all 8GB of memory in
active use and the VM itself is completely idle using the same 40 gbps
infiniband link:

1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps

These numbers would of course scale up to whatever size virtual machine
you have to migrate using RDMA.

Enabling this feature does *not* have any measurable affect on
migration *downtime*. This is because, without this feature, all of the
memory will have already been registered already in advance during
the bulk round and does not need to be re-registered during the successive
iteration rounds.

The following changes since commit f3aa844bbb2922a5b8393d17620eca7d7e921ab3:

   build: include config-{, all-}devices.mak after defining CONFIG_SOFTMMU and 
CONFIG_USER_ONLY (2013-04-24 12:18:41 -0500)

are available in the git repository at:

   g...@github.com:hinesmr/qemu.git rdma_patch_v9

for you to fetch changes up to 75e6fac1f642885b93cefe6e1874d648e9850f8f:

   rdma: send pc.ram (2013-04-24 14:55:01 -0400)


Michael R. Hines (14):
   rdma: add documentation
   rdma: introduce qemu_update_position()
   rdma: export yield_until_fd_readable()
   rdma: export throughput w/ MigrationStats QMP
   rdma: introduce qemu_file_mode_is_not_valid()
   rdma: export qemu_fflush()
   rdma: introduce ram_handle_compressed()
   rdma: introduce qemu_ram_foreach_block()
   rdma: new QEMUFileOps hooks
   rdma: introduce capability x-rdma-pin-all
   rdma: core logic
   rdma: send pc.ram
   rdma: fix mlock() freezes and accounting
   rdma: add pin-all accounting timestamp to QMP statistics

  Makefile.objs |1 +
  arch_init.c   |   69 +-
  configure |   29 +
  docs/rdma.txt |  415 ++
  exec.c|9 +
  hmp.c |6 +
  include/block/coroutine.h |6 +
  include/exec/cpu-common.h |5 +
  include/migration/migration.h |   32 +
  include/migration/qemu-file.h |   32 +
  migration-rdma.c  | 2831 +
  migration.c   |   36 +-
  qapi-schema.json  |   15 +-
  qemu-coroutine-io.c   |   23 +
  savevm.c  |  114 +-
  15 files changed, 3574 insertions(+), 49 deletions(-)
  create mode 100644 docs/rdma.txt
  create mode 100644 migration-rdma.c






[Qemu-devel] [RFC PATCH v6 1/3] Introduce async_run_on_cpu()

2013-06-14 Thread Chegu Vinod
Introduce an asynchronous version of run_on_cpu() i.e. the caller
doesn't have to block till the call back routine finishes execution
on the target vcpu.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com
---
 cpus.c|   29 +
 include/qemu-common.h |1 +
 include/qom/cpu.h |   10 ++
 3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index c232265..8cd4eab 100644
--- a/cpus.c
+++ b/cpus.c
@@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data)
 
 wi.func = func;
 wi.data = data;
+wi.free = false;
 if (cpu-queued_work_first == NULL) {
 cpu-queued_work_first = wi;
 } else {
@@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data)
 }
 }
 
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
+{
+struct qemu_work_item *wi;
+
+if (qemu_cpu_is_self(cpu)) {
+func(data);
+return;
+}
+
+wi = g_malloc0(sizeof(struct qemu_work_item));
+wi-func = func;
+wi-data = data;
+wi-free = true;
+if (cpu-queued_work_first == NULL) {
+cpu-queued_work_first = wi;
+} else {
+cpu-queued_work_last-next = wi;
+}
+cpu-queued_work_last = wi;
+wi-next = NULL;
+wi-done = false;
+
+qemu_cpu_kick(cpu);
+}
+
 static void flush_queued_work(CPUState *cpu)
 {
 struct qemu_work_item *wi;
@@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu)
 cpu-queued_work_first = wi-next;
 wi-func(wi-data);
 wi-done = true;
+if (wi-free) {
+g_free(wi);
+}
 }
 cpu-queued_work_last = NULL;
 qemu_cond_broadcast(qemu_work_cond);
diff --git a/include/qemu-common.h b/include/qemu-common.h
index ed8b6e2..ac0ed38 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -302,6 +302,7 @@ struct qemu_work_item {
 void (*func)(void *data);
 void *data;
 int done;
+bool free;
 };
 
 #ifdef CONFIG_USER_ONLY
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 7cd9442..46465e9 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -265,6 +265,16 @@ bool cpu_is_stopped(CPUState *cpu);
 void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 
 /**
+ * async_run_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchronously.
+ */
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
+
+/**
  * qemu_for_each_cpu:
  * @func: The function to be executed.
  * @data: Data to pass to the function.
-- 
1.7.1




[Qemu-devel] [RFC PATCH v6 3/3] Force auto-convegence of live migration

2013-06-14 Thread Chegu Vinod
If a user chooses to turn on the auto-converge migration capability
these changes detect the lack of convergence and throttle down the
guest. i.e. force the VCPUs out of the guest for some duration
and let the migration thread catchup and help converge.

Verified the convergence using the following :
 - Java Warehouse workload running on a 20VCPU/256G guest(~80% busy)
 - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with Java warehouse workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: off  
 Migration status: active
 total time: 1487503 milliseconds
 expected downtime: 519 milliseconds
 transferred ram: 383749347 kbytes
 remaining ram: 2753372 kbytes
 total ram: 268444224 kbytes
 duplicate: 65461532 pages
 skipped: 64901568 pages
 normal: 95750218 pages
 normal bytes: 383000872 kbytes
 dirty pages rate: 67551 pages

 ---

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: on   
 Migration status: completed
 total time: 241161 milliseconds
 downtime: 6373 milliseconds
 transferred ram: 28235307 kbytes
 remaining ram: 0 kbytes
 total ram: 268444224 kbytes
 duplicate: 64946416 pages
 skipped: 64903523 pages
 normal: 7044971 pages
 normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c |   85 +++
 1 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 5d32ecf..69c6c8c 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,8 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static bool mig_throttle_on;
+static void throttle_down_guest_to_converge(void);
 
 /***/
 /* ram save/restore */
@@ -378,8 +380,15 @@ static void migration_bitmap_sync(void)
 uint64_t num_dirty_pages_init = migration_dirty_pages;
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
+static int64_t bytes_xfer_prev;
 static int64_t num_dirty_pages_period;
 int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==4)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+((num_dirty_pages_period*TARGET_PAGE_SIZE) 
+((bytes_xfer_now - bytes_xfer_prev)/2))) {
+if (dirty_rate_high_cnt++  4) {
+DPRINTF(Unable to converge. Throtting down guest\n);
+mig_throttle_on = true;
+}
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+}
 s-dirty_pages_rate = num_dirty_pages_period * 1000
 / (end_time - start_time);
 s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -628,6 +654,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 }
 total_sent += bytes_sent;
 acct_info.iterations++;
+throttle_down_guest_to_converge();
 /* we want to check in the 1st loop, just in case it was the 1st time
and we had to sync the dirty bitmap.
qemu_get_clock_ns() is a bit expensive, so we only check each some
@@ -1098,3 +1125,61 @@ TargetInfo *qmp_query_target(Error **errp)
 
 return info;
 }
+
+static bool throttling_needed(void)
+{
+if (!migrate_auto_converge()) {
+return false;
+}
+return mig_throttle_on;
+}
+
+/* Stub function that's gets run on the vcpu when its brought out of the
+   VM to run inside qemu via async_run_on_cpu()*/
+static void mig_sleep_cpu(void *opq)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(30*1000);
+qemu_mutex_lock_iothread();
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a performance drop.
+*/
+static void mig_throttle_cpu_down(CPUState *cpu, void *data)
+{
+async_run_on_cpu(cpu, mig_sleep_cpu, NULL);
+}
+
+static void mig_throttle_guest_down(void)
+{
+if (throttling_needed()) {
+qemu_mutex_lock_iothread();
+qemu_for_each_cpu(mig_throttle_cpu_down, NULL

[Qemu-devel] [RFC PATCH v6 0/3] Throttle-down guest to help with live migration convergence

2013-06-14 Thread Chegu Vinod
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

If a user chooses to force convergence of their migration via a new
migration capability auto-converge then this change will auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in our observation that
lasts only for a short duration of time. i.e. end up entering
stage 3 (downtime phase) soon after that. No external monitoring/triggers
are required.

Thanks to Juan and Paolo for their useful suggestions.

---
Changes from v5:
- incorporated feedback from Paolo  Igor.
- rebased to latest qemu.git

Changes from v4:
- incorporated feedback from Paolo.
- split into 3 patches.

Changes from v3:
- incorporated feedback from Paolo and Eric
- rebased to latest qemu.git

Changes from v2:
- incorporated feedback from Orit, Juan and Eric
- stop the throttling thread at the start of stage 3
- rebased to latest qemu.git

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---

Chegu Vinod (3):
  Introduce async_run_on_cpu()
  Add 'auto-converge' migration capability
  Force auto-convegence of live migration

 arch_init.c   |   85 +
 cpus.c|   29 ++
 include/migration/migration.h |2 +
 include/qemu-common.h |1 +
 include/qom/cpu.h |   10 +
 migration.c   |9 
 qapi-schema.json  |5 ++-
 7 files changed, 140 insertions(+), 1 deletions(-)




[Qemu-devel] [RFC PATCH v6 2/3] Add 'auto-converge' migration capability

2013-06-14 Thread Chegu Vinod
The auto-converge migration capability allows the user to specify if they
choose live migration seqeunce to automatically detect and force convergence.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com
Reviewed-by: Eric Blake ebl...@redhat.com
---
 include/migration/migration.h |2 ++
 migration.c   |9 +
 qapi-schema.json  |5 -
 3 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..ace91b0 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -127,4 +127,6 @@ int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+bool migrate_auto_converge(void);
 #endif
diff --git a/migration.c b/migration.c
index 058f9e6..d0759c1 100644
--- a/migration.c
+++ b/migration.c
@@ -473,6 +473,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
 max_downtime = (uint64_t)value;
 }
 
+bool migrate_auto_converge(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return s-enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
+}
+
 int migrate_use_xbzrle(void)
 {
 MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index 5ad6894..882a7fd 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -605,10 +605,13 @@
 #  This feature allows us to minimize migration traffic for certain 
work
 #  loads, by sending compressed difference of the pages
 #
+# @auto-converge: If enabled, QEMU will automatically throttle down the guest
+#  to speed up convergence of RAM migration. (since 1.6)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'auto-converge'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.7.1




Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

2013-06-06 Thread Chegu Vinod

On 6/1/2013 9:09 PM, Michael R. Hines wrote:

All,

I have successfully performed over 1000+ back-to-back RDMA migrations 
automatically looped *in a row* using a heavy-weight memory-stress 
benchmark here at IBM.
Migration success is done by capturing the actual serial console 
output of the virtual machine while the benchmark is running and 
redirecting each migration output to a file to verify that the output 
matches the expected output of a successful migration. For half of the 
1000 migrations, I used a 14GB virtual machine size (largest VM I can 
create) and the remaining 500 migrations I used a 2GB virtual machine 
(to make sure I was testing both 32-bit and 64-bit address 
boundaries). The benchmark is configured to have 75% stores and 25% 
loads and is configured to use 80% of the allocatable free memory of 
the VM (i.e. no swapping allowed).


I have defined a successful migration per the output file as follows:

1. The memory benchmark is still running and active (CPU near 100% and 
memory usage is high)
2. There are no kernel panics in the console output (regex keywords 
panic, BUG, oom, etc...)

3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messages 
throughout the life of the VM to the console from inside the VM using 
the 'write' command in infinite loop.


With this method in a loop, I believe I've ironed out all the 
regression-testing bugs that I can find. You all may find the 
following bugs interesting. The original version of this patch was 
written in 2010 (Before my time @ IBM).


Bug #1: In the original 2010 patch, each write operation uses the same 
identifier. (A Work Request ID in infiniband terminology).
This is not typical (but allowed by the hardware) - and instead each 
operation should have its own unique identifier so that the write 
operation can be tracked properly as it completes.


Bug #2: Also in the original 2010 patch, write operations were grouped 
into separate signaled and unsignaled work requests, which is also 
not typical (but allowed by the hardware). Signalling is infiniband 
terminology which means to activate/deactivate notifying the sender 
whether or not the RDMA operation has already completed. (Note: the 
receiver is never notified - which is what a DMA is supposed to be). 
In normal operation per infiniband specifications, unsignaled 
operations (which indicate to the hardware *not* to notify the sender 
of completion) are *supposed* to be paired simultaneously with a 
signaled operation using the *same* work request identifier. Instead, 
the original patch was using *different* work requests for 
signaled/unsignaled writes, which means that most of the writes would 
be transmitted without ever being tracked for completion whatsoever. 
(Per infinband specifications, signaled and unsignaled writes must be 
grouped together because the hardware ensures that completion 
notification is not given until *all* of the writes of the same 
request have actually completed).


Bug #3: Finally, in the original 2010 patch, ordering was not being 
handled. Per infiniband specifications, writes can happen completely 
out of order. Not only that, but PCI-express itself can change the 
order of the writes as well. It was only until after the first 2 bugs 
were fixed that I could actually manifest this bug *in code*: What was 
happening was that a very large group of requests would burst from 
the QEMU migration thread. At which point, not all of the requests 
would finish. Then a short time later, the next iteration would start 
and the virtual machine's writable working set was still hovering 
somewhere in the same vicinity of the address space as the previous 
burst of writes that had not yet completed. When this happens, the new 
writes were much smaller (not a part of a larger chunk per our 
algorithms). Since the new writes were smaller they would complete 
faster than the larger, older writes in the same address range. Since 
they complete out of order, the newer writes would then get clobbered 
by the older writes - resulting in an inconsistent virtual machine. 
So, to solve this: during each new write, we now do a search to see 
if the address of the next requested write matches or overlaps with 
the address range of any of the previous outstanding writes that 
were still in transit, and I found several hits. This was easily 
solved by blocking until the conflicting write has completed before 
proceeding to issue a new write to the hardware.


- Michael



Hi Michael,

Got some limited time on the systems so gave your latest bits a quick 
try today (with the default no pinning) and it seems to be better than 
before.


Ran a Java warehouse workload where the guest was 85-90% busy...

For both cases
(qemu) migrate_set_speed 40G
(qemu) migrate_set_downtime 2
(qemu) migrate -d x-rdma:ip:port

...

20VCPU/256G guest

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration 

[Qemu-devel] OS crash while attempting to boot a 1TB guest.

2013-06-03 Thread Chegu Vinod


Hello,

For guest sizes = 1TB RAM  the guest OS is unable to boot up (please 
see attached GIF file for the Oops message). Wonder if this is a 
bug/regression in qemu/seabios or does one have to enable/disable 
something else in the qemu command line (pl. see below) ?


Thanks
Vinod


Host and Guest OS : 3.10-rc2 (from kvm.git) and qemu1.5.5 (from qemu.git 
as of May 28th)


The qemu command line :


/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-name vm1 \
-m 1048576 -smp 80,sockets=80,cores=1,threads=1 \
-mem-path /dev/hugepages \
-no-hpet \
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc 
-no-shutdown \
-drive 
file=/var/lib/libvirt/images/vm1/vm1.img,if=none,id=drive-virtio-disk0,format=raw,cache=none 
\
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
\

-monitor stdio \
-net nic,model=virtio,macaddr=...,netdev=nic-0 \
-netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \
-vnc :4


attachment: guest_panic2.GIF

Re: [Qemu-devel] [PATCH 3/2] vfio: Provide module option to disable vfio_iommu_type1 hugepage support

2013-05-30 Thread Chegu Vinod

On 5/28/2013 9:27 AM, Alex Williamson wrote:

Add a module option to vfio_iommu_type1 to disable IOMMU hugepage
support.  This causes iommu_map to only be called with single page
mappings, disabling the IOMMU driver's ability to use hugepages.
This option can be enabled by loading vfio_iommu_type1 with
disable_hugepages=1 or dynamically through sysfs.  If enabled
dynamically, only new mappings are restricted.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

As suggested by Konrad.  This is cleaner to add as a follow-on

  drivers/vfio/vfio_iommu_type1.c |   11 +++
  1 file changed, 11 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 6654a7e..8a2be4e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -48,6 +48,12 @@ module_param_named(allow_unsafe_interrupts,
  MODULE_PARM_DESC(allow_unsafe_interrupts,
 Enable VFIO IOMMU support for on platforms without interrupt 
remapping support.);
  
+static bool disable_hugepages;

+module_param_named(disable_hugepages,
+  disable_hugepages, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_hugepages,
+Disable VFIO IOMMU support for IOMMU hugepages.);
+
  struct vfio_iommu {
struct iommu_domain *domain;
struct mutexlock;
@@ -270,6 +276,11 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
return -ENOMEM;
}
  
+	if (unlikely(disable_hugepages)) {

+   vfio_lock_acct(1);
+   return 1;
+   }
+
/* Lock all the consecutive pages from pfn_base */
for (i = 1, vaddr += PAGE_SIZE; i  npage; i++, vaddr += PAGE_SIZE) {
unsigned long pfn = 0;

.



Tested-by: Chegu Vinod chegu_vi...@hp.com

I was able to verify your changes on a 2 Sandybridge-EP socket platform 
and observed about ~7-8% improvement in the netperf's TCP_RR 
performance.  The guest size was small (16vcpu/32GB).


Hopefully these changes also have an indirect benefit of avoiding soft 
lockups on the host side when larger guests ( 256GB ) are rebooted. 
Someone who has ready access to a larger Sandybridge-EP/EX platform can 
verify this.


FYI
Vinod




Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration

2013-05-10 Thread Chegu Vinod

On 5/10/2013 6:07 AM, Anthony Liguori wrote:

Chegu Vinod chegu_vi...@hp.com writes:


  If a user chooses to turn on the auto-converge migration capability
  these changes detect the lack of convergence and throttle down the
  guest. i.e. force the VCPUs out of the guest for some duration
  and let the migration thread catchup and help converge.

  Verified the convergence using the following :
  - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
  - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

  Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
  migrate downtime set to 4seconds).

Would it make sense to separate out the slow the VCPU down part of
this?

That would give a management tool more flexibility to create policies
around slowing the VCPU down to encourage migration.


I believe one can always enhance libvirt tools to monitor the migration 
statistics and control the shares/entitlements of the vcpus via 
cgroups..thereby slowing the guest down to allow for convergence  (I had 
that listed in my earlier versions of the patches as an option and also 
noted that it requires external (i.e. tool driven) monitoring and 
triggers...and that this alternative was kind of automatic after the 
initial setting of the capability).


Is that what you meant by your comment above (or) are you talking about 
something outside the scope of cgroups and from an implementation point 
of view also outside the migration code path...i.e. a new knob that an 
external tool can use to just throttle down the vcpus of a guest ?


Thanks
Vinod





In fact, I wonder if we need anything in the migration path if we just
expose the slow the VCPU down bit as a feature.

Slow the VCPU down is not quite the same as setting priority of the VCPU
thread largely because of the QBL so I recognize the need to have
something for this in QEMU.

Regards,

Anthony Liguori


  (qemu) info migrate
  capabilities: xbzrle: off auto-converge: off  
  Migration status: active
  total time: 1487503 milliseconds
  expected downtime: 519 milliseconds
  transferred ram: 383749347 kbytes
  remaining ram: 2753372 kbytes
  total ram: 268444224 kbytes
  duplicate: 65461532 pages
  skipped: 64901568 pages
  normal: 95750218 pages
  normal bytes: 383000872 kbytes
  dirty pages rate: 67551 pages

  ---
  
  (qemu) info migrate

  capabilities: xbzrle: off auto-converge: on   
  Migration status: completed
  total time: 241161 milliseconds
  downtime: 6373 milliseconds
  transferred ram: 28235307 kbytes
  remaining ram: 0 kbytes
  total ram: 268444224 kbytes
  duplicate: 64946416 pages
  skipped: 64903523 pages
  normal: 7044971 pages
  normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
  arch_init.c   |   68 +
  include/migration/migration.h |4 ++
  migration.c   |1 +
  3 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 49c5dc2..29788d6 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -49,6 +49,7 @@
  #include trace.h
  #include exec/cpu-all.h
  #include hw/acpi/acpi.h
+#include sysemu/cpus.h
  
  #ifdef DEBUG_ARCH_INIT

  #define DPRINTF(fmt, ...) \
@@ -104,6 +105,8 @@ int graphic_depth = 15;
  #endif
  
  const uint32_t arch_type = QEMU_ARCH;

+static bool mig_throttle_on;
+
  
  /***/

  /* ram save/restore */
@@ -378,8 +381,15 @@ static void migration_bitmap_sync(void)
  uint64_t num_dirty_pages_init = migration_dirty_pages;
  MigrationState *s = migrate_get_current();
  static int64_t start_time;
+static int64_t bytes_xfer_prev;
  static int64_t num_dirty_pages_period;
  int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
  
  if (!start_time) {

  start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +414,23 @@ static void migration_bitmap_sync(void)
  
  /* more than 1 second = 1000 millisecons */

  if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==5)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+((num_dirty_pages_period*TARGET_PAGE_SIZE) 
+((bytes_xfer_now - bytes_xfer_prev)/2))) {
+if (dirty_rate_high_cnt++  5) {
+DPRINTF(Unable to converge. Throtting down guest\n);
+mig_throttle_on = true

[Qemu-devel] [RFC PATCH v5 2/3] Add 'auto-converge' migration capability

2013-05-09 Thread Chegu Vinod
 The auto-converge migration capability allows the user to specify if they
 choose live migration seqeunce to automatically detect and force convergence.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 include/migration/migration.h |2 ++
 migration.c   |9 +
 qapi-schema.json  |5 -
 3 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..ace91b0 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -127,4 +127,6 @@ int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+bool migrate_auto_converge(void);
 #endif
diff --git a/migration.c b/migration.c
index 3eb0fad..570cee5 100644
--- a/migration.c
+++ b/migration.c
@@ -474,6 +474,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
 max_downtime = (uint64_t)value;
 }
 
+bool migrate_auto_converge(void)
+{
+MigrationState *s;
+
+s = migrate_get_current();
+
+return s-enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
+}
+
 int migrate_use_xbzrle(void)
 {
 MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index 199744a..b33839c 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -602,10 +602,13 @@
 #  This feature allows us to minimize migration traffic for certain 
work
 #  loads, by sending compressed difference of the pages
 #
+# @auto-converge: Migration supports automatic throttling down of guest
+#  to force convergence. (since 1.6)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'auto-converge'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.7.1




[Qemu-devel] [RFC PATCH v5 0/3] Throttle-down guest to help with live migration convergence.

2013-05-09 Thread Chegu Vinod
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

If a user chooses to force convergence of their migration via a new
migration capability auto-converge then this change will auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. end up entering
stage 3 (downtime phase) soon after that. No external trigger is
required.

Thanks to Juan and Paolo for their useful suggestions.

---

Changes from v4:
- incorporated feedback from Paolo.
- split into 3 patches.

Changes from v3:
- incorporated feedback from Paolo and Eric
- rebased to latest qemu.git

Changes from v2:
- incorporated feedback from Orit, Juan and Eric
- stop the throttling thread at the start of stage 3
- rebased to latest qemu.git

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com

Chegu Vinod (3):
 Introduce async_run_on_cpu()
 Add 'auto-converge' migration capability
 Force auto-convegence of live migration

 arch_init.c   |   68 +
 cpus.c|   29 +
 include/migration/migration.h |6 +++
 include/qemu-common.h |1 +
 include/qom/cpu.h |   10 ++
 migration.c   |   10 ++
 qapi-schema.json  |5 ++-
 7 files changed, 128 insertions(+), 1 deletions(-)




[Qemu-devel] [RFC PATCH v5 1/3] Introduce async_run_on_cpu()

2013-05-09 Thread Chegu Vinod
 Introduce an asynchronous version of run_on_cpu() i.e. the caller
 doesn't have to block till the call back routine finishes execution
 on the target vcpu.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 cpus.c|   29 +
 include/qemu-common.h |1 +
 include/qom/cpu.h |   10 ++
 3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index c232265..8cd4eab 100644
--- a/cpus.c
+++ b/cpus.c
@@ -653,6 +653,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data)
 
 wi.func = func;
 wi.data = data;
+wi.free = false;
 if (cpu-queued_work_first == NULL) {
 cpu-queued_work_first = wi;
 } else {
@@ -671,6 +672,31 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), 
void *data)
 }
 }
 
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
+{
+struct qemu_work_item *wi;
+
+if (qemu_cpu_is_self(cpu)) {
+func(data);
+return;
+}
+
+wi = g_malloc0(sizeof(struct qemu_work_item));
+wi-func = func;
+wi-data = data;
+wi-free = true;
+if (cpu-queued_work_first == NULL) {
+cpu-queued_work_first = wi;
+} else {
+cpu-queued_work_last-next = wi;
+}
+cpu-queued_work_last = wi;
+wi-next = NULL;
+wi-done = false;
+
+qemu_cpu_kick(cpu);
+}
+
 static void flush_queued_work(CPUState *cpu)
 {
 struct qemu_work_item *wi;
@@ -683,6 +709,9 @@ static void flush_queued_work(CPUState *cpu)
 cpu-queued_work_first = wi-next;
 wi-func(wi-data);
 wi-done = true;
+if (wi-free) {
+g_free(wi);
+}
 }
 cpu-queued_work_last = NULL;
 qemu_cond_broadcast(qemu_work_cond);
diff --git a/include/qemu-common.h b/include/qemu-common.h
index b399d85..bad6e1f 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -286,6 +286,7 @@ struct qemu_work_item {
 void (*func)(void *data);
 void *data;
 int done;
+bool free;
 };
 
 #ifdef CONFIG_USER_ONLY
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 7cd9442..46465e9 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -265,6 +265,16 @@ bool cpu_is_stopped(CPUState *cpu);
 void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 
 /**
+ * async_run_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchronously.
+ */
+void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
+
+/**
  * qemu_for_each_cpu:
  * @func: The function to be executed.
  * @data: Data to pass to the function.
-- 
1.7.1




[Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration

2013-05-09 Thread Chegu Vinod
 If a user chooses to turn on the auto-converge migration capability
 these changes detect the lack of convergence and throttle down the
 guest. i.e. force the VCPUs out of the guest for some duration
 and let the migration thread catchup and help converge.

 Verified the convergence using the following :
 - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
 - OLTP like workload running on a 80VCPU/512G guest (~80% busy)

 Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
 migrate downtime set to 4seconds).

 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: off  
 Migration status: active
 total time: 1487503 milliseconds
 expected downtime: 519 milliseconds
 transferred ram: 383749347 kbytes
 remaining ram: 2753372 kbytes
 total ram: 268444224 kbytes
 duplicate: 65461532 pages
 skipped: 64901568 pages
 normal: 95750218 pages
 normal bytes: 383000872 kbytes
 dirty pages rate: 67551 pages

 ---
 
 (qemu) info migrate
 capabilities: xbzrle: off auto-converge: on   
 Migration status: completed
 total time: 241161 milliseconds
 downtime: 6373 milliseconds
 transferred ram: 28235307 kbytes
 remaining ram: 0 kbytes
 total ram: 268444224 kbytes
 duplicate: 64946416 pages
 skipped: 64903523 pages
 normal: 7044971 pages
 normal bytes: 28179884 kbytes

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c   |   68 +
 include/migration/migration.h |4 ++
 migration.c   |1 +
 3 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 49c5dc2..29788d6 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -49,6 +49,7 @@
 #include trace.h
 #include exec/cpu-all.h
 #include hw/acpi/acpi.h
+#include sysemu/cpus.h
 
 #ifdef DEBUG_ARCH_INIT
 #define DPRINTF(fmt, ...) \
@@ -104,6 +105,8 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static bool mig_throttle_on;
+
 
 /***/
 /* ram save/restore */
@@ -378,8 +381,15 @@ static void migration_bitmap_sync(void)
 uint64_t num_dirty_pages_init = migration_dirty_pages;
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
+static int64_t bytes_xfer_prev;
 static int64_t num_dirty_pages_period;
 int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +414,23 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==5)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+((num_dirty_pages_period*TARGET_PAGE_SIZE) 
+((bytes_xfer_now - bytes_xfer_prev)/2))) {
+if (dirty_rate_high_cnt++  5) {
+DPRINTF(Unable to converge. Throtting down guest\n);
+mig_throttle_on = true;
+}
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+}
 s-dirty_pages_rate = num_dirty_pages_period * 1000
 / (end_time - start_time);
 s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -496,6 +523,15 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
 return bytes_sent;
 }
 
+bool throttling_needed(void)
+{
+if (!migrate_auto_converge()) {
+return false;
+}
+
+return mig_throttle_on;
+}
+
 static uint64_t bytes_transferred;
 
 static ram_addr_t ram_save_remaining(void)
@@ -1098,3 +1134,35 @@ TargetInfo *qmp_query_target(Error **errp)
 
 return info;
 }
+
+static void mig_delay_vcpu(void)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(50*1000);
+qemu_mutex_lock_iothread();
+}
+
+/* Stub used for getting the vcpu out of VM and into qemu via
+   run_on_cpu()*/
+static void mig_kick_cpu(void *opq)
+{
+mig_delay_vcpu();
+return;
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a performance drop.
+*/
+void migration_throttle_down(void)
+{
+if (throttling_needed()) {
+CPUArchState *penv = first_cpu;
+while (penv) {
+qemu_mutex_lock_iothread();
+async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL

Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

2013-05-09 Thread Chegu Vinod

On 5/9/2013 10:20 AM, Michael R. Hines wrote:

Comments inline. FYI: please CC mrhi...@us.ibm.com,
because it helps me know when to scroll threw the bazillion qemu-devel 
emails.


I have things separated out into folders and rules, but a direct CC is 
better =)




Sure will do.



On 05/03/2013 07:28 PM, Chegu Vinod wrote:


Hi Michael,

I picked up the qemu bits from your github branch and gave it a 
try.   (BTW the setup I was given temporary access to has a pair of 
MLX's  IB QDR cards connected back to back via QSFP cables)


Observed a couple of things and wanted to share..perhaps you may be 
aware of them already or perhaps these are unrelated to your specific 
changes ? (Note: Still haven't finished the review of your changes ).


a) x-rdma-pin-all off case

Seem to only work sometimes but fails at other times. Here is an 
example...


(qemu) rdma: Accepting rdma connection...
rdma: Memory pin all: disabled
rdma: verbs context after listen: 0x56757d50
rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: 
fe80::2:c903:9:5855

rdma: Accepted migration
qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent 
with Host ind

ex 0x4ec: delta 0xffe6
qemu: warning: error while loading state for instance 0x0 of device 
'virtio-net'

load of migration failed



Can you give me more details about the configuration of your VM?


The guest is a 10-VCPU/128GB ...and nothing really that fancy with 
respect to storage or networking.


Hosted on a large Westmere-EX box (target is a similarly configured 
Westmere-X system). There is a shared SAN disk between the two hosts.  
Both hosts have 3.9-rc7 kernel that I got at that time from kvm.git 
tree. The guest was also running the same kernel.


Since I was just trying it out I was not running any workload either.

On the source host the qemu command line :


/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-name vm1 \
-m 131072 -smp 10,sockets=1,cores=10,threads=1 \
-mem-path /dev/hugepages \
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \
-drive 
file=/dev/libvirt_lvm3/vm1,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native 
\
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
\

-monitor stdio \
-net nic,model=virtio,macaddr=52:54:00:71:01:01,netdev=nic-0 \
-netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \
-vnc :4


On the destination host the command line was same as the above with the 
following additional arg...


-incoming x-rdma:static private ipaddr of the IB:port #






b) x-rdma-pin-all on case :

The guest is not resuming on the target host. i.e. the source host's 
qemu states that migration is complete but the guest is not 
responsive anymore... (doesn't seem to have crashed but its stuck 
somewhere).Have you seen this behavior before ? Any tips on how I 
could extract additional info ?


Is the QEMU monitor still responsive?


They were responsive.

Can you capture a screenshot of the guest's console to see if there is 
a panic?


No panic on the guest's console :(


What kind of storage is attached to the VM?



Simple virtio disk hosted on a SAN disk (see the qemu command line).





Besides the list of noted restrictions/issues around having to pin 
all of guest memoryif the pinning is done as part of starting of 
the migration it ends up taking noticeably long time for larger 
guests. Wonder whether that should be counted as part of the total 
migration time ?.




That's a good question: The pin-all option should not be slowing down 
your VM to much as the VM should still be running before the 
migration_thread() actually kicks in and starts the migration.


Well I had hoped that it would not have any serious impacts but it ended 
up freezing the guest...




I need more information on the configuration of your VM, guest 
operating system, architecture and so forth...


Pl. see above.

And similarly as before whether or not QEMU is not responsive or 
whether or not it's the guest that's panicked...


Guest just freezes...doesn't panic when this pinning is in progress 
(i.e. after I set the capability and start the migration) . After the 
pin'ng completes the guest continues to run and the migration 
continues...till it completes (as per the source host's qemu)...but I 
never see it resume on the target host.


Also the act of pinning all the memory seems to freeze the guest. 
e.g. : For larger enterprise sized guests (say 128GB and higher) the 
guest is frozen is anywhere from nearly a minute (~50seconds) to 
multiple minutes as the guest size increases...which imo kind of 
defeats the purpose of live guest migration.


That's bad =) There must be a bug somewhere the largest VM I 
can create on my hardware is ~16GB - so let me give that a try and try 
to track down the problem.


Ok. Perhaps run a simple test run inside the guest can help observe any 
scheduling

Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration

2013-05-09 Thread Chegu Vinod

On 5/9/2013 1:05 PM, Igor Mammedov wrote:

On Thu,  9 May 2013 12:43:20 -0700
Chegu Vinod chegu_vi...@hp.com wrote:


  If a user chooses to turn on the auto-converge migration capability
  these changes detect the lack of convergence and throttle down the
  guest. i.e. force the VCPUs out of the guest for some duration
  and let the migration thread catchup and help converge.


[...]

+void migration_throttle_down(void)
+{
+if (throttling_needed()) {
+CPUArchState *penv = first_cpu;
+while (penv) {
+qemu_mutex_lock_iothread();
+async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
+qemu_mutex_unlock_iothread();
+penv = penv-next_cpu;

could you replace open coded loop with qemu_for_each_cpu()?


Yes will try to replace it in the next version.
Vinod



+}
+}
+}





Re: [Qemu-devel] [RFC PATCH v5 3/3] Force auto-convegence of live migration

2013-05-09 Thread Chegu Vinod

On 5/9/2013 1:24 PM, Igor Mammedov wrote:

On Thu,  9 May 2013 12:43:20 -0700
Chegu Vinod chegu_vi...@hp.com wrote:


  If a user chooses to turn on the auto-converge migration capability
  these changes detect the lack of convergence and throttle down the
  guest. i.e. force the VCPUs out of the guest for some duration
  and let the migration thread catchup and help converge.


[...]

+
+static void mig_delay_vcpu(void)
+{
+qemu_mutex_unlock_iothread();
+g_usleep(50*1000);
+qemu_mutex_lock_iothread();
+}
+
+/* Stub used for getting the vcpu out of VM and into qemu via
+   run_on_cpu()*/
+static void mig_kick_cpu(void *opq)
+{
+mig_delay_vcpu();
+return;
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a performance drop.
+*/
+void migration_throttle_down(void)
+{
+if (throttling_needed()) {
+CPUArchState *penv = first_cpu;
+while (penv) {
+qemu_mutex_lock_iothread();

Locking it here and the unlocking it inside of queued work doesn't look nice.

Yes...but see below.

What exactly are you protecting with this lock?
It was my understanding that BQL is supposed to be held when the vcpu 
threads start entering and executing in the qemu context (as qemu is not 
MP safe).. Still true?


In this specific use case I was concerned about the fraction of the time 
when a given vcpu thread is in the qemu context but not executing the 
callback routine...and was hence holding the BQL.Holding the BQL and 
g_usleep'ng is not only bad but would slow down the migration 
thread...hence the doesn't look nice stuff :(


For this specific use case If its not really required to even bother 
with the BQL then pl. do let me know.


Also pl. refer to version 3 of my patchI was doing a g_usleep() in 
kvm_cpu_exec() and was not messing much with the BQLbut that was 
deemed as not a good thing either.


Thanks
Vinod




+async_run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
+qemu_mutex_unlock_iothread();
+penv = penv-next_cpu;
+}
+}
+}








[Qemu-devel] [RFC PATCH v4] Throttle-down guest when live migration does not converge.

2013-05-06 Thread Chegu Vinod
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

If a user chooses to force convergence of their migration via a new
migration capability auto-converge then this change will auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. we end up entering
stage 3 (downtime phase) soon after that. No external trigger is
required.

Thanks to Juan and Paolo for their useful suggestions.

Verified the convergence using the following :
- SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

(qemu) info migrate
capabilities: xbzrle: off auto-converge: off  
Migration status: active
total time: 1487503 milliseconds
expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages

---

(qemu) info migrate
capabilities: xbzrle: off auto-converge: on   
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds
transferred ram: 28235307 kbytes
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64946416 pages
skipped: 64903523 pages
normal: 7044971 pages
normal bytes: 28179884 kbytes

---

Changes from v3:
- incorporated feedback from Paolo and Eric
- rebased to latest qemu.git

Changes from v2:
- incorporated feedback from Orit, Juan and Eric
- stop the throttling thread at the start of stage 3
- rebased to latest qemu.git

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c   |   61 -
 cpus.c|   41 +++
 include/migration/migration.h |7 +
 include/qemu-common.h |1 +
 include/qemu/main-loop.h  |3 ++
 include/qom/cpu.h |   10 +++
 kvm-all.c |   46 +++
 migration.c   |   18 
 qapi-schema.json  |5 +++-
 9 files changed, 190 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 49c5dc2..2f703cf 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,7 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static bool mig_throttle_on;
 
 /***/
 /* ram save/restore */
@@ -379,7 +380,14 @@ static void migration_bitmap_sync(void)
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
 static int64_t num_dirty_pages_period;
+static int64_t bytes_xfer_prev;
 int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +412,27 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==5)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+((num_dirty_pages_period*TARGET_PAGE_SIZE) 
+((bytes_xfer_now - bytes_xfer_prev)/2))) {
+if (dirty_rate_high_cnt++  5) {
+DPRINTF(Unable to converge. Throtting down guest\n);
+qemu_mutex_lock_mig_throttle();
+if (!mig_throttle_on) {
+mig_throttle_on = true;
+}
+qemu_mutex_unlock_mig_throttle();
+}
+ }
+ bytes_xfer_prev = bytes_xfer_now

Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

2013-05-03 Thread Chegu Vinod


Hi Michael,

I picked up the qemu bits from your github branch and gave it a try.   
(BTW the setup I was given temporary access to has a pair of MLX's  IB 
QDR cards connected back to back via QSFP cables)


Observed a couple of things and wanted to share..perhaps you may be 
aware of them already or perhaps these are unrelated to your specific 
changes ? (Note: Still haven't finished the review of your changes ).


a) x-rdma-pin-all off case

Seem to only work sometimes but fails at other times. Here is an example...

(qemu) rdma: Accepting rdma connection...
rdma: Memory pin all: disabled
rdma: verbs context after listen: 0x56757d50
rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: 
fe80::2:c903:9:5855

rdma: Accepted migration
qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent with 
Host ind

ex 0x4ec: delta 0xffe6
qemu: warning: error while loading state for instance 0x0 of device 
'virtio-net'

load of migration failed


b) x-rdma-pin-all on case :

The guest is not resuming on the target host. i.e. the source host's 
qemu states that migration is complete but the guest is not responsive 
anymore... (doesn't seem to have crashed but its stuck somewhere).
Have you seen this behavior before ? Any tips on how I could extract 
additional info ?


Besides the list of noted restrictions/issues around having to pin all 
of guest memoryif the pinning is done as part of starting of the 
migration it ends up taking noticeably long time for larger guests. 
Wonder whether that should be counted as part of the total migration 
time ?.


Also the act of pinning all the memory seems to freeze the guest. e.g. 
: For larger enterprise sized guests (say 128GB and higher) the guest is 
frozen is anywhere from nearly a minute (~50seconds) to multiple 
minutes as the guest size increases...which imo kind of defeats the 
purpose of live guest migration.


Would like to hear if you have already thought about any other 
alternatives to address this issue ? for e.g. would it be better to pin 
all of the guest's memory as part of starting the guest itself ? Yes 
there are restrictions when we do pinning...but it can help with 
performance.

---
BTW, a different (yet sort of related) topic... recently a patch went 
into upstream that provided an option to qemu to mlock all of guest 
memory :


https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html .

but when attempting to do the mlock for larger guests a lot of time is 
spent bringing each page into cache and clearing/zeron'g it etc.etc.


https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html




Note: The basic tcp based live guest migration in the same qemu version 
still works fine on the same hosts over a pair of non-RDMA cards 10Gb 
NICs connected back-to-back.


Thanks
Vinod





[Qemu-devel] [PATCH v3] Throttle-down guest when live migration does not converge.

2013-05-01 Thread Chegu Vinod
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

If a user chooses to force convergence of their migration via a new
migration capability auto-converge then this change will auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. we end up entering
stage 3 (downtime phase) soon after that. No external trigger is
required.

Thanks to Juan and Paolo for their useful suggestions.

Verified the convergence using the following :
- SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

(qemu) info migrate
capabilities: xbzrle: off auto-converge: off  
Migration status: active
total time: 1487503 milliseconds
expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages

---

(qemu) info migrate
capabilities: xbzrle: off auto-converge: on   
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds
transferred ram: 28235307 kbytes
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64946416 pages
skipped: 64903523 pages
normal: 7044971 pages
normal bytes: 28179884 kbytes

---

Changes from v2:
- incorporated feedback from Orit, Juan and Eric
- stop the throttling thread at the start of stage 3
- rebased to latest qemu.git

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c   |   61 +
 cpus.c|   12 
 include/migration/migration.h |7 +
 include/qemu/main-loop.h  |3 ++
 kvm-all.c |   46 +++
 migration.c   |   18 
 qapi-schema.json  |7 -
 7 files changed, 153 insertions(+), 1 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 49c5dc2..7e03b2c 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,7 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static bool mig_throttle_on;
 
 /***/
 /* ram save/restore */
@@ -379,7 +380,14 @@ static void migration_bitmap_sync(void)
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
 static int64_t num_dirty_pages_period;
+static int64_t bytes_xfer_prev;
 int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
@@ -404,6 +412,27 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==5)
+   we turn on the throttle down logic */
+bytes_xfer_now = ram_bytes_transferred();
+if (s-dirty_pages_rate 
+((num_dirty_pages_period*TARGET_PAGE_SIZE) 
+((bytes_xfer_now - bytes_xfer_prev)/2))) {
+if (dirty_rate_high_cnt++  5) {
+DPRINTF(Unable to converge. Throtting down guest\n);
+qemu_mutex_lock_mig_throttle();
+if (!mig_throttle_on) {
+mig_throttle_on = true;
+}
+qemu_mutex_unlock_mig_throttle();
+}
+ }
+ bytes_xfer_prev = bytes_xfer_now;
+}
 s-dirty_pages_rate = num_dirty_pages_period * 1000
 / (end_time - start_time);
 s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -496,6 +525,33 @@ static

Re: [Qemu-devel] [PATCH v3] Throttle-down guest when live migration does not converge.

2013-05-01 Thread Chegu Vinod

On 5/1/2013 5:38 AM, Eric Blake wrote:

On 05/01/2013 06:22 AM, Chegu Vinod wrote:

Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.
---

Changes from v2:
- incorporated feedback from Orit, Juan and Eric
- stop the throttling thread at the start of stage 3
- rebased to latest qemu.git

+++ b/qapi-schema.json
@@ -600,9 +600,14 @@
  #  loads, by sending compressed difference of the pages
  #
  # Since: 1.2
+#
+# @auto-converge: Migration supports automatic throttling down of guest
+#  to force convergence. Disabled by default.
+#
+# Since: 1.6
  ##

I've already argued that ALL new migration capabilities should be
disabled by default (see the thread on 'x-rdma-pin-all', which will be a
merge conflict if it gets applied before your patch).  So I don't think
that last sentence adds anything, and can be dropped.

I think this works, although it's the first instance of having two
top-level Since: tags on a single JSON entity.  I was envisioning:

@xbzrle: yadda... pages

@auto-convert: Migration supports... convergence (since 1.6)

Since: 1.2

to match the conventions elsewhere that the overall JSON entity (the
enum MigrationCapability) exists since 1.2, but the addition of
auto-convert happened in 1.6.

However, as nothing parses the .json file to turn it into formal docs
(yet), I'm not going to insist on a respin if this is the only problem
with your patch.  I'm not comfortable enough with my skills in reviewing
the rest of the patch, or I'd offer a reviewed-by.


I shall make the suggested changes.
Appreciate your review feedback on this part of the change.

Thanks
Vinod



Re: [Qemu-devel] [PATCH v3] Throttle-down guest when live migration does not converge.

2013-05-01 Thread Chegu Vinod

On 5/1/2013 8:40 AM, Paolo Bonzini wrote:

I shall make the suggested changes.
Appreciate your review feedback on this part of the change.

Hi Paolo.,

Thanks for taking a look (BTW, I accidentally left out the RFC  in the 
patch subject line...my bad!).

Hi Vinod,

I think unfortunately it is not acceptable to make this patch work only
for KVM.  (It cannot work for Xen, but that's not a problem since Xen
uses a different migration mechanism; but it should work for TCG).


Ok. I hadn't yet looked at TCG aspects etc. Will follow up offline...



Unfortunately, as you noted the run_on_cpu callbacks currently run
under the big QEMU lock.  We need to fix that first.  We have time
for that during 1.6.


Ok.  Was under the impression that anytime a vcpu thread enters to do 
anything in qemu the BQL had to be held. So choose to go with 
run_on_cpu()  .  Will follow up offline on alternatives


Holding the vcpus in the host context (i.e. kvm module) itself is 
perhaps another way. Would need some handshakes (i.e. new ioctls ) with 
the kernel. Would that be acceptable way to proceed?


Thanks
Vinod



Paolo
.






Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.

2013-04-30 Thread Chegu Vinod

On 4/30/2013 8:20 AM, Juan Quintela wrote:

Chegu Vinod chegu_vi...@hp.com wrote:

Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

A few options that were discussed/being-pursued to help with
the convergence issue include:

1) Slow down guest considerably via cgroup's CPU controls - requires
libvirt client support to detect  trigger action, but conceptually
similar to this RFC change.

2) Speed up transfer rate:
- RDMA based Pre-copy - lower overhead and fast (Unfortunately
  has a few restrictions and some customers still choose not
  to deploy RDMA :-( ).
- Add parallelism to improve transfer rate and use multiple 10Gig
  connections (bonded). - could add some overhead on the host.

3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
promising but need to consider  handle newer failure scenarios.

If an enterprise user chooses to force convergence of their migration
via the new capability auto-converge then with this change we auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. we end up entering
stage 3 (downtime phase) soon after that.

No exernal trigger is required (unlike option 1) and it can co-exist
with enhancements being pursued as part of Option 2 (e.g. RDMA).

Thanks to Juan and Paolo for their useful suggestions.

Verified the convergence using the following :
- SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

(qemu) info migrate
capabilities: xbzrle: off auto-converge: off  
Migration status: active
total time: 1487503 milliseconds

148 seconds


1487 seconds and still the Migration is not completed.




expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages

---

(qemu) info migrate
capabilities: xbzrle: off auto-converge: on   
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds

6.3 seconds and finished,  not bad at all O:-)
That's the *downtime*..  The total time for migration to complete is  
241 secs. (SpecJBB is

one of those workloads that dirties memory quite a bit).



How much does the guest throughput drops while we enter autoconverge mode?


Workload performance drops for some short duration and it...but it soon 
switches to stage 3.





transferred ram: 28235307 kbytes
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64946416 pages
skipped: 64903523 pages
normal: 7044971 pages
normal bytes: 28179884 kbytes

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
 Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
@@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
  MigrationState *s = migrate_get_current();
  static int64_t start_time;
  static int64_t num_dirty_pages_period;
+static int64_t bytes_xfer_prev;
  int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (migrate_auto_converge()  !bytes_xfer_prev) {

Just do the !bytes_xfer_prev test here?  migrate_autoconverge is more
expensive to call that just do the assignment?


Sure



+
+if (value) {
+return true;
+}
+return false;

this code is just:

return value;


ok




diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
index 6f0200a..9a3886d 100644
--- a/include/qemu/main-loop.h
+++ b/include/qemu/main-loop.h
@@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
   */
  void qemu_mutex_unlock_iothread(void);
  
+void qemu_mutex_lock_mig_throttle(void);

+void qemu_mutex_unlock_mig_throttle(void);
+
  /* internal interfaces */
  
  void qemu_fd_register(int fd);

diff --git a/kvm-all.c b/kvm-all.c
index 2d92721..a92cb77 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -33,6 +33,8 @@
  #include exec/memory.h
  #include exec/address-spaces.h
  #include qemu/event_notifier.h
+#include sysemu/cpus.h
+#include migration/migration.h
  
  /* This check must be after config-host.h is included */

  #ifdef CONFIG_EVENTFD
@@ -116,6 +118,8 @@ static const

Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.

2013-04-30 Thread Chegu Vinod

On 4/30/2013 9:01 AM, Juan Quintela wrote:

Chegu Vinod chegu_vi...@hp.com wrote:

On 4/30/2013 8:20 AM, Juan Quintela wrote:

(qemu) info migrate
capabilities: xbzrle: off auto-converge: off  
Migration status: active
total time: 1487503 milliseconds

148 seconds

1487 seconds and still the Migration is not completed.


expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages

---

(qemu) info migrate
capabilities: xbzrle: off auto-converge: on   
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds

6.3 seconds and finished,  not bad at all O:-)

That's the *downtime*..  The total time for migration to complete is
241 secs. (SpecJBB is
one of those workloads that dirties memory quite a bit).

Sorry,  you are right.  Imressive anyways for such small change.


+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a greater performance drop but for a shorter
+   duration.
+*/
+void *migration_throttle_down(void *opaque)
+{
+throttling = true;
+while (throttling_needed()) {
+CPUArchState *penv = first_cpu;

I am not sure that we can follow the list without the iothread lock
here.

Hmm.. Is this due to vcpu hot plug that might happen at the time of
live migration (or) due
to something else ? I was trying to avoid holding the iothread lock
for longer duration and slow
down the migration thread...

Well,  thinking back about it,  what we should do is disable cpu
hotplug/unplug during migration


I tend to agree.

For now I am not going to hold the iothread lock for following the list...


(it is not working well anyways as
Today).


Yes...and I see that Igor, Eduardo et.al. are trying to fix this.

Vinod



Thanks,  Juan.
.






Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.

2013-04-30 Thread Chegu Vinod

On 4/30/2013 8:04 AM, Orit Wasserman wrote:

On 04/27/2013 11:50 PM, Chegu Vinod wrote:

Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

A few options that were discussed/being-pursued to help with
the convergence issue include:

1) Slow down guest considerably via cgroup's CPU controls - requires
libvirt client support to detect  trigger action, but conceptually
similar to this RFC change.

2) Speed up transfer rate:
- RDMA based Pre-copy - lower overhead and fast (Unfortunately
  has a few restrictions and some customers still choose not
  to deploy RDMA :-( ).
- Add parallelism to improve transfer rate and use multiple 10Gig
  connections (bonded). - could add some overhead on the host.

3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
promising but need to consider  handle newer failure scenarios.

If an enterprise user chooses to force convergence of their migration
via the new capability auto-converge then with this change we auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. we end up entering
stage 3 (downtime phase) soon after that.

No exernal trigger is required (unlike option 1) and it can co-exist
with enhancements being pursued as part of Option 2 (e.g. RDMA).

Thanks to Juan and Paolo for their useful suggestions.

Verified the convergence using the following :
- SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

(qemu) info migrate
capabilities: xbzrle: off auto-converge: off  
Migration status: active
total time: 1487503 milliseconds
expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages

---

(qemu) info migrate
capabilities: xbzrle: off auto-converge: on   
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds
transferred ram: 28235307 kbytes
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64946416 pages
skipped: 64903523 pages
normal: 7044971 pages
normal bytes: 28179884 kbytes

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
 Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
  arch_init.c   |   44 +++
  cpus.c|   12 +
  include/migration/migration.h |   12 +
  include/qemu/main-loop.h  |3 ++
  kvm-all.c |   51 +
  migration.c   |   15 
  qapi-schema.json  |6 -
  7 files changed, 142 insertions(+), 1 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 92de1bd..6dcc742 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,7 @@ int graphic_depth = 15;
  #endif
  
  const uint32_t arch_type = QEMU_ARCH;

+static uint64_t mig_throttle_on;
  
  /***/

  /* ram save/restore */
@@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
  MigrationState *s = migrate_get_current();
  static int64_t start_time;
  static int64_t num_dirty_pages_period;
+static int64_t bytes_xfer_prev;
  int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (migrate_auto_converge()  !bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
  
  if (!start_time) {

  start_time = qemu_get_clock_ms(rt_clock);
  }
  
+

  trace_migration_bitmap_sync_start();
  memory_global_sync_dirty_bitmap(get_system_memory());
  
@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
  
  /* more than 1 second = 1000 millisecons */

  if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were

Re: [Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.

2013-04-29 Thread Chegu Vinod

On 4/29/2013 7:53 AM, Eric Blake wrote:

On 04/27/2013 02:50 PM, Chegu Vinod wrote:

Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

No exernal trigger is required (unlike option 1) and it can co-exist

s/exernal/external/


with enhancements being pursued as part of Option 2 (e.g. RDMA).

Thanks to Juan and Paolo for their useful suggestions.

---

(qemu) info migrate
capabilities: xbzrle: off auto-converge: on   

This part looks nice.

I'm not reviewing the entire patch (I'm not an expert on the internals
of migration), but just the interface:


Thanks for taking a look at this. I shall incorporate your suggested 
changes in the

next version.

Hoping to hear from Juan/Orit and others on the live migration part.

Thanks,
Vinod


+++ b/qapi-schema.json
@@ -599,10 +599,14 @@
  #  This feature allows us to minimize migration traffic for certain 
work
  #  loads, by sending compressed difference of the pages
  #
+# @auto-converge: Controls whether or not the we want the migration to
+#  automaticially detect and force convergence by slowing

s/automaticially/automatically/


+#  down the guest. Disabled by default.

Missing a (since 1.6) designation.

Also, use of first-person (us, we) in docs seems a bit unprofessional,
although you were copying pre-existing usage.  How about:

@xbzrle: Migration supports xbzrle (Xor Based Zero Run Length Encoding),
  which minimizes migration traffic for certain workloads by
  sending compressed differences of active pages

@auto-converge: Migration supports automatic throttling of guest
 activity to force convergence (since 1.6)






[Qemu-devel] [RFC PATCH v2] Throttle-down guest when live migration does not converge.

2013-04-27 Thread Chegu Vinod
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

A few options that were discussed/being-pursued to help with
the convergence issue include:

1) Slow down guest considerably via cgroup's CPU controls - requires
   libvirt client support to detect  trigger action, but conceptually
   similar to this RFC change.

2) Speed up transfer rate:
   - RDMA based Pre-copy - lower overhead and fast (Unfortunately
 has a few restrictions and some customers still choose not
 to deploy RDMA :-( ).
   - Add parallelism to improve transfer rate and use multiple 10Gig
 connections (bonded). - could add some overhead on the host.

3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
   promising but need to consider  handle newer failure scenarios.

If an enterprise user chooses to force convergence of their migration
via the new capability auto-converge then with this change we auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. we end up entering
stage 3 (downtime phase) soon after that.

No exernal trigger is required (unlike option 1) and it can co-exist
with enhancements being pursued as part of Option 2 (e.g. RDMA).

Thanks to Juan and Paolo for their useful suggestions.

Verified the convergence using the following :
- SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

(qemu) info migrate
capabilities: xbzrle: off auto-converge: off  
Migration status: active
total time: 1487503 milliseconds
expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages

---

(qemu) info migrate
capabilities: xbzrle: off auto-converge: on   
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds
transferred ram: 28235307 kbytes
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64946416 pages
skipped: 64903523 pages
normal: 7044971 pages
normal bytes: 28179884 kbytes

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori 
Eric Blake.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c   |   44 +++
 cpus.c|   12 +
 include/migration/migration.h |   12 +
 include/qemu/main-loop.h  |3 ++
 kvm-all.c |   51 +
 migration.c   |   15 
 qapi-schema.json  |6 -
 7 files changed, 142 insertions(+), 1 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 92de1bd..6dcc742 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,7 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static uint64_t mig_throttle_on;
 
 /***/
 /* ram save/restore */
@@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
 static int64_t num_dirty_pages_period;
+static int64_t bytes_xfer_prev;
 int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
+
+if (migrate_auto_converge()  !bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
 }
 
+
 trace_migration_bitmap_sync_start();
 memory_global_sync_dirty_bitmap(get_system_memory());
 
@@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+if (migrate_auto_converge()) {
+/* The following detection logic can be refined later. For now:
+   Check to see if the dirtied bytes is 50% more than the approx.
+   amount of bytes that just got transferred since the last time we
+   were in this routine. If that happens N times (for now N==5)
+   we turn on the throttle down logic

[Qemu-devel] [RFC PATCH] Throttle-down guest when live migration does not converge.

2013-04-24 Thread Chegu Vinod
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

A few options that were discussed/being-pursued to help with
the convergence issue include:

1) Slow down guest considerably via cgroup's CPU controls - requires
   libvirt client support to detect  trigger action, but conceptually
   similar to this RFC change.

2) Speed up transfer rate:
   - RDMA based Pre-copy - lower overhead and fast (Unfortunately
 has a few restrictions and some customers still choose not
 to deploy RDMA :-( ).
   - Add parallelism to improve transfer rate and use multiple 10Gig
 connections (bonded). - could add some overhead on the host.

3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
   promising but need to consider  handle newer failure scenarios.

The following [RFC] change attempts to auto-detect lack of convergence
situation and trigger a slowdown of the workload by explicitly
disallowing the VCPUs from spending much time in the VM context. 
No exernal trigger is required (unlike option 1) and it can co-exist
with enhancements being pursued as part of Option 2 (e.g. RDMA).

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. we end up entering
stage 3 (downtime phase) soon after that.

Verified the convergence using the following:
- SpecJbb2005 workload running on a 20VCPU/128G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Thanks to Juan and Paolo for some useful suggestions. More
refinment is needed (e.g. smarter way to detect  variable
throttling based on need etc). For now I was hoping to get
some feedback or hear about other more refined ideas.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
---
 arch_init.c   |   37 +++
 cpus.c|   12 ++
 include/migration/migration.h |9 +++
 include/qemu/main-loop.h  |3 ++
 kvm-all.c |   49 +
 migration.c   |6 +
 6 files changed, 116 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 92de1bd..a06ff81 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,7 @@ int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static uint64_t mig_throttle_on;
 
 /***/
 /* ram save/restore */
@@ -379,12 +380,19 @@ static void migration_bitmap_sync(void)
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
 static int64_t num_dirty_pages_period;
+static int64_t bytes_xfer_prev;
 int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;
 
 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
 }
 
+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
+
 trace_migration_bitmap_sync_start();
 memory_global_sync_dirty_bitmap(get_system_memory());
 
@@ -404,6 +412,23 @@ static void migration_bitmap_sync(void)
 
 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+ /* The following detection logic can be refined later. For now:
+  Check to see if the dirtied bytes is 50% more than the approx.
+  amount of bytes that just got transferred since the last time we
+  were in this routine. If that happens N times (for now N==5)
+  we turn on the throttle down logic */
+ bytes_xfer_now = ram_bytes_transferred();
+ if (s-dirty_pages_rate 
+ ((num_dirty_pages_period*TARGET_PAGE_SIZE) 
+ ((bytes_xfer_now - bytes_xfer_prev)/2))) {
+ if (dirty_rate_high_cnt++  5) {
+ DPRINTF(Unable to converge. Throtting down guest\n);
+ mig_throttle_on = 1;
+ }
+}
+bytes_xfer_prev = bytes_xfer_now;
+
 s-dirty_pages_rate = num_dirty_pages_period * 1000
 / (end_time - start_time);
 s-dirty_bytes_rate = s-dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -496,6 +521,18 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
 return bytes_sent;
 }
 
+bool throttling_needed(void)
+{
+bool value;
+
+qemu_mutex_lock_mig_throttle();
+value = mig_throttle_on;
+qemu_mutex_unlock_mig_throttle();
+
+if (value) {
+return true;
+}
+return false;
+}
+
 static uint64_t bytes_transferred;
 
 static ram_addr_t ram_save_remaining(void)
diff --git a/cpus.c b/cpus.c
index 5a98a37..eea6601 100644
--- a/cpus.c
+++ b/cpus.c
@@ -616,6 +616,7

Re: [Qemu-devel] [RFC PATCH] Throttle-down guest when live migration does not converge.

2013-04-24 Thread Chegu Vinod

On 4/24/2013 6:59 PM, Anthony Liguori wrote:
On Wed, Apr 24, 2013 at 6:42 PM, Chegu Vinod chegu_vi...@hp.com 
mailto:chegu_vi...@hp.com wrote:


Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest
migration.
Despite some good recent improvements ( using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

A few options that were discussed/being-pursued to help with
the convergence issue include:

1) Slow down guest considerably via cgroup's CPU controls - requires
   libvirt client support to detect  trigger action, but conceptually
   similar to this RFC change.

2) Speed up transfer rate:
   - RDMA based Pre-copy - lower overhead and fast (Unfortunately
 has a few restrictions and some customers still choose not
 to deploy RDMA :-( ).
   - Add parallelism to improve transfer rate and use multiple 10Gig
 connections (bonded). - could add some overhead on the host.

3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
   promising but need to consider  handle newer failure scenarios.

The following [RFC] change attempts to auto-detect lack of convergence
situation and trigger a slowdown of the workload by explicitly
disallowing the VCPUs from spending much time in the VM context.
No exernal trigger is required (unlike option 1) and it can co-exist
with enhancements being pursued as part of Option 2 (e.g. RDMA).

The migration thread tries to catchup and this eventually leads
to convergence in some deterministic amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. we end up entering
stage 3 (downtime phase) soon after that.


This is a reasonable idea and approach but it cannot be unconditional. 
 Sacrificing VCPU performance to encourage convergence is a management 
decision.  In some cases, VCPU performance is far more important than 
migration convergence.


I understand the concern and agree.

Would it be ok to pass in an additional argument to qemu as part of 
trigerring the live migration i.e. to indicate if its ok to force 
convergence when it fails to converge on its own after N # of tries 
following the bulk transfer ?


Thanks!
Vinod



Regards,

Anthony Liguori

Verified the convergence using the following:
- SpecJbb2005 workload running on a 20VCPU/128G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Thanks to Juan and Paolo for some useful suggestions. More
refinment is needed (e.g. smarter way to detect  variable
throttling based on need etc). For now I was hoping to get
some feedback or hear about other more refined ideas.

Signed-off-by: Chegu Vinod chegu_vi...@hp.com
mailto:chegu_vi...@hp.com
---
 arch_init.c   |   37 +++
 cpus.c|   12 ++
 include/migration/migration.h |9 +++
 include/qemu/main-loop.h  |3 ++
 kvm-all.c |   49
+
 migration.c   |6 +
 6 files changed, 116 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 92de1bd..a06ff81 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,7 @@ int graphic_depth = 15;
 #endif

 const uint32_t arch_type = QEMU_ARCH;
+static uint64_t mig_throttle_on;

 /***/
 /* ram save/restore */
@@ -379,12 +380,19 @@ static void migration_bitmap_sync(void)
 MigrationState *s = migrate_get_current();
 static int64_t start_time;
 static int64_t num_dirty_pages_period;
+static int64_t bytes_xfer_prev;
 int64_t end_time;
+int64_t bytes_xfer_now;
+static int dirty_rate_high_cnt;

 if (!start_time) {
 start_time = qemu_get_clock_ms(rt_clock);
 }

+if (!bytes_xfer_prev) {
+bytes_xfer_prev = ram_bytes_transferred();
+}
+
 trace_migration_bitmap_sync_start();
 memory_global_sync_dirty_bitmap(get_system_memory());

@@ -404,6 +412,23 @@ static void migration_bitmap_sync(void)

 /* more than 1 second = 1000 millisecons */
 if (end_time  start_time + 1000) {
+ /* The following detection logic can be refined later.
For now:
+  Check to see if the dirtied bytes is 50% more than the
approx.
+  amount of bytes that just got transferred since the
last time we
+  were in this routine. If that happens N times (for now
N==5)
+  we turn on the throttle down logic */
+ bytes_xfer_now = ram_bytes_transferred

Re: [Qemu-devel] [PATCH v4] Add option to mlock qemu and guest memory

2013-04-21 Thread Chegu Vinod

Hi Satoru,

FYI... I had tried to use this change earlier and it did show some 
improvements in perf. (due to reduced exits).


But as expected mlockall () on large sized guests adds a considerable 
delay in boot time. For e.g. on an 8 socket Westmere box = a 256G guest 
: took an additional ~2+ mins to boot and a 512G guest took an 
additional ~5+ mins to boot. This is mainly due to long time spent in 
trying to clear all the pages.


77.96% 35728  qemu-system-x86  [kernel.kallsyms] [k] 
clear_page_c

|
--- clear_page_c
hugetlb_no_page
hugetlb_fault
follow_hugetlb_page
__get_user_pages
__mlock_vma_pages_range
__mm_populate
vm_mmap_pgoff
sys_mmap_pgoff
sys_mmap
system_call
__GI___mmap64
qemu_ram_alloc_from_ptr
qemu_ram_alloc
memory_region_init_ram
pc_memory_init
pc_init1
pc_init_pci
main
__libc_start_main

Need to have a faster way to clear pages.
Vinod



[Qemu-devel] Large guest boot hangs the host.

2013-03-21 Thread Chegu Vinod

Hello,

I have been noticing host hangs when trying to boot large guests 
(=40Vcpus) with the current upstream qemu.


Host is running 3.8.2 kernel.
qemu is the latest one from qemu.git.

Example qemu command line listed below... this used to work with a 
slightly older qemu (about 1.5 weeks ago and on the same host with the 
3.8.2 kernel)  'am trying to determine the cause of the host 
hang...but wanted to check to see if anyone

else has seen it...

Thanks
Vinod





/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-smp sockets=8,cores=10,threads=1 \
-numa node,nodeid=0,cpus=0-9,mem=64g \
-numa node,nodeid=1,cpus=10-19,mem=64g \
-numa node,nodeid=2,cpus=20-29,mem=64g \
-numa node,nodeid=3,cpus=30-39,mem=64g \
-numa node,nodeid=4,cpus=40-49,mem=64g \
-numa node,nodeid=5,cpus=50-59,mem=64g \
-numa node,nodeid=6,cpus=60-69,mem=64g \
-numa node,nodeid=7,cpus=70-79,mem=64g \
-m 524288 \
-mem-path /dev/hugepages \
-name vm1 \
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \
-drive 
file=/dev/libvirt_lvm/vm1,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native 
\
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
\

-monitor stdio \
-net nic,model=virtio,macaddr=52:54:00:71:01:01,netdev=nic-0 \
-netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \
-vnc :4





Re: [Qemu-devel] [PATCH 00/41] Migration cleanups and latency improvements

2013-02-19 Thread Chegu Vinod

On 2/15/2013 9:46 AM, Paolo Bonzini wrote:

This series does many of the improvements that the migration thread
promised.  It removes buffering, lets a large amount of code run outside
the big QEMU lock, and removes some duplication between incoming and
outgoing migration.

Patches 1 to 7 are simple cleanups.

Patches 8 to 14 simplify the lifecycle of the migration thread and
the migration QEMUFile.

Patches 15 to 18 add fine-grained locking to the block migration
data structures, so that patches 19 to 21 can move RAM/block live
migration out of the big QEMU lock.  At this point blocking writes
will not starve other threads seeking to grab the big QEMU mutex:
patches 22 to 24 removes the buffering and cleanup the code.

Patches 25 to 28 are more cleanups.

Patches 29 to 33 improve QEMUFile so that patches 34 and 35 can
use QEMUFile to write out data, instead of MigrationState.
Patches 36 to 41 then can remove the useless QEMUFile wrapper
that remains.

Please review and test!  You can find these patches at
git://github.com/bonzini/qemu.git, branch migration-thread-20130115.

Juan Quintela (1):
   Rename buffered_ to migration_

Paolo Bonzini (40):
   migration: simplify while loop
   migration: always use vm_stop_force_state
   migration: move more error handling to migrate_fd_cleanup
   migration: push qemu_savevm_state_cancel out of qemu_savevm_state_*
   block-migration: remove useless calls to blk_mig_cleanup
   qemu-file: pass errno from qemu_fflush via f-last_error
   migration: use qemu_file_set_error to pass error codes back to
 qemu_savevm_state
   qemu-file: temporarily expose qemu_file_set_error and qemu_fflush
   migration: flush all data to fd when buffered_flush is called
   migration: use qemu_file_set_error
   migration: simplify error handling
   migration: do not nest flushing of device data
   migration: prepare to access s-state outside critical sections
   migration: cleanup migration (including thread) in the iothread
   block-migration: remove variables that are never read
   block-migration: small preparatory changes for locking
   block-migration: document usage of state across threads
   block-migration: add lock
   migration: reorder SaveVMHandlers members
   migration: run pending/iterate callbacks out of big lock
   migration: run setup callbacks out of big lock
   migration: yay, buffering is gone
   qemu-file: make qemu_fflush and qemu_file_set_error private again
   migration: eliminate last_round
   migration: detect error before sleeping
   migration: remove useless qemu_file_get_error check
   migration: use qemu_file_rate_limit consistently
   migration: merge qemu_popen_cmd with qemu_popen
   qemu-file: fsync a writable stdio QEMUFile
   qemu-file: check exit status when closing a pipe QEMUFile
   qemu-file: add writable socket QEMUFile
   qemu-file: simplify and export qemu_ftell
   migration: use QEMUFile for migration channel lifetime
   migration: use QEMUFile for writing outgoing migration data
   migration: use qemu_ftell to compute bandwidth
   migration: small changes around rate-limiting
   migration: move rate limiting to QEMUFile
   migration: move contents of migration_close to migrate_fd_cleanup
   migration: eliminate s-migration_file
   migration: inline migrate_fd_close

  arch_init.c   |   14 ++-
  block-migration.c |  167 +++--
  docs/migration.txt|   20 +---
  include/migration/migration.h |   12 +--
  include/migration/qemu-file.h |   21 +--
  include/migration/vmstate.h   |   21 ++-
  include/qemu/atomic.h |1 +
  include/sysemu/sysemu.h   |6 +-
  migration-exec.c  |   39 +-
  migration-fd.c|   47 +--
  migration-tcp.c   |   33 +
  migration-unix.c  |   33 +
  migration.c   |  345 -
  savevm.c  |  214 +++---
  util/osdep.c  |6 +-
  15 files changed, 367 insertions(+), 612 deletions(-)

.


'am still in the midst of reviewing the changes but gave them a try. The 
following are my preliminary observations :


- The mult-second freezes at the start of migration of larger guests 
(i.e. 128GB and higher) aren't observable with the above changes. (The 
simple timer script that does a gettimeofday every 100ms didn't complain 
about delays etc.).


- Noticed  improvements in bandwidth utilization during the iterative 
pre-copy phase and during the downtime phase.


- The total migration time reduced...more for larger guests (Note: The 
undesirably large actual downtime for larger guests is a different 
topic that still needs to be addressed independent of these changes).


Some details follow below...

Thanks
Vinod


Details:
--

Host and Guest kernels are running : 3.8-rc5.

Comparing upstream (Qemu 1.4.50) vs. Paolo's branch(Qemu 1.3.92 based) i.e.
git clone 

Re: [Qemu-devel] [PATCH 0/4] migration stats fixes

2013-02-04 Thread Chegu Vinod

On 02/01/2013 02:32 PM, Juan Quintela wrote:



Hi

migration expected_downtime calculation was removed on commit
e4ed1541ac9413eac494a03532e34beaf8a7d1c5.

We add the calculation back.  Before doing the calculation we do:

- expected_downtime intial value is max_downtime.  Much, much better
  intial value than 0.

- we move when we measure the time.  We used to measure how much it
  took before we really sent the data.

- we introduce sleep_time concept.  While we are sleeping because we
  have sent all the allowed data for this second we shouldn't be
  accounting that time as sending.

- last patch just introduces the re-calculation of expected_downtime.

It just changes the stats value.  Well, patchs 2  3 change the
bandwidth calculation for migration, but I think that we were
undercalculating it enough than it was a bug.

Without the 2  3 patches, the expected_downtime for an idle gust
was calculated as 80ms (with 30 ms default target value), and we ended
having a downtime of around 15ms.

With this patches applied, we calculate an expected downtime of around
15ms or so, and then we spent aroqund 18ms on downtime.  Notice that
we only calculate how much it takes to sent the rest of the RAM, it
just happens that there is some more data to sent that what we are 
calculating.


Review, please.

Later, Juan.


The following changes since commit 
8a55ebf01507ab73cc458cfcd5b9cb856aba0b9e:


  Merge remote-tracking branch 'afaerber/qom-cpu' into staging 
(2013-01-31

19:37:33 -0600)

are available in the git repository at:


  git://repo.or.cz/qemu/quintela.git stats.next

for you to fetch changes up to 791128495e3546ccc88dd037ea4dfd31eca14a56:

  migration: calculate expected_downtime (2013-02-01 13:22:37 +0100)


Juan Quintela (4):
  migration: change initial value of expected_downtime
  migration: calculate end time after we have sent the data
  migration: don't account sleep time for calculating bandwidth
  migration: calculate expected_downtime

 arch_init.c   |  1 +
 include/migration/migration.h |  1 +
 migration.c   | 15 +--
 3 files changed, 15 insertions(+), 2 deletions(-)



Reviewed-by: Chegu Vinod chegu_vi...@hp.com





[Qemu-devel] vhost-net thread getting stuck ?

2013-01-09 Thread Chegu Vinod


Hello,

'am running into an issue with the latest bits. [ Pl. see below. The 
vhost thread seems to be getting
stuck while trying to memcopy...perhaps a bad address?.  ] Wondering if 
this is a known issue or

some recent regression ?

'am using the latest qemu (from qemu.git) and the latest kvm.git kernel 
on the host. Started the

guest using the following command line

/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-smp sockets=8,cores=10,threads=1 \
-numa node,nodeid=0,cpus=0-9,mem=64g \
-numa node,nodeid=1,cpus=10-19,mem=64g \
-numa node,nodeid=2,cpus=20-29,mem=64g \
-numa node,nodeid=3,cpus=30-39,mem=64g \
-numa node,nodeid=4,cpus=40-49,mem=64g \
-numa node,nodeid=5,cpus=50-59,mem=64g \
-numa node,nodeid=6,cpus=60-69,mem=64g \
-numa node,nodeid=7,cpus=70-79,mem=64g \
-m 524288 \
-mem-path /dev/hugepages \
-name vm2 \
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm2.monitor,server,now

ait \
-drive 
file=/dev/libvirt_lvm2/vm2,if=none,id=drive-virtio-disk0,format=raw,cache

=none,aio=native \
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=v

irtio-disk0,bootindex=1 \
-monitor stdio \
-net nic,model=virtio,macaddr=52:54:00:71:01:02,netdev=nic-0 \
-netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \
-vnc :4


Was just doing a basic kernel build in the guest when it hung with the 
following in the

dmesg of the host.

Thanks
Vinod

BUG: soft lockup - CPU#46 stuck for 23s! [vhost-135220:135231]
Modules linked in: kvm_intel kvm fuse ip6table_filter ip6_tables 
ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state 
nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter 
ip_tables bridge stp llc autofs4 sunrpc pcc_cpufreq ipv6 vhost_net 
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp 
crc32c_intel ghash_clmulni_intel microcode pcspkr mlx4_core be2net 
lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core sg netxen_nic ext4 
mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif aesni_intel ablk_helper 
cryptd lrw aes_x86_64 xts gf128mul pata_acpi ata_generic ata_piix hpsa 
lpfc scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm 
i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last 
unloaded: kvm]

CPU 46
Pid: 135231, comm: vhost-135220 Not tainted 3.7.0+ #1 HP ProLiant DL980 G7
RIP: 0010:[8147bab0]  [8147bab0] 
skb_flow_dissect+0x1b0/0x440

RSP: 0018:881ffd131bc8  EFLAGS: 0246
RAX: 8a1f7dc70c00 RBX:  RCX: 7fa0
RDX:  RSI: 881ffd131c68 RDI: 8a1ff1bd6c80
RBP: 881ffd131c58 R08: 881ffd131bf8 R09: 8a1ff1bd6c80
R10: 0010 R11: 0004 R12: 8a1ff1bd6c80
R13: 000b R14: 8147330b R15: 881ffd131b58
FS:  () GS:8a1fff98() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 003d5c810dc0 CR3: 009f77c04000 CR4: 27e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process vhost-135220 (pid: 135231, threadinfo 881ffd13, task 
881ffcb754c0)

Stack:
 881ffd131c18 81477b90 00e2 2b289bcc58ce
 881ffd131ce4 00a2  00a2
 00a2 00a2 881ffd131c88 937e754e
Call Trace:
 [81477b90] ? memcpy_fromiovecend+0x90/0xd0
 [8147f3ca] __skb_get_rxhash+0x1a/0xe0
 [a03c90f8] tun_get_user+0x468/0x660 [tun]
 [81090010] ? __sdt_alloc+0x80/0x1a0
 [a03c934d] tun_sendmsg+0x5d/0x80 [tun]
 [a0468e8a] handle_tx+0x34a/0x680 [vhost_net]
 [a04691f5] handle_tx_kick+0x15/0x20 [vhost_net]
 [a0466dfc] vhost_worker+0x10c/0x1c0 [vhost_net]
 [a0466cf0] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net]
 [a0466cf0] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net]
 [8107ecfe] kthread+0xce/0xe0
 [8107ec30] ? kthread_freezable_should_stop+0x70/0x70
 [815537ac] ret_from_fork+0x7c/0xb0
 [8107ec30] ? kthread_freezable_should_stop+0x70/0x70
Code: b6 50 06 48 89 ce 48 c1 ee 20 31 f1 41 89 0e 48 8b 48 20 48 33 48 
18 48 89 c8 48 c1 e8 20 31 c1 41 89 4e 04 e9 35 ff ff ff 66 90 0f b6 
50 09 e9 1a ff ff ff 0f 1f 80 00 00 00 00 41 8b 44 24 68

[root@hydra11 kvm_rik]#
Message from syslogd@hydra11 at Jan  9 13:06:58 ...
 kernel:BUG: soft lockup - CPU#46 stuck for 22s! [vhost-135220:135231]




Re: [Qemu-devel] vhost-net thread getting stuck ?

2013-01-09 Thread Chegu Vinod

On 1/9/2013 8:35 PM, Jason Wang wrote:

On 01/10/2013 04:25 AM, Chegu Vinod wrote:

Hello,

'am running into an issue with the latest bits. [ Pl. see below. The
vhost thread seems to be getting
stuck while trying to memcopy...perhaps a bad address?.  ] Wondering
if this is a known issue or
some recent regression ?

Hi:

Looks like the issue has been fixed in following commits, does you tree
contain these?

499744209b2cbca66c42119226e5470da3bb7040 and
76fe45812a3b134c39170ca32dfd4b7217d33145.

They have been merged in to Linus 3.8-rc tree.


I was using kvm.git kernel (as of today morning)looks like the fixes 
aren't there yet.


Will try the Linus's 3.8-rc tree.

Thanks!
Vinod



Thanks

'am using the latest qemu (from qemu.git) and the latest kvm.git
kernel on the host. Started the
guest using the following command line

/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-smp sockets=8,cores=10,threads=1 \
-numa node,nodeid=0,cpus=0-9,mem=64g \
-numa node,nodeid=1,cpus=10-19,mem=64g \
-numa node,nodeid=2,cpus=20-29,mem=64g \
-numa node,nodeid=3,cpus=30-39,mem=64g \
-numa node,nodeid=4,cpus=40-49,mem=64g \
-numa node,nodeid=5,cpus=50-59,mem=64g \
-numa node,nodeid=6,cpus=60-69,mem=64g \
-numa node,nodeid=7,cpus=70-79,mem=64g \
-m 524288 \
-mem-path /dev/hugepages \
-name vm2 \
-chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm2.monitor,server,now
ait \
-drive
file=/dev/libvirt_lvm2/vm2,if=none,id=drive-virtio-disk0,format=raw,cache
=none,aio=native \
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=v
irtio-disk0,bootindex=1 \
-monitor stdio \
-net nic,model=virtio,macaddr=52:54:00:71:01:02,netdev=nic-0 \
-netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \
-vnc :4


Was just doing a basic kernel build in the guest when it hung with the
following in the
dmesg of the host.

Thanks
Vinod

BUG: soft lockup - CPU#46 stuck for 23s! [vhost-135220:135231]
Modules linked in: kvm_intel kvm fuse ip6table_filter ip6_tables
ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter
ip_tables bridge stp llc autofs4 sunrpc pcc_cpufreq ipv6 vhost_net
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp
crc32c_intel ghash_clmulni_intel microcode pcspkr mlx4_core be2net
lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core sg netxen_nic ext4
mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif aesni_intel ablk_helper
cryptd lrw aes_x86_64 xts gf128mul pata_acpi ata_generic ata_piix hpsa
lpfc scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm
i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: kvm]
CPU 46
Pid: 135231, comm: vhost-135220 Not tainted 3.7.0+ #1 HP ProLiant
DL980 G7
RIP: 0010:[8147bab0]  [8147bab0]
skb_flow_dissect+0x1b0/0x440
RSP: 0018:881ffd131bc8  EFLAGS: 0246
RAX: 8a1f7dc70c00 RBX:  RCX: 7fa0
RDX:  RSI: 881ffd131c68 RDI: 8a1ff1bd6c80
RBP: 881ffd131c58 R08: 881ffd131bf8 R09: 8a1ff1bd6c80
R10: 0010 R11: 0004 R12: 8a1ff1bd6c80
R13: 000b R14: 8147330b R15: 881ffd131b58
FS:  () GS:8a1fff98()
knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 003d5c810dc0 CR3: 009f77c04000 CR4: 27e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process vhost-135220 (pid: 135231, threadinfo 881ffd13, task
881ffcb754c0)
Stack:
  881ffd131c18 81477b90 00e2 2b289bcc58ce
  881ffd131ce4 00a2  00a2
  00a2 00a2 881ffd131c88 937e754e
Call Trace:
  [81477b90] ? memcpy_fromiovecend+0x90/0xd0
  [8147f3ca] __skb_get_rxhash+0x1a/0xe0
  [a03c90f8] tun_get_user+0x468/0x660 [tun]
  [81090010] ? __sdt_alloc+0x80/0x1a0
  [a03c934d] tun_sendmsg+0x5d/0x80 [tun]
  [a0468e8a] handle_tx+0x34a/0x680 [vhost_net]
  [a04691f5] handle_tx_kick+0x15/0x20 [vhost_net]
  [a0466dfc] vhost_worker+0x10c/0x1c0 [vhost_net]
  [a0466cf0] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net]
  [a0466cf0] ? vhost_attach_cgroups_work+0x30/0x30 [vhost_net]
  [8107ecfe] kthread+0xce/0xe0
  [8107ec30] ? kthread_freezable_should_stop+0x70/0x70
  [815537ac] ret_from_fork+0x7c/0xb0
  [8107ec30] ? kthread_freezable_should_stop+0x70/0x70
Code: b6 50 06 48 89 ce 48 c1 ee 20 31 f1 41 89 0e 48 8b 48 20 48 33
48 18 48 89 c8 48 c1 e8 20 31 c1 41 89 4e 04 e9 35 ff ff ff 66 90 0f
b6 50 09 e9 1a ff ff ff 0f 1f 80 00 00 00 00 41 8b 44 24 68
[root@hydra11 kvm_rik]#
Message from syslogd@hydra11 at Jan  9 13:06:58 ...
  kernel:BUG: soft lockup - CPU#46 stuck for 22s! [vhost-135220:135231]



.






Re: [Qemu-devel] Migration ToDo list

2012-11-13 Thread Chegu Vinod

On 11/13/2012 8:18 AM, Juan Quintela wrote:

Hi

If you have anything else to put, please add.

Migration Thread
* Plan is integrate it as one of first thing in December (me)
* Remove copies with buffered file (me)

Bitmap Optimization
* Finish moving to individual bitmaps for migration/vga/code
* Make sure we don't copy things around
* Shared memory bitmap with kvm?
* Move to 2MB pages bitmap and then fine grain?


If its not already implied in the above ...  the long freezes observed 
at the start of the migration needs to be addressed (its most likely 
related to BQL ?).




QIDL
* Review the patches (me)

PostCopy
* Review patches?
* See what we can already integrate?
   I remember for last year that we could integrate the 1st third or so

RDMA
* Send RDMA/tcp/ library they already have (Benoit)
* This is required for postcopy
* This can be used for precopy


Not sure if what Benoit has can be directly used for pre-copy also.

As Paolo said... we need to look at RDS API's for pre-copy.  ('have just 
started looking at the same). Would like to know if SDP can be used...



General
* Change protocol to:
   a) being always 16byte aligned (paolo said that is faster)
   b) do scatter/gather of the pages?


Control of where the migration thread(s) run...

--

BTW, has anyone tried doing multiple guest migration from a host ? Are 
there limitations  (enforced via higher level management tools) as to 
how many guests can be migrated at once (in an attempt to quickly 
evacuate a flaky host) ?


Vinod


Fault Tolerance
* That is built on top of migration code, but I have nothing to add.

Any more ideas?

Later, Juan.
.






Re: [Qemu-devel] [PATCH 00/18] Migration thread lite (20121029)

2012-10-29 Thread Chegu Vinod

On 10/29/2012 9:21 AM, Vinod, Chegu wrote:

Date: Mon, 29 Oct 2012 15:11:25 +0100
From: Juan Quintela quint...@redhat.com
To: qemu-devel@nongnu.org
Cc: owass...@redhat.com, mtosa...@redhat.com, a...@redhat.com,
pbonz...@redhat.com
Subject: [Qemu-devel] [PATCH 00/18] Migration thread lite (20121029)

Hi

After discussing with Anthony and Paolo, this is the minimal migration thread 
support that we can do for 1.3.

- fixed all the comments (thanks eric, paolo and orit).
- buffered_file.c remains until 1.4.
- testing for vinod showed very nice results, that is why the bitmap
   handling optimizations remain.


Hi Juan,

Is there any value in calling the migration_bitmap_synch() routine in 
the ram_save_setup() . All the
pages were marked as dirty to begin with...so can't we just assume that 
all pages need to be sent

and proceed ?

migration_bitmap_synch () - still remains an expensive call. and the 
very first call to seems to be ~3x times
 more expensive than the subsequent calls. For large guests (128G 
guests) this is multiple seconds...and

it freezes the OS instance.

Thanks
Vinod



Note: Writes has become blocking, and I have to change the remove
the feature now in qemu-sockets.c.  Checked that migration was the only user of 
that feature.  If new users appear, they just need to add the 
socket_set_nonblock() by hand.

Please, review.

Thanks, Juan.


The following changes since commit 50cd72148211c5e5f22ea2519d19ce024226e61f:

   hw/xtensa_sim: get rid of intermediate xtensa_sim_init (2012-10-27 15:04:00 
+)

are available in the git repository at:

   http://repo.or.cz/r/qemu/quintela.git migration-thread-20121029

for you to fetch changes up to 2c74654f19efc7db35117a87c0d9db4776931e1b:

   ram: optimize migration bitmap walking (2012-10-29 14:14:28 +0100)

Juan Quintela (15):
   buffered_file: Move from using a timer to use a thread
   migration: make qemu_fopen_ops_buffered() return void
   migration: stop all cpus correctly
   migration: make writes blocking
   migration: remove unfreeze logic
   migration: take finer locking
   buffered_file: Unfold the trick to restart generating migration data
   buffered_file: don't flush on put buffer
   buffered_file: unfold buffered_append in buffered_put_buffer
   savevm: New save live migration method: pending
   migration: include qemu-file.h
   migration-fd: remove duplicate include
   memory: introduce memory_region_test_and_clear_dirty
   ram: Use memory_region_test_and_clear_dirty
   ram: optimize migration bitmap walking

Paolo Bonzini (1):
   split MRU ram list

Umesh Deshpande (2):
   add a version number to ram_list
   protect the ramlist with a separate mutex

  arch_init.c   | 115 ---
  block-migration.c |  49 +---
  buffered_file.c   | 130 +-
  buffered_file.h   |   2 +-
  cpu-all.h |  15 ++-
  exec.c|  44 +++---
  memory.c  |  16 +++
  memory.h  |  18 
  migration-exec.c  |   3 +-
  migration-fd.c|   4 +-
  migration-tcp.c   |   2 +-
  migration-unix.c  |   2 +-
  migration.c   | 100 -
  migration.h   |   4 +-
  qemu-file.h   |   5 ---
  qemu-sockets.c|   4 --
  savevm.c  |  24 +++---
  sysemu.h  |   1 +
  vmstate.h |   1 +
  19 files changed, 280 insertions(+), 259 deletions(-)

--
1.7.11.7





.






Re: [Qemu-devel] Unable to enable +x2apic for the guest cpus...

2012-10-13 Thread Chegu Vinod

On 10/13/2012 12:32 AM, Gleb Natapov wrote:

On Fri, Oct 12, 2012 at 07:38:42PM -0700, Chegu Vinod wrote:

Hello,

I am using a very recent upstream version of qemu.git along with
kvm.git kernels (in the host and guest).
  [Guest kernel had been compiled with CONFIG_X86_X2APIC and
CONFIG_IRQ_REMAP both set]

When I attempt to start a guest with +x2apic flag (pl. see the qemu
cmd line below) I end up with a hang of the qemu and
a kernel BUG at /arch/x86/kvm/lapic.c:159 !Pl. see the attached
screen shot of the console for additional info.

I am able to boot the same guest without the +x2apic flag in the
qemu cmd line.

Not sure if this an issue (or) if I have something incorrectly
specified in the qemu cmd line ? If its the latter...pl. advise the
correct usage
for enabling x2apic for the guest cpus.. for the upstream bits.


This is the bug in how ldr in x2apic mode is calculated.

Try the following patch:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c6e6b72..43e9fad 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1311,7 +1311,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
vcpu-arch.apic_base = value;
if (apic_x2apic_mode(apic)) {
u32 id = kvm_apic_id(apic);
-   u32 ldr = ((id  ~0xf)  16) | (1  (id  0xf));
+   u32 ldr = ((id  4)  16) | (1  (id  0xf));
kvm_apic_set_ldr(apic, ldr);
}
apic-base_address = apic-vcpu-arch.apic_base 
--
Gleb.
.




Retried with the above  patch and the guest is booting fine. (x2apic 
flag shows up in the guest's /proc/cpuinfo).


Was this a recent regression?

Thanks!
Vinod



[Qemu-devel] Fwd: Re: [RFC 0/7] Migration stats

2012-08-13 Thread Chegu Vinod


Forwarding to the alias.
Thanks,
Vinod

 Original Message 
Subject:Re: [RFC 0/7] Migration stats
Date:   Mon, 13 Aug 2012 15:20:10 +0200
From:   Juan Quintela quint...@redhat.com
Reply-To:   quint...@redhat.com
To: Chegu Vinod chegu_vi...@hp.com
CC: 


[ snip ]

 - Prints the real downtime that we have had



   really, it prints the total downtime of the complete phase, but the
   downtime also includes the last ram_iterate phase.  Working on
   fixing that one.


Good one.


[...]


What do I want to know:

- is there any stat that you want?  Once here, adding a new one should
   be easy.





A few suggestions :

a) Total amount of time spent sync'ng up dirty bitmap logs for the
total duration of migration.


I can add that one, it is not difficult.  Notice that in future I expect
to do the syncs in smaller chucks (but that is pie on the sky)


b) Actual [average?] bandwidth that was used as compared to the
allocated bandwidth ...  (I am wanting to know how folks are observing
near line rate on a 10Gig...when I am not...).


Print average bandwidth is easy.  The hardware one is difficult to get
from inside one application.





'think it would be useful to know the approximate amount of [host] cpu
time that got used up by the migration related thread(s) and any
related host side services (like servicing the I/O interrupts while
driving traffic through the network). I assume there are alternate
methods to derive all these (and we don't need to overload the
migration stats?]


This one is not easy to do from inside qemu.  Much easier to get from
the outside.  As far as I know, it is not easy to monitor cpu usage from
inside the cpu that we can to measure.

Thanks for the comments, Juan.
.






Re: [Qemu-devel] FW: Fwd: [RFC 00/27] Migration thread (WIP)

2012-07-27 Thread Chegu Vinod

On 7/27/2012 7:11 AM, Vinod, Chegu wrote:


-Original Message-
From: Juan Quintela [mailto:quint...@redhat.com]
Sent: Friday, July 27, 2012 4:06 AM
To: Vinod, Chegu
Cc: qemu-devel@nongnu.org; Orit Wasserman
Subject: Re: Fwd: [RFC 00/27] Migration thread (WIP)

Chegu Vinod chegu_vi...@hp.com wrote:

On 7/26/2012 11:41 AM, Chegu Vinod wrote:


 
 
  Original Message 



  Subject:  [Qemu-devel] [RFC 00/27] Migration thread (WIP)

  Date: Tue, 24 Jul 2012 20:36:25 +0200

  From: Juan Quintela quint...@redhat.com

  To:   qemu-devel@nongnu.org


 
 
 Hi


This series are on top of the migration-next-v5 series just posted.

First of all, this is an RFC/Work in progress.  Just a lot of people
asked for it, and I would like review of the design.

 Hello,
 
 Thanks for sharing this early/WIP version for evaluation.
 
 Still in the middle of  code review..but wanted to share a couple

 of quick  observations.
 'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G
 and downtime 2s).
 Once with no workload (i.e. idle guest) and the second was with a
 SpecJBB running in the guest.
 
 The idle guest case seemed to migrate fine...
 
 
 capabilities: xbzrle: off

 Migration status: completed
 transferred ram: 3811345 kbytes
 remaining ram: 0 kbytes
 total ram: 134226368 kbytes
 total time: 199743 milliseconds
 
 
 In the case of the SpecJBB I ran into issues during stage 3...the

 source host's qemu and the guest hung. I need to debug this
 more... (if  already have some hints pl. let me know.).
 
 
 capabilities: xbzrle: off

 Migration status: active
 transferred ram: 127618578 kbytes
 remaining ram: 2386832 kbytes
 total ram: 134226368 kbytes
 total time: 526139 milliseconds
 (qemu) qemu_savevm_state_complete called
 qemu_savevm_state_complete calling ram_save_complete
  
 ---  hung somewhere after this ('need to get more info).
 
 



Appears to be some race condition...as there are cases when it hangs
and in some cases it succeeds.

Weird guess, try to use less vcpus, same ram.


Ok..will try that.

The way that we stop cpus is _hacky_ to say it somewhere.  Will try to think 
about that part.

Ok.

Thanks for the testing.  All my testing has been done with 8GB guests and 
2vcps.  Will try with more vcpus to see if it makes a difference.





(qemu) info migrate
capabilities: xbzrle: off
Migration status: completed
transferred ram: 129937687 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 543228 milliseconds

Humm, _that_ is more strange.  This means that it finished.


There are cases where the migration is finishing just fine... even with 
larger guest configurations (256G/20VCPUs).




  Could you run qemu under gdb and sent me the stack traces?

I don't know your gdb thread kung-fu, so here are the instructions just in case:

gdb --args exact qemu commandh line you used C-c to break when it hangs 
(gdb)info threads you see all the threads running (gdb)thread 1 or whatever other 
number (gdb)bt the backtrace of that thread.


The hang is intermittent...
I ran it 4-5 times (under gdb) just now and I didn't see the issue :-(



I am specially interested in the backtrace of the migration thread and of the 
iothread.


Will keep re-trying with different configs. and see if i get lucky in 
reproducing it (under gdb).


Vinod


Thanks, Juan.


Need to review/debug...

Vinod



 ---
 
 As with the non-migration-thread version the Specjbb workload

 completed before the migration attempted to move to stage 3 (i.e.
 didn't converge while the workload was still active).
 
 BTW, with this version of the bits (i.e. while running SpecJBB

 which is supposed to dirty quite a bit of memory) I noticed that
 there wasn't much change in the b/w usage of the dedicated 10Gb
 private network link (It was still  ~1.5-3.0Gb/sec).   Expected
 this to be a little better since we have a separate thread...  not
 sure what else is in play here ? (numa locality of where the
 migration thread runs or something other basic tuning in the
 implementation ?)
 
 'have a hi-level design question... (perhaps folks have already

 thought about it..and categorized it as potential future
 optimization..?)
 
 Would it be possible to off load the iothread completely [from all

 migration related activity] and have one thread

Re: [Qemu-devel] Fwd: [RFC 00/27] Migration thread (WIP)

2012-07-26 Thread Chegu Vinod




 Original Message 
Subject:[Qemu-devel] [RFC 00/27] Migration thread (WIP)
Date:   Tue, 24 Jul 2012 20:36:25 +0200
From:   Juan Quintela quint...@redhat.com
To: qemu-devel@nongnu.org



Hi

This series are on top of the migration-next-v5 series just posted.

First of all, this is an RFC/Work in progress.  Just a lot of people
asked for it, and I would like review of the design.

Hello,

Thanks for sharing this early/WIP version for evaluation.

Still in the middle of  code review..but wanted to share a couple of 
quick  observations.
'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and 
downtime 2s).
Once with no workload (i.e. idle guest) and the second was with a 
SpecJBB running in the guest.


The idle guest case seemed to migrate fine...


capabilities: xbzrle: off
Migration status: completed
transferred ram: 3811345 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 199743 milliseconds


In the case of the SpecJBB I ran into issues during stage 3...the source 
host's qemu and the guest hung. I need to debug this more... (if  
already have some hints pl. let me know.).



capabilities: xbzrle: off
Migration status: active
transferred ram: 127618578 kbytes
remaining ram: 2386832 kbytes
total ram: 134226368 kbytes
total time: 526139 milliseconds
(qemu) qemu_savevm_state_complete called
qemu_savevm_state_complete calling ram_save_complete

---  hung somewhere after this ('need to get more info).


---

As with the non-migration-thread version the Specjbb workload completed 
before the migration attempted to move to stage 3 (i.e. didn't converge 
while the workload was still active).


BTW, with this version of the bits (i.e. while running SpecJBB which is 
supposed to dirty quite a bit of memory) I noticed that there wasn't 
much change in the b/w usage of the dedicated 10Gb private network link 
(It was still  ~1.5-3.0Gb/sec).   Expected this to be a little better 
since we have a separate thread...  not sure what else is in play here ? 
(numa locality of where the migration thread runs or something other 
basic tuning in the implementation ?)


'have a hi-level design question... (perhaps folks have already thought 
about it..and categorized it as potential future optimization..?)


Would it be possible to off load the iothread completely [from all 
migration related activity] and have one thread (with the appropriate 
protection) get involved with getting the list of the dirty pages ? Have 
one or more threads dedicated for trying to push multiple streams of 
data to saturate the allocated network bandwidth ?  This may help in 
large + busy guests. Comments?There  are perhaps other implications 
of doing all of this (like burning more host cpu cycles) but perhaps 
this can be configurable based on user's needs... e.g. fewer but large 
guests on a host with no over subscription.


Thanks
Vinod



It does:
- get a new bitmap for migration, and that bitmap uses 1 bit by page
- it unfolds migration_buffered_file.  Only one user existed.
- it simplifies buffered_file a lot.

- About the migration thread, special attention was giving to try to
   get the series review-able (reviewers would tell if I got it).

Basic design:
- we create a new thread instead of a timer function
- we move all the migration work to that thread (but run everything
   except the waits with the iothread lock.
- we move all the writting to outside the iothread lock.  i.e.
   we walk the state with the iothread hold, and copy everything to one buffer.
   then we write that buffer to the sockets outside the iothread lock.
- once here, we move to writting synchronously to the sockets.
- this allows us to simplify quite a lot.

And basically, that is it.  Notice that we still do the iterate page
walking with the iothread held.  Light testing show that we got
similar speed and latencies than without the thread (notice that
almost no optimizations done here yet).

Appart of the review:
- Are there any locking issues that I have missed (I guess so)
- stop all cpus correctly.  vm_stop should be called from the iothread,
   I use the trick of using a bottom half to get that working correctly.
   but this _implementation_ is ugly as hell.  Is there an easy way
   of doing it?
- Do I really have to export last_ram_offset(), there is no other way
   of knowing the ammount of RAM?

Known issues:

- for some reason, when it has to start a 2nd round of bitmap
   handling, it decides to dirty all pages.  Haven't found still why
   this happens.

If you can test it, and said me where it breaks, it would also help.

Work is based on Umesh thread work, and work that Paolo Bonzini had
work on top of that.  All the mirgation thread was done from scratch
becase I was unable to debug why it was failing, but it owes a lot
to the previous design.

Thanks in advance, Juan.

The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:

   Merge remote-tracking branch 

Re: [Qemu-devel] Fwd: [RFC 00/27] Migration thread (WIP)

2012-07-26 Thread Chegu Vinod

On 7/26/2012 11:41 AM, Chegu Vinod wrote:




 Original Message 
Subject:[Qemu-devel] [RFC 00/27] Migration thread (WIP)
Date:   Tue, 24 Jul 2012 20:36:25 +0200
From:   Juan Quintela quint...@redhat.com
To: qemu-devel@nongnu.org



Hi

This series are on top of the migration-next-v5 series just posted.

First of all, this is an RFC/Work in progress.  Just a lot of people
asked for it, and I would like review of the design.

Hello,

Thanks for sharing this early/WIP version for evaluation.

Still in the middle of  code review..but wanted to share a couple of 
quick  observations.
'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and 
downtime 2s).
Once with no workload (i.e. idle guest) and the second was with a 
SpecJBB running in the guest.


The idle guest case seemed to migrate fine...


capabilities: xbzrle: off
Migration status: completed
transferred ram: 3811345 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 199743 milliseconds


In the case of the SpecJBB I ran into issues during stage 3...the 
source host's qemu and the guest hung. I need to debug this more... 
(if  already have some hints pl. let me know.).



capabilities: xbzrle: off
Migration status: active
transferred ram: 127618578 kbytes
remaining ram: 2386832 kbytes
total ram: 134226368 kbytes
total time: 526139 milliseconds
(qemu) qemu_savevm_state_complete called
qemu_savevm_state_complete calling ram_save_complete

---  hung somewhere after this ('need to get more info).




Appears to be some race condition...as there are cases when it hangs and 
in some cases it succeeds.


(qemu) info migrate
capabilities: xbzrle: off
Migration status: completed
transferred ram: 129937687 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 543228 milliseconds

Need to review/debug...

Vinod




---

As with the non-migration-thread version the Specjbb workload 
completed before the migration attempted to move to stage 3 (i.e. 
didn't converge while the workload was still active).


BTW, with this version of the bits (i.e. while running SpecJBB which 
is supposed to dirty quite a bit of memory) I noticed that there 
wasn't much change in the b/w usage of the dedicated 10Gb private 
network link (It was still  ~1.5-3.0Gb/sec). Expected this to be a 
little better since we have a separate thread...  not sure what else 
is in play here ? (numa locality of where the migration thread runs or 
something other basic tuning in the implementation ?)


'have a hi-level design question... (perhaps folks have already 
thought about it..and categorized it as potential future optimization..?)


Would it be possible to off load the iothread completely [from all 
migration related activity] and have one thread (with the appropriate 
protection) get involved with getting the list of the dirty pages ? 
Have one or more threads dedicated for trying to push multiple streams 
of data to saturate the allocated network bandwidth ?  This may help 
in large + busy guests. Comments? There  are perhaps other 
implications of doing all of this (like burning more host cpu cycles) 
but perhaps this can be configurable based on user's needs... e.g. 
fewer but large guests on a host with no over subscription.


Thanks
Vinod



It does:
- get a new bitmap for migration, and that bitmap uses 1 bit by page
- it unfolds migration_buffered_file.  Only one user existed.
- it simplifies buffered_file a lot.

- About the migration thread, special attention was giving to try to
   get the series review-able (reviewers would tell if I got it).

Basic design:
- we create a new thread instead of a timer function
- we move all the migration work to that thread (but run everything
   except the waits with the iothread lock.
- we move all the writting to outside the iothread lock.  i.e.
   we walk the state with the iothread hold, and copy everything to one buffer.
   then we write that buffer to the sockets outside the iothread lock.
- once here, we move to writting synchronously to the sockets.
- this allows us to simplify quite a lot.

And basically, that is it.  Notice that we still do the iterate page
walking with the iothread held.  Light testing show that we got
similar speed and latencies than without the thread (notice that
almost no optimizations done here yet).

Appart of the review:
- Are there any locking issues that I have missed (I guess so)
- stop all cpus correctly.  vm_stop should be called from the iothread,
   I use the trick of using a bottom half to get that working correctly.
   but this _implementation_ is ugly as hell.  Is there an easy way
   of doing it?
- Do I really have to export last_ram_offset(), there is no other way
   of knowing the ammount of RAM?

Known issues:

- for some reason, when it has to start a 2nd round of bitmap
   handling, it decides to dirty all pages.  Haven't found still why
   this happens.

If you can test it, and said me where it breaks, it would also help

[Qemu-devel] [PATCH v4] Fixes related to processing of qemu's -numa option

2012-07-16 Thread Chegu Vinod
Changes since v3:
   - using bitmap_set() instead of set_bit() in numa_add() routine.
   - removed call to bitmak_zero() since bitmap_new() also zeros' the bitmap.
   - Rebased to the latest qemu.

Changes since v2:
   - Using unsigned long * for the node_cpumask[].
   - Use bitmap_new() instead of g_malloc0() for allocation.
   - Don't rely on max_cpus since it may not be initialized
 before the numa related qemu options are parsed  processed.

Note: Continuing to use a new constant for allocation of
  the mask (This constant is currently set to 255 since
  with an 8bit APIC ID VCPUs can range from 0-254 in a
  guest. The APIC ID 255 (0xFF) is reserved for broadcast).

Changes since v1:

   - Use bitmap functions that are already in qemu (instead
 of cpu_set_t macro's from sched.h)
   - Added a check for endvalue = max_cpus.
   - Fix to address the round-robbing assignment when
 cpu's are not explicitly specified.
---

v1:
--

The -numa option to qemu is used to create [fake] numa nodes
and expose them to the guest OS instance.

There are a couple of issues with the -numa option:

a) Max VCPU's that can be specified for a guest while using
   the qemu's -numa option is 64. Due to a typecasting issue
   when the number of VCPUs is  32 the VCPUs don't show up
   under the specified [fake] numa nodes.

b) KVM currently has support for 160VCPUs per guest. The
   qemu's -numa option has only support for upto 64VCPUs
   per guest.
This patch addresses these two issues.

Below are examples of (a) and (b)

a) 32 VCPUs are specified with the -numa option:

/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
71:01:01 \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

...
Upstream qemu :
--

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
6 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 32 33 34 35 36 37 38 39 40 41
node 0 size: 131072 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51
node 1 size: 131072 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59
node 2 size: 131072 MB
node 3 cpus: 30
node 3 size: 131072 MB
node 4 cpus:
node 4 size: 131072 MB
node 5 cpus: 31
node 5 size: 131072 MB

With the patch applied :
---

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
6 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 131072 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 131072 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29
node 2 size: 131072 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39
node 3 size: 131072 MB
node 4 cpus: 40 41 42 43 44 45 46 47 48 49
node 4 size: 131072 MB
node 5 cpus: 50 51 52 53 54 55 56 57 58 59
node 5 size: 131072 MB

b) 64 VCPUs specified with -numa option:

/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu 
Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+d-vnc
 :4

...

Upstream qemu :
--

only 63 CPUs in NUMA mode supported.
only 64 CPUs in NUMA mode supported.
QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
8 nodes
node 0 cpus: 6 7 8 9 38 39 40 41 70 71 72 73
node 0 size: 65536 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 74 75 
76 77 78 79
node 1 size: 65536 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 60 61
node 2 size: 65536 MB
node 3 cpus: 30 62
node 3 size: 65536 MB
node 4 cpus:
node 4 size: 65536 MB
node 5 cpus:
node 5 size: 65536 MB
node 6 cpus: 31 63
node 6 size: 65536 MB
node 7 cpus: 0 1 2 3 4 5 32 33 34 35 36 37 64 65 66 67 68 69
node 7 size: 65536 MB

With the patch applied :
---

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
8 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 65536 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 65536 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29
node 2 size: 65536 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39
node 3 size: 65536 MB
node 4 cpus: 40 41 42 43 44 45 46 47 48 49
node 4 size: 65536 MB
node 5 cpus: 50 51 52 53 54 55 56 57 58 59
node 5 size: 65536 MB
node 6 cpus: 60 61 62 63 64 65 66 67 68 69
node 6 size: 65536 MB
node 7 cpus: 70 71 72 73 74 75 76 77 78 79

Signed-off-by: Chegu Vinod chegu_vi...@hp.com, Jim Hull jim.h...@hp.com, 
Craig Hada craig.h...@hp.com
---
 cpus.c   |3 ++-
 hw/pc.c  |3 ++-
 sysemu.h |3 ++-
 vl.c |   43 +--
 4 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/cpus.c b/cpus.c
index b182b3d..acccd08 100644
--- a/cpus.c
+++ b/cpus.c
@@ -36,6 +36,7 @@
 #include cpus.h
 #include qtest.h
 #include main-loop.h
+#include bitmap.h
 
 #ifndef _WIN32
 #include compatfd.h
@@ -1145,7 +1146,7 @@ void set_numa_modes(void)
 
 for (env = first_cpu; env != NULL; env = env-next_cpu) {
 for (i = 0; i  nb_numa_nodes; i

[Qemu-devel] [PATCH v3] Fixes related to processing of qemu's -numa option

2012-07-05 Thread Chegu Vinod
-off-by: Chegu Vinod chegu_vi...@hp.com, Jim Hull jim.h...@hp.com, 
Craig Hada craig.h...@hp.com
Tested-by: Eduardo Habkost ehabkost at redhat.com
---
 cpus.c   |3 ++-
 hw/pc.c  |3 ++-
 sysemu.h |3 ++-
 vl.c |   48 ++--
 4 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/cpus.c b/cpus.c
index b182b3d..acccd08 100644
--- a/cpus.c
+++ b/cpus.c
@@ -36,6 +36,7 @@
 #include cpus.h
 #include qtest.h
 #include main-loop.h
+#include bitmap.h
 
 #ifndef _WIN32
 #include compatfd.h
@@ -1145,7 +1146,7 @@ void set_numa_modes(void)
 
 for (env = first_cpu; env != NULL; env = env-next_cpu) {
 for (i = 0; i  nb_numa_nodes; i++) {
-if (node_cpumask[i]  (1  env-cpu_index)) {
+if (test_bit(env-cpu_index, node_cpumask[i])) {
 env-numa_node = i;
 }
 }
diff --git a/hw/pc.c b/hw/pc.c
index c7e9ab3..2edcc07 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -48,6 +48,7 @@
 #include memory.h
 #include exec-memory.h
 #include arch_init.h
+#include bitmap.h
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -639,7 +640,7 @@ static void *bochs_bios_init(void)
 numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes);
 for (i = 0; i  max_cpus; i++) {
 for (j = 0; j  nb_numa_nodes; j++) {
-if (node_cpumask[j]  (1  i)) {
+if (test_bit(i, node_cpumask[j])) {
 numa_fw_cfg[i + 1] = cpu_to_le64(j);
 break;
 }
diff --git a/sysemu.h b/sysemu.h
index bc2c788..2ce63fc 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -133,9 +133,10 @@ extern uint8_t qemu_extra_params_fw[2];
 extern QEMUClock *rtc_clock;
 
 #define MAX_NODES 64
+#define MAX_CPUMASK_BITS 255
 extern int nb_numa_nodes;
 extern uint64_t node_mem[MAX_NODES];
-extern uint64_t node_cpumask[MAX_NODES];
+extern unsigned long *node_cpumask[MAX_NODES];
 
 #define MAX_OPTION_ROMS 16
 typedef struct QEMUOptionRom {
diff --git a/vl.c b/vl.c
index 1329c30..fdd7b74 100644
--- a/vl.c
+++ b/vl.c
@@ -28,6 +28,7 @@
 #include errno.h
 #include sys/time.h
 #include zlib.h
+#include bitmap.h
 
 /* Needed early for CONFIG_BSD etc. */
 #include config-host.h
@@ -240,7 +241,7 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = 
QTAILQ_HEAD_INITIALIZER(fw_boot_order
 
 int nb_numa_nodes;
 uint64_t node_mem[MAX_NODES];
-uint64_t node_cpumask[MAX_NODES];
+unsigned long *node_cpumask[MAX_NODES];
 
 uint8_t qemu_uuid[16];
 
@@ -950,6 +951,9 @@ static void numa_add(const char *optarg)
 char *endptr;
 unsigned long long value, endvalue;
 int nodenr;
+int i;
+
+value = endvalue = 0ULL;
 
 optarg = get_opt_name(option, 128, optarg, ',') + 1;
 if (!strcmp(option, node)) {
@@ -970,27 +974,25 @@ static void numa_add(const char *optarg)
 }
 node_mem[nodenr] = sval;
 }
-if (get_param_value(option, 128, cpus, optarg) == 0) {
-node_cpumask[nodenr] = 0;
-} else {
+if (get_param_value(option, 128, cpus, optarg) != 0) {
 value = strtoull(option, endptr, 10);
-if (value = 64) {
-value = 63;
-fprintf(stderr, only 64 CPUs in NUMA mode supported.\n);
+if (*endptr == '-') {
+endvalue = strtoull(endptr+1, endptr, 10);
 } else {
-if (*endptr == '-') {
-endvalue = strtoull(endptr+1, endptr, 10);
-if (endvalue = 63) {
-endvalue = 62;
-fprintf(stderr,
-only 63 CPUs in NUMA mode supported.\n);
-}
-value = (2ULL  endvalue) - (1ULL  value);
-} else {
-value = 1ULL  value;
-}
+endvalue = value;
+}
+
+
+if (!(endvalue  MAX_CPUMASK_BITS)) {
+endvalue = MAX_CPUMASK_BITS - 1;
+fprintf(stderr,
+A max of %d CPUs are supported in a guest\n,
+ MAX_CPUMASK_BITS);
+}
+
+for (i = value; i = endvalue; ++i) {
+set_bit(i, node_cpumask[nodenr]);
 }
-node_cpumask[nodenr] = value;
 }
 nb_numa_nodes++;
 }
@@ -2331,7 +2333,8 @@ int main(int argc, char **argv, char **envp)
 
 for (i = 0; i  MAX_NODES; i++) {
 node_mem[i] = 0;
-node_cpumask[i] = 0;
+node_cpumask[i] = bitmap_new(MAX_CPUMASK_BITS);
+bitmap_zero(node_cpumask[i], MAX_CPUMASK_BITS);
 }
 
 nb_numa_nodes = 0;
@@ -3469,8 +3472,9 @@ int main(int argc, char **argv, char **envp)
 }
 
 for (i = 0; i  nb_numa_nodes; i++) {
-if (node_cpumask[i] != 0)
+if (!bitmap_empty(node_cpumask[i], MAX_CPUMASK_BITS)) {
 break;
+}
 }
 /* assigning the VCPUs round-robin is easier to implement

[Qemu-devel] [PATCH v2] Fixes related to processing of qemu's -numa option

2012-06-19 Thread Chegu Vinod
From: root r...@hydra11.kio

Changes since v1:


 - Use bitmap functions that are already in qemu (instead
   of cpu_set_t macro's)
 - Added a check for endvalue = max_cpus.
 - Fix to address the round-robbing assignment (for the case
   when cpu's are not explicitly specified)

Note: Continuing to use a new constant for
  allocation of the cpumask (max_cpus was
  not getting set early enough).

---

v1:
--

The -numa option to qemu is used to create [fake] numa nodes
and expose them to the guest OS instance.

There are a couple of issues with the -numa option:

a) Max VCPU's that can be specified for a guest while using
   the qemu's -numa option is 64. Due to a typecasting issue
   when the number of VCPUs is  32 the VCPUs don't show up
   under the specified [fake] numa nodes.

b) KVM currently has support for 160VCPUs per guest. The
   qemu's -numa option has only support for upto 64VCPUs
   per guest.

This patch addresses these two issues.

Below are examples of (a) and (b)

a) 32 VCPUs are specified with the -numa option:

/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
71:01:01 \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

...

Upstream qemu :
--

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
6 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 32 33 34 35 36 37 38 39 40 41
node 0 size: 131072 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51
node 1 size: 131072 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59
node 2 size: 131072 MB
node 3 cpus: 30
node 3 size: 131072 MB
node 4 cpus:
node 4 size: 131072 MB
node 5 cpus: 31
node 5 size: 131072 MB

With the patch applied :
---

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
6 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 131072 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 131072 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29
node 2 size: 131072 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39
node 3 size: 131072 MB
node 4 cpus: 40 41 42 43 44 45 46 47 48 49
node 4 size: 131072 MB
node 5 cpus: 50 51 52 53 54 55 56 57 58 59
node 5 size: 131072 MB

b) 64 VCPUs specified with -numa option:

/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu 
Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
 \
-smp sockets=8,cores=10,threads=1 \
-numa node,nodeid=0,cpus=0-9,mem=64g \
-numa node,nodeid=1,cpus=10-19,mem=64g \
-numa node,nodeid=2,cpus=20-29,mem=64g \
-numa node,nodeid=3,cpus=30-39,mem=64g \
-numa node,nodeid=4,cpus=40-49,mem=64g \
-numa node,nodeid=5,cpus=50-59,mem=64g \
-numa node,nodeid=6,cpus=60-69,mem=64g \
-numa node,nodeid=7,cpus=70-79,mem=64g \
-m 524288 \
-name vm1 \
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \
-drive 
file=/dev/libvirt_lvm/vm.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native
 \
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 \
-monitor stdio \
-net nic,macaddr=52:54:00:71:01:01 \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

...

Upstream qemu :
--

only 63 CPUs in NUMA mode supported.
only 64 CPUs in NUMA mode supported.
QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
8 nodes
node 0 cpus: 6 7 8 9 38 39 40 41 70 71 72 73
node 0 size: 65536 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 74 75 
76 77 78 79
node 1 size: 65536 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 60 61
node 2 size: 65536 MB
node 3 cpus: 30 62
node 3 size: 65536 MB
node 4 cpus:
node 4 size: 65536 MB
node 5 cpus:
node 5 size: 65536 MB
node 6 cpus: 31 63
node 6 size: 65536 MB
node 7 cpus: 0 1 2 3 4 5 32 33 34 35 36 37 64 65 66 67 68 69
node 7 size: 65536 MB

With the patch applied :
---

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
8 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 65536 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 65536 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29
node 2 size: 65536 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39
node 3 size: 65536 MB
node 4 cpus: 40 41 42 43 44 45 46 47 48 49
node 4 size: 65536 MB
node 5 cpus: 50 51 52 53 54 55 56 57 58 59
node 5 size: 65536 MB
node 6 cpus: 60 61 62 63 64 65 66 67 68 69
node 6 size: 65536 MB
node 7 cpus: 70 71 72 73 74 75 76 77 78 79

Signed-off-by: Chegu Vinod chegu_vi...@hp.com, Jim Hull jim.h...@hp.com, 
Craig Hada craig.h...@hp.com
---
 cpus.c   |3 ++-
 hw/pc.c  |4 +++-
 sysemu.h |3 ++-
 vl.c |   48 ++--
 4 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/cpus.c b/cpus.c
index b182b3d..89ce04d 100644
--- a/cpus.c
+++ b/cpus.c
@@ -36,6 +36,7

Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option

2012-06-18 Thread Chegu Vinod

On 6/18/2012 1:29 PM, Eduardo Habkost wrote:

On Sun, Jun 17, 2012 at 01:12:31PM -0700, Chegu Vinod wrote:

The -numa option to qemu is used to create [fake] numa nodes
and expose them to the guest OS instance.

There are a couple of issues with the -numa option:

a) Max VCPU's that can be specified for a guest while using
the qemu's -numa option is 64. Due to a typecasting issue
when the number of VCPUs is  32 the VCPUs don't show up
under the specified [fake] numa nodes.

b) KVM currently has support for 160VCPUs per guest. The
qemu's -numa option has only support for upto 64VCPUs
per guest.

This patch addresses these two issues. [ Note: This
patch has been verified by Eduardo Habkost ].

I was going to add a Tested-by line, but this patch breaks the automatic
round-robin assignment when no CPU is added to any node on the
command-line.


Pl. see below.


Additional questions below:

[...]

diff --git a/cpus.c b/cpus.c
index b182b3d..f9cee60 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1145,7 +1145,7 @@ void set_numa_modes(void)

  for (env = first_cpu; env != NULL; env = env-next_cpu) {
  for (i = 0; i  nb_numa_nodes; i++) {
-if (node_cpumask[i]  (1  env-cpu_index)) {
+if (CPU_ISSET_S(env-cpu_index, cpumask_size, node_cpumask[i])) {
  env-numa_node = i;
  }
  }
diff --git a/hw/pc.c b/hw/pc.c
index 8368701..f0c3665 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -639,7 +639,7 @@ static void *bochs_bios_init(void)
  numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes);
  for (i = 0; i  max_cpus; i++) {
  for (j = 0; j  nb_numa_nodes; j++) {
-if (node_cpumask[j]  (1  i)) {
+if (CPU_ISSET_S(i, cpumask_size, node_cpumask[j])) {
  numa_fw_cfg[i + 1] = cpu_to_le64(j);
  break;
  }
diff --git a/sysemu.h b/sysemu.h
index bc2c788..6e4d342 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -9,6 +9,7 @@
  #include qapi-types.h
  #include notify.h
  #include main-loop.h
+#includesched.h

  /* vl.c */

@@ -133,9 +134,11 @@ extern uint8_t qemu_extra_params_fw[2];
  extern QEMUClock *rtc_clock;

  #define MAX_NODES 64
+#define KVM_MAX_VCPUS 254

Do we really need this constant? Why not just use max_cpus when
allocating the CPU sets, instead?


Hmm I had thought about this earlier too max_cpus was not getting 
set at the point where the allocations were being done. I have now moved 
that code to a bit later and switched to using

max_cpus.





  extern int nb_numa_nodes;
  extern uint64_t node_mem[MAX_NODES];
-extern uint64_t node_cpumask[MAX_NODES];
+extern cpu_set_t *node_cpumask[MAX_NODES];
+extern size_t cpumask_size;

  #define MAX_OPTION_ROMS 16
  typedef struct QEMUOptionRom {
diff --git a/vl.c b/vl.c
index 204d85b..1906412 100644
--- a/vl.c
+++ b/vl.c
@@ -28,6 +28,7 @@
  #includeerrno.h
  #includesys/time.h
  #includezlib.h
+#includesched.h

  /* Needed early for CONFIG_BSD etc. */
  #include config-host.h
@@ -240,7 +241,8 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = 
QTAILQ_HEAD_INITIALIZER(fw_boot_order

  int nb_numa_nodes;
  uint64_t node_mem[MAX_NODES];
-uint64_t node_cpumask[MAX_NODES];
+cpu_set_t *node_cpumask[MAX_NODES];
+size_t cpumask_size;

  uint8_t qemu_uuid[16];

@@ -950,6 +952,9 @@ static void numa_add(const char *optarg)
  char *endptr;
  unsigned long long value, endvalue;
  int nodenr;
+int i;
+
+value = endvalue = 0;

  optarg = get_opt_name(option, 128, optarg, ',') + 1;
  if (!strcmp(option, node)) {
@@ -970,27 +975,17 @@ static void numa_add(const char *optarg)
  }
  node_mem[nodenr] = sval;
  }
-if (get_param_value(option, 128, cpus, optarg) == 0) {
-node_cpumask[nodenr] = 0;
-} else {
+if (get_param_value(option, 128, cpus, optarg) != 0) {
  value = strtoull(option,endptr, 10);
-if (value= 64) {
-value = 63;
-fprintf(stderr, only 64 CPUs in NUMA mode supported.\n);
+if (*endptr == '-') {
+endvalue = strtoull(endptr+1,endptr, 10);
  } else {
-if (*endptr == '-') {
-endvalue = strtoull(endptr+1,endptr, 10);
-if (endvalue= 63) {
-endvalue = 62;
-fprintf(stderr,
-only 63 CPUs in NUMA mode supported.\n);
-}
-value = (2ULL  endvalue) - (1ULL  value);
-} else {
-value = 1ULL  value;
-}
+endvalue = value;
+}
+
+for (i = value; i= endvalue; ++i) {
+CPU_SET_S(i, cpumask_size, node_cpumask[nodenr]);

What if endvalue is larger than the cpu mask size we allocated?


I can add a check (endvalue = max_cpus) and print an error.
Should we force set the endvalue to max_cpus-1 in that case

Re: [Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option

2012-06-18 Thread Chegu Vinod

On 6/18/2012 3:11 PM, Eric Blake wrote:

On 06/18/2012 04:05 PM, Andreas Färber wrote:

Am 17.06.2012 22:12, schrieb Chegu Vinod:

diff --git a/vl.c b/vl.c
index 204d85b..1906412 100644
--- a/vl.c
+++ b/vl.c
@@ -28,6 +28,7 @@
  #includeerrno.h
  #includesys/time.h
  #includezlib.h
+#includesched.h

Did you check whether this and the macros you're using are available on
POSIX and mingw32? vl.c is a pretty central file.

POSIX, yes.  mingw32, no.  Use of preprocessor conditionals is probably
in order.


Thanks. I will look into this.
Vinod



[Qemu-devel] [PATCH] Fixes related to processing of qemu's -numa option

2012-06-17 Thread Chegu Vinod
-off-by: Chegu Vinod chegu_vi...@hp.com, Jim Hull jim.h...@hp.com, 
Craig Hada craig.h...@hp.com
---
 cpus.c   |2 +-
 hw/pc.c  |2 +-
 sysemu.h |5 -
 vl.c |   42 --
 4 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/cpus.c b/cpus.c
index b182b3d..f9cee60 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1145,7 +1145,7 @@ void set_numa_modes(void)
 
 for (env = first_cpu; env != NULL; env = env-next_cpu) {
 for (i = 0; i  nb_numa_nodes; i++) {
-if (node_cpumask[i]  (1  env-cpu_index)) {
+if (CPU_ISSET_S(env-cpu_index, cpumask_size, node_cpumask[i])) {
 env-numa_node = i;
 }
 }
diff --git a/hw/pc.c b/hw/pc.c
index 8368701..f0c3665 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -639,7 +639,7 @@ static void *bochs_bios_init(void)
 numa_fw_cfg[0] = cpu_to_le64(nb_numa_nodes);
 for (i = 0; i  max_cpus; i++) {
 for (j = 0; j  nb_numa_nodes; j++) {
-if (node_cpumask[j]  (1  i)) {
+if (CPU_ISSET_S(i, cpumask_size, node_cpumask[j])) {
 numa_fw_cfg[i + 1] = cpu_to_le64(j);
 break;
 }
diff --git a/sysemu.h b/sysemu.h
index bc2c788..6e4d342 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -9,6 +9,7 @@
 #include qapi-types.h
 #include notify.h
 #include main-loop.h
+#include sched.h
 
 /* vl.c */
 
@@ -133,9 +134,11 @@ extern uint8_t qemu_extra_params_fw[2];
 extern QEMUClock *rtc_clock;
 
 #define MAX_NODES 64
+#define KVM_MAX_VCPUS 254
 extern int nb_numa_nodes;
 extern uint64_t node_mem[MAX_NODES];
-extern uint64_t node_cpumask[MAX_NODES];
+extern cpu_set_t *node_cpumask[MAX_NODES];
+extern size_t cpumask_size;
 
 #define MAX_OPTION_ROMS 16
 typedef struct QEMUOptionRom {
diff --git a/vl.c b/vl.c
index 204d85b..1906412 100644
--- a/vl.c
+++ b/vl.c
@@ -28,6 +28,7 @@
 #include errno.h
 #include sys/time.h
 #include zlib.h
+#include sched.h
 
 /* Needed early for CONFIG_BSD etc. */
 #include config-host.h
@@ -240,7 +241,8 @@ QTAILQ_HEAD(, FWBootEntry) fw_boot_order = 
QTAILQ_HEAD_INITIALIZER(fw_boot_order
 
 int nb_numa_nodes;
 uint64_t node_mem[MAX_NODES];
-uint64_t node_cpumask[MAX_NODES];
+cpu_set_t *node_cpumask[MAX_NODES];
+size_t cpumask_size;
 
 uint8_t qemu_uuid[16];
 
@@ -950,6 +952,9 @@ static void numa_add(const char *optarg)
 char *endptr;
 unsigned long long value, endvalue;
 int nodenr;
+int i;
+
+value = endvalue = 0;
 
 optarg = get_opt_name(option, 128, optarg, ',') + 1;
 if (!strcmp(option, node)) {
@@ -970,27 +975,17 @@ static void numa_add(const char *optarg)
 }
 node_mem[nodenr] = sval;
 }
-if (get_param_value(option, 128, cpus, optarg) == 0) {
-node_cpumask[nodenr] = 0;
-} else {
+if (get_param_value(option, 128, cpus, optarg) != 0) {
 value = strtoull(option, endptr, 10);
-if (value = 64) {
-value = 63;
-fprintf(stderr, only 64 CPUs in NUMA mode supported.\n);
+if (*endptr == '-') {
+endvalue = strtoull(endptr+1, endptr, 10);
 } else {
-if (*endptr == '-') {
-endvalue = strtoull(endptr+1, endptr, 10);
-if (endvalue = 63) {
-endvalue = 62;
-fprintf(stderr,
-only 63 CPUs in NUMA mode supported.\n);
-}
-value = (2ULL  endvalue) - (1ULL  value);
-} else {
-value = 1ULL  value;
-}
+endvalue = value;
+}
+
+for (i = value; i = endvalue; ++i) {
+CPU_SET_S(i, cpumask_size, node_cpumask[nodenr]);
 }
-node_cpumask[nodenr] = value;
 }
 nb_numa_nodes++;
 }
@@ -2331,7 +2326,9 @@ int main(int argc, char **argv, char **envp)
 
 for (i = 0; i  MAX_NODES; i++) {
 node_mem[i] = 0;
-node_cpumask[i] = 0;
+node_cpumask[i] = CPU_ALLOC(KVM_MAX_VCPUS);
+cpumask_size = CPU_ALLOC_SIZE(KVM_MAX_VCPUS);
+CPU_ZERO_S(cpumask_size, node_cpumask[i]);
 }
 
 nb_numa_nodes = 0;
@@ -3465,8 +3462,9 @@ int main(int argc, char **argv, char **envp)
 }
 
 for (i = 0; i  nb_numa_nodes; i++) {
-if (node_cpumask[i] != 0)
+if (node_cpumask[i] != NULL) {
 break;
+}
 }
 /* assigning the VCPUs round-robin is easier to implement, guest OSes
  * must cope with this anyway, because there are BIOSes out there in
@@ -3474,7 +3472,7 @@ int main(int argc, char **argv, char **envp)
  */
 if (i == nb_numa_nodes) {
 for (i = 0; i  max_cpus; i++) {
-node_cpumask[i % nb_numa_nodes] |= 1  i;
+CPU_SET_S(i, cpumask_size, node_cpumask[i % nb_numa_nodes

Re: [Qemu-devel] Large sized guest taking for ever to boot...

2012-06-12 Thread Chegu Vinod

On 6/8/2012 11:37 AM, Jan Kiszka wrote:

On 2012-06-08 20:20, Chegu Vinod wrote:

On 6/8/2012 11:08 AM, Jan Kiszka wrote:

[CC'ing qemu as this discusses its code base]

On 2012-06-08 19:57, Chegu Vinod wrote:

On 6/8/2012 10:42 AM, Alex Williamson wrote:

On Fri, 2012-06-08 at 10:10 -0700, Chegu Vinod wrote:

On 6/8/2012 9:46 AM, Alex Williamson wrote:

On Fri, 2012-06-08 at 16:29 +, Chegu Vinod wrote:

Hello,

I picked up a recent version of the qemu (1.0.92 with some fixes)
and tried it
on x86_64 server (with host and the guest running 3.4.1 kernel).

BTW, I observe the same thing if i were to use 1.1.50 version of the
qemu... not sure if this is really
related to qemu...


While trying to boot a large guest (80 vcpus + 512GB) I observed
that the guest
took for ever to boot up...  ~1 hr or even more. [This wasn't the
case when I
was using RHEL 6.x related bits]

Was either case using device assignment?  Device assignment will map
and
pin each page of guest memory before startup, which can be a noticeable
pause on smallish (16GB) guests.  That should be linear scaling though
and if you're using qemu and not qemu-kvm, not related.  Thanks,

I don't have any device assignment at this point . Yes I am using qemu
(not qemu-kvm)...

Just to be safe, are you using --enable-kvm with qemu?

Yes...

Unless you are using current qemu.git master (where it is enabled by
default), --enable-kvm does not activate the in-kernel irqchip support
for you. Not sure if that can make such a huge difference, but it is a
variation from qemu-kvm. You can enable it in qemu-1.1 with -machine
kernel_irqchip=on.

Thanks for pointing this out...i will add that.

I was using qemu.gitnot the master

With qemu.git master I meant the latest version you can pull from the
master branch or qemu.git. If you are using an older version, please
specify the hash.

BTW, you can check if irqchip is on by looking at the out of info
qtree in the qemu monitor. One of the last devices listed must be
called kvm-apic.



Sorry for the confusion.  I was using the qemu.git  and I do see the 
kvm-apic stuff  (in the info qtree output) without specifying the 
-machine kernel_irqchip=on option..








-

/etc/qemu-ifup tap0

/usr/local/bin/qemu-system-x86_64 -enable-kvm \

-cpu
Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+s

s,+acpi,+ds,+vme \
-m 524288 -smp 80,sockets=80,cores=1,threads=1 \
-name vm1 \
-chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
\
-drive
file=/dev/libvirt_lvm/vm.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native
\
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
\
-monitor stdio \
-net nic,macaddr=52:54:00:71:01:01 \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

/etc/qemu-ifdown tap0




The issue seems very basic... 'was earlier running RHEL6.3 RC1 on the
host and the guest and the host and the guest seemed to boot fine..

Note that RHEL is based on qemu-kvm.  Thanks,

Yep..knew that :)

I was using upstream qemu-kvm and was encouraged to move away from
it...to qemu.

And that is good. :)

Is the problem present in current qemu-kvm.git? If yes, can you bisect
when it was introduced?

Shall try out the qemu-kvm.git  (after finishing some experiments..).

Yes, please.


Just did that and below are the results...




BTW,  another data point ...if I try to boot a the RHEL6.3 kernel in the
guest (with the latest qemu.git and the 3.4.1 on the host) it boots just
fine

So something to do with the 3.4.1 kernel in the guest and the existing
udev... Need to dig
deeper.

Maybe. But lets focus on getting the problematic guest running fine in
some configuration. If that turns out to be impossible, we may see a
guest issue.

Host config. : 80 Cores + 1TB.Guest config : 40VCPUs + 512GB.

I rebuilt the 3.4.1 kernel in the guest from scratch and retried my 
experiments and measured

the boot times...

a) Host :  RHEL6.3 RC1 + qemu-kvm (that came with it)   Guest : RHEL6.3 
RC1:  ~1 min


b) Host :3.4.1 + qemu-kvm.git  Guest : RHEL6.3RC1  -   ~1 min

c) Host :3.4.1 + qemu-kvm.git  Guest : 3.4.1  -   ~10 mins

d) Host :3.4.1 + qemu.git  Guest : RHEL6.3 RC1  -~1 min

e) Host :3.4.1 + qemu.git  Guest : 3.4.1   -~14 mins


In both the case (c)  (e) had quite a few warning/error messages from 
udevd.


FYI..
Vinod
PS:  Haven't had the time to do any further analysis...as the machine is 
being used for

other experiments...




Jan






Re: [Qemu-devel] Large sized guest taking for ever to boot...

2012-06-12 Thread Chegu Vinod

On 6/12/2012 8:39 AM, Gleb Natapov wrote:

On Tue, Jun 12, 2012 at 08:33:59AM -0700, Chegu Vinod wrote:

I rebuilt the 3.4.1 kernel in the guest from scratch and retried my
experiments and measured
the boot times...

a) Host :  RHEL6.3 RC1 + qemu-kvm (that came with it)   Guest :
RHEL6.3 RC1:  ~1 min

b) Host :3.4.1 + qemu-kvm.git  Guest : RHEL6.3RC1  -   ~1 min

c) Host :3.4.1 + qemu-kvm.git  Guest : 3.4.1  -   ~10 mins

d) Host :3.4.1 + qemu.git  Guest : RHEL6.3 RC1  -~1 min

e) Host :3.4.1 + qemu.git  Guest : 3.4.1   -~14 mins


In both the case (c)  (e) had quite a few warning/error messages
from udevd.

FYI..
Vinod
PS:  Haven't had the time to do any further analysis...as the
machine is being used for
other experiments...



Next time you get the machine can you try running (b) but adding
-hypervisor to your -cpu config.


I tried that and didn't make any difference...i.e. the RHEL6.3 RC1 guest 
booted in about ~1 min.


Vinod


--
Gleb.






[Qemu-devel] Live Migration of a large guest : guest frozen on the destination host

2012-06-11 Thread Chegu Vinod

Hello,

'am having some issues trying to live migrate a large guest and would 
like to get some pointers
on how to go about about debugging this. Here is some info. on the 
configuration


_Hardware :_
Two DL980's  each with 80 Westmere cores + 1 TB of RAM. Using a 10G NIC 
private link

(back to back) between two DL980's
_
Host software used:_
Host 3.4.1 kernel
Qemu versions used :
  Case 1: upstream qemu  (1.1.50) - from qemu.git
  Case 2 : 1.0.92 +  Juan Quintela's huge_memory changes
_
Guest :
_40VCPUs + 512GB

_Guest software used:_
RHEL6.3 RC1  (had some basic boot issues with 3.4.1 kernel and udevd..)
Guest is booted off an FC LUN (visible to both the hosts).

[Note: 'am not using virsh/virt-manager etc. but just the qemu to start 
the guest and also interact with
the qemu monitor for live migration etc. Have set the migration speed to 
10G but haven't changed the

downtime (default :  30ms) ]


Tried to live migrate this large guest..using either of the qemu's (i.e. 
Case 1 or Case2) and observed

the following :

When this guest was Idling  'was able to live migrate and have the guest 
come up fine on the

other host. Was able to interact with the guest on the destination host.

With workloads (e.g. AIM7-compute or SpecJBB or Google Stress App Test 
(SAT)) running in the
guest if we tried to do live migration.. we observe that [after a while] 
the source host claims that the
 live migration is complete...but the guest on the destination host is 
often in a frozen/hung state..
can't really interact with it or ping it.   Still trying to capture more 
information...but was also hoping to

get some clues/tips from the experts on these mailing lists...

[ BTW, is there a way to get a snap shot of the image of the guest on 
the source host just before
the downtime (i.e. start of stage 3) on the source host and compare 
that with the image of the guest on
the destination host just before its about to resume ? Is such a 
debugging feature already available ? ]


Thanks
Vinod



Re: [Qemu-devel] Large sized guest taking for ever to boot...

2012-06-10 Thread Chegu Vinod

On 6/10/2012 2:30 AM, Gleb Natapov wrote:

On Fri, Jun 08, 2012 at 11:20:53AM -0700, Chegu Vinod wrote:

On 6/8/2012 11:08 AM, Jan Kiszka wrote:
BTW,  another data point ...if I try to boot a the RHEL6.3 kernel in
the guest (with the latest qemu.git and the 3.4.1 on the host) it
boots just fine

So something to do with the 3.4.1 kernel in the guest and the
existing udev... Need to dig
deeper.


How many CPUs do you have on the host? RHEL6.3 uses unfair spinlock
when it runs as a guest.


80 active cores on the host.

Vinod


--
Gleb.






Re: [Qemu-devel] [RFC 0/7] Fix migration with lots of memory

2012-06-10 Thread Chegu Vinod

Hello,

I did pick up these patches a while back and did run some migration tests while
running simple workloads in the guest. Below are some results.

FYI...
Vinod



Config Details:

Guest 10vcps, 60GB (running on a host that is 6cores(12threads) and 64GB).
The hosts are identical X86_64 Blade servers  are connected via a private
10G link (for migration traffic)

Guest was started using qemu (no virsh/virt-manager etc).
Migration was initiated at the qemu monitor prompt
and the migration_set_speed was used to set to 10G. No changes
to the downtime.

Software:
- Guest  the Host OS : 3.4.0-rc7+
- Vanilla : basic upstream qemu.git
- huge_memory changes(Juan's qemu.git tree)


[ Note : BTW, 'did also try vers:11 of XBZRLE patches...but ran into issues 
(guest crashed
after migration) 'have reported it to the author]


Here are the simple workloads and results:

1) Idling guest
2) AIM7-compute (with 2000 users).
3) 10way parallel make (of the kernel)
4) 2 instances of memory r/w loop (exactly the same as in docs/xbzrle.txt)
5) SPECJbb2005


Note: In the Vanilla case I had instrumented ram_save_live()
to print out the total migration time and the MB's transferred.

1) Idling guest:

Vanilla :
Total Mig. time: 173016 ms
Total MB's transferred : 1606MB

huge_memory:
Total Mig. time:  48821 ms
Total MB's transferred : 1620 MB

2) AIM7-compute  (2000 users)

Vanilla :
Total Mig. time: 241124 ms
Total MB's transferred : 4827MB

huge_memory:
Total Mig. time: 66716 ms
Total MB's transferred : 4022MB


3) 10 way parallel make: (of the linux kernel)

Vanilla :
Total Mig. time: 104319 ms
Total MB's transferred : 2316MB

huge_memory:
Total Mig. time: 55105 ms
Total MB's transferred : 2995MB


4) 2 instances of Memory r/w loop: (refer to docs/xbzrle.txt)

Vanilla :
Total Mig. time: 112102 ms
Total MB's transferred : 1739MB

huge_memory:
Total Mig. time: 85504ms
Total MB's transferred : 1745MB


5) SPECJbb :

Vanilla :
Total Mig. time: 162189 ms
Total MB's transferred : 5461MB

huge_memory:
Total Mig. time: 67787 ms
Total MB's transferred : 8528MB


[Expected] Observation :

Unlike with the Vanilla case(  also the XBZRLE case), with these patches I was 
still able
to interact with the qemu monitor prompt and also interact with the guest 
during the migration (i.e. during the iterative pre-copy phase).





--


On 5/22/2012 11:32 AM, Juan Quintela wrote:

Hi

After a long, long time, this is v2.

This are basically the changes that we have for RHEL, due to the
problems that we have with big memory machines.  I just rebased the
patches and fixed the easy parts:

- buffered_file_limit is gone: we just use 50ms and call it a day

- I let ram_addr_t as a valid type for a counter (no, I still don't
   agree with Anthony on this, but it is not important).

- Print total time of migration always.  Notice that I also print it
   when migration is completed.  Luiz, could you take a look to see if
   I did something worng (probably).

- Moved debug printfs to tracepointns.  Thanks a lot to Stefan for
   helping with it.  Once here, I had to put the traces in the middle
   of trace-events file, if I put them on the end of the file, when I
   enable them, I got generated the previous two tracepoints, instead
   of the ones I just defined.  Stefan is looking on that.  Workaround
   is defining them anywhere else.

- exit from cpu_physical_memory_reset_dirty().  Anthony wanted that I
   created an empty stub for kvm, and maintain the code for tcg.  The
   problem is that we can have both kvm and tcg running from the same
   binary.  Intead of exiting in the middle of the function, I just
   refactored the code out.  Is there an struct where I could add a new
   function pointer for this behaviour?

- exit if we have been too long on ram_save_live() loop.  Anthony
   didn't like this, I will sent a version based on the migration
   thread in the following days.  But just need something working for
   other people to test.

   Notice that I still got lots of more than 50ms printf's. (Yes,
   there is a debugging printf there).

- Bitmap handling.  Still all code to count dirty pages, will try to
   get something saner based on bitmap optimizations.

Comments?

Later, Juan.




v1:
---

Executive Summary
-

This series of patches fix migration with lots of memory.  With them stalls
are removed, and we honored max_dowtime.
I also add infrastructure to measure what is happening during migration
(#define DEBUG_MIGRATION and DEBUG_SAVEVM).

Migration is broken at the momment in qemu tree, Michael patch is needed to
fix virtio migration.  Measurements are given for qemu-kvm tree.  At the end, 
some measurements
with qemu tree.

Long Version with measurements (for those that like numbers O:-)
--

8 vCPUS and 64GB RAM, a RHEL5 guest that is completelly idle

initial
---

savevm: save live iterate section id 3 name ram took 3266 milliseconds 46 
times

We have 46 stalls, and missed the 

Re: [Qemu-devel] Large sized guest taking for ever to boot...

2012-06-08 Thread Chegu Vinod

On 6/8/2012 11:08 AM, Jan Kiszka wrote:

[CC'ing qemu as this discusses its code base]

On 2012-06-08 19:57, Chegu Vinod wrote:

On 6/8/2012 10:42 AM, Alex Williamson wrote:

On Fri, 2012-06-08 at 10:10 -0700, Chegu Vinod wrote:

On 6/8/2012 9:46 AM, Alex Williamson wrote:

On Fri, 2012-06-08 at 16:29 +, Chegu Vinod wrote:

Hello,

I picked up a recent version of the qemu (1.0.92 with some fixes)
and tried it
on x86_64 server (with host and the guest running 3.4.1 kernel).

BTW, I observe the same thing if i were to use 1.1.50 version of the
qemu... not sure if this is really
related to qemu...


While trying to boot a large guest (80 vcpus + 512GB) I observed
that the guest
took for ever to boot up...  ~1 hr or even more. [This wasn't the
case when I
was using RHEL 6.x related bits]

Was either case using device assignment?  Device assignment will map
and
pin each page of guest memory before startup, which can be a noticeable
pause on smallish (16GB) guests.  That should be linear scaling though
and if you're using qemu and not qemu-kvm, not related.  Thanks,

I don't have any device assignment at this point . Yes I am using qemu
(not qemu-kvm)...

Just to be safe, are you using --enable-kvm with qemu?

Yes...

Unless you are using current qemu.git master (where it is enabled by
default), --enable-kvm does not activate the in-kernel irqchip support
for you. Not sure if that can make such a huge difference, but it is a
variation from qemu-kvm. You can enable it in qemu-1.1 with -machine
kernel_irqchip=on.


Thanks for pointing this out...i will add that.

I was using qemu.gitnot the master



-

/etc/qemu-ifup tap0

/usr/local/bin/qemu-system-x86_64 -enable-kvm \

-cpu
Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+s

s,+acpi,+ds,+vme \
-m 524288 -smp 80,sockets=80,cores=1,threads=1 \
-name vm1 \
-chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
\
-drive
file=/dev/libvirt_lvm/vm.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native
\
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
\
-monitor stdio \
-net nic,macaddr=52:54:00:71:01:01 \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

/etc/qemu-ifdown tap0




The issue seems very basic... 'was earlier running RHEL6.3 RC1 on the
host and the guest and the host and the guest seemed to boot fine..

Note that RHEL is based on qemu-kvm.  Thanks,

Yep..knew that :)

I was using upstream qemu-kvm and was encouraged to move away from
it...to qemu.

And that is good. :)

Is the problem present in current qemu-kvm.git? If yes, can you bisect
when it was introduced?

Shall try out the qemu-kvm.git  (after finishing some experiments..).

BTW,  another data point ...if I try to boot a the RHEL6.3 kernel in the 
guest (with the latest qemu.git and the 3.4.1 on the host) it boots just 
fine


So something to do with the 3.4.1 kernel in the guest and the existing 
udev... Need to dig

deeper.

Vinod


Jan






Re: [Qemu-devel] Fwd: [PATCH v2 00/41] postcopy live migration

2012-06-04 Thread Chegu Vinod

On 6/4/2012 6:13 AM, Isaku Yamahata wrote:

On Mon, Jun 04, 2012 at 05:01:30AM -0700, Chegu Vinod wrote:

Hello Isaku Yamahata,

Hi.


I just saw your patches..Would it be possible to email me a tar bundle of these
patches (makes it easier to apply the patches to a copy of the upstream 
qemu.git)

I uploaded them to github for those who are interested in it.

git://github.com/yamahata/qemu.git qemu-postcopy-june-04-2012
git://github.com/yamahata/linux-umem.git  linux-umem-june-04-2012



Thanks for the pointer...

BTW, I am also curious if you have considered using any kind of RDMA features 
for
optimizing the page-faults during postcopy ?

Yes, RDMA is interesting topic. Can we share your use case/concern/issues?



Looking at large sized guests (256GB and higher)  running cpu/memory 
intensive enterprise workloads.
The  concerns are the same...i.e. having a predictable total migration 
time, minimal downtime/freeze-time and of course minimal service 
degradation to the workload(s) in the VM or the co-located VM's...


How large of a guest have you tested your changes with and what kind of 
workloads have you used so far ?



Thus we can collaborate.
You may want to see Benoit's results.


Yes. 'have already seen some of Benoit's results.

Hence the question about use of RDMA techniques for post copy.


As long as I know, he has not published
his code yet.


Thanks
Vinod



thanks,


Thanks
Vinod



--

Message: 1
Date: Mon,  4 Jun 2012 18:57:02 +0900
From: Isaku Yamahatayamah...@valinux.co.jp
To: qemu-devel@nongnu.org, k...@vger.kernel.org
Cc: benoit.hud...@gmail.com, aarca...@redhat.com, aligu...@us.ibm.com,
quint...@redhat.com, stefa...@gmail.com, t.hirofu...@aist.go.jp,
dl...@redhat.com, satoshi.i...@aist.go.jp,  
mdr...@linux.vnet.ibm.com,
yoshikawa.tak...@oss.ntt.co.jp, owass...@redhat.com, a...@redhat.com,
pbonz...@redhat.com
Subject: [Qemu-devel] [PATCH v2 00/41] postcopy live migration
Message-ID:cover.1338802190.git.yamah...@valinux.co.jp

After the long time, we have v2. This is qemu part.
The linux kernel part is sent separatedly.

Changes v1 -   v2:
- split up patches for review
- buffered file refactored
- many bug fixes
   Espcially PV drivers can work with postcopy
- optimization/heuristic

Patches
1 - 30: refactoring exsiting code and preparation
31 - 37: implement postcopy itself (essential part)
38 - 41: some optimization/heuristic for postcopy

Intro
=
This patch series implements postcopy live migration.[1]
As discussed at KVM forum 2011, dedicated character device is used for
distributed shared memory between migration source and destination.
Now we can discuss/benchmark/compare with precopy. I believe there are
much rooms for improvement.

[1] http://wiki.qemu.org/Features/PostCopyLiveMigration


Usage
=
You need load umem character device on the host before starting migration.
Postcopy can be used for tcg and kvm accelarator. The implementation depend
on only linux umem character device. But the driver dependent code is split
into a file.
I tested only host page size == guest page size case, but the implementation
allows host page size != guest page size case.

The following options are added with this patch series.
- incoming part
   command line options
   -postcopy [-postcopy-flagsflags]
   where flags is for changing behavior for benchmark/debugging
   Currently the following flags are available
   0: default
   1: enable touching page request

   example:
   qemu -postcopy -incoming tcp:0: -monitor stdio -machine accel=kvm

- outging part
   options for migrate command
   migrate [-p [-n] [-m]] URI [prefault forward   [prefault backword]]
   -p: indicate postcopy migration
   -n: disable background transferring pages: This is for benchmark/debugging
   -m: move background transfer of postcopy mode
   prefault forward: The number of forward pages which is sent with on-demand
   prefault backward: The number of backward pages which is sent with
on-demand

   example:
   migrate -p -n tcp:dest ip address:
   migrate -p -n -m tcp:dest ip address: 32 0


TODO

- benchmark/evaluation. Especially how async page fault affects the result.
- improve/optimization
   At the moment at least what I'm aware of is
   - making incoming socket non-blocking with thread
 As page compression is comming, it is impractical to non-blocking read
 and check if the necessary data is read.
   - touching pages in incoming qemu process by fd handler seems suboptimal.
 creating dedicated thread?
   - outgoing handler seems suboptimal causing latency.
- consider on FUSE/CUSE possibility
- don't fork umemd, but create thread?

basic postcopy work flow

 qemu on the destination
   |
   V
 open(/dev/umem)
   |
   V
 UMEM_INIT