Re: [PATCH] hw/mem/nvdimm: fix error message for 'unarmed' flag

2022-06-17 Thread Xiao Guangrong
On Wed, Jun 15, 2022 at 7:49 PM David Hildenbrand  wrote:
>
> On 15.06.22 13:17, Xiao Guangrong wrote:
> > On Wed, Jun 15, 2022 at 4:24 PM David Hildenbrand 
wrote:
> >
> >>>> Is that a temporary or a permanent thing? Do we know?
> >>>
> >>> No idea. But his last signed-off was three years ago.
> >>
> >> I sent a patch to Xiao, asking if he's still active in QEMU. If I don't
>
> s/patch/mail/ :)
>
> >> get a reply this week, I'll move forward with proposing an update to
> >> MAINTAINERS as described.
> >>
> >
> > Okay, please do it.
> >
> > Sorry, I am just roughly reading the mailing list of qemu & kvm usually,
> > and do not get enough time to actively review or contribute on these
> > fields. :-(
>
> Not an issue, thanks for that information and thanks for your work in
> the past on that!
>
> Should I keep you entered as a reviewer for the new section?

Okay, that is good for me! :)


Re: [PATCH] hw/mem/nvdimm: fix error message for 'unarmed' flag

2022-06-15 Thread Xiao Guangrong
On Wed, Jun 15, 2022 at 4:24 PM David Hildenbrand  wrote:

> >> Is that a temporary or a permanent thing? Do we know?
> >
> > No idea. But his last signed-off was three years ago.
>
> I sent a patch to Xiao, asking if he's still active in QEMU. If I don't
> get a reply this week, I'll move forward with proposing an update to
> MAINTAINERS as described.
>

Okay, please do it.

Sorry, I am just roughly reading the mailing list of qemu & kvm usually,
and do not get enough time to actively review or contribute on these
fields. :-(



Re: [Qemu-devel] About making QEMU to LIBs!

2019-03-28 Thread Xiao Guangrong




On 3/27/19 5:41 PM, Stefano Garzarella wrote:

Hi Yang, Xiao,
Just adding few things, because I'm currently exploring the QEMU modules
in order to reduce the boot time and the footprint.



Hi Stefan Hajnoczi and Stefano Garzarella,

The work exploring the QEMU modules looks really good. we will
evaluate it.

Hmm... how about make these modules usable out of QEMU, it seems
doable, right? But I'm not sure how much effort we need.



The idea is to create new QEMU modules (like the block/audio/ui
drivers) in order to reduce the dependency of QEMU executable to the
shared libraries that require a lot of time for the initialization, so
we can postpone their initialization only when we will use them.

Using callgrind I'm profiling the loading of dynamic libraries to
understand which part of QEMU should be modularized to speed up the boot
phase when they are not used.

I found that libgnutls, libspice-server and libnuma initialization takes
most of the time. I disabled these libraries (--disable-numa --disable-gnutls
--disable-spice) to understand the maximum speed up and I gained 10 ms
during the boot:
 - qemu_init_end: 50.92 ms
 - qemu_init_end: 40.17 ms (without numa, gnutls and spice)

I would start to modularize spice, the library is used in the spice
drivers (spice-display, spice-chardev, spice-input, etc.) but also in
the QXL emulation.

Maybe we can synchronize our work :)


Great work, Stefano!

Does the benefit come from the fact that your work avoids load some
libraries during exec(), right? So it depends on the size of the
shared library?




Re: [Qemu-devel] About making QEMU to LIBs!

2019-03-26 Thread Xiao Guangrong




On 3/26/19 5:07 PM, Paolo Bonzini wrote:

On 26/03/19 08:00, Xiao Guangrong wrote:

On 3/26/19 7:18 AM, Paolo Bonzini wrote:

Also, what is the use case?  Is it to reduce the attack surface without
having multiple QEMU binaries?


Security is one story we concern, only the essential and audited
modules/libraries can be loaded into memory.

Another story is that lightweight virt. becomes more and more important
to run VM-based workload in the public Cloud. However, QEMU is too huge
and cumbersome to fill our requirements, e.g, QEMU-lites has been being
developed for a long time but still lacks a way into mainline or
Intel's nEMU more radically.


What is your definition of lightweight?  To some extent, moving devices
to separate dynamic libraries _adds_ weight for the mechanisms to load
those libraries dynamically.


Lightweight = less resource usage + fast to reach user's self-defined workload.

For the new software, these libraries can be statically linked into it.



Would separate QEMU binaries be a solution?  I think I am not as opposed


Separate QEMU binaries mean we need multiple different compiling environments
and additional package maintenance on the production...


to a "q35-lite" machine type these days, I find it preferrable to share
the northbridge and southbridge with Q35 and just get rid of IDE, VGA,
IOAPIC, legacy ISA devices etc.  The chipset would stay the same as q35
so that we keep secure boot, share the code for ACPI stuff (hotplug),
and if the OS needs it we can also add back the RTC.


Well, that is a big step forward. :)

I'd think making most used components, such as virtio devices, are statically
compiled that is the minimal requirements for lightweight scenario. Other
components are libraries which can be dlopen() at the runtime, that slow
little down to boot the traditional VMs indeed but that is acceptable...



Re: [Qemu-devel] About making QEMU to LIBs!

2019-03-26 Thread Xiao Guangrong




On 3/26/19 7:18 AM, Paolo Bonzini wrote:

On 25/03/19 12:46, Yang Zhong wrote:

Hello all,

Rust-VMM has started to make all features and common modules to crates, and CSP 
can
deploy their VMM on demand. This afternoon, Xiao guangrong and i talked about 
the light
weight VM solitions,and we thought QEMU can do same thing like Rust-vmm, 
although we can
make all modules or features in QEMU configurable, but making features and 
modules to libs
are still valuable for QEMU. Any comments are welcome! thanks!


What features/modules were you thinking of?  Were you thinking of
turning them into dynamically loaded libraries, or spinning them out
into separate projects (such as libslirp)?


We are considering make QEMU's device emulations to dynamically libraries that 
include
virtio devices, IO controller and even vCPU emulations plus some hooks into it, 
and so
on.



Also, what is the use case?  Is it to reduce the attack surface without
having multiple QEMU binaries?



Security is one story we concern, only the essential and audited 
modules/libraries
can be loaded into memory.

Another story is that lightweight virt. becomes more and more important to run 
VM-based
workload in the public Cloud. However, QEMU is too huge and cumbersome to fill 
our
requirements, e.g, QEMU-lites has been being developed for a long time but 
still lacks
a way into mainline or Intel's nEMU more radically.

That's why we are going to redesign the userspace VMM, making QEMU to libraries 
is
definitely good to us.


Having dynamically linked libraries would go even beyond what rust-vmm
does; Rust binaries in the end statically link every crate that they
use, rust-vmm's purpose is mostly to share code across VMMs and avoid
forks.  It may also have a startup time penalty which has to be considered.



Rust-VMM is good indeed, but it's not friendly enough for us.

QEMU IS A GOOD GUY,no reason to abandon it. :)

QEMU is excellent and we are using it in our productions for so many years, its 
code
is trusted as robust. Reusing QEMU's code helps the new software to achieve
production-level's quality.

Furthermore, we still need QEMU for the traditional workloads, maintaining two 
totally
different code base is not easy for the cloud provider and most developers in 
the cloud
companies are really good at QEMU. So leveraging QEMU and the new software is 
more
convenient for us.

Thanks!



Re: [Qemu-devel] [PATCH v2 0/3] PCDIMM cleanup

2019-02-20 Thread Xiao Guangrong




On 2/20/19 8:51 AM, Wei Yang wrote:

Three trivial cleanup for pc-dimm.

Patch [1] remove the check on class->hotpluggable since pc-dimm is always
hotpluggable.
Patch [2] remove nvdimm_realize
Patch [2] remove pcdimm realize-callback

v2:
   * fix warning in Patch 1
   * split Patch 2 into two

Wei Yang (3):
   pc-dimm: remove check on pc-dimm hotpluggable
   mem/nvdimm: remove nvdimm_realize



   pc-dimm: revert "introduce realize callback"


I think the word 'revert' is not so precise as that hints
the commit is bugly, instead, it was factored in the later
comments then becomes useless now.

Anyway, this pathset looks good to me.

Reviewed-by: Xiao Guangrong 



Re: [Qemu-devel] [PATCH v2 3/3] migration: introduce adaptive model for waiting thread

2019-02-18 Thread Xiao Guangrong




On 1/11/19 5:57 PM, Markus Armbruster wrote:

guangrong.x...@gmail.com writes:


From: Xiao Guangrong 

Currently we have two behaviors if all threads are busy to do compression,
the main thread mush wait one of them becoming free if @compress-wait-thread
set to on or the main thread can directly return without wait and post
the page out as normal one

Both of them have its profits and short-comes, however, if the bandwidth is
not limited extremely so that compression can not use out all of it bandwidth,
at the same time, the migration process is easily throttled if we posted too
may pages as normal pages. None of them can work properly under this case

In order to use the bandwidth more effectively, we introduce the third
behavior, adaptive, which make the main thread wait if there is no bandwidth
left or let the page go out as normal page if there has enough bandwidth to
make sure the migration process will not be throttled

Signed-off-by: Xiao Guangrong 
---
  hmp.c |   6 ++-
  migration/migration.c | 116 --
  migration/migration.h |  13 ++
  migration/ram.c   | 116 +-
  qapi/migration.json   |  39 +++--


You neglected to cc: the QAPI schema maintainers.
scripts/get_maintainer.pl can help you find the maintainers to cc: on
your patches.



Thank you for pointing it out, i will pay more attention on it.


  5 files changed, 251 insertions(+), 39 deletions(-)

[...]

diff --git a/qapi/migration.json b/qapi/migration.json
index c5babd03b0..0220a0945b 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -93,11 +93,16 @@
  #
  # @compression-rate: rate of compressed size
  #
+# @compress-no-wait-weight: it controls how many pages are directly posted
+# out as normal page when all compression threads are currently busy.
+# Only available if compress-wait-thread = adaptive. (Since 3.2)


"Only available" suggests the member is optional.


+#
  # Since: 3.1
  ##
  { 'struct': 'CompressionStats',
'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
-  'compressed-size': 'int', 'compression-rate': 'number' } }
+  'compressed-size': 'int', 'compression-rate': 'number',
+  'compress-no-wait-weight': 'int'} }


It isn't.  Should it be optional?  If not, what's its value when
compress-wait-thread isn't adaptive?



It'd be better to make it optional... i will fix it. :)

  
  ##

  # @MigrationStatus:
@@ -489,9 +494,13 @@
  #  the compression thread count is an integer between 1 and 255.
  #
  # @compress-wait-thread: Controls behavior when all compression threads are
-#currently busy. If true (default), wait for a free
-#compression thread to become available; otherwise,
-#send the page uncompressed. (Since 3.1)
+#  currently busy. If 'true/on' (default), wait for a free



+#  compression thread to become available; if 'false/off', send the
+#  page uncompressed. (Since 3.1)
+#  If it is 'adaptive',  the behavior is adaptively controlled based on
+#  the rate limit. If it has enough bandwidth, it acts as
+#  compress-wait-thread is off. (Since 3.2)
+#
  #
  # @decompress-threads: Set decompression thread count to be used in live
  #  migration, the decompression thread count is an integer between 1
@@ -577,9 +586,12 @@
  # @compress-threads: compression thread count
  #
  # @compress-wait-thread: Controls behavior when all compression threads are
-#currently busy. If true (default), wait for a free
-#compression thread to become available; otherwise,
-#send the page uncompressed. (Since 3.1)
+#  currently busy. If 'true/on' (default), wait for a free
+#  compression thread to become available; if 'false/off', send the
+#  page uncompressed. (Since 3.1)
+#  If it is 'adaptive',  the behavior is adaptively controlled based on
+#  the rate limit. If it has enough bandwidth, it acts as
+#  compress-wait-thread is off. (Since 3.2)
  #
  # @decompress-threads: decompression thread count
  #
@@ -655,7 +667,7 @@
  { 'struct': 'MigrateSetParameters',
'data': { '*compress-level': 'int',
  '*compress-threads': 'int',
-'*compress-wait-thread': 'bool',
+'*compress-wait-thread': 'str',


Compatibility break.

You can add a separate flag like you did in v1 if I understand your cover
letter correctly.  Awkward.

You can use a suitable alternate of bool and enum.


‘alternate’ seems a good solution to me, will fix. :)



'str' is not a good idea, because it defeats introspection.


will fix.




  '*decompress-threads': 'int',
  '*cpu-throttle-initial': 'int',
  '*cpu-throttle-increment': 'int',
@@ -697

Re: [Qemu-devel] [PATCH v2 3/3] migration: introduce adaptive model for waiting thread

2019-02-18 Thread Xiao Guangrong




On 1/16/19 2:40 PM, Peter Xu wrote:

On Fri, Jan 11, 2019 at 02:37:32PM +0800, guangrong.x...@gmail.com wrote:



+
+static void update_compress_wait_thread(MigrationState *s)
+{
+s->compress_wait_thread = get_compress_wait_thread(>parameters);
+assert(s->compress_wait_thread != COMPRESS_WAIT_THREAD_ERR);
+}


We can probably deprecate these chunk of codes if you're going to use
alternative structs or enum as suggested by Markus...



Yes, indeed.


I think Libvirt is not using this parameter, right?  And I believe the
parameter "compress-wait-thread" was just introduced since QEMU 3.1.
I'm not sure whether we can directly change it to an enum assuming
that no one is really using it in production yet which could possibly
break nobody.


I did a check in libvirt's code:
$ git grep compress-wait-thread
tests/qemucapabilitiesdata/caps_3.1.0.ppc64.replies:  "name": 
"compress-wait-thread",
tests/qemucapabilitiesdata/caps_3.1.0.ppc64.replies:  "name": 
"compress-wait-thread",
tests/qemucapabilitiesdata/caps_3.1.0.x86_64.replies:  "name": 
"compress-wait-thread",
tests/qemucapabilitiesdata/caps_3.1.0.x86_64.replies:  "name": 
"compress-wait-thread",
tests/qemucapabilitiesdata/caps_4.0.0.x86_64.replies:  "name": 
"compress-wait-thread",
tests/qemucapabilitiesdata/caps_4.0.0.x86_64.replies:  "name": 
"compress-wait-thread",

It seems changing this parameter will break libvirt's self-test?



Maybe we still have chance to quickly switch back to the name
"x-compress-wait-thread" just like the -global interface then we don't
need to worry much on QAPI breakage so far until the parameter proves
itself to remove the "x-".



Er... i am not sure i can follow it clearly. :(


[...]


@@ -1917,40 +2000,40 @@ bool migrate_postcopy_blocktime(void)
  return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME];
  }
  
-bool migrate_use_compression(void)

+int64_t migrate_max_bandwidth(void)
  {
  MigrationState *s;
  
  s = migrate_get_current();
  
-return s->enabled_capabilities[MIGRATION_CAPABILITY_COMPRESS];

+return s->parameters.max_bandwidth;
  }
  
-int migrate_compress_level(void)

+bool migrate_use_compression(void)
  {
  MigrationState *s;
  
  s = migrate_get_current();
  
-return s->parameters.compress_level;

+return s->enabled_capabilities[MIGRATION_CAPABILITY_COMPRESS];
  }
  
-int migrate_compress_threads(void)

+int migrate_compress_level(void)
  {
  MigrationState *s;
  
  s = migrate_get_current();
  
-return s->parameters.compress_threads;

+return s->parameters.compress_level;
  }
  
-int migrate_compress_wait_thread(void)

+int migrate_compress_threads(void)
  {
  MigrationState *s;
  
  s = migrate_get_current();
  
-return s->parameters.compress_wait_thread;

+return s->parameters.compress_threads;


I'm a bit confused on these diff... are you only adding
migrate_max_bandwidth() and not changing anything else?


I guess so.


 I'm curious
on how these chunk is generated since it looks really weird...


Looks weird for me too. :(



[...]


  /* State of RAM for migration */
  struct RAMState {
  /* QEMUFile used for this migration */
@@ -292,6 +294,19 @@ struct RAMState {
  bool ram_bulk_stage;
  /* How many times we have dirty too many pages */
  int dirty_rate_high_cnt;
+
+/* used by by compress-wait-thread-adaptive */


compress-wait-thread-adaptive is gone?


It's a typo, will fix.


+
+max_bw = (max_bw >> 20) * 8;
+remain_bw = abs(max_bw - (int64_t)(mbps));
+if (remain_bw <= (max_bw / 20)) {
+/* if we have used all the bandwidth, let's compress more. */
+if (compression_counters.compress_no_wait_weight) {
+compression_counters.compress_no_wait_weight--;
+}
+goto exit;
+}
+
+/* have enough bandwidth left, do not need to compress so aggressively */
+if (compression_counters.compress_no_wait_weight !=
+COMPRESS_BUSY_COUNT_PERIOD) {
+compression_counters.compress_no_wait_weight++;
+}
+
+exit:
+ram_state->compress_busy_count = 0;
+ram_state->compress_no_wait_left =
+compression_counters.compress_no_wait_weight;


The "goto" and the chunk seems awkward to me...  How about this?

   if (not_enough_remain_bw && weight)
 weight--;
   else if (weight <= MAX)
 weight++;

   (do the rest...)



Okay, will address your style.



Also, would you like to add some documentation to these compression
features into docs/devel/migration.rst?  It'll be good, but it's your
call. :)


It's useful indeed. I will do it.


  static void migration_update_rates(RAMState *rs, int64_t end_time)
  {
  uint64_t page_count = rs->target_page_count - rs->target_page_count_prev;
@@ -1975,7 +2057,11 @@ static int compress_page_with_multi_thread(RAMState *rs, 
RAMBlock *block,
 ram_addr_t offset)
  

Re: [Qemu-devel] [PATCH v2 2/3] migration: fix memory leak when updating tls-creds and tls-hostname

2019-02-18 Thread Xiao Guangrong




On 1/16/19 12:03 AM, Eric Blake wrote:

On 1/15/19 4:24 AM, Dr. David Alan Gilbert wrote:


I think the problem is that
migrate_params_check checks a MigrationParameters

while the QMP command gives us a MigrateSetParameters; but we also use
migrate_params_check for the global check you added (8b0b29dc) which is
against migrationParameters; so that's why migrate_params_check takes
a MigrationParameters.

It's horrible we've got stuff duped so much.


Indeed.



However, I don't like this fix because if someone later was to add
a test for tls parameters to migrate_params_check, then they would be
confused why the hostname/creds weren't checked.
So while we have migrate_params_test_apply, it should cover all
parameters.

I think a cleaner check would be to write a MigrateParameters_free
that free'd any strings, and call that in qmp_migrate_set_parameters
on both exit paths.


We already have it; it's named qapi_free_MigrationParameters(),
generated in qapi-types-migration.h.



Thank you all (and sorry for the delay reply due to Chinese New Year :)),
i will use this interface instead in the next version.



Re: [Qemu-devel] [PATCH 0/2] optimize waiting for free thread to do compression(Internet mail)

2018-12-24 Thread xiaoguangrong(Xiao Guangrong)
On 12/21/18 4:11 PM, Peter Xu wrote:
> On Thu, Dec 13, 2018 at 03:57:25PM +0800, guangrong.x...@gmail.com wrote:
>> From: Xiao Guangrong 
>>
>> Currently we have two behaviors if all threads are busy to do compression,
>> the main thread mush wait one of them becoming free if @compress-wait-thread
>> set to on or the main thread can directly return without wait and post
>> the page out as normal one
>>
>> Both of them have its profits and short-comes, however, if the bandwidth is
>> not limited extremely so that compression can not use out all of it 
>> bandwidth,
>> at the same time, the migration process is easily throttled if we posted too
>> may pages as normal pages. None of them can work properly under this case
>>
>> In order to use the bandwidth more effectively, we introduce the third
>> behavior, compress-wait-thread-adaptive, which make the main thread wait
>> if there is no bandwidth left or let the page go out as normal page if there
>> has enough bandwidth to make sure the migration process will not be
>> throttled
>>
>> Another patch introduces a new statistic, pages-per-second, as bandwidth
>> or mbps is not enough to measure the performance of posting pages out as
>> we have compression, xbzrle, which can significantly reduce the amount of
>> the data size, instead, pages-per-second if the one we want
>>
>> Performance data
>> 
>> We have limited the bandwidth to 300
>>
>>  Used Bandwidth Pages-per-Second
>> compress-wait-thread = on 951.66 mbps 131784
>>
>> compress-wait-thread = off2491.74 mbps93495
>> compress-wait-thread-adaptive 1982.94 mbps162529
>> = on
> 
> Sounds reasonable to me, though it looks like there're only three
> options: on, off, adaptive.  Should we squash the two parameters?

Thanks for your review, Peter!

That's good to me, if other guys do not have objects, i will do it in the
next version.



Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-12-09 Thread Xiao Guangrong




On 12/5/18 1:16 AM, Paolo Bonzini wrote:

On 04/12/18 16:49, Christophe de Dinechin wrote:

  Linux and QEMU's own qht work just fine with compile-time directives.


Wouldn’t it work fine without any compile-time directive at all?


Yes, that's what I meant.  Though there are certainly cases in which the
difference without proper cacheline alignment is an order of magnitude
less throughput or something like that; it would certainly be noticeable.


I don't think lock-free lists are easier.  Bitmaps smaller than 64
elements are both faster and easier to manage.


I believe that this is only true if you use a linked list for both freelist
management and for thread notification (i.e. to replace the bitmaps).
However, if you use an atomic list only for the free list, and keep
bitmaps for signaling, then performance is at least equal, often better.
Plus you get the added benefit of having a thread-safe API, i.e.
something that is truly lock-free.

I did a small experiment to test / prove this. Last commit on branch:
https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
Take with a grain of salt, microbenchmarks are always suspect ;-)

The code in “thread_test.c” includes Xiao’s code with two variations,
plus some testing code lifted from the flight recorder library.
1. The FREE_LIST variation (sl_test) is what I would like to propose.
2. The BITMAP variation (bm_test) is the baseline
3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach

To run it, you need to do “make opt-test”, then run “test_script”
which outputs a CSV file. The summary of my findings testing on
a ThinkPad, a Xeon machine and a MacBook is here:
https://imgur.com/a/4HmbB9K

Overall, the proposed approach:

- makes the API thread safe and lock free, addressing the one
drawback that Xiao was mentioning.

- delivers up to 30% more requests on the Macbook, while being
“within noise” (sometimes marginally better) for the other two.
I suspect an optimization opportunity found by clang, because
the Macbook delivers really high numbers.

- spends less time blocking when all threads are busy, which
accounts for the higher number of client loops.

If you think that makes sense, then either Xiao can adapt the code
from the branch above, or I can send a follow-up patch.


Having a follow-up patch would be best I think.  Thanks for
experimenting with this, it's always fun stuff. :)



Yup, Christophe, please post the follow-up patches and add yourself
to the author list if you like. I am looking forward to it. :)

Thanks!






Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-11-28 Thread Xiao Guangrong




On 11/27/18 8:49 PM, Christophe de Dinechin wrote:

(I did not finish the review, but decided to send what I already had).



On 22 Nov 2018, at 08:20, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

This modules implements the lockless and efficient threaded workqueue.


I’m not entirely convinced that it’s either “lockless” or “efficient”
in the current iteration. I believe that it’s relatively easy to fix, though.



I think Emilio has already replied to your concern why it is "lockless". :)



Three abstracted objects are used in this module:
- Request.
 It not only contains the data that the workqueue fetches out
to finish the request but also offers the space to save the result
after the workqueue handles the request.

It's flowed between user and workqueue. The user fills the request
data into it when it is owned by user. After it is submitted to the
workqueue, the workqueue fetched data out and save the result into
it after the request is handled.


fetched -> fetches
save -> saves


Will fix... My English is even worse than C. :(


+
+/*
+ * find a free request where the user can store the data that is needed to
+ * finish the request
+ *
+ * If all requests are used up, return NULL
+ */
+void *threaded_workqueue_get_request(Threads *threads);


Using void * to represent the payload makes it easy to get
the wrong pointer in there without the compiler noticing.
Consider adding a type for the payload?



Another option could be taken is exporting the ThreadRequest to the user
and it's put at the very beginning in the user-defined data struct.

However, it will export the internal designed things to the user, i am
not sure it is a good idea...


+ *
+ * Author:
+ *   Xiao Guangrong 
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/bitmap.h"
+#include "qemu/threaded-workqueue.h"
+
+#define SMP_CACHE_BYTES 64


+1 on comments already made by others


Will improve it.




+
+/*
+ * the request representation which contains the internally used mete data,


mete -> meta


Will fix.




+ * it is the header of user-defined data.
+ *
+ * It should be aligned to the nature size of CPU.
+ */
+struct ThreadRequest {
+/*
+ * the request has been handled by the thread and need the user
+ * to fetch result out.
+ */
+uint8_t done;
+
+/*
+ * the index to Thread::requests.
+ * Save it to the padding space although it can be calculated at runtime.
+ */
+uint8_t request_index;


So no more than 256?

This is blocked by MAX_THREAD_REQUEST_NR test at the beginning
of threaded_workqueue_create, but I would make it more explicit either
with a compile-time assert that MAX_THREAD_REQUEST_NR is
below UINT8_MAX, or by adding a second test for UINT8_MAX in
threaded_workqueue_create.


It's good to me.

I prefer the former one that "compile-time assert that MAX_THREAD_REQUEST_NR
is below UINT8_MAX"



Also, an obvious extension would be to make bitmaps into arrays.

Do you think someone would want to use the package to assign
requests per CPU or per VCPU? If so, that could quickly go above 64.



Well... it specifies the depth of each single thread, it has negative
affection if larger depth is used, as it causes
threaded_workqueue_wait_for_requests() too slow, at that point, the
user needs to wait all the threads to exhaust all its requests.

Another impact is that u64 is more efficient than bitmaps, we can see
it from the performance data:
   https://ibb.co/hq7u5V

Based on those, i think 64 should be enough, at least for the present
user, migration thread.




+
+/* the index to Threads::per_thread_data */
+unsigned int thread_index;


Don’t you want to use a size_t for that?


size_t is 8 bytes... i'd like to make the request header more tiny...




+} QEMU_ALIGNED(sizeof(unsigned long));


Nit: the alignment type is inconsistent with that given
to QEMU_BUILD_BUG_ON in threaded_workqueue_create.
(long vs. unsigned long).



Yup, will make them consistent.


Also, why is the alignment required? Aren’t you more interested
in cache-line alignment?



ThreadRequest actually is the header put at the very beginning of
the request. If is not aligned to "long", the user-defined data
struct could be accessed without properly aligned.




+typedef struct ThreadRequest ThreadRequest;




+
+struct ThreadLocal {
+struct Threads *threads;
+
+/* the index of the thread */
+int self;


Why signed?


Mistake, will fix.




+
+/* thread is useless and needs to exit */
+bool quit;
+
+QemuThread thread;
+
+void *requests;
+
+   /*
+ * the bit in these two bitmaps indicates the index of the @requests
+ * respectively. If it's the same, the corresponding request is fr

Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-11-27 Thread Xiao Guangrong




On 11/26/18 6:28 PM, Paolo Bonzini wrote:

On 26/11/18 09:18, Xiao Guangrong wrote:



+static uint64_t get_free_request_bitmap(Threads *threads,
ThreadLocal *thread)
+{
+    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
+
+    request_fill_bitmap =
atomic_rcu_read(>request_fill_bitmap);
+    request_done_bitmap =
atomic_rcu_read(>request_done_bitmap);
+    bitmap_xor(_bitmap, _fill_bitmap,
_done_bitmap,
+   threads->thread_requests_nr);


This is not wrong, but it's a big ugly. Instead, I would:

- Introduce bitmap_xor_atomic in a previous patch
- Use bitmap_xor_atomic here, getting rid of the rcu reads


Hmm, however, we do not need atomic xor operation here... that should be
slower than
just two READ_ONCE calls.


Yeah, I'd just go with Guangrong's version.  Alternatively, add
find_{first,next}_{same,different}_bit functions (whatever subset of the
4 you need).


That's good to me. will try it. ;)



Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-11-27 Thread Xiao Guangrong




On 11/27/18 2:55 AM, Emilio G. Cota wrote:

On Mon, Nov 26, 2018 at 15:57:25 +0800, Xiao Guangrong wrote:



On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:


+#include "qemu/osdep.h"
+#include "qemu/bitmap.h"
+#include "qemu/threaded-workqueue.h"
+
+#define SMP_CACHE_BYTES 64


That's architecture dependent isn't it?



Yes, it's arch dependent indeed.

I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
so that can work.

Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(


No, at compile-time this is impossible to know.

We do query this info at run-time though (see util/cacheinfo.c),
but using that info here would complicate things too much.


I see.



You can just give it a different name, and perhaps add a comment.
See for instance what we do in qht.c with QHT_BUCKET_ALIGN.


That's really a good lesson to me, will follow it. :)



Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-11-27 Thread Xiao Guangrong




On 11/27/18 2:49 AM, Emilio G. Cota wrote:

On Mon, Nov 26, 2018 at 16:06:37 +0800, Xiao Guangrong wrote:

+/* after the user fills the request, the bit is flipped. */
+uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+/* after handles the request, the thread flips the bit. */
+uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);


Use DECLARE_BITMAP, otherwise you'll get type errors as David
pointed out.


If we do it, the field becomes a pointer... that complicates the
thing.


Not necessarily, see below.

On Mon, Nov 26, 2018 at 16:18:24 +0800, Xiao Guangrong wrote:

On 11/24/18 8:17 AM, Emilio G. Cota wrote:

On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.x...@gmail.com wrote:

+static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
+{
+uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
+
+request_fill_bitmap = atomic_rcu_read(>request_fill_bitmap);
+request_done_bitmap = atomic_rcu_read(>request_done_bitmap);
+bitmap_xor(_bitmap, _fill_bitmap, _done_bitmap,
+   threads->thread_requests_nr);


This is not wrong, but it's a big ugly. Instead, I would:

- Introduce bitmap_xor_atomic in a previous patch
- Use bitmap_xor_atomic here, getting rid of the rcu reads


Hmm, however, we do not need atomic xor operation here... that should be slower 
than
just two READ_ONCE calls.


If you use DECLARE_BITMAP, you get an in-place array. On a 64-bit
host, that'd be
unsigned long foo[1]; /* [2] on 32-bit */

Then again on 64-bit hosts, bitmap_xor_atomic would reduce
to 2 atomic reads:

static inline void bitmap_xor_atomic(unsigned long *dst,
const unsigned long *src1, const unsigned long *src2, long nbits)
{
 if (small_nbits(nbits)) {
 *dst = atomic_read(src1) ^ atomic_read();
 } else {
 slow_bitmap_xor_atomic(dst, src1, src2, nbits);


We needn't do inplace xor operation. i.e, we just fetch the bitmaps to
the local variables do xor locally.

So we need additional complicity to handle the case that is !small_nbits(nbits)
... but it is really not a big deal as you said, it just couple of codes.

However, use u64 for the purpose that only  64 indexes are allowed is more
straightforward and can be naturally understood. :)




Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-11-26 Thread Xiao Guangrong




On 11/26/18 6:56 PM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:



On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:


+#include "qemu/osdep.h"
+#include "qemu/bitmap.h"
+#include "qemu/threaded-workqueue.h"
+
+#define SMP_CACHE_BYTES 64


That's architecture dependent isn't it?



Yes, it's arch dependent indeed.

I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
so that can work.

Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(


I think it depends why you need it; but we shouldn't have a constant
that is wrong, and we shouldn't define something architecture dependent
in here.



I see. I will address Emilio's suggestion that rename SMP_CACHE_BYTES to
THREAD_QUEUE_ALIGN and additional comments.


+   /*
+ * the bit in these two bitmaps indicates the index of the @requests
+ * respectively. If it's the same, the corresponding request is free
+ * and owned by the user, i.e, where the user fills a request. Otherwise,
+ * it is valid and owned by the thread, i.e, where the thread fetches
+ * the request and write the result.
+ */
+
+/* after the user fills the request, the bit is flipped. */
+uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+/* after handles the request, the thread flips the bit. */
+uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);


Patchew complained about some type mismatches; I think those are because
you're using the bitmap_* functions on these; those functions always
operate on 'long' not on uint64_t - and on some platforms they're
unfortunately not the same.


I guess you were taking about this error:
ERROR: externs should be avoided in .c files
#233: FILE: util/threaded-workqueue.c:65:
+uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone
when the aligned thing is removed...

The issue you pointed out can be avoid by using type-casting, like:
bitmap_xor(..., (void *)>request_fill_bitmap)
cannot we?


I thought the error was just due to long vs uint64_t ratehr than the
qemu_aligned.  I don't think it's just a casting problem, since I don't
think the long's are always 64bit.


Well, i made some adjustments that makes check_patch.sh really happy :),
as followings:
$ git diff util/
diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
index 2ab37cee8d..e34c65a8eb 100644
--- a/util/threaded-workqueue.c
+++ b/util/threaded-workqueue.c
@@ -62,21 +62,30 @@ struct ThreadLocal {
  */

 /* after the user fills the request, the bit is flipped. */
-uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+struct {
+uint64_t request_fill_bitmap;
+} QEMU_ALIGNED(SMP_CACHE_BYTES);
+
 /* after handles the request, the thread flips the bit. */
-uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+struct {
+uint64_t request_done_bitmap;
+} QEMU_ALIGNED(SMP_CACHE_BYTES);

 /*
  * the event used to wake up the thread whenever a valid request has
  * been submitted
  */
-QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+struct {
+QemuEvent request_valid_ev;
+} QEMU_ALIGNED(SMP_CACHE_BYTES);

 /*
  * the event is notified whenever a request has been completed
  * (i.e, become free), which is used to wake up the user
  */
-QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+struct {
+QemuEvent request_free_ev;
+} QEMU_ALIGNED(SMP_CACHE_BYTES);
 };
 typedef struct ThreadLocal ThreadLocal;

$ ./scripts/checkpatch.pl -f util/threaded-workqueue.c
total: 0 errors, 0 warnings, 472 lines checked

util/threaded-workqueue.c has no obvious style problems and is ready for 
submission.

check_patch.sh somehow treats QEMU_ALIGNED as a function before the 
modification.

And yes, u64 is not a long type on 32 bit arch, it's long[2] instead. that's 
fine
when we pass the &(u64) to the function whose parameter is (long *). I thing 
this
trick is widely used. e.g, the example in kvm that i replied to Emilio:

static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
{
return test_bit(req & KVM_REQUEST_MASK, (void *)>requests);
}






Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-11-26 Thread Xiao Guangrong




On 11/24/18 8:17 AM, Emilio G. Cota wrote:

On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.x...@gmail.com wrote:

+static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
+{
+uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
+
+request_fill_bitmap = atomic_rcu_read(>request_fill_bitmap);
+request_done_bitmap = atomic_rcu_read(>request_done_bitmap);
+bitmap_xor(_bitmap, _fill_bitmap, _done_bitmap,
+   threads->thread_requests_nr);


This is not wrong, but it's a big ugly. Instead, I would:

- Introduce bitmap_xor_atomic in a previous patch
- Use bitmap_xor_atomic here, getting rid of the rcu reads


Hmm, however, we do not need atomic xor operation here... that should be slower 
than
just two READ_ONCE calls.



Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-11-26 Thread Xiao Guangrong




On 11/24/18 8:12 AM, Emilio G. Cota wrote:

On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.x...@gmail.com wrote:

+   /*
+ * the bit in these two bitmaps indicates the index of the @requests


This @ is not ASCII, is it?



Good eyes. :)

Will fix it.


+ * respectively. If it's the same, the corresponding request is free
+ * and owned by the user, i.e, where the user fills a request. Otherwise,
+ * it is valid and owned by the thread, i.e, where the thread fetches
+ * the request and write the result.
+ */
+
+/* after the user fills the request, the bit is flipped. */
+uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+/* after handles the request, the thread flips the bit. */
+uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);


Use DECLARE_BITMAP, otherwise you'll get type errors as David
pointed out.


If we do it, the field becomes a pointer... that complicates the
thing.

Hmm, i am using the same trick applied by kvm module when it handles
vcpu->requests:
static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
{
return test_bit(req & KVM_REQUEST_MASK, (void *)>requests);
}

Is it good?



Re: [Qemu-devel] [PATCH v3 3/5] migration: use threaded workqueue for compression

2018-11-26 Thread Xiao Guangrong




On 11/24/18 2:29 AM, Dr. David Alan Gilbert wrote:


  static void
-update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
+update_compress_thread_counts(CompressData *cd, int bytes_xmit)


Keep the const?


Yes, indeed. Will correct it in the next version.


+if (deflateInit(>stream, migrate_compress_level()) != Z_OK) {
+g_free(cd->originbuf);
+return -1;
+}


Please print errors if you fail in any case so we can easily tell what
happened.


Sure, will do.


+if (wait) {
+cpu_relax();
+goto retry;


Is there nothing better we can use to wait without eating CPU time?


There is a mechanism to wait without eating CPU time in the data
structure, but it makes sense to busy wait.  There are 4 threads in the
workqueue, so you have to compare 1/4th of the time spent compressing a
page, with the trip into the kernel to wake you up.  You're adding 20%
CPU usage, but I'm not surprised it's worthwhile.


Hmm OK; in that case it does at least need a comment because it's a bit
odd, and we should watch out how that scales - I guess it's less of
an overhead the more threads you use.



Sure, will add some comments to explain the purpose.



Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue

2018-11-25 Thread Xiao Guangrong




On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:


+#include "qemu/osdep.h"
+#include "qemu/bitmap.h"
+#include "qemu/threaded-workqueue.h"
+
+#define SMP_CACHE_BYTES 64


That's architecture dependent isn't it?



Yes, it's arch dependent indeed.

I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
so that can work.

Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(


+   /*
+ * the bit in these two bitmaps indicates the index of the @requests
+ * respectively. If it's the same, the corresponding request is free
+ * and owned by the user, i.e, where the user fills a request. Otherwise,
+ * it is valid and owned by the thread, i.e, where the thread fetches
+ * the request and write the result.
+ */
+
+/* after the user fills the request, the bit is flipped. */
+uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+/* after handles the request, the thread flips the bit. */
+uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);


Patchew complained about some type mismatches; I think those are because
you're using the bitmap_* functions on these; those functions always
operate on 'long' not on uint64_t - and on some platforms they're
unfortunately not the same.


I guess you were taking about this error:
ERROR: externs should be avoided in .c files
#233: FILE: util/threaded-workqueue.c:65:
+uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone
when the aligned thing is removed...

The issue you pointed out can be avoid by using type-casting, like:
bitmap_xor(..., (void *)>request_fill_bitmap)
cannot we?

Thanks!



Re: [Qemu-devel] [PATCH v2 2/5] util: introduce threaded workqueue

2018-11-20 Thread Xiao Guangrong




On 11/14/18 2:38 AM, Emilio G. Cota wrote:

On Tue, Nov 06, 2018 at 20:20:22 +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

This modules implements the lockless and efficient threaded workqueue.

(snip)

+++ b/util/threaded-workqueue.c
+struct Threads {
+/*
+ * in order to avoid contention, the @requests is partitioned to
+ * @threads_nr pieces, each thread exclusively handles
+ * @thread_request_nr requests in the array.
+ */
+void *requests;

(snip)

+/*
+ * the bit in these two bitmaps indicates the index of the @requests
+ * respectively. If it's the same, the corresponding request is free
+ * and owned by the user, i.e, where the user fills a request. Otherwise,
+ * it is valid and owned by the thread, i.e, where the thread fetches
+ * the request and write the result.
+ */
+
+/* after the user fills the request, the bit is flipped. */
+unsigned long *request_fill_bitmap;
+/* after handles the request, the thread flips the bit. */
+unsigned long *request_done_bitmap;

(snip)

+/* the request is pushed to the thread with round-robin manner */
+unsigned int current_thread_index;

(snip)

+QemuEvent ev;

(snip)

+};


The fields I'm showing above are all shared by all worker threads.
This can lead to unnecessary contention. For example:
- Adjacent requests can share the same cache line, which might be
   written to by different worker threads (when setting request->done)

- The request_done bitmap is written to by worker threads every time
   a job completes. At high core counts with low numbers of job slots,
   this can result in high contention. For example, imagine we have
   16 threads with 4 jobs each. This only requires 64 bits == 8 bytes, i.e.
   much less than a cache line. Whenever a job completes, the cache line
   will be atomically updated by one of the 16 threads.

- The completion event (Threads.ev above) is written to by every thread.
   Again, this can result in contention at large core counts.

An orthogonal issue is the round-robin policy. This can give us fairness,
in that we guarantee that all workers get a similar number of jobs.
But giving one job at a time to each worker is suboptimal when the job
sizes are small-ish, because it misses out on the benefits of batching,
which amortize the cost of communication.
Given that the number of jobs that we have (at least in the benchmark)
are small, filling up a worker's queue before moving on to the next
can yield a significant speedup at high core counts.

I implemented the above on top of your series. The results are as follows:

  threaded-workqueue-bench -r 4 -m 2 -c 
20 -t #N
   Host: AMD Opteron(tm) Processor 
6376
   Thread pinning: #N+1 cores, 
same-socket first

  12 
+---+
 |+   + + + + +A+ + + + +   
  + + + + + +|
 | $
  before ***B*** |
  10 |-+  $$
   +batching ###D###-|
 |$$
   +per-thread-state $$$A$$$ |
 |$$  AA
 |
 | $AD D$A $A $ $ $A  A   $$   
AA  A$   AA$ A|
   8 |-+   D$AA  A# D$AA# A  $#D$$  $ $$ A  $   $A $A  
$$ A$ A$A $ $ AA   $A $  $A   $ A   +-|
 |AA  B* B$DA D  DD# A #$$   A  A   $$AA  A  A$A  $ 
 A  A A$ AAA  A$A|
 |   $DB*B  B $ $ BBD   $$  #D #D   A   A$A 
A|
   6 |-+  $AA*B   *A *  *   $# D  D  D#  #D #D   D#
D#DD#D   D# D#  # ##D  D#   +-|
 |   A BB   *   A DDD  D  D#D  DD#D 
 D#D  D  DD  D  D# D#D  DD#DD|
 |   $   B  
  D  |
 | $A **BB B
 |
   4 |-+  A#  B   ***   
   +-|
 |$B *B  BB* B* 
   *BB*B   B*BB*BB*B  B *BB* B*BB|
 |  $A  B   B  
BB*BB*BB*BB*BB*BB*BB **B ** BB|
 

Re: [Qemu-devel] [PATCH v2 0/5] migration: improve multithreads

2018-11-11 Thread Xiao Guangrong



Hi,

Ping...

On 11/6/18 8:20 PM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Changelog in v2:
These changes are based on Paolo's suggestion:
1) rename the lockless multithreads model to threaded workqueue
2) hugely improve the internal design, that make all the request be
a large array, properly partition it, assign requests to threads
respectively and use bitmaps to sync up threads and the submitter,
after that ptr_ring and spinlock are dropped
3) introduce event wait for the submitter

These changes are based on Emilio's review:
4) make more detailed description for threaded workqueue
5) add a benchmark for threaded workqueue

The previous version can be found at
https://marc.info/?l=kvm=153968821910007=2

There's the simple performance measurement comparing these two versions,
the environment is the same as we listed in the previous version.

Use 8 threads to compress the data in the source QEMU
- with compress-wait-thread = off


   total timebusy-ratio
--
v11250660.38
v21204440.35

- with compress-wait-thread = on
  total timebusy-ratio
--
v11644260
v21426090

The v2 win slightly.

Xiao Guangrong (5):
   bitops: introduce change_bit_atomic
   util: introduce threaded workqueue
   migration: use threaded workqueue for compression
   migration: use threaded workqueue for decompression
   tests: add threaded-workqueue-bench

  include/qemu/bitops.h |  13 +
  include/qemu/threaded-workqueue.h |  94 +++
  migration/ram.c   | 538 ++
  tests/Makefile.include|   5 +-
  tests/threaded-workqueue-bench.c  | 256 ++
  util/Makefile.objs|   1 +
  util/threaded-workqueue.c | 466 +
  7 files changed, 1030 insertions(+), 343 deletions(-)
  create mode 100644 include/qemu/threaded-workqueue.h
  create mode 100644 tests/threaded-workqueue-bench.c
  create mode 100644 util/threaded-workqueue.c





Re: [Qemu-devel] [PATCH 2/4] migration: introduce lockless multithreads model

2018-10-28 Thread Xiao Guangrong




On 10/28/2018 03:50 PM, Paolo Bonzini wrote:

On 27/10/2018 01:33, Emilio G. Cota wrote:

On Wed, Oct 17, 2018 at 12:10:15 +0200, Paolo Bonzini wrote:

On 16/10/2018 13:10, guangrong.x...@gmail.com wrote:



An idea: the total number of requests is going to be very small, and a
PtrRing is not the nicest data structure for multiple producer/single
consumer.  So you could instead:

(snip)

- now that you have request indices, you can replace the completion
ptr_ring with a bitmap, and set a bit in the bitmap with set_bit_atomic
to report completion.  On the writer side you use find_next_bit to find

(snip)

Emilio, can you review the above ideas?


Sorry it took me a while to go through this.

I like your suggestions. Just one nit; I'm not sure I understood
the use case very well, but I think using a bitmap to signal
completion might be suboptimal, since we'd have several
thread spinning on the same cacheline yet caring about
different bits.


Requests are asynchronous, the bitmap is only used to find a free
submission slot.  You're right that the bitmap can bounce across
processors, but I'm not sure how else you would do that because you
don't know in advance how many submitting threads you have.  It wouldn't
be any worse if there was a spinlock.

However, in the migration case there is only one submitting thread, so
it's okay. :)



Yup.

The cache contention only exists in the work threads, the sumbiter thread
is totally free who is the main migration thread. Making the main thread
be faster is good.


Paolo


Xiao: a couple of suggestions

- Since you'll be adding a generic module, make its commit and
   description self-contained. That is, mentioning in the
   log that this will be used for migration is fine, but please
   describe the module (and the assumptions it makes about its
   users) in general, so that someone that doesn't know anything
   about migration can still understand this module (and hopefully
   adopt it for other use cases).


Good to me, i will add more detailed description for this module in
the next version.



- I'd like to see a simple test program (or rather, benchmark)
   that shows how this works. This benchmark would be completely
   unrelated to migration; it should just be a simple test of
   the performance/scalability of this module.
   Having this benchmark would help (1) discuss and quantitately
   evaluate modifications to the module, and (2) help others to
   quickly understand what the module does.
   See tests/qht-bench.c for an example.



Can not agree with you more, will do. :)

Thank you, Emilio and Paolo!




Re: [Qemu-devel] [PATCH 2/4] migration: introduce lockless multithreads model

2018-10-18 Thread Xiao Guangrong




On 10/17/2018 06:10 PM, Paolo Bonzini wrote:



An idea: the total number of requests is going to be very small, and a
PtrRing is not the nicest data structure for multiple producer/single
consumer.  So you could instead:

- add the size of one request to the ops structure.  Move the allocation
in init_requests, so that you can have one single large array that
stores all requests.  thread_request_init gets the pointer to a single
request.

- now that you have a single array (let's call it threads->requests),
the request index can be found with "(req - threads->requests) /
threads->ops->request_size".  The thread index, furthermore, is just
request_index / threads->thread_ring_size and you can remove it from
ThreadRequest.

- now that you have request indices, you can replace the completion
ptr_ring with a bitmap, and set a bit in the bitmap with set_bit_atomic
to report completion.  On the writer side you use find_next_bit to find
a completed request and clear_bit_atomic to clear its index.  The index
passed to find_next_bit is threads->current_thread_index *
threads->thread_ring_size,  And you also don't need find_free_thread,
because you can update threads->current_thread_index to

 threads->current_thread_index =
((free_request_id / threads->thread_ring_size) + 1)
 % threads->thread_nr;

after you prepare a request, and threads will then naturally receive
requests in round-robin order.  (If find_next_bit fails it means you
have to move the current_thread_index to 0 and retry).

- you don't need the free requests list and free_requests_nr either: you
just initialize the completed request bitmap to all-ones, and
find_next_bit + clear_bit_atomic will do the work of free_requests.
ThreadRequest goes away completely now!

The advantage is that you get rid of the spinlock on the consumer side,
and many auxiliary data structures on the producer side: a bitmap is a
much nicer data structure for multiple producer/single consumer than the
PtrRing.  In addition, with this design the entire Threads structure
becomes read-mostly, which is nice for the cache.  The disadvantage is
that you add an atomic operation (clear_bit_atomic) to
threads_submit_request_prepare(*).



All your comments are good to me and you are a GENIUS, the idea
that make the requests be a single array and partitions it like
this way simplifies the thing a lot.


The PtrRing is still useful for the other direction, because there you
have single producer/single consumer.

(*) It's probably possible to have two separate bitmaps, e.g.
where the producer and consumers *flip* bits and the
producer looks for mismatched bits between the two bitmaps.
I'm not asking you to flesh out and implement that; it's
just why I think you can ignore the extra cost of
clear_bit_atomic.  In fact, if the maintainers think this
is overkill you can go ahead with just the naming fixes and
I'll take a look at a rewrite when I have some time for fun
stuff. :)



LOL! Great minds think alike, the idea, "flipping bitmaps", was in my
mind too. :)

Beside that... i think we get the chance to remove ptr_ring gracefully,
as the bitmap can indicate the ownership of the request as well. If
the bit is 1 (supposing all bits are 1 on default), only the user can
operate it, the bit will be cleared after the user fills the info
to the request. After that, the thread sees the bit is cleared, then
it gets the ownership and finishes the request, then sets bit in
the bitmap. The ownership is returned to the user again.

One thing may be disadvantage is, it can't differentiate the case if the
request is empty or contains the result which need the user call
threads_wait_done(), that will slow threads_wait_done() a little as it
need check all requests, but it is not a big deal as
1) at the point needing flush, it's high possible that all most requests
   have been used.
2) the total number of requests is going to be very small.


It is illustrated by following code by combining the "flip" bitmaps:

struct Threads {
   ..

   /*
* the bit in these two bitmaps indicates the index of the requests
* respectively. If it's the same, the request is owned by the user,
* i.e, only the use can use the request. Otherwise, it is owned by
* the thread.
*/

   /* after the user fills the request, the bit is flipped. */
   unsigned long *request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

   /* after handles the request, the thread flips the bit. */
   unsigned long *request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
}

threads_submit_request_prepare()
{
request_done_bitmap = READ_ONCE(threads->request_done_bitmap);
result_bitmap = bitmap_xor(_done_bitmap, 
threads->request_fill_bitmap);

index = find_first_zero_bit(current-thread-to-request-index, 
_bitmap);

/* make sure we get the data the thread written. */

Re: [Qemu-devel] [PATCH 1/4] ptr_ring: port ptr_ring from linux kernel to QEMU

2018-10-18 Thread Xiao Guangrong




On 10/17/2018 04:14 PM, Paolo Bonzini wrote:

On 16/10/2018 18:40, Emilio G. Cota wrote:

+#define SMP_CACHE_BYTES  64
+#define cacheline_aligned_in_smp \
+__attribute__((__aligned__(SMP_CACHE_BYTES)))

You could use QEMU_ALIGNED() here.


Yes, you are right.




+
+#define WRITE_ONCE(ptr, val) \
+(*((volatile typeof(ptr) *)(&(ptr))) = (val))
+#define READ_ONCE(ptr) (*((volatile typeof(ptr) *)(&(ptr

Why not atomic_read/set, like in the rest of the QEMU code base?


Or even atomic_rcu_read/atomic_rcu_set, which includes the necessary
barriers.



Okay, will fix it, thank you and Emilio for pointing the
issue out.


Also, please do not use __ identifiers in QEMU code.
cacheline_aligned_in_smp can become just QEMU_ALIGNED(SMP_CACHE_BYTES).



Sure, will keep that in my mind. :)




Re: [Qemu-devel] [PATCH v6 0/3] migration: compression optimization

2018-09-13 Thread Xiao Guangrong




On 09/06/2018 07:03 PM, Juan Quintela wrote:

guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Changelog in v6:

Thanks to Juan's review, in this version we
1) move flush compressed data to find_dirty_block() where it hits the end
of memblock
2) use save_page_use_compression instead of migrate_use_compression in
flush_compressed_data

Xiao Guangrong (3):
   migration: do not flush_compressed_data at the end of iteration
   migration: show the statistics of compression
   migration: use save_page_use_compression in flush_compressed_data

  hmp.c | 13 +++
  migration/migration.c | 12 ++
  migration/ram.c   | 63 +++
  migration/ram.h   |  1 +
  qapi/migration.json   | 26 -
  5 files changed, 105 insertions(+), 10 deletions(-)


queued



Hi Juan,

Could i ask where is the place you queued these patches, i did not found
them on your git tree at
   https://github.com/juanquintela/qemu migration/next or migration.next

I am working on the next part of migration, it's more convenient to let
it be based on your place. :)

Thanks!





Re: [Qemu-devel] [PATCH v5 1/4] migration: do not flush_compressed_data at the end of each iteration

2018-09-03 Thread Xiao Guangrong




On 09/04/2018 11:54 AM, Xiao Guangrong wrote:



We will call it only if xbzrle is also enabled, at this case, we will
disable compression and xbzrle for the following pages, please refer

 ^and use xbzrle

Sorry for the typo.



Re: [Qemu-devel] [PATCH v5 1/4] migration: do not flush_compressed_data at the end of each iteration

2018-09-03 Thread Xiao Guangrong




On 09/04/2018 12:38 AM, Juan Quintela wrote:

guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

flush_compressed_data() needs to wait all compression threads to
finish their work, after that all threads are free until the
migration feeds new request to them, reducing its call can improve
the throughput and use CPU resource more effectively
We do not need to flush all threads at the end of iteration, the
data can be kept locally until the memory block is changed or
memory migration starts over in that case we will meet a dirtied
page which may still exists in compression threads's ring

Signed-off-by: Xiao Guangrong 


I am not so sure about this patch.

Right now, we warantee that after each iteration, all data is written
before we start a new round.

This patch changes to only "flush" the compression threads if we have
"synched" with the kvm migration bitmap.  Idea is good but as far as I
can see:

- we already call flush_compressed_data() inside firnd_dirty_block if we
   synchronize the bitmap. 


The one called in find_dirty_block is as followings:

if (migrate_use_xbzrle()) {
/* If xbzrle is on, stop using the data compression at this
 * point. In theory, xbzrle can do better than compression.
 */
flush_compressed_data(rs);
}

We will call it only if xbzrle is also enabled, at this case, we will
disable compression and xbzrle for the following pages, please refer
to save_page_use_compression(). So, it can not help us if we just enabled
compression separately.


So, at least, we need to update
   dirty_sync_count there.


I understand this is a optimization that reduces flush_compressed_data
under the if both xbzrle and compression are both enabled. However, we
can avoid it by using save_page_use_compression() to replace
migrate_use_compression().

Furthermore, in our work which will be pushed it out after this patchset
(https://marc.info/?l=kvm=152810616708459=2), we will made
flush_compressed_data() really light if there is nothing to be flushed.

So, how about just keep it at this patch and let's optimize it in our
next work? :)



- queued pages are "interesting", but I am not sure if compression and
   postcopy work well together.



Compression and postcopy can not both be enabled, please refer to the
code in migrate_caps_check()

if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
/* The decompression threads asynchronously write into RAM
 * rather than use the atomic copies needed to avoid
 * userfaulting.  It should be possible to fix the decompression
 * threads for compatibility in future.
 */
error_setg(errp, "Postcopy is not currently compatible "
   "with compression");
return false;
}


So, if we don't need to call flush_compressed_data() every round, then
the one inside find_dirty_block() should be enough.  Otherwise, I can't
see why we need this other.


Independent of this patch:
- We always send data for every compression thread without testing if
there is any there.



Yes. That's because we will never doubly handle the page in the iteration. :)




@@ -3212,7 +3225,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  }
  i++;
  }
-flush_compressed_data(rs);
  rcu_read_unlock();
  
  /*


Why is not enough just to remove this call to flush_compressed_data?


Consider this case, thanks Dave to point it out. :)

===

  iteration one:
 thread 1: Save compressed page 'n'

  iteration two:
 thread 2: Save compressed page 'n'

What guarantees that the version of page 'n'
from thread 2 reaches the destination first without
this flush?
===

If we just remove it, we can not get this guarantee. :)



Re: [Qemu-devel] [PATCH v3 10/10] migration: show the statistics of compression

2018-08-08 Thread Xiao Guangrong




On 08/08/2018 02:12 PM, Peter Xu wrote:

On Tue, Aug 07, 2018 at 05:12:09PM +0800, guangrong.x...@gmail.com wrote:

[...]


@@ -1602,6 +1614,26 @@ static void migration_update_rates(RAMState *rs, int64_t 
end_time)
  rs->xbzrle_cache_miss_prev) / page_count;
  rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
  }
+
+if (migrate_use_compression()) {
+compression_counters.busy_rate = (double)(compression_counters.busy -
+rs->compress_thread_busy_prev) / page_count;


So this is related to the previous patch - I still doubt its
correctness if page_count is the host pages count rather than the
guest pages'.  Other than that the patch looks good to me.


I think i can treat it as your Reviewed-by boldly. :)



Re: [Qemu-devel] [PATCH v3 08/10] migration: handle the error condition properly

2018-08-08 Thread Xiao Guangrong




On 08/08/2018 10:11 PM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:



On 08/08/2018 01:08 PM, Peter Xu wrote:

On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

ram_find_and_save_block() can return negative if any error hanppens,
however, it is completely ignored in current code


Could you hint me where we'll return an error?



I think control_save_page() may return a error condition but i am not
good at it ... Other places look safe _currently_. These functions were
designed to have error returned anyway.


ram_control_save_page's return is checked by control_save_page which
returns true/false but sets *pages to a return value.

What I'd need to follow closely is the case where ram_control_save_page
returns RAM_SAVE_CONTROL_DELAYED, in that case control_save_page I think
returns with *pages=-1 and returns true.
And I think in that case ram_save_target_page can leak that -1 - hmm.

Now, ram_save_host_page already checks for <0 and will return that,
but I think that would potentially loop in ram_find_and_save_block; I'm
not sure we want to change that or not!


ram_find_and_save_block() will continue the look only if ram_save_host_page
returns zero:

..
if (found) {
pages = ram_save_host_page(rs, , last_stage);
}
} while (!pages && again);

IMHO, how to change the code really depends on the semantic of these functions,
based on the comments of them, returning error conditions is the current
semantic.

Another choice would be the one squashes error conditions to QEMUFile and
adapt comments and code of these functions to reflect the new semantic
clearly.

Which one do you guys prefer to? :)




Re: [Qemu-devel] [PATCH v3 08/10] migration: handle the error condition properly

2018-08-08 Thread Xiao Guangrong




On 08/08/2018 02:56 PM, Peter Xu wrote:

On Wed, Aug 08, 2018 at 02:29:52PM +0800, Xiao Guangrong wrote:



On 08/08/2018 01:08 PM, Peter Xu wrote:

On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

ram_find_and_save_block() can return negative if any error hanppens,
however, it is completely ignored in current code


Could you hint me where we'll return an error?



I think control_save_page() may return a error condition but i am not
good at it ... Other places look safe _currently_. These functions were
designed to have error returned anyway.


Ah, the RDMA codes...

Then I feel like this patch would be more suitable to be put into some
of the RDMA series - at least we'd better be clear about what errors
we're going to capture.  For non-RDMA, it seems a bit helpless after
all - AFAIU we're depending on the few qemu_file_get_error() calls to
detect output errors.


So, are you talking about to modify the semantic of these functions,
ram_save_host_page(), ram_save_target_page(), etc, and make them
be:
"Returns the number of pages written where zero means no dirty pages,
error conditions are indicated in the QEMUFile @rs->file if it
happened."

If it's what you want, i will update the comments and make the
implementation more clear to reflect this fact for these
functions



Re: [Qemu-devel] [PATCH v3 09/10] migration: fix calculating xbzrle_counters.cache_miss_rate

2018-08-08 Thread Xiao Guangrong




On 08/08/2018 02:05 PM, Peter Xu wrote:

On Tue, Aug 07, 2018 at 05:12:08PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

As Peter pointed out:
| - xbzrle_counters.cache_miss is done in save_xbzrle_page(), so it's
|   per-guest-page granularity
|
| - RAMState.iterations is done for each ram_find_and_save_block(), so
|   it's per-host-page granularity
|
| An example is that when we migrate a 2M huge page in the guest, we
| will only increase the RAMState.iterations by 1 (since
| ram_find_and_save_block() will be called once), but we might increase
| xbzrle_counters.cache_miss for 2M/4K=512 times (we'll call
| save_xbzrle_page() that many times) if all the pages got cache miss.
| Then IMHO the cache miss rate will be 512/1=51200% (while it should
| actually be just 100% cache miss).

And he also suggested as xbzrle_counters.cache_miss_rate is the only
user of rs->iterations we can adapt it to count guest page numbers

After that, rename 'iterations' to 'handle_pages' to better reflect
its meaning

Suggested-by: Peter Xu 
Signed-off-by: Xiao Guangrong 
---
  migration/ram.c | 18 +-
  1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 09be01dca2..bd7c18d1f9 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -300,10 +300,10 @@ struct RAMState {
  uint64_t num_dirty_pages_period;
  /* xbzrle misses since the beginning of the period */
  uint64_t xbzrle_cache_miss_prev;
-/* number of iterations at the beginning of period */
-uint64_t iterations_prev;
-/* Iterations since start */
-uint64_t iterations;
+/* total handled pages at the beginning of period */
+uint64_t handle_pages_prev;
+/* total handled pages since start */
+uint64_t handle_pages;


The name is not that straightforward to me.  I would think about
"[guest|host]_page_count" or something better, or we just keep the old
naming but with a better comment would be fine too.


The filed actually indicates total pages (target pages more precisely)
handled during live migration. 'iterations' confuses us completely.

It's target_page_count good to you?




  /* number of dirty bits in the bitmap */
  uint64_t migration_dirty_pages;
  /* last dirty_sync_count we have seen */
@@ -1587,19 +1587,19 @@ uint64_t ram_pagesize_summary(void)
  
  static void migration_update_rates(RAMState *rs, int64_t end_time)

  {
-uint64_t iter_count = rs->iterations - rs->iterations_prev;
+uint64_t page_count = rs->handle_pages - rs->handle_pages_prev;
  
  /* calculate period counters */

  ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
  / (end_time - rs->time_last_bitmap_sync);
  
-if (!iter_count) {

+if (!page_count) {
  return;
  }
  
  if (migrate_use_xbzrle()) {

  xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss 
-
-rs->xbzrle_cache_miss_prev) / iter_count;
+rs->xbzrle_cache_miss_prev) / page_count;
  rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
  }
  }
@@ -1657,7 +1657,7 @@ static void migration_bitmap_sync(RAMState *rs)
  
  migration_update_rates(rs, end_time);
  
-rs->iterations_prev = rs->iterations;

+rs->handle_pages_prev = rs->handle_pages;
  
  /* reset period counters */

  rs->time_last_bitmap_sync = end_time;
@@ -3209,7 +3209,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  break;
  }
  
-rs->iterations++;

+rs->handle_pages += pages;


So it's still counting host pages, is this your intention to only
change the name in the patch?


Hmm... the value returned by ram_find_and_save_block() isn't the total
target pages posted out?

/**
 * ram_find_and_save_block: finds a dirty page and sends it to f
 *
 * Called within an RCU critical section.
 *
 * Returns the number of pages written where zero means no dirty pages,
 * or negative on error
...

 *
 * On systems where host-page-size > target-page-size it will send all the
 * pages in a host page that are dirty.
 */



Re: [Qemu-devel] [PATCH v3 08/10] migration: handle the error condition properly

2018-08-08 Thread Xiao Guangrong




On 08/08/2018 01:08 PM, Peter Xu wrote:

On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

ram_find_and_save_block() can return negative if any error hanppens,
however, it is completely ignored in current code


Could you hint me where we'll return an error?



I think control_save_page() may return a error condition but i am not
good at it ... Other places look safe _currently_. These functions were
designed to have error returned anyway.


(Anyway I agree that the error handling is not that good, mostly
  because the QEMUFile APIs does not provide proper return code, e.g.,
  qemu_put_be64 returns void)



Yes, it is, the returned error condition is mixed in file's API and
function's return value... :(




Re: [Qemu-devel] [PATCH v3 07/10] migration: do not flush_compressed_data at the end of each iteration

2018-08-08 Thread Xiao Guangrong




On 08/08/2018 12:52 PM, Peter Xu wrote:

On Tue, Aug 07, 2018 at 05:12:06PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

flush_compressed_data() needs to wait all compression threads to
finish their work, after that all threads are free until the
migration feeds new request to them, reducing its call can improve
the throughput and use CPU resource more effectively

We do not need to flush all threads at the end of iteration, the
data can be kept locally until the memory block is changed or
memory migration starts over in that case we will meet a dirtied
page which may still exists in compression threads's ring

Signed-off-by: Xiao Guangrong 
---
  migration/ram.c | 14 +-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 99ecf9b315..55966bc2c1 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -306,6 +306,8 @@ struct RAMState {
  uint64_t iterations;
  /* number of dirty bits in the bitmap */
  uint64_t migration_dirty_pages;
+/* last dirty_sync_count we have seen */
+uint64_t dirty_sync_count_prev;
  /* protects modification of the bitmap */
  QemuMutex bitmap_mutex;
  /* The RAMBlock used in the last src_page_requests */
@@ -3173,6 +3175,17 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  
  ram_control_before_iterate(f, RAM_CONTROL_ROUND);
  
+/*

+ * if memory migration starts over, we will meet a dirtied page which
+ * may still exists in compression threads's ring, so we should flush
+ * the compressed data to make sure the new page is not overwritten by
+ * the old one in the destination.
+ */
+if (ram_counters.dirty_sync_count != rs->dirty_sync_count_prev) {
+rs->dirty_sync_count_prev = ram_counters.dirty_sync_count;
+flush_compressed_data(rs);


AFAIU this only happens when ram_save_pending() calls
migration_bitmap_sync().  Could we just simply flush there?  Then we
can avoid that new variable.



Yup, that's better indeed, will do it.



Re: [Qemu-devel] [PATCH v3 01/10] migration: do not wait for free thread

2018-08-08 Thread Xiao Guangrong




On 08/08/2018 11:51 AM, Peter Xu wrote:

On Tue, Aug 07, 2018 at 08:29:54AM -0500, Eric Blake wrote:

On 08/07/2018 04:12 AM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Instead of putting the main thread to sleep state to wait for
free compression thread, we can directly post it out as normal
page that reduces the latency and uses CPUs more efficiently

A parameter, compress-wait-thread, is introduced, it can be
enabled if the user really wants the old behavior

Signed-off-by: Xiao Guangrong 
---



+++ b/qapi/migration.json
@@ -462,6 +462,11 @@
   # @compress-threads: Set compression thread count to be used in live 
migration,
   #  the compression thread count is an integer between 1 and 255.
   #
+# @compress-wait-thread: Wait if no thread is free to compress the memory page
+#  if it's enabled, otherwise, the page will be posted out immediately
+#  in the main thread without compression. It's true on default.
+#  (Since: 3.1)


Grammar suggestion:

@compress-wait-thread: Controls behavior when all compression threads are
currently busy. If true (default), wait for a free compression thread to
become available; otherwise, send the page uncompressed. (Since 3.1)


Eric's version seems better.  With that:


It's good to me too. Will apply Eric's suggestion in the next version.



Reviewed-by: Peter Xu 


Thank you, Peter and Eric.



Re: [Qemu-devel] [PATCH v2 1/8] migration: do not wait for free thread

2018-07-24 Thread Xiao Guangrong




On 07/24/2018 02:36 AM, Eric Blake wrote:

On 07/19/2018 07:15 AM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Instead of putting the main thread to sleep state to wait for
free compression thread, we can directly post it out as normal
page that reduces the latency and uses CPUs more efficiently

A parameter, compress-wait-thread, is introduced, it can be
enabled if the user really wants the old behavior

Signed-off-by: Xiao Guangrong 
---
  hmp.c |  8 
  migration/migration.c | 21 +
  migration/migration.h |  1 +
  migration/ram.c   | 45 ++---
  qapi/migration.json   | 23 ++-
  5 files changed, 74 insertions(+), 24 deletions(-)




+++ b/qapi/migration.json
@@ -462,6 +462,11 @@
  # @compress-threads: Set compression thread count to be used in live 
migration,
  #  the compression thread count is an integer between 1 and 255.
  #
+# @compress-wait-thread: Wait if no thread is free to compress the memory page
+#  if it's enabled, otherwise, the page will be posted out immediately
+#  in the main thread without compression. It's off on default.
+#  (Since: 3.0)


Is this a bug fix? It's awfully late in the release cycle to be adding new 
features; is this something that we can live without until 3.1?



It's performance improvement, i think it is not urgent. :)



Re: [Qemu-devel] [PATCH v2 6/8] migration: move handle of zero page to the thread

2018-07-24 Thread Xiao Guangrong




On 07/23/2018 05:40 PM, Peter Xu wrote:

On Mon, Jul 23, 2018 at 04:44:49PM +0800, Xiao Guangrong wrote:

[...]



However, it is not safe to do ram_release_pages in the thread as it's
not protected it multithreads. Fortunately, compression will be disabled
if it switches to post-copy, so i preferred to keep current behavior and
deferred to fix it after this patchset has been merged.


Do you mean ram_release_pages() is not thread-safe?  Why?  I didn't
notice it before but I feel like it is safe.


bitmap_clear() called in the function is not safe.


Yeah, and the funny thing is that I don't think ram_release_pages()
should even touch the receivedmap...  It's possible that the
release-ram feature for postcopy is broken.

Never mind on that.  I'll post a patch to fix it, then I think the
ram_release_pages() will be thread safe.

Then this patch shouldn't be affected and it should be fine after that
fix


That would be great, thanks for your work.





Re: [Qemu-devel] [PATCH v2 3/8] migration: show the statistics of compression

2018-07-24 Thread Xiao Guangrong




On 07/23/2018 05:15 PM, Peter Xu wrote:

On Mon, Jul 23, 2018 at 04:40:29PM +0800, Xiao Guangrong wrote:



On 07/23/2018 04:05 PM, Peter Xu wrote:

On Mon, Jul 23, 2018 at 03:39:18PM +0800, Xiao Guangrong wrote:



On 07/23/2018 12:36 PM, Peter Xu wrote:

On Thu, Jul 19, 2018 at 08:15:15PM +0800, guangrong.x...@gmail.com wrote:

@@ -1597,6 +1608,24 @@ static void migration_update_rates(RAMState *rs, int64_t 
end_time)
rs->xbzrle_cache_miss_prev) / iter_count;
rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
}
+
+if (migrate_use_compression()) {
+uint64_t comp_pages;
+
+compression_counters.busy_rate = (double)(compression_counters.busy -
+rs->compress_thread_busy_prev) / iter_count;


Here I'm not sure it's correct...

"iter_count" stands for ramstate.iterations.  It's increased per
ram_find_and_save_block(), so IMHO it might contain multiple guest


ram_find_and_save_block() returns if a page is successfully posted and
it only posts 1 page out at one time.


ram_find_and_save_block() calls ram_save_host_page(), and we should be
sending multiple guest pages in ram_save_host_page() if the host page
is a huge page?



You're right, thank you for pointing it out.

So, how about introduce a filed, posted_pages, into RAMState that is used
to track total pages posted out.

Then will use this filed to replace 'iter_count' for compression and use
'RAMState.posted_pages - ram_counters.duplicate' to calculate
xbzrle_cache_miss as the zero page is not handled by xbzrle.

Or introduce a new function, total_posted_pages, which returns the
sum of all page counts:

static total_posted_pages(void)
{
return ram_counters.normal + ram_counters.duplicate + 
compression_counters.pages
   +  xbzrle_counters.pages;
}

that would be a bit more complexity...


If below [1] is wrong too, then I'm thinking whether we could just
move the rs->iterations++ from ram_save_iterate() loop to
ram_save_host_page() loop, then we possibly fix both places...



Beside that, even if we fix iterations, xbzrle is painful still as
the zero should not be counted in the cache miss, i.e, xbzrle does
not handle zero page at all.

Anyway, fixing iterations is a good start. :)


After all I don't see any other usages of rs->iterations so it seems
fine.  Dave/Juan?






pages.  However compression_counters.busy should be per guest page.



Actually, it's derived from xbzrle_counters.cache_miss_rate:
  xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss 
-
  rs->xbzrle_cache_miss_prev) / iter_count;


Then this is suspecious to me too...


[1]






+rs->compress_thread_busy_prev = compression_counters.busy;
+
+comp_pages = compression_counters.pages - rs->compress_pages_prev;
+if (comp_pages) {
+compression_counters.compression_rate =
+(double)(compression_counters.reduced_size -
+rs->compress_reduced_size_prev) /
+(comp_pages * TARGET_PAGE_SIZE);
+rs->compress_pages_prev = compression_counters.pages;
+rs->compress_reduced_size_prev = compression_counters.reduced_size;
+}
+}
}
static void migration_bitmap_sync(RAMState *rs)
@@ -1872,6 +1901,9 @@ static void flush_compressed_data(RAMState *rs)
qemu_mutex_lock(_param[idx].mutex);
if (!comp_param[idx].quit) {
len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;


I would agree with Dave here - why we store the "reduced size" instead
of the size of the compressed data (which I think should be len - 8)?



len-8 is the size of data after compressed rather than the data improved
by compression that is not straightforward for the user to see how much
the improvement is by applying compression.

Hmm... but it is not a big deal to me... :)


Yeah it might be a personal preference indeed. :)

It's just natural to do that this way for me since AFAIU the
compression ratio is defined as:

 compressed data size
compression ratio =
  original data size



Er, we do it as following:
 compression_counters.compression_rate =
 (double)(compression_counters.reduced_size -
 rs->compress_reduced_size_prev) /
 (comp_pages * TARGET_PAGE_SIZE);

We use reduced_size, i.e,:

  original data size - compressed data size
 compression ratio =
   original data size

for example, for 100 bytes raw data, if we posted 99 bytes out, then
the compression ration should be 1%.


I think it should 

Re: [Qemu-devel] [PATCH v2 8/8] migration: do not flush_compressed_data at the end of each iteration

2018-07-24 Thread Xiao Guangrong




On 07/23/2018 05:01 PM, Peter Xu wrote:


Yes, it's sufficient for current thread model, will drop it for now
and add it at the time when the lockless mutilthread model is applied. :)


Ah I think I see your point.  Even if so I would think it better to do
any extra cleanup directly in compress_threads_save_cleanup() if
possible.



Okay, got it.



Re: [Qemu-devel] [PATCH v2 8/8] migration: do not flush_compressed_data at the end of each iteration

2018-07-23 Thread Xiao Guangrong




On 07/23/2018 04:35 PM, Peter Xu wrote:

On Mon, Jul 23, 2018 at 04:05:21PM +0800, Xiao Guangrong wrote:



On 07/23/2018 01:49 PM, Peter Xu wrote:

On Thu, Jul 19, 2018 at 08:15:20PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

flush_compressed_data() needs to wait all compression threads to
finish their work, after that all threads are free until the
migration feeds new request to them, reducing its call can improve
the throughput and use CPU resource more effectively

We do not need to flush all threads at the end of iteration, the
data can be kept locally until the memory block is changed or
memory migration starts over in that case we will meet a dirtied
page which may still exists in compression threads's ring

Signed-off-by: Xiao Guangrong 
---
   migration/ram.c | 15 ++-
   1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 89305c7af5..fdab13821d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -315,6 +315,8 @@ struct RAMState {
   uint64_t iterations;
   /* number of dirty bits in the bitmap */
   uint64_t migration_dirty_pages;
+/* last dirty_sync_count we have seen */
+uint64_t dirty_sync_count;


Better suffix it with "_prev" as well?  So that we can quickly
identify that it's only a cache and it can be different from the one
in the ram_counters.


Indeed, will update it.




   /* protects modification of the bitmap */
   QemuMutex bitmap_mutex;
   /* The RAMBlock used in the last src_page_requests */
@@ -2532,6 +2534,7 @@ static void ram_save_cleanup(void *opaque)
   }
   xbzrle_cleanup();
+flush_compressed_data(*rsp);


Could I ask why do we need this considering that we have
compress_threads_save_cleanup() right down there?


Dave ask it too. :(

"This is for the error condition, if any error occurred during live migration,
there is no chance to call ram_save_complete. After using the lockless
multithreads model, we assert all requests have been handled before destroy
the work threads."

That makes sure there is nothing left in the threads before doing
compress_threads_save_cleanup() as current behavior. For lockless
mutilthread model, we check if all requests are free before destroy
them.


But why do we need to explicitly flush it here?  Now in
compress_threads_save_cleanup() we have qemu_fclose() on the buffers,
which logically will flush the data and clean up everything too.
Would that suffice?



Yes, it's sufficient for current thread model, will drop it for now
and add it at the time when the lockless mutilthread model is applied. :)




Re: [Qemu-devel] [PATCH v2 6/8] migration: move handle of zero page to the thread

2018-07-23 Thread Xiao Guangrong




On 07/23/2018 04:28 PM, Peter Xu wrote:

On Mon, Jul 23, 2018 at 03:56:33PM +0800, Xiao Guangrong wrote:

[...]


@@ -2249,15 +2308,8 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss,
   return res;
   }
-/*
- * When starting the process of a new block, the first page of
- * the block should be sent out before other pages in the same
- * block, and all the pages in last block should have been sent
- * out, keeping this order is important, because the 'cont' flag
- * is used to avoid resending the block name.
- */
-if (block != rs->last_sent_block && save_page_use_compression(rs)) {
-flush_compressed_data(rs);
+if (save_compress_page(rs, block, offset)) {
+return 1;


It's a bit tricky (though it seems to be a good idea too) to move the
zero detect into the compression thread, though I noticed that we also
do something else for zero pages:

  res = save_zero_page(rs, block, offset);
  if (res > 0) {
  /* Must let xbzrle know, otherwise a previous (now 0'd) cached
   * page would be stale
   */
  if (!save_page_use_compression(rs)) {
  XBZRLE_cache_lock();
  xbzrle_cache_zero_page(rs, block->offset + offset);
  XBZRLE_cache_unlock();
  }
  ram_release_pages(block->idstr, offset, res);
  return res;
  }

I'd guess that the xbzrle update of the zero page is not needed for
compression since after all xbzrle is not enabled when compression is


Yup. if they are both enabled, compression works only for the first
iteration (i.e, ram_bulk_stage), at that point, nothing is cached
in xbzrle's cahe, in other words, xbzrle has posted nothing to the
destination.


enabled, however do we need to do ram_release_pages() somehow?



We have done it in the thread:

+static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock 
*block,
   ram_addr_t offset, uint8_t *source_buf)
  {


+if (save_zero_page_to_file(rs, f, block, offset)) {
+zero_page = true;
+goto exit;
+}
..

+exit:
  ram_release_pages(block->idstr, offset & TARGET_PAGE_MASK, 1);
+return zero_page;
  }


Ah, then it seems fine.  Though I'd suggest you comment these into the
commit message in case people won't get it easily.



Okay, will update the commit log addressed your comments.



However, it is not safe to do ram_release_pages in the thread as it's
not protected it multithreads. Fortunately, compression will be disabled
if it switches to post-copy, so i preferred to keep current behavior and
deferred to fix it after this patchset has been merged.


Do you mean ram_release_pages() is not thread-safe?  Why?  I didn't
notice it before but I feel like it is safe.


bitmap_clear() called in the function is not safe.




Re: [Qemu-devel] [PATCH v2 3/8] migration: show the statistics of compression

2018-07-23 Thread Xiao Guangrong




On 07/23/2018 04:05 PM, Peter Xu wrote:

On Mon, Jul 23, 2018 at 03:39:18PM +0800, Xiao Guangrong wrote:



On 07/23/2018 12:36 PM, Peter Xu wrote:

On Thu, Jul 19, 2018 at 08:15:15PM +0800, guangrong.x...@gmail.com wrote:

@@ -1597,6 +1608,24 @@ static void migration_update_rates(RAMState *rs, int64_t 
end_time)
   rs->xbzrle_cache_miss_prev) / iter_count;
   rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
   }
+
+if (migrate_use_compression()) {
+uint64_t comp_pages;
+
+compression_counters.busy_rate = (double)(compression_counters.busy -
+rs->compress_thread_busy_prev) / iter_count;


Here I'm not sure it's correct...

"iter_count" stands for ramstate.iterations.  It's increased per
ram_find_and_save_block(), so IMHO it might contain multiple guest


ram_find_and_save_block() returns if a page is successfully posted and
it only posts 1 page out at one time.


ram_find_and_save_block() calls ram_save_host_page(), and we should be
sending multiple guest pages in ram_save_host_page() if the host page
is a huge page?



You're right, thank you for pointing it out.

So, how about introduce a filed, posted_pages, into RAMState that is used
to track total pages posted out.

Then will use this filed to replace 'iter_count' for compression and use
'RAMState.posted_pages - ram_counters.duplicate' to calculate
xbzrle_cache_miss as the zero page is not handled by xbzrle.

Or introduce a new function, total_posted_pages, which returns the
sum of all page counts:

   static total_posted_pages(void)
   {
   return ram_counters.normal + ram_counters.duplicate + 
compression_counters.pages
  +  xbzrle_counters.pages;
   }

that would be a bit more complexity...




pages.  However compression_counters.busy should be per guest page.



Actually, it's derived from xbzrle_counters.cache_miss_rate:
 xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss -
 rs->xbzrle_cache_miss_prev) / iter_count;


Then this is suspecious to me too...




+rs->compress_thread_busy_prev = compression_counters.busy;
+
+comp_pages = compression_counters.pages - rs->compress_pages_prev;
+if (comp_pages) {
+compression_counters.compression_rate =
+(double)(compression_counters.reduced_size -
+rs->compress_reduced_size_prev) /
+(comp_pages * TARGET_PAGE_SIZE);
+rs->compress_pages_prev = compression_counters.pages;
+rs->compress_reduced_size_prev = compression_counters.reduced_size;
+}
+}
   }
   static void migration_bitmap_sync(RAMState *rs)
@@ -1872,6 +1901,9 @@ static void flush_compressed_data(RAMState *rs)
   qemu_mutex_lock(_param[idx].mutex);
   if (!comp_param[idx].quit) {
   len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;


I would agree with Dave here - why we store the "reduced size" instead
of the size of the compressed data (which I think should be len - 8)?



len-8 is the size of data after compressed rather than the data improved
by compression that is not straightforward for the user to see how much
the improvement is by applying compression.

Hmm... but it is not a big deal to me... :)


Yeah it might be a personal preference indeed. :)

It's just natural to do that this way for me since AFAIU the
compression ratio is defined as:

compressed data size
   compression ratio =
 original data size



Er, we do it as following:
compression_counters.compression_rate =
(double)(compression_counters.reduced_size -
rs->compress_reduced_size_prev) /
(comp_pages * TARGET_PAGE_SIZE);

We use reduced_size, i.e,:

 original data size - compressed data size
compression ratio =
  original data size

for example, for 100 bytes raw data, if we posted 99 bytes out, then
the compression ration should be 1%.

So if i understand correctly, the reduced_size is really you want? :)




Re: [Qemu-devel] [PATCH v2 8/8] migration: do not flush_compressed_data at the end of each iteration

2018-07-23 Thread Xiao Guangrong




On 07/23/2018 01:49 PM, Peter Xu wrote:

On Thu, Jul 19, 2018 at 08:15:20PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

flush_compressed_data() needs to wait all compression threads to
finish their work, after that all threads are free until the
migration feeds new request to them, reducing its call can improve
the throughput and use CPU resource more effectively

We do not need to flush all threads at the end of iteration, the
data can be kept locally until the memory block is changed or
memory migration starts over in that case we will meet a dirtied
page which may still exists in compression threads's ring

Signed-off-by: Xiao Guangrong 
---
  migration/ram.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 89305c7af5..fdab13821d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -315,6 +315,8 @@ struct RAMState {
  uint64_t iterations;
  /* number of dirty bits in the bitmap */
  uint64_t migration_dirty_pages;
+/* last dirty_sync_count we have seen */
+uint64_t dirty_sync_count;


Better suffix it with "_prev" as well?  So that we can quickly
identify that it's only a cache and it can be different from the one
in the ram_counters.


Indeed, will update it.




  /* protects modification of the bitmap */
  QemuMutex bitmap_mutex;
  /* The RAMBlock used in the last src_page_requests */
@@ -2532,6 +2534,7 @@ static void ram_save_cleanup(void *opaque)
  }
  
  xbzrle_cleanup();

+flush_compressed_data(*rsp);


Could I ask why do we need this considering that we have
compress_threads_save_cleanup() right down there?


Dave ask it too. :(

"This is for the error condition, if any error occurred during live migration,
there is no chance to call ram_save_complete. After using the lockless
multithreads model, we assert all requests have been handled before destroy
the work threads."

That makes sure there is nothing left in the threads before doing
compress_threads_save_cleanup() as current behavior. For lockless
mutilthread model, we check if all requests are free before destroy
them.




  compress_threads_save_cleanup();
  ram_state_cleanup(rsp);
  }
@@ -3203,6 +3206,17 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  
  ram_control_before_iterate(f, RAM_CONTROL_ROUND);
  
+/*

+ * if memory migration starts over, we will meet a dirtied page which
+ * may still exists in compression threads's ring, so we should flush
+ * the compressed data to make sure the new page is not overwritten by
+ * the old one in the destination.
+ */
+if (ram_counters.dirty_sync_count != rs->dirty_sync_count) {
+rs->dirty_sync_count = ram_counters.dirty_sync_count;
+flush_compressed_data(rs);
+}
+
  t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
  i = 0;
  while ((ret = qemu_file_rate_limit(f)) == 0 ||
@@ -3235,7 +3249,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  }
  i++;
  }
-flush_compressed_data(rs);


This looks sane to me, but I'd like to see how other people would
think about it too...


Thank you a lot, Peter! :)




Re: [Qemu-devel] [PATCH v2 6/8] migration: move handle of zero page to the thread

2018-07-23 Thread Xiao Guangrong




On 07/23/2018 01:03 PM, Peter Xu wrote:

On Thu, Jul 19, 2018 at 08:15:18PM +0800, guangrong.x...@gmail.com wrote:

[...]


@@ -1950,12 +1971,16 @@ retry:
  set_compress_params(_param[idx], block, offset);
  qemu_cond_signal(_param[idx].cond);
  qemu_mutex_unlock(_param[idx].mutex);
-pages = 1;
-/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
-compression_counters.reduced_size += TARGET_PAGE_SIZE -
- bytes_xmit + 8;
-compression_counters.pages++;
  ram_counters.transferred += bytes_xmit;
+pages = 1;


(moving of this line seems irrelevant; meanwhile more duplicated codes
so even better to have a helper now)



Good to me. :)


+if (comp_param[idx].zero_page) {
+ram_counters.duplicate++;
+} else {
+/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+compression_counters.reduced_size += TARGET_PAGE_SIZE -
+ bytes_xmit + 8;
+compression_counters.pages++;
+}
  break;
  }
  }


[...]


@@ -2249,15 +2308,8 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss,
  return res;
  }
  
-/*

- * When starting the process of a new block, the first page of
- * the block should be sent out before other pages in the same
- * block, and all the pages in last block should have been sent
- * out, keeping this order is important, because the 'cont' flag
- * is used to avoid resending the block name.
- */
-if (block != rs->last_sent_block && save_page_use_compression(rs)) {
-flush_compressed_data(rs);
+if (save_compress_page(rs, block, offset)) {
+return 1;


It's a bit tricky (though it seems to be a good idea too) to move the
zero detect into the compression thread, though I noticed that we also
do something else for zero pages:

 res = save_zero_page(rs, block, offset);
 if (res > 0) {
 /* Must let xbzrle know, otherwise a previous (now 0'd) cached
  * page would be stale
  */
 if (!save_page_use_compression(rs)) {
 XBZRLE_cache_lock();
 xbzrle_cache_zero_page(rs, block->offset + offset);
 XBZRLE_cache_unlock();
 }
 ram_release_pages(block->idstr, offset, res);
 return res;
 }

I'd guess that the xbzrle update of the zero page is not needed for
compression since after all xbzrle is not enabled when compression is


Yup. if they are both enabled, compression works only for the first
iteration (i.e, ram_bulk_stage), at that point, nothing is cached
in xbzrle's cahe, in other words, xbzrle has posted nothing to the
destination.


enabled, however do we need to do ram_release_pages() somehow?



We have done it in the thread:

+static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock 
*block,
  ram_addr_t offset, uint8_t *source_buf)
 {


+if (save_zero_page_to_file(rs, f, block, offset)) {
+zero_page = true;
+goto exit;
+}
..

+exit:
 ram_release_pages(block->idstr, offset & TARGET_PAGE_MASK, 1);
+return zero_page;
 }

However, it is not safe to do ram_release_pages in the thread as it's
not protected it multithreads. Fortunately, compression will be disabled
if it switches to post-copy, so i preferred to keep current behavior and
deferred to fix it after this patchset has been merged.



Re: [Qemu-devel] [PATCH v2 3/8] migration: show the statistics of compression

2018-07-23 Thread Xiao Guangrong




On 07/23/2018 12:36 PM, Peter Xu wrote:

On Thu, Jul 19, 2018 at 08:15:15PM +0800, guangrong.x...@gmail.com wrote:

@@ -1597,6 +1608,24 @@ static void migration_update_rates(RAMState *rs, int64_t 
end_time)
  rs->xbzrle_cache_miss_prev) / iter_count;
  rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
  }
+
+if (migrate_use_compression()) {
+uint64_t comp_pages;
+
+compression_counters.busy_rate = (double)(compression_counters.busy -
+rs->compress_thread_busy_prev) / iter_count;


Here I'm not sure it's correct...

"iter_count" stands for ramstate.iterations.  It's increased per
ram_find_and_save_block(), so IMHO it might contain multiple guest


ram_find_and_save_block() returns if a page is successfully posted and
it only posts 1 page out at one time.


pages.  However compression_counters.busy should be per guest page.



Actually, it's derived from xbzrle_counters.cache_miss_rate:
xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss -
rs->xbzrle_cache_miss_prev) / iter_count;


+rs->compress_thread_busy_prev = compression_counters.busy;
+
+comp_pages = compression_counters.pages - rs->compress_pages_prev;
+if (comp_pages) {
+compression_counters.compression_rate =
+(double)(compression_counters.reduced_size -
+rs->compress_reduced_size_prev) /
+(comp_pages * TARGET_PAGE_SIZE);
+rs->compress_pages_prev = compression_counters.pages;
+rs->compress_reduced_size_prev = compression_counters.reduced_size;
+}
+}
  }
  
  static void migration_bitmap_sync(RAMState *rs)

@@ -1872,6 +1901,9 @@ static void flush_compressed_data(RAMState *rs)
  qemu_mutex_lock(_param[idx].mutex);
  if (!comp_param[idx].quit) {
  len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;


I would agree with Dave here - why we store the "reduced size" instead
of the size of the compressed data (which I think should be len - 8)?



len-8 is the size of data after compressed rather than the data improved
by compression that is not straightforward for the user to see how much
the improvement is by applying compression.

Hmm... but it is not a big deal to me... :)


Meanwhile, would a helper be nicer? Like:


Yup, that's nicer indeed.



Re: [Qemu-devel] [PATCH v2 1/8] migration: do not wait for free thread

2018-07-23 Thread Xiao Guangrong




On 07/23/2018 11:25 AM, Peter Xu wrote:

On Thu, Jul 19, 2018 at 08:15:13PM +0800, guangrong.x...@gmail.com wrote:

@@ -3113,6 +3132,8 @@ static Property migration_properties[] = {
  DEFINE_PROP_UINT8("x-compress-threads", MigrationState,
parameters.compress_threads,
DEFAULT_MIGRATE_COMPRESS_THREAD_COUNT),
+DEFINE_PROP_BOOL("x-compress-wait-thread", MigrationState,
+  parameters.compress_wait_thread, false),


This performance feature bit makes sense to me, but I would still
think it should be true by default to keep the old behavior:

- it might change the behavior drastically: we might be in a state
   between "normal" migration and "compressed" migration since we'll
   contain both of the pages.  Old compression users might not expect
   that.

- it might still even perform worse - an extreme case is that when
   network bandwidth is very very limited but instead we have plenty of
   CPU resources. [1]

So it's really a good tunable for me when people really needs to
understand what's it before turning it on.


That looks good to me.




  DEFINE_PROP_UINT8("x-decompress-threads", MigrationState,
parameters.decompress_threads,
DEFAULT_MIGRATE_DECOMPRESS_THREAD_COUNT),
diff --git a/migration/migration.h b/migration/migration.h
index 64a7b33735..a46b9e6c8d 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -271,6 +271,7 @@ bool migrate_use_return_path(void);
  bool migrate_use_compression(void);
  int migrate_compress_level(void);
  int migrate_compress_threads(void);
+int migrate_compress_wait_thread(void);
  int migrate_decompress_threads(void);
  bool migrate_use_events(void);
  bool migrate_postcopy_blocktime(void);
diff --git a/migration/ram.c b/migration/ram.c
index 52dd678092..0ad234c692 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1889,30 +1889,34 @@ static int compress_page_with_multi_thread(RAMState 
*rs, RAMBlock *block,
 ram_addr_t offset)
  {
  int idx, thread_count, bytes_xmit = -1, pages = -1;
+bool wait = migrate_compress_wait_thread();
  
  thread_count = migrate_compress_threads();

  qemu_mutex_lock(_done_lock);
-while (true) {
-for (idx = 0; idx < thread_count; idx++) {
-if (comp_param[idx].done) {
-comp_param[idx].done = false;
-bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-qemu_mutex_lock(_param[idx].mutex);
-set_compress_params(_param[idx], block, offset);
-qemu_cond_signal(_param[idx].cond);
-qemu_mutex_unlock(_param[idx].mutex);
-pages = 1;
-ram_counters.normal++;
-ram_counters.transferred += bytes_xmit;
-break;
-}
-}
-if (pages > 0) {
+retry:
+for (idx = 0; idx < thread_count; idx++) {
+if (comp_param[idx].done) {
+comp_param[idx].done = false;
+bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+qemu_mutex_lock(_param[idx].mutex);
+set_compress_params(_param[idx], block, offset);
+qemu_cond_signal(_param[idx].cond);
+qemu_mutex_unlock(_param[idx].mutex);
+pages = 1;
+ram_counters.normal++;
+ram_counters.transferred += bytes_xmit;
  break;
-} else {
-qemu_cond_wait(_done_cond, _done_lock);
  }
  }
+
+/*
+ * if there is no thread is free to compress the data and the user
+ * really expects the slowdown, wait it.


Considering [1] above, IMHO it might not really be a slow down but it
depends.  Maybe only mentioning about the fact that we're sending a
normal page instead of the compressed page if wait is not specified.



Okay, will update the comments based on your suggestion.


+ */
+if (pages < 0 && wait) {
+qemu_cond_wait(_done_cond, _done_lock);
+goto retry;
+}
  qemu_mutex_unlock(_done_lock);
  
  return pages;

@@ -2226,7 +2230,10 @@ static int ram_save_target_page(RAMState *rs, 
PageSearchStatus *pss,
   * CPU resource.
   */
  if (block == rs->last_sent_block && save_page_use_compression(rs)) {
-return compress_page_with_multi_thread(rs, block, offset);
+res = compress_page_with_multi_thread(rs, block, offset);
+if (res > 0) {
+return res;
+}
  } else if (migrate_use_multifd()) {
  return ram_save_multifd_page(rs, block, offset);
  }
diff --git a/qapi/migration.json b/qapi/migration.json
index 186e8a7303..b4f394844b 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -462,6 +462,11 @@
  # @compress-threads: Set compression thread count to be used in live 
migration,
  #  the compression thread count is an integer between 1 and 

Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression

2018-07-23 Thread Xiao Guangrong




On 07/23/2018 12:05 AM, Michael S. Tsirkin wrote:

On Wed, Jul 18, 2018 at 04:46:21PM +0800, Xiao Guangrong wrote:



On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:



On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:


Hi Peter,

Sorry for the delay as i was busy on other things.

On 06/19/2018 03:30 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Detecting zero page is not a light work, we can disable it
for compression that can handle all zero data very well


Is there any number shows how the compression algo performs better
than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
be fast, depending on how init_accel() is done in util/bufferiszero.c.


This is the comparison between zero-detection and compression (the target
buffer is all zero bit):

Zero 810 ns Compression: 26905 ns.
Zero 417 ns Compression: 8022 ns.
Zero 408 ns Compression: 7189 ns.
Zero 400 ns Compression: 7255 ns.
Zero 412 ns Compression: 7016 ns.
Zero 411 ns Compression: 7035 ns.
Zero 413 ns Compression: 6994 ns.
Zero 399 ns Compression: 7024 ns.
Zero 416 ns Compression: 7053 ns.
Zero 405 ns Compression: 7041 ns.

Indeed, zero-detection is faster than compression.

However during our profiling for the live_migration thread (after reverted this 
patch),
we noticed zero-detection cost lots of CPU:

12.01%  kqemu  qemu-system-x86_64[.] buffer_zero_sse2   


◆


Interesting; what host are you running on?
Some hosts have support for the faster buffer_zero_ss4/avx2


The host is:

model name  : Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
...
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi
   mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art 
arch_perfmon pebs bts
   rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni 
pclmulqdq dtes64 monitor
   ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
x2apic movbe popcnt
   tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb cat_l3
   cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid 
fsgsbase tsc_adjust bmi1
   hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed 
adx smap clflushopt
   clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc 
cqm_occup_llc cqm_mbm_total
   cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req 
pku ospke

I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to 
too old glib/gcc
version:
 gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
 glibc.x86_64 2.12


Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.


Er, it is not easy to update glibc in the production env :(


But neither is QEMU updated in production all that easily. While we do
want to support older hosts functionally, it does not make
much sense to devel complex optimizations that only benefit
older hosts.



Can not agree with you more. :)

So i benchmarked in on the production with newer distribution installed.
Here is the data:
 27.48%  kqemu  [kernel.kallsyms] [k] copy_user_enhanced_fast_string
 12.63%  kqemu  [kernel.kallsyms] [k] copy_page_rep
 10.82%  kqemu  qemu-system-x86_64[.] buffer_zero_avx2
  5.69%  kqemu  [kernel.kallsyms] [k] 
native_queued_spin_lock_slowpath
  4.61%  kqemu  qemu-system-x86_64[.] threads_submit_request_prepare
  4.39%  kqemu  qemu-system-x86_64[.] qemu_event_set
  4.12%  kqemu  qemu-system-x86_64[.] 
ram_find_and_save_block.part.24
  3.61%  kqemu  [kernel.kallsyms] [k] tcp_sendmsg
  2.62%  kqemu  libc-2.17.so  [.] __memcpy_ssse3_back
  1.89%  kqemu  qemu-system-x86_64[.] qemu_put_qemu_file
  1.32%  kqemu  qemu-system-x86_64[.] compress_thread_data_done

It does not help...




Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed

2018-07-18 Thread Xiao Guangrong




On 07/12/2018 04:26 PM, Peter Xu wrote:

On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:



On 07/11/2018 04:21 PM, Peter Xu wrote:

On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:



On 06/19/2018 03:36 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Try to hold src_page_req_mutex only if the queue is not
empty


Pure question: how much this patch would help?  Basically if you are
running compression tests then I think it means you are with precopy
(since postcopy cannot work with compression yet), then here the lock
has no contention at all.


Yes, you are right, however we can observe it is in the top functions
(after revert this patch):

Samples: 29K of event 'cycles', Event count (approx.): 22263412260
+   7.99%  kqemu  qemu-system-x86_64   [.] ram_bytes_total
+   6.95%  kqemu  [kernel.kallsyms][k] copy_user_enhanced_fast_string
+   6.23%  kqemu  qemu-system-x86_64   [.] qemu_put_qemu_file
+   6.20%  kqemu  qemu-system-x86_64   [.] qemu_event_set
+   5.80%  kqemu  qemu-system-x86_64   [.] __ring_put
+   4.82%  kqemu  qemu-system-x86_64   [.] compress_thread_data_done
+   4.11%  kqemu  qemu-system-x86_64   [.] ring_is_full
+   3.07%  kqemu  qemu-system-x86_64   [.] threads_submit_request_prepare
+   2.83%  kqemu  qemu-system-x86_64   [.] ring_mp_get
+   2.71%  kqemu  qemu-system-x86_64   [.] __ring_is_full
+   2.46%  kqemu  qemu-system-x86_64   [.] buffer_zero_sse2
+   2.40%  kqemu  qemu-system-x86_64   [.] add_to_iovec
+   2.21%  kqemu  qemu-system-x86_64   [.] ring_get
+   1.96%  kqemu  [kernel.kallsyms][k] __lock_acquire
+   1.90%  kqemu  libc-2.12.so [.] memcpy
+   1.55%  kqemu  qemu-system-x86_64   [.] ring_len
+   1.12%  kqemu  libpthread-2.12.so   [.] pthread_mutex_unlock
+   1.11%  kqemu  qemu-system-x86_64   [.] ram_find_and_save_block
+   1.07%  kqemu  qemu-system-x86_64   [.] ram_save_host_page
+   1.04%  kqemu  qemu-system-x86_64   [.] qemu_put_buffer
+   0.97%  kqemu  qemu-system-x86_64   [.] compress_page_with_multi_thread
+   0.96%  kqemu  qemu-system-x86_64   [.] ram_save_target_page
+   0.93%  kqemu  libpthread-2.12.so   [.] pthread_mutex_lock


(sorry to respond late; I was busy with other stuff for the
   release...)



You're welcome. :)


I am trying to find out anything related to unqueue_page() but I
failed.  Did I miss anything obvious there?



unqueue_page() was not listed here indeed, i think the function
itself is light enough (a check then directly return) so it
did not leave a trace here.

This perf data was got after reverting this patch, i.e, it's
based on the lockless multithread model, then unqueue_page() is
the only place using mutex in the main thread.

And you can see the overload of mutext was gone after applying
this patch in the mail i replied to Dave.


I see.  It's not a big portion of CPU resource, though of course I
don't have reason to object to this change as well.

Actually what interested me more is why ram_bytes_total() is such a
hot spot.  AFAIU it's only called in ram_find_and_save_block() per
call, and it should be mostly a constant if we don't plug/unplug
memories.  Not sure that means that's a better spot to work on.



I noticed it too. That could be another work we will work on. :)




Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression

2018-07-18 Thread Xiao Guangrong




On 07/17/2018 03:01 AM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:



On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote:
  }

   static void migration_bitmap_sync(RAMState *rs)
@@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
   qemu_mutex_lock(_param[idx].mutex);
   if (!comp_param[idx].quit) {
   len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;


I think I'd rather save just len+8 rather than than the subtraction.


Hmm, is this what you want?
   compression_counters.reduced_size += len - 8;

Then calculate the real reduced size in populate_ram_info() where we return this
info to the user:
   info->compression->reduced_size = compression_counters.pages * PAGE_SIZE 
- compression_counters.reduced_size;

Right?


I mean I'd rather see the actual size presented to the user rather than
the saving compared to uncompressed.



These statistics are used to help people to see whether compression works 
efficiently or not,
so maybe reduced-size is more straightforward? :)



Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression

2018-07-18 Thread Xiao Guangrong




On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:



On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:


Hi Peter,

Sorry for the delay as i was busy on other things.

On 06/19/2018 03:30 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Detecting zero page is not a light work, we can disable it
for compression that can handle all zero data very well


Is there any number shows how the compression algo performs better
than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
be fast, depending on how init_accel() is done in util/bufferiszero.c.


This is the comparison between zero-detection and compression (the target
buffer is all zero bit):

Zero 810 ns Compression: 26905 ns.
Zero 417 ns Compression: 8022 ns.
Zero 408 ns Compression: 7189 ns.
Zero 400 ns Compression: 7255 ns.
Zero 412 ns Compression: 7016 ns.
Zero 411 ns Compression: 7035 ns.
Zero 413 ns Compression: 6994 ns.
Zero 399 ns Compression: 7024 ns.
Zero 416 ns Compression: 7053 ns.
Zero 405 ns Compression: 7041 ns.

Indeed, zero-detection is faster than compression.

However during our profiling for the live_migration thread (after reverted this 
patch),
we noticed zero-detection cost lots of CPU:

   12.01%  kqemu  qemu-system-x86_64[.] buffer_zero_sse2

   ◆


Interesting; what host are you running on?
Some hosts have support for the faster buffer_zero_ss4/avx2


The host is:

model name  : Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
...
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi
  mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art 
arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni 
pclmulqdq dtes64 monitor
  ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
x2apic movbe popcnt
  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb cat_l3
  cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase 
tsc_adjust bmi1
  hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx 
smap clflushopt
  clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc 
cqm_occup_llc cqm_mbm_total
  cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req 
pku ospke

I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to 
too old glib/gcc
version:
gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
glibc.x86_64 2.12


Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.


Er, it is not easy to update glibc in the production env :(



Re: [Qemu-devel] [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration

2018-07-18 Thread Xiao Guangrong




On 07/14/2018 02:01 AM, Dr. David Alan Gilbert wrote:

* guangrong.x...@gmail.com (guangrong.x...@gmail.com) wrote:

From: Xiao Guangrong 

flush_compressed_data() needs to wait all compression threads to
finish their work, after that all threads are free until the
migration feed new request to them, reducing its call can improve
the throughput and use CPU resource more effectively

We do not need to flush all threads at the end of iteration, the
data can be kept locally until the memory block is changed

Signed-off-by: Xiao Guangrong 
---
  migration/ram.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index f9a8646520..0a38c1c61e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1994,6 +1994,7 @@ static void ram_save_cleanup(void *opaque)
  }
  
  xbzrle_cleanup();

+flush_compressed_data(*rsp);
  compress_threads_save_cleanup();
  ram_state_cleanup(rsp);
  }


I'm not sure why this change corresponds to the other removal.
We should already have sent all remaining data in ram_save_complete()'s
call to flush_compressed_data - so what is this one for?



This is for the error condition, if any error occurred during live migration,
there is no chance to call ram_save_complete. After using the lockless
multithreads model, we assert all requests have been handled before destroy
the work threads.


@@ -2690,7 +2691,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
  }
  i++;
  }
-flush_compressed_data(rs);
  rcu_read_unlock();


Hmm - are we sure there's no other cases that depend on ordering of all
of the compressed data being sent out between threads?


Err, i tried think it over carefully, however, still missed the case you 
mentioned. :(
Anyway, doing flush_compressed_data() for every 50ms hurt us too much.


I think the one I'd most worry about is the case where:

   iteration one:
  thread 1: Save compressed page 'n'

   iteration two:
  thread 2: Save compressed page 'n'

What guarantees that the version of page 'n'
from thread 2 reaches the destination first without
this flush?



Hmm... you are right, i missed this case. So how about avoid it by doing this
check at the beginning of ram_save_iterate():

if (ram_counters.dirty_sync_count != rs.dirty_sync_count) {
flush_compressed_data(*rsp);
rs.dirty_sync_count = ram_counters.dirty_sync_count;
}





Re: [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model

2018-07-18 Thread Xiao Guangrong




On 07/14/2018 12:24 AM, Dr. David Alan Gilbert wrote:


+static void *thread_run(void *opaque)
+{
+ThreadLocal *self_data = (ThreadLocal *)opaque;
+Threads *threads = self_data->threads;
+void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
+ThreadRequest *request;
+int count, ret;
+
+for ( ; !atomic_read(_data->quit); ) {
+qemu_event_reset(_data->ev);
+
+count = 0;
+while ((request = ring_get(self_data->request_ring)) ||
+count < BUSY_WAIT_COUNT) {
+ /*
+ * wait some while before go to sleep so that the user
+ * needn't go to kernel space to wake up the consumer
+ * threads.
+ *
+ * That will waste some CPU resource indeed however it
+ * can significantly improve the case that the request
+ * will be available soon.
+ */
+ if (!request) {
+cpu_relax();
+count++;
+continue;
+}
+count = 0;


Things like busywait counts probably need isolating somewhere;
getting those counts right is quite hard.


Okay, i will make it to be a separated function.



Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed

2018-07-12 Thread Xiao Guangrong




On 07/11/2018 04:21 PM, Peter Xu wrote:

On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:



On 06/19/2018 03:36 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Try to hold src_page_req_mutex only if the queue is not
empty


Pure question: how much this patch would help?  Basically if you are
running compression tests then I think it means you are with precopy
(since postcopy cannot work with compression yet), then here the lock
has no contention at all.


Yes, you are right, however we can observe it is in the top functions
(after revert this patch):

Samples: 29K of event 'cycles', Event count (approx.): 22263412260
+   7.99%  kqemu  qemu-system-x86_64   [.] ram_bytes_total
+   6.95%  kqemu  [kernel.kallsyms][k] copy_user_enhanced_fast_string
+   6.23%  kqemu  qemu-system-x86_64   [.] qemu_put_qemu_file
+   6.20%  kqemu  qemu-system-x86_64   [.] qemu_event_set
+   5.80%  kqemu  qemu-system-x86_64   [.] __ring_put
+   4.82%  kqemu  qemu-system-x86_64   [.] compress_thread_data_done
+   4.11%  kqemu  qemu-system-x86_64   [.] ring_is_full
+   3.07%  kqemu  qemu-system-x86_64   [.] threads_submit_request_prepare
+   2.83%  kqemu  qemu-system-x86_64   [.] ring_mp_get
+   2.71%  kqemu  qemu-system-x86_64   [.] __ring_is_full
+   2.46%  kqemu  qemu-system-x86_64   [.] buffer_zero_sse2
+   2.40%  kqemu  qemu-system-x86_64   [.] add_to_iovec
+   2.21%  kqemu  qemu-system-x86_64   [.] ring_get
+   1.96%  kqemu  [kernel.kallsyms][k] __lock_acquire
+   1.90%  kqemu  libc-2.12.so [.] memcpy
+   1.55%  kqemu  qemu-system-x86_64   [.] ring_len
+   1.12%  kqemu  libpthread-2.12.so   [.] pthread_mutex_unlock
+   1.11%  kqemu  qemu-system-x86_64   [.] ram_find_and_save_block
+   1.07%  kqemu  qemu-system-x86_64   [.] ram_save_host_page
+   1.04%  kqemu  qemu-system-x86_64   [.] qemu_put_buffer
+   0.97%  kqemu  qemu-system-x86_64   [.] compress_page_with_multi_thread
+   0.96%  kqemu  qemu-system-x86_64   [.] ram_save_target_page
+   0.93%  kqemu  libpthread-2.12.so   [.] pthread_mutex_lock


(sorry to respond late; I was busy with other stuff for the
  release...)



You're welcome. :)


I am trying to find out anything related to unqueue_page() but I
failed.  Did I miss anything obvious there?



unqueue_page() was not listed here indeed, i think the function
itself is light enough (a check then directly return) so it
did not leave a trace here.

This perf data was got after reverting this patch, i.e, it's
based on the lockless multithread model, then unqueue_page() is
the only place using mutex in the main thread.

And you can see the overload of mutext was gone after applying
this patch in the mail i replied to Dave.




Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer

2018-07-03 Thread Xiao Guangrong




On 06/29/2018 09:08 PM, Michael S. Tsirkin wrote:

On Fri, Jun 29, 2018 at 03:30:44PM +0800, Xiao Guangrong wrote:


Hi Michael,

On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote:

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 





(1) 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2) http://dpdk.org/doc/api/rte__ring_8h.html

Signed-off-by: Xiao Guangrong 


So instead of all this super-optimized trickiness, how about
a simple port of ptr_ring from linux?

That one isn't lockless but it's known to outperform
most others for a single producer/single consumer case.
And with a ton of networking going on,
who said it's such a hot spot? OTOH this implementation
has more barriers which slows down each individual thread.
It's also a source of bugs.



Thank you for pointing it out.

I just quickly went through the code of ptr_ring that is very nice and
really impressive. I will consider to port it to QEMU.


The port is pretty trivial. See below. It's a SPSC structure though.  So
you need to use it with lock.  Given the critical section is small, I


Why put these locks into this common struct? For our case, each thread
has its own ring which is SCSP, no lock is needed at all. Atomic operations
still slow things down, see [PATCH 07/12] migration: hold the lock only if
it is really needed. I'd move the inner locks to the user instead.




Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed

2018-07-03 Thread Xiao Guangrong




On 06/29/2018 07:22 PM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:



On 06/19/2018 03:36 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Try to hold src_page_req_mutex only if the queue is not
empty


Pure question: how much this patch would help?  Basically if you are
running compression tests then I think it means you are with precopy
(since postcopy cannot work with compression yet), then here the lock
has no contention at all.


Yes, you are right, however we can observe it is in the top functions
(after revert this patch):


Can you show the matching trace with the patch in?


Sure, there is:

+   8.38%  kqemu  [kernel.kallsyms][k] copy_user_enhanced_fast_string
+   8.03%  kqemu  qemu-system-x86_64   [.] ram_bytes_total
+   6.62%  kqemu  qemu-system-x86_64   [.] qemu_event_set
+   6.02%  kqemu  qemu-system-x86_64   [.] qemu_put_qemu_file
+   5.81%  kqemu  qemu-system-x86_64   [.] __ring_put
+   5.04%  kqemu  qemu-system-x86_64   [.] compress_thread_data_done
+   4.48%  kqemu  qemu-system-x86_64   [.] ring_is_full
+   4.44%  kqemu  qemu-system-x86_64   [.] ring_mp_get
+   3.39%  kqemu  qemu-system-x86_64   [.] __ring_is_full
+   2.61%  kqemu  qemu-system-x86_64   [.] add_to_iovec
+   2.48%  kqemu  qemu-system-x86_64   [.] threads_submit_request_prepare
+   2.08%  kqemu  libc-2.12.so [.] memcpy
+   2.07%  kqemu  qemu-system-x86_64   [.] ring_len
+   1.91%  kqemu  [kernel.kallsyms][k] __lock_acquire
+   1.60%  kqemu  qemu-system-x86_64   [.] buffer_zero_sse2
+   1.16%  kqemu  qemu-system-x86_64   [.] ram_find_and_save_block
+   1.14%  kqemu  qemu-system-x86_64   [.] ram_save_target_page
+   1.12%  kqemu  qemu-system-x86_64   [.] compress_page_with_multi_thread
+   1.09%  kqemu  qemu-system-x86_64   [.] ram_save_host_page
+   1.07%  kqemu  qemu-system-x86_64   [.] test_and_clear_bit
+   1.07%  kqemu  qemu-system-x86_64   [.] qemu_put_buffer
+   1.03%  kqemu  qemu-system-x86_64   [.] qemu_put_byte
+   0.80%  kqemu  qemu-system-x86_64   [.] threads_submit_request_commit
+   0.74%  kqemu  qemu-system-x86_64   [.] migration_bitmap_clear_dirty
+   0.70%  kqemu  qemu-system-x86_64   [.] control_save_page
+   0.69%  kqemu  qemu-system-x86_64   [.] test_bit
+   0.69%  kqemu  qemu-system-x86_64   [.] ram_save_iterate
+   0.63%  kqemu  qemu-system-x86_64   [.] migration_bitmap_find_dirty
+   0.63%  kqemu  qemu-system-x86_64   [.] ram_control_save_page
+   0.62%  kqemu  qemu-system-x86_64   [.] rcu_read_lock
+   0.56%  kqemu  qemu-system-x86_64   [.] qemu_file_get_error
+   0.55%  kqemu  [kernel.kallsyms][k] lock_acquire
+   0.55%  kqemu  qemu-system-x86_64   [.] find_dirty_block
+   0.54%  kqemu  qemu-system-x86_64   [.] ring_index
+   0.53%  kqemu  qemu-system-x86_64   [.] ring_put
+   0.51%  kqemu  qemu-system-x86_64   [.] unqueue_page
+   0.50%  kqemu  qemu-system-x86_64   [.] migrate_use_compression
+   0.48%  kqemu  qemu-system-x86_64   [.] get_queued_page
+   0.46%  kqemu  qemu-system-x86_64   [.] ring_get
+   0.46%  kqemu  [i40e]   [k] i40e_clean_tx_irq
+   0.45%  kqemu  [kernel.kallsyms][k] lock_release
+   0.44%  kqemu  [kernel.kallsyms][k] native_sched_clock
+   0.38%  kqemu  qemu-system-x86_64   [.] migrate_get_current
+   0.38%  kqemu  [kernel.kallsyms][k] find_held_lock
+   0.34%  kqemu  [kernel.kallsyms][k] __lock_release
+   0.34%  kqemu  qemu-system-x86_64   [.] qemu_ram_pagesize
+   0.29%  kqemu  [kernel.kallsyms][k] lock_is_held_type
+   0.27%  kqemu  [kernel.kallsyms][k] update_load_avg
+   0.27%  kqemu  qemu-system-x86_64   [.] save_page_use_compression
+   0.24%  kqemu  qemu-system-x86_64   [.] qemu_file_rate_limit
+   0.23%  kqemu  [kernel.kallsyms][k] tcp_sendmsg
+   0.23%  kqemu  [kernel.kallsyms][k] match_held_lock
+   0.22%  kqemu  [kernel.kallsyms][k] do_raw_spin_trylock
+   0.22%  kqemu  [kernel.kallsyms][k] cyc2ns_read_begin




Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression

2018-07-02 Thread Xiao Guangrong




On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:

* Xiao Guangrong (guangrong.x...@gmail.com) wrote:


Hi Peter,

Sorry for the delay as i was busy on other things.

On 06/19/2018 03:30 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Detecting zero page is not a light work, we can disable it
for compression that can handle all zero data very well


Is there any number shows how the compression algo performs better
than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
be fast, depending on how init_accel() is done in util/bufferiszero.c.


This is the comparison between zero-detection and compression (the target
buffer is all zero bit):

Zero 810 ns Compression: 26905 ns.
Zero 417 ns Compression: 8022 ns.
Zero 408 ns Compression: 7189 ns.
Zero 400 ns Compression: 7255 ns.
Zero 412 ns Compression: 7016 ns.
Zero 411 ns Compression: 7035 ns.
Zero 413 ns Compression: 6994 ns.
Zero 399 ns Compression: 7024 ns.
Zero 416 ns Compression: 7053 ns.
Zero 405 ns Compression: 7041 ns.

Indeed, zero-detection is faster than compression.

However during our profiling for the live_migration thread (after reverted this 
patch),
we noticed zero-detection cost lots of CPU:

  12.01%  kqemu  qemu-system-x86_64[.] buffer_zero_sse2 

  ◆


Interesting; what host are you running on?
Some hosts have support for the faster buffer_zero_ss4/avx2


The host is:

model name  : Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
...
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi
 mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art 
arch_perfmon pebs bts
 rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni 
pclmulqdq dtes64 monitor
 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
x2apic movbe popcnt
 tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
cpuid_fault epb cat_l3
 cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase 
tsc_adjust bmi1
 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx 
smap clflushopt
 clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc 
cqm_occup_llc cqm_mbm_total
 cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req 
pku ospke

I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to 
too old glib/gcc
version:
   gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
   glibc.x86_64 2.12





   7.60%  kqemu  qemu-system-x86_64[.] ram_bytes_total  

  ▒
   6.56%  kqemu  qemu-system-x86_64[.] qemu_event_set   

  ▒
   5.61%  kqemu  qemu-system-x86_64[.] qemu_put_qemu_file   

  ▒
   5.00%  kqemu  qemu-system-x86_64[.] __ring_put   

  ▒
   4.89%  kqemu  [kernel.kallsyms] [k] 
copy_user_enhanced_fast_string  

   ▒
   4.71%  kqemu  qemu-system-x86_64[.] compress_thread_data_done

  ▒
   3.63%  kqemu  qemu-system-x86_64[.] ring_is_full 

  ▒
   2.89%  kqemu  qemu-system-x86_64[.] __ring_is_full   

  ▒
   2.68%  kqemu  qemu-system-x86_64[.] 
threads_submit_request_prepare  

   ▒
   2.60%  kqemu  q

Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer

2018-06-29 Thread Xiao Guangrong




On 06/29/2018 02:15 PM, Jason Wang wrote:



On 2018年06月29日 11:59, Xiao Guangrong wrote:



On 06/28/2018 09:36 PM, Jason Wang wrote:



On 2018年06月04日 17:55, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong

It's the simple lockless ring buffer implement which supports both
single producer vs. single consumer and multiple producers vs.
single consumer.




Finally, it fetches the valid data out, set the entry to the initialized
state and update ring->out to make the entry be usable to the producer:

   data = *entry;
   *entry = NULL;
   ring->out++;

Memory barrier is omitted here, please refer to the comment in the code.

(1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2)http://dpdk.org/doc/api/rte__ring_8h.html

Signed-off-by: Xiao Guangrong
---


May I ask why you need a MPSC ring here? Can we just use N SPSC ring for 
submitting pages and another N SPSC ring for passing back results?


Sure.

We had this option in our mind, however, it is not scalable which will slow
the main thread down, instead, we'd rather to speed up main thread and move
reasonable workload to the threads.


I'm not quite understand the scalability issue here. Is it because of main 
thread need go through all N rings (which I think not)?


Yes, it is.

The main thread need to check each single thread and wait
it done one by one...



Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer

2018-06-29 Thread Xiao Guangrong




On 06/29/2018 12:23 PM, Michael S. Tsirkin wrote:

On Thu, Jun 28, 2018 at 09:36:00PM +0800, Jason Wang wrote:



On 2018年06月04日 17:55, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong




Memory barrier is omitted here, please refer to the comment in the code.

(1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2)http://dpdk.org/doc/api/rte__ring_8h.html

Signed-off-by: Xiao Guangrong
---


May I ask why you need a MPSC ring here? Can we just use N SPSC ring for
submitting pages and another N SPSC ring for passing back results?

Thanks


Or just an SPSC ring + a lock.
How big of a gain is lockless access to a trivial structure
like the ring?



Okay, i will give a try.

BTW, we tried to use a global ring + lock for input and lockless ring for input,
the former did not show better performance. But we haven't tried to use global
ring + lock for out yet.



Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer

2018-06-29 Thread Xiao Guangrong


Hi Michael,

On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote:

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 





(1) 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2) http://dpdk.org/doc/api/rte__ring_8h.html

Signed-off-by: Xiao Guangrong 


So instead of all this super-optimized trickiness, how about
a simple port of ptr_ring from linux?

That one isn't lockless but it's known to outperform
most others for a single producer/single consumer case.
And with a ton of networking going on,
who said it's such a hot spot? OTOH this implementation
has more barriers which slows down each individual thread.
It's also a source of bugs.



Thank you for pointing it out.

I just quickly went through the code of ptr_ring that is very nice and
really impressive. I will consider to port it to QEMU.


Further, atomic tricks this one uses are not fair so some threads can get
completely starved while others make progress. There's also no
chance to mix aggressive polling and sleeping with this
kind of scheme, so the starved thread will consume lots of
CPU.

So I'd like to see a simple ring used, and then a patch on top
switching to this tricky one with performance comparison
along with that.



I agree with you, i will make a version that uses a lock for multiple
producers and doing incremental optimizations based on it.


---
  migration/ring.h | 265 +++
  1 file changed, 265 insertions(+)
  create mode 100644 migration/ring.h

diff --git a/migration/ring.h b/migration/ring.h
new file mode 100644
index 00..da9b8bdcbb
--- /dev/null
+++ b/migration/ring.h
@@ -0,0 +1,265 @@
+/*
+ * Ring Buffer
+ *
+ * Multiple producers and single consumer are supported with lock free.
+ *
+ * Copyright (c) 2018 Tencent Inc
+ *
+ * Authors:
+ *  Xiao Guangrong 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef _RING__
+#define _RING__


Prefix Ring is too short.



Okay, will improve it.


+atomic_set(>data[index], NULL);
+
+/*
+ * (B) smp_mb() is needed as we should read the entry out before
+ * updating ring->out as we did in __ring_get().
+ *
+ * (A) smp_wmb() is needed as we should make the entry be NULL before
+ * updating ring->out (which will make the entry be visible and usable).
+ */


I can't say I understand this all.
And the interaction of acquire/release semantics with smp_*
barriers is even scarier.



Hmm... the parallel accesses for these two indexes and the data stored
in the ring are subtle indeed. :(


+atomic_store_release(>out, ring->out + 1);
+
+return data;
+}
+
+static inline int ring_put(Ring *ring, void *data)
+{
+if (ring->flags & RING_MULTI_PRODUCER) {
+return ring_mp_put(ring, data);
+}
+return __ring_put(ring, data);
+}
+
+static inline void *ring_get(Ring *ring)
+{
+if (ring->flags & RING_MULTI_PRODUCER) {
+return ring_mp_get(ring);
+}
+return __ring_get(ring);
+}
+#endif



A bunch of tricky barriers retries etc all over the place.  This sorely
needs *a lot of* unit tests. Where are they?


I used the code attached in this mail to test & benchmark the patches during
my development which does not dedicate for Ring, instead it is based
on the framework of compression.

Yes, test cases are useful and really needed, i will do it... :)

#include "qemu/osdep.h"

#include "libqtest.h"
#include 

#include "qemu/osdep.h"
#include 
#include "qemu/cutils.h"
#include "qemu/bitops.h"
#include "qemu/bitmap.h"
#include "qemu/main-loop.h"
#include "migration/ram.h"
#include "migration/migration.h"
#include "migration/register.h"
#include "migration/misc.h"
#include "migration/page_cache.h"
#include "qemu/error-report.h"
#include "qapi/error.h"
#include "qapi/qapi-events-migration.h"
#include "qapi/qmp/qerror.h"
#include "trace.h"
//#include "exec/ram_addr.h"
#include "exec/target_page.h"
#include "qemu/rcu_queue.h"
#include "migration/colo.h"
#include "migration/block.h"
#include "migration/threads.h"

#include "migration/qemu-file.h"
#include "migration/threads.h"

CompressionStats compression_counters;

#define PAGE_SIZE 4096
#define PAGE_MASK ~(PAGE_SIZE - 1)

static ssize_t test_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
   int64_t pos)
{
int i, size = 0;

for (i = 0; i < iovcnt; i++) {
size += iov[i].iov_len;
}
return size;
}

static int test_fclose(void *opaque)
{
return 0;
}

static const QEMUFileOps test_write_ops = {

Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer

2018-06-28 Thread Xiao Guangrong




On 06/28/2018 09:36 PM, Jason Wang wrote:



On 2018年06月04日 17:55, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong

It's the simple lockless ring buffer implement which supports both
single producer vs. single consumer and multiple producers vs.
single consumer.




Finally, it fetches the valid data out, set the entry to the initialized
state and update ring->out to make the entry be usable to the producer:

   data = *entry;
   *entry = NULL;
   ring->out++;

Memory barrier is omitted here, please refer to the comment in the code.

(1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2)http://dpdk.org/doc/api/rte__ring_8h.html

Signed-off-by: Xiao Guangrong
---


May I ask why you need a MPSC ring here? Can we just use N SPSC ring for 
submitting pages and another N SPSC ring for passing back results?


Sure.

We had this option in our mind, however, it is not scalable which will slow
the main thread down, instead, we'd rather to speed up main thread and move
reasonable workload to the threads.



Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer

2018-06-28 Thread Xiao Guangrong




On 06/28/2018 07:55 PM, Wei Wang wrote:

On 06/28/2018 06:02 PM, Xiao Guangrong wrote:


CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier.


On 06/20/2018 12:52 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

It's the simple lockless ring buffer implement which supports both
single producer vs. single consumer and multiple producers vs.
single consumer.

Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
rte_ring (2) before i wrote this implement. It corrects some bugs of
memory barriers in kfifo and it is the simpler lockless version of
rte_ring as currently multiple access is only allowed for producer.


Could you provide some more information about the kfifo bug? Any
pointer would be appreciated.



Sure, i reported one of the memory barrier issue to linux kernel:
   https://lkml.org/lkml/2018/5/11/58

Actually, beside that, there is another memory barrier issue in kfifo,
please consider this case:

   at the beginning
   ring->size = 4
   ring->out = 0
   ring->in = 4

 Consumer    Producer
 --- --
   index = ring->out; /* index == 0 */
   ring->out++; /* ring->out == 1 */
   < Re-Order >
    out = ring->out;
    if (ring->in - out >= ring->mask)
    return -EFULL;
    /* see the ring is not full */
    index = ring->in & ring->mask; /* index == 
0 */
    ring->data[index] = new_data;
 ring->in++;

   data = ring->data[index];
   !! the old data is lost !!

So we need to make sure:
1) for the consumer, we should read the ring->data[] out before updating 
ring->out
2) for the producer, we should read ring->out before updating ring->data[]

as followings:
  Producer   Consumer
   
  Reading ring->out    Reading ring->data[index]
  smp_mb() smp_mb()
  Setting ring->data[index] = data ring->out++

[ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() 
in the
  patch. ]

But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer?



I wonder if this could be solved by simply tweaking the above consumer 
implementation:

[1] index = ring->out;
[2] data = ring->data[index];
[3] index++;
[4] ring->out = index;

Now [2] and [3] forms a WAR dependency, which avoids the reordering.


It can not. [2] and [4] still do not any dependency, CPU and complainer can omit
the 'index'.




Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression

2018-06-28 Thread Xiao Guangrong



Hi Daniel,

On 06/28/2018 05:36 PM, Daniel P. Berrangé wrote:

On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:





After this patch, the workload is moved to the worker thread, is it
acceptable?


It depends on your point of view. If you have spare / idle CPUs on the host,
then moving workload to a thread is ok, despite the CPU cost of compression
in that thread being much higher what what was replaced, since you won't be
taking CPU resources away from other contending workloads.

I'd venture to suggest though that we should probably *not* be optimizing for
the case of idle CPUs on the host. More realistic is to expect that the host
CPUs are near fully committed to work, and thus the (default) goal should be
to minimize CPU overhead for the host as a whole. From this POV, zero-page
detection is better than compression due to > x10 better speed.

Given the CPU overheads of compression, I think it has fairly narrow use
in migration in general when considering hosts are often highly committed
on CPU.


I understand your concern, however, it is not bad.

First, we tolerate the case that the thread runs little slowly - we do
not wait the thread becoming free, instead, we directly posted the page
out as normal (zero detecting still works for the normal page), so it
at least makes the performance not worse then the case compression is
not used.

Second we think the reasonable configuration is that the system should
have enough CPU resource to run the number of threads the user
configured. BTW, the work we will post out will make these parameters
be runtimely configurable, then a control daemon (e.g, libvirt) will
adjust them based on the current load of the system and the statistic
from QEMU. This is the topic we submitted to KVM Forum this year,
hope it can be accepted.

At last, we have a watermark to trigger live migration on a
load-balanced production, the watermark makes sure there is some free
CPU resource left for other works.

Thanks!



Re: [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model

2018-06-28 Thread Xiao Guangrong




On 06/20/2018 02:52 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free




Not sure how other people think, for me these information suites
better as cover letter.  For commit message, I would prefer to know
about something like: what this thread model can do; how the APIs are
designed and used; what's the limitations, etc.  After all until this
patch nowhere is using the new model yet, so these numbers are a bit
misleading.



Yes, i completely agree with you, i will remove it for its changelog.



Signed-off-by: Xiao Guangrong 
---
  migration/Makefile.objs |   1 +
  migration/threads.c | 265 
  migration/threads.h | 116 +


Again, this model seems to be suitable for scenarios even outside
migration.  So I'm not sure whether you'd like to generalize it (I
still see e.g. constants and comments related to migration, but there
aren't much) and put it into util/.


Sure, that's good to me. :)




  3 files changed, 382 insertions(+)
  create mode 100644 migration/threads.c
  create mode 100644 migration/threads.h

diff --git a/migration/Makefile.objs b/migration/Makefile.objs
index c83ec47ba8..bdb61a7983 100644
--- a/migration/Makefile.objs
+++ b/migration/Makefile.objs
@@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
  common-obj-y += xbzrle.o postcopy-ram.o
  common-obj-y += qjson.o
  common-obj-y += block-dirty-bitmap.o
+common-obj-y += threads.o
  
  common-obj-$(CONFIG_RDMA) += rdma.o
  
diff --git a/migration/threads.c b/migration/threads.c

new file mode 100644
index 00..eecd3229b7
--- /dev/null
+++ b/migration/threads.c
@@ -0,0 +1,265 @@
+#include "threads.h"
+
+/* retry to see if there is avilable request before actually go to wait. */
+#define BUSY_WAIT_COUNT 1000
+
+static void *thread_run(void *opaque)
+{
+ThreadLocal *self_data = (ThreadLocal *)opaque;
+Threads *threads = self_data->threads;
+void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
+ThreadRequest *request;
+int count, ret;
+
+for ( ; !atomic_read(_data->quit); ) {
+qemu_event_reset(_data->ev);
+
+count = 0;
+while ((request = ring_get(self_data->request_ring)) ||
+count < BUSY_WAIT_COUNT) {
+ /*
+ * wait some while before go to sleep so that the user
+ * needn't go to kernel space to wake up the consumer
+ * threads.
+ *
+ * That will waste some CPU resource indeed however it
+ * can significantly improve the case that the request
+ * will be available soon.
+ */
+ if (!request) {
+cpu_relax();
+count++;
+continue;
+}
+count = 0;
+
+handler(request);
+
+do {
+ret = ring_put(threads->request_done_ring, request);
+/*
+ * request_done_ring has enough room to contain all
+ * requests, however, theoretically, it still can be
+ * fail if the ring's indexes are overflow that would
+ * happen if there is more than 2^32 requests are


Could you elaborate why this ring_put() could fail, and why failure is
somehow related to 2^32 overflow?

Firstly, I don't understand why it will fail.


As we explained in the previous mail:

| Without it we can easily observe a "strange" behavior that the thread will
| put the result to the global ring failed even if we allocated enough room
| for the global ring (its capability >= total requests), that's because
| these two indexes can be updated at anytime, consider the case that multiple
| get and put operations can be finished between reading ring->in and ring->out
| so that very possibly ring->in can pass the value readed from ring->out.
|
| Having this code, the negative case only happens if these two indexes (32 
bits)
| overflows to the same value, that can help us to catch potential bug in the
| code.


Meanwhile, AFAIU your ring can even live well with that 2^32 overflow.
Or did I misunderstood?


Please refer to the code:
+if (__ring_is_full(ring, in, out)) {
+if (atomic_read(>in) == in &&
+atomic_read(>out) == out) {
+return -ENOBUFS;
+}

As we allocated enough room for this global ring so there is the only case
that put data will fail that the indexes are overflowed to the same value.

This possibly 2^32 get/put operations happened on other threads and main
thread when this thread is reading these two indexes.

Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer

2018-06-28 Thread Xiao Guangrong




On 06/20/2018 01:55 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote:

[...]

(Some more comments/questions for the MP implementation...)


+static inline int ring_mp_put(Ring *ring, void *data)
+{
+unsigned int index, in, in_next, out;
+
+do {
+in = atomic_read(>in);
+out = atomic_read(>out);


[0]

Do we need to fetch "out" with load_acquire()?  Otherwise what's the
pairing of below store_release() at [1]?



The barrier paired with [1] is the full barrier implied in atomic_cmpxchg(),


This barrier exists in SP-SC case which makes sense to me, I assume
that's also needed for MP-SC case, am I right?


We needn't put a memory here as we do not need to care the order between
these two indexes (in and out), instead, the memory barrier (and for
SP-SC as well) is used to make the order between ring->out and updating
ring->data[] as we explained in previous mail.




+
+if (__ring_is_full(ring, in, out)) {
+if (atomic_read(>in) == in &&
+atomic_read(>out) == out) {


Why read again?  After all the ring API seems to be designed as
non-blocking.  E.g., I see the poll at [2] below makes more sense
since when reaches [2] it means that there must be a producer that is
_doing_ the queuing, so polling is very possible to complete fast.
However here it seems to be a pure busy poll without any hint.  Then
not sure whether we should just let the caller decide whether it wants
to call ring_put() again.



Without it we can easily observe a "strange" behavior that the thread will
put the result to the global ring failed even if we allocated enough room
for the global ring (its capability >= total requests), that's because
these two indexes can be updated at anytime, consider the case that multiple
get and put operations can be finished between reading ring->in and ring->out
so that very possibly ring->in can pass the value readed from ring->out.

Having this code, the negative case only happens if these two indexes (32 bits)
overflows to the same value, that can help us to catch potential bug in the
code.


+return -ENOBUFS;
+}
+
+/* a entry has been fetched out, retry. */
+continue;
+}
+
+in_next = in + 1;
+} while (atomic_cmpxchg(>in, in, in_next) != in);
+
+index = ring_index(ring, in);
+
+/*
+ * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
+ * is implied in atomic_cmpxchg() as we should read ring->out first
+ * before fetching the entry, otherwise this assert will fail.


Thanks for all these comments!  These are really helpful for
reviewers.

However I'm not sure whether I understand it correctly here on MB of
(A) for ring_mp_get() - AFAIU that should corresponds to a smp_rmb()
at [0] above when reading the "out" variable rather than this
assertion, and that's why I thought at [0] we should have something
like a load_acquire() there (which contains a rmb()).


Memory barrier (A) in ring_mp_get() makes sure the order between:
   ring->data[index] = NULL;
   smp_wmb();
   ring->out = out + 1;

And the memory barrier at [0] makes sure the order between:
   out = ring->out;
   /* smp_rmb() */
   compxchg();
   value = ring->data[index];
   assert(value);

[ note: the assertion and reading ring->out are across cmpxchg(). ]

Did i understand your question clearly?



 From content-wise, I think the code here is correct, since
atomic_cmpxchg() should have one implicit smp_mb() after all so we
don't need anything further barriers here.


Yes, it is.



Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer

2018-06-28 Thread Xiao Guangrong



CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier.


On 06/20/2018 12:52 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

It's the simple lockless ring buffer implement which supports both
single producer vs. single consumer and multiple producers vs.
single consumer.

Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
rte_ring (2) before i wrote this implement. It corrects some bugs of
memory barriers in kfifo and it is the simpler lockless version of
rte_ring as currently multiple access is only allowed for producer.


Could you provide some more information about the kfifo bug?  Any
pointer would be appreciated.



Sure, i reported one of the memory barrier issue to linux kernel:
   https://lkml.org/lkml/2018/5/11/58

Actually, beside that, there is another memory barrier issue in kfifo,
please consider this case:

   at the beginning
   ring->size = 4
   ring->out = 0
   ring->in = 4

 ConsumerProducer
 --- --
   index = ring->out; /* index == 0 */
   ring->out++; /* ring->out == 1 */
   < Re-Order >
out = ring->out;
if (ring->in - out >= ring->mask)
return -EFULL;
/* see the ring is not full */
index = ring->in & ring->mask; /* index == 
0 */
ring->data[index] = new_data;
 ring->in++;

   data = ring->data[index];
   !! the old data is lost !!

So we need to make sure:
1) for the consumer, we should read the ring->data[] out before updating 
ring->out
2) for the producer, we should read ring->out before updating ring->data[]

as followings:
  Producer   Consumer
   
  Reading ring->outReading ring->data[index]
  smp_mb() smp_mb()
  Setting ring->data[index] = data ring->out++

[ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() 
in the
  patch. ]

But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer?



If has single producer vs. single consumer, it is the traditional fifo,
If has multiple producers, it uses the algorithm as followings:

For the producer, it uses two steps to update the ring:
- first step, occupy the entry in the ring:

retry:
   in = ring->in
   if (cmpxhg(>in, in, in +1) != in)
 goto retry;

  after that the entry pointed by ring->data[in] has been owned by
  the producer.

  assert(ring->data[in] == NULL);

  Note, no other producer can touch this entry so that this entry
  should always be the initialized state.

- second step, write the data to the entry:

  ring->data[in] = data;

For the consumer, it first checks if there is available entry in the
ring and fetches the entry from the ring:

  if (!ring_is_empty(ring))
   entry = [ring->out];

  Note: the ring->out has not been updated so that the entry pointed
  by ring->out is completely owned by the consumer.

Then it checks if the data is ready:

retry:
  if (*entry == NULL)
 goto retry;
That means, the producer has updated the index but haven't written any
data to it.

Finally, it fetches the valid data out, set the entry to the initialized
state and update ring->out to make the entry be usable to the producer:

   data = *entry;
   *entry = NULL;
   ring->out++;

Memory barrier is omitted here, please refer to the comment in the code.

(1) 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2) http://dpdk.org/doc/api/rte__ring_8h.html

Signed-off-by: Xiao Guangrong 
---
  migration/ring.h | 265 +++


If this is a very general implementation, not sure whether we can move
this to util/ directory so that it can be used even outside migration
codes.


I thought it too. Currently migration is the only user for it, so i put
it near the code of migration. It's good to me to move it to util/ if you
prefer.




  1 file changed, 265 insertions(+)
  create mode 100644 migration/ring.h

diff --git a/migration/ring.h b/migration/ring.h
new file mode 100644
index 00..da9b8bdcbb
--- /dev/null
+++ b/migration/ring.h
@@ -0,0 +1,265 @@
+/*
+ * Ring Buffer
+ *
+ * Multiple producers and single consumer are supported with lock free.
+ *
+ * Copyright (c) 2018 Tencent Inc
+ *
+ * Authors:
+ *  Xiao Guangrong 
+ *
+ * This work is licensed under the terms of 

Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed

2018-06-28 Thread Xiao Guangrong




On 06/19/2018 03:36 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Try to hold src_page_req_mutex only if the queue is not
empty


Pure question: how much this patch would help?  Basically if you are
running compression tests then I think it means you are with precopy
(since postcopy cannot work with compression yet), then here the lock
has no contention at all.


Yes, you are right, however we can observe it is in the top functions
(after revert this patch):

Samples: 29K of event 'cycles', Event count (approx.): 22263412260
+   7.99%  kqemu  qemu-system-x86_64   [.] ram_bytes_total
+   6.95%  kqemu  [kernel.kallsyms][k] copy_user_enhanced_fast_string
+   6.23%  kqemu  qemu-system-x86_64   [.] qemu_put_qemu_file
+   6.20%  kqemu  qemu-system-x86_64   [.] qemu_event_set
+   5.80%  kqemu  qemu-system-x86_64   [.] __ring_put
+   4.82%  kqemu  qemu-system-x86_64   [.] compress_thread_data_done
+   4.11%  kqemu  qemu-system-x86_64   [.] ring_is_full
+   3.07%  kqemu  qemu-system-x86_64   [.] threads_submit_request_prepare
+   2.83%  kqemu  qemu-system-x86_64   [.] ring_mp_get
+   2.71%  kqemu  qemu-system-x86_64   [.] __ring_is_full
+   2.46%  kqemu  qemu-system-x86_64   [.] buffer_zero_sse2
+   2.40%  kqemu  qemu-system-x86_64   [.] add_to_iovec
+   2.21%  kqemu  qemu-system-x86_64   [.] ring_get
+   1.96%  kqemu  [kernel.kallsyms][k] __lock_acquire
+   1.90%  kqemu  libc-2.12.so [.] memcpy
+   1.55%  kqemu  qemu-system-x86_64   [.] ring_len
+   1.12%  kqemu  libpthread-2.12.so   [.] pthread_mutex_unlock
+   1.11%  kqemu  qemu-system-x86_64   [.] ram_find_and_save_block
+   1.07%  kqemu  qemu-system-x86_64   [.] ram_save_host_page
+   1.04%  kqemu  qemu-system-x86_64   [.] qemu_put_buffer
+   0.97%  kqemu  qemu-system-x86_64   [.] compress_page_with_multi_thread
+   0.96%  kqemu  qemu-system-x86_64   [.] ram_save_target_page
+   0.93%  kqemu  libpthread-2.12.so   [.] pthread_mutex_lock

I guess its atomic operations cost CPU resource and check-before-lock is
a common tech, i think it shouldn't have side effect, right? :)




Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression

2018-06-28 Thread Xiao Guangrong



Hi Peter,

Sorry for the delay as i was busy on other things.

On 06/19/2018 03:30 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Detecting zero page is not a light work, we can disable it
for compression that can handle all zero data very well


Is there any number shows how the compression algo performs better
than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
be fast, depending on how init_accel() is done in util/bufferiszero.c.


This is the comparison between zero-detection and compression (the target
buffer is all zero bit):

Zero 810 ns Compression: 26905 ns.
Zero 417 ns Compression: 8022 ns.
Zero 408 ns Compression: 7189 ns.
Zero 400 ns Compression: 7255 ns.
Zero 412 ns Compression: 7016 ns.
Zero 411 ns Compression: 7035 ns.
Zero 413 ns Compression: 6994 ns.
Zero 399 ns Compression: 7024 ns.
Zero 416 ns Compression: 7053 ns.
Zero 405 ns Compression: 7041 ns.

Indeed, zero-detection is faster than compression.

However during our profiling for the live_migration thread (after reverted this 
patch),
we noticed zero-detection cost lots of CPU:

 12.01%  kqemu  qemu-system-x86_64[.] buffer_zero_sse2  

 ◆
  7.60%  kqemu  qemu-system-x86_64[.] ram_bytes_total   

 ▒
  6.56%  kqemu  qemu-system-x86_64[.] qemu_event_set

 ▒
  5.61%  kqemu  qemu-system-x86_64[.] qemu_put_qemu_file

 ▒
  5.00%  kqemu  qemu-system-x86_64[.] __ring_put

 ▒
  4.89%  kqemu  [kernel.kallsyms] [k] 
copy_user_enhanced_fast_string  

   ▒
  4.71%  kqemu  qemu-system-x86_64[.] compress_thread_data_done 

 ▒
  3.63%  kqemu  qemu-system-x86_64[.] ring_is_full  

 ▒
  2.89%  kqemu  qemu-system-x86_64[.] __ring_is_full

 ▒
  2.68%  kqemu  qemu-system-x86_64[.] 
threads_submit_request_prepare  

   ▒
  2.60%  kqemu  qemu-system-x86_64[.] ring_mp_get   

 ▒
  2.25%  kqemu  qemu-system-x86_64[.] ring_get  

 ▒
  1.96%  kqemu  libc-2.12.so  [.] memcpy

After this patch, the workload is moved to the worker thread, is it
acceptable?



 From compression rate POV of course zero page algo wins since it
contains no data (but only a flag).



Yes it is. The compressed zero page is 45 bytes that is small enough i think.

Hmm, if you do not like, how about move detecting zero page to the work thread?

Thanks!



Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression

2018-06-14 Thread Xiao Guangrong




On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote:
 }
  
  static void migration_bitmap_sync(RAMState *rs)

@@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
  qemu_mutex_lock(_param[idx].mutex);
  if (!comp_param[idx].quit) {
  len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;


I think I'd rather save just len+8 rather than than the subtraction.


Hmm, is this what you want?
  compression_counters.reduced_size += len - 8;

Then calculate the real reduced size in populate_ram_info() where we return this
info to the user:
  info->compression->reduced_size = compression_counters.pages * PAGE_SIZE 
- compression_counters.reduced_size;

Right?


I think other than that, and Eric's comments, it's OK.



Thanks.



Re: [Qemu-devel] [PATCH 04/12] migration: introduce migration_update_rates

2018-06-13 Thread Xiao Guangrong




On 06/14/2018 12:17 AM, Dr. David Alan Gilbert wrote:

* guangrong.x...@gmail.com (guangrong.x...@gmail.com) wrote:

From: Xiao Guangrong 

It is used to slightly clean the code up, no logic is changed


Actually, there is a slight change; iterations_prev is always updated
when previously it was only updated with xbzrle on; still the change
makes more sense.


Yes, indeed. I will update the changelog.




Re: [Qemu-devel] [PATCH 02/12] migration: fix counting normal page for compression

2018-06-13 Thread Xiao Guangrong




On 06/13/2018 11:51 PM, Dr. David Alan Gilbert wrote:

* guangrong.x...@gmail.com (guangrong.x...@gmail.com) wrote:

From: Xiao Guangrong 

The compressed page is not normal page


Is this the right reason?


I think the 'normal' page shouldn't include the compressed
page and XBZRLE-ed page (the current code does not treat
xbzrle pages are normal as well).


I think we always increment some counter for a page - so
what gets incremented for a compressed page?


In the later patch, we will introduce the statistics of
compression which contains "pages":
   @pages: amount of pages compressed and transferred to the target VM


Is the real answer that we do:

   ram_save_target_page
  control_save_page
  compress_page_with_multi_thread

and control_save_page already increments the counter?


No :), control_save_page increments the counter only if it posted
data out, under that case, the compression path is not invoked.

Thanks!



Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread

2018-06-13 Thread Xiao Guangrong




On 06/13/2018 11:43 PM, Dr. David Alan Gilbert wrote:

* Peter Xu (pet...@redhat.com) wrote:

On Tue, Jun 12, 2018 at 10:42:25AM +0800, Xiao Guangrong wrote:



On 06/11/2018 03:39 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Instead of putting the main thread to sleep state to wait for
free compression thread, we can directly post it out as normal
page that reduces the latency and uses CPUs more efficiently


The feature looks good, though I'm not sure whether we should make a
capability flag for this feature since otherwise it'll be hard to
switch back to the old full-compression way no matter for what
reason.  Would that be a problem?



We assume this optimization should always be optimistic for all cases,
particularly, we introduced the statistics of compression, then the user
should adjust its parameters based on those statistics if anything works
worse.


Ah, that'll be good.



Furthermore, we really need to improve this optimization if it hurts
any case rather than leaving a option to the user. :)


Yeah, even if we make it a parameter/capability we can still turn that
on by default in new versions but keep the old behavior in old
versions. :) The major difference is that, then we can still _have_ a
way to compress every page. I'm just thinking if we don't have a
switch for that then if someone wants to measure e.g.  how a new
compression algo could help VM migration, then he/she won't be
possible to do that again since the numbers will be meaningless if
that bit is out of control on which page will be compressed.

Though I don't know how much use it'll bring...  But if that won't be
too hard, it still seems good.  Not a strong opinion.


I think that is needed; it might be that some users have really awful
networking and need the compression; I'd expect that for people who turn
on compression they really expect the slowdown because they need it for
their network, so changing that is a bit odd.


People should make sure the system has enough CPU resource to do
compression as well, so the perfect usage is that the 'busy-rate'
is low enough i think.

However, it's not a big deal, i will introduce a parameter,
maybe, compress-wait-free-thread.

Thank you all, Dave and Peter! :)




Re: [Qemu-devel] [PATCH 00/12] migration: improve multithreads for compression and decompression

2018-06-11 Thread Xiao Guangrong




On 06/11/2018 04:00 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Background
--
Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free

The reasons are:
1) there are two many locks used to do synchronous,there
  is a global lock and each single thread has its own lock,
  migration thread and work threads need to go to sleep if
  these locks are busy

2) migration thread separately submits request to the thread
however, only one request can be pended, that means, the
thread has to go to sleep after finishing the request

Our Ideas
-
To make it work better, we introduce a new multithread model,
the user, currently it is the migration thread, submits request
to each thread with round-robin manner, the thread has its own
ring whose capacity is 4 and puts the result to a global ring
which is lockless for multiple producers, the user fetches result
out from the global ring and do remaining operations for the
request, e.g, posting the compressed data out for migration on
the source QEMU

Other works in this patchset is offering some statistics to see
if compression works as we expected and making the migration thread
work fast so it can feed more requests to the threads


Hi, Guangrong,

I'm not sure whether my understanding is correct, but AFAIU the old
code has a major defect that it depends too much on the big lock.  The
critial section of the small lock seems to be very short always, and
also that's per-thread.  However we use the big lock in lots of
places: flush compress data, queue every page, or send the notifies in
the compression thread.



The lock is one issue, however, another issue is that, the thread has
to go to sleep after finishing one request and the main thread (live
migration thread) needs to go to kernel space and wake the thread up
for every single request.

And we also observed that linearly scan the threads one by one to
see which is free is not cache-friendly...


I haven't yet read the whole work, this work seems to be quite nice
according to your test results.  However have you thought about
firstly remove the big lock without touching much of other part of the
code, then continue to improve it?  Or have you ever tried to do so?
I don't think you need to do extra work for this, but I would
appreciate if you have existing test results to share.



If you really want the performance result, i will try it...

Actually, the first version we used on our production is that we
use a lockless multi-thread model (only one atomic operation is needed
for both producer and consumer) but only one request can be fed to the
thread. It's comparable to your suggestion (and should far more faster
than your suggestion).

We observed the shortcoming of this solutions is that too many waits and
wakeups trapped to kernel, so CPU is idle and bandwidth is low.


In other words, would it be nicer to separate the work into two
pieces?

- one to refactor the existing locks, to see what we can gain by
   simplify the locks to minimum.  AFAIU now the locking used is still
   not ideal, my thinking is that _maybe_ we can start by removing the
   big lock, and use a semaphore or something to replace the "done"
   notification while still keep the small lock?  Even some busy
   looping?



Note: no lock is used after this patchset...


- one to introduce the lockless ring buffer, to demostrate how the
   lockless data structure helps comparing to the locking ways

Then we can know which item contributed how much to the performance
numbers.  After all the new ring and thread model seems to be a big
chunk of work (sorry I haven't read them yet, but I will).


It is really a huge burden that refactor old code and later completely
remove old code.

We redesigned the data struct and algorithm completely and abstract the
model to clean up the code used for compression and decompression, it's
not easy to modify the old code part by part... :(

But... if you really it is really needed, i will try to figure out a way
to address your suggestion. :)

Thanks!




Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread

2018-06-11 Thread Xiao Guangrong




On 06/11/2018 03:39 PM, Peter Xu wrote:

On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Instead of putting the main thread to sleep state to wait for
free compression thread, we can directly post it out as normal
page that reduces the latency and uses CPUs more efficiently


The feature looks good, though I'm not sure whether we should make a
capability flag for this feature since otherwise it'll be hard to
switch back to the old full-compression way no matter for what
reason.  Would that be a problem?



We assume this optimization should always be optimistic for all cases,
particularly, we introduced the statistics of compression, then the user
should adjust its parameters based on those statistics if anything works
worse.

Furthermore, we really need to improve this optimization if it hurts
any case rather than leaving a option to the user. :)



Signed-off-by: Xiao Guangrong 
---
  migration/ram.c | 34 +++---
  1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 5bcbf7a9f9..0caf32ab0a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState 
*rs, RAMBlock *block,
  
  thread_count = migrate_compress_threads();

  qemu_mutex_lock(_done_lock);


Can we drop this lock in this case?


The lock is used to protect comp_param[].done...

Well, we are able to possibly remove it if we redesign the implementation, e.g, 
use atomic
access for comp_param.done, however, it still can not work efficiently i 
believe. Please see
more in the later reply to your comments in the cover-letter.




Re: [Qemu-devel] [PATCH] pc: Remove PC_COMPAT_2_12 from 3.0 machine-types

2018-06-10 Thread Xiao Guangrong




On 06/09/2018 03:29 AM, Eduardo Habkost wrote:

commit f548222c added PC_COMPAT_2_12 to the 3.0 PC machine-types.
I believe this happened during manual conflict resolution when
applying the patch.



Indeed!

Reviewed-by: Xiao Guangrong 



Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression

2018-06-06 Thread Xiao Guangrong




On 06/05/2018 06:31 AM, Eric Blake wrote:

On 06/04/2018 04:55 AM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong 

Then the uses can adjust the parameters based on this info

Currently, it includes:
pages: amount of pages compressed and transferred to the target VM
busy: amount of count that no free thread to compress data
busy-rate: rate of thread busy
reduced-size: amount of bytes reduced by compression
compression-rate: rate of compressed size

Signed-off-by: Xiao Guangrong 
---



+++ b/qapi/migration.json
@@ -72,6 +72,26 @@
 'cache-miss': 'int', 'cache-miss-rate': 'number',
 'overflow': 'int' } }
+##
+# @CompressionStats:
+#
+# Detailed compression migration statistics


Sounds better as s/compression migration/migration compression/


Indeed.




+#
+# @pages: amount of pages compressed and transferred to the target VM
+#
+# @busy: amount of count that no free thread to compress data


Not sure what was meant, maybe:

@busy: count of times that no free thread was available to compress data



Yup, that's better.


+#
+# @busy-rate: rate of thread busy


In what unit? pages per second?


It's calculated by:
   pages-directly-posted-out-without-compression / total-page-posted-out




+#
+# @reduced-size: amount of bytes reduced by compression
+#
+# @compression-rate: rate of compressed size


In what unit?



It's calculated by:
   size-posted-out-after-compression / (compressed-page * page_size, i.e, that 
is
the raw data without compression)


+#
+##


Missing a 'Since: 3.0' tag



Wow, directly upgrade to 3.0, big step. :-)
Will add this tag in the next version.


+{ 'struct': 'CompressionStats',
+  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
+   'reduced-size': 'int', 'compression-rate': 'number' } }
+
  ##
  # @MigrationStatus:
  #
@@ -169,6 +189,8 @@
  #   only present when the postcopy-blocktime migration capability
  #   is enabled. (Since 2.13)


Pre-existing - we need to fix this 2.13 to be 3.0 (if it isn't already fixed)


I should re-sync the repo before making patches next time.




  #
+# @compression: compression migration statistics, only returned if compression
+#   feature is on and status is 'active' or 'completed' (Since 2.14)


There will not be a 2.14 (for that matter, not even a 2.13).  The next release 
is 3.0.


Okay, will fix.



Re: [Qemu-devel] [PATCH v2] migration: introduce decompress-error-check

2018-05-03 Thread Xiao Guangrong

On 05/03/2018 04:06 PM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

QEMU 2.13 enables strict check for compression & decompression to
make the migration more robust, that depends on the source to fix
the internal design which triggers the unexpected error conditions

To make it work for migrating old version QEMU to 2.13 QEMU, we
introduce this parameter to disable the error check on the
destination which is the default behavior of the machine type
which is older than 2.13, alternately, the strict check can be
enabled explicitly as followings:
   -M pc-q35-2.11 -global migration.decompress-error-check=true



[ CC: Eric ]

Hi, Dave, Eric, Peter,

Could you please have a look if it is good to you guys.

Thanks!



Re: [Qemu-devel] [PATCH] migration: fix saving normal page even if it's been compressed

2018-05-02 Thread Xiao Guangrong



On 05/02/2018 10:46 AM, Peter Xu wrote:

On Sat, Apr 28, 2018 at 04:10:45PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

Fix the bug introduced by da3f56cb2e767016 (migration: remove
ram_save_compressed_page()), It should be 'return' rather than
'res'

Sorry for this stupid mistake :(

Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com>


Reviewed-by: Peter Xu <pet...@redhat.com>

So is that only a performance degradation without this fix (since
AFAIU the compressing pages will be sent twice)?  Thanks,


Yes, that's why we did not detect it in our test. :(



Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check

2018-04-28 Thread Xiao Guangrong



On 04/27/2018 07:29 PM, Dr. David Alan Gilbert wrote:



Yes, can not agree with you more. :)


The challenge is how to put something into the stream without breaking
an old version of QEMU that's receiving the stream.



Er, i did not think this case :(.

The new parameter as this patch did is the only idea now in my mind...
not sure if Eric and Peter have another solution.



Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check

2018-04-27 Thread Xiao Guangrong



On 04/27/2018 05:31 PM, Peter Xu wrote:

On Fri, Apr 27, 2018 at 11:15:37AM +0800, Xiao Guangrong wrote:



On 04/26/2018 10:01 PM, Eric Blake wrote:

On 04/26/2018 04:15 AM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

QEMU 2.13 enables strict check for compression & decompression to
make the migration more robuster, that depends on the source to fix


s/robuster/robust/



Will fix, thank you for pointing it out.


the internal design which triggers the unexpected error conditions


2.13 hasn't been released yet.  Why do we need a knob to explicitly turn
off strict checking?  Can we not instead make 2.13 automatically smart
enough to tell if the incoming stream is coming from an older qemu
(which might fail if the strict checks are enabled) vs. a newer qemu
(the sender gave us what we need to ensure the strict checks are
worthwhile)?



Really smart.

How about introduce a new command, MIG_CMD_DECOMPRESS_ERR_CHECK,
the destination will do strict check if got this command (i.e, new
QEMU is running on the source), otherwise, turn the check off.


Why not we just introduce a compat bit for that?  I mean something
like: 15c3850325 ("migration: move skip_section_footers",
2017-06-28).  Then we turn that check bit off for <=2.12.

Would that work?


I am afraid it can not. :(

The compat bit only impacts local behavior, however, in this case, we
need the source QEMU to tell the destination if it supports strict
error check.




Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check

2018-04-26 Thread Xiao Guangrong



On 04/26/2018 10:01 PM, Eric Blake wrote:

On 04/26/2018 04:15 AM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

QEMU 2.13 enables strict check for compression & decompression to
make the migration more robuster, that depends on the source to fix


s/robuster/robust/



Will fix, thank you for pointing it out.


the internal design which triggers the unexpected error conditions


2.13 hasn't been released yet.  Why do we need a knob to explicitly turn
off strict checking?  Can we not instead make 2.13 automatically smart
enough to tell if the incoming stream is coming from an older qemu
(which might fail if the strict checks are enabled) vs. a newer qemu
(the sender gave us what we need to ensure the strict checks are
worthwhile)?



Really smart.

How about introduce a new command, MIG_CMD_DECOMPRESS_ERR_CHECK,
the destination will do strict check if got this command (i.e, new
QEMU is running on the source), otherwise, turn the check off.



To make it work for migrating old version QEMU to 2.13 QEMU, we
introduce this parameter to disable the error check on the
destination

Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com>
---



+++ b/qapi/migration.json
@@ -455,6 +455,17 @@
  #  compression, so set the decompress-threads to the number about 1/4
  #  of compress-threads is adequate.
  #
+# @decompress-error-check: check decompression errors. When false, the errors
+#  triggered by memory decompression are ignored.


What are the consequences of such an error?  Is it a security hole to
leave this at false, when a malicious migration stream can cause us to
misbehave by ignoring the errors?


The issue fixed by strict error check is avoiding VM corruption if zlib failed
to compress & decompress the memory, i.e, catch error conditions returned by
zlib API.




+#  When true, migration is aborted if the errors are
+#  detected. For the old QEMU versions (< 2.13) the
+#  internal design will cause decompression to fail
+#  so the destination should completely ignore the
+#  error conditions, i.e, make it be false if these
+#  QEMUs are going to be migrated. Since 2.13, this
+#  design is fixed, make it be true to avoid corrupting
+#  the VM silently (Since 2.13)


Rather wordy; I'd suggest:

@decompress-error-check: Set to true to abort the migration if
 decompression errors are detected at the destination. Should be
 left at false (default) for qemu older than 2.13, since only
 newer qemu sends streams that do not trigger spurious
 decompression errors. (Since 2.13)



Yup, much better.


But that's if we even need it (it SHOULD be possible to design something
into the migration stream so that you can detect this property
automatically instead of relying on the user to set the property).



Yes, can not agree with you more. :)




Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check

2018-04-26 Thread Xiao Guangrong



On 04/26/2018 05:34 PM, Dr. David Alan Gilbert wrote:

* guangrong.x...@gmail.com (guangrong.x...@gmail.com) wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

QEMU 2.13 enables strict check for compression & decompression to
make the migration more robuster, that depends on the source to fix
the internal design which triggers the unexpected error conditions


Hmm that's painful!


Yup, i will think it more carefully next time. :(




To make it work for migrating old version QEMU to 2.13 QEMU, we
introduce this parameter to disable the error check on the
destination

Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com>
---
  hmp.c |  8 
  migration/migration.c | 19 +++
  migration/migration.h |  1 +
  migration/ram.c   |  2 +-
  qapi/migration.json   | 43 +++
  5 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/hmp.c b/hmp.c
index 898e25f3e1..f0b934368b 100644
--- a/hmp.c
+++ b/hmp.c
@@ -329,6 +329,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict 
*qdict)
  monitor_printf(mon, "%s: %u\n",
  MigrationParameter_str(MIGRATION_PARAMETER_DECOMPRESS_THREADS),
  params->decompress_threads);
+assert(params->has_decompress_error_check);
+monitor_printf(mon, "%s: %s\n",
+MigrationParameter_str(MIGRATION_PARAMETER_DECOMPRESS_ERROR_CHECK),
+params->decompress_error_check ? "on" : "off");


Since it's a bool, this should be a migration-capability rather than a
parameter I think.



The parameter, @block-incremental, it is also a bool, i think it is okay to
decompress-error-check as well, and it is one of the configurations
of decompression, it is better to group them. Right? :)


Also, we need to make sure it defaults to off for compatibility.


Yes, the parameter is allocated by zalloc, it has already been false on
default, do you think we should make it be false explicitly?



Other than that, I think it's OK.


Thank you, Dave!



Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check

2018-04-26 Thread Xiao Guangrong


Hi Dave,

On 04/26/2018 05:15 PM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

QEMU 2.13 enables strict check for compression & decompression to
make the migration more robuster, that depends on the source to fix
the internal design which triggers the unexpected error conditions

To make it work for migrating old version QEMU to 2.13 QEMU, we
introduce this parameter to disable the error check on the
destination


We noticed this issue when evaluated your pull request, could you
please review this fix and pull it to 2.13 release? Sorry, i did
not realize it before. :(



Re: [Qemu-devel] [PATCH v3 00/10] migration: improve and cleanup compression

2018-04-07 Thread Xiao Guangrong


Hi Paolo, Michael, Stefan and others,

Could anyone merge this patchset if it is okay to you guys?

On 03/30/2018 03:51 PM, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

Changelog in v3:
Following changes are from Peter's review:
1) use comp_param[i].file and decomp_param[i].compbuf to indicate if
the thread is properly init'd or not
2) save the file which is used by ram loader to the global variable
instead it is cached per decompression thread

Changelog in v2:
Thanks to the review from Dave, Peter, Wei and Jiang Biao, the changes
in this version are:
1) include the performance number in the cover letter
2)add some comments to explain how to use z_stream->opaque in the
patchset
3) allocate a internal buffer for per thread to store the data to
be compressed
4) add a new patch that moves some code to ram_save_host_page() so
that 'goto' can be omitted gracefully
5) split the optimization of compression and decompress into two
separated patches
6) refine and correct code styles


This is the first part of our work to improve compression to make it
be more useful in the production.

The first patch resolves the problem that the migration thread spends
too much CPU resource to compression memory if it jumps to a new block
that causes the network is used very deficient.

The second patch fixes the performance issue that too many VM-exits
happen during live migration if compression is being used, it is caused
by huge memory returned to kernel frequently as the memory is allocated
and freed for every signal call to compress2()

The remaining patches clean the code up dramatically

Performance numbers:
We have tested it on my desktop, i7-4790 + 16G, by locally live migrate
the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to
350. During the migration, a workload which has 8 threads repeatedly
written total 6G memory in the VM.

Before this patchset, its bandwidth is ~25 mbps, after applying, the
bandwidth is ~50 mbp.

We also collected the perf data for patch 2 and 3 on our production,
before the patchset:
+  57.88%  kqemu  [kernel.kallsyms][k] queued_spin_lock_slowpath
+  10.55%  kqemu  [kernel.kallsyms][k] __lock_acquire
+   4.83%  kqemu  [kernel.kallsyms][k] flush_tlb_func_common

-   1.16%  kqemu  [kernel.kallsyms][k] lock_acquire 
  ▒
- lock_acquire  
   ▒
   - 15.68% _raw_spin_lock  
   ▒
  + 29.42% __schedule   
   ▒
  + 29.14% perf_event_context_sched_out 
   ▒
  + 23.60% tdp_page_fault   
   ▒
  + 10.54% do_anonymous_page
   ▒
  + 2.07% kvm_mmu_notifier_invalidate_range_start   
   ▒
  + 1.83% zap_pte_range 
   ▒
  + 1.44% kvm_mmu_notifier_invalidate_range_end


apply our work:
+  51.92%  kqemu  [kernel.kallsyms][k] queued_spin_lock_slowpath
+  14.82%  kqemu  [kernel.kallsyms][k] __lock_acquire
+   1.47%  kqemu  [kernel.kallsyms][k] mark_lock.clone.0
+   1.46%  kqemu  [kernel.kallsyms][k] native_sched_clock
+   1.31%  kqemu  [kernel.kallsyms][k] lock_acquire
+   1.24%  kqemu  libc-2.12.so [.] __memset_sse2

-  14.82%  kqemu  [kernel.kallsyms][k] __lock_acquire   
  ▒
- __lock_acquire
   ▒
   - 99.75% lock_acquire
   ▒
  - 18.38% _raw_spin_lock   
   ▒
 + 39.62% tdp_page_fault
   ▒
 + 31.32% __schedule
   ▒
 + 27.53% perf_event_context_sched_out  
   ▒
 + 0.58% hrtimer_interrupt


We can see the TLB flush and mmu-lock contention have gone.

Xiao Guangrong (10):
   migration: stop compressing page in migration thread
   migration: stop compression to allocate and free memory frequently
   migration: stop decompression to allocate and free memory frequently
   migration: detect compression and decompression errors
   migration: introduce control_save_page()
   migration: move some code to ram_save_host_page
   migration: move calling control_save_page to the common place
   migration: move calling save_

Re: [Qemu-devel] [PATCH V4] migration: add capability to bypass the shared memory

2018-04-04 Thread Xiao Guangrong



On 04/04/2018 07:47 PM, Lai Jiangshan wrote:

1) What's this

When the migration capability 'bypass-shared-memory'
is set, the shared memory will be bypassed when migration.

It is the key feature to enable several excellent features for
the qemu, such as qemu-local-migration, qemu-live-update,
extremely-fast-save-restore, vm-template, vm-fast-live-clone,
yet-another-post-copy-migration, etc..

The philosophy behind this key feature, including the resulting
advanced key features, is that a part of the memory management
is separated out from the qemu, and let the other toolkits
such as libvirt, kata-containers (https://github.com/kata-containers)
runv(https://github.com/hyperhq/runv/) or some multiple cooperative
qemu commands directly access to it, manage it, provide features on it.

2) Status in real world

The hyperhq(http://hyper.sh  http://hypercontainer.io/)
introduced the feature vm-template(vm-fast-live-clone)
to the hyper container for several years, it works perfect.
(see https://github.com/hyperhq/runv/pull/297).

The feature vm-template makes the containers(VMs) can
be started in 130ms and save 80M memory for every
container(VM). So that the hyper containers are fast
and high-density as normal containers.

kata-containers project (https://github.com/kata-containers)
which was launched by hyper, intel and friends and which descended
from runv (and clear-container) should have this feature enabled.
Unfortunately, due to the code confliction between runv,
this feature was temporary disabled and it is being brought
back by hyper and intel team.

3) How to use and bring up advanced features.

In current qemu command line, shared memory has
to be configured via memory-object.

a) feature: qemu-local-migration, qemu-live-update
Set the mem-path on the tmpfs and set share=on for it when
start the vm. example:
-object \
memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
-numa node,nodeid=0,cpus=0-7,memdev=mem

when you want to migrate the vm locally (after fixed a security bug
of the qemu-binary, or other reason), you can start a new qemu with
the same command line and -incoming, then you can migrate the
vm from the old qemu to the new qemu with the migration capability
'bypass-shared-memory' set. The migration will migrate the device-state
*ONLY*, the memory is the origin memory backed by tmpfs file.

b) feature: extremely-fast-save-restore
the same above, but the mem-path is on the persistent file system.

c)  feature: vm-template, vm-fast-live-clone
the template vm is started as 1), and paused when the guest reaches
the template point(example: the guest app is ready), then the template
vm is saved. (the qemu process of the template can be killed now, because
we need only the memory and the device state files (in tmpfs)).

Then we can launch one or multiple VMs base on the template vm states,
the new VMs are started without the “share=on”, all the new VMs share
the initial memory from the memory file, they save a lot of memory.
all the new VMs start from the template point, the guest app can go to
work quickly.

The new VM booted from template vm can’t become template again,
if you need this unusual chained-template feature, you can write
a cloneable-tmpfs kernel module for it.

The libvirt toolkit can’t manage vm-template currently, in the
hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
“libvrit managed template” feature to libvirt.

d) feature: yet-another-post-copy-migration
It is a possible feature, no toolkit can do it well now.
Using nbd server/client on the memory file is reluctantly Ok but
inconvenient. A special feature for tmpfs might be needed to
fully complete this feature.
No one need yet another post copy migration method,
but it is possible when some crazy man need it.


Excellent work. :)

It's a brilliant feature that can improve our production a lot.

Reviewed-by: Xiao Guangrong <xiaoguangr...@tencent.com>



Re: [Qemu-devel] [PATCH v2 04/10] migration: detect compression and decompression errors

2018-03-29 Thread Xiao Guangrong



On 03/29/2018 12:25 PM, Peter Xu wrote:

On Thu, Mar 29, 2018 at 11:51:03AM +0800, Xiao Guangrong wrote:



On 03/28/2018 05:59 PM, Peter Xu wrote:

On Tue, Mar 27, 2018 at 05:10:37PM +0800, guangrong.x...@gmail.com wrote:

[...]


-static int compress_threads_load_setup(void)
+static int compress_threads_load_setup(QEMUFile *f)
   {
   int i, thread_count;
@@ -2665,6 +2685,7 @@ static int compress_threads_load_setup(void)
   }
   decomp_param[i].stream.opaque = _param[i];
+decomp_param[i].file = f;


On the source side the error will be set via:

  qemu_file_set_error(migrate_get_current()->to_dst_file, blen);

Maybe we can do similar things using migrate_incoming_get_current() to
avoid caching the QEMUFile multiple times?



I have considered it, however, it can not work as the @file used by ram
loader is not the file got from migrate_incoming_get_current() under some
cases.

For example, in colo_process_incoming_thread(), the file passed to
qemu_loadvm_state() is a internal buffer and it is not easy to switch it
to incoming file.


I see. How about cache it in a global variable?  We have these
already:

 thread_count = migrate_decompress_threads();
 decompress_threads = g_new0(QemuThread, thread_count);
 decomp_param = g_new0(DecompressParam, thread_count);
 ...

IMHO we can add a new one too, at least we don't cache it multiple
times (after all decomp_param[i]s are global variables too).



Nice, that's good to me. Will add your Reviewed-by on this patch
as well if you do not mind. :)



Re: [Qemu-devel] [PATCH v2 04/10] migration: detect compression and decompression errors

2018-03-28 Thread Xiao Guangrong



On 03/28/2018 05:59 PM, Peter Xu wrote:

On Tue, Mar 27, 2018 at 05:10:37PM +0800, guangrong.x...@gmail.com wrote:

[...]


-static int compress_threads_load_setup(void)
+static int compress_threads_load_setup(QEMUFile *f)
  {
  int i, thread_count;
  
@@ -2665,6 +2685,7 @@ static int compress_threads_load_setup(void)

  }
  decomp_param[i].stream.opaque = _param[i];
  
+decomp_param[i].file = f;


On the source side the error will be set via:

 qemu_file_set_error(migrate_get_current()->to_dst_file, blen);

Maybe we can do similar things using migrate_incoming_get_current() to
avoid caching the QEMUFile multiple times?



I have considered it, however, it can not work as the @file used by ram
loader is not the file got from migrate_incoming_get_current() under some
cases.

For example, in colo_process_incoming_thread(), the file passed to
qemu_loadvm_state() is a internal buffer and it is not easy to switch it
to incoming file.




Re: [Qemu-devel] [PATCH v2 03/10] migration: stop decompression to allocate and free memory frequently

2018-03-28 Thread Xiao Guangrong



On 03/28/2018 05:42 PM, Peter Xu wrote:

On Tue, Mar 27, 2018 at 05:10:36PM +0800, guangrong.x...@gmail.com wrote:

[...]


+static int compress_threads_load_setup(void)
+{
+int i, thread_count;
+
+if (!migrate_use_compression()) {
+return 0;
+}
+
+thread_count = migrate_decompress_threads();
+decompress_threads = g_new0(QemuThread, thread_count);
+decomp_param = g_new0(DecompressParam, thread_count);
+qemu_mutex_init(_done_lock);
+qemu_cond_init(_done_cond);
+for (i = 0; i < thread_count; i++) {
+if (inflateInit(_param[i].stream) != Z_OK) {
+goto exit;
+}
+decomp_param[i].stream.opaque = _param[i];


Same question as the encoding patch here, otherwise looks good to me.


Thanks for you pointed out, will fix.

Hmm, can i treat it as your Reviewed-by for the next version?



Re: [Qemu-devel] [PATCH v2 02/10] migration: stop compression to allocate and free memory frequently

2018-03-28 Thread Xiao Guangrong



On 03/28/2018 05:25 PM, Peter Xu wrote:

On Tue, Mar 27, 2018 at 05:10:35PM +0800, guangrong.x...@gmail.com wrote:

[...]


@@ -357,10 +358,20 @@ static void compress_threads_save_cleanup(void)
  terminate_compression_threads();
  thread_count = migrate_compress_threads();
  for (i = 0; i < thread_count; i++) {
+/*
+ * stream.opaque can be used to store private data, we use it
+ * as a indicator which shows if the thread is properly init'd
+ * or not
+ */
+if (!comp_param[i].stream.opaque) {
+break;
+}


How about using comp_param[i].file?  The opaque seems to be hiding
deeper, and...



Yes, indeed, good suggestion.


  qemu_thread_join(compress_threads + i);
  qemu_fclose(comp_param[i].file);
  qemu_mutex_destroy(_param[i].mutex);
  qemu_cond_destroy(_param[i].cond);
+deflateEnd(_param[i].stream);
+comp_param[i].stream.opaque = NULL;
  }
  qemu_mutex_destroy(_done_lock);
  qemu_cond_destroy(_done_cond);
@@ -370,12 +381,12 @@ static void compress_threads_save_cleanup(void)
  comp_param = NULL;
  }
  
-static void compress_threads_save_setup(void)

+static int compress_threads_save_setup(void)
  {
  int i, thread_count;
  
  if (!migrate_use_compression()) {

-return;
+return 0;
  }
  thread_count = migrate_compress_threads();
  compress_threads = g_new0(QemuThread, thread_count);
@@ -383,6 +394,12 @@ static void compress_threads_save_setup(void)
  qemu_cond_init(_done_cond);
  qemu_mutex_init(_done_lock);
  for (i = 0; i < thread_count; i++) {
+if (deflateInit(_param[i].stream,
+   migrate_compress_level()) != Z_OK) {


(indent issue)



Will fix.


+goto exit;
+}
+comp_param[i].stream.opaque = _param[i];


...here from document:

 ZEXTERN int ZEXPORT deflateInit OF((z_streamp strm, int level));

 Initializes the internal stream state for compression. The
 fields zalloc, zfree and opaque must be initialized before by
 the caller. If zalloc and zfree are set to Z_NULL, deflateInit
 updates them to use default allocation functions.

So shall we init opaque first?  Otherwise looks good to me.


No, opaque need to be init-ed only if zalloc and zfree are specified, it
is not the case in this patch.





Re: [Qemu-devel] [PATCH 3/8] migration: support todetectcompression and decompression errors

2018-03-28 Thread Xiao Guangrong



On 03/28/2018 12:20 PM, Peter Xu wrote:

On Wed, Mar 28, 2018 at 12:08:19PM +0800, jiang.bi...@zte.com.cn wrote:


On Tue, Mar 27, 2018 at 10:35:29PM +0800, Xiao Guangrong wrote:


No, we can't make the assumption that "error _must_ be caused by page update".
No document/ABI about compress/decompress promised it. :)


Indeed, I found no good documents about below errors that jiang.biao
pointed out.

Hi, Peter
The description about the errors comes from here,
http://www.zlib.net/manual.html
And about the error codes returned by inflate(), they are described as,
** inflate() returns
Z_OK if some progress has been made (more input processed or more output 
produced),
Z_STREAM_END if the end of the compressed data has been reached and all 
uncompressed output has been produced,
Z_NEED_DICT if a preset dictionary is needed at this point,
Z_DATA_ERROR if the input data was corrupted (input stream not conforming to the 
zlib format or incorrect check value, in which case strm->msg points to a 
string with a more specific error),
Z_STREAM_ERROR if the stream structure was inconsistent (for example next_in or 
next_out was Z_NULL, or the state was inadvertently written over by the 
application),
Z_MEM_ERROR if there was not enough memory,
Z_BUF_ERROR if no progress was possible or if there was not enough room in the 
output buffer when Z_FINISH is used. ...
**


Ah yes.  My bad to be so uncareful. :)


According to the above description, the error caused by page update looks
more like tend to return Z_DATA_ERROR, but I do not have env to verify that. :)


No, still lack info to confirm the case of compressing the data being
updated is the only one to return Z_DATA_ERROR. And nothing provided
that no other error condition causes data corrupted will be squeezed
into this error code.


As I understand it, the real compress/decompress error cases other than that
caused by page update should be rare, maybe the error code is enough to
distinguish those if we can verify the the error codes returned by page update
and other silent failures by test. If so, we can cut the cost of memcpy.


Please note, compare with other operations, e.g, compression, detect zero page,
etc., memcpy() is not a hot function at all.


If not, I agree with Guangrong's idea too. I never read the zlib code and all my
information comes from the manual, so if anything inaccurate, pls ignore my
option. :)


So I suppose all of us know that alternative now, we just need a solid
way to confirm the uncertainty.  I'll leave this to Guangrong.


Yes, i still prefer to memcpy() to make it safe enough to protect our production
unless we get enough certainty to figure out the error conditions.

Thanks!





Re: [Qemu-devel] [PATCH 1/8] migration: stop compressing page in migration thread

2018-03-27 Thread Xiao Guangrong



On 03/28/2018 11:01 AM, Wang, Wei W wrote:

On Tuesday, March 13, 2018 3:58 PM, Xiao Guangrong wrote:


As compression is a heavy work, do not do it in migration thread, instead, we
post it out as a normal page

Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com>



Hi Guangrong,

Dave asked me to help review your patch, so I will just drop my 2 cents 
wherever possible, and hope that could be inspiring for your work.


Thank you both for the nice help on the work. :)





---
  migration/ram.c | 32 
  1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c index
7266351fd0..615693f180 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState
*rs, PageSearchStatus *pss,
  int pages = -1;
  uint64_t bytes_xmit = 0;
  uint8_t *p;
-int ret, blen;
+int ret;
  RAMBlock *block = pss->block;
  ram_addr_t offset = pss->page << TARGET_PAGE_BITS;

@@ -1162,23 +1162,23 @@ static int
ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
  if (block != rs->last_sent_block) {
  flush_compressed_data(rs);
  pages = save_zero_page(rs, block, offset);
-if (pages == -1) {
-/* Make sure the first page is sent out before other pages */
-bytes_xmit = save_page_header(rs, rs->f, block, offset |
-  RAM_SAVE_FLAG_COMPRESS_PAGE);
-blen = qemu_put_compression_data(rs->f, p, TARGET_PAGE_SIZE,
- migrate_compress_level());
-if (blen > 0) {
-ram_counters.transferred += bytes_xmit + blen;
-ram_counters.normal++;
-pages = 1;
-} else {
-qemu_file_set_error(rs->f, blen);
-error_report("compressed data failed!");
-}
-}
  if (pages > 0) {
  ram_release_pages(block->idstr, offset, pages);
+} else {
+/*
+ * Make sure the first page is sent out before other pages.
+ *
+ * we post it as normal page as compression will take much
+ * CPU resource.
+ */
+ram_counters.transferred += save_page_header(rs, rs->f, block,
+offset | RAM_SAVE_FLAG_PAGE);
+qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
+  migrate_release_ram() &
+  migration_in_postcopy());
+ram_counters.transferred += TARGET_PAGE_SIZE;
+ram_counters.normal++;
+pages = 1;
  }
  } else {
  pages = save_zero_page(rs, block, offset);
--


I agree that this patch is an improvement for the current implementation. So 
just pile up mine here:
Reviewed-by: Wei Wang <wei.w.w...@intel.com>


Thanks.




If you are interested in something more aggressive, I can share an alternative 
approach, which I think would be better. Please see below.

Actually, we can use the multi-threaded compression for the first page as well, 
which will not block the migration thread progress. The advantage is that we 
can enjoy the compression benefit for the first page and meanwhile not blocking 
the migration thread - the page is given to a compression thread and compressed 
asynchronously to the migration thread execution.



Yes, it is a good point.


The main barrier to achieving the above that is that we need to make sure the 
first page of each block is sent first in the multi-threaded environment. We 
can twist the current implementation to achieve that, which is not hard:

For example, we can add a new flag to RAMBlock - bool first_page_added. In each 
thread of compression, they need
1) check if this is the first page of the block.
2) If it is the first page, set block->first_page_added after sending the page;
3) If it is not the first the page, wait to send the page only when 
block->first_page_added is set.



So there is another barrier introduced which hurts the parallel...

Hmm, we need more deliberate consideration on this point, let me think it over 
after this work.

Thank you.




Re: [Qemu-devel] [PATCH 3/8] migration: support to detectcompression and decompression errors

2018-03-27 Thread Xiao Guangrong



On 03/28/2018 08:43 AM, jiang.bi...@zte.com.cn wrote:

On 03/27/2018 07:17 PM, Peter Xu wrote:

On Tue, Mar 27, 2018 at 03:42:32AM +0800, Xiao Guangrong wrote:

[...]


It'll be understandable to me if the problem is that the compress()
API does not allow the input buffer to be changed during the whole
period of the call.  If that is a must, this patch for sure helps.


Yes, that is exactly what i want to say. :)


So I think now I know what this patch is for. :) And yeah, it makes
sense.

Though another question would be: if the buffer is updated during
compress() and compress() returned error, would that pollute the whole
z_stream or it only fails the compress() call?



I guess deflateReset() can recover everything, i.e, keep z_stream as
it is init'ed by deflate_init().


(Same question applies to decompress().)

If it's only a compress() error and it won't pollute z_stream (or say,
it can be recovered after a deflateReset() and then we can continue to
call deflate() without problem), then we'll actually have two
alternatives to solve this "buffer update" issue:

1. Use the approach of current patch: we copy the page every time, so
 deflate() never fails because update never happens.  But it's slow
 since we copy the pages every time.

2. Use the old approach, and when compress() fail, we just ignore that
 page (since now we know that error _must_ be caused by page update,
 then we are 100% sure that we'll send that page again so it'll be
 perfectly fine).



No, we can't make the assumption that "error _must_ be caused by page update".
No document/ABI about compress/decompress promised it. :)

So, as I metioned before, can we just distingush the decompress/compress errors
from errors caused by page update by the return code of inflate/deflate?
According to the zlib manual, there seems to be several error codes for 
different
cases,
#define Z_ERRNO(-1)
#define Z_STREAM_ERROR (-2)
#define Z_DATA_ERROR   (-3)
#define Z_MEM_ERROR(-4)
#define Z_BUF_ERROR(-5)
#define Z_VERSION_ERROR (-6)
Did you check the return code when silent failure(not caused by page update)
happened before? :)


I am afraid there is no such error code and i guess zlib is not designed to
compress the data which is being modified.




Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compression and decompression errors

2018-03-27 Thread Xiao Guangrong



On 03/27/2018 07:17 PM, Peter Xu wrote:

On Tue, Mar 27, 2018 at 03:42:32AM +0800, Xiao Guangrong wrote:

[...]


It'll be understandable to me if the problem is that the compress()
API does not allow the input buffer to be changed during the whole
period of the call.  If that is a must, this patch for sure helps.


Yes, that is exactly what i want to say. :)


So I think now I know what this patch is for. :) And yeah, it makes
sense.

Though another question would be: if the buffer is updated during
compress() and compress() returned error, would that pollute the whole
z_stream or it only fails the compress() call?



I guess deflateReset() can recover everything, i.e, keep z_stream as
it is init'ed by deflate_init().


(Same question applies to decompress().)

If it's only a compress() error and it won't pollute z_stream (or say,
it can be recovered after a deflateReset() and then we can continue to
call deflate() without problem), then we'll actually have two
alternatives to solve this "buffer update" issue:

1. Use the approach of current patch: we copy the page every time, so
deflate() never fails because update never happens.  But it's slow
since we copy the pages every time.

2. Use the old approach, and when compress() fail, we just ignore that
page (since now we know that error _must_ be caused by page update,
then we are 100% sure that we'll send that page again so it'll be
perfectly fine).



No, we can't make the assumption that "error _must_ be caused by page update".
No document/ABI about compress/decompress promised it. :)

Thanks!



Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compression and decompression errors

2018-03-27 Thread Xiao Guangrong



On 03/27/2018 03:22 PM, Peter Xu wrote:

On Thu, Mar 22, 2018 at 08:03:53PM +0800, Xiao Guangrong wrote:



On 03/21/2018 06:00 PM, Peter Xu wrote:

On Tue, Mar 13, 2018 at 03:57:34PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

Currently the page being compressed is allowed to be updated by
the VM on the source QEMU, correspondingly the destination QEMU
just ignores the decompression error. However, we completely miss
the chance to catch real errors, then the VM is corrupted silently

To make the migration more robuster, we copy the page to a buffer
first to avoid it being written by VM, then detect and handle the
errors of both compression and decompression errors properly


Not sure I missed anything important, but I'll just shoot my thoughts
as questions (again)...

Actually this is a more general question? Say, even without
compression, we can be sending a page that is being modified.

However, IMHO we don't need to worry that, since if that page is
modified, we'll definitely send that page again, so the new page will
replace the old.  So on destination side, even if decompress() failed
on a page it'll be fine IMHO.  Though now we are copying the corrupted
buffer.  On that point, I fully agree that we should not - maybe we
can just drop the page entirely?

For non-compress pages, we can't detect that, so we'll copy the page
even if corrupted.

The special part for compression would be: would the deflate() fail if
there is concurrent update to the buffer being compressed?  And would
that corrupt the whole compression stream, or it would only fail the
deflate() call?


It is not the same for normal page and compressed page.

For the normal page, the dirty-log mechanism in QEMU and the infrastructure
of the network (e.g, TCP) can make sure that the modified memory will
be posted to the destination without corruption.

However, nothing can guarantee compression/decompression is BUG-free,
e,g, consider the case, in the last step, vCPUs & dirty-log are paused and
the memory is compressed and posted to destination, if there is any error
in compression/decompression, VM dies silently.


Here do you mean the compression error even if the VM is halted?  I'd
say in that case IMHO the extra memcpy() would still help little since
the coiped page should exactly be the same as the source page?


”compression error“ means that compress2() in original code returns a
error code.

If the data being compressed is being modified at the some time,
compression will fail and this failure is negative. We move the data to
a internal buffer to avoid this case, so that we can catch the real
error condition.



I'd say I don't know what we can really do if there are zlib bugs. I
was assuming we'll definitely fail in a strange way if there is any,
which should be hard to be detected from QEMU's POV (maybe a
destination VM crash, as you mentioned).  It'll be easy for us to
detect errors when we got error code returned from compress(), however
IMHO when we say "zlib bug" it can also mean that data is corrputed
even compress() and decompress() both returned with good state.



Ah, sorry, i abused the word "BUG".

It does not mean the BUG in compression/decompression API, i mean the
failure conditions (The API returns a error code).


It'll be understandable to me if the problem is that the compress()
API does not allow the input buffer to be changed during the whole
period of the call.  If that is a must, this patch for sure helps.


Yes, that is exactly what i want to say. :)





Re: [Qemu-devel] [PATCH 1/8] migration: stop compressing page in migration thread

2018-03-26 Thread Xiao Guangrong



On 03/26/2018 05:02 PM, Peter Xu wrote:

On Thu, Mar 22, 2018 at 07:38:07PM +0800, Xiao Guangrong wrote:



On 03/21/2018 04:19 PM, Peter Xu wrote:

On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote:


Hi David,

Thanks for your review.

On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:


migration/ram.c | 32 


Hi,
 Do you have some performance numbers to show this helps?  Were those
taken on a normal system or were they taken with one of the compression
accelerators (which I think the compression migration was designed for)?


Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.

During the migration, a workload which has 8 threads repeatedly written total
6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
applying, the bandwidth is ~50 mbps.


Hi, Guangrong,

Not really review comments, but I got some questions. :)


Your comments are always valuable to me! :)



IIUC this patch will only change the behavior when last_sent_block
changed.  I see that the performance is doubled after the change,
which is really promising.  However I don't fully understand why it
brings such a big difference considering that IMHO current code is
sending dirty pages per-RAMBlock.  I mean, IMHO last_sent_block should
not change frequently?  Or am I wrong?


It's depends on the configuration, each memory-region which is ram or
file backend has a RAMBlock.

Actually, more benefits comes from the fact that the performance & throughput
of the multithreads has been improved as the threads is fed by the
migration thread and the result is consumed by the migration
thread.


I'm not sure whether I got your points - I think you mean that the
compression threads and the migration thread can form a better
pipeline if the migration thread does not do any compression at all.

I think I agree with that.

However it does not really explain to me on why a very rare event
(sending the first page of a RAMBlock, considering bitmap sync is
rare) can greatly affect the performance (it shows a doubled boost).



I understand it is trick indeed, but it is not very hard to explain.
Multi-threads (using 8 CPUs in our test) keep idle for a long time
for the origin code, however, after our patch, as the normal is
posted out async-ly that it's extremely fast as you said (the network
is almost idle for current implementation) so it has a long time that
the CPUs can be used effectively to generate more compressed data than
before.


Btw, about the numbers: IMHO the numbers might not be really "true
numbers".  Or say, even the bandwidth is doubled, IMHO it does not
mean the performance is doubled. Becasue the data has changed.

Previously there were only compressed pages, and now for each cycle of
RAMBlock looping we'll send a normal page (then we'll get more thing
to send).  So IMHO we don't really know whether we sent more pages
with this patch, we can only know we sent more bytes (e.g., an extreme
case is that the extra 25Mbps/s are all caused by those normal pages,
and we can be sending exactly the same number of pages like before, or
even worse?).



Current implementation uses CPU very ineffectively (it's our next work
to be posted out) that the network is almost idle so posting more data
out is a better choice,further more, migration thread plays a role for
parallel, it'd better to make it fast.





Another follow-up question would be: have you measured how long time
needed to compress a 4k page, and how many time to send it?  I think
"sending the page" is not really meaningful considering that we just
put a page into the buffer (which should be extremely fast since we
don't really flush it every time), however I would be curious on how
slow would compressing a page be.


I haven't benchmark the performance of zlib, i think it is CPU intensive
workload, particularly, there no compression-accelerator (e.g, QAT) on
our production. BTW, we were using lzo instead of zlib which worked
better for some workload.


Never mind. Good to know about that.



Putting a page into buffer should depend on the network, i,e, if the
network is congested it should take long time. :)


Again, considering that I don't know much on compression (especially I
hardly used that) mine are only questions, which should not block your
patches to be either queued/merged/reposted when proper. :)


Yes, i see. The discussion can potentially raise a better solution.

Thanks for your comment, Peter!



Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compression and decompression errors

2018-03-22 Thread Xiao Guangrong



On 03/21/2018 06:00 PM, Peter Xu wrote:

On Tue, Mar 13, 2018 at 03:57:34PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

Currently the page being compressed is allowed to be updated by
the VM on the source QEMU, correspondingly the destination QEMU
just ignores the decompression error. However, we completely miss
the chance to catch real errors, then the VM is corrupted silently

To make the migration more robuster, we copy the page to a buffer
first to avoid it being written by VM, then detect and handle the
errors of both compression and decompression errors properly


Not sure I missed anything important, but I'll just shoot my thoughts
as questions (again)...

Actually this is a more general question? Say, even without
compression, we can be sending a page that is being modified.

However, IMHO we don't need to worry that, since if that page is
modified, we'll definitely send that page again, so the new page will
replace the old.  So on destination side, even if decompress() failed
on a page it'll be fine IMHO.  Though now we are copying the corrupted
buffer.  On that point, I fully agree that we should not - maybe we
can just drop the page entirely?

For non-compress pages, we can't detect that, so we'll copy the page
even if corrupted.

The special part for compression would be: would the deflate() fail if
there is concurrent update to the buffer being compressed?  And would
that corrupt the whole compression stream, or it would only fail the
deflate() call?


It is not the same for normal page and compressed page.

For the normal page, the dirty-log mechanism in QEMU and the infrastructure
of the network (e.g, TCP) can make sure that the modified memory will
be posted to the destination without corruption.

However, nothing can guarantee compression/decompression is BUG-free,
e,g, consider the case, in the last step, vCPUs & dirty-log are paused and
the memory is compressed and posted to destination, if there is any error
in compression/decompression, VM dies silently.



Re: [Qemu-devel] [PATCH 2/8] migration: stop allocating and freeing memory frequently

2018-03-22 Thread Xiao Guangrong



On 03/21/2018 05:06 PM, Peter Xu wrote:

On Tue, Mar 13, 2018 at 03:57:33PM +0800, guangrong.x...@gmail.com wrote:

From: Xiao Guangrong <xiaoguangr...@tencent.com>

Current code uses compress2()/uncompress() to compress/decompress
memory, these two function manager memory allocation and release
internally, that causes huge memory is allocated and freed very
frequently

More worse, frequently returning memory to kernel will flush TLBs
and trigger invalidation callbacks on mmu-notification which
interacts with KVM MMU, that dramatically reduce the performance
of VM

So, we maintain the memory by ourselves and reuse it for each
compression and decompression

Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com>
---
  migration/qemu-file.c |  34 ++--
  migration/qemu-file.h |   6 ++-
  migration/ram.c   | 142 +-
  3 files changed, 140 insertions(+), 42 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 2ab2bf362d..1ff33a1ffb 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -658,6 +658,30 @@ uint64_t qemu_get_be64(QEMUFile *f)
  return v;
  }
  
+/* return the size after compression, or negative value on error */

+static int qemu_compress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
+  const uint8_t *source, size_t source_len)
+{
+int err;
+
+err = deflateReset(stream);


I'm not familiar with zlib, but I saw this in manual:

  https://www.zlib.net/manual.html

  This function is equivalent to deflateEnd followed by deflateInit,
  but does not free and reallocate the internal compression state. The
  stream will leave the compression level and any other attributes that
  may have been set unchanged.

I thought it was deflateInit() who is slow?  Can we avoid the reset as


deflateEnd() is worse as it frees memory to kernel which triggers
TLB flush and mmu-notifier.


long as we make sure to deflateInit() before doing anything else?


Actually, deflateReset() is cheap... :)



Meanwhile, is there any performance number for this single patch?
Since I thought the old code is calling compress2() which contains
deflateInit() and deflateEnd() too, just like what current patch do?


No, after the patch, we just call deflateInit() / deflateEnd() one
time (in _setup() handler and _cleanup handler).

Yes. This is the perf data from our production,
after revert this patch:
+  57.88%  kqemu  [kernel.kallsyms][k] queued_spin_lock_slowpath
+  10.55%  kqemu  [kernel.kallsyms][k] __lock_acquire
+   4.83%  kqemu  [kernel.kallsyms][k] flush_tlb_func_common

-   1.16%  kqemu  [kernel.kallsyms][k] lock_acquire 
  ▒
   - lock_acquire   
  ▒
  - 15.68% _raw_spin_lock   
  ▒
 + 29.42% __schedule
  ▒
 + 29.14% perf_event_context_sched_out  
  ▒
 + 23.60% tdp_page_fault
  ▒
 + 10.54% do_anonymous_page 
  ▒
 + 2.07% kvm_mmu_notifier_invalidate_range_start
  ▒
 + 1.83% zap_pte_range  
  ▒
 + 1.44% kvm_mmu_notifier_invalidate_range_end


apply our work:
+  51.92%  kqemu  [kernel.kallsyms][k] queued_spin_lock_slowpath
+  14.82%  kqemu  [kernel.kallsyms][k] __lock_acquire
+   1.47%  kqemu  [kernel.kallsyms][k] mark_lock.clone.0
+   1.46%  kqemu  [kernel.kallsyms][k] native_sched_clock
+   1.31%  kqemu  [kernel.kallsyms][k] lock_acquire
+   1.24%  kqemu  libc-2.12.so [.] __memset_sse2

-  14.82%  kqemu  [kernel.kallsyms][k] __lock_acquire   
  ▒
   - __lock_acquire 
  ▒
  - 99.75% lock_acquire 
  ▒
 - 18.38% _raw_spin_lock
  ▒
+ 39.62% tdp_page_fault 
  ▒
+ 31.32% __schedule 
  ▒
+ 27.53% perf_event_context_sched_out   
  ▒
+ 0.58% hrtimer_interrupt


You can see the TLB flush and mmu-lock contention have gone after this patch.



It would be nice too if we can split the patch into two (decode,
encode) if you want, but that's optional.


That's good 

Re: [Qemu-devel] [PATCH 1/8] migration: stop compressing page in migration thread

2018-03-22 Thread Xiao Guangrong



On 03/21/2018 04:19 PM, Peter Xu wrote:

On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote:


Hi David,

Thanks for your review.

On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:


   migration/ram.c | 32 


Hi,
Do you have some performance numbers to show this helps?  Were those
taken on a normal system or were they taken with one of the compression
accelerators (which I think the compression migration was designed for)?


Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.

During the migration, a workload which has 8 threads repeatedly written total
6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
applying, the bandwidth is ~50 mbps.


Hi, Guangrong,

Not really review comments, but I got some questions. :)


Your comments are always valuable to me! :)



IIUC this patch will only change the behavior when last_sent_block
changed.  I see that the performance is doubled after the change,
which is really promising.  However I don't fully understand why it
brings such a big difference considering that IMHO current code is
sending dirty pages per-RAMBlock.  I mean, IMHO last_sent_block should
not change frequently?  Or am I wrong?


It's depends on the configuration, each memory-region which is ram or
file backend has a RAMBlock.

Actually, more benefits comes from the fact that the performance & throughput
of the multithreads has been improved as the threads is fed by the
migration thread and the result is consumed by the migration
thread.



Another follow-up question would be: have you measured how long time
needed to compress a 4k page, and how many time to send it?  I think
"sending the page" is not really meaningful considering that we just
put a page into the buffer (which should be extremely fast since we
don't really flush it every time), however I would be curious on how
slow would compressing a page be.


I haven't benchmark the performance of zlib, i think it is CPU intensive
workload, particularly, there no compression-accelerator (e.g, QAT) on
our production. BTW, we were using lzo instead of zlib which worked
better for some workload.

Putting a page into buffer should depend on the network, i,e, if the
network is congested it should take long time. :)




Re: [Qemu-devel] [PATCH 2/8] migration: stop allocating and freeing memory frequently

2018-03-19 Thread Xiao Guangrong



On 03/19/2018 06:54 PM, Dr. David Alan Gilbert wrote:


+return 0;
+exit:
+compress_threads_load_cleanup();


I don't think this is safe; if inflateInit(..) fails in not-the-last
thread, compress_threads_load_cleanup() will try and destroy all the
mutex's and condition variables, even though they've not yet all been
_init'd.



That is exactly why we used 'opaque', please see more below...


However, other than that I think the patch is OK; a chat with Dan
Berrange has convinced me this probably doesn't affect the stream
format, so that's OK.

One thing I would like is a comment as to how the 'opaque' field is
being used; I don't think I quite understand what you're doing there.


The zlib.h file says that:
" The opaque value provided by the application will be passed as the first
parameter for calls of zalloc and zfree.  This can be useful for custom
memory management.  The compression library attaches no meaning to the
opaque value."
So we can use it to store our private data.

Here, we use it as a indicator which shows if the thread is properly init'd
or not. If inflateInit() is successful we will set it to non-NULL, otherwise
it is NULL, so that the cleanup path can figure out the first thread failed
to do inflateInit().


OK, so I think you just need to add a comment to explain that. Put it
above the 'if (!decomp_param[i].stream.opaque) {' in
compress_threads_load_cleanup  so it'll be easy to understand.


Yes, indeed, i will do.

Thanks!




Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compressionand decompression errors

2018-03-19 Thread Xiao Guangrong



On 03/19/2018 03:56 PM, jiang.bi...@zte.com.cn wrote:

Hi, guangrong

@@ -1051,11 +1052,13 @@ static int do_compress_ram_page(QEMUFile *f, z_stream 
*stream, RAMBlock *block,
{
RAMState *rs = ram_state;
int bytes_sent, blen;
-uint8_t *p = block->host + (offset & TARGET_PAGE_MASK);
+uint8_t buf[TARGET_PAGE_SIZE], *p;



+p = block->host + (offset & TARGET_PAGE_MASK);
bytes_sent = save_page_header(rs, f, block, offset |
RAM_SAVE_FLAG_COMPRESS_PAGE);
-blen = qemu_put_compression_data(f, stream, p, TARGET_PAGE_SIZE);
+memcpy(buf, p, TARGET_PAGE_SIZE);
+blen = qemu_put_compression_data(f, stream, buf, TARGET_PAGE_SIZE);

Memory copy operation for every page to be compressed is not cheap, especially
when the page number is huge, and it may be not necessary for pages never
updated during migration.


This is only for 4k page.


Is there any possibility that we can distinguish the real compress/decompress
errors from those being caused by source VM updating? Such as the return
value of qemu_uncompress(distinguish Z_DATA_ERROR and other error codes
returned by inflate())?


Unfortunately, no. :(



Re: [Qemu-devel] [PATCH 2/8] migration: stop allocating and freeingmemory frequently

2018-03-18 Thread Xiao Guangrong



On 03/19/2018 09:49 AM, jiang.bi...@zte.com.cn wrote:

Hi, guangrong


+/* return the size after compression, or negative value on error */
+static int qemu_compress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
+  const uint8_t *source, size_t source_len)
+{
+int err;
+
+err = deflateReset(stream);
+if (err != Z_OK) {
+return -1;
+}
+
+stream->avail_in = source_len;
+stream->next_in = (uint8_t *)source;
+stream->avail_out = dest_len;
+stream->next_out = dest;
+

duplicated code with qemu_uncompress(), would initializing stream outside
of qemu_compress_data() be better? In that case, we could pass much less
parameters down, and avoid the duplicated code. Or could we encapsulate
some struct to ease the case?


There are multiple places to do compression/uncompression in QEMU,
i am going to introduce common functions to cleanup these places,
that can be another patchset later...


+err = deflate(stream, Z_FINISH);
+if (err != Z_STREAM_END) {
+return -1;
+}
+
+return stream->next_out - dest;
+}
+

@@ -683,8 +707,10 @@ ssize_t qemu_put_compression_data(QEMUFile *f, const 
uint8_t *p, size_t size,
return -1;
}
}
-if (compress2(f->buf + f->buf_index + sizeof(int32_t), (uLongf *),
-  (Bytef *)p, size, level) != Z_OK) {
+
+blen = qemu_compress_data(stream, f->buf + f->buf_index + sizeof(int32_t),
+  blen, p, size);

The "level" parameter is never used after the patch, could we just removed it?
On the other hand, deflate() of zlib supports compression level too(by
deflateInit(stream, level)), should we just reuse the level properly?  If not, 
the
*migrate parameter compress_level* will be useless.


The 'level' has been pushed to @stream:
+if (deflateInit(_param[i].stream,
+   migrate_compress_level()) != Z_OK) {
+goto exit;
+}


+if (blen < 0) {
error_report("Compress Failed!");
return 0;
}

+/* return the size after decompression, or negative value on error */
+static int qemu_uncompress(z_stream *stream, uint8_t *dest, size_t dest_len,
+   uint8_t *source, size_t source_len)

The name of *qemu_uncompress* does not quite match *qemu_compress_data*,
would *qemu_uncompress_data* be better?


It's good to me. will rename it.


Besides, the prototype is not consistent with  *qemu_compress_data* either,
should the -*source- be -const- also here?


Okay.

Thanks!



  1   2   3   4   5   6   7   8   9   >