Re: [PATCH] hw/mem/nvdimm: fix error message for 'unarmed' flag
On Wed, Jun 15, 2022 at 7:49 PM David Hildenbrand wrote: > > On 15.06.22 13:17, Xiao Guangrong wrote: > > On Wed, Jun 15, 2022 at 4:24 PM David Hildenbrand wrote: > > > >>>> Is that a temporary or a permanent thing? Do we know? > >>> > >>> No idea. But his last signed-off was three years ago. > >> > >> I sent a patch to Xiao, asking if he's still active in QEMU. If I don't > > s/patch/mail/ :) > > >> get a reply this week, I'll move forward with proposing an update to > >> MAINTAINERS as described. > >> > > > > Okay, please do it. > > > > Sorry, I am just roughly reading the mailing list of qemu & kvm usually, > > and do not get enough time to actively review or contribute on these > > fields. :-( > > Not an issue, thanks for that information and thanks for your work in > the past on that! > > Should I keep you entered as a reviewer for the new section? Okay, that is good for me! :)
Re: [PATCH] hw/mem/nvdimm: fix error message for 'unarmed' flag
On Wed, Jun 15, 2022 at 4:24 PM David Hildenbrand wrote: > >> Is that a temporary or a permanent thing? Do we know? > > > > No idea. But his last signed-off was three years ago. > > I sent a patch to Xiao, asking if he's still active in QEMU. If I don't > get a reply this week, I'll move forward with proposing an update to > MAINTAINERS as described. > Okay, please do it. Sorry, I am just roughly reading the mailing list of qemu & kvm usually, and do not get enough time to actively review or contribute on these fields. :-(
Re: [Qemu-devel] About making QEMU to LIBs!
On 3/27/19 5:41 PM, Stefano Garzarella wrote: Hi Yang, Xiao, Just adding few things, because I'm currently exploring the QEMU modules in order to reduce the boot time and the footprint. Hi Stefan Hajnoczi and Stefano Garzarella, The work exploring the QEMU modules looks really good. we will evaluate it. Hmm... how about make these modules usable out of QEMU, it seems doable, right? But I'm not sure how much effort we need. The idea is to create new QEMU modules (like the block/audio/ui drivers) in order to reduce the dependency of QEMU executable to the shared libraries that require a lot of time for the initialization, so we can postpone their initialization only when we will use them. Using callgrind I'm profiling the loading of dynamic libraries to understand which part of QEMU should be modularized to speed up the boot phase when they are not used. I found that libgnutls, libspice-server and libnuma initialization takes most of the time. I disabled these libraries (--disable-numa --disable-gnutls --disable-spice) to understand the maximum speed up and I gained 10 ms during the boot: - qemu_init_end: 50.92 ms - qemu_init_end: 40.17 ms (without numa, gnutls and spice) I would start to modularize spice, the library is used in the spice drivers (spice-display, spice-chardev, spice-input, etc.) but also in the QXL emulation. Maybe we can synchronize our work :) Great work, Stefano! Does the benefit come from the fact that your work avoids load some libraries during exec(), right? So it depends on the size of the shared library?
Re: [Qemu-devel] About making QEMU to LIBs!
On 3/26/19 5:07 PM, Paolo Bonzini wrote: On 26/03/19 08:00, Xiao Guangrong wrote: On 3/26/19 7:18 AM, Paolo Bonzini wrote: Also, what is the use case? Is it to reduce the attack surface without having multiple QEMU binaries? Security is one story we concern, only the essential and audited modules/libraries can be loaded into memory. Another story is that lightweight virt. becomes more and more important to run VM-based workload in the public Cloud. However, QEMU is too huge and cumbersome to fill our requirements, e.g, QEMU-lites has been being developed for a long time but still lacks a way into mainline or Intel's nEMU more radically. What is your definition of lightweight? To some extent, moving devices to separate dynamic libraries _adds_ weight for the mechanisms to load those libraries dynamically. Lightweight = less resource usage + fast to reach user's self-defined workload. For the new software, these libraries can be statically linked into it. Would separate QEMU binaries be a solution? I think I am not as opposed Separate QEMU binaries mean we need multiple different compiling environments and additional package maintenance on the production... to a "q35-lite" machine type these days, I find it preferrable to share the northbridge and southbridge with Q35 and just get rid of IDE, VGA, IOAPIC, legacy ISA devices etc. The chipset would stay the same as q35 so that we keep secure boot, share the code for ACPI stuff (hotplug), and if the OS needs it we can also add back the RTC. Well, that is a big step forward. :) I'd think making most used components, such as virtio devices, are statically compiled that is the minimal requirements for lightweight scenario. Other components are libraries which can be dlopen() at the runtime, that slow little down to boot the traditional VMs indeed but that is acceptable...
Re: [Qemu-devel] About making QEMU to LIBs!
On 3/26/19 7:18 AM, Paolo Bonzini wrote: On 25/03/19 12:46, Yang Zhong wrote: Hello all, Rust-VMM has started to make all features and common modules to crates, and CSP can deploy their VMM on demand. This afternoon, Xiao guangrong and i talked about the light weight VM solitions,and we thought QEMU can do same thing like Rust-vmm, although we can make all modules or features in QEMU configurable, but making features and modules to libs are still valuable for QEMU. Any comments are welcome! thanks! What features/modules were you thinking of? Were you thinking of turning them into dynamically loaded libraries, or spinning them out into separate projects (such as libslirp)? We are considering make QEMU's device emulations to dynamically libraries that include virtio devices, IO controller and even vCPU emulations plus some hooks into it, and so on. Also, what is the use case? Is it to reduce the attack surface without having multiple QEMU binaries? Security is one story we concern, only the essential and audited modules/libraries can be loaded into memory. Another story is that lightweight virt. becomes more and more important to run VM-based workload in the public Cloud. However, QEMU is too huge and cumbersome to fill our requirements, e.g, QEMU-lites has been being developed for a long time but still lacks a way into mainline or Intel's nEMU more radically. That's why we are going to redesign the userspace VMM, making QEMU to libraries is definitely good to us. Having dynamically linked libraries would go even beyond what rust-vmm does; Rust binaries in the end statically link every crate that they use, rust-vmm's purpose is mostly to share code across VMMs and avoid forks. It may also have a startup time penalty which has to be considered. Rust-VMM is good indeed, but it's not friendly enough for us. QEMU IS A GOOD GUY,no reason to abandon it. :) QEMU is excellent and we are using it in our productions for so many years, its code is trusted as robust. Reusing QEMU's code helps the new software to achieve production-level's quality. Furthermore, we still need QEMU for the traditional workloads, maintaining two totally different code base is not easy for the cloud provider and most developers in the cloud companies are really good at QEMU. So leveraging QEMU and the new software is more convenient for us. Thanks!
Re: [Qemu-devel] [PATCH v2 0/3] PCDIMM cleanup
On 2/20/19 8:51 AM, Wei Yang wrote: Three trivial cleanup for pc-dimm. Patch [1] remove the check on class->hotpluggable since pc-dimm is always hotpluggable. Patch [2] remove nvdimm_realize Patch [2] remove pcdimm realize-callback v2: * fix warning in Patch 1 * split Patch 2 into two Wei Yang (3): pc-dimm: remove check on pc-dimm hotpluggable mem/nvdimm: remove nvdimm_realize pc-dimm: revert "introduce realize callback" I think the word 'revert' is not so precise as that hints the commit is bugly, instead, it was factored in the later comments then becomes useless now. Anyway, this pathset looks good to me. Reviewed-by: Xiao Guangrong
Re: [Qemu-devel] [PATCH v2 3/3] migration: introduce adaptive model for waiting thread
On 1/11/19 5:57 PM, Markus Armbruster wrote: guangrong.x...@gmail.com writes: From: Xiao Guangrong Currently we have two behaviors if all threads are busy to do compression, the main thread mush wait one of them becoming free if @compress-wait-thread set to on or the main thread can directly return without wait and post the page out as normal one Both of them have its profits and short-comes, however, if the bandwidth is not limited extremely so that compression can not use out all of it bandwidth, at the same time, the migration process is easily throttled if we posted too may pages as normal pages. None of them can work properly under this case In order to use the bandwidth more effectively, we introduce the third behavior, adaptive, which make the main thread wait if there is no bandwidth left or let the page go out as normal page if there has enough bandwidth to make sure the migration process will not be throttled Signed-off-by: Xiao Guangrong --- hmp.c | 6 ++- migration/migration.c | 116 -- migration/migration.h | 13 ++ migration/ram.c | 116 +- qapi/migration.json | 39 +++-- You neglected to cc: the QAPI schema maintainers. scripts/get_maintainer.pl can help you find the maintainers to cc: on your patches. Thank you for pointing it out, i will pay more attention on it. 5 files changed, 251 insertions(+), 39 deletions(-) [...] diff --git a/qapi/migration.json b/qapi/migration.json index c5babd03b0..0220a0945b 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -93,11 +93,16 @@ # # @compression-rate: rate of compressed size # +# @compress-no-wait-weight: it controls how many pages are directly posted +# out as normal page when all compression threads are currently busy. +# Only available if compress-wait-thread = adaptive. (Since 3.2) "Only available" suggests the member is optional. +# # Since: 3.1 ## { 'struct': 'CompressionStats', 'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number', - 'compressed-size': 'int', 'compression-rate': 'number' } } + 'compressed-size': 'int', 'compression-rate': 'number', + 'compress-no-wait-weight': 'int'} } It isn't. Should it be optional? If not, what's its value when compress-wait-thread isn't adaptive? It'd be better to make it optional... i will fix it. :) ## # @MigrationStatus: @@ -489,9 +494,13 @@ # the compression thread count is an integer between 1 and 255. # # @compress-wait-thread: Controls behavior when all compression threads are -#currently busy. If true (default), wait for a free -#compression thread to become available; otherwise, -#send the page uncompressed. (Since 3.1) +# currently busy. If 'true/on' (default), wait for a free +# compression thread to become available; if 'false/off', send the +# page uncompressed. (Since 3.1) +# If it is 'adaptive', the behavior is adaptively controlled based on +# the rate limit. If it has enough bandwidth, it acts as +# compress-wait-thread is off. (Since 3.2) +# # # @decompress-threads: Set decompression thread count to be used in live # migration, the decompression thread count is an integer between 1 @@ -577,9 +586,12 @@ # @compress-threads: compression thread count # # @compress-wait-thread: Controls behavior when all compression threads are -#currently busy. If true (default), wait for a free -#compression thread to become available; otherwise, -#send the page uncompressed. (Since 3.1) +# currently busy. If 'true/on' (default), wait for a free +# compression thread to become available; if 'false/off', send the +# page uncompressed. (Since 3.1) +# If it is 'adaptive', the behavior is adaptively controlled based on +# the rate limit. If it has enough bandwidth, it acts as +# compress-wait-thread is off. (Since 3.2) # # @decompress-threads: decompression thread count # @@ -655,7 +667,7 @@ { 'struct': 'MigrateSetParameters', 'data': { '*compress-level': 'int', '*compress-threads': 'int', -'*compress-wait-thread': 'bool', +'*compress-wait-thread': 'str', Compatibility break. You can add a separate flag like you did in v1 if I understand your cover letter correctly. Awkward. You can use a suitable alternate of bool and enum. ‘alternate’ seems a good solution to me, will fix. :) 'str' is not a good idea, because it defeats introspection. will fix. '*decompress-threads': 'int', '*cpu-throttle-initial': 'int', '*cpu-throttle-increment': 'int', @@ -697
Re: [Qemu-devel] [PATCH v2 3/3] migration: introduce adaptive model for waiting thread
On 1/16/19 2:40 PM, Peter Xu wrote: On Fri, Jan 11, 2019 at 02:37:32PM +0800, guangrong.x...@gmail.com wrote: + +static void update_compress_wait_thread(MigrationState *s) +{ +s->compress_wait_thread = get_compress_wait_thread(>parameters); +assert(s->compress_wait_thread != COMPRESS_WAIT_THREAD_ERR); +} We can probably deprecate these chunk of codes if you're going to use alternative structs or enum as suggested by Markus... Yes, indeed. I think Libvirt is not using this parameter, right? And I believe the parameter "compress-wait-thread" was just introduced since QEMU 3.1. I'm not sure whether we can directly change it to an enum assuming that no one is really using it in production yet which could possibly break nobody. I did a check in libvirt's code: $ git grep compress-wait-thread tests/qemucapabilitiesdata/caps_3.1.0.ppc64.replies: "name": "compress-wait-thread", tests/qemucapabilitiesdata/caps_3.1.0.ppc64.replies: "name": "compress-wait-thread", tests/qemucapabilitiesdata/caps_3.1.0.x86_64.replies: "name": "compress-wait-thread", tests/qemucapabilitiesdata/caps_3.1.0.x86_64.replies: "name": "compress-wait-thread", tests/qemucapabilitiesdata/caps_4.0.0.x86_64.replies: "name": "compress-wait-thread", tests/qemucapabilitiesdata/caps_4.0.0.x86_64.replies: "name": "compress-wait-thread", It seems changing this parameter will break libvirt's self-test? Maybe we still have chance to quickly switch back to the name "x-compress-wait-thread" just like the -global interface then we don't need to worry much on QAPI breakage so far until the parameter proves itself to remove the "x-". Er... i am not sure i can follow it clearly. :( [...] @@ -1917,40 +2000,40 @@ bool migrate_postcopy_blocktime(void) return s->enabled_capabilities[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME]; } -bool migrate_use_compression(void) +int64_t migrate_max_bandwidth(void) { MigrationState *s; s = migrate_get_current(); -return s->enabled_capabilities[MIGRATION_CAPABILITY_COMPRESS]; +return s->parameters.max_bandwidth; } -int migrate_compress_level(void) +bool migrate_use_compression(void) { MigrationState *s; s = migrate_get_current(); -return s->parameters.compress_level; +return s->enabled_capabilities[MIGRATION_CAPABILITY_COMPRESS]; } -int migrate_compress_threads(void) +int migrate_compress_level(void) { MigrationState *s; s = migrate_get_current(); -return s->parameters.compress_threads; +return s->parameters.compress_level; } -int migrate_compress_wait_thread(void) +int migrate_compress_threads(void) { MigrationState *s; s = migrate_get_current(); -return s->parameters.compress_wait_thread; +return s->parameters.compress_threads; I'm a bit confused on these diff... are you only adding migrate_max_bandwidth() and not changing anything else? I guess so. I'm curious on how these chunk is generated since it looks really weird... Looks weird for me too. :( [...] /* State of RAM for migration */ struct RAMState { /* QEMUFile used for this migration */ @@ -292,6 +294,19 @@ struct RAMState { bool ram_bulk_stage; /* How many times we have dirty too many pages */ int dirty_rate_high_cnt; + +/* used by by compress-wait-thread-adaptive */ compress-wait-thread-adaptive is gone? It's a typo, will fix. + +max_bw = (max_bw >> 20) * 8; +remain_bw = abs(max_bw - (int64_t)(mbps)); +if (remain_bw <= (max_bw / 20)) { +/* if we have used all the bandwidth, let's compress more. */ +if (compression_counters.compress_no_wait_weight) { +compression_counters.compress_no_wait_weight--; +} +goto exit; +} + +/* have enough bandwidth left, do not need to compress so aggressively */ +if (compression_counters.compress_no_wait_weight != +COMPRESS_BUSY_COUNT_PERIOD) { +compression_counters.compress_no_wait_weight++; +} + +exit: +ram_state->compress_busy_count = 0; +ram_state->compress_no_wait_left = +compression_counters.compress_no_wait_weight; The "goto" and the chunk seems awkward to me... How about this? if (not_enough_remain_bw && weight) weight--; else if (weight <= MAX) weight++; (do the rest...) Okay, will address your style. Also, would you like to add some documentation to these compression features into docs/devel/migration.rst? It'll be good, but it's your call. :) It's useful indeed. I will do it. static void migration_update_rates(RAMState *rs, int64_t end_time) { uint64_t page_count = rs->target_page_count - rs->target_page_count_prev; @@ -1975,7 +2057,11 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block, ram_addr_t offset)
Re: [Qemu-devel] [PATCH v2 2/3] migration: fix memory leak when updating tls-creds and tls-hostname
On 1/16/19 12:03 AM, Eric Blake wrote: On 1/15/19 4:24 AM, Dr. David Alan Gilbert wrote: I think the problem is that migrate_params_check checks a MigrationParameters while the QMP command gives us a MigrateSetParameters; but we also use migrate_params_check for the global check you added (8b0b29dc) which is against migrationParameters; so that's why migrate_params_check takes a MigrationParameters. It's horrible we've got stuff duped so much. Indeed. However, I don't like this fix because if someone later was to add a test for tls parameters to migrate_params_check, then they would be confused why the hostname/creds weren't checked. So while we have migrate_params_test_apply, it should cover all parameters. I think a cleaner check would be to write a MigrateParameters_free that free'd any strings, and call that in qmp_migrate_set_parameters on both exit paths. We already have it; it's named qapi_free_MigrationParameters(), generated in qapi-types-migration.h. Thank you all (and sorry for the delay reply due to Chinese New Year :)), i will use this interface instead in the next version.
Re: [Qemu-devel] [PATCH 0/2] optimize waiting for free thread to do compression(Internet mail)
On 12/21/18 4:11 PM, Peter Xu wrote: > On Thu, Dec 13, 2018 at 03:57:25PM +0800, guangrong.x...@gmail.com wrote: >> From: Xiao Guangrong >> >> Currently we have two behaviors if all threads are busy to do compression, >> the main thread mush wait one of them becoming free if @compress-wait-thread >> set to on or the main thread can directly return without wait and post >> the page out as normal one >> >> Both of them have its profits and short-comes, however, if the bandwidth is >> not limited extremely so that compression can not use out all of it >> bandwidth, >> at the same time, the migration process is easily throttled if we posted too >> may pages as normal pages. None of them can work properly under this case >> >> In order to use the bandwidth more effectively, we introduce the third >> behavior, compress-wait-thread-adaptive, which make the main thread wait >> if there is no bandwidth left or let the page go out as normal page if there >> has enough bandwidth to make sure the migration process will not be >> throttled >> >> Another patch introduces a new statistic, pages-per-second, as bandwidth >> or mbps is not enough to measure the performance of posting pages out as >> we have compression, xbzrle, which can significantly reduce the amount of >> the data size, instead, pages-per-second if the one we want >> >> Performance data >> >> We have limited the bandwidth to 300 >> >> Used Bandwidth Pages-per-Second >> compress-wait-thread = on 951.66 mbps 131784 >> >> compress-wait-thread = off2491.74 mbps93495 >> compress-wait-thread-adaptive 1982.94 mbps162529 >> = on > > Sounds reasonable to me, though it looks like there're only three > options: on, off, adaptive. Should we squash the two parameters? Thanks for your review, Peter! That's good to me, if other guys do not have objects, i will do it in the next version.
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 12/5/18 1:16 AM, Paolo Bonzini wrote: On 04/12/18 16:49, Christophe de Dinechin wrote: Linux and QEMU's own qht work just fine with compile-time directives. Wouldn’t it work fine without any compile-time directive at all? Yes, that's what I meant. Though there are certainly cases in which the difference without proper cacheline alignment is an order of magnitude less throughput or something like that; it would certainly be noticeable. I don't think lock-free lists are easier. Bitmaps smaller than 64 elements are both faster and easier to manage. I believe that this is only true if you use a linked list for both freelist management and for thread notification (i.e. to replace the bitmaps). However, if you use an atomic list only for the free list, and keep bitmaps for signaling, then performance is at least equal, often better. Plus you get the added benefit of having a thread-safe API, i.e. something that is truly lock-free. I did a small experiment to test / prove this. Last commit on branch: https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue Take with a grain of salt, microbenchmarks are always suspect ;-) The code in “thread_test.c” includes Xiao’s code with two variations, plus some testing code lifted from the flight recorder library. 1. The FREE_LIST variation (sl_test) is what I would like to propose. 2. The BITMAP variation (bm_test) is the baseline 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach To run it, you need to do “make opt-test”, then run “test_script” which outputs a CSV file. The summary of my findings testing on a ThinkPad, a Xeon machine and a MacBook is here: https://imgur.com/a/4HmbB9K Overall, the proposed approach: - makes the API thread safe and lock free, addressing the one drawback that Xiao was mentioning. - delivers up to 30% more requests on the Macbook, while being “within noise” (sometimes marginally better) for the other two. I suspect an optimization opportunity found by clang, because the Macbook delivers really high numbers. - spends less time blocking when all threads are busy, which accounts for the higher number of client loops. If you think that makes sense, then either Xiao can adapt the code from the branch above, or I can send a follow-up patch. Having a follow-up patch would be best I think. Thanks for experimenting with this, it's always fun stuff. :) Yup, Christophe, please post the follow-up patches and add yourself to the author list if you like. I am looking forward to it. :) Thanks!
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 11/27/18 8:49 PM, Christophe de Dinechin wrote: (I did not finish the review, but decided to send what I already had). On 22 Nov 2018, at 08:20, guangrong.x...@gmail.com wrote: From: Xiao Guangrong This modules implements the lockless and efficient threaded workqueue. I’m not entirely convinced that it’s either “lockless” or “efficient” in the current iteration. I believe that it’s relatively easy to fix, though. I think Emilio has already replied to your concern why it is "lockless". :) Three abstracted objects are used in this module: - Request. It not only contains the data that the workqueue fetches out to finish the request but also offers the space to save the result after the workqueue handles the request. It's flowed between user and workqueue. The user fills the request data into it when it is owned by user. After it is submitted to the workqueue, the workqueue fetched data out and save the result into it after the request is handled. fetched -> fetches save -> saves Will fix... My English is even worse than C. :( + +/* + * find a free request where the user can store the data that is needed to + * finish the request + * + * If all requests are used up, return NULL + */ +void *threaded_workqueue_get_request(Threads *threads); Using void * to represent the payload makes it easy to get the wrong pointer in there without the compiler noticing. Consider adding a type for the payload? Another option could be taken is exporting the ThreadRequest to the user and it's put at the very beginning in the user-defined data struct. However, it will export the internal designed things to the user, i am not sure it is a good idea... + * + * Author: + * Xiao Guangrong + * + * Copyright(C) 2018 Tencent Corporation. + * + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later. + * See the COPYING.LIB file in the top-level directory. + */ + +#include "qemu/osdep.h" +#include "qemu/bitmap.h" +#include "qemu/threaded-workqueue.h" + +#define SMP_CACHE_BYTES 64 +1 on comments already made by others Will improve it. + +/* + * the request representation which contains the internally used mete data, mete -> meta Will fix. + * it is the header of user-defined data. + * + * It should be aligned to the nature size of CPU. + */ +struct ThreadRequest { +/* + * the request has been handled by the thread and need the user + * to fetch result out. + */ +uint8_t done; + +/* + * the index to Thread::requests. + * Save it to the padding space although it can be calculated at runtime. + */ +uint8_t request_index; So no more than 256? This is blocked by MAX_THREAD_REQUEST_NR test at the beginning of threaded_workqueue_create, but I would make it more explicit either with a compile-time assert that MAX_THREAD_REQUEST_NR is below UINT8_MAX, or by adding a second test for UINT8_MAX in threaded_workqueue_create. It's good to me. I prefer the former one that "compile-time assert that MAX_THREAD_REQUEST_NR is below UINT8_MAX" Also, an obvious extension would be to make bitmaps into arrays. Do you think someone would want to use the package to assign requests per CPU or per VCPU? If so, that could quickly go above 64. Well... it specifies the depth of each single thread, it has negative affection if larger depth is used, as it causes threaded_workqueue_wait_for_requests() too slow, at that point, the user needs to wait all the threads to exhaust all its requests. Another impact is that u64 is more efficient than bitmaps, we can see it from the performance data: https://ibb.co/hq7u5V Based on those, i think 64 should be enough, at least for the present user, migration thread. + +/* the index to Threads::per_thread_data */ +unsigned int thread_index; Don’t you want to use a size_t for that? size_t is 8 bytes... i'd like to make the request header more tiny... +} QEMU_ALIGNED(sizeof(unsigned long)); Nit: the alignment type is inconsistent with that given to QEMU_BUILD_BUG_ON in threaded_workqueue_create. (long vs. unsigned long). Yup, will make them consistent. Also, why is the alignment required? Aren’t you more interested in cache-line alignment? ThreadRequest actually is the header put at the very beginning of the request. If is not aligned to "long", the user-defined data struct could be accessed without properly aligned. +typedef struct ThreadRequest ThreadRequest; + +struct ThreadLocal { +struct Threads *threads; + +/* the index of the thread */ +int self; Why signed? Mistake, will fix. + +/* thread is useless and needs to exit */ +bool quit; + +QemuThread thread; + +void *requests; + + /* + * the bit in these two bitmaps indicates the index of the @requests + * respectively. If it's the same, the corresponding request is fr
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 11/26/18 6:28 PM, Paolo Bonzini wrote: On 26/11/18 09:18, Xiao Guangrong wrote: +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread) +{ + uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap; + + request_fill_bitmap = atomic_rcu_read(>request_fill_bitmap); + request_done_bitmap = atomic_rcu_read(>request_done_bitmap); + bitmap_xor(_bitmap, _fill_bitmap, _done_bitmap, + threads->thread_requests_nr); This is not wrong, but it's a big ugly. Instead, I would: - Introduce bitmap_xor_atomic in a previous patch - Use bitmap_xor_atomic here, getting rid of the rcu reads Hmm, however, we do not need atomic xor operation here... that should be slower than just two READ_ONCE calls. Yeah, I'd just go with Guangrong's version. Alternatively, add find_{first,next}_{same,different}_bit functions (whatever subset of the 4 you need). That's good to me. will try it. ;)
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 11/27/18 2:55 AM, Emilio G. Cota wrote: On Mon, Nov 26, 2018 at 15:57:25 +0800, Xiao Guangrong wrote: On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote: +#include "qemu/osdep.h" +#include "qemu/bitmap.h" +#include "qemu/threaded-workqueue.h" + +#define SMP_CACHE_BYTES 64 That's architecture dependent isn't it? Yes, it's arch dependent indeed. I just used 64 for simplification and i think it is <= 64 on most CPU arch-es so that can work. Should i introduce statically defined CACHE LINE SIZE for all arch-es? :( No, at compile-time this is impossible to know. We do query this info at run-time though (see util/cacheinfo.c), but using that info here would complicate things too much. I see. You can just give it a different name, and perhaps add a comment. See for instance what we do in qht.c with QHT_BUCKET_ALIGN. That's really a good lesson to me, will follow it. :)
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 11/27/18 2:49 AM, Emilio G. Cota wrote: On Mon, Nov 26, 2018 at 16:06:37 +0800, Xiao Guangrong wrote: +/* after the user fills the request, the bit is flipped. */ +uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); +/* after handles the request, the thread flips the bit. */ +uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); Use DECLARE_BITMAP, otherwise you'll get type errors as David pointed out. If we do it, the field becomes a pointer... that complicates the thing. Not necessarily, see below. On Mon, Nov 26, 2018 at 16:18:24 +0800, Xiao Guangrong wrote: On 11/24/18 8:17 AM, Emilio G. Cota wrote: On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.x...@gmail.com wrote: +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread) +{ +uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap; + +request_fill_bitmap = atomic_rcu_read(>request_fill_bitmap); +request_done_bitmap = atomic_rcu_read(>request_done_bitmap); +bitmap_xor(_bitmap, _fill_bitmap, _done_bitmap, + threads->thread_requests_nr); This is not wrong, but it's a big ugly. Instead, I would: - Introduce bitmap_xor_atomic in a previous patch - Use bitmap_xor_atomic here, getting rid of the rcu reads Hmm, however, we do not need atomic xor operation here... that should be slower than just two READ_ONCE calls. If you use DECLARE_BITMAP, you get an in-place array. On a 64-bit host, that'd be unsigned long foo[1]; /* [2] on 32-bit */ Then again on 64-bit hosts, bitmap_xor_atomic would reduce to 2 atomic reads: static inline void bitmap_xor_atomic(unsigned long *dst, const unsigned long *src1, const unsigned long *src2, long nbits) { if (small_nbits(nbits)) { *dst = atomic_read(src1) ^ atomic_read(); } else { slow_bitmap_xor_atomic(dst, src1, src2, nbits); We needn't do inplace xor operation. i.e, we just fetch the bitmaps to the local variables do xor locally. So we need additional complicity to handle the case that is !small_nbits(nbits) ... but it is really not a big deal as you said, it just couple of codes. However, use u64 for the purpose that only 64 indexes are allowed is more straightforward and can be naturally understood. :)
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 11/26/18 6:56 PM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote: +#include "qemu/osdep.h" +#include "qemu/bitmap.h" +#include "qemu/threaded-workqueue.h" + +#define SMP_CACHE_BYTES 64 That's architecture dependent isn't it? Yes, it's arch dependent indeed. I just used 64 for simplification and i think it is <= 64 on most CPU arch-es so that can work. Should i introduce statically defined CACHE LINE SIZE for all arch-es? :( I think it depends why you need it; but we shouldn't have a constant that is wrong, and we shouldn't define something architecture dependent in here. I see. I will address Emilio's suggestion that rename SMP_CACHE_BYTES to THREAD_QUEUE_ALIGN and additional comments. + /* + * the bit in these two bitmaps indicates the index of the @requests + * respectively. If it's the same, the corresponding request is free + * and owned by the user, i.e, where the user fills a request. Otherwise, + * it is valid and owned by the thread, i.e, where the thread fetches + * the request and write the result. + */ + +/* after the user fills the request, the bit is flipped. */ +uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); +/* after handles the request, the thread flips the bit. */ +uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); Patchew complained about some type mismatches; I think those are because you're using the bitmap_* functions on these; those functions always operate on 'long' not on uint64_t - and on some platforms they're unfortunately not the same. I guess you were taking about this error: ERROR: externs should be avoided in .c files #233: FILE: util/threaded-workqueue.c:65: +uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone when the aligned thing is removed... The issue you pointed out can be avoid by using type-casting, like: bitmap_xor(..., (void *)>request_fill_bitmap) cannot we? I thought the error was just due to long vs uint64_t ratehr than the qemu_aligned. I don't think it's just a casting problem, since I don't think the long's are always 64bit. Well, i made some adjustments that makes check_patch.sh really happy :), as followings: $ git diff util/ diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c index 2ab37cee8d..e34c65a8eb 100644 --- a/util/threaded-workqueue.c +++ b/util/threaded-workqueue.c @@ -62,21 +62,30 @@ struct ThreadLocal { */ /* after the user fills the request, the bit is flipped. */ -uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); +struct { +uint64_t request_fill_bitmap; +} QEMU_ALIGNED(SMP_CACHE_BYTES); + /* after handles the request, the thread flips the bit. */ -uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); +struct { +uint64_t request_done_bitmap; +} QEMU_ALIGNED(SMP_CACHE_BYTES); /* * the event used to wake up the thread whenever a valid request has * been submitted */ -QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES); +struct { +QemuEvent request_valid_ev; +} QEMU_ALIGNED(SMP_CACHE_BYTES); /* * the event is notified whenever a request has been completed * (i.e, become free), which is used to wake up the user */ -QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES); +struct { +QemuEvent request_free_ev; +} QEMU_ALIGNED(SMP_CACHE_BYTES); }; typedef struct ThreadLocal ThreadLocal; $ ./scripts/checkpatch.pl -f util/threaded-workqueue.c total: 0 errors, 0 warnings, 472 lines checked util/threaded-workqueue.c has no obvious style problems and is ready for submission. check_patch.sh somehow treats QEMU_ALIGNED as a function before the modification. And yes, u64 is not a long type on 32 bit arch, it's long[2] instead. that's fine when we pass the &(u64) to the function whose parameter is (long *). I thing this trick is widely used. e.g, the example in kvm that i replied to Emilio: static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu) { return test_bit(req & KVM_REQUEST_MASK, (void *)>requests); }
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 11/24/18 8:17 AM, Emilio G. Cota wrote: On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.x...@gmail.com wrote: +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread) +{ +uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap; + +request_fill_bitmap = atomic_rcu_read(>request_fill_bitmap); +request_done_bitmap = atomic_rcu_read(>request_done_bitmap); +bitmap_xor(_bitmap, _fill_bitmap, _done_bitmap, + threads->thread_requests_nr); This is not wrong, but it's a big ugly. Instead, I would: - Introduce bitmap_xor_atomic in a previous patch - Use bitmap_xor_atomic here, getting rid of the rcu reads Hmm, however, we do not need atomic xor operation here... that should be slower than just two READ_ONCE calls.
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 11/24/18 8:12 AM, Emilio G. Cota wrote: On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.x...@gmail.com wrote: + /* + * the bit in these two bitmaps indicates the index of the @requests This @ is not ASCII, is it? Good eyes. :) Will fix it. + * respectively. If it's the same, the corresponding request is free + * and owned by the user, i.e, where the user fills a request. Otherwise, + * it is valid and owned by the thread, i.e, where the thread fetches + * the request and write the result. + */ + +/* after the user fills the request, the bit is flipped. */ +uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); +/* after handles the request, the thread flips the bit. */ +uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); Use DECLARE_BITMAP, otherwise you'll get type errors as David pointed out. If we do it, the field becomes a pointer... that complicates the thing. Hmm, i am using the same trick applied by kvm module when it handles vcpu->requests: static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu) { return test_bit(req & KVM_REQUEST_MASK, (void *)>requests); } Is it good?
Re: [Qemu-devel] [PATCH v3 3/5] migration: use threaded workqueue for compression
On 11/24/18 2:29 AM, Dr. David Alan Gilbert wrote: static void -update_compress_thread_counts(const CompressParam *param, int bytes_xmit) +update_compress_thread_counts(CompressData *cd, int bytes_xmit) Keep the const? Yes, indeed. Will correct it in the next version. +if (deflateInit(>stream, migrate_compress_level()) != Z_OK) { +g_free(cd->originbuf); +return -1; +} Please print errors if you fail in any case so we can easily tell what happened. Sure, will do. +if (wait) { +cpu_relax(); +goto retry; Is there nothing better we can use to wait without eating CPU time? There is a mechanism to wait without eating CPU time in the data structure, but it makes sense to busy wait. There are 4 threads in the workqueue, so you have to compare 1/4th of the time spent compressing a page, with the trip into the kernel to wake you up. You're adding 20% CPU usage, but I'm not surprised it's worthwhile. Hmm OK; in that case it does at least need a comment because it's a bit odd, and we should watch out how that scales - I guess it's less of an overhead the more threads you use. Sure, will add some comments to explain the purpose.
Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote: +#include "qemu/osdep.h" +#include "qemu/bitmap.h" +#include "qemu/threaded-workqueue.h" + +#define SMP_CACHE_BYTES 64 That's architecture dependent isn't it? Yes, it's arch dependent indeed. I just used 64 for simplification and i think it is <= 64 on most CPU arch-es so that can work. Should i introduce statically defined CACHE LINE SIZE for all arch-es? :( + /* + * the bit in these two bitmaps indicates the index of the @requests + * respectively. If it's the same, the corresponding request is free + * and owned by the user, i.e, where the user fills a request. Otherwise, + * it is valid and owned by the thread, i.e, where the thread fetches + * the request and write the result. + */ + +/* after the user fills the request, the bit is flipped. */ +uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); +/* after handles the request, the thread flips the bit. */ +uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); Patchew complained about some type mismatches; I think those are because you're using the bitmap_* functions on these; those functions always operate on 'long' not on uint64_t - and on some platforms they're unfortunately not the same. I guess you were taking about this error: ERROR: externs should be avoided in .c files #233: FILE: util/threaded-workqueue.c:65: +uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone when the aligned thing is removed... The issue you pointed out can be avoid by using type-casting, like: bitmap_xor(..., (void *)>request_fill_bitmap) cannot we? Thanks!
Re: [Qemu-devel] [PATCH v2 2/5] util: introduce threaded workqueue
On 11/14/18 2:38 AM, Emilio G. Cota wrote: On Tue, Nov 06, 2018 at 20:20:22 +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong This modules implements the lockless and efficient threaded workqueue. (snip) +++ b/util/threaded-workqueue.c +struct Threads { +/* + * in order to avoid contention, the @requests is partitioned to + * @threads_nr pieces, each thread exclusively handles + * @thread_request_nr requests in the array. + */ +void *requests; (snip) +/* + * the bit in these two bitmaps indicates the index of the @requests + * respectively. If it's the same, the corresponding request is free + * and owned by the user, i.e, where the user fills a request. Otherwise, + * it is valid and owned by the thread, i.e, where the thread fetches + * the request and write the result. + */ + +/* after the user fills the request, the bit is flipped. */ +unsigned long *request_fill_bitmap; +/* after handles the request, the thread flips the bit. */ +unsigned long *request_done_bitmap; (snip) +/* the request is pushed to the thread with round-robin manner */ +unsigned int current_thread_index; (snip) +QemuEvent ev; (snip) +}; The fields I'm showing above are all shared by all worker threads. This can lead to unnecessary contention. For example: - Adjacent requests can share the same cache line, which might be written to by different worker threads (when setting request->done) - The request_done bitmap is written to by worker threads every time a job completes. At high core counts with low numbers of job slots, this can result in high contention. For example, imagine we have 16 threads with 4 jobs each. This only requires 64 bits == 8 bytes, i.e. much less than a cache line. Whenever a job completes, the cache line will be atomically updated by one of the 16 threads. - The completion event (Threads.ev above) is written to by every thread. Again, this can result in contention at large core counts. An orthogonal issue is the round-robin policy. This can give us fairness, in that we guarantee that all workers get a similar number of jobs. But giving one job at a time to each worker is suboptimal when the job sizes are small-ish, because it misses out on the benefits of batching, which amortize the cost of communication. Given that the number of jobs that we have (at least in the benchmark) are small, filling up a worker's queue before moving on to the next can yield a significant speedup at high core counts. I implemented the above on top of your series. The results are as follows: threaded-workqueue-bench -r 4 -m 2 -c 20 -t #N Host: AMD Opteron(tm) Processor 6376 Thread pinning: #N+1 cores, same-socket first 12 +---+ |+ + + + + +A+ + + + + + + + + + +| | $ before ***B*** | 10 |-+ $$ +batching ###D###-| |$$ +per-thread-state $$$A$$$ | |$$ AA | | $AD D$A $A $ $ $A A $$ AA A$ AA$ A| 8 |-+ D$AA A# D$AA# A $#D$$ $ $$ A $ $A $A $$ A$ A$A $ $ AA $A $ $A $ A +-| |AA B* B$DA D DD# A #$$ A A $$AA A A$A $ A A A$ AAA A$A| | $DB*B B $ $ BBD $$ #D #D A A$A A| 6 |-+ $AA*B *A * * $# D D D# #D #D D# D#DD#D D# D# # ##D D# +-| | A BB * A DDD D D#D DD#D D#D D DD D D# D#D DD#DD| | $ B D | | $A **BB B | 4 |-+ A# B *** +-| |$B *B BB* B* *BB*B B*BB*BB*B B *BB* B*BB| | $A B B BB*BB*BB*BB*BB*BB*BB **B ** BB|
Re: [Qemu-devel] [PATCH v2 0/5] migration: improve multithreads
Hi, Ping... On 11/6/18 8:20 PM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Changelog in v2: These changes are based on Paolo's suggestion: 1) rename the lockless multithreads model to threaded workqueue 2) hugely improve the internal design, that make all the request be a large array, properly partition it, assign requests to threads respectively and use bitmaps to sync up threads and the submitter, after that ptr_ring and spinlock are dropped 3) introduce event wait for the submitter These changes are based on Emilio's review: 4) make more detailed description for threaded workqueue 5) add a benchmark for threaded workqueue The previous version can be found at https://marc.info/?l=kvm=153968821910007=2 There's the simple performance measurement comparing these two versions, the environment is the same as we listed in the previous version. Use 8 threads to compress the data in the source QEMU - with compress-wait-thread = off total timebusy-ratio -- v11250660.38 v21204440.35 - with compress-wait-thread = on total timebusy-ratio -- v11644260 v21426090 The v2 win slightly. Xiao Guangrong (5): bitops: introduce change_bit_atomic util: introduce threaded workqueue migration: use threaded workqueue for compression migration: use threaded workqueue for decompression tests: add threaded-workqueue-bench include/qemu/bitops.h | 13 + include/qemu/threaded-workqueue.h | 94 +++ migration/ram.c | 538 ++ tests/Makefile.include| 5 +- tests/threaded-workqueue-bench.c | 256 ++ util/Makefile.objs| 1 + util/threaded-workqueue.c | 466 + 7 files changed, 1030 insertions(+), 343 deletions(-) create mode 100644 include/qemu/threaded-workqueue.h create mode 100644 tests/threaded-workqueue-bench.c create mode 100644 util/threaded-workqueue.c
Re: [Qemu-devel] [PATCH 2/4] migration: introduce lockless multithreads model
On 10/28/2018 03:50 PM, Paolo Bonzini wrote: On 27/10/2018 01:33, Emilio G. Cota wrote: On Wed, Oct 17, 2018 at 12:10:15 +0200, Paolo Bonzini wrote: On 16/10/2018 13:10, guangrong.x...@gmail.com wrote: An idea: the total number of requests is going to be very small, and a PtrRing is not the nicest data structure for multiple producer/single consumer. So you could instead: (snip) - now that you have request indices, you can replace the completion ptr_ring with a bitmap, and set a bit in the bitmap with set_bit_atomic to report completion. On the writer side you use find_next_bit to find (snip) Emilio, can you review the above ideas? Sorry it took me a while to go through this. I like your suggestions. Just one nit; I'm not sure I understood the use case very well, but I think using a bitmap to signal completion might be suboptimal, since we'd have several thread spinning on the same cacheline yet caring about different bits. Requests are asynchronous, the bitmap is only used to find a free submission slot. You're right that the bitmap can bounce across processors, but I'm not sure how else you would do that because you don't know in advance how many submitting threads you have. It wouldn't be any worse if there was a spinlock. However, in the migration case there is only one submitting thread, so it's okay. :) Yup. The cache contention only exists in the work threads, the sumbiter thread is totally free who is the main migration thread. Making the main thread be faster is good. Paolo Xiao: a couple of suggestions - Since you'll be adding a generic module, make its commit and description self-contained. That is, mentioning in the log that this will be used for migration is fine, but please describe the module (and the assumptions it makes about its users) in general, so that someone that doesn't know anything about migration can still understand this module (and hopefully adopt it for other use cases). Good to me, i will add more detailed description for this module in the next version. - I'd like to see a simple test program (or rather, benchmark) that shows how this works. This benchmark would be completely unrelated to migration; it should just be a simple test of the performance/scalability of this module. Having this benchmark would help (1) discuss and quantitately evaluate modifications to the module, and (2) help others to quickly understand what the module does. See tests/qht-bench.c for an example. Can not agree with you more, will do. :) Thank you, Emilio and Paolo!
Re: [Qemu-devel] [PATCH 2/4] migration: introduce lockless multithreads model
On 10/17/2018 06:10 PM, Paolo Bonzini wrote: An idea: the total number of requests is going to be very small, and a PtrRing is not the nicest data structure for multiple producer/single consumer. So you could instead: - add the size of one request to the ops structure. Move the allocation in init_requests, so that you can have one single large array that stores all requests. thread_request_init gets the pointer to a single request. - now that you have a single array (let's call it threads->requests), the request index can be found with "(req - threads->requests) / threads->ops->request_size". The thread index, furthermore, is just request_index / threads->thread_ring_size and you can remove it from ThreadRequest. - now that you have request indices, you can replace the completion ptr_ring with a bitmap, and set a bit in the bitmap with set_bit_atomic to report completion. On the writer side you use find_next_bit to find a completed request and clear_bit_atomic to clear its index. The index passed to find_next_bit is threads->current_thread_index * threads->thread_ring_size, And you also don't need find_free_thread, because you can update threads->current_thread_index to threads->current_thread_index = ((free_request_id / threads->thread_ring_size) + 1) % threads->thread_nr; after you prepare a request, and threads will then naturally receive requests in round-robin order. (If find_next_bit fails it means you have to move the current_thread_index to 0 and retry). - you don't need the free requests list and free_requests_nr either: you just initialize the completed request bitmap to all-ones, and find_next_bit + clear_bit_atomic will do the work of free_requests. ThreadRequest goes away completely now! The advantage is that you get rid of the spinlock on the consumer side, and many auxiliary data structures on the producer side: a bitmap is a much nicer data structure for multiple producer/single consumer than the PtrRing. In addition, with this design the entire Threads structure becomes read-mostly, which is nice for the cache. The disadvantage is that you add an atomic operation (clear_bit_atomic) to threads_submit_request_prepare(*). All your comments are good to me and you are a GENIUS, the idea that make the requests be a single array and partitions it like this way simplifies the thing a lot. The PtrRing is still useful for the other direction, because there you have single producer/single consumer. (*) It's probably possible to have two separate bitmaps, e.g. where the producer and consumers *flip* bits and the producer looks for mismatched bits between the two bitmaps. I'm not asking you to flesh out and implement that; it's just why I think you can ignore the extra cost of clear_bit_atomic. In fact, if the maintainers think this is overkill you can go ahead with just the naming fixes and I'll take a look at a rewrite when I have some time for fun stuff. :) LOL! Great minds think alike, the idea, "flipping bitmaps", was in my mind too. :) Beside that... i think we get the chance to remove ptr_ring gracefully, as the bitmap can indicate the ownership of the request as well. If the bit is 1 (supposing all bits are 1 on default), only the user can operate it, the bit will be cleared after the user fills the info to the request. After that, the thread sees the bit is cleared, then it gets the ownership and finishes the request, then sets bit in the bitmap. The ownership is returned to the user again. One thing may be disadvantage is, it can't differentiate the case if the request is empty or contains the result which need the user call threads_wait_done(), that will slow threads_wait_done() a little as it need check all requests, but it is not a big deal as 1) at the point needing flush, it's high possible that all most requests have been used. 2) the total number of requests is going to be very small. It is illustrated by following code by combining the "flip" bitmaps: struct Threads { .. /* * the bit in these two bitmaps indicates the index of the requests * respectively. If it's the same, the request is owned by the user, * i.e, only the use can use the request. Otherwise, it is owned by * the thread. */ /* after the user fills the request, the bit is flipped. */ unsigned long *request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); /* after handles the request, the thread flips the bit. */ unsigned long *request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES); } threads_submit_request_prepare() { request_done_bitmap = READ_ONCE(threads->request_done_bitmap); result_bitmap = bitmap_xor(_done_bitmap, threads->request_fill_bitmap); index = find_first_zero_bit(current-thread-to-request-index, _bitmap); /* make sure we get the data the thread written. */
Re: [Qemu-devel] [PATCH 1/4] ptr_ring: port ptr_ring from linux kernel to QEMU
On 10/17/2018 04:14 PM, Paolo Bonzini wrote: On 16/10/2018 18:40, Emilio G. Cota wrote: +#define SMP_CACHE_BYTES 64 +#define cacheline_aligned_in_smp \ +__attribute__((__aligned__(SMP_CACHE_BYTES))) You could use QEMU_ALIGNED() here. Yes, you are right. + +#define WRITE_ONCE(ptr, val) \ +(*((volatile typeof(ptr) *)(&(ptr))) = (val)) +#define READ_ONCE(ptr) (*((volatile typeof(ptr) *)(&(ptr Why not atomic_read/set, like in the rest of the QEMU code base? Or even atomic_rcu_read/atomic_rcu_set, which includes the necessary barriers. Okay, will fix it, thank you and Emilio for pointing the issue out. Also, please do not use __ identifiers in QEMU code. cacheline_aligned_in_smp can become just QEMU_ALIGNED(SMP_CACHE_BYTES). Sure, will keep that in my mind. :)
Re: [Qemu-devel] [PATCH v6 0/3] migration: compression optimization
On 09/06/2018 07:03 PM, Juan Quintela wrote: guangrong.x...@gmail.com wrote: From: Xiao Guangrong Changelog in v6: Thanks to Juan's review, in this version we 1) move flush compressed data to find_dirty_block() where it hits the end of memblock 2) use save_page_use_compression instead of migrate_use_compression in flush_compressed_data Xiao Guangrong (3): migration: do not flush_compressed_data at the end of iteration migration: show the statistics of compression migration: use save_page_use_compression in flush_compressed_data hmp.c | 13 +++ migration/migration.c | 12 ++ migration/ram.c | 63 +++ migration/ram.h | 1 + qapi/migration.json | 26 - 5 files changed, 105 insertions(+), 10 deletions(-) queued Hi Juan, Could i ask where is the place you queued these patches, i did not found them on your git tree at https://github.com/juanquintela/qemu migration/next or migration.next I am working on the next part of migration, it's more convenient to let it be based on your place. :) Thanks!
Re: [Qemu-devel] [PATCH v5 1/4] migration: do not flush_compressed_data at the end of each iteration
On 09/04/2018 11:54 AM, Xiao Guangrong wrote: We will call it only if xbzrle is also enabled, at this case, we will disable compression and xbzrle for the following pages, please refer ^and use xbzrle Sorry for the typo.
Re: [Qemu-devel] [PATCH v5 1/4] migration: do not flush_compressed_data at the end of each iteration
On 09/04/2018 12:38 AM, Juan Quintela wrote: guangrong.x...@gmail.com wrote: From: Xiao Guangrong flush_compressed_data() needs to wait all compression threads to finish their work, after that all threads are free until the migration feeds new request to them, reducing its call can improve the throughput and use CPU resource more effectively We do not need to flush all threads at the end of iteration, the data can be kept locally until the memory block is changed or memory migration starts over in that case we will meet a dirtied page which may still exists in compression threads's ring Signed-off-by: Xiao Guangrong I am not so sure about this patch. Right now, we warantee that after each iteration, all data is written before we start a new round. This patch changes to only "flush" the compression threads if we have "synched" with the kvm migration bitmap. Idea is good but as far as I can see: - we already call flush_compressed_data() inside firnd_dirty_block if we synchronize the bitmap. The one called in find_dirty_block is as followings: if (migrate_use_xbzrle()) { /* If xbzrle is on, stop using the data compression at this * point. In theory, xbzrle can do better than compression. */ flush_compressed_data(rs); } We will call it only if xbzrle is also enabled, at this case, we will disable compression and xbzrle for the following pages, please refer to save_page_use_compression(). So, it can not help us if we just enabled compression separately. So, at least, we need to update dirty_sync_count there. I understand this is a optimization that reduces flush_compressed_data under the if both xbzrle and compression are both enabled. However, we can avoid it by using save_page_use_compression() to replace migrate_use_compression(). Furthermore, in our work which will be pushed it out after this patchset (https://marc.info/?l=kvm=152810616708459=2), we will made flush_compressed_data() really light if there is nothing to be flushed. So, how about just keep it at this patch and let's optimize it in our next work? :) - queued pages are "interesting", but I am not sure if compression and postcopy work well together. Compression and postcopy can not both be enabled, please refer to the code in migrate_caps_check() if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) { if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) { /* The decompression threads asynchronously write into RAM * rather than use the atomic copies needed to avoid * userfaulting. It should be possible to fix the decompression * threads for compatibility in future. */ error_setg(errp, "Postcopy is not currently compatible " "with compression"); return false; } So, if we don't need to call flush_compressed_data() every round, then the one inside find_dirty_block() should be enough. Otherwise, I can't see why we need this other. Independent of this patch: - We always send data for every compression thread without testing if there is any there. Yes. That's because we will never doubly handle the page in the iteration. :) @@ -3212,7 +3225,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } i++; } -flush_compressed_data(rs); rcu_read_unlock(); /* Why is not enough just to remove this call to flush_compressed_data? Consider this case, thanks Dave to point it out. :) === iteration one: thread 1: Save compressed page 'n' iteration two: thread 2: Save compressed page 'n' What guarantees that the version of page 'n' from thread 2 reaches the destination first without this flush? === If we just remove it, we can not get this guarantee. :)
Re: [Qemu-devel] [PATCH v3 10/10] migration: show the statistics of compression
On 08/08/2018 02:12 PM, Peter Xu wrote: On Tue, Aug 07, 2018 at 05:12:09PM +0800, guangrong.x...@gmail.com wrote: [...] @@ -1602,6 +1614,26 @@ static void migration_update_rates(RAMState *rs, int64_t end_time) rs->xbzrle_cache_miss_prev) / page_count; rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss; } + +if (migrate_use_compression()) { +compression_counters.busy_rate = (double)(compression_counters.busy - +rs->compress_thread_busy_prev) / page_count; So this is related to the previous patch - I still doubt its correctness if page_count is the host pages count rather than the guest pages'. Other than that the patch looks good to me. I think i can treat it as your Reviewed-by boldly. :)
Re: [Qemu-devel] [PATCH v3 08/10] migration: handle the error condition properly
On 08/08/2018 10:11 PM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: On 08/08/2018 01:08 PM, Peter Xu wrote: On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong ram_find_and_save_block() can return negative if any error hanppens, however, it is completely ignored in current code Could you hint me where we'll return an error? I think control_save_page() may return a error condition but i am not good at it ... Other places look safe _currently_. These functions were designed to have error returned anyway. ram_control_save_page's return is checked by control_save_page which returns true/false but sets *pages to a return value. What I'd need to follow closely is the case where ram_control_save_page returns RAM_SAVE_CONTROL_DELAYED, in that case control_save_page I think returns with *pages=-1 and returns true. And I think in that case ram_save_target_page can leak that -1 - hmm. Now, ram_save_host_page already checks for <0 and will return that, but I think that would potentially loop in ram_find_and_save_block; I'm not sure we want to change that or not! ram_find_and_save_block() will continue the look only if ram_save_host_page returns zero: .. if (found) { pages = ram_save_host_page(rs, , last_stage); } } while (!pages && again); IMHO, how to change the code really depends on the semantic of these functions, based on the comments of them, returning error conditions is the current semantic. Another choice would be the one squashes error conditions to QEMUFile and adapt comments and code of these functions to reflect the new semantic clearly. Which one do you guys prefer to? :)
Re: [Qemu-devel] [PATCH v3 08/10] migration: handle the error condition properly
On 08/08/2018 02:56 PM, Peter Xu wrote: On Wed, Aug 08, 2018 at 02:29:52PM +0800, Xiao Guangrong wrote: On 08/08/2018 01:08 PM, Peter Xu wrote: On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong ram_find_and_save_block() can return negative if any error hanppens, however, it is completely ignored in current code Could you hint me where we'll return an error? I think control_save_page() may return a error condition but i am not good at it ... Other places look safe _currently_. These functions were designed to have error returned anyway. Ah, the RDMA codes... Then I feel like this patch would be more suitable to be put into some of the RDMA series - at least we'd better be clear about what errors we're going to capture. For non-RDMA, it seems a bit helpless after all - AFAIU we're depending on the few qemu_file_get_error() calls to detect output errors. So, are you talking about to modify the semantic of these functions, ram_save_host_page(), ram_save_target_page(), etc, and make them be: "Returns the number of pages written where zero means no dirty pages, error conditions are indicated in the QEMUFile @rs->file if it happened." If it's what you want, i will update the comments and make the implementation more clear to reflect this fact for these functions
Re: [Qemu-devel] [PATCH v3 09/10] migration: fix calculating xbzrle_counters.cache_miss_rate
On 08/08/2018 02:05 PM, Peter Xu wrote: On Tue, Aug 07, 2018 at 05:12:08PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong As Peter pointed out: | - xbzrle_counters.cache_miss is done in save_xbzrle_page(), so it's | per-guest-page granularity | | - RAMState.iterations is done for each ram_find_and_save_block(), so | it's per-host-page granularity | | An example is that when we migrate a 2M huge page in the guest, we | will only increase the RAMState.iterations by 1 (since | ram_find_and_save_block() will be called once), but we might increase | xbzrle_counters.cache_miss for 2M/4K=512 times (we'll call | save_xbzrle_page() that many times) if all the pages got cache miss. | Then IMHO the cache miss rate will be 512/1=51200% (while it should | actually be just 100% cache miss). And he also suggested as xbzrle_counters.cache_miss_rate is the only user of rs->iterations we can adapt it to count guest page numbers After that, rename 'iterations' to 'handle_pages' to better reflect its meaning Suggested-by: Peter Xu Signed-off-by: Xiao Guangrong --- migration/ram.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 09be01dca2..bd7c18d1f9 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -300,10 +300,10 @@ struct RAMState { uint64_t num_dirty_pages_period; /* xbzrle misses since the beginning of the period */ uint64_t xbzrle_cache_miss_prev; -/* number of iterations at the beginning of period */ -uint64_t iterations_prev; -/* Iterations since start */ -uint64_t iterations; +/* total handled pages at the beginning of period */ +uint64_t handle_pages_prev; +/* total handled pages since start */ +uint64_t handle_pages; The name is not that straightforward to me. I would think about "[guest|host]_page_count" or something better, or we just keep the old naming but with a better comment would be fine too. The filed actually indicates total pages (target pages more precisely) handled during live migration. 'iterations' confuses us completely. It's target_page_count good to you? /* number of dirty bits in the bitmap */ uint64_t migration_dirty_pages; /* last dirty_sync_count we have seen */ @@ -1587,19 +1587,19 @@ uint64_t ram_pagesize_summary(void) static void migration_update_rates(RAMState *rs, int64_t end_time) { -uint64_t iter_count = rs->iterations - rs->iterations_prev; +uint64_t page_count = rs->handle_pages - rs->handle_pages_prev; /* calculate period counters */ ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000 / (end_time - rs->time_last_bitmap_sync); -if (!iter_count) { +if (!page_count) { return; } if (migrate_use_xbzrle()) { xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss - -rs->xbzrle_cache_miss_prev) / iter_count; +rs->xbzrle_cache_miss_prev) / page_count; rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss; } } @@ -1657,7 +1657,7 @@ static void migration_bitmap_sync(RAMState *rs) migration_update_rates(rs, end_time); -rs->iterations_prev = rs->iterations; +rs->handle_pages_prev = rs->handle_pages; /* reset period counters */ rs->time_last_bitmap_sync = end_time; @@ -3209,7 +3209,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) break; } -rs->iterations++; +rs->handle_pages += pages; So it's still counting host pages, is this your intention to only change the name in the patch? Hmm... the value returned by ram_find_and_save_block() isn't the total target pages posted out? /** * ram_find_and_save_block: finds a dirty page and sends it to f * * Called within an RCU critical section. * * Returns the number of pages written where zero means no dirty pages, * or negative on error ... * * On systems where host-page-size > target-page-size it will send all the * pages in a host page that are dirty. */
Re: [Qemu-devel] [PATCH v3 08/10] migration: handle the error condition properly
On 08/08/2018 01:08 PM, Peter Xu wrote: On Tue, Aug 07, 2018 at 05:12:07PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong ram_find_and_save_block() can return negative if any error hanppens, however, it is completely ignored in current code Could you hint me where we'll return an error? I think control_save_page() may return a error condition but i am not good at it ... Other places look safe _currently_. These functions were designed to have error returned anyway. (Anyway I agree that the error handling is not that good, mostly because the QEMUFile APIs does not provide proper return code, e.g., qemu_put_be64 returns void) Yes, it is, the returned error condition is mixed in file's API and function's return value... :(
Re: [Qemu-devel] [PATCH v3 07/10] migration: do not flush_compressed_data at the end of each iteration
On 08/08/2018 12:52 PM, Peter Xu wrote: On Tue, Aug 07, 2018 at 05:12:06PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong flush_compressed_data() needs to wait all compression threads to finish their work, after that all threads are free until the migration feeds new request to them, reducing its call can improve the throughput and use CPU resource more effectively We do not need to flush all threads at the end of iteration, the data can be kept locally until the memory block is changed or memory migration starts over in that case we will meet a dirtied page which may still exists in compression threads's ring Signed-off-by: Xiao Guangrong --- migration/ram.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/migration/ram.c b/migration/ram.c index 99ecf9b315..55966bc2c1 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -306,6 +306,8 @@ struct RAMState { uint64_t iterations; /* number of dirty bits in the bitmap */ uint64_t migration_dirty_pages; +/* last dirty_sync_count we have seen */ +uint64_t dirty_sync_count_prev; /* protects modification of the bitmap */ QemuMutex bitmap_mutex; /* The RAMBlock used in the last src_page_requests */ @@ -3173,6 +3175,17 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) ram_control_before_iterate(f, RAM_CONTROL_ROUND); +/* + * if memory migration starts over, we will meet a dirtied page which + * may still exists in compression threads's ring, so we should flush + * the compressed data to make sure the new page is not overwritten by + * the old one in the destination. + */ +if (ram_counters.dirty_sync_count != rs->dirty_sync_count_prev) { +rs->dirty_sync_count_prev = ram_counters.dirty_sync_count; +flush_compressed_data(rs); AFAIU this only happens when ram_save_pending() calls migration_bitmap_sync(). Could we just simply flush there? Then we can avoid that new variable. Yup, that's better indeed, will do it.
Re: [Qemu-devel] [PATCH v3 01/10] migration: do not wait for free thread
On 08/08/2018 11:51 AM, Peter Xu wrote: On Tue, Aug 07, 2018 at 08:29:54AM -0500, Eric Blake wrote: On 08/07/2018 04:12 AM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Instead of putting the main thread to sleep state to wait for free compression thread, we can directly post it out as normal page that reduces the latency and uses CPUs more efficiently A parameter, compress-wait-thread, is introduced, it can be enabled if the user really wants the old behavior Signed-off-by: Xiao Guangrong --- +++ b/qapi/migration.json @@ -462,6 +462,11 @@ # @compress-threads: Set compression thread count to be used in live migration, # the compression thread count is an integer between 1 and 255. # +# @compress-wait-thread: Wait if no thread is free to compress the memory page +# if it's enabled, otherwise, the page will be posted out immediately +# in the main thread without compression. It's true on default. +# (Since: 3.1) Grammar suggestion: @compress-wait-thread: Controls behavior when all compression threads are currently busy. If true (default), wait for a free compression thread to become available; otherwise, send the page uncompressed. (Since 3.1) Eric's version seems better. With that: It's good to me too. Will apply Eric's suggestion in the next version. Reviewed-by: Peter Xu Thank you, Peter and Eric.
Re: [Qemu-devel] [PATCH v2 1/8] migration: do not wait for free thread
On 07/24/2018 02:36 AM, Eric Blake wrote: On 07/19/2018 07:15 AM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Instead of putting the main thread to sleep state to wait for free compression thread, we can directly post it out as normal page that reduces the latency and uses CPUs more efficiently A parameter, compress-wait-thread, is introduced, it can be enabled if the user really wants the old behavior Signed-off-by: Xiao Guangrong --- hmp.c | 8 migration/migration.c | 21 + migration/migration.h | 1 + migration/ram.c | 45 ++--- qapi/migration.json | 23 ++- 5 files changed, 74 insertions(+), 24 deletions(-) +++ b/qapi/migration.json @@ -462,6 +462,11 @@ # @compress-threads: Set compression thread count to be used in live migration, # the compression thread count is an integer between 1 and 255. # +# @compress-wait-thread: Wait if no thread is free to compress the memory page +# if it's enabled, otherwise, the page will be posted out immediately +# in the main thread without compression. It's off on default. +# (Since: 3.0) Is this a bug fix? It's awfully late in the release cycle to be adding new features; is this something that we can live without until 3.1? It's performance improvement, i think it is not urgent. :)
Re: [Qemu-devel] [PATCH v2 6/8] migration: move handle of zero page to the thread
On 07/23/2018 05:40 PM, Peter Xu wrote: On Mon, Jul 23, 2018 at 04:44:49PM +0800, Xiao Guangrong wrote: [...] However, it is not safe to do ram_release_pages in the thread as it's not protected it multithreads. Fortunately, compression will be disabled if it switches to post-copy, so i preferred to keep current behavior and deferred to fix it after this patchset has been merged. Do you mean ram_release_pages() is not thread-safe? Why? I didn't notice it before but I feel like it is safe. bitmap_clear() called in the function is not safe. Yeah, and the funny thing is that I don't think ram_release_pages() should even touch the receivedmap... It's possible that the release-ram feature for postcopy is broken. Never mind on that. I'll post a patch to fix it, then I think the ram_release_pages() will be thread safe. Then this patch shouldn't be affected and it should be fine after that fix That would be great, thanks for your work.
Re: [Qemu-devel] [PATCH v2 3/8] migration: show the statistics of compression
On 07/23/2018 05:15 PM, Peter Xu wrote: On Mon, Jul 23, 2018 at 04:40:29PM +0800, Xiao Guangrong wrote: On 07/23/2018 04:05 PM, Peter Xu wrote: On Mon, Jul 23, 2018 at 03:39:18PM +0800, Xiao Guangrong wrote: On 07/23/2018 12:36 PM, Peter Xu wrote: On Thu, Jul 19, 2018 at 08:15:15PM +0800, guangrong.x...@gmail.com wrote: @@ -1597,6 +1608,24 @@ static void migration_update_rates(RAMState *rs, int64_t end_time) rs->xbzrle_cache_miss_prev) / iter_count; rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss; } + +if (migrate_use_compression()) { +uint64_t comp_pages; + +compression_counters.busy_rate = (double)(compression_counters.busy - +rs->compress_thread_busy_prev) / iter_count; Here I'm not sure it's correct... "iter_count" stands for ramstate.iterations. It's increased per ram_find_and_save_block(), so IMHO it might contain multiple guest ram_find_and_save_block() returns if a page is successfully posted and it only posts 1 page out at one time. ram_find_and_save_block() calls ram_save_host_page(), and we should be sending multiple guest pages in ram_save_host_page() if the host page is a huge page? You're right, thank you for pointing it out. So, how about introduce a filed, posted_pages, into RAMState that is used to track total pages posted out. Then will use this filed to replace 'iter_count' for compression and use 'RAMState.posted_pages - ram_counters.duplicate' to calculate xbzrle_cache_miss as the zero page is not handled by xbzrle. Or introduce a new function, total_posted_pages, which returns the sum of all page counts: static total_posted_pages(void) { return ram_counters.normal + ram_counters.duplicate + compression_counters.pages + xbzrle_counters.pages; } that would be a bit more complexity... If below [1] is wrong too, then I'm thinking whether we could just move the rs->iterations++ from ram_save_iterate() loop to ram_save_host_page() loop, then we possibly fix both places... Beside that, even if we fix iterations, xbzrle is painful still as the zero should not be counted in the cache miss, i.e, xbzrle does not handle zero page at all. Anyway, fixing iterations is a good start. :) After all I don't see any other usages of rs->iterations so it seems fine. Dave/Juan? pages. However compression_counters.busy should be per guest page. Actually, it's derived from xbzrle_counters.cache_miss_rate: xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss - rs->xbzrle_cache_miss_prev) / iter_count; Then this is suspecious to me too... [1] +rs->compress_thread_busy_prev = compression_counters.busy; + +comp_pages = compression_counters.pages - rs->compress_pages_prev; +if (comp_pages) { +compression_counters.compression_rate = +(double)(compression_counters.reduced_size - +rs->compress_reduced_size_prev) / +(comp_pages * TARGET_PAGE_SIZE); +rs->compress_pages_prev = compression_counters.pages; +rs->compress_reduced_size_prev = compression_counters.reduced_size; +} +} } static void migration_bitmap_sync(RAMState *rs) @@ -1872,6 +1901,9 @@ static void flush_compressed_data(RAMState *rs) qemu_mutex_lock(_param[idx].mutex); if (!comp_param[idx].quit) { len = qemu_put_qemu_file(rs->f, comp_param[idx].file); +/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */ +compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8; I would agree with Dave here - why we store the "reduced size" instead of the size of the compressed data (which I think should be len - 8)? len-8 is the size of data after compressed rather than the data improved by compression that is not straightforward for the user to see how much the improvement is by applying compression. Hmm... but it is not a big deal to me... :) Yeah it might be a personal preference indeed. :) It's just natural to do that this way for me since AFAIU the compression ratio is defined as: compressed data size compression ratio = original data size Er, we do it as following: compression_counters.compression_rate = (double)(compression_counters.reduced_size - rs->compress_reduced_size_prev) / (comp_pages * TARGET_PAGE_SIZE); We use reduced_size, i.e,: original data size - compressed data size compression ratio = original data size for example, for 100 bytes raw data, if we posted 99 bytes out, then the compression ration should be 1%. I think it should
Re: [Qemu-devel] [PATCH v2 8/8] migration: do not flush_compressed_data at the end of each iteration
On 07/23/2018 05:01 PM, Peter Xu wrote: Yes, it's sufficient for current thread model, will drop it for now and add it at the time when the lockless mutilthread model is applied. :) Ah I think I see your point. Even if so I would think it better to do any extra cleanup directly in compress_threads_save_cleanup() if possible. Okay, got it.
Re: [Qemu-devel] [PATCH v2 8/8] migration: do not flush_compressed_data at the end of each iteration
On 07/23/2018 04:35 PM, Peter Xu wrote: On Mon, Jul 23, 2018 at 04:05:21PM +0800, Xiao Guangrong wrote: On 07/23/2018 01:49 PM, Peter Xu wrote: On Thu, Jul 19, 2018 at 08:15:20PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong flush_compressed_data() needs to wait all compression threads to finish their work, after that all threads are free until the migration feeds new request to them, reducing its call can improve the throughput and use CPU resource more effectively We do not need to flush all threads at the end of iteration, the data can be kept locally until the memory block is changed or memory migration starts over in that case we will meet a dirtied page which may still exists in compression threads's ring Signed-off-by: Xiao Guangrong --- migration/ram.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/migration/ram.c b/migration/ram.c index 89305c7af5..fdab13821d 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -315,6 +315,8 @@ struct RAMState { uint64_t iterations; /* number of dirty bits in the bitmap */ uint64_t migration_dirty_pages; +/* last dirty_sync_count we have seen */ +uint64_t dirty_sync_count; Better suffix it with "_prev" as well? So that we can quickly identify that it's only a cache and it can be different from the one in the ram_counters. Indeed, will update it. /* protects modification of the bitmap */ QemuMutex bitmap_mutex; /* The RAMBlock used in the last src_page_requests */ @@ -2532,6 +2534,7 @@ static void ram_save_cleanup(void *opaque) } xbzrle_cleanup(); +flush_compressed_data(*rsp); Could I ask why do we need this considering that we have compress_threads_save_cleanup() right down there? Dave ask it too. :( "This is for the error condition, if any error occurred during live migration, there is no chance to call ram_save_complete. After using the lockless multithreads model, we assert all requests have been handled before destroy the work threads." That makes sure there is nothing left in the threads before doing compress_threads_save_cleanup() as current behavior. For lockless mutilthread model, we check if all requests are free before destroy them. But why do we need to explicitly flush it here? Now in compress_threads_save_cleanup() we have qemu_fclose() on the buffers, which logically will flush the data and clean up everything too. Would that suffice? Yes, it's sufficient for current thread model, will drop it for now and add it at the time when the lockless mutilthread model is applied. :)
Re: [Qemu-devel] [PATCH v2 6/8] migration: move handle of zero page to the thread
On 07/23/2018 04:28 PM, Peter Xu wrote: On Mon, Jul 23, 2018 at 03:56:33PM +0800, Xiao Guangrong wrote: [...] @@ -2249,15 +2308,8 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss, return res; } -/* - * When starting the process of a new block, the first page of - * the block should be sent out before other pages in the same - * block, and all the pages in last block should have been sent - * out, keeping this order is important, because the 'cont' flag - * is used to avoid resending the block name. - */ -if (block != rs->last_sent_block && save_page_use_compression(rs)) { -flush_compressed_data(rs); +if (save_compress_page(rs, block, offset)) { +return 1; It's a bit tricky (though it seems to be a good idea too) to move the zero detect into the compression thread, though I noticed that we also do something else for zero pages: res = save_zero_page(rs, block, offset); if (res > 0) { /* Must let xbzrle know, otherwise a previous (now 0'd) cached * page would be stale */ if (!save_page_use_compression(rs)) { XBZRLE_cache_lock(); xbzrle_cache_zero_page(rs, block->offset + offset); XBZRLE_cache_unlock(); } ram_release_pages(block->idstr, offset, res); return res; } I'd guess that the xbzrle update of the zero page is not needed for compression since after all xbzrle is not enabled when compression is Yup. if they are both enabled, compression works only for the first iteration (i.e, ram_bulk_stage), at that point, nothing is cached in xbzrle's cahe, in other words, xbzrle has posted nothing to the destination. enabled, however do we need to do ram_release_pages() somehow? We have done it in the thread: +static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block, ram_addr_t offset, uint8_t *source_buf) { +if (save_zero_page_to_file(rs, f, block, offset)) { +zero_page = true; +goto exit; +} .. +exit: ram_release_pages(block->idstr, offset & TARGET_PAGE_MASK, 1); +return zero_page; } Ah, then it seems fine. Though I'd suggest you comment these into the commit message in case people won't get it easily. Okay, will update the commit log addressed your comments. However, it is not safe to do ram_release_pages in the thread as it's not protected it multithreads. Fortunately, compression will be disabled if it switches to post-copy, so i preferred to keep current behavior and deferred to fix it after this patchset has been merged. Do you mean ram_release_pages() is not thread-safe? Why? I didn't notice it before but I feel like it is safe. bitmap_clear() called in the function is not safe.
Re: [Qemu-devel] [PATCH v2 3/8] migration: show the statistics of compression
On 07/23/2018 04:05 PM, Peter Xu wrote: On Mon, Jul 23, 2018 at 03:39:18PM +0800, Xiao Guangrong wrote: On 07/23/2018 12:36 PM, Peter Xu wrote: On Thu, Jul 19, 2018 at 08:15:15PM +0800, guangrong.x...@gmail.com wrote: @@ -1597,6 +1608,24 @@ static void migration_update_rates(RAMState *rs, int64_t end_time) rs->xbzrle_cache_miss_prev) / iter_count; rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss; } + +if (migrate_use_compression()) { +uint64_t comp_pages; + +compression_counters.busy_rate = (double)(compression_counters.busy - +rs->compress_thread_busy_prev) / iter_count; Here I'm not sure it's correct... "iter_count" stands for ramstate.iterations. It's increased per ram_find_and_save_block(), so IMHO it might contain multiple guest ram_find_and_save_block() returns if a page is successfully posted and it only posts 1 page out at one time. ram_find_and_save_block() calls ram_save_host_page(), and we should be sending multiple guest pages in ram_save_host_page() if the host page is a huge page? You're right, thank you for pointing it out. So, how about introduce a filed, posted_pages, into RAMState that is used to track total pages posted out. Then will use this filed to replace 'iter_count' for compression and use 'RAMState.posted_pages - ram_counters.duplicate' to calculate xbzrle_cache_miss as the zero page is not handled by xbzrle. Or introduce a new function, total_posted_pages, which returns the sum of all page counts: static total_posted_pages(void) { return ram_counters.normal + ram_counters.duplicate + compression_counters.pages + xbzrle_counters.pages; } that would be a bit more complexity... pages. However compression_counters.busy should be per guest page. Actually, it's derived from xbzrle_counters.cache_miss_rate: xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss - rs->xbzrle_cache_miss_prev) / iter_count; Then this is suspecious to me too... +rs->compress_thread_busy_prev = compression_counters.busy; + +comp_pages = compression_counters.pages - rs->compress_pages_prev; +if (comp_pages) { +compression_counters.compression_rate = +(double)(compression_counters.reduced_size - +rs->compress_reduced_size_prev) / +(comp_pages * TARGET_PAGE_SIZE); +rs->compress_pages_prev = compression_counters.pages; +rs->compress_reduced_size_prev = compression_counters.reduced_size; +} +} } static void migration_bitmap_sync(RAMState *rs) @@ -1872,6 +1901,9 @@ static void flush_compressed_data(RAMState *rs) qemu_mutex_lock(_param[idx].mutex); if (!comp_param[idx].quit) { len = qemu_put_qemu_file(rs->f, comp_param[idx].file); +/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */ +compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8; I would agree with Dave here - why we store the "reduced size" instead of the size of the compressed data (which I think should be len - 8)? len-8 is the size of data after compressed rather than the data improved by compression that is not straightforward for the user to see how much the improvement is by applying compression. Hmm... but it is not a big deal to me... :) Yeah it might be a personal preference indeed. :) It's just natural to do that this way for me since AFAIU the compression ratio is defined as: compressed data size compression ratio = original data size Er, we do it as following: compression_counters.compression_rate = (double)(compression_counters.reduced_size - rs->compress_reduced_size_prev) / (comp_pages * TARGET_PAGE_SIZE); We use reduced_size, i.e,: original data size - compressed data size compression ratio = original data size for example, for 100 bytes raw data, if we posted 99 bytes out, then the compression ration should be 1%. So if i understand correctly, the reduced_size is really you want? :)
Re: [Qemu-devel] [PATCH v2 8/8] migration: do not flush_compressed_data at the end of each iteration
On 07/23/2018 01:49 PM, Peter Xu wrote: On Thu, Jul 19, 2018 at 08:15:20PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong flush_compressed_data() needs to wait all compression threads to finish their work, after that all threads are free until the migration feeds new request to them, reducing its call can improve the throughput and use CPU resource more effectively We do not need to flush all threads at the end of iteration, the data can be kept locally until the memory block is changed or memory migration starts over in that case we will meet a dirtied page which may still exists in compression threads's ring Signed-off-by: Xiao Guangrong --- migration/ram.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/migration/ram.c b/migration/ram.c index 89305c7af5..fdab13821d 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -315,6 +315,8 @@ struct RAMState { uint64_t iterations; /* number of dirty bits in the bitmap */ uint64_t migration_dirty_pages; +/* last dirty_sync_count we have seen */ +uint64_t dirty_sync_count; Better suffix it with "_prev" as well? So that we can quickly identify that it's only a cache and it can be different from the one in the ram_counters. Indeed, will update it. /* protects modification of the bitmap */ QemuMutex bitmap_mutex; /* The RAMBlock used in the last src_page_requests */ @@ -2532,6 +2534,7 @@ static void ram_save_cleanup(void *opaque) } xbzrle_cleanup(); +flush_compressed_data(*rsp); Could I ask why do we need this considering that we have compress_threads_save_cleanup() right down there? Dave ask it too. :( "This is for the error condition, if any error occurred during live migration, there is no chance to call ram_save_complete. After using the lockless multithreads model, we assert all requests have been handled before destroy the work threads." That makes sure there is nothing left in the threads before doing compress_threads_save_cleanup() as current behavior. For lockless mutilthread model, we check if all requests are free before destroy them. compress_threads_save_cleanup(); ram_state_cleanup(rsp); } @@ -3203,6 +3206,17 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) ram_control_before_iterate(f, RAM_CONTROL_ROUND); +/* + * if memory migration starts over, we will meet a dirtied page which + * may still exists in compression threads's ring, so we should flush + * the compressed data to make sure the new page is not overwritten by + * the old one in the destination. + */ +if (ram_counters.dirty_sync_count != rs->dirty_sync_count) { +rs->dirty_sync_count = ram_counters.dirty_sync_count; +flush_compressed_data(rs); +} + t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); i = 0; while ((ret = qemu_file_rate_limit(f)) == 0 || @@ -3235,7 +3249,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } i++; } -flush_compressed_data(rs); This looks sane to me, but I'd like to see how other people would think about it too... Thank you a lot, Peter! :)
Re: [Qemu-devel] [PATCH v2 6/8] migration: move handle of zero page to the thread
On 07/23/2018 01:03 PM, Peter Xu wrote: On Thu, Jul 19, 2018 at 08:15:18PM +0800, guangrong.x...@gmail.com wrote: [...] @@ -1950,12 +1971,16 @@ retry: set_compress_params(_param[idx], block, offset); qemu_cond_signal(_param[idx].cond); qemu_mutex_unlock(_param[idx].mutex); -pages = 1; -/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */ -compression_counters.reduced_size += TARGET_PAGE_SIZE - - bytes_xmit + 8; -compression_counters.pages++; ram_counters.transferred += bytes_xmit; +pages = 1; (moving of this line seems irrelevant; meanwhile more duplicated codes so even better to have a helper now) Good to me. :) +if (comp_param[idx].zero_page) { +ram_counters.duplicate++; +} else { +/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */ +compression_counters.reduced_size += TARGET_PAGE_SIZE - + bytes_xmit + 8; +compression_counters.pages++; +} break; } } [...] @@ -2249,15 +2308,8 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss, return res; } -/* - * When starting the process of a new block, the first page of - * the block should be sent out before other pages in the same - * block, and all the pages in last block should have been sent - * out, keeping this order is important, because the 'cont' flag - * is used to avoid resending the block name. - */ -if (block != rs->last_sent_block && save_page_use_compression(rs)) { -flush_compressed_data(rs); +if (save_compress_page(rs, block, offset)) { +return 1; It's a bit tricky (though it seems to be a good idea too) to move the zero detect into the compression thread, though I noticed that we also do something else for zero pages: res = save_zero_page(rs, block, offset); if (res > 0) { /* Must let xbzrle know, otherwise a previous (now 0'd) cached * page would be stale */ if (!save_page_use_compression(rs)) { XBZRLE_cache_lock(); xbzrle_cache_zero_page(rs, block->offset + offset); XBZRLE_cache_unlock(); } ram_release_pages(block->idstr, offset, res); return res; } I'd guess that the xbzrle update of the zero page is not needed for compression since after all xbzrle is not enabled when compression is Yup. if they are both enabled, compression works only for the first iteration (i.e, ram_bulk_stage), at that point, nothing is cached in xbzrle's cahe, in other words, xbzrle has posted nothing to the destination. enabled, however do we need to do ram_release_pages() somehow? We have done it in the thread: +static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block, ram_addr_t offset, uint8_t *source_buf) { +if (save_zero_page_to_file(rs, f, block, offset)) { +zero_page = true; +goto exit; +} .. +exit: ram_release_pages(block->idstr, offset & TARGET_PAGE_MASK, 1); +return zero_page; } However, it is not safe to do ram_release_pages in the thread as it's not protected it multithreads. Fortunately, compression will be disabled if it switches to post-copy, so i preferred to keep current behavior and deferred to fix it after this patchset has been merged.
Re: [Qemu-devel] [PATCH v2 3/8] migration: show the statistics of compression
On 07/23/2018 12:36 PM, Peter Xu wrote: On Thu, Jul 19, 2018 at 08:15:15PM +0800, guangrong.x...@gmail.com wrote: @@ -1597,6 +1608,24 @@ static void migration_update_rates(RAMState *rs, int64_t end_time) rs->xbzrle_cache_miss_prev) / iter_count; rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss; } + +if (migrate_use_compression()) { +uint64_t comp_pages; + +compression_counters.busy_rate = (double)(compression_counters.busy - +rs->compress_thread_busy_prev) / iter_count; Here I'm not sure it's correct... "iter_count" stands for ramstate.iterations. It's increased per ram_find_and_save_block(), so IMHO it might contain multiple guest ram_find_and_save_block() returns if a page is successfully posted and it only posts 1 page out at one time. pages. However compression_counters.busy should be per guest page. Actually, it's derived from xbzrle_counters.cache_miss_rate: xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss - rs->xbzrle_cache_miss_prev) / iter_count; +rs->compress_thread_busy_prev = compression_counters.busy; + +comp_pages = compression_counters.pages - rs->compress_pages_prev; +if (comp_pages) { +compression_counters.compression_rate = +(double)(compression_counters.reduced_size - +rs->compress_reduced_size_prev) / +(comp_pages * TARGET_PAGE_SIZE); +rs->compress_pages_prev = compression_counters.pages; +rs->compress_reduced_size_prev = compression_counters.reduced_size; +} +} } static void migration_bitmap_sync(RAMState *rs) @@ -1872,6 +1901,9 @@ static void flush_compressed_data(RAMState *rs) qemu_mutex_lock(_param[idx].mutex); if (!comp_param[idx].quit) { len = qemu_put_qemu_file(rs->f, comp_param[idx].file); +/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */ +compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8; I would agree with Dave here - why we store the "reduced size" instead of the size of the compressed data (which I think should be len - 8)? len-8 is the size of data after compressed rather than the data improved by compression that is not straightforward for the user to see how much the improvement is by applying compression. Hmm... but it is not a big deal to me... :) Meanwhile, would a helper be nicer? Like: Yup, that's nicer indeed.
Re: [Qemu-devel] [PATCH v2 1/8] migration: do not wait for free thread
On 07/23/2018 11:25 AM, Peter Xu wrote: On Thu, Jul 19, 2018 at 08:15:13PM +0800, guangrong.x...@gmail.com wrote: @@ -3113,6 +3132,8 @@ static Property migration_properties[] = { DEFINE_PROP_UINT8("x-compress-threads", MigrationState, parameters.compress_threads, DEFAULT_MIGRATE_COMPRESS_THREAD_COUNT), +DEFINE_PROP_BOOL("x-compress-wait-thread", MigrationState, + parameters.compress_wait_thread, false), This performance feature bit makes sense to me, but I would still think it should be true by default to keep the old behavior: - it might change the behavior drastically: we might be in a state between "normal" migration and "compressed" migration since we'll contain both of the pages. Old compression users might not expect that. - it might still even perform worse - an extreme case is that when network bandwidth is very very limited but instead we have plenty of CPU resources. [1] So it's really a good tunable for me when people really needs to understand what's it before turning it on. That looks good to me. DEFINE_PROP_UINT8("x-decompress-threads", MigrationState, parameters.decompress_threads, DEFAULT_MIGRATE_DECOMPRESS_THREAD_COUNT), diff --git a/migration/migration.h b/migration/migration.h index 64a7b33735..a46b9e6c8d 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -271,6 +271,7 @@ bool migrate_use_return_path(void); bool migrate_use_compression(void); int migrate_compress_level(void); int migrate_compress_threads(void); +int migrate_compress_wait_thread(void); int migrate_decompress_threads(void); bool migrate_use_events(void); bool migrate_postcopy_blocktime(void); diff --git a/migration/ram.c b/migration/ram.c index 52dd678092..0ad234c692 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1889,30 +1889,34 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block, ram_addr_t offset) { int idx, thread_count, bytes_xmit = -1, pages = -1; +bool wait = migrate_compress_wait_thread(); thread_count = migrate_compress_threads(); qemu_mutex_lock(_done_lock); -while (true) { -for (idx = 0; idx < thread_count; idx++) { -if (comp_param[idx].done) { -comp_param[idx].done = false; -bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file); -qemu_mutex_lock(_param[idx].mutex); -set_compress_params(_param[idx], block, offset); -qemu_cond_signal(_param[idx].cond); -qemu_mutex_unlock(_param[idx].mutex); -pages = 1; -ram_counters.normal++; -ram_counters.transferred += bytes_xmit; -break; -} -} -if (pages > 0) { +retry: +for (idx = 0; idx < thread_count; idx++) { +if (comp_param[idx].done) { +comp_param[idx].done = false; +bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file); +qemu_mutex_lock(_param[idx].mutex); +set_compress_params(_param[idx], block, offset); +qemu_cond_signal(_param[idx].cond); +qemu_mutex_unlock(_param[idx].mutex); +pages = 1; +ram_counters.normal++; +ram_counters.transferred += bytes_xmit; break; -} else { -qemu_cond_wait(_done_cond, _done_lock); } } + +/* + * if there is no thread is free to compress the data and the user + * really expects the slowdown, wait it. Considering [1] above, IMHO it might not really be a slow down but it depends. Maybe only mentioning about the fact that we're sending a normal page instead of the compressed page if wait is not specified. Okay, will update the comments based on your suggestion. + */ +if (pages < 0 && wait) { +qemu_cond_wait(_done_cond, _done_lock); +goto retry; +} qemu_mutex_unlock(_done_lock); return pages; @@ -2226,7 +2230,10 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss, * CPU resource. */ if (block == rs->last_sent_block && save_page_use_compression(rs)) { -return compress_page_with_multi_thread(rs, block, offset); +res = compress_page_with_multi_thread(rs, block, offset); +if (res > 0) { +return res; +} } else if (migrate_use_multifd()) { return ram_save_multifd_page(rs, block, offset); } diff --git a/qapi/migration.json b/qapi/migration.json index 186e8a7303..b4f394844b 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -462,6 +462,11 @@ # @compress-threads: Set compression thread count to be used in live migration, # the compression thread count is an integer between 1 and
Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
On 07/23/2018 12:05 AM, Michael S. Tsirkin wrote: On Wed, Jul 18, 2018 at 04:46:21PM +0800, Xiao Guangrong wrote: On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: Hi Peter, Sorry for the delay as i was busy on other things. On 06/19/2018 03:30 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Detecting zero page is not a light work, we can disable it for compression that can handle all zero data very well Is there any number shows how the compression algo performs better than the zero-detect algo? Asked since AFAIU buffer_is_zero() might be fast, depending on how init_accel() is done in util/bufferiszero.c. This is the comparison between zero-detection and compression (the target buffer is all zero bit): Zero 810 ns Compression: 26905 ns. Zero 417 ns Compression: 8022 ns. Zero 408 ns Compression: 7189 ns. Zero 400 ns Compression: 7255 ns. Zero 412 ns Compression: 7016 ns. Zero 411 ns Compression: 7035 ns. Zero 413 ns Compression: 6994 ns. Zero 399 ns Compression: 7024 ns. Zero 416 ns Compression: 7053 ns. Zero 405 ns Compression: 7041 ns. Indeed, zero-detection is faster than compression. However during our profiling for the live_migration thread (after reverted this patch), we noticed zero-detection cost lots of CPU: 12.01% kqemu qemu-system-x86_64[.] buffer_zero_sse2 ◆ Interesting; what host are you running on? Some hosts have support for the faster buffer_zero_ss4/avx2 The host is: model name : Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz ... flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc version: gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC) glibc.x86_64 2.12 Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7. Er, it is not easy to update glibc in the production env :( But neither is QEMU updated in production all that easily. While we do want to support older hosts functionally, it does not make much sense to devel complex optimizations that only benefit older hosts. Can not agree with you more. :) So i benchmarked in on the production with newer distribution installed. Here is the data: 27.48% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string 12.63% kqemu [kernel.kallsyms] [k] copy_page_rep 10.82% kqemu qemu-system-x86_64[.] buffer_zero_avx2 5.69% kqemu [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 4.61% kqemu qemu-system-x86_64[.] threads_submit_request_prepare 4.39% kqemu qemu-system-x86_64[.] qemu_event_set 4.12% kqemu qemu-system-x86_64[.] ram_find_and_save_block.part.24 3.61% kqemu [kernel.kallsyms] [k] tcp_sendmsg 2.62% kqemu libc-2.17.so [.] __memcpy_ssse3_back 1.89% kqemu qemu-system-x86_64[.] qemu_put_qemu_file 1.32% kqemu qemu-system-x86_64[.] compress_thread_data_done It does not help...
Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
On 07/12/2018 04:26 PM, Peter Xu wrote: On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote: On 07/11/2018 04:21 PM, Peter Xu wrote: On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote: On 06/19/2018 03:36 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Try to hold src_page_req_mutex only if the queue is not empty Pure question: how much this patch would help? Basically if you are running compression tests then I think it means you are with precopy (since postcopy cannot work with compression yet), then here the lock has no contention at all. Yes, you are right, however we can observe it is in the top functions (after revert this patch): Samples: 29K of event 'cycles', Event count (approx.): 22263412260 + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total + 6.95% kqemu [kernel.kallsyms][k] copy_user_enhanced_fast_string + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set + 5.80% kqemu qemu-system-x86_64 [.] __ring_put + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec + 2.21% kqemu qemu-system-x86_64 [.] ring_get + 1.96% kqemu [kernel.kallsyms][k] __lock_acquire + 1.90% kqemu libc-2.12.so [.] memcpy + 1.55% kqemu qemu-system-x86_64 [.] ring_len + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock (sorry to respond late; I was busy with other stuff for the release...) You're welcome. :) I am trying to find out anything related to unqueue_page() but I failed. Did I miss anything obvious there? unqueue_page() was not listed here indeed, i think the function itself is light enough (a check then directly return) so it did not leave a trace here. This perf data was got after reverting this patch, i.e, it's based on the lockless multithread model, then unqueue_page() is the only place using mutex in the main thread. And you can see the overload of mutext was gone after applying this patch in the mail i replied to Dave. I see. It's not a big portion of CPU resource, though of course I don't have reason to object to this change as well. Actually what interested me more is why ram_bytes_total() is such a hot spot. AFAIU it's only called in ram_find_and_save_block() per call, and it should be mostly a constant if we don't plug/unplug memories. Not sure that means that's a better spot to work on. I noticed it too. That could be another work we will work on. :)
Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
On 07/17/2018 03:01 AM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote: } static void migration_bitmap_sync(RAMState *rs) @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs) qemu_mutex_lock(_param[idx].mutex); if (!comp_param[idx].quit) { len = qemu_put_qemu_file(rs->f, comp_param[idx].file); +/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */ +compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8; I think I'd rather save just len+8 rather than than the subtraction. Hmm, is this what you want? compression_counters.reduced_size += len - 8; Then calculate the real reduced size in populate_ram_info() where we return this info to the user: info->compression->reduced_size = compression_counters.pages * PAGE_SIZE - compression_counters.reduced_size; Right? I mean I'd rather see the actual size presented to the user rather than the saving compared to uncompressed. These statistics are used to help people to see whether compression works efficiently or not, so maybe reduced-size is more straightforward? :)
Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: Hi Peter, Sorry for the delay as i was busy on other things. On 06/19/2018 03:30 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Detecting zero page is not a light work, we can disable it for compression that can handle all zero data very well Is there any number shows how the compression algo performs better than the zero-detect algo? Asked since AFAIU buffer_is_zero() might be fast, depending on how init_accel() is done in util/bufferiszero.c. This is the comparison between zero-detection and compression (the target buffer is all zero bit): Zero 810 ns Compression: 26905 ns. Zero 417 ns Compression: 8022 ns. Zero 408 ns Compression: 7189 ns. Zero 400 ns Compression: 7255 ns. Zero 412 ns Compression: 7016 ns. Zero 411 ns Compression: 7035 ns. Zero 413 ns Compression: 6994 ns. Zero 399 ns Compression: 7024 ns. Zero 416 ns Compression: 7053 ns. Zero 405 ns Compression: 7041 ns. Indeed, zero-detection is faster than compression. However during our profiling for the live_migration thread (after reverted this patch), we noticed zero-detection cost lots of CPU: 12.01% kqemu qemu-system-x86_64[.] buffer_zero_sse2 ◆ Interesting; what host are you running on? Some hosts have support for the faster buffer_zero_ss4/avx2 The host is: model name : Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz ... flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc version: gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC) glibc.x86_64 2.12 Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7. Er, it is not easy to update glibc in the production env :(
Re: [Qemu-devel] [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration
On 07/14/2018 02:01 AM, Dr. David Alan Gilbert wrote: * guangrong.x...@gmail.com (guangrong.x...@gmail.com) wrote: From: Xiao Guangrong flush_compressed_data() needs to wait all compression threads to finish their work, after that all threads are free until the migration feed new request to them, reducing its call can improve the throughput and use CPU resource more effectively We do not need to flush all threads at the end of iteration, the data can be kept locally until the memory block is changed Signed-off-by: Xiao Guangrong --- migration/ram.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/migration/ram.c b/migration/ram.c index f9a8646520..0a38c1c61e 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1994,6 +1994,7 @@ static void ram_save_cleanup(void *opaque) } xbzrle_cleanup(); +flush_compressed_data(*rsp); compress_threads_save_cleanup(); ram_state_cleanup(rsp); } I'm not sure why this change corresponds to the other removal. We should already have sent all remaining data in ram_save_complete()'s call to flush_compressed_data - so what is this one for? This is for the error condition, if any error occurred during live migration, there is no chance to call ram_save_complete. After using the lockless multithreads model, we assert all requests have been handled before destroy the work threads. @@ -2690,7 +2691,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque) } i++; } -flush_compressed_data(rs); rcu_read_unlock(); Hmm - are we sure there's no other cases that depend on ordering of all of the compressed data being sent out between threads? Err, i tried think it over carefully, however, still missed the case you mentioned. :( Anyway, doing flush_compressed_data() for every 50ms hurt us too much. I think the one I'd most worry about is the case where: iteration one: thread 1: Save compressed page 'n' iteration two: thread 2: Save compressed page 'n' What guarantees that the version of page 'n' from thread 2 reaches the destination first without this flush? Hmm... you are right, i missed this case. So how about avoid it by doing this check at the beginning of ram_save_iterate(): if (ram_counters.dirty_sync_count != rs.dirty_sync_count) { flush_compressed_data(*rsp); rs.dirty_sync_count = ram_counters.dirty_sync_count; }
Re: [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model
On 07/14/2018 12:24 AM, Dr. David Alan Gilbert wrote: +static void *thread_run(void *opaque) +{ +ThreadLocal *self_data = (ThreadLocal *)opaque; +Threads *threads = self_data->threads; +void (*handler)(ThreadRequest *data) = threads->thread_request_handler; +ThreadRequest *request; +int count, ret; + +for ( ; !atomic_read(_data->quit); ) { +qemu_event_reset(_data->ev); + +count = 0; +while ((request = ring_get(self_data->request_ring)) || +count < BUSY_WAIT_COUNT) { + /* + * wait some while before go to sleep so that the user + * needn't go to kernel space to wake up the consumer + * threads. + * + * That will waste some CPU resource indeed however it + * can significantly improve the case that the request + * will be available soon. + */ + if (!request) { +cpu_relax(); +count++; +continue; +} +count = 0; Things like busywait counts probably need isolating somewhere; getting those counts right is quite hard. Okay, i will make it to be a separated function.
Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
On 07/11/2018 04:21 PM, Peter Xu wrote: On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote: On 06/19/2018 03:36 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Try to hold src_page_req_mutex only if the queue is not empty Pure question: how much this patch would help? Basically if you are running compression tests then I think it means you are with precopy (since postcopy cannot work with compression yet), then here the lock has no contention at all. Yes, you are right, however we can observe it is in the top functions (after revert this patch): Samples: 29K of event 'cycles', Event count (approx.): 22263412260 + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total + 6.95% kqemu [kernel.kallsyms][k] copy_user_enhanced_fast_string + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set + 5.80% kqemu qemu-system-x86_64 [.] __ring_put + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec + 2.21% kqemu qemu-system-x86_64 [.] ring_get + 1.96% kqemu [kernel.kallsyms][k] __lock_acquire + 1.90% kqemu libc-2.12.so [.] memcpy + 1.55% kqemu qemu-system-x86_64 [.] ring_len + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock (sorry to respond late; I was busy with other stuff for the release...) You're welcome. :) I am trying to find out anything related to unqueue_page() but I failed. Did I miss anything obvious there? unqueue_page() was not listed here indeed, i think the function itself is light enough (a check then directly return) so it did not leave a trace here. This perf data was got after reverting this patch, i.e, it's based on the lockless multithread model, then unqueue_page() is the only place using mutex in the main thread. And you can see the overload of mutext was gone after applying this patch in the mail i replied to Dave.
Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
On 06/29/2018 09:08 PM, Michael S. Tsirkin wrote: On Fri, Jun 29, 2018 at 03:30:44PM +0800, Xiao Guangrong wrote: Hi Michael, On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote: On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h (2) http://dpdk.org/doc/api/rte__ring_8h.html Signed-off-by: Xiao Guangrong So instead of all this super-optimized trickiness, how about a simple port of ptr_ring from linux? That one isn't lockless but it's known to outperform most others for a single producer/single consumer case. And with a ton of networking going on, who said it's such a hot spot? OTOH this implementation has more barriers which slows down each individual thread. It's also a source of bugs. Thank you for pointing it out. I just quickly went through the code of ptr_ring that is very nice and really impressive. I will consider to port it to QEMU. The port is pretty trivial. See below. It's a SPSC structure though. So you need to use it with lock. Given the critical section is small, I Why put these locks into this common struct? For our case, each thread has its own ring which is SCSP, no lock is needed at all. Atomic operations still slow things down, see [PATCH 07/12] migration: hold the lock only if it is really needed. I'd move the inner locks to the user instead.
Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
On 06/29/2018 07:22 PM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: On 06/19/2018 03:36 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Try to hold src_page_req_mutex only if the queue is not empty Pure question: how much this patch would help? Basically if you are running compression tests then I think it means you are with precopy (since postcopy cannot work with compression yet), then here the lock has no contention at all. Yes, you are right, however we can observe it is in the top functions (after revert this patch): Can you show the matching trace with the patch in? Sure, there is: + 8.38% kqemu [kernel.kallsyms][k] copy_user_enhanced_fast_string + 8.03% kqemu qemu-system-x86_64 [.] ram_bytes_total + 6.62% kqemu qemu-system-x86_64 [.] qemu_event_set + 6.02% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file + 5.81% kqemu qemu-system-x86_64 [.] __ring_put + 5.04% kqemu qemu-system-x86_64 [.] compress_thread_data_done + 4.48% kqemu qemu-system-x86_64 [.] ring_is_full + 4.44% kqemu qemu-system-x86_64 [.] ring_mp_get + 3.39% kqemu qemu-system-x86_64 [.] __ring_is_full + 2.61% kqemu qemu-system-x86_64 [.] add_to_iovec + 2.48% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare + 2.08% kqemu libc-2.12.so [.] memcpy + 2.07% kqemu qemu-system-x86_64 [.] ring_len + 1.91% kqemu [kernel.kallsyms][k] __lock_acquire + 1.60% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 + 1.16% kqemu qemu-system-x86_64 [.] ram_find_and_save_block + 1.14% kqemu qemu-system-x86_64 [.] ram_save_target_page + 1.12% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread + 1.09% kqemu qemu-system-x86_64 [.] ram_save_host_page + 1.07% kqemu qemu-system-x86_64 [.] test_and_clear_bit + 1.07% kqemu qemu-system-x86_64 [.] qemu_put_buffer + 1.03% kqemu qemu-system-x86_64 [.] qemu_put_byte + 0.80% kqemu qemu-system-x86_64 [.] threads_submit_request_commit + 0.74% kqemu qemu-system-x86_64 [.] migration_bitmap_clear_dirty + 0.70% kqemu qemu-system-x86_64 [.] control_save_page + 0.69% kqemu qemu-system-x86_64 [.] test_bit + 0.69% kqemu qemu-system-x86_64 [.] ram_save_iterate + 0.63% kqemu qemu-system-x86_64 [.] migration_bitmap_find_dirty + 0.63% kqemu qemu-system-x86_64 [.] ram_control_save_page + 0.62% kqemu qemu-system-x86_64 [.] rcu_read_lock + 0.56% kqemu qemu-system-x86_64 [.] qemu_file_get_error + 0.55% kqemu [kernel.kallsyms][k] lock_acquire + 0.55% kqemu qemu-system-x86_64 [.] find_dirty_block + 0.54% kqemu qemu-system-x86_64 [.] ring_index + 0.53% kqemu qemu-system-x86_64 [.] ring_put + 0.51% kqemu qemu-system-x86_64 [.] unqueue_page + 0.50% kqemu qemu-system-x86_64 [.] migrate_use_compression + 0.48% kqemu qemu-system-x86_64 [.] get_queued_page + 0.46% kqemu qemu-system-x86_64 [.] ring_get + 0.46% kqemu [i40e] [k] i40e_clean_tx_irq + 0.45% kqemu [kernel.kallsyms][k] lock_release + 0.44% kqemu [kernel.kallsyms][k] native_sched_clock + 0.38% kqemu qemu-system-x86_64 [.] migrate_get_current + 0.38% kqemu [kernel.kallsyms][k] find_held_lock + 0.34% kqemu [kernel.kallsyms][k] __lock_release + 0.34% kqemu qemu-system-x86_64 [.] qemu_ram_pagesize + 0.29% kqemu [kernel.kallsyms][k] lock_is_held_type + 0.27% kqemu [kernel.kallsyms][k] update_load_avg + 0.27% kqemu qemu-system-x86_64 [.] save_page_use_compression + 0.24% kqemu qemu-system-x86_64 [.] qemu_file_rate_limit + 0.23% kqemu [kernel.kallsyms][k] tcp_sendmsg + 0.23% kqemu [kernel.kallsyms][k] match_held_lock + 0.22% kqemu [kernel.kallsyms][k] do_raw_spin_trylock + 0.22% kqemu [kernel.kallsyms][k] cyc2ns_read_begin
Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote: * Xiao Guangrong (guangrong.x...@gmail.com) wrote: Hi Peter, Sorry for the delay as i was busy on other things. On 06/19/2018 03:30 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Detecting zero page is not a light work, we can disable it for compression that can handle all zero data very well Is there any number shows how the compression algo performs better than the zero-detect algo? Asked since AFAIU buffer_is_zero() might be fast, depending on how init_accel() is done in util/bufferiszero.c. This is the comparison between zero-detection and compression (the target buffer is all zero bit): Zero 810 ns Compression: 26905 ns. Zero 417 ns Compression: 8022 ns. Zero 408 ns Compression: 7189 ns. Zero 400 ns Compression: 7255 ns. Zero 412 ns Compression: 7016 ns. Zero 411 ns Compression: 7035 ns. Zero 413 ns Compression: 6994 ns. Zero 399 ns Compression: 7024 ns. Zero 416 ns Compression: 7053 ns. Zero 405 ns Compression: 7041 ns. Indeed, zero-detection is faster than compression. However during our profiling for the live_migration thread (after reverted this patch), we noticed zero-detection cost lots of CPU: 12.01% kqemu qemu-system-x86_64[.] buffer_zero_sse2 ◆ Interesting; what host are you running on? Some hosts have support for the faster buffer_zero_ss4/avx2 The host is: model name : Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz ... flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc version: gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC) glibc.x86_64 2.12 7.60% kqemu qemu-system-x86_64[.] ram_bytes_total ▒ 6.56% kqemu qemu-system-x86_64[.] qemu_event_set ▒ 5.61% kqemu qemu-system-x86_64[.] qemu_put_qemu_file ▒ 5.00% kqemu qemu-system-x86_64[.] __ring_put ▒ 4.89% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string ▒ 4.71% kqemu qemu-system-x86_64[.] compress_thread_data_done ▒ 3.63% kqemu qemu-system-x86_64[.] ring_is_full ▒ 2.89% kqemu qemu-system-x86_64[.] __ring_is_full ▒ 2.68% kqemu qemu-system-x86_64[.] threads_submit_request_prepare ▒ 2.60% kqemu q
Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
On 06/29/2018 02:15 PM, Jason Wang wrote: On 2018年06月29日 11:59, Xiao Guangrong wrote: On 06/28/2018 09:36 PM, Jason Wang wrote: On 2018年06月04日 17:55, guangrong.x...@gmail.com wrote: From: Xiao Guangrong It's the simple lockless ring buffer implement which supports both single producer vs. single consumer and multiple producers vs. single consumer. Finally, it fetches the valid data out, set the entry to the initialized state and update ring->out to make the entry be usable to the producer: data = *entry; *entry = NULL; ring->out++; Memory barrier is omitted here, please refer to the comment in the code. (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h (2)http://dpdk.org/doc/api/rte__ring_8h.html Signed-off-by: Xiao Guangrong --- May I ask why you need a MPSC ring here? Can we just use N SPSC ring for submitting pages and another N SPSC ring for passing back results? Sure. We had this option in our mind, however, it is not scalable which will slow the main thread down, instead, we'd rather to speed up main thread and move reasonable workload to the threads. I'm not quite understand the scalability issue here. Is it because of main thread need go through all N rings (which I think not)? Yes, it is. The main thread need to check each single thread and wait it done one by one...
Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
On 06/29/2018 12:23 PM, Michael S. Tsirkin wrote: On Thu, Jun 28, 2018 at 09:36:00PM +0800, Jason Wang wrote: On 2018年06月04日 17:55, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Memory barrier is omitted here, please refer to the comment in the code. (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h (2)http://dpdk.org/doc/api/rte__ring_8h.html Signed-off-by: Xiao Guangrong --- May I ask why you need a MPSC ring here? Can we just use N SPSC ring for submitting pages and another N SPSC ring for passing back results? Thanks Or just an SPSC ring + a lock. How big of a gain is lockless access to a trivial structure like the ring? Okay, i will give a try. BTW, we tried to use a global ring + lock for input and lockless ring for input, the former did not show better performance. But we haven't tried to use global ring + lock for out yet.
Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
Hi Michael, On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote: On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h (2) http://dpdk.org/doc/api/rte__ring_8h.html Signed-off-by: Xiao Guangrong So instead of all this super-optimized trickiness, how about a simple port of ptr_ring from linux? That one isn't lockless but it's known to outperform most others for a single producer/single consumer case. And with a ton of networking going on, who said it's such a hot spot? OTOH this implementation has more barriers which slows down each individual thread. It's also a source of bugs. Thank you for pointing it out. I just quickly went through the code of ptr_ring that is very nice and really impressive. I will consider to port it to QEMU. Further, atomic tricks this one uses are not fair so some threads can get completely starved while others make progress. There's also no chance to mix aggressive polling and sleeping with this kind of scheme, so the starved thread will consume lots of CPU. So I'd like to see a simple ring used, and then a patch on top switching to this tricky one with performance comparison along with that. I agree with you, i will make a version that uses a lock for multiple producers and doing incremental optimizations based on it. --- migration/ring.h | 265 +++ 1 file changed, 265 insertions(+) create mode 100644 migration/ring.h diff --git a/migration/ring.h b/migration/ring.h new file mode 100644 index 00..da9b8bdcbb --- /dev/null +++ b/migration/ring.h @@ -0,0 +1,265 @@ +/* + * Ring Buffer + * + * Multiple producers and single consumer are supported with lock free. + * + * Copyright (c) 2018 Tencent Inc + * + * Authors: + * Xiao Guangrong + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#ifndef _RING__ +#define _RING__ Prefix Ring is too short. Okay, will improve it. +atomic_set(>data[index], NULL); + +/* + * (B) smp_mb() is needed as we should read the entry out before + * updating ring->out as we did in __ring_get(). + * + * (A) smp_wmb() is needed as we should make the entry be NULL before + * updating ring->out (which will make the entry be visible and usable). + */ I can't say I understand this all. And the interaction of acquire/release semantics with smp_* barriers is even scarier. Hmm... the parallel accesses for these two indexes and the data stored in the ring are subtle indeed. :( +atomic_store_release(>out, ring->out + 1); + +return data; +} + +static inline int ring_put(Ring *ring, void *data) +{ +if (ring->flags & RING_MULTI_PRODUCER) { +return ring_mp_put(ring, data); +} +return __ring_put(ring, data); +} + +static inline void *ring_get(Ring *ring) +{ +if (ring->flags & RING_MULTI_PRODUCER) { +return ring_mp_get(ring); +} +return __ring_get(ring); +} +#endif A bunch of tricky barriers retries etc all over the place. This sorely needs *a lot of* unit tests. Where are they? I used the code attached in this mail to test & benchmark the patches during my development which does not dedicate for Ring, instead it is based on the framework of compression. Yes, test cases are useful and really needed, i will do it... :) #include "qemu/osdep.h" #include "libqtest.h" #include #include "qemu/osdep.h" #include #include "qemu/cutils.h" #include "qemu/bitops.h" #include "qemu/bitmap.h" #include "qemu/main-loop.h" #include "migration/ram.h" #include "migration/migration.h" #include "migration/register.h" #include "migration/misc.h" #include "migration/page_cache.h" #include "qemu/error-report.h" #include "qapi/error.h" #include "qapi/qapi-events-migration.h" #include "qapi/qmp/qerror.h" #include "trace.h" //#include "exec/ram_addr.h" #include "exec/target_page.h" #include "qemu/rcu_queue.h" #include "migration/colo.h" #include "migration/block.h" #include "migration/threads.h" #include "migration/qemu-file.h" #include "migration/threads.h" CompressionStats compression_counters; #define PAGE_SIZE 4096 #define PAGE_MASK ~(PAGE_SIZE - 1) static ssize_t test_writev_buffer(void *opaque, struct iovec *iov, int iovcnt, int64_t pos) { int i, size = 0; for (i = 0; i < iovcnt; i++) { size += iov[i].iov_len; } return size; } static int test_fclose(void *opaque) { return 0; } static const QEMUFileOps test_write_ops = {
Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
On 06/28/2018 09:36 PM, Jason Wang wrote: On 2018年06月04日 17:55, guangrong.x...@gmail.com wrote: From: Xiao Guangrong It's the simple lockless ring buffer implement which supports both single producer vs. single consumer and multiple producers vs. single consumer. Finally, it fetches the valid data out, set the entry to the initialized state and update ring->out to make the entry be usable to the producer: data = *entry; *entry = NULL; ring->out++; Memory barrier is omitted here, please refer to the comment in the code. (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h (2)http://dpdk.org/doc/api/rte__ring_8h.html Signed-off-by: Xiao Guangrong --- May I ask why you need a MPSC ring here? Can we just use N SPSC ring for submitting pages and another N SPSC ring for passing back results? Sure. We had this option in our mind, however, it is not scalable which will slow the main thread down, instead, we'd rather to speed up main thread and move reasonable workload to the threads.
Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
On 06/28/2018 07:55 PM, Wei Wang wrote: On 06/28/2018 06:02 PM, Xiao Guangrong wrote: CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier. On 06/20/2018 12:52 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong It's the simple lockless ring buffer implement which supports both single producer vs. single consumer and multiple producers vs. single consumer. Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's rte_ring (2) before i wrote this implement. It corrects some bugs of memory barriers in kfifo and it is the simpler lockless version of rte_ring as currently multiple access is only allowed for producer. Could you provide some more information about the kfifo bug? Any pointer would be appreciated. Sure, i reported one of the memory barrier issue to linux kernel: https://lkml.org/lkml/2018/5/11/58 Actually, beside that, there is another memory barrier issue in kfifo, please consider this case: at the beginning ring->size = 4 ring->out = 0 ring->in = 4 Consumer Producer --- -- index = ring->out; /* index == 0 */ ring->out++; /* ring->out == 1 */ < Re-Order > out = ring->out; if (ring->in - out >= ring->mask) return -EFULL; /* see the ring is not full */ index = ring->in & ring->mask; /* index == 0 */ ring->data[index] = new_data; ring->in++; data = ring->data[index]; !! the old data is lost !! So we need to make sure: 1) for the consumer, we should read the ring->data[] out before updating ring->out 2) for the producer, we should read ring->out before updating ring->data[] as followings: Producer Consumer Reading ring->out Reading ring->data[index] smp_mb() smp_mb() Setting ring->data[index] = data ring->out++ [ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() in the patch. ] But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer? I wonder if this could be solved by simply tweaking the above consumer implementation: [1] index = ring->out; [2] data = ring->data[index]; [3] index++; [4] ring->out = index; Now [2] and [3] forms a WAR dependency, which avoids the reordering. It can not. [2] and [4] still do not any dependency, CPU and complainer can omit the 'index'.
Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
Hi Daniel, On 06/28/2018 05:36 PM, Daniel P. Berrangé wrote: On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote: After this patch, the workload is moved to the worker thread, is it acceptable? It depends on your point of view. If you have spare / idle CPUs on the host, then moving workload to a thread is ok, despite the CPU cost of compression in that thread being much higher what what was replaced, since you won't be taking CPU resources away from other contending workloads. I'd venture to suggest though that we should probably *not* be optimizing for the case of idle CPUs on the host. More realistic is to expect that the host CPUs are near fully committed to work, and thus the (default) goal should be to minimize CPU overhead for the host as a whole. From this POV, zero-page detection is better than compression due to > x10 better speed. Given the CPU overheads of compression, I think it has fairly narrow use in migration in general when considering hosts are often highly committed on CPU. I understand your concern, however, it is not bad. First, we tolerate the case that the thread runs little slowly - we do not wait the thread becoming free, instead, we directly posted the page out as normal (zero detecting still works for the normal page), so it at least makes the performance not worse then the case compression is not used. Second we think the reasonable configuration is that the system should have enough CPU resource to run the number of threads the user configured. BTW, the work we will post out will make these parameters be runtimely configurable, then a control daemon (e.g, libvirt) will adjust them based on the current load of the system and the statistic from QEMU. This is the topic we submitted to KVM Forum this year, hope it can be accepted. At last, we have a watermark to trigger live migration on a load-balanced production, the watermark makes sure there is some free CPU resource left for other works. Thanks!
Re: [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model
On 06/20/2018 02:52 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Current implementation of compression and decompression are very hard to be enabled on productions. We noticed that too many wait-wakes go to kernel space and CPU usages are very low even if the system is really free Not sure how other people think, for me these information suites better as cover letter. For commit message, I would prefer to know about something like: what this thread model can do; how the APIs are designed and used; what's the limitations, etc. After all until this patch nowhere is using the new model yet, so these numbers are a bit misleading. Yes, i completely agree with you, i will remove it for its changelog. Signed-off-by: Xiao Guangrong --- migration/Makefile.objs | 1 + migration/threads.c | 265 migration/threads.h | 116 + Again, this model seems to be suitable for scenarios even outside migration. So I'm not sure whether you'd like to generalize it (I still see e.g. constants and comments related to migration, but there aren't much) and put it into util/. Sure, that's good to me. :) 3 files changed, 382 insertions(+) create mode 100644 migration/threads.c create mode 100644 migration/threads.h diff --git a/migration/Makefile.objs b/migration/Makefile.objs index c83ec47ba8..bdb61a7983 100644 --- a/migration/Makefile.objs +++ b/migration/Makefile.objs @@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o common-obj-y += xbzrle.o postcopy-ram.o common-obj-y += qjson.o common-obj-y += block-dirty-bitmap.o +common-obj-y += threads.o common-obj-$(CONFIG_RDMA) += rdma.o diff --git a/migration/threads.c b/migration/threads.c new file mode 100644 index 00..eecd3229b7 --- /dev/null +++ b/migration/threads.c @@ -0,0 +1,265 @@ +#include "threads.h" + +/* retry to see if there is avilable request before actually go to wait. */ +#define BUSY_WAIT_COUNT 1000 + +static void *thread_run(void *opaque) +{ +ThreadLocal *self_data = (ThreadLocal *)opaque; +Threads *threads = self_data->threads; +void (*handler)(ThreadRequest *data) = threads->thread_request_handler; +ThreadRequest *request; +int count, ret; + +for ( ; !atomic_read(_data->quit); ) { +qemu_event_reset(_data->ev); + +count = 0; +while ((request = ring_get(self_data->request_ring)) || +count < BUSY_WAIT_COUNT) { + /* + * wait some while before go to sleep so that the user + * needn't go to kernel space to wake up the consumer + * threads. + * + * That will waste some CPU resource indeed however it + * can significantly improve the case that the request + * will be available soon. + */ + if (!request) { +cpu_relax(); +count++; +continue; +} +count = 0; + +handler(request); + +do { +ret = ring_put(threads->request_done_ring, request); +/* + * request_done_ring has enough room to contain all + * requests, however, theoretically, it still can be + * fail if the ring's indexes are overflow that would + * happen if there is more than 2^32 requests are Could you elaborate why this ring_put() could fail, and why failure is somehow related to 2^32 overflow? Firstly, I don't understand why it will fail. As we explained in the previous mail: | Without it we can easily observe a "strange" behavior that the thread will | put the result to the global ring failed even if we allocated enough room | for the global ring (its capability >= total requests), that's because | these two indexes can be updated at anytime, consider the case that multiple | get and put operations can be finished between reading ring->in and ring->out | so that very possibly ring->in can pass the value readed from ring->out. | | Having this code, the negative case only happens if these two indexes (32 bits) | overflows to the same value, that can help us to catch potential bug in the | code. Meanwhile, AFAIU your ring can even live well with that 2^32 overflow. Or did I misunderstood? Please refer to the code: +if (__ring_is_full(ring, in, out)) { +if (atomic_read(>in) == in && +atomic_read(>out) == out) { +return -ENOBUFS; +} As we allocated enough room for this global ring so there is the only case that put data will fail that the indexes are overflowed to the same value. This possibly 2^32 get/put operations happened on other threads and main thread when this thread is reading these two indexes.
Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
On 06/20/2018 01:55 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote: [...] (Some more comments/questions for the MP implementation...) +static inline int ring_mp_put(Ring *ring, void *data) +{ +unsigned int index, in, in_next, out; + +do { +in = atomic_read(>in); +out = atomic_read(>out); [0] Do we need to fetch "out" with load_acquire()? Otherwise what's the pairing of below store_release() at [1]? The barrier paired with [1] is the full barrier implied in atomic_cmpxchg(), This barrier exists in SP-SC case which makes sense to me, I assume that's also needed for MP-SC case, am I right? We needn't put a memory here as we do not need to care the order between these two indexes (in and out), instead, the memory barrier (and for SP-SC as well) is used to make the order between ring->out and updating ring->data[] as we explained in previous mail. + +if (__ring_is_full(ring, in, out)) { +if (atomic_read(>in) == in && +atomic_read(>out) == out) { Why read again? After all the ring API seems to be designed as non-blocking. E.g., I see the poll at [2] below makes more sense since when reaches [2] it means that there must be a producer that is _doing_ the queuing, so polling is very possible to complete fast. However here it seems to be a pure busy poll without any hint. Then not sure whether we should just let the caller decide whether it wants to call ring_put() again. Without it we can easily observe a "strange" behavior that the thread will put the result to the global ring failed even if we allocated enough room for the global ring (its capability >= total requests), that's because these two indexes can be updated at anytime, consider the case that multiple get and put operations can be finished between reading ring->in and ring->out so that very possibly ring->in can pass the value readed from ring->out. Having this code, the negative case only happens if these two indexes (32 bits) overflows to the same value, that can help us to catch potential bug in the code. +return -ENOBUFS; +} + +/* a entry has been fetched out, retry. */ +continue; +} + +in_next = in + 1; +} while (atomic_cmpxchg(>in, in, in_next) != in); + +index = ring_index(ring, in); + +/* + * smp_rmb() paired with the memory barrier of (A) in ring_mp_get() + * is implied in atomic_cmpxchg() as we should read ring->out first + * before fetching the entry, otherwise this assert will fail. Thanks for all these comments! These are really helpful for reviewers. However I'm not sure whether I understand it correctly here on MB of (A) for ring_mp_get() - AFAIU that should corresponds to a smp_rmb() at [0] above when reading the "out" variable rather than this assertion, and that's why I thought at [0] we should have something like a load_acquire() there (which contains a rmb()). Memory barrier (A) in ring_mp_get() makes sure the order between: ring->data[index] = NULL; smp_wmb(); ring->out = out + 1; And the memory barrier at [0] makes sure the order between: out = ring->out; /* smp_rmb() */ compxchg(); value = ring->data[index]; assert(value); [ note: the assertion and reading ring->out are across cmpxchg(). ] Did i understand your question clearly? From content-wise, I think the code here is correct, since atomic_cmpxchg() should have one implicit smp_mb() after all so we don't need anything further barriers here. Yes, it is.
Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier. On 06/20/2018 12:52 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong It's the simple lockless ring buffer implement which supports both single producer vs. single consumer and multiple producers vs. single consumer. Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's rte_ring (2) before i wrote this implement. It corrects some bugs of memory barriers in kfifo and it is the simpler lockless version of rte_ring as currently multiple access is only allowed for producer. Could you provide some more information about the kfifo bug? Any pointer would be appreciated. Sure, i reported one of the memory barrier issue to linux kernel: https://lkml.org/lkml/2018/5/11/58 Actually, beside that, there is another memory barrier issue in kfifo, please consider this case: at the beginning ring->size = 4 ring->out = 0 ring->in = 4 ConsumerProducer --- -- index = ring->out; /* index == 0 */ ring->out++; /* ring->out == 1 */ < Re-Order > out = ring->out; if (ring->in - out >= ring->mask) return -EFULL; /* see the ring is not full */ index = ring->in & ring->mask; /* index == 0 */ ring->data[index] = new_data; ring->in++; data = ring->data[index]; !! the old data is lost !! So we need to make sure: 1) for the consumer, we should read the ring->data[] out before updating ring->out 2) for the producer, we should read ring->out before updating ring->data[] as followings: Producer Consumer Reading ring->outReading ring->data[index] smp_mb() smp_mb() Setting ring->data[index] = data ring->out++ [ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() in the patch. ] But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer? If has single producer vs. single consumer, it is the traditional fifo, If has multiple producers, it uses the algorithm as followings: For the producer, it uses two steps to update the ring: - first step, occupy the entry in the ring: retry: in = ring->in if (cmpxhg(>in, in, in +1) != in) goto retry; after that the entry pointed by ring->data[in] has been owned by the producer. assert(ring->data[in] == NULL); Note, no other producer can touch this entry so that this entry should always be the initialized state. - second step, write the data to the entry: ring->data[in] = data; For the consumer, it first checks if there is available entry in the ring and fetches the entry from the ring: if (!ring_is_empty(ring)) entry = [ring->out]; Note: the ring->out has not been updated so that the entry pointed by ring->out is completely owned by the consumer. Then it checks if the data is ready: retry: if (*entry == NULL) goto retry; That means, the producer has updated the index but haven't written any data to it. Finally, it fetches the valid data out, set the entry to the initialized state and update ring->out to make the entry be usable to the producer: data = *entry; *entry = NULL; ring->out++; Memory barrier is omitted here, please refer to the comment in the code. (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h (2) http://dpdk.org/doc/api/rte__ring_8h.html Signed-off-by: Xiao Guangrong --- migration/ring.h | 265 +++ If this is a very general implementation, not sure whether we can move this to util/ directory so that it can be used even outside migration codes. I thought it too. Currently migration is the only user for it, so i put it near the code of migration. It's good to me to move it to util/ if you prefer. 1 file changed, 265 insertions(+) create mode 100644 migration/ring.h diff --git a/migration/ring.h b/migration/ring.h new file mode 100644 index 00..da9b8bdcbb --- /dev/null +++ b/migration/ring.h @@ -0,0 +1,265 @@ +/* + * Ring Buffer + * + * Multiple producers and single consumer are supported with lock free. + * + * Copyright (c) 2018 Tencent Inc + * + * Authors: + * Xiao Guangrong + * + * This work is licensed under the terms of
Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
On 06/19/2018 03:36 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Try to hold src_page_req_mutex only if the queue is not empty Pure question: how much this patch would help? Basically if you are running compression tests then I think it means you are with precopy (since postcopy cannot work with compression yet), then here the lock has no contention at all. Yes, you are right, however we can observe it is in the top functions (after revert this patch): Samples: 29K of event 'cycles', Event count (approx.): 22263412260 + 7.99% kqemu qemu-system-x86_64 [.] ram_bytes_total + 6.95% kqemu [kernel.kallsyms][k] copy_user_enhanced_fast_string + 6.23% kqemu qemu-system-x86_64 [.] qemu_put_qemu_file + 6.20% kqemu qemu-system-x86_64 [.] qemu_event_set + 5.80% kqemu qemu-system-x86_64 [.] __ring_put + 4.82% kqemu qemu-system-x86_64 [.] compress_thread_data_done + 4.11% kqemu qemu-system-x86_64 [.] ring_is_full + 3.07% kqemu qemu-system-x86_64 [.] threads_submit_request_prepare + 2.83% kqemu qemu-system-x86_64 [.] ring_mp_get + 2.71% kqemu qemu-system-x86_64 [.] __ring_is_full + 2.46% kqemu qemu-system-x86_64 [.] buffer_zero_sse2 + 2.40% kqemu qemu-system-x86_64 [.] add_to_iovec + 2.21% kqemu qemu-system-x86_64 [.] ring_get + 1.96% kqemu [kernel.kallsyms][k] __lock_acquire + 1.90% kqemu libc-2.12.so [.] memcpy + 1.55% kqemu qemu-system-x86_64 [.] ring_len + 1.12% kqemu libpthread-2.12.so [.] pthread_mutex_unlock + 1.11% kqemu qemu-system-x86_64 [.] ram_find_and_save_block + 1.07% kqemu qemu-system-x86_64 [.] ram_save_host_page + 1.04% kqemu qemu-system-x86_64 [.] qemu_put_buffer + 0.97% kqemu qemu-system-x86_64 [.] compress_page_with_multi_thread + 0.96% kqemu qemu-system-x86_64 [.] ram_save_target_page + 0.93% kqemu libpthread-2.12.so [.] pthread_mutex_lock I guess its atomic operations cost CPU resource and check-before-lock is a common tech, i think it shouldn't have side effect, right? :)
Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
Hi Peter, Sorry for the delay as i was busy on other things. On 06/19/2018 03:30 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Detecting zero page is not a light work, we can disable it for compression that can handle all zero data very well Is there any number shows how the compression algo performs better than the zero-detect algo? Asked since AFAIU buffer_is_zero() might be fast, depending on how init_accel() is done in util/bufferiszero.c. This is the comparison between zero-detection and compression (the target buffer is all zero bit): Zero 810 ns Compression: 26905 ns. Zero 417 ns Compression: 8022 ns. Zero 408 ns Compression: 7189 ns. Zero 400 ns Compression: 7255 ns. Zero 412 ns Compression: 7016 ns. Zero 411 ns Compression: 7035 ns. Zero 413 ns Compression: 6994 ns. Zero 399 ns Compression: 7024 ns. Zero 416 ns Compression: 7053 ns. Zero 405 ns Compression: 7041 ns. Indeed, zero-detection is faster than compression. However during our profiling for the live_migration thread (after reverted this patch), we noticed zero-detection cost lots of CPU: 12.01% kqemu qemu-system-x86_64[.] buffer_zero_sse2 ◆ 7.60% kqemu qemu-system-x86_64[.] ram_bytes_total ▒ 6.56% kqemu qemu-system-x86_64[.] qemu_event_set ▒ 5.61% kqemu qemu-system-x86_64[.] qemu_put_qemu_file ▒ 5.00% kqemu qemu-system-x86_64[.] __ring_put ▒ 4.89% kqemu [kernel.kallsyms] [k] copy_user_enhanced_fast_string ▒ 4.71% kqemu qemu-system-x86_64[.] compress_thread_data_done ▒ 3.63% kqemu qemu-system-x86_64[.] ring_is_full ▒ 2.89% kqemu qemu-system-x86_64[.] __ring_is_full ▒ 2.68% kqemu qemu-system-x86_64[.] threads_submit_request_prepare ▒ 2.60% kqemu qemu-system-x86_64[.] ring_mp_get ▒ 2.25% kqemu qemu-system-x86_64[.] ring_get ▒ 1.96% kqemu libc-2.12.so [.] memcpy After this patch, the workload is moved to the worker thread, is it acceptable? From compression rate POV of course zero page algo wins since it contains no data (but only a flag). Yes it is. The compressed zero page is 45 bytes that is small enough i think. Hmm, if you do not like, how about move detecting zero page to the work thread? Thanks!
Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote: } static void migration_bitmap_sync(RAMState *rs) @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs) qemu_mutex_lock(_param[idx].mutex); if (!comp_param[idx].quit) { len = qemu_put_qemu_file(rs->f, comp_param[idx].file); +/* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */ +compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8; I think I'd rather save just len+8 rather than than the subtraction. Hmm, is this what you want? compression_counters.reduced_size += len - 8; Then calculate the real reduced size in populate_ram_info() where we return this info to the user: info->compression->reduced_size = compression_counters.pages * PAGE_SIZE - compression_counters.reduced_size; Right? I think other than that, and Eric's comments, it's OK. Thanks.
Re: [Qemu-devel] [PATCH 04/12] migration: introduce migration_update_rates
On 06/14/2018 12:17 AM, Dr. David Alan Gilbert wrote: * guangrong.x...@gmail.com (guangrong.x...@gmail.com) wrote: From: Xiao Guangrong It is used to slightly clean the code up, no logic is changed Actually, there is a slight change; iterations_prev is always updated when previously it was only updated with xbzrle on; still the change makes more sense. Yes, indeed. I will update the changelog.
Re: [Qemu-devel] [PATCH 02/12] migration: fix counting normal page for compression
On 06/13/2018 11:51 PM, Dr. David Alan Gilbert wrote: * guangrong.x...@gmail.com (guangrong.x...@gmail.com) wrote: From: Xiao Guangrong The compressed page is not normal page Is this the right reason? I think the 'normal' page shouldn't include the compressed page and XBZRLE-ed page (the current code does not treat xbzrle pages are normal as well). I think we always increment some counter for a page - so what gets incremented for a compressed page? In the later patch, we will introduce the statistics of compression which contains "pages": @pages: amount of pages compressed and transferred to the target VM Is the real answer that we do: ram_save_target_page control_save_page compress_page_with_multi_thread and control_save_page already increments the counter? No :), control_save_page increments the counter only if it posted data out, under that case, the compression path is not invoked. Thanks!
Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread
On 06/13/2018 11:43 PM, Dr. David Alan Gilbert wrote: * Peter Xu (pet...@redhat.com) wrote: On Tue, Jun 12, 2018 at 10:42:25AM +0800, Xiao Guangrong wrote: On 06/11/2018 03:39 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Instead of putting the main thread to sleep state to wait for free compression thread, we can directly post it out as normal page that reduces the latency and uses CPUs more efficiently The feature looks good, though I'm not sure whether we should make a capability flag for this feature since otherwise it'll be hard to switch back to the old full-compression way no matter for what reason. Would that be a problem? We assume this optimization should always be optimistic for all cases, particularly, we introduced the statistics of compression, then the user should adjust its parameters based on those statistics if anything works worse. Ah, that'll be good. Furthermore, we really need to improve this optimization if it hurts any case rather than leaving a option to the user. :) Yeah, even if we make it a parameter/capability we can still turn that on by default in new versions but keep the old behavior in old versions. :) The major difference is that, then we can still _have_ a way to compress every page. I'm just thinking if we don't have a switch for that then if someone wants to measure e.g. how a new compression algo could help VM migration, then he/she won't be possible to do that again since the numbers will be meaningless if that bit is out of control on which page will be compressed. Though I don't know how much use it'll bring... But if that won't be too hard, it still seems good. Not a strong opinion. I think that is needed; it might be that some users have really awful networking and need the compression; I'd expect that for people who turn on compression they really expect the slowdown because they need it for their network, so changing that is a bit odd. People should make sure the system has enough CPU resource to do compression as well, so the perfect usage is that the 'busy-rate' is low enough i think. However, it's not a big deal, i will introduce a parameter, maybe, compress-wait-free-thread. Thank you all, Dave and Peter! :)
Re: [Qemu-devel] [PATCH 00/12] migration: improve multithreads for compression and decompression
On 06/11/2018 04:00 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Background -- Current implementation of compression and decompression are very hard to be enabled on productions. We noticed that too many wait-wakes go to kernel space and CPU usages are very low even if the system is really free The reasons are: 1) there are two many locks used to do synchronous,there is a global lock and each single thread has its own lock, migration thread and work threads need to go to sleep if these locks are busy 2) migration thread separately submits request to the thread however, only one request can be pended, that means, the thread has to go to sleep after finishing the request Our Ideas - To make it work better, we introduce a new multithread model, the user, currently it is the migration thread, submits request to each thread with round-robin manner, the thread has its own ring whose capacity is 4 and puts the result to a global ring which is lockless for multiple producers, the user fetches result out from the global ring and do remaining operations for the request, e.g, posting the compressed data out for migration on the source QEMU Other works in this patchset is offering some statistics to see if compression works as we expected and making the migration thread work fast so it can feed more requests to the threads Hi, Guangrong, I'm not sure whether my understanding is correct, but AFAIU the old code has a major defect that it depends too much on the big lock. The critial section of the small lock seems to be very short always, and also that's per-thread. However we use the big lock in lots of places: flush compress data, queue every page, or send the notifies in the compression thread. The lock is one issue, however, another issue is that, the thread has to go to sleep after finishing one request and the main thread (live migration thread) needs to go to kernel space and wake the thread up for every single request. And we also observed that linearly scan the threads one by one to see which is free is not cache-friendly... I haven't yet read the whole work, this work seems to be quite nice according to your test results. However have you thought about firstly remove the big lock without touching much of other part of the code, then continue to improve it? Or have you ever tried to do so? I don't think you need to do extra work for this, but I would appreciate if you have existing test results to share. If you really want the performance result, i will try it... Actually, the first version we used on our production is that we use a lockless multi-thread model (only one atomic operation is needed for both producer and consumer) but only one request can be fed to the thread. It's comparable to your suggestion (and should far more faster than your suggestion). We observed the shortcoming of this solutions is that too many waits and wakeups trapped to kernel, so CPU is idle and bandwidth is low. In other words, would it be nicer to separate the work into two pieces? - one to refactor the existing locks, to see what we can gain by simplify the locks to minimum. AFAIU now the locking used is still not ideal, my thinking is that _maybe_ we can start by removing the big lock, and use a semaphore or something to replace the "done" notification while still keep the small lock? Even some busy looping? Note: no lock is used after this patchset... - one to introduce the lockless ring buffer, to demostrate how the lockless data structure helps comparing to the locking ways Then we can know which item contributed how much to the performance numbers. After all the new ring and thread model seems to be a big chunk of work (sorry I haven't read them yet, but I will). It is really a huge burden that refactor old code and later completely remove old code. We redesigned the data struct and algorithm completely and abstract the model to clean up the code used for compression and decompression, it's not easy to modify the old code part by part... :( But... if you really it is really needed, i will try to figure out a way to address your suggestion. :) Thanks!
Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread
On 06/11/2018 03:39 PM, Peter Xu wrote: On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Instead of putting the main thread to sleep state to wait for free compression thread, we can directly post it out as normal page that reduces the latency and uses CPUs more efficiently The feature looks good, though I'm not sure whether we should make a capability flag for this feature since otherwise it'll be hard to switch back to the old full-compression way no matter for what reason. Would that be a problem? We assume this optimization should always be optimistic for all cases, particularly, we introduced the statistics of compression, then the user should adjust its parameters based on those statistics if anything works worse. Furthermore, we really need to improve this optimization if it hurts any case rather than leaving a option to the user. :) Signed-off-by: Xiao Guangrong --- migration/ram.c | 34 +++--- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 5bcbf7a9f9..0caf32ab0a 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block, thread_count = migrate_compress_threads(); qemu_mutex_lock(_done_lock); Can we drop this lock in this case? The lock is used to protect comp_param[].done... Well, we are able to possibly remove it if we redesign the implementation, e.g, use atomic access for comp_param.done, however, it still can not work efficiently i believe. Please see more in the later reply to your comments in the cover-letter.
Re: [Qemu-devel] [PATCH] pc: Remove PC_COMPAT_2_12 from 3.0 machine-types
On 06/09/2018 03:29 AM, Eduardo Habkost wrote: commit f548222c added PC_COMPAT_2_12 to the 3.0 PC machine-types. I believe this happened during manual conflict resolution when applying the patch. Indeed! Reviewed-by: Xiao Guangrong
Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
On 06/05/2018 06:31 AM, Eric Blake wrote: On 06/04/2018 04:55 AM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Then the uses can adjust the parameters based on this info Currently, it includes: pages: amount of pages compressed and transferred to the target VM busy: amount of count that no free thread to compress data busy-rate: rate of thread busy reduced-size: amount of bytes reduced by compression compression-rate: rate of compressed size Signed-off-by: Xiao Guangrong --- +++ b/qapi/migration.json @@ -72,6 +72,26 @@ 'cache-miss': 'int', 'cache-miss-rate': 'number', 'overflow': 'int' } } +## +# @CompressionStats: +# +# Detailed compression migration statistics Sounds better as s/compression migration/migration compression/ Indeed. +# +# @pages: amount of pages compressed and transferred to the target VM +# +# @busy: amount of count that no free thread to compress data Not sure what was meant, maybe: @busy: count of times that no free thread was available to compress data Yup, that's better. +# +# @busy-rate: rate of thread busy In what unit? pages per second? It's calculated by: pages-directly-posted-out-without-compression / total-page-posted-out +# +# @reduced-size: amount of bytes reduced by compression +# +# @compression-rate: rate of compressed size In what unit? It's calculated by: size-posted-out-after-compression / (compressed-page * page_size, i.e, that is the raw data without compression) +# +## Missing a 'Since: 3.0' tag Wow, directly upgrade to 3.0, big step. :-) Will add this tag in the next version. +{ 'struct': 'CompressionStats', + 'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number', + 'reduced-size': 'int', 'compression-rate': 'number' } } + ## # @MigrationStatus: # @@ -169,6 +189,8 @@ # only present when the postcopy-blocktime migration capability # is enabled. (Since 2.13) Pre-existing - we need to fix this 2.13 to be 3.0 (if it isn't already fixed) I should re-sync the repo before making patches next time. # +# @compression: compression migration statistics, only returned if compression +# feature is on and status is 'active' or 'completed' (Since 2.14) There will not be a 2.14 (for that matter, not even a 2.13). The next release is 3.0. Okay, will fix.
Re: [Qemu-devel] [PATCH v2] migration: introduce decompress-error-check
On 05/03/2018 04:06 PM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> QEMU 2.13 enables strict check for compression & decompression to make the migration more robust, that depends on the source to fix the internal design which triggers the unexpected error conditions To make it work for migrating old version QEMU to 2.13 QEMU, we introduce this parameter to disable the error check on the destination which is the default behavior of the machine type which is older than 2.13, alternately, the strict check can be enabled explicitly as followings: -M pc-q35-2.11 -global migration.decompress-error-check=true [ CC: Eric ] Hi, Dave, Eric, Peter, Could you please have a look if it is good to you guys. Thanks!
Re: [Qemu-devel] [PATCH] migration: fix saving normal page even if it's been compressed
On 05/02/2018 10:46 AM, Peter Xu wrote: On Sat, Apr 28, 2018 at 04:10:45PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> Fix the bug introduced by da3f56cb2e767016 (migration: remove ram_save_compressed_page()), It should be 'return' rather than 'res' Sorry for this stupid mistake :( Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com> Reviewed-by: Peter Xu <pet...@redhat.com> So is that only a performance degradation without this fix (since AFAIU the compressing pages will be sent twice)? Thanks, Yes, that's why we did not detect it in our test. :(
Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check
On 04/27/2018 07:29 PM, Dr. David Alan Gilbert wrote: Yes, can not agree with you more. :) The challenge is how to put something into the stream without breaking an old version of QEMU that's receiving the stream. Er, i did not think this case :(. The new parameter as this patch did is the only idea now in my mind... not sure if Eric and Peter have another solution.
Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check
On 04/27/2018 05:31 PM, Peter Xu wrote: On Fri, Apr 27, 2018 at 11:15:37AM +0800, Xiao Guangrong wrote: On 04/26/2018 10:01 PM, Eric Blake wrote: On 04/26/2018 04:15 AM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> QEMU 2.13 enables strict check for compression & decompression to make the migration more robuster, that depends on the source to fix s/robuster/robust/ Will fix, thank you for pointing it out. the internal design which triggers the unexpected error conditions 2.13 hasn't been released yet. Why do we need a knob to explicitly turn off strict checking? Can we not instead make 2.13 automatically smart enough to tell if the incoming stream is coming from an older qemu (which might fail if the strict checks are enabled) vs. a newer qemu (the sender gave us what we need to ensure the strict checks are worthwhile)? Really smart. How about introduce a new command, MIG_CMD_DECOMPRESS_ERR_CHECK, the destination will do strict check if got this command (i.e, new QEMU is running on the source), otherwise, turn the check off. Why not we just introduce a compat bit for that? I mean something like: 15c3850325 ("migration: move skip_section_footers", 2017-06-28). Then we turn that check bit off for <=2.12. Would that work? I am afraid it can not. :( The compat bit only impacts local behavior, however, in this case, we need the source QEMU to tell the destination if it supports strict error check.
Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check
On 04/26/2018 10:01 PM, Eric Blake wrote: On 04/26/2018 04:15 AM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> QEMU 2.13 enables strict check for compression & decompression to make the migration more robuster, that depends on the source to fix s/robuster/robust/ Will fix, thank you for pointing it out. the internal design which triggers the unexpected error conditions 2.13 hasn't been released yet. Why do we need a knob to explicitly turn off strict checking? Can we not instead make 2.13 automatically smart enough to tell if the incoming stream is coming from an older qemu (which might fail if the strict checks are enabled) vs. a newer qemu (the sender gave us what we need to ensure the strict checks are worthwhile)? Really smart. How about introduce a new command, MIG_CMD_DECOMPRESS_ERR_CHECK, the destination will do strict check if got this command (i.e, new QEMU is running on the source), otherwise, turn the check off. To make it work for migrating old version QEMU to 2.13 QEMU, we introduce this parameter to disable the error check on the destination Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com> --- +++ b/qapi/migration.json @@ -455,6 +455,17 @@ # compression, so set the decompress-threads to the number about 1/4 # of compress-threads is adequate. # +# @decompress-error-check: check decompression errors. When false, the errors +# triggered by memory decompression are ignored. What are the consequences of such an error? Is it a security hole to leave this at false, when a malicious migration stream can cause us to misbehave by ignoring the errors? The issue fixed by strict error check is avoiding VM corruption if zlib failed to compress & decompress the memory, i.e, catch error conditions returned by zlib API. +# When true, migration is aborted if the errors are +# detected. For the old QEMU versions (< 2.13) the +# internal design will cause decompression to fail +# so the destination should completely ignore the +# error conditions, i.e, make it be false if these +# QEMUs are going to be migrated. Since 2.13, this +# design is fixed, make it be true to avoid corrupting +# the VM silently (Since 2.13) Rather wordy; I'd suggest: @decompress-error-check: Set to true to abort the migration if decompression errors are detected at the destination. Should be left at false (default) for qemu older than 2.13, since only newer qemu sends streams that do not trigger spurious decompression errors. (Since 2.13) Yup, much better. But that's if we even need it (it SHOULD be possible to design something into the migration stream so that you can detect this property automatically instead of relying on the user to set the property). Yes, can not agree with you more. :)
Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check
On 04/26/2018 05:34 PM, Dr. David Alan Gilbert wrote: * guangrong.x...@gmail.com (guangrong.x...@gmail.com) wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> QEMU 2.13 enables strict check for compression & decompression to make the migration more robuster, that depends on the source to fix the internal design which triggers the unexpected error conditions Hmm that's painful! Yup, i will think it more carefully next time. :( To make it work for migrating old version QEMU to 2.13 QEMU, we introduce this parameter to disable the error check on the destination Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com> --- hmp.c | 8 migration/migration.c | 19 +++ migration/migration.h | 1 + migration/ram.c | 2 +- qapi/migration.json | 43 +++ 5 files changed, 68 insertions(+), 5 deletions(-) diff --git a/hmp.c b/hmp.c index 898e25f3e1..f0b934368b 100644 --- a/hmp.c +++ b/hmp.c @@ -329,6 +329,10 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict) monitor_printf(mon, "%s: %u\n", MigrationParameter_str(MIGRATION_PARAMETER_DECOMPRESS_THREADS), params->decompress_threads); +assert(params->has_decompress_error_check); +monitor_printf(mon, "%s: %s\n", +MigrationParameter_str(MIGRATION_PARAMETER_DECOMPRESS_ERROR_CHECK), +params->decompress_error_check ? "on" : "off"); Since it's a bool, this should be a migration-capability rather than a parameter I think. The parameter, @block-incremental, it is also a bool, i think it is okay to decompress-error-check as well, and it is one of the configurations of decompression, it is better to group them. Right? :) Also, we need to make sure it defaults to off for compatibility. Yes, the parameter is allocated by zalloc, it has already been false on default, do you think we should make it be false explicitly? Other than that, I think it's OK. Thank you, Dave!
Re: [Qemu-devel] [PATCH] migration: introduce decompress-error-check
Hi Dave, On 04/26/2018 05:15 PM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> QEMU 2.13 enables strict check for compression & decompression to make the migration more robuster, that depends on the source to fix the internal design which triggers the unexpected error conditions To make it work for migrating old version QEMU to 2.13 QEMU, we introduce this parameter to disable the error check on the destination We noticed this issue when evaluated your pull request, could you please review this fix and pull it to 2.13 release? Sorry, i did not realize it before. :(
Re: [Qemu-devel] [PATCH v3 00/10] migration: improve and cleanup compression
Hi Paolo, Michael, Stefan and others, Could anyone merge this patchset if it is okay to you guys? On 03/30/2018 03:51 PM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> Changelog in v3: Following changes are from Peter's review: 1) use comp_param[i].file and decomp_param[i].compbuf to indicate if the thread is properly init'd or not 2) save the file which is used by ram loader to the global variable instead it is cached per decompression thread Changelog in v2: Thanks to the review from Dave, Peter, Wei and Jiang Biao, the changes in this version are: 1) include the performance number in the cover letter 2)add some comments to explain how to use z_stream->opaque in the patchset 3) allocate a internal buffer for per thread to store the data to be compressed 4) add a new patch that moves some code to ram_save_host_page() so that 'goto' can be omitted gracefully 5) split the optimization of compression and decompress into two separated patches 6) refine and correct code styles This is the first part of our work to improve compression to make it be more useful in the production. The first patch resolves the problem that the migration thread spends too much CPU resource to compression memory if it jumps to a new block that causes the network is used very deficient. The second patch fixes the performance issue that too many VM-exits happen during live migration if compression is being used, it is caused by huge memory returned to kernel frequently as the memory is allocated and freed for every signal call to compress2() The remaining patches clean the code up dramatically Performance numbers: We have tested it on my desktop, i7-4790 + 16G, by locally live migrate the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350. During the migration, a workload which has 8 threads repeatedly written total 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after applying, the bandwidth is ~50 mbp. We also collected the perf data for patch 2 and 3 on our production, before the patchset: + 57.88% kqemu [kernel.kallsyms][k] queued_spin_lock_slowpath + 10.55% kqemu [kernel.kallsyms][k] __lock_acquire + 4.83% kqemu [kernel.kallsyms][k] flush_tlb_func_common - 1.16% kqemu [kernel.kallsyms][k] lock_acquire ▒ - lock_acquire ▒ - 15.68% _raw_spin_lock ▒ + 29.42% __schedule ▒ + 29.14% perf_event_context_sched_out ▒ + 23.60% tdp_page_fault ▒ + 10.54% do_anonymous_page ▒ + 2.07% kvm_mmu_notifier_invalidate_range_start ▒ + 1.83% zap_pte_range ▒ + 1.44% kvm_mmu_notifier_invalidate_range_end apply our work: + 51.92% kqemu [kernel.kallsyms][k] queued_spin_lock_slowpath + 14.82% kqemu [kernel.kallsyms][k] __lock_acquire + 1.47% kqemu [kernel.kallsyms][k] mark_lock.clone.0 + 1.46% kqemu [kernel.kallsyms][k] native_sched_clock + 1.31% kqemu [kernel.kallsyms][k] lock_acquire + 1.24% kqemu libc-2.12.so [.] __memset_sse2 - 14.82% kqemu [kernel.kallsyms][k] __lock_acquire ▒ - __lock_acquire ▒ - 99.75% lock_acquire ▒ - 18.38% _raw_spin_lock ▒ + 39.62% tdp_page_fault ▒ + 31.32% __schedule ▒ + 27.53% perf_event_context_sched_out ▒ + 0.58% hrtimer_interrupt We can see the TLB flush and mmu-lock contention have gone. Xiao Guangrong (10): migration: stop compressing page in migration thread migration: stop compression to allocate and free memory frequently migration: stop decompression to allocate and free memory frequently migration: detect compression and decompression errors migration: introduce control_save_page() migration: move some code to ram_save_host_page migration: move calling control_save_page to the common place migration: move calling save_
Re: [Qemu-devel] [PATCH V4] migration: add capability to bypass the shared memory
On 04/04/2018 07:47 PM, Lai Jiangshan wrote: 1) What's this When the migration capability 'bypass-shared-memory' is set, the shared memory will be bypassed when migration. It is the key feature to enable several excellent features for the qemu, such as qemu-local-migration, qemu-live-update, extremely-fast-save-restore, vm-template, vm-fast-live-clone, yet-another-post-copy-migration, etc.. The philosophy behind this key feature, including the resulting advanced key features, is that a part of the memory management is separated out from the qemu, and let the other toolkits such as libvirt, kata-containers (https://github.com/kata-containers) runv(https://github.com/hyperhq/runv/) or some multiple cooperative qemu commands directly access to it, manage it, provide features on it. 2) Status in real world The hyperhq(http://hyper.sh http://hypercontainer.io/) introduced the feature vm-template(vm-fast-live-clone) to the hyper container for several years, it works perfect. (see https://github.com/hyperhq/runv/pull/297). The feature vm-template makes the containers(VMs) can be started in 130ms and save 80M memory for every container(VM). So that the hyper containers are fast and high-density as normal containers. kata-containers project (https://github.com/kata-containers) which was launched by hyper, intel and friends and which descended from runv (and clear-container) should have this feature enabled. Unfortunately, due to the code confliction between runv, this feature was temporary disabled and it is being brought back by hyper and intel team. 3) How to use and bring up advanced features. In current qemu command line, shared memory has to be configured via memory-object. a) feature: qemu-local-migration, qemu-live-update Set the mem-path on the tmpfs and set share=on for it when start the vm. example: -object \ memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \ -numa node,nodeid=0,cpus=0-7,memdev=mem when you want to migrate the vm locally (after fixed a security bug of the qemu-binary, or other reason), you can start a new qemu with the same command line and -incoming, then you can migrate the vm from the old qemu to the new qemu with the migration capability 'bypass-shared-memory' set. The migration will migrate the device-state *ONLY*, the memory is the origin memory backed by tmpfs file. b) feature: extremely-fast-save-restore the same above, but the mem-path is on the persistent file system. c) feature: vm-template, vm-fast-live-clone the template vm is started as 1), and paused when the guest reaches the template point(example: the guest app is ready), then the template vm is saved. (the qemu process of the template can be killed now, because we need only the memory and the device state files (in tmpfs)). Then we can launch one or multiple VMs base on the template vm states, the new VMs are started without the “share=on”, all the new VMs share the initial memory from the memory file, they save a lot of memory. all the new VMs start from the template point, the guest app can go to work quickly. The new VM booted from template vm can’t become template again, if you need this unusual chained-template feature, you can write a cloneable-tmpfs kernel module for it. The libvirt toolkit can’t manage vm-template currently, in the hyperhq/runv, we use qemu wrapper script to do it. I hope someone add “libvrit managed template” feature to libvirt. d) feature: yet-another-post-copy-migration It is a possible feature, no toolkit can do it well now. Using nbd server/client on the memory file is reluctantly Ok but inconvenient. A special feature for tmpfs might be needed to fully complete this feature. No one need yet another post copy migration method, but it is possible when some crazy man need it. Excellent work. :) It's a brilliant feature that can improve our production a lot. Reviewed-by: Xiao Guangrong <xiaoguangr...@tencent.com>
Re: [Qemu-devel] [PATCH v2 04/10] migration: detect compression and decompression errors
On 03/29/2018 12:25 PM, Peter Xu wrote: On Thu, Mar 29, 2018 at 11:51:03AM +0800, Xiao Guangrong wrote: On 03/28/2018 05:59 PM, Peter Xu wrote: On Tue, Mar 27, 2018 at 05:10:37PM +0800, guangrong.x...@gmail.com wrote: [...] -static int compress_threads_load_setup(void) +static int compress_threads_load_setup(QEMUFile *f) { int i, thread_count; @@ -2665,6 +2685,7 @@ static int compress_threads_load_setup(void) } decomp_param[i].stream.opaque = _param[i]; +decomp_param[i].file = f; On the source side the error will be set via: qemu_file_set_error(migrate_get_current()->to_dst_file, blen); Maybe we can do similar things using migrate_incoming_get_current() to avoid caching the QEMUFile multiple times? I have considered it, however, it can not work as the @file used by ram loader is not the file got from migrate_incoming_get_current() under some cases. For example, in colo_process_incoming_thread(), the file passed to qemu_loadvm_state() is a internal buffer and it is not easy to switch it to incoming file. I see. How about cache it in a global variable? We have these already: thread_count = migrate_decompress_threads(); decompress_threads = g_new0(QemuThread, thread_count); decomp_param = g_new0(DecompressParam, thread_count); ... IMHO we can add a new one too, at least we don't cache it multiple times (after all decomp_param[i]s are global variables too). Nice, that's good to me. Will add your Reviewed-by on this patch as well if you do not mind. :)
Re: [Qemu-devel] [PATCH v2 04/10] migration: detect compression and decompression errors
On 03/28/2018 05:59 PM, Peter Xu wrote: On Tue, Mar 27, 2018 at 05:10:37PM +0800, guangrong.x...@gmail.com wrote: [...] -static int compress_threads_load_setup(void) +static int compress_threads_load_setup(QEMUFile *f) { int i, thread_count; @@ -2665,6 +2685,7 @@ static int compress_threads_load_setup(void) } decomp_param[i].stream.opaque = _param[i]; +decomp_param[i].file = f; On the source side the error will be set via: qemu_file_set_error(migrate_get_current()->to_dst_file, blen); Maybe we can do similar things using migrate_incoming_get_current() to avoid caching the QEMUFile multiple times? I have considered it, however, it can not work as the @file used by ram loader is not the file got from migrate_incoming_get_current() under some cases. For example, in colo_process_incoming_thread(), the file passed to qemu_loadvm_state() is a internal buffer and it is not easy to switch it to incoming file.
Re: [Qemu-devel] [PATCH v2 03/10] migration: stop decompression to allocate and free memory frequently
On 03/28/2018 05:42 PM, Peter Xu wrote: On Tue, Mar 27, 2018 at 05:10:36PM +0800, guangrong.x...@gmail.com wrote: [...] +static int compress_threads_load_setup(void) +{ +int i, thread_count; + +if (!migrate_use_compression()) { +return 0; +} + +thread_count = migrate_decompress_threads(); +decompress_threads = g_new0(QemuThread, thread_count); +decomp_param = g_new0(DecompressParam, thread_count); +qemu_mutex_init(_done_lock); +qemu_cond_init(_done_cond); +for (i = 0; i < thread_count; i++) { +if (inflateInit(_param[i].stream) != Z_OK) { +goto exit; +} +decomp_param[i].stream.opaque = _param[i]; Same question as the encoding patch here, otherwise looks good to me. Thanks for you pointed out, will fix. Hmm, can i treat it as your Reviewed-by for the next version?
Re: [Qemu-devel] [PATCH v2 02/10] migration: stop compression to allocate and free memory frequently
On 03/28/2018 05:25 PM, Peter Xu wrote: On Tue, Mar 27, 2018 at 05:10:35PM +0800, guangrong.x...@gmail.com wrote: [...] @@ -357,10 +358,20 @@ static void compress_threads_save_cleanup(void) terminate_compression_threads(); thread_count = migrate_compress_threads(); for (i = 0; i < thread_count; i++) { +/* + * stream.opaque can be used to store private data, we use it + * as a indicator which shows if the thread is properly init'd + * or not + */ +if (!comp_param[i].stream.opaque) { +break; +} How about using comp_param[i].file? The opaque seems to be hiding deeper, and... Yes, indeed, good suggestion. qemu_thread_join(compress_threads + i); qemu_fclose(comp_param[i].file); qemu_mutex_destroy(_param[i].mutex); qemu_cond_destroy(_param[i].cond); +deflateEnd(_param[i].stream); +comp_param[i].stream.opaque = NULL; } qemu_mutex_destroy(_done_lock); qemu_cond_destroy(_done_cond); @@ -370,12 +381,12 @@ static void compress_threads_save_cleanup(void) comp_param = NULL; } -static void compress_threads_save_setup(void) +static int compress_threads_save_setup(void) { int i, thread_count; if (!migrate_use_compression()) { -return; +return 0; } thread_count = migrate_compress_threads(); compress_threads = g_new0(QemuThread, thread_count); @@ -383,6 +394,12 @@ static void compress_threads_save_setup(void) qemu_cond_init(_done_cond); qemu_mutex_init(_done_lock); for (i = 0; i < thread_count; i++) { +if (deflateInit(_param[i].stream, + migrate_compress_level()) != Z_OK) { (indent issue) Will fix. +goto exit; +} +comp_param[i].stream.opaque = _param[i]; ...here from document: ZEXTERN int ZEXPORT deflateInit OF((z_streamp strm, int level)); Initializes the internal stream state for compression. The fields zalloc, zfree and opaque must be initialized before by the caller. If zalloc and zfree are set to Z_NULL, deflateInit updates them to use default allocation functions. So shall we init opaque first? Otherwise looks good to me. No, opaque need to be init-ed only if zalloc and zfree are specified, it is not the case in this patch.
Re: [Qemu-devel] [PATCH 3/8] migration: support todetectcompression and decompression errors
On 03/28/2018 12:20 PM, Peter Xu wrote: On Wed, Mar 28, 2018 at 12:08:19PM +0800, jiang.bi...@zte.com.cn wrote: On Tue, Mar 27, 2018 at 10:35:29PM +0800, Xiao Guangrong wrote: No, we can't make the assumption that "error _must_ be caused by page update". No document/ABI about compress/decompress promised it. :) Indeed, I found no good documents about below errors that jiang.biao pointed out. Hi, Peter The description about the errors comes from here, http://www.zlib.net/manual.html And about the error codes returned by inflate(), they are described as, ** inflate() returns Z_OK if some progress has been made (more input processed or more output produced), Z_STREAM_END if the end of the compressed data has been reached and all uncompressed output has been produced, Z_NEED_DICT if a preset dictionary is needed at this point, Z_DATA_ERROR if the input data was corrupted (input stream not conforming to the zlib format or incorrect check value, in which case strm->msg points to a string with a more specific error), Z_STREAM_ERROR if the stream structure was inconsistent (for example next_in or next_out was Z_NULL, or the state was inadvertently written over by the application), Z_MEM_ERROR if there was not enough memory, Z_BUF_ERROR if no progress was possible or if there was not enough room in the output buffer when Z_FINISH is used. ... ** Ah yes. My bad to be so uncareful. :) According to the above description, the error caused by page update looks more like tend to return Z_DATA_ERROR, but I do not have env to verify that. :) No, still lack info to confirm the case of compressing the data being updated is the only one to return Z_DATA_ERROR. And nothing provided that no other error condition causes data corrupted will be squeezed into this error code. As I understand it, the real compress/decompress error cases other than that caused by page update should be rare, maybe the error code is enough to distinguish those if we can verify the the error codes returned by page update and other silent failures by test. If so, we can cut the cost of memcpy. Please note, compare with other operations, e.g, compression, detect zero page, etc., memcpy() is not a hot function at all. If not, I agree with Guangrong's idea too. I never read the zlib code and all my information comes from the manual, so if anything inaccurate, pls ignore my option. :) So I suppose all of us know that alternative now, we just need a solid way to confirm the uncertainty. I'll leave this to Guangrong. Yes, i still prefer to memcpy() to make it safe enough to protect our production unless we get enough certainty to figure out the error conditions. Thanks!
Re: [Qemu-devel] [PATCH 1/8] migration: stop compressing page in migration thread
On 03/28/2018 11:01 AM, Wang, Wei W wrote: On Tuesday, March 13, 2018 3:58 PM, Xiao Guangrong wrote: As compression is a heavy work, do not do it in migration thread, instead, we post it out as a normal page Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com> Hi Guangrong, Dave asked me to help review your patch, so I will just drop my 2 cents wherever possible, and hope that could be inspiring for your work. Thank you both for the nice help on the work. :) --- migration/ram.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 7266351fd0..615693f180 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss, int pages = -1; uint64_t bytes_xmit = 0; uint8_t *p; -int ret, blen; +int ret; RAMBlock *block = pss->block; ram_addr_t offset = pss->page << TARGET_PAGE_BITS; @@ -1162,23 +1162,23 @@ static int ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss, if (block != rs->last_sent_block) { flush_compressed_data(rs); pages = save_zero_page(rs, block, offset); -if (pages == -1) { -/* Make sure the first page is sent out before other pages */ -bytes_xmit = save_page_header(rs, rs->f, block, offset | - RAM_SAVE_FLAG_COMPRESS_PAGE); -blen = qemu_put_compression_data(rs->f, p, TARGET_PAGE_SIZE, - migrate_compress_level()); -if (blen > 0) { -ram_counters.transferred += bytes_xmit + blen; -ram_counters.normal++; -pages = 1; -} else { -qemu_file_set_error(rs->f, blen); -error_report("compressed data failed!"); -} -} if (pages > 0) { ram_release_pages(block->idstr, offset, pages); +} else { +/* + * Make sure the first page is sent out before other pages. + * + * we post it as normal page as compression will take much + * CPU resource. + */ +ram_counters.transferred += save_page_header(rs, rs->f, block, +offset | RAM_SAVE_FLAG_PAGE); +qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE, + migrate_release_ram() & + migration_in_postcopy()); +ram_counters.transferred += TARGET_PAGE_SIZE; +ram_counters.normal++; +pages = 1; } } else { pages = save_zero_page(rs, block, offset); -- I agree that this patch is an improvement for the current implementation. So just pile up mine here: Reviewed-by: Wei Wang <wei.w.w...@intel.com> Thanks. If you are interested in something more aggressive, I can share an alternative approach, which I think would be better. Please see below. Actually, we can use the multi-threaded compression for the first page as well, which will not block the migration thread progress. The advantage is that we can enjoy the compression benefit for the first page and meanwhile not blocking the migration thread - the page is given to a compression thread and compressed asynchronously to the migration thread execution. Yes, it is a good point. The main barrier to achieving the above that is that we need to make sure the first page of each block is sent first in the multi-threaded environment. We can twist the current implementation to achieve that, which is not hard: For example, we can add a new flag to RAMBlock - bool first_page_added. In each thread of compression, they need 1) check if this is the first page of the block. 2) If it is the first page, set block->first_page_added after sending the page; 3) If it is not the first the page, wait to send the page only when block->first_page_added is set. So there is another barrier introduced which hurts the parallel... Hmm, we need more deliberate consideration on this point, let me think it over after this work. Thank you.
Re: [Qemu-devel] [PATCH 3/8] migration: support to detectcompression and decompression errors
On 03/28/2018 08:43 AM, jiang.bi...@zte.com.cn wrote: On 03/27/2018 07:17 PM, Peter Xu wrote: On Tue, Mar 27, 2018 at 03:42:32AM +0800, Xiao Guangrong wrote: [...] It'll be understandable to me if the problem is that the compress() API does not allow the input buffer to be changed during the whole period of the call. If that is a must, this patch for sure helps. Yes, that is exactly what i want to say. :) So I think now I know what this patch is for. :) And yeah, it makes sense. Though another question would be: if the buffer is updated during compress() and compress() returned error, would that pollute the whole z_stream or it only fails the compress() call? I guess deflateReset() can recover everything, i.e, keep z_stream as it is init'ed by deflate_init(). (Same question applies to decompress().) If it's only a compress() error and it won't pollute z_stream (or say, it can be recovered after a deflateReset() and then we can continue to call deflate() without problem), then we'll actually have two alternatives to solve this "buffer update" issue: 1. Use the approach of current patch: we copy the page every time, so deflate() never fails because update never happens. But it's slow since we copy the pages every time. 2. Use the old approach, and when compress() fail, we just ignore that page (since now we know that error _must_ be caused by page update, then we are 100% sure that we'll send that page again so it'll be perfectly fine). No, we can't make the assumption that "error _must_ be caused by page update". No document/ABI about compress/decompress promised it. :) So, as I metioned before, can we just distingush the decompress/compress errors from errors caused by page update by the return code of inflate/deflate? According to the zlib manual, there seems to be several error codes for different cases, #define Z_ERRNO(-1) #define Z_STREAM_ERROR (-2) #define Z_DATA_ERROR (-3) #define Z_MEM_ERROR(-4) #define Z_BUF_ERROR(-5) #define Z_VERSION_ERROR (-6) Did you check the return code when silent failure(not caused by page update) happened before? :) I am afraid there is no such error code and i guess zlib is not designed to compress the data which is being modified.
Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compression and decompression errors
On 03/27/2018 07:17 PM, Peter Xu wrote: On Tue, Mar 27, 2018 at 03:42:32AM +0800, Xiao Guangrong wrote: [...] It'll be understandable to me if the problem is that the compress() API does not allow the input buffer to be changed during the whole period of the call. If that is a must, this patch for sure helps. Yes, that is exactly what i want to say. :) So I think now I know what this patch is for. :) And yeah, it makes sense. Though another question would be: if the buffer is updated during compress() and compress() returned error, would that pollute the whole z_stream or it only fails the compress() call? I guess deflateReset() can recover everything, i.e, keep z_stream as it is init'ed by deflate_init(). (Same question applies to decompress().) If it's only a compress() error and it won't pollute z_stream (or say, it can be recovered after a deflateReset() and then we can continue to call deflate() without problem), then we'll actually have two alternatives to solve this "buffer update" issue: 1. Use the approach of current patch: we copy the page every time, so deflate() never fails because update never happens. But it's slow since we copy the pages every time. 2. Use the old approach, and when compress() fail, we just ignore that page (since now we know that error _must_ be caused by page update, then we are 100% sure that we'll send that page again so it'll be perfectly fine). No, we can't make the assumption that "error _must_ be caused by page update". No document/ABI about compress/decompress promised it. :) Thanks!
Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compression and decompression errors
On 03/27/2018 03:22 PM, Peter Xu wrote: On Thu, Mar 22, 2018 at 08:03:53PM +0800, Xiao Guangrong wrote: On 03/21/2018 06:00 PM, Peter Xu wrote: On Tue, Mar 13, 2018 at 03:57:34PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> Currently the page being compressed is allowed to be updated by the VM on the source QEMU, correspondingly the destination QEMU just ignores the decompression error. However, we completely miss the chance to catch real errors, then the VM is corrupted silently To make the migration more robuster, we copy the page to a buffer first to avoid it being written by VM, then detect and handle the errors of both compression and decompression errors properly Not sure I missed anything important, but I'll just shoot my thoughts as questions (again)... Actually this is a more general question? Say, even without compression, we can be sending a page that is being modified. However, IMHO we don't need to worry that, since if that page is modified, we'll definitely send that page again, so the new page will replace the old. So on destination side, even if decompress() failed on a page it'll be fine IMHO. Though now we are copying the corrupted buffer. On that point, I fully agree that we should not - maybe we can just drop the page entirely? For non-compress pages, we can't detect that, so we'll copy the page even if corrupted. The special part for compression would be: would the deflate() fail if there is concurrent update to the buffer being compressed? And would that corrupt the whole compression stream, or it would only fail the deflate() call? It is not the same for normal page and compressed page. For the normal page, the dirty-log mechanism in QEMU and the infrastructure of the network (e.g, TCP) can make sure that the modified memory will be posted to the destination without corruption. However, nothing can guarantee compression/decompression is BUG-free, e,g, consider the case, in the last step, vCPUs & dirty-log are paused and the memory is compressed and posted to destination, if there is any error in compression/decompression, VM dies silently. Here do you mean the compression error even if the VM is halted? I'd say in that case IMHO the extra memcpy() would still help little since the coiped page should exactly be the same as the source page? ”compression error“ means that compress2() in original code returns a error code. If the data being compressed is being modified at the some time, compression will fail and this failure is negative. We move the data to a internal buffer to avoid this case, so that we can catch the real error condition. I'd say I don't know what we can really do if there are zlib bugs. I was assuming we'll definitely fail in a strange way if there is any, which should be hard to be detected from QEMU's POV (maybe a destination VM crash, as you mentioned). It'll be easy for us to detect errors when we got error code returned from compress(), however IMHO when we say "zlib bug" it can also mean that data is corrputed even compress() and decompress() both returned with good state. Ah, sorry, i abused the word "BUG". It does not mean the BUG in compression/decompression API, i mean the failure conditions (The API returns a error code). It'll be understandable to me if the problem is that the compress() API does not allow the input buffer to be changed during the whole period of the call. If that is a must, this patch for sure helps. Yes, that is exactly what i want to say. :)
Re: [Qemu-devel] [PATCH 1/8] migration: stop compressing page in migration thread
On 03/26/2018 05:02 PM, Peter Xu wrote: On Thu, Mar 22, 2018 at 07:38:07PM +0800, Xiao Guangrong wrote: On 03/21/2018 04:19 PM, Peter Xu wrote: On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote: Hi David, Thanks for your review. On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote: migration/ram.c | 32 Hi, Do you have some performance numbers to show this helps? Were those taken on a normal system or were they taken with one of the compression accelerators (which I think the compression migration was designed for)? Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350. During the migration, a workload which has 8 threads repeatedly written total 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after applying, the bandwidth is ~50 mbps. Hi, Guangrong, Not really review comments, but I got some questions. :) Your comments are always valuable to me! :) IIUC this patch will only change the behavior when last_sent_block changed. I see that the performance is doubled after the change, which is really promising. However I don't fully understand why it brings such a big difference considering that IMHO current code is sending dirty pages per-RAMBlock. I mean, IMHO last_sent_block should not change frequently? Or am I wrong? It's depends on the configuration, each memory-region which is ram or file backend has a RAMBlock. Actually, more benefits comes from the fact that the performance & throughput of the multithreads has been improved as the threads is fed by the migration thread and the result is consumed by the migration thread. I'm not sure whether I got your points - I think you mean that the compression threads and the migration thread can form a better pipeline if the migration thread does not do any compression at all. I think I agree with that. However it does not really explain to me on why a very rare event (sending the first page of a RAMBlock, considering bitmap sync is rare) can greatly affect the performance (it shows a doubled boost). I understand it is trick indeed, but it is not very hard to explain. Multi-threads (using 8 CPUs in our test) keep idle for a long time for the origin code, however, after our patch, as the normal is posted out async-ly that it's extremely fast as you said (the network is almost idle for current implementation) so it has a long time that the CPUs can be used effectively to generate more compressed data than before. Btw, about the numbers: IMHO the numbers might not be really "true numbers". Or say, even the bandwidth is doubled, IMHO it does not mean the performance is doubled. Becasue the data has changed. Previously there were only compressed pages, and now for each cycle of RAMBlock looping we'll send a normal page (then we'll get more thing to send). So IMHO we don't really know whether we sent more pages with this patch, we can only know we sent more bytes (e.g., an extreme case is that the extra 25Mbps/s are all caused by those normal pages, and we can be sending exactly the same number of pages like before, or even worse?). Current implementation uses CPU very ineffectively (it's our next work to be posted out) that the network is almost idle so posting more data out is a better choice,further more, migration thread plays a role for parallel, it'd better to make it fast. Another follow-up question would be: have you measured how long time needed to compress a 4k page, and how many time to send it? I think "sending the page" is not really meaningful considering that we just put a page into the buffer (which should be extremely fast since we don't really flush it every time), however I would be curious on how slow would compressing a page be. I haven't benchmark the performance of zlib, i think it is CPU intensive workload, particularly, there no compression-accelerator (e.g, QAT) on our production. BTW, we were using lzo instead of zlib which worked better for some workload. Never mind. Good to know about that. Putting a page into buffer should depend on the network, i,e, if the network is congested it should take long time. :) Again, considering that I don't know much on compression (especially I hardly used that) mine are only questions, which should not block your patches to be either queued/merged/reposted when proper. :) Yes, i see. The discussion can potentially raise a better solution. Thanks for your comment, Peter!
Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compression and decompression errors
On 03/21/2018 06:00 PM, Peter Xu wrote: On Tue, Mar 13, 2018 at 03:57:34PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> Currently the page being compressed is allowed to be updated by the VM on the source QEMU, correspondingly the destination QEMU just ignores the decompression error. However, we completely miss the chance to catch real errors, then the VM is corrupted silently To make the migration more robuster, we copy the page to a buffer first to avoid it being written by VM, then detect and handle the errors of both compression and decompression errors properly Not sure I missed anything important, but I'll just shoot my thoughts as questions (again)... Actually this is a more general question? Say, even without compression, we can be sending a page that is being modified. However, IMHO we don't need to worry that, since if that page is modified, we'll definitely send that page again, so the new page will replace the old. So on destination side, even if decompress() failed on a page it'll be fine IMHO. Though now we are copying the corrupted buffer. On that point, I fully agree that we should not - maybe we can just drop the page entirely? For non-compress pages, we can't detect that, so we'll copy the page even if corrupted. The special part for compression would be: would the deflate() fail if there is concurrent update to the buffer being compressed? And would that corrupt the whole compression stream, or it would only fail the deflate() call? It is not the same for normal page and compressed page. For the normal page, the dirty-log mechanism in QEMU and the infrastructure of the network (e.g, TCP) can make sure that the modified memory will be posted to the destination without corruption. However, nothing can guarantee compression/decompression is BUG-free, e,g, consider the case, in the last step, vCPUs & dirty-log are paused and the memory is compressed and posted to destination, if there is any error in compression/decompression, VM dies silently.
Re: [Qemu-devel] [PATCH 2/8] migration: stop allocating and freeing memory frequently
On 03/21/2018 05:06 PM, Peter Xu wrote: On Tue, Mar 13, 2018 at 03:57:33PM +0800, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> Current code uses compress2()/uncompress() to compress/decompress memory, these two function manager memory allocation and release internally, that causes huge memory is allocated and freed very frequently More worse, frequently returning memory to kernel will flush TLBs and trigger invalidation callbacks on mmu-notification which interacts with KVM MMU, that dramatically reduce the performance of VM So, we maintain the memory by ourselves and reuse it for each compression and decompression Signed-off-by: Xiao Guangrong <xiaoguangr...@tencent.com> --- migration/qemu-file.c | 34 ++-- migration/qemu-file.h | 6 ++- migration/ram.c | 142 +- 3 files changed, 140 insertions(+), 42 deletions(-) diff --git a/migration/qemu-file.c b/migration/qemu-file.c index 2ab2bf362d..1ff33a1ffb 100644 --- a/migration/qemu-file.c +++ b/migration/qemu-file.c @@ -658,6 +658,30 @@ uint64_t qemu_get_be64(QEMUFile *f) return v; } +/* return the size after compression, or negative value on error */ +static int qemu_compress_data(z_stream *stream, uint8_t *dest, size_t dest_len, + const uint8_t *source, size_t source_len) +{ +int err; + +err = deflateReset(stream); I'm not familiar with zlib, but I saw this in manual: https://www.zlib.net/manual.html This function is equivalent to deflateEnd followed by deflateInit, but does not free and reallocate the internal compression state. The stream will leave the compression level and any other attributes that may have been set unchanged. I thought it was deflateInit() who is slow? Can we avoid the reset as deflateEnd() is worse as it frees memory to kernel which triggers TLB flush and mmu-notifier. long as we make sure to deflateInit() before doing anything else? Actually, deflateReset() is cheap... :) Meanwhile, is there any performance number for this single patch? Since I thought the old code is calling compress2() which contains deflateInit() and deflateEnd() too, just like what current patch do? No, after the patch, we just call deflateInit() / deflateEnd() one time (in _setup() handler and _cleanup handler). Yes. This is the perf data from our production, after revert this patch: + 57.88% kqemu [kernel.kallsyms][k] queued_spin_lock_slowpath + 10.55% kqemu [kernel.kallsyms][k] __lock_acquire + 4.83% kqemu [kernel.kallsyms][k] flush_tlb_func_common - 1.16% kqemu [kernel.kallsyms][k] lock_acquire ▒ - lock_acquire ▒ - 15.68% _raw_spin_lock ▒ + 29.42% __schedule ▒ + 29.14% perf_event_context_sched_out ▒ + 23.60% tdp_page_fault ▒ + 10.54% do_anonymous_page ▒ + 2.07% kvm_mmu_notifier_invalidate_range_start ▒ + 1.83% zap_pte_range ▒ + 1.44% kvm_mmu_notifier_invalidate_range_end apply our work: + 51.92% kqemu [kernel.kallsyms][k] queued_spin_lock_slowpath + 14.82% kqemu [kernel.kallsyms][k] __lock_acquire + 1.47% kqemu [kernel.kallsyms][k] mark_lock.clone.0 + 1.46% kqemu [kernel.kallsyms][k] native_sched_clock + 1.31% kqemu [kernel.kallsyms][k] lock_acquire + 1.24% kqemu libc-2.12.so [.] __memset_sse2 - 14.82% kqemu [kernel.kallsyms][k] __lock_acquire ▒ - __lock_acquire ▒ - 99.75% lock_acquire ▒ - 18.38% _raw_spin_lock ▒ + 39.62% tdp_page_fault ▒ + 31.32% __schedule ▒ + 27.53% perf_event_context_sched_out ▒ + 0.58% hrtimer_interrupt You can see the TLB flush and mmu-lock contention have gone after this patch. It would be nice too if we can split the patch into two (decode, encode) if you want, but that's optional. That's good
Re: [Qemu-devel] [PATCH 1/8] migration: stop compressing page in migration thread
On 03/21/2018 04:19 PM, Peter Xu wrote: On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote: Hi David, Thanks for your review. On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote: migration/ram.c | 32 Hi, Do you have some performance numbers to show this helps? Were those taken on a normal system or were they taken with one of the compression accelerators (which I think the compression migration was designed for)? Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350. During the migration, a workload which has 8 threads repeatedly written total 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after applying, the bandwidth is ~50 mbps. Hi, Guangrong, Not really review comments, but I got some questions. :) Your comments are always valuable to me! :) IIUC this patch will only change the behavior when last_sent_block changed. I see that the performance is doubled after the change, which is really promising. However I don't fully understand why it brings such a big difference considering that IMHO current code is sending dirty pages per-RAMBlock. I mean, IMHO last_sent_block should not change frequently? Or am I wrong? It's depends on the configuration, each memory-region which is ram or file backend has a RAMBlock. Actually, more benefits comes from the fact that the performance & throughput of the multithreads has been improved as the threads is fed by the migration thread and the result is consumed by the migration thread. Another follow-up question would be: have you measured how long time needed to compress a 4k page, and how many time to send it? I think "sending the page" is not really meaningful considering that we just put a page into the buffer (which should be extremely fast since we don't really flush it every time), however I would be curious on how slow would compressing a page be. I haven't benchmark the performance of zlib, i think it is CPU intensive workload, particularly, there no compression-accelerator (e.g, QAT) on our production. BTW, we were using lzo instead of zlib which worked better for some workload. Putting a page into buffer should depend on the network, i,e, if the network is congested it should take long time. :)
Re: [Qemu-devel] [PATCH 2/8] migration: stop allocating and freeing memory frequently
On 03/19/2018 06:54 PM, Dr. David Alan Gilbert wrote: +return 0; +exit: +compress_threads_load_cleanup(); I don't think this is safe; if inflateInit(..) fails in not-the-last thread, compress_threads_load_cleanup() will try and destroy all the mutex's and condition variables, even though they've not yet all been _init'd. That is exactly why we used 'opaque', please see more below... However, other than that I think the patch is OK; a chat with Dan Berrange has convinced me this probably doesn't affect the stream format, so that's OK. One thing I would like is a comment as to how the 'opaque' field is being used; I don't think I quite understand what you're doing there. The zlib.h file says that: " The opaque value provided by the application will be passed as the first parameter for calls of zalloc and zfree. This can be useful for custom memory management. The compression library attaches no meaning to the opaque value." So we can use it to store our private data. Here, we use it as a indicator which shows if the thread is properly init'd or not. If inflateInit() is successful we will set it to non-NULL, otherwise it is NULL, so that the cleanup path can figure out the first thread failed to do inflateInit(). OK, so I think you just need to add a comment to explain that. Put it above the 'if (!decomp_param[i].stream.opaque) {' in compress_threads_load_cleanup so it'll be easy to understand. Yes, indeed, i will do. Thanks!
Re: [Qemu-devel] [PATCH 3/8] migration: support to detect compressionand decompression errors
On 03/19/2018 03:56 PM, jiang.bi...@zte.com.cn wrote: Hi, guangrong @@ -1051,11 +1052,13 @@ static int do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block, { RAMState *rs = ram_state; int bytes_sent, blen; -uint8_t *p = block->host + (offset & TARGET_PAGE_MASK); +uint8_t buf[TARGET_PAGE_SIZE], *p; +p = block->host + (offset & TARGET_PAGE_MASK); bytes_sent = save_page_header(rs, f, block, offset | RAM_SAVE_FLAG_COMPRESS_PAGE); -blen = qemu_put_compression_data(f, stream, p, TARGET_PAGE_SIZE); +memcpy(buf, p, TARGET_PAGE_SIZE); +blen = qemu_put_compression_data(f, stream, buf, TARGET_PAGE_SIZE); Memory copy operation for every page to be compressed is not cheap, especially when the page number is huge, and it may be not necessary for pages never updated during migration. This is only for 4k page. Is there any possibility that we can distinguish the real compress/decompress errors from those being caused by source VM updating? Such as the return value of qemu_uncompress(distinguish Z_DATA_ERROR and other error codes returned by inflate())? Unfortunately, no. :(
Re: [Qemu-devel] [PATCH 2/8] migration: stop allocating and freeingmemory frequently
On 03/19/2018 09:49 AM, jiang.bi...@zte.com.cn wrote: Hi, guangrong +/* return the size after compression, or negative value on error */ +static int qemu_compress_data(z_stream *stream, uint8_t *dest, size_t dest_len, + const uint8_t *source, size_t source_len) +{ +int err; + +err = deflateReset(stream); +if (err != Z_OK) { +return -1; +} + +stream->avail_in = source_len; +stream->next_in = (uint8_t *)source; +stream->avail_out = dest_len; +stream->next_out = dest; + duplicated code with qemu_uncompress(), would initializing stream outside of qemu_compress_data() be better? In that case, we could pass much less parameters down, and avoid the duplicated code. Or could we encapsulate some struct to ease the case? There are multiple places to do compression/uncompression in QEMU, i am going to introduce common functions to cleanup these places, that can be another patchset later... +err = deflate(stream, Z_FINISH); +if (err != Z_STREAM_END) { +return -1; +} + +return stream->next_out - dest; +} + @@ -683,8 +707,10 @@ ssize_t qemu_put_compression_data(QEMUFile *f, const uint8_t *p, size_t size, return -1; } } -if (compress2(f->buf + f->buf_index + sizeof(int32_t), (uLongf *), - (Bytef *)p, size, level) != Z_OK) { + +blen = qemu_compress_data(stream, f->buf + f->buf_index + sizeof(int32_t), + blen, p, size); The "level" parameter is never used after the patch, could we just removed it? On the other hand, deflate() of zlib supports compression level too(by deflateInit(stream, level)), should we just reuse the level properly? If not, the *migrate parameter compress_level* will be useless. The 'level' has been pushed to @stream: +if (deflateInit(_param[i].stream, + migrate_compress_level()) != Z_OK) { +goto exit; +} +if (blen < 0) { error_report("Compress Failed!"); return 0; } +/* return the size after decompression, or negative value on error */ +static int qemu_uncompress(z_stream *stream, uint8_t *dest, size_t dest_len, + uint8_t *source, size_t source_len) The name of *qemu_uncompress* does not quite match *qemu_compress_data*, would *qemu_uncompress_data* be better? It's good to me. will rename it. Besides, the prototype is not consistent with *qemu_compress_data* either, should the -*source- be -const- also here? Okay. Thanks!