[PATCH v3] i386/cpu: fixup number of addressable IDs for processor cores in the physical package
When QEMU is started with: -cpu host,host-cache-info=on,l3-cache=off \ -smp 2,sockets=1,dies=1,cores=1,threads=2 Guest can't acquire maximum number of addressable IDs for processor cores in the physical package from CPUID[04H]. When creating a CPU topology of 1 core per package, host-cache-info only uses the Host's addressable core IDs field (CPUID.04H.EAX[bits 31-26]), resulting in a conflict (on the multicore Host) between the Guest core topology information in this field and the Guest's actual cores number. Fix it by removing the unnecessary condition to cover 1 core per package case. This is safe because cores_per_pkg will not be 0 and will be at least 1. Fixes: d7caf13b5fcf ("x86: cpu: fixup number of addressable IDs for logical processors sharing cache") Signed-off-by: Guixiong Wei Signed-off-by: Yipeng Yin Signed-off-by: Chuang Xu --- target/i386/cpu.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index bc2dceb647..b68f7460db 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6426,10 +6426,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (*eax & 31) { int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14); -if (cores_per_pkg > 1) { -*eax &= ~0xFC00; -*eax |= max_core_ids_in_package(_info) << 26; -} +*eax &= ~0xFC00; +*eax |= max_core_ids_in_package(_info) << 26; if (host_vcpus_per_cache > threads_per_pkg) { *eax &= ~0x3FFC000; -- 2.20.1
[PATCH v2] i386/cpu: fixup number of addressable IDs for processor cores in the physical package
When QEMU is started with: -cpu host,host-cache-info=on,l3-cache=off \ -smp 2,sockets=1,dies=1,cores=1,threads=2 Guest can't acquire maximum number of addressable IDs for processor cores in the physical package from CPUID[04H]. When testing Intel TDX, guest attempts to acquire extended topology from CPUID[0bH], but because the TDX module doesn't provide the emulation of CPUID[0bH], guest will instead acquire extended topology from CPUID[04H]. However, due to QEMU's inaccurate emulation of CPUID[04H], one of the vcpus in 2c TDX guest would be offline. Fix it by removing the unnecessary condition. Fixes: d7caf13b5fcf742e5680c1d3448ba070fc811644 ("x86: cpu: fixup number of addressable IDs for logical processors sharing cache") Signed-off-by: Guixiong Wei Signed-off-by: Yipeng Yin Signed-off-by: Chuang Xu --- target/i386/cpu.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index bc2dceb647..b68f7460db 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6426,10 +6426,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (*eax & 31) { int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14); -if (cores_per_pkg > 1) { -*eax &= ~0xFC00; -*eax |= max_core_ids_in_package(_info) << 26; -} +*eax &= ~0xFC00; +*eax |= max_core_ids_in_package(_info) << 26; if (host_vcpus_per_cache > threads_per_pkg) { *eax &= ~0x3FFC000; -- 2.20.1
Re: [PATCH] x86: cpu: fixup number of addressable IDs for processor cores in the physical package
Hi Zhao, On 2024/5/28 上午10:31, Zhao Liu wrote: Hi Chuang, On Mon, May 27, 2024 at 11:13:33AM +0800, Chuang Xu wrote: Date: Mon, 27 May 2024 11:13:33 +0800 From: Chuang Xu Subject: [PATCH] x86: cpu: fixup number of addressable IDs for processor cores in the physical package According to the usual practice of QEMU commits, people tend to use "i386/cpu" as the subject prefix, which indicates the code path. X-Mailer: git-send-email 2.24.3 (Apple Git-128) When QEMU is started with: -cpu host,host-cache-info=on,l3-cache=off \ Just a discussion, "l3-cache=off" doesn't work in host cache pssthu case, do you have a specific need that you don't want to see l3 cache? No specific need, just generated by libvirt. -smp 2,sockets=1,dies=1,cores=1,threads=2 Guest can't acquire maximum number of addressable IDs for processor cores in the physical package from CPUID[04H]. This bug was introduced in commit d7caf13b5fcf742e5680c1d3448ba070fc811644. Fix it by changing the judgement condition to a >= 1. Pls add a "Fixes" tag like: Fixes: d7caf13b5fcf ("x86: cpu: fixup number of addressable IDs for logical processors sharing cache") Since this is a historical issue that deserves to be ported to the stable branch, you can cc stable list by: Cc: qemu-sta...@nongnu.org Signed-off-by: Chuang Xu As the patch sender, it's better to put your signature on the last line. ;-) Signed-off-by: Guixiong Wei Signed-off-by: Yipeng Yin --- target/i386/cpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index cd16cb893d..0369c01153 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6097,7 +6097,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (*eax & 31) { int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14); int vcpus_per_socket = cs->nr_cores * cs->nr_threads; -if (cs->nr_cores > 1) { +if (cs->nr_cores >= 1) { Like Igor suggested, this condition could be removed since cs->nr_cores can't be 0. *eax &= ~0xFC00; *eax |= (pow2ceil(cs->nr_cores) - 1) << 26; } ...the code is outdated, pls rebase on the latest master branch. My fault, sorry for forgetting to pull the latest code.. Regards, Zhao Thanks for all your suggestions! Chuang
[PATCH] x86: cpu: fixup number of addressable IDs for processor cores in the physical package
When QEMU is started with: -cpu host,host-cache-info=on,l3-cache=off \ -smp 2,sockets=1,dies=1,cores=1,threads=2 Guest can't acquire maximum number of addressable IDs for processor cores in the physical package from CPUID[04H]. This bug was introduced in commit d7caf13b5fcf742e5680c1d3448ba070fc811644. Fix it by changing the judgement condition to a >= 1. Signed-off-by: Chuang Xu Signed-off-by: Guixiong Wei Signed-off-by: Yipeng Yin --- target/i386/cpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index cd16cb893d..0369c01153 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -6097,7 +6097,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, if (*eax & 31) { int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14); int vcpus_per_socket = cs->nr_cores * cs->nr_threads; -if (cs->nr_cores > 1) { +if (cs->nr_cores >= 1) { *eax &= ~0xFC00; *eax |= (pow2ceil(cs->nr_cores) - 1) << 26; } -- 2.20.1
Re: [PATCH v2 04/11] multifd: Count the number of bytes sent correctly
Hi,Juan, On 2023/1/30 下午4:09, Juan Quintela wrote: > Current code asumes that all pages are whole. That is not true for > example for compression already. Fix it for creating a new field > ->sent_bytes that includes it. > > All ram_counters are used only from the migration thread, so we have > two options: > - put a mutex and fill everything when we sent it (not only >ram_counters, also qemu_file->xfer_bytes). > - Create a local variable that implements how much has been sent >through each channel. And when we push another packet, we "add" the >previous stats. > > I choose two due to less changes overall. On the previous code we > increase transferred and then we sent. Current code goes the other > way around. It sents the data, and after the fact, it updates the > counters. Notice that each channel can have a maximum of half a > megabyte of data without counting, so it is not very important. > > Signed-off-by: Juan Quintela > --- > migration/multifd.h | 2 ++ > migration/multifd.c | 6 -- > 2 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/migration/multifd.h b/migration/multifd.h > index e2802a9ce2..36f899c56f 100644 > --- a/migration/multifd.h > +++ b/migration/multifd.h > @@ -102,6 +102,8 @@ typedef struct { > uint32_t flags; > /* global number of generated multifd packets */ > uint64_t packet_num; > +/* How many bytes have we sent on the last packet */ > +uint64_t sent_bytes; > /* thread has work to do */ > int pending_job; > /* array of pages to sent. > diff --git a/migration/multifd.c b/migration/multifd.c > index 61cafe4c76..cd26b2fda9 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -394,7 +394,6 @@ static int multifd_send_pages(QEMUFile *f) > static int next_channel; > MultiFDSendParams *p = NULL; /* make happy gcc */ > MultiFDPages_t *pages = multifd_send_state->pages; > -uint64_t transferred; > > if (qatomic_read(_send_state->exiting)) { > return -1; > @@ -429,7 +428,8 @@ static int multifd_send_pages(QEMUFile *f) > p->packet_num = multifd_send_state->packet_num++; > multifd_send_state->pages = p->pages; > p->pages = pages; > -transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len; > +uint64_t transferred = p->sent_bytes; > +p->sent_bytes = 0; > qemu_file_acct_rate_limit(f, transferred); > qemu_mutex_unlock(>mutex); > stat64_add(_atomic_counters.multifd_bytes, transferred); > @@ -719,6 +719,8 @@ static void *multifd_send_thread(void *opaque) > } > > qemu_mutex_lock(>mutex); > +p->sent_bytes += p->packet_len; > +p->sent_bytes += p->next_packet_size; Consider a scenario where some normal pages are transmitted in the first round, followed by several consecutive rounds of zero pages. When zero pages are transmitted, next_packet_size of first round is still incorrectly added to sent_bytes. If we set a rate limiting for dirty page transmission, the transmission performance of multi zero check will degrade. Maybe we should set next_packet_size to 0 in multifd_send_pages()? > p->pending_job--; > qemu_mutex_unlock(>mutex); >
Re: [PATCH v8 0/6] migration: reduce time of loading non-iterable vmstate
Hi, Paolo, A few months ago, Juan told me that this series requires your or someone familiar with memory API's feedback. Could you please review it and provide some suggestions? On 2023/3/17 下午4:18, Chuang Xu wrote: In this version: - delete useless line change. - update comments and commit messages. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 112 ms 285 ms after20 ms194 ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 65 ms151 ms after19 ms100 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 24 ms51 ms after9 ms 36 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Please review, Chuang Thanks!
[PATCH v8 3/6] memory: Introduce memory_region_transaction_do_commit()
Split memory_region_transaction_do_commit() from memory_region_transaction_commit(). We'll call do_commit() in address_space_to_flatview() in the later patch. Signed-off-by: Chuang Xu --- softmmu/memory.c | 47 +++ 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index a992a365d9..33ecc62ee9 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1093,34 +1093,41 @@ void memory_region_transaction_begin(void) ++memory_region_transaction_depth; } -void memory_region_transaction_commit(void) +void memory_region_transaction_do_commit(void) { AddressSpace *as; -assert(memory_region_transaction_depth); assert(qemu_mutex_iothread_locked()); ---memory_region_transaction_depth; -if (!memory_region_transaction_depth) { -if (memory_region_update_pending) { -flatviews_reset(); +if (memory_region_update_pending) { +flatviews_reset(); -MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); +MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); -QTAILQ_FOREACH(as, _spaces, address_spaces_link) { -address_space_set_flatview(as); -address_space_update_ioeventfds(as); -} -memory_region_update_pending = false; -ioeventfd_update_pending = false; -MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); -} else if (ioeventfd_update_pending) { -QTAILQ_FOREACH(as, _spaces, address_spaces_link) { -address_space_update_ioeventfds(as); -} -ioeventfd_update_pending = false; +QTAILQ_FOREACH(as, _spaces, address_spaces_link) { +address_space_set_flatview(as); +address_space_update_ioeventfds(as); +} +memory_region_update_pending = false; +ioeventfd_update_pending = false; +MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); +} else if (ioeventfd_update_pending) { +QTAILQ_FOREACH(as, _spaces, address_spaces_link) { +address_space_update_ioeventfds(as); } - } +ioeventfd_update_pending = false; +} +} + +void memory_region_transaction_commit(void) +{ +assert(memory_region_transaction_depth); +assert(qemu_mutex_iothread_locked()); + +--memory_region_transaction_depth; +if (!memory_region_transaction_depth) { +memory_region_transaction_do_commit(); +} } static void memory_region_destructor_none(MemoryRegion *mr) -- 2.20.1
[PATCH v8 0/6] migration: reduce time of loading non-iterable vmstate
In this version: - delete useless line change. - update comments and commit messages. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 112 ms 285 ms after20 ms194 ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 65 ms151 ms after19 ms100 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 24 ms51 ms after9 ms 36 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Please review, Chuang [v7] - introduce address_space_to_flatview_rcu(). - squash peter's fix into patch 1. - rebase to latest upstream. - update test results. [v6] - add peter's patch. - split mr_do_commit() from mr_commit(). - adjust the sanity check in address_space_to_flatview(). - rebase to latest upstream. - replace 8260 with 8362 as testing host. - update the latest test results. [v5] - rename rcu_read_locked() to rcu_read_is_locked(). - adjust the sanity check in address_space_to_flatview(). - improve some comments. [v4] - attach more information in the cover letter. - remove changes on virtio_load. - add rcu_read_locked() to detect holding of rcu lock. [v3] - move virtio_load_check_delay() from virtio_memory_listener_commit() to virtio_vmstate_change(). - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() will be called when delay_check is true. [v2] - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms
[PATCH v8 2/6] rcu: Introduce rcu_read_is_locked()
Add rcu_read_is_locked() to detect holding of rcu lock. Signed-off-by: Chuang Xu --- include/qemu/rcu.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index 313fc414bc..7bf45602e1 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -115,6 +115,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_is_locked(void) +{ +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); + +return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* -- 2.20.1
[PATCH v8 4/6] memory: Add do_commit() and sanity check in address_space_to_flatview
In next patch, we wrap the vm_load into a whole mr transaction to speed up vm_load. This brings a problem, old flatviews may be referenced during the vm_load. vm_load contains far more memory updates than referencing to a specific flatview, hence we introduce do_commit to make sure address_space_to_flatview will return the newest flatview and it should logically only be triggered in a few spots during vm_load. Other than that, sanity check whether BQL or rcu is held before using any flatview. Signed-off-by: Chuang Xu --- include/exec/memory.h | 23 +++ softmmu/memory.c | 5 + 2 files changed, 28 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 6fa0b071f0..d6fd89db64 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -27,6 +27,7 @@ #include "qemu/notify.h" #include "qom/object.h" #include "qemu/rcu.h" +#include "qemu/main-loop.h" #define RAM_ADDR_INVALID (~(ram_addr_t)0) @@ -1095,8 +1096,30 @@ struct FlatView { MemoryRegion *root; }; +bool memory_region_transaction_in_progress(void); + +void memory_region_transaction_do_commit(void); + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +if (qemu_mutex_iothread_locked()) { +/* We exclusively own the flatview now.. */ +if (memory_region_transaction_in_progress()) { +/* + * Fetch the flatview within a transaction in-progress, it + * means current_map may not be the latest, we need to update + * immediately to make sure the caller won't see obsolete + * mapping. + */ +memory_region_transaction_do_commit(); +} + +/* No further protection needed to access current_map */ +return as->current_map; +} + +/* Otherwise we must have had the RCU lock or something went wrong */ +assert(rcu_read_is_locked()); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index 33ecc62ee9..6a8e8b4e71 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1130,6 +1130,11 @@ void memory_region_transaction_commit(void) } } +bool memory_region_transaction_in_progress(void) +{ +return memory_region_transaction_depth != 0; +} + static void memory_region_destructor_none(MemoryRegion *mr) { } -- 2.20.1
[PATCH v8 1/6] memory: Reference as->current_map directly in memory commit
From: Peter Xu Calling RCU variants of address_space_get|to_flatview() during memory commit (flatview updates, triggering memory listeners, or updating ioeventfds, etc.) is not 100% accurate, because commit() requires BQL rather than RCU read lock, so the context exclusively owns current_map and can be directly referenced. Neither does it need a refcount to current_map because it cannot be freed from under the caller. Add address_space_get_flatview_raw() for the case where the context holds BQL rather than RCU read lock and use it across the core memory updates, Drop the extra refcounts on FlatView*. Signed-off-by: Peter Xu --- softmmu/memory.c | 28 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 4699ba55ec..a992a365d9 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -61,6 +61,13 @@ struct AddrRange { Int128 size; }; +/* Called with BQL held */ +static inline FlatView *address_space_to_flatview_raw(AddressSpace *as) +{ +assert(qemu_mutex_iothread_locked()); +return as->current_map; +} + static AddrRange addrrange_make(Int128 start, Int128 size) { return (AddrRange) { start, size }; @@ -155,7 +162,7 @@ enum ListenerDirection { Forward, Reverse }; #define MEMORY_LISTENER_UPDATE_REGION(fr, as, dir, callback, _args...) \ do {\ MemoryRegionSection mrs = section_from_flat_range(fr, \ -address_space_to_flatview(as)); \ +address_space_to_flatview_raw(as)); \ MEMORY_LISTENER_CALL(as, callback, dir, , ##_args); \ } while(0) @@ -753,6 +760,7 @@ static FlatView *generate_memory_topology(MemoryRegion *mr) } static void address_space_add_del_ioeventfds(AddressSpace *as, + FlatView *view, MemoryRegionIoeventfd *fds_new, unsigned fds_new_nb, MemoryRegionIoeventfd *fds_old, @@ -774,7 +782,7 @@ static void address_space_add_del_ioeventfds(AddressSpace *as, _new[inew]))) { fd = _old[iold]; section = (MemoryRegionSection) { -.fv = address_space_to_flatview(as), +.fv = view, .offset_within_address_space = int128_get64(fd->addr.start), .size = fd->addr.size, }; @@ -787,7 +795,7 @@ static void address_space_add_del_ioeventfds(AddressSpace *as, _old[iold]))) { fd = _new[inew]; section = (MemoryRegionSection) { -.fv = address_space_to_flatview(as), +.fv = view, .offset_within_address_space = int128_get64(fd->addr.start), .size = fd->addr.size, }; @@ -833,7 +841,7 @@ static void address_space_update_ioeventfds(AddressSpace *as) ioeventfd_max = QEMU_ALIGN_UP(as->ioeventfd_nb, 4); ioeventfds = g_new(MemoryRegionIoeventfd, ioeventfd_max); -view = address_space_get_flatview(as); +view = address_space_to_flatview_raw(as); FOR_EACH_FLAT_RANGE(fr, view) { for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { tmp = addrrange_shift(fr->mr->ioeventfds[i].addr, @@ -852,13 +860,12 @@ static void address_space_update_ioeventfds(AddressSpace *as) } } -address_space_add_del_ioeventfds(as, ioeventfds, ioeventfd_nb, +address_space_add_del_ioeventfds(as, view, ioeventfds, ioeventfd_nb, as->ioeventfds, as->ioeventfd_nb); g_free(as->ioeventfds); as->ioeventfds = ioeventfds; as->ioeventfd_nb = ioeventfd_nb; -flatview_unref(view); } /* @@ -1026,7 +1033,7 @@ static void flatviews_reset(void) static void address_space_set_flatview(AddressSpace *as) { -FlatView *old_view = address_space_to_flatview(as); +FlatView *old_view = address_space_to_flatview_raw(as); MemoryRegion *physmr = memory_region_get_flatview_root(as->root); FlatView *new_view = g_hash_table_lookup(flat_views, physmr); @@ -2979,8 +2986,7 @@ static void listener_add_address_space(MemoryListener *listener, listener->log_global_start(listener); } } - -view = address_space_get_flatview(as); +view = address_space_to_flatview_raw(as); FOR_EACH_FLAT_RANGE(fr, view) { MemoryRegionSection section = section_from_flat_range(fr, view); @@ -2994,7 +3000,6 @@ static void listener_add_address_space(MemoryListener *listener, if (listener->commit) { listener->commit(listener); } -flatview_unref(view); } static void listener_del_address_space(MemoryListener *listener, @@
[PATCH v8 6/6] memory: Introduce address_space_to_flatview_rcu()
In last patch, we wrap vm_load with begin/commit, here we introduce address_space_to_flatview_rcu() to avoid unnecessary enforce commit during vm_load. Signed-off-by: Chuang Xu --- include/exec/memory.h | 17 + softmmu/memory.c | 2 +- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index d6fd89db64..2bf702dc94 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1123,6 +1123,23 @@ static inline FlatView *address_space_to_flatview(AddressSpace *as) return qatomic_rcu_read(>current_map); } +/* + * We recommend using address_space_to_flatview() rather than this one. + * Note that if we use this during a memory region transaction, we may + * see obsolete flatviews. We must bear with an obsolete map until commit. + * And outside a memory region transaction, this is basically the same as + * address_space_to_flatview(). + */ +static inline FlatView *address_space_to_flatview_rcu(AddressSpace *as) +{ +/* + * Before using any flatview, sanity check BQL or RCU is held. + */ +assert(qemu_mutex_iothread_locked() || rcu_read_is_locked()); + +return qatomic_rcu_read(>current_map); +} + /** * typedef flatview_cb: callback for flatview_for_each_range() * diff --git a/softmmu/memory.c b/softmmu/memory.c index 6a8e8b4e71..33d14e967d 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -815,7 +815,7 @@ FlatView *address_space_get_flatview(AddressSpace *as) RCU_READ_LOCK_GUARD(); do { -view = address_space_to_flatview(as); +view = address_space_to_flatview_rcu(as); /* If somebody has replaced as->current_map concurrently, * flatview_ref returns false. */ -- 2.20.1
[PATCH v8 5/6] migration: Reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Note that the following test results are based on the application of the next patch. Without the next patch, the improvement will be reduced. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 112 ms 285 ms after about 20 ms 194 ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 65 ms about 151 ms after about 19 ms about 100 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 24 ms about 51 ms after about 9 ms about 36 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu --- migration/savevm.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index aa54a67fda..ecf7b27000 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2787,7 +2787,25 @@ int qemu_loadvm_state(QEMUFile *f) cpu_synchronize_all_pre_loadvm(); +/* + * Call memory_region_transaction_begin() before loading vmstate. + * This call is paired with memory_region_transaction_commit() at + * the end of qemu_loadvm_state_main(), in order to pack all the + * changes to memory region during the period of loading + * non-iterable vmstate in a single memory transaction. + * This operation will reduce time of loading non-iterable vmstate + */ +memory_region_transaction_begin(); + ret = qemu_loadvm_state_main(f, mis); + +/* + * Call memory_region_transaction_commit() after loading vmstate. + * At this point, qemu actually completes all the previous memory + * region transactions. + */ +memory_region_transaction_commit(); + qemu_event_set(>main_thread_load_event); trace_qemu_loadvm_state_post_main(ret); -- 2.20.1
Re: [PATCH v7 6/6] memory: Introduce address_space_to_flatview_rcu()
Hi, Peter, On 2023/3/10 下午11:08, Peter Xu wrote: On Fri, Mar 10, 2023 at 10:24:25AM +0800, Chuang Xu wrote: In last patch, we wrap vm_load with begin/commit, here we introduce address_space_to_flatview_rcu() to avoid unnecessary enforce commit during vm_load. Signed-off-by: Chuang Xu --- include/exec/memory-internal.h | 2 +- include/exec/memory.h | 20 softmmu/memory.c | 2 +- 3 files changed, 22 insertions(+), 2 deletions(-) diff --git a/include/exec/memory-internal.h b/include/exec/memory-internal.h index 100c1237ac..1432240449 100644 --- a/include/exec/memory-internal.h +++ b/include/exec/memory-internal.h @@ -30,7 +30,7 @@ static inline AddressSpaceDispatch *flatview_to_dispatch(FlatView *fv) static inline AddressSpaceDispatch *address_space_to_dispatch(AddressSpace *as) { -return flatview_to_dispatch(address_space_to_flatview(as)); +return flatview_to_dispatch(address_space_to_flatview_rcu(as)); } I'm not sure whether this one is always safe. Previously I considered that there was no address_space_translate_iommu() traced during vm_load, so I took it as safe. But my trace may not be able to obtain all cases of triggering do_commit() during vm_load.. tcg_commit() seems to be safe, but maybe address_space_translate_iommu() is not? Maybe easier to leave this untouched? Yes, I'll fix it in v8 later. FlatView *address_space_get_flatview(AddressSpace *as); diff --git a/include/exec/memory.h b/include/exec/memory.h index d6fd89db64..235e3017bc 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1100,6 +1100,9 @@ bool memory_region_transaction_in_progress(void); void memory_region_transaction_do_commit(void); +/* + * We recommend using this by default. + */ I think this comment doesn't really help.. drop it? static inline FlatView *address_space_to_flatview(AddressSpace *as) Thanks!
Re: [PATCH v7 3/6] memory: Introduce memory_region_transaction_do_commit()
Hi, Peter, On 2023/3/10 下午10:51, Peter Xu wrote: On Fri, Mar 10, 2023 at 10:24:22AM +0800, Chuang Xu wrote: Split memory_region_transaction_do_commit() from memory_region_transaction_commit(). We'll call do_commit() in address_space_to_flatview() in the later patch. Signed-off-by: Chuang Xu [...] +void memory_region_transaction_commit(void) +{ +assert(memory_region_transaction_depth); +assert(qemu_mutex_iothread_locked()); This context has nothing to assert BQL, so this can be dropped I think (you have one in do_commit). do_commit() will be triggered only when depth is 0. Before do_commit() is triggered, we must ensure that the BQL is held when the depth is modified. + +--memory_region_transaction_depth; +if (!memory_region_transaction_depth) { +memory_region_transaction_do_commit(); +} } With above dropped: Reviewed-by: Peter Xu Thanks!
[PATCH v7 6/6] memory: Introduce address_space_to_flatview_rcu()
In last patch, we wrap vm_load with begin/commit, here we introduce address_space_to_flatview_rcu() to avoid unnecessary enforce commit during vm_load. Signed-off-by: Chuang Xu --- include/exec/memory-internal.h | 2 +- include/exec/memory.h | 20 softmmu/memory.c | 2 +- 3 files changed, 22 insertions(+), 2 deletions(-) diff --git a/include/exec/memory-internal.h b/include/exec/memory-internal.h index 100c1237ac..1432240449 100644 --- a/include/exec/memory-internal.h +++ b/include/exec/memory-internal.h @@ -30,7 +30,7 @@ static inline AddressSpaceDispatch *flatview_to_dispatch(FlatView *fv) static inline AddressSpaceDispatch *address_space_to_dispatch(AddressSpace *as) { -return flatview_to_dispatch(address_space_to_flatview(as)); +return flatview_to_dispatch(address_space_to_flatview_rcu(as)); } FlatView *address_space_get_flatview(AddressSpace *as); diff --git a/include/exec/memory.h b/include/exec/memory.h index d6fd89db64..235e3017bc 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1100,6 +1100,9 @@ bool memory_region_transaction_in_progress(void); void memory_region_transaction_do_commit(void); +/* + * We recommend using this by default. + */ static inline FlatView *address_space_to_flatview(AddressSpace *as) { if (qemu_mutex_iothread_locked()) { @@ -1123,6 +1126,23 @@ static inline FlatView *address_space_to_flatview(AddressSpace *as) return qatomic_rcu_read(>current_map); } +/* + * We recommend using address_space_to_flatview() rather than this one. + * Note that if we use this during a memory region transaction, we may + * see obsolete flatviews. We must bear with an obsolete map until commit. + * And outside a memory region transaction, this is basically the same as + * address_space_to_flatview(). + */ +static inline FlatView *address_space_to_flatview_rcu(AddressSpace *as) +{ +/* + * Before using any flatview, sanity check BQL or RCU is held. + */ +assert(qemu_mutex_iothread_locked() || rcu_read_is_locked()); + +return qatomic_rcu_read(>current_map); +} + /** * typedef flatview_cb: callback for flatview_for_each_range() * diff --git a/softmmu/memory.c b/softmmu/memory.c index 6a8e8b4e71..33d14e967d 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -815,7 +815,7 @@ FlatView *address_space_get_flatview(AddressSpace *as) RCU_READ_LOCK_GUARD(); do { -view = address_space_to_flatview(as); +view = address_space_to_flatview_rcu(as); /* If somebody has replaced as->current_map concurrently, * flatview_ref returns false. */ -- 2.20.1
[PATCH v7 4/6] memory: Add sanity check in address_space_to_flatview
Before using any flatview, sanity check whether BQL or rcu is held. And if we're during a memory region transaction, try to immediately update mappings, or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 23 +++ softmmu/memory.c | 5 + 2 files changed, 28 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 6fa0b071f0..d6fd89db64 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -27,6 +27,7 @@ #include "qemu/notify.h" #include "qom/object.h" #include "qemu/rcu.h" +#include "qemu/main-loop.h" #define RAM_ADDR_INVALID (~(ram_addr_t)0) @@ -1095,8 +1096,30 @@ struct FlatView { MemoryRegion *root; }; +bool memory_region_transaction_in_progress(void); + +void memory_region_transaction_do_commit(void); + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +if (qemu_mutex_iothread_locked()) { +/* We exclusively own the flatview now.. */ +if (memory_region_transaction_in_progress()) { +/* + * Fetch the flatview within a transaction in-progress, it + * means current_map may not be the latest, we need to update + * immediately to make sure the caller won't see obsolete + * mapping. + */ +memory_region_transaction_do_commit(); +} + +/* No further protection needed to access current_map */ +return as->current_map; +} + +/* Otherwise we must have had the RCU lock or something went wrong */ +assert(rcu_read_is_locked()); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index 33ecc62ee9..6a8e8b4e71 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1130,6 +1130,11 @@ void memory_region_transaction_commit(void) } } +bool memory_region_transaction_in_progress(void) +{ +return memory_region_transaction_depth != 0; +} + static void memory_region_destructor_none(MemoryRegion *mr) { } -- 2.20.1
[PATCH v7 2/6] rcu: Introduce rcu_read_is_locked()
Add rcu_read_is_locked() to detect holding of rcu lock. Signed-off-by: Chuang Xu --- include/qemu/rcu.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index 313fc414bc..7bf45602e1 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -115,6 +115,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_is_locked(void) +{ +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); + +return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* -- 2.20.1
[PATCH v7 3/6] memory: Introduce memory_region_transaction_do_commit()
Split memory_region_transaction_do_commit() from memory_region_transaction_commit(). We'll call do_commit() in address_space_to_flatview() in the later patch. Signed-off-by: Chuang Xu --- softmmu/memory.c | 47 +++ 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index a992a365d9..33ecc62ee9 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1093,34 +1093,41 @@ void memory_region_transaction_begin(void) ++memory_region_transaction_depth; } -void memory_region_transaction_commit(void) +void memory_region_transaction_do_commit(void) { AddressSpace *as; -assert(memory_region_transaction_depth); assert(qemu_mutex_iothread_locked()); ---memory_region_transaction_depth; -if (!memory_region_transaction_depth) { -if (memory_region_update_pending) { -flatviews_reset(); +if (memory_region_update_pending) { +flatviews_reset(); -MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); +MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); -QTAILQ_FOREACH(as, _spaces, address_spaces_link) { -address_space_set_flatview(as); -address_space_update_ioeventfds(as); -} -memory_region_update_pending = false; -ioeventfd_update_pending = false; -MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); -} else if (ioeventfd_update_pending) { -QTAILQ_FOREACH(as, _spaces, address_spaces_link) { -address_space_update_ioeventfds(as); -} -ioeventfd_update_pending = false; +QTAILQ_FOREACH(as, _spaces, address_spaces_link) { +address_space_set_flatview(as); +address_space_update_ioeventfds(as); +} +memory_region_update_pending = false; +ioeventfd_update_pending = false; +MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); +} else if (ioeventfd_update_pending) { +QTAILQ_FOREACH(as, _spaces, address_spaces_link) { +address_space_update_ioeventfds(as); } - } +ioeventfd_update_pending = false; +} +} + +void memory_region_transaction_commit(void) +{ +assert(memory_region_transaction_depth); +assert(qemu_mutex_iothread_locked()); + +--memory_region_transaction_depth; +if (!memory_region_transaction_depth) { +memory_region_transaction_do_commit(); +} } static void memory_region_destructor_none(MemoryRegion *mr) -- 2.20.1
[PATCH v7 1/6] memory: Reference as->current_map directly in memory commit
From: Peter Xu Calling RCU variance of address_space_get|to_flatview() during memory commit (flatview updates, triggering memory listeners, or updating ioeventfds, etc.) is not 100% accurate, because commit() requires BQL rather than RCU read lock, so the context exclusively owns current_map and can be directly referenced. Neither does it need a refcount to current_map because it cannot be freed from under the caller. Add address_space_get_flatview_raw() for the case where the context holds BQL rather than RCU read lock and use it across the core memory updates, Drop the extra refcounts on FlatView*. Signed-off-by: Peter Xu --- softmmu/memory.c | 28 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 4699ba55ec..a992a365d9 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -61,6 +61,13 @@ struct AddrRange { Int128 size; }; +/* Called with BQL held */ +static inline FlatView *address_space_to_flatview_raw(AddressSpace *as) +{ +assert(qemu_mutex_iothread_locked()); +return as->current_map; +} + static AddrRange addrrange_make(Int128 start, Int128 size) { return (AddrRange) { start, size }; @@ -155,7 +162,7 @@ enum ListenerDirection { Forward, Reverse }; #define MEMORY_LISTENER_UPDATE_REGION(fr, as, dir, callback, _args...) \ do {\ MemoryRegionSection mrs = section_from_flat_range(fr, \ -address_space_to_flatview(as)); \ +address_space_to_flatview_raw(as)); \ MEMORY_LISTENER_CALL(as, callback, dir, , ##_args); \ } while(0) @@ -753,6 +760,7 @@ static FlatView *generate_memory_topology(MemoryRegion *mr) } static void address_space_add_del_ioeventfds(AddressSpace *as, + FlatView *view, MemoryRegionIoeventfd *fds_new, unsigned fds_new_nb, MemoryRegionIoeventfd *fds_old, @@ -774,7 +782,7 @@ static void address_space_add_del_ioeventfds(AddressSpace *as, _new[inew]))) { fd = _old[iold]; section = (MemoryRegionSection) { -.fv = address_space_to_flatview(as), +.fv = view, .offset_within_address_space = int128_get64(fd->addr.start), .size = fd->addr.size, }; @@ -787,7 +795,7 @@ static void address_space_add_del_ioeventfds(AddressSpace *as, _old[iold]))) { fd = _new[inew]; section = (MemoryRegionSection) { -.fv = address_space_to_flatview(as), +.fv = view, .offset_within_address_space = int128_get64(fd->addr.start), .size = fd->addr.size, }; @@ -833,7 +841,7 @@ static void address_space_update_ioeventfds(AddressSpace *as) ioeventfd_max = QEMU_ALIGN_UP(as->ioeventfd_nb, 4); ioeventfds = g_new(MemoryRegionIoeventfd, ioeventfd_max); -view = address_space_get_flatview(as); +view = address_space_to_flatview_raw(as); FOR_EACH_FLAT_RANGE(fr, view) { for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { tmp = addrrange_shift(fr->mr->ioeventfds[i].addr, @@ -852,13 +860,12 @@ static void address_space_update_ioeventfds(AddressSpace *as) } } -address_space_add_del_ioeventfds(as, ioeventfds, ioeventfd_nb, +address_space_add_del_ioeventfds(as, view, ioeventfds, ioeventfd_nb, as->ioeventfds, as->ioeventfd_nb); g_free(as->ioeventfds); as->ioeventfds = ioeventfds; as->ioeventfd_nb = ioeventfd_nb; -flatview_unref(view); } /* @@ -1026,7 +1033,7 @@ static void flatviews_reset(void) static void address_space_set_flatview(AddressSpace *as) { -FlatView *old_view = address_space_to_flatview(as); +FlatView *old_view = address_space_to_flatview_raw(as); MemoryRegion *physmr = memory_region_get_flatview_root(as->root); FlatView *new_view = g_hash_table_lookup(flat_views, physmr); @@ -2979,8 +2986,7 @@ static void listener_add_address_space(MemoryListener *listener, listener->log_global_start(listener); } } - -view = address_space_get_flatview(as); +view = address_space_to_flatview_raw(as); FOR_EACH_FLAT_RANGE(fr, view) { MemoryRegionSection section = section_from_flat_range(fr, view); @@ -2994,7 +3000,6 @@ static void listener_add_address_space(MemoryListener *listener, if (listener->commit) { listener->commit(listener); } -flatview_unref(view); } static void listener_del_address_space(MemoryListener *listener, @@
[PATCH v7 5/6] migration: Reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Note that the following test results are based on the application of the next patch. Without the next patch, the improvement will be reduced. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 112 ms 285 ms after about 20 ms 194 ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 65 ms about 151 ms after about 19 ms about 100 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 24 ms about 51 ms after about 9 ms about 36 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu --- migration/savevm.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index aa54a67fda..9a7d3e40d6 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2762,6 +2762,7 @@ out: goto retry; } } + return ret; } @@ -2787,7 +2788,25 @@ int qemu_loadvm_state(QEMUFile *f) cpu_synchronize_all_pre_loadvm(); +/* + * Call memory_region_transaction_begin() before loading vmstate. + * This call is paired with memory_region_transaction_commit() at + * the end of qemu_loadvm_state_main(), in order to pack all the + * changes to memory region during the period of loading + * non-iterable vmstate in a single memory transaction. + * This operation will reduce time of loading non-iterable vmstate + */ +memory_region_transaction_begin(); + ret = qemu_loadvm_state_main(f, mis); + +/* + * Call memory_region_transaction_commit() after loading vmstate. + * At this point, qemu actually completes all the previous memory + * region transactions. + */ +memory_region_transaction_commit(); + qemu_event_set(>main_thread_load_event); trace_qemu_loadvm_state_post_main(ret); -- 2.20.1
[PATCH v7 0/6] migration: reduce time of loading non-iterable vmstate
In this version: - introduce address_space_to_flatview_rcu() - squash peter's fix into patch 1 - rebase to latest upstream - update test results The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 112 ms 285 ms after20 ms194 ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 65 ms151 ms after19 ms100 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 24 ms51 ms after9 ms 36 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Please review, Chuang [v6] - add peter's patch. - split mr_do_commit() from mr_commit(). - adjust the sanity check in address_space_to_flatview(). - rebase to latest upstream. - replace 8260 with 8362 as testing host. - update the latest test results. [v5] - rename rcu_read_locked() to rcu_read_is_locked(). - adjust the sanity check in address_space_to_flatview(). - improve some comments. [v4] - attach more information in the cover letter. - remove changes on virtio_load. - add rcu_read_locked() to detect holding of rcu lock. [v3] - move virtio_load_check_delay() from virtio_memory_listener_commit() to virtio_vmstate_change(). - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() will be called when delay_check is true. [v2] - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms
Re: [PATCH RESEND v6 0/5] migration: reduce time of loading non-iterable vmstate
Hi, Peter, On 2023/3/8 下午11:46, Peter Xu wrote: 1. squash fix into patch1 of yours. 2. introduce address_space_to_flatview_rcu() 3. add specific comment to define when to use which as_to_flat() This can be together with 2). We should suggest using address_space_to_flatview() by default in the comment, and only use _rcu() with cautions e.g. we can mention commit() hooks as example, and also mention the possibility of seeing very old (or purely empty flatview) if during vm load. In that sense this can be the last patch of your set so there's the vm load context to reference. I hope there'll be no outliers that takes only RCU (no bql) but still expect a very new flatview then it'll crash easily if called in a vm load. But let's see.. I assume your test cases are already a much larger set so covers a lot of code paths already. 4. Does enforce commit() need further modification or keep current status? Looks like you have some new thoughts on it? I don't. PS: I do have some thoughts but I don't think I mentioned them.. My thoughts were that we can actually avoid calling begin()/commit()/... hooks during a nested do_commit() at all but only update current_map. That'll further avoid the _rcu() patch to be introduced, but I think that needs more changes and may not be necessary at all. Ignore this. Got it. Are there any other missing points? No from my side. Note that 8.0 reached soft freeze. Sorry to say so, but it seems this work will only be possible (if no further objections coming) for 8.1 merge windows, so the early merge will be after middle of Apirl. Thanks for being consistent with it already so far. I also want to thank you for your long-term guidance and suggestions for this series. To tell the truth, as a new comer, this is the first patch I try to send to the community. Your active discussion in the emails makes me feel that I am doing something really meaningful, so I am willing to continuously devote my energy to participate in the discussion, and at the same time, I benefit a lot from the discussion with you. Thanks!
Re: [PATCH RESEND v6 0/5] migration: reduce time of loading non-iterable vmstate
Hi, Peter, On 2023/3/8 下午10:58, Peter Xu wrote: On Wed, Mar 08, 2023 at 06:03:45AM -0800, Chuang Xu wrote: IIUC, Do you mean that different ways to get flatview are tricky? Yes, and properly define when to use which. As you said, it's slightly beyond what this series does. Maybe it would be better if we discuss it in a new series and keep this series at v6? what's your take? Quotting your test result: time of loading non-iterable vmstate before 112 ms long's patch applied103 ms my patch applied 44 ms both applied 39 ms add as_to_flat_rcu 19 ms If introducing address_space_to_flatview_rcu() can further half the time, maybe still worth it? The thing is the extra _rcu() doesn't bring the major complexity, IMHO. It brings some on identifying which is really safe to not reference a latest flatview (it seems to me only during a commit() hook..). The major complexity still comes from the nested enforced commit() during address_space_to_flatview() but that is already in the patchset. Thanks, OK, let me continue to finish v7. Here I list some TODOs in v7: 1. squash fix into patch1 of yours. 2. introduce address_space_to_flatview_rcu() 3. add specific comment to define when to use which as_to_flat() 4. Does enforce commit() need further modification or keep current status? Looks like you have some new thoughts on it? Are there any other missing points? Thanks!
Re: [PATCH RESEND v6 0/5] migration: reduce time of loading non-iterable vmstate
Hi, Peter, On 2023/3/8 上午1:04, Peter Xu wrote: > On Tue, Mar 07, 2023 at 09:24:31PM +0800, Chuang Xu wrote: >>> Why do we need address_space_get_flatview_rcu()? I'm not sure whether you >> address_space_cahce_init() uses address_space_get_flatview() to acquire >> a ref-ed flatview. If we want to use address_space_to_flatview_rcu() and >> make the flatview ref-ed, maybe we need to add address_space_get_flatview_rcu()? >> Or if we use address_space_to_flatview_rcu() directly, then we'll get a >> unref-ed flatview, I'm not sure wheather it's safe in other cases.. At least >> I don't want the assertion to be triggered one day. > No we can't touch that, afaict. It was a real fix (447b0d0b9e) to a real > bug. > > What I meant is we make address_space_get_flatview() always use > address_space_to_flatview_rcu(). This time I clearly understand what you mean. > > I think it's safe, if you see the current callers of it (after my patch 1 > fixed version applied): > > memory_region_sync_dirty_bitmap[2249] view = address_space_get_flatview(as); > memory_region_update_coalesced_range[2457] view = address_space_get_flatview(as); > memory_region_clear_dirty_bitmap[2285] view = address_space_get_flatview(as); > mtree_info_flatview[3418] view = address_space_get_flatview(as); > address_space_cache_init[3350] cache->fv = address_space_get_flatview(as); > > Case 5 is what we're targeting for which I think it's fine. Logically any > commit() hook should just use address_space_get_flatview_raw() to reference > the flatview, but at least address_space_cache_init() is also called in > non-BQL sections, so _rcu is the only thing we can use here, iiuc.. > > Case 1-3 is never called during vm load, so I think it's safe. > > Case 4 can be accessing stalled flatview but I assume fine since it's just > for debugging. E.g. if someone tries "info mtree -f" during vm loading > after your patch applied, then one can see stalled info. I don't think it > can even happen because so far "info mtree" should also take BQL.. so it'll > be blocked until vm load completes. > > The whole thing is still tricky. I didn't yet make my mind through on how IIUC, Do you mean that different ways to get flatview are tricky? > to make it very clean, I think it's slightly beyond what this series does > and I need some more thinkings (or suggestions from others). > As you said, it's slightly beyond what this series does. Maybe it would be better if we discuss it in a new series and keep this series at v6? what's your take? Thanks!
Re: [PATCH RESEND v6 0/5] migration: reduce time of loading non-iterable vmstate
Hi, Peter, On 2023/3/7 上午4:48, Peter Xu wrote: On Mon, Mar 06, 2023 at 08:48:05PM +0800, Chuang Xu wrote: Hi, Peter, On 2023/3/6 上午6:05, Peter Xu wrote: 1.virtio_load->virtio_init_region_cache 2.virtio_load->virtio_set_features_nocheck What is this one specifically? I failed to see quickly why this needs to reference the flatview. 0x560f3bb26510 : memory_region_transaction_do_commit+0x0/0x1c0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bb26e45 : address_space_get_flatview+0x95/0x100 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bb296b6 : memory_listener_register+0xf6/0x300 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3baec59f : virtio_device_realize+0x12f/0x1a0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb3b1f : device_set_realized+0x2ff/0x660 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb6ec6 : property_set_bool+0x46/0x70 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb9173 : object_property_set+0x73/0x100 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbbc1ef : object_property_set_qobject+0x2f/0x50 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb93e4 : object_property_set_bool+0x34/0xa0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3b9830d9 : virtio_pci_realize+0x299/0x4e0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3b901204 : pci_qdev_realize+0x7e4/0x1090 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb3b1f : device_set_realized+0x2ff/0x660 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb6ec6 : property_set_bool+0x46/0x70 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb9173 : object_property_set+0x73/0x100 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbbc1ef : object_property_set_qobject+0x2f/0x50 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb93e4 : object_property_set_bool+0x34/0xa0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3b99e4a0 : qdev_device_add_from_qdict+0xb00/0xc40 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bac0c75 : virtio_net_set_features+0x385/0x490 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3baec238 : virtio_set_features_nocheck+0x58/0x70 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3baf1e9e : virtio_load+0x33e/0x820 [/data00/migration/qemu-open/build/qemu-system-x86_64] I think this one can also be avoided. Basically any memory listener related op can avoid referencing the latest flatview because even if it's during depth>0 it'll be synced again when depth==0. 3.vapic_post_load Same confusion to this one.. 0x557a57b0e510 : memory_region_transaction_do_commit+0x0/0x1c0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x557a57b0e9b5 : memory_region_find_rcu+0x2e5/0x310 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x557a57b11165 : memory_region_find+0x55/0xf0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x557a57a07638 : vapic_prepare+0x58/0x250 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x557a57a07a7b : vapic_post_load+0x1b/0x80 [/data00/migration/qemu-open/build/qemu-system-x86_64] AFAIU this one is the other one that truly need to reference the latest flatview; the other one is (5) on AHCI. It's a pity that it uses memory_region_find_rcu() even if it must already have BQL so it's kind of unclear (and enforced commit should never need to happen with RCU logically..). 4.tcg_commit This is not enabled if kvm is on, right? Yes. I obtained these results from qtests. And some qtests enabled tcg. 5.ahci_state_post_load During my test, virtio_init_region_cache() will frequently trigger do_commit() in address_space_to_flatview(), which will reduce the optimization effect of v6 compared with v1. IIU above 1 & 4 could leverage one address_space_to_flatview_rcu() which can keep the old semantics of address_space_to_flatview() by just assert rcu read lock and do qatomic_rcu_read() (e.g., tcg_commit() will run again at last stage of vm load). Not sure how much it'll help. This can really improve the performance of the existing patch, which is basically the same as that of v1, but it needs to add address_space_to_flatview_rcu() and address_space_get_flatview_rcu(). I have also considered this, but will others find it confusing? Because there will be to_flatview(), to_flatview_raw() and to_flatview_rcu().. Why do we need address_space_get_flatview_rcu()? I'm not sure whether you address_space_cahce_init() uses address_space_get_flatview() to acquire a ref-ed flatview. If we want to use address_space_to_flatview_rcu() and make the flatview ref-ed, maybe we need to add address_space_get_flatview_rcu()? Or if we use address_space_to_flatview_rcu() directly, then we'll get a unref-ed flatview, I'm not sure wheather it's safe in other cases.. At least I don't want
Re: [PATCH RESEND v6 0/5] migration: reduce time of loading non-iterable vmstate
Hi, Peter, On 2023/3/6 上午6:05, Peter Xu wrote: Hi, Chuang, On Fri, Mar 03, 2023 at 06:56:50PM +0800, Chuang Xu wrote: Sorry to forget to update the test results in the last patch of v6. In this version: - add peter's patch. - split mr_do_commit() from mr_commit(). - adjust the sanity check in address_space_to_flatview(). - rebase to latest upstream. - replace 8260 with 8362 as testing host. - update the latest test results. Here I list some cases which will trigger do_commit() in address_space_to_flatview(): I ran qtest cases and used systemtap to trace those do_commit(). 1.virtio_load->virtio_init_region_cache 2.virtio_load->virtio_set_features_nocheck What is this one specifically? I failed to see quickly why this needs to reference the flatview. 0x560f3bb26510 : memory_region_transaction_do_commit+0x0/0x1c0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bb26e45 : address_space_get_flatview+0x95/0x100 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bb296b6 : memory_listener_register+0xf6/0x300 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3baec59f : virtio_device_realize+0x12f/0x1a0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb3b1f : device_set_realized+0x2ff/0x660 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb6ec6 : property_set_bool+0x46/0x70 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb9173 : object_property_set+0x73/0x100 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbbc1ef : object_property_set_qobject+0x2f/0x50 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb93e4 : object_property_set_bool+0x34/0xa0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3b9830d9 : virtio_pci_realize+0x299/0x4e0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3b901204 : pci_qdev_realize+0x7e4/0x1090 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb3b1f : device_set_realized+0x2ff/0x660 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb6ec6 : property_set_bool+0x46/0x70 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb9173 : object_property_set+0x73/0x100 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbbc1ef : object_property_set_qobject+0x2f/0x50 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bbb93e4 : object_property_set_bool+0x34/0xa0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3b99e4a0 : qdev_device_add_from_qdict+0xb00/0xc40 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3bac0c75 : virtio_net_set_features+0x385/0x490 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3baec238 : virtio_set_features_nocheck+0x58/0x70 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x560f3baf1e9e : virtio_load+0x33e/0x820 [/data00/migration/qemu-open/build/qemu-system-x86_64] 3.vapic_post_load Same confusion to this one.. 0x557a57b0e510 : memory_region_transaction_do_commit+0x0/0x1c0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x557a57b0e9b5 : memory_region_find_rcu+0x2e5/0x310 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x557a57b11165 : memory_region_find+0x55/0xf0 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x557a57a07638 : vapic_prepare+0x58/0x250 [/data00/migration/qemu-open/build/qemu-system-x86_64] 0x557a57a07a7b : vapic_post_load+0x1b/0x80 [/data00/migration/qemu-open/build/qemu-system-x86_64] 4.tcg_commit This is not enabled if kvm is on, right? Yes. I obtained these results from qtests. And some qtests enabled tcg. 5.ahci_state_post_load During my test, virtio_init_region_cache() will frequently trigger do_commit() in address_space_to_flatview(), which will reduce the optimization effect of v6 compared with v1. IIU above 1 & 4 could leverage one address_space_to_flatview_rcu() which can keep the old semantics of address_space_to_flatview() by just assert rcu read lock and do qatomic_rcu_read() (e.g., tcg_commit() will run again at last stage of vm load). Not sure how much it'll help. This can really improve the performance of the existing patch, which is basically the same as that of v1, but it needs to add address_space_to_flatview_rcu() and address_space_get_flatview_rcu(). I have also considered this, but will others find it confusing? Because there will be to_flatview(), to_flatview_raw() and to_flatview_rcu().. You may also want to have a look at the other patch to optimize ioeventfd commit here; I think that'll also speed up vm load but not sure how much your series can further do upon: https://lore.kernel.org/all/20230228142514.2582-1-longpe...@huawei.com/ Thanks, Here are the latest test results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of l
[PATCH RESEND v6 5/5] migration: Reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 112 ms 285 ms after about 44 ms 208 ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 65 ms about 151 ms after about 30 ms about 110 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 24 ms about 51 ms after about 12 ms about 38 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu --- migration/savevm.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index b5e6962bb6..3dd9daabd8 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2770,6 +2770,7 @@ out: goto retry; } } + return ret; } @@ -2795,7 +2796,25 @@ int qemu_loadvm_state(QEMUFile *f) cpu_synchronize_all_pre_loadvm(); +/* + * Call memory_region_transaction_begin() before loading vmstate. + * This call is paired with memory_region_transaction_commit() at + * the end of qemu_loadvm_state_main(), in order to pack all the + * changes to memory region during the period of loading + * non-iterable vmstate in a single memory transaction. + * This operation will reduce time of loading non-iterable vmstate + */ +memory_region_transaction_begin(); + ret = qemu_loadvm_state_main(f, mis); + +/* + * Call memory_region_transaction_commit() after loading vmstate. + * At this point, qemu actually completes all the previous memory + * region transactions. + */ +memory_region_transaction_commit(); + qemu_event_set(>main_thread_load_event); trace_qemu_loadvm_state_post_main(ret); -- 2.20.1
[PATCH RESEND v6 3/5] memory: Introduce memory_region_transaction_do_commit()
Split memory_region_transaction_do_commit() from memory_region_transaction_commit(). We'll call do_commit() in address_space_to_flatview() in the later patch. Signed-off-by: Chuang Xu --- softmmu/memory.c | 47 +++ 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 213496802b..b89abf400e 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1093,34 +1093,41 @@ void memory_region_transaction_begin(void) ++memory_region_transaction_depth; } -void memory_region_transaction_commit(void) +void memory_region_transaction_do_commit(void) { AddressSpace *as; -assert(memory_region_transaction_depth); assert(qemu_mutex_iothread_locked()); ---memory_region_transaction_depth; -if (!memory_region_transaction_depth) { -if (memory_region_update_pending) { -flatviews_reset(); +if (memory_region_update_pending) { +flatviews_reset(); -MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); +MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); -QTAILQ_FOREACH(as, _spaces, address_spaces_link) { -address_space_set_flatview(as); -address_space_update_ioeventfds(as); -} -memory_region_update_pending = false; -ioeventfd_update_pending = false; -MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); -} else if (ioeventfd_update_pending) { -QTAILQ_FOREACH(as, _spaces, address_spaces_link) { -address_space_update_ioeventfds(as); -} -ioeventfd_update_pending = false; +QTAILQ_FOREACH(as, _spaces, address_spaces_link) { +address_space_set_flatview(as); +address_space_update_ioeventfds(as); +} +memory_region_update_pending = false; +ioeventfd_update_pending = false; +MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); +} else if (ioeventfd_update_pending) { +QTAILQ_FOREACH(as, _spaces, address_spaces_link) { +address_space_update_ioeventfds(as); } - } +ioeventfd_update_pending = false; +} +} + +void memory_region_transaction_commit(void) +{ +assert(memory_region_transaction_depth); +assert(qemu_mutex_iothread_locked()); + +--memory_region_transaction_depth; +if (!memory_region_transaction_depth) { +memory_region_transaction_do_commit(); +} } static void memory_region_destructor_none(MemoryRegion *mr) -- 2.20.1
[PATCH RESEND v6 1/5] memory: Reference as->current_map directly in memory commit
From: Peter Xu Calling RCU variance of address_space_get|to_flatview() during memory commit (flatview updates, triggering memory listeners, or updating ioeventfds, etc.) is not 100% accurate, because commit() requires BQL rather than RCU read lock, so the context exclusively owns current_map and can be directly referenced. Neither does it need a refcount to current_map because it cannot be freed from under the caller. Add address_space_get_flatview_raw() for the case where the context holds BQL rather than RCU read lock and use it across the core memory updates, Drop the extra refcounts on FlatView*. Signed-off-by: Peter Xu --- softmmu/memory.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 9d64efca26..213496802b 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -61,6 +61,13 @@ struct AddrRange { Int128 size; }; +/* Called with BQL held */ +static inline FlatView *address_space_to_flatview_raw(AddressSpace *as) +{ +assert(qemu_mutex_iothread_locked()); +return as->current_map; +} + static AddrRange addrrange_make(Int128 start, Int128 size) { return (AddrRange) { start, size }; @@ -155,7 +162,7 @@ enum ListenerDirection { Forward, Reverse }; #define MEMORY_LISTENER_UPDATE_REGION(fr, as, dir, callback, _args...) \ do {\ MemoryRegionSection mrs = section_from_flat_range(fr, \ -address_space_to_flatview(as)); \ +address_space_to_flatview_raw(as)); \ MEMORY_LISTENER_CALL(as, callback, dir, , ##_args); \ } while(0) @@ -753,6 +760,7 @@ static FlatView *generate_memory_topology(MemoryRegion *mr) } static void address_space_add_del_ioeventfds(AddressSpace *as, + FlatView *view, MemoryRegionIoeventfd *fds_new, unsigned fds_new_nb, MemoryRegionIoeventfd *fds_old, @@ -774,7 +782,7 @@ static void address_space_add_del_ioeventfds(AddressSpace *as, _new[inew]))) { fd = _old[iold]; section = (MemoryRegionSection) { -.fv = address_space_to_flatview(as), +.fv = view, .offset_within_address_space = int128_get64(fd->addr.start), .size = fd->addr.size, }; @@ -787,7 +795,7 @@ static void address_space_add_del_ioeventfds(AddressSpace *as, _old[iold]))) { fd = _new[inew]; section = (MemoryRegionSection) { -.fv = address_space_to_flatview(as), +.fv = view, .offset_within_address_space = int128_get64(fd->addr.start), .size = fd->addr.size, }; @@ -833,7 +841,7 @@ static void address_space_update_ioeventfds(AddressSpace *as) ioeventfd_max = QEMU_ALIGN_UP(as->ioeventfd_nb, 4); ioeventfds = g_new(MemoryRegionIoeventfd, ioeventfd_max); -view = address_space_get_flatview(as); +view = address_space_to_flatview_raw(as); FOR_EACH_FLAT_RANGE(fr, view) { for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { tmp = addrrange_shift(fr->mr->ioeventfds[i].addr, @@ -852,13 +860,12 @@ static void address_space_update_ioeventfds(AddressSpace *as) } } -address_space_add_del_ioeventfds(as, ioeventfds, ioeventfd_nb, +address_space_add_del_ioeventfds(as, view, ioeventfds, ioeventfd_nb, as->ioeventfds, as->ioeventfd_nb); g_free(as->ioeventfds); as->ioeventfds = ioeventfds; as->ioeventfd_nb = ioeventfd_nb; -flatview_unref(view); } /* @@ -1026,7 +1033,7 @@ static void flatviews_reset(void) static void address_space_set_flatview(AddressSpace *as) { -FlatView *old_view = address_space_to_flatview(as); +FlatView *old_view = address_space_to_flatview_raw(as); MemoryRegion *physmr = memory_region_get_flatview_root(as->root); FlatView *new_view = g_hash_table_lookup(flat_views, physmr); -- 2.20.1
[PATCH RESEND v6 0/5] migration: reduce time of loading non-iterable vmstate
Sorry to forget to update the test results in the last patch of v6. In this version: - add peter's patch. - split mr_do_commit() from mr_commit(). - adjust the sanity check in address_space_to_flatview(). - rebase to latest upstream. - replace 8260 with 8362 as testing host. - update the latest test results. Here I list some cases which will trigger do_commit() in address_space_to_flatview(): 1.virtio_load->virtio_init_region_cache 2.virtio_load->virtio_set_features_nocheck 3.vapic_post_load 4.tcg_commit 5.ahci_state_post_load During my test, virtio_init_region_cache() will frequently trigger do_commit() in address_space_to_flatview(), which will reduce the optimization effect of v6 compared with v1. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. This time I replace 8260 with 8362 as testing host, use latest spdk as vhost-user-blk backend. The downtime results are different from the previous, but it doesn't affect the improvement comparison of loading vmstate. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 112 ms 285 ms after44 ms208 ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 65 ms151 ms after30 ms110 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 24 ms51 ms after12 ms38 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Please review, Chuang [v5] - rename rcu_read_locked() to rcu_read_is_locked(). - adjust the sanity check in address_space_to_flatview(). - improve some comments. [v4] - attach more information in the cover letter. - remove changes on virtio_load. - add rcu_read_locked() to detect holding of rcu lock. [v3] - move virtio_load_check_delay() from virtio_memory_listener_commit() to virtio_vmstate_change(). - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() will be called when delay_check is true. [v2] - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms
[PATCH RESEND v6 2/5] rcu: Introduce rcu_read_is_locked()
Add rcu_read_is_locked() to detect holding of rcu lock. Signed-off-by: Chuang Xu --- include/qemu/rcu.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index b063c6fde8..719916d9d3 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_is_locked(void) +{ +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); + +return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* -- 2.20.1
[PATCH RESEND v6 4/5] memory: Add sanity check in address_space_to_flatview
Before using any flatview, sanity check whether BQL or rcu is held. And if we're during a memory region transaction, try to immediately update mappings, or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 23 +++ softmmu/memory.c | 5 + 2 files changed, 28 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 2e602a2fad..84b531c6ff 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -27,6 +27,7 @@ #include "qemu/notify.h" #include "qom/object.h" #include "qemu/rcu.h" +#include "qemu/main-loop.h" #define RAM_ADDR_INVALID (~(ram_addr_t)0) @@ -1095,8 +1096,30 @@ struct FlatView { MemoryRegion *root; }; +bool memory_region_transaction_in_progress(void); + +void memory_region_transaction_do_commit(void); + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +if (qemu_mutex_iothread_locked()) { +/* We exclusively own the flatview now.. */ +if (memory_region_transaction_in_progress()) { +/* + * Fetch the flatview within a transaction in-progress, it + * means current_map may not be the latest, we need to update + * immediately to make sure the caller won't see obsolete + * mapping. + */ +memory_region_transaction_do_commit(); +} + +/* No further protection needed to access current_map */ +return as->current_map; +} + +/* Otherwise we must have had the RCU lock or something went wrong */ +assert(rcu_read_is_locked()); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index b89abf400e..1834e14cc8 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1130,6 +1130,11 @@ void memory_region_transaction_commit(void) } } +bool memory_region_transaction_in_progress(void) +{ +return memory_region_transaction_depth != 0; +} + static void memory_region_destructor_none(MemoryRegion *mr) { } -- 2.20.1
[PATCH v6 1/5] memory: Reference as->current_map directly in memory commit
From: Peter Xu Calling RCU variance of address_space_get|to_flatview() during memory commit (flatview updates, triggering memory listeners, or updating ioeventfds, etc.) is not 100% accurate, because commit() requires BQL rather than RCU read lock, so the context exclusively owns current_map and can be directly referenced. Neither does it need a refcount to current_map because it cannot be freed from under the caller. Add address_space_get_flatview_raw() for the case where the context holds BQL rather than RCU read lock and use it across the core memory updates, Drop the extra refcounts on FlatView*. Signed-off-by: Peter Xu --- softmmu/memory.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 9d64efca26..213496802b 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -61,6 +61,13 @@ struct AddrRange { Int128 size; }; +/* Called with BQL held */ +static inline FlatView *address_space_to_flatview_raw(AddressSpace *as) +{ +assert(qemu_mutex_iothread_locked()); +return as->current_map; +} + static AddrRange addrrange_make(Int128 start, Int128 size) { return (AddrRange) { start, size }; @@ -155,7 +162,7 @@ enum ListenerDirection { Forward, Reverse }; #define MEMORY_LISTENER_UPDATE_REGION(fr, as, dir, callback, _args...) \ do {\ MemoryRegionSection mrs = section_from_flat_range(fr, \ -address_space_to_flatview(as)); \ +address_space_to_flatview_raw(as)); \ MEMORY_LISTENER_CALL(as, callback, dir, , ##_args); \ } while(0) @@ -753,6 +760,7 @@ static FlatView *generate_memory_topology(MemoryRegion *mr) } static void address_space_add_del_ioeventfds(AddressSpace *as, + FlatView *view, MemoryRegionIoeventfd *fds_new, unsigned fds_new_nb, MemoryRegionIoeventfd *fds_old, @@ -774,7 +782,7 @@ static void address_space_add_del_ioeventfds(AddressSpace *as, _new[inew]))) { fd = _old[iold]; section = (MemoryRegionSection) { -.fv = address_space_to_flatview(as), +.fv = view, .offset_within_address_space = int128_get64(fd->addr.start), .size = fd->addr.size, }; @@ -787,7 +795,7 @@ static void address_space_add_del_ioeventfds(AddressSpace *as, _old[iold]))) { fd = _new[inew]; section = (MemoryRegionSection) { -.fv = address_space_to_flatview(as), +.fv = view, .offset_within_address_space = int128_get64(fd->addr.start), .size = fd->addr.size, }; @@ -833,7 +841,7 @@ static void address_space_update_ioeventfds(AddressSpace *as) ioeventfd_max = QEMU_ALIGN_UP(as->ioeventfd_nb, 4); ioeventfds = g_new(MemoryRegionIoeventfd, ioeventfd_max); -view = address_space_get_flatview(as); +view = address_space_to_flatview_raw(as); FOR_EACH_FLAT_RANGE(fr, view) { for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { tmp = addrrange_shift(fr->mr->ioeventfds[i].addr, @@ -852,13 +860,12 @@ static void address_space_update_ioeventfds(AddressSpace *as) } } -address_space_add_del_ioeventfds(as, ioeventfds, ioeventfd_nb, +address_space_add_del_ioeventfds(as, view, ioeventfds, ioeventfd_nb, as->ioeventfds, as->ioeventfd_nb); g_free(as->ioeventfds); as->ioeventfds = ioeventfds; as->ioeventfd_nb = ioeventfd_nb; -flatview_unref(view); } /* @@ -1026,7 +1033,7 @@ static void flatviews_reset(void) static void address_space_set_flatview(AddressSpace *as) { -FlatView *old_view = address_space_to_flatview(as); +FlatView *old_view = address_space_to_flatview_raw(as); MemoryRegion *physmr = memory_region_get_flatview_root(as->root); FlatView *new_view = g_hash_table_lookup(flat_views, physmr); -- 2.20.1
[PATCH v6 2/5] rcu: Introduce rcu_read_is_locked()
Add rcu_read_is_locked() to detect holding of rcu lock. Signed-off-by: Chuang Xu --- include/qemu/rcu.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index b063c6fde8..719916d9d3 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_is_locked(void) +{ +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); + +return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* -- 2.20.1
[PATCH v6 0/5] migration: reduce time of loading non-iterable vmstate
In this version: - add peter's patch. - split mr_do_commit() from mr_commit(). - adjust the sanity check in address_space_to_flatview(). - rebase to latest upstream. - replace 8260 with 8362 as testing host. - update the latest test results. Here I list some cases which will trigger do_commit() in address_space_to_flatview(): 1.virtio_load->virtio_init_region_cache 2.virtio_load->virtio_set_features_nocheck 3.vapic_post_load 4.tcg_commit 5.ahci_state_post_load During my test, virtio_init_region_cache() will frequently trigger do_commit() in address_space_to_flatview(), which will reduce the optimization effect of v6 compared with v1. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. This time I replace 8260 with 8362 as testing host, use latest spdk as vhost-user-blk backend. The downtime results are different from the previous, but it doesn't affect the improvement comparison of loading vmstate. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 112 ms 285 ms after44 ms208 ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 65 ms151 ms after30 ms110 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8362 CPU - Mellanox Technologies MT28841 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before 24 ms51 ms after12 ms38 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Please review, Chuang [v5] - rename rcu_read_locked() to rcu_read_is_locked(). - adjust the sanity check in address_space_to_flatview(). - improve some comments. [v4] - attach more information in the cover letter. - remove changes on virtio_load. - add rcu_read_locked() to detect holding of rcu lock. [v3] - move virtio_load_check_delay() from virtio_memory_listener_commit() to virtio_vmstate_change(). - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() will be called when delay_check is true. [v2] - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms
[PATCH v6 5/5] migration: Reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu --- migration/savevm.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index b5e6962bb6..3dd9daabd8 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2770,6 +2770,7 @@ out: goto retry; } } + return ret; } @@ -2795,7 +2796,25 @@ int qemu_loadvm_state(QEMUFile *f) cpu_synchronize_all_pre_loadvm(); +/* + * Call memory_region_transaction_begin() before loading vmstate. + * This call is paired with memory_region_transaction_commit() at + * the end of qemu_loadvm_state_main(), in order to pack all the + * changes to memory region during the period of loading + * non-iterable vmstate in a single memory transaction. + * This operation will reduce time of loading non-iterable vmstate + */ +memory_region_transaction_begin(); + ret = qemu_loadvm_state_main(f, mis); + +/* + * Call memory_region_transaction_commit() after loading vmstate. + * At this point, qemu actually completes all the previous memory + * region transactions. + */ +memory_region_transaction_commit(); + qemu_event_set(>main_thread_load_event); trace_qemu_loadvm_state_post_main(ret); -- 2.20.1
[PATCH v6 4/5] memory: Add sanity check in address_space_to_flatview
Before using any flatview, sanity check whether BQL or rcu is held. And if we're during a memory region transaction, try to immediately update mappings, or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 23 +++ softmmu/memory.c | 5 + 2 files changed, 28 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 2e602a2fad..84b531c6ff 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -27,6 +27,7 @@ #include "qemu/notify.h" #include "qom/object.h" #include "qemu/rcu.h" +#include "qemu/main-loop.h" #define RAM_ADDR_INVALID (~(ram_addr_t)0) @@ -1095,8 +1096,30 @@ struct FlatView { MemoryRegion *root; }; +bool memory_region_transaction_in_progress(void); + +void memory_region_transaction_do_commit(void); + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +if (qemu_mutex_iothread_locked()) { +/* We exclusively own the flatview now.. */ +if (memory_region_transaction_in_progress()) { +/* + * Fetch the flatview within a transaction in-progress, it + * means current_map may not be the latest, we need to update + * immediately to make sure the caller won't see obsolete + * mapping. + */ +memory_region_transaction_do_commit(); +} + +/* No further protection needed to access current_map */ +return as->current_map; +} + +/* Otherwise we must have had the RCU lock or something went wrong */ +assert(rcu_read_is_locked()); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index b89abf400e..1834e14cc8 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1130,6 +1130,11 @@ void memory_region_transaction_commit(void) } } +bool memory_region_transaction_in_progress(void) +{ +return memory_region_transaction_depth != 0; +} + static void memory_region_destructor_none(MemoryRegion *mr) { } -- 2.20.1
[PATCH v6 3/5] memory: Introduce memory_region_transaction_do_commit()
Split memory_region_transaction_do_commit() from memory_region_transaction_commit(). We'll call do_commit() in address_space_to_flatview() in the later patch. Signed-off-by: Chuang Xu --- softmmu/memory.c | 47 +++ 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index 213496802b..b89abf400e 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1093,34 +1093,41 @@ void memory_region_transaction_begin(void) ++memory_region_transaction_depth; } -void memory_region_transaction_commit(void) +void memory_region_transaction_do_commit(void) { AddressSpace *as; -assert(memory_region_transaction_depth); assert(qemu_mutex_iothread_locked()); ---memory_region_transaction_depth; -if (!memory_region_transaction_depth) { -if (memory_region_update_pending) { -flatviews_reset(); +if (memory_region_update_pending) { +flatviews_reset(); -MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); +MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); -QTAILQ_FOREACH(as, _spaces, address_spaces_link) { -address_space_set_flatview(as); -address_space_update_ioeventfds(as); -} -memory_region_update_pending = false; -ioeventfd_update_pending = false; -MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); -} else if (ioeventfd_update_pending) { -QTAILQ_FOREACH(as, _spaces, address_spaces_link) { -address_space_update_ioeventfds(as); -} -ioeventfd_update_pending = false; +QTAILQ_FOREACH(as, _spaces, address_spaces_link) { +address_space_set_flatview(as); +address_space_update_ioeventfds(as); +} +memory_region_update_pending = false; +ioeventfd_update_pending = false; +MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); +} else if (ioeventfd_update_pending) { +QTAILQ_FOREACH(as, _spaces, address_spaces_link) { +address_space_update_ioeventfds(as); } - } +ioeventfd_update_pending = false; +} +} + +void memory_region_transaction_commit(void) +{ +assert(memory_region_transaction_depth); +assert(qemu_mutex_iothread_locked()); + +--memory_region_transaction_depth; +if (!memory_region_transaction_depth) { +memory_region_transaction_do_commit(); +} } static void memory_region_destructor_none(MemoryRegion *mr) -- 2.20.1
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Peter On 2023/2/25 下午11:32, Peter Xu wrote: On Thu, Feb 23, 2023 at 11:28:46AM +0800, Chuang Xu wrote: Hi, Peter Hi, Chuang, On 2023/2/22 下午11:57, Peter Xu wrote: I don't see why it's wrong, and that's exactly what I wanted to guarantee, that if memory_region_update_pending==true when updating ioeventfd, we may have some serios problem. For this, I split my patch into two parts and I put this change into the last patch. See the attachment. If you worry about this, you can e.g. try applying patch 1 only later, but I still don't see why it could. Sorry, I made some mistakes in my previous understanding of the code. However, if this assertion is added, it will indeed trigger some coredump in qtest with my patches. Here is the coredump(This is the only one I found): #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x7fa5e7b17535 in __GI_abort () at abort.c:79 #2 0x7fa5e7b1740f in __assert_fail_base (fmt=0x7fa5e7c78ef0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x56305fc02d60 "qemu_mutex_iothread_locked() && !memory_region_update_pending", file=0x56305fc028cb "../softmmu/memory.c", line=855, function=) at assert.c:92 #3 0x7fa5e7b251a2 in __GI___assert_fail (assertion=assertion@entry=0x56305fc02d60 "qemu_mutex_iothread_locked() && !memory_region_update_pending", file=file@entry=0x56305fc028cb "../softmmu/memory.c", line=line@entry=855, function=function@entry=0x56305fc03cc0 <__PRETTY_FUNCTION__.38596> "address_space_update_ioeventfds") at assert.c:101 #4 0x56305f8a9194 in address_space_update_ioeventfds (as=as@entry=0x563061293648) at ../softmmu/memory.c:855 #5 0x56305f8afe4f in address_space_init (as=as@entry=0x563061293648, root=root@entry=0x5630612936a0, name=name@entry=0x5630612934f0 "virtio-net-pci") at ../softmmu/memory.c:3172 #6 0x56305f686e43 in do_pci_register_device (errp=0x7fa5e4f39850, devfn=, name=0x563061011c40 "virtio-net-pci", pci_dev=0x563061293410) at ../hw/pci/pci.c:1145 #7 pci_qdev_realize (qdev=0x563061293410, errp=0x7fa5e4f39850) at ../hw/pci/pci.c:2036 #8 0x56305f939daf in device_set_realized (obj=, value=true, errp=0x7fa5e4f39ae0) at ../hw/core/qdev.c:510 #9 0x56305f93d156 in property_set_bool (obj=0x563061293410, v=, name=, opaque=0x5630610221d0, errp=0x7fa5e4f39ae0) at ../qom/object.c:2285 #10 0x56305f93f403 in object_property_set (obj=obj@entry=0x563061293410, name=name@entry=0x56305fba6bc3 "realized", v=v@entry=0x563061e52a00, errp=errp@entry=0x7fa5e4f39ae0) at ../qom/object.c:1420 #11 0x56305f94247f in object_property_set_qobject (obj=obj@entry=0x563061293410, name=name@entry=0x56305fba6bc3 "realized", value=value@entry=0x563061e61cb0, errp=errp@entry=0x7fa5e4f39ae0) at ../qom/qom-qobject.c:28 #12 0x56305f93f674 in object_property_set_bool (obj=0x563061293410, name=name@entry=0x56305fba6bc3 "realized", value=value@entry=true, errp=errp@entry=0x7fa5e4f39ae0) at ../qom/object.c:1489 #13 0x56305f93959c in qdev_realize (dev=, bus=bus@entry=0x563061c9c400, errp=errp@entry=0x7fa5e4f39ae0) at ../hw/core/qdev.c:292 #14 0x56305f7244a0 in qdev_device_add_from_qdict (opts=0x563061e64c00, from_json=, errp=, errp@entry=0x7fa5e4f39ae0) at /data00/migration/qemu-open/include/hw/qdev-core.h:17 #15 0x56305f846c75 in failover_add_primary (errp=0x7fa5e4f39ad8, n=0x563062043530) at ../hw/net/virtio-net.c:933 #16 virtio_net_set_features (vdev=, features=4611687122246533156) at ../hw/net/virtio-net.c:1004 #17 0x56305f872238 in virtio_set_features_nocheck (vdev=vdev@entry=0x563062043530, val=val@entry=4611687122246533156) at ../hw/virtio/virtio.c:2851 #18 0x56305f877e9e in virtio_load (vdev=0x563062043530, f=0x56306125bde0, version_id=11) at ../hw/virtio/virtio.c:3027 #19 0x56305f73c601 in vmstate_load_state (f=f@entry=0x56306125bde0, vmsd=0x56305fef16c0 , opaque=0x563062043530, version_id=11) at ../migration/vmstate.c:137 #20 0x56305f757672 in vmstate_load (f=0x56306125bde0, se=0x563062176700) at ../migration/savevm.c:919 #21 0x56305f757905 in qemu_loadvm_section_start_full (f=f@entry=0x56306125bde0, mis=0x56306101d3e0) at ../migration/savevm.c:2503 #22 0x56305f75aca8 in qemu_loadvm_state_main (f=f@entry=0x56306125bde0, mis=mis@entry=0x56306101d3e0) at ../migration/savevm.c:2719 #23 0x56305f75c17a in qemu_loadvm_state (f=0x56306125bde0) at ../migration/savevm.c:2809 #24 0x56305f74980e in process_incoming_migration_co (opaque=) at ../migration/migration.c:606 #25 0x56305fab25cb in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:177 I really think changing depth is a hack... :( Here IMHO the problem is we have other missing calls to address_space_to_flatview() during commit() and that caused the issue. There aren't a lot of those, and sorry to
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Peter On 2023/2/22 下午11:57, Peter Xu wrote: On Wed, Feb 22, 2023 at 02:27:55PM +0800, Chuang Xu wrote: Hi, Peter Hi, Chuang, Note that as I mentioned in the comment, we temporarily replace this value to prevent commit() and address_space_to_flatview() call each other recursively, and eventually stack overflow. Sorry to have overlooked that part. IMHO here it's not about the depth, but rather that we don't even need any RCU protection when updating ioeventfds because we exclusively own the FlatView* too there. I wanted to describe what I had in mind but instead I figured a patch may be needed to be accurate (with some cleanups alongside), hence attached. IIUC it can work with what I suggested before without fiddling with depth. Please have a look. The patch should apply cleanly to master branch so if it works it can be your 1st patch too. PS: Paolo - I know I asked this before, but it'll be good to have your review comment on anything above. Thanks, Here are two problems I can see: 1. It is inappropriate to use assert(qemu_mutex_iothread_locked() && !memory_region_update_pending) in update_ioeventfds(). For example, when entering commit(), if memory_region_update_pending is true, the assertion will be triggered immediately when update_ioeventfds is called. 2. The problem of stack overflow has not been solved. There are too many places where address_space_to_flatview() may be called. Here are another coredump stack: #8 0x55a3a769ed85 in memory_region_transaction_commit_force () at ../softmmu/memory.c:1154 #9 0x55a3a769fd75 in address_space_to_flatview (as=0x55a3a7ede180 ) at /data00/migration/qemu-open/include/exec/memory.h:1118 #10 address_space_update_topology_pass (as=as@entry=0x55a3a7ede180 , old_view=old_view@entry=0x55a3a9d44990, new_view=new_view@entry=0x55a3d6837390, adding=adding@entry=false) at ../softmmu/memory.c:955 #11 0x55a3a76a007c in address_space_set_flatview (as=as@entry=0x55a3a7ede180 ) at ../softmmu/memory.c:1062 #12 0x55a3a769e870 in address_space_update_flatview_all () at ../softmmu/memory.c:1107 #13 0x55a3a769ed85 in memory_region_transaction_commit_force () at ../softmmu/memory.c:1154 #14 0x55a3a769fd75 in address_space_to_flatview (as=0x55a3a7ede180 ) at /data00/migration/qemu-open/include/exec/memory.h:1118 #15 address_space_update_topology_pass (as=as@entry=0x55a3a7ede180 , old_view=old_view@entry=0x55a3a9d44990, new_view=new_view@entry=0x55a3d67f8d90, adding=adding@entry=false) at ../softmmu/memory.c:955 #16 0x55a3a76a007c in address_space_set_flatview (as=as@entry=0x55a3a7ede180 ) at ../softmmu/memory.c:1062 #17 0x55a3a769e870 in address_space_update_flatview_all () at ../softmmu/memory.c:1107 #18 0x55a3a769ed85 in memory_region_transaction_commit_force () at ../softmmu/memory.c:1154 #19 0x55a3a769fd75 in address_space_to_flatview (as=0x55a3a7ede180 ) at /data00/migration/qemu-open/include/exec/memory.h:1118 #20 address_space_update_topology_pass (as=as@entry=0x55a3a7ede180 , old_view=old_view@entry=0x55a3a9d44990, new_view=new_view@entry=0x55a3d67ba790, adding=adding@entry=false) at ../softmmu/memory.c:955 #21 0x55a3a76a007c in address_space_set_flatview (as=as@entry=0x55a3a7ede180 ) at ../softmmu/memory.c:1062 #22 0x55a3a769e870 in address_space_update_flatview_all () at ../softmmu/memory.c:1107 #23 0x55a3a769ed85 in memory_region_transaction_commit_force () at ../softmmu/memory.c:1154 And this may not be the only case where stack overflow occurs. Thus, changing the depth value is the safest way I think. Thanks!
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Peter On 2023/2/22 上午4:36, Peter Xu wrote: On 2023/2/21 上午11:38, Chuang Xu wrote: I think we need a memory_region_transaction_commit_force() to force commit some transactions when load vmstate. This function is designed like this: /* * memory_region_transaction_commit_force() is desgined to * force the mr transaction to be commited in the process * of loading vmstate. */ void memory_region_transaction_commit_force(void) I would call this memory_region_transaction_do_commit(), and I don't think the manipulation of memory_region_transaction_depth is needed here since we don't release BQL during the whole process, so changing that depth isn't needed at all to me. So, I think we can... { AddressSpace *as; unsigned int memory_region_transaction_depth_copy = memory_region_transaction_depth; /* * Temporarily replace memory_region_transaction_depth with 0 to prevent * memory_region_transaction_commit_force() and address_space_to_flatview() * call each other recursively. */ memory_region_transaction_depth = 0; ... drop here ... Note that as I mentioned in the comment, we temporarily replace this value to prevent commit() and address_space_to_flatview() call each other recursively, and eventually stack overflow. Part of the coredump call stack is attached here: #8 0x558de5a998b5 in memory_region_transaction_do_commit () at ../softmmu/memory.c:1131 #9 0x558de5a99dfd in address_space_to_flatview (as=0x558de6516060 ) at /data00/migration/qemu-open/include/exec/memory.h:1130 #10 address_space_get_flatview (as=as@entry=0x558de6516060 ) at ../softmmu/memory.c:810 #11 0x558de5a9a199 in address_space_update_ioeventfds (as=as@entry=0x558de6516060 ) at ../softmmu/memory.c:836 #12 0x558de5a99900 in memory_region_transaction_do_commit () at ../softmmu/memory.c:1137 #13 memory_region_transaction_do_commit () at ../softmmu/memory.c:1125 #14 0x558de5a99dfd in address_space_to_flatview (as=0x558de6516060 ) at /data00/migration/qemu-open/include/exec/memory.h:1130 #15 address_space_get_flatview (as=as@entry=0x558de6516060 ) at ../softmmu/memory.c:810 #16 0x558de5a9a199 in address_space_update_ioeventfds (as=as@entry=0x558de6516060 ) at ../softmmu/memory.c:836 #17 0x558de5a99900 in memory_region_transaction_do_commit () at ../softmmu/memory.c:1137 #18 memory_region_transaction_do_commit () at ../softmmu/memory.c:1125 #19 0x558de5a99dfd in address_space_to_flatview (as=0x558de6516060 ) at /data00/migration/qemu-open/include/exec/memory.h:1130 #20 address_space_get_flatview (as=as@entry=0x558de6516060 ) at ../softmmu/memory.c:810 #21 0x558de5a9a199 in address_space_update_ioeventfds (as=as@entry=0x558de6516060 ) at ../softmmu/memory.c:836 #22 0x558de5a99900 in memory_region_transaction_do_commit () at ../softmmu/memory.c:1137 #23 memory_region_transaction_do_commit () at ../softmmu/memory.c:1125 #24 0x558de5a99dfd in address_space_to_flatview (as=0x558de6516060 ) at /data00/migration/qemu-open/include/exec/memory.h:1130 So I think we need to change the depth here. assert(qemu_mutex_iothread_locked()); if (memory_region_update_pending) { flatviews_reset(); MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); QTAILQ_FOREACH(as, _spaces, address_spaces_link) { address_space_set_flatview(as); address_space_update_ioeventfds(as); } memory_region_update_pending = false; ioeventfd_update_pending = false; MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); } else if (ioeventfd_update_pending) { QTAILQ_FOREACH(as, _spaces, address_spaces_link) { address_space_update_ioeventfds(as); } ioeventfd_update_pending = false; } /* recover memory_region_transaction_depth */ memory_region_transaction_depth = memory_region_transaction_depth_copy; ... drop here ... } ... then call this new memory_region_transaction_do_commit() in memory_region_transaction_commit(). void memory_region_transaction_commit(void) { AddressSpace *as; assert(memory_region_transaction_depth); --memory_region_transaction_depth; memory_region_transaction_do_commit(); } Then... Now there are two options to use this function: 1. call it in address_space_to_flatview(): static inline FlatView *address_space_to_flatview(AddressSpace *as) { /* * Before using any flatview, check whether we're during a memory * region transaction. If so, force commit the memory region transaction. */ if (memory_region_transaction_in_progress()) Here we need to add the condition of BQL holding, or some threads without BQL held running here will trigger the assertion in memory_region_transaction_commit_force(). I'm not sure whether this condition is sufficient, at least for the mr access in the load thread it is sufficient (because the load thread will hold the BQL
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Peter This email is a supplement to the previous one. On 2023/2/21 上午11:38, Chuang Xu wrote: I think we need a memory_region_transaction_commit_force() to force commit some transactions when load vmstate. This function is designed like this: /* * memory_region_transaction_commit_force() is desgined to * force the mr transaction to be commited in the process * of loading vmstate. */ void memory_region_transaction_commit_force(void) { AddressSpace *as; unsigned int memory_region_transaction_depth_copy = memory_region_transaction_depth; /* * Temporarily replace memory_region_transaction_depth with 0 to prevent * memory_region_transaction_commit_force() and address_space_to_flatview() * call each other recursively. */ memory_region_transaction_depth = 0; assert(qemu_mutex_iothread_locked()); if (memory_region_update_pending) { flatviews_reset(); MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); QTAILQ_FOREACH(as, _spaces, address_spaces_link) { address_space_set_flatview(as); address_space_update_ioeventfds(as); } memory_region_update_pending = false; ioeventfd_update_pending = false; MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); } else if (ioeventfd_update_pending) { QTAILQ_FOREACH(as, _spaces, address_spaces_link) { address_space_update_ioeventfds(as); } ioeventfd_update_pending = false; } /* recover memory_region_transaction_depth */ memory_region_transaction_depth = memory_region_transaction_depth_copy; } Now there are two options to use this function: 1. call it in address_space_to_flatview(): static inline FlatView *address_space_to_flatview(AddressSpace *as) { /* * Before using any flatview, check whether we're during a memory * region transaction. If so, force commit the memory region transaction. */ if (memory_region_transaction_in_progress()) Here we need to add the condition of BQL holding, or some threads without BQL held running here will trigger the assertion in memory_region_transaction_commit_force(). I'm not sure whether this condition is sufficient, at least for the mr access in the load thread it is sufficient (because the load thread will hold the BQL when accessing mr). But for other cases, it seems that we will return to our discussion on sanity check.. Another point I worry about is whether the number of mr transaction commits has increased in some other scenarios because of this force commit. Although So far, I haven't seen a simple virtual machine lifecycle trigger this force commit. I did a little test: replace commit_force() with abort() and run qtest. Almost all error I can see is related to migration.. memory_region_transaction_commit_force(); return qatomic_rcu_read(>current_map); } 2. call it before each post_load() I prefer to use the former one, it is more general than the latter. And with this function, the sanity check is not necessary any more. Although we may inevitably call memory_region_transaction_commit_force() several times, in my actual test, the number of calls to memory_region_transaction_commit_force() is still insignificant compared with the number of calls to memory_region_transaction_commit() before optimization. As a result, This code won't affect the optimization effect, but it can ensure reliability. Looking forward to your opinion, Thanks!
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Peter It seems that there is a problem with the code format in my last email. I adjusted the format and resend this to you. Hope the format of this email won't be wrong again.. :) On 2023/2/17 下午11:52, Peter Xu wrote: Hello, Chuang, On Fri, Feb 17, 2023 at 04:11:19PM +0800, Chuang Xu wrote: Error 1 was triggered by our sanity check. I try to add RCU_READ_LOCK_GUARD() in address_space_init() and it works. But I'm not sure if this code change is appropriate. If this change is not appropriate, we may need to consider other sanity check. I'd suggest not adding RCU locks without a good reason. address_space_init() is definitely a special context because the AS is exclusively owned by the caller before it returns. It means no RCU protection needed at all because no one else is touching it; neither do we need qatomic_rcu_read() when read. So I suggest we directly reference current_map, even though that'll need a rich comment: static void address_space_set_flatview(AddressSpace *as) { -FlatView *old_view = address_space_to_flatview(as); +/* + * NOTE: we don't use RCU flavoured of address_space_to_flatview() + * because we exclusively own as->current_map here: it's either during + * init of an address space, or during commit() with BQL held. + */ +FlatView *old_view = as->current_map; We can have address_space_to_flatview_raw() but since we'll directly modify as->current_map very soon in the same function, so may not even bother. I agree with you. But now I am facing a new problem. Our sanity check is not as reliable as we think. Although my current code can pass all the tests that Juan told me in the email. But if I configure with nothing and run 'make check', My patch triggers error in ahci migrate test: G_TEST_DBUS_DAEMON=/data00/migration/qemu-open/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=1 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /data00/migration/qemu-open/build/tests/qtest/ahci-test --tap -k ✀ stderr: qemu-system-x86_64: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address qemu-system-x86_64: Failed to load ich9_ahci:ahci qemu-system-x86_64: error while loading state for instance 0x0 of device ':00:1f.2/ich9_ahci' qemu-system-x86_64: load of migration failed: Operation not permitted Migration did not complete, status: failed It seems that ahci_state_post_load() will call memory_access_is_direct() and access mr->ram. Due to mr transaction delay, this value doesn't meet expectations. And Here is the call chain for you to read the code: ->ahci_state_post_load ->ahci_cond_start_engines ->ahci_map_fis_address ->map_page ->dma_memory_map ->address_space_map ->memory_access_is_direct I think we need a memory_region_transaction_commit_force() to force commit some transactions when load vmstate. This function is designed like this: /* * memory_region_transaction_commit_force() is desgined to * force the mr transaction to be commited in the process * of loading vmstate. */ void memory_region_transaction_commit_force(void) { AddressSpace *as; unsigned int memory_region_transaction_depth_copy = memory_region_transaction_depth; /* * Temporarily replace memory_region_transaction_depth with 0 to prevent * memory_region_transaction_commit_force() and address_space_to_flatview() * call each other recursively. */ memory_region_transaction_depth = 0; assert(qemu_mutex_iothread_locked()); if (memory_region_update_pending) { flatviews_reset(); MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); QTAILQ_FOREACH(as, _spaces, address_spaces_link) { address_space_set_flatview(as); address_space_update_ioeventfds(as); } memory_region_update_pending = false; ioeventfd_update_pending = false; MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); } else if (ioeventfd_update_pending) { QTAILQ_FOREACH(as, _spaces, address_spaces_link) { address_space_update_ioeventfds(as); } ioeventfd_update_pending = false; } /* recover memory_region_transaction_depth */ memory_region_transaction_depth = memory_region_transaction_depth_copy; } Now there are two options to use this function: 1. call it in address_space_to_flatview(): static inline FlatView *address_space_to_flatview(AddressSpace *as) { /* * Before using any flatview, check whether we're during a memory * region transaction. If so, force commit the memory region transaction. */ if (memory_region_transaction_in_progress()) memory_region_transaction_commit_force(); return qat
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Peter On 2023/2/17 下午11:52, Peter Xu wrote: Hello, Chuang, On Fri, Feb 17, 2023 at 04:11:19PM +0800, Chuang Xu wrote: Error 1 was triggered by our sanity check. I try to add RCU_READ_LOCK_GUARD() in address_space_init() and it works. But I'm not sure if this code change is appropriate. If this change is not appropriate, we may need to consider other sanity check. I'd suggest not adding RCU locks without a good reason. address_space_init() is definitely a special context because the AS is exclusively owned by the caller before it returns. It means no RCU protection needed at all because no one else is touching it; neither do we need qatomic_rcu_read() when read. So I suggest we directly reference current_map, even though that'll need a rich comment: static void address_space_set_flatview(AddressSpace *as) { -FlatView *old_view = address_space_to_flatview(as); +/* + * NOTE: we don't use RCU flavoured of address_space_to_flatview() + * because we exclusively own as->current_map here: it's either during + * init of an address space, or during commit() with BQL held. + */ +FlatView *old_view = as->current_map; We can have address_space_to_flatview_raw() but since we'll directly modify as->current_map very soon in the same function, so may not even bother. I agree with you. But now I am facing a new problem. Our sanity check is not as reliable as we think. Although my current code can pass all the tests that Juan told me in the email. But if I configure with nothing and run 'make check', My patch triggers error in ahci migrate test: G_TEST_DBUS_DAEMON=/data00/migration/qemu-open/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=1 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /data00/migration/qemu-open/build/tests/qtest/ahci-test --tap -k ✀ stderr: qemu-system-x86_64: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address qemu-system-x86_64: Failed to load ich9_ahci:ahci qemu-system-x86_64: error while loading state for instance 0x0 of device ':00:1f.2/ich9_ahci' qemu-system-x86_64: load of migration failed: Operation not permitted Migration did not complete, status: failed It seems that ahci_state_post_load() will call memory_access_is_direct() and access mr->ram. Due to mr transaction delay, this value doesn't meet expectations. And Here is the call chain for you to read the code: ->ahci_state_post_load ->ahci_cond_start_engines ->ahci_map_fis_address ->map_page ->dma_memory_map ->address_space_map ->memory_access_is_direct I think we need a memory_region_transaction_commit_force() to force commit some transactions when load vmstate. This function is designed like this: | /* memory_region_transaction_commit_force() is desgined to * force the mr transaction to be commited in the process * of loading vmstate. */ void memory_region_transaction_commit_force(void) { AddressSpace *as; unsigned int memory_region_transaction_depth_copy = memory_region_transaction_depth; /* Temporarily replace memory_region_transaction_depth with 0 to prevent * memory_region_transaction_commit_force() and address_space_to_flatview() * call each other recursively. */ memory_region_transaction_depth = 0; assert(qemu_mutex_iothread_locked()); if (memory_region_update_pending) { flatviews_reset(); MEMORY_LISTENER_CALL_GLOBAL(begin, Forward); QTAILQ_FOREACH(as, _spaces, address_spaces_link) { address_space_set_flatview(as); address_space_update_ioeventfds(as); } memory_region_update_pending = false; ioeventfd_update_pending = false; MEMORY_LISTENER_CALL_GLOBAL(commit, Forward); } else if (ioeventfd_update_pending) { QTAILQ_FOREACH(as, _spaces, address_spaces_link) { address_space_update_ioeventfds(as); } ioeventfd_update_pending = false; } /* recover memory_region_transaction_depth */ memory_region_transaction_depth = memory_region_transaction_depth_copy; } Now there are two options to use this function: 1. call it in address_space_to_flatview(): | static inline FlatView *address_space_to_flatview(AddressSpace *as) { /* * Before using any flatview, check whether we're during a memory * region transaction. If so, force commit the memory transaction. */ if (memory_region_transaction_in_progress()) memory_region_transaction_commit_force(); return qatomic_rcu_read(>current_map); } 2. call it before each post_load() I prefer to use the former one, it is more general than the latter. And with this function, the sanity check is not necessary any more. Although we may inevitably call memory_region_transaction_commit_force() several times, in my actual test, the number of calls to || |memory_region_transaction_commit_forc
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Juan On 2023/2/16 上午3:10, Juan Quintela wrote: Chuang Xu wrote: In this version: Hi I had to drop this. It breaks migration of dbus-vmstate. .[K144/179 qemu:qtest+qtest-x86_64 / qtest-x86_64/virtio-net-failover ERROR 5.66s killed by signal 6 SIGABRT G_TEST_DBUS_DAEMON=/mnt/code/qemu/multifd/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-x86_64 MALLOC_PERTURB_=145 /scratch/qemu/multifd/x64/tests/qtest/virtio-net-failover --tap -k ―― ✀ ―― stderr: qemu-system-x86_64: /mnt/code/qemu/multifd/include/exec/memory.h:1112: address_space_to_flatview: Assertion `(!memory_region_transaction_in_progress() && qemu_mutex_iothread_locked()) || rcu_read_is_locked()' failed. Broken pipe ../../../../mnt/code/qemu/multifd/tests/qtest/libqtest.c:190: kill_qemu() detected QEMU death from signal 6 (Aborted) (core dumped) (test program exited with status code -6) TAP parsing error: Too few tests run (expected 23, got 12) Can you take a look at this? I reproduced it with "make check" and qemu compiled with the configure line attached. Later, Juan. configure --enable-trace-backends=log --prefix=/usr --sysconfdir=/etc/sysconfig/ --audio-drv-list=pa --target-list=x86_64-softmmu --with-coroutine=ucontext --with-git-submodules=validate --enable-fdt=system --enable-alsa --enable-attr --enable-auth-pam --enable-avx2 --enable-avx512f --enable-bochs --enable-bpf --enable-brlapi --disable-bsd-user --enable-bzip2 --enable-cap-ng --disable-capstone --disable-cfi --disable-cfi-debug --enable-cloop --disable-cocoa --enable-containers --disable-coreaudio --enable-coroutine-pool --enable-crypto-afalg --enable-curl --enable-curses --enable-dbus-display --enable-debug-info --disable-debug-mutex --disable-debug-stack-usage --disable-debug-tcg --enable-dmg --disable-docs --disable-dsound --enable-fdt --disable-fuse --disable-fuse-lseek --disable-fuzzing --disable-gcov --enable-gettext --enable-gio --enable-glusterfs --enable-gnutls --disable-gprof --enable-gtk --disable-guest-agent --disable-guest-agent-msi --disable-hax --disable-hvf --enable-iconv --enable-install-blobs --enable-jack --enable-keyring --enable-kvm --enable-l2tpv3 --enable-libdaxctl --enable-libiscsi --enable-libnfs --enable-libpmem --enable-libssh --enable-libudev --enable-libusb --enable-linux-aio --enable-linux-io-uring --disable-linux-user --enable-live-block-migration --disable-lto --disable-lzfse --enable-lzo --disable-malloc-trim --enable-membarrier --enable-module-upgrades --enable-modules --enable-mpath --enable-multiprocess --disable-netmap --enable-nettle --enable-numa --disable-nvmm --enable-opengl --enable-oss --enable-pa --enable-parallels --enable-pie --enable-plugins --enable-png --disable-profiler --enable-pvrdma --enable-qcow1 --enable-qed --disable-qom-cast-debug --enable-rbd --enable-rdma --enable-replication --enable-rng-none --disable-safe-stack --disable-sanitizers --enable-stack-protector --enable-sdl --enable-sdl-image --enable-seccomp --enable-selinux --enable-slirp --enable-slirp-smbd --enable-smartcard --enable-snappy --enable-sparse --enable-spice --enable-spice-protocol --enable-system --enable-tcg --disable-tcg-interpreter --disable-tools --enable-tpm --disable-tsan --disable-u2f --enable-usb-redir --disable-user --disable-vde --enable-vdi --enable-vhost-crypto --enable-vhost-kernel --enable-vhost-net --enable-vhost-user --enable-vhost-user-blk-server --enable-vhost-vdpa --enable-virglrenderer --enable-virtfs --enable-virtiofsd
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Peter! In my last email to Juan, I mentioned two errors. Now I want to discuss them with you. On 2023/2/16 下午11:41, Chuang Xu wrote: I ran qtest with reference to your environment, and finally reported two errors. Error 1(the same as yours): QTEST_QEMU_BINARY=./qemu-system-x86_64 MALLOC_PERTURB_=87 G_TEST_DBUS_DAEMON=/data00/migration/qemu-open/tests/dbus-vmstate-daemon.sh /data00/migration/qemu-open/build/tests/qtest/virtio-net-failover --tap -k ― ✀ ― stderr: qemu-system-x86_64: /data00/migration/qemu-open/include/exec/memory.h:1114: address_space_to_flatview: Assertion `(!memory_region_transaction_in_progress() && qemu_mutex_iothread_locked()) || rcu_read_is_locked()' failed. Broken pipe ../tests/qtest/libqtest.c:190: kill_qemu() detected QEMU death from signal 6 (Aborted) (core dumped) (test program exited with status code -6) TAP parsing error: Too few tests run (expected 23, got 12) Coredump backtrace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x7f3af64a8535 in __GI_abort () at abort.c:79 #2 0x7f3af64a840f in __assert_fail_base (fmt=0x7f3af6609ef0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55d9425f48a8 "(!memory_region_transaction_in_progress() && qemu_mutex_iothread_locked()) || rcu_read_is_locked()", file=0x55d9425f4870 "/data00/migration/qemu-open/include/exec/memory.h", line=1114, function=) at assert.c:92 #3 0x7f3af64b61a2 in __GI___assert_fail (assertion=assertion@entry=0x55d9425f48a8 "(!memory_region_transaction_in_progress() && qemu_mutex_iothread_locked()) || rcu_read_is_locked()", file=file@entry=0x55d9425f4870 "/data00/migration/qemu-open/include/exec/memory.h", line=line@entry=1114, function=function@entry=0x55d9426cdce0 <__PRETTY_FUNCTION__.20039> "address_space_to_flatview") at assert.c:101 #4 0x55d942373853 in address_space_to_flatview (as=0x55d944738648) at /data00/migration/qemu-open/include/exec/memory.h:1112 #5 0x55d9423746f5 in address_space_to_flatview (as=0x55d944738648) at /data00/migration/qemu-open/include/qemu/rcu.h:126 #6 address_space_set_flatview (as=as@entry=0x55d944738648) at ../softmmu/memory.c:1029 #7 0x55d94237ace3 in address_space_update_topology (as=0x55d944738648) at ../softmmu/memory.c:1080 #8 address_space_init (as=as@entry=0x55d944738648, root=root@entry=0x55d9447386a0, name=name@entry=0x55d9447384f0 "virtio-net-pci") at ../softmmu/memory.c:3082 #9 0x55d942151e43 in do_pci_register_device (errp=0x7f3aef7fe850, devfn=, name=0x55d9444b6c40 "virtio-net-pci", pci_dev=0x55d944738410) at ../hw/pci/pci.c:1145 #10 pci_qdev_realize (qdev=0x55d944738410, errp=0x7f3aef7fe850) at ../hw/pci/pci.c:2036 #11 0x55d942404a8f in device_set_realized (obj=, value=true, errp=0x7f3aef7feae0) at ../hw/core/qdev.c:510 #12 0x55d942407e36 in property_set_bool (obj=0x55d944738410, v=, name=, opaque=0x55d9444c71d0, errp=0x7f3aef7feae0) at ../qom/object.c:2285 #13 0x55d94240a0e3 in object_property_set (obj=obj@entry=0x55d944738410, name=name@entry=0x55d942670c23 "realized", v=v@entry=0x55d9452f7a00, errp=errp@entry=0x7f3aef7feae0) at ../qom/object.c:1420 #14 0x55d94240d15f in object_property_set_qobject (obj=obj@entry=0x55d944738410, name=name@entry=0x55d942670c23 "realized", value=value@entry=0x55d945306cb0, errp=errp@entry=0x7f3aef7feae0) at ../qom/qom-qobject.c:28 #15 0x55d94240a354 in object_property_set_bool (obj=0x55d944738410, name=name@entry=0x55d942670c23 "realized", value=value@entry=true, errp=errp@entry=0x7f3aef7feae0) at ../qom/object.c:1489 #16 0x55d94240427c in qdev_realize (dev=, bus=bus@entry=0x55d945141400, errp=errp@entry=0x7f3aef7feae0) at ../hw/core/qdev.c:292 #17 0x55d9421ef4a0 in qdev_device_add_from_qdict (opts=0x55d945309c00, from_json=, errp=, errp@entry=0x7f3aef7feae0) at /data00/migration/qemu-open/include/hw/qdev-core.h:17 #18 0x55d942311c85 in failover_add_primary (errp=0x7f3aef7fead8, n=0x55d9454e8530) at ../hw/net/virtio-net.c:933 #19 virtio_net_set_features (vdev=, features=4611687122246533156) at ../hw/net/virtio-net.c:1004 #20 0x55d94233d248 in virtio_set_features_nocheck (vdev=vdev@entry=0x55d9454e8530, val=val@entry=4611687122246533156) at ../hw/virtio/virtio.c:2851 #21 0x55d942342eae in virtio_load (vdev=0x55d9454e8530, f=0x55d944700de0, version_id=11) at ../hw/virtio/virtio.c:3027 #22 0x55d942207601 in vmstate_load_state (f=f@entry=0x55d944700de0, vmsd=0x55d9429baba0 , opaque=0x55d9454e8530, version_id=11) at ../migration/vms
Re: [RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
Hi, Juan Thanks for your test results! On 2023/2/16 上午3:10, Juan Quintela wrote: > Chuang Xu wrote: >> In this version: > Hi > > I had to drop this. It breaks migration of dbus-vmstate. Previously, I only focused on the precopy migration test in the normal scenario, but did not run qtest. So I need to apologize for my inexperience.. > > .[K144/179 qemu:qtest+qtest-x86_64 / qtest-x86_64/virtio-net-failover ERROR 5.66s killed by signal 6 SIGABRT >>>> G_TEST_DBUS_DAEMON=/mnt/code/qemu/multifd/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-x86_64 MALLOC_PERTURB_=145 /scratch/qemu/multifd/x64/tests/qtest/virtio-net-failover --tap -k > ―― ✀ ―― > stderr: > qemu-system-x86_64: /mnt/code/qemu/multifd/include/exec/memory.h:1112: address_space_to_flatview: Assertion `(!memory_region_transaction_in_progress() && qemu_mutex_iothread_locked()) || rcu_read_is_locked()' failed. > Broken pipe > ../../../../mnt/code/qemu/multifd/tests/qtest/libqtest.c:190: kill_qemu() detected QEMU death from signal 6 (Aborted) (core dumped) > > (test program exited with status code -6) > > TAP parsing error: Too few tests run (expected 23, got 12) > > > Can you take a look at this? > > I reproduced it with "make check" and qemu compiled with the configure > line attached. > > Later, Juan. > > configure --enable-trace-backends=log > --prefix=/usr > --sysconfdir=/etc/sysconfig/ > --audio-drv-list=pa > --target-list=x86_64-softmmu > --with-coroutine=ucontext > --with-git-submodules=validate > --enable-fdt=system > --enable-alsa > --enable-attr > --enable-auth-pam > --enable-avx2 > --enable-avx512f > --enable-bochs > --enable-bpf > --enable-brlapi > --disable-bsd-user > --enable-bzip2 > --enable-cap-ng > --disable-capstone > --disable-cfi > --disable-cfi-debug > --enable-cloop > --disable-cocoa > --enable-containers > --disable-coreaudio > --enable-coroutine-pool > --enable-crypto-afalg > --enable-curl > --enable-curses > --enable-dbus-display > --enable-debug-info > --disable-debug-mutex > --disable-debug-stack-usage > --disable-debug-tcg > --enable-dmg > --disable-docs > --disable-dsound > --enable-fdt > --disable-fuse > --disable-fuse-lseek > --disable-fuzzing > --disable-gcov > --enable-gettext > --enable-gio > --enable-glusterfs > --enable-gnutls > --disable-gprof > --enable-gtk > --disable-guest-agent > --disable-guest-agent-msi > --disable-hax > --disable-hvf > --enable-iconv > --enable-install-blobs > --enable-jack > --enable-keyring > --enable-kvm > --enable-l2tpv3 > --enable-libdaxctl > --enable-libiscsi > --enable-libnfs > --enable-libpmem > --enable-libssh > --enable-libudev > --enable-libusb > --enable-linux-aio > --enable-linux-io-uring > --disable-linux-user > --enable-live-block-migration > --disable-lto > --disable-lzfse > --enable-lzo > --disable-malloc-trim > --enable-membarrier > --enable-module-upgrades > --enable-modules > --enable-mpath > --enable-multiprocess > --disable-netmap > --enable-nettle > --enable-numa > --disable-nvmm > --enable-opengl > --enable-oss > --enable-pa > --enable-parallels > --enable-pie > --enable-plugins > --enable-png > --disable-profiler > --enable-pvrdma > --enable-qcow1 > --enable-qed > --disable-qom-cast-debug > --enable-rbd > --enable-rdma > --enable-replication > --enable-rng-none > --disable-safe-stack > --disable-sanitizers > --enable-stack-protector > --enable-sdl > --enable-sdl-image > --enable-seccomp > --enable-selinux > --enable-slirp > --enable-slirp-smbd > --enable-smartcard > --enable-snappy > --enable-sparse > --enable-spice > --enable-spice-protocol > --enable-system > --enable-tcg > --disable-tcg-interpreter > --disable-tools > --enable-tpm > --disable-tsan > --disable-u2f > --enable-usb-redir > --disable-user > --disable-vde > --enable-vdi > --enable-vhost-crypto > --enable-vhost-kernel > --enable-vhost-net > --enable-vhost-user > --enable-vhost-user-blk-server > --enable-vhost-vdpa > --enable-virglrenderer > --enable-virtfs > --enable-virtiofsd > --enable-vnc > --enable-vnc-j
Re: [RFC v5 1/3] rcu: introduce rcu_read_is_locked()
Hi, Paolo! On 2023/2/2 下午6:59, Juan Quintela wrote: Chuang Xu wrote: add rcu_read_is_locked() to detect holding of rcu lock. Signed-off-by: Chuang Xu Reviewed-by: Juan Quintela Althought I think that petting a review from Paolo or anyone that knows more than RCU could be a good idea. --- include/qemu/rcu.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index b063c6fde8..719916d9d3 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_is_locked(void) +{ +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); + +return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* Do you have any more suggestions about patch1? Looking forward to your reply. Thanks!
[RFC v5 0/3] migration: reduce time of loading non-iterable vmstate
In this version: - rename rcu_read_locked() to rcu_read_is_locked(). - adjust the sanity check in address_space_to_flatview(). - improve some comments. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms (This result is different from that of v1. It may be that someone has changed something on my host.., but it does not affect the display of the optimization effect.) In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Please review, Chuang. [v4] - attach more information in the cover letter. - remove changes on virtio_load. - add rcu_read_locked() to detect holding of rcu lock. [v3] - move virtio_load_check_delay() from virtio_memory_listener_commit() to virtio_vmstate_change(). - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() will be called when delay_check is true. [v2] - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms
[RFC v5 3/3] migration: reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu --- migration/savevm.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index a0cdb714f7..8ca6d396f4 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2617,6 +2617,16 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) uint8_t section_type; int ret = 0; +/* + * Call memory_region_transaction_begin() before loading vmstate. + * This call is paired with memory_region_transaction_commit() at + * the end of qemu_loadvm_state_main(), in order to pack all the + * changes to memory region during the period of loading + * non-iterable vmstate in a single memory transaction. + * This operation will reduce time of loading non-iterable vmstate + */ +memory_region_transaction_begin(); + retry: while (true) { section_type = qemu_get_byte(f); @@ -2684,6 +2694,14 @@ out: goto retry; } } + +/* + * Call memory_region_transaction_commit() after loading vmstate. + * At this point, qemu actually completes all the previous memory + * region transactions. + */ +memory_region_transaction_commit(); + return ret; } -- 2.20.1
[RFC v5 1/3] rcu: introduce rcu_read_is_locked()
add rcu_read_is_locked() to detect holding of rcu lock. Signed-off-by: Chuang Xu --- include/qemu/rcu.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index b063c6fde8..719916d9d3 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_is_locked(void) +{ +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); + +return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* -- 2.20.1
[RFC v5 2/3] memory: add depth assert in address_space_to_flatview
Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 15 +++ softmmu/memory.c | 5 + 2 files changed, 20 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..ce13ebb763 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -27,6 +27,7 @@ #include "qemu/notify.h" #include "qom/object.h" #include "qemu/rcu.h" +#include "qemu/main-loop.h" #define RAM_ADDR_INVALID (~(ram_addr_t)0) @@ -1069,8 +1070,22 @@ struct FlatView { MemoryRegion *root; }; +bool memory_region_transaction_in_progress(void); + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +/* + * Before using any flatview, sanity check we're not during a memory + * region transaction or the map can be invalid. Note that this can + * also be called during commit phase of memory transaction, but that + * should also only happen when the depth decreases to 0 first. + * Meanwhile it's safe to access current_map with RCU read lock held + * even if during a memory transaction. It means the user can bear + * with an obsolete map. + */ +assert((!memory_region_transaction_in_progress() && +qemu_mutex_iothread_locked()) || +rcu_read_is_locked()); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index bc0be3f62c..856c37fd0a 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1116,6 +1116,11 @@ void memory_region_transaction_commit(void) } } +bool memory_region_transaction_in_progress(void) +{ +return memory_region_transaction_depth != 0; +} + static void memory_region_destructor_none(MemoryRegion *mr) { } -- 2.20.1
Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
Hi, Peter, On 2023/1/12 下午11:13, Peter Xu wrote: We wanted to capture outliers when you apply the follow up patch to vm load procedure. That will make depth>0 for the whole process of vm load during migration, and we wanted to make sure it's safe, hence this patch, right? In my perspective, both BQL and RCU can ensure that the flatview will not be released when the worker thread accesses the flatview, because before the rcu thread releases the flatview, it will make sure itself holding BQL and the worker thread not holding RCU. So, whether the depth is 0 or not, as long as BQL or RCU is held, the worker thread will not access the obsolete flatview (IIUC 'obsolete' means that flatview is released). To summarize, the original check didn't consider BQL, and if to consider BQL I think it should be something like: /* Guarantees valid access to the flatview, either lock works */ assert(BQL_HELD() || RCU_HELD()); /* * Guarantees any BQL holder is not reading obsolete flatview (e.g. when * during vm load) */ if (BQL_HELD()) assert(depth==0); IIUC it can be merged into: assert((BQL_HELD() && depth==0) || RCU_HELD()); IMHO assert(BQL_HELD() || RCU_HELD()) is enough.. Yes, but IMHO this will guarantee safe use of flatview only if _before_ your follow up patch. Before that patch, the depth==0 should always stand (when BQL_HELD() stands) I think. After that patch, since depth will be increased at the entry of vm load there's risk that we can overlook code that will be referencing flatview during vm load and that can reference an obsolete flatview. Since the whole process of qemu_loadvm_state_main() will have BQL held we won't hit the assertion if only to check "BQL_HELD() || RCU_HELD()" because BQL_HELD always satisfies. Or you think that once a mr transaction is in progress, the old flatview has been obsolete? If we regard flatview as obsolete when a mr transaction is in progress, How can RCU ensure that flatview is not obsolete? AFAIU RCU cannot guarantee that. So IIUC any RCU lock user need to be able to tolerant obsolete flatviews being referenced and it should not harm the system. If it needs the latest update, it should take care of that separately. For example, the virtio code we're looking at in this series uses RCU lock to build address space cache for the device vrings which is based on the current flatview of mem. It's safe to reference obsolete flatview in this case (it means the flatview can be during an update when virtio is establishing the address space cache), IMHO that's fine because the address space cache will be updated again in virtio_memory_listener_commit() so it'll be consistent at last. The intermediate phase of inconsistency should be fine in this case just like any DMA happens during a memory hotplug. For this specific patch, IMHO the core is to check depth>0 reference, and we need RCU_HELD to mask out where the obsolete references are fine. Thanks, Thanks a lot for your patience! I ignored the preconditions for this discussion, so that I asked a stupid question.. Now I'm in favor of checking “(BQL_HELD() && depth==0) || RCU_HELD()” in v5. And what does Paolo thinks of Peter's solution? Thanks again!
Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
Hi, Peter, Paolo, On 2023/1/10 下午10:45, Peter Xu wrote: On Tue, Jan 10, 2023 at 12:09:41AM -0800, Chuang Xu wrote: Hi, Peter and Paolo, Hi, Chuang, Paolo, I'm sorry I didn't reply to your email in time. I was infected with COVID-19 two weeks ago, so I couldn't think about the problems discussed in your email for a long time. On 2023/1/4 上午1:43, Peter Xu wrote: Hi, Paolo, On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote: Il ven 23 dic 2022, 16:54 Peter Xu ha scritto: This is not valid because the transaction could happen in *another* thread. In that case memory_region_transaction_depth() will be > 0, but RCU is needed. Do you mean the code is wrong, or the comment? Note that the code has checked rcu_read_locked() where introduced in patch 1, but maybe something else was missed? The assertion is wrong. It will succeed even if RCU is unlocked in this thread but a transaction is in progress in another thread. IIUC this is the case where the context: (1) doesn't have RCU read lock held, and, (2) doesn't have BQL held. Is it safe at all to reference any flatview in such a context? The thing is I think the flatview pointer can be freed anytime if both locks are not taken. Perhaps you can check (memory_region_transaction_depth() > 0 && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead? What if one thread calls address_space_to_flatview() with BQL held but not RCU read lock held? I assume it's a legal operation, but it seems to be able to trigger the assert already? Thanks, I'm not sure whether I understand the content of your discussion correctly, so here I want to explain my understanding firstly. From my perspective, Paolo thinks that when thread 1 is in a transaction, thread 2 will trigger the assertion when accessing the flatview without holding RCU read lock, although sometimes the thread 2's access to flatview is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0 && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead. And Peter thinks that as long as no thread holds the BQL or RCU read lock, the old flatview can be released (actually executed by the rcu thread with BQL held). When thread 1 is in a transaction, if thread 2 access the flatview with BQL held but not RCU read lock held, it's a legal operation. In this legal case, it seems that both my code and Paolo's code will trigger assertion. IIUC your original patch is fine in this case (BQL held, RCU not held), as long as depth==0. IMHO what we want to trap here is when BQL held (while RCU is not) and depth>0 which can cause unpredictable side effect of using obsolete flatview. I don't quite understand the side effects of depth>0 when BQL is held (while RCU is not). In my perspective, both BQL and RCU can ensure that the flatview will not be released when the worker thread accesses the flatview, because before the rcu thread releases the flatview, it will make sure itself holding BQL and the worker thread not holding RCU. So, whether the depth is 0 or not, as long as BQL or RCU is held, the worker thread will not access the obsolete flatview (IIUC 'obsolete' means that flatview is released). To summarize, the original check didn't consider BQL, and if to consider BQL I think it should be something like: /* Guarantees valid access to the flatview, either lock works */ assert(BQL_HELD() || RCU_HELD()); /* * Guarantees any BQL holder is not reading obsolete flatview (e.g. when * during vm load) */ if (BQL_HELD()) assert(depth==0); IIUC it can be merged into: assert((BQL_HELD() && depth==0) || RCU_HELD()); IMHO assert(BQL_HELD() || RCU_HELD()) is enough.. Or you think that once a mr transaction is in progress, the old flatview has been obsolete? If we regard flatview as obsolete when a mr transaction is in progress, How can RCU ensure that flatview is not obsolete? What does Paolo think of this check? Thanks! I'm not sure if I have a good understanding of your emails? I think checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() || qemu_mutex_iothread_locked()) should cover the case you discussed. This seems still problematic too? Since the assert can pass even if neither BQL nor RCU is held (as long as depth==0). Thanks,
Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
Hi, Peter and Paolo, I'm sorry I didn't reply to your email in time. I was infected with COVID-19 two weeks ago, so I couldn't think about the problems discussed in your email for a long time. On 2023/1/4 上午1:43, Peter Xu wrote: > Hi, Paolo, > > On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote: >> Il ven 23 dic 2022, 16:54 Peter Xu ha scritto: >> This is not valid because the transaction could happen in *another* >>> thread. In that case memory_region_transaction_depth() will be > 0, but RCU is needed. >>> Do you mean the code is wrong, or the comment? Note that the code has >>> checked rcu_read_locked() where introduced in patch 1, but maybe something >>> else was missed? >>> >> The assertion is wrong. It will succeed even if RCU is unlocked in this >> thread but a transaction is in progress in another thread. > IIUC this is the case where the context: > > (1) doesn't have RCU read lock held, and, > (2) doesn't have BQL held. > > Is it safe at all to reference any flatview in such a context? The thing > is I think the flatview pointer can be freed anytime if both locks are not > taken. > >> Perhaps you can check (memory_region_transaction_depth() > 0 && >> !qemu_mutex_iothread_locked()) || rcu_read_locked() instead? > What if one thread calls address_space_to_flatview() with BQL held but not > RCU read lock held? I assume it's a legal operation, but it seems to be > able to trigger the assert already? > > Thanks, > I'm not sure whether I understand the content of your discussion correctly, so here I want to explain my understanding firstly. >From my perspective, Paolo thinks that when thread 1 is in a transaction, thread 2 will trigger the assertion when accessing the flatview without holding RCU read lock, although sometimes the thread 2's access to flatview is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0 && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead. And Peter thinks that as long as no thread holds the BQL or RCU read lock, the old flatview can be released (actually executed by the rcu thread with BQL held). When thread 1 is in a transaction, if thread 2 access the flatview with BQL held but not RCU read lock held, it's a legal operation. In this legal case, it seems that both my code and Paolo's code will trigger assertion. I'm not sure if I have a good understanding of your emails? I think checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() || qemu_mutex_iothread_locked()) should cover the case you discussed. What's your take? Thanks!
Re: [RFC v4 1/3] rcu: introduce rcu_read_locked()
On 2023/1/4 下午10:20, Alex Bennée wrote: > Chuang Xu writes: > >> add rcu_read_locked() to detect holding of rcu lock. >> >> Signed-off-by: Chuang Xu >> --- >> include/qemu/rcu.h | 7 +++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h >> index b063c6fde8..42cbd0080f 100644 >> --- a/include/qemu/rcu.h >> +++ b/include/qemu/rcu.h >> @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) >> } >> } >> >> +static inline bool rcu_read_locked(void) > We use the locked suffix to indicate functions that should be called > with a lock held. Perhaps renaming this to rcu_read_is_locked() would > make the intent of the function clearer? Yes, rcu_read_is_locked() do make the intent of the function clearer. I'll rename the function in v5. Thanks! >> +{ >> + struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); >> + >> + return p_rcu_reader->depth > 0; >> +} >> + >> extern void synchronize_rcu(void); >> >> /* >
Re: [External] Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
On 2022/12/28 下午6:50, Philippe Mathieu-Daudé wrote: On 23/12/22 15:23, Chuang Xu wrote: Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 9 + softmmu/memory.c | 5 + 2 files changed, 14 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..66c43b4862 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +int memory_region_transaction_get_depth(void); Do we want to expose this; isn't the depth internal? If we need to expose something, can we restrict it to bool memory_region_in_transaction(void) or bool memory_region_transaction_in_progress(void)? Yes, we'd better not expose the value of an internal variable. I'll make changes in v5. Thanks!
Re: [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate
On 2022/12/24 上午12:06, David Hildenbrand wrote: On 23.12.22 15:23, Chuang Xu wrote: The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime beforeabout 150 ms 740+ ms afterabout 30 ms 630+ ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime beforeabout 90 ms about 250 ms afterabout 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime beforeabout 20 ms about 70 ms afterabout 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu --- migration/savevm.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index a0cdb714f7..19785e5a54 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) uint8_t section_type; int ret = 0; +/* call memory_region_transaction_begin() before loading vmstate */ I'd suggest extending the comment *why* you are doing that, that it's a pure performance optimization, and how it achieves that. Thanks! I'll extend the comment in v5.
Re: [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate
On 2022/12/23 下午11:50, Peter Xu wrote: Chuang, On Fri, Dec 23, 2022 at 10:23:04PM +0800, Chuang Xu wrote: In this version: - attach more information in the cover letter. - remove changes on virtio_load(). - add rcu_read_locked() to detect holding of rcu lock. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms Have you investigated why multi-queue added so much downtime overhead with the same environment, comparing to below [1]? I have analyzed the downtime in detail. Both stopping and starting the device are time-consuming. For stopping vhost-net devices, vhost_net_stop_one() will be called once more for each additional queue, while vhost_virtqueue_stop() will be called twice in vhost_dev_stop(). For example, we need to call vhost_virtqueue_stop() 32(= 16 * 2) times to stop a 16-queue vhost-net device. In vhost_virtqueue_stop(), QEMU needs to negotiate with the vhost user daemon. The same is true for vhost-net devices startup. For stopping vhost-user-blk devices, vhost_virtqueue_stop() will be called once more for each additional queue. For example, we need to call vhost_virtqueue_stop() 4 times to stop a 4-queue vhost-user-blk device. The same is true for vhost-user-blk devices startup. It seems that the vhost-user-blk device is less affected by the number of queues than the vhost-net device. However, the vhost-user-blk device needs to prepare inflight when it is started. The negotiation with spdk in this step is also time-consuming. I tried to move this step to the startup phase of the target QEMU before the migration started. In my test, This optimization can greatly reduce the vhost-user-blk device startup time and thus reduce the downtime. I'm not sure whether this is hacky. If you are interested in this, maybe we can discuss it further. (This result is different from that of v1. It may be that someone has changed something on my host.., but it does not affect the display of the optimization effect.) In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms [1] In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. The downtime measured in precopy can be more complicated than postcopy because the time of switch is calculated by qemu based on the downtime setup, and also that contains part of RAM migrations. Postcopy should be more accurate on that because there's no calculation done, meanwhile there's no RAM transferred during downtime. However postcopy downtime is not accurate either in implementation of it in postcopy_start(), where the downtime is measured right after we flushed the packed data, and right below it there's some idea of optimizing it: if (migrate_postcopy_ram()) { /* * Although this ping is just for debug, it could potentially be * used for getting a better measurement of downtime at the source. */ qemu_savevm_send_ping(ms->to_dst_file, 4); } So maybe I'll have a look there. The current calculation of downtime is really inaccurate, because the sou
[RFC v4 2/3] memory: add depth assert in address_space_to_flatview
Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 9 + softmmu/memory.c | 5 + 2 files changed, 14 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..66c43b4862 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +int memory_region_transaction_get_depth(void); + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +/* + * Before using any flatview, sanity check we're not during a memory + * region transaction or the map can be invalid. Note that this can + * also be called during commit phase of memory transaction, but that + * should also only happen when the depth decreases to 0 first. + */ +assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked()); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index bc0be3f62c..01192e2e5b 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1116,6 +1116,11 @@ void memory_region_transaction_commit(void) } } +int memory_region_transaction_get_depth(void) +{ +return memory_region_transaction_depth; +} + static void memory_region_destructor_none(MemoryRegion *mr) { } -- 2.20.1
[RFC v4 1/3] rcu: introduce rcu_read_locked()
add rcu_read_locked() to detect holding of rcu lock. Signed-off-by: Chuang Xu --- include/qemu/rcu.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index b063c6fde8..42cbd0080f 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_locked(void) +{ +struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); + +return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* -- 2.20.1
[RFC v4 0/3] migration: reduce time of loading non-iterable vmstate
In this version: - attach more information in the cover letter. - remove changes on virtio_load(). - add rcu_read_locked() to detect holding of rcu lock. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms (This result is different from that of v1. It may be that someone has changed something on my host.., but it does not affect the display of the optimization effect.) In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Please review, Chuang. [v3] - move virtio_load_check_delay() from virtio_memory_listener_commit() to virtio_vmstate_change(). - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() will be called when delay_check is true. [v2] - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms
[RFC v4 3/3] migration: reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu --- migration/savevm.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index a0cdb714f7..19785e5a54 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) uint8_t section_type; int ret = 0; +/* call memory_region_transaction_begin() before loading vmstate */ +memory_region_transaction_begin(); + retry: while (true) { section_type = qemu_get_byte(f); @@ -2684,6 +2687,10 @@ out: goto retry; } } + +/* call memory_region_transaction_commit() after loading vmstate */ +memory_region_transaction_commit(); + return ret; } -- 2.20.1
Re: [RFC v3 1/3] memory: add depth assert in address_space_to_flatview
On 2022/12/16 上午12:04, Peter Xu wrote: On Wed, Dec 14, 2022 at 04:38:52PM -0500, Peter Xu wrote: On Wed, Dec 14, 2022 at 08:03:38AM -0800, Chuang Xu wrote: On 2022/12/13 下午9:35, Chuang Xu wrote: Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 9 + softmmu/memory.c | 1 - 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..b43cd46084 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +static unsigned memory_region_transaction_depth; + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +/* + * Before using any flatview, sanity check we're not during a memory + * region transaction or the map can be invalid. Note that this can + * also be called during commit phase of memory transaction, but that + * should also only happen when the depth decreases to 0 first. + */ +assert(memory_region_transaction_depth == 0); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index bc0be3f62c..f177c40cd8 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -37,7 +37,6 @@ //#define DEBUG_UNASSIGNED -static unsigned memory_region_transaction_depth; static bool memory_region_update_pending; static bool ioeventfd_update_pending; unsigned int global_dirty_tracking; Here are some new situations to be synchronized. I found that there is a probability to trigger assert in the QEMU startup phase. Here is the coredump backtrace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x7f7825e33535 in __GI_abort () at abort.c:79 #2 0x7f7825e3340f in __assert_fail_base (fmt=0x7f7825f94ef0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5653c643add8 "memory_region_transaction_depth == 0", file=0x5653c63dad78 "/data00/migration/qemu-open/include/exec/memory.h", line=1082, function=) at assert.c:92 #3 0x7f7825e411a2 in __GI___assert_fail (assertion=assertion@entry=0x5653c643add8 "memory_region_transaction_depth == 0", file=file@entry=0x5653c63dad78 "/data00/migration/qemu-open/include/exec/memory.h", line=line@entry=1082, function=function@entry=0x5653c643bd00 <__PRETTY_FUNCTION__.18101> "address_space_to_flatview") at assert.c:101 #4 0x5653c60f0383 in address_space_to_flatview (as=0x5653c6af2340 ) at /data00/migration/qemu-open/include/exec/memory.h:1082 #5 address_space_to_flatview (as=0x5653c6af2340 ) at /data00/migration/qemu-open/include/exec/memory.h:1074 #6 address_space_get_flatview (as=0x5653c6af2340 ) at ../softmmu/memory.c:809 #7 0x5653c60fef04 in address_space_cache_init (cache=cache@entry=0x7f781fff8420, as=, addr=63310635776, len=48, is_write=is_write@entry=false) at ../softmmu/physmem.c:3352 #8 0x5653c60c08c5 in virtqueue_split_pop (vq=0x7f781c576270, sz=264) at ../hw/virtio/virtio.c:1959 #9 0x5653c60c0b7d in virtqueue_pop (vq=vq@entry=0x7f781c576270, sz=) at ../hw/virtio/virtio.c:2177 #10 0x5653c609f14f in virtio_scsi_pop_req (s=s@entry=0x5653c9034300, vq=vq@entry=0x7f781c576270) at ../hw/scsi/virtio-scsi.c:219 #11 0x5653c60a04a3 in virtio_scsi_handle_cmd_vq (vq=0x7f781c576270, s=0x5653c9034300) at ../hw/scsi/virtio-scsi.c:735 #12 virtio_scsi_handle_cmd (vdev=0x5653c9034300, vq=0x7f781c576270) at ../hw/scsi/virtio-scsi.c:776 #13 0x5653c60ba72f in virtio_queue_notify_vq (vq=0x7f781c576270) at ../hw/virtio/virtio.c:2847 #14 0x5653c62d9706 in aio_dispatch_handler (ctx=ctx@entry=0x5653c84909e0, node=0x7f68e4007840) at ../util/aio-posix.c:369 #15 0x5653c62da254 in aio_dispatch_ready_handlers (ready_list=0x7f781fffe6a8, ctx=0x5653c84909e0) at ../util/aio-posix.c:399 #16 aio_poll (ctx=0x5653c84909e0, blocking=blocking@entry=true) at ../util/aio-posix.c:713 #17 0x5653c61b2296 in iothread_run (opaque=opaque@entry=0x5653c822c390) at ../iothread.c:67 #18 0x5653c62dcd8a in qemu_thread_start (args=) at ../util/qemu-thread-posix.c:505 #19 0x7f7825fd8fa3 in start_thread (arg=) at pthread_create.c:486 #20 0x7f7825f0a06f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 This does look like a bug to me. Paolo/Michael? Hmm, I found that virtqueue_split_pop() took the rcu lock.. then I think it's fine. Chuang, I think what you can try next is add a helper to detect holding of rcu lock, then assert with "depth==0 || rcu_read_locked()". I think that's: static inline bool rcu_read_locked(void) { struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); return p_rcu_reader->depth > 0; } Then IIUC you can even drop patch 2 because virtio_load() also takes the rcu lock. Good idea! Later I'll try this way in my v4 patches. Thanks very much!
Re: [RFC v3 1/3] memory: add depth assert in address_space_to_flatview
On 2022/12/16 上午12:51, Peter Maydell wrote: On Tue, 13 Dec 2022 at 13:36, Chuang Xu wrote: Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 9 + softmmu/memory.c | 1 - 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..b43cd46084 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +static unsigned memory_region_transaction_depth; This looks odd. If you define a static variable in a header file then each .c file which directly or indirectly includes the header will get its own private copy of the variable. This probably isn't what you want... thanks -- PMM Yes, Maybe we should add a function to acquire the value.. I'll add this part to v4. Thanks!
Re: [RFC v3 1/3] memory: add depth assert in address_space_to_flatview
On 2022/12/13 下午9:35, Chuang Xu wrote: Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 9 + softmmu/memory.c | 1 - 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..b43cd46084 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +static unsigned memory_region_transaction_depth; + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +/* + * Before using any flatview, sanity check we're not during a memory + * region transaction or the map can be invalid. Note that this can + * also be called during commit phase of memory transaction, but that + * should also only happen when the depth decreases to 0 first. + */ +assert(memory_region_transaction_depth == 0); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index bc0be3f62c..f177c40cd8 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -37,7 +37,6 @@ //#define DEBUG_UNASSIGNED -static unsigned memory_region_transaction_depth; static bool memory_region_update_pending; static bool ioeventfd_update_pending; unsigned int global_dirty_tracking; Here are some new situations to be synchronized. I found that there is a probability to trigger assert in the QEMU startup phase. Here is the coredump backtrace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x7f7825e33535 in __GI_abort () at abort.c:79 #2 0x7f7825e3340f in __assert_fail_base (fmt=0x7f7825f94ef0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5653c643add8 "memory_region_transaction_depth == 0", file=0x5653c63dad78 "/data00/migration/qemu-open/include/exec/memory.h", line=1082, function=) at assert.c:92 #3 0x7f7825e411a2 in __GI___assert_fail (assertion=assertion@entry=0x5653c643add8 "memory_region_transaction_depth == 0", file=file@entry=0x5653c63dad78 "/data00/migration/qemu-open/include/exec/memory.h", line=line@entry=1082, function=function@entry=0x5653c643bd00 <__PRETTY_FUNCTION__.18101> "address_space_to_flatview") at assert.c:101 #4 0x5653c60f0383 in address_space_to_flatview (as=0x5653c6af2340 ) at /data00/migration/qemu-open/include/exec/memory.h:1082 #5 address_space_to_flatview (as=0x5653c6af2340 ) at /data00/migration/qemu-open/include/exec/memory.h:1074 #6 address_space_get_flatview (as=0x5653c6af2340 ) at ../softmmu/memory.c:809 #7 0x5653c60fef04 in address_space_cache_init (cache=cache@entry=0x7f781fff8420, as=, addr=63310635776, len=48, is_write=is_write@entry=false) at ../softmmu/physmem.c:3352 #8 0x5653c60c08c5 in virtqueue_split_pop (vq=0x7f781c576270, sz=264) at ../hw/virtio/virtio.c:1959 #9 0x5653c60c0b7d in virtqueue_pop (vq=vq@entry=0x7f781c576270, sz=) at ../hw/virtio/virtio.c:2177 #10 0x5653c609f14f in virtio_scsi_pop_req (s=s@entry=0x5653c9034300, vq=vq@entry=0x7f781c576270) at ../hw/scsi/virtio-scsi.c:219 #11 0x5653c60a04a3 in virtio_scsi_handle_cmd_vq (vq=0x7f781c576270, s=0x5653c9034300) at ../hw/scsi/virtio-scsi.c:735 #12 virtio_scsi_handle_cmd (vdev=0x5653c9034300, vq=0x7f781c576270) at ../hw/scsi/virtio-scsi.c:776 #13 0x5653c60ba72f in virtio_queue_notify_vq (vq=0x7f781c576270) at ../hw/virtio/virtio.c:2847 #14 0x5653c62d9706 in aio_dispatch_handler (ctx=ctx@entry=0x5653c84909e0, node=0x7f68e4007840) at ../util/aio-posix.c:369 #15 0x5653c62da254 in aio_dispatch_ready_handlers (ready_list=0x7f781fffe6a8, ctx=0x5653c84909e0) at ../util/aio-posix.c:399 #16 aio_poll (ctx=0x5653c84909e0, blocking=blocking@entry=true) at ../util/aio-posix.c:713 #17 0x5653c61b2296 in iothread_run (opaque=opaque@entry=0x5653c822c390) at ../iothread.c:67 #18 0x5653c62dcd8a in qemu_thread_start (args=) at ../util/qemu-thread-posix.c:505 #19 0x7f7825fd8fa3 in start_thread (arg=) at pthread_create.c:486 #20 0x7f7825f0a06f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 I find that some functions doesn't take bql before calling address_space_to_flatview(), as shown in the backtrace. I also find other similar situations in the code. I find that prepare_mmio_access() will take bql to provide some protection, but I don't think it's sufficient.I think that if we want to ensure that the reading and writing of mr and the modification of mr don't occur at the same time, maybe we need to take bql in address_space_to_flatview() like this: static FlatView *address_space_to_flatview(AddressSpace *as) { bool release_lock = false; FlatView *ret; if (!qemu_mutex_iothread_locked()) { qemu_mutex_lock_iothread(); release_lock = true; } /* * Before using any flatview, sanity
Re: [RFC v3 2/3] virtio: support delay of checks in virtio_load()
On 2022/12/14 上午12:31, Peter Xu wrote: On Tue, Dec 13, 2022 at 09:35:09PM +0800, Chuang Xu wrote: Delay checks in virtio_load() to avoid possible address_space_to_flatview() call during memory region's begin/commit. I didn't notice virtio has the vm change handler already, looks good to reuse it. :) A few more comments though (before some real virtio developers chim im). Signed-off-by: Chuang Xu --- hw/virtio/virtio.c | 37 +++-- include/hw/virtio/virtio.h | 2 ++ 2 files changed, 29 insertions(+), 10 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index eb6347ab5d..f556e565c6 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -3642,8 +3642,26 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id) vdev->start_on_kick = true; } +vdev->delay_check = true; + +if (vdc->post_load) { +ret = vdc->post_load(vdev); +if (ret) { +return ret; +} +} + +return 0; +} + +static void virtio_load_check_delay(VirtIODevice *vdev) +{ RCU_READ_LOCK_GUARD(); -for (i = 0; i < num; i++) { +for (int i = 0; i < VIRTIO_QUEUE_MAX; i++) { +if (vdev->vq[i].vring.num == 0) { +break; +} + if (vdev->vq[i].vring.desc) { uint16_t nheads; @@ -3696,19 +3714,12 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id) i, vdev->vq[i].vring.num, vdev->vq[i].last_avail_idx, vdev->vq[i].used_idx); -return -1; +abort(); This is when the switchover finished. I'm not sure how severe this is and whether there can be something to remedy - abort() is probably the least we want to do here, since the admin may not want to crash the whole VM due to one vring failure on one device. At this time, the vcpus are still stopped. If these checks fail in virtio_load(), - 1 will be returned, and the migration will be rolled back. But virtio_vmstate_change() returns nothing, if we want to rollback the migration after the checks fail, I think we need abort(). } } } -if (vdc->post_load) { -ret = vdc->post_load(vdev); -if (ret) { -return ret; -} -} - -return 0; +return; } void virtio_cleanup(VirtIODevice *vdev) @@ -3722,6 +3733,11 @@ static void virtio_vmstate_change(void *opaque, bool running, RunState state) BusState *qbus = qdev_get_parent_bus(DEVICE(vdev)); VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus); bool backend_run = running && virtio_device_started(vdev, vdev->status); + +if (vdev->delay_check) { +virtio_load_check_delay(vdev); +vdev->delay_check = false; +} vdev->vm_running = running; if (backend_run) { @@ -3789,6 +3805,7 @@ void virtio_init(VirtIODevice *vdev, uint16_t device_id, size_t config_size) virtio_vmstate_change, vdev); vdev->device_endian = virtio_default_endian(); vdev->use_guest_notifier_mask = true; +vdev->delay_check = false; } /* diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index acfd4df125..269e80d04a 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -135,6 +135,8 @@ struct VirtIODevice AddressSpace *dma_as; QLIST_HEAD(, VirtQueue) *vector_queues; QTAILQ_ENTRY(VirtIODevice) next; +/* @delay_check: delay checks in virtio_load */ +bool delay_check; I think it covers more than the check? It also initializes variables like used_idx and shadow_avail_idx. I'm not sure how vital they are, but I'd just avoid using the word "check" if not sure (e.g. "load_delay", or "load_finalize"?). OK. I prefer to use "load_finalize". Thanks! }; struct VirtioDeviceClass { -- 2.20.1
[RFC v3 3/3] migration: reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms Signed-off-by: Chuang Xu --- migration/savevm.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index a0cdb714f7..19785e5a54 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) uint8_t section_type; int ret = 0; +/* call memory_region_transaction_begin() before loading vmstate */ +memory_region_transaction_begin(); + retry: while (true) { section_type = qemu_get_byte(f); @@ -2684,6 +2687,10 @@ out: goto retry; } } + +/* call memory_region_transaction_commit() after loading vmstate */ +memory_region_transaction_commit(); + return ret; } -- 2.20.1
[RFC v3 2/3] virtio: support delay of checks in virtio_load()
Delay checks in virtio_load() to avoid possible address_space_to_flatview() call during memory region's begin/commit. Signed-off-by: Chuang Xu --- hw/virtio/virtio.c | 37 +++-- include/hw/virtio/virtio.h | 2 ++ 2 files changed, 29 insertions(+), 10 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index eb6347ab5d..f556e565c6 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -3642,8 +3642,26 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id) vdev->start_on_kick = true; } +vdev->delay_check = true; + +if (vdc->post_load) { +ret = vdc->post_load(vdev); +if (ret) { +return ret; +} +} + +return 0; +} + +static void virtio_load_check_delay(VirtIODevice *vdev) +{ RCU_READ_LOCK_GUARD(); -for (i = 0; i < num; i++) { +for (int i = 0; i < VIRTIO_QUEUE_MAX; i++) { +if (vdev->vq[i].vring.num == 0) { +break; +} + if (vdev->vq[i].vring.desc) { uint16_t nheads; @@ -3696,19 +3714,12 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id) i, vdev->vq[i].vring.num, vdev->vq[i].last_avail_idx, vdev->vq[i].used_idx); -return -1; +abort(); } } } -if (vdc->post_load) { -ret = vdc->post_load(vdev); -if (ret) { -return ret; -} -} - -return 0; +return; } void virtio_cleanup(VirtIODevice *vdev) @@ -3722,6 +3733,11 @@ static void virtio_vmstate_change(void *opaque, bool running, RunState state) BusState *qbus = qdev_get_parent_bus(DEVICE(vdev)); VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus); bool backend_run = running && virtio_device_started(vdev, vdev->status); + +if (vdev->delay_check) { +virtio_load_check_delay(vdev); +vdev->delay_check = false; +} vdev->vm_running = running; if (backend_run) { @@ -3789,6 +3805,7 @@ void virtio_init(VirtIODevice *vdev, uint16_t device_id, size_t config_size) virtio_vmstate_change, vdev); vdev->device_endian = virtio_default_endian(); vdev->use_guest_notifier_mask = true; +vdev->delay_check = false; } /* diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index acfd4df125..269e80d04a 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -135,6 +135,8 @@ struct VirtIODevice AddressSpace *dma_as; QLIST_HEAD(, VirtQueue) *vector_queues; QTAILQ_ENTRY(VirtIODevice) next; +/* @delay_check: delay checks in virtio_load */ +bool delay_check; }; struct VirtioDeviceClass { -- 2.20.1
[RFC v3 1/3] memory: add depth assert in address_space_to_flatview
Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 9 + softmmu/memory.c | 1 - 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..b43cd46084 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +static unsigned memory_region_transaction_depth; + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +/* + * Before using any flatview, sanity check we're not during a memory + * region transaction or the map can be invalid. Note that this can + * also be called during commit phase of memory transaction, but that + * should also only happen when the depth decreases to 0 first. + */ +assert(memory_region_transaction_depth == 0); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index bc0be3f62c..f177c40cd8 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -37,7 +37,6 @@ //#define DEBUG_UNASSIGNED -static unsigned memory_region_transaction_depth; static bool memory_region_update_pending; static bool ioeventfd_update_pending; unsigned int global_dirty_tracking; -- 2.20.1
[RFC v3 0/3] migration: reduce time of loading non-iterable vmstate
Hi! In this version: - move virtio_load_check_delay() from virtio_memory_listener_commit() to virtio_vmstate_change(). - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() will be called when delay_check is true. Please review, Chuang. [v2] - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms
Re: [RFC v2 2/3] virtio: support delay of checks in virtio_load()
On 2022/12/13 上午4:18, Peter Xu wrote: On Tue, Dec 13, 2022 at 12:49:41AM +0800, Chuang Xu wrote: +bool migration_enable_load_check_delay; I'm just afraid this is still too hacky. One thing is because this variable itself to be only set at specific phase during migration to cover that commit(). The other thing is I'm not sure we can always rely on the commit() being happen 100% - what if there's no memory layout changes throughout the whole process of vm load? That'll be skipped if memory_region_update_pending==false as I said. Yes, you're right. I wanted to set memory_region_update_pending to true at the beginning of qemu_loadvm_state_main(), but somehow I forgot this detail.. So far the best I can come up with is we allow each virtio device to register a vm state change handler (during virtio_load) to do the rest, then in the handler it unregisters itself so it only runs once right before the VM starts. But I'm not sure whether the virtio developers will be happy with it. Maybe worth a try. Feel free to have a look at like kvmvapic_vm_state_change() if you think that idea worth exploring. That's a good idea! But I don't think it's necessary to register a new vm state change handler. Maybe we just need to add a delay_check flag to VirtIODevice and do those delayed checks in virtio_vmstate_change() when delay_check is true. Later I'll upload the v3 patches. Thanks!
Re: [RFC v2 0/3] migration: reduce time of loading non-iterable vmstate
On 2022/12/13 上午4:23, Peter Xu wrote: On Tue, Dec 13, 2022 at 12:49:39AM +0800, Chuang Xu wrote: Hi! Chuang, In this version: - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. Since there'll be other changes besides migration, please consider also copy the relevant maintainers too on either memory and virtio in your next post: $ ./scripts/get_maintainer.pl -f softmmu/memory.c -f hw/virtio/virtio.c Paolo Bonzini (supporter:Memory API) Peter Xu (supporter:Memory API) David Hildenbrand (supporter:Memory API) "Philippe Mathieu-Daudé" (reviewer:Memory API) "Michael S. Tsirkin" (supporter:virtio)qemu-devel@nongnu.org (open list:All patches CC here) Sorry I forgot to update the cc list.. Thanks for your reminder!
[RFC v2 0/3] migration: reduce time of loading non-iterable vmstate
Hi! In this version: - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. Please review, Chuang. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms
[RFC v2 3/3] migration: reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms Signed-off-by: Chuang Xu --- migration/savevm.c | 13 + 1 file changed, 13 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index a0cdb714f7..68a7a99b79 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) uint8_t section_type; int ret = 0; +/* call memory_region_transaction_begin() before loading vmstate */ +memory_region_transaction_begin(); + retry: while (true) { section_type = qemu_get_byte(f); @@ -2684,6 +2687,16 @@ out: goto retry; } } + +/* + * call memory_region_transaction_commit() after loading non-iterable + * vmstate, make sure the migration_enable_load_check_delay flag is + * true during commit. + */ +migration_enable_load_check_delay = true; +memory_region_transaction_commit(); +migration_enable_load_check_delay = false; + return ret; } -- 2.20.1
[RFC v2 1/3] memory: add depth assert in address_space_to_flatview
Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu --- include/exec/memory.h | 9 + softmmu/memory.c | 1 - 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..b43cd46084 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +static unsigned memory_region_transaction_depth; + static inline FlatView *address_space_to_flatview(AddressSpace *as) { +/* + * Before using any flatview, sanity check we're not during a memory + * region transaction or the map can be invalid. Note that this can + * also be called during commit phase of memory transaction, but that + * should also only happen when the depth decreases to 0 first. + */ +assert(memory_region_transaction_depth == 0); return qatomic_rcu_read(>current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index bc0be3f62c..f177c40cd8 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -37,7 +37,6 @@ //#define DEBUG_UNASSIGNED -static unsigned memory_region_transaction_depth; static bool memory_region_update_pending; static bool ioeventfd_update_pending; unsigned int global_dirty_tracking; -- 2.20.1
[RFC v2 2/3] virtio: support delay of checks in virtio_load()
Delay checks in virtio_load() to avoid possible address_space_to_flatview() call during memory region's begin/commit. Signed-off-by: Chuang Xu --- hw/virtio/virtio.c | 33 ++--- include/sysemu/sysemu.h | 1 + softmmu/globals.c | 3 +++ 3 files changed, 26 insertions(+), 11 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index eb6347ab5d..3e3fa2a89d 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -33,6 +33,7 @@ #include "hw/virtio/virtio-access.h" #include "sysemu/dma.h" #include "sysemu/runstate.h" +#include "sysemu/sysemu.h" #include "standard-headers/linux/virtio_ids.h" #include "standard-headers/linux/vhost_types.h" #include "standard-headers/linux/virtio_blk.h" @@ -3642,8 +3643,20 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id) vdev->start_on_kick = true; } +if (vdc->post_load) { +ret = vdc->post_load(vdev); +if (ret) { +return ret; +} +} + +return 0; +} + +static void virtio_load_check_delay(VirtIODevice *vdev) +{ RCU_READ_LOCK_GUARD(); -for (i = 0; i < num; i++) { +for (int i = 0; i < VIRTIO_QUEUE_MAX; i++) { if (vdev->vq[i].vring.desc) { uint16_t nheads; @@ -3696,19 +3709,12 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id) i, vdev->vq[i].vring.num, vdev->vq[i].last_avail_idx, vdev->vq[i].used_idx); -return -1; +abort(); } } } -if (vdc->post_load) { -ret = vdc->post_load(vdev); -if (ret) { -return ret; -} -} - -return 0; +return; } void virtio_cleanup(VirtIODevice *vdev) @@ -4158,7 +4164,12 @@ static void virtio_memory_listener_commit(MemoryListener *listener) if (vdev->vq[i].vring.num == 0) { break; } -virtio_init_region_cache(vdev, i); + +if (migration_enable_load_check_delay) { +virtio_load_check_delay(vdev); +} else { +virtio_init_region_cache(vdev, i); +} } } diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h index 6a7a31e64d..0523091445 100644 --- a/include/sysemu/sysemu.h +++ b/include/sysemu/sysemu.h @@ -12,6 +12,7 @@ extern int only_migratable; extern const char *qemu_name; extern QemuUUID qemu_uuid; extern bool qemu_uuid_set; +extern bool migration_enable_load_check_delay; const char *qemu_get_vm_name(void); diff --git a/softmmu/globals.c b/softmmu/globals.c index 527edbefdd..1bd8f6c978 100644 --- a/softmmu/globals.c +++ b/softmmu/globals.c @@ -65,3 +65,6 @@ bool qemu_uuid_set; uint32_t xen_domid; enum xen_mode xen_mode = XEN_EMULATE; bool xen_domid_restrict; + +bool migration_enable_load_check_delay; + -- 2.20.1
Re: [External] Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate
Hi, Peter! This email is a supplement to my previous email 7 hours ago. On 2022/12/9 上午12:00, Peter Xu wrote: If the assert will work that'll be even better. I'm actually worried this can trigger like what you mentioned in the virtio path, I didn't expect it comes that soon. So if there's a minimum cases and we can fixup easily that'll be great. Hopefully there aren't so much or we'll need to revisit the whole idea. Thanks, Only delaying virtio_init_region_cache() will result in the failure of the checks and caches following original virtio_init_region_cache(). Here are the patches related to these checks and cache operation:https://gitlab.com/qemu-project/qemu/-/commit/1abeb5a65d515f8a8a9cfc4a82342f731bd9321fhttps://gitlab.com/qemu-project/qemu/-/commit/be1fea9bc286f64c6c995bb0d7145a0b738aeddbhttps://gitlab.com/qemu-project/qemu/-/commit/b796fcd1bf2978aed15748db04e054f34789e9ebhttps://gitlab.com/qemu-project/qemu/-/commit/bccdef6b1a204db0f41ffb6e24ce373e4d7890d4 I think I should try to postpone these checks and caches too.. Thanks!
Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate
On 2022/12/9 上午12:00, Peter Xu wrote: On Thu, Dec 08, 2022 at 10:39:11PM +0800, Chuang Xu wrote: On 2022/12/8 上午6:08, Peter Xu wrote: On Thu, Dec 08, 2022 at 12:07:03AM +0800, Chuang Xu wrote: On 2022/12/6 上午12:28, Peter Xu wrote: Chuang, No worry on the delay; you're faster than when I read yours. :) On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote: As a start, maybe you can try with poison address_space_to_flatview() (by e.g. checking the start_pack_mr_change flag and assert it is not set) during this process to see whether any call stack can even try to dereference a flatview. It's just that I didn't figure a good way to "prove" its validity, even if I think this is an interesting idea worth thinking to shrink the downtime. Thanks for your sugguestions! I used a thread local variable to identify whether the current thread is a migration thread(main thread of target qemu) and I modified the code of qemu_coroutine_switch to make sure the thread local variable true only in process_incoming_migration_co call stack. If the target qemu detects that start_pack_mr_change is set and address_space_to_flatview() is called in non-migrating threads or non-migrating coroutine, it will crash. Are you using the thread var just to avoid the assert triggering in the migration thread when commiting memory changes? I think _maybe_ another cleaner way to sanity check this is directly upon the depth: static inline FlatView *address_space_to_flatview(AddressSpace *as) { /* * Before using any flatview, sanity check we're not during a memory * region transaction or the map can be invalid. Note that this can * also be called during commit phase of memory transaction, but that * should also only happen when the depth decreases to 0 first. */ assert(memory_region_transaction_depth == 0); return qatomic_rcu_read(>current_map); } That should also cover the safe cases of memory transaction commits during migration. Peter, I tried this way and found that the target qemu will crash. Here is the gdb backtrace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x7ff2929d851a in __GI_abort () at abort.c:118 #2 0x7ff2929cfe67 in __assert_fail_base (fmt=, assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:92 #3 0x7ff2929cff12 in __GI___assert_fail (assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:101 #4 0x55a324b2ed5e in address_space_to_flatview (as=0x55a326132580 ) at /data00/migration/qemu-5.2.0/include/exec/memory.h:766 #5 0x55a324e79559 in address_space_to_flatview (as=0x55a326132580 ) at ../softmmu/memory.c:811 #6 address_space_get_flatview (as=0x55a326132580 ) at ../softmmu/memory.c:805 #7 0x55a324e96474 in address_space_cache_init (cache=cache@entry=0x55a32a4fb000, as=, addr=addr@entry=68404985856, len=len@entry=4096, is_write=false) at ../softmmu/physmem.c:3307 #8 0x55a324ea9cba in virtio_init_region_cache (vdev=0x55a32985d9a0, n=0) at ../hw/virtio/virtio.c:185 #9 0x55a324eaa615 in virtio_load (vdev=0x55a32985d9a0, f=, version_id=) at ../hw/virtio/virtio.c:3203 #10 0x55a324c6ab96 in vmstate_load_state (f=f@entry=0x55a329dc0c00, vmsd=0x55a325fc1a60 , opaque=0x55a32985d9a0, version_id=1) at ../migration/vmstate.c:143 #11 0x55a324cda138 in vmstate_load (f=0x55a329dc0c00, se=0x55a329941c90) at ../migration/savevm.c:913 #12 0x55a324cdda34 in qemu_loadvm_section_start_full (mis=0x55a3284ef9e0, f=0x55a329dc0c00) at ../migration/savevm.c:2741 #13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00, mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939 #14 0x55a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at ../migration/savevm.c:3021 #15 0x55a324d14b4e in process_incoming_migration_co (opaque=) at ../migration/migration.c:574 #16 0x55a32501ae3b in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173 #17 0x7ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #18 0x7ffed80dc2a0 in ?? () #19 0x in ?? () address_space_cache_init() is the only caller of address_space_to_flatview I can find in vmstate_load call stack so far. Although I think the mr used by address_space_cache_init() won't be affected by the delay of memory_region_transaction_commit(), we really need a mechanism to prevent the modified mr from being used. Maybe we can build a stale
Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate
On 2022/12/8 上午6:08, Peter Xu wrote: On Thu, Dec 08, 2022 at 12:07:03AM +0800, Chuang Xu wrote: On 2022/12/6 上午12:28, Peter Xu wrote: Chuang, No worry on the delay; you're faster than when I read yours. :) On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote: As a start, maybe you can try with poison address_space_to_flatview() (by e.g. checking the start_pack_mr_change flag and assert it is not set) during this process to see whether any call stack can even try to dereference a flatview. It's just that I didn't figure a good way to "prove" its validity, even if I think this is an interesting idea worth thinking to shrink the downtime. Thanks for your sugguestions! I used a thread local variable to identify whether the current thread is a migration thread(main thread of target qemu) and I modified the code of qemu_coroutine_switch to make sure the thread local variable true only in process_incoming_migration_co call stack. If the target qemu detects that start_pack_mr_change is set and address_space_to_flatview() is called in non-migrating threads or non-migrating coroutine, it will crash. Are you using the thread var just to avoid the assert triggering in the migration thread when commiting memory changes? I think _maybe_ another cleaner way to sanity check this is directly upon the depth: static inline FlatView *address_space_to_flatview(AddressSpace *as) { /* * Before using any flatview, sanity check we're not during a memory * region transaction or the map can be invalid. Note that this can * also be called during commit phase of memory transaction, but that * should also only happen when the depth decreases to 0 first. */ assert(memory_region_transaction_depth == 0); return qatomic_rcu_read(>current_map); } That should also cover the safe cases of memory transaction commits during migration. Peter, I tried this way and found that the target qemu will crash. Here is the gdb backtrace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x7ff2929d851a in __GI_abort () at abort.c:118 #2 0x7ff2929cfe67 in __assert_fail_base (fmt=, assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:92 #3 0x7ff2929cff12 in __GI___assert_fail (assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:101 #4 0x55a324b2ed5e in address_space_to_flatview (as=0x55a326132580 ) at /data00/migration/qemu-5.2.0/include/exec/memory.h:766 #5 0x55a324e79559 in address_space_to_flatview (as=0x55a326132580 ) at ../softmmu/memory.c:811 #6 address_space_get_flatview (as=0x55a326132580 ) at ../softmmu/memory.c:805 #7 0x55a324e96474 in address_space_cache_init (cache=cache@entry=0x55a32a4fb000, as=, addr=addr@entry=68404985856, len=len@entry=4096, is_write=false) at ../softmmu/physmem.c:3307 #8 0x55a324ea9cba in virtio_init_region_cache (vdev=0x55a32985d9a0, n=0) at ../hw/virtio/virtio.c:185 #9 0x55a324eaa615 in virtio_load (vdev=0x55a32985d9a0, f=, version_id=) at ../hw/virtio/virtio.c:3203 #10 0x55a324c6ab96 in vmstate_load_state (f=f@entry=0x55a329dc0c00, vmsd=0x55a325fc1a60 , opaque=0x55a32985d9a0, version_id=1) at ../migration/vmstate.c:143 #11 0x55a324cda138 in vmstate_load (f=0x55a329dc0c00, se=0x55a329941c90) at ../migration/savevm.c:913 #12 0x55a324cdda34 in qemu_loadvm_section_start_full (mis=0x55a3284ef9e0, f=0x55a329dc0c00) at ../migration/savevm.c:2741 #13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00, mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939 #14 0x55a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at ../migration/savevm.c:3021 #15 0x55a324d14b4e in process_incoming_migration_co (opaque=) at ../migration/migration.c:574 #16 0x55a32501ae3b in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173 #17 0x7ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #18 0x7ffed80dc2a0 in ?? () #19 0x in ?? () address_space_cache_init() is the only caller of address_space_to_flatview I can find in vmstate_load call stack so far. Although I think the mr used by address_space_cache_init() won't be affected by the delay of memory_region_transaction_commit(), we really need a mechanism to prevent the modified mr from being used. Maybe we can build a stale list: If a subregion is added, add its parent to the stale list(consider
Re: [External] Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate
On 2022/12/6 上午12:28, Peter Xu wrote: Chuang, No worry on the delay; you're faster than when I read yours. :) On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote: As a start, maybe you can try with poison address_space_to_flatview() (by e.g. checking the start_pack_mr_change flag and assert it is not set) during this process to see whether any call stack can even try to dereference a flatview. It's just that I didn't figure a good way to "prove" its validity, even if I think this is an interesting idea worth thinking to shrink the downtime. Thanks for your sugguestions! I used a thread local variable to identify whether the current thread is a migration thread(main thread of target qemu) and I modified the code of qemu_coroutine_switch to make sure the thread local variable true only in process_incoming_migration_co call stack. If the target qemu detects that start_pack_mr_change is set and address_space_to_flatview() is called in non-migrating threads or non-migrating coroutine, it will crash. Are you using the thread var just to avoid the assert triggering in the migration thread when commiting memory changes? I think _maybe_ another cleaner way to sanity check this is directly upon the depth: static inline FlatView *address_space_to_flatview(AddressSpace *as) { /* * Before using any flatview, sanity check we're not during a memory * region transaction or the map can be invalid. Note that this can * also be called during commit phase of memory transaction, but that * should also only happen when the depth decreases to 0 first. */ assert(memory_region_transaction_depth == 0); return qatomic_rcu_read(>current_map); } That should also cover the safe cases of memory transaction commits during migration. Peter, I tried this way and found that the target qemu will crash. Here is the gdb backtrace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x7ff2929d851a in __GI_abort () at abort.c:118 #2 0x7ff2929cfe67 in __assert_fail_base (fmt=, assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:92 #3 0x7ff2929cff12 in __GI___assert_fail (assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:101 #4 0x55a324b2ed5e in address_space_to_flatview (as=0x55a326132580 ) at /data00/migration/qemu-5.2.0/include/exec/memory.h:766 #5 0x55a324e79559 in address_space_to_flatview (as=0x55a326132580 ) at ../softmmu/memory.c:811 #6 address_space_get_flatview (as=0x55a326132580 ) at ../softmmu/memory.c:805 #7 0x55a324e96474 in address_space_cache_init (cache=cache@entry=0x55a32a4fb000, as=, addr=addr@entry=68404985856, len=len@entry=4096, is_write=false) at ../softmmu/physmem.c:3307 #8 0x55a324ea9cba in virtio_init_region_cache (vdev=0x55a32985d9a0, n=0) at ../hw/virtio/virtio.c:185 #9 0x55a324eaa615 in virtio_load (vdev=0x55a32985d9a0, f=, version_id=) at ../hw/virtio/virtio.c:3203 #10 0x55a324c6ab96 in vmstate_load_state (f=f@entry=0x55a329dc0c00, vmsd=0x55a325fc1a60 , opaque=0x55a32985d9a0, version_id=1) at ../migration/vmstate.c:143 #11 0x55a324cda138 in vmstate_load (f=0x55a329dc0c00, se=0x55a329941c90) at ../migration/savevm.c:913 #12 0x55a324cdda34 in qemu_loadvm_section_start_full (mis=0x55a3284ef9e0, f=0x55a329dc0c00) at ../migration/savevm.c:2741 #13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00, mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939 #14 0x55a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at ../migration/savevm.c:3021 #15 0x55a324d14b4e in process_incoming_migration_co (opaque=) at ../migration/migration.c:574 #16 0x55a32501ae3b in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173 #17 0x7ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #18 0x7ffed80dc2a0 in ?? () #19 0x in ?? () address_space_cache_init() is the only caller of address_space_to_flatview I can find in vmstate_load call stack so far. Although I think the mr used by address_space_cache_init() won't be affected by the delay of memory_region_transaction_commit(), we really need a mechanism to prevent the modified mr from being used. Maybe we can build a stale list: If a subregion is added, add its parent to the stale list(considering that new subregion's priority has uncertain effects on flatviews). If a subregion is deleted, add itself to
Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate
Peter, I'm sorry I didn't reply to your email in time, because I was busy with other work last week. Here is my latest progress. On 2022/11/29 上午1:41, Peter Xu wrote: On Mon, Nov 28, 2022 at 05:42:43PM +0800, Chuang Xu wrote: On 2022/11/25 上午12:40, Peter Xu wrote: On Fri, Nov 18, 2022 at 04:36:48PM +0800, Chuang Xu wrote: The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms Signed-off-by: Chuang Xu This is an interesting idea.. I think it means at least the address space operations will all be messed up if happening during the precopy loading Sorry, I don't quite understand the meaning of "messed up" here.. Maybe I need more information about how the address space operations will be messed up. AFAIK the major thing we do during commit of memory regions is to apply the memory region changes to the rest (flatviews, or ioeventfds), basically it makes everything matching with the new memory region layout. If we allow memory region commit to be postponed for the whole loading process, it means at least from flat view pov any further things like: address_space_write(_space_memory, ...) Could write to wrong places because the flat views are not updated. I have tested migration on normal qemu and optimized qemu repeatedly, I haven't trace any other operation on target qemu's mr (such as address_space_write...) happens so far. progress, but I don't directly see its happening either. For example, in most post_load()s of vmsd I think the devices should just write directly to its buffers, accessing MRs directly, even if they want DMAs or just update fields to correct states. Even so, I'm not super confident that holds And I'm not sure whether the "its happening" means "begin/commit happening" or "messed up happening"? If it's the former, Here are what I observe: the stage of loading iterable vmstate doesn't call begin/commit, but the stage of loading noniterable vmstate calls a large amount of begin/commit in field->info->get() operation. For example: #0 memory_region_transaction_commit () at ../softmmu/memory.c:1085 #1 0x559b6f683523 in pci_update_mappings (d=d@entry=0x7f5cd8682010) at ../hw/pci/pci.c:1361 #2 0x559b6f683a1f in get_pci_config_device (f=, pv=0x7f5cd86820a0, size=256, field=) at ../hw/pci/pci.c:545 #3 0x559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=vmsd@entry=0x559b70909d40 , opaque=opaque@entry=0x7f5cd8682010, version_id=2) at ../migration/vmstate.c:143 #4 0x559b6f68466f in pci_device_load (s=s@entry=0x7f5cd8682010, f=f@entry=0x559b757eb4b0) at ../hw/pci/pci.c:664 #5 0x559b6f6ad38a in virtio_pci_load_config (d=0x7f5cd8682010, f=0x559b757eb4b0) at ../hw/virtio/virtio-pci.c:181 #6 0x559b6f7dfe91 in virtio_load (vdev=0x7f5cd868a1a0, f=0x559b757eb4b0, version_id=1) at ../hw/virtio/virtio.c:3071 #7 0x559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=0x559b709ae260 , opaque=0x7f5cd868a1a0, version_id=1) at ../migration/vmstate.c:143 #8 0x559b6f62da48 in vmstate_load (f=0x559b757eb4b0, se=0x559b7591c010) at ../migration/savevm.c:913 #9 0x559b6f631334 in qemu_loadvm_section_start_full (mis=0x559b73f1a580, f=0x559b757eb4b0) at ../migration/savevm.c:2741 #10 qemu_loadvm_state_main (f=f@entry=0x559b757eb4b0, mis=mis@entry=0x559b73f1a580) at ../migration/savevm.c:2937 #11 0x559b6f632faa in qemu_loadvm_state (f=0x559b757eb4b0) at ../migration/savevm.c:3018 #12 0x559b6f6d2ece in process_incoming_migration_co (opaque=) at ../migration/migration.c:574 #13 0x559b6f9f9f0b in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173 #14 0x7f5cfeecf000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #15 0x7fff04a2e8f0 in ?? () #16 0x in ?? () true, not to mention any other side effects (e.g., would we release bql during precopy for any reason?). Copy Paolo and PeterM for some extra eyes. What I observe is that during the loading process, migration thread will call Condwait to wait for the vcpu threads to complete tasks, such as kvm_apic_post_load, and rcu thread will acquire the bql to do the flatview_destroy operation. So far, I haven't seen the side effects of these two situations. Yes that's something I'd worry about. The current memory API should be
Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate
On 2022/11/25 上午12:40, Peter Xu wrote: On Fri, Nov 18, 2022 at 04:36:48PM +0800, Chuang Xu wrote: The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms Signed-off-by: Chuang Xu This is an interesting idea.. I think it means at least the address space operations will all be messed up if happening during the precopy loading Sorry, I don't quite understand the meaning of "messed up" here.. Maybe I need more information about how the address space operations will be messed up. progress, but I don't directly see its happening either. For example, in most post_load()s of vmsd I think the devices should just write directly to its buffers, accessing MRs directly, even if they want DMAs or just update fields to correct states. Even so, I'm not super confident that holds And I'm not sure whether the "its happening" means "begin/commit happening" or "messed up happening"? If it's the former, Here are what I observe: the stage of loading iterable vmstate doesn't call begin/commit, but the stage of loading noniterable vmstate calls a large amount of begin/commit in field->info->get() operation. For example: #0 memory_region_transaction_commit () at ../softmmu/memory.c:1085 #1 0x559b6f683523 in pci_update_mappings (d=d@entry=0x7f5cd8682010) at ../hw/pci/pci.c:1361 #2 0x559b6f683a1f in get_pci_config_device (f=, pv=0x7f5cd86820a0, size=256, field=) at ../hw/pci/pci.c:545 #3 0x559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=vmsd@entry=0x559b70909d40 , opaque=opaque@entry=0x7f5cd8682010, version_id=2) at ../migration/vmstate.c:143 #4 0x559b6f68466f in pci_device_load (s=s@entry=0x7f5cd8682010, f=f@entry=0x559b757eb4b0) at ../hw/pci/pci.c:664 #5 0x559b6f6ad38a in virtio_pci_load_config (d=0x7f5cd8682010, f=0x559b757eb4b0) at ../hw/virtio/virtio-pci.c:181 #6 0x559b6f7dfe91 in virtio_load (vdev=0x7f5cd868a1a0, f=0x559b757eb4b0, version_id=1) at ../hw/virtio/virtio.c:3071 #7 0x559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=0x559b709ae260 , opaque=0x7f5cd868a1a0, version_id=1) at ../migration/vmstate.c:143 #8 0x559b6f62da48 in vmstate_load (f=0x559b757eb4b0, se=0x559b7591c010) at ../migration/savevm.c:913 #9 0x559b6f631334 in qemu_loadvm_section_start_full (mis=0x559b73f1a580, f=0x559b757eb4b0) at ../migration/savevm.c:2741 #10 qemu_loadvm_state_main (f=f@entry=0x559b757eb4b0, mis=mis@entry=0x559b73f1a580) at ../migration/savevm.c:2937 #11 0x559b6f632faa in qemu_loadvm_state (f=0x559b757eb4b0) at ../migration/savevm.c:3018 #12 0x559b6f6d2ece in process_incoming_migration_co (opaque=) at ../migration/migration.c:574 #13 0x559b6f9f9f0b in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173 #14 0x7f5cfeecf000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #15 0x7fff04a2e8f0 in ?? () #16 0x in ?? () true, not to mention any other side effects (e.g., would we release bql during precopy for any reason?). Copy Paolo and PeterM for some extra eyes. What I observe is that during the loading process, migration thread will call Condwait to wait for the vcpu threads to complete tasks, such as kvm_apic_post_load, and rcu thread will acquire the bql to do the flatview_destroy operation. So far, I haven't seen the side effects of these two situations. --- migration/migration.c | 1 + migration/migration.h | 2 ++ migration/savevm.c| 8 3 files changed, 11 insertions(+) diff --git a/migration/migration.c b/migration/migration.c index e6f8bc2478..ed20704552 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -224,6 +224,7 @@ void migration_object_init(void) qemu_sem_init(_incoming->postcopy_pause_sem_fast_load, 0); qemu_mutex_init(_incoming->page_request_mutex); current_incoming->page_requested = g_tree_new(page_request_addr_cmp); +current_incoming->start_pack_mr_change = false; migration_object_check(current_migration, _fatal); diff --git a/migration/migration.h b/migration/migration.h index 58b245b138..86597f5feb 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -186,6 +186,8 @@ struct MigrationIncomingState { * contains valid information. */ Qemu
[RFC PATCH] migration: reduce time of loading non-iterable vmstate
The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms Signed-off-by: Chuang Xu --- migration/migration.c | 1 + migration/migration.h | 2 ++ migration/savevm.c| 8 3 files changed, 11 insertions(+) diff --git a/migration/migration.c b/migration/migration.c index e6f8bc2478..ed20704552 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -224,6 +224,7 @@ void migration_object_init(void) qemu_sem_init(_incoming->postcopy_pause_sem_fast_load, 0); qemu_mutex_init(_incoming->page_request_mutex); current_incoming->page_requested = g_tree_new(page_request_addr_cmp); +current_incoming->start_pack_mr_change = false; migration_object_check(current_migration, _fatal); diff --git a/migration/migration.h b/migration/migration.h index 58b245b138..86597f5feb 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -186,6 +186,8 @@ struct MigrationIncomingState { * contains valid information. */ QemuMutex page_request_mutex; + +bool start_pack_mr_change; }; MigrationIncomingState *migration_incoming_get_current(void); diff --git a/migration/savevm.c b/migration/savevm.c index 48e85c052c..a073009a74 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2630,6 +2630,12 @@ retry: switch (section_type) { case QEMU_VM_SECTION_START: case QEMU_VM_SECTION_FULL: +/* call memory_region_transaction_begin() before loading non-iterable vmstate */ +if (section_type == QEMU_VM_SECTION_FULL && !mis->start_pack_mr_change) { +memory_region_transaction_begin(); +mis->start_pack_mr_change = true; +} + ret = qemu_loadvm_section_start_full(f, mis); if (ret < 0) { goto out; @@ -2650,6 +2656,8 @@ retry: } break; case QEMU_VM_EOF: +/* call memory_region_transaction_commit() after loading non-iterable vmstate */ +memory_region_transaction_commit(); /* This is the end of migration */ goto out; default: -- 2.20.1
Re: [PATCH v7 10/12] multifd: Support for zero pages transmission
On 2022/8/2 下午2:39, Juan Quintela wrote: This patch adds counters and similar. Logic will be added on the following patch. Signed-off-by: Juan Quintela --- Added counters for duplicated/non duplicated pages. Removed reviewed by from David. Add total_zero_pages --- migration/multifd.h| 17 - migration/multifd.c| 36 +--- migration/ram.c| 2 -- migration/trace-events | 8 4 files changed, 49 insertions(+), 14 deletions(-) diff --git a/migration/multifd.h b/migration/multifd.h index cd389d18d2..a1b852200d 100644 --- a/migration/multifd.h +++ b/migration/multifd.h @@ -47,7 +47,10 @@ typedef struct { /* size of the next packet that contains pages */ uint32_t next_packet_size; uint64_t packet_num; -uint64_t unused[4];/* Reserved for future use */ +/* zero pages */ +uint32_t zero_pages; +uint32_t unused32[1];/* Reserved for future use */ +uint64_t unused64[3];/* Reserved for future use */ char ramblock[256]; uint64_t offset[]; } __attribute__((packed)) MultiFDPacket_t; @@ -127,6 +130,8 @@ typedef struct { uint64_t num_packets; /* non zero pages sent through this channel */ uint64_t total_normal_pages; +/* zero pages sent through this channel */ +uint64_t total_zero_pages; /* buffers to send */ struct iovec *iov; /* number of iovs used */ @@ -135,6 +140,10 @@ typedef struct { ram_addr_t *normal; /* num of non zero pages */ uint32_t normal_num; +/* Pages that are zero */ +ram_addr_t *zero; +/* num of zero pages */ +uint32_t zero_num; /* used for compression methods */ void *data; } MultiFDSendParams; @@ -184,12 +193,18 @@ typedef struct { uint8_t *host; /* non zero pages recv through this channel */ uint64_t total_normal_pages; +/* zero pages recv through this channel */ +uint64_t total_zero_pages; /* buffers to recv */ struct iovec *iov; /* Pages that are not zero */ ram_addr_t *normal; /* num of non zero pages */ uint32_t normal_num; +/* Pages that are zero */ +ram_addr_t *zero; +/* num of zero pages */ +uint32_t zero_num; /* used for de-compression methods */ void *data; } MultiFDRecvParams; diff --git a/migration/multifd.c b/migration/multifd.c index 68fc9f8e88..4473d9f834 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -263,6 +263,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p) packet->normal_pages = cpu_to_be32(p->normal_num); packet->next_packet_size = cpu_to_be32(p->next_packet_size); packet->packet_num = cpu_to_be64(p->packet_num); +packet->zero_pages = cpu_to_be32(p->zero_num); if (p->pages->block) { strncpy(packet->ramblock, p->pages->block->idstr, 256); @@ -323,7 +324,15 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp) p->next_packet_size = be32_to_cpu(packet->next_packet_size); p->packet_num = be64_to_cpu(packet->packet_num); -if (p->normal_num == 0) { +p->zero_num = be32_to_cpu(packet->zero_pages); +if (p->zero_num > packet->pages_alloc - p->normal_num) { +error_setg(errp, "multifd: received packet " + "with %u zero pages and expected maximum pages are %u", + p->zero_num, packet->pages_alloc - p->normal_num) ; +return -1; +} + +if (p->normal_num == 0 && p->zero_num == 0) { return 0; } @@ -432,6 +441,8 @@ static int multifd_send_pages(QEMUFile *f) ram_counters.multifd_bytes += p->sent_bytes; qemu_file_acct_rate_limit(f, p->sent_bytes); p->sent_bytes = 0; +ram_counters.normal += p->normal_num; +ram_counters.duplicate += p->zero_num; qemu_mutex_unlock(>mutex); qemu_sem_post(>sem); @@ -545,6 +556,8 @@ void multifd_save_cleanup(void) p->iov = NULL; g_free(p->normal); p->normal = NULL; +g_free(p->zero); +p->zero = NULL; multifd_send_state->ops->send_cleanup(p, _err); if (local_err) { migrate_set_error(migrate_get_current(), local_err); @@ -666,6 +679,7 @@ static void *multifd_send_thread(void *opaque) qemu_mutex_unlock(>mutex); p->normal_num = 0; +p->zero_num = 0; if (use_zero_copy_send) { p->iovs_num = 0; @@ -687,8 +701,8 @@ static void *multifd_send_thread(void *opaque) } multifd_send_fill_packet(p); -trace_multifd_send(p->id, packet_num, p->normal_num, p->flags, - p->next_packet_size); +trace_multifd_send(p->id, packet_num, p->normal_num, p->zero_num, + p->flags, p->next_packet_size); if (use_zero_copy_send) { /*
Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
On 2022/6/14 下午10:14, Dr. David Alan Gilbert wrote: I don't think we can tell which one of them triggered the error; so the only thing I can suggest is that we document the need for optmem_max setting; I wonder how we get a better answer than 'a few 100KB'? I guess it's something like the number of packets inflight * sizeof(cmsghdr) ? Dave Three cases with errno ENOBUFS are described in the official doc(https://www.kernel.org/doc/html/v5.12/networking/msg_zerocopy.html): 1.The socket option was not set 2.The socket exceeds its optmem limit 3.The user exceeds its ulimit on locked pages For case 1, if the code logic is correct, this possibility can be ignored. For case 2, I asked a kernel developer about the reason for "a few 100KB". He said that the recommended value should be for the purpose of improving the performance of zero_copy send. If the NICsends data slower than the data generation speed, even if optmem is set to 100KB, there is a probability that sendmsg returns with errno ENOBUFS. For case 3, If I do not set max locked memory for the qemu, the max locked memory will be unlimited. I set the max locked memory for qemu and found that once the memory usage exceeds the max locked memory, oom will occur. Does this mean that sendmsg cannot return with errno ENOBUFS at all when user exceeds its ulimit on locked pages? If the above is true, can we take the errno as the case 2? I modified the code logic to call sendmsg again when the errno is ENOBUFS and set optmem to the initial 20KB(echo 20480 > /proc/sys/net/core/optmem_max), now the multifd zero_copy migration goes well. Here are the changes I made to the code: Signed-off-by: chuang xu --- io/channel-socket.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/io/channel-socket.c b/io/channel-socket.c index dc9c165de1..9267f55a1d 100644 --- a/io/channel-socket.c +++ b/io/channel-socket.c @@ -595,9 +595,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, #ifdef QEMU_MSG_ZEROCOPY case ENOBUFS: if (sflags & MSG_ZEROCOPY) { - error_setg_errno(errp, errno, - "Process can't lock enough memory for using MSG_ZEROCOPY"); - return -1; + goto retry; } break; #endif -- Dave, what's your take? Best Regards, chuang xu
Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
On 2022/5/13 下午2:28, Leonardo Bras wrote: @@ -557,15 +578,31 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, memcpy(CMSG_DATA(cmsg), fds, fdsize); } +#ifdef QEMU_MSG_ZEROCOPY +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { +sflags = MSG_ZEROCOPY; +} +#endif + retry: -ret = sendmsg(sioc->fd, , 0); +ret = sendmsg(sioc->fd, , sflags); if (ret <= 0) { -if (errno == EAGAIN) { +switch (errno) { +case EAGAIN: return QIO_CHANNEL_ERR_BLOCK; -} -if (errno == EINTR) { +case EINTR: goto retry; +#ifdef QEMU_MSG_ZEROCOPY +case ENOBUFS: +if (sflags & MSG_ZEROCOPY) { +error_setg_errno(errp, errno, + "Process can't lock enough memory for using MSG_ZEROCOPY"); +return -1; +} +break; +#endif } + error_setg_errno(errp, errno, "Unable to write to socket"); return -1; Hi, Leo. There are some other questions I would like to discuss with you. I tested the multifd zero_copy migration and found that sometimes even if max locked memory of qemu was set to 16GB(much greater than `MULTIFD_PACKET_SIZE`), the error "Process can't lock enough memory for using MSG_ZEROCOPY" would still be reported. I noticed that the doc(https://www.kernel.org/doc/html/v5.12/networking/msg_zerocopy.html) says "A zerocopy failure will return -1 with errno ENOBUFS. This happens if the socket option was not set, _the socket exceeds its optmem limit_ or the user exceeds its ulimit on locked pages." I also found that the RFC(https://lwn.net/Articles/715279/) says _"__The change to allocate notification skbuffs from optmem requires__ensuring that net.core.optmem is at least a few 100KB."_ On my host, optmem was initially set to 20KB, I tried to change it to 100KB (echo 102400 > /proc/sys/net/core/optmem_max) as the RFC says.Then I tested the multifd zero_copy migration repeatedly,and the error disappeared. So when sendmsg returns -1 with errno ENOBUFS, should we distinguish between error ''socket exceeds optmem limit" and error "user exceeds ulimit on locked pages"? Or is there any better way to avoid this problem? Best Regards, chuang xu
Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
On 2022/6/8 下午1:24, Leonardo Bras Soares Passos wrote: I will send a fix shortly. Is that ok if I include a "Reported-by: 徐闯 " in the patch? okay. Best Regards, chuang xu