Re: [PATCH] mac_dbdma: Remove leftover `dma_memory_unmap` calls
On Tue, Sep 17, 2024 at 08:25:06AM +0200, Mattias Nissler wrote: > Mark, thanks for testing and confirming that this doesn't cause any > obvious breakage. > > For my curiosity, which path should this patch take to get into > master? Peter, are you going to respin your pull request with this > included? The pull has already landed master branch, so there'll be no respin. And considering this is a fix from the device side, may make more sense that the maintainer takes it, which points to John Snow here. John, would you take it via your tree? Thanks, -- Peter Xu
Re: [PATCH 21/39] migration: replace assert(false) with g_assert_not_reached()
On Tue, Sep 10, 2024 at 03:15:48PM -0700, Pierrick Bouvier wrote: > Signed-off-by: Pierrick Bouvier Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH 08/39] migration: replace assert(0) with g_assert_not_reached()
On Tue, Sep 10, 2024 at 03:15:35PM -0700, Pierrick Bouvier wrote: > Signed-off-by: Pierrick Bouvier Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 01/19] qapi: Smarter camel_to_upper() to reduce need for 'prefix'
On Thu, Sep 05, 2024 at 04:52:36PM +0100, Daniel P. Berrangé wrote: > On Thu, Sep 05, 2024 at 07:59:13AM +0200, Markus Armbruster wrote: > > Daniel P. Berrangé writes: > > > > > On Wed, Sep 04, 2024 at 01:18:18PM +0200, Markus Armbruster wrote: > > >> camel_to_upper() converts its argument from camel case to upper case > > >> with '_' between words. Used for generated enumeration constant > > >> prefixes. > > > > > > > > >> > > >> Signed-off-by: Markus Armbruster > > >> Reviewed-by: Daniel P. Berrang?? > > > > > > The accent in my name is getting mangled in this series. > > > > Uh-oh! > > > > Checking... Hmm. It's correct in git, correct in output of > > git-format-patch, correct in the copy I got from git-send-email --bcc > > armbru via localhost MTA, and the copy I got from --to > > qemu-de...@nongnu.org, correct in lore.kernel.org[*], correct in an mbox > > downloaded from patchew. > > > > Could the culprit be on your side? > > I compared my received mail vs the mbox archive on nongnu.org for > qemu-devel. > > In both cases the actual mail body seems to be valid UTF-8 and is > identical. The message in the nongnu.org archive, however, has > > Content-Type: text/plain; charset=UTF-8 > > while the copy I got in my inbox has merely > > Content-Type: text/plain > > What I can't determine is whether your original sent message > had "charset=UTF-8" which then got stripped by redhat's incoming > mail server, or whether your original lacked 'charset=UTF8' and > it got added by mailman when saving the message to the mbox archives ? I didn't read into details of what Markus hit, but I just remembered I hit similar things before and Dan reported similar issue. At that time (which I tried to recall..) was because I used git-publish sending patches, in which there is an encoding issue. I tried to fix with this branch: https://github.com/xzpeter/git-publish/commits/fix-print-2/ I also remember I tried to upstream that to Stefan's repo but I totally forgot what happened later, but the result is I am still using this branch internally (which I completely forgot which version I'm using... but I found that until I see this discussion and checked..). Please ignore everything if git-publish is not used at all.. but just in case helpful.. -- Peter Xu
Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion
On Fri, Aug 30, 2024 at 03:11:38PM +0900, Akihiko Odaki wrote: > On 2024/08/30 4:48, Peter Xu wrote: > > On Thu, Aug 29, 2024 at 01:39:36PM +0900, Akihiko Odaki wrote: > > > > > I am calling the fact that embedded memory regions are accessible in > > > > > instance_finalize() "live". A device can perform operations on its > > > > > memory > > > > > regions during instance_finalize() and we should be aware of that. > > > > > > > > This part is true. I suppose we should still suggest device finalize() > > > > to > > > > properly detach MRs, and that should normally be done there. > > > > > > It is better to avoid manual resource deallocation in general because it > > > is > > > too error-prone. > > > > I had an impression that you mixed up "finalize()" and "free()" in the > > context of our discussion.. let us clarify this first before everything > > else below, just in case I overlook stuff.. > I am aware of that distinction. I dealt with it with patch "virtio-gpu: > Handle resource blob commands": > https://lore.kernel.org/r/20240822185110.1757429-12-dmitry.osipe...@collabora.com > > > > > MR is very special as an object, in that it should have no free() hook, > > hence by default nothing is going to be freed when mr->refcount==0. It > > means MRs need to be freed manually always. > > > > For example: > > > > (gdb) p system_memory->parent_obj->free > > $2 = (ObjectFree *) 0x0 > > > > It plays perfect because the majority of QEMU device model is using MR as a > > field (rather than a pointer) of another structure, so that's exactly what > > we're looking for: we don't want to free() the MR as it's allocated > > together with the owner device. That'll be released when the owner free(). > > > > When dynamic allocation gets into the picture for MR, it's more complicated > > for sure, because it means the user (like VFIOQuirk) will need to manually > > allocate the MRs, then it requires explicit object_unparent() to detach > > that from the device / owner when finalize(). NOTE! object_unparent() > > will NOT free the MR yet so far. The MR still need to be manually freed > > with an explicit g_free(), normally. Again, I'd suggest you refer to the > > VFIOQuirk code just as an example. In that case this part is done with > > e.g. vfio_bar_quirk_finalize(). > > > > for (i = 0; i < quirk->nr_mem; i++) { > > object_unparent(OBJECT(&quirk->mem[i])); > > } > > g_free(quirk->mem); > > > > Here quirk->mem is a pointer to an array of MR which can contain one or > > more MRs, but the idea is the same. > > > > > > > > I have an impression with QEMU code base that it is failing manual > > > resource > > > deallocation so frequently although such deallocation can be easily > > > automated by binding resources to objects and free them when objects die > > > by > > > providing a function like Linux's devm_kmalloc(). Unfortunately I haven't > > > found time to do that though. > > > > AFAICT, the property list is exactly what you're saying. IIUC normally an > > object will be properly finalized()ed + free()ed when the parent object is > > finalize()ed. Here MR is just special as it bypasses all the free() part. > > > > > > > > So my opinion here is 1) we should automate resource deallocation as much > > > as > > > possible but 2) we shouldn't disturb code that performs manual resource > > > management. > > > > Agree. However note again that in whatever case cross-device MR links will > > still require explicit detachment or it's prone to memory leak. > > I am not sure what you refer to with cross-device MR links so can you > clarify it? I was referring to Peter Maydell's example, where the MR can be used outside of its owner. In that case manual operation is a must before finalize(), as finalize() can only make sense to resolve internal links automatically. But now knowing that you're explicitly mentioning "deallocation" rather than "finalize()", and you're aware of diff between deallocation v.s. finalize(), I suppose I misunderstood what you meant, and now I'm not sure I get what you're suggesting. > > > > > > > > > instance_finalize() is for manual resource management. It is better to > > >
Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion
ect_unparent() in > instance_finalize() is fine is wrong because the memory region is already > finalized and we shouldn't call object_unparent() for a finalized object. > > This also means a device does not have a control of its memory regions > during finalization and the device will not see the side effect of calling > memory_region_del_subregion(). However, if a device really needs to control > the order of memory region finalization, it can still call object_ref() for > memory regions to keep them alive until instance_finalize(). The effect of > memory_region_del_subregion() will be visible to the device if a device > does such a trick. While it is very unlikely that any device does such a > trick, it is still better to avoid having any assumption on devices. > > With the proposed object_ref(mr->owner == subregion->owner ? > OBJECT(subregion) : subregion->owner) call, we don't have to make any > assumption on how finalization works, which is too complicated as this > discussion shows. It certainly makes the use of two reference counters, but > it does not deviate from the current policy on their usage and it is also > trivial to adapt when the reference counter in memory region gets removed. > Considering all above, I believe my patch is the most straightforward > solution. Let's see how you thinks after you read above first. I hope we can get more onto the same page at least on the context. So far I still prefer using mr->refcount as less as possible. Now it plays the only role as "whether this MR is put onto an owner property list", and for all the rest refcounts on MR it should always routes to the owner. I still think we need to be extremely cautious on further complicating this refcount. It's already complicated indeed, but hopefully this model is the best we can trivially have right now, and so far it's clear to me, but it's always possible I overlooked something. Thanks, -- Peter Xu
Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion
On Wed, Aug 28, 2024 at 11:02:06PM +0900, Akihiko Odaki wrote: > On 2024/08/28 22:09, Peter Xu wrote: > > On Wed, Aug 28, 2024 at 02:33:59PM +0900, Akihiko Odaki wrote: > > > On 2024/08/28 1:11, Peter Xu wrote: > > > > On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote: > > > > > On 2024/08/27 4:42, Peter Xu wrote: > > > > > > On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote: > > > > > > > On Mon, 26 Aug 2024 at 16:22, Peter Xu wrote: > > > > > > > > > > > > > > > > On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote: > > > > > > > > > memory_region_update_container_subregions() used to call > > > > > > > > > memory_region_ref(), which creates a reference to the owner > > > > > > > > > of the > > > > > > > > > subregion, on behalf of the owner of the container. This > > > > > > > > > results in a > > > > > > > > > circular reference if the subregion and container have the > > > > > > > > > same owner. > > > > > > > > > > > > > > > > > > memory_region_ref() creates a reference to the owner instead > > > > > > > > > of the > > > > > > > > > memory region to match the lifetime of the owner and memory > > > > > > > > > region. We > > > > > > > > > do not need such a hack if the subregion and container have > > > > > > > > > the same > > > > > > > > > owner because the owner will be alive as long as the > > > > > > > > > container is. > > > > > > > > > Therefore, create a reference to the subregion itself instead > > > > > > > > > ot its > > > > > > > > > owner in such a case; the reference to the subregion is still > > > > > > > > > necessary > > > > > > > > > to ensure that the subregion gets finalized after the > > > > > > > > > container. > > > > > > > > > > > > > > > > > > Signed-off-by: Akihiko Odaki > > > > > > > > > --- > > > > > > > > > system/memory.c | 8 ++-- > > > > > > > > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/system/memory.c b/system/memory.c > > > > > > > > > index 5e6eb459d5de..e4d3e9d1f427 100644 > > > > > > > > > --- a/system/memory.c > > > > > > > > > +++ b/system/memory.c > > > > > > > > > @@ -2612,7 +2612,9 @@ static void > > > > > > > > > memory_region_update_container_subregions(MemoryRegion > > > > > > > > > *subregion) > > > > > > > > > > > > > > > > > > memory_region_transaction_begin(); > > > > > > > > > > > > > > > > > > -memory_region_ref(subregion); > > > > > > > > > +object_ref(mr->owner == subregion->owner ? > > > > > > > > > + OBJECT(subregion) : subregion->owner); > > > > > > > > > > > > > > > > The only place that mr->refcount is used so far is the owner > > > > > > > > with the > > > > > > > > object property attached to the mr, am I right (ignoring > > > > > > > > name-less MRs)? > > > > > > > > > > > > > > > > I worry this will further complicate refcounting, now we're > > > > > > > > actively using > > > > > > > > two refcounts for MRs.. > > > > > > > > > > The actor of object_ref() is the owner of the memory region also in > > > > > this > > > > > case. We are calling object_ref() on behalf of mr->owner so we use > > > > > mr->refcount iff mr->owner == subregion->owner. In this sense there > > > > > is only > > > > > one user of mr->refcount even after this change. > > > > > > > > Yes it's still one user, b
Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion
On Wed, Aug 28, 2024 at 02:33:59PM +0900, Akihiko Odaki wrote: > On 2024/08/28 1:11, Peter Xu wrote: > > On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote: > > > On 2024/08/27 4:42, Peter Xu wrote: > > > > On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote: > > > > > On Mon, 26 Aug 2024 at 16:22, Peter Xu wrote: > > > > > > > > > > > > On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote: > > > > > > > memory_region_update_container_subregions() used to call > > > > > > > memory_region_ref(), which creates a reference to the owner of the > > > > > > > subregion, on behalf of the owner of the container. This results > > > > > > > in a > > > > > > > circular reference if the subregion and container have the same > > > > > > > owner. > > > > > > > > > > > > > > memory_region_ref() creates a reference to the owner instead of > > > > > > > the > > > > > > > memory region to match the lifetime of the owner and memory > > > > > > > region. We > > > > > > > do not need such a hack if the subregion and container have the > > > > > > > same > > > > > > > owner because the owner will be alive as long as the container is. > > > > > > > Therefore, create a reference to the subregion itself instead ot > > > > > > > its > > > > > > > owner in such a case; the reference to the subregion is still > > > > > > > necessary > > > > > > > to ensure that the subregion gets finalized after the container. > > > > > > > > > > > > > > Signed-off-by: Akihiko Odaki > > > > > > > --- > > > > > > >system/memory.c | 8 ++-- > > > > > > >1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/system/memory.c b/system/memory.c > > > > > > > index 5e6eb459d5de..e4d3e9d1f427 100644 > > > > > > > --- a/system/memory.c > > > > > > > +++ b/system/memory.c > > > > > > > @@ -2612,7 +2612,9 @@ static void > > > > > > > memory_region_update_container_subregions(MemoryRegion *subregion) > > > > > > > > > > > > > >memory_region_transaction_begin(); > > > > > > > > > > > > > > -memory_region_ref(subregion); > > > > > > > +object_ref(mr->owner == subregion->owner ? > > > > > > > + OBJECT(subregion) : subregion->owner); > > > > > > > > > > > > The only place that mr->refcount is used so far is the owner with > > > > > > the > > > > > > object property attached to the mr, am I right (ignoring name-less > > > > > > MRs)? > > > > > > > > > > > > I worry this will further complicate refcounting, now we're > > > > > > actively using > > > > > > two refcounts for MRs.. > > > > > > The actor of object_ref() is the owner of the memory region also in this > > > case. We are calling object_ref() on behalf of mr->owner so we use > > > mr->refcount iff mr->owner == subregion->owner. In this sense there is > > > only > > > one user of mr->refcount even after this change. > > > > Yes it's still one user, but it's not that straightforward to see, also > > it's still an extension to how we use mr->refcount right now. Currently > > it's about "true / false" just to describe, now it's a real counter. > > > > I wished that counter doesn't even exist if we'd like to stick with device > > / owner's counter. Adding this can definitely also make further effort > > harder if we want to remove mr->refcount. > > I don't think it will make removing mr->refcount harder. With this change, > mr->refcount will count the parent and container. If we remove mr->refcount, > we need to trigger object_finalize() in a way other than checking > mr->refcount, which can be achieved by simply evaluating OBJECT(mr)->parent > && mr->container. > > > > > > > > > > > > > > > >
Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion
On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote: > On 2024/08/27 4:42, Peter Xu wrote: > > On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote: > > > On Mon, 26 Aug 2024 at 16:22, Peter Xu wrote: > > > > > > > > On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote: > > > > > memory_region_update_container_subregions() used to call > > > > > memory_region_ref(), which creates a reference to the owner of the > > > > > subregion, on behalf of the owner of the container. This results in a > > > > > circular reference if the subregion and container have the same owner. > > > > > > > > > > memory_region_ref() creates a reference to the owner instead of the > > > > > memory region to match the lifetime of the owner and memory region. We > > > > > do not need such a hack if the subregion and container have the same > > > > > owner because the owner will be alive as long as the container is. > > > > > Therefore, create a reference to the subregion itself instead ot its > > > > > owner in such a case; the reference to the subregion is still > > > > > necessary > > > > > to ensure that the subregion gets finalized after the container. > > > > > > > > > > Signed-off-by: Akihiko Odaki > > > > > --- > > > > > system/memory.c | 8 ++-- > > > > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/system/memory.c b/system/memory.c > > > > > index 5e6eb459d5de..e4d3e9d1f427 100644 > > > > > --- a/system/memory.c > > > > > +++ b/system/memory.c > > > > > @@ -2612,7 +2612,9 @@ static void > > > > > memory_region_update_container_subregions(MemoryRegion *subregion) > > > > > > > > > > memory_region_transaction_begin(); > > > > > > > > > > -memory_region_ref(subregion); > > > > > +object_ref(mr->owner == subregion->owner ? > > > > > + OBJECT(subregion) : subregion->owner); > > > > > > > > The only place that mr->refcount is used so far is the owner with the > > > > object property attached to the mr, am I right (ignoring name-less MRs)? > > > > > > > > I worry this will further complicate refcounting, now we're actively > > > > using > > > > two refcounts for MRs.. > > The actor of object_ref() is the owner of the memory region also in this > case. We are calling object_ref() on behalf of mr->owner so we use > mr->refcount iff mr->owner == subregion->owner. In this sense there is only > one user of mr->refcount even after this change. Yes it's still one user, but it's not that straightforward to see, also it's still an extension to how we use mr->refcount right now. Currently it's about "true / false" just to describe, now it's a real counter. I wished that counter doesn't even exist if we'd like to stick with device / owner's counter. Adding this can definitely also make further effort harder if we want to remove mr->refcount. > > > > > > > > > Continue discussion there: > > > > > > > > https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com > > > > > > > > What I don't see is how mr->subregions differs from mr->container, so we > > > > allow subregions to be attached but not the container when finalize() > > > > (which is, afaict, the other way round). > > > > > > > > It seems easier to me that we allow both container and subregions to > > > > exist > > > > as long as within the owner itself, rather than start heavier use of > > > > mr->refcount. > > > > > > I don't think just "same owner" necessarily will be workable -- > > > you can have a setup like: > > >* device A has a container C_A > > >* device A has a child-device B > > >* device B has a memory region R_B > > >* device A's realize method puts R_B into C_A > > > > > > R_B's owner is B, and the container's owner is A, > > > but we still want to be able to get rid of A (in the process > > > getting rid of B because it gets unparented and unreffed, > > > and R_B and C_A also). > > > > For cross-device references, s
Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion
On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote: > On Mon, 26 Aug 2024 at 16:22, Peter Xu wrote: > > > > On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote: > > > memory_region_update_container_subregions() used to call > > > memory_region_ref(), which creates a reference to the owner of the > > > subregion, on behalf of the owner of the container. This results in a > > > circular reference if the subregion and container have the same owner. > > > > > > memory_region_ref() creates a reference to the owner instead of the > > > memory region to match the lifetime of the owner and memory region. We > > > do not need such a hack if the subregion and container have the same > > > owner because the owner will be alive as long as the container is. > > > Therefore, create a reference to the subregion itself instead ot its > > > owner in such a case; the reference to the subregion is still necessary > > > to ensure that the subregion gets finalized after the container. > > > > > > Signed-off-by: Akihiko Odaki > > > --- > > > system/memory.c | 8 ++-- > > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > diff --git a/system/memory.c b/system/memory.c > > > index 5e6eb459d5de..e4d3e9d1f427 100644 > > > --- a/system/memory.c > > > +++ b/system/memory.c > > > @@ -2612,7 +2612,9 @@ static void > > > memory_region_update_container_subregions(MemoryRegion *subregion) > > > > > > memory_region_transaction_begin(); > > > > > > -memory_region_ref(subregion); > > > +object_ref(mr->owner == subregion->owner ? > > > + OBJECT(subregion) : subregion->owner); > > > > The only place that mr->refcount is used so far is the owner with the > > object property attached to the mr, am I right (ignoring name-less MRs)? > > > > I worry this will further complicate refcounting, now we're actively using > > two refcounts for MRs.. > > > > Continue discussion there: > > > > https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com > > > > What I don't see is how mr->subregions differs from mr->container, so we > > allow subregions to be attached but not the container when finalize() > > (which is, afaict, the other way round). > > > > It seems easier to me that we allow both container and subregions to exist > > as long as within the owner itself, rather than start heavier use of > > mr->refcount. > > I don't think just "same owner" necessarily will be workable -- > you can have a setup like: > * device A has a container C_A > * device A has a child-device B > * device B has a memory region R_B > * device A's realize method puts R_B into C_A > > R_B's owner is B, and the container's owner is A, > but we still want to be able to get rid of A (in the process > getting rid of B because it gets unparented and unreffed, > and R_B and C_A also). For cross-device references, should we rely on an explicit call to memory_region_del_subregion(), so as to detach the link between C_A and R_B? My understanding so far: logically when MR finalize() it should guarantee both (1) mr->container==NULL, and (2) mr->subregions empty. That's before commit 2e2b8eb70fdb7dfb and could be the ideal world (though at the very beginning we don't assert on ->container==NULL yet). It requires all device emulations to do proper unrealize() to unlink all the MRs. However what I'm guessing is QEMU probably used to have lots of devices that are not following the rules and leaking these links. Hence we have had 2e2b8eb70fdb7dfb, allowing that to happen as long as it's safe, and it's justified by comment in 2e2b8eb70fdb7dfb on why it's safe. What I was thinking is this comment seems to apply too to mr->container, so that it should be safe too to unlink ->container the same way as its own subregions. IIUC that means for device-internal MR links we should be fine leaving whatever link between MRs owned by such device; the device->refcount guarantees none of them will be visible in any AS. But then we need to always properly unlink the MRs when the link is across >1 device owners, otherwise it's prone to leak. -- Peter Xu
Re: [PATCH v4 5/7] memory: Clarify owner must not call memory_region_ref()
On Fri, Aug 23, 2024 at 03:13:10PM +0900, Akihiko Odaki wrote: > Signed-off-by: Akihiko Odaki > --- > include/exec/memory.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index d79415a3b159..6698e9d05eab 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -1220,6 +1220,7 @@ void memory_region_init(MemoryRegion *mr, > * memory_region_ref: Add a reference to a memory region > * > * This function adds a reference to the owner if present. > + * The owner must not call this function as it results in a circular > reference. > * See docs/devel/memory.rst to know about owner. > * > * @mr: the #MemoryRegion Please consider merging all four doc pre-update patches into one. Thanks, -- Peter Xu
Re: [PATCH v4 7/7] tests/qtest: Delete previous boot file
On Fri, Aug 23, 2024 at 03:13:12PM +0900, Akihiko Odaki wrote: > A test run may create boot files several times. Delete the previous boot > file before creating a new one. > > Signed-off-by: Akihiko Odaki > Reviewed-by: Michael S. Tsirkin > Acked-by: Thomas Huth I didn't track which came early, but I think Fabiano has queued the other one here: https://lore.kernel.org/r/20240820144912.320744-2-peter.mayd...@linaro.org https://gitlab.com/farosas/qemu/-/commits/migration-staging/ So we should be good. -- Peter Xu
Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion
On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote: > memory_region_update_container_subregions() used to call > memory_region_ref(), which creates a reference to the owner of the > subregion, on behalf of the owner of the container. This results in a > circular reference if the subregion and container have the same owner. > > memory_region_ref() creates a reference to the owner instead of the > memory region to match the lifetime of the owner and memory region. We > do not need such a hack if the subregion and container have the same > owner because the owner will be alive as long as the container is. > Therefore, create a reference to the subregion itself instead ot its > owner in such a case; the reference to the subregion is still necessary > to ensure that the subregion gets finalized after the container. > > Signed-off-by: Akihiko Odaki > --- > system/memory.c | 8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/system/memory.c b/system/memory.c > index 5e6eb459d5de..e4d3e9d1f427 100644 > --- a/system/memory.c > +++ b/system/memory.c > @@ -2612,7 +2612,9 @@ static void > memory_region_update_container_subregions(MemoryRegion *subregion) > > memory_region_transaction_begin(); > > -memory_region_ref(subregion); > +object_ref(mr->owner == subregion->owner ? > + OBJECT(subregion) : subregion->owner); The only place that mr->refcount is used so far is the owner with the object property attached to the mr, am I right (ignoring name-less MRs)? I worry this will further complicate refcounting, now we're actively using two refcounts for MRs.. Continue discussion there: https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com What I don't see is how mr->subregions differs from mr->container, so we allow subregions to be attached but not the container when finalize() (which is, afaict, the other way round). It seems easier to me that we allow both container and subregions to exist as long as within the owner itself, rather than start heavier use of mr->refcount. I tend to agree with you in another thread, where you mentioned it's better we get rid of one of the refcounts. If not trivial to get, we should still try to stick with one refcount to make it less chaos. > + > QTAILQ_FOREACH(other, &mr->subregions, subregions_link) { > if (subregion->priority >= other->priority) { > QTAILQ_INSERT_BEFORE(other, subregion, subregions_link); > @@ -2670,7 +2672,9 @@ void memory_region_del_subregion(MemoryRegion *mr, > assert(alias->mapped_via_alias >= 0); > } > QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link); > -memory_region_unref(subregion); > +object_unref(mr->owner == subregion->owner ? > + OBJECT(subregion) : subregion->owner); > + > memory_region_update_pending |= mr->enabled && subregion->enabled; > memory_region_transaction_commit(); > } > > -- > 2.46.0 > -- Peter Xu
Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)
On Thu, Jul 11, 2024 at 11:44:08AM -0300, Fabiano Rosas wrote: > But of course, that means we cannot claim to support all kinds of > forward migrations anymore. Only those in the 6 year period. That "6 years" comes from machine type deprecation period, and migration compatibility is mostly only attached to machine types, and we only ever allowed migration with the same machine type. It means, >6 years migration will never work anyway as soon as we start to deprecate machine types (irrelevant of any reuse of UNUSED), because the >6 years machine types will simply be gone.. See configuration_post_load(), where it'll throw an error upfront when machine type mismatched. -- Peter Xu
Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)
On Thu, Jul 11, 2024 at 10:34:12AM -0300, Fabiano Rosas wrote: > Is there an easy way to look at a field and tell in which machine type's > timeframe it was introduced? I am not aware of any. > If the machine type of that era has been removed, then the field is free > to go as well. I'd prefer if we had a hard link instead of just counting > years. Maybe we should to that mapping at the machine deprecation time? > As in, "look at the unused fields introduced in that timeframe and mark > them free". We can do that, but depending on how easy it would be. That can be an overkill to me if it's non-trivial. When it becomes complicated, I'd rather make machine compat property easier to use so we always stick with that. Currently it's not as easy to use. Maybe we shouldn't make it a common rule to let people reuse the UNUSED fields, even if in this case it's probably fine? E.g. I don't think it's a huge deal to keep all UNUSED fields forever - sending 512B zeros for only one specific device isn't an issue even if kept forever. If "over 6 years" would be okay and simple enough, then maybe we can stick with that (and only if people would like to reuse a field and ask; that's after all not required..). If more than that I doubt whether we should spend time working on covering all the fields. -- Peter Xu
Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)
On Wed, Jul 10, 2024 at 06:38:26PM -0300, Fabiano Rosas wrote: > Peter Xu writes: > > > On Wed, Jul 10, 2024 at 04:48:23PM -0300, Fabiano Rosas wrote: > >> Peter Xu writes: > >> > >> > On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote: > >> >> It's not about trust, we simply don't support migrations other than > >> >> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included. > >> > > >> > Where does it come from? I thought we suppport that.. > >> > >> I'm taking that from: > >> > >> docs/devel/migration/main.rst: > >> "In general QEMU tries to maintain forward migration compatibility > >> (i.e. migrating from QEMU n->n+1) and there are users who benefit from > >> backward compatibility as well." > >> > >> But of course it doesn't say whether that comes with a transitive rule > >> allowing n->n+2 migrations. > > > > I'd say that "i.e." implies n->n+1 is not the only forward migration we > > would support. > > > > I _think_ we should support all forward migration as long as the machine > > type matches. > > > >> > >> > > >> > The same question would be: are we requesting an OpenStack cluster to > >> > always upgrade QEMU with +1 versions, otherwise migration will fail? > >> > >> Will an OpenStack cluster be using upstream QEMU? If not, then that's a > > > > It's an example to show what I meant! :) Nothing else. Definitely not > > saying that everyone should use an upstream released QEMU (but in reality, > > it's not a problem, I think, and I do feel like people use them, perhaps > > more with the stable releases). > > > >> question for the distro. In a very practical sense, we're not requesting > >> anything. We barely test n->n+1/n->n-1, even if we had a strong support > >> statement I wouldn't be confident saying migration from QEMU 2.7 -> QEMU > >> 9.1 should succeed. > > > > No matter what we test in CI, I don't think we should break that for >1 > > versions.. I hope 2.7->9.1 keeps working, otherwise I think it's legal to > > file a bug by anyone. > > > > For example, I randomly fetched a bug report: > > > > https://gitlab.com/qemu-project/qemu/-/issues/1937 > > > > QEMU version:6.2 and 7.2.5 > > > > And I believe that's the common case even for upstream. If we don't do > > that right for upstream, it can be impossible tasks for downstream and for > > all of us to maintain. > > But do we do that right currently? I have no idea. Have we ever done > it? And we're here discussing a hypothetical 2.7->9.1 ... > > So we cannot reuse the UNUSED field because QEMU from 2016 might send > their data and QEMU from 2024 would interpret it wrong. > > How do we proceed? Add a subsection. And make the code survive when > receiving 0. > > @Peter is that it? What about backwards-compat? We'll need a property as > well it seems. Compat property is definitely one way to go, but I think it's you who more or less persuaded me that reusing it seems possible! At least I can't yet think of anything bad if it's ancient unused buffers. And that's why I was asking about a sane way to describe the "magic year".. And I was very serious when I said "6 years" to follow the deprecation of machine types, because it'll be something we can follow to say when an unused buffer can be obsolete and it could make some sense: if we will start to deprecate machine types, then it may not make sense to keep any UNUSED compatible forever, after all. And we need that ruler to be as accurate as "always 6 years to follow machine type deprecation procedure", in case someone else tomorrow asks us something that was only UNUSED since 2018. We need a rule of thumb if we want to reuse it, and if all of you agree we can start with this one, perhaps with a comment above the field (before we think all through and make it a rule on deprecating UNUSED)? -- Peter Xu
Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)
On Wed, Jul 10, 2024 at 04:48:23PM -0300, Fabiano Rosas wrote: > Peter Xu writes: > > > On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote: > >> It's not about trust, we simply don't support migrations other than > >> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included. > > > > Where does it come from? I thought we suppport that.. > > I'm taking that from: > > docs/devel/migration/main.rst: > "In general QEMU tries to maintain forward migration compatibility > (i.e. migrating from QEMU n->n+1) and there are users who benefit from > backward compatibility as well." > > But of course it doesn't say whether that comes with a transitive rule > allowing n->n+2 migrations. I'd say that "i.e." implies n->n+1 is not the only forward migration we would support. I _think_ we should support all forward migration as long as the machine type matches. > > > > > The same question would be: are we requesting an OpenStack cluster to > > always upgrade QEMU with +1 versions, otherwise migration will fail? > > Will an OpenStack cluster be using upstream QEMU? If not, then that's a It's an example to show what I meant! :) Nothing else. Definitely not saying that everyone should use an upstream released QEMU (but in reality, it's not a problem, I think, and I do feel like people use them, perhaps more with the stable releases). > question for the distro. In a very practical sense, we're not requesting > anything. We barely test n->n+1/n->n-1, even if we had a strong support > statement I wouldn't be confident saying migration from QEMU 2.7 -> QEMU > 9.1 should succeed. No matter what we test in CI, I don't think we should break that for >1 versions.. I hope 2.7->9.1 keeps working, otherwise I think it's legal to file a bug by anyone. For example, I randomly fetched a bug report: https://gitlab.com/qemu-project/qemu/-/issues/1937 QEMU version:6.2 and 7.2.5 And I believe that's the common case even for upstream. If we don't do that right for upstream, it can be impossible tasks for downstream and for all of us to maintain. -- Peter Xu
Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)
On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote: > It's not about trust, we simply don't support migrations other than > n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included. Where does it come from? I thought we suppport that.. The same question would be: are we requesting an OpenStack cluster to always upgrade QEMU with +1 versions, otherwise migration will fail? -- Peter Xu
Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)
On Wed, Jul 10, 2024 at 11:08:20AM -0300, Fabiano Rosas wrote: > >> I think it's ok: > >> > >> { > >> "field": "unused", > >> "version_id": 1, > >> "field_exists": false, > >> "size": 512 > >> }, > >> > >> vs. > >> > >> { > >> "field": "vendor_data", > >> "version_id": 0, > >> "field_exists": false, > >> "num": 512, > >> "size": 1 > >> }, > >> > >> The unused field was introduced in 2016 so there's no chance of > >> migrating a QEMU that old to/from 9.1. > > > > What happens if an old qemu 9.0 sends rubbish here to a new QEMU, while the > > new QEMU would consider it meaningful data? > > It will send zeros, no? The code will have to cope with that. The > alternative is to put the vendor_data in a subsection and the code will > also have to cope with the lack of data when the old QEMU doesn't send > it. Ah indeed, that "static const uint8_t buf[1024]" is there at least since 2017. So yes, probably always sending zeros. Nothing I can think of otherwise indeed, if we want to trust that nothing will migrate before 2016. It's just that we may want to know how that "2016" is justified to be safe if we would like to allow that in the future. One thing _could_ be that "rule of thumb" is we plan to obsolete machines with 6 years, so anything "UNUSED" older than 6 years can be over-written? -- Peter Xu
Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)
On Tue, Jul 09, 2024 at 05:38:54PM -0300, Fabiano Rosas wrote: > Philippe Mathieu-Daudé writes: > > > "General command" (GEN_CMD, CMD56) is described as: > > > > GEN_CMD is the same as the single block read or write > > commands (CMD24 or CMD17). The difference is that [...] > > the data block is not a memory payload data but has a > > vendor specific format and meaning. > > > > Thus this block must not be stored overwriting data block > > on underlying storage drive. Keep it in a dedicated > > 'vendor_data[]' array. > > > > Signed-off-by: Philippe Mathieu-Daudé > > Tested-by: Cédric Le Goater > > --- > > RFC: Is it safe to reuse VMSTATE_UNUSED_V() (which happens > > to be the same size)? > > Hi, sorry it took some time to get to this, I had just left for vacation > when you first posted. And I totally overlooked there's the email.. until you replied. Welcome back. > > I think it's ok: > > { > "field": "unused", > "version_id": 1, > "field_exists": false, > "size": 512 > }, > > vs. > > { > "field": "vendor_data", > "version_id": 0, > "field_exists": false, > "num": 512, > "size": 1 > }, > > The unused field was introduced in 2016 so there's no chance of > migrating a QEMU that old to/from 9.1. What happens if an old qemu 9.0 sends rubbish here to a new QEMU, while the new QEMU would consider it meaningful data? -- Peter Xu
Re: [PATCH v2 09/15] memory: Do not create circular reference with subregion
On Mon, Jul 08, 2024 at 10:06:44AM +0200, Philippe Mathieu-Daudé wrote: > On 6/7/24 13:59, Akihiko Odaki wrote: > > On 2024/07/03 2:44, Peter Xu wrote: > > > On Thu, Jun 27, 2024 at 10:37:52PM +0900, Akihiko Odaki wrote: > > > > A memory region does not use their own reference counters, but instead > > > > piggybacks on another QOM object, "owner" (unless the owner is not the > > > > memory region itself). When creating a subregion, a new reference to the > > > > owner of the container must be created. However, if the subregion is > > > > owned by the same QOM object, this result in a self-reference, and make > > > > the owner immortal. Avoid such a self-reference. > > > > > > > > Signed-off-by: Akihiko Odaki > > > > --- > > > > system/memory.c | 11 +-- > > > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/system/memory.c b/system/memory.c > > > > index 74cd73ebc78b..949f5016a68d 100644 > > > > --- a/system/memory.c > > > > +++ b/system/memory.c > > > > @@ -2638,7 +2638,10 @@ static void > > > > memory_region_update_container_subregions(MemoryRegion > > > > *subregion) > > > > memory_region_transaction_begin(); > > > > - memory_region_ref(subregion); > > > > + if (mr->owner != subregion->owner) { > > > > + memory_region_ref(subregion); > > > > + } > > > > + > > > > QTAILQ_FOREACH(other, &mr->subregions, subregions_link) { > > > > if (subregion->priority >= other->priority) { > > > > QTAILQ_INSERT_BEFORE(other, subregion, subregions_link); > > > > @@ -2696,7 +2699,11 @@ void > > > > memory_region_del_subregion(MemoryRegion *mr, > > > > assert(alias->mapped_via_alias >= 0); > > > > } > > > > QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link); > > > > - memory_region_unref(subregion); > > > > + > > > > + if (mr->owner != subregion->owner) { > > > > + memory_region_unref(subregion); > > > > + } > > > > + > > > > memory_region_update_pending |= mr->enabled && subregion->enabled; > > > > memory_region_transaction_commit(); > > > > } > > > > > > This does look like a real issue.. the patch looks reasonable to me, > > > but I > > > wonder whether we should start to add some good comments in code to > > > reflect > > > that complexity starting from this one. The MR refcount isn't easy to > > > understand to me. > > > > > > It also lets me start to wonder how MR refcount went through until > > > it looks > > > like today.. It's definitely not extremely intuitive to use mr->owner as > > > the object to do refcounting if mr itself does has its own QObject, > > > meanwhile it has other tricks around. > > > > > > E.g. the first thing I stumbled over when looking was the optimization > > > where we will avoid refcounting the mr when there's no owner, and IIUC it > > > was for the case when the "guest memory" (which will never be freed) used > > > to have no owner so we can speedup DMA if we know it won't go away. > > > > > > https://lore.kernel.org/qemu-devel/1450263601-2828-5-git-send-email-pbonz...@redhat.com/ > > > > > > commit 612263cf33062f7441a5d0e3b37c65991fdc3210 > > > Author: Paolo Bonzini > > > Date: Wed Dec 9 11:44:25 2015 +0100 > > > > > > memory: avoid unnecessary object_ref/unref > > > For the common case of DMA into non-hotplugged RAM, it is > > > unnecessary > > > but expensive to do object_ref/unref. Add back an owner field to > > > MemoryRegion, so that these memory regions can skip the reference > > > counting. > > > > > > If so, it looks like it will stop working with memory-backends get > > > involved? As I think those MRs will have owner set always, and I wonder > > > whether memory-backends should be the major way to specify guest > > > memory now > > > and in the future. So I'm not sure how important that optimization is as > > > of now, and whether we could "simplify" it back to always do the refcount > > > if the major scenarios wi
Re: [PATCH v2 09/15] memory: Do not create circular reference with subregion
On Thu, Jun 27, 2024 at 10:37:52PM +0900, Akihiko Odaki wrote: > A memory region does not use their own reference counters, but instead > piggybacks on another QOM object, "owner" (unless the owner is not the > memory region itself). When creating a subregion, a new reference to the > owner of the container must be created. However, if the subregion is > owned by the same QOM object, this result in a self-reference, and make > the owner immortal. Avoid such a self-reference. > > Signed-off-by: Akihiko Odaki > --- > system/memory.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/system/memory.c b/system/memory.c > index 74cd73ebc78b..949f5016a68d 100644 > --- a/system/memory.c > +++ b/system/memory.c > @@ -2638,7 +2638,10 @@ static void > memory_region_update_container_subregions(MemoryRegion *subregion) > > memory_region_transaction_begin(); > > -memory_region_ref(subregion); > +if (mr->owner != subregion->owner) { > +memory_region_ref(subregion); > +} > + > QTAILQ_FOREACH(other, &mr->subregions, subregions_link) { > if (subregion->priority >= other->priority) { > QTAILQ_INSERT_BEFORE(other, subregion, subregions_link); > @@ -2696,7 +2699,11 @@ void memory_region_del_subregion(MemoryRegion *mr, > assert(alias->mapped_via_alias >= 0); > } > QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link); > -memory_region_unref(subregion); > + > +if (mr->owner != subregion->owner) { > +memory_region_unref(subregion); > +} > + > memory_region_update_pending |= mr->enabled && subregion->enabled; > memory_region_transaction_commit(); > } This does look like a real issue.. the patch looks reasonable to me, but I wonder whether we should start to add some good comments in code to reflect that complexity starting from this one. The MR refcount isn't easy to understand to me. It also lets me start to wonder how MR refcount went through until it looks like today.. It's definitely not extremely intuitive to use mr->owner as the object to do refcounting if mr itself does has its own QObject, meanwhile it has other tricks around. E.g. the first thing I stumbled over when looking was the optimization where we will avoid refcounting the mr when there's no owner, and IIUC it was for the case when the "guest memory" (which will never be freed) used to have no owner so we can speedup DMA if we know it won't go away. https://lore.kernel.org/qemu-devel/1450263601-2828-5-git-send-email-pbonz...@redhat.com/ commit 612263cf33062f7441a5d0e3b37c65991fdc3210 Author: Paolo Bonzini Date: Wed Dec 9 11:44:25 2015 +0100 memory: avoid unnecessary object_ref/unref For the common case of DMA into non-hotplugged RAM, it is unnecessary but expensive to do object_ref/unref. Add back an owner field to MemoryRegion, so that these memory regions can skip the reference counting. If so, it looks like it will stop working with memory-backends get involved? As I think those MRs will have owner set always, and I wonder whether memory-backends should be the major way to specify guest memory now and in the future. So I'm not sure how important that optimization is as of now, and whether we could "simplify" it back to always do the refcount if the major scenarios will not adopt it. The other issue is we used owner refcount from the start of memory_region_ref() got introduced, since: commit 46637be269aaaceb9867ffdf176e906401138fff Author: Paolo Bonzini Date: Tue May 7 09:06:00 2013 +0200 memory: add ref/unref And we still have that in our document, even though I don't think it's true anymore: * ... MemoryRegions actually do not have their * own reference count; they piggyback on a QOM object, their "owner". * This function adds a reference to the owner. It looks like what happened is when introduced the change, MR is not a QOM object yet. But it later is.. I mentioned all these only because I found that _if_ we can keep mr refcounting as simple as other objects: memory_region_ref(mr) { object_ref(OBJECT(mr)); } Then looks like this "recursive refcount" problem can also go away. I'm curious whether you or anyone tried to explore that path, or whether above doesn't make sense at all. Thanks, -- Peter Xu
Re: [PATCH 07/14] migration: Free removed SaveStateEntry
On Wed, Jun 26, 2024 at 08:06:30PM +0900, Akihiko Odaki wrote: > This suppresses LeakSanitizer warnings. > > Signed-off-by: Akihiko Odaki Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote: > > > I just noticed this thread; some random notes from a somewhat > > > fragmented memory of this: > > > > > > a) Long long ago, I also tried rsocket; > > > https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html > > > as I remember the library was quite flaky at the time. > > > > Hmm interesting. There also looks like a thread doing rpoll(). > > Yeh, I can't actually remember much more about what I did back then! Heh, that's understandable and fair. :) > > I hope Lei and his team has tested >4G mem, otherwise definitely worth > > checking. Lei also mentioned there're rsocket bugs they found in the cover > > letter, but not sure what's that about. > > It would probably be a good idea to keep track of what bugs > are in flight with it, and try it on a few RDMA cards to see > what problems get triggered. > I think I reported a few at the time, but I gave up after > feeling it was getting very hacky. Agreed. Maybe we can have a list of that in the cover letter or even QEMU's migration/rmda doc page. Lei, if you think that makes sense please do so in your upcoming posts. There'll need to have a list of things you encountered in the kernel driver and it'll be even better if there're further links to read on each problem. > > > > > > e) Someone made a good suggestion (sorry can't remember who) - that the > > > RDMA migration structure was the wrong way around - it should be the > > > destination which initiates an RDMA read, rather than the source > > > doing a write; then things might become a LOT simpler; you just need > > > to send page ranges to the destination and it can pull it. > > > That might work nicely for postcopy. > > > > I'm not sure whether it'll still be a problem if rdma recv side is based on > > zero-copy. It would be a matter of whether atomicity can be guaranteed so > > that we don't want the guest vcpus to see a partially copied page during > > on-flight DMAs. UFFDIO_COPY (or friend) is currently the only solution for > > that. > > Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if > the destination can issue an RDMA read itself, it doesn't need to send > messages > to the source to ask for a page fetch; it just goes and grabs it itself, > that's got to be good for latency. Oh, that's pretty internal stuff of rdma to me and beyond my knowledge.. but from what I can tell it sounds very reasonable indeed! Thanks! -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote: > > e) Someone made a good suggestion (sorry can't remember who) - that the > > RDMA migration structure was the wrong way around - it should be the > > destination which initiates an RDMA read, rather than the source > > doing a write; then things might become a LOT simpler; you just need > > to send page ranges to the destination and it can pull it. > > That might work nicely for postcopy. > > I'm not sure whether it'll still be a problem if rdma recv side is based on > zero-copy. It would be a matter of whether atomicity can be guaranteed so > that we don't want the guest vcpus to see a partially copied page during > on-flight DMAs. UFFDIO_COPY (or friend) is currently the only solution for > that. And when thinking about this (of UFFDIO_COPY's nature on not being able to do zero-copy...), the only way this will be able to do zerocopy is to use file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that when we do DMA we pass over the page cache, which can be mapped in another virtual address besides what the vcpus are using. Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic updates on the vcpu pgtables, avoiding the copy. QEMU doesn't have it, but it looks like there's one more reason we may want to have better use of shmem.. than anonymous. And actually when working on 4k faults on 1G hugetlb I added CONTINUE support. https://github.com/xzpeter/qemu/tree/doublemap https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899740dc5 Maybe it's worthwhile on its own now, because it also means we can use that in multifd to avoid one extra layer of buffering when supporting multifd+postcopy (which has the same issue here on directly copying data into guest pages). It'll also work with things like rmda I think in similar ways. It's just that it'll not work on anonymous. I definitely hijacked the thread to somewhere too far away. I'll stop here.. Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
Hey, Dave! On Wed, Jun 05, 2024 at 12:31:56AM +, Dr. David Alan Gilbert wrote: > * Michael Galaxy (mgal...@akamai.com) wrote: > > One thing to keep in mind here (despite me not having any hardware to test) > > was that one of the original goals here > > in the RDMA implementation was not simply raw throughput nor raw latency, > > but a lack of CPU utilization in kernel > > space due to the offload. While it is entirely possible that newer hardware > > w/ TCP might compete, the significant > > reductions in CPU usage in the TCP/IP stack were a big win at the time. > > > > Just something to consider while you're doing the testing > > I just noticed this thread; some random notes from a somewhat > fragmented memory of this: > > a) Long long ago, I also tried rsocket; > https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html > as I remember the library was quite flaky at the time. Hmm interesting. There also looks like a thread doing rpoll(). Btw, not sure whether you noticed, but there's the series posted for the latest rsocket conversion here: https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gong...@huawei.com I hope Lei and his team has tested >4G mem, otherwise definitely worth checking. Lei also mentioned there're rsocket bugs they found in the cover letter, but not sure what's that about. > > b) A lot of the complexity in the rdma migration code comes from > emulating a stream to carry the migration control data and interleaving > that with the actual RAM copy. I believe the original design used > a separate TCP socket for the control data, and just used the RDMA > for the data - that should be a lot simpler (but alas was rejected > in review early on) > > c) I can't rememmber the last benchmarks I did; but I think I did > manage to beat RDMA with multifd; but yes, multifd does eat host CPU > where as RDMA barely uses a whisper. I think my first impression on this matter came from you on this one. :) > > d) The 'zero-copy-send' option in migrate may well get some of that > CPU time back; but if I remember we were still bottle necked on > the receive side. (I can't remember if zero-copy-send worked with > multifd?) Yes, and zero-copy requires multifd for now. I think it's because we didn't want to complicate the header processings in the migration stream where it may not be page aligned. > > e) Someone made a good suggestion (sorry can't remember who) - that the > RDMA migration structure was the wrong way around - it should be the > destination which initiates an RDMA read, rather than the source > doing a write; then things might become a LOT simpler; you just need > to send page ranges to the destination and it can pull it. > That might work nicely for postcopy. I'm not sure whether it'll still be a problem if rdma recv side is based on zero-copy. It would be a matter of whether atomicity can be guaranteed so that we don't want the guest vcpus to see a partially copied page during on-flight DMAs. UFFDIO_COPY (or friend) is currently the only solution for that. Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
Lei, On Wed, May 29, 2024 at 02:43:46AM +, Gonglei (Arei) wrote: > For rdma programming, the current mainstream implementation is to use > rdma_cm to establish a connection, and then use verbs to transmit data. > rdma_cm and ibverbs create two FDs respectively. The two FDs have > different responsibilities. rdma_cm fd is used to notify connection > establishment events, and verbs fd is used to notify new CQEs. When > poll/epoll monitoring is directly performed on the rdma_cm fd, only a > pollin event can be monitored, which means that an rdma_cm event > occurs. When the verbs fd is directly polled/epolled, only the pollin > event can be listened, which indicates that a new CQE is generated. > > Rsocket is a sub-module attached to the rdma_cm library and provides > rdma calls that are completely similar to socket interfaces. However, > this library returns only the rdma_cm fd for listening to link > setup-related events and does not expose the verbs fd (readable and > writable events for listening to data). Only the rpoll interface provided > by the RSocket can be used to listen to related events. However, QEMU > uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept > API). And cannot listen to the verbs fd event. Only some hacking methods > can be used to address this problem. Do you guys have any ideas? Thanks. I saw that you mentioned this elsewhere: > Right. But the question is QEMU do not use rpoll but gilb's ppoll. :( So what I'm thinking may not make much sense, as I mentioned I don't think I know rdma at all.. and my idea also has involvement on coroutine stuff which I also don't know well. But just in case it shed some light in some form. IIUC we do iochannel blockings with this no matter for read/write: if (len == QIO_CHANNEL_ERR_BLOCK) { if (qemu_in_coroutine()) { qio_channel_yield(ioc, G_IO_XXX); } else { qio_channel_wait(ioc, G_IO_XXX); } continue; } One thing I'm wondering is whether we can provide a new feature bit for qiochannel, e.g., QIO_CHANNEL_FEATURE_POLL, so that the iochannel can define its own poll routine rather than using the default when possible. I think it may not work if it's in a coroutine, as I guess that'll block other fds from being waked up. Hence it should look like this: if (len == QIO_CHANNEL_ERR_BLOCK) { if (qemu_in_coroutine()) { qio_channel_yield(ioc, G_IO_XXX); } else if (qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_POLL)) { qio_channel_poll(ioc, G_IO_XXX); } else { qio_channel_wait(ioc, G_IO_XXX); } continue; } Maybe we even want to forbid such channel to be used in coroutine already, as when QIO_CHANNEL_FEATURE_POLL set it may mean that this iochannel simply won't work with poll() like in rdma's use case. Then rdma iochannel can implement qio_channel_poll() using rpoll(). There's one other dependent issue here in that I _think_ the migration recv side is still in a coroutine.. so we may need to move that into a thread first. IIRC we don't yet have a major blocker to do that, but I didn't further check either. I've put that issue aside just to see whether this may or may not make sense. Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Tue, May 28, 2024 at 09:06:04AM +, Gonglei (Arei) wrote: > Hi Peter, > > > -Original Message- > > From: Peter Xu [mailto:pet...@redhat.com] > > Sent: Wednesday, May 22, 2024 6:15 AM > > To: Yu Zhang > > Cc: Michael Galaxy ; Jinpu Wang > > ; Elmar Gerdes ; > > zhengchuan ; Gonglei (Arei) > > ; Daniel P. Berrangé ; > > Markus Armbruster ; Zhijian Li (Fujitsu) > > ; qemu-de...@nongnu.org; Yuval Shaia > > ; Kevin Wolf ; Prasanna > > Kumar Kalever ; Cornelia Huck > > ; Michael Roth ; Prasanna > > Kumar Kalever ; Paolo Bonzini > > ; qemu-block@nongnu.org; de...@lists.libvirt.org; > > Hanna Reitz ; Michael S. Tsirkin ; > > Thomas Huth ; Eric Blake ; Song > > Gao ; Marc-André Lureau > > ; Alex Bennée ; > > Wainer dos Santos Moschetta ; Beraldo Leal > > ; Pannengyuan ; > > Xiexiangyou ; Fabiano Rosas > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling > > > > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote: > > > Hello Michael and Peter, > > > > Hi, > > > > > > > > Exactly, not so compelling, as I did it first only on servers widely > > > used for production in our data center. The network adapters are > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 > > > 2-port Gigabit Ethernet PCIe > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more > > reasonable. > > > > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15 > > wvaqk81vxtkzx-l...@mail.gmail.com/ > > > > Appreciate a lot for everyone helping on the testings. > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family > > > [ConnectX-5] > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM > > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet > > > on these two hosts. One is standby while the other is active. > > > > > > Now I'll try on a server with more recent Ethernet and InfiniBand > > > network adapters. One of them has: > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01) > > > > > > The comparison between RDMA and TCP on the same NIC could make more > > sense. > > > > It looks to me NICs are powerful now, but again as I mentioned I don't > > think it's > > a reason we need to deprecate rdma, especially if QEMU's rdma migration has > > the chance to be refactored using rsocket. > > > > Is there anyone who started looking into that direction? Would it make > > sense > > we start some PoC now? > > > > My team has finished the PoC refactoring which works well. > > Progress: > 1. Implement io/channel-rdma.c, > 2. Add unit test tests/unit/test-io-channel-rdma.c and verifying it is > successful, > 3. Remove the original code from migration/rdma.c, > 4. Rewrite the rdma_start_outgoing_migration and > rdma_start_incoming_migration logic, > 5. Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live > migration from polluting the core logic of live migration), > 6. The soft-RoCE implemented by software is used to test the RDMA live > migration. It's successful. > > We will be submit the patchset later. That's great news, thank you! -- Peter Xu
Re: [PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state
On Mon, May 27, 2024 at 12:53:22PM +0200, Markus Armbruster wrote: > Peter Xu writes: > > > On Mon, May 13, 2024 at 04:17:02PM +0200, Markus Armbruster wrote: > >> Functions that use an Error **errp parameter to return errors should > >> not also report them to the user, because reporting is the caller's > >> job. When the caller does, the error is reported twice. When it > >> doesn't (because it recovered from the error), there is no error to > >> report, i.e. the report is bogus. > >> > >> qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate > >> this principle: they call qemu_save_device_state() and > >> qemu_loadvm_state(), which call error_report_err(). > >> > >> I wish I could clean this up now, but migration's error reporting is > >> too complicated (confused?) for me to mess with it. > > > > :-( > > If I understood how it's *supposed* to work, I might have a chance... > > I can see a mixture of reporting errors directly (with error_report() & > friends), passing them to callers (via Error **errp), and storing them > in / retrieving them from MigrationState member @error. This can't be > right. > > I think a necessary first step towards getting it right is a shared > understanding how errors are to be handled in migration code. This > includes how error data should flow from error source to error sink, and > what the possible sinks are. True. I think the sink should always be MigrationState.error so far. There's also the other complexity on detecting errors using either qemu_file_get_error() or migrate_get_error().. the error handling in migration is indeed prone to a cleanup. I've added a cleanup entry for migration todo page: https://wiki.qemu.org/ToDo/LiveMigration#Migration_error_detection_and_reporting Thanks, -- Peter Xu
Re: [PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state
On Mon, May 13, 2024 at 04:17:02PM +0200, Markus Armbruster wrote: > Functions that use an Error **errp parameter to return errors should > not also report them to the user, because reporting is the caller's > job. When the caller does, the error is reported twice. When it > doesn't (because it recovered from the error), there is no error to > report, i.e. the report is bogus. > > qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate > this principle: they call qemu_save_device_state() and > qemu_loadvm_state(), which call error_report_err(). > > I wish I could clean this up now, but migration's error reporting is > too complicated (confused?) for me to mess with it. :-( > > Instead, I'm merely improving the error reported by > qmp_xen_load_devices_state() and qmp_xen_load_devices_state() to the > QMP core from > > An IO error has occurred > > to > saving Xen device state failed > > and > > loading Xen device state failed > > respectively. > > Signed-off-by: Markus Armbruster Acked-by: Peter Xu -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote: > Hello Michael and Peter, Hi, > > Exactly, not so compelling, as I did it first only on servers widely > used for production in our data center. The network adapters are > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 > 2-port Gigabit Ethernet PCIe Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more reasonable. https://lore.kernel.org/qemu-devel/camgffen-dkpmz4ta71mjydyemg0zda15wvaqk81vxtkzx-l...@mail.gmail.com/ Appreciate a lot for everyone helping on the testings. > InfiniBand controller: Mellanox Technologies MT27800 Family [ConnectX-5] > > which doesn't meet our purpose. I can choose RDMA or TCP for VM > migration. RDMA traffic is through InfiniBand and TCP through Ethernet > on these two hosts. One is standby while the other is active. > > Now I'll try on a server with more recent Ethernet and InfiniBand > network adapters. One of them has: > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01) > > The comparison between RDMA and TCP on the same NIC could make more sense. It looks to me NICs are powerful now, but again as I mentioned I don't think it's a reason we need to deprecate rdma, especially if QEMU's rdma migration has the chance to be refactored using rsocket. Is there anyone who started looking into that direction? Would it make sense we start some PoC now? Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote: > That's a good news to see the socket abstraction for RDMA! > When I was developed the series above, the most pain is the RDMA migration > has no QIOChannel abstraction and i need to take a 'fake channel' > for it which is awkward in code implementation. > So, as far as I know, we can do this by > i. the first thing is that we need to evaluate the rsocket is good enough to > satisfy our QIOChannel fundamental abstraction > ii. if it works right, then we will continue to see if it can give us > opportunity to hide the detail of rdma protocol > into rsocket by remove most of code in rdma.c and also some hack in > migration main process. > iii. implement the advanced features like multi-fd and multi-uri for rdma > migration. > > Since I am not familiar with rsocket, I need some times to look at it and do > some quick verify with rdma migration based on rsocket. > But, yes, I am willing to involved in this refactor work and to see if we can > make this migration feature more better:) Based on what we have now, it looks like we'd better halt the deprecation process a bit, so I think we shouldn't need to rush it at least in 9.1 then, and we'll need to see how it goes on the refactoring. It'll be perfect if rsocket works, otherwise supporting multifd with little overhead / exported APIs would also be a good thing in general with whatever approach. And obviously all based on the facts that we can get resources from companies to support this feature first. Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if any of us can provide some test results please do so. Many people are saying RDMA is better, but I yet didn't see any numbers comparing it with modern TCP networks. I don't want to have old impressions floating around even if things might have changed.. When we have consolidated results, we should share them out and also reflect that in QEMU's migration docs when a rdma document page is ready. Chuan, please check the whole thread discussion, it may help to understand what we are looking for on rdma migrations [1]. Meanwhile please feel free to sync with Jinpu's team and see how to move forward with such a project. [1] https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/ Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote: > Hello, > > > -Original Message- > > From: Peter Xu [mailto:pet...@redhat.com] > > Sent: Monday, May 6, 2024 11:18 PM > > To: Gonglei (Arei) > > Cc: Daniel P. Berrangé ; Markus Armbruster > > ; Michael Galaxy ; Yu Zhang > > ; Zhijian Li (Fujitsu) ; Jinpu > > Wang > > ; Elmar Gerdes ; > > qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf > > ; Prasanna Kumar Kalever > > ; Cornelia Huck ; > > Michael Roth ; Prasanna Kumar Kalever > > ; integrat...@gluster.org; Paolo Bonzini > > ; qemu-block@nongnu.org; de...@lists.libvirt.org; > > Hanna Reitz ; Michael S. Tsirkin ; > > Thomas Huth ; Eric Blake ; Song > > Gao ; Marc-André Lureau > > ; Alex Bennée ; > > Wainer dos Santos Moschetta ; Beraldo Leal > > ; Pannengyuan ; > > Xiexiangyou > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling > > > > On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote: > > > Hi, Peter > > > > Hey, Lei, > > > > Happy to see you around again after years. > > > Haha, me too. > > > > RDMA features high bandwidth, low latency (in non-blocking lossless > > > network), and direct remote memory access by bypassing the CPU (As you > > > know, CPU resources are expensive for cloud vendors, which is one of > > > the reasons why we introduced offload cards.), which TCP does not have. > > > > It's another cost to use offload cards, v.s. preparing more cpu resources? > > > Software and hardware offload converged architecture is the way to go for all > cloud vendors > (Including comprehensive benefits in terms of performance, cost, security, > and innovation speed), > it's not just a matter of adding the resource of a DPU card. > > > > In some scenarios where fast live migration is needed (extremely short > > > interruption duration and migration duration) is very useful. To this > > > end, we have also developed RDMA support for multifd. > > > > Will any of you upstream that work? I'm curious how intrusive would it be > > when adding it to multifd, if it can keep only 5 exported functions like > > what > > rdma.h does right now it'll be pretty nice. We also want to make sure it > > works > > with arbitrary sized loads and buffers, e.g. vfio is considering to add IO > > loads to > > multifd channels too. > > > > In fact, we sent the patchset to the community in 2021. Pls see: > https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/ I wasn't aware of that for sure in the past.. Multifd has changed quite a bit in the last 9.0 release, that may not apply anymore. One thing to mention is please look at Dan's comment on possible use of rsocket.h: https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/ And Jinpu did help provide an initial test result over the library: https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/ It looks like we have a chance to apply that in QEMU. > > > > One thing to note that the question here is not about a pure performance > > comparison between rdma and nics only. It's about help us make a decision > > on whether to drop rdma, iow, even if rdma performs well, the community > > still > > has the right to drop it if nobody can actively work and maintain it. > > It's just that if nics can perform as good it's more a reason to drop, > > unless > > companies can help to provide good support and work together. > > > > We are happy to provide the necessary review and maintenance work for RDMA > if the community needs it. > > CC'ing Chuan Zheng. I'm not sure whether you and Jinpu's team would like to work together and provide a final solution for rdma over multifd. It could be much simpler than the original 2021 proposal if the rsocket API will work out. Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote: > Hi Peter, hi Daniel, Hi, Jinpu, Thanks for sharing this test results. Sounds like a great news. What's your plan next? Would it then be worthwhile / possible moving QEMU into that direction? Would that greatly simplify rdma code as Dan mentioned? Thanks, > > On Fri, May 3, 2024 at 4:33 PM Peter Xu wrote: > > > > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote: > > > I had a brief check in the rsocket changelog, there seems some > > > improvement over time, > > > might be worth revisiting this. due to socket abstraction, we can't > > > use some feature like > > > ODP, it won't be a small and easy task. > > > > It'll be good to know whether Dan's suggestion would work first, without > > rewritting everything yet so far. Not sure whether some perf test could > > help with the rsocket APIs even without QEMU's involvements (or looking for > > test data supporting / invalidate such conversions). > > > I did a quick test with iperf on 100 G environment and 40 G > environment, in summary rsocket works pretty well. > > iperf tests between 2 hosts with 40 G (IB), > first a few test with different num. of threads on top of ipoib > interface, later with preload rsocket on top of same ipoib interface. > > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 > > Client connecting to 10.43.3.145, TCP port 52000 > TCP window size: 165 KByte (default) > > [ 3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000 > [ ID] Interval Transfer Bandwidth > [ 3] 0.-10.0001 sec 2.85 GBytes 2.44 Gbits/sec > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2 > > Client connecting to 10.43.3.145, TCP port 52000 > TCP window size: 165 KByte (default) > > [ 4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000 > [ 3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000 > [ ID] Interval Transfer Bandwidth > [ 3] 0.-10.0012 sec 2.85 GBytes 2.45 Gbits/sec > [ 4] 0.-10.0026 sec 2.86 GBytes 2.45 Gbits/sec > [SUM] 0.-10.0026 sec 5.71 GBytes 4.90 Gbits/sec > [ CT] final connect times (min/avg/max/stdev) = > 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0 > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4 > > Client connecting to 10.43.3.145, TCP port 52000 > TCP window size: 165 KByte (default) > > [ 4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000 > [ 6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000 > [ 3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000 > [ 5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000 > [ ID] Interval Transfer Bandwidth > [ 3] 0.-10.0017 sec 2.85 GBytes 2.45 Gbits/sec > [ 4] 0.-10.0015 sec 2.85 GBytes 2.45 Gbits/sec > [ 5] 0.-10.0026 sec 2.85 GBytes 2.45 Gbits/sec > [ 6] 0.-10.0005 sec 2.85 GBytes 2.45 Gbits/sec > [SUM] 0.-10.0005 sec 11.4 GBytes 9.80 Gbits/sec > [ CT] final connect times (min/avg/max/stdev) = > 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0 > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8 > > Client connecting to 10.43.3.145, TCP port 52000 > TCP window size: 165 KByte (default) > > [ 7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000 > [ 6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000 > [ 8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000 > [ 9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000 > [ 3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000 > [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000 > [ 5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000 > [ 4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000 > [ ID] Interval Transfer Bandwidth > [ 4] 0.-10.0005 sec 2.85 GBytes 2.44 Gbits/sec > [ 8] 0.-10.0011 sec 2.85 GBytes 2.45 Gbits/sec > [ 5] 0.-10. sec 2.85 GBytes 2.45 Gbits/sec > [ 12] 0.-10.0021 sec 2.85 GBytes 2.44 Gbits/sec > [ 3] 0.-10.0003 sec 2.85 GBytes
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote: > Hi, Peter Hey, Lei, Happy to see you around again after years. > RDMA features high bandwidth, low latency (in non-blocking lossless > network), and direct remote memory access by bypassing the CPU (As you > know, CPU resources are expensive for cloud vendors, which is one of the > reasons why we introduced offload cards.), which TCP does not have. It's another cost to use offload cards, v.s. preparing more cpu resources? > In some scenarios where fast live migration is needed (extremely short > interruption duration and migration duration) is very useful. To this > end, we have also developed RDMA support for multifd. Will any of you upstream that work? I'm curious how intrusive would it be when adding it to multifd, if it can keep only 5 exported functions like what rdma.h does right now it'll be pretty nice. We also want to make sure it works with arbitrary sized loads and buffers, e.g. vfio is considering to add IO loads to multifd channels too. One thing to note that the question here is not about a pure performance comparison between rdma and nics only. It's about help us make a decision on whether to drop rdma, iow, even if rdma performs well, the community still has the right to drop it if nobody can actively work and maintain it. It's just that if nics can perform as good it's more a reason to drop, unless companies can help to provide good support and work together. Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote: > I had a brief check in the rsocket changelog, there seems some > improvement over time, > might be worth revisiting this. due to socket abstraction, we can't > use some feature like > ODP, it won't be a small and easy task. It'll be good to know whether Dan's suggestion would work first, without rewritting everything yet so far. Not sure whether some perf test could help with the rsocket APIs even without QEMU's involvements (or looking for test data supporting / invalidate such conversions). Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote: > Hi Michael, Hi Peter, > > > On Thu, May 2, 2024 at 3:23 PM Michael Galaxy wrote: > > > > Yu Zhang / Jinpu, > > > > Any possibility (at your lesiure, and within the disclosure rules of > > your company, IONOS) if you could share any of your performance > > information to educate the group? > > > > NICs have indeed changed, but not everybody has 100ge mellanox cards at > > their disposal. Some people don't. > Our staging env is with 100 Gb/s IB environment. > We will have a new setup in the coming months with Ethernet (RoCE), we > will run some performance > comparison when we have the environment ready. Thanks both. Please keep us posted. Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not involved, am I right? The other note is that the comparison needs to be with multifd enabled for the "tcp:" case. I'd suggest we start with 8 threads if it's 100Gbps. I think I can still fetch some 100Gbps or even 200Gbps nics around our labs without even waiting for months. If you want I can try to see how we can test together. And btw I don't think we need a cluster, IIUC we simply need two hosts, 100G nic on both sides? IOW, it seems to me we only need two cards just for experiments, systems that can drive the cards, and a wire supporting 100G? > > > > > - Michael > > Thx! > Jinpu > > > > On 5/1/24 11:16, Peter Xu wrote: > > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote: > > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote: > > >>> What I worry more is whether this is really what we want to keep rdma in > > >>> qemu, and that's also why I was trying to request for some serious > > >>> performance measurements comparing rdma v.s. nics. And here when I said > > >>> "we" I mean both QEMU community and any company that will support > > >>> keeping > > >>> rdma around. > > >>> > > >>> The problem is if NICs now are fast enough to perform at least equally > > >>> against rdma, and if it has a lower cost of overall maintenance, does it > > >>> mean that rdma migration will only be used by whoever wants to keep > > >>> them in > > >>> the products and existed already? In that case we should simply ask new > > >>> users to stick with tcp, and rdma users should only drop but not > > >>> increase. > > >>> > > >>> It seems also destined that most new migration features will not support > > >>> rdma: see how much we drop old features in migration now (which rdma > > >>> _might_ still leverage, but maybe not), and how much we add mostly > > >>> multifd > > >>> relevant which will probably not apply to rdma at all. So in general > > >>> what > > >>> I am worrying is a both-loss condition, if the company might be easier > > >>> to > > >>> either stick with an old qemu (depending on whether other new features > > >>> are > > >>> requested to be used besides RDMA alone), or do periodic rebase with > > >>> RDMA > > >>> downstream only. > > >> I don't know much about the originals of RDMA support in QEMU and why > > >> this particular design was taken. It is indeed a huge maint burden to > > >> have a completely different code flow for RDMA with 4000+ lines of > > >> custom protocol signalling which is barely understandable. > > >> > > >> I would note that /usr/include/rdma/rsocket.h provides a higher level > > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had > > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress > > >> type could almost[1] trivially have supported RDMA. There would have > > >> been almost no RDMA code required in the migration subsystem, and all > > >> the modern features like compression, multifd, post-copy, etc would > > >> "just work". > > >> > > >> I guess the 'rsocket.h' shim may well limit some of the possible > > >> performance gains, but it might still have been a better tradeoff > > >> to have not quite so good peak performance, but with massively > > >> less maint burden. > > > My understanding so far is RDMA is sololy for performance but nothing > > >
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote: > On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote: > > What I worry more is whether this is really what we want to keep rdma in > > qemu, and that's also why I was trying to request for some serious > > performance measurements comparing rdma v.s. nics. And here when I said > > "we" I mean both QEMU community and any company that will support keeping > > rdma around. > > > > The problem is if NICs now are fast enough to perform at least equally > > against rdma, and if it has a lower cost of overall maintenance, does it > > mean that rdma migration will only be used by whoever wants to keep them in > > the products and existed already? In that case we should simply ask new > > users to stick with tcp, and rdma users should only drop but not increase. > > > > It seems also destined that most new migration features will not support > > rdma: see how much we drop old features in migration now (which rdma > > _might_ still leverage, but maybe not), and how much we add mostly multifd > > relevant which will probably not apply to rdma at all. So in general what > > I am worrying is a both-loss condition, if the company might be easier to > > either stick with an old qemu (depending on whether other new features are > > requested to be used besides RDMA alone), or do periodic rebase with RDMA > > downstream only. > > I don't know much about the originals of RDMA support in QEMU and why > this particular design was taken. It is indeed a huge maint burden to > have a completely different code flow for RDMA with 4000+ lines of > custom protocol signalling which is barely understandable. > > I would note that /usr/include/rdma/rsocket.h provides a higher level > API that is a 1-1 match of the normal kernel 'sockets' API. If we had > leveraged that, then QIOChannelSocket class and the QAPI SocketAddress > type could almost[1] trivially have supported RDMA. There would have > been almost no RDMA code required in the migration subsystem, and all > the modern features like compression, multifd, post-copy, etc would > "just work". > > I guess the 'rsocket.h' shim may well limit some of the possible > performance gains, but it might still have been a better tradeoff > to have not quite so good peak performance, but with massively > less maint burden. My understanding so far is RDMA is sololy for performance but nothing else, then it's a question on whether rdma existing users would like to do so if it will run slower. Jinpu mentioned on the explicit usages of ib verbs but I am just mostly quotting that word as I don't really know such details: https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/ So not sure whether that applies here too, in that having qiochannel wrapper may not allow direct access to those ib verbs. Thanks, > > With regards, > Daniel > > [1] "almost" trivially, because the poll() integration for rsockets > requires a bit more magic sauce since rsockets FDs are not > really FDs from the kernel's POV. Still, QIOCHannel likely can > abstract that probme. > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| > -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote: > On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote: > > Peter Xu writes: > > > > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote: > > >> Hi All (and Peter), > > > > > > Hi, Michael, > > > > > >> > > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name > > >> (highly irregular for a male) and yes, that's my real last name: > > >> https://www.linkedin.com/in/mrgalaxy/) > > >> > > >> I'm the original author of the RDMA implementation. I've been discussing > > >> with Yu Zhang for a little bit about potentially handing over > > >> maintainership > > >> of the codebase to his team. > > >> > > >> I simply have zero access to RoCE or Infiniband hardware at all, > > >> unfortunately. so I've never been able to run tests or use what I wrote > > >> at > > >> work, and as all of you know, if you don't have a way to test something, > > >> then you can't maintain it. > > >> > > >> Yu Zhang put a (very kind) proposal forward to me to ask the community if > > >> they feel comfortable training his team to maintain the codebase (and run > > >> tests) while they learn about it. > > > > > > The "while learning" part is fine at least to me. IMHO the "ownership" to > > > the code, or say, taking over the responsibility, may or may not need 100% > > > mastering the code base first. There should still be some fundamental > > > confidence to work on the code though as a starting point, then it's about > > > serious use case to back this up, and careful testings while getting more > > > familiar with it. > > > > How much experience we expect of maintainers depends on the subsystem > > and other circumstances. The hard requirement isn't experience, it's > > trust. See the recent attack on xz. > > > > I do not mean to express any doubts whatsoever on Yu Zhang's integrity! > > I'm merely reminding y'all what's at stake. > > I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly > common scenario is that volunteer maintainers are honest people. QEMU is > in a massively better peer review situation. With xz there was basically no > oversight of the new maintainer. With QEMU, we have oversight from 1000's > of people on the list, a huge pool of general maintainers, the specific > migration maintainers, and the release manager merging code. > > With a lack of historical experiance with QEMU maintainership, I'd suggest > that new RDMA volunteers would start by adding themselves to the "MAINTAINERS" > file with only the 'Reviewer' classification. The main migration maintainers > would still handle pull requests, but wait for a R-b from one of the RMDA > volunteers. After some period of time the RDMA folks could graduate to full > maintainer status if the migration maintainers needed to reduce their load. > I suspect that might prove unneccesary though, given RDMA isn't an area of > code with a high turnover of patches. Right, and we can do that as a start, it also follows our normal rules of starting from Reviewers to maintain something. I even considered Zhijian to be the previous rdma goto guy / maintainer no matter what role he used to have in the MAINTAINERS file. Here IMHO it's more about whether any company would like to stand up and provide help, without yet binding that to be able to send pull requests in the near future or even longer term. What I worry more is whether this is really what we want to keep rdma in qemu, and that's also why I was trying to request for some serious performance measurements comparing rdma v.s. nics. And here when I said "we" I mean both QEMU community and any company that will support keeping rdma around. The problem is if NICs now are fast enough to perform at least equally against rdma, and if it has a lower cost of overall maintenance, does it mean that rdma migration will only be used by whoever wants to keep them in the products and existed already? In that case we should simply ask new users to stick with tcp, and rdma users should only drop but not increase. It seems also destined that most new migration features will not support rdma: see how much we drop old features in migration now (which rdma _might_ still leverage, but maybe not), and how much we add mostly multifd relevant which will probably not apply to rdma at all. So in general what I am worrying is a both-loss condition, if the company might be easier to either stick with an old qemu (depending on whether other new features are requested to be used besides RDMA alone), or do periodic rebase with RDMA downstream only. So even if we want to keep RDMA around I hope with this chance we can at least have clear picture on when we should still suggest any new user to use RDMA (with the reasons behind). Or we simply shouldn't suggest any new user to use RDMA at all (because at least it'll lose many new migration features). Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote: > Hi All (and Peter), Hi, Michael, > > My name is Michael Galaxy (formerly Hines). Yes, I changed my last name > (highly irregular for a male) and yes, that's my real last name: > https://www.linkedin.com/in/mrgalaxy/) > > I'm the original author of the RDMA implementation. I've been discussing > with Yu Zhang for a little bit about potentially handing over maintainership > of the codebase to his team. > > I simply have zero access to RoCE or Infiniband hardware at all, > unfortunately. so I've never been able to run tests or use what I wrote at > work, and as all of you know, if you don't have a way to test something, > then you can't maintain it. > > Yu Zhang put a (very kind) proposal forward to me to ask the community if > they feel comfortable training his team to maintain the codebase (and run > tests) while they learn about it. The "while learning" part is fine at least to me. IMHO the "ownership" to the code, or say, taking over the responsibility, may or may not need 100% mastering the code base first. There should still be some fundamental confidence to work on the code though as a starting point, then it's about serious use case to back this up, and careful testings while getting more familiar with it. > > If you don't mind, I'd like to let him send over his (very detailed) > proposal, Yes please, it's exactly the time to share the plan. The hope is we try to reach a consensus before or around the middle of this release (9.1). Normally QEMU has a 3~4 months window for each release and 9.1 schedule is not yet out, but I think it means we make a decision before or around middle of June. Thanks, > > - Michael > > On 4/11/24 11:36, Yu Zhang wrote: > > > 1) Either a CI test covering at least the major RDMA paths, or at least > > > periodically tests for each QEMU release will be needed. > > We use a batch of regression test cases for the stack, which covers the > > test for QEMU. I did such test for most of the QEMU releases planned as > > candidates for rollout. > > > > The migration test needs a pair of (either physical or virtual) servers with > > InfiniBand network, which makes it difficult to do on a single server. The > > nested VM could be a possible approach, for which we may need virtual > > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you > > know. > > > > [1] > > https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$ > > > > Thanks and best regards! > > > > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu wrote: > > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote: > > > > On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via > > > > wrote: > > > > > > > > > > on 4/10/2024 3:46 AM, Peter Xu wrote: > > > > > > > > > > > > Is there document/link about the unittest/CI for migration tests, > > > > > > > Why > > > > > > > are those tests missing? > > > > > > > Is it hard or very special to set up an environment for that? > > > > > > > maybe we > > > > > > > can help in this regards. > > > > > > See tests/qtest/migration-test.c. We put most of our migration > > > > > > tests > > > > > > there and that's covered in CI. > > > > > > > > > > > > I think one major issue is CI systems don't normally have rdma > > > > > > devices. > > > > > > Can rdma migration test be carried out without a real hardware? > > > > > Yeah, RXE aka. SOFT-RoCE is able to emulate the RDMA, for example > > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0 # on host > > > > > then we can get a new RDMA interface "rxe_eth0". > > > > > This new RDMA interface is able to do the QEMU RDMA migration. > > > > > > > > > > Also, the loopback(lo) device is able to emulate the RDMA interface > > > > > "rxe_lo", however when > > > > > I tried(years ago) to do RDMA migration over this > > > > > interface(rdma:127.0.0.1:) , it got something wrong. > > > > > So i gave up enabling the RDMA migration qtest at that time. > > > > Thanks, Zhijian. > > > > > > >
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
Yu, On Thu, Apr 11, 2024 at 06:36:54PM +0200, Yu Zhang wrote: > > 1) Either a CI test covering at least the major RDMA paths, or at least > > periodically tests for each QEMU release will be needed. > We use a batch of regression test cases for the stack, which covers the > test for QEMU. I did such test for most of the QEMU releases planned as > candidates for rollout. The least I can think of is a few tests in one release. Definitely too less if one release can already break.. > > The migration test needs a pair of (either physical or virtual) servers with > InfiniBand network, which makes it difficult to do on a single server. The > nested VM could be a possible approach, for which we may need virtual > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know. > > [1] https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce Does it require a kernel driver? The less host kernel / hardware / .. dependencies the better. I am wondering whether there can be a library doing everything in userspace, translating RDMA into e.g. socket messages (so maybe ultimately that's something like IP->rdma->IP.. just to cover the "rdma" procedures), then that'll work for CI reliably. Please also see my full list, though, especially entry 4). Thanks already for looking for solutions on the tests, but I don't want to waste your time then found that tests are not enough even if ready. I think we need people that understand these stuff well enough, have dedicated time and look after it. Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote: > On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote: > > > > > > on 4/10/2024 3:46 AM, Peter Xu wrote: > > > > >> Is there document/link about the unittest/CI for migration tests, Why > > >> are those tests missing? > > >> Is it hard or very special to set up an environment for that? maybe we > > >> can help in this regards. > > > See tests/qtest/migration-test.c. We put most of our migration tests > > > there and that's covered in CI. > > > > > > I think one major issue is CI systems don't normally have rdma devices. > > > Can rdma migration test be carried out without a real hardware? > > > > Yeah, RXE aka. SOFT-RoCE is able to emulate the RDMA, for example > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0 # on host > > then we can get a new RDMA interface "rxe_eth0". > > This new RDMA interface is able to do the QEMU RDMA migration. > > > > Also, the loopback(lo) device is able to emulate the RDMA interface > > "rxe_lo", however when > > I tried(years ago) to do RDMA migration over this > > interface(rdma:127.0.0.1:) , it got something wrong. > > So i gave up enabling the RDMA migration qtest at that time. > > Thanks, Zhijian. > > I'm not sure adding an emu-link for rdma is doable for CI systems, though. > Maybe someone more familiar with how CI works can chim in. Some people got dropped on the cc list for unknown reason, I'm adding them back (Fabiano, Peter Maydell, Phil). Let's make sure nobody is dropped by accident. I'll try to summarize what is still missing, and I think these will be greatly helpful if we don't want to deprecate rdma migration: 1) Either a CI test covering at least the major RDMA paths, or at least periodically tests for each QEMU release will be needed. 2) Some performance tests between modern RDMA and NIC devices are welcomed. The current knowledge is modern NIC can work similarly to RDMA in performance, then it's debatable why we still maintain so much rdma specific code. 3) No need to be soild patchsets for this one, but some plan to improve RDMA migration code so that it is not almost isolated from the rest protocols. 4) Someone to look after this code for real. For 2) and 3) more info is here: https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n Here 4) can be the most important as Markus pointed out. We just didn't get there yet on the discussions, but maybe Markus is right that we should talk that first. Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote: > > > on 4/10/2024 3:46 AM, Peter Xu wrote: > > >> Is there document/link about the unittest/CI for migration tests, Why > >> are those tests missing? > >> Is it hard or very special to set up an environment for that? maybe we > >> can help in this regards. > > See tests/qtest/migration-test.c. We put most of our migration tests > > there and that's covered in CI. > > > > I think one major issue is CI systems don't normally have rdma devices. > > Can rdma migration test be carried out without a real hardware? > > Yeah, RXE aka. SOFT-RoCE is able to emulate the RDMA, for example > $ sudo rdma link add rxe_eth0 type rxe netdev eth0 # on host > then we can get a new RDMA interface "rxe_eth0". > This new RDMA interface is able to do the QEMU RDMA migration. > > Also, the loopback(lo) device is able to emulate the RDMA interface > "rxe_lo", however when > I tried(years ago) to do RDMA migration over this > interface(rdma:127.0.0.1:) , it got something wrong. > So i gave up enabling the RDMA migration qtest at that time. Thanks, Zhijian. I'm not sure adding an emu-link for rdma is doable for CI systems, though. Maybe someone more familiar with how CI works can chim in. -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote: > Hi Peter, > > On Mon, Apr 8, 2024 at 6:18 PM Peter Xu wrote: > > > > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote: > > > Hi Peter, > > > > Jinpu, > > > > Thanks for joining the discussion. > > > > > > > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu wrote: > > > > > > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote: > > > > > Hello Peter und Zhjian, > > > > > > > > > > Thank you so much for letting me know about this. I'm also a bit > > > > > surprised at > > > > > the plan for deprecating the RDMA migration subsystem. > > > > > > > > It's not too late, since it looks like we do have users not yet notified > > > > from this, we'll redo the deprecation procedure even if it'll be the > > > > final > > > > plan, and it'll be 2 releases after this. > > > > > > > > > > > > > > > IMHO it's more important to know whether there are still users and > > > > > > whether > > > > > > they would still like to see it around. > > > > > > > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led > > > > > > to the a few > > > > > > obvious bugs being noticed too late. > > > > > > > > > > Yes, we are a user of this subsystem. I was unaware of the lack of > > > > > test coverage > > > > > for this part. As soon as 8.2 was released, I saw that many of the > > > > > migration test > > > > > cases failed and came to realize that there might be a bug between 8.1 > > > > > and 8.2, but > > > > > was unable to confirm and report it quickly to you. > > > > > > > > > > The maintenance of this part could be too costly or difficult from > > > > > your point of view. > > > > > > > > It may or may not be too costly, it's just that we need real users of > > > > RDMA > > > > taking some care of it. Having it broken easily for >1 releases > > > > definitely > > > > is a sign of lack of users. It is an implication to the community that > > > > we > > > > should consider dropping some features so that we can get the best use > > > > of > > > > the community resources for the things that may have a broader audience. > > > > > > > > One thing majorly missing is a RDMA tester to guard all the merges to > > > > not > > > > break RDMA paths, hopefully in CI. That should not rely on RDMA > > > > hardwares > > > > but just to sanity check the migration+rdma code running all fine. RDMA > > > > taught us the lesson so we're requesting CI coverage for all other new > > > > features that will be merged at least for migration subsystem, so that > > > > we > > > > plan to not merge anything that is not covered by CI unless extremely > > > > necessary in the future. > > > > > > > > For sure CI is not the only missing part, but I'd say we should start > > > > with > > > > it, then someone should also take care of the code even if only in > > > > maintenance mode (no new feature to add on top). > > > > > > > > > > > > > > My concern is, this plan will forces a few QEMU users (not sure how > > > > > many) like us > > > > > either to stick to the RDMA migration by using an increasingly older > > > > > version of QEMU, > > > > > or to abandon the currently used RDMA migration. > > > > > > > > RDMA doesn't get new features anyway, if there's specific use case for > > > > RDMA > > > > migrations, would it work if such a scenario uses the old binary? Is it > > > > possible to switch to the TCP protocol with some good NICs? > > > We have used rdma migration with HCA from Nvidia for years, our > > > experience is RDMA migration works better than tcp (over ipoib). > > > > Please bare with me, as I know little on rdma stuff. > > > > I'm actually pretty confused (and since a long time ago..) on why we need > > to operation with rdma contexts when ipoib seems to provide all the tcp > > layers. I meant, can it work with
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote: > Hi Peter, Jinpu, Thanks for joining the discussion. > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu wrote: > > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote: > > > Hello Peter und Zhjian, > > > > > > Thank you so much for letting me know about this. I'm also a bit > > > surprised at > > > the plan for deprecating the RDMA migration subsystem. > > > > It's not too late, since it looks like we do have users not yet notified > > from this, we'll redo the deprecation procedure even if it'll be the final > > plan, and it'll be 2 releases after this. > > > > > > > > > IMHO it's more important to know whether there are still users and > > > > whether > > > > they would still like to see it around. > > > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led to > > > > the a few > > > > obvious bugs being noticed too late. > > > > > > Yes, we are a user of this subsystem. I was unaware of the lack of test > > > coverage > > > for this part. As soon as 8.2 was released, I saw that many of the > > > migration test > > > cases failed and came to realize that there might be a bug between 8.1 > > > and 8.2, but > > > was unable to confirm and report it quickly to you. > > > > > > The maintenance of this part could be too costly or difficult from > > > your point of view. > > > > It may or may not be too costly, it's just that we need real users of RDMA > > taking some care of it. Having it broken easily for >1 releases definitely > > is a sign of lack of users. It is an implication to the community that we > > should consider dropping some features so that we can get the best use of > > the community resources for the things that may have a broader audience. > > > > One thing majorly missing is a RDMA tester to guard all the merges to not > > break RDMA paths, hopefully in CI. That should not rely on RDMA hardwares > > but just to sanity check the migration+rdma code running all fine. RDMA > > taught us the lesson so we're requesting CI coverage for all other new > > features that will be merged at least for migration subsystem, so that we > > plan to not merge anything that is not covered by CI unless extremely > > necessary in the future. > > > > For sure CI is not the only missing part, but I'd say we should start with > > it, then someone should also take care of the code even if only in > > maintenance mode (no new feature to add on top). > > > > > > > > My concern is, this plan will forces a few QEMU users (not sure how > > > many) like us > > > either to stick to the RDMA migration by using an increasingly older > > > version of QEMU, > > > or to abandon the currently used RDMA migration. > > > > RDMA doesn't get new features anyway, if there's specific use case for RDMA > > migrations, would it work if such a scenario uses the old binary? Is it > > possible to switch to the TCP protocol with some good NICs? > We have used rdma migration with HCA from Nvidia for years, our > experience is RDMA migration works better than tcp (over ipoib). Please bare with me, as I know little on rdma stuff. I'm actually pretty confused (and since a long time ago..) on why we need to operation with rdma contexts when ipoib seems to provide all the tcp layers. I meant, can it work with the current "tcp:" protocol with ipoib even if there's rdma/ib hardwares underneath? Is it because of performance improvements so that we must use a separate path comparing to generic "tcp:" protocol here? > > Switching back to TCP will lead us to the old problems which was > solved by RDMA migration. Can you elaborate the problems, and why tcp won't work in this case? They may not be directly relevant to the issue we're discussing, but I'm happy to learn more. What is the NICs you were testing before? Did the test carry out with things like modern ones (50Gbps-200Gbps NICs), or the test was done when these hardwares are not common? Per my recent knowledge on the new Intel hardwares, at least the ones that support QPL, it's easy to achieve single core 50Gbps+. https://lore.kernel.org/r/ph7pr11mb5941a91ac1e514bcc32896a6a3...@ph7pr11mb5941.namprd11.prod.outlook.com Quote from Yuan: Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps. [ ID] Interval Transfer Bitrate Retr Cwnd [ 5]
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote: > Hello Peter und Zhjian, > > Thank you so much for letting me know about this. I'm also a bit surprised at > the plan for deprecating the RDMA migration subsystem. It's not too late, since it looks like we do have users not yet notified from this, we'll redo the deprecation procedure even if it'll be the final plan, and it'll be 2 releases after this. > > > IMHO it's more important to know whether there are still users and whether > > they would still like to see it around. > > > I admit RDMA migration was lack of testing(unit/CI test), which led to the > > a few > > obvious bugs being noticed too late. > > Yes, we are a user of this subsystem. I was unaware of the lack of test > coverage > for this part. As soon as 8.2 was released, I saw that many of the > migration test > cases failed and came to realize that there might be a bug between 8.1 > and 8.2, but > was unable to confirm and report it quickly to you. > > The maintenance of this part could be too costly or difficult from > your point of view. It may or may not be too costly, it's just that we need real users of RDMA taking some care of it. Having it broken easily for >1 releases definitely is a sign of lack of users. It is an implication to the community that we should consider dropping some features so that we can get the best use of the community resources for the things that may have a broader audience. One thing majorly missing is a RDMA tester to guard all the merges to not break RDMA paths, hopefully in CI. That should not rely on RDMA hardwares but just to sanity check the migration+rdma code running all fine. RDMA taught us the lesson so we're requesting CI coverage for all other new features that will be merged at least for migration subsystem, so that we plan to not merge anything that is not covered by CI unless extremely necessary in the future. For sure CI is not the only missing part, but I'd say we should start with it, then someone should also take care of the code even if only in maintenance mode (no new feature to add on top). > > My concern is, this plan will forces a few QEMU users (not sure how > many) like us > either to stick to the RDMA migration by using an increasingly older > version of QEMU, > or to abandon the currently used RDMA migration. RDMA doesn't get new features anyway, if there's specific use case for RDMA migrations, would it work if such scenario uses the old binary? Is it possible to switch to the TCP protocol with some good NICs? Per our best knowledge, RDMA users are rare, and please let anyone know if you are aware of such users. IIUC the major reason why RDMA stopped being the trend is because the network is not like ten years ago; I don't think I have good knowledge in RDMA at all nor network, but my understanding is it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make little sense to maintain multiple protocols, considering RDMA migration code is so special so that it has the most custom code comparing to other protocols. Thanks, -- Peter Xu
Re: [PATCH for-9.1] migration: Add Error** argument to add_bitmaps_to_list()
On Fri, Mar 29, 2024 at 11:56:27AM +0100, Cédric Le Goater wrote: > This allows to report more precise errors in the migration handler > dirty_bitmap_save_setup(). > > Suggested-by Vladimir Sementsov-Ogievskiy > Signed-off-by: Cédric Le Goater > --- > > To apply on top of : > https://lore.kernel.org/qemu-devel/20240320064911.545001-1-...@redhat.com/ queued, thanks. -- Peter Xu
Re: [PATCH 11/19] migration/block: fix -Werror=maybe-uninitialized false-positive
On Thu, Mar 28, 2024 at 02:20:44PM +0400, marcandre.lur...@redhat.com wrote: > From: Marc-André Lureau > > ../migration/block.c:966:16: error: ‘ret’ may be used uninitialized > [-Werror=maybe-uninitialized] > > Given that "cluster_size" must be <= BLK_MIG_BLOCK_SIZE, the previous > loop is entered at least once, so 'ret' is assigned a value in all conditions. > > Signed-off-by: Marc-André Lureau Acked-by: Peter Xu -- Peter Xu
Re: [PATCH 15/19] migration: fix -Werror=maybe-uninitialized false-positive
On Thu, Mar 28, 2024 at 02:20:48PM +0400, marcandre.lur...@redhat.com wrote: > From: Marc-André Lureau > > ../migration/ram.c:1873:23: error: ‘dirty’ may be used uninitialized > [-Werror=maybe-uninitialized] > > When 'block' != NULL, 'dirty' is initialized. > > Signed-off-by: Marc-André Lureau Acked-by: Peter Xu -- Peter Xu
Re: [PATCH 12/19] migration: fix -Werror=maybe-uninitialized false-positives
On Thu, Mar 28, 2024 at 02:20:45PM +0400, marcandre.lur...@redhat.com wrote: > From: Marc-André Lureau > > ../migration/dirtyrate.c:186:5: error: ‘records’ may be used uninitialized > [-Werror=maybe-uninitialized] > ../migration/dirtyrate.c:168:12: error: ‘gen_id’ may be used uninitialized > [-Werror=maybe-uninitialized] > ../migration/migration.c:2273:5: error: ‘file’ may be used uninitialized > [-Werror=maybe-uninitialized] > > Signed-off-by: Marc-André Lureau Acked-by: Peter Xu -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Thu, Mar 28, 2024 at 04:22:27PM +0100, Thomas Huth wrote: > Since e9a54265f5 was not very clear about rdma migration code, should we > maybe rather add a separate deprecation note for the migration part, and add > a proper warning message to the migration code in case someone tries to use > it there, and then only remove the rdma migration code after two more > releases? Definitely a valid option to me. So far RDMA isn't covered in tests (actually same to COLO, and I wonder our position of COLO too in this case..), so unfortunately we don't even know when it'll break just like before. >From other activities that I can see when new code comes, maintaining RDMA code should be fairly manageable so far (and whoever will write new rdma codes in those two releases will also need to take the maintainer's role). We did it for those years, and we can keep that for two more releases. Hopefully that can ring a louder alarm to the current users with such warnings, so that people can either stick with old binaries, or invest developer/test resources to the community. Thanks, -- Peter Xu
Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote: > Philippe Mathieu-Daudé writes: > > > The whole RDMA subsystem was deprecated in commit e9a54265f5 > > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem") > > released in v8.2. > > > > Remove: > > - RDMA handling from migration > > - dependencies on libibumad, libibverbs and librdmacm > > > > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears > > in old migration streams. > > > > Cc: Peter Xu > > Cc: Li Zhijian > > Acked-by: Fabiano Rosas > > Signed-off-by: Philippe Mathieu-Daudé > > Just to be clear, because people raised the point in the last version, > the first link in the deprecation commit links to a thread comprising > entirely of rdma migration patches. I don't see any ambiguity on whether > the deprecation was intended to include migration. There's even an ack > from Juan. Yes I remember that's the plan. > > So on the basis of not reverting the previous maintainer's decision, my > Ack stands here. > > We also had pretty obvious bugs ([1], [2]) in the past that would have > been caught if we had any kind of testing for the feature, so I can't > even say this thing works currently. > > @Peter Xu, @Li Zhijian, what are your thoughts on this? Generally I definitely agree with such a removal sooner or later, as that's how deprecation works, and even after Juan's left I'm not aware of any other new RDMA users. Personally, I'd slightly prefer postponing it one more release which might help a bit of our downstream maintenance, however I assume that's not a blocker either, as I think we can also manage it. IMHO it's more important to know whether there are still users and whether they would still like to see it around. That's also one thing I notice that e9a54265f533f didn't yet get acks from RDMA users that we are aware, even if they're rare. According to [2] it could be that such user may only rely on the release versions of QEMU when it broke things. So I'm copying Yu too (while Zhijian is already in the loop), just in case someone would like to stand up and speak. Thanks, > > 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com > 2- > https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com > -- Peter Xu
Re: [PATCH for 9.0] migration: Skip only empty block devices
On Tue, Mar 12, 2024 at 05:34:26PM -0400, Stefan Hajnoczi wrote: > I understand now. I missed that returning from init_blk_migration_it() > did not abort iteration. Thank you! I queued it, thanks both! -- Peter Xu
Re: [PATCH for 9.0] migration: Skip only empty block devices
On Tue, Mar 12, 2024 at 01:04:31PM +0100, Cédric Le Goater wrote: > The block .save_setup() handler calls a helper routine > init_blk_migration() which builds a list of block devices to take into > account for migration. When one device is found to be empty (sectors > == 0), the loop exits and all the remaining devices are ignored. This > is a regression introduced when bdrv_iterate() was removed. > > Change that by skipping only empty devices. > > Cc: Markus Armbruster > Suggested: Kevin Wolf This should be: Suggested-by: Kevin Wolf I think the missed "by" caused Kevin not in the cc list, I added Kevin in. I'll hold a bit for Kevin to ACK, no repost needed just for above; I can fix it. Thanks, > Fixes: fea68bb6e9fa ("block: Eliminate bdrv_iterate(), use bdrv_next()") > Signed-off-by: Cédric Le Goater > --- > migration/block.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/migration/block.c b/migration/block.c > index > 8c6ebafacc1ffe930d1d4f19d968817b14852c69..2b9054889ad2ba739828594c50cf047703757e96 > 100644 > --- a/migration/block.c > +++ b/migration/block.c > @@ -402,7 +402,10 @@ static int init_blk_migration(QEMUFile *f) > } > > sectors = bdrv_nb_sectors(bs); > -if (sectors <= 0) { > +if (sectors == 0) { > +continue; > +} > +if (sectors < 0) { > ret = sectors; > bdrv_next_cleanup(&it); > goto out; > -- > 2.44.0 > > -- Peter Xu
Re: [PATCH 00/15] qapi: Require member documentation (with loophole)
On Mon, Feb 05, 2024 at 08:46:54AM +0100, Markus Armbruster wrote: > qapi/migration.json > MigrateSetParameters 1 It's tls-authz. I'll send a patch for this one. Thanks, -- Peter Xu
Re: [PATCH v2 2/2] hw/pflash: implement update buffer for block writes
hat this is overlooked in the commit a19cbfb346 ("spice: add qxl device") where VMSTATE_VBUFFER_UINT32 is introduced. I don't have a good way to trivially fix it because we don't remember that type of size in vmsd. However a simple solution could be that we convert all existing VMSTATE_VBUFFER() (potentially, VMSTATE_PARTIAL_VBUFFER users; there seems to have only 1 caller..) to always use uint32_t. I don't think it's wise to try using a signed for size anyway, and it should be compatible change as we doubled the size. I'll hold a bit to see whether there's some comment, then I can try to post a patch. > + * that is in some other struct field, but it's a runtime constant and > + * we can assume the memory has already been allocated. > +*/ > + > #define VMSTATE_VBUFFER_UINT32(_field, _state, _version, _test, > _field_size) { \ > @@ -688,2 +698,9 @@ extern const VMStateInfo vmstate_info_qlist; > > +/** > + * VMSTATE_VBUFFER_ALLOC_UINT32: > + * > + * We need to migrate an array of uint32_t of variable size dependent > + * on the inbound migration data, and so the migration code must > + * allocate it. > +*/ > #define VMSTATE_VBUFFER_ALLOC_UINT32(_field, _state, _version, \ > --- > -- Peter Xu
Re: [PATCH 1/6] system/cpus: rename qemu_mutex_lock_iothread() to qemu_bql_lock()
On Thu, Nov 30, 2023 at 03:43:25PM -0500, Stefan Hajnoczi wrote: > On Thu, Nov 30, 2023 at 03:08:49PM -0500, Peter Xu wrote: > > On Wed, Nov 29, 2023 at 04:26:20PM -0500, Stefan Hajnoczi wrote: > > > The Big QEMU Lock (BQL) has many names and they are confusing. The > > > actual QemuMutex variable is called qemu_global_mutex but it's commonly > > > referred to as the BQL in discussions and some code comments. The > > > locking APIs, however, are called qemu_mutex_lock_iothread() and > > > qemu_mutex_unlock_iothread(). > > > > > > The "iothread" name is historic and comes from when the main thread was > > > split into into KVM vcpu threads and the "iothread" (now called the main > > > loop thread). I have contributed to the confusion myself by introducing > > > a separate --object iothread, a separate concept unrelated to the BQL. > > > > > > The "iothread" name is no longer appropriate for the BQL. Rename the > > > locking APIs to: > > > - void qemu_bql_lock(void) > > > - void qemu_bql_unlock(void) > > > - bool qemu_bql_locked(void) > > > > > > There are more APIs with "iothread" in their names. Subsequent patches > > > will rename them. There are also comments and documentation that will be > > > updated in later patches. > > > > > > Signed-off-by: Stefan Hajnoczi > > > > Acked-by: Peter Xu > > > > Two nickpicks: > > > > - BQL contains "QEMU" as the 2nd character, so maybe easier to further > > rename qemu_bql into bql_? > > Philippe wondered whether the variable name should end with _mutex (or > _lock is common too), so an alternative might be big_qemu_lock. That's IMHO mutex isn't important in this context, but an implementation detail of the "lock" as an abstract concept. For example, we won't need to rename it again then if the impl changes, e.g. using pure futex or a rwlock replacement. When that happens we don't need to change all call sites again. (never really meant to change the lock impl, just an example.. :) KVM actually has that example of KVM_MMU_LOCK() macro taking as the rwlock write lock when the spinlock is replaced with rwlock, while it'll keep to be the spinlock "lock()" when !KVM_HAVE_MMU_RWLOCK. > imperfect because it doesn't start with the usual qemu_ prefix. > qemu_big_lock is better in that regard but inconsistent with our BQL > abbreviation. > > I don't like putting an underscore at the end. It's unusual and would > make me wonder what that means. Ah, I meant replacing the "qemu_bql_" prefix with "bql_", as that contains QEMU already, rather than making "_" at the end. So they'll be bql_lock(), bql_unlock(), bql_locked(). > > Naming is hard, but please discuss and I'm open to change to BQL > variable's name to whatever we all agree on. I'm pretty okay with qemu_bql_lock(), etc. too. I prefer a tiny little bit on bql_ over qemu_bql_ in this regard, but frankly they're all names good enough to me. The "qemu_" prefix can still be a good thing saying "this is a qemu global function", even if contained inside "bql" itself. > > > > > - Could we keep the full spell of BQL at some places, so people can still > > reference it if not familiar? IIUC most of the BQL helpers will root > > back to the major three functions (_lock, _unlock, _locked), perhaps > > add a comment of "BQL stands for..." over these three functions as > > comment? > > Yes, I'll update the doc comments to say "Big QEMU Lock (BQL)" for each > of these functions. Thanks! -- Peter Xu
Re: [PATCH 1/6] system/cpus: rename qemu_mutex_lock_iothread() to qemu_bql_lock()
On Wed, Nov 29, 2023 at 04:26:20PM -0500, Stefan Hajnoczi wrote: > The Big QEMU Lock (BQL) has many names and they are confusing. The > actual QemuMutex variable is called qemu_global_mutex but it's commonly > referred to as the BQL in discussions and some code comments. The > locking APIs, however, are called qemu_mutex_lock_iothread() and > qemu_mutex_unlock_iothread(). > > The "iothread" name is historic and comes from when the main thread was > split into into KVM vcpu threads and the "iothread" (now called the main > loop thread). I have contributed to the confusion myself by introducing > a separate --object iothread, a separate concept unrelated to the BQL. > > The "iothread" name is no longer appropriate for the BQL. Rename the > locking APIs to: > - void qemu_bql_lock(void) > - void qemu_bql_unlock(void) > - bool qemu_bql_locked(void) > > There are more APIs with "iothread" in their names. Subsequent patches > will rename them. There are also comments and documentation that will be > updated in later patches. > > Signed-off-by: Stefan Hajnoczi Acked-by: Peter Xu Two nickpicks: - BQL contains "QEMU" as the 2nd character, so maybe easier to further rename qemu_bql into bql_? - Could we keep the full spell of BQL at some places, so people can still reference it if not familiar? IIUC most of the BQL helpers will root back to the major three functions (_lock, _unlock, _locked), perhaps add a comment of "BQL stands for..." over these three functions as comment? Please take or ignore these nitpicks; my ACK will stand irrelevant. Thanks, -- Peter Xu
Re: [PATCH V2 4/6] cpr: relax vhost migration blockers
Copy qemu-block, virtio On Wed, Oct 25, 2023 at 12:44:27PM -0700, Steve Sistare wrote: > vhost blocks migration if logging is not supported to track dirty > memory, and vhost-user blocks it if the log cannot be saved to a shm fd. > > vhost-vdpa blocks migration if both hosts do not support all the device's > features using a shadow VQ, for tracking requests and dirty memory. > > vhost-scsi blocks migration if storage cannot be shared across hosts, > or if state cannot be migrated. > > None of these conditions apply if the old and new qemu processes do > not run concurrently, and if new qemu starts on the same host as old, > which is the case for cpr. > > Narrow the scope of these blockers so they only apply to normal mode. > They will not block cpr modes when they are added in subsequent patches. > > No functional change until a new mode is added. > > Signed-off-by: Steve Sistare > Reviewed-by: Juan Quintela > --- > hw/scsi/vhost-scsi.c | 2 +- > hw/virtio/vhost.c| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c > index 14e23cc..bf528d5 100644 > --- a/hw/scsi/vhost-scsi.c > +++ b/hw/scsi/vhost-scsi.c > @@ -208,7 +208,7 @@ static void vhost_scsi_realize(DeviceState *dev, Error > **errp) > "When external environment supports it (Orchestrator > migrates " > "target SCSI device state or use shared storage over > network), " > "set 'migratable' property to true to enable migration."); > -if (migrate_add_blocker(&vsc->migration_blocker, errp) < 0) { > +if (migrate_add_blocker_normal(&vsc->migration_blocker, errp) < 0) { > goto free_virtio; > } > } > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index d737671..f5e9625 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -1527,7 +1527,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque, > } > > if (hdev->migration_blocker != NULL) { > -r = migrate_add_blocker(&hdev->migration_blocker, errp); > +r = migrate_add_blocker_normal(&hdev->migration_blocker, errp); > if (r < 0) { > goto fail_busyloop; > } > -- > 1.8.3.1 > -- Peter Xu
Re: [PATCH V2 3/6] cpr: relax blockdev migration blockers
Copy qemu-block and maintainers. On Wed, Oct 25, 2023 at 12:44:26PM -0700, Steve Sistare wrote: > Some blockdevs block migration because they do not support sharing across > hosts and/or do not support dirty bitmaps. These prohibitions do not apply > if the old and new qemu processes do not run concurrently, and if new qemu > starts on the same host as old, which is the case for cpr. Narrow the scope > of these blockers so they only apply to normal mode. They will not block > cpr modes when they are added in subsequent patches. > > No functional change until a new mode is added. > > Signed-off-by: Steve Sistare > Reviewed-by: Juan Quintela > --- > block/parallels.c | 2 +- > block/qcow.c | 2 +- > block/vdi.c | 2 +- > block/vhdx.c | 2 +- > block/vmdk.c | 2 +- > block/vpc.c | 2 +- > block/vvfat.c | 2 +- > 7 files changed, 7 insertions(+), 7 deletions(-) > > diff --git a/block/parallels.c b/block/parallels.c > index 1697a2e..8a520db 100644 > --- a/block/parallels.c > +++ b/block/parallels.c > @@ -1369,7 +1369,7 @@ static int parallels_open(BlockDriverState *bs, QDict > *options, int flags, > bdrv_get_device_or_node_name(bs)); > bdrv_graph_rdunlock_main_loop(); > > -ret = migrate_add_blocker(&s->migration_blocker, errp); > +ret = migrate_add_blocker_normal(&s->migration_blocker, errp); > if (ret < 0) { > error_setg(errp, "Migration blocker error"); > goto fail; > diff --git a/block/qcow.c b/block/qcow.c > index fdd4c83..eab68e3 100644 > --- a/block/qcow.c > +++ b/block/qcow.c > @@ -307,7 +307,7 @@ static int qcow_open(BlockDriverState *bs, QDict > *options, int flags, > bdrv_get_device_or_node_name(bs)); > bdrv_graph_rdunlock_main_loop(); > > -ret = migrate_add_blocker(&s->migration_blocker, errp); > +ret = migrate_add_blocker_normal(&s->migration_blocker, errp); > if (ret < 0) { > goto fail; > } > diff --git a/block/vdi.c b/block/vdi.c > index fd7e365..c647d72 100644 > --- a/block/vdi.c > +++ b/block/vdi.c > @@ -498,7 +498,7 @@ static int vdi_open(BlockDriverState *bs, QDict *options, > int flags, > bdrv_get_device_or_node_name(bs)); > bdrv_graph_rdunlock_main_loop(); > > -ret = migrate_add_blocker(&s->migration_blocker, errp); > +ret = migrate_add_blocker_normal(&s->migration_blocker, errp); > if (ret < 0) { > goto fail_free_bmap; > } > diff --git a/block/vhdx.c b/block/vhdx.c > index e37f8c0..a9d0874 100644 > --- a/block/vhdx.c > +++ b/block/vhdx.c > @@ -1096,7 +1096,7 @@ static int vhdx_open(BlockDriverState *bs, QDict > *options, int flags, > error_setg(&s->migration_blocker, "The vhdx format used by node '%s' " > "does not support live migration", > bdrv_get_device_or_node_name(bs)); > -ret = migrate_add_blocker(&s->migration_blocker, errp); > +ret = migrate_add_blocker_normal(&s->migration_blocker, errp); > if (ret < 0) { > goto fail; > } > diff --git a/block/vmdk.c b/block/vmdk.c > index 1335d39..85864b8 100644 > --- a/block/vmdk.c > +++ b/block/vmdk.c > @@ -1386,7 +1386,7 @@ static int vmdk_open(BlockDriverState *bs, QDict > *options, int flags, > error_setg(&s->migration_blocker, "The vmdk format used by node '%s' " > "does not support live migration", > bdrv_get_device_or_node_name(bs)); > -ret = migrate_add_blocker(&s->migration_blocker, errp); > +ret = migrate_add_blocker_normal(&s->migration_blocker, errp); > if (ret < 0) { > goto fail; > } > diff --git a/block/vpc.c b/block/vpc.c > index c30cf86..aa1a48a 100644 > --- a/block/vpc.c > +++ b/block/vpc.c > @@ -452,7 +452,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, > int flags, > bdrv_get_device_or_node_name(bs)); > bdrv_graph_rdunlock_main_loop(); > > -ret = migrate_add_blocker(&s->migration_blocker, errp); > +ret = migrate_add_blocker_normal(&s->migration_blocker, errp); > if (ret < 0) { > goto fail; > } > diff --git a/block/vvfat.c b/block/vvfat.c > index 266e036..9d050ba 100644 > --- a/block/vvfat.c > +++ b/block/vvfat.c > @@ -1268,7 +1268,7 @@ static int vvfat_open(BlockDriverState *bs, QDict > *options, int flags, > "The vvfat (rw) format used by node '%s' " > "does not support live migration", > bdrv_get_device_or_node_name(bs)); > -ret = migrate_add_blocker(&s->migration_blocker, errp); > +ret = migrate_add_blocker_normal(&s->migration_blocker, errp); > if (ret < 0) { > goto fail; > } > -- > 1.8.3.1 > -- Peter Xu
Re: [PATCH v2 5/9] migration: check required subsections are loaded, once
On Tue, Oct 24, 2023 at 12:41:56PM +0200, Juan Quintela wrote: > > @@ -509,6 +538,13 @@ static int vmstate_subsection_load(QEMUFile *f, const > > VMStateDescription *vmsd, > > } > > } > > > > +for (i = 0; i < n; i++) { > > +if (!visited[i] && vmstate_section_needed(vmsd->subsections[i], > > opaque)) { > > +trace_vmstate_subsection_load_bad(vmsd->name, > > vmsd->subsections[i]->name, "(not visited)"); > > +return -ENOENT; > > +} > > +} > > + > > trace_vmstate_subsection_load_good(vmsd->name); > > return 0; > > } > > This part is the only one where I can see there could be some > discussion. So I wil wait to see what other people think. Previous email: https://lore.kernel.org/qemu-devel/ZR2P1RbxCfBdYBaQ@x1n/ I still think it is safer to not fail unless justified that we won't hit surprises in the ->needed(). There are a lot so I assume it's non-trivial to justify. We can switch the tracepoint into error_report() in that case, though, as long as it won't fail the migration. Thanks, -- Peter Xu
Re: [PATCH 0/5] RFC: migration: check required entries and sections are loaded
On Tue, Sep 26, 2023 at 07:59:20PM +0400, marcandre.lur...@redhat.com wrote: > Marc-André Lureau (5): > block/fdc: 'phase' is not needed on load > virtio: make endian_needed() work during loading > net/slirp: use different IDs for each instance First 3 patches are bug fixes, am I right? It'll be great if they can be acked (or even picked up earlier?) by subsystem maintainers if so, and copy stable if justified proper. Thanks, -- Peter Xu
Re: [PATCH 4/5] RFC: migration: check required subsections are loaded, once
On Tue, Sep 26, 2023 at 07:59:24PM +0400, marcandre.lur...@redhat.com wrote: > @@ -484,6 +513,13 @@ static int vmstate_subsection_load(QEMUFile *f, const > VMStateDescription *vmsd, > } > } > > +for (i = 0; i < n; i++) { > +if (!visited[i] && vmstate_save_needed(vmsd->subsections[i], > opaque)) { > +trace_vmstate_subsection_load_bad(vmsd->name, > vmsd->subsections[i]->name, "(not visited)"); > +return -ENOENT; > +} > +} One thing that might be tricky to call needed() on loading side is, IMHO the needed() hooks normally was designed to be only called on a complete VM state. IOW, I think it can reference any machine/device state, or whatever variable assuming all of them contain valid data. But the load side may not yet contain everything.. we can guarantee here we loaded the full device state of this one as subsections should be the last to come, and all we've loaded so far. But what if it references something else outside what we've loaded? It looks possible in some special .needed() hook we can return something unexpected. I assume most needed() hooks are fine (and it does look like we can find bugs with this, which means this might be proved useful already at least in some form or another). I just worry on something start to break after we become strict on this. Maybe.. make the check only throw warnings, but not yet fail the migration? > + > trace_vmstate_subsection_load_good(vmsd->name); > return 0; > } > -- > 2.41.0 > -- Peter Xu
Re: [RFC PATCH v2 22/22] softmmu/physmem: Clean up local variable shadowing
On Mon, Sep 04, 2023 at 05:31:30PM +0100, Daniel P. Berrangé wrote: > On Mon, Sep 04, 2023 at 06:12:34PM +0200, Philippe Mathieu-Daudé wrote: > > Fix: > > > > softmmu/physmem.c: In function > > ‘cpu_physical_memory_snapshot_and_clear_dirty’: > > softmmu/physmem.c:916:27: warning: declaration of ‘offset’ shadows a > > parameter [-Wshadow=compatible-local] > > 916 | unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE; > > | ^~ > > softmmu/physmem.c:892:31: note: shadowed declaration is here > > 892 | (MemoryRegion *mr, hwaddr offset, hwaddr length, unsigned > > client) > > |~~~^~ > > > > Signed-off-by: Philippe Mathieu-Daudé > > --- > > RFC: Please double-check how 'offset' is used few lines later. > > I don't see an issue - those lines are in an outer scope, so won't > be accessing the 'offset' you've changed, they'll be the parameter > instead. If you want to sanity check though, presumably the asm > dissassembly for this method should be the same before/after this > change (and if it didn't do so then it's a bug..) > > > --- > > softmmu/physmem.c | 10 +- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > Reviewed-by: Daniel P. Berrangé Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 21/22] softmmu/memory: Clean up local variable shadowing
On Mon, Sep 04, 2023 at 06:12:33PM +0200, Philippe Mathieu-Daudé wrote: > Fix: > > softmmu/memory.c: In function ‘mtree_print_mr’: > softmmu/memory.c:3236:27: warning: declaration of ‘ml’ shadows a previous > local [-Wshadow=compatible-local] >3236 | MemoryRegionList *ml; > | ^~ > softmmu/memory.c:3213:32: note: shadowed declaration is here >3213 | MemoryRegionList *new_ml, *ml, *next_ml; > |^~ > > Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH 1/2] vmstate: Mark VMStateInfo.get/put() coroutine_mixed_fn
On Tue, Sep 05, 2023 at 04:50:01PM +0200, Kevin Wolf wrote: > Migration code can run both in coroutine context (the usual case) and > non-coroutine context (at least savevm/loadvm for snapshots). This also > affects the VMState callbacks, and devices must consider this. Change > the callback definition in VMStateInfo to be explicit about it. > > Signed-off-by: Kevin Wolf Acked-by: Peter Xu -- Peter Xu
Re: [PATCH 2/7] migration: Clean up local variable shadowing
On Thu, Aug 31, 2023 at 03:25:41PM +0200, Markus Armbruster wrote: > Local variables shadowing other local variables or parameters make the > code needlessly hard to understand. Tracked down with -Wshadow=local. > Clean up: delete inner declarations when they are actually redundant, > else rename variables. > > Signed-off-by: Markus Armbruster Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH 1/7] migration/rdma: Fix save_page method to fail on polling error
On Thu, Aug 31, 2023 at 03:25:40PM +0200, Markus Armbruster wrote: > qemu_rdma_save_page() reports polling error with error_report(), then > succeeds anyway. This is because the variable holding the polling > status *shadows* the variable the function returns. The latter > remains zero. > > Broken since day one, and duplicated more recently. > > Fixes: 2da776db4846 (rdma: core logic) > Fixes: b390afd8c50b (migration/rdma: Fix out of order wrid) > Signed-off-by: Markus Armbruster Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 5/5] migration: Deprecate old compression method
On Thu, Jun 22, 2023 at 09:50:19PM +0200, Juan Quintela wrote: > Signed-off-by: Juan Quintela Acked-by: Peter Xu -- Peter Xu
Re: [PATCH v2 1/5] migration: Use proper indentation for migration.json
On Thu, Jun 22, 2023 at 09:50:15PM +0200, Juan Quintela wrote: > We broke it with dirtyrate limit patches. > > Signed-off-by: Juan Quintela Acked-by: Peter Xu -- Peter Xu
Re: [RFC 4/6] migration: Deprecate -incoming
On Fri, Jun 23, 2023 at 09:23:18AM +0100, Daniel P. Berrangé wrote: > On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote: > > On Thu, Jun 22, 2023 at 10:59:58AM +0100, Daniel P. Berrangé wrote: > > > I've mentioned several times before that the user should never need to > > > set this multifd-channels parameter (nor many other parameters) on the > > > destination in the first place. > > > > > > The QEMU migration stream should be changed to add a full > > > bi-directional handshake, with negotiation of most parameters. > > > IOW, the src QEMU should be configured with 16 channels, and > > > it should connect the primary control channel, and then directly > > > tell the dest that it wants to use 16 multifd channels. > > > > > > If we're expecting the user to pass this info across to the dest > > > manually we've already spectacularly failed wrt user friendliness. > > > > I can try to move the todo even higher. Trying to list the initial goals > > here: > > > > - One extra phase of handshake between src/dst (maybe the time to boost > > QEMU_VM_FILE_VERSION) before anything else happens. > > > > - Dest shouldn't need to apply any cap/param, it should get all from src. > > Dest still need to be setup with an URI and that should be all it needs. > > There are a few that the dest will still need set explicitly. Specifically > the TLS parameters - tls-authz and tls-creds, because those are both > related to --object parameters configured on the dst QEMU. Potentially > there's an argument to be made for the TLS parameters to be part fo the > initial 'migrate' and 'migrate-incoming' command data, as they're > specifically related to the connection establishment, while (most) of > the other params are related to the migration protocol running inside > the connection. Ideally we can even make tls options to be after the main connection is established, IOW the gnutls handshake can be part of the generic handshake. But yeah I agree that may contain much more work, so we may start with assuming the v2 handshake just happen on the tls channel built for now. I think the new protocol should allow extension so when we want to move the tls handshake into it v2 protocol should be able to first detect src/dst binary support of that, and switch to that if we want - then we can even got a src qemu migration failure which tells "dest qemu forget to setup tls credentials in cmdlines", or anything wrong on dest during tls setup. > > A few other parameters are also related to the connection establishment, > most notably the enablement multifd, postcopy and postcopy-pre-empt. As I mentioned in the list, I plan to make this part of the default v2 where v2 handshake will take care of managing the connections rather than relying on the old code. I'm not sure how complicated it'll be, but the v2 protocol just sounds a good fit for having such a major change on how we setup the channels, and chance we get all things alright from the start. > > I think with those ones we don't need to set them on the src either. > With the new migration handshake we should probably use multifd > codepaths unconditionally, with a single channel. The v2 handshake will be beneficial to !multifd as well. Right now I tend to make it also work for !multifd, e.g., it always makes sense to do a device tree comparision before migration, even if someone used special tunneling so multifd may not be able to be enabled for whatever reason, but as long as a return path is available so they can talk. > By matching with the introduction of new protocol, we have a nice point > against which to deprecate the old non-multifd codepaths. We'll need to > keep the non-multifd code around *alot* longer than the normal > deprecation cycle though, as we need mig to/from very old QEMUs. I actually had a feeling that we should always keep it.. I'm not sure whether we must combine a new handshake to "making multifd the default". I do think we can make the !multifd path very simple though, e.g., I'd think we should start considering deprecate things like !multifd+compressions etc earlier than that. > > The enablement of postcopy could be automatic too - src & dst can > both detect if their host OS supports it. That would make all > migrations post-copy capable. The mgmt app just needs to trigger > the switch to post-copy mode *if* they want to use it. Sounds doable. > > Likewise we can just always assume postcopy-pre-empt is available. > > I think 'return-path' becomes another one we can just assume too. Right, handshake cap (or with the new QAPI of URI replacement) should imply return path already. Thanks, -- Peter Xu
Re: [RFC 4/6] migration: Deprecate -incoming
On Fri, Jun 23, 2023 at 08:17:46AM +0100, Daniel P. Berrangé wrote: > On Thu, Jun 22, 2023 at 03:20:01PM -0400, Peter Xu wrote: > > On Thu, Jun 22, 2023 at 05:33:29PM +0100, Daniel P. Berrangé wrote: > > > On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote: > > > > I can try to move the todo even higher. Trying to list the initial > > > > goals > > > > here: > > > > > > > > - One extra phase of handshake between src/dst (maybe the time to boost > > > > QEMU_VM_FILE_VERSION) before anything else happens. > > > > > > > > - Dest shouldn't need to apply any cap/param, it should get all from > > > > src. > > > > Dest still need to be setup with an URI and that should be all it > > > > needs. > > > > > > > > - Src shouldn't need to worry on the binary version of dst anymore as > > > > long > > > > as dest qemu supports handshake, because src can fetch it from dest. > > > > > > I'm not sure that works in general. Even if we have a handshake and > > > bi-directional comms for live migration, we still haave the save/restore > > > to file codepath to deal with. The dst QEMU doesn't exist at the time > > > the save process is done, so we can't add logic to VMSate handling that > > > assumes knowledge of the dst version at time of serialization. > > > > My current thought was still based on a new cap or anything the user would > > need to specify first on both sides (but hopefully the last cap to set on > > dest). > > > > E.g. if with a new handshake cap we shouldn't set it on a exec: or file: > > protocol migration, and it should just fail on qmp_migrate() telling that > > the URI is not supported if the cap is set. Return path is definitely > > required here. > > exec can support bi-directional migration - we have both stdin + stdout > for the command. For exec it is mostly a documentation problem - you > can't merely use 'cat' for example, but if you used 'socat' that could > be made to work bi-directionally. Okay. Just an example that the handshake just cannot work for all the cases, and it should always be able to fail. So when exec doesn't properly provide return path, I think with handshake=on we should get a timeout of not getting response properly and fail the migration after the timeout, then. There're a bunch of implications and details that need to be investigated around such a handshake if it'll be proposed for real, so I'm not yet sure whether there's something that may be surprising. For channeltypes it seems all fine for now. Hopefully nothing obvious overlooked. > > > > I don't think its possible for QEMU to validate that it has a fully > > > bi-directional channel, without adding timeouts to its detection which I > > > think we should strive to avoid. > > > > > > I don't think we actually need self-bootstrapping anyway. > > > > > > I think the mgmt app can just indicate the new v2 bi-directional > > > protocol when issuing the 'migrate' and 'migrate-incoming' > > > commands. This becomes trivial when Het's refactoring of the > > > migrate address QAPI is accepted: > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04851.html > > > > > > eg: > > > > > > { "execute": "migrate", > > > "arguments": { > > > "channels": [ { "channeltype": "main", > > > "addr": { "transport": "socket", "type": "inet", > > > "host": "10.12.34.9", > > > "port": "1050" } } ] } } > > > > > > note the 'channeltype' parameter here. If we declare the 'main' > > > refers to the existing migration protocol, then we merely need > > > to define a new 'channeltype' to use as an indicator for the > > > v2 migration handshake protocol. > > > > Using a new channeltype would also work at least on src qemu, but I'm not > > sure on how dest qemu would know that it needs a handshake in that case, > > because it knows nothing until the connection is established. > > In Het's series the 'migrate_incoming' command similarly has a chaneltype > parameter. Oh, yeah then that'll just work. Thanks. -- Peter Xu
Re: [RFC 4/6] migration: Deprecate -incoming
On Thu, Jun 22, 2023 at 05:33:29PM +0100, Daniel P. Berrangé wrote: > On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote: > > I can try to move the todo even higher. Trying to list the initial goals > > here: > > > > - One extra phase of handshake between src/dst (maybe the time to boost > > QEMU_VM_FILE_VERSION) before anything else happens. > > > > - Dest shouldn't need to apply any cap/param, it should get all from src. > > Dest still need to be setup with an URI and that should be all it needs. > > > > - Src shouldn't need to worry on the binary version of dst anymore as long > > as dest qemu supports handshake, because src can fetch it from dest. > > I'm not sure that works in general. Even if we have a handshake and > bi-directional comms for live migration, we still haave the save/restore > to file codepath to deal with. The dst QEMU doesn't exist at the time > the save process is done, so we can't add logic to VMSate handling that > assumes knowledge of the dst version at time of serialization. My current thought was still based on a new cap or anything the user would need to specify first on both sides (but hopefully the last cap to set on dest). E.g. if with a new handshake cap we shouldn't set it on a exec: or file: protocol migration, and it should just fail on qmp_migrate() telling that the URI is not supported if the cap is set. Return path is definitely required here. > > > - Handshake can always fail gracefully if anything wrong happened, it > > normally should mean dest qemu is not compatible with src's setup (either > > machine, device, or migration configs) for whatever reason. Src should > > be able to get a solid error from dest if so. > > > > - Handshake protocol should always be self-bootstrap-able, it means when we > > change the handshake protocol it should always works with old binaries. > > > > - When src is newer it should be able to know what's missing on dest and > > skip the new bits. > > > > - When dst is newer it should all rely on src (which is older) and it > > should always understand src's language. > > I'm not convinced it can reliably self-bootstrap in a backwards > compatible manner, precisely because the current migration stream > has no handshake and only requires a unidirectional channel. Yes, please see above. I meant when we grow the handshake protocol we should make sure we don't need anything new to be setup either on src/dst of qemu. It won't apply to before-handshake binaries. > I don't think its possible for QEMU to validate that it has a fully > bi-directional channel, without adding timeouts to its detection which I > think we should strive to avoid. > > I don't think we actually need self-bootstrapping anyway. > > I think the mgmt app can just indicate the new v2 bi-directional > protocol when issuing the 'migrate' and 'migrate-incoming' > commands. This becomes trivial when Het's refactoring of the > migrate address QAPI is accepted: > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04851.html > > eg: > > { "execute": "migrate", > "arguments": { > "channels": [ { "channeltype": "main", > "addr": { "transport": "socket", "type": "inet", >"host": "10.12.34.9", > "port": "1050" } } ] } } > > note the 'channeltype' parameter here. If we declare the 'main' > refers to the existing migration protocol, then we merely need > to define a new 'channeltype' to use as an indicator for the > v2 migration handshake protocol. Using a new channeltype would also work at least on src qemu, but I'm not sure on how dest qemu would know that it needs a handshake in that case, because it knows nothing until the connection is established. Maybe we still need QEMU_VM_FILE_VERSION to be boosted at least in this case, so dest can read this at the very beginning, old binaries will fail immediately, new binaries will start to talk with v2 language. > > > - All !main channels need to be established later than the handshake - if > > we're going to do this anyway we probably should do it altogether to make > > channels named, so each channel used in migration needs to have a common > > header. Prepare to deprecate the old tricks of channel orderings. > > Once the primary channel involves a bi-directional handshake, >
Re: [RFC 4/6] migration: Deprecate -incoming
On Thu, Jun 22, 2023 at 10:59:58AM +0100, Daniel P. Berrangé wrote: > On Thu, Jun 22, 2023 at 10:52:12AM +0200, Juan Quintela wrote: > > Paolo Bonzini wrote: > > > On 6/12/23 22:51, Juan Quintela wrote: > > >>> Shall we just leave it there? Or is deprecating it helps us in any > > >>> form? > > >> See the patches two weeks ago when people complained that lisen(.., num) > > >> was too low. And there are other parameters that work the same way > > >> (that I convenientely had forgotten). So the easiest way to get things > > >> right is to use "defer" always. Using -incoming "uri" should only be > > >> for people that "know what they are doing", so we had to ways to do it: > > >> - review all migration options and see which ones work without defer > > >>and document it > > >> - deprecate everything that is not defer. > > > > > > "-incoming " is literally the same as running "migrate-incoming" > > > as the first thing on the monitor: > > > > > > if (incoming) { > > > Error *local_err = NULL; > > > if (strcmp(incoming, "defer") != 0) { > > > qmp_migrate_incoming(incoming, &local_err); > > > if (local_err) { > > > error_reportf_err(local_err, "-incoming %s: ", incoming); > > > exit(1); > > > } > > > } > > > } else if (autostart) { > > > qmp_cont(NULL); > > > } > > > > > > It's the only piece of code which distinguishes "-incoming defer" from > > > "-incoming ". > > > > > > So I'm not sure what the problem would be with keeping it? > > > > User friendliness. > > > > First of all, I use it all the time. And I know that it is useful for > > developers. I was the one asking peter to implement -global > > migration.foo to be able to test multifd with it. > > > > The problem is that if you use more than two channels with multifd, on > > the incoming side, you need to do: > > > > - migrate_set_parameter multifd-channels 16 > > - migrate_incoming > > > > And people continue to do: > > > > - qemu -incoming > > - migrate_set_parameter multifd-channels 16 (on the command line) > > > > And they complain that it fails, because we are calling listen with the > > wrong value. > > IMHO if we want to improve user friendliness then arguing about use > of the CLI vs QMP for migration is completely missing the bigger > picture IMHO. > > I've mentioned several times before that the user should never need to > set this multifd-channels parameter (nor many other parameters) on the > destination in the first place. > > The QEMU migration stream should be changed to add a full > bi-directional handshake, with negotiation of most parameters. > IOW, the src QEMU should be configured with 16 channels, and > it should connect the primary control channel, and then directly > tell the dest that it wants to use 16 multifd channels. > > If we're expecting the user to pass this info across to the dest > manually we've already spectacularly failed wrt user friendliness. I can try to move the todo even higher. Trying to list the initial goals here: - One extra phase of handshake between src/dst (maybe the time to boost QEMU_VM_FILE_VERSION) before anything else happens. - Dest shouldn't need to apply any cap/param, it should get all from src. Dest still need to be setup with an URI and that should be all it needs. - Src shouldn't need to worry on the binary version of dst anymore as long as dest qemu supports handshake, because src can fetch it from dest. - Handshake can always fail gracefully if anything wrong happened, it normally should mean dest qemu is not compatible with src's setup (either machine, device, or migration configs) for whatever reason. Src should be able to get a solid error from dest if so. - Handshake protocol should always be self-bootstrap-able, it means when we change the handshake protocol it should always works with old binaries. - When src is newer it should be able to know what's missing on dest and skip the new bits. - When dst is newer it should all rely on src (which is older) and it should always understand src's language. - All !main channels need to be established later than the handshake - if we're going to do this anyway we probably should do it altogether to make channels named, so each channel used in migration needs to have a common header. Prepare to deprecate the old tricks of channel orderings. Does it look reasonable? Anything to add/remove? Thanks, -- Peter Xu
Re: [RFC 4/6] migration: Deprecate -incoming
On Thu, Jun 22, 2023 at 11:22:56AM +0200, Thomas Huth wrote: > Then simply forbid "migrate_set_parameter multifd-channels ..." if the uri > has been specified on the command line? Yeah, actually already in a pull (even though the pr may need a new one..): https://lore.kernel.org/r/20230622021320.66124-23-quint...@redhat.com -- Peter Xu
Re: [RFC 4/6] migration: Deprecate -incoming
On Thu, Jun 22, 2023 at 12:01:55PM +0200, Juan Quintela wrote: > Paolo Bonzini wrote: > > On 6/22/23 10:52, Juan Quintela wrote: > >> User friendliness. > >> The problem is that if you use more than two channels with multifd, on > >> the incoming side, you need to do: > > > > You're sacrificing user-friendliness for the 99.99% that don't use > > multifd, for an error (i.e. it's not even fixing the issue) for the > > 0.01% that use multifd. That's not user-friendly. > > You are forgeting of the 0.01% that uses postocopy preempt (that is easy > just changing the default value to 2), and the 0.001% that uses > compression (they have exactly the same problem with compress_threads, > compress_zlib, etc). > > >> - migrate_set_parameter multifd-channels 16 > >> - migrate_incoming > >> > >>> The issue is not how many features the command line has, but how > >>> they're implemented. > >> Or if they are confusing for the user? > > > > Anyone using multifd is not a typical user anyway. > > >>> If they're just QMP wrappers and as such they're self-contained in > >>> softmmu/vl.c, that's fine. > >>> > >>> In fact, even for parameters, we could use keyval to parse "-incoming" > >> What is keyval? > > > > util/keyval.c and include/qemu/keyval.h. It parses a list of > > key=value pairs into a QDict. Once you have removed the "source" key > > from the QDict you can use a visitor to parse the rest into a > > MigrateSetParameters. See the handling of QEMU_OPTION_audio, it could > > be something like > > > > > > case QEMU_OPTION_incoing: { > > Visitor *v; > > MigrateSetParameters *incoming_params = NULL; > > QDict *dict = keyval_parse(optarg, "source", NULL, > > &error_fatal); > > > > if (incoming) { > > if (qdict_haskey(dict, "source")) { > > error_setg(&error_fatal, "Parameter 'source' > > is duplicate"); > > } > > } else { > > if (!qdict_haskey(dict, "source")) { > > error_setg(&error_fatal, "Parameter 'source' > > is missing"); > > } > > runstate_set(RUN_STATE_INMIGRATE); > > incoming = g_strdup(qdict_get_str(dict, "source")); > > qdict_del(dict, "source"); > > } > > > > v = qobject_input_visitor_new_keyval(QOBJECT(dict)); > > qobject_unref(dict); > > visit_type_MigrateSetParameters(v, NULL, > > &incoming_params, &error_fatal); > > visit_free(v); > > qmp_migration_set_parameters(incoming_params, PS: we may want to postpone this to be later than migration_object_init(), when/if there's a real patch. > > &error_fatal); > > qapi_free_MigrateSetParameters(incoming_params); > > } > > > > > > For example "-incoming [source=]tcp:foo,multifd-channels=16" would > > desugar to > > > > migrate_set_parameter multifd-channels 16 > > migrate_incoming tcp:foo > > > > The only incompatibility is for people who are using "," in an URI, > > which is rare and only an issue for the "exec" protocol. If we worry on breaking anyone, we can apply the keyval parsing only when !exec. Not sure whether it matters a huge lot.. > > Aha, that makes sense. And will allow us to deprecate/remove the > --global migration.* stuff. We may still need a way to set the caps/params for src qemu?.. Thanks, -- Peter Xu
Re: [PATCH v2 06/20] qemu_file: total_transferred is not used anymore
On Thu, Jun 22, 2023 at 01:05:49AM +0200, Juan Quintela wrote: > Peter Xu wrote: > > On Tue, May 30, 2023 at 08:39:27PM +0200, Juan Quintela wrote: > >> Signed-off-by: Juan Quintela > >> --- > >> migration/qemu-file.c | 4 > >> 1 file changed, 4 deletions(-) > >> > >> diff --git a/migration/qemu-file.c b/migration/qemu-file.c > >> index eb0497e532..6b6deea19b 100644 > >> --- a/migration/qemu-file.c > >> +++ b/migration/qemu-file.c > >> @@ -41,9 +41,6 @@ struct QEMUFile { > >> QIOChannel *ioc; > >> bool is_writable; > >> > >> -/* The sum of bytes transferred on the wire */ > >> -uint64_t total_transferred; > >> - > >> int buf_index; > >> int buf_size; /* 0 when writing */ > >> uint8_t buf[IO_BUF_SIZE]; > >> @@ -287,7 +284,6 @@ void qemu_fflush(QEMUFile *f) > >> qemu_file_set_error_obj(f, -EIO, local_error); > >> } else { > >> uint64_t size = iov_size(f->iov, f->iovcnt); > >> -f->total_transferred += size; > > > > I think this patch is another example why I think sometimes the way patch > > is split are pretty much adding more complexity on review... > > It depends of taste. > > You are doing one thing in way1. > Then you find a better way to do it, lets call it way2. > > Now we have two options to see how we arrived there. > > a- You got any declarations/definition/initializations for way2 > b- You write way2 alongside way1 > c- You test that both ways give the same result, and you see that they >give the same result. > d- you remove the way1. > > Or you squash the four patches in a single patch. But then the reviewer > lost the place where one can see why it is the same than the old one. For this patch to remove total_transferred, IMHO as a reviewer it'll be easier to me if it's put in the same patch where it got replaced. It might be different if we're going to remove a large chunk of code, but for this patch it's a few lines of change. > > Sometimes is better the longer way, sometimes is better the short one. > > Clearly we don't agree about what is the best way in this case. > > > Here we removed a variable operation but it seems all fine if it's not used > > anywhere. But it also means current code base (before this patch applied) > > doesn't make sense already because it contains this useless addition. So > > IMHO it means some previous patch does it just wrong. > > No. It is how it is developed. And being respectful with the > reviewer. Given it enough information to do a proper review. I never doubted about that. I trust that you were trying to provide the best for a reviewer and I appreciate that. > > During the development of this series, there were lots of: > > if (old_counter != new_counter) >printf(""); I assume you meant when debugging? I may not have read all the patches; I hope we just don't do that if possible in any real git commit. > > traces were in the several thousand lines long. If I have to review > that change, I would love any help that writer can give me. That is why > it is done this way. Yeah, I think it's probably that even from reviewers' perspective the need can be different from individuals. > > > I think it means it's caused by a wrong split of patches, then each patch > > stops to make much sense as a standalone one. > > It stops making sense if you want each feature to be a single patch. > Before the patch no feature. After the patch full feature. That brings > us to very long patches. > > What is easier to review (to do the same) > > a - 1 x 1000 lines patch > b - 10 x 100 lines patch > > I will go with b any time. Except if the split is arbitrary. AFAIU, this is a different thing. I'm never against splitting patch, but about how to split. I was never asking for a 1000 LOC patch, right? :) > > > I can go back and try to find whatever patch on the list that will explain > > this. But it'll also go into git log. Anyone reads this later will be > > confused once again. Even harder for them to figure out what > > happened. > > As said before, I completely disagree here. And what is worse. If it > gets wrong, with your approach git bisect will not help as much than > with my appreach. > > > Do you think we could reorganize the patches so each of a single patch > > explains itself? > > No. See before. We go for a very spaguetti code to a much less > spaguety code. >
Re: [RFC 4/6] migration: Deprecate -incoming
On Tue, Jun 20, 2023 at 01:10:55PM +0100, Daniel P. Berrangé wrote: > In some cases it is worth having a convenience option for user friendliness. > > In this case, however, users are already needing to use QMP/HMP on the > source side to set migration parameters. I think it is reasonable to say > that doing *exactly* the same with QMP/HMP on the destination side is > actually more friendly than requiring use of -global on the dest side > which is different syntax. The -global parameters also work for the src side for my scripts, so I only need to attach "-incoming tcp:XXX" for dst cmdline here. > > We don't document the way to use -global with migration parameters so > when people see '-incoming' I think we are steering them into a trap, > making it look like -incoming is sufficient on its own. Hence that user's > mistake recently about not setting migration parameters. > > Overall I agree with deprecating '-incoming' != 'defer', as I think it > will better guide users to following the correct steps, as is not a > usability burden as they're already using QMP/HMP on the source side. I'd say -global is definitely not for users but for developers. Just like we keep around HMP - I'm not sure whether it's used in productions, I thought it was only for developers and we don't deprecate it eagerly. No strong opinions here if everyone wants to get rid of that, but before that I hope we can have some kind of cmdline tool that can easily tunnel qmp commands so whoever developers using pure cmdlines can switch to the new way easily. Thanks, -- Peter Xu
Re: [PATCH v2 19/20] qemu-file: Simplify qemu_file_shutdown()
On Tue, May 30, 2023 at 08:39:40PM +0200, Juan Quintela wrote: > Signed-off-by: Juan Quintela Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 20/20] qemu-file: Make qemu_file_get_error_obj() static
On Tue, May 30, 2023 at 08:39:41PM +0200, Juan Quintela wrote: > It was not used outside of qemu_file.c anyways. > > Signed-off-by: Juan Quintela Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 18/20] qemu_file: Make qemu_file_is_writable() static
On Tue, May 30, 2023 at 08:39:39PM +0200, Juan Quintela wrote: > It is not used outside of qemu_file, and it shouldn't. > > Signed-off-by: Juan Quintela Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH v2 16/20] migration/rdma: Split qemu_fopen_rdma() into input/output functions
On Tue, May 30, 2023 at 08:39:37PM +0200, Juan Quintela wrote: > This is how everything else in QEMUFile is structured. > As a bonus they are three less lines of code. > > Signed-off-by: Juan Quintela > --- > migration/rdma.c | 35 --- > 1 file changed, 16 insertions(+), 19 deletions(-) > > diff --git a/migration/rdma.c b/migration/rdma.c > index 3ef35fc635..606837bd45 100644 > --- a/migration/rdma.c > +++ b/migration/rdma.c > @@ -4046,25 +4046,22 @@ static void qio_channel_rdma_register_types(void) > > type_init(qio_channel_rdma_register_types); > > -static QEMUFile *qemu_fopen_rdma(RDMAContext *rdma, const char *mode) > +static QEMUFile *rdma_new_input(RDMAContext *rdma) > { > -QIOChannelRDMA *rioc; > +QIOChannelRDMA *rioc = > QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA)); > +rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc)); > +rioc->rdmain = rdma; > +rioc->rdmaout = rdma->return_path; > > -if (qemu_file_mode_is_not_valid(mode)) { We can also drop this function together. If with that: Reviewed-by: Peter Xu > -return NULL; > -} > +return rioc->file; > +} > > -rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA)); > - > -if (mode[0] == 'w') { > -rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc)); > -rioc->rdmaout = rdma; > -rioc->rdmain = rdma->return_path; > -} else { > -rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc)); > -rioc->rdmain = rdma; > -rioc->rdmaout = rdma->return_path; > -} > +static QEMUFile *rdma_new_output(RDMAContext *rdma) > +{ > +QIOChannelRDMA *rioc = > QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA)); > +rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc)); > +rioc->rdmaout = rdma; > +rioc->rdmain = rdma->return_path; > > return rioc->file; > } > @@ -4090,9 +4087,9 @@ static void rdma_accept_incoming_migration(void *opaque) > return; > } > > -f = qemu_fopen_rdma(rdma, "rb"); > +f = rdma_new_input(rdma); > if (f == NULL) { > -fprintf(stderr, "RDMA ERROR: could not qemu_fopen_rdma\n"); > +fprintf(stderr, "RDMA ERROR: could not open RDMA for input\n"); > qemu_rdma_cleanup(rdma); > return; > } > @@ -4217,7 +4214,7 @@ void rdma_start_outgoing_migration(void *opaque, > trace_rdma_start_outgoing_migration_after_rdma_connect(); > > s->rdma_migration = true; > -s->to_dst_file = qemu_fopen_rdma(rdma, "wb"); > +s->to_dst_file = rdma_new_output(rdma); > migrate_fd_connect(s, NULL); > return; > return_path_err: > -- > 2.40.1 > -- Peter Xu
Re: [PATCH v2 06/20] qemu_file: total_transferred is not used anymore
On Tue, May 30, 2023 at 08:39:27PM +0200, Juan Quintela wrote: > Signed-off-by: Juan Quintela > --- > migration/qemu-file.c | 4 > 1 file changed, 4 deletions(-) > > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > index eb0497e532..6b6deea19b 100644 > --- a/migration/qemu-file.c > +++ b/migration/qemu-file.c > @@ -41,9 +41,6 @@ struct QEMUFile { > QIOChannel *ioc; > bool is_writable; > > -/* The sum of bytes transferred on the wire */ > -uint64_t total_transferred; > - > int buf_index; > int buf_size; /* 0 when writing */ > uint8_t buf[IO_BUF_SIZE]; > @@ -287,7 +284,6 @@ void qemu_fflush(QEMUFile *f) > qemu_file_set_error_obj(f, -EIO, local_error); > } else { > uint64_t size = iov_size(f->iov, f->iovcnt); > -f->total_transferred += size; I think this patch is another example why I think sometimes the way patch is split are pretty much adding more complexity on review... Here we removed a variable operation but it seems all fine if it's not used anywhere. But it also means current code base (before this patch applied) doesn't make sense already because it contains this useless addition. So IMHO it means some previous patch does it just wrong. I think it means it's caused by a wrong split of patches, then each patch stops to make much sense as a standalone one. I can go back and try to find whatever patch on the list that will explain this. But it'll also go into git log. Anyone reads this later will be confused once again. Even harder for them to figure out what happened. Do you think we could reorganize the patches so each of a single patch explains itself? The other thing is about priority of patches - I still have ~80 patches pending reviews on migration only.. Would you think it makes sense we pick up important ones first and merge them with higher priority? What I have in mind are: - The regression you mentioned on qemu_fflush() when ram save (I hope I understand the problem right, though...). - The slowness of migration-test. I'm not sure whether you have anything better than either Dan's approach or the switchover-hold flags, I just think that seems more important to resolve for us upstream. We can merge Dan's or mine, you can also propose something else, but IMHO that seems to be a higher priority? And whatever else I haven't noticed. I'll continue reading but I'm sure you know the best on this.. so I'd really rely on you. What do you think? Thanks, -- Peter Xu
Re: [PATCH v2 04/20] qemu-file: We only call qemu_file_transferred_* on the sending side
On Tue, Jun 13, 2023 at 06:02:05PM +0200, Juan Quintela wrote: > Peter Xu wrote: > > On Tue, May 30, 2023 at 08:39:25PM +0200, Juan Quintela wrote: > >> Remove the increase in qemu_file_fill_buffer() and add asserts to > >> qemu_file_transferred* functions. > >> > >> Signed-off-by: Juan Quintela > > > > The read side accounting does look a bit weird and never caught my notice.. > > > > Maybe worth also touching the document of QEMUFile::total_transferred to > > clarify what it accounts? > > > > Reviewed-by: Peter Xu > > > > Though when I'm looking at the counters (didn't follow every single recent > > patch on this..), I found that now reading transferred value is actually > > more expensive - qemu_file_transferred() needs flushing, even if for the > > fast version, qemu_file_transferred_fast() loops over all possible iovs, > > which can be as large as MAX_IOV_SIZE==64. > > > > To be explicit, I _think_ for each guest page we now need to flush... > > > > ram_save_iterate > > migration_rate_exceeded > > migration_transferred_bytes > > qemu_file_transferred > > > > I hope I'm wrong.. > > See patch 7: > > diff --git a/migration/migration-stats.c b/migration/migration-stats.c > index 79eea8d865..1696185694 100644 > --- a/migration/migration-stats.c > +++ b/migration/migration-stats.c > @@ -62,7 +62,7 @@ uint64_t migration_transferred_bytes(QEMUFile *f) > { > uint64_t multifd = stat64_get(&mig_stats.multifd_bytes); > uint64_t rdma = stat64_get(&mig_stats.rdma_bytes); > -uint64_t qemu_file = qemu_file_transferred(f); > +uint64_t qemu_file = stat64_get(&mig_stats.qemu_file_transferred); > > trace_migration_transferred_bytes(qemu_file, multifd, rdma); > return qemu_file + multifd + rdma; If this is a known regression, should we make a first patchset fix it and make it higher priority to merge? It seems this is even not mentioned in the cover letter.. while IMHO this is the most important bit to have in it.. > > > > Does it mean that perhaps we simply need "sent and put into send buffer" > > more than "what really got transferred"? So I start to wonder what's the > > origianl purpose of this change, and which one is better.. > > That is basically what patch 5 and 6 do O:-) > > Problem is arriving to something that is bisectable (for correctness) > and is easy to review. > > And yes, my choices can be different from the ones tat you do. > > The other reason for the small patches is that: > a - sometimes I found a different test where things broke, and have to > bisect > b - small patches are much easier to rebase (that I am doing a lot) That's okay. Thanks, -- Peter Xu
Re: [RFC 4/6] migration: Deprecate -incoming
On Mon, Jun 12, 2023 at 10:51:08PM +0200, Juan Quintela wrote: > Peter Xu wrote: > > On Mon, Jun 12, 2023 at 09:33:42PM +0200, Juan Quintela wrote: > >> Only "defer" is recommended. After setting all migation parameters, > >> start incoming migration with "migrate-incoming uri" command. > >> > >> Signed-off-by: Juan Quintela > >> --- > >> docs/about/deprecated.rst | 7 +++ > >> softmmu/vl.c | 2 ++ > >> 2 files changed, 9 insertions(+) > >> > >> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst > >> index 47e98dc95e..518672722d 100644 > >> --- a/docs/about/deprecated.rst > >> +++ b/docs/about/deprecated.rst > >> @@ -447,3 +447,10 @@ The new way to modify migration is using migration > >> parameters. > >> ``blk`` functionality can be acchieved using > >> ``migrate_set_parameter block-incremental true``. > >> > >> +``-incoming uri`` (since 8.1) > >> +''''''''''''''''''''''''''''' > >> + > >> +Everything except ``-incoming defer`` are deprecated. This allows to > >> +setup parameters before launching the proper migration with > >> +``migrate-incoming uri``. > >> + > >> diff --git a/softmmu/vl.c b/softmmu/vl.c > >> index b0b96f67fa..7fe865ab59 100644 > >> --- a/softmmu/vl.c > >> +++ b/softmmu/vl.c > >> @@ -2651,6 +2651,8 @@ void qmp_x_exit_preconfig(Error **errp) > >> if (incoming) { > >> Error *local_err = NULL; > >> if (strcmp(incoming, "defer") != 0) { > >> +warn_report("-incoming %s is deprecated, use -incoming defer > >> and " > >> +" set the uri with migrate-incoming.", incoming); > > > > I still use uri for all my scripts, alongside with "-global migration.xxx" > > and it works. > > You know what you are doing (TM). > And remember that we don't support -gobal migration.x-foo. > Yes, I know, we should drop the "x-" prefixes. I hope they'll always be there. :) They're pretty handy for tests, when we want to boot a VM without the need to script the sequences of qmp cmds. Yes, we probably should just always drop the x-. We can always declare debugging purpose for all -global migration.* fields. > > > Shall we just leave it there? Or is deprecating it helps us in any form? > > See the patches two weeks ago when people complained that lisen(.., num) > was too low. And there are other parameters that work the same way > (that I convenientely had forgotten). So the easiest way to get things > right is to use "defer" always. Using -incoming "uri" should only be > for people that "know what they are doing", so we had to ways to do it: > - review all migration options and see which ones work without defer > and document it > - deprecate everything that is not defer. > > Anything else is not going to be very user unfriendly. > What do you think. IIRC Wei Wang had a series just for that, so after that patchset applied we should have fixed all issues cleanly? Is there one more thing that's not working right there? > > Later, Juan. > > PD. This series are RFC for multiple reasons O:-) Happy to know the rest (besides which I know will break my script :). -- Peter Xu
Re: [RFC 4/6] migration: Deprecate -incoming
On Mon, Jun 12, 2023 at 09:33:42PM +0200, Juan Quintela wrote: > Only "defer" is recommended. After setting all migation parameters, > start incoming migration with "migrate-incoming uri" command. > > Signed-off-by: Juan Quintela > --- > docs/about/deprecated.rst | 7 +++ > softmmu/vl.c | 2 ++ > 2 files changed, 9 insertions(+) > > diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst > index 47e98dc95e..518672722d 100644 > --- a/docs/about/deprecated.rst > +++ b/docs/about/deprecated.rst > @@ -447,3 +447,10 @@ The new way to modify migration is using migration > parameters. > ``blk`` functionality can be acchieved using > ``migrate_set_parameter block-incremental true``. > > +``-incoming uri`` (since 8.1) > +''''''''''''''''''''''''''''' > + > +Everything except ``-incoming defer`` are deprecated. This allows to > +setup parameters before launching the proper migration with > +``migrate-incoming uri``. > + > diff --git a/softmmu/vl.c b/softmmu/vl.c > index b0b96f67fa..7fe865ab59 100644 > --- a/softmmu/vl.c > +++ b/softmmu/vl.c > @@ -2651,6 +2651,8 @@ void qmp_x_exit_preconfig(Error **errp) > if (incoming) { > Error *local_err = NULL; > if (strcmp(incoming, "defer") != 0) { > +warn_report("-incoming %s is deprecated, use -incoming defer and > " > +" set the uri with migrate-incoming.", incoming); I still use uri for all my scripts, alongside with "-global migration.xxx" and it works. Shall we just leave it there? Or is deprecating it helps us in any form? -- Peter Xu
Re: [PATCH v2 05/20] qemu_file: Use a stat64 for qemu_file_transferred
On Tue, May 30, 2023 at 08:39:26PM +0200, Juan Quintela wrote: > This way we can read it from any thread. > I checked that it gives the same value than the current one. We never > use to qemu_files at the same time. > > Signed-off-by: Juan Quintela The follow up patch may be better to be squashed or it's very confusing.. Why do we need to convert mostly everything into atomics? Is it modified outside migration thread? AFAIR atomic ops are still expensive, and will get more expensive on larger hosts, IOW it'll be good for us to keep non-atomic when possible (and that's why when I changed some counters in postcopy work I only changed the limited set because then the rest are still accessed in 1 single thread and keep running fast). > --- > migration/migration-stats.h | 4 > migration/qemu-file.c | 5 +++-- > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/migration/migration-stats.h b/migration/migration-stats.h > index 2358caad63..b7795e7914 100644 > --- a/migration/migration-stats.h > +++ b/migration/migration-stats.h > @@ -81,6 +81,10 @@ typedef struct { > * Number of bytes sent during precopy stage. > */ > Stat64 precopy_bytes; > +/* > + * Number of bytes transferred with QEMUFile. > + */ > +Stat64 qemu_file_transferred; > /* > * Amount of transferred data at the start of current cycle. > */ > diff --git a/migration/qemu-file.c b/migration/qemu-file.c > index be3dab85cb..eb0497e532 100644 > --- a/migration/qemu-file.c > +++ b/migration/qemu-file.c > @@ -288,6 +288,7 @@ void qemu_fflush(QEMUFile *f) > } else { > uint64_t size = iov_size(f->iov, f->iovcnt); > f->total_transferred += size; > +stat64_add(&mig_stats.qemu_file_transferred, size); > } > > qemu_iovec_release_ram(f); > @@ -628,7 +629,7 @@ int coroutine_mixed_fn qemu_get_byte(QEMUFile *f) > > uint64_t qemu_file_transferred_noflush(QEMUFile *f) > { > -uint64_t ret = f->total_transferred; > +uint64_t ret = stat64_get(&mig_stats.qemu_file_transferred); > int i; > > g_assert(qemu_file_is_writable(f)); > @@ -644,7 +645,7 @@ uint64_t qemu_file_transferred(QEMUFile *f) > { > g_assert(qemu_file_is_writable(f)); > qemu_fflush(f); > -return f->total_transferred; > +return stat64_get(&mig_stats.qemu_file_transferred); > } > > void qemu_put_be16(QEMUFile *f, unsigned int v) > -- > 2.40.1 > -- Peter Xu
Re: [PATCH v2 04/20] qemu-file: We only call qemu_file_transferred_* on the sending side
On Tue, May 30, 2023 at 08:39:25PM +0200, Juan Quintela wrote: > Remove the increase in qemu_file_fill_buffer() and add asserts to > qemu_file_transferred* functions. > > Signed-off-by: Juan Quintela The read side accounting does look a bit weird and never caught my notice.. Maybe worth also touching the document of QEMUFile::total_transferred to clarify what it accounts? Reviewed-by: Peter Xu Though when I'm looking at the counters (didn't follow every single recent patch on this..), I found that now reading transferred value is actually more expensive - qemu_file_transferred() needs flushing, even if for the fast version, qemu_file_transferred_fast() loops over all possible iovs, which can be as large as MAX_IOV_SIZE==64. To be explicit, I _think_ for each guest page we now need to flush... ram_save_iterate migration_rate_exceeded migration_transferred_bytes qemu_file_transferred I hope I'm wrong.. Does it mean that perhaps we simply need "sent and put into send buffer" more than "what really got transferred"? So I start to wonder what's the origianl purpose of this change, and which one is better.. -- Peter Xu
Re: [PATCH 21/21] migration/multifd: Compute transferred bytes correctly
On Thu, May 18, 2023 at 06:40:18PM +0200, Juan Quintela wrote: > Peter Xu wrote: > > On Mon, May 08, 2023 at 03:09:09PM +0200, Juan Quintela wrote: > >> In the past, we had to put the in the main thread all the operations > >> related with sizes due to qemu_file not beeing thread safe. As now > >> all counters are atomic, we can update the counters just after the > >> do the write. As an aditional bonus, we are able to use the right > >> value for the compression methods. Right now we were assuming that > >> there were no compression at all. > > > > Maybe worth mention that initial packet is also accounted after this. > > Ok. > >> > >> +stat64_add(&mig_stats.multifd_bytes, p->next_packet_size); > >> +stat64_add(&mig_stats.transferred, p->next_packet_size); > > > > Two nits: > > > > Maybe merge the two so half atomic operations? > > On my tree, to send after this got in: > > 77fdd3475c migration: Remove transferred atomic counter > > O:-) Ah this looks even better, indeed. :) What I meant was we can also do atomic update in one shot for both next_packet_size + packet_len. > > > Also maybe also worth having a inline helper for adding both multifd_bytes > > and transferred? > > I am removing it. > > After next set of packates: > > rate limit is calulated as: > > begining_period = migration_transferred_bytes(); > ... > > bytes_this_period = migration_transferred_bytes() - begining_period; > > transferred is calculated as: > - multifd_bytes + qemu_file_bytes; > > So things get really simple. As all counters are atomic, you do a > write and after the write to increse the write size to the qemu_file or > to the multifd_bytes. And that is it. Agreed. Thanks, -- Peter Xu
Re: [PATCH 21/21] migration/multifd: Compute transferred bytes correctly
On Mon, May 08, 2023 at 03:09:09PM +0200, Juan Quintela wrote: > In the past, we had to put the in the main thread all the operations > related with sizes due to qemu_file not beeing thread safe. As now > all counters are atomic, we can update the counters just after the > do the write. As an aditional bonus, we are able to use the right > value for the compression methods. Right now we were assuming that > there were no compression at all. Maybe worth mention that initial packet is also accounted after this. > > Signed-off-by: Juan Quintela Two more trivial nits: > --- > migration/multifd.c | 13 - > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index 9d2ade7abc..3a19d8e304 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -175,6 +175,7 @@ void multifd_register_ops(int method, MultiFDMethods *ops) > static int multifd_send_initial_packet(MultiFDSendParams *p, Error **errp) > { > MultiFDInit_t msg = {}; > +size_t size = sizeof(msg); > int ret; > > msg.magic = cpu_to_be32(MULTIFD_MAGIC); > @@ -182,10 +183,12 @@ static int > multifd_send_initial_packet(MultiFDSendParams *p, Error **errp) > msg.id = p->id; > memcpy(msg.uuid, &qemu_uuid.data, sizeof(msg.uuid)); > > -ret = qio_channel_write_all(p->c, (char *)&msg, sizeof(msg), errp); > +ret = qio_channel_write_all(p->c, (char *)&msg, size, errp); > if (ret != 0) { > return -1; > } > +stat64_add(&mig_stats.multifd_bytes, size); > +stat64_add(&mig_stats.transferred, size); > return 0; > } > > @@ -396,7 +399,6 @@ static int multifd_send_pages(QEMUFile *f) > static int next_channel; > MultiFDSendParams *p = NULL; /* make happy gcc */ > MultiFDPages_t *pages = multifd_send_state->pages; > -uint64_t transferred; > > if (qatomic_read(&multifd_send_state->exiting)) { > return -1; > @@ -431,10 +433,7 @@ static int multifd_send_pages(QEMUFile *f) > p->packet_num = multifd_send_state->packet_num++; > multifd_send_state->pages = p->pages; > p->pages = pages; > -transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len; > qemu_mutex_unlock(&p->mutex); > -stat64_add(&mig_stats.transferred, transferred); > -stat64_add(&mig_stats.multifd_bytes, transferred); > qemu_sem_post(&p->sem); > > return 1; > @@ -716,6 +715,8 @@ static void *multifd_send_thread(void *opaque) > if (ret != 0) { > break; > } > +stat64_add(&mig_stats.multifd_bytes, p->packet_len); > +stat64_add(&mig_stats.transferred, p->packet_len); > } else { > /* Send header using the same writev call */ > p->iov[0].iov_len = p->packet_len; > @@ -728,6 +729,8 @@ static void *multifd_send_thread(void *opaque) > break; > } > > +stat64_add(&mig_stats.multifd_bytes, p->next_packet_size); > +stat64_add(&mig_stats.transferred, p->next_packet_size); Two nits: Maybe merge the two so half atomic operations? Also maybe also worth having a inline helper for adding both multifd_bytes and transferred? With/without that, all look good: Reviewed-by: Peter Xu Thanks, > qemu_mutex_lock(&p->mutex); > p->pending_job--; > qemu_mutex_unlock(&p->mutex); > -- > 2.40.0 > -- Peter Xu
Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks
On Wed, May 10, 2023 at 08:31:13AM +0200, Juan Quintela wrote: > Peter Xu wrote: > > Hi > > [Adding Kevin to the party] > > > On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote: > >> To fix it, ensure that the BQL is held during setup. To avoid changing > >> the behavior for migration too, introduce conditionals for the setup > >> callbacks that need the BQL and only take the lock if it's not already > >> held. > > > > The major complexity of this patch is the "conditionally taking" part. > > Yeap. > > I don't want that bit. > > This is another case of: > - I have a problem > - I will use recursive mutexes to solve it > > Now you have two problems O:-) > > > Pure question: what is the benefit of not holding BQL always during > > save_setup(), if after all we have this coroutine issue to be solved? > > Dunno. > > I would like that paolo commented on this. I "reviewed the code" 10 > years ago. I don't remember at all why we wanted to change that. > > > I can understand that it helps us to avoid taking BQL too long, but we'll > > need to take it anyway during ramblock dirty track initializations, and so > > far IIUC it's the major time to be consumed during setup(). > > > > Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync() > > call needs the iothread lock". Firstly I think it's also covering > > enablement of dirty tracking: > > > > +qemu_mutex_lock_iothread(); > > +qemu_mutex_lock_ramlist(); > > +bytes_transferred = 0; > > +reset_ram_globals(); > > + > > memory_global_dirty_log_start(); > > migration_bitmap_sync(); > > +qemu_mutex_unlock_iothread(); > > > > And I think enablement itself can be slow too, maybe even slower than > > migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET > > supported in the kernel. > > > > Meanwhile I always got confused on why we need to sync dirty bitmap when > > setup at all. Say, what if we drop migration_bitmap_sync() here? After > > all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())? > > How do you convince KVM (or the other lists) to start doing dirty > tracking? Doing a bitmap sync. I think memory_global_dirty_log_start() kicks off the tracking already. Take KVM as example, normally the case is KVM_MEM_LOG_DIRTY_PAGES is set there, then ioctl(KVM_SET_USER_MEMORY_REGION) will start dirty tracking for all of the guest memory slots necessary (including wr-protect all pages). KVM_DIRTY_LOG_INITIALLY_SET is slightly special, it'll skip that procedure during ioctl(KVM_SET_USER_MEMORY_REGION), but that also means the kernel assumed the userapp (QEMU) marked all pages dirty initially (always the case for QEMU, I think..). Hence in this case the sync doesn't help either because we'll simply get no new dirty bits in this shot.. > > And yeap, probably there is a better way of doing it. > > > This is slightly off-topic, but I'd like to know if someone can help > > answer. > > > > My whole point is still questioning whether we can unconditionally take bql > > during save_setup(). > > I agree with you. > > > I could have missed something, though, where we want to do in setup() but > > avoid holding BQL. Help needed on figuring this out (and if there is, IMHO > > it'll be worthwhile to put that into comment of save_setup() hook). > > I am more towards revert completely > 9b0950375277467fd74a9075624477ae43b9bb22 > > and call it a day. On migration we don't use coroutines on the sending > side (I mean the migration code, the block layer uses coroutines for > everything/anything). > > Paolo, Stefan any clues for not run setup with the BKL? > > Later, Juan. > -- Peter Xu
Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks
On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote: > To fix it, ensure that the BQL is held during setup. To avoid changing > the behavior for migration too, introduce conditionals for the setup > callbacks that need the BQL and only take the lock if it's not already > held. The major complexity of this patch is the "conditionally taking" part. Pure question: what is the benefit of not holding BQL always during save_setup(), if after all we have this coroutine issue to be solved? I can understand that it helps us to avoid taking BQL too long, but we'll need to take it anyway during ramblock dirty track initializations, and so far IIUC it's the major time to be consumed during setup(). Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync() call needs the iothread lock". Firstly I think it's also covering enablement of dirty tracking: +qemu_mutex_lock_iothread(); +qemu_mutex_lock_ramlist(); +bytes_transferred = 0; +reset_ram_globals(); + memory_global_dirty_log_start(); migration_bitmap_sync(); +qemu_mutex_unlock_iothread(); And I think enablement itself can be slow too, maybe even slower than migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET supported in the kernel. Meanwhile I always got confused on why we need to sync dirty bitmap when setup at all. Say, what if we drop migration_bitmap_sync() here? After all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())? This is slightly off-topic, but I'd like to know if someone can help answer. My whole point is still questioning whether we can unconditionally take bql during save_setup(). I could have missed something, though, where we want to do in setup() but avoid holding BQL. Help needed on figuring this out (and if there is, IMHO it'll be worthwhile to put that into comment of save_setup() hook). Thanks, -- Peter Xu
Re: [PATCH 0/9] QEMU file cleanups
On Thu, May 04, 2023 at 04:56:46PM +0200, Juan Quintela wrote: > Peter Xu wrote: > > On Thu, May 04, 2023 at 01:38:32PM +0200, Juan Quintela wrote: > >> - convince and review code to see that everything is uint64_t. > > > > One general question to patches regarding this - what's the major benefit > > of using uint64_t? > > > > It doubles the possible numbers to hold, but it's already 64bits so I don't > > think it matters a lot. > > We were checking for negatives even when that can't be. > And we are doing this dance of > > int64_t x, y; > uint64_t a, b; > > x = a; > b = y; > > This is always confusing and not always right. Yeah this is confusing, but if anything can go wrong with this I assume we could have some bigger problem anyway.. > > > The thing is we're removing some code trying to > > detect negative which seems to be still helpful to detect e.g. overflows > > (even though I don't think it'll happen). I just still think it's good to > > know when overflow happens, and not sure what I missed on benefits of using > > unsigned here. > > If you grep through the code, you see that half of the things are > int64_t and the other half is uint64_t. I find it always confusing. Right, I'm personally curious whether we should just use int64_t always unless necessary. :) Another good thing with int64_t is it's also suitable for error report when used in retvals. But no strong opinion here, I don't think that's a huge deal for now. Having such an alignment on types makes sense to me. Thanks, -- Peter Xu
Re: [PATCH 0/9] QEMU file cleanups
On Thu, May 04, 2023 at 01:38:32PM +0200, Juan Quintela wrote: > - convince and review code to see that everything is uint64_t. One general question to patches regarding this - what's the major benefit of using uint64_t? It doubles the possible numbers to hold, but it's already 64bits so I don't think it matters a lot. The thing is we're removing some code trying to detect negative which seems to be still helpful to detect e.g. overflows (even though I don't think it'll happen). I just still think it's good to know when overflow happens, and not sure what I missed on benefits of using unsigned here. I've reviewed all the rest patches and all look good here. Thanks, -- Peter Xu
Re: [PATCH 9/9] qemu-file: Account for rate_limit usage on qemu_fflush()
On Thu, May 04, 2023 at 01:38:41PM +0200, Juan Quintela wrote: > That is the moment we know we have transferred something. > > Signed-off-by: Juan Quintela There'll be a slight side effect that qemu_file_rate_limit() can be triggered later than before because data cached in the qemufile won't be accounted until flushed. Two limits here: - IO_BUF_SIZE==32K, the real buffer - MAX_IOV_SIZE==64 (I think), the async buffer to put guest page ptrs directly, on x86 it's 64*4K=256K So the impact should be no more than 288KB on x86. Looks still fine.. Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH 8/9] qemu-file: Make ram_control_save_page() use accessors for rate_limit
On Thu, May 04, 2023 at 01:38:40PM +0200, Juan Quintela wrote: > Signed-off-by: Juan Quintela Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH 6/9] qemu-file: remove shutdown member
On Thu, May 04, 2023 at 01:38:38PM +0200, Juan Quintela wrote: > The first thing that we do after setting the shutdown value is set the > error as -EIO if there is not a previous error. > > So this value is reduntant. Just remove it and use > qemu_file_get_error() in the places that it was tested. > > Signed-off-by: Juan Quintela Nit: I'd squash this with previous. Reviewed-by: Peter Xu -- Peter Xu
Re: [PATCH 5/9] qemu-file: No need to check for shutdown in qemu_file_rate_limit
On Thu, May 04, 2023 at 01:38:37PM +0200, Juan Quintela wrote: > After calling qemu_file_shutdown() we set the error as -EIO if there > is no another previous error, so no need to check it here. > > Signed-off-by: Juan Quintela Reviewed-by: Peter Xu -- Peter Xu
Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll
../migration/savevm.c:2954 > >> #15 0x558ed3e17fb1 in snapshot_save_job_bh (opaque=0x558ed68c5170) at > >> ../migration/savevm.c:3253 > >> #16 0x558ed42f050a in aio_bh_call (bh=0x558ed671ae00) at > >> ../util/async.c:155 > >> #17 0x558ed42f0615 in aio_bh_poll (ctx=0x558ed5c62910) at > >> ../util/async.c:184 > >> #18 0x558ed42d47b8 in aio_poll (ctx=0x558ed5c62910, blocking=true) at > >> ../util/aio-posix.c:721 > >> #19 0x558ed412df1c in bdrv_poll_co (s=0x7f8ffcff3eb0) at > >> /home/febner/repos/qemu/block/block-gen.h:43 > >> #20 0x558ed4130c3a in blk_pwrite (blk=0x558ed5ed4f60, > >> offset=230912, bytes=512, buf=0x7f8ffc438600, flags=0) at > >> block/block-gen.c:1650 > >> #21 0x558ed3ba9078 in pflash_update (pfl=0x558ed5eb7b30, > >> offset=230912, size=1) at ../hw/block/pflash_cfi01.c:394 > >> #22 0x558ed3ba9749 in pflash_write (pfl=0x558ed5eb7b30, > >> offset=231232, value=0, width=1, be=0) at > >> ../hw/block/pflash_cfi01.c:522 > >> #23 0x558ed3ba9cda in pflash_mem_write_with_attrs > >> (opaque=0x558ed5eb7b30, addr=231232, value=0, len=1, attrs=...) at > >> ../hw/block/pflash_cfi01.c:681 > >> #24 0x558ed402a36a in memory_region_write_with_attrs_accessor > >> (mr=0x558ed5eb7ef0, addr=231232, value=0x7f8ffcff40c8, size=1, > >> shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514 > >> #25 0x558ed402a4a9 in access_with_adjusted_size (addr=231232, > >> value=0x7f8ffcff40c8, size=1, access_size_min=1, access_size_max=4, > >> access_fn=0x558ed402a270 , > >> mr=0x558ed5eb7ef0, attrs=...) at ../softmmu/memory.c:555 > >> #26 0x558ed402d5de in memory_region_dispatch_write > >> (mr=0x558ed5eb7ef0, addr=231232, data=0, op=MO_8, attrs=...) at > >> ../softmmu/memory.c:1522 > >> #27 0x558ed403a6f4 in flatview_write_continue > >> (fv=0x558ed66d62c0, addr=4291004224, attrs=..., ptr=0x7f9029957028, > >> len=1, addr1=231232, l=1, mr=0x558ed5eb7ef0) at > >> ../softmmu/physmem.c:2641 > >> #28 0x558ed403a857 in flatview_write (fv=0x558ed66d62c0, > >> addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at > >> ../softmmu/physmem.c:2683 > >> #29 0x558ed403ac07 in address_space_write (as=0x558ed4ca2b20 > >> , addr=4291004224, attrs=..., > >> buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2779 > >> #30 0x558ed403ac74 in address_space_rw (as=0x558ed4ca2b20 > >> , addr=4291004224, attrs=..., > >> buf=0x7f9029957028, len=1, is_write=true) at > >> ../softmmu/physmem.c:2789 > >> #31 0x558ed40cea88 in kvm_cpu_exec (cpu=0x558ed622a910) at > >> ../accel/kvm/kvm-all.c:2989 > >> #32 0x558ed40d179a in kvm_vcpu_thread_fn (arg=0x558ed622a910) at > >> ../accel/kvm/kvm-accel-ops.c:51 > >> #33 0x558ed42d925f in qemu_thread_start (args=0x558ed5c68c80) at > >> ../util/qemu-thread-posix.c:541 > >> #34 0x7f9028ab7ea7 in start_thread (arg=) at > >> pthread_create.c:477 > >> #35 0x7f9027c18a2f in clone () at > >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Totally unfamiliar with block jobs, but... does it mean that snapshot_*_job_bh()s should just always make sure BQL taken? Thanks, -- Peter Xu
Re: [PULL 00/26] Next patches
On Sat, Feb 04, 2023 at 10:19:34AM +, Peter Maydell wrote: > On Thu, 2 Feb 2023 at 16:07, Juan Quintela wrote: > > > > The following changes since commit deabea6e88f7c4c3c12a36ee30051c6209561165: > > > > Merge tag 'for_upstream' of > > https://git.kernel.org/pub/scm/virt/kvm/mst/qemu into staging (2023-02-02 > > 10:10:07 +) > > > > are available in the Git repository at: > > > > https://gitlab.com/juan.quintela/qemu.git tags/next-pull-request > > > > for you to fetch changes up to 5ee6d3d1eeccd85aa2a835e82b8d9e1b4f7441e1: > > > > migration: check magic value for deciding the mapping of channels > > (2023-02-02 17:04:16 +0100) > > > > > > Migration PULL request, new try > > Fails to build on anything that isn't Linux: > > In file included from ../migration/postcopy-ram.c:40: > /private/var/folders/76/zy5ktkns50v6gt5g8r0sf6scgn/T/cirrus-ci-build/include/qemu/userfaultfd.h:18:10: > fatal error: 'linux/userfaultfd.h' file not found Oops, my fault. Juan, please feel free to drop patch "util/userfaultfd: Add uffd_open()". I'll respin with the whole set. -- Peter Xu