from:"Peter Xu"

Re: [PATCH] mac_dbdma: Remove leftover `dma_memory_unmap` calls

2024-09-17 Thread Peter Xu

On Tue, Sep 17, 2024 at 08:25:06AM +0200, Mattias Nissler wrote:
> Mark, thanks for testing and confirming that this doesn't cause any
> obvious breakage.
> 
> For my curiosity, which path should this patch take to get into
> master? Peter, are you going to respin your pull request with this
> included?

The pull has already landed master branch, so there'll be no respin.  And
considering this is a fix from the device side, may make more sense that
the maintainer takes it, which points to John Snow here.

John, would you take it via your tree?

Thanks,

-- 
Peter Xu

Re: [PATCH 21/39] migration: replace assert(false) with g_assert_not_reached()

2024-09-11 Thread Peter Xu

On Tue, Sep 10, 2024 at 03:15:48PM -0700, Pierrick Bouvier wrote:
> Signed-off-by: Pierrick Bouvier 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 08/39] migration: replace assert(0) with g_assert_not_reached()

2024-09-11 Thread Peter Xu

On Tue, Sep 10, 2024 at 03:15:35PM -0700, Pierrick Bouvier wrote:
> Signed-off-by: Pierrick Bouvier 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 01/19] qapi: Smarter camel_to_upper() to reduce need for 'prefix'

2024-09-05 Thread Peter Xu

On Thu, Sep 05, 2024 at 04:52:36PM +0100, Daniel P. Berrangé wrote:
> On Thu, Sep 05, 2024 at 07:59:13AM +0200, Markus Armbruster wrote:
> > Daniel P. Berrangé  writes:
> > 
> > > On Wed, Sep 04, 2024 at 01:18:18PM +0200, Markus Armbruster wrote:
> > >> camel_to_upper() converts its argument from camel case to upper case
> > >> with '_' between words.  Used for generated enumeration constant
> > >> prefixes.
> > >
> > >
> > >> 
> > >> Signed-off-by: Markus Armbruster 
> > >> Reviewed-by: Daniel P. Berrang?? 
> > >
> > > The accent in my name is getting mangled in this series.
> > 
> > Uh-oh!
> > 
> > Checking...  Hmm.  It's correct in git, correct in output of
> > git-format-patch, correct in the copy I got from git-send-email --bcc
> > armbru via localhost MTA, and the copy I got from --to
> > qemu-de...@nongnu.org, correct in lore.kernel.org[*], correct in an mbox
> > downloaded from patchew.
> > 
> > Could the culprit be on your side?
> 
> I compared my received mail vs the mbox archive on nongnu.org for
> qemu-devel.
> 
> In both cases the actual mail body seems to be valid UTF-8 and is
> identical. The message in the nongnu.org archive, however, has
> 
>   Content-Type: text/plain; charset=UTF-8
> 
> while the copy I got in my inbox has merely
> 
>   Content-Type: text/plain
> 
> What I can't determine is whether your original sent message
> had "charset=UTF-8" which then got stripped by redhat's incoming
> mail server, or whether your original lacked 'charset=UTF8' and
> it got added by mailman when saving the message to the mbox archives ?

I didn't read into details of what Markus hit, but I just remembered I hit
similar things before and Dan reported similar issue.  At that time (which
I tried to recall..) was because I used git-publish sending patches, in
which there is an encoding issue. I tried to fix with this branch:

https://github.com/xzpeter/git-publish/commits/fix-print-2/

I also remember I tried to upstream that to Stefan's repo but I totally
forgot what happened later, but the result is I am still using this branch
internally (which I completely forgot which version I'm using... but I
found that until I see this discussion and checked..).

Please ignore everything if git-publish is not used at all.. but just in
case helpful..

-- 
Peter Xu

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-30 Thread Peter Xu

On Fri, Aug 30, 2024 at 03:11:38PM +0900, Akihiko Odaki wrote:
> On 2024/08/30 4:48, Peter Xu wrote:
> > On Thu, Aug 29, 2024 at 01:39:36PM +0900, Akihiko Odaki wrote:
> > > > > I am calling the fact that embedded memory regions are accessible in
> > > > > instance_finalize() "live". A device can perform operations on its 
> > > > > memory
> > > > > regions during instance_finalize() and we should be aware of that.
> > > > 
> > > > This part is true.  I suppose we should still suggest device finalize() 
> > > > to
> > > > properly detach MRs, and that should normally be done there.
> > > 
> > > It is better to avoid manual resource deallocation in general because it 
> > > is
> > > too error-prone.
> > 
> > I had an impression that you mixed up "finalize()" and "free()" in the
> > context of our discussion.. let us clarify this first before everything
> > else below, just in case I overlook stuff..
> I am aware of that distinction. I dealt with it with patch "virtio-gpu:
> Handle resource blob commands":
> https://lore.kernel.org/r/20240822185110.1757429-12-dmitry.osipe...@collabora.com
> 
> > 
> > MR is very special as an object, in that it should have no free() hook,
> > hence by default nothing is going to be freed when mr->refcount==0.  It
> > means MRs need to be freed manually always.
> > 
> > For example:
> > 
> > (gdb) p system_memory->parent_obj->free
> > $2 = (ObjectFree *) 0x0
> > 
> > It plays perfect because the majority of QEMU device model is using MR as a
> > field (rather than a pointer) of another structure, so that's exactly what
> > we're looking for: we don't want to free() the MR as it's allocated
> > together with the owner device.  That'll be released when the owner free().
> > 
> > When dynamic allocation gets into the picture for MR, it's more complicated
> > for sure, because it means the user (like VFIOQuirk) will need to manually
> > allocate the MRs, then it requires explicit object_unparent() to detach
> > that from the device / owner when finalize().  NOTE!  object_unparent()
> > will NOT free the MR yet so far.  The MR still need to be manually freed
> > with an explicit g_free(), normally.  Again, I'd suggest you refer to the
> > VFIOQuirk code just as an example.  In that case this part is done with
> > e.g. vfio_bar_quirk_finalize().
> > 
> >  for (i = 0; i < quirk->nr_mem; i++) {
> >  object_unparent(OBJECT(&quirk->mem[i]));
> >  }
> >  g_free(quirk->mem);
> > 
> > Here quirk->mem is a pointer to an array of MR which can contain one or
> > more MRs, but the idea is the same.
> > 
> > > 
> > > I have an impression with QEMU code base that it is failing manual 
> > > resource
> > > deallocation so frequently although such deallocation can be easily
> > > automated by binding resources to objects and free them when objects die 
> > > by
> > > providing a function like Linux's devm_kmalloc(). Unfortunately I haven't
> > > found time to do that though.
> > 
> > AFAICT, the property list is exactly what you're saying.  IIUC normally an
> > object will be properly finalized()ed + free()ed when the parent object is
> > finalize()ed.  Here MR is just special as it bypasses all the free() part.
> > 
> > > 
> > > So my opinion here is 1) we should automate resource deallocation as much 
> > > as
> > > possible but 2) we shouldn't disturb code that performs manual resource
> > > management.
> > 
> > Agree.  However note again that in whatever case cross-device MR links will
> > still require explicit detachment or it's prone to memory leak.
> 
> I am not sure what you refer to with cross-device MR links so can you
> clarify it?

I was referring to Peter Maydell's example, where the MR can be used
outside of its owner.  In that case manual operation is a must before
finalize(), as finalize() can only make sense to resolve internal links
automatically.

But now knowing that you're explicitly mentioning "deallocation" rather
than "finalize()", and you're aware of diff between deallocation
v.s. finalize(), I suppose I misunderstood what you meant, and now I'm not
sure I get what you're suggesting.

> 
> > 
> > > 
> > > instance_finalize() is for manual resource management. It is better to 
> > >

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-29 Thread Peter Xu

ect_unparent() in
> instance_finalize() is fine is wrong because the memory region is already
> finalized and we shouldn't call object_unparent() for a finalized object.
> 
> This also means a device does not have a control of its memory regions
> during finalization and the device will not see the side effect of calling
> memory_region_del_subregion(). However, if a device really needs to control
> the order of memory region finalization, it can still call object_ref() for
> memory regions to keep them alive until instance_finalize(). The effect of
> memory_region_del_subregion() will be  visible to the device if a device
> does such a trick. While it is very unlikely that any device does such a
> trick, it is still better to avoid having any assumption on devices.
> 
> With the proposed object_ref(mr->owner == subregion->owner ?
> OBJECT(subregion) : subregion->owner) call, we don't have to make any
> assumption on how finalization works, which is too complicated as this
> discussion shows. It certainly makes the use of two reference counters, but
> it does not deviate from the current policy on their usage and it is also
> trivial to adapt when the reference counter in memory region gets removed.
> Considering all above, I believe my patch is the most straightforward
> solution.

Let's see how you thinks after you read above first.  I hope we can get
more onto the same page at least on the context.

So far I still prefer using mr->refcount as less as possible.  Now it plays
the only role as "whether this MR is put onto an owner property list", and
for all the rest refcounts on MR it should always routes to the owner.  I
still think we need to be extremely cautious on further complicating this
refcount.  It's already complicated indeed, but hopefully this model is the
best we can trivially have right now, and so far it's clear to me, but it's
always possible I overlooked something.

Thanks,

-- 
Peter Xu

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-28 Thread Peter Xu

On Wed, Aug 28, 2024 at 11:02:06PM +0900, Akihiko Odaki wrote:
> On 2024/08/28 22:09, Peter Xu wrote:
> > On Wed, Aug 28, 2024 at 02:33:59PM +0900, Akihiko Odaki wrote:
> > > On 2024/08/28 1:11, Peter Xu wrote:
> > > > On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote:
> > > > > On 2024/08/27 4:42, Peter Xu wrote:
> > > > > > On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote:
> > > > > > > On Mon, 26 Aug 2024 at 16:22, Peter Xu  wrote:
> > > > > > > > 
> > > > > > > > On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:
> > > > > > > > > memory_region_update_container_subregions() used to call
> > > > > > > > > memory_region_ref(), which creates a reference to the owner 
> > > > > > > > > of the
> > > > > > > > > subregion, on behalf of the owner of the container. This 
> > > > > > > > > results in a
> > > > > > > > > circular reference if the subregion and container have the 
> > > > > > > > > same owner.
> > > > > > > > > 
> > > > > > > > > memory_region_ref() creates a reference to the owner instead 
> > > > > > > > > of the
> > > > > > > > > memory region to match the lifetime of the owner and memory 
> > > > > > > > > region. We
> > > > > > > > > do not need such a hack if the subregion and container have 
> > > > > > > > > the same
> > > > > > > > > owner because the owner will be alive as long as the 
> > > > > > > > > container is.
> > > > > > > > > Therefore, create a reference to the subregion itself instead 
> > > > > > > > > ot its
> > > > > > > > > owner in such a case; the reference to the subregion is still 
> > > > > > > > > necessary
> > > > > > > > > to ensure that the subregion gets finalized after the 
> > > > > > > > > container.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Akihiko Odaki 
> > > > > > > > > ---
> > > > > > > > > system/memory.c | 8 ++--
> > > > > > > > > 1 file changed, 6 insertions(+), 2 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/system/memory.c b/system/memory.c
> > > > > > > > > index 5e6eb459d5de..e4d3e9d1f427 100644
> > > > > > > > > --- a/system/memory.c
> > > > > > > > > +++ b/system/memory.c
> > > > > > > > > @@ -2612,7 +2612,9 @@ static void 
> > > > > > > > > memory_region_update_container_subregions(MemoryRegion 
> > > > > > > > > *subregion)
> > > > > > > > > 
> > > > > > > > > memory_region_transaction_begin();
> > > > > > > > > 
> > > > > > > > > -memory_region_ref(subregion);
> > > > > > > > > +object_ref(mr->owner == subregion->owner ?
> > > > > > > > > +   OBJECT(subregion) : subregion->owner);
> > > > > > > > 
> > > > > > > > The only place that mr->refcount is used so far is the owner 
> > > > > > > > with the
> > > > > > > > object property attached to the mr, am I right (ignoring 
> > > > > > > > name-less MRs)?
> > > > > > > > 
> > > > > > > > I worry this will further complicate refcounting, now we're 
> > > > > > > > actively using
> > > > > > > > two refcounts for MRs..
> > > > > 
> > > > > The actor of object_ref() is the owner of the memory region also in 
> > > > > this
> > > > > case. We are calling object_ref() on behalf of mr->owner so we use
> > > > > mr->refcount iff mr->owner == subregion->owner. In this sense there 
> > > > > is only
> > > > > one user of mr->refcount even after this change.
> > > > 
> > > > Yes it's still one user, b

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-28 Thread Peter Xu

On Wed, Aug 28, 2024 at 02:33:59PM +0900, Akihiko Odaki wrote:
> On 2024/08/28 1:11, Peter Xu wrote:
> > On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote:
> > > On 2024/08/27 4:42, Peter Xu wrote:
> > > > On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote:
> > > > > On Mon, 26 Aug 2024 at 16:22, Peter Xu  wrote:
> > > > > > 
> > > > > > On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:
> > > > > > > memory_region_update_container_subregions() used to call
> > > > > > > memory_region_ref(), which creates a reference to the owner of the
> > > > > > > subregion, on behalf of the owner of the container. This results 
> > > > > > > in a
> > > > > > > circular reference if the subregion and container have the same 
> > > > > > > owner.
> > > > > > > 
> > > > > > > memory_region_ref() creates a reference to the owner instead of 
> > > > > > > the
> > > > > > > memory region to match the lifetime of the owner and memory 
> > > > > > > region. We
> > > > > > > do not need such a hack if the subregion and container have the 
> > > > > > > same
> > > > > > > owner because the owner will be alive as long as the container is.
> > > > > > > Therefore, create a reference to the subregion itself instead ot 
> > > > > > > its
> > > > > > > owner in such a case; the reference to the subregion is still 
> > > > > > > necessary
> > > > > > > to ensure that the subregion gets finalized after the container.
> > > > > > > 
> > > > > > > Signed-off-by: Akihiko Odaki 
> > > > > > > ---
> > > > > > >system/memory.c | 8 ++--
> > > > > > >1 file changed, 6 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/system/memory.c b/system/memory.c
> > > > > > > index 5e6eb459d5de..e4d3e9d1f427 100644
> > > > > > > --- a/system/memory.c
> > > > > > > +++ b/system/memory.c
> > > > > > > @@ -2612,7 +2612,9 @@ static void 
> > > > > > > memory_region_update_container_subregions(MemoryRegion *subregion)
> > > > > > > 
> > > > > > >memory_region_transaction_begin();
> > > > > > > 
> > > > > > > -memory_region_ref(subregion);
> > > > > > > +object_ref(mr->owner == subregion->owner ?
> > > > > > > +   OBJECT(subregion) : subregion->owner);
> > > > > > 
> > > > > > The only place that mr->refcount is used so far is the owner with 
> > > > > > the
> > > > > > object property attached to the mr, am I right (ignoring name-less 
> > > > > > MRs)?
> > > > > > 
> > > > > > I worry this will further complicate refcounting, now we're 
> > > > > > actively using
> > > > > > two refcounts for MRs..
> > > 
> > > The actor of object_ref() is the owner of the memory region also in this
> > > case. We are calling object_ref() on behalf of mr->owner so we use
> > > mr->refcount iff mr->owner == subregion->owner. In this sense there is 
> > > only
> > > one user of mr->refcount even after this change.
> > 
> > Yes it's still one user, but it's not that straightforward to see, also
> > it's still an extension to how we use mr->refcount right now.  Currently
> > it's about "true / false" just to describe, now it's a real counter.
> > 
> > I wished that counter doesn't even exist if we'd like to stick with device
> > / owner's counter.  Adding this can definitely also make further effort
> > harder if we want to remove mr->refcount.
> 
> I don't think it will make removing mr->refcount harder. With this change,
> mr->refcount will count the parent and container. If we remove mr->refcount,
> we need to trigger object_finalize() in a way other than checking
> mr->refcount, which can be achieved by simply evaluating OBJECT(mr)->parent
> && mr->container.
> 
> > 
> > > 
> > > > > > 
> > > >

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-27 Thread Peter Xu

On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote:
> On 2024/08/27 4:42, Peter Xu wrote:
> > On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote:
> > > On Mon, 26 Aug 2024 at 16:22, Peter Xu  wrote:
> > > > 
> > > > On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:
> > > > > memory_region_update_container_subregions() used to call
> > > > > memory_region_ref(), which creates a reference to the owner of the
> > > > > subregion, on behalf of the owner of the container. This results in a
> > > > > circular reference if the subregion and container have the same owner.
> > > > > 
> > > > > memory_region_ref() creates a reference to the owner instead of the
> > > > > memory region to match the lifetime of the owner and memory region. We
> > > > > do not need such a hack if the subregion and container have the same
> > > > > owner because the owner will be alive as long as the container is.
> > > > > Therefore, create a reference to the subregion itself instead ot its
> > > > > owner in such a case; the reference to the subregion is still 
> > > > > necessary
> > > > > to ensure that the subregion gets finalized after the container.
> > > > > 
> > > > > Signed-off-by: Akihiko Odaki 
> > > > > ---
> > > > >   system/memory.c | 8 ++--
> > > > >   1 file changed, 6 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/system/memory.c b/system/memory.c
> > > > > index 5e6eb459d5de..e4d3e9d1f427 100644
> > > > > --- a/system/memory.c
> > > > > +++ b/system/memory.c
> > > > > @@ -2612,7 +2612,9 @@ static void 
> > > > > memory_region_update_container_subregions(MemoryRegion *subregion)
> > > > > 
> > > > >   memory_region_transaction_begin();
> > > > > 
> > > > > -memory_region_ref(subregion);
> > > > > +object_ref(mr->owner == subregion->owner ?
> > > > > +   OBJECT(subregion) : subregion->owner);
> > > > 
> > > > The only place that mr->refcount is used so far is the owner with the
> > > > object property attached to the mr, am I right (ignoring name-less MRs)?
> > > > 
> > > > I worry this will further complicate refcounting, now we're actively 
> > > > using
> > > > two refcounts for MRs..
> 
> The actor of object_ref() is the owner of the memory region also in this
> case. We are calling object_ref() on behalf of mr->owner so we use
> mr->refcount iff mr->owner == subregion->owner. In this sense there is only
> one user of mr->refcount even after this change.

Yes it's still one user, but it's not that straightforward to see, also
it's still an extension to how we use mr->refcount right now.  Currently
it's about "true / false" just to describe, now it's a real counter.

I wished that counter doesn't even exist if we'd like to stick with device
/ owner's counter.  Adding this can definitely also make further effort
harder if we want to remove mr->refcount.

> 
> > > > 
> > > > Continue discussion there:
> > > > 
> > > > https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com
> > > > 
> > > > What I don't see is how mr->subregions differs from mr->container, so we
> > > > allow subregions to be attached but not the container when finalize()
> > > > (which is, afaict, the other way round).
> > > > 
> > > > It seems easier to me that we allow both container and subregions to 
> > > > exist
> > > > as long as within the owner itself, rather than start heavier use of
> > > > mr->refcount.
> > > 
> > > I don't think just "same owner" necessarily will be workable --
> > > you can have a setup like:
> > >* device A has a container C_A
> > >* device A has a child-device B
> > >* device B has a memory region R_B
> > >* device A's realize method puts R_B into C_A
> > > 
> > > R_B's owner is B, and the container's owner is A,
> > > but we still want to be able to get rid of A (in the process
> > > getting rid of B because it gets unparented and unreffed,
> > > and R_B and C_A also).
> > 
> > For cross-device references, s

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-26 Thread Peter Xu

On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote:
> On Mon, 26 Aug 2024 at 16:22, Peter Xu  wrote:
> >
> > On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:
> > > memory_region_update_container_subregions() used to call
> > > memory_region_ref(), which creates a reference to the owner of the
> > > subregion, on behalf of the owner of the container. This results in a
> > > circular reference if the subregion and container have the same owner.
> > >
> > > memory_region_ref() creates a reference to the owner instead of the
> > > memory region to match the lifetime of the owner and memory region. We
> > > do not need such a hack if the subregion and container have the same
> > > owner because the owner will be alive as long as the container is.
> > > Therefore, create a reference to the subregion itself instead ot its
> > > owner in such a case; the reference to the subregion is still necessary
> > > to ensure that the subregion gets finalized after the container.
> > >
> > > Signed-off-by: Akihiko Odaki 
> > > ---
> > >  system/memory.c | 8 ++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/system/memory.c b/system/memory.c
> > > index 5e6eb459d5de..e4d3e9d1f427 100644
> > > --- a/system/memory.c
> > > +++ b/system/memory.c
> > > @@ -2612,7 +2612,9 @@ static void 
> > > memory_region_update_container_subregions(MemoryRegion *subregion)
> > >
> > >  memory_region_transaction_begin();
> > >
> > > -memory_region_ref(subregion);
> > > +object_ref(mr->owner == subregion->owner ?
> > > +   OBJECT(subregion) : subregion->owner);
> >
> > The only place that mr->refcount is used so far is the owner with the
> > object property attached to the mr, am I right (ignoring name-less MRs)?
> >
> > I worry this will further complicate refcounting, now we're actively using
> > two refcounts for MRs..
> >
> > Continue discussion there:
> >
> > https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com
> >
> > What I don't see is how mr->subregions differs from mr->container, so we
> > allow subregions to be attached but not the container when finalize()
> > (which is, afaict, the other way round).
> >
> > It seems easier to me that we allow both container and subregions to exist
> > as long as within the owner itself, rather than start heavier use of
> > mr->refcount.
> 
> I don't think just "same owner" necessarily will be workable --
> you can have a setup like:
>   * device A has a container C_A
>   * device A has a child-device B
>   * device B has a memory region R_B
>   * device A's realize method puts R_B into C_A
> 
> R_B's owner is B, and the container's owner is A,
> but we still want to be able to get rid of A (in the process
> getting rid of B because it gets unparented and unreffed,
> and R_B and C_A also).

For cross-device references, should we rely on an explicit call to
memory_region_del_subregion(), so as to detach the link between C_A and
R_B?

My understanding so far: logically when MR finalize() it should guarantee
both (1) mr->container==NULL, and (2) mr->subregions empty.  That's before
commit 2e2b8eb70fdb7dfb and could be the ideal world (though at the very
beginning we don't assert on ->container==NULL yet).  It requires all
device emulations to do proper unrealize() to unlink all the MRs.

However what I'm guessing is QEMU probably used to have lots of devices
that are not following the rules and leaking these links.  Hence we have
had 2e2b8eb70fdb7dfb, allowing that to happen as long as it's safe, and
it's justified by comment in 2e2b8eb70fdb7dfb on why it's safe.

What I was thinking is this comment seems to apply too to mr->container, so
that it should be safe too to unlink ->container the same way as its own
subregions.

IIUC that means for device-internal MR links we should be fine leaving
whatever link between MRs owned by such device; the device->refcount
guarantees none of them will be visible in any AS.  But then we need to
always properly unlink the MRs when the link is across >1 device owners,
otherwise it's prone to leak.

-- 
Peter Xu

Re: [PATCH v4 5/7] memory: Clarify owner must not call memory_region_ref()

2024-08-26 Thread Peter Xu

On Fri, Aug 23, 2024 at 03:13:10PM +0900, Akihiko Odaki wrote:
> Signed-off-by: Akihiko Odaki 
> ---
>  include/exec/memory.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index d79415a3b159..6698e9d05eab 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1220,6 +1220,7 @@ void memory_region_init(MemoryRegion *mr,
>   * memory_region_ref: Add a reference to a memory region
>   *
>   * This function adds a reference to the owner if present.
> + * The owner must not call this function as it results in a circular 
> reference.
>   * See docs/devel/memory.rst to know about owner.
>   *
>   * @mr: the #MemoryRegion

Please consider merging all four doc pre-update patches into one.

Thanks,

-- 
Peter Xu

Re: [PATCH v4 7/7] tests/qtest: Delete previous boot file

2024-08-26 Thread Peter Xu

On Fri, Aug 23, 2024 at 03:13:12PM +0900, Akihiko Odaki wrote:
> A test run may create boot files several times. Delete the previous boot
> file before creating a new one.
> 
> Signed-off-by: Akihiko Odaki 
> Reviewed-by: Michael S. Tsirkin 
> Acked-by: Thomas Huth 

I didn't track which came early, but I think Fabiano has queued the other
one here:

https://lore.kernel.org/r/20240820144912.320744-2-peter.mayd...@linaro.org
https://gitlab.com/farosas/qemu/-/commits/migration-staging/

So we should be good.

-- 
Peter Xu

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-26 Thread Peter Xu

On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:
> memory_region_update_container_subregions() used to call
> memory_region_ref(), which creates a reference to the owner of the
> subregion, on behalf of the owner of the container. This results in a
> circular reference if the subregion and container have the same owner.
> 
> memory_region_ref() creates a reference to the owner instead of the
> memory region to match the lifetime of the owner and memory region. We
> do not need such a hack if the subregion and container have the same
> owner because the owner will be alive as long as the container is.
> Therefore, create a reference to the subregion itself instead ot its
> owner in such a case; the reference to the subregion is still necessary
> to ensure that the subregion gets finalized after the container.
> 
> Signed-off-by: Akihiko Odaki 
> ---
>  system/memory.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/system/memory.c b/system/memory.c
> index 5e6eb459d5de..e4d3e9d1f427 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2612,7 +2612,9 @@ static void 
> memory_region_update_container_subregions(MemoryRegion *subregion)
>  
>  memory_region_transaction_begin();
>  
> -memory_region_ref(subregion);
> +object_ref(mr->owner == subregion->owner ?
> +   OBJECT(subregion) : subregion->owner);

The only place that mr->refcount is used so far is the owner with the
object property attached to the mr, am I right (ignoring name-less MRs)?

I worry this will further complicate refcounting, now we're actively using
two refcounts for MRs..

Continue discussion there:

https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com

What I don't see is how mr->subregions differs from mr->container, so we
allow subregions to be attached but not the container when finalize()
(which is, afaict, the other way round).

It seems easier to me that we allow both container and subregions to exist
as long as within the owner itself, rather than start heavier use of
mr->refcount.

I tend to agree with you in another thread, where you mentioned it's better
we get rid of one of the refcounts. If not trivial to get, we should still
try to stick with one refcount to make it less chaos.

> +
>  QTAILQ_FOREACH(other, &mr->subregions, subregions_link) {
>  if (subregion->priority >= other->priority) {
>  QTAILQ_INSERT_BEFORE(other, subregion, subregions_link);
> @@ -2670,7 +2672,9 @@ void memory_region_del_subregion(MemoryRegion *mr,
>  assert(alias->mapped_via_alias >= 0);
>  }
>  QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
> -memory_region_unref(subregion);
> +object_unref(mr->owner == subregion->owner ?
> +     OBJECT(subregion) : subregion->owner);
> +
>  memory_region_update_pending |= mr->enabled && subregion->enabled;
>  memory_region_transaction_commit();
>  }
> 
> -- 
> 2.46.0
> 

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 11:44:08AM -0300, Fabiano Rosas wrote:
> But of course, that means we cannot claim to support all kinds of
> forward migrations anymore. Only those in the 6 year period.

That "6 years" comes from machine type deprecation period, and migration
compatibility is mostly only attached to machine types, and we only ever
allowed migration with the same machine type.

It means, >6 years migration will never work anyway as soon as we start to
deprecate machine types (irrelevant of any reuse of UNUSED), because the >6
years machine types will simply be gone.. See configuration_post_load(),
where it'll throw an error upfront when machine type mismatched.

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 10:34:12AM -0300, Fabiano Rosas wrote:
> Is there an easy way to look at a field and tell in which machine type's
> timeframe it was introduced?

I am not aware of any.

> If the machine type of that era has been removed, then the field is free
> to go as well. I'd prefer if we had a hard link instead of just counting
> years. Maybe we should to that mapping at the machine deprecation time?
> As in, "look at the unused fields introduced in that timeframe and mark
> them free".

We can do that, but depending on how easy it would be. That can be an
overkill to me if it's non-trivial.  When it becomes complicated, I'd
rather make machine compat property easier to use so we always stick with
that.  Currently it's not as easy to use.

Maybe we shouldn't make it a common rule to let people reuse the UNUSED
fields, even if in this case it's probably fine?

E.g. I don't think it's a huge deal to keep all UNUSED fields forever -
sending 512B zeros for only one specific device isn't an issue even if kept
forever.

If "over 6 years" would be okay and simple enough, then maybe we can stick
with that (and only if people would like to reuse a field and ask; that's
after all not required..).  If more than that I doubt whether we should
spend time working on covering all the fields.

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 06:38:26PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Wed, Jul 10, 2024 at 04:48:23PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> >> >> It's not about trust, we simply don't support migrations other than
> >> >> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.
> >> >
> >> > Where does it come from?  I thought we suppport that..
> >> 
> >> I'm taking that from:
> >> 
> >> docs/devel/migration/main.rst:
> >>   "In general QEMU tries to maintain forward migration compatibility
> >>   (i.e. migrating from QEMU n->n+1) and there are users who benefit from
> >>   backward compatibility as well."
> >> 
> >> But of course it doesn't say whether that comes with a transitive rule
> >> allowing n->n+2 migrations.
> >
> > I'd say that "i.e." implies n->n+1 is not the only forward migration we
> > would support.
> >
> > I _think_ we should support all forward migration as long as the machine
> > type matches.
> >
> >> 
> >> >
> >> > The same question would be: are we requesting an OpenStack cluster to
> >> > always upgrade QEMU with +1 versions, otherwise migration will fail?
> >> 
> >> Will an OpenStack cluster be using upstream QEMU? If not, then that's a
> >
> > It's an example to show what I meant! :) Nothing else. Definitely not
> > saying that everyone should use an upstream released QEMU (but in reality,
> > it's not a problem, I think, and I do feel like people use them, perhaps
> > more with the stable releases).
> >
> >> question for the distro. In a very practical sense, we're not requesting
> >> anything. We barely test n->n+1/n->n-1, even if we had a strong support
> >> statement I wouldn't be confident saying migration from QEMU 2.7 -> QEMU
> >> 9.1 should succeed.
> >
> > No matter what we test in CI, I don't think we should break that for >1
> > versions..  I hope 2.7->9.1 keeps working, otherwise I think it's legal to
> > file a bug by anyone.
> >
> > For example, I randomly fetched a bug report:
> >
> > https://gitlab.com/qemu-project/qemu/-/issues/1937
> >
> > QEMU version:6.2 and 7.2.5
> >
> > And I believe that's the common case even for upstream.  If we don't do
> > that right for upstream, it can be impossible tasks for downstream and for
> > all of us to maintain.
> 
> But do we do that right currently? I have no idea. Have we ever done
> it? And we're here discussing a hypothetical 2.7->9.1 ...
> 
> So we cannot reuse the UNUSED field because QEMU from 2016 might send
> their data and QEMU from 2024 would interpret it wrong.
> 
> How do we proceed? Add a subsection. And make the code survive when
> receiving 0.
> 
> @Peter is that it? What about backwards-compat? We'll need a property as
> well it seems.

Compat property is definitely one way to go, but I think it's you who more
or less persuaded me that reusing it seems possible! At least I can't yet
think of anything bad if it's ancient unused buffers.

And that's why I was asking about a sane way to describe the "magic
year".. And I was very serious when I said "6 years" to follow the
deprecation of machine types, because it'll be something we can follow to
say when an unused buffer can be obsolete and it could make some sense: if
we will start to deprecate machine types, then it may not make sense to
keep any UNUSED compatible forever, after all.

And we need that ruler to be as accurate as "always 6 years to follow
machine type deprecation procedure", in case someone else tomorrow asks us
something that was only UNUSED since 2018.  We need a rule of thumb if we
want to reuse it, and if all of you agree we can start with this one,
perhaps with a comment above the field (before we think all through and
make it a rule on deprecating UNUSED)?

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 04:48:23PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> >> It's not about trust, we simply don't support migrations other than
> >> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.
> >
> > Where does it come from?  I thought we suppport that..
> 
> I'm taking that from:
> 
> docs/devel/migration/main.rst:
>   "In general QEMU tries to maintain forward migration compatibility
>   (i.e. migrating from QEMU n->n+1) and there are users who benefit from
>   backward compatibility as well."
> 
> But of course it doesn't say whether that comes with a transitive rule
> allowing n->n+2 migrations.

I'd say that "i.e." implies n->n+1 is not the only forward migration we
would support.

I _think_ we should support all forward migration as long as the machine
type matches.

> 
> >
> > The same question would be: are we requesting an OpenStack cluster to
> > always upgrade QEMU with +1 versions, otherwise migration will fail?
> 
> Will an OpenStack cluster be using upstream QEMU? If not, then that's a

It's an example to show what I meant! :) Nothing else. Definitely not
saying that everyone should use an upstream released QEMU (but in reality,
it's not a problem, I think, and I do feel like people use them, perhaps
more with the stable releases).

> question for the distro. In a very practical sense, we're not requesting
> anything. We barely test n->n+1/n->n-1, even if we had a strong support
> statement I wouldn't be confident saying migration from QEMU 2.7 -> QEMU
> 9.1 should succeed.

No matter what we test in CI, I don't think we should break that for >1
versions..  I hope 2.7->9.1 keeps working, otherwise I think it's legal to
file a bug by anyone.

For example, I randomly fetched a bug report:

https://gitlab.com/qemu-project/qemu/-/issues/1937

QEMU version:6.2 and 7.2.5

And I believe that's the common case even for upstream.  If we don't do
that right for upstream, it can be impossible tasks for downstream and for
all of us to maintain.

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> It's not about trust, we simply don't support migrations other than
> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.

Where does it come from?  I thought we suppport that..

The same question would be: are we requesting an OpenStack cluster to
always upgrade QEMU with +1 versions, otherwise migration will fail?

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 11:08:20AM -0300, Fabiano Rosas wrote:
> >> I think it's ok:
> >> 
> >> {
> >>   "field": "unused",
> >>   "version_id": 1,
> >>   "field_exists": false,
> >>   "size": 512
> >> },
> >> 
> >> vs.
> >> 
> >> {
> >>   "field": "vendor_data",
> >>   "version_id": 0,
> >>   "field_exists": false,
> >>   "num": 512,
> >>   "size": 1
> >> },
> >> 
> >> The unused field was introduced in 2016 so there's no chance of
> >> migrating a QEMU that old to/from 9.1.
> >
> > What happens if an old qemu 9.0 sends rubbish here to a new QEMU, while the
> > new QEMU would consider it meaningful data?
> 
> It will send zeros, no? The code will have to cope with that. The
> alternative is to put the vendor_data in a subsection and the code will
> also have to cope with the lack of data when the old QEMU doesn't send
> it.

Ah indeed, that "static const uint8_t buf[1024]" is there at least since
2017.  So yes, probably always sending zeros.

Nothing I can think of otherwise indeed, if we want to trust that nothing
will migrate before 2016.  It's just that we may want to know how that
"2016" is justified to be safe if we would like to allow that in the
future.

One thing _could_ be that "rule of thumb" is we plan to obsolete machines
with 6 years, so anything "UNUSED" older than 6 years can be over-written?

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-09 Thread Peter Xu

On Tue, Jul 09, 2024 at 05:38:54PM -0300, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé  writes:
> 
> > "General command" (GEN_CMD, CMD56) is described as:
> >
> >   GEN_CMD is the same as the single block read or write
> >   commands (CMD24 or CMD17). The difference is that [...]
> >   the data block is not a memory payload data but has a
> >   vendor specific format and meaning.
> >
> > Thus this block must not be stored overwriting data block
> > on underlying storage drive. Keep it in a dedicated
> > 'vendor_data[]' array.
> >
> > Signed-off-by: Philippe Mathieu-Daudé 
> > Tested-by: Cédric Le Goater 
> > ---
> > RFC: Is it safe to reuse VMSTATE_UNUSED_V() (which happens
> > to be the same size)?
> 
> Hi, sorry it took some time to get to this, I had just left for vacation
> when you first posted.

And I totally overlooked there's the email.. until you replied.  Welcome
back.

> 
> I think it's ok:
> 
> {
>   "field": "unused",
>   "version_id": 1,
>   "field_exists": false,
>   "size": 512
> },
> 
> vs.
> 
> {
>   "field": "vendor_data",
>   "version_id": 0,
>   "field_exists": false,
>   "num": 512,
>   "size": 1
> },
> 
> The unused field was introduced in 2016 so there's no chance of
> migrating a QEMU that old to/from 9.1.

What happens if an old qemu 9.0 sends rubbish here to a new QEMU, while the
new QEMU would consider it meaningful data?

-- 
Peter Xu

Re: [PATCH v2 09/15] memory: Do not create circular reference with subregion

2024-07-08 Thread Peter Xu

On Mon, Jul 08, 2024 at 10:06:44AM +0200, Philippe Mathieu-Daudé wrote:
> On 6/7/24 13:59, Akihiko Odaki wrote:
> > On 2024/07/03 2:44, Peter Xu wrote:
> > > On Thu, Jun 27, 2024 at 10:37:52PM +0900, Akihiko Odaki wrote:
> > > > A memory region does not use their own reference counters, but instead
> > > > piggybacks on another QOM object, "owner" (unless the owner is not the
> > > > memory region itself). When creating a subregion, a new reference to the
> > > > owner of the container must be created. However, if the subregion is
> > > > owned by the same QOM object, this result in a self-reference, and make
> > > > the owner immortal. Avoid such a self-reference.
> > > > 
> > > > Signed-off-by: Akihiko Odaki 
> > > > ---
> > > >   system/memory.c | 11 +--
> > > >   1 file changed, 9 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/system/memory.c b/system/memory.c
> > > > index 74cd73ebc78b..949f5016a68d 100644
> > > > --- a/system/memory.c
> > > > +++ b/system/memory.c
> > > > @@ -2638,7 +2638,10 @@ static void
> > > > memory_region_update_container_subregions(MemoryRegion
> > > > *subregion)
> > > >   memory_region_transaction_begin();
> > > > -    memory_region_ref(subregion);
> > > > +    if (mr->owner != subregion->owner) {
> > > > +    memory_region_ref(subregion);
> > > > +    }
> > > > +
> > > >   QTAILQ_FOREACH(other, &mr->subregions, subregions_link) {
> > > >   if (subregion->priority >= other->priority) {
> > > >   QTAILQ_INSERT_BEFORE(other, subregion, subregions_link);
> > > > @@ -2696,7 +2699,11 @@ void
> > > > memory_region_del_subregion(MemoryRegion *mr,
> > > >   assert(alias->mapped_via_alias >= 0);
> > > >   }
> > > >   QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
> > > > -    memory_region_unref(subregion);
> > > > +
> > > > +    if (mr->owner != subregion->owner) {
> > > > +    memory_region_unref(subregion);
> > > > +    }
> > > > +
> > > >   memory_region_update_pending |= mr->enabled && subregion->enabled;
> > > >   memory_region_transaction_commit();
> > > >   }
> > > 
> > > This does look like a real issue.. the patch looks reasonable to me,
> > > but I
> > > wonder whether we should start to add some good comments in code to
> > > reflect
> > > that complexity starting from this one.  The MR refcount isn't easy to
> > > understand to me.
> > > 
> > > It also lets me start to wonder how MR refcount went through until
> > > it looks
> > > like today..  It's definitely not extremely intuitive to use mr->owner as
> > > the object to do refcounting if mr itself does has its own QObject,
> > > meanwhile it has other tricks around.
> > > 
> > > E.g. the first thing I stumbled over when looking was the optimization
> > > where we will avoid refcounting the mr when there's no owner, and IIUC it
> > > was for the case when the "guest memory" (which will never be freed) used
> > > to have no owner so we can speedup DMA if we know it won't go away.
> > > 
> > > https://lore.kernel.org/qemu-devel/1450263601-2828-5-git-send-email-pbonz...@redhat.com/
> > > 
> > > commit 612263cf33062f7441a5d0e3b37c65991fdc3210
> > > Author: Paolo Bonzini 
> > > Date:   Wed Dec 9 11:44:25 2015 +0100
> > > 
> > >  memory: avoid unnecessary object_ref/unref
> > >  For the common case of DMA into non-hotplugged RAM, it is
> > > unnecessary
> > >  but expensive to do object_ref/unref.  Add back an owner field to
> > >  MemoryRegion, so that these memory regions can skip the reference
> > >  counting.
> > > 
> > > If so, it looks like it will stop working with memory-backends get
> > > involved?  As I think those MRs will have owner set always, and I wonder
> > > whether memory-backends should be the major way to specify guest
> > > memory now
> > > and in the future.  So I'm not sure how important that optimization is as
> > > of now, and whether we could "simplify" it back to always do the refcount
> > > if the major scenarios wi

Re: [PATCH v2 09/15] memory: Do not create circular reference with subregion

2024-07-02 Thread Peter Xu

On Thu, Jun 27, 2024 at 10:37:52PM +0900, Akihiko Odaki wrote:
> A memory region does not use their own reference counters, but instead
> piggybacks on another QOM object, "owner" (unless the owner is not the
> memory region itself). When creating a subregion, a new reference to the
> owner of the container must be created. However, if the subregion is
> owned by the same QOM object, this result in a self-reference, and make
> the owner immortal. Avoid such a self-reference.
> 
> Signed-off-by: Akihiko Odaki 
> ---
>  system/memory.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/system/memory.c b/system/memory.c
> index 74cd73ebc78b..949f5016a68d 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2638,7 +2638,10 @@ static void 
> memory_region_update_container_subregions(MemoryRegion *subregion)
>  
>  memory_region_transaction_begin();
>  
> -memory_region_ref(subregion);
> +if (mr->owner != subregion->owner) {
> +memory_region_ref(subregion);
> +}
> +
>  QTAILQ_FOREACH(other, &mr->subregions, subregions_link) {
>  if (subregion->priority >= other->priority) {
>  QTAILQ_INSERT_BEFORE(other, subregion, subregions_link);
> @@ -2696,7 +2699,11 @@ void memory_region_del_subregion(MemoryRegion *mr,
>  assert(alias->mapped_via_alias >= 0);
>  }
>  QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
> -memory_region_unref(subregion);
> +
> +if (mr->owner != subregion->owner) {
> +memory_region_unref(subregion);
> +}
> +
>  memory_region_update_pending |= mr->enabled && subregion->enabled;
>  memory_region_transaction_commit();
>  }

This does look like a real issue.. the patch looks reasonable to me, but I
wonder whether we should start to add some good comments in code to reflect
that complexity starting from this one.  The MR refcount isn't easy to
understand to me.

It also lets me start to wonder how MR refcount went through until it looks
like today..  It's definitely not extremely intuitive to use mr->owner as
the object to do refcounting if mr itself does has its own QObject,
meanwhile it has other tricks around.

E.g. the first thing I stumbled over when looking was the optimization
where we will avoid refcounting the mr when there's no owner, and IIUC it
was for the case when the "guest memory" (which will never be freed) used
to have no owner so we can speedup DMA if we know it won't go away.

https://lore.kernel.org/qemu-devel/1450263601-2828-5-git-send-email-pbonz...@redhat.com/

commit 612263cf33062f7441a5d0e3b37c65991fdc3210
Author: Paolo Bonzini 
Date:   Wed Dec 9 11:44:25 2015 +0100

memory: avoid unnecessary object_ref/unref

For the common case of DMA into non-hotplugged RAM, it is unnecessary
but expensive to do object_ref/unref.  Add back an owner field to
MemoryRegion, so that these memory regions can skip the reference
counting.

If so, it looks like it will stop working with memory-backends get
involved?  As I think those MRs will have owner set always, and I wonder
whether memory-backends should be the major way to specify guest memory now
and in the future.  So I'm not sure how important that optimization is as
of now, and whether we could "simplify" it back to always do the refcount
if the major scenarios will not adopt it.

The other issue is we used owner refcount from the start of
memory_region_ref() got introduced, since:

commit 46637be269aaaceb9867ffdf176e906401138fff
Author: Paolo Bonzini 
Date:   Tue May 7 09:06:00 2013 +0200

memory: add ref/unref

And we still have that in our document, even though I don't think it's true
anymore:

 * ...  MemoryRegions actually do not have their
 * own reference count; they piggyback on a QOM object, their "owner".
 * This function adds a reference to the owner.

It looks like what happened is when introduced the change, MR is not a QOM
object yet.  But it later is..

I mentioned all these only because I found that _if_ we can keep mr
refcounting as simple as other objects:

memory_region_ref(mr)
{
object_ref(OBJECT(mr));
}

Then looks like this "recursive refcount" problem can also go away.  I'm
curious whether you or anyone tried to explore that path, or whether above
doesn't make sense at all.

Thanks,

-- 
Peter Xu

Re: [PATCH 07/14] migration: Free removed SaveStateEntry

2024-06-26 Thread Peter Xu

On Wed, Jun 26, 2024 at 08:06:30PM +0900, Akihiko Odaki wrote:
> This suppresses LeakSanitizer warnings.
> 
> Signed-off-by: Akihiko Odaki 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu

On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote:
> > > I just noticed this thread; some random notes from a somewhat
> > > fragmented memory of this:
> > > 
> > >   a) Long long ago, I also tried rsocket; 
> > >   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > >  as I remember the library was quite flaky at the time.
> > 
> > Hmm interesting.  There also looks like a thread doing rpoll().
> 
> Yeh, I can't actually remember much more about what I did back then!

Heh, that's understandable and fair. :)

> > I hope Lei and his team has tested >4G mem, otherwise definitely worth
> > checking.  Lei also mentioned there're rsocket bugs they found in the cover
> > letter, but not sure what's that about.
> 
> It would probably be a good idea to keep track of what bugs
> are in flight with it, and try it on a few RDMA cards to see
> what problems get triggered.
> I think I reported a few at the time, but I gave up after
> feeling it was getting very hacky.

Agreed.  Maybe we can have a list of that in the cover letter or even
QEMU's migration/rmda doc page.

Lei, if you think that makes sense please do so in your upcoming posts.
There'll need to have a list of things you encountered in the kernel driver
and it'll be even better if there're further links to read on each problem.

> > > 
> > >   e) Someone made a good suggestion (sorry can't remember who) - that the
> > >  RDMA migration structure was the wrong way around - it should be the
> > >  destination which initiates an RDMA read, rather than the source
> > >  doing a write; then things might become a LOT simpler; you just need
> > >  to send page ranges to the destination and it can pull it.
> > >  That might work nicely for postcopy.
> > 
> > I'm not sure whether it'll still be a problem if rdma recv side is based on
> > zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> > that we don't want the guest vcpus to see a partially copied page during
> > on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> > that.
> 
> Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if
> the destination can issue an RDMA read itself, it doesn't need to send 
> messages
> to the source to ask for a page fetch; it just goes and grabs it itself,
> that's got to be good for latency.

Oh, that's pretty internal stuff of rdma to me and beyond my knowledge..
but from what I can tell it sounds very reasonable indeed!

Thanks!

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu

On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote:
> >   e) Someone made a good suggestion (sorry can't remember who) - that the
> >  RDMA migration structure was the wrong way around - it should be the
> >  destination which initiates an RDMA read, rather than the source
> >  doing a write; then things might become a LOT simpler; you just need
> >  to send page ranges to the destination and it can pull it.
> >  That might work nicely for postcopy.
> 
> I'm not sure whether it'll still be a problem if rdma recv side is based on
> zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> that we don't want the guest vcpus to see a partially copied page during
> on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> that.

And when thinking about this (of UFFDIO_COPY's nature on not being able to
do zero-copy...), the only way this will be able to do zerocopy is to use
file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that
when we do DMA we pass over the page cache, which can be mapped in another
virtual address besides what the vcpus are using.

Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic
updates on the vcpu pgtables, avoiding the copy.  QEMU doesn't have it, but
it looks like there's one more reason we may want to have better use of
shmem.. than anonymous.  And actually when working on 4k faults on 1G
hugetlb I added CONTINUE support.

https://github.com/xzpeter/qemu/tree/doublemap
https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899740dc5

Maybe it's worthwhile on its own now, because it also means we can use that
in multifd to avoid one extra layer of buffering when supporting
multifd+postcopy (which has the same issue here on directly copying data
into guest pages).  It'll also work with things like rmda I think in
similar ways.  It's just that it'll not work on anonymous.

I definitely hijacked the thread to somewhere too far away.  I'll stop
here..

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu

Hey, Dave!

On Wed, Jun 05, 2024 at 12:31:56AM +, Dr. David Alan Gilbert wrote:
> * Michael Galaxy (mgal...@akamai.com) wrote:
> > One thing to keep in mind here (despite me not having any hardware to test)
> > was that one of the original goals here
> > in the RDMA implementation was not simply raw throughput nor raw latency,
> > but a lack of CPU utilization in kernel
> > space due to the offload. While it is entirely possible that newer hardware
> > w/ TCP might compete, the significant
> > reductions in CPU usage in the TCP/IP stack were a big win at the time.
> > 
> > Just something to consider while you're doing the testing
> 
> I just noticed this thread; some random notes from a somewhat
> fragmented memory of this:
> 
>   a) Long long ago, I also tried rsocket; 
>   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
>  as I remember the library was quite flaky at the time.

Hmm interesting.  There also looks like a thread doing rpoll().

Btw, not sure whether you noticed, but there's the series posted for the
latest rsocket conversion here:

https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gong...@huawei.com

I hope Lei and his team has tested >4G mem, otherwise definitely worth
checking.  Lei also mentioned there're rsocket bugs they found in the cover
letter, but not sure what's that about.

> 
>   b) A lot of the complexity in the rdma migration code comes from
> emulating a stream to carry the migration control data and interleaving
> that with the actual RAM copy.   I believe the original design used
> a separate TCP socket for the control data, and just used the RDMA
> for the data - that should be a lot simpler (but alas was rejected
> in review early on)
> 
>   c) I can't rememmber the last benchmarks I did; but I think I did
> manage to beat RDMA with multifd; but yes, multifd does eat host CPU
> where as RDMA barely uses a whisper.

I think my first impression on this matter came from you on this one. :)

> 
>   d) The 'zero-copy-send' option in migrate may well get some of that
>  CPU time back; but if I remember we were still bottle necked on
>  the receive side. (I can't remember if zero-copy-send worked with
>  multifd?)

Yes, and zero-copy requires multifd for now. I think it's because we didn't
want to complicate the header processings in the migration stream where it
may not be page aligned.

> 
>   e) Someone made a good suggestion (sorry can't remember who) - that the
>  RDMA migration structure was the wrong way around - it should be the
>  destination which initiates an RDMA read, rather than the source
>  doing a write; then things might become a LOT simpler; you just need
>  to send page ranges to the destination and it can pull it.
>  That might work nicely for postcopy.

I'm not sure whether it'll still be a problem if rdma recv side is based on
zero-copy.  It would be a matter of whether atomicity can be guaranteed so
that we don't want the guest vcpus to see a partially copied page during
on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
that.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Peter Xu

Lei,

On Wed, May 29, 2024 at 02:43:46AM +, Gonglei (Arei) wrote:
> For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> rdma_cm and ibverbs create two FDs respectively. The two FDs have
> different responsibilities. rdma_cm fd is used to notify connection
> establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a
> pollin event can be monitored, which means that an rdma_cm event
> occurs. When the verbs fd is directly polled/epolled, only the pollin
> event can be listened, which indicates that a new CQE is generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides
> rdma calls that are completely similar to socket interfaces. However,
> this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and
> writable events for listening to data). Only the rpoll interface provided
> by the RSocket can be used to listen to related events. However, QEMU
> uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept
> API).  And cannot listen to the verbs fd event. Only some hacking methods
> can be used to address this problem.  Do you guys have any ideas? Thanks.

I saw that you mentioned this elsewhere:

> Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(

So what I'm thinking may not make much sense, as I mentioned I don't think
I know rdma at all.. and my idea also has involvement on coroutine stuff
which I also don't know well. But just in case it shed some light in some
form.

IIUC we do iochannel blockings with this no matter for read/write:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

One thing I'm wondering is whether we can provide a new feature bit for
qiochannel, e.g., QIO_CHANNEL_FEATURE_POLL, so that the iochannel can
define its own poll routine rather than using the default when possible.

I think it may not work if it's in a coroutine, as I guess that'll block
other fds from being waked up.  Hence it should look like this:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else if (qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_POLL)) {
qio_channel_poll(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

Maybe we even want to forbid such channel to be used in coroutine already,
as when QIO_CHANNEL_FEATURE_POLL set it may mean that this iochannel simply
won't work with poll() like in rdma's use case.

Then rdma iochannel can implement qio_channel_poll() using rpoll().

There's one other dependent issue here in that I _think_ the migration recv
side is still in a coroutine.. so we may need to move that into a thread
first.  IIRC we don't yet have a major blocker to do that, but I didn't
further check either.  I've put that issue aside just to see whether this
may or may not make sense.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Peter Xu

On Tue, May 28, 2024 at 09:06:04AM +, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang 
> > Cc: Michael Galaxy ; Jinpu Wang
> > ; Elmar Gerdes ;
> > zhengchuan ; Gonglei (Arei)
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-de...@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> > 
> > Hi,
> > 
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> > 
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more 
> > reasonable.
> > 
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wvaqk81vxtkzx-l...@mail.gmail.com/
> > 
> > Appreciate a lot for everyone helping on the testings.
> > 
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> > 
> > It looks to me NICs are powerful now, but again as I mentioned I don't 
> > think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> > 
> > Is there anyone who started looking into that direction?  Would it make 
> > sense
> > we start some PoC now?
> > 
> 
> My team has finished the PoC refactoring which works well. 
> 
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
> successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and 
> rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
> migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live 
> migration. It's successful.
> 
> We will be submit the patchset later.

That's great news, thank you!

-- 
Peter Xu

Re: [PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state

2024-05-27 Thread Peter Xu

On Mon, May 27, 2024 at 12:53:22PM +0200, Markus Armbruster wrote:
> Peter Xu  writes:
> 
> > On Mon, May 13, 2024 at 04:17:02PM +0200, Markus Armbruster wrote:
> >> Functions that use an Error **errp parameter to return errors should
> >> not also report them to the user, because reporting is the caller's
> >> job.  When the caller does, the error is reported twice.  When it
> >> doesn't (because it recovered from the error), there is no error to
> >> report, i.e. the report is bogus.
> >> 
> >> qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate
> >> this principle: they call qemu_save_device_state() and
> >> qemu_loadvm_state(), which call error_report_err().
> >> 
> >> I wish I could clean this up now, but migration's error reporting is
> >> too complicated (confused?) for me to mess with it.
> >
> > :-(
> 
> If I understood how it's *supposed* to work, I might have a chance...
> 
> I can see a mixture of reporting errors directly (with error_report() &
> friends), passing them to callers (via Error **errp), and storing them
> in / retrieving them from MigrationState member @error.  This can't be
> right.
> 
> I think a necessary first step towards getting it right is a shared
> understanding how errors are to be handled in migration code.  This
> includes how error data should flow from error source to error sink, and
> what the possible sinks are.

True.  I think the sink should always be MigrationState.error so far.

There's also the other complexity on detecting errors using either
qemu_file_get_error() or migrate_get_error().. the error handling in
migration is indeed prone to a cleanup.

I've added a cleanup entry for migration todo page:

https://wiki.qemu.org/ToDo/LiveMigration#Migration_error_detection_and_reporting

Thanks,

-- 
Peter Xu

Re: [PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state

2024-05-23 Thread Peter Xu

On Mon, May 13, 2024 at 04:17:02PM +0200, Markus Armbruster wrote:
> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
> 
> qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate
> this principle: they call qemu_save_device_state() and
> qemu_loadvm_state(), which call error_report_err().
> 
> I wish I could clean this up now, but migration's error reporting is
> too complicated (confused?) for me to mess with it.

:-(

> 
> Instead, I'm merely improving the error reported by
> qmp_xen_load_devices_state() and qmp_xen_load_devices_state() to the
> QMP core from
> 
> An IO error has occurred
> 
> to
> saving Xen device state failed
> 
> and
> 
> loading Xen device state failed
> 
> respectively.
> 
> Signed-off-by: Markus Armbruster 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-21 Thread Peter Xu

On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> Hello Michael and Peter,

Hi,

> 
> Exactly, not so compelling, as I did it first only on servers widely
> used for production in our data center. The network adapters are
> 
> Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> 2-port Gigabit Ethernet PCIe

Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
reasonable.

https://lore.kernel.org/qemu-devel/camgffen-dkpmz4ta71mjydyemg0zda15wvaqk81vxtkzx-l...@mail.gmail.com/

Appreciate a lot for everyone helping on the testings.

> InfiniBand controller: Mellanox Technologies MT27800 Family [ConnectX-5]
> 
> which doesn't meet our purpose. I can choose RDMA or TCP for VM
> migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> on these two hosts. One is standby while the other is active.
> 
> Now I'll try on a server with more recent Ethernet and InfiniBand
> network adapters. One of them has:
> BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> 
> The comparison between RDMA and TCP on the same NIC could make more sense.

It looks to me NICs are powerful now, but again as I mentioned I don't
think it's a reason we need to deprecate rdma, especially if QEMU's rdma
migration has the chance to be refactored using rsocket.

Is there anyone who started looking into that direction?  Would it make
sense we start some PoC now?

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Peter Xu

On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> That's a good news to see the socket abstraction for RDMA!
> When I was developed the series above, the most pain is the RDMA migration 
> has no QIOChannel abstraction and i need to take a 'fake channel'
> for it which is awkward in code implementation.
> So, as far as I know, we can do this by
> i. the first thing is that we need to evaluate the rsocket is good enough to 
> satisfy our QIOChannel fundamental abstraction
> ii. if it works right, then we will continue to see if it can give us 
> opportunity to hide the detail of rdma protocol
> into rsocket by remove most of code in rdma.c and also some hack in 
> migration main process.
> iii. implement the advanced features like multi-fd and multi-uri for rdma 
> migration.
> 
> Since I am not familiar with rsocket, I need some times to look at it and do 
> some quick verify with rdma migration based on rsocket.
> But, yes, I am willing to involved in this refactor work and to see if we can 
> make this migration feature more better:）

Based on what we have now, it looks like we'd better halt the deprecation
process a bit, so I think we shouldn't need to rush it at least in 9.1
then, and we'll need to see how it goes on the refactoring.

It'll be perfect if rsocket works, otherwise supporting multifd with little
overhead / exported APIs would also be a good thing in general with
whatever approach.  And obviously all based on the facts that we can get
resources from companies to support this feature first.

Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
any of us can provide some test results please do so.  Many people are
saying RDMA is better, but I yet didn't see any numbers comparing it with
modern TCP networks.  I don't want to have old impressions floating around
even if things might have changed..  When we have consolidated results, we
should share them out and also reflect that in QEMU's migration docs when a
rdma document page is ready.

Chuan, please check the whole thread discussion, it may help to understand
what we are looking for on rdma migrations [1].  Meanwhile please feel free
to sync with Jinpu's team and see how to move forward with such a project.

[1] https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-07 Thread Peter Xu

On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:
> Hello,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Monday, May 6, 2024 11:18 PM
> > To: Gonglei (Arei) 
> > Cc: Daniel P. Berrangé ; Markus Armbruster
> > ; Michael Galaxy ; Yu Zhang
> > ; Zhijian Li (Fujitsu) ; Jinpu 
> > Wang
> > ; Elmar Gerdes ;
> > qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
> > ; Prasanna Kumar Kalever
> > ; Cornelia Huck ;
> > Michael Roth ; Prasanna Kumar Kalever
> > ; integrat...@gluster.org; Paolo Bonzini
> > ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> > > Hi, Peter
> > 
> > Hey, Lei,
> > 
> > Happy to see you around again after years.
> > 
> Haha, me too.
> 
> > > RDMA features high bandwidth, low latency (in non-blocking lossless
> > > network), and direct remote memory access by bypassing the CPU (As you
> > > know, CPU resources are expensive for cloud vendors, which is one of
> > > the reasons why we introduced offload cards.), which TCP does not have.
> > 
> > It's another cost to use offload cards, v.s. preparing more cpu resources?
> > 
> Software and hardware offload converged architecture is the way to go for all 
> cloud vendors 
> (Including comprehensive benefits in terms of performance, cost, security, 
> and innovation speed), 
> it's not just a matter of adding the resource of a DPU card.
> 
> > > In some scenarios where fast live migration is needed (extremely short
> > > interruption duration and migration duration) is very useful. To this
> > > end, we have also developed RDMA support for multifd.
> > 
> > Will any of you upstream that work?  I'm curious how intrusive would it be
> > when adding it to multifd, if it can keep only 5 exported functions like 
> > what
> > rdma.h does right now it'll be pretty nice.  We also want to make sure it 
> > works
> > with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
> > loads to
> > multifd channels too.
> > 
> 
> In fact, we sent the patchset to the community in 2021. Pls see:
> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/

I wasn't aware of that for sure in the past..

Multifd has changed quite a bit in the last 9.0 release, that may not apply
anymore.  One thing to mention is please look at Dan's comment on possible
use of rsocket.h:

https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/

And Jinpu did help provide an initial test result over the library:

https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/

It looks like we have a chance to apply that in QEMU.

> 
> 
> > One thing to note that the question here is not about a pure performance
> > comparison between rdma and nics only.  It's about help us make a decision
> > on whether to drop rdma, iow, even if rdma performs well, the community 
> > still
> > has the right to drop it if nobody can actively work and maintain it.
> > It's just that if nics can perform as good it's more a reason to drop, 
> > unless
> > companies can help to provide good support and work together.
> > 
> 
> We are happy to provide the necessary review and maintenance work for RDMA
> if the community needs it.
> 
> CC'ing Chuan Zheng.

I'm not sure whether you and Jinpu's team would like to work together and
provide a final solution for rdma over multifd.  It could be much simpler
than the original 2021 proposal if the rsocket API will work out.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Peter Xu

On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> Hi Peter, hi Daniel,

Hi, Jinpu,

Thanks for sharing this test results.  Sounds like a great news.

What's your plan next?  Would it then be worthwhile / possible moving QEMU
into that direction?  Would that greatly simplify rdma code as Dan
mentioned?

Thanks,

> 
> On Fri, May 3, 2024 at 4:33 PM Peter Xu  wrote:
> >
> > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > I had a brief check in the rsocket changelog, there seems some
> > > improvement over time,
> > >  might be worth revisiting this. due to socket abstraction, we can't
> > > use some feature like
> > >  ODP, it won't be a small and easy task.
> >
> > It'll be good to know whether Dan's suggestion would work first, without
> > rewritting everything yet so far.  Not sure whether some perf test could
> > help with the rsocket APIs even without QEMU's involvements (or looking for
> > test data supporting / invalidate such conversions).
> >
> I did a quick test with iperf on 100 G environment and 40 G
> environment, in summary rsocket works pretty well.
> 
> iperf tests between 2 hosts with 40 G (IB),
> first  a few test with different num. of threads on top of ipoib
> interface, later with preload rsocket on top of same ipoib interface.
> 
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> [SUM] 0.-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> [  6] 0.-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> [SUM] 0.-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  4] 0.-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
> [  8] 0.-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.-10. sec  2.85 GBytes  2.45 Gbits/sec
> [ 12] 0.-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
> [  3] 0.-10.0003 sec  2.85 GBytes

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Peter Xu

On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> Hi, Peter

Hey, Lei,

Happy to see you around again after years.

> RDMA features high bandwidth, low latency (in non-blocking lossless
> network), and direct remote memory access by bypassing the CPU (As you
> know, CPU resources are expensive for cloud vendors, which is one of the
> reasons why we introduced offload cards.), which TCP does not have.

It's another cost to use offload cards, v.s. preparing more cpu resources?

> In some scenarios where fast live migration is needed (extremely short
> interruption duration and migration duration) is very useful. To this
> end, we have also developed RDMA support for multifd.

Will any of you upstream that work?  I'm curious how intrusive would it be
when adding it to multifd, if it can keep only 5 exported functions like
what rdma.h does right now it'll be pretty nice.  We also want to make sure
it works with arbitrary sized loads and buffers, e.g. vfio is considering
to add IO loads to multifd channels too.

One thing to note that the question here is not about a pure performance
comparison between rdma and nics only.  It's about help us make a decision
on whether to drop rdma, iow, even if rdma performs well, the community
still has the right to drop it if nobody can actively work and maintain it.
It's just that if nics can perform as good it's more a reason to drop,
unless companies can help to provide good support and work together.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-03 Thread Peter Xu

On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> I had a brief check in the rsocket changelog, there seems some
> improvement over time,
>  might be worth revisiting this. due to socket abstraction, we can't
> use some feature like
>  ODP, it won't be a small and easy task.

It'll be good to know whether Dan's suggestion would work first, without
rewritting everything yet so far.  Not sure whether some perf test could
help with the rsocket APIs even without QEMU's involvements (or looking for
test data supporting / invalidate such conversions).

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Peter Xu

On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> Hi Michael, Hi Peter,
> 
> 
> On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
> >
> > Yu Zhang / Jinpu,
> >
> > Any possibility (at your lesiure, and within the disclosure rules of
> > your company, IONOS) if you could share any of your performance
> > information to educate the group?
> >
> > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > their disposal. Some people don't.
> Our staging env is with 100 Gb/s IB environment.
> We will have a new setup in the coming months with Ethernet (RoCE), we
> will run some performance
> comparison when we have the environment ready.

Thanks both.  Please keep us posted.

Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
involved, am I right?

The other note is that the comparison needs to be with multifd enabled for
the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.

I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
without even waiting for months.  If you want I can try to see how we can
test together.  And btw I don't think we need a cluster, IIUC we simply
need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
two cards just for experiments, systems that can drive the cards, and a
wire supporting 100G?

> 
> >
> > - Michael
> 
> Thx!
> Jinpu
> >
> > On 5/1/24 11:16, Peter Xu wrote:
> > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > >>> What I worry more is whether this is really what we want to keep rdma in
> > >>> qemu, and that's also why I was trying to request for some serious
> > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > >>> "we" I mean both QEMU community and any company that will support 
> > >>> keeping
> > >>> rdma around.
> > >>>
> > >>> The problem is if NICs now are fast enough to perform at least equally
> > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > >>> mean that rdma migration will only be used by whoever wants to keep 
> > >>> them in
> > >>> the products and existed already?  In that case we should simply ask new
> > >>> users to stick with tcp, and rdma users should only drop but not 
> > >>> increase.
> > >>>
> > >>> It seems also destined that most new migration features will not support
> > >>> rdma: see how much we drop old features in migration now (which rdma
> > >>> _might_ still leverage, but maybe not), and how much we add mostly 
> > >>> multifd
> > >>> relevant which will probably not apply to rdma at all.  So in general 
> > >>> what
> > >>> I am worrying is a both-loss condition, if the company might be easier 
> > >>> to
> > >>> either stick with an old qemu (depending on whether other new features 
> > >>> are
> > >>> requested to be used besides RDMA alone), or do periodic rebase with 
> > >>> RDMA
> > >>> downstream only.
> > >> I don't know much about the originals of RDMA support in QEMU and why
> > >> this particular design was taken. It is indeed a huge maint burden to
> > >> have a completely different code flow for RDMA with 4000+ lines of
> > >> custom protocol signalling which is barely understandable.
> > >>
> > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > >> type could almost[1] trivially have supported RDMA. There would have
> > >> been almost no RDMA code required in the migration subsystem, and all
> > >> the modern features like compression, multifd, post-copy, etc would
> > >> "just work".
> > >>
> > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > >> performance gains, but it might still have been a better tradeoff
> > >> to have not quite so good peak performance, but with massively
> > >> less maint burden.
> > > My understanding so far is RDMA is sololy for performance but nothing 
> > >

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu

On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> > 
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> > 
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
> 
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
> 
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
> 
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,

> 
> With regards,
> Daniel
> 
> [1] "almost" trivially, because the poll() integration for rsockets
> requires a bit more magic sauce since rsockets FDs are not
> really FDs from the kernel's POV. Still, QIOCHannel likely can
> abstract that probme.
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu

On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > Peter Xu  writes:
> > 
> > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > >> Hi All (and Peter),
> > >
> > > Hi, Michael,
> > >
> > >> 
> > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > >> (highly irregular for a male) and yes, that's my real last name:
> > >> https://www.linkedin.com/in/mrgalaxy/)
> > >> 
> > >> I'm the original author of the RDMA implementation. I've been discussing
> > >> with Yu Zhang for a little bit about potentially handing over 
> > >> maintainership
> > >> of the codebase to his team.
> > >> 
> > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > >> unfortunately. so I've never been able to run tests or use what I wrote 
> > >> at
> > >> work, and as all of you know, if you don't have a way to test something,
> > >> then you can't maintain it.
> > >> 
> > >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > >> they feel comfortable training his team to maintain the codebase (and run
> > >> tests) while they learn about it.
> > >
> > > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > > the code, or say, taking over the responsibility, may or may not need 100%
> > > mastering the code base first.  There should still be some fundamental
> > > confidence to work on the code though as a starting point, then it's about
> > > serious use case to back this up, and careful testings while getting more
> > > familiar with it.
> > 
> > How much experience we expect of maintainers depends on the subsystem
> > and other circumstances.  The hard requirement isn't experience, it's
> > trust.  See the recent attack on xz.
> > 
> > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > I'm merely reminding y'all what's at stake.
> 
> I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
> common scenario is that volunteer maintainers are honest people. QEMU is
> in a massively better peer review situation. With xz there was basically no
> oversight of the new maintainer. With QEMU, we have oversight from 1000's
> of people on the list, a huge pool of general maintainers, the specific
> migration maintainers, and the release manager merging code.
> 
> With a lack of historical experiance with QEMU maintainership, I'd suggest
> that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
> file with only the 'Reviewer' classification. The main migration maintainers
> would still handle pull requests, but wait for a R-b from one of the RMDA
> volunteers. After some period of time the RDMA folks could graduate to full
> maintainer status if the migration maintainers needed to reduce their load.
> I suspect that might prove unneccesary though, given RDMA isn't an area of
> code with a high turnover of patches.

Right, and we can do that as a start, it also follows our normal rules of
starting from Reviewers to maintain something.  I even considered Zhijian
to be the previous rdma goto guy / maintainer no matter what role he used
to have in the MAINTAINERS file.

Here IMHO it's more about whether any company would like to stand up and
provide help, without yet binding that to be able to send pull requests in
the near future or even longer term.

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either stick with an old qemu (depending on whether other new features are
requested to be used besides RDMA alone), or do periodic rebase with RDMA
downstream only.

So even if we want to keep RDMA around I hope with this chance we can at
least have clear picture on when we should still suggest any new user to
use RDMA (with the reasons behind).  Or we simply shouldn't suggest any new
user to use RDMA at all (because at least it'll lose many new migration
features).

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-29 Thread Peter Xu

On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> Hi All (and Peter),

Hi, Michael,

> 
> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> (highly irregular for a male) and yes, that's my real last name:
> https://www.linkedin.com/in/mrgalaxy/)
> 
> I'm the original author of the RDMA implementation. I've been discussing
> with Yu Zhang for a little bit about potentially handing over maintainership
> of the codebase to his team.
> 
> I simply have zero access to RoCE or Infiniband hardware at all,
> unfortunately. so I've never been able to run tests or use what I wrote at
> work, and as all of you know, if you don't have a way to test something,
> then you can't maintain it.
> 
> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> they feel comfortable training his team to maintain the codebase (and run
> tests) while they learn about it.

The "while learning" part is fine at least to me.  IMHO the "ownership" to
the code, or say, taking over the responsibility, may or may not need 100%
mastering the code base first.  There should still be some fundamental
confidence to work on the code though as a starting point, then it's about
serious use case to back this up, and careful testings while getting more
familiar with it.

> 
> If you don't mind, I'd like to let him send over his (very detailed)
> proposal,

Yes please, it's exactly the time to share the plan.  The hope is we try to
reach a consensus before or around the middle of this release (9.1).
Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
not yet out, but I think it means we make a decision before or around
middle of June.

Thanks,

> 
> - Michael
> 
> On 4/11/24 11:36, Yu Zhang wrote:
> > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > >  periodically tests for each QEMU release will be needed.
> > We use a batch of regression test cases for the stack, which covers the
> > test for QEMU. I did such test for most of the QEMU releases planned as
> > candidates for rollout.
> > 
> > The migration test needs a pair of (either physical or virtual) servers with
> > InfiniBand network, which makes it difficult to do on a single server. The
> > nested VM could be a possible approach, for which we may need virtual
> > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you 
> > know.
> > 
> > [1]  
> > https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > 
> > Thanks and best regards!
> > 
> > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu  wrote:
> > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via 
> > > > wrote:
> > > > > 
> > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > 
> > > > > > > Is there document/link about the unittest/CI for migration tests, 
> > > > > > > Why
> > > > > > > are those tests missing?
> > > > > > > Is it hard or very special to set up an environment for that? 
> > > > > > > maybe we
> > > > > > > can help in this regards.
> > > > > > See tests/qtest/migration-test.c.  We put most of our migration 
> > > > > > tests
> > > > > > there and that's covered in CI.
> > > > > > 
> > > > > > I think one major issue is CI systems don't normally have rdma 
> > > > > > devices.
> > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > 
> > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > "rxe_lo", however when
> > > > > I tried(years ago) to do RDMA migration over this
> > > > > interface(rdma:127.0.0.1:) , it got something wrong.
> > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > Thanks, Zhijian.
> > > > 
> > >

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-12 Thread Peter Xu

Yu,

On Thu, Apr 11, 2024 at 06:36:54PM +0200, Yu Zhang wrote:
> > 1) Either a CI test covering at least the major RDMA paths, or at least
> > periodically tests for each QEMU release will be needed.
> We use a batch of regression test cases for the stack, which covers the
> test for QEMU. I did such test for most of the QEMU releases planned as
> candidates for rollout.

The least I can think of is a few tests in one release.  Definitely too
less if one release can already break..

> 
> The migration test needs a pair of (either physical or virtual) servers with
> InfiniBand network, which makes it difficult to do on a single server. The
> nested VM could be a possible approach, for which we may need virtual
> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> 
> [1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Does it require a kernel driver?  The less host kernel / hardware /
.. dependencies the better.

I am wondering whether there can be a library doing everything in
userspace, translating RDMA into e.g. socket messages (so maybe ultimately
that's something like IP->rdma->IP.. just to cover the "rdma" procedures),
then that'll work for CI reliably.

Please also see my full list, though, especially entry 4).  Thanks already
for looking for solutions on the tests, but I don't want to waste your time
then found that tests are not enough even if ready.  I think we need people
that understand these stuff well enough, have dedicated time and look after
it.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-11 Thread Peter Xu

On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> > 
> > 
> > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > 
> > >> Is there document/link about the unittest/CI for migration tests, Why
> > >> are those tests missing?
> > >> Is it hard or very special to set up an environment for that? maybe we
> > >> can help in this regards.
> > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > there and that's covered in CI.
> > >
> > > I think one major issue is CI systems don't normally have rdma devices.
> > > Can rdma migration test be carried out without a real hardware?
> > 
> > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > then we can get a new RDMA interface "rxe_eth0".
> > This new RDMA interface is able to do the QEMU RDMA migration.
> > 
> > Also, the loopback(lo) device is able to emulate the RDMA interface 
> > "rxe_lo", however when
> > I tried(years ago) to do RDMA migration over this 
> > interface(rdma:127.0.0.1:) , it got something wrong.
> > So i gave up enabling the RDMA migration qtest at that time.
> 
> Thanks, Zhijian.
> 
> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> Maybe someone more familiar with how CI works can chim in.

Some people got dropped on the cc list for unknown reason, I'm adding them
back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
accident.

I'll try to summarize what is still missing, and I think these will be
greatly helpful if we don't want to deprecate rdma migration:

  1) Either a CI test covering at least the major RDMA paths, or at least
 periodically tests for each QEMU release will be needed.

  2) Some performance tests between modern RDMA and NIC devices are
 welcomed.  The current knowledge is modern NIC can work similarly to
 RDMA in performance, then it's debatable why we still maintain so much
 rdma specific code.

  3) No need to be soild patchsets for this one, but some plan to improve
 RDMA migration code so that it is not almost isolated from the rest
 protocols.

  4) Someone to look after this code for real.

For 2) and 3) more info is here:

https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n

Here 4) can be the most important as Markus pointed out.  We just didn't
get there yet on the discussions, but maybe Markus is right that we should
talk that first.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-10 Thread Peter Xu

On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> 
> 
> on 4/10/2024 3:46 AM, Peter Xu wrote:
> 
> >> Is there document/link about the unittest/CI for migration tests, Why
> >> are those tests missing?
> >> Is it hard or very special to set up an environment for that? maybe we
> >> can help in this regards.
> > See tests/qtest/migration-test.c.  We put most of our migration tests
> > there and that's covered in CI.
> >
> > I think one major issue is CI systems don't normally have rdma devices.
> > Can rdma migration test be carried out without a real hardware?
> 
> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> then we can get a new RDMA interface "rxe_eth0".
> This new RDMA interface is able to do the QEMU RDMA migration.
> 
> Also, the loopback(lo) device is able to emulate the RDMA interface 
> "rxe_lo", however when
> I tried(years ago) to do RDMA migration over this 
> interface(rdma:127.0.0.1:) , it got something wrong.
> So i gave up enabling the RDMA migration qtest at that time.

Thanks, Zhijian.

I'm not sure adding an emu-link for rdma is doable for CI systems, though.
Maybe someone more familiar with how CI works can chim in.

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Peter Xu

On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> Hi Peter,
> 
> On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
> >
> > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > Hi Peter,
> >
> > Jinpu,
> >
> > Thanks for joining the discussion.
> >
> > >
> > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > > >
> > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > Hello Peter und Zhjian,
> > > > >
> > > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > > surprised at
> > > > > the plan for deprecating the RDMA migration subsystem.
> > > >
> > > > It's not too late, since it looks like we do have users not yet notified
> > > > from this, we'll redo the deprecation procedure even if it'll be the 
> > > > final
> > > > plan, and it'll be 2 releases after this.
> > > >
> > > > >
> > > > > > IMHO it's more important to know whether there are still users and 
> > > > > > whether
> > > > > > they would still like to see it around.
> > > > >
> > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led 
> > > > > > to the a few
> > > > > > obvious bugs being noticed too late.
> > > > >
> > > > > Yes, we are a user of this subsystem. I was unaware of the lack of 
> > > > > test coverage
> > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > migration test
> > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > and 8.2, but
> > > > > was unable to confirm and report it quickly to you.
> > > > >
> > > > > The maintenance of this part could be too costly or difficult from
> > > > > your point of view.
> > > >
> > > > It may or may not be too costly, it's just that we need real users of 
> > > > RDMA
> > > > taking some care of it.  Having it broken easily for >1 releases 
> > > > definitely
> > > > is a sign of lack of users.  It is an implication to the community that 
> > > > we
> > > > should consider dropping some features so that we can get the best use 
> > > > of
> > > > the community resources for the things that may have a broader audience.
> > > >
> > > > One thing majorly missing is a RDMA tester to guard all the merges to 
> > > > not
> > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA 
> > > > hardwares
> > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > features that will be merged at least for migration subsystem, so that 
> > > > we
> > > > plan to not merge anything that is not covered by CI unless extremely
> > > > necessary in the future.
> > > >
> > > > For sure CI is not the only missing part, but I'd say we should start 
> > > > with
> > > > it, then someone should also take care of the code even if only in
> > > > maintenance mode (no new feature to add on top).
> > > >
> > > > >
> > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > many) like us
> > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > version of QEMU,
> > > > > or to abandon the currently used RDMA migration.
> > > >
> > > > RDMA doesn't get new features anyway, if there's specific use case for 
> > > > RDMA
> > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > possible to switch to the TCP protocol with some good NICs?
> > > We have used rdma migration with HCA from Nvidia for years, our
> > > experience is RDMA migration works better than tcp (over ipoib).
> >
> > Please bare with me, as I know little on rdma stuff.
> >
> > I'm actually pretty confused (and since a long time ago..) on why we need
> > to operation with rdma contexts when ipoib seems to provide all the tcp
> > layers.  I meant, can it work with

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-08 Thread Peter Xu

On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> Hi Peter,

Jinpu,

Thanks for joining the discussion.

> 
> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> >
> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > Hello Peter und Zhjian,
> > >
> > > Thank you so much for letting me know about this. I'm also a bit 
> > > surprised at
> > > the plan for deprecating the RDMA migration subsystem.
> >
> > It's not too late, since it looks like we do have users not yet notified
> > from this, we'll redo the deprecation procedure even if it'll be the final
> > plan, and it'll be 2 releases after this.
> >
> > >
> > > > IMHO it's more important to know whether there are still users and 
> > > > whether
> > > > they would still like to see it around.
> > >
> > > > I admit RDMA migration was lack of testing(unit/CI test), which led to 
> > > > the a few
> > > > obvious bugs being noticed too late.
> > >
> > > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > > coverage
> > > for this part. As soon as 8.2 was released, I saw that many of the
> > > migration test
> > > cases failed and came to realize that there might be a bug between 8.1
> > > and 8.2, but
> > > was unable to confirm and report it quickly to you.
> > >
> > > The maintenance of this part could be too costly or difficult from
> > > your point of view.
> >
> > It may or may not be too costly, it's just that we need real users of RDMA
> > taking some care of it.  Having it broken easily for >1 releases definitely
> > is a sign of lack of users.  It is an implication to the community that we
> > should consider dropping some features so that we can get the best use of
> > the community resources for the things that may have a broader audience.
> >
> > One thing majorly missing is a RDMA tester to guard all the merges to not
> > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > but just to sanity check the migration+rdma code running all fine.  RDMA
> > taught us the lesson so we're requesting CI coverage for all other new
> > features that will be merged at least for migration subsystem, so that we
> > plan to not merge anything that is not covered by CI unless extremely
> > necessary in the future.
> >
> > For sure CI is not the only missing part, but I'd say we should start with
> > it, then someone should also take care of the code even if only in
> > maintenance mode (no new feature to add on top).
> >
> > >
> > > My concern is, this plan will forces a few QEMU users (not sure how
> > > many) like us
> > > either to stick to the RDMA migration by using an increasingly older
> > > version of QEMU,
> > > or to abandon the currently used RDMA migration.
> >
> > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > migrations, would it work if such a scenario uses the old binary?  Is it
> > possible to switch to the TCP protocol with some good NICs?
> We have used rdma migration with HCA from Nvidia for years, our
> experience is RDMA migration works better than tcp (over ipoib).

Please bare with me, as I know little on rdma stuff.

I'm actually pretty confused (and since a long time ago..) on why we need
to operation with rdma contexts when ipoib seems to provide all the tcp
layers.  I meant, can it work with the current "tcp:" protocol with ipoib
even if there's rdma/ib hardwares underneath?  Is it because of performance
improvements so that we must use a separate path comparing to generic
"tcp:" protocol here?

> 
> Switching back to TCP will lead us to the old problems which was
> solved by RDMA migration.

Can you elaborate the problems, and why tcp won't work in this case?  They
may not be directly relevant to the issue we're discussing, but I'm happy
to learn more.

What is the NICs you were testing before?  Did the test carry out with
things like modern ones (50Gbps-200Gbps NICs), or the test was done when
these hardwares are not common?

Per my recent knowledge on the new Intel hardwares, at least the ones that
support QPL, it's easy to achieve single core 50Gbps+.

https://lore.kernel.org/r/ph7pr11mb5941a91ac1e514bcc32896a6a3...@ph7pr11mb5941.namprd11.prod.outlook.com

Quote from Yuan:

  Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
  [ ID] Interval   Transfer Bitrate Retr  Cwnd
  [  5]

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-02 Thread Peter Xu

On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> Hello Peter und Zhjian,
> 
> Thank you so much for letting me know about this. I'm also a bit surprised at
> the plan for deprecating the RDMA migration subsystem.

It's not too late, since it looks like we do have users not yet notified
from this, we'll redo the deprecation procedure even if it'll be the final
plan, and it'll be 2 releases after this.

> 
> > IMHO it's more important to know whether there are still users and whether
> > they would still like to see it around.
> 
> > I admit RDMA migration was lack of testing(unit/CI test), which led to the 
> > a few
> > obvious bugs being noticed too late.
> 
> Yes, we are a user of this subsystem. I was unaware of the lack of test 
> coverage
> for this part. As soon as 8.2 was released, I saw that many of the
> migration test
> cases failed and came to realize that there might be a bug between 8.1
> and 8.2, but
> was unable to confirm and report it quickly to you.
> 
> The maintenance of this part could be too costly or difficult from
> your point of view.

It may or may not be too costly, it's just that we need real users of RDMA
taking some care of it.  Having it broken easily for >1 releases definitely
is a sign of lack of users.  It is an implication to the community that we
should consider dropping some features so that we can get the best use of
the community resources for the things that may have a broader audience.

One thing majorly missing is a RDMA tester to guard all the merges to not
break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
but just to sanity check the migration+rdma code running all fine.  RDMA
taught us the lesson so we're requesting CI coverage for all other new
features that will be merged at least for migration subsystem, so that we
plan to not merge anything that is not covered by CI unless extremely
necessary in the future.

For sure CI is not the only missing part, but I'd say we should start with
it, then someone should also take care of the code even if only in
maintenance mode (no new feature to add on top).

> 
> My concern is, this plan will forces a few QEMU users (not sure how
> many) like us
> either to stick to the RDMA migration by using an increasingly older
> version of QEMU,
> or to abandon the currently used RDMA migration.

RDMA doesn't get new features anyway, if there's specific use case for RDMA
migrations, would it work if such scenario uses the old binary?  Is it
possible to switch to the TCP protocol with some good NICs?

Per our best knowledge, RDMA users are rare, and please let anyone know if
you are aware of such users.  IIUC the major reason why RDMA stopped being
the trend is because the network is not like ten years ago; I don't think I
have good knowledge in RDMA at all nor network, but my understanding is
it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
little sense to maintain multiple protocols, considering RDMA migration
code is so special so that it has the most custom code comparing to other
protocols.

Thanks,

-- 
Peter Xu

Re: [PATCH for-9.1] migration: Add Error** argument to add_bitmaps_to_list()

2024-04-01 Thread Peter Xu

On Fri, Mar 29, 2024 at 11:56:27AM +0100, Cédric Le Goater wrote:
> This allows to report more precise errors in the migration handler
> dirty_bitmap_save_setup().
> 
> Suggested-by Vladimir Sementsov-Ogievskiy  
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  To apply on top of : 
>  https://lore.kernel.org/qemu-devel/20240320064911.545001-1-...@redhat.com/

queued, thanks.

-- 
Peter Xu

Re: [PATCH 11/19] migration/block: fix -Werror=maybe-uninitialized false-positive

2024-03-28 Thread Peter Xu

On Thu, Mar 28, 2024 at 02:20:44PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> ../migration/block.c:966:16: error: ‘ret’ may be used uninitialized 
> [-Werror=maybe-uninitialized]
> 
> Given that "cluster_size" must be <= BLK_MIG_BLOCK_SIZE, the previous
> loop is entered at least once, so 'ret' is assigned a value in all conditions.
> 
> Signed-off-by: Marc-André Lureau 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 15/19] migration: fix -Werror=maybe-uninitialized false-positive

2024-03-28 Thread Peter Xu

On Thu, Mar 28, 2024 at 02:20:48PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> ../migration/ram.c:1873:23: error: ‘dirty’ may be used uninitialized 
> [-Werror=maybe-uninitialized]
> 
> When 'block' != NULL, 'dirty' is initialized.
> 
> Signed-off-by: Marc-André Lureau 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 12/19] migration: fix -Werror=maybe-uninitialized false-positives

2024-03-28 Thread Peter Xu

On Thu, Mar 28, 2024 at 02:20:45PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> ../migration/dirtyrate.c:186:5: error: ‘records’ may be used uninitialized 
> [-Werror=maybe-uninitialized]
> ../migration/dirtyrate.c:168:12: error: ‘gen_id’ may be used uninitialized 
> [-Werror=maybe-uninitialized]
> ../migration/migration.c:2273:5: error: ‘file’ may be used uninitialized 
> [-Werror=maybe-uninitialized]
> 
> Signed-off-by: Marc-André Lureau 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Peter Xu

On Thu, Mar 28, 2024 at 04:22:27PM +0100, Thomas Huth wrote:
> Since e9a54265f5 was not very clear about rdma migration code, should we
> maybe rather add a separate deprecation note for the migration part, and add
> a proper warning message to the migration code in case someone tries to use
> it there, and then only remove the rdma migration code after two more
> releases?

Definitely a valid option to me.

So far RDMA isn't covered in tests (actually same to COLO, and I wonder our
position of COLO too in this case..), so unfortunately we don't even know
when it'll break just like before.

>From other activities that I can see when new code comes, maintaining RDMA
code should be fairly manageable so far (and whoever will write new rdma
codes in those two releases will also need to take the maintainer's
role). We did it for those years, and we can keep that for two more
releases. Hopefully that can ring a louder alarm to the current users with
such warnings, so that people can either stick with old binaries, or invest
developer/test resources to the community.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Peter Xu

On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé  writes:
> 
> > The whole RDMA subsystem was deprecated in commit e9a54265f5
> > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> > released in v8.2.
> >
> > Remove:
> >  - RDMA handling from migration
> >  - dependencies on libibumad, libibverbs and librdmacm
> >
> > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> > in old migration streams.
> >
> > Cc: Peter Xu 
> > Cc: Li Zhijian 
> > Acked-by: Fabiano Rosas 
> > Signed-off-by: Philippe Mathieu-Daudé 
> 
> Just to be clear, because people raised the point in the last version,
> the first link in the deprecation commit links to a thread comprising
> entirely of rdma migration patches. I don't see any ambiguity on whether
> the deprecation was intended to include migration. There's even an ack
> from Juan.

Yes I remember that's the plan.

> 
> So on the basis of not reverting the previous maintainer's decision, my
> Ack stands here.
> 
> We also had pretty obvious bugs ([1], [2]) in the past that would have
> been caught if we had any kind of testing for the feature, so I can't
> even say this thing works currently.
> 
> @Peter Xu, @Li Zhijian, what are your thoughts on this?

Generally I definitely agree with such a removal sooner or later, as that's
how deprecation works, and even after Juan's left I'm not aware of any
other new RDMA users.  Personally, I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance, however
I assume that's not a blocker either, as I think we can also manage it.

IMHO it's more important to know whether there are still users and whether
they would still like to see it around. That's also one thing I notice that
e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
if they're rare. According to [2] it could be that such user may only rely
on the release versions of QEMU when it broke things.

So I'm copying Yu too (while Zhijian is already in the loop), just in case
someone would like to stand up and speak.

Thanks,

> 
> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com
> 2- 
> https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com
> 

-- 
Peter Xu

Re: [PATCH for 9.0] migration: Skip only empty block devices

2024-03-12 Thread Peter Xu

On Tue, Mar 12, 2024 at 05:34:26PM -0400, Stefan Hajnoczi wrote:
> I understand now. I missed that returning from init_blk_migration_it()
> did not abort iteration. Thank you!

I queued it, thanks both!

-- 
Peter Xu

Re: [PATCH for 9.0] migration: Skip only empty block devices

2024-03-12 Thread Peter Xu

On Tue, Mar 12, 2024 at 01:04:31PM +0100, Cédric Le Goater wrote:
> The block .save_setup() handler calls a helper routine
> init_blk_migration() which builds a list of block devices to take into
> account for migration. When one device is found to be empty (sectors
> == 0), the loop exits and all the remaining devices are ignored. This
> is a regression introduced when bdrv_iterate() was removed.
> 
> Change that by skipping only empty devices.
> 
> Cc: Markus Armbruster 
> Suggested: Kevin Wolf 

This should be:

Suggested-by: Kevin Wolf 

I think the missed "by" caused Kevin not in the cc list, I added Kevin in.

I'll hold a bit for Kevin to ACK, no repost needed just for above; I can
fix it.

Thanks,

> Fixes: fea68bb6e9fa ("block: Eliminate bdrv_iterate(), use bdrv_next()")
> Signed-off-by: Cédric Le Goater 
> ---
>  migration/block.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/block.c b/migration/block.c
> index 
> 8c6ebafacc1ffe930d1d4f19d968817b14852c69..2b9054889ad2ba739828594c50cf047703757e96
>  100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -402,7 +402,10 @@ static int init_blk_migration(QEMUFile *f)
>  }
>  
>  sectors = bdrv_nb_sectors(bs);
> -if (sectors <= 0) {
> +if (sectors == 0) {
> +continue;
> +}
> +if (sectors < 0) {
>  ret = sectors;
>  bdrv_next_cleanup(&it);
>  goto out;
> -- 
> 2.44.0
> 
> 

-- 
Peter Xu

Re: [PATCH 00/15] qapi: Require member documentation (with loophole)

2024-02-06 Thread Peter Xu

On Mon, Feb 05, 2024 at 08:46:54AM +0100, Markus Armbruster wrote:
> qapi/migration.json
>   MigrateSetParameters 1

It's tls-authz.  I'll send a patch for this one.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 2/2] hw/pflash: implement update buffer for block writes

2024-01-16 Thread Peter Xu

hat this is overlooked in the commit a19cbfb346 ("spice: add qxl
device") where VMSTATE_VBUFFER_UINT32 is introduced.

I don't have a good way to trivially fix it because we don't remember that
type of size in vmsd.  However a simple solution could be that we convert
all existing VMSTATE_VBUFFER() (potentially, VMSTATE_PARTIAL_VBUFFER users;
there seems to have only 1 caller..) to always use uint32_t.  I don't think
it's wise to try using a signed for size anyway, and it should be
compatible change as we doubled the size.

I'll hold a bit to see whether there's some comment, then I can try to post
a patch.

> + * that is in some other struct field, but it's a runtime constant and
> + * we can assume the memory has already been allocated.
> +*/
> +
>  #define VMSTATE_VBUFFER_UINT32(_field, _state, _version, _test,
> _field_size) { \
> @@ -688,2 +698,9 @@ extern const VMStateInfo vmstate_info_qlist;
> 
> +/**
> + * VMSTATE_VBUFFER_ALLOC_UINT32:
> + *
> + * We need to migrate an array of uint32_t of variable size dependent
> + * on the inbound migration data, and so the migration code must
> + * allocate it.
> +*/
>  #define VMSTATE_VBUFFER_ALLOC_UINT32(_field, _state, _version,   \
> ---
> 

-- 
Peter Xu

Re: [PATCH 1/6] system/cpus: rename qemu_mutex_lock_iothread() to qemu_bql_lock()

2023-11-30 Thread Peter Xu

On Thu, Nov 30, 2023 at 03:43:25PM -0500, Stefan Hajnoczi wrote:
> On Thu, Nov 30, 2023 at 03:08:49PM -0500, Peter Xu wrote:
> > On Wed, Nov 29, 2023 at 04:26:20PM -0500, Stefan Hajnoczi wrote:
> > > The Big QEMU Lock (BQL) has many names and they are confusing. The
> > > actual QemuMutex variable is called qemu_global_mutex but it's commonly
> > > referred to as the BQL in discussions and some code comments. The
> > > locking APIs, however, are called qemu_mutex_lock_iothread() and
> > > qemu_mutex_unlock_iothread().
> > > 
> > > The "iothread" name is historic and comes from when the main thread was
> > > split into into KVM vcpu threads and the "iothread" (now called the main
> > > loop thread). I have contributed to the confusion myself by introducing
> > > a separate --object iothread, a separate concept unrelated to the BQL.
> > > 
> > > The "iothread" name is no longer appropriate for the BQL. Rename the
> > > locking APIs to:
> > > - void qemu_bql_lock(void)
> > > - void qemu_bql_unlock(void)
> > > - bool qemu_bql_locked(void)
> > > 
> > > There are more APIs with "iothread" in their names. Subsequent patches
> > > will rename them. There are also comments and documentation that will be
> > > updated in later patches.
> > > 
> > > Signed-off-by: Stefan Hajnoczi 
> > 
> > Acked-by: Peter Xu 
> > 
> > Two nickpicks:
> > 
> >   - BQL contains "QEMU" as the 2nd character, so maybe easier to further
> > rename qemu_bql into bql_?
> 
> Philippe wondered whether the variable name should end with _mutex (or
> _lock is common too), so an alternative might be big_qemu_lock. That's

IMHO mutex isn't important in this context, but an implementation detail of
the "lock" as an abstract concept.

For example, we won't need to rename it again then if the impl changes,
e.g. using pure futex or a rwlock replacement.  When that happens we don't
need to change all call sites again.

(never really meant to change the lock impl, just an example.. :)

KVM actually has that example of KVM_MMU_LOCK() macro taking as the rwlock
write lock when the spinlock is replaced with rwlock, while it'll keep to
be the spinlock "lock()" when !KVM_HAVE_MMU_RWLOCK.

> imperfect because it doesn't start with the usual qemu_ prefix.
> qemu_big_lock is better in that regard but inconsistent with our BQL
> abbreviation.
> 
> I don't like putting an underscore at the end. It's unusual and would
> make me wonder what that means.

Ah, I meant replacing the "qemu_bql_" prefix with "bql_", as that contains
QEMU already, rather than making "_" at the end.  So they'll be bql_lock(),
bql_unlock(), bql_locked().

> 
> Naming is hard, but please discuss and I'm open to change to BQL
> variable's name to whatever we all agree on.

I'm pretty okay with qemu_bql_lock(), etc. too.  I prefer a tiny little bit
on bql_ over qemu_bql_ in this regard, but frankly they're all names good
enough to me.  The "qemu_" prefix can still be a good thing saying "this is
a qemu global function", even if contained inside "bql" itself.

> 
> > 
> >   - Could we keep the full spell of BQL at some places, so people can still
> > reference it if not familiar?  IIUC most of the BQL helpers will root
> > back to the major three functions (_lock, _unlock, _locked), perhaps
> > add a comment of "BQL stands for..." over these three functions as
> > comment?
> 
> Yes, I'll update the doc comments to say "Big QEMU Lock (BQL)" for each
> of these functions.

Thanks!

-- 
Peter Xu

Re: [PATCH 1/6] system/cpus: rename qemu_mutex_lock_iothread() to qemu_bql_lock()

2023-11-30 Thread Peter Xu

On Wed, Nov 29, 2023 at 04:26:20PM -0500, Stefan Hajnoczi wrote:
> The Big QEMU Lock (BQL) has many names and they are confusing. The
> actual QemuMutex variable is called qemu_global_mutex but it's commonly
> referred to as the BQL in discussions and some code comments. The
> locking APIs, however, are called qemu_mutex_lock_iothread() and
> qemu_mutex_unlock_iothread().
> 
> The "iothread" name is historic and comes from when the main thread was
> split into into KVM vcpu threads and the "iothread" (now called the main
> loop thread). I have contributed to the confusion myself by introducing
> a separate --object iothread, a separate concept unrelated to the BQL.
> 
> The "iothread" name is no longer appropriate for the BQL. Rename the
> locking APIs to:
> - void qemu_bql_lock(void)
> - void qemu_bql_unlock(void)
> - bool qemu_bql_locked(void)
> 
> There are more APIs with "iothread" in their names. Subsequent patches
> will rename them. There are also comments and documentation that will be
> updated in later patches.
> 
> Signed-off-by: Stefan Hajnoczi 

Acked-by: Peter Xu 

Two nickpicks:

  - BQL contains "QEMU" as the 2nd character, so maybe easier to further
rename qemu_bql into bql_?

  - Could we keep the full spell of BQL at some places, so people can still
reference it if not familiar?  IIUC most of the BQL helpers will root
back to the major three functions (_lock, _unlock, _locked), perhaps
add a comment of "BQL stands for..." over these three functions as
comment?

Please take or ignore these nitpicks; my ACK will stand irrelevant.

Thanks,

-- 
Peter Xu

Re: [PATCH V2 4/6] cpr: relax vhost migration blockers

2023-10-31 Thread Peter Xu

Copy qemu-block, virtio

On Wed, Oct 25, 2023 at 12:44:27PM -0700, Steve Sistare wrote:
> vhost blocks migration if logging is not supported to track dirty
> memory, and vhost-user blocks it if the log cannot be saved to a shm fd.
> 
> vhost-vdpa blocks migration if both hosts do not support all the device's
> features using a shadow VQ, for tracking requests and dirty memory.
> 
> vhost-scsi blocks migration if storage cannot be shared across hosts,
> or if state cannot be migrated.
> 
> None of these conditions apply if the old and new qemu processes do
> not run concurrently, and if new qemu starts on the same host as old,
> which is the case for cpr.
> 
> Narrow the scope of these blockers so they only apply to normal mode.
> They will not block cpr modes when they are added in subsequent patches.
> 
> No functional change until a new mode is added.
> 
> Signed-off-by: Steve Sistare 
> Reviewed-by: Juan Quintela 
> ---
>  hw/scsi/vhost-scsi.c | 2 +-
>  hw/virtio/vhost.c| 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
> index 14e23cc..bf528d5 100644
> --- a/hw/scsi/vhost-scsi.c
> +++ b/hw/scsi/vhost-scsi.c
> @@ -208,7 +208,7 @@ static void vhost_scsi_realize(DeviceState *dev, Error 
> **errp)
>  "When external environment supports it (Orchestrator 
> migrates "
>  "target SCSI device state or use shared storage over 
> network), "
>  "set 'migratable' property to true to enable migration.");
> -if (migrate_add_blocker(&vsc->migration_blocker, errp) < 0) {
> +if (migrate_add_blocker_normal(&vsc->migration_blocker, errp) < 0) {
>  goto free_virtio;
>  }
>  }
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index d737671..f5e9625 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1527,7 +1527,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>  }
>  
>  if (hdev->migration_blocker != NULL) {
> -r = migrate_add_blocker(&hdev->migration_blocker, errp);
> +r = migrate_add_blocker_normal(&hdev->migration_blocker, errp);
>  if (r < 0) {
>  goto fail_busyloop;
>  }
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

Re: [PATCH V2 3/6] cpr: relax blockdev migration blockers

2023-10-31 Thread Peter Xu

Copy qemu-block and maintainers.

On Wed, Oct 25, 2023 at 12:44:26PM -0700, Steve Sistare wrote:
> Some blockdevs block migration because they do not support sharing across
> hosts and/or do not support dirty bitmaps.  These prohibitions do not apply
> if the old and new qemu processes do not run concurrently, and if new qemu
> starts on the same host as old, which is the case for cpr.  Narrow the scope
> of these blockers so they only apply to normal mode.  They will not block
> cpr modes when they are added in subsequent patches.
> 
> No functional change until a new mode is added.
> 
> Signed-off-by: Steve Sistare 
> Reviewed-by: Juan Quintela 
> ---
>  block/parallels.c | 2 +-
>  block/qcow.c  | 2 +-
>  block/vdi.c   | 2 +-
>  block/vhdx.c  | 2 +-
>  block/vmdk.c  | 2 +-
>  block/vpc.c   | 2 +-
>  block/vvfat.c | 2 +-
>  7 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/parallels.c b/block/parallels.c
> index 1697a2e..8a520db 100644
> --- a/block/parallels.c
> +++ b/block/parallels.c
> @@ -1369,7 +1369,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
> *options, int flags,
> bdrv_get_device_or_node_name(bs));
>  bdrv_graph_rdunlock_main_loop();
>  
> -ret = migrate_add_blocker(&s->migration_blocker, errp);
> +ret = migrate_add_blocker_normal(&s->migration_blocker, errp);
>  if (ret < 0) {
>  error_setg(errp, "Migration blocker error");
>  goto fail;
> diff --git a/block/qcow.c b/block/qcow.c
> index fdd4c83..eab68e3 100644
> --- a/block/qcow.c
> +++ b/block/qcow.c
> @@ -307,7 +307,7 @@ static int qcow_open(BlockDriverState *bs, QDict 
> *options, int flags,
> bdrv_get_device_or_node_name(bs));
>  bdrv_graph_rdunlock_main_loop();
>  
> -ret = migrate_add_blocker(&s->migration_blocker, errp);
> +ret = migrate_add_blocker_normal(&s->migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> diff --git a/block/vdi.c b/block/vdi.c
> index fd7e365..c647d72 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -498,7 +498,7 @@ static int vdi_open(BlockDriverState *bs, QDict *options, 
> int flags,
> bdrv_get_device_or_node_name(bs));
>  bdrv_graph_rdunlock_main_loop();
>  
> -ret = migrate_add_blocker(&s->migration_blocker, errp);
> +ret = migrate_add_blocker_normal(&s->migration_blocker, errp);
>  if (ret < 0) {
>  goto fail_free_bmap;
>  }
> diff --git a/block/vhdx.c b/block/vhdx.c
> index e37f8c0..a9d0874 100644
> --- a/block/vhdx.c
> +++ b/block/vhdx.c
> @@ -1096,7 +1096,7 @@ static int vhdx_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  error_setg(&s->migration_blocker, "The vhdx format used by node '%s' "
> "does not support live migration",
> bdrv_get_device_or_node_name(bs));
> -ret = migrate_add_blocker(&s->migration_blocker, errp);
> +ret = migrate_add_blocker_normal(&s->migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> diff --git a/block/vmdk.c b/block/vmdk.c
> index 1335d39..85864b8 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -1386,7 +1386,7 @@ static int vmdk_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  error_setg(&s->migration_blocker, "The vmdk format used by node '%s' "
> "does not support live migration",
> bdrv_get_device_or_node_name(bs));
> -ret = migrate_add_blocker(&s->migration_blocker, errp);
> +ret = migrate_add_blocker_normal(&s->migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> diff --git a/block/vpc.c b/block/vpc.c
> index c30cf86..aa1a48a 100644
> --- a/block/vpc.c
> +++ b/block/vpc.c
> @@ -452,7 +452,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
> int flags,
> bdrv_get_device_or_node_name(bs));
>  bdrv_graph_rdunlock_main_loop();
>  
> -ret = migrate_add_blocker(&s->migration_blocker, errp);
> +ret = migrate_add_blocker_normal(&s->migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> diff --git a/block/vvfat.c b/block/vvfat.c
> index 266e036..9d050ba 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -1268,7 +1268,7 @@ static int vvfat_open(BlockDriverState *bs, QDict 
> *options, int flags,
> "The vvfat (rw) format used by node '%s' "
> "does not support live migration",
> bdrv_get_device_or_node_name(bs));
> -ret = migrate_add_blocker(&s->migration_blocker, errp);
> +ret = migrate_add_blocker_normal(&s->migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

Re: [PATCH v2 5/9] migration: check required subsections are loaded, once

2023-10-24 Thread Peter Xu

On Tue, Oct 24, 2023 at 12:41:56PM +0200, Juan Quintela wrote:
> > @@ -509,6 +538,13 @@ static int vmstate_subsection_load(QEMUFile *f, const 
> > VMStateDescription *vmsd,
> >  }
> >  }
> >  
> > +for (i = 0; i < n; i++) {
> > +if (!visited[i] && vmstate_section_needed(vmsd->subsections[i], 
> > opaque)) {
> > +trace_vmstate_subsection_load_bad(vmsd->name, 
> > vmsd->subsections[i]->name, "(not visited)");
> > +return -ENOENT;
> > +}
> > +}
> > +
> >  trace_vmstate_subsection_load_good(vmsd->name);
> >  return 0;
> >  }
> 
> This part is the only one where I can see there could be some
> discussion.  So I wil wait to see what other people think.

Previous email:

https://lore.kernel.org/qemu-devel/ZR2P1RbxCfBdYBaQ@x1n/

I still think it is safer to not fail unless justified that we won't hit
surprises in the ->needed().  There are a lot so I assume it's non-trivial
to justify.

We can switch the tracepoint into error_report() in that case, though, as
long as it won't fail the migration.

Thanks,

-- 
Peter Xu

Re: [PATCH 0/5] RFC: migration: check required entries and sections are loaded

2023-10-04 Thread Peter Xu

On Tue, Sep 26, 2023 at 07:59:20PM +0400, marcandre.lur...@redhat.com wrote:
> Marc-André Lureau (5):
>   block/fdc: 'phase' is not needed on load
>   virtio: make endian_needed() work during loading
>   net/slirp: use different IDs for each instance

First 3 patches are bug fixes, am I right?

It'll be great if they can be acked (or even picked up earlier?) by
subsystem maintainers if so, and copy stable if justified proper.

Thanks,

-- 
Peter Xu

Re: [PATCH 4/5] RFC: migration: check required subsections are loaded, once

2023-10-04 Thread Peter Xu

On Tue, Sep 26, 2023 at 07:59:24PM +0400, marcandre.lur...@redhat.com wrote:
> @@ -484,6 +513,13 @@ static int vmstate_subsection_load(QEMUFile *f, const 
> VMStateDescription *vmsd,
>  }
>  }
>  
> +for (i = 0; i < n; i++) {
> +if (!visited[i] && vmstate_save_needed(vmsd->subsections[i], 
> opaque)) {
> +trace_vmstate_subsection_load_bad(vmsd->name, 
> vmsd->subsections[i]->name, "(not visited)");
> +return -ENOENT;
> +}
> +}

One thing that might be tricky to call needed() on loading side is, IMHO
the needed() hooks normally was designed to be only called on a complete VM
state. IOW, I think it can reference any machine/device state, or whatever
variable assuming all of them contain valid data.

But the load side may not yet contain everything..  we can guarantee here
we loaded the full device state of this one as subsections should be the
last to come, and all we've loaded so far.  But what if it references
something else outside what we've loaded?  It looks possible in some
special .needed() hook we can return something unexpected.

I assume most needed() hooks are fine (and it does look like we can find
bugs with this, which means this might be proved useful already at least in
some form or another). I just worry on something start to break after we
become strict on this.

Maybe.. make the check only throw warnings, but not yet fail the migration?

> +
>  trace_vmstate_subsection_load_good(vmsd->name);
>  return 0;
>  }
> -- 
> 2.41.0
> 

-- 
Peter Xu

Re: [RFC PATCH v2 22/22] softmmu/physmem: Clean up local variable shadowing

2023-09-05 Thread Peter Xu

On Mon, Sep 04, 2023 at 05:31:30PM +0100, Daniel P. Berrangé wrote:
> On Mon, Sep 04, 2023 at 06:12:34PM +0200, Philippe Mathieu-Daudé wrote:
> > Fix:
> > 
> >   softmmu/physmem.c: In function 
> > ‘cpu_physical_memory_snapshot_and_clear_dirty’:
> >   softmmu/physmem.c:916:27: warning: declaration of ‘offset’ shadows a 
> > parameter [-Wshadow=compatible-local]
> > 916 | unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
> > |   ^~
> >   softmmu/physmem.c:892:31: note: shadowed declaration is here
> > 892 | (MemoryRegion *mr, hwaddr offset, hwaddr length, unsigned 
> > client)
> > |~~~^~
> > 
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> > RFC: Please double-check how 'offset' is used few lines later.
> 
> I don't see an issue - those lines are in an outer scope, so won't
> be accessing the 'offset' you've changed, they'll be the parameter
> instead. If you want to sanity check though, presumably the asm
> dissassembly for this method should be the same before/after this
> change

(and if it didn't do so then it's a bug..)

> 
> > ---
> >  softmmu/physmem.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> Reviewed-by: Daniel P. Berrangé 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 21/22] softmmu/memory: Clean up local variable shadowing

2023-09-05 Thread Peter Xu

On Mon, Sep 04, 2023 at 06:12:33PM +0200, Philippe Mathieu-Daudé wrote:
> Fix:
> 
>   softmmu/memory.c: In function ‘mtree_print_mr’:
>   softmmu/memory.c:3236:27: warning: declaration of ‘ml’ shadows a previous 
> local [-Wshadow=compatible-local]
>3236 | MemoryRegionList *ml;
> |   ^~
>   softmmu/memory.c:3213:32: note: shadowed declaration is here
>3213 | MemoryRegionList *new_ml, *ml, *next_ml;
> |^~
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 1/2] vmstate: Mark VMStateInfo.get/put() coroutine_mixed_fn

2023-09-05 Thread Peter Xu

On Tue, Sep 05, 2023 at 04:50:01PM +0200, Kevin Wolf wrote:
> Migration code can run both in coroutine context (the usual case) and
> non-coroutine context (at least savevm/loadvm for snapshots). This also
> affects the VMState callbacks, and devices must consider this. Change
> the callback definition in VMStateInfo to be explicit about it.
> 
> Signed-off-by: Kevin Wolf 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 2/7] migration: Clean up local variable shadowing

2023-08-31 Thread Peter Xu

On Thu, Aug 31, 2023 at 03:25:41PM +0200, Markus Armbruster wrote:
> Local variables shadowing other local variables or parameters make the
> code needlessly hard to understand.  Tracked down with -Wshadow=local.
> Clean up: delete inner declarations when they are actually redundant,
> else rename variables.
> 
> Signed-off-by: Markus Armbruster 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 1/7] migration/rdma: Fix save_page method to fail on polling error

2023-08-31 Thread Peter Xu

On Thu, Aug 31, 2023 at 03:25:40PM +0200, Markus Armbruster wrote:
> qemu_rdma_save_page() reports polling error with error_report(), then
> succeeds anyway.  This is because the variable holding the polling
> status *shadows* the variable the function returns.  The latter
> remains zero.
> 
> Broken since day one, and duplicated more recently.
> 
> Fixes: 2da776db4846 (rdma: core logic)
> Fixes: b390afd8c50b (migration/rdma: Fix out of order wrid)
> Signed-off-by: Markus Armbruster 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 5/5] migration: Deprecate old compression method

2023-06-27 Thread Peter Xu

On Thu, Jun 22, 2023 at 09:50:19PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 1/5] migration: Use proper indentation for migration.json

2023-06-27 Thread Peter Xu

On Thu, Jun 22, 2023 at 09:50:15PM +0200, Juan Quintela wrote:
> We broke it with dirtyrate limit patches.
> 
> Signed-off-by: Juan Quintela 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-23 Thread Peter Xu

On Fri, Jun 23, 2023 at 09:23:18AM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote:
> > On Thu, Jun 22, 2023 at 10:59:58AM +0100, Daniel P. Berrangé wrote:
> > > I've mentioned several times before that the user should never need to
> > > set this multifd-channels parameter (nor many other parameters) on the
> > > destination in the first place.
> > > 
> > > The QEMU migration stream should be changed to add a full
> > > bi-directional handshake, with negotiation of most parameters.
> > > IOW, the src QEMU should be configured with 16 channels, and
> > > it should connect the primary control channel, and then directly
> > > tell the dest that it wants to use 16 multifd channels.
> > > 
> > > If we're expecting the user to pass this info across to the dest
> > > manually we've already spectacularly failed wrt user friendliness.
> > 
> > I can try to move the todo even higher.  Trying to list the initial goals
> > here:
> > 
> > - One extra phase of handshake between src/dst (maybe the time to boost
> >   QEMU_VM_FILE_VERSION) before anything else happens.
> > 
> > - Dest shouldn't need to apply any cap/param, it should get all from src.
> >   Dest still need to be setup with an URI and that should be all it needs.
> 
> There are a few that the dest will still need set explicitly. Specifically
> the TLS parameters - tls-authz and tls-creds, because those are both
> related to --object parameters configured on the dst QEMU. Potentially
> there's an argument to be made for the TLS parameters to be part fo the
> initial 'migrate' and 'migrate-incoming' command data, as they're
> specifically related to the connection establishment, while (most) of
> the other params are related to the migration protocol running inside
> the connection.

Ideally we can even make tls options to be after the main connection is
established, IOW the gnutls handshake can be part of the generic handshake.
But yeah I agree that may contain much more work, so we may start with
assuming the v2 handshake just happen on the tls channel built for now.

I think the new protocol should allow extension so when we want to move the
tls handshake into it v2 protocol should be able to first detect src/dst
binary support of that, and switch to that if we want - then we can even
got a src qemu migration failure which tells "dest qemu forget to setup tls
credentials in cmdlines", or anything wrong on dest during tls setup.

> 
> A few other parameters are also related to the connection establishment,
> most notably the enablement multifd, postcopy and postcopy-pre-empt.

As I mentioned in the list, I plan to make this part of the default v2
where v2 handshake will take care of managing the connections rather than
relying on the old code.  I'm not sure how complicated it'll be, but the v2
protocol just sounds a good fit for having such a major change on how we
setup the channels, and chance we get all things alright from the start.

> 
> I think with those ones we don't need to set them on the src either.
> With the new migration handshake we should probably use multifd
> codepaths unconditionally, with a single channel.

The v2 handshake will be beneficial to !multifd as well.  Right now I tend
to make it also work for !multifd, e.g., it always makes sense to do a
device tree comparision before migration, even if someone used special
tunneling so multifd may not be able to be enabled for whatever reason, but
as long as a return path is available so they can talk.

> By matching with the introduction of new protocol, we have a nice point
> against which to deprecate the old non-multifd codepaths. We'll need to
> keep the non-multifd code around *alot* longer than the normal
> deprecation cycle though, as we need mig to/from very old QEMUs.

I actually had a feeling that we should always keep it..  I'm not sure
whether we must combine a new handshake to "making multifd the default".  I
do think we can make the !multifd path very simple though, e.g., I'd think
we should start considering deprecate things like !multifd+compressions etc
earlier than that.

> 
> The enablement of postcopy could be automatic too - src & dst can
> both detect if their host OS supports it. That would make all
> migrations post-copy capable. The mgmt app just needs to trigger
> the switch to post-copy mode *if* they want to use it.

Sounds doable.

> 
> Likewise we can just always assume postcopy-pre-empt is available.
> 
> I think 'return-path' becomes another one we can just assume too.

Right, handshake cap (or with the new QAPI of URI replacement) should imply
return path already.

Thanks,

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-23 Thread Peter Xu

On Fri, Jun 23, 2023 at 08:17:46AM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 22, 2023 at 03:20:01PM -0400, Peter Xu wrote:
> > On Thu, Jun 22, 2023 at 05:33:29PM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote:
> > > > I can try to move the todo even higher.  Trying to list the initial 
> > > > goals
> > > > here:
> > > > 
> > > > - One extra phase of handshake between src/dst (maybe the time to boost
> > > >   QEMU_VM_FILE_VERSION) before anything else happens.
> > > > 
> > > > - Dest shouldn't need to apply any cap/param, it should get all from 
> > > > src.
> > > >   Dest still need to be setup with an URI and that should be all it 
> > > > needs.
> > > > 
> > > > - Src shouldn't need to worry on the binary version of dst anymore as 
> > > > long
> > > >   as dest qemu supports handshake, because src can fetch it from dest.
> > > 
> > > I'm not sure that works in general. Even if we have a handshake and
> > > bi-directional comms for live migration, we still haave the save/restore
> > > to file codepath to deal with. The dst QEMU doesn't exist at the time
> > > the save process is done, so we can't add logic to VMSate handling that
> > > assumes knowledge of the dst version at time of serialization.
> > 
> > My current thought was still based on a new cap or anything the user would
> > need to specify first on both sides (but hopefully the last cap to set on
> > dest).
> > 
> > E.g. if with a new handshake cap we shouldn't set it on a exec: or file:
> > protocol migration, and it should just fail on qmp_migrate() telling that
> > the URI is not supported if the cap is set.  Return path is definitely
> > required here.
> 
> exec can support bi-directional migration - we have both stdin + stdout
> for the command. For exec it is mostly a documentation problem - you
> can't merely use 'cat' for example, but if you used 'socat' that could
> be made to work bi-directionally.

Okay.  Just an example that the handshake just cannot work for all the
cases, and it should always be able to fail.

So when exec doesn't properly provide return path, I think with
handshake=on we should get a timeout of not getting response properly and
fail the migration after the timeout, then.

There're a bunch of implications and details that need to be investigated
around such a handshake if it'll be proposed for real, so I'm not yet sure
whether there's something that may be surprising.  For channeltypes it
seems all fine for now.  Hopefully nothing obvious overlooked.

> 
> > > I don't think its possible for QEMU to validate that it has a fully
> > > bi-directional channel, without adding timeouts to its detection which I
> > > think we should strive to avoid.
> > > 
> > > I don't think we actually need self-bootstrapping anyway.
> > > 
> > > I think the mgmt app can just indicate the new v2 bi-directional
> > > protocol when issuing the 'migrate' and 'migrate-incoming'
> > > commands.  This becomes trivial when Het's refactoring of the
> > > migrate address QAPI is accepted:
> > > 
> > >   https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04851.html
> > > 
> > > eg:
> > > 
> > > { "execute": "migrate",
> > >   "arguments": {
> > >   "channels": [ { "channeltype": "main",
> > >   "addr": { "transport": "socket", "type": "inet",
> > >        "host": "10.12.34.9",
> > > "port": "1050" } } ] } }
> > > 
> > > note the 'channeltype' parameter here. If we declare the 'main'
> > > refers to the existing migration protocol, then we merely need
> > > to define a new 'channeltype' to use as an indicator for the
> > > v2 migration handshake protocol.
> > 
> > Using a new channeltype would also work at least on src qemu, but I'm not
> > sure on how dest qemu would know that it needs a handshake in that case,
> > because it knows nothing until the connection is established.
> 
> In Het's series the 'migrate_incoming' command similarly has a chaneltype
> parameter.

Oh, yeah then that'll just work.  Thanks.

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 05:33:29PM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote:
> > I can try to move the todo even higher.  Trying to list the initial goals
> > here:
> > 
> > - One extra phase of handshake between src/dst (maybe the time to boost
> >   QEMU_VM_FILE_VERSION) before anything else happens.
> > 
> > - Dest shouldn't need to apply any cap/param, it should get all from src.
> >   Dest still need to be setup with an URI and that should be all it needs.
> > 
> > - Src shouldn't need to worry on the binary version of dst anymore as long
> >   as dest qemu supports handshake, because src can fetch it from dest.
> 
> I'm not sure that works in general. Even if we have a handshake and
> bi-directional comms for live migration, we still haave the save/restore
> to file codepath to deal with. The dst QEMU doesn't exist at the time
> the save process is done, so we can't add logic to VMSate handling that
> assumes knowledge of the dst version at time of serialization.

My current thought was still based on a new cap or anything the user would
need to specify first on both sides (but hopefully the last cap to set on
dest).

E.g. if with a new handshake cap we shouldn't set it on a exec: or file:
protocol migration, and it should just fail on qmp_migrate() telling that
the URI is not supported if the cap is set.  Return path is definitely
required here.

> 
> > - Handshake can always fail gracefully if anything wrong happened, it
> >   normally should mean dest qemu is not compatible with src's setup (either
> >   machine, device, or migration configs) for whatever reason.  Src should
> >   be able to get a solid error from dest if so.
> > 
> > - Handshake protocol should always be self-bootstrap-able, it means when we
> >   change the handshake protocol it should always works with old binaries.
> > 
> >   - When src is newer it should be able to know what's missing on dest and
> > skip the new bits.
> > 
> >   - When dst is newer it should all rely on src (which is older) and it
> > should always understand src's language.
> 
> I'm not convinced it can reliably self-bootstrap in a backwards
> compatible manner, precisely because the current migration stream
> has no handshake and only requires a unidirectional channel.

Yes, please see above.  I meant when we grow the handshake protocol we
should make sure we don't need anything new to be setup either on src/dst
of qemu.  It won't apply to before-handshake binaries.

> I don't think its possible for QEMU to validate that it has a fully
> bi-directional channel, without adding timeouts to its detection which I
> think we should strive to avoid.
> 
> I don't think we actually need self-bootstrapping anyway.
> 
> I think the mgmt app can just indicate the new v2 bi-directional
> protocol when issuing the 'migrate' and 'migrate-incoming'
> commands.  This becomes trivial when Het's refactoring of the
> migrate address QAPI is accepted:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04851.html
> 
> eg:
> 
> { "execute": "migrate",
>   "arguments": {
>   "channels": [ { "channeltype": "main",
>   "addr": { "transport": "socket", "type": "inet",
>"host": "10.12.34.9",
> "port": "1050" } } ] } }
> 
> note the 'channeltype' parameter here. If we declare the 'main'
> refers to the existing migration protocol, then we merely need
> to define a new 'channeltype' to use as an indicator for the
> v2 migration handshake protocol.

Using a new channeltype would also work at least on src qemu, but I'm not
sure on how dest qemu would know that it needs a handshake in that case,
because it knows nothing until the connection is established.

Maybe we still need QEMU_VM_FILE_VERSION to be boosted at least in this
case, so dest can read this at the very beginning, old binaries will fail
immediately, new binaries will start to talk with v2 language.

> 
> > - All !main channels need to be established later than the handshake - if
> >   we're going to do this anyway we probably should do it altogether to make
> >   channels named, so each channel used in migration needs to have a common
> >   header.  Prepare to deprecate the old tricks of channel orderings.
> 
> Once the primary channel involves a bi-directional handshake,
>

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 10:59:58AM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 22, 2023 at 10:52:12AM +0200, Juan Quintela wrote:
> > Paolo Bonzini  wrote:
> > > On 6/12/23 22:51, Juan Quintela wrote:
> > >>> Shall we just leave it there?  Or is deprecating it helps us in any 
> > >>> form?
> > >> See the patches two weeks ago when people complained that lisen(.., num)
> > >> was too low.  And there are other parameters that work the same way
> > >> (that I convenientely had forgotten).  So the easiest way to get things
> > >> right is to use "defer" always.  Using -incoming "uri" should only be
> > >> for people that "know what they are doing", so we had to ways to do it:
> > >> - review all migration options and see which ones work without defer
> > >>and document it
> > >> - deprecate everything that is not defer.
> > >
> > > "-incoming " is literally the same as running "migrate-incoming"
> > > as the first thing on the monitor:
> > >
> > > if (incoming) {
> > > Error *local_err = NULL;
> > > if (strcmp(incoming, "defer") != 0) {
> > > qmp_migrate_incoming(incoming, &local_err);
> > > if (local_err) {
> > > error_reportf_err(local_err, "-incoming %s: ", incoming);
> > > exit(1);
> > > }
> > > }
> > > } else if (autostart) {
> > > qmp_cont(NULL);
> > > }
> > >
> > > It's the only piece of code which distinguishes "-incoming defer" from
> > > "-incoming ".
> > >
> > > So I'm not sure what the problem would be with keeping it?
> > 
> > User friendliness.
> > 
> > First of all, I use it all the time.  And I know that it is useful for
> > developers.  I was the one asking peter to implement -global
> > migration.foo to be able to test multifd with it.
> > 
> > The problem is that if you use more than two channels with multifd, on
> > the incoming side, you need to do:
> > 
> > - migrate_set_parameter multifd-channels 16
> > - migrate_incoming 
> > 
> > And people continue to do:
> > 
> > - qemu -incoming 
> > - migrate_set_parameter multifd-channels 16 (on the command line)
> > 
> > And they complain that it fails, because we are calling listen with the
> > wrong value.
> 
> IMHO if we want to improve user friendliness then arguing about use
> of the CLI vs QMP for migration is completely missing the bigger
> picture IMHO.
> 
> I've mentioned several times before that the user should never need to
> set this multifd-channels parameter (nor many other parameters) on the
> destination in the first place.
> 
> The QEMU migration stream should be changed to add a full
> bi-directional handshake, with negotiation of most parameters.
> IOW, the src QEMU should be configured with 16 channels, and
> it should connect the primary control channel, and then directly
> tell the dest that it wants to use 16 multifd channels.
> 
> If we're expecting the user to pass this info across to the dest
> manually we've already spectacularly failed wrt user friendliness.

I can try to move the todo even higher.  Trying to list the initial goals
here:

- One extra phase of handshake between src/dst (maybe the time to boost
  QEMU_VM_FILE_VERSION) before anything else happens.

- Dest shouldn't need to apply any cap/param, it should get all from src.
  Dest still need to be setup with an URI and that should be all it needs.

- Src shouldn't need to worry on the binary version of dst anymore as long
  as dest qemu supports handshake, because src can fetch it from dest.

- Handshake can always fail gracefully if anything wrong happened, it
  normally should mean dest qemu is not compatible with src's setup (either
  machine, device, or migration configs) for whatever reason.  Src should
  be able to get a solid error from dest if so.

- Handshake protocol should always be self-bootstrap-able, it means when we
  change the handshake protocol it should always works with old binaries.

  - When src is newer it should be able to know what's missing on dest and
skip the new bits.

  - When dst is newer it should all rely on src (which is older) and it
should always understand src's language.

- All !main channels need to be established later than the handshake - if
  we're going to do this anyway we probably should do it altogether to make
  channels named, so each channel used in migration needs to have a common
  header.  Prepare to deprecate the old tricks of channel orderings.

Does it look reasonable?  Anything to add/remove?

Thanks,

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 11:22:56AM +0200, Thomas Huth wrote:
> Then simply forbid "migrate_set_parameter multifd-channels ..." if the uri
> has been specified on the command line?

Yeah, actually already in a pull (even though the pr may need a new one..):

https://lore.kernel.org/r/20230622021320.66124-23-quint...@redhat.com

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 12:01:55PM +0200, Juan Quintela wrote:
> Paolo Bonzini  wrote:
> > On 6/22/23 10:52, Juan Quintela wrote:
> >> User friendliness.
> >> The problem is that if you use more than two channels with multifd, on
> >> the incoming side, you need to do:
> >
> > You're sacrificing user-friendliness for the 99.99% that don't use
> > multifd, for an error (i.e. it's not even fixing the issue) for the
> > 0.01% that use multifd.  That's not user-friendly.
> 
> You are forgeting of the 0.01% that uses postocopy preempt (that is easy
> just changing the default value to 2), and the 0.001% that uses
> compression (they have exactly the same problem with compress_threads,
> compress_zlib, etc).
> 
> >> - migrate_set_parameter multifd-channels 16
> >> - migrate_incoming 
> >> 
> >>> The issue is not how many features the command line has, but how
> >>> they're implemented.
> >> Or if they are confusing for the user?
> >
> > Anyone using multifd is not a typical user anyway.
> 
> >>> If they're just QMP wrappers and as such they're self-contained in
> >>> softmmu/vl.c, that's fine.
> >>>
> >>> In fact, even for parameters, we could use keyval to parse "-incoming"
> >> What is keyval?
> >
> > util/keyval.c and include/qemu/keyval.h.  It parses a list of
> > key=value pairs into a QDict.  Once you have removed the "source" key
> > from the QDict you can use a visitor to parse the rest into a
> > MigrateSetParameters.  See the handling of QEMU_OPTION_audio, it could
> > be something like
> >
> >
> > case QEMU_OPTION_incoing: {
> > Visitor *v;
> > MigrateSetParameters *incoming_params = NULL;
> > QDict *dict = keyval_parse(optarg, "source", NULL,
> > &error_fatal);
> >
> > if (incoming) {
> > if (qdict_haskey(dict, "source")) {
> > error_setg(&error_fatal, "Parameter 'source'
> > is duplicate");
> > }
> > } else {
> > if (!qdict_haskey(dict, "source")) {
> > error_setg(&error_fatal, "Parameter 'source'
> > is missing");
> > }
> > runstate_set(RUN_STATE_INMIGRATE);
> > incoming = g_strdup(qdict_get_str(dict, "source"));
> > qdict_del(dict, "source");
> > }
> >
> > v = qobject_input_visitor_new_keyval(QOBJECT(dict));
> > qobject_unref(dict);
> > visit_type_MigrateSetParameters(v, NULL,
> > &incoming_params, &error_fatal);
> > visit_free(v);
> > qmp_migration_set_parameters(incoming_params,

PS: we may want to postpone this to be later than migration_object_init(),
when/if there's a real patch.

> > &error_fatal);
> > qapi_free_MigrateSetParameters(incoming_params);
> > }
> >
> >
> > For example "-incoming [source=]tcp:foo,multifd-channels=16" would
> > desugar to
> >
> >   migrate_set_parameter multifd-channels 16
> >   migrate_incoming tcp:foo
> >
> > The only incompatibility is for people who are using "," in an URI,
> > which is rare and only an issue for the "exec" protocol.

If we worry on breaking anyone, we can apply the keyval parsing only when
!exec.  Not sure whether it matters a huge lot..

> 
> Aha, that makes sense.  And will allow us to deprecate/remove the
> --global migration.* stuff.

We may still need a way to set the caps/params for src qemu?..

Thanks,

-- 
Peter Xu

Re: [PATCH v2 06/20] qemu_file: total_transferred is not used anymore

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 01:05:49AM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Tue, May 30, 2023 at 08:39:27PM +0200, Juan Quintela wrote:
> >> Signed-off-by: Juan Quintela 
> >> ---
> >>  migration/qemu-file.c | 4 
> >>  1 file changed, 4 deletions(-)
> >> 
> >> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> >> index eb0497e532..6b6deea19b 100644
> >> --- a/migration/qemu-file.c
> >> +++ b/migration/qemu-file.c
> >> @@ -41,9 +41,6 @@ struct QEMUFile {
> >>  QIOChannel *ioc;
> >>  bool is_writable;
> >>  
> >> -/* The sum of bytes transferred on the wire */
> >> -uint64_t total_transferred;
> >> -
> >>  int buf_index;
> >>  int buf_size; /* 0 when writing */
> >>  uint8_t buf[IO_BUF_SIZE];
> >> @@ -287,7 +284,6 @@ void qemu_fflush(QEMUFile *f)
> >>  qemu_file_set_error_obj(f, -EIO, local_error);
> >>  } else {
> >>  uint64_t size = iov_size(f->iov, f->iovcnt);
> >> -f->total_transferred += size;
> >
> > I think this patch is another example why I think sometimes the way patch
> > is split are pretty much adding more complexity on review...
> 
> It depends of taste.
> 
> You are doing one thing in way1.
> Then you find a better way to do it, lets call it way2.
> 
> Now we have two options to see how we arrived there.
> 
> a- You got any declarations/definition/initializations for way2
> b- You write way2 alongside way1
> c- You test that both ways give the same result, and you see that they
>give the same result.
> d- you remove the way1.
> 
> Or you squash the four patches in a single patch.  But then the reviewer
> lost the place where one can see why it is the same than the old one.

For this patch to remove total_transferred, IMHO as a reviewer it'll be
easier to me if it's put in the same patch where it got replaced.

It might be different if we're going to remove a large chunk of code, but
for this patch it's a few lines of change.

> 
> Sometimes is better the longer way, sometimes is better the short one.
> 
> Clearly we don't agree about what is the best way in this case.
> 
> > Here we removed a variable operation but it seems all fine if it's not used
> > anywhere.  But it also means current code base (before this patch applied)
> > doesn't make sense already because it contains this useless addition.  So
> > IMHO it means some previous patch does it just wrong.
> 
> No.  It is how it is developed.  And being respectful with the
> reviewer.  Given it enough information to do a proper review.

I never doubted about that.  I trust that you were trying to provide the
best for a reviewer and I appreciate that.

> 
> During the development of this series, there were lots of:
> 
> if (old_counter != new_counter)
>printf("");

I assume you meant when debugging?  I may not have read all the patches; I
hope we just don't do that if possible in any real git commit.

> 
> traces were in the several thousand lines long.  If I have to review
> that change, I would love any help that writer can give me.  That is why
> it is done this way.

Yeah, I think it's probably that even from reviewers' perspective the need
can be different from individuals.

> 
> > I think it means it's caused by a wrong split of patches, then each patch
> > stops to make much sense as a standalone one.
> 
> It stops making sense if you want each feature to be a single patch.
> Before the patch no feature.  After the patch full feature.  That brings
> us to very long patches.
> 
> What is easier to review (to do the same)
> 
> a - 1 x 1000 lines patch
> b - 10 x 100 lines patch
> 
> I will go with b any time.  Except if the split is arbitrary.

AFAIU, this is a different thing.  I'm never against splitting patch, but
about how to split.  I was never asking for a 1000 LOC patch, right? :)

> 
> > I can go back and try to find whatever patch on the list that will explain
> > this.  But it'll also go into git log.  Anyone reads this later will be
> > confused once again.  Even harder for them to figure out what
> > happened.
> 
> As said before, I completely disagree here.  And what is worse.  If it
> gets wrong, with your approach git bisect will not help as much than
> with my appreach.
> 
> > Do you think we could reorganize the patches so each of a single patch
> > explains itself?
> 
> No.  See before.  We go for a very spaguetti code to a much less
> spaguety code.
>

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-20 Thread Peter Xu

On Tue, Jun 20, 2023 at 01:10:55PM +0100, Daniel P. Berrangé wrote:
> In some cases it is worth having a convenience option for user friendliness.
> 
> In this case, however, users are already needing to use QMP/HMP on the
> source side to set migration parameters. I think it is reasonable to say
> that doing *exactly* the same with QMP/HMP on the destination side is
> actually more friendly than requiring use of -global on the dest side
> which is different syntax.

The -global parameters also work for the src side for my scripts, so I only
need to attach "-incoming tcp:XXX" for dst cmdline here.

> 
> We don't document the way to use -global with migration parameters so
> when people see '-incoming' I think we are steering them into a trap,
> making it look like -incoming is sufficient on its own. Hence that user's
> mistake recently about not setting migration parameters.
> 
> Overall I agree with deprecating  '-incoming' != 'defer', as I think it
> will better guide users to following the correct steps, as is not a
> usability burden as they're already using QMP/HMP on the source side.

I'd say -global is definitely not for users but for developers.  Just like
we keep around HMP - I'm not sure whether it's used in productions, I
thought it was only for developers and we don't deprecate it eagerly.

No strong opinions here if everyone wants to get rid of that, but before
that I hope we can have some kind of cmdline tool that can easily tunnel
qmp commands so whoever developers using pure cmdlines can switch to the
new way easily.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 19/20] qemu-file: Simplify qemu_file_shutdown()

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:40PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 20/20] qemu-file: Make qemu_file_get_error_obj() static

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:41PM +0200, Juan Quintela wrote:
> It was not used outside of qemu_file.c anyways.
> 
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 18/20] qemu_file: Make qemu_file_is_writable() static

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:39PM +0200, Juan Quintela wrote:
> It is not used outside of qemu_file, and it shouldn't.
> 
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 16/20] migration/rdma: Split qemu_fopen_rdma() into input/output functions

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:37PM +0200, Juan Quintela wrote:
> This is how everything else in QEMUFile is structured.
> As a bonus they are three less lines of code.
> 
> Signed-off-by: Juan Quintela 
> ---
>  migration/rdma.c | 35 ---
>  1 file changed, 16 insertions(+), 19 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 3ef35fc635..606837bd45 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -4046,25 +4046,22 @@ static void qio_channel_rdma_register_types(void)
>  
>  type_init(qio_channel_rdma_register_types);
>  
> -static QEMUFile *qemu_fopen_rdma(RDMAContext *rdma, const char *mode)
> +static QEMUFile *rdma_new_input(RDMAContext *rdma)
>  {
> -QIOChannelRDMA *rioc;
> +QIOChannelRDMA *rioc = 
> QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> +rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
> +rioc->rdmain = rdma;
> +rioc->rdmaout = rdma->return_path;
>  
> -if (qemu_file_mode_is_not_valid(mode)) {

We can also drop this function together.  If with that:

Reviewed-by: Peter Xu 

> -return NULL;
> -}
> +return rioc->file;
> +}
>  
> -rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> -
> -if (mode[0] == 'w') {
> -rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
> -rioc->rdmaout = rdma;
> -rioc->rdmain = rdma->return_path;
> -} else {
> -rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
> -rioc->rdmain = rdma;
> -rioc->rdmaout = rdma->return_path;
> -}
> +static QEMUFile *rdma_new_output(RDMAContext *rdma)
> +{
> +QIOChannelRDMA *rioc = 
> QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> +rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
> +rioc->rdmaout = rdma;
> +rioc->rdmain = rdma->return_path;
>  
>  return rioc->file;
>  }
> @@ -4090,9 +4087,9 @@ static void rdma_accept_incoming_migration(void *opaque)
>  return;
>  }
>  
> -f = qemu_fopen_rdma(rdma, "rb");
> +f = rdma_new_input(rdma);
>  if (f == NULL) {
> -fprintf(stderr, "RDMA ERROR: could not qemu_fopen_rdma\n");
> +fprintf(stderr, "RDMA ERROR: could not open RDMA for input\n");
>  qemu_rdma_cleanup(rdma);
>  return;
>  }
> @@ -4217,7 +4214,7 @@ void rdma_start_outgoing_migration(void *opaque,
>  trace_rdma_start_outgoing_migration_after_rdma_connect();
>  
>  s->rdma_migration = true;
> -s->to_dst_file = qemu_fopen_rdma(rdma, "wb");
> +s->to_dst_file = rdma_new_output(rdma);
>  migrate_fd_connect(s, NULL);
>  return;
>  return_path_err:
> -- 
> 2.40.1
> 

-- 
Peter Xu

Re: [PATCH v2 06/20] qemu_file: total_transferred is not used anymore

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:27PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 
> ---
>  migration/qemu-file.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index eb0497e532..6b6deea19b 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -41,9 +41,6 @@ struct QEMUFile {
>  QIOChannel *ioc;
>  bool is_writable;
>  
> -/* The sum of bytes transferred on the wire */
> -uint64_t total_transferred;
> -
>  int buf_index;
>  int buf_size; /* 0 when writing */
>  uint8_t buf[IO_BUF_SIZE];
> @@ -287,7 +284,6 @@ void qemu_fflush(QEMUFile *f)
>  qemu_file_set_error_obj(f, -EIO, local_error);
>  } else {
>  uint64_t size = iov_size(f->iov, f->iovcnt);
> -f->total_transferred += size;

I think this patch is another example why I think sometimes the way patch
is split are pretty much adding more complexity on review...

Here we removed a variable operation but it seems all fine if it's not used
anywhere.  But it also means current code base (before this patch applied)
doesn't make sense already because it contains this useless addition.  So
IMHO it means some previous patch does it just wrong.

I think it means it's caused by a wrong split of patches, then each patch
stops to make much sense as a standalone one.

I can go back and try to find whatever patch on the list that will explain
this.  But it'll also go into git log.  Anyone reads this later will be
confused once again.  Even harder for them to figure out what happened.

Do you think we could reorganize the patches so each of a single patch
explains itself?

The other thing is about priority of patches - I still have ~80 patches
pending reviews on migration only.. Would you think it makes sense we pick
up important ones first and merge them with higher priority?

What I have in mind are:

  - The regression you mentioned on qemu_fflush() when ram save (I hope I
understand the problem right, though...).

  - The slowness of migration-test.  I'm not sure whether you have anything
better than either Dan's approach or the switchover-hold flags, I just
think that seems more important to resolve for us upstream.  We can
merge Dan's or mine, you can also propose something else, but IMHO that
seems to be a higher priority?

And whatever else I haven't noticed.  I'll continue reading but I'm sure
you know the best on this..  so I'd really rely on you.

What do you think?

Thanks,

-- 
Peter Xu

Re: [PATCH v2 04/20] qemu-file: We only call qemu_file_transferred_* on the sending side

2023-06-14 Thread Peter Xu

On Tue, Jun 13, 2023 at 06:02:05PM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Tue, May 30, 2023 at 08:39:25PM +0200, Juan Quintela wrote:
> >> Remove the increase in qemu_file_fill_buffer() and add asserts to
> >> qemu_file_transferred* functions.
> >> 
> >> Signed-off-by: Juan Quintela 
> >
> > The read side accounting does look a bit weird and never caught my notice..
> >
> > Maybe worth also touching the document of QEMUFile::total_transferred to
> > clarify what it accounts?
> >
> > Reviewed-by: Peter Xu 
> >
> > Though when I'm looking at the counters (didn't follow every single recent
> > patch on this..), I found that now reading transferred value is actually
> > more expensive - qemu_file_transferred() needs flushing, even if for the
> > fast version, qemu_file_transferred_fast() loops over all possible iovs,
> > which can be as large as MAX_IOV_SIZE==64.
> >
> > To be explicit, I _think_ for each guest page we now need to flush...
> >
> >   ram_save_iterate
> > migration_rate_exceeded
> >   migration_transferred_bytes
> > qemu_file_transferred
> >
> > I hope I'm wrong..
> 
> See patch 7:
> 
> diff --git a/migration/migration-stats.c b/migration/migration-stats.c
> index 79eea8d865..1696185694 100644
> --- a/migration/migration-stats.c
> +++ b/migration/migration-stats.c
> @@ -62,7 +62,7 @@ uint64_t migration_transferred_bytes(QEMUFile *f)
>  {
>  uint64_t multifd = stat64_get(&mig_stats.multifd_bytes);
>  uint64_t rdma = stat64_get(&mig_stats.rdma_bytes);
> -uint64_t qemu_file = qemu_file_transferred(f);
> +uint64_t qemu_file = stat64_get(&mig_stats.qemu_file_transferred);
>  
>  trace_migration_transferred_bytes(qemu_file, multifd, rdma);
>  return qemu_file + multifd + rdma;

If this is a known regression, should we make a first patchset fix it and
make it higher priority to merge?

It seems this is even not mentioned in the cover letter.. while IMHO this
is the most important bit to have in it..

> 
> 
> > Does it mean that perhaps we simply need "sent and put into send buffer"
> > more than "what really got transferred"?  So I start to wonder what's the
> > origianl purpose of this change, and which one is better..
> 
> That is basically what patch 5 and 6 do O:-)
> 
> Problem is arriving to something that is bisectable (for correctness)
> and is easy to review.
> 
> And yes, my choices can be different from the ones tat you do.
> 
> The other reason for the small patches is that:
> a - sometimes I found a different test where things broke, and have to
> bisect
> b - small patches are much easier to rebase (that I am doing a lot)

That's okay.  Thanks,

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-12 Thread Peter Xu

On Mon, Jun 12, 2023 at 10:51:08PM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Mon, Jun 12, 2023 at 09:33:42PM +0200, Juan Quintela wrote:
> >> Only "defer" is recommended.  After setting all migation parameters,
> >> start incoming migration with "migrate-incoming uri" command.
> >> 
> >> Signed-off-by: Juan Quintela 
> >> ---
> >>  docs/about/deprecated.rst | 7 +++
> >>  softmmu/vl.c  | 2 ++
> >>  2 files changed, 9 insertions(+)
> >> 
> >> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> >> index 47e98dc95e..518672722d 100644
> >> --- a/docs/about/deprecated.rst
> >> +++ b/docs/about/deprecated.rst
> >> @@ -447,3 +447,10 @@ The new way to modify migration is using migration 
> >> parameters.
> >>  ``blk`` functionality can be acchieved using
> >>  ``migrate_set_parameter block-incremental true``.
> >>  
> >> +``-incoming uri`` (since 8.1)
> >> +'''''''''''''''''''''''''''''
> >> +
> >> +Everything except ``-incoming defer`` are deprecated.  This allows to
> >> +setup parameters before launching the proper migration with
> >> +``migrate-incoming uri``.
> >> +
> >> diff --git a/softmmu/vl.c b/softmmu/vl.c
> >> index b0b96f67fa..7fe865ab59 100644
> >> --- a/softmmu/vl.c
> >> +++ b/softmmu/vl.c
> >> @@ -2651,6 +2651,8 @@ void qmp_x_exit_preconfig(Error **errp)
> >>  if (incoming) {
> >>  Error *local_err = NULL;
> >>  if (strcmp(incoming, "defer") != 0) {
> >> +warn_report("-incoming %s is deprecated, use -incoming defer 
> >> and "
> >> +" set the uri with migrate-incoming.", incoming);
> >
> > I still use uri for all my scripts, alongside with "-global migration.xxx"
> > and it works.
> 
> You know what you are doing (TM).
> And remember that we don't support -gobal migration.x-foo.
> Yes, I know, we should drop the "x-" prefixes.

I hope they'll always be there. :) They're pretty handy for tests, when we
want to boot a VM without the need to script the sequences of qmp cmds.

Yes, we probably should just always drop the x-.  We can always declare
debugging purpose for all -global migration.* fields.

> 
> > Shall we just leave it there?  Or is deprecating it helps us in any form?
> 
> See the patches two weeks ago when people complained that lisen(.., num)
> was too low.  And there are other parameters that work the same way
> (that I convenientely had forgotten).  So the easiest way to get things
> right is to use "defer" always.  Using -incoming "uri" should only be
> for people that "know what they are doing", so we had to ways to do it:
> - review all migration options and see which ones work without defer
>   and document it
> - deprecate everything that is not defer.
> 
> Anything else is not going to be very user unfriendly.
> What do you think.

IIRC Wei Wang had a series just for that, so after that patchset applied we
should have fixed all issues cleanly?  Is there one more thing that's not
working right there?

> 
> Later, Juan.
> 
> PD.  This series are RFC for multiple reasons O:-)

Happy to know the rest (besides which I know will break my script :).

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-12 Thread Peter Xu

On Mon, Jun 12, 2023 at 09:33:42PM +0200, Juan Quintela wrote:
> Only "defer" is recommended.  After setting all migation parameters,
> start incoming migration with "migrate-incoming uri" command.
> 
> Signed-off-by: Juan Quintela 
> ---
>  docs/about/deprecated.rst | 7 +++
>  softmmu/vl.c  | 2 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 47e98dc95e..518672722d 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -447,3 +447,10 @@ The new way to modify migration is using migration 
> parameters.
>  ``blk`` functionality can be acchieved using
>  ``migrate_set_parameter block-incremental true``.
>  
> +``-incoming uri`` (since 8.1)
> +'''''''''''''''''''''''''''''
> +
> +Everything except ``-incoming defer`` are deprecated.  This allows to
> +setup parameters before launching the proper migration with
> +``migrate-incoming uri``.
> +
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index b0b96f67fa..7fe865ab59 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2651,6 +2651,8 @@ void qmp_x_exit_preconfig(Error **errp)
>  if (incoming) {
>  Error *local_err = NULL;
>  if (strcmp(incoming, "defer") != 0) {
> +warn_report("-incoming %s is deprecated, use -incoming defer and 
> "
> +" set the uri with migrate-incoming.", incoming);

I still use uri for all my scripts, alongside with "-global migration.xxx"
and it works.

Shall we just leave it there?  Or is deprecating it helps us in any form?

-- 
Peter Xu

Re: [PATCH v2 05/20] qemu_file: Use a stat64 for qemu_file_transferred

2023-06-05 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:26PM +0200, Juan Quintela wrote:
> This way we can read it from any thread.
> I checked that it gives the same value than the current one.  We never
> use to qemu_files at the same time.
> 
> Signed-off-by: Juan Quintela 

The follow up patch may be better to be squashed or it's very confusing..

Why do we need to convert mostly everything into atomics?  Is it modified
outside migration thread?

AFAIR atomic ops are still expensive, and will get more expensive on larger
hosts, IOW it'll be good for us to keep non-atomic when possible (and
that's why when I changed some counters in postcopy work I only changed the
limited set because then the rest are still accessed in 1 single thread and
keep running fast).

> ---
>  migration/migration-stats.h | 4 
>  migration/qemu-file.c   | 5 +++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 2358caad63..b7795e7914 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -81,6 +81,10 @@ typedef struct {
>   * Number of bytes sent during precopy stage.
>   */
>  Stat64 precopy_bytes;
> +/*
> + * Number of bytes transferred with QEMUFile.
> + */
> +Stat64 qemu_file_transferred;
>  /*
>   * Amount of transferred data at the start of current cycle.
>   */
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index be3dab85cb..eb0497e532 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -288,6 +288,7 @@ void qemu_fflush(QEMUFile *f)
>  } else {
>  uint64_t size = iov_size(f->iov, f->iovcnt);
>  f->total_transferred += size;
> +stat64_add(&mig_stats.qemu_file_transferred, size);
>  }
>  
>  qemu_iovec_release_ram(f);
> @@ -628,7 +629,7 @@ int coroutine_mixed_fn qemu_get_byte(QEMUFile *f)
>  
>  uint64_t qemu_file_transferred_noflush(QEMUFile *f)
>  {
> -uint64_t ret = f->total_transferred;
> +uint64_t ret = stat64_get(&mig_stats.qemu_file_transferred);
>  int i;
>  
>  g_assert(qemu_file_is_writable(f));
> @@ -644,7 +645,7 @@ uint64_t qemu_file_transferred(QEMUFile *f)
>  {
>  g_assert(qemu_file_is_writable(f));
>  qemu_fflush(f);
> -return f->total_transferred;
> +return stat64_get(&mig_stats.qemu_file_transferred);
>  }
>  
>  void qemu_put_be16(QEMUFile *f, unsigned int v)
> -- 
> 2.40.1
> 

-- 
Peter Xu

Re: [PATCH v2 04/20] qemu-file: We only call qemu_file_transferred_* on the sending side

2023-06-05 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:25PM +0200, Juan Quintela wrote:
> Remove the increase in qemu_file_fill_buffer() and add asserts to
> qemu_file_transferred* functions.
> 
> Signed-off-by: Juan Quintela 

The read side accounting does look a bit weird and never caught my notice..

Maybe worth also touching the document of QEMUFile::total_transferred to
clarify what it accounts?

Reviewed-by: Peter Xu 

Though when I'm looking at the counters (didn't follow every single recent
patch on this..), I found that now reading transferred value is actually
more expensive - qemu_file_transferred() needs flushing, even if for the
fast version, qemu_file_transferred_fast() loops over all possible iovs,
which can be as large as MAX_IOV_SIZE==64.

To be explicit, I _think_ for each guest page we now need to flush...

  ram_save_iterate
migration_rate_exceeded
  migration_transferred_bytes
qemu_file_transferred

I hope I'm wrong..

Does it mean that perhaps we simply need "sent and put into send buffer"
more than "what really got transferred"?  So I start to wonder what's the
origianl purpose of this change, and which one is better..

-- 
Peter Xu

Re: [PATCH 21/21] migration/multifd: Compute transferred bytes correctly

2023-05-18 Thread Peter Xu

On Thu, May 18, 2023 at 06:40:18PM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Mon, May 08, 2023 at 03:09:09PM +0200, Juan Quintela wrote:
> >> In the past, we had to put the in the main thread all the operations
> >> related with sizes due to qemu_file not beeing thread safe.  As now
> >> all counters are atomic, we can update the counters just after the
> >> do the write.  As an aditional bonus, we are able to use the right
> >> value for the compression methods.  Right now we were assuming that
> >> there were no compression at all.
> >
> > Maybe worth mention that initial packet is also accounted after this.
> 
> Ok.
> >>  
> >> +stat64_add(&mig_stats.multifd_bytes, p->next_packet_size);
> >> +stat64_add(&mig_stats.transferred, p->next_packet_size);
> >
> > Two nits:
> >
> > Maybe merge the two so half atomic operations?
> 
> On my tree, to send after this got in:
> 
> 77fdd3475c migration: Remove transferred atomic counter
> 
> O:-)

Ah this looks even better, indeed. :)

What I meant was we can also do atomic update in one shot for both
next_packet_size + packet_len.

> 
> > Also maybe also worth having a inline helper for adding both multifd_bytes
> > and transferred?
> 
> I am removing it.
> 
> After next set of packates:
> 
> rate limit is calulated as:
> 
> begining_period = migration_transferred_bytes();
> ...
> 
> bytes_this_period = migration_transferred_bytes() - begining_period;
> 
> transferred is calculated as:
> - multifd_bytes + qemu_file_bytes;
> 
> So things get really simple.  As all counters are atomic, you do a
> write and after the write to increse the write size to the qemu_file or
> to the multifd_bytes.  And that is it.

Agreed.

Thanks,

-- 
Peter Xu

Re: [PATCH 21/21] migration/multifd: Compute transferred bytes correctly

2023-05-18 Thread Peter Xu

On Mon, May 08, 2023 at 03:09:09PM +0200, Juan Quintela wrote:
> In the past, we had to put the in the main thread all the operations
> related with sizes due to qemu_file not beeing thread safe.  As now
> all counters are atomic, we can update the counters just after the
> do the write.  As an aditional bonus, we are able to use the right
> value for the compression methods.  Right now we were assuming that
> there were no compression at all.

Maybe worth mention that initial packet is also accounted after this.

> 
> Signed-off-by: Juan Quintela 

Two more trivial nits:

> ---
>  migration/multifd.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 9d2ade7abc..3a19d8e304 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -175,6 +175,7 @@ void multifd_register_ops(int method, MultiFDMethods *ops)
>  static int multifd_send_initial_packet(MultiFDSendParams *p, Error **errp)
>  {
>  MultiFDInit_t msg = {};
> +size_t size = sizeof(msg);
>  int ret;
>  
>  msg.magic = cpu_to_be32(MULTIFD_MAGIC);
> @@ -182,10 +183,12 @@ static int 
> multifd_send_initial_packet(MultiFDSendParams *p, Error **errp)
>  msg.id = p->id;
>  memcpy(msg.uuid, &qemu_uuid.data, sizeof(msg.uuid));
>  
> -ret = qio_channel_write_all(p->c, (char *)&msg, sizeof(msg), errp);
> +ret = qio_channel_write_all(p->c, (char *)&msg, size, errp);
>  if (ret != 0) {
>  return -1;
>  }
> +stat64_add(&mig_stats.multifd_bytes, size);
> +stat64_add(&mig_stats.transferred, size);
>  return 0;
>  }
>  
> @@ -396,7 +399,6 @@ static int multifd_send_pages(QEMUFile *f)
>  static int next_channel;
>  MultiFDSendParams *p = NULL; /* make happy gcc */
>  MultiFDPages_t *pages = multifd_send_state->pages;
> -uint64_t transferred;
>  
>  if (qatomic_read(&multifd_send_state->exiting)) {
>  return -1;
> @@ -431,10 +433,7 @@ static int multifd_send_pages(QEMUFile *f)
>  p->packet_num = multifd_send_state->packet_num++;
>  multifd_send_state->pages = p->pages;
>  p->pages = pages;
> -transferred = ((uint64_t) pages->num) * p->page_size + p->packet_len;
>  qemu_mutex_unlock(&p->mutex);
> -stat64_add(&mig_stats.transferred, transferred);
> -stat64_add(&mig_stats.multifd_bytes, transferred);
>  qemu_sem_post(&p->sem);
>  
>  return 1;
> @@ -716,6 +715,8 @@ static void *multifd_send_thread(void *opaque)
>  if (ret != 0) {
>  break;
>  }
> +stat64_add(&mig_stats.multifd_bytes, p->packet_len);
> +stat64_add(&mig_stats.transferred, p->packet_len);
>  } else {
>  /* Send header using the same writev call */
>  p->iov[0].iov_len = p->packet_len;
> @@ -728,6 +729,8 @@ static void *multifd_send_thread(void *opaque)
>  break;
>  }
>  
> +stat64_add(&mig_stats.multifd_bytes, p->next_packet_size);
> +stat64_add(&mig_stats.transferred, p->next_packet_size);

Two nits:

Maybe merge the two so half atomic operations?

Also maybe also worth having a inline helper for adding both multifd_bytes
and transferred?

With/without that, all look good:

Reviewed-by: Peter Xu 

Thanks,

>  qemu_mutex_lock(&p->mutex);
>  p->pending_job--;
>  qemu_mutex_unlock(&p->mutex);
> -- 
> 2.40.0
> 

-- 
Peter Xu

Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks

2023-05-17 Thread Peter Xu

On Wed, May 10, 2023 at 08:31:13AM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> 
> Hi
> 
> [Adding Kevin to the party]
> 
> > On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote:
> >> To fix it, ensure that the BQL is held during setup. To avoid changing
> >> the behavior for migration too, introduce conditionals for the setup
> >> callbacks that need the BQL and only take the lock if it's not already
> >> held.
> >
> > The major complexity of this patch is the "conditionally taking" part.
> 
> Yeap.
> 
> I don't want that bit.
> 
> This is another case of:
> - I have a problem
> - I will use recursive mutexes to solve it
> 
> Now you have two problems O:-)
> 
> > Pure question: what is the benefit of not holding BQL always during
> > save_setup(), if after all we have this coroutine issue to be solved?
> 
> Dunno.
> 
> I would like that paolo commented on this.  I "reviewed the code" 10
> years ago.  I don't remember at all why we wanted to change that.
> 
> > I can understand that it helps us to avoid taking BQL too long, but we'll
> > need to take it anyway during ramblock dirty track initializations, and so
> > far IIUC it's the major time to be consumed during setup().
> >
> > Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync()
> > call needs the iothread lock". Firstly I think it's also covering
> > enablement of dirty tracking:
> >
> > +qemu_mutex_lock_iothread();
> > +qemu_mutex_lock_ramlist();
> > +bytes_transferred = 0;
> > +reset_ram_globals();
> > +
> >  memory_global_dirty_log_start();
> >  migration_bitmap_sync();
> > +qemu_mutex_unlock_iothread();
> >
> > And I think enablement itself can be slow too, maybe even slower than
> > migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET
> > supported in the kernel.
> >
> > Meanwhile I always got confused on why we need to sync dirty bitmap when
> > setup at all.  Say, what if we drop migration_bitmap_sync() here?  After
> > all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())?
> 
> How do you convince KVM (or the other lists) to start doing dirty
> tracking?  Doing a bitmap sync.

I think memory_global_dirty_log_start() kicks off the tracking already.

Take KVM as example, normally the case is KVM_MEM_LOG_DIRTY_PAGES is set
there, then ioctl(KVM_SET_USER_MEMORY_REGION) will start dirty tracking for
all of the guest memory slots necessary (including wr-protect all pages).

KVM_DIRTY_LOG_INITIALLY_SET is slightly special, it'll skip that procedure
during ioctl(KVM_SET_USER_MEMORY_REGION), but that also means the kernel
assumed the userapp (QEMU) marked all pages dirty initially (always the
case for QEMU, I think..).  Hence in this case the sync doesn't help either
because we'll simply get no new dirty bits in this shot..

> 
> And yeap, probably there is a better way of doing it.
> 
> > This is slightly off-topic, but I'd like to know if someone can help
> > answer.
> >
> > My whole point is still questioning whether we can unconditionally take bql
> > during save_setup().
> 
> I agree with you.
> 
> > I could have missed something, though, where we want to do in setup() but
> > avoid holding BQL.  Help needed on figuring this out (and if there is, IMHO
> > it'll be worthwhile to put that into comment of save_setup() hook).
> 
> I am more towards revert completely
> 9b0950375277467fd74a9075624477ae43b9bb22
> 
> and call it a day.  On migration we don't use coroutines on the sending
> side (I mean the migration code, the block layer uses coroutines for
> everything/anything).
> 
> Paolo, Stefan any clues for not run setup with the BKL?
> 
> Later, Juan.
> 

-- 
Peter Xu

Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks

2023-05-05 Thread Peter Xu

On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote:
> To fix it, ensure that the BQL is held during setup. To avoid changing
> the behavior for migration too, introduce conditionals for the setup
> callbacks that need the BQL and only take the lock if it's not already
> held.

The major complexity of this patch is the "conditionally taking" part.

Pure question: what is the benefit of not holding BQL always during
save_setup(), if after all we have this coroutine issue to be solved?

I can understand that it helps us to avoid taking BQL too long, but we'll
need to take it anyway during ramblock dirty track initializations, and so
far IIUC it's the major time to be consumed during setup().

Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync()
call needs the iothread lock". Firstly I think it's also covering
enablement of dirty tracking:

+qemu_mutex_lock_iothread();
+qemu_mutex_lock_ramlist();
+bytes_transferred = 0;
+reset_ram_globals();
+
 memory_global_dirty_log_start();
 migration_bitmap_sync();
+qemu_mutex_unlock_iothread();

And I think enablement itself can be slow too, maybe even slower than
migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET
supported in the kernel.

Meanwhile I always got confused on why we need to sync dirty bitmap when
setup at all.  Say, what if we drop migration_bitmap_sync() here?  After
all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())?

This is slightly off-topic, but I'd like to know if someone can help
answer.

My whole point is still questioning whether we can unconditionally take bql
during save_setup().

I could have missed something, though, where we want to do in setup() but
avoid holding BQL.  Help needed on figuring this out (and if there is, IMHO
it'll be worthwhile to put that into comment of save_setup() hook).

Thanks,

-- 
Peter Xu

Re: [PATCH 0/9] QEMU file cleanups

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 04:56:46PM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Thu, May 04, 2023 at 01:38:32PM +0200, Juan Quintela wrote:
> >> - convince and review code to see that everything is uint64_t.
> >
> > One general question to patches regarding this - what's the major benefit
> > of using uint64_t?
> >
> > It doubles the possible numbers to hold, but it's already 64bits so I don't
> > think it matters a lot.
> 
> We were checking for negatives even when that can't be.
> And we are doing this dance of
> 
> int64_t x, y;
> uint64_t a, b;
> 
> x = a;
> b = y;
> 
> This is always confusing and not always right.

Yeah this is confusing, but if anything can go wrong with this I assume we
could have some bigger problem anyway..

> 
> > The thing is we're removing some code trying to
> > detect negative which seems to be still helpful to detect e.g. overflows
> > (even though I don't think it'll happen).  I just still think it's good to
> > know when overflow happens, and not sure what I missed on benefits of using
> > unsigned here.
> 
> If you grep through the code, you see that half of the things are
> int64_t and the other half is uint64_t.  I find it always confusing.

Right, I'm personally curious whether we should just use int64_t always
unless necessary. :) Another good thing with int64_t is it's also suitable
for error report when used in retvals.

But no strong opinion here, I don't think that's a huge deal for now.
Having such an alignment on types makes sense to me.

Thanks,

-- 
Peter Xu

Re: [PATCH 0/9] QEMU file cleanups

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:32PM +0200, Juan Quintela wrote:
> - convince and review code to see that everything is uint64_t.

One general question to patches regarding this - what's the major benefit
of using uint64_t?

It doubles the possible numbers to hold, but it's already 64bits so I don't
think it matters a lot.  The thing is we're removing some code trying to
detect negative which seems to be still helpful to detect e.g. overflows
(even though I don't think it'll happen).  I just still think it's good to
know when overflow happens, and not sure what I missed on benefits of using
unsigned here.

I've reviewed all the rest patches and all look good here.

Thanks,

-- 
Peter Xu

Re: [PATCH 9/9] qemu-file: Account for rate_limit usage on qemu_fflush()

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:41PM +0200, Juan Quintela wrote:
> That is the moment we know we have transferred something.
> 
> Signed-off-by: Juan Quintela 

There'll be a slight side effect that qemu_file_rate_limit() can be
triggered later than before because data cached in the qemufile won't be
accounted until flushed.

Two limits here:

- IO_BUF_SIZE==32K, the real buffer
- MAX_IOV_SIZE==64 (I think), the async buffer to put guest page ptrs
  directly, on x86 it's 64*4K=256K

So the impact should be no more than 288KB on x86.  Looks still fine..

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 8/9] qemu-file: Make ram_control_save_page() use accessors for rate_limit

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:40PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 6/9] qemu-file: remove shutdown member

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:38PM +0200, Juan Quintela wrote:
> The first thing that we do after setting the shutdown value is set the
> error as -EIO if there is not a previous error.
> 
> So this value is reduntant.  Just remove it and use
> qemu_file_get_error() in the places that it was tested.
> 
> Signed-off-by: Juan Quintela 

Nit: I'd squash this with previous.

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 5/9] qemu-file: No need to check for shutdown in qemu_file_rate_limit

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:37PM +0200, Juan Quintela wrote:
> After calling qemu_file_shutdown() we set the error as -EIO if there
> is no another previous error, so no need to check it here.
> 
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll

2023-04-27 Thread Peter Xu

../migration/savevm.c:2954
> >> #15 0x558ed3e17fb1 in snapshot_save_job_bh (opaque=0x558ed68c5170) at 
> >> ../migration/savevm.c:3253
> >> #16 0x558ed42f050a in aio_bh_call (bh=0x558ed671ae00) at 
> >> ../util/async.c:155
> >> #17 0x558ed42f0615 in aio_bh_poll (ctx=0x558ed5c62910) at 
> >> ../util/async.c:184
> >> #18 0x558ed42d47b8 in aio_poll (ctx=0x558ed5c62910, blocking=true) at 
> >> ../util/aio-posix.c:721
> >> #19 0x558ed412df1c in bdrv_poll_co (s=0x7f8ffcff3eb0) at 
> >> /home/febner/repos/qemu/block/block-gen.h:43
> >> #20 0x558ed4130c3a in blk_pwrite (blk=0x558ed5ed4f60,
> >> offset=230912, bytes=512, buf=0x7f8ffc438600, flags=0) at
> >> block/block-gen.c:1650
> >> #21 0x558ed3ba9078 in pflash_update (pfl=0x558ed5eb7b30, 
> >> offset=230912, size=1) at ../hw/block/pflash_cfi01.c:394
> >> #22 0x558ed3ba9749 in pflash_write (pfl=0x558ed5eb7b30,
> >> offset=231232, value=0, width=1, be=0) at
> >> ../hw/block/pflash_cfi01.c:522
> >> #23 0x558ed3ba9cda in pflash_mem_write_with_attrs
> >> (opaque=0x558ed5eb7b30, addr=231232, value=0, len=1, attrs=...) at
> >> ../hw/block/pflash_cfi01.c:681
> >> #24 0x558ed402a36a in memory_region_write_with_attrs_accessor
> >> (mr=0x558ed5eb7ef0, addr=231232, value=0x7f8ffcff40c8, size=1,
> >> shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
> >> #25 0x558ed402a4a9 in access_with_adjusted_size (addr=231232,
> >> value=0x7f8ffcff40c8, size=1, access_size_min=1, access_size_max=4,
> >> access_fn=0x558ed402a270 ,
> >> mr=0x558ed5eb7ef0, attrs=...) at ../softmmu/memory.c:555
> >> #26 0x558ed402d5de in memory_region_dispatch_write
> >> (mr=0x558ed5eb7ef0, addr=231232, data=0, op=MO_8, attrs=...) at
> >> ../softmmu/memory.c:1522
> >> #27 0x558ed403a6f4 in flatview_write_continue
> >> (fv=0x558ed66d62c0, addr=4291004224, attrs=..., ptr=0x7f9029957028,
> >> len=1, addr1=231232, l=1, mr=0x558ed5eb7ef0) at
> >> ../softmmu/physmem.c:2641
> >> #28 0x558ed403a857 in flatview_write (fv=0x558ed66d62c0,
> >> addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at
> >> ../softmmu/physmem.c:2683
> >> #29 0x558ed403ac07 in address_space_write (as=0x558ed4ca2b20
> >> , addr=4291004224, attrs=...,
> >> buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2779
> >> #30 0x558ed403ac74 in address_space_rw (as=0x558ed4ca2b20
> >> , addr=4291004224, attrs=...,
> >> buf=0x7f9029957028, len=1, is_write=true) at
> >> ../softmmu/physmem.c:2789
> >> #31 0x558ed40cea88 in kvm_cpu_exec (cpu=0x558ed622a910) at 
> >> ../accel/kvm/kvm-all.c:2989
> >> #32 0x558ed40d179a in kvm_vcpu_thread_fn (arg=0x558ed622a910) at 
> >> ../accel/kvm/kvm-accel-ops.c:51
> >> #33 0x558ed42d925f in qemu_thread_start (args=0x558ed5c68c80) at 
> >> ../util/qemu-thread-posix.c:541
> >> #34 0x7f9028ab7ea7 in start_thread (arg=) at 
> >> pthread_create.c:477
> >> #35 0x7f9027c18a2f in clone () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Totally unfamiliar with block jobs, but... does it mean that
snapshot_*_job_bh()s should just always make sure BQL taken?

Thanks,

-- 
Peter Xu

Re: [PULL 00/26] Next patches

2023-02-06 Thread Peter Xu

On Sat, Feb 04, 2023 at 10:19:34AM +, Peter Maydell wrote:
> On Thu, 2 Feb 2023 at 16:07, Juan Quintela  wrote:
> >
> > The following changes since commit deabea6e88f7c4c3c12a36ee30051c6209561165:
> >
> >   Merge tag 'for_upstream' of 
> > https://git.kernel.org/pub/scm/virt/kvm/mst/qemu into staging (2023-02-02 
> > 10:10:07 +)
> >
> > are available in the Git repository at:
> >
> >   https://gitlab.com/juan.quintela/qemu.git tags/next-pull-request
> >
> > for you to fetch changes up to 5ee6d3d1eeccd85aa2a835e82b8d9e1b4f7441e1:
> >
> >   migration: check magic value for deciding the mapping of channels 
> > (2023-02-02 17:04:16 +0100)
> >
> > 
> > Migration PULL request, new try
> 
> Fails to build on anything that isn't Linux:
> 
> In file included from ../migration/postcopy-ram.c:40:
> /private/var/folders/76/zy5ktkns50v6gt5g8r0sf6scgn/T/cirrus-ci-build/include/qemu/userfaultfd.h:18:10:
> fatal error: 'linux/userfaultfd.h' file not found

Oops, my fault.

Juan, please feel free to drop patch "util/userfaultfd: Add uffd_open()".
I'll respin with the whole set.

-- 
Peter Xu

1 2 >

1 - 100 of 162 matches

Mail list logo