from:"Peter Xu"

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-08-01 Thread Peter Xu

On Thu, Aug 01, 2024 at 01:51:00AM -0400, Michael S. Tsirkin wrote:
> So I suggest a single command that dumps some description of host
> features, to be passed to qemu on destination. qemu then fails to
> start on destination if some of these do not work.
> The advantage is that this also helps things like -cpu host,
> and a bunch of other things like vdpa where we like to pass through
> config from kernel.

Something like that could work indeed.  I'm thinking whether it shouldn't
require a new QMP command; that sounds more work, and we also needs Libvirt
cooperations so QEMU migration will still fail.  I wonder whether we can
integrate it into migration handshake that I referred previously in our
TODO item here:

https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake

The "device handshake" part (in the previous plan) was that we at least can
verify VMSD fields matching on both sides - VMSDs are defined in both QEMU
binaries, so migration can do that already without device opt-in.

What we can do on top of that (or even, before that) is, maybe, allow
device to opt-in in such handshake besides an "VMSD check", so that there
can be something hooked to the VMSDs or similar structures, so the src
QEMU's device A can talk to dest QEMU's device A making sure everything is
good for migration.

Virtio can handshake on host feature lists and we can fail the whole
handshake there.  Same to -cpu, or vDPA, as long as opt-in hook is provided
on both sides.

The good side of it is it sounds natural to integrate this with a handshake
(when we can have it).  Meanwhile, we restrict everything within the device
scope, so neither QEMU nor migration needs to know what happened exactly.

Would that sound workable and better?

Besides, I also wonder what's our next step for this issue.  Should we fix
this on the safe side, and only set ON by default when we have the
handshake ready (in whatever form, either above, or a new QMP command)?

It's just that the handshake in general may still need some thoughts, so
I'm not sure how fast that can ready, considering our very limited
bandwidth so far.  Maybe that can be done separately, but I remember Dan
used to suggest we do handshake right in one shot, and I tend to agree
that'll be nicer.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-08-01 Thread Peter Xu

On Thu, Aug 01, 2024 at 02:05:54PM +0900, Akihiko Odaki wrote:
> On 2024/07/31 4:11, Peter Xu wrote:
> > On Tue, Jul 30, 2024 at 07:46:12PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Jul 30, 2024 at 02:13:51PM -0400, Peter Xu wrote:
> > > > On Mon, Jul 29, 2024 at 06:26:41PM +0100, Daniel P. Berrangé wrote:
> > > > > On Mon, Jul 29, 2024 at 01:00:30PM -0400, Peter Xu wrote:
> > > > > > On Mon, Jul 29, 2024 at 04:58:03PM +0100, Daniel P. Berrangé wrote:
> > > > > > > 
> > > > > > > We've got two mutually conflicting goals with the machine type
> > > > > > > definitions.
> > > > > > > 
> > > > > > > Primarily we use them to ensure stable ABI, but an important
> > > > > > > secondary goal is to enable new tunables to have new defaults
> > > > > > > set, without having to update every mgmt app.  The latter
> > > > > > > works very well when the defaults have no dependancy on the
> > > > > > > platform kernel/OS, but breaks migration when they do have a
> > > > > > > platform dependancy.
> > > > > > > 
> > > > > > > >- Firstly, never quietly flipping any bit that affects the 
> > > > > > > > ABI...
> > > > > > > > 
> > > > > > > >- Have a default value of off, then QEMU will always allow 
> > > > > > > > the VM to boot
> > > > > > > >  by default, while advanced users can opt-in on new 
> > > > > > > > features.  We can't
> > > > > > > >  make this ON by default otherwise some VMs can already 
> > > > > > > > fail to boot,
> > > > > > > > 
> > > > > > > >- If the host doesn't support the feature while the cmdline 
> > > > > > > > enabled it,
> > > > > > > >  it needs to fail QEMU boot rather than flipping, so that 
> > > > > > > > it says "hey,
> > > > > > > >  this host does not support running such VM specified, due 
> > > > > > > > to XXX
> > > > > > > >  feature missing".
> > > > > > > > 
> > > > > > > > That's the only way an user could understand what happened, and 
> > > > > > > > IMHO that's
> > > > > > > > a clean way that we stick with QEMU cmdline on defining the 
> > > > > > > > guest ABI,
> > > > > > > > while in which the machine type is the fundation of such 
> > > > > > > > definition, as the
> > > > > > > > machine type can decides many of the rest compat properties.  
> > > > > > > > And that's
> > > > > > > > the whole point of the compat properties too (to make sure the 
> > > > > > > > guest ABI is
> > > > > > > > stable).
> > > > > > > > 
> > > > > > > > If kernel breaks it easily, all compat property things that we 
> > > > > > > > maintain can
> > > > > > > > already stop making sense in general, because it didn't define 
> > > > > > > > the whole
> > > > > > > > guest ABI..
> > > > > > > > 
> > > > > > > > So AFAIU that's really what we used for years, I hope I didn't 
> > > > > > > > overlook
> > > > > > > > somehting.  And maybe we don't yet need the "-platform" layer 
> > > > > > > > if we can
> > > > > > > > keep up with this rule?
> > > > > > > 
> > > > > > > We've failed at this for years wrt enabling use of new defaults 
> > > > > > > that have
> > > > > > > a platform depedancy, so historical practice isn't a good 
> > > > > > > reference.
> > > > > > > 
> > > > > > > There are 100's (possibly 1000's) of tunables set implicitly as 
> > > > > > > part of
> > > > > > > the machine type, and of those, libvirt likely only exposes a few 
> > > > > > > 10's
> > > > > > > of tunables. The vast majority are low level details that no mgmt 
>

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-31 Thread Peter Xu

On Wed, Jul 31, 2024 at 03:41:00AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 31, 2024 at 08:04:24AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Jul 30, 2024 at 05:32:48PM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Jul 30, 2024 at 04:03:53PM -0400, Peter Xu wrote:
> > > > On Tue, Jul 30, 2024 at 03:22:50PM -0400, Michael S. Tsirkin wrote:
> > > > > This is not what we did historically. Why should we start now?
> > > > 
> > > > It's a matter of whether we still want migration to randomly fail, like
> > > > what this patch does.
> > > > 
> > > > Or any better suggestions?  I'm definitely open to that.
> > > > 
> > > > Thanks,
> > > > 
> > > > -- 
> > > > Peter Xu
> > > 
> > > Randomly is an overstatement. You need to switch between kernels
> > > where this feature differs. We did it with a ton of features
> > > in the past, donnu why we single out USO now.
> > 
> > This has been a problem with a ton of features in the past. We've
> > ignored the problem, but that doesn't make it the right solution
> > 
> > With regards,
> > Daniel
> 
> Pushing it to domain xml does not really help,
> migration will still fail unexpectedly (after wasting
> a ton of resources copying memory, and getting
> a downtime bump, I might add).

Could you elaborate why it would fail if with what I proposed?

Note that if this is a generic comment about "any migration can fail if we
found a device mismatch", we have plan to fix that to some degree. It's
just that we don't have enough people working on these topics yet. See:

https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake

It includes:

 "Check device tree on both sides, etc., to make sure the migration is
  applicable. E.g., we should fail early and clearly on any device
  mismatch."

However I don't think it'll cover all checks, e.g. I _think_ even if we
verify VMSDs then post_load() hooks can still fail, and there can be some
corner cases to think.  And of course, this may not even apply to virtio
since virtio manages migration itself, without providing a top-level vmsd.

> 
> The right solution is to have a tool that can query
> backends, and that given the results from all of the cluster,
> generate a set of parameters that will ensure migration works.
> Kind of like qemu-img, but for migration.

This is adding extra work, IMHO.

If we stick with "qemu cmdline as guest ABI" concept, I think we're all
fine, as that work is done by QEMU booting up first on both sides,
including dest.  Basically Libvirt already plays this role of the new tool
without any new code to be added at all: what captured on the boot failure
log will be the output of that tool if we write it.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-30 Thread Peter Xu

On Tue, Jul 30, 2024 at 05:32:48PM -0400, Michael S. Tsirkin wrote:
> On Tue, Jul 30, 2024 at 04:03:53PM -0400, Peter Xu wrote:
> > On Tue, Jul 30, 2024 at 03:22:50PM -0400, Michael S. Tsirkin wrote:
> > > This is not what we did historically. Why should we start now?
> > 
> > It's a matter of whether we still want migration to randomly fail, like
> > what this patch does.
> > 
> > Or any better suggestions?  I'm definitely open to that.
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> 
> Randomly is an overstatement. You need to switch between kernels
> where this feature differs. We did it with a ton of features
> in the past, donnu why we single out USO now.

Right, my previous comment should apply to all such features, so it's not
sololy about USO*.

For old features that Jason mentioned that can also be auto-OFF, my wild
guess was that most of them should be supported in most of the kernels that
people are using, so they're fine.  Otherwise I don't see what stops it
from happening in other features too.  And that's also why I am thinking
maybe we don't need to fix old features, but for this USO* one - I'm not
sure yet; it could hit already.

For the future, I definitely want to avoid such issue; that's also one
major reason / goal I wanted to discuss this thoroughly this time..

> 
> Basically downstreams just don't separately add kernel features vs
> qemu features. There's little reason for them to do so.

But we hit this bug in downstream tests..  IIUC it means this is not the
case?

To be explicit, for RHEL9 some version we added USO* features for QEMU, but
not yet for the kernel TAP drivers.  AFAIU that's the context where we
trapped this failure where we have some system supporting the QEMU feature
but not supporting the kernel ones.  While some newer systems will support
both.  Then we hit this when migrating back to the RHEL9 system.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-30 Thread Peter Xu

On Tue, Jul 30, 2024 at 03:22:50PM -0400, Michael S. Tsirkin wrote:
> This is not what we did historically. Why should we start now?

It's a matter of whether we still want migration to randomly fail, like
what this patch does.

Or any better suggestions?  I'm definitely open to that.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-30 Thread Peter Xu

On Tue, Jul 30, 2024 at 07:46:12PM +0100, Daniel P. Berrangé wrote:
> On Tue, Jul 30, 2024 at 02:13:51PM -0400, Peter Xu wrote:
> > On Mon, Jul 29, 2024 at 06:26:41PM +0100, Daniel P. Berrangé wrote:
> > > On Mon, Jul 29, 2024 at 01:00:30PM -0400, Peter Xu wrote:
> > > > On Mon, Jul 29, 2024 at 04:58:03PM +0100, Daniel P. Berrangé wrote:
> > > > > 
> > > > > We've got two mutually conflicting goals with the machine type
> > > > > definitions.
> > > > > 
> > > > > Primarily we use them to ensure stable ABI, but an important
> > > > > secondary goal is to enable new tunables to have new defaults
> > > > > set, without having to update every mgmt app.  The latter
> > > > > works very well when the defaults have no dependancy on the
> > > > > platform kernel/OS, but breaks migration when they do have a
> > > > > platform dependancy.
> > > > > 
> > > > > >   - Firstly, never quietly flipping any bit that affects the ABI...
> > > > > > 
> > > > > >   - Have a default value of off, then QEMU will always allow the VM 
> > > > > > to boot
> > > > > > by default, while advanced users can opt-in on new features.  
> > > > > > We can't
> > > > > > make this ON by default otherwise some VMs can already fail to 
> > > > > > boot,
> > > > > > 
> > > > > >   - If the host doesn't support the feature while the cmdline 
> > > > > > enabled it,
> > > > > > it needs to fail QEMU boot rather than flipping, so that it 
> > > > > > says "hey,
> > > > > > this host does not support running such VM specified, due to XXX
> > > > > > feature missing".
> > > > > > 
> > > > > > That's the only way an user could understand what happened, and 
> > > > > > IMHO that's
> > > > > > a clean way that we stick with QEMU cmdline on defining the guest 
> > > > > > ABI,
> > > > > > while in which the machine type is the fundation of such 
> > > > > > definition, as the
> > > > > > machine type can decides many of the rest compat properties.  And 
> > > > > > that's
> > > > > > the whole point of the compat properties too (to make sure the 
> > > > > > guest ABI is
> > > > > > stable).
> > > > > > 
> > > > > > If kernel breaks it easily, all compat property things that we 
> > > > > > maintain can
> > > > > > already stop making sense in general, because it didn't define the 
> > > > > > whole
> > > > > > guest ABI..
> > > > > > 
> > > > > > So AFAIU that's really what we used for years, I hope I didn't 
> > > > > > overlook
> > > > > > somehting.  And maybe we don't yet need the "-platform" layer if we 
> > > > > > can
> > > > > > keep up with this rule?
> > > > > 
> > > > > We've failed at this for years wrt enabling use of new defaults that 
> > > > > have
> > > > > a platform depedancy, so historical practice isn't a good reference.
> > > > > 
> > > > > There are 100's (possibly 1000's) of tunables set implicitly as part 
> > > > > of
> > > > > the machine type, and of those, libvirt likely only exposes a few 10's
> > > > > of tunables. The vast majority are low level details that no mgmt app
> > > > > wants to know about, they just want to accept QEMU's new defaults,
> > > > > while preserving machine ABI. This is a good thing. No one wants the
> > > > > burden of wiring up every single tunable into libvirt and mgmt apps.
> > > > > 
> > > > > This is what the "-platform" concept would be intended to preserve. It
> > > > > would allow a way to enable groups of settings that have a platform 
> > > > > level
> > > > > dependancy, without ever having to teach either libvirt or the mgmt 
> > > > > apps
> > > > > about the individual tunables.
> > > > 
> > > > Do you think we can achieve similar goal by simply turning the feature 
> > > > to
> > > > ON only after a few QEMU releases?  I

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-30 Thread Peter Xu

On Mon, Jul 29, 2024 at 06:26:41PM +0100, Daniel P. Berrangé wrote:
> On Mon, Jul 29, 2024 at 01:00:30PM -0400, Peter Xu wrote:
> > On Mon, Jul 29, 2024 at 04:58:03PM +0100, Daniel P. Berrangé wrote:
> > > 
> > > We've got two mutually conflicting goals with the machine type
> > > definitions.
> > > 
> > > Primarily we use them to ensure stable ABI, but an important
> > > secondary goal is to enable new tunables to have new defaults
> > > set, without having to update every mgmt app.  The latter
> > > works very well when the defaults have no dependancy on the
> > > platform kernel/OS, but breaks migration when they do have a
> > > platform dependancy.
> > > 
> > > >   - Firstly, never quietly flipping any bit that affects the ABI...
> > > > 
> > > >   - Have a default value of off, then QEMU will always allow the VM to 
> > > > boot
> > > > by default, while advanced users can opt-in on new features.  We 
> > > > can't
> > > > make this ON by default otherwise some VMs can already fail to boot,
> > > > 
> > > >   - If the host doesn't support the feature while the cmdline enabled 
> > > > it,
> > > > it needs to fail QEMU boot rather than flipping, so that it says 
> > > > "hey,
> > > > this host does not support running such VM specified, due to XXX
> > > > feature missing".
> > > > 
> > > > That's the only way an user could understand what happened, and IMHO 
> > > > that's
> > > > a clean way that we stick with QEMU cmdline on defining the guest ABI,
> > > > while in which the machine type is the fundation of such definition, as 
> > > > the
> > > > machine type can decides many of the rest compat properties.  And that's
> > > > the whole point of the compat properties too (to make sure the guest 
> > > > ABI is
> > > > stable).
> > > > 
> > > > If kernel breaks it easily, all compat property things that we maintain 
> > > > can
> > > > already stop making sense in general, because it didn't define the whole
> > > > guest ABI..
> > > > 
> > > > So AFAIU that's really what we used for years, I hope I didn't overlook
> > > > somehting.  And maybe we don't yet need the "-platform" layer if we can
> > > > keep up with this rule?
> > > 
> > > We've failed at this for years wrt enabling use of new defaults that have
> > > a platform depedancy, so historical practice isn't a good reference.
> > > 
> > > There are 100's (possibly 1000's) of tunables set implicitly as part of
> > > the machine type, and of those, libvirt likely only exposes a few 10's
> > > of tunables. The vast majority are low level details that no mgmt app
> > > wants to know about, they just want to accept QEMU's new defaults,
> > > while preserving machine ABI. This is a good thing. No one wants the
> > > burden of wiring up every single tunable into libvirt and mgmt apps.
> > > 
> > > This is what the "-platform" concept would be intended to preserve. It
> > > would allow a way to enable groups of settings that have a platform level
> > > dependancy, without ever having to teach either libvirt or the mgmt apps
> > > about the individual tunables.
> > 
> > Do you think we can achieve similar goal by simply turning the feature to
> > ON only after a few QEMU releases?  I also mentioned that idea below.
> > 
> > https://lore.kernel.org/r/ZqQNKZ9_OPhDq2AK@x1n
> > 
> > So far it really sounds like the right thing to do to me to fix all similar
> > issues, even without introducing anything new we need to maintain.
> 
> Turning a feature with a platform dependency to "on" implies that
> the machine type will cease to work out of the box for platforms
> which lack the feature. IMHO that's not acceptable behaviour for
> any of our supported platforms.

Right, that's why I was thinking whether we should just always be on the
safe side, even if I just replied in the other email to Akihiko, that we do
have the option to make this more aggresive by turning those to ON after
even 1-2 years or even less.. and we have control of how aggressive this
can be.

> 
> IOW, "after a few QEMU releases" implies a delay of as much as
> 5 years, while we wait for platforms which don't support the
> feature to drop out of our supported targets list.  I don't
> think that'll satisfy the desire

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-30 Thread Peter Xu

On Tue, Jul 30, 2024 at 02:23:46AM +0900, Akihiko Odaki wrote:
> On 2024/07/30 2:00, Peter Xu wrote:
> > On Mon, Jul 29, 2024 at 04:58:03PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Jul 26, 2024 at 04:47:40PM -0400, Peter Xu wrote:
> > > > On Fri, Jul 26, 2024 at 04:17:12PM +0100, Daniel P. Berrangé wrote:
> > > > > 
> > > > > In terms of launching QEMU I'd imagine:
> > > > > 
> > > > >$QEMU -machine pc-q35-9.1 -platform linux-6.9 ...args...
> > > > > 
> > > > > Any virtual machine HW features which are tied to host kernel features
> > > > > would have their defaults set based on the requested -platform. The
> > > > > -machine will be fully invariant wrt the host kernel.
> > > > > 
> > > > > You would have -platform hlep to list available platforms, and
> > > > > corresonding QMP "query-platforms" command to list what platforms
> > > > > are supported on a given host OS.
> > > > > 
> > > > > Downstream distros can provide their own platforms definitions
> > > > > (eg "linux-rhel-9.5") if they have kernels whose feature set
> > > > > diverges from upstream due to backports.
> > > > > 
> > > > > Mgmt apps won't need to be taught about every single little QEMU
> > > > > setting whose default is derived from the kernel. Individual
> > > > > defaults are opaque and controlled by the requested platform.
> > > > > 
> > > > > Live migration has clearly defined semantics, and mgmt app can
> > > > > use query-platforms to validate two hosts are compatible.
> > > > > 
> > > > > Omitting -platform should pick the very latest platform that is
> > > > > cmpatible with the current host (not neccessarily the latest
> > > > > platform built-in to QEMU).
> > > > 
> > > > This seems to add one more layer to maintain, and so far I don't know
> > > > whether it's a must.
> > > > 
> > > > To put it simple, can we simply rely on qemu cmdline as "the guest 
> > > > ABI"?  I
> > > > thought it was mostly the case already, except some extremely rare
> > > > outliers.
> > > > 
> > > > When we have one host that boots up a VM using:
> > > > 
> > > >$QEMU1 $cmdline
> > > > 
> > > > Then another host boots up:
> > > > 
> > > >$QEMU2 $cmdline -incoming XXX
> > > > 
> > > > Then migration should succeed if $cmdline is exactly the same, and the 
> > > > VM
> > > > can boot up all fine without errors on both sides.
> > > > 
> > > > AFAICT this has nothing to do with what kernel is underneath, even not
> > > > Linux?  I think either QEMU1 / QEMU2 has the option to fail.  But if it
> > > > didn't, I thought the ABI should be guaranteed.
> > > 
> > > We've got two mutually conflicting goals with the machine type
> > > definitions.
> > > 
> > > Primarily we use them to ensure stable ABI, but an important
> > > secondary goal is to enable new tunables to have new defaults
> > > set, without having to update every mgmt app.  The latter
> > > works very well when the defaults have no dependancy on the
> > > platform kernel/OS, but breaks migration when they do have a
> > > platform dependancy.
> > > 
> > > >- Firstly, never quietly flipping any bit that affects the ABI...
> > > > 
> > > >- Have a default value of off, then QEMU will always allow the VM to 
> > > > boot
> > > >  by default, while advanced users can opt-in on new features.  We 
> > > > can't
> > > >  make this ON by default otherwise some VMs can already fail to 
> > > > boot,
> > > > 
> > > >- If the host doesn't support the feature while the cmdline enabled 
> > > > it,
> > > >  it needs to fail QEMU boot rather than flipping, so that it says 
> > > > "hey,
> > > >  this host does not support running such VM specified, due to XXX
> > > >  feature missing".
> > > > 
> > > > That's the only way an user could understand what happened, and IMHO 
> > > > that's
> > > > a clean way that we stick with QEMU cmdline on defining the guest ABI,
> > > > whi

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-29 Thread Peter Xu

On Mon, Jul 29, 2024 at 04:58:03PM +0100, Daniel P. Berrangé wrote:
> On Fri, Jul 26, 2024 at 04:47:40PM -0400, Peter Xu wrote:
> > On Fri, Jul 26, 2024 at 04:17:12PM +0100, Daniel P. Berrangé wrote:
> > > 
> > > In terms of launching QEMU I'd imagine:
> > > 
> > >   $QEMU -machine pc-q35-9.1 -platform linux-6.9 ...args...
> > > 
> > > Any virtual machine HW features which are tied to host kernel features
> > > would have their defaults set based on the requested -platform. The
> > > -machine will be fully invariant wrt the host kernel.
> > > 
> > > You would have -platform hlep to list available platforms, and
> > > corresonding QMP "query-platforms" command to list what platforms
> > > are supported on a given host OS.
> > > 
> > > Downstream distros can provide their own platforms definitions
> > > (eg "linux-rhel-9.5") if they have kernels whose feature set
> > > diverges from upstream due to backports.
> > > 
> > > Mgmt apps won't need to be taught about every single little QEMU
> > > setting whose default is derived from the kernel. Individual
> > > defaults are opaque and controlled by the requested platform.
> > > 
> > > Live migration has clearly defined semantics, and mgmt app can
> > > use query-platforms to validate two hosts are compatible.
> > > 
> > > Omitting -platform should pick the very latest platform that is
> > > cmpatible with the current host (not neccessarily the latest
> > > platform built-in to QEMU).
> > 
> > This seems to add one more layer to maintain, and so far I don't know
> > whether it's a must.
> > 
> > To put it simple, can we simply rely on qemu cmdline as "the guest ABI"?  I
> > thought it was mostly the case already, except some extremely rare
> > outliers.
> > 
> > When we have one host that boots up a VM using:
> > 
> >   $QEMU1 $cmdline
> > 
> > Then another host boots up:
> > 
> >   $QEMU2 $cmdline -incoming XXX
> > 
> > Then migration should succeed if $cmdline is exactly the same, and the VM
> > can boot up all fine without errors on both sides.
> > 
> > AFAICT this has nothing to do with what kernel is underneath, even not
> > Linux?  I think either QEMU1 / QEMU2 has the option to fail.  But if it
> > didn't, I thought the ABI should be guaranteed.
> 
> We've got two mutually conflicting goals with the machine type
> definitions.
> 
> Primarily we use them to ensure stable ABI, but an important
> secondary goal is to enable new tunables to have new defaults
> set, without having to update every mgmt app.  The latter
> works very well when the defaults have no dependancy on the
> platform kernel/OS, but breaks migration when they do have a
> platform dependancy.
> 
> >   - Firstly, never quietly flipping any bit that affects the ABI...
> > 
> >   - Have a default value of off, then QEMU will always allow the VM to boot
> > by default, while advanced users can opt-in on new features.  We can't
> > make this ON by default otherwise some VMs can already fail to boot,
> > 
> >   - If the host doesn't support the feature while the cmdline enabled it,
> > it needs to fail QEMU boot rather than flipping, so that it says "hey,
> > this host does not support running such VM specified, due to XXX
> > feature missing".
> > 
> > That's the only way an user could understand what happened, and IMHO that's
> > a clean way that we stick with QEMU cmdline on defining the guest ABI,
> > while in which the machine type is the fundation of such definition, as the
> > machine type can decides many of the rest compat properties.  And that's
> > the whole point of the compat properties too (to make sure the guest ABI is
> > stable).
> > 
> > If kernel breaks it easily, all compat property things that we maintain can
> > already stop making sense in general, because it didn't define the whole
> > guest ABI..
> > 
> > So AFAIU that's really what we used for years, I hope I didn't overlook
> > somehting.  And maybe we don't yet need the "-platform" layer if we can
> > keep up with this rule?
> 
> We've failed at this for years wrt enabling use of new defaults that have
> a platform depedancy, so historical practice isn't a good reference.
> 
> There are 100's (possibly 1000's) of tunables set implicitly as part of
> the machine type, and of those, libvirt likely only exposes a few 10's
> of tunables. The vast majority are low level d

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-29 Thread Peter Xu

On Mon, Jul 29, 2024 at 01:45:12PM +0900, Akihiko Odaki wrote:
> On 2024/07/29 12:50, Jason Wang wrote:
> > On Sun, Jul 28, 2024 at 11:19 PM Akihiko Odaki  
> > wrote:
> > > 
> > > On 2024/07/27 5:47, Peter Xu wrote:
> > > > On Fri, Jul 26, 2024 at 04:17:12PM +0100, Daniel P. Berrangé wrote:
> > > > > On Fri, Jul 26, 2024 at 10:43:42AM -0400, Peter Xu wrote:
> > > > > > On Fri, Jul 26, 2024 at 09:48:02AM +0100, Daniel P. Berrangé wrote:
> > > > > > > On Fri, Jul 26, 2024 at 09:03:24AM +0200, Thomas Huth wrote:
> > > > > > > > On 26/07/2024 08.08, Michael S. Tsirkin wrote:
> > > > > > > > > On Thu, Jul 25, 2024 at 06:18:20PM -0400, Peter Xu wrote:
> > > > > > > > > > On Tue, Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich 
> > > > > > > > > > wrote:
> > > > > > > > > > > USO features of virtio-net device depend on kernel ability
> > > > > > > > > > > to support them, for backward compatibility by default the
> > > > > > > > > > > features are disabled on 8.0 and earlier.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Yuri Benditovich 
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Andrew Melnychecnko 
> > > > > > > > > > 
> > > > > > > > > > Looks like this patch broke migration when the VM starts on 
> > > > > > > > > > a host that has
> > > > > > > > > > USO supported, to another host that doesn't..
> > > > > > > > > 
> > > > > > > > > This was always the case with all offloads. The answer at the 
> > > > > > > > > moment is,
> > > > > > > > > don't do this.
> > > > > > > > 
> > > > > > > > May I ask for my understanding:
> > > > > > > > "don't do this" = don't automatically enable/disable virtio 
> > > > > > > > features in QEMU
> > > > > > > > depending on host kernel features, or "don't do this" = don't 
> > > > > > > > try to migrate
> > > > > > > > between machines that have different host kernel features?
> > > > > > > > 
> > > > > > > > > Long term, we need to start exposing management APIs
> > > > > > > > > to discover this, and management has to disable unsupported 
> > > > > > > > > features.
> > > > > > > > 
> > > > > > > > Ack, this likely needs some treatments from the libvirt side, 
> > > > > > > > too.
> > > > > > > 
> > > > > > > When QEMU automatically toggles machine type featuers based on 
> > > > > > > host
> > > > > > > kernel, relying on libvirt to then disable them again is 
> > > > > > > impractical,
> > > > > > > as we cannot assume that the libvirt people are using knows about
> > > > > > > newly introduced features. Even if libvirt is updated to know 
> > > > > > > about
> > > > > > > it, people can easily be using a previous libvirt release.
> > > > > > > 
> > > > > > > QEMU itself needs to make the machine types do that they are there
> > > > > > > todo, which is to define a stable machine ABI.
> > > > > > > 
> > > > > > > What QEMU is missing here is a "platform ABI" concept, to encode
> > > > > > > sets of features which are tied to specific platform generations.
> > > > > > > As long as we don't have that we'll keep having these broken
> > > > > > > migration problems from machine types dynamically changing instead
> > > > > > > of providing a stable guest ABI.
> > > > > > 
> > > > > > Any more elaboration on this idea?  Would it be easily feasible in
> > > > > > implementation?
> > > > > 
> > > > > In terms of launching QEMU I'd imagine:
> > > > > 
> > > > > $QEMU -machine pc-q35-9.1 -platform linux-6.9 ...args...
> > > >

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-26 Thread Peter Xu

On Fri, Jul 26, 2024 at 07:39:46PM +0200, Thomas Huth wrote:
> On 26/07/2024 09.25, Michael S. Tsirkin wrote:
> > On Fri, Jul 26, 2024 at 09:03:24AM +0200, Thomas Huth wrote:
> > > On 26/07/2024 08.08, Michael S. Tsirkin wrote:
> > > > On Thu, Jul 25, 2024 at 06:18:20PM -0400, Peter Xu wrote:
> > > > > On Tue, Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich wrote:
> > > > > > USO features of virtio-net device depend on kernel ability
> > > > > > to support them, for backward compatibility by default the
> > > > > > features are disabled on 8.0 and earlier.
> > > > > > 
> > > > > > Signed-off-by: Yuri Benditovich 
> > > > > > Signed-off-by: Andrew Melnychecnko 
> > > > > 
> > > > > Looks like this patch broke migration when the VM starts on a host 
> > > > > that has
> > > > > USO supported, to another host that doesn't..
> > > > 
> > > > This was always the case with all offloads. The answer at the moment is,
> > > > don't do this.
> > > 
> > > May I ask for my understanding:
> > > "don't do this" = don't automatically enable/disable virtio features in 
> > > QEMU
> > > depending on host kernel features, or "don't do this" = don't try to 
> > > migrate
> > > between machines that have different host kernel features?
> > 
> > The later.
> 
> From my experience, it should rather be the former. We've seen similar
> issues with the s390x machine in the past when trying to automatically
> enable features depending on the availability of a kernel features. While it
> looks nicer at a very first glance ("hey, a new feature is available, we
> enable that for you, dear user!"), you end up in migration hell pretty
> quickly.
> 
> Maybe we could elevate the "--nodefaults" command line switch to avoid
> enabling such features automatically?
> 
> Anyway, while we're discussing solutions: We are in softfreeze already.
> Should we disable the UFO bits in the new 9.1 machine type for the time
> being to avoid that more people are running into this problem?

Probably too late for this one; this patch was merged in 8.2.
Unfortunately CIs won't even cover a test across two host kernels, even so
it'll need to be unlucky enough to one has USO one not..

But I do agree with Thomas here.

I think the only feature that can be auto-enabled is the ones that do not
affect guest ABI.  When affected, the only right way to me to enable them
should be exporting -device interface so that Libvirt can opt-in on
enabling them when the host support is detected.  For QEMU users, that
means user needs to explicitly enable them or they're off.

Or, there's also another option that we turn default to ON for such
feature, but when most of the kernels should support it.  With that, we can
set OFF in compat property for old machines, and we should fail the new
machine from boot when running on an old kenrel without the feature.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-26 Thread Peter Xu

On Fri, Jul 26, 2024 at 04:17:12PM +0100, Daniel P. Berrangé wrote:
> On Fri, Jul 26, 2024 at 10:43:42AM -0400, Peter Xu wrote:
> > On Fri, Jul 26, 2024 at 09:48:02AM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Jul 26, 2024 at 09:03:24AM +0200, Thomas Huth wrote:
> > > > On 26/07/2024 08.08, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 25, 2024 at 06:18:20PM -0400, Peter Xu wrote:
> > > > > > On Tue, Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich wrote:
> > > > > > > USO features of virtio-net device depend on kernel ability
> > > > > > > to support them, for backward compatibility by default the
> > > > > > > features are disabled on 8.0 and earlier.
> > > > > > > 
> > > > > > > Signed-off-by: Yuri Benditovich 
> > > > > > > Signed-off-by: Andrew Melnychecnko 
> > > > > > 
> > > > > > Looks like this patch broke migration when the VM starts on a host 
> > > > > > that has
> > > > > > USO supported, to another host that doesn't..
> > > > > 
> > > > > This was always the case with all offloads. The answer at the moment 
> > > > > is,
> > > > > don't do this.
> > > > 
> > > > May I ask for my understanding:
> > > > "don't do this" = don't automatically enable/disable virtio features in 
> > > > QEMU
> > > > depending on host kernel features, or "don't do this" = don't try to 
> > > > migrate
> > > > between machines that have different host kernel features?
> > > > 
> > > > > Long term, we need to start exposing management APIs
> > > > > to discover this, and management has to disable unsupported features.
> > > > 
> > > > Ack, this likely needs some treatments from the libvirt side, too.
> > > 
> > > When QEMU automatically toggles machine type featuers based on host
> > > kernel, relying on libvirt to then disable them again is impractical,
> > > as we cannot assume that the libvirt people are using knows about
> > > newly introduced features. Even if libvirt is updated to know about
> > > it, people can easily be using a previous libvirt release.
> > > 
> > > QEMU itself needs to make the machine types do that they are there
> > > todo, which is to define a stable machine ABI. 
> > > 
> > > What QEMU is missing here is a "platform ABI" concept, to encode
> > > sets of features which are tied to specific platform generations.
> > > As long as we don't have that we'll keep having these broken
> > > migration problems from machine types dynamically changing instead
> > > of providing a stable guest ABI.
> > 
> > Any more elaboration on this idea?  Would it be easily feasible in
> > implementation?
> 
> In terms of launching QEMU I'd imagine:
> 
>   $QEMU -machine pc-q35-9.1 -platform linux-6.9 ...args...
> 
> Any virtual machine HW features which are tied to host kernel features
> would have their defaults set based on the requested -platform. The
> -machine will be fully invariant wrt the host kernel.
> 
> You would have -platform hlep to list available platforms, and
> corresonding QMP "query-platforms" command to list what platforms
> are supported on a given host OS.
> 
> Downstream distros can provide their own platforms definitions
> (eg "linux-rhel-9.5") if they have kernels whose feature set
> diverges from upstream due to backports.
> 
> Mgmt apps won't need to be taught about every single little QEMU
> setting whose default is derived from the kernel. Individual
> defaults are opaque and controlled by the requested platform.
> 
> Live migration has clearly defined semantics, and mgmt app can
> use query-platforms to validate two hosts are compatible.
> 
> Omitting -platform should pick the very latest platform that is
> cmpatible with the current host (not neccessarily the latest
> platform built-in to QEMU).

This seems to add one more layer to maintain, and so far I don't know
whether it's a must.

To put it simple, can we simply rely on qemu cmdline as "the guest ABI"?  I
thought it was mostly the case already, except some extremely rare
outliers.

When we have one host that boots up a VM using:

  $QEMU1 $cmdline

Then another host boots up:

  $QEMU2 $cmdline -incoming XXX

Then migration should succeed if $cmdline is exactly the same, and the VM
can boot up all fine without errors on both sides.

AFAICT this has nothing to do wit

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-26 Thread Peter Xu

On Fri, Jul 26, 2024 at 09:48:02AM +0100, Daniel P. Berrangé wrote:
> On Fri, Jul 26, 2024 at 09:03:24AM +0200, Thomas Huth wrote:
> > On 26/07/2024 08.08, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 06:18:20PM -0400, Peter Xu wrote:
> > > > On Tue, Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich wrote:
> > > > > USO features of virtio-net device depend on kernel ability
> > > > > to support them, for backward compatibility by default the
> > > > > features are disabled on 8.0 and earlier.
> > > > > 
> > > > > Signed-off-by: Yuri Benditovich 
> > > > > Signed-off-by: Andrew Melnychecnko 
> > > > 
> > > > Looks like this patch broke migration when the VM starts on a host that 
> > > > has
> > > > USO supported, to another host that doesn't..
> > > 
> > > This was always the case with all offloads. The answer at the moment is,
> > > don't do this.
> > 
> > May I ask for my understanding:
> > "don't do this" = don't automatically enable/disable virtio features in QEMU
> > depending on host kernel features, or "don't do this" = don't try to migrate
> > between machines that have different host kernel features?
> > 
> > > Long term, we need to start exposing management APIs
> > > to discover this, and management has to disable unsupported features.
> > 
> > Ack, this likely needs some treatments from the libvirt side, too.
> 
> When QEMU automatically toggles machine type featuers based on host
> kernel, relying on libvirt to then disable them again is impractical,
> as we cannot assume that the libvirt people are using knows about
> newly introduced features. Even if libvirt is updated to know about
> it, people can easily be using a previous libvirt release.
> 
> QEMU itself needs to make the machine types do that they are there
> todo, which is to define a stable machine ABI. 
> 
> What QEMU is missing here is a "platform ABI" concept, to encode
> sets of features which are tied to specific platform generations.
> As long as we don't have that we'll keep having these broken
> migration problems from machine types dynamically changing instead
> of providing a stable guest ABI.

Any more elaboration on this idea?  Would it be easily feasible in
implementation?

I'd second any sane solution that we can avoid happening similar breakages
in the future.

I also wonder what else might be easily affected like this too when
migration can break with changed kernel or changed HW.  I suppose the CPU
model is well covered by Libvirt so we're fine at least on x86 etc.  While
IIUC KVM always have such thoughts in mind, so that KVM will make sure to
not break an userspace in such way or it'll simply be a KVM bug and fixed.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-07-26 Thread Peter Xu

On Fri, Jul 26, 2024 at 03:25:31AM -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 26, 2024 at 09:03:24AM +0200, Thomas Huth wrote:
> > On 26/07/2024 08.08, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 06:18:20PM -0400, Peter Xu wrote:
> > > > On Tue, Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich wrote:
> > > > > USO features of virtio-net device depend on kernel ability
> > > > > to support them, for backward compatibility by default the
> > > > > features are disabled on 8.0 and earlier.
> > > > > 
> > > > > Signed-off-by: Yuri Benditovich 
> > > > > Signed-off-by: Andrew Melnychecnko 
> > > > 
> > > > Looks like this patch broke migration when the VM starts on a host that 
> > > > has
> > > > USO supported, to another host that doesn't..
> > > 
> > > This was always the case with all offloads. The answer at the moment is,
> > > don't do this.
> > 
> > May I ask for my understanding:
> > "don't do this" = don't automatically enable/disable virtio features in QEMU
> > depending on host kernel features, or "don't do this" = don't try to migrate
> > between machines that have different host kernel features?
> 
> The later.

The question is how should an user know a migration is not supported?

The user can be using exactly the same QEMU binary on two hosts, while
there can be a tiny slight difference in host kernel version, then
migration can fail between them misterously.

There're too many kernel features that can be on/off when kernels are
different, even if slightly.  Then I don't see how someone can even
identify such issue, unless one uses exactly the same host kernels on both
sides..

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 11:44:08AM -0300, Fabiano Rosas wrote:
> But of course, that means we cannot claim to support all kinds of
> forward migrations anymore. Only those in the 6 year period.

That "6 years" comes from machine type deprecation period, and migration
compatibility is mostly only attached to machine types, and we only ever
allowed migration with the same machine type.

It means, >6 years migration will never work anyway as soon as we start to
deprecate machine types (irrelevant of any reuse of UNUSED), because the >6
years machine types will simply be gone.. See configuration_post_load(),
where it'll throw an error upfront when machine type mismatched.

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-11 Thread Peter Xu

On Thu, Jul 11, 2024 at 10:34:12AM -0300, Fabiano Rosas wrote:
> Is there an easy way to look at a field and tell in which machine type's
> timeframe it was introduced?

I am not aware of any.

> If the machine type of that era has been removed, then the field is free
> to go as well. I'd prefer if we had a hard link instead of just counting
> years. Maybe we should to that mapping at the machine deprecation time?
> As in, "look at the unused fields introduced in that timeframe and mark
> them free".

We can do that, but depending on how easy it would be. That can be an
overkill to me if it's non-trivial.  When it becomes complicated, I'd
rather make machine compat property easier to use so we always stick with
that.  Currently it's not as easy to use.

Maybe we shouldn't make it a common rule to let people reuse the UNUSED
fields, even if in this case it's probably fine?

E.g. I don't think it's a huge deal to keep all UNUSED fields forever -
sending 512B zeros for only one specific device isn't an issue even if kept
forever.

If "over 6 years" would be okay and simple enough, then maybe we can stick
with that (and only if people would like to reuse a field and ask; that's
after all not required..).  If more than that I doubt whether we should
spend time working on covering all the fields.

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 06:38:26PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Wed, Jul 10, 2024 at 04:48:23PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> >> >> It's not about trust, we simply don't support migrations other than
> >> >> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.
> >> >
> >> > Where does it come from?  I thought we suppport that..
> >> 
> >> I'm taking that from:
> >> 
> >> docs/devel/migration/main.rst:
> >>   "In general QEMU tries to maintain forward migration compatibility
> >>   (i.e. migrating from QEMU n->n+1) and there are users who benefit from
> >>   backward compatibility as well."
> >> 
> >> But of course it doesn't say whether that comes with a transitive rule
> >> allowing n->n+2 migrations.
> >
> > I'd say that "i.e." implies n->n+1 is not the only forward migration we
> > would support.
> >
> > I _think_ we should support all forward migration as long as the machine
> > type matches.
> >
> >> 
> >> >
> >> > The same question would be: are we requesting an OpenStack cluster to
> >> > always upgrade QEMU with +1 versions, otherwise migration will fail?
> >> 
> >> Will an OpenStack cluster be using upstream QEMU? If not, then that's a
> >
> > It's an example to show what I meant! :) Nothing else. Definitely not
> > saying that everyone should use an upstream released QEMU (but in reality,
> > it's not a problem, I think, and I do feel like people use them, perhaps
> > more with the stable releases).
> >
> >> question for the distro. In a very practical sense, we're not requesting
> >> anything. We barely test n->n+1/n->n-1, even if we had a strong support
> >> statement I wouldn't be confident saying migration from QEMU 2.7 -> QEMU
> >> 9.1 should succeed.
> >
> > No matter what we test in CI, I don't think we should break that for >1
> > versions..  I hope 2.7->9.1 keeps working, otherwise I think it's legal to
> > file a bug by anyone.
> >
> > For example, I randomly fetched a bug report:
> >
> > https://gitlab.com/qemu-project/qemu/-/issues/1937
> >
> > QEMU version:6.2 and 7.2.5
> >
> > And I believe that's the common case even for upstream.  If we don't do
> > that right for upstream, it can be impossible tasks for downstream and for
> > all of us to maintain.
> 
> But do we do that right currently? I have no idea. Have we ever done
> it? And we're here discussing a hypothetical 2.7->9.1 ...
> 
> So we cannot reuse the UNUSED field because QEMU from 2016 might send
> their data and QEMU from 2024 would interpret it wrong.
> 
> How do we proceed? Add a subsection. And make the code survive when
> receiving 0.
> 
> @Peter is that it? What about backwards-compat? We'll need a property as
> well it seems.

Compat property is definitely one way to go, but I think it's you who more
or less persuaded me that reusing it seems possible! At least I can't yet
think of anything bad if it's ancient unused buffers.

And that's why I was asking about a sane way to describe the "magic
year".. And I was very serious when I said "6 years" to follow the
deprecation of machine types, because it'll be something we can follow to
say when an unused buffer can be obsolete and it could make some sense: if
we will start to deprecate machine types, then it may not make sense to
keep any UNUSED compatible forever, after all.

And we need that ruler to be as accurate as "always 6 years to follow
machine type deprecation procedure", in case someone else tomorrow asks us
something that was only UNUSED since 2018.  We need a rule of thumb if we
want to reuse it, and if all of you agree we can start with this one,
perhaps with a comment above the field (before we think all through and
make it a rule on deprecating UNUSED)?

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 04:48:23PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> >> It's not about trust, we simply don't support migrations other than
> >> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.
> >
> > Where does it come from?  I thought we suppport that..
> 
> I'm taking that from:
> 
> docs/devel/migration/main.rst:
>   "In general QEMU tries to maintain forward migration compatibility
>   (i.e. migrating from QEMU n->n+1) and there are users who benefit from
>   backward compatibility as well."
> 
> But of course it doesn't say whether that comes with a transitive rule
> allowing n->n+2 migrations.

I'd say that "i.e." implies n->n+1 is not the only forward migration we
would support.

I _think_ we should support all forward migration as long as the machine
type matches.

> 
> >
> > The same question would be: are we requesting an OpenStack cluster to
> > always upgrade QEMU with +1 versions, otherwise migration will fail?
> 
> Will an OpenStack cluster be using upstream QEMU? If not, then that's a

It's an example to show what I meant! :) Nothing else. Definitely not
saying that everyone should use an upstream released QEMU (but in reality,
it's not a problem, I think, and I do feel like people use them, perhaps
more with the stable releases).

> question for the distro. In a very practical sense, we're not requesting
> anything. We barely test n->n+1/n->n-1, even if we had a strong support
> statement I wouldn't be confident saying migration from QEMU 2.7 -> QEMU
> 9.1 should succeed.

No matter what we test in CI, I don't think we should break that for >1
versions..  I hope 2.7->9.1 keeps working, otherwise I think it's legal to
file a bug by anyone.

For example, I randomly fetched a bug report:

https://gitlab.com/qemu-project/qemu/-/issues/1937

QEMU version:6.2 and 7.2.5

And I believe that's the common case even for upstream.  If we don't do
that right for upstream, it can be impossible tasks for downstream and for
all of us to maintain.

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 01:21:51PM -0300, Fabiano Rosas wrote:
> It's not about trust, we simply don't support migrations other than
> n->n+1 and (maybe) n->n-1. So QEMU from 2016 is certainly not included.

Where does it come from?  I thought we suppport that..

The same question would be: are we requesting an OpenStack cluster to
always upgrade QEMU with +1 versions, otherwise migration will fail?

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-10 Thread Peter Xu

On Wed, Jul 10, 2024 at 11:08:20AM -0300, Fabiano Rosas wrote:
> >> I think it's ok:
> >> 
> >> {
> >>   "field": "unused",
> >>   "version_id": 1,
> >>   "field_exists": false,
> >>   "size": 512
> >> },
> >> 
> >> vs.
> >> 
> >> {
> >>   "field": "vendor_data",
> >>   "version_id": 0,
> >>   "field_exists": false,
> >>   "num": 512,
> >>   "size": 1
> >> },
> >> 
> >> The unused field was introduced in 2016 so there's no chance of
> >> migrating a QEMU that old to/from 9.1.
> >
> > What happens if an old qemu 9.0 sends rubbish here to a new QEMU, while the
> > new QEMU would consider it meaningful data?
> 
> It will send zeros, no? The code will have to cope with that. The
> alternative is to put the vendor_data in a subsection and the code will
> also have to cope with the lack of data when the old QEMU doesn't send
> it.

Ah indeed, that "static const uint8_t buf[1024]" is there at least since
2017.  So yes, probably always sending zeros.

Nothing I can think of otherwise indeed, if we want to trust that nothing
will migrate before 2016.  It's just that we may want to know how that
"2016" is justified to be safe if we would like to allow that in the
future.

One thing _could_ be that "rule of thumb" is we plan to obsolete machines
with 6 years, so anything "UNUSED" older than 6 years can be over-written?

-- 
Peter Xu

Re: [PATCH v3 06/17] hw/sd/sdcard: Do not store vendor data on block drive (CMD56)

2024-07-09 Thread Peter Xu

On Tue, Jul 09, 2024 at 05:38:54PM -0300, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé  writes:
> 
> > "General command" (GEN_CMD, CMD56) is described as:
> >
> >   GEN_CMD is the same as the single block read or write
> >   commands (CMD24 or CMD17). The difference is that [...]
> >   the data block is not a memory payload data but has a
> >   vendor specific format and meaning.
> >
> > Thus this block must not be stored overwriting data block
> > on underlying storage drive. Keep it in a dedicated
> > 'vendor_data[]' array.
> >
> > Signed-off-by: Philippe Mathieu-Daudé 
> > Tested-by: Cédric Le Goater 
> > ---
> > RFC: Is it safe to reuse VMSTATE_UNUSED_V() (which happens
> > to be the same size)?
> 
> Hi, sorry it took some time to get to this, I had just left for vacation
> when you first posted.

And I totally overlooked there's the email.. until you replied.  Welcome
back.

> 
> I think it's ok:
> 
> {
>   "field": "unused",
>   "version_id": 1,
>   "field_exists": false,
>   "size": 512
> },
> 
> vs.
> 
> {
>   "field": "vendor_data",
>   "version_id": 0,
>   "field_exists": false,
>   "num": 512,
>   "size": 1
> },
> 
> The unused field was introduced in 2016 so there's no chance of
> migrating a QEMU that old to/from 9.1.

What happens if an old qemu 9.0 sends rubbish here to a new QEMU, while the
new QEMU would consider it meaningful data?

-- 
Peter Xu

Re: [PATCH] qemu: add support for qemu switchover-ack

2024-06-20 Thread Peter Xu

On Thu, Jun 20, 2024 at 07:45:42PM +, Jon Kohler wrote:
> 
> 
> > On Jun 20, 2024, at 4:30 AM, Jiri Denemark  wrote:
> > 
> > !---|
> >  CAUTION: External Email
> > 
> > |---!
> > 
> > On Tue, Jun 18, 2024 at 16:14:29 +0100, Daniel P. Berrangé wrote:
> >> On Tue, Jun 18, 2024 at 08:06:06AM -0700, Jon Kohler wrote:
> >>> diff --git a/include/libvirt/libvirt-domain.h 
> >>> b/include/libvirt/libvirt-domain.h
> >>> index 2f5b01bbfe..9543629f30 100644
> >>> --- a/include/libvirt/libvirt-domain.h
> >>> +++ b/include/libvirt/libvirt-domain.h
> >>> @@ -1100,6 +1100,17 @@ typedef enum {
> >>>  * Since: 8.5.0
> >>>  */
> >>> VIR_MIGRATE_ZEROCOPY = (1 << 20),
> >>> +
> >>> +/* Use switchover ack migration capability to reduce downtime on VFIO
> >>> + * device migration. This prevents the source from stopping the VM 
> >>> and
> >>> + * completing the migration until an ACK is received from the 
> >>> destination
> >>> + * that it's OK to do so. Thus, a VFIO device can make sure that its
> >>> + * initial bytes were sent and loaded in the destination before the
> >>> + * source VM is stopped.
> >>> + *
> >>> + * Since: 10.5.0
> >>> + */
> >>> +VIR_MIGRATE_SWITCHOVER_ACK = (1 << 21),
> >>> } virDomainMigrateFlags;
> >> 
> >> Do we really need a flag for this ?  Is there a credible scenario
> >> in which this flag works, and yet shouldn't be used by libvirt ?
> >> 
> >> IOW, can we just "do the right thing" and always enable this,
> >> except for TUNNELLED mode.
> > 
> > I discussed this capability some time ago with Peter (I think) and if
> > IIRC there was some downside when the capability is enabled for domains
> > that do not use VFIO. I don't remember exactly what it was about, but
> > perhaps introducing an extra delay in migration switchover? Peter, can
> > you add the details, please?
> 
> Thanks - @Peter, if you have additional info on that, would love to know
> what the non-VFIO downsides are here. 

So far, VFIO is the only one who will register this "ACK needed" hook.
When nobody registers with it, the ACK will be sent upfront of a migration
when return path is established. That happens at the very beginning of a
migration, and that ACK will be completely meaningless in that case.

Said that, it may not be too bad either to have that meaningless ACK, if
that will simply Libvirt.  That only happens once per migration, and after
sent once it should work exactly the same as when switchover-ack not enabled.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu

On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote:
> > > I just noticed this thread; some random notes from a somewhat
> > > fragmented memory of this:
> > > 
> > >   a) Long long ago, I also tried rsocket; 
> > >   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > >  as I remember the library was quite flaky at the time.
> > 
> > Hmm interesting.  There also looks like a thread doing rpoll().
> 
> Yeh, I can't actually remember much more about what I did back then!

Heh, that's understandable and fair. :)

> > I hope Lei and his team has tested >4G mem, otherwise definitely worth
> > checking.  Lei also mentioned there're rsocket bugs they found in the cover
> > letter, but not sure what's that about.
> 
> It would probably be a good idea to keep track of what bugs
> are in flight with it, and try it on a few RDMA cards to see
> what problems get triggered.
> I think I reported a few at the time, but I gave up after
> feeling it was getting very hacky.

Agreed.  Maybe we can have a list of that in the cover letter or even
QEMU's migration/rmda doc page.

Lei, if you think that makes sense please do so in your upcoming posts.
There'll need to have a list of things you encountered in the kernel driver
and it'll be even better if there're further links to read on each problem.

> > > 
> > >   e) Someone made a good suggestion (sorry can't remember who) - that the
> > >  RDMA migration structure was the wrong way around - it should be the
> > >  destination which initiates an RDMA read, rather than the source
> > >  doing a write; then things might become a LOT simpler; you just need
> > >  to send page ranges to the destination and it can pull it.
> > >  That might work nicely for postcopy.
> > 
> > I'm not sure whether it'll still be a problem if rdma recv side is based on
> > zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> > that we don't want the guest vcpus to see a partially copied page during
> > on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> > that.
> 
> Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if
> the destination can issue an RDMA read itself, it doesn't need to send 
> messages
> to the source to ask for a page fetch; it just goes and grabs it itself,
> that's got to be good for latency.

Oh, that's pretty internal stuff of rdma to me and beyond my knowledge..
but from what I can tell it sounds very reasonable indeed!

Thanks!

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu

On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote:
> >   e) Someone made a good suggestion (sorry can't remember who) - that the
> >  RDMA migration structure was the wrong way around - it should be the
> >  destination which initiates an RDMA read, rather than the source
> >  doing a write; then things might become a LOT simpler; you just need
> >  to send page ranges to the destination and it can pull it.
> >  That might work nicely for postcopy.
> 
> I'm not sure whether it'll still be a problem if rdma recv side is based on
> zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> that we don't want the guest vcpus to see a partially copied page during
> on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> that.

And when thinking about this (of UFFDIO_COPY's nature on not being able to
do zero-copy...), the only way this will be able to do zerocopy is to use
file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that
when we do DMA we pass over the page cache, which can be mapped in another
virtual address besides what the vcpus are using.

Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic
updates on the vcpu pgtables, avoiding the copy.  QEMU doesn't have it, but
it looks like there's one more reason we may want to have better use of
shmem.. than anonymous.  And actually when working on 4k faults on 1G
hugetlb I added CONTINUE support.

https://github.com/xzpeter/qemu/tree/doublemap
https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899740dc5

Maybe it's worthwhile on its own now, because it also means we can use that
in multifd to avoid one extra layer of buffering when supporting
multifd+postcopy (which has the same issue here on directly copying data
into guest pages).  It'll also work with things like rmda I think in
similar ways.  It's just that it'll not work on anonymous.

I definitely hijacked the thread to somewhere too far away.  I'll stop
here..

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu

Hey, Dave!

On Wed, Jun 05, 2024 at 12:31:56AM +, Dr. David Alan Gilbert wrote:
> * Michael Galaxy (mgal...@akamai.com) wrote:
> > One thing to keep in mind here (despite me not having any hardware to test)
> > was that one of the original goals here
> > in the RDMA implementation was not simply raw throughput nor raw latency,
> > but a lack of CPU utilization in kernel
> > space due to the offload. While it is entirely possible that newer hardware
> > w/ TCP might compete, the significant
> > reductions in CPU usage in the TCP/IP stack were a big win at the time.
> > 
> > Just something to consider while you're doing the testing
> 
> I just noticed this thread; some random notes from a somewhat
> fragmented memory of this:
> 
>   a) Long long ago, I also tried rsocket; 
>   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
>  as I remember the library was quite flaky at the time.

Hmm interesting.  There also looks like a thread doing rpoll().

Btw, not sure whether you noticed, but there's the series posted for the
latest rsocket conversion here:

https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gong...@huawei.com

I hope Lei and his team has tested >4G mem, otherwise definitely worth
checking.  Lei also mentioned there're rsocket bugs they found in the cover
letter, but not sure what's that about.

> 
>   b) A lot of the complexity in the rdma migration code comes from
> emulating a stream to carry the migration control data and interleaving
> that with the actual RAM copy.   I believe the original design used
> a separate TCP socket for the control data, and just used the RDMA
> for the data - that should be a lot simpler (but alas was rejected
> in review early on)
> 
>   c) I can't rememmber the last benchmarks I did; but I think I did
> manage to beat RDMA with multifd; but yes, multifd does eat host CPU
> where as RDMA barely uses a whisper.

I think my first impression on this matter came from you on this one. :)

> 
>   d) The 'zero-copy-send' option in migrate may well get some of that
>  CPU time back; but if I remember we were still bottle necked on
>  the receive side. (I can't remember if zero-copy-send worked with
>  multifd?)

Yes, and zero-copy requires multifd for now. I think it's because we didn't
want to complicate the header processings in the migration stream where it
may not be page aligned.

> 
>   e) Someone made a good suggestion (sorry can't remember who) - that the
>  RDMA migration structure was the wrong way around - it should be the
>  destination which initiates an RDMA read, rather than the source
>  doing a write; then things might become a LOT simpler; you just need
>  to send page ranges to the destination and it can pull it.
>  That might work nicely for postcopy.

I'm not sure whether it'll still be a problem if rdma recv side is based on
zero-copy.  It would be a matter of whether atomicity can be guaranteed so
that we don't want the guest vcpus to see a partially copied page during
on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
that.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Peter Xu

Lei,

On Wed, May 29, 2024 at 02:43:46AM +, Gonglei (Arei) wrote:
> For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> rdma_cm and ibverbs create two FDs respectively. The two FDs have
> different responsibilities. rdma_cm fd is used to notify connection
> establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a
> pollin event can be monitored, which means that an rdma_cm event
> occurs. When the verbs fd is directly polled/epolled, only the pollin
> event can be listened, which indicates that a new CQE is generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides
> rdma calls that are completely similar to socket interfaces. However,
> this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and
> writable events for listening to data). Only the rpoll interface provided
> by the RSocket can be used to listen to related events. However, QEMU
> uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept
> API).  And cannot listen to the verbs fd event. Only some hacking methods
> can be used to address this problem.  Do you guys have any ideas? Thanks.

I saw that you mentioned this elsewhere:

> Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(

So what I'm thinking may not make much sense, as I mentioned I don't think
I know rdma at all.. and my idea also has involvement on coroutine stuff
which I also don't know well. But just in case it shed some light in some
form.

IIUC we do iochannel blockings with this no matter for read/write:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

One thing I'm wondering is whether we can provide a new feature bit for
qiochannel, e.g., QIO_CHANNEL_FEATURE_POLL, so that the iochannel can
define its own poll routine rather than using the default when possible.

I think it may not work if it's in a coroutine, as I guess that'll block
other fds from being waked up.  Hence it should look like this:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else if (qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_POLL)) {
qio_channel_poll(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

Maybe we even want to forbid such channel to be used in coroutine already,
as when QIO_CHANNEL_FEATURE_POLL set it may mean that this iochannel simply
won't work with poll() like in rdma's use case.

Then rdma iochannel can implement qio_channel_poll() using rpoll().

There's one other dependent issue here in that I _think_ the migration recv
side is still in a coroutine.. so we may need to move that into a thread
first.  IIRC we don't yet have a major blocker to do that, but I didn't
further check either.  I've put that issue aside just to see whether this
may or may not make sense.

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Peter Xu

On Tue, May 28, 2024 at 09:06:04AM +, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang 
> > Cc: Michael Galaxy ; Jinpu Wang
> > ; Elmar Gerdes ;
> > zhengchuan ; Gonglei (Arei)
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-de...@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-bl...@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> > 
> > Hi,
> > 
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> > 
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more 
> > reasonable.
> > 
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wvaqk81vxtkzx-l...@mail.gmail.com/
> > 
> > Appreciate a lot for everyone helping on the testings.
> > 
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> > 
> > It looks to me NICs are powerful now, but again as I mentioned I don't 
> > think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> > 
> > Is there anyone who started looking into that direction?  Would it make 
> > sense
> > we start some PoC now?
> > 
> 
> My team has finished the PoC refactoring which works well. 
> 
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
> successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and 
> rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
> migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live 
> migration. It's successful.
> 
> We will be submit the patchset later.

That's great news, thank you!

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-21 Thread Peter Xu

On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> Hello Michael and Peter,

Hi,

> 
> Exactly, not so compelling, as I did it first only on servers widely
> used for production in our data center. The network adapters are
> 
> Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> 2-port Gigabit Ethernet PCIe

Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
reasonable.

https://lore.kernel.org/qemu-devel/camgffen-dkpmz4ta71mjydyemg0zda15wvaqk81vxtkzx-l...@mail.gmail.com/

Appreciate a lot for everyone helping on the testings.

> InfiniBand controller: Mellanox Technologies MT27800 Family [ConnectX-5]
> 
> which doesn't meet our purpose. I can choose RDMA or TCP for VM
> migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> on these two hosts. One is standby while the other is active.
> 
> Now I'll try on a server with more recent Ethernet and InfiniBand
> network adapters. One of them has:
> BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> 
> The comparison between RDMA and TCP on the same NIC could make more sense.

It looks to me NICs are powerful now, but again as I mentioned I don't
think it's a reason we need to deprecate rdma, especially if QEMU's rdma
migration has the chance to be refactored using rsocket.

Is there anyone who started looking into that direction?  Would it make
sense we start some PoC now?

Thanks,

-- 
Peter Xu

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Peter Xu

On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> That's a good news to see the socket abstraction for RDMA!
> When I was developed the series above, the most pain is the RDMA migration 
> has no QIOChannel abstraction and i need to take a 'fake channel'
> for it which is awkward in code implementation.
> So, as far as I know, we can do this by
> i. the first thing is that we need to evaluate the rsocket is good enough to 
> satisfy our QIOChannel fundamental abstraction
> ii. if it works right, then we will continue to see if it can give us 
> opportunity to hide the detail of rdma protocol
> into rsocket by remove most of code in rdma.c and also some hack in 
> migration main process.
> iii. implement the advanced features like multi-fd and multi-uri for rdma 
> migration.
> 
> Since I am not familiar with rsocket, I need some times to look at it and do 
> some quick verify with rdma migration based on rsocket.
> But, yes, I am willing to involved in this refactor work and to see if we can 
> make this migration feature more better:）

Based on what we have now, it looks like we'd better halt the deprecation
process a bit, so I think we shouldn't need to rush it at least in 9.1
then, and we'll need to see how it goes on the refactoring.

It'll be perfect if rsocket works, otherwise supporting multifd with little
overhead / exported APIs would also be a good thing in general with
whatever approach.  And obviously all based on the facts that we can get
resources from companies to support this feature first.

Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
any of us can provide some test results please do so.  Many people are
saying RDMA is better, but I yet didn't see any numbers comparing it with
modern TCP networks.  I don't want to have old impressions floating around
even if things might have changed..  When we have consolidated results, we
should share them out and also reflect that in QEMU's migration docs when a
rdma document page is ready.

Chuan, please check the whole thread discussion, it may help to understand
what we are looking for on rdma migrations [1].  Meanwhile please feel free
to sync with Jinpu's team and see how to move forward with such a project.

[1] https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-07 Thread Peter Xu

On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:
> Hello,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Monday, May 6, 2024 11:18 PM
> > To: Gonglei (Arei) 
> > Cc: Daniel P. Berrangé ; Markus Armbruster
> > ; Michael Galaxy ; Yu Zhang
> > ; Zhijian Li (Fujitsu) ; Jinpu 
> > Wang
> > ; Elmar Gerdes ;
> > qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
> > ; Prasanna Kumar Kalever
> > ; Cornelia Huck ;
> > Michael Roth ; Prasanna Kumar Kalever
> > ; integrat...@gluster.org; Paolo Bonzini
> > ; qemu-bl...@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> > > Hi, Peter
> > 
> > Hey, Lei,
> > 
> > Happy to see you around again after years.
> > 
> Haha, me too.
> 
> > > RDMA features high bandwidth, low latency (in non-blocking lossless
> > > network), and direct remote memory access by bypassing the CPU (As you
> > > know, CPU resources are expensive for cloud vendors, which is one of
> > > the reasons why we introduced offload cards.), which TCP does not have.
> > 
> > It's another cost to use offload cards, v.s. preparing more cpu resources?
> > 
> Software and hardware offload converged architecture is the way to go for all 
> cloud vendors 
> (Including comprehensive benefits in terms of performance, cost, security, 
> and innovation speed), 
> it's not just a matter of adding the resource of a DPU card.
> 
> > > In some scenarios where fast live migration is needed (extremely short
> > > interruption duration and migration duration) is very useful. To this
> > > end, we have also developed RDMA support for multifd.
> > 
> > Will any of you upstream that work?  I'm curious how intrusive would it be
> > when adding it to multifd, if it can keep only 5 exported functions like 
> > what
> > rdma.h does right now it'll be pretty nice.  We also want to make sure it 
> > works
> > with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
> > loads to
> > multifd channels too.
> > 
> 
> In fact, we sent the patchset to the community in 2021. Pls see:
> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/

I wasn't aware of that for sure in the past..

Multifd has changed quite a bit in the last 9.0 release, that may not apply
anymore.  One thing to mention is please look at Dan's comment on possible
use of rsocket.h:

https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/

And Jinpu did help provide an initial test result over the library:

https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/

It looks like we have a chance to apply that in QEMU.

> 
> 
> > One thing to note that the question here is not about a pure performance
> > comparison between rdma and nics only.  It's about help us make a decision
> > on whether to drop rdma, iow, even if rdma performs well, the community 
> > still
> > has the right to drop it if nobody can actively work and maintain it.
> > It's just that if nics can perform as good it's more a reason to drop, 
> > unless
> > companies can help to provide good support and work together.
> > 
> 
> We are happy to provide the necessary review and maintenance work for RDMA
> if the community needs it.
> 
> CC'ing Chuan Zheng.

I'm not sure whether you and Jinpu's team would like to work together and
provide a final solution for rdma over multifd.  It could be much simpler
than the original 2021 proposal if the rsocket API will work out.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Peter Xu

On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> Hi Peter, hi Daniel,

Hi, Jinpu,

Thanks for sharing this test results.  Sounds like a great news.

What's your plan next?  Would it then be worthwhile / possible moving QEMU
into that direction?  Would that greatly simplify rdma code as Dan
mentioned?

Thanks,

> 
> On Fri, May 3, 2024 at 4:33 PM Peter Xu  wrote:
> >
> > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > I had a brief check in the rsocket changelog, there seems some
> > > improvement over time,
> > >  might be worth revisiting this. due to socket abstraction, we can't
> > > use some feature like
> > >  ODP, it won't be a small and easy task.
> >
> > It'll be good to know whether Dan's suggestion would work first, without
> > rewritting everything yet so far.  Not sure whether some perf test could
> > help with the rsocket APIs even without QEMU's involvements (or looking for
> > test data supporting / invalidate such conversions).
> >
> I did a quick test with iperf on 100 G environment and 40 G
> environment, in summary rsocket works pretty well.
> 
> iperf tests between 2 hosts with 40 G (IB),
> first  a few test with different num. of threads on top of ipoib
> interface, later with preload rsocket on top of same ipoib interface.
> 
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> [SUM] 0.-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> [  6] 0.-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> [SUM] 0.-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  4] 0.-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
> [  8] 0.-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.-10. sec  2.85 GBytes  2.45 Gbits/sec
> [ 12] 0.-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
> [  3] 0.-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> [  7] 0.

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Peter Xu

On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> Hi, Peter

Hey, Lei,

Happy to see you around again after years.

> RDMA features high bandwidth, low latency (in non-blocking lossless
> network), and direct remote memory access by bypassing the CPU (As you
> know, CPU resources are expensive for cloud vendors, which is one of the
> reasons why we introduced offload cards.), which TCP does not have.

It's another cost to use offload cards, v.s. preparing more cpu resources?

> In some scenarios where fast live migration is needed (extremely short
> interruption duration and migration duration) is very useful. To this
> end, we have also developed RDMA support for multifd.

Will any of you upstream that work?  I'm curious how intrusive would it be
when adding it to multifd, if it can keep only 5 exported functions like
what rdma.h does right now it'll be pretty nice.  We also want to make sure
it works with arbitrary sized loads and buffers, e.g. vfio is considering
to add IO loads to multifd channels too.

One thing to note that the question here is not about a pure performance
comparison between rdma and nics only.  It's about help us make a decision
on whether to drop rdma, iow, even if rdma performs well, the community
still has the right to drop it if nobody can actively work and maintain it.
It's just that if nics can perform as good it's more a reason to drop,
unless companies can help to provide good support and work together.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-03 Thread Peter Xu

On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> I had a brief check in the rsocket changelog, there seems some
> improvement over time,
>  might be worth revisiting this. due to socket abstraction, we can't
> use some feature like
>  ODP, it won't be a small and easy task.

It'll be good to know whether Dan's suggestion would work first, without
rewritting everything yet so far.  Not sure whether some perf test could
help with the rsocket APIs even without QEMU's involvements (or looking for
test data supporting / invalidate such conversions).

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Peter Xu

On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> Hi Michael, Hi Peter,
> 
> 
> On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
> >
> > Yu Zhang / Jinpu,
> >
> > Any possibility (at your lesiure, and within the disclosure rules of
> > your company, IONOS) if you could share any of your performance
> > information to educate the group?
> >
> > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > their disposal. Some people don't.
> Our staging env is with 100 Gb/s IB environment.
> We will have a new setup in the coming months with Ethernet (RoCE), we
> will run some performance
> comparison when we have the environment ready.

Thanks both.  Please keep us posted.

Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
involved, am I right?

The other note is that the comparison needs to be with multifd enabled for
the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.

I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
without even waiting for months.  If you want I can try to see how we can
test together.  And btw I don't think we need a cluster, IIUC we simply
need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
two cards just for experiments, systems that can drive the cards, and a
wire supporting 100G?

> 
> >
> > - Michael
> 
> Thx!
> Jinpu
> >
> > On 5/1/24 11:16, Peter Xu wrote:
> > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > >>> What I worry more is whether this is really what we want to keep rdma in
> > >>> qemu, and that's also why I was trying to request for some serious
> > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > >>> "we" I mean both QEMU community and any company that will support 
> > >>> keeping
> > >>> rdma around.
> > >>>
> > >>> The problem is if NICs now are fast enough to perform at least equally
> > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > >>> mean that rdma migration will only be used by whoever wants to keep 
> > >>> them in
> > >>> the products and existed already?  In that case we should simply ask new
> > >>> users to stick with tcp, and rdma users should only drop but not 
> > >>> increase.
> > >>>
> > >>> It seems also destined that most new migration features will not support
> > >>> rdma: see how much we drop old features in migration now (which rdma
> > >>> _might_ still leverage, but maybe not), and how much we add mostly 
> > >>> multifd
> > >>> relevant which will probably not apply to rdma at all.  So in general 
> > >>> what
> > >>> I am worrying is a both-loss condition, if the company might be easier 
> > >>> to
> > >>> either stick with an old qemu (depending on whether other new features 
> > >>> are
> > >>> requested to be used besides RDMA alone), or do periodic rebase with 
> > >>> RDMA
> > >>> downstream only.
> > >> I don't know much about the originals of RDMA support in QEMU and why
> > >> this particular design was taken. It is indeed a huge maint burden to
> > >> have a completely different code flow for RDMA with 4000+ lines of
> > >> custom protocol signalling which is barely understandable.
> > >>
> > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > >> type could almost[1] trivially have supported RDMA. There would have
> > >> been almost no RDMA code required in the migration subsystem, and all
> > >> the modern features like compression, multifd, post-copy, etc would
> > >> "just work".
> > >>
> > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > >> performance gains, but it might still have been a better tradeoff
> > >> to have not quite so good peak performance, but with massively
> > >> less maint burden.
> > > My understanding so far is RDMA is sololy for performance but nothing 
> > > else,
> > > then it's a question on whethe

Re: [PATCH v3 3/6] migration: Remove 'blk/-b' option from migrate commands

2024-05-02 Thread Peter Xu

On Thu, May 02, 2024 at 01:35:06PM +, Dr. David Alan Gilbert wrote:
> * Markus Armbruster (arm...@redhat.com) wrote:
> > Fabiano Rosas  writes:
> > 
> > > The block migration is considered obsolete and has been deprecated in
> > > 8.2. Remove the migrate command option that enables it. This only
> > > affects the QMP and HMP commands, the feature can still be accessed by
> > > setting the migration 'block' capability. The whole feature will be
> > > removed in a future patch.
> > >
> > > Deprecation commit 8846b5bfca ("migration: migrate 'blk' command
> > > option is deprecated.").
> > >
> > > Reviewed-by: Markus Armbruster 
> > > Signed-off-by: Fabiano Rosas 
> > 
> > [...]
> > 
> > > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > > index 7978302949..ebca2cdced 100644
> > > --- a/hmp-commands.hx
> > > +++ b/hmp-commands.hx
> > > @@ -909,21 +909,17 @@ ERST
> > >  
> > >  {
> > >  .name   = "migrate",
> > > -.args_type  = "detach:-d,blk:-b,resume:-r,uri:s",
> > > -.params = "[-d] [-b] [-r] uri",
> > > +.args_type  = "detach:-d,resume:-r,uri:s",
> > > +.params = "[-d] [-r] uri",
> > >  .help   = "migrate to URI (using -d to not wait for 
> > > completion)"
> > > -   "\n\t\t\t -b for migration without shared storage with"
> > > -   " full copy of disk\n\t\t\t -r to resume a paused 
> > > migration",
> > > +   "\n\t\t\t -r to resume a paused migration",
> > >  .cmd= hmp_migrate,
> > >  },
> > >  
> > >  
> > >  SRST
> > > -``migrate [-d] [-b]`` *uri*
> > > +``migrate [-d]`` *uri*
> > >Migrate to *uri* (using -d to not wait for completion).
> > > -
> > > -  ``-b``
> > > -for migration with full copy of disk
> > >  ERST
> > 
> > Not this patch's fault, but here goes anyway: -r is undocumented here.
> 
> Probably one for Peter I guess.

Yes, and I'll send a patch! :-D

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu

On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> > 
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> > 
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
> 
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
> 
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
> 
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,

> 
> With regards,
> Daniel
> 
> [1] "almost" trivially, because the poll() integration for rsockets
> requires a bit more magic sauce since rsockets FDs are not
> really FDs from the kernel's POV. Still, QIOCHannel likely can
> abstract that probme.
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH v3 0/6] migration removals & deprecations

2024-05-01 Thread Peter Xu

On Tue, Apr 30, 2024 at 11:27:31AM -0300, Fabiano Rosas wrote:
> since v2:
> 
> - removed some more stuff which I missed:
>blk/inc options from hmp-commands.hx
>the entire ram-compress.h
>unused declarations from options.h
>unused compression functions from qemu-file.c
> 
> - removed must_remove_block_options earlier in the 'blk' patch
> 
> - added a deprecation warning to outgoing/incoming fd
> 
> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1272385260
> 
> v2:
> https://lore.kernel.org/r/20240426131408.25410-1-faro...@suse.de
> v1:
> https://lore.kernel.org/r/20240425150939.19268-1-faro...@suse.de
> 
> Hi everyone,
> 
> Here's some cleaning up of deprecated code. It removes the old block
> migration and compression code. Both have suitable replacements in the
> form of the blockdev-mirror driver and multifd compression,
> respectively.
> 
> There's also a deprecation for fd: + file to cope with the fact that
> the new MigrationAddress API defines transports instead of protocols
> (loose terms) like the old string API did. So we cannot map 1:1 from
> fd: to any transport because fd: allows *both* file migration and
> socket migration.
> 
> Fabiano Rosas (6):
>   migration: Remove 'skipped' field from MigrationStats
>   migration: Remove 'inc' option from migrate command
>   migration: Remove 'blk/-b' option from migrate commands
>   migration: Remove block migration
>   migration: Remove non-multifd compression
>   migration: Deprecate fd: for file migration

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu

On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > Peter Xu  writes:
> > 
> > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > >> Hi All (and Peter),
> > >
> > > Hi, Michael,
> > >
> > >> 
> > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > >> (highly irregular for a male) and yes, that's my real last name:
> > >> https://www.linkedin.com/in/mrgalaxy/)
> > >> 
> > >> I'm the original author of the RDMA implementation. I've been discussing
> > >> with Yu Zhang for a little bit about potentially handing over 
> > >> maintainership
> > >> of the codebase to his team.
> > >> 
> > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > >> unfortunately. so I've never been able to run tests or use what I wrote 
> > >> at
> > >> work, and as all of you know, if you don't have a way to test something,
> > >> then you can't maintain it.
> > >> 
> > >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > >> they feel comfortable training his team to maintain the codebase (and run
> > >> tests) while they learn about it.
> > >
> > > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > > the code, or say, taking over the responsibility, may or may not need 100%
> > > mastering the code base first.  There should still be some fundamental
> > > confidence to work on the code though as a starting point, then it's about
> > > serious use case to back this up, and careful testings while getting more
> > > familiar with it.
> > 
> > How much experience we expect of maintainers depends on the subsystem
> > and other circumstances.  The hard requirement isn't experience, it's
> > trust.  See the recent attack on xz.
> > 
> > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > I'm merely reminding y'all what's at stake.
> 
> I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
> common scenario is that volunteer maintainers are honest people. QEMU is
> in a massively better peer review situation. With xz there was basically no
> oversight of the new maintainer. With QEMU, we have oversight from 1000's
> of people on the list, a huge pool of general maintainers, the specific
> migration maintainers, and the release manager merging code.
> 
> With a lack of historical experiance with QEMU maintainership, I'd suggest
> that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
> file with only the 'Reviewer' classification. The main migration maintainers
> would still handle pull requests, but wait for a R-b from one of the RMDA
> volunteers. After some period of time the RDMA folks could graduate to full
> maintainer status if the migration maintainers needed to reduce their load.
> I suspect that might prove unneccesary though, given RDMA isn't an area of
> code with a high turnover of patches.

Right, and we can do that as a start, it also follows our normal rules of
starting from Reviewers to maintain something.  I even considered Zhijian
to be the previous rdma goto guy / maintainer no matter what role he used
to have in the MAINTAINERS file.

Here IMHO it's more about whether any company would like to stand up and
provide help, without yet binding that to be able to send pull requests in
the near future or even longer term.

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either sti

Re: [PATCH v2 5/6] migration: Remove non-multifd compression

2024-04-29 Thread Peter Xu

On Fri, Apr 26, 2024 at 10:14:07AM -0300, Fabiano Rosas wrote:
> The 'compress' migration capability enables the old compression code
> which has shown issues over the years and is thought to be less stable
> and tested than the more recent multifd-based compression. The old
> compression code has been deprecated in 8.2 and now is time to remove
> it.
> 
> Deprecation commit 864128df46 ("migration: Deprecate old compression
> method").
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH v2 2/6] migration: Remove 'inc' option from migrate command

2024-04-29 Thread Peter Xu

On Fri, Apr 26, 2024 at 10:14:04AM -0300, Fabiano Rosas wrote:
> The block incremental option for block migration has been deprecated
> in 8.2 in favor of using the block-mirror feature. Remove it now.
> 
> Deprecation commit 40101f320d ("migration: migrate 'inc' command
> option is deprecated.").
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH v2 1/6] migration: Remove 'skipped' field from MigrationStats

2024-04-29 Thread Peter Xu

On Fri, Apr 26, 2024 at 10:14:03AM -0300, Fabiano Rosas wrote:
> The 'skipped' field of the MigrationStats struct has been deprecated
> in 8.1. Time to remove it.
> 
> Deprecation commit 7b24d32634 ("migration: skipped field is really
> obsolete.").
> 
> Reviewed-by: Markus Armbruster 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH v2 6/6] migration: Deprecate fd: for file migration

2024-04-29 Thread Peter Xu

On Mon, Apr 29, 2024 at 03:47:39PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Fri, Apr 26, 2024 at 10:14:08AM -0300, Fabiano Rosas wrote:
> >> The fd: URI can currently trigger two different types of migration, a
> >> TCP migration using sockets and a file migration using a plain
> >> file. This is in conflict with the recently introduced (8.2) QMP
> >> migrate API that takes structured data as JSON-like format. We cannot
> >> keep the same backend for both types of migration because with the new
> >> API the code is more tightly coupled to the type of transport. This
> >> means a TCP migration must use the 'socket' transport and a file
> >> migration must use the 'file' transport.
> >> 
> >> If we keep allowing fd: when using a file, this creates an issue when
> >> the user converts the old-style (fd:) to the new style ("transport":
> >> "socket") invocation because the file descriptor in question has
> >> previously been allowed to be either a plain file or a socket.
> >> 
> >> To avoid creating too much confusion, we can simply deprecate the fd:
> >> + file usage, which is thought to be rarely used currently and instead
> >> establish a 1:1 correspondence between fd: URI and socket transport,
> >> and file: URI and file transport.
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >> ---
> >>  docs/about/deprecated.rst | 14 ++
> >>  1 file changed, 14 insertions(+)
> >> 
> >> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> >> index 0fb5c82640..813f7996fe 100644
> >> --- a/docs/about/deprecated.rst
> >> +++ b/docs/about/deprecated.rst
> >> @@ -464,3 +464,17 @@ both, older and future versions of QEMU.
> >>  The ``blacklist`` config file option has been renamed to ``block-rpcs``
> >>  (to be in sync with the renaming of the corresponding command line
> >>  option).
> >> +
> >> +Migration
> >> +-
> >> +
> >> +``fd:`` URI when used for file migration (since 9.1)
> >> +
> >> +
> >> +The ``fd:`` URI can currently provide a file descriptor that
> >> +references either a socket or a plain file. These are two different
> >> +types of migration. In order to reduce ambiguity, the ``fd:`` URI
> >> +usage of providing a file descriptor to a plain file has been
> >> +deprecated in favor of explicitly using the ``file:`` URI with the
> >> +file descriptor being passed as an ``fdset``. Refer to the ``add-fd``
> >> +command documentation for details on the ``fdset`` usage.
> >
> > Wanna do some warn_report() when detected non-socket fds alongside?  Looks
> > like we previously do this for all deprecations.
> 
> Yes, good point.
> 
> >
> > What's the plan when it's support removed?  I'm imaginging that we sanity
> > check fstat() + S_ISSOCK on the fd and fail otherwise?  In that case we can
> > have the code there, dump warn_report(), then switch to failing qmp migrate
> > (and incoming side) later on?
> 
> Something along those lines. We currently use fd_is_socket():
> 
> bool fd_is_socket(int fd)
> {
> int optval;
> socklen_t optlen = sizeof(optval);
> return !getsockopt(fd, SOL_SOCKET, SO_TYPE, , );
> }
> 
> I'm thinking of this in fd_start_outgoing_migation():
> 
> if (!fd_is_socket(fd)) {
> warn_report("fd: migration to a file is deprecated."
> " Use file: instead.");
> }

Sounds good, perhaps also in fd_start_incoming_migration().

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH v2 3/6] migration: Remove 'blk/-b' option from migrate commands

2024-04-29 Thread Peter Xu

On Mon, Apr 29, 2024 at 03:35:02PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Mon, Apr 29, 2024 at 02:18:57PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Fri, Apr 26, 2024 at 10:14:05AM -0300, Fabiano Rosas wrote:
> >> >> @@ -2003,21 +1997,7 @@ static bool migrate_prepare(MigrationState *s, 
> >> >> bool blk, bool resume,
> >> >>  }
> >> >>  }
> >> >>  
> >> >> -if (blk) {

[1]

> >> >> -if (migrate_colo()) {
> >> >> -error_setg(errp, "No disk migration is required in COLO 
> >> >> mode");
> >> >> -return false;
> >> >> -}
> >> >> -if (migrate_block()) {
> >> >> -error_setg(errp, "Command options are incompatible with "
> >> >> -   "current migration capabilities");
> >> >> -return false;
> >> >> -}
> >> >> -if (!migrate_cap_set(MIGRATION_CAPABILITY_BLOCK, true, errp)) {
> >> >> -return false;
> >> >> -}
> >> >> -s->must_remove_block_options = true;
> >> >> -}
> >> >> +s->must_remove_block_options = true;
> >> >
> >> > Can we drop this var too?  Perhaps with block_cleanup_parameters()?
> >> 
> >> Yes, Markus mentioned it in v1 already. Take a look there. There's
> >> several other declarations I missed. v3 is coming soon.
> >
> > Right, noticed that it's removed actually in the next patch.
> >
> > But iiuc it can already been removed in this patch.  If we want to remove
> > it in the next, logically we should set must_remove_block_options=false
> > here, though..  So maybe easier to just drop it here.
> 
> Ah I see what you mean. I thought you're just asking for the removal
> overall.
> 
> But block_cleanup_parameters sets the block capability to false and the
> whole block migration only goes away in the next patch. I think we need
> to keep this as true to preserve behavior. In theory, after this patch
> people could still use the block migration just fine by setting the cap.

I'm reading this patch the same as "blk==false" always above [1].  In that
case, only if we keep must_remove_block_options=false could we maintain the
behavior?  Otherwise at the end of migration (or cancellations) QEMU can
wrongly clear MIGRATION_CAPABILITY_BLOCK bit if the user set it manually?

(But hey, we're discussing whoever applies this patch only without the
 rest..  definitely not a huge deal :)

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH v2 6/6] migration: Deprecate fd: for file migration

2024-04-29 Thread Peter Xu

On Fri, Apr 26, 2024 at 10:14:08AM -0300, Fabiano Rosas wrote:
> The fd: URI can currently trigger two different types of migration, a
> TCP migration using sockets and a file migration using a plain
> file. This is in conflict with the recently introduced (8.2) QMP
> migrate API that takes structured data as JSON-like format. We cannot
> keep the same backend for both types of migration because with the new
> API the code is more tightly coupled to the type of transport. This
> means a TCP migration must use the 'socket' transport and a file
> migration must use the 'file' transport.
> 
> If we keep allowing fd: when using a file, this creates an issue when
> the user converts the old-style (fd:) to the new style ("transport":
> "socket") invocation because the file descriptor in question has
> previously been allowed to be either a plain file or a socket.
> 
> To avoid creating too much confusion, we can simply deprecate the fd:
> + file usage, which is thought to be rarely used currently and instead
> establish a 1:1 correspondence between fd: URI and socket transport,
> and file: URI and file transport.
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  docs/about/deprecated.rst | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 0fb5c82640..813f7996fe 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -464,3 +464,17 @@ both, older and future versions of QEMU.
>  The ``blacklist`` config file option has been renamed to ``block-rpcs``
>  (to be in sync with the renaming of the corresponding command line
>  option).
> +
> +Migration
> +-
> +
> +``fd:`` URI when used for file migration (since 9.1)
> +
> +
> +The ``fd:`` URI can currently provide a file descriptor that
> +references either a socket or a plain file. These are two different
> +types of migration. In order to reduce ambiguity, the ``fd:`` URI
> +usage of providing a file descriptor to a plain file has been
> +deprecated in favor of explicitly using the ``file:`` URI with the
> +file descriptor being passed as an ``fdset``. Refer to the ``add-fd``
> +command documentation for details on the ``fdset`` usage.

Wanna do some warn_report() when detected non-socket fds alongside?  Looks
like we previously do this for all deprecations.

What's the plan when it's support removed?  I'm imaginging that we sanity
check fstat() + S_ISSOCK on the fd and fail otherwise?  In that case we can
have the code there, dump warn_report(), then switch to failing qmp migrate
(and incoming side) later on?

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH v2 3/6] migration: Remove 'blk/-b' option from migrate commands

2024-04-29 Thread Peter Xu

On Mon, Apr 29, 2024 at 02:18:57PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Fri, Apr 26, 2024 at 10:14:05AM -0300, Fabiano Rosas wrote:
> >> @@ -2003,21 +1997,7 @@ static bool migrate_prepare(MigrationState *s, bool 
> >> blk, bool resume,
> >>  }
> >>  }
> >>  
> >> -if (blk) {
> >> -if (migrate_colo()) {
> >> -error_setg(errp, "No disk migration is required in COLO 
> >> mode");
> >> -return false;
> >> -}
> >> -if (migrate_block()) {
> >> -error_setg(errp, "Command options are incompatible with "
> >> -   "current migration capabilities");
> >> -return false;
> >> -}
> >> -if (!migrate_cap_set(MIGRATION_CAPABILITY_BLOCK, true, errp)) {
> >> -return false;
> >> -}
> >> -s->must_remove_block_options = true;
> >> -}
> >> +s->must_remove_block_options = true;
> >
> > Can we drop this var too?  Perhaps with block_cleanup_parameters()?
> 
> Yes, Markus mentioned it in v1 already. Take a look there. There's
> several other declarations I missed. v3 is coming soon.

Right, noticed that it's removed actually in the next patch.

But iiuc it can already been removed in this patch.  If we want to remove
it in the next, logically we should set must_remove_block_options=false
here, though..  So maybe easier to just drop it here.

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH v2 3/6] migration: Remove 'blk/-b' option from migrate commands

2024-04-29 Thread Peter Xu

On Fri, Apr 26, 2024 at 10:14:05AM -0300, Fabiano Rosas wrote:
> @@ -2003,21 +1997,7 @@ static bool migrate_prepare(MigrationState *s, bool 
> blk, bool resume,
>  }
>  }
>  
> -if (blk) {
> -if (migrate_colo()) {
> -error_setg(errp, "No disk migration is required in COLO mode");
> -return false;
> -}
> -if (migrate_block()) {
> -error_setg(errp, "Command options are incompatible with "
> -   "current migration capabilities");
> -return false;
> -}
> -if (!migrate_cap_set(MIGRATION_CAPABILITY_BLOCK, true, errp)) {
> -return false;
> -}
> -s->must_remove_block_options = true;
> -}
> +s->must_remove_block_options = true;

Can we drop this var too?  Perhaps with block_cleanup_parameters()?

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-29 Thread Peter Xu

On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> Hi All (and Peter),

Hi, Michael,

> 
> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> (highly irregular for a male) and yes, that's my real last name:
> https://www.linkedin.com/in/mrgalaxy/)
> 
> I'm the original author of the RDMA implementation. I've been discussing
> with Yu Zhang for a little bit about potentially handing over maintainership
> of the codebase to his team.
> 
> I simply have zero access to RoCE or Infiniband hardware at all,
> unfortunately. so I've never been able to run tests or use what I wrote at
> work, and as all of you know, if you don't have a way to test something,
> then you can't maintain it.
> 
> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> they feel comfortable training his team to maintain the codebase (and run
> tests) while they learn about it.

The "while learning" part is fine at least to me.  IMHO the "ownership" to
the code, or say, taking over the responsibility, may or may not need 100%
mastering the code base first.  There should still be some fundamental
confidence to work on the code though as a starting point, then it's about
serious use case to back this up, and careful testings while getting more
familiar with it.

> 
> If you don't mind, I'd like to let him send over his (very detailed)
> proposal,

Yes please, it's exactly the time to share the plan.  The hope is we try to
reach a consensus before or around the middle of this release (9.1).
Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
not yet out, but I think it means we make a decision before or around
middle of June.

Thanks,

> 
> - Michael
> 
> On 4/11/24 11:36, Yu Zhang wrote:
> > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > >  periodically tests for each QEMU release will be needed.
> > We use a batch of regression test cases for the stack, which covers the
> > test for QEMU. I did such test for most of the QEMU releases planned as
> > candidates for rollout.
> > 
> > The migration test needs a pair of (either physical or virtual) servers with
> > InfiniBand network, which makes it difficult to do on a single server. The
> > nested VM could be a possible approach, for which we may need virtual
> > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you 
> > know.
> > 
> > [1]  
> > https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > 
> > Thanks and best regards!
> > 
> > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu  wrote:
> > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via 
> > > > wrote:
> > > > > 
> > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > 
> > > > > > > Is there document/link about the unittest/CI for migration tests, 
> > > > > > > Why
> > > > > > > are those tests missing?
> > > > > > > Is it hard or very special to set up an environment for that? 
> > > > > > > maybe we
> > > > > > > can help in this regards.
> > > > > > See tests/qtest/migration-test.c.  We put most of our migration 
> > > > > > tests
> > > > > > there and that's covered in CI.
> > > > > > 
> > > > > > I think one major issue is CI systems don't normally have rdma 
> > > > > > devices.
> > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > 
> > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > "rxe_lo", however when
> > > > > I tried(years ago) to do RDMA migration over this
> > > > > interface(rdma:127.0.0.1:) , it got something wrong.
> > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > Thanks, Zhijian.
> > > > 
> > > > I'm not sure adding an emu-link for rdma is doable f

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-12 Thread Peter Xu

Yu,

On Thu, Apr 11, 2024 at 06:36:54PM +0200, Yu Zhang wrote:
> > 1) Either a CI test covering at least the major RDMA paths, or at least
> > periodically tests for each QEMU release will be needed.
> We use a batch of regression test cases for the stack, which covers the
> test for QEMU. I did such test for most of the QEMU releases planned as
> candidates for rollout.

The least I can think of is a few tests in one release.  Definitely too
less if one release can already break..

> 
> The migration test needs a pair of (either physical or virtual) servers with
> InfiniBand network, which makes it difficult to do on a single server. The
> nested VM could be a possible approach, for which we may need virtual
> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> 
> [1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Does it require a kernel driver?  The less host kernel / hardware /
.. dependencies the better.

I am wondering whether there can be a library doing everything in
userspace, translating RDMA into e.g. socket messages (so maybe ultimately
that's something like IP->rdma->IP.. just to cover the "rdma" procedures),
then that'll work for CI reliably.

Please also see my full list, though, especially entry 4).  Thanks already
for looking for solutions on the tests, but I don't want to waste your time
then found that tests are not enough even if ready.  I think we need people
that understand these stuff well enough, have dedicated time and look after
it.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-11 Thread Peter Xu

On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> > 
> > 
> > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > 
> > >> Is there document/link about the unittest/CI for migration tests, Why
> > >> are those tests missing?
> > >> Is it hard or very special to set up an environment for that? maybe we
> > >> can help in this regards.
> > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > there and that's covered in CI.
> > >
> > > I think one major issue is CI systems don't normally have rdma devices.
> > > Can rdma migration test be carried out without a real hardware?
> > 
> > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > then we can get a new RDMA interface "rxe_eth0".
> > This new RDMA interface is able to do the QEMU RDMA migration.
> > 
> > Also, the loopback(lo) device is able to emulate the RDMA interface 
> > "rxe_lo", however when
> > I tried(years ago) to do RDMA migration over this 
> > interface(rdma:127.0.0.1:) , it got something wrong.
> > So i gave up enabling the RDMA migration qtest at that time.
> 
> Thanks, Zhijian.
> 
> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> Maybe someone more familiar with how CI works can chim in.

Some people got dropped on the cc list for unknown reason, I'm adding them
back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
accident.

I'll try to summarize what is still missing, and I think these will be
greatly helpful if we don't want to deprecate rdma migration:

  1) Either a CI test covering at least the major RDMA paths, or at least
 periodically tests for each QEMU release will be needed.

  2) Some performance tests between modern RDMA and NIC devices are
 welcomed.  The current knowledge is modern NIC can work similarly to
 RDMA in performance, then it's debatable why we still maintain so much
 rdma specific code.

  3) No need to be soild patchsets for this one, but some plan to improve
 RDMA migration code so that it is not almost isolated from the rest
 protocols.

  4) Someone to look after this code for real.

For 2) and 3) more info is here:

https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n

Here 4) can be the most important as Markus pointed out.  We just didn't
get there yet on the discussions, but maybe Markus is right that we should
talk that first.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-10 Thread Peter Xu

On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> 
> 
> on 4/10/2024 3:46 AM, Peter Xu wrote:
> 
> >> Is there document/link about the unittest/CI for migration tests, Why
> >> are those tests missing?
> >> Is it hard or very special to set up an environment for that? maybe we
> >> can help in this regards.
> > See tests/qtest/migration-test.c.  We put most of our migration tests
> > there and that's covered in CI.
> >
> > I think one major issue is CI systems don't normally have rdma devices.
> > Can rdma migration test be carried out without a real hardware?
> 
> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> then we can get a new RDMA interface "rxe_eth0".
> This new RDMA interface is able to do the QEMU RDMA migration.
> 
> Also, the loopback(lo) device is able to emulate the RDMA interface 
> "rxe_lo", however when
> I tried(years ago) to do RDMA migration over this 
> interface(rdma:127.0.0.1:) , it got something wrong.
> So i gave up enabling the RDMA migration qtest at that time.

Thanks, Zhijian.

I'm not sure adding an emu-link for rdma is doable for CI systems, though.
Maybe someone more familiar with how CI works can chim in.

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Peter Xu

On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> Hi Peter,
> 
> On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
> >
> > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > Hi Peter,
> >
> > Jinpu,
> >
> > Thanks for joining the discussion.
> >
> > >
> > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > > >
> > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > Hello Peter und Zhjian,
> > > > >
> > > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > > surprised at
> > > > > the plan for deprecating the RDMA migration subsystem.
> > > >
> > > > It's not too late, since it looks like we do have users not yet notified
> > > > from this, we'll redo the deprecation procedure even if it'll be the 
> > > > final
> > > > plan, and it'll be 2 releases after this.
> > > >
> > > > >
> > > > > > IMHO it's more important to know whether there are still users and 
> > > > > > whether
> > > > > > they would still like to see it around.
> > > > >
> > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led 
> > > > > > to the a few
> > > > > > obvious bugs being noticed too late.
> > > > >
> > > > > Yes, we are a user of this subsystem. I was unaware of the lack of 
> > > > > test coverage
> > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > migration test
> > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > and 8.2, but
> > > > > was unable to confirm and report it quickly to you.
> > > > >
> > > > > The maintenance of this part could be too costly or difficult from
> > > > > your point of view.
> > > >
> > > > It may or may not be too costly, it's just that we need real users of 
> > > > RDMA
> > > > taking some care of it.  Having it broken easily for >1 releases 
> > > > definitely
> > > > is a sign of lack of users.  It is an implication to the community that 
> > > > we
> > > > should consider dropping some features so that we can get the best use 
> > > > of
> > > > the community resources for the things that may have a broader audience.
> > > >
> > > > One thing majorly missing is a RDMA tester to guard all the merges to 
> > > > not
> > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA 
> > > > hardwares
> > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > features that will be merged at least for migration subsystem, so that 
> > > > we
> > > > plan to not merge anything that is not covered by CI unless extremely
> > > > necessary in the future.
> > > >
> > > > For sure CI is not the only missing part, but I'd say we should start 
> > > > with
> > > > it, then someone should also take care of the code even if only in
> > > > maintenance mode (no new feature to add on top).
> > > >
> > > > >
> > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > many) like us
> > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > version of QEMU,
> > > > > or to abandon the currently used RDMA migration.
> > > >
> > > > RDMA doesn't get new features anyway, if there's specific use case for 
> > > > RDMA
> > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > possible to switch to the TCP protocol with some good NICs?
> > > We have used rdma migration with HCA from Nvidia for years, our
> > > experience is RDMA migration works better than tcp (over ipoib).
> >
> > Please bare with me, as I know little on rdma stuff.
> >
> > I'm actually pretty confused (and since a long time ago..) on why we need
> > to operation with rdma contexts when ipoib seems to provide all the tcp
> > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > e

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-08 Thread Peter Xu

On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> Hi Peter,

Jinpu,

Thanks for joining the discussion.

> 
> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> >
> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > Hello Peter und Zhjian,
> > >
> > > Thank you so much for letting me know about this. I'm also a bit 
> > > surprised at
> > > the plan for deprecating the RDMA migration subsystem.
> >
> > It's not too late, since it looks like we do have users not yet notified
> > from this, we'll redo the deprecation procedure even if it'll be the final
> > plan, and it'll be 2 releases after this.
> >
> > >
> > > > IMHO it's more important to know whether there are still users and 
> > > > whether
> > > > they would still like to see it around.
> > >
> > > > I admit RDMA migration was lack of testing(unit/CI test), which led to 
> > > > the a few
> > > > obvious bugs being noticed too late.
> > >
> > > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > > coverage
> > > for this part. As soon as 8.2 was released, I saw that many of the
> > > migration test
> > > cases failed and came to realize that there might be a bug between 8.1
> > > and 8.2, but
> > > was unable to confirm and report it quickly to you.
> > >
> > > The maintenance of this part could be too costly or difficult from
> > > your point of view.
> >
> > It may or may not be too costly, it's just that we need real users of RDMA
> > taking some care of it.  Having it broken easily for >1 releases definitely
> > is a sign of lack of users.  It is an implication to the community that we
> > should consider dropping some features so that we can get the best use of
> > the community resources for the things that may have a broader audience.
> >
> > One thing majorly missing is a RDMA tester to guard all the merges to not
> > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > but just to sanity check the migration+rdma code running all fine.  RDMA
> > taught us the lesson so we're requesting CI coverage for all other new
> > features that will be merged at least for migration subsystem, so that we
> > plan to not merge anything that is not covered by CI unless extremely
> > necessary in the future.
> >
> > For sure CI is not the only missing part, but I'd say we should start with
> > it, then someone should also take care of the code even if only in
> > maintenance mode (no new feature to add on top).
> >
> > >
> > > My concern is, this plan will forces a few QEMU users (not sure how
> > > many) like us
> > > either to stick to the RDMA migration by using an increasingly older
> > > version of QEMU,
> > > or to abandon the currently used RDMA migration.
> >
> > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > migrations, would it work if such a scenario uses the old binary?  Is it
> > possible to switch to the TCP protocol with some good NICs?
> We have used rdma migration with HCA from Nvidia for years, our
> experience is RDMA migration works better than tcp (over ipoib).

Please bare with me, as I know little on rdma stuff.

I'm actually pretty confused (and since a long time ago..) on why we need
to operation with rdma contexts when ipoib seems to provide all the tcp
layers.  I meant, can it work with the current "tcp:" protocol with ipoib
even if there's rdma/ib hardwares underneath?  Is it because of performance
improvements so that we must use a separate path comparing to generic
"tcp:" protocol here?

> 
> Switching back to TCP will lead us to the old problems which was
> solved by RDMA migration.

Can you elaborate the problems, and why tcp won't work in this case?  They
may not be directly relevant to the issue we're discussing, but I'm happy
to learn more.

What is the NICs you were testing before?  Did the test carry out with
things like modern ones (50Gbps-200Gbps NICs), or the test was done when
these hardwares are not common?

Per my recent knowledge on the new Intel hardwares, at least the ones that
support QPL, it's easy to achieve single core 50Gbps+.

https://lore.kernel.org/r/ph7pr11mb5941a91ac1e514bcc32896a6a3...@ph7pr11mb5941.namprd11.prod.outlook.com

Quote from Yuan:

  Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
  [ ID] Interval   Transfer Bitrate Retr  Cwnd
  [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec0   2.87 MBytes
  [  5]   1.00-2.00

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-02 Thread Peter Xu

On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> Hello Peter und Zhjian,
> 
> Thank you so much for letting me know about this. I'm also a bit surprised at
> the plan for deprecating the RDMA migration subsystem.

It's not too late, since it looks like we do have users not yet notified
from this, we'll redo the deprecation procedure even if it'll be the final
plan, and it'll be 2 releases after this.

> 
> > IMHO it's more important to know whether there are still users and whether
> > they would still like to see it around.
> 
> > I admit RDMA migration was lack of testing(unit/CI test), which led to the 
> > a few
> > obvious bugs being noticed too late.
> 
> Yes, we are a user of this subsystem. I was unaware of the lack of test 
> coverage
> for this part. As soon as 8.2 was released, I saw that many of the
> migration test
> cases failed and came to realize that there might be a bug between 8.1
> and 8.2, but
> was unable to confirm and report it quickly to you.
> 
> The maintenance of this part could be too costly or difficult from
> your point of view.

It may or may not be too costly, it's just that we need real users of RDMA
taking some care of it.  Having it broken easily for >1 releases definitely
is a sign of lack of users.  It is an implication to the community that we
should consider dropping some features so that we can get the best use of
the community resources for the things that may have a broader audience.

One thing majorly missing is a RDMA tester to guard all the merges to not
break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
but just to sanity check the migration+rdma code running all fine.  RDMA
taught us the lesson so we're requesting CI coverage for all other new
features that will be merged at least for migration subsystem, so that we
plan to not merge anything that is not covered by CI unless extremely
necessary in the future.

For sure CI is not the only missing part, but I'd say we should start with
it, then someone should also take care of the code even if only in
maintenance mode (no new feature to add on top).

> 
> My concern is, this plan will forces a few QEMU users (not sure how
> many) like us
> either to stick to the RDMA migration by using an increasingly older
> version of QEMU,
> or to abandon the currently used RDMA migration.

RDMA doesn't get new features anyway, if there's specific use case for RDMA
migrations, would it work if such scenario uses the old binary?  Is it
possible to switch to the TCP protocol with some good NICs?

Per our best knowledge, RDMA users are rare, and please let anyone know if
you are aware of such users.  IIUC the major reason why RDMA stopped being
the trend is because the network is not like ten years ago; I don't think I
have good knowledge in RDMA at all nor network, but my understanding is
it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
little sense to maintain multiple protocols, considering RDMA migration
code is so special so that it has the most custom code comparing to other
protocols.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Peter Xu

On Thu, Mar 28, 2024 at 04:22:27PM +0100, Thomas Huth wrote:
> Since e9a54265f5 was not very clear about rdma migration code, should we
> maybe rather add a separate deprecation note for the migration part, and add
> a proper warning message to the migration code in case someone tries to use
> it there, and then only remove the rdma migration code after two more
> releases?

Definitely a valid option to me.

So far RDMA isn't covered in tests (actually same to COLO, and I wonder our
position of COLO too in this case..), so unfortunately we don't even know
when it'll break just like before.

From other activities that I can see when new code comes, maintaining RDMA
code should be fairly manageable so far (and whoever will write new rdma
codes in those two releases will also need to take the maintainer's
role). We did it for those years, and we can keep that for two more
releases. Hopefully that can ring a louder alarm to the current users with
such warnings, so that people can either stick with old binaries, or invest
developer/test resources to the community.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Peter Xu

On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé  writes:
> 
> > The whole RDMA subsystem was deprecated in commit e9a54265f5
> > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> > released in v8.2.
> >
> > Remove:
> >  - RDMA handling from migration
> >  - dependencies on libibumad, libibverbs and librdmacm
> >
> > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> > in old migration streams.
> >
> > Cc: Peter Xu 
> > Cc: Li Zhijian 
> > Acked-by: Fabiano Rosas 
> > Signed-off-by: Philippe Mathieu-Daudé 
> 
> Just to be clear, because people raised the point in the last version,
> the first link in the deprecation commit links to a thread comprising
> entirely of rdma migration patches. I don't see any ambiguity on whether
> the deprecation was intended to include migration. There's even an ack
> from Juan.

Yes I remember that's the plan.

> 
> So on the basis of not reverting the previous maintainer's decision, my
> Ack stands here.
> 
> We also had pretty obvious bugs ([1], [2]) in the past that would have
> been caught if we had any kind of testing for the feature, so I can't
> even say this thing works currently.
> 
> @Peter Xu, @Li Zhijian, what are your thoughts on this?

Generally I definitely agree with such a removal sooner or later, as that's
how deprecation works, and even after Juan's left I'm not aware of any
other new RDMA users.  Personally, I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance, however
I assume that's not a blocker either, as I think we can also manage it.

IMHO it's more important to know whether there are still users and whether
they would still like to see it around. That's also one thing I notice that
e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
if they're rare. According to [2] it could be that such user may only rely
on the release versions of QEMU when it broke things.

So I'm copying Yu too (while Zhijian is already in the loop), just in case
someone would like to stand up and speak.

Thanks,

> 
> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com
> 2- 
> https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com
> 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org

55 matches

Mail list logo