Re: [PATCH] qemu: add support for qemu switchover-ack

2024-06-20 Thread Peter Xu
On Thu, Jun 20, 2024 at 07:45:42PM +, Jon Kohler wrote:
> 
> 
> > On Jun 20, 2024, at 4:30 AM, Jiri Denemark  wrote:
> > 
> > !---|
> >  CAUTION: External Email
> > 
> > |---!
> > 
> > On Tue, Jun 18, 2024 at 16:14:29 +0100, Daniel P. Berrangé wrote:
> >> On Tue, Jun 18, 2024 at 08:06:06AM -0700, Jon Kohler wrote:
> >>> diff --git a/include/libvirt/libvirt-domain.h 
> >>> b/include/libvirt/libvirt-domain.h
> >>> index 2f5b01bbfe..9543629f30 100644
> >>> --- a/include/libvirt/libvirt-domain.h
> >>> +++ b/include/libvirt/libvirt-domain.h
> >>> @@ -1100,6 +1100,17 @@ typedef enum {
> >>>  * Since: 8.5.0
> >>>  */
> >>> VIR_MIGRATE_ZEROCOPY = (1 << 20),
> >>> +
> >>> +/* Use switchover ack migration capability to reduce downtime on VFIO
> >>> + * device migration. This prevents the source from stopping the VM 
> >>> and
> >>> + * completing the migration until an ACK is received from the 
> >>> destination
> >>> + * that it's OK to do so. Thus, a VFIO device can make sure that its
> >>> + * initial bytes were sent and loaded in the destination before the
> >>> + * source VM is stopped.
> >>> + *
> >>> + * Since: 10.5.0
> >>> + */
> >>> +VIR_MIGRATE_SWITCHOVER_ACK = (1 << 21),
> >>> } virDomainMigrateFlags;
> >> 
> >> Do we really need a flag for this ?  Is there a credible scenario
> >> in which this flag works, and yet shouldn't be used by libvirt ?
> >> 
> >> IOW, can we just "do the right thing" and always enable this,
> >> except for TUNNELLED mode.
> > 
> > I discussed this capability some time ago with Peter (I think) and if
> > IIRC there was some downside when the capability is enabled for domains
> > that do not use VFIO. I don't remember exactly what it was about, but
> > perhaps introducing an extra delay in migration switchover? Peter, can
> > you add the details, please?
> 
> Thanks - @Peter, if you have additional info on that, would love to know
> what the non-VFIO downsides are here. 

So far, VFIO is the only one who will register this "ACK needed" hook.
When nobody registers with it, the ACK will be sent upfront of a migration
when return path is established. That happens at the very beginning of a
migration, and that ACK will be completely meaningless in that case.

Said that, it may not be too bad either to have that meaningless ACK, if
that will simply Libvirt.  That only happens once per migration, and after
sent once it should work exactly the same as when switchover-ack not enabled.

Thanks,

-- 
Peter Xu


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote:
> > > I just noticed this thread; some random notes from a somewhat
> > > fragmented memory of this:
> > > 
> > >   a) Long long ago, I also tried rsocket; 
> > >   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > >  as I remember the library was quite flaky at the time.
> > 
> > Hmm interesting.  There also looks like a thread doing rpoll().
> 
> Yeh, I can't actually remember much more about what I did back then!

Heh, that's understandable and fair. :)

> > I hope Lei and his team has tested >4G mem, otherwise definitely worth
> > checking.  Lei also mentioned there're rsocket bugs they found in the cover
> > letter, but not sure what's that about.
> 
> It would probably be a good idea to keep track of what bugs
> are in flight with it, and try it on a few RDMA cards to see
> what problems get triggered.
> I think I reported a few at the time, but I gave up after
> feeling it was getting very hacky.

Agreed.  Maybe we can have a list of that in the cover letter or even
QEMU's migration/rmda doc page.

Lei, if you think that makes sense please do so in your upcoming posts.
There'll need to have a list of things you encountered in the kernel driver
and it'll be even better if there're further links to read on each problem.

> > > 
> > >   e) Someone made a good suggestion (sorry can't remember who) - that the
> > >  RDMA migration structure was the wrong way around - it should be the
> > >  destination which initiates an RDMA read, rather than the source
> > >  doing a write; then things might become a LOT simpler; you just need
> > >  to send page ranges to the destination and it can pull it.
> > >  That might work nicely for postcopy.
> > 
> > I'm not sure whether it'll still be a problem if rdma recv side is based on
> > zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> > that we don't want the guest vcpus to see a partially copied page during
> > on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> > that.
> 
> Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if
> the destination can issue an RDMA read itself, it doesn't need to send 
> messages
> to the source to ask for a page fetch; it just goes and grabs it itself,
> that's got to be good for latency.

Oh, that's pretty internal stuff of rdma to me and beyond my knowledge..
but from what I can tell it sounds very reasonable indeed!

Thanks!

-- 
Peter Xu


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote:
> >   e) Someone made a good suggestion (sorry can't remember who) - that the
> >  RDMA migration structure was the wrong way around - it should be the
> >  destination which initiates an RDMA read, rather than the source
> >  doing a write; then things might become a LOT simpler; you just need
> >  to send page ranges to the destination and it can pull it.
> >  That might work nicely for postcopy.
> 
> I'm not sure whether it'll still be a problem if rdma recv side is based on
> zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> that we don't want the guest vcpus to see a partially copied page during
> on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> that.

And when thinking about this (of UFFDIO_COPY's nature on not being able to
do zero-copy...), the only way this will be able to do zerocopy is to use
file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that
when we do DMA we pass over the page cache, which can be mapped in another
virtual address besides what the vcpus are using.

Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic
updates on the vcpu pgtables, avoiding the copy.  QEMU doesn't have it, but
it looks like there's one more reason we may want to have better use of
shmem.. than anonymous.  And actually when working on 4k faults on 1G
hugetlb I added CONTINUE support.

https://github.com/xzpeter/qemu/tree/doublemap
https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899740dc5

Maybe it's worthwhile on its own now, because it also means we can use that
in multifd to avoid one extra layer of buffering when supporting
multifd+postcopy (which has the same issue here on directly copying data
into guest pages).  It'll also work with things like rmda I think in
similar ways.  It's just that it'll not work on anonymous.

I definitely hijacked the thread to somewhere too far away.  I'll stop
here..

Thanks,

-- 
Peter Xu


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
Hey, Dave!

On Wed, Jun 05, 2024 at 12:31:56AM +, Dr. David Alan Gilbert wrote:
> * Michael Galaxy (mgal...@akamai.com) wrote:
> > One thing to keep in mind here (despite me not having any hardware to test)
> > was that one of the original goals here
> > in the RDMA implementation was not simply raw throughput nor raw latency,
> > but a lack of CPU utilization in kernel
> > space due to the offload. While it is entirely possible that newer hardware
> > w/ TCP might compete, the significant
> > reductions in CPU usage in the TCP/IP stack were a big win at the time.
> > 
> > Just something to consider while you're doing the testing
> 
> I just noticed this thread; some random notes from a somewhat
> fragmented memory of this:
> 
>   a) Long long ago, I also tried rsocket; 
>   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
>  as I remember the library was quite flaky at the time.

Hmm interesting.  There also looks like a thread doing rpoll().

Btw, not sure whether you noticed, but there's the series posted for the
latest rsocket conversion here:

https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gong...@huawei.com

I hope Lei and his team has tested >4G mem, otherwise definitely worth
checking.  Lei also mentioned there're rsocket bugs they found in the cover
letter, but not sure what's that about.

> 
>   b) A lot of the complexity in the rdma migration code comes from
> emulating a stream to carry the migration control data and interleaving
> that with the actual RAM copy.   I believe the original design used
> a separate TCP socket for the control data, and just used the RDMA
> for the data - that should be a lot simpler (but alas was rejected
> in review early on)
> 
>   c) I can't rememmber the last benchmarks I did; but I think I did
> manage to beat RDMA with multifd; but yes, multifd does eat host CPU
> where as RDMA barely uses a whisper.

I think my first impression on this matter came from you on this one. :)

> 
>   d) The 'zero-copy-send' option in migrate may well get some of that
>  CPU time back; but if I remember we were still bottle necked on
>  the receive side. (I can't remember if zero-copy-send worked with
>  multifd?)

Yes, and zero-copy requires multifd for now. I think it's because we didn't
want to complicate the header processings in the migration stream where it
may not be page aligned.

> 
>   e) Someone made a good suggestion (sorry can't remember who) - that the
>  RDMA migration structure was the wrong way around - it should be the
>  destination which initiates an RDMA read, rather than the source
>  doing a write; then things might become a LOT simpler; you just need
>  to send page ranges to the destination and it can pull it.
>  That might work nicely for postcopy.

I'm not sure whether it'll still be a problem if rdma recv side is based on
zero-copy.  It would be a matter of whether atomicity can be guaranteed so
that we don't want the guest vcpus to see a partially copied page during
on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
that.

Thanks,

-- 
Peter Xu


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Peter Xu
Lei,

On Wed, May 29, 2024 at 02:43:46AM +, Gonglei (Arei) wrote:
> For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> rdma_cm and ibverbs create two FDs respectively. The two FDs have
> different responsibilities. rdma_cm fd is used to notify connection
> establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a
> pollin event can be monitored, which means that an rdma_cm event
> occurs. When the verbs fd is directly polled/epolled, only the pollin
> event can be listened, which indicates that a new CQE is generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides
> rdma calls that are completely similar to socket interfaces. However,
> this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and
> writable events for listening to data). Only the rpoll interface provided
> by the RSocket can be used to listen to related events. However, QEMU
> uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept
> API).  And cannot listen to the verbs fd event. Only some hacking methods
> can be used to address this problem.  Do you guys have any ideas? Thanks.

I saw that you mentioned this elsewhere:

> Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(

So what I'm thinking may not make much sense, as I mentioned I don't think
I know rdma at all.. and my idea also has involvement on coroutine stuff
which I also don't know well. But just in case it shed some light in some
form.

IIUC we do iochannel blockings with this no matter for read/write:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

One thing I'm wondering is whether we can provide a new feature bit for
qiochannel, e.g., QIO_CHANNEL_FEATURE_POLL, so that the iochannel can
define its own poll routine rather than using the default when possible.

I think it may not work if it's in a coroutine, as I guess that'll block
other fds from being waked up.  Hence it should look like this:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else if (qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_POLL)) {
qio_channel_poll(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

Maybe we even want to forbid such channel to be used in coroutine already,
as when QIO_CHANNEL_FEATURE_POLL set it may mean that this iochannel simply
won't work with poll() like in rdma's use case.

Then rdma iochannel can implement qio_channel_poll() using rpoll().

There's one other dependent issue here in that I _think_ the migration recv
side is still in a coroutine.. so we may need to move that into a thread
first.  IIRC we don't yet have a major blocker to do that, but I didn't
further check either.  I've put that issue aside just to see whether this
may or may not make sense.

Thanks,

-- 
Peter Xu


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 09:06:04AM +, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang 
> > Cc: Michael Galaxy ; Jinpu Wang
> > ; Elmar Gerdes ;
> > zhengchuan ; Gonglei (Arei)
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-de...@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-bl...@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> > 
> > Hi,
> > 
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> > 
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more 
> > reasonable.
> > 
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wvaqk81vxtkzx-l...@mail.gmail.com/
> > 
> > Appreciate a lot for everyone helping on the testings.
> > 
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> > 
> > It looks to me NICs are powerful now, but again as I mentioned I don't 
> > think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> > 
> > Is there anyone who started looking into that direction?  Would it make 
> > sense
> > we start some PoC now?
> > 
> 
> My team has finished the PoC refactoring which works well. 
> 
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
> successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and 
> rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
> migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live 
> migration. It's successful.
> 
> We will be submit the patchset later.

That's great news, thank you!

-- 
Peter Xu


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-21 Thread Peter Xu
On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> Hello Michael and Peter,

Hi,

> 
> Exactly, not so compelling, as I did it first only on servers widely
> used for production in our data center. The network adapters are
> 
> Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> 2-port Gigabit Ethernet PCIe

Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
reasonable.

https://lore.kernel.org/qemu-devel/camgffen-dkpmz4ta71mjydyemg0zda15wvaqk81vxtkzx-l...@mail.gmail.com/

Appreciate a lot for everyone helping on the testings.

> InfiniBand controller: Mellanox Technologies MT27800 Family [ConnectX-5]
> 
> which doesn't meet our purpose. I can choose RDMA or TCP for VM
> migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> on these two hosts. One is standby while the other is active.
> 
> Now I'll try on a server with more recent Ethernet and InfiniBand
> network adapters. One of them has:
> BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> 
> The comparison between RDMA and TCP on the same NIC could make more sense.

It looks to me NICs are powerful now, but again as I mentioned I don't
think it's a reason we need to deprecate rdma, especially if QEMU's rdma
migration has the chance to be refactored using rsocket.

Is there anyone who started looking into that direction?  Would it make
sense we start some PoC now?

Thanks,

-- 
Peter Xu


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Peter Xu
On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> That's a good news to see the socket abstraction for RDMA!
> When I was developed the series above, the most pain is the RDMA migration 
> has no QIOChannel abstraction and i need to take a 'fake channel'
> for it which is awkward in code implementation.
> So, as far as I know, we can do this by
> i. the first thing is that we need to evaluate the rsocket is good enough to 
> satisfy our QIOChannel fundamental abstraction
> ii. if it works right, then we will continue to see if it can give us 
> opportunity to hide the detail of rdma protocol
> into rsocket by remove most of code in rdma.c and also some hack in 
> migration main process.
> iii. implement the advanced features like multi-fd and multi-uri for rdma 
> migration.
> 
> Since I am not familiar with rsocket, I need some times to look at it and do 
> some quick verify with rdma migration based on rsocket.
> But, yes, I am willing to involved in this refactor work and to see if we can 
> make this migration feature more better:)

Based on what we have now, it looks like we'd better halt the deprecation
process a bit, so I think we shouldn't need to rush it at least in 9.1
then, and we'll need to see how it goes on the refactoring.

It'll be perfect if rsocket works, otherwise supporting multifd with little
overhead / exported APIs would also be a good thing in general with
whatever approach.  And obviously all based on the facts that we can get
resources from companies to support this feature first.

Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
any of us can provide some test results please do so.  Many people are
saying RDMA is better, but I yet didn't see any numbers comparing it with
modern TCP networks.  I don't want to have old impressions floating around
even if things might have changed..  When we have consolidated results, we
should share them out and also reflect that in QEMU's migration docs when a
rdma document page is ready.

Chuan, please check the whole thread discussion, it may help to understand
what we are looking for on rdma migrations [1].  Meanwhile please feel free
to sync with Jinpu's team and see how to move forward with such a project.

[1] https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-07 Thread Peter Xu
On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:
> Hello,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Monday, May 6, 2024 11:18 PM
> > To: Gonglei (Arei) 
> > Cc: Daniel P. Berrangé ; Markus Armbruster
> > ; Michael Galaxy ; Yu Zhang
> > ; Zhijian Li (Fujitsu) ; Jinpu 
> > Wang
> > ; Elmar Gerdes ;
> > qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
> > ; Prasanna Kumar Kalever
> > ; Cornelia Huck ;
> > Michael Roth ; Prasanna Kumar Kalever
> > ; integrat...@gluster.org; Paolo Bonzini
> > ; qemu-bl...@nongnu.org; devel@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> > > Hi, Peter
> > 
> > Hey, Lei,
> > 
> > Happy to see you around again after years.
> > 
> Haha, me too.
> 
> > > RDMA features high bandwidth, low latency (in non-blocking lossless
> > > network), and direct remote memory access by bypassing the CPU (As you
> > > know, CPU resources are expensive for cloud vendors, which is one of
> > > the reasons why we introduced offload cards.), which TCP does not have.
> > 
> > It's another cost to use offload cards, v.s. preparing more cpu resources?
> > 
> Software and hardware offload converged architecture is the way to go for all 
> cloud vendors 
> (Including comprehensive benefits in terms of performance, cost, security, 
> and innovation speed), 
> it's not just a matter of adding the resource of a DPU card.
> 
> > > In some scenarios where fast live migration is needed (extremely short
> > > interruption duration and migration duration) is very useful. To this
> > > end, we have also developed RDMA support for multifd.
> > 
> > Will any of you upstream that work?  I'm curious how intrusive would it be
> > when adding it to multifd, if it can keep only 5 exported functions like 
> > what
> > rdma.h does right now it'll be pretty nice.  We also want to make sure it 
> > works
> > with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
> > loads to
> > multifd channels too.
> > 
> 
> In fact, we sent the patchset to the community in 2021. Pls see:
> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/

I wasn't aware of that for sure in the past..

Multifd has changed quite a bit in the last 9.0 release, that may not apply
anymore.  One thing to mention is please look at Dan's comment on possible
use of rsocket.h:

https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/

And Jinpu did help provide an initial test result over the library:

https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/

It looks like we have a chance to apply that in QEMU.

> 
> 
> > One thing to note that the question here is not about a pure performance
> > comparison between rdma and nics only.  It's about help us make a decision
> > on whether to drop rdma, iow, even if rdma performs well, the community 
> > still
> > has the right to drop it if nobody can actively work and maintain it.
> > It's just that if nics can perform as good it's more a reason to drop, 
> > unless
> > companies can help to provide good support and work together.
> > 
> 
> We are happy to provide the necessary review and maintenance work for RDMA
> if the community needs it.
> 
> CC'ing Chuan Zheng.

I'm not sure whether you and Jinpu's team would like to work together and
provide a final solution for rdma over multifd.  It could be much simpler
than the original 2021 proposal if the rsocket API will work out.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Peter Xu
On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> Hi Peter, hi Daniel,

Hi, Jinpu,

Thanks for sharing this test results.  Sounds like a great news.

What's your plan next?  Would it then be worthwhile / possible moving QEMU
into that direction?  Would that greatly simplify rdma code as Dan
mentioned?

Thanks,

> 
> On Fri, May 3, 2024 at 4:33 PM Peter Xu  wrote:
> >
> > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > I had a brief check in the rsocket changelog, there seems some
> > > improvement over time,
> > >  might be worth revisiting this. due to socket abstraction, we can't
> > > use some feature like
> > >  ODP, it won't be a small and easy task.
> >
> > It'll be good to know whether Dan's suggestion would work first, without
> > rewritting everything yet so far.  Not sure whether some perf test could
> > help with the rsocket APIs even without QEMU's involvements (or looking for
> > test data supporting / invalidate such conversions).
> >
> I did a quick test with iperf on 100 G environment and 40 G
> environment, in summary rsocket works pretty well.
> 
> iperf tests between 2 hosts with 40 G (IB),
> first  a few test with different num. of threads on top of ipoib
> interface, later with preload rsocket on top of same ipoib interface.
> 
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> [SUM] 0.-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> [  6] 0.-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> [SUM] 0.-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  4] 0.-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
> [  8] 0.-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.-10. sec  2.85 GBytes  2.45 Gbits/sec
> [ 12] 0.-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
> [  3] 0.-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> [  7] 0.

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Peter Xu
On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> Hi, Peter

Hey, Lei,

Happy to see you around again after years.

> RDMA features high bandwidth, low latency (in non-blocking lossless
> network), and direct remote memory access by bypassing the CPU (As you
> know, CPU resources are expensive for cloud vendors, which is one of the
> reasons why we introduced offload cards.), which TCP does not have.

It's another cost to use offload cards, v.s. preparing more cpu resources?

> In some scenarios where fast live migration is needed (extremely short
> interruption duration and migration duration) is very useful. To this
> end, we have also developed RDMA support for multifd.

Will any of you upstream that work?  I'm curious how intrusive would it be
when adding it to multifd, if it can keep only 5 exported functions like
what rdma.h does right now it'll be pretty nice.  We also want to make sure
it works with arbitrary sized loads and buffers, e.g. vfio is considering
to add IO loads to multifd channels too.

One thing to note that the question here is not about a pure performance
comparison between rdma and nics only.  It's about help us make a decision
on whether to drop rdma, iow, even if rdma performs well, the community
still has the right to drop it if nobody can actively work and maintain it.
It's just that if nics can perform as good it's more a reason to drop,
unless companies can help to provide good support and work together.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-03 Thread Peter Xu
On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> I had a brief check in the rsocket changelog, there seems some
> improvement over time,
>  might be worth revisiting this. due to socket abstraction, we can't
> use some feature like
>  ODP, it won't be a small and easy task.

It'll be good to know whether Dan's suggestion would work first, without
rewritting everything yet so far.  Not sure whether some perf test could
help with the rsocket APIs even without QEMU's involvements (or looking for
test data supporting / invalidate such conversions).

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Peter Xu
On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> Hi Michael, Hi Peter,
> 
> 
> On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
> >
> > Yu Zhang / Jinpu,
> >
> > Any possibility (at your lesiure, and within the disclosure rules of
> > your company, IONOS) if you could share any of your performance
> > information to educate the group?
> >
> > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > their disposal. Some people don't.
> Our staging env is with 100 Gb/s IB environment.
> We will have a new setup in the coming months with Ethernet (RoCE), we
> will run some performance
> comparison when we have the environment ready.

Thanks both.  Please keep us posted.

Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
involved, am I right?

The other note is that the comparison needs to be with multifd enabled for
the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.

I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
without even waiting for months.  If you want I can try to see how we can
test together.  And btw I don't think we need a cluster, IIUC we simply
need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
two cards just for experiments, systems that can drive the cards, and a
wire supporting 100G?

> 
> >
> > - Michael
> 
> Thx!
> Jinpu
> >
> > On 5/1/24 11:16, Peter Xu wrote:
> > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > >>> What I worry more is whether this is really what we want to keep rdma in
> > >>> qemu, and that's also why I was trying to request for some serious
> > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > >>> "we" I mean both QEMU community and any company that will support 
> > >>> keeping
> > >>> rdma around.
> > >>>
> > >>> The problem is if NICs now are fast enough to perform at least equally
> > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > >>> mean that rdma migration will only be used by whoever wants to keep 
> > >>> them in
> > >>> the products and existed already?  In that case we should simply ask new
> > >>> users to stick with tcp, and rdma users should only drop but not 
> > >>> increase.
> > >>>
> > >>> It seems also destined that most new migration features will not support
> > >>> rdma: see how much we drop old features in migration now (which rdma
> > >>> _might_ still leverage, but maybe not), and how much we add mostly 
> > >>> multifd
> > >>> relevant which will probably not apply to rdma at all.  So in general 
> > >>> what
> > >>> I am worrying is a both-loss condition, if the company might be easier 
> > >>> to
> > >>> either stick with an old qemu (depending on whether other new features 
> > >>> are
> > >>> requested to be used besides RDMA alone), or do periodic rebase with 
> > >>> RDMA
> > >>> downstream only.
> > >> I don't know much about the originals of RDMA support in QEMU and why
> > >> this particular design was taken. It is indeed a huge maint burden to
> > >> have a completely different code flow for RDMA with 4000+ lines of
> > >> custom protocol signalling which is barely understandable.
> > >>
> > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > >> type could almost[1] trivially have supported RDMA. There would have
> > >> been almost no RDMA code required in the migration subsystem, and all
> > >> the modern features like compression, multifd, post-copy, etc would
> > >> "just work".
> > >>
> > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > >> performance gains, but it might still have been a better tradeoff
> > >> to have not quite so good peak performance, but with massively
> > >> less maint burden.
> > > My understanding so far is RDMA is sololy for performance but nothing 
> > > else,
> > > then it's a question on whethe

Re: [PATCH v3 3/6] migration: Remove 'blk/-b' option from migrate commands

2024-05-02 Thread Peter Xu
On Thu, May 02, 2024 at 01:35:06PM +, Dr. David Alan Gilbert wrote:
> * Markus Armbruster (arm...@redhat.com) wrote:
> > Fabiano Rosas  writes:
> > 
> > > The block migration is considered obsolete and has been deprecated in
> > > 8.2. Remove the migrate command option that enables it. This only
> > > affects the QMP and HMP commands, the feature can still be accessed by
> > > setting the migration 'block' capability. The whole feature will be
> > > removed in a future patch.
> > >
> > > Deprecation commit 8846b5bfca ("migration: migrate 'blk' command
> > > option is deprecated.").
> > >
> > > Reviewed-by: Markus Armbruster 
> > > Signed-off-by: Fabiano Rosas 
> > 
> > [...]
> > 
> > > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > > index 7978302949..ebca2cdced 100644
> > > --- a/hmp-commands.hx
> > > +++ b/hmp-commands.hx
> > > @@ -909,21 +909,17 @@ ERST
> > >  
> > >  {
> > >  .name   = "migrate",
> > > -.args_type  = "detach:-d,blk:-b,resume:-r,uri:s",
> > > -.params = "[-d] [-b] [-r] uri",
> > > +.args_type  = "detach:-d,resume:-r,uri:s",
> > > +.params = "[-d] [-r] uri",
> > >  .help   = "migrate to URI (using -d to not wait for 
> > > completion)"
> > > -   "\n\t\t\t -b for migration without shared storage with"
> > > -   " full copy of disk\n\t\t\t -r to resume a paused 
> > > migration",
> > > +   "\n\t\t\t -r to resume a paused migration",
> > >  .cmd= hmp_migrate,
> > >  },
> > >  
> > >  
> > >  SRST
> > > -``migrate [-d] [-b]`` *uri*
> > > +``migrate [-d]`` *uri*
> > >Migrate to *uri* (using -d to not wait for completion).
> > > -
> > > -  ``-b``
> > > -for migration with full copy of disk
> > >  ERST
> > 
> > Not this patch's fault, but here goes anyway: -r is undocumented here.
> 
> Probably one for Peter I guess.

Yes, and I'll send a patch! :-D

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu
On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> > 
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> > 
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
> 
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
> 
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
> 
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,

> 
> With regards,
> Daniel
> 
> [1] "almost" trivially, because the poll() integration for rsockets
> requires a bit more magic sauce since rsockets FDs are not
> really FDs from the kernel's POV. Still, QIOCHannel likely can
> abstract that probme.
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v3 0/6] migration removals & deprecations

2024-05-01 Thread Peter Xu
On Tue, Apr 30, 2024 at 11:27:31AM -0300, Fabiano Rosas wrote:
> since v2:
> 
> - removed some more stuff which I missed:
>blk/inc options from hmp-commands.hx
>the entire ram-compress.h
>unused declarations from options.h
>unused compression functions from qemu-file.c
> 
> - removed must_remove_block_options earlier in the 'blk' patch
> 
> - added a deprecation warning to outgoing/incoming fd
> 
> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1272385260
> 
> v2:
> https://lore.kernel.org/r/20240426131408.25410-1-faro...@suse.de
> v1:
> https://lore.kernel.org/r/20240425150939.19268-1-faro...@suse.de
> 
> Hi everyone,
> 
> Here's some cleaning up of deprecated code. It removes the old block
> migration and compression code. Both have suitable replacements in the
> form of the blockdev-mirror driver and multifd compression,
> respectively.
> 
> There's also a deprecation for fd: + file to cope with the fact that
> the new MigrationAddress API defines transports instead of protocols
> (loose terms) like the old string API did. So we cannot map 1:1 from
> fd: to any transport because fd: allows *both* file migration and
> socket migration.
> 
> Fabiano Rosas (6):
>   migration: Remove 'skipped' field from MigrationStats
>   migration: Remove 'inc' option from migrate command
>   migration: Remove 'blk/-b' option from migrate commands
>   migration: Remove block migration
>   migration: Remove non-multifd compression
>   migration: Deprecate fd: for file migration

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu
On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > Peter Xu  writes:
> > 
> > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > >> Hi All (and Peter),
> > >
> > > Hi, Michael,
> > >
> > >> 
> > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > >> (highly irregular for a male) and yes, that's my real last name:
> > >> https://www.linkedin.com/in/mrgalaxy/)
> > >> 
> > >> I'm the original author of the RDMA implementation. I've been discussing
> > >> with Yu Zhang for a little bit about potentially handing over 
> > >> maintainership
> > >> of the codebase to his team.
> > >> 
> > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > >> unfortunately. so I've never been able to run tests or use what I wrote 
> > >> at
> > >> work, and as all of you know, if you don't have a way to test something,
> > >> then you can't maintain it.
> > >> 
> > >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > >> they feel comfortable training his team to maintain the codebase (and run
> > >> tests) while they learn about it.
> > >
> > > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > > the code, or say, taking over the responsibility, may or may not need 100%
> > > mastering the code base first.  There should still be some fundamental
> > > confidence to work on the code though as a starting point, then it's about
> > > serious use case to back this up, and careful testings while getting more
> > > familiar with it.
> > 
> > How much experience we expect of maintainers depends on the subsystem
> > and other circumstances.  The hard requirement isn't experience, it's
> > trust.  See the recent attack on xz.
> > 
> > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > I'm merely reminding y'all what's at stake.
> 
> I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
> common scenario is that volunteer maintainers are honest people. QEMU is
> in a massively better peer review situation. With xz there was basically no
> oversight of the new maintainer. With QEMU, we have oversight from 1000's
> of people on the list, a huge pool of general maintainers, the specific
> migration maintainers, and the release manager merging code.
> 
> With a lack of historical experiance with QEMU maintainership, I'd suggest
> that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
> file with only the 'Reviewer' classification. The main migration maintainers
> would still handle pull requests, but wait for a R-b from one of the RMDA
> volunteers. After some period of time the RDMA folks could graduate to full
> maintainer status if the migration maintainers needed to reduce their load.
> I suspect that might prove unneccesary though, given RDMA isn't an area of
> code with a high turnover of patches.

Right, and we can do that as a start, it also follows our normal rules of
starting from Reviewers to maintain something.  I even considered Zhijian
to be the previous rdma goto guy / maintainer no matter what role he used
to have in the MAINTAINERS file.

Here IMHO it's more about whether any company would like to stand up and
provide help, without yet binding that to be able to send pull requests in
the near future or even longer term.

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either sti

Re: [PATCH v2 5/6] migration: Remove non-multifd compression

2024-04-29 Thread Peter Xu
On Fri, Apr 26, 2024 at 10:14:07AM -0300, Fabiano Rosas wrote:
> The 'compress' migration capability enables the old compression code
> which has shown issues over the years and is thought to be less stable
> and tested than the more recent multifd-based compression. The old
> compression code has been deprecated in 8.2 and now is time to remove
> it.
> 
> Deprecation commit 864128df46 ("migration: Deprecate old compression
> method").
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v2 2/6] migration: Remove 'inc' option from migrate command

2024-04-29 Thread Peter Xu
On Fri, Apr 26, 2024 at 10:14:04AM -0300, Fabiano Rosas wrote:
> The block incremental option for block migration has been deprecated
> in 8.2 in favor of using the block-mirror feature. Remove it now.
> 
> Deprecation commit 40101f320d ("migration: migrate 'inc' command
> option is deprecated.").
> 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v2 1/6] migration: Remove 'skipped' field from MigrationStats

2024-04-29 Thread Peter Xu
On Fri, Apr 26, 2024 at 10:14:03AM -0300, Fabiano Rosas wrote:
> The 'skipped' field of the MigrationStats struct has been deprecated
> in 8.1. Time to remove it.
> 
> Deprecation commit 7b24d32634 ("migration: skipped field is really
> obsolete.").
> 
> Reviewed-by: Markus Armbruster 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v2 6/6] migration: Deprecate fd: for file migration

2024-04-29 Thread Peter Xu
On Mon, Apr 29, 2024 at 03:47:39PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Fri, Apr 26, 2024 at 10:14:08AM -0300, Fabiano Rosas wrote:
> >> The fd: URI can currently trigger two different types of migration, a
> >> TCP migration using sockets and a file migration using a plain
> >> file. This is in conflict with the recently introduced (8.2) QMP
> >> migrate API that takes structured data as JSON-like format. We cannot
> >> keep the same backend for both types of migration because with the new
> >> API the code is more tightly coupled to the type of transport. This
> >> means a TCP migration must use the 'socket' transport and a file
> >> migration must use the 'file' transport.
> >> 
> >> If we keep allowing fd: when using a file, this creates an issue when
> >> the user converts the old-style (fd:) to the new style ("transport":
> >> "socket") invocation because the file descriptor in question has
> >> previously been allowed to be either a plain file or a socket.
> >> 
> >> To avoid creating too much confusion, we can simply deprecate the fd:
> >> + file usage, which is thought to be rarely used currently and instead
> >> establish a 1:1 correspondence between fd: URI and socket transport,
> >> and file: URI and file transport.
> >> 
> >> Signed-off-by: Fabiano Rosas 
> >> ---
> >>  docs/about/deprecated.rst | 14 ++
> >>  1 file changed, 14 insertions(+)
> >> 
> >> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> >> index 0fb5c82640..813f7996fe 100644
> >> --- a/docs/about/deprecated.rst
> >> +++ b/docs/about/deprecated.rst
> >> @@ -464,3 +464,17 @@ both, older and future versions of QEMU.
> >>  The ``blacklist`` config file option has been renamed to ``block-rpcs``
> >>  (to be in sync with the renaming of the corresponding command line
> >>  option).
> >> +
> >> +Migration
> >> +-
> >> +
> >> +``fd:`` URI when used for file migration (since 9.1)
> >> +
> >> +
> >> +The ``fd:`` URI can currently provide a file descriptor that
> >> +references either a socket or a plain file. These are two different
> >> +types of migration. In order to reduce ambiguity, the ``fd:`` URI
> >> +usage of providing a file descriptor to a plain file has been
> >> +deprecated in favor of explicitly using the ``file:`` URI with the
> >> +file descriptor being passed as an ``fdset``. Refer to the ``add-fd``
> >> +command documentation for details on the ``fdset`` usage.
> >
> > Wanna do some warn_report() when detected non-socket fds alongside?  Looks
> > like we previously do this for all deprecations.
> 
> Yes, good point.
> 
> >
> > What's the plan when it's support removed?  I'm imaginging that we sanity
> > check fstat() + S_ISSOCK on the fd and fail otherwise?  In that case we can
> > have the code there, dump warn_report(), then switch to failing qmp migrate
> > (and incoming side) later on?
> 
> Something along those lines. We currently use fd_is_socket():
> 
> bool fd_is_socket(int fd)
> {
> int optval;
> socklen_t optlen = sizeof(optval);
> return !getsockopt(fd, SOL_SOCKET, SO_TYPE, , );
> }
> 
> I'm thinking of this in fd_start_outgoing_migation():
> 
> if (!fd_is_socket(fd)) {
> warn_report("fd: migration to a file is deprecated."
> " Use file: instead.");
> }

Sounds good, perhaps also in fd_start_incoming_migration().

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v2 3/6] migration: Remove 'blk/-b' option from migrate commands

2024-04-29 Thread Peter Xu
On Mon, Apr 29, 2024 at 03:35:02PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Mon, Apr 29, 2024 at 02:18:57PM -0300, Fabiano Rosas wrote:
> >> Peter Xu  writes:
> >> 
> >> > On Fri, Apr 26, 2024 at 10:14:05AM -0300, Fabiano Rosas wrote:
> >> >> @@ -2003,21 +1997,7 @@ static bool migrate_prepare(MigrationState *s, 
> >> >> bool blk, bool resume,
> >> >>  }
> >> >>  }
> >> >>  
> >> >> -if (blk) {

[1]

> >> >> -if (migrate_colo()) {
> >> >> -error_setg(errp, "No disk migration is required in COLO 
> >> >> mode");
> >> >> -return false;
> >> >> -}
> >> >> -if (migrate_block()) {
> >> >> -error_setg(errp, "Command options are incompatible with "
> >> >> -   "current migration capabilities");
> >> >> -return false;
> >> >> -}
> >> >> -if (!migrate_cap_set(MIGRATION_CAPABILITY_BLOCK, true, errp)) {
> >> >> -return false;
> >> >> -}
> >> >> -s->must_remove_block_options = true;
> >> >> -}
> >> >> +s->must_remove_block_options = true;
> >> >
> >> > Can we drop this var too?  Perhaps with block_cleanup_parameters()?
> >> 
> >> Yes, Markus mentioned it in v1 already. Take a look there. There's
> >> several other declarations I missed. v3 is coming soon.
> >
> > Right, noticed that it's removed actually in the next patch.
> >
> > But iiuc it can already been removed in this patch.  If we want to remove
> > it in the next, logically we should set must_remove_block_options=false
> > here, though..  So maybe easier to just drop it here.
> 
> Ah I see what you mean. I thought you're just asking for the removal
> overall.
> 
> But block_cleanup_parameters sets the block capability to false and the
> whole block migration only goes away in the next patch. I think we need
> to keep this as true to preserve behavior. In theory, after this patch
> people could still use the block migration just fine by setting the cap.

I'm reading this patch the same as "blk==false" always above [1].  In that
case, only if we keep must_remove_block_options=false could we maintain the
behavior?  Otherwise at the end of migration (or cancellations) QEMU can
wrongly clear MIGRATION_CAPABILITY_BLOCK bit if the user set it manually?

(But hey, we're discussing whoever applies this patch only without the
 rest..  definitely not a huge deal :)

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v2 6/6] migration: Deprecate fd: for file migration

2024-04-29 Thread Peter Xu
On Fri, Apr 26, 2024 at 10:14:08AM -0300, Fabiano Rosas wrote:
> The fd: URI can currently trigger two different types of migration, a
> TCP migration using sockets and a file migration using a plain
> file. This is in conflict with the recently introduced (8.2) QMP
> migrate API that takes structured data as JSON-like format. We cannot
> keep the same backend for both types of migration because with the new
> API the code is more tightly coupled to the type of transport. This
> means a TCP migration must use the 'socket' transport and a file
> migration must use the 'file' transport.
> 
> If we keep allowing fd: when using a file, this creates an issue when
> the user converts the old-style (fd:) to the new style ("transport":
> "socket") invocation because the file descriptor in question has
> previously been allowed to be either a plain file or a socket.
> 
> To avoid creating too much confusion, we can simply deprecate the fd:
> + file usage, which is thought to be rarely used currently and instead
> establish a 1:1 correspondence between fd: URI and socket transport,
> and file: URI and file transport.
> 
> Signed-off-by: Fabiano Rosas 
> ---
>  docs/about/deprecated.rst | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 0fb5c82640..813f7996fe 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -464,3 +464,17 @@ both, older and future versions of QEMU.
>  The ``blacklist`` config file option has been renamed to ``block-rpcs``
>  (to be in sync with the renaming of the corresponding command line
>  option).
> +
> +Migration
> +-
> +
> +``fd:`` URI when used for file migration (since 9.1)
> +
> +
> +The ``fd:`` URI can currently provide a file descriptor that
> +references either a socket or a plain file. These are two different
> +types of migration. In order to reduce ambiguity, the ``fd:`` URI
> +usage of providing a file descriptor to a plain file has been
> +deprecated in favor of explicitly using the ``file:`` URI with the
> +file descriptor being passed as an ``fdset``. Refer to the ``add-fd``
> +command documentation for details on the ``fdset`` usage.

Wanna do some warn_report() when detected non-socket fds alongside?  Looks
like we previously do this for all deprecations.

What's the plan when it's support removed?  I'm imaginging that we sanity
check fstat() + S_ISSOCK on the fd and fail otherwise?  In that case we can
have the code there, dump warn_report(), then switch to failing qmp migrate
(and incoming side) later on?

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v2 3/6] migration: Remove 'blk/-b' option from migrate commands

2024-04-29 Thread Peter Xu
On Mon, Apr 29, 2024 at 02:18:57PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Fri, Apr 26, 2024 at 10:14:05AM -0300, Fabiano Rosas wrote:
> >> @@ -2003,21 +1997,7 @@ static bool migrate_prepare(MigrationState *s, bool 
> >> blk, bool resume,
> >>  }
> >>  }
> >>  
> >> -if (blk) {
> >> -if (migrate_colo()) {
> >> -error_setg(errp, "No disk migration is required in COLO 
> >> mode");
> >> -return false;
> >> -}
> >> -if (migrate_block()) {
> >> -error_setg(errp, "Command options are incompatible with "
> >> -   "current migration capabilities");
> >> -return false;
> >> -}
> >> -if (!migrate_cap_set(MIGRATION_CAPABILITY_BLOCK, true, errp)) {
> >> -return false;
> >> -}
> >> -s->must_remove_block_options = true;
> >> -}
> >> +s->must_remove_block_options = true;
> >
> > Can we drop this var too?  Perhaps with block_cleanup_parameters()?
> 
> Yes, Markus mentioned it in v1 already. Take a look there. There's
> several other declarations I missed. v3 is coming soon.

Right, noticed that it's removed actually in the next patch.

But iiuc it can already been removed in this patch.  If we want to remove
it in the next, logically we should set must_remove_block_options=false
here, though..  So maybe easier to just drop it here.

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v2 3/6] migration: Remove 'blk/-b' option from migrate commands

2024-04-29 Thread Peter Xu
On Fri, Apr 26, 2024 at 10:14:05AM -0300, Fabiano Rosas wrote:
> @@ -2003,21 +1997,7 @@ static bool migrate_prepare(MigrationState *s, bool 
> blk, bool resume,
>  }
>  }
>  
> -if (blk) {
> -if (migrate_colo()) {
> -error_setg(errp, "No disk migration is required in COLO mode");
> -return false;
> -}
> -if (migrate_block()) {
> -error_setg(errp, "Command options are incompatible with "
> -   "current migration capabilities");
> -return false;
> -}
> -if (!migrate_cap_set(MIGRATION_CAPABILITY_BLOCK, true, errp)) {
> -return false;
> -}
> -s->must_remove_block_options = true;
> -}
> +s->must_remove_block_options = true;

Can we drop this var too?  Perhaps with block_cleanup_parameters()?

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-29 Thread Peter Xu
On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> Hi All (and Peter),

Hi, Michael,

> 
> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> (highly irregular for a male) and yes, that's my real last name:
> https://www.linkedin.com/in/mrgalaxy/)
> 
> I'm the original author of the RDMA implementation. I've been discussing
> with Yu Zhang for a little bit about potentially handing over maintainership
> of the codebase to his team.
> 
> I simply have zero access to RoCE or Infiniband hardware at all,
> unfortunately. so I've never been able to run tests or use what I wrote at
> work, and as all of you know, if you don't have a way to test something,
> then you can't maintain it.
> 
> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> they feel comfortable training his team to maintain the codebase (and run
> tests) while they learn about it.

The "while learning" part is fine at least to me.  IMHO the "ownership" to
the code, or say, taking over the responsibility, may or may not need 100%
mastering the code base first.  There should still be some fundamental
confidence to work on the code though as a starting point, then it's about
serious use case to back this up, and careful testings while getting more
familiar with it.

> 
> If you don't mind, I'd like to let him send over his (very detailed)
> proposal,

Yes please, it's exactly the time to share the plan.  The hope is we try to
reach a consensus before or around the middle of this release (9.1).
Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
not yet out, but I think it means we make a decision before or around
middle of June.

Thanks,

> 
> - Michael
> 
> On 4/11/24 11:36, Yu Zhang wrote:
> > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > >  periodically tests for each QEMU release will be needed.
> > We use a batch of regression test cases for the stack, which covers the
> > test for QEMU. I did such test for most of the QEMU releases planned as
> > candidates for rollout.
> > 
> > The migration test needs a pair of (either physical or virtual) servers with
> > InfiniBand network, which makes it difficult to do on a single server. The
> > nested VM could be a possible approach, for which we may need virtual
> > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you 
> > know.
> > 
> > [1]  
> > https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > 
> > Thanks and best regards!
> > 
> > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu  wrote:
> > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via 
> > > > wrote:
> > > > > 
> > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > 
> > > > > > > Is there document/link about the unittest/CI for migration tests, 
> > > > > > > Why
> > > > > > > are those tests missing?
> > > > > > > Is it hard or very special to set up an environment for that? 
> > > > > > > maybe we
> > > > > > > can help in this regards.
> > > > > > See tests/qtest/migration-test.c.  We put most of our migration 
> > > > > > tests
> > > > > > there and that's covered in CI.
> > > > > > 
> > > > > > I think one major issue is CI systems don't normally have rdma 
> > > > > > devices.
> > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > 
> > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > "rxe_lo", however when
> > > > > I tried(years ago) to do RDMA migration over this
> > > > > interface(rdma:127.0.0.1:) , it got something wrong.
> > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > Thanks, Zhijian.
> > > > 
> > > > I'm not sure adding an emu-link for rdma is doable f

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-12 Thread Peter Xu
Yu,

On Thu, Apr 11, 2024 at 06:36:54PM +0200, Yu Zhang wrote:
> > 1) Either a CI test covering at least the major RDMA paths, or at least
> > periodically tests for each QEMU release will be needed.
> We use a batch of regression test cases for the stack, which covers the
> test for QEMU. I did such test for most of the QEMU releases planned as
> candidates for rollout.

The least I can think of is a few tests in one release.  Definitely too
less if one release can already break..

> 
> The migration test needs a pair of (either physical or virtual) servers with
> InfiniBand network, which makes it difficult to do on a single server. The
> nested VM could be a possible approach, for which we may need virtual
> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> 
> [1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Does it require a kernel driver?  The less host kernel / hardware /
.. dependencies the better.

I am wondering whether there can be a library doing everything in
userspace, translating RDMA into e.g. socket messages (so maybe ultimately
that's something like IP->rdma->IP.. just to cover the "rdma" procedures),
then that'll work for CI reliably.

Please also see my full list, though, especially entry 4).  Thanks already
for looking for solutions on the tests, but I don't want to waste your time
then found that tests are not enough even if ready.  I think we need people
that understand these stuff well enough, have dedicated time and look after
it.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-11 Thread Peter Xu
On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> > 
> > 
> > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > 
> > >> Is there document/link about the unittest/CI for migration tests, Why
> > >> are those tests missing?
> > >> Is it hard or very special to set up an environment for that? maybe we
> > >> can help in this regards.
> > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > there and that's covered in CI.
> > >
> > > I think one major issue is CI systems don't normally have rdma devices.
> > > Can rdma migration test be carried out without a real hardware?
> > 
> > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > then we can get a new RDMA interface "rxe_eth0".
> > This new RDMA interface is able to do the QEMU RDMA migration.
> > 
> > Also, the loopback(lo) device is able to emulate the RDMA interface 
> > "rxe_lo", however when
> > I tried(years ago) to do RDMA migration over this 
> > interface(rdma:127.0.0.1:) , it got something wrong.
> > So i gave up enabling the RDMA migration qtest at that time.
> 
> Thanks, Zhijian.
> 
> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> Maybe someone more familiar with how CI works can chim in.

Some people got dropped on the cc list for unknown reason, I'm adding them
back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
accident.

I'll try to summarize what is still missing, and I think these will be
greatly helpful if we don't want to deprecate rdma migration:

  1) Either a CI test covering at least the major RDMA paths, or at least
 periodically tests for each QEMU release will be needed.

  2) Some performance tests between modern RDMA and NIC devices are
 welcomed.  The current knowledge is modern NIC can work similarly to
 RDMA in performance, then it's debatable why we still maintain so much
 rdma specific code.

  3) No need to be soild patchsets for this one, but some plan to improve
 RDMA migration code so that it is not almost isolated from the rest
 protocols.

  4) Someone to look after this code for real.

For 2) and 3) more info is here:

https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n

Here 4) can be the most important as Markus pointed out.  We just didn't
get there yet on the discussions, but maybe Markus is right that we should
talk that first.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-10 Thread Peter Xu
On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> 
> 
> on 4/10/2024 3:46 AM, Peter Xu wrote:
> 
> >> Is there document/link about the unittest/CI for migration tests, Why
> >> are those tests missing?
> >> Is it hard or very special to set up an environment for that? maybe we
> >> can help in this regards.
> > See tests/qtest/migration-test.c.  We put most of our migration tests
> > there and that's covered in CI.
> >
> > I think one major issue is CI systems don't normally have rdma devices.
> > Can rdma migration test be carried out without a real hardware?
> 
> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> then we can get a new RDMA interface "rxe_eth0".
> This new RDMA interface is able to do the QEMU RDMA migration.
> 
> Also, the loopback(lo) device is able to emulate the RDMA interface 
> "rxe_lo", however when
> I tried(years ago) to do RDMA migration over this 
> interface(rdma:127.0.0.1:) , it got something wrong.
> So i gave up enabling the RDMA migration qtest at that time.

Thanks, Zhijian.

I'm not sure adding an emu-link for rdma is doable for CI systems, though.
Maybe someone more familiar with how CI works can chim in.

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Peter Xu
On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> Hi Peter,
> 
> On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
> >
> > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > Hi Peter,
> >
> > Jinpu,
> >
> > Thanks for joining the discussion.
> >
> > >
> > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > > >
> > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > Hello Peter und Zhjian,
> > > > >
> > > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > > surprised at
> > > > > the plan for deprecating the RDMA migration subsystem.
> > > >
> > > > It's not too late, since it looks like we do have users not yet notified
> > > > from this, we'll redo the deprecation procedure even if it'll be the 
> > > > final
> > > > plan, and it'll be 2 releases after this.
> > > >
> > > > >
> > > > > > IMHO it's more important to know whether there are still users and 
> > > > > > whether
> > > > > > they would still like to see it around.
> > > > >
> > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led 
> > > > > > to the a few
> > > > > > obvious bugs being noticed too late.
> > > > >
> > > > > Yes, we are a user of this subsystem. I was unaware of the lack of 
> > > > > test coverage
> > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > migration test
> > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > and 8.2, but
> > > > > was unable to confirm and report it quickly to you.
> > > > >
> > > > > The maintenance of this part could be too costly or difficult from
> > > > > your point of view.
> > > >
> > > > It may or may not be too costly, it's just that we need real users of 
> > > > RDMA
> > > > taking some care of it.  Having it broken easily for >1 releases 
> > > > definitely
> > > > is a sign of lack of users.  It is an implication to the community that 
> > > > we
> > > > should consider dropping some features so that we can get the best use 
> > > > of
> > > > the community resources for the things that may have a broader audience.
> > > >
> > > > One thing majorly missing is a RDMA tester to guard all the merges to 
> > > > not
> > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA 
> > > > hardwares
> > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > features that will be merged at least for migration subsystem, so that 
> > > > we
> > > > plan to not merge anything that is not covered by CI unless extremely
> > > > necessary in the future.
> > > >
> > > > For sure CI is not the only missing part, but I'd say we should start 
> > > > with
> > > > it, then someone should also take care of the code even if only in
> > > > maintenance mode (no new feature to add on top).
> > > >
> > > > >
> > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > many) like us
> > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > version of QEMU,
> > > > > or to abandon the currently used RDMA migration.
> > > >
> > > > RDMA doesn't get new features anyway, if there's specific use case for 
> > > > RDMA
> > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > possible to switch to the TCP protocol with some good NICs?
> > > We have used rdma migration with HCA from Nvidia for years, our
> > > experience is RDMA migration works better than tcp (over ipoib).
> >
> > Please bare with me, as I know little on rdma stuff.
> >
> > I'm actually pretty confused (and since a long time ago..) on why we need
> > to operation with rdma contexts when ipoib seems to provide all the tcp
> > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > e

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-08 Thread Peter Xu
On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> Hi Peter,

Jinpu,

Thanks for joining the discussion.

> 
> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> >
> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > Hello Peter und Zhjian,
> > >
> > > Thank you so much for letting me know about this. I'm also a bit 
> > > surprised at
> > > the plan for deprecating the RDMA migration subsystem.
> >
> > It's not too late, since it looks like we do have users not yet notified
> > from this, we'll redo the deprecation procedure even if it'll be the final
> > plan, and it'll be 2 releases after this.
> >
> > >
> > > > IMHO it's more important to know whether there are still users and 
> > > > whether
> > > > they would still like to see it around.
> > >
> > > > I admit RDMA migration was lack of testing(unit/CI test), which led to 
> > > > the a few
> > > > obvious bugs being noticed too late.
> > >
> > > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > > coverage
> > > for this part. As soon as 8.2 was released, I saw that many of the
> > > migration test
> > > cases failed and came to realize that there might be a bug between 8.1
> > > and 8.2, but
> > > was unable to confirm and report it quickly to you.
> > >
> > > The maintenance of this part could be too costly or difficult from
> > > your point of view.
> >
> > It may or may not be too costly, it's just that we need real users of RDMA
> > taking some care of it.  Having it broken easily for >1 releases definitely
> > is a sign of lack of users.  It is an implication to the community that we
> > should consider dropping some features so that we can get the best use of
> > the community resources for the things that may have a broader audience.
> >
> > One thing majorly missing is a RDMA tester to guard all the merges to not
> > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > but just to sanity check the migration+rdma code running all fine.  RDMA
> > taught us the lesson so we're requesting CI coverage for all other new
> > features that will be merged at least for migration subsystem, so that we
> > plan to not merge anything that is not covered by CI unless extremely
> > necessary in the future.
> >
> > For sure CI is not the only missing part, but I'd say we should start with
> > it, then someone should also take care of the code even if only in
> > maintenance mode (no new feature to add on top).
> >
> > >
> > > My concern is, this plan will forces a few QEMU users (not sure how
> > > many) like us
> > > either to stick to the RDMA migration by using an increasingly older
> > > version of QEMU,
> > > or to abandon the currently used RDMA migration.
> >
> > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > migrations, would it work if such a scenario uses the old binary?  Is it
> > possible to switch to the TCP protocol with some good NICs?
> We have used rdma migration with HCA from Nvidia for years, our
> experience is RDMA migration works better than tcp (over ipoib).

Please bare with me, as I know little on rdma stuff.

I'm actually pretty confused (and since a long time ago..) on why we need
to operation with rdma contexts when ipoib seems to provide all the tcp
layers.  I meant, can it work with the current "tcp:" protocol with ipoib
even if there's rdma/ib hardwares underneath?  Is it because of performance
improvements so that we must use a separate path comparing to generic
"tcp:" protocol here?

> 
> Switching back to TCP will lead us to the old problems which was
> solved by RDMA migration.

Can you elaborate the problems, and why tcp won't work in this case?  They
may not be directly relevant to the issue we're discussing, but I'm happy
to learn more.

What is the NICs you were testing before?  Did the test carry out with
things like modern ones (50Gbps-200Gbps NICs), or the test was done when
these hardwares are not common?

Per my recent knowledge on the new Intel hardwares, at least the ones that
support QPL, it's easy to achieve single core 50Gbps+.

https://lore.kernel.org/r/ph7pr11mb5941a91ac1e514bcc32896a6a3...@ph7pr11mb5941.namprd11.prod.outlook.com

Quote from Yuan:

  Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
  [ ID] Interval   Transfer Bitrate Retr  Cwnd
  [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec0   2.87 MBytes
  [  5]   1.00-2.00 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-02 Thread Peter Xu
On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> Hello Peter und Zhjian,
> 
> Thank you so much for letting me know about this. I'm also a bit surprised at
> the plan for deprecating the RDMA migration subsystem.

It's not too late, since it looks like we do have users not yet notified
from this, we'll redo the deprecation procedure even if it'll be the final
plan, and it'll be 2 releases after this.

> 
> > IMHO it's more important to know whether there are still users and whether
> > they would still like to see it around.
> 
> > I admit RDMA migration was lack of testing(unit/CI test), which led to the 
> > a few
> > obvious bugs being noticed too late.
> 
> Yes, we are a user of this subsystem. I was unaware of the lack of test 
> coverage
> for this part. As soon as 8.2 was released, I saw that many of the
> migration test
> cases failed and came to realize that there might be a bug between 8.1
> and 8.2, but
> was unable to confirm and report it quickly to you.
> 
> The maintenance of this part could be too costly or difficult from
> your point of view.

It may or may not be too costly, it's just that we need real users of RDMA
taking some care of it.  Having it broken easily for >1 releases definitely
is a sign of lack of users.  It is an implication to the community that we
should consider dropping some features so that we can get the best use of
the community resources for the things that may have a broader audience.

One thing majorly missing is a RDMA tester to guard all the merges to not
break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
but just to sanity check the migration+rdma code running all fine.  RDMA
taught us the lesson so we're requesting CI coverage for all other new
features that will be merged at least for migration subsystem, so that we
plan to not merge anything that is not covered by CI unless extremely
necessary in the future.

For sure CI is not the only missing part, but I'd say we should start with
it, then someone should also take care of the code even if only in
maintenance mode (no new feature to add on top).

> 
> My concern is, this plan will forces a few QEMU users (not sure how
> many) like us
> either to stick to the RDMA migration by using an increasingly older
> version of QEMU,
> or to abandon the currently used RDMA migration.

RDMA doesn't get new features anyway, if there's specific use case for RDMA
migrations, would it work if such scenario uses the old binary?  Is it
possible to switch to the TCP protocol with some good NICs?

Per our best knowledge, RDMA users are rare, and please let anyone know if
you are aware of such users.  IIUC the major reason why RDMA stopped being
the trend is because the network is not like ten years ago; I don't think I
have good knowledge in RDMA at all nor network, but my understanding is
it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
little sense to maintain multiple protocols, considering RDMA migration
code is so special so that it has the most custom code comparing to other
protocols.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Peter Xu
On Thu, Mar 28, 2024 at 04:22:27PM +0100, Thomas Huth wrote:
> Since e9a54265f5 was not very clear about rdma migration code, should we
> maybe rather add a separate deprecation note for the migration part, and add
> a proper warning message to the migration code in case someone tries to use
> it there, and then only remove the rdma migration code after two more
> releases?

Definitely a valid option to me.

So far RDMA isn't covered in tests (actually same to COLO, and I wonder our
position of COLO too in this case..), so unfortunately we don't even know
when it'll break just like before.

From other activities that I can see when new code comes, maintaining RDMA
code should be fairly manageable so far (and whoever will write new rdma
codes in those two releases will also need to take the maintainer's
role). We did it for those years, and we can keep that for two more
releases. Hopefully that can ring a louder alarm to the current users with
such warnings, so that people can either stick with old binaries, or invest
developer/test resources to the community.

Thanks,

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Peter Xu
On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé  writes:
> 
> > The whole RDMA subsystem was deprecated in commit e9a54265f5
> > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> > released in v8.2.
> >
> > Remove:
> >  - RDMA handling from migration
> >  - dependencies on libibumad, libibverbs and librdmacm
> >
> > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> > in old migration streams.
> >
> > Cc: Peter Xu 
> > Cc: Li Zhijian 
> > Acked-by: Fabiano Rosas 
> > Signed-off-by: Philippe Mathieu-Daudé 
> 
> Just to be clear, because people raised the point in the last version,
> the first link in the deprecation commit links to a thread comprising
> entirely of rdma migration patches. I don't see any ambiguity on whether
> the deprecation was intended to include migration. There's even an ack
> from Juan.

Yes I remember that's the plan.

> 
> So on the basis of not reverting the previous maintainer's decision, my
> Ack stands here.
> 
> We also had pretty obvious bugs ([1], [2]) in the past that would have
> been caught if we had any kind of testing for the feature, so I can't
> even say this thing works currently.
> 
> @Peter Xu, @Li Zhijian, what are your thoughts on this?

Generally I definitely agree with such a removal sooner or later, as that's
how deprecation works, and even after Juan's left I'm not aware of any
other new RDMA users.  Personally, I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance, however
I assume that's not a blocker either, as I think we can also manage it.

IMHO it's more important to know whether there are still users and whether
they would still like to see it around. That's also one thing I notice that
e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
if they're rare. According to [2] it could be that such user may only rely
on the release versions of QEMU when it broke things.

So I'm copying Yu too (while Zhijian is already in the loop), just in case
someone would like to stand up and speak.

Thanks,

> 
> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com
> 2- 
> https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com
> 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org