RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-07 Thread Gonglei (Arei)
Hi,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Thursday, June 6, 2024 5:19 AM
> To: Dr. David Alan Gilbert 
> Cc: Michael Galaxy ; zhengchuan
> ; Gonglei (Arei) ;
> Daniel P. Berrangé ; Markus Armbruster
> ; Yu Zhang ; Zhijian Li (Fujitsu)
> ; Jinpu Wang ; Elmar Gerdes
> ; qemu-de...@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; integrat...@gluster.org; Paolo
> Bonzini ; qemu-block@nongnu.org;
> de...@lists.libvirt.org; Hanna Reitz ; Michael S. Tsirkin
> ; Thomas Huth ; Eric Blake
> ; Song Gao ; Marc-André
> Lureau ; Alex Bennée
> ; Wainer dos Santos Moschetta
> ; Beraldo Leal ; Pannengyuan
> ; Xiexiangyou 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote:
> > > > I just noticed this thread; some random notes from a somewhat
> > > > fragmented memory of this:
> > > >
> > > >   a) Long long ago, I also tried rsocket;
> > > >
> https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > > >  as I remember the library was quite flaky at the time.
> > >
> > > Hmm interesting.  There also looks like a thread doing rpoll().
> >
> > Yeh, I can't actually remember much more about what I did back then!
> 
> Heh, that's understandable and fair. :)
> 
> > > I hope Lei and his team has tested >4G mem, otherwise definitely
> > > worth checking.  Lei also mentioned there're rsocket bugs they found
> > > in the cover letter, but not sure what's that about.
> >
> > It would probably be a good idea to keep track of what bugs are in
> > flight with it, and try it on a few RDMA cards to see what problems
> > get triggered.
> > I think I reported a few at the time, but I gave up after feeling it
> > was getting very hacky.
> 
> Agreed.  Maybe we can have a list of that in the cover letter or even QEMU's
> migration/rmda doc page.
> 
> Lei, if you think that makes sense please do so in your upcoming posts.
> There'll need to have a list of things you encountered in the kernel driver 
> and
> it'll be even better if there're further links to read on each problem.
> 
OK, no problem. There are two bugs:

Bug 1:

https://github.com/linux-rdma/rdma-core/commit/23985e25aebb559b761872313f8cab4e811c5a3d#diff-5ddbf83c6f021688166096ca96c9bba874dffc3cab88ded2e9d8b2176faa084cR3302-R3303

his commit introduces a bug that causes QEMU suspension.
When the timeout parameter of the rpoll is not -1 or 0, the program is 
suspended occasionally.

Problem analysis:
During the first rpoll,
In line 3297, rs_poll_enter () performs pollcnt++. In this case, the value of 
pollcnt is 1.
In line 3302, timeout expires and the function exits. Note that rs_poll_exit () 
is not --pollcnt here.
In this case, the value of pollcnt is 1.
During the second rpoll, pollcnt++ is performed in line 3297 rs_poll_enter (). 
In this case, the value of pollcnt is 2.
If no timeout expires and the poll return value is greater than 0, the 
rs_poll_stop () function is executed. Because the if (--pollcnt) condition is 
false, suspendpoll = 1 is executed.
Go back to the do while loop inside rpoll, again rs_poll_enter () now if 
(suspendpoll) condition is true, execute pthread_yield (); and return -EBUSY, 
Then, the do while loop in the rpoll is returned. Because the if (rs_poll_enter 
()) condition is true, the rs_poll_enter () function is executed again after 
the continue operation. As a result, the program is suspended.

Root cause: In line 3302, rs_poll_exit () is not executed before the timeout 
expires function exits.


Bug 2:

In rsocket.c, there is a receive queue int accept_queue[2] implemented by 
socketpair. The listen_svc thread in rsocket.c is responsible for receiving 
connections and writing them to the accept_queue[1]. When raccept () is called, 
a connection is received from accept_queue[0].
In the test case, qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN); waits for a 
readable event (waiting for a connection), rpoll () checks if accept_queue[0] 
has a readable event, However, this poll does not poll accept_queue[0]. After 
the timeout expires, rpoll () obtains the readable event of accept_queue[0] 
from rs_poll_arm again.

Impaction: 
The accept operation can be performed only after 5000 ms. Of course, we can 
shorten this time by echoing the millisecond time > 
/etc/rdma/rsocket/wake_up_interval.


Regards,
-Gonglei

> > > >
> > > >   e) Someone made a good suggestion (sorry can't remember who) -
> that the
> > > >  RDMA migration structure was the wrong way around - it should
> be the
&

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)
Hi,

> -Original Message-
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > 15
> > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > >
> > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > >
> > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > Family [ConnectX-5]
> > > > > > > > >
> > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > while the other
> > > is active.
> > > > > > > > >
> > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > (rev
> > > > > > > > > 01)
> > > > > > > > >
> > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > could make more
> > > > > > > > sense.
> > > > > > > >
> > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > to be refactored
> > > using rsocket.
> > > > > > > >
> > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > Would it make sense we start some PoC now?
> > > > > > > >
> > > > > > >
> > > > > > > My team has finished the PoC refactoring which works well.
> > > > > > >
> > > > > > > Progress:
> > > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > > functions from migration/ram.c. (to prevent RDMA live
> > > > > > > migration from polluting the
> > > > > > core logic of live migration), 6.  The soft-RoCE implemented
> > > > > > by software is used to test the RDMA live migration. It's 
> > > > > > successful.
> > > > > > >
> > > > > > > We will be submit the patchset later.
> > > > > >
> > > > > > That's great news, thank you!
> > > > > >
> > > > > > --
> > > > > > Peter Xu
> > > > >
> > > > > For rdma programming, the current mainstream implementation is
> > > > > to use
> > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > > > >
> > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs
> > > > > have different responsibilities. rdma_cm fd is used to notify
> > > > > connection establishment events, and verbs fd is used to notify
> > > > > new CQEs. When
> > > poll/epoll monitoring is directly performed on the rdma_cm fd, only
> > > a pollin event can be monitored, which means that an rdma_cm event
> > > occurs. When the verbs fd is directly polled/epolled, only the
> > > pollin event can be listened, which indicates that a new CQE is generated.
> > > > >
> > > > > Rsocket is a sub-module attached to the rdma_cm library and
> > > > > provides rdma calls that are completely similar to socket interfaces.
> > > > > However, this library returns only the rdma_cm fd for listening
> > > > > to link
> > > setup-related events and does not expose the verbs fd (readable and
> > > writable events for listening to data). Only the rpoll interface
> > > provided by the RSocket can be used to listen to related events.
> > > However, QEMU uses the ppoll interface to listen to the rdma_cm fd
> (gotten by raccept API).
> > > > > And cannot listen to the verbs fd event.
> I'm confused, the rs_poll_arm
> :https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#
> L3290
> For STREAM, rpoll setup fd for both cq fd and cm fd.
> 

Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(


Regards,
-Gonglei



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)


> -Original Message-
> From: Jinpu Wang [mailto:jinpu.w...@ionos.com]
> Sent: Wednesday, May 29, 2024 5:18 PM
> To: Gonglei (Arei) 
> Cc: Greg Sword ; Peter Xu ;
> Yu Zhang ; Michael Galaxy ;
> Elmar Gerdes ; zhengchuan
> ; Daniel P. Berrangé ;
> Markus Armbruster ; Zhijian Li (Fujitsu)
> ; qemu-de...@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; Paolo Bonzini
> ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou ; Fabiano Rosas ;
> RDMA mailing list ; she...@nvidia.com; Haris
> Iqbal 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> Hi Gonglei,
> 
> On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei) 
> wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Greg Sword [mailto:gregswo...@gmail.com]
> > > Sent: Wednesday, May 29, 2024 2:06 PM
> > > To: Jinpu Wang 
> > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> > > handling
> > >
> > > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> > > wrote:
> > > >
> > > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)
> > > > 
> > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -Original Message-
> > > > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > > servers widely used for production in our data center.
> > > > > > > > > The network adapters are
> > > > > > > > >
> > > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > > NetXtreme
> > > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > > >
> > > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6
> > > > > > > > looks more
> > > > > > reasonable.
> > > > > > > >
> > > > > > > >
> > > > > >
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > 15
> > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > >
> > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > >
> > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > Family [ConnectX-5]
> > > > > > > > >
> > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > while the other
> > > is active.
> > > > > > > > >
> > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > (rev
> > > > > > > > > 01)
> > > > > > > > >
> > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > could make more
> > > > > > > > sense.
> > > > > > > >
> > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > to be refactored
> > > using rsocket.
> > > > > > > >
> > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > Would it make

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)


> -Original Message-
> From: Greg Sword [mailto:gregswo...@gmail.com]
> Sent: Wednesday, May 29, 2024 2:06 PM
> To: Jinpu Wang 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> wrote:
> >
> > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) 
> wrote:
> > >
> > > Hi,
> > >
> > > > -Original Message-
> > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > servers widely used for production in our data center. The
> > > > > > > network adapters are
> > > > > > >
> > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > NetXtreme
> > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > >
> > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks
> > > > > > more
> > > > reasonable.
> > > > > >
> > > > > >
> > > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > 15
> > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > >
> > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > >
> > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > [ConnectX-5]
> > > > > > >
> > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for
> > > > > > > VM migration. RDMA traffic is through InfiniBand and TCP
> > > > > > > through Ethernet on these two hosts. One is standby while the 
> > > > > > > other
> is active.
> > > > > > >
> > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev
> > > > > > > 01)
> > > > > > >
> > > > > > > The comparison between RDMA and TCP on the same NIC could
> > > > > > > make more
> > > > > > sense.
> > > > > >
> > > > > > It looks to me NICs are powerful now, but again as I mentioned
> > > > > > I don't think it's a reason we need to deprecate rdma,
> > > > > > especially if QEMU's rdma migration has the chance to be refactored
> using rsocket.
> > > > > >
> > > > > > Is there anyone who started looking into that direction?
> > > > > > Would it make sense we start some PoC now?
> > > > > >
> > > > >
> > > > > My team has finished the PoC refactoring which works well.
> > > > >
> > > > > Progress:
> > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > functions from migration/ram.c. (to prevent RDMA live migration
> > > > > from polluting the
> > > > core logic of live migration), 6.  The soft-RoCE implemented by
> > > > software is used to test the RDMA live migration. It's successful.
> > > > >
> > > > > We will be submit the patchset later.
> > > >
> > > > That's great news, thank you!
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> > >
> > > rdma_cm and ibverbs create two FDs respectively. The two FDs have
> > > different responsibilities. rdma_cm fd is used to notify connection
> > > establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin
> event can be monitored, which means that an rdma_cm event occurs. When
> the verbs fd is directly polled/epolled, only the pollin event can be 
> listened,
> which indicates that a new CQE is generated.
> > >
> > > Rsocket is a sub-module attached to the rdma_cm library and provides
> > > rdma calls that are completely similar to socket interfaces.
> > > However, this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and writable
> events for listening to data). Only the rpoll interface provided by the 
> RSocket
> can be used to listen to related events. However, QEMU uses the ppoll
> interface to listen to the rdma_cm fd (gotten by raccept API).
> > > And cannot listen to the verbs fd event. Only some hacking methods can be
> used to address this problem.
> > >
> > > Do you guys have any ideas? Thanks.
> > +cc linux-rdma
> 
> Why include rdma community?
> 

Can rdma/rsocket provide an API to expose the verbs fd? 


Regards,
-Gonglei

> > +cc Sean
> >
> >
> >
> > >
> > >
> > > Regards,
> > > -Gonglei
> >


RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Gonglei (Arei)
Hi,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Tuesday, May 28, 2024 11:55 PM
> > > > Exactly, not so compelling, as I did it first only on servers
> > > > widely used for production in our data center. The network
> > > > adapters are
> > > >
> > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > BCM5720 2-port Gigabit Ethernet PCIe
> > >
> > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> reasonable.
> > >
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > >
> > > Appreciate a lot for everyone helping on the testings.
> > >
> > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > [ConnectX-5]
> > > >
> > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > Ethernet on these two hosts. One is standby while the other is active.
> > > >
> > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > network adapters. One of them has:
> > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > >
> > > > The comparison between RDMA and TCP on the same NIC could make
> > > > more
> > > sense.
> > >
> > > It looks to me NICs are powerful now, but again as I mentioned I
> > > don't think it's a reason we need to deprecate rdma, especially if
> > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > >
> > > Is there anyone who started looking into that direction?  Would it
> > > make sense we start some PoC now?
> > >
> >
> > My team has finished the PoC refactoring which works well.
> >
> > Progress:
> > 1.  Implement io/channel-rdma.c,
> > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > Rewrite the rdma_start_outgoing_migration and
> > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > from migration/ram.c. (to prevent RDMA live migration from polluting the
> core logic of live migration), 6.  The soft-RoCE implemented by software is
> used to test the RDMA live migration. It's successful.
> >
> > We will be submit the patchset later.
> 
> That's great news, thank you!
> 
> --
> Peter Xu

For rdma programming, the current mainstream implementation is to use rdma_cm 
to establish a connection, and then use verbs to transmit data.

rdma_cm and ibverbs create two FDs respectively. The two FDs have different 
responsibilities. rdma_cm fd is used to notify connection establishment events, 
and verbs fd is used to notify new CQEs. When poll/epoll monitoring is directly 
performed on the rdma_cm fd, only a pollin event can be monitored, which means
that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, 
only the pollin event can be listened, which indicates that a new CQE is 
generated.

Rsocket is a sub-module attached to the rdma_cm library and provides rdma calls 
that are completely similar to socket interfaces. However, this library returns 
only the rdma_cm fd for listening to link setup-related events and does not 
expose the verbs fd (readable and writable events for listening to data). Only 
the rpoll 
interface provided by the RSocket can be used to listen to related events. 
However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by 
raccept API). 
And cannot listen to the verbs fd event. Only some hacking methods can be used 
to address this problem. 

Do you guys have any ideas? Thanks.


Regards,
-Gonglei


RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Gonglei (Arei)
Hi Peter,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Wednesday, May 22, 2024 6:15 AM
> To: Yu Zhang 
> Cc: Michael Galaxy ; Jinpu Wang
> ; Elmar Gerdes ;
> zhengchuan ; Gonglei (Arei)
> ; Daniel P. Berrangé ;
> Markus Armbruster ; Zhijian Li (Fujitsu)
> ; qemu-de...@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; Paolo Bonzini
> ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou ; Fabiano Rosas 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > Hello Michael and Peter,
> 
> Hi,
> 
> >
> > Exactly, not so compelling, as I did it first only on servers widely
> > used for production in our data center. The network adapters are
> >
> > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > 2-port Gigabit Ethernet PCIe
> 
> Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more reasonable.
> 
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> wvaqk81vxtkzx-l...@mail.gmail.com/
> 
> Appreciate a lot for everyone helping on the testings.
> 
> > InfiniBand controller: Mellanox Technologies MT27800 Family
> > [ConnectX-5]
> >
> > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > on these two hosts. One is standby while the other is active.
> >
> > Now I'll try on a server with more recent Ethernet and InfiniBand
> > network adapters. One of them has:
> > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> >
> > The comparison between RDMA and TCP on the same NIC could make more
> sense.
> 
> It looks to me NICs are powerful now, but again as I mentioned I don't think 
> it's
> a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> the chance to be refactored using rsocket.
> 
> Is there anyone who started looking into that direction?  Would it make sense
> we start some PoC now?
> 

My team has finished the PoC refactoring which works well. 

Progress:
1.  Implement io/channel-rdma.c,
2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
successful,
3.  Remove the original code from migration/rdma.c,
4.  Rewrite the rdma_start_outgoing_migration and rdma_start_incoming_migration 
logic,
5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
migration from polluting the core logic of live migration),
6.  The soft-RoCE implemented by software is used to test the RDMA live 
migration. It's successful.

We will be submit the patchset later.


Regards,
-Gonglei

> Thanks,
> 
> --
> Peter Xu



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Gonglei (Arei)
Hello,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Monday, May 6, 2024 11:18 PM
> To: Gonglei (Arei) 
> Cc: Daniel P. Berrangé ; Markus Armbruster
> ; Michael Galaxy ; Yu Zhang
> ; Zhijian Li (Fujitsu) ; Jinpu Wang
> ; Elmar Gerdes ;
> qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
> ; Prasanna Kumar Kalever
> ; Cornelia Huck ;
> Michael Roth ; Prasanna Kumar Kalever
> ; integrat...@gluster.org; Paolo Bonzini
> ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> > Hi, Peter
> 
> Hey, Lei,
> 
> Happy to see you around again after years.
> 
Haha, me too.

> > RDMA features high bandwidth, low latency (in non-blocking lossless
> > network), and direct remote memory access by bypassing the CPU (As you
> > know, CPU resources are expensive for cloud vendors, which is one of
> > the reasons why we introduced offload cards.), which TCP does not have.
> 
> It's another cost to use offload cards, v.s. preparing more cpu resources?
> 
Software and hardware offload converged architecture is the way to go for all 
cloud vendors 
(Including comprehensive benefits in terms of performance, cost, security, and 
innovation speed), 
it's not just a matter of adding the resource of a DPU card.

> > In some scenarios where fast live migration is needed (extremely short
> > interruption duration and migration duration) is very useful. To this
> > end, we have also developed RDMA support for multifd.
> 
> Will any of you upstream that work?  I'm curious how intrusive would it be
> when adding it to multifd, if it can keep only 5 exported functions like what
> rdma.h does right now it'll be pretty nice.  We also want to make sure it 
> works
> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
> loads to
> multifd channels too.
> 

In fact, we sent the patchset to the community in 2021. Pls see:
https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/


> One thing to note that the question here is not about a pure performance
> comparison between rdma and nics only.  It's about help us make a decision
> on whether to drop rdma, iow, even if rdma performs well, the community still
> has the right to drop it if nobody can actively work and maintain it.
> It's just that if nics can perform as good it's more a reason to drop, unless
> companies can help to provide good support and work together.
> 

We are happy to provide the necessary review and maintenance work for RDMA
if the community needs it.

CC'ing Chuan Zheng.


Regards,
-Gonglei



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-05 Thread Gonglei (Arei)
Hi, Peter

RDMA features high bandwidth, low latency (in non-blocking lossless network), 
and direct remote 
memory access by bypassing the CPU (As you know, CPU resources are expensive 
for cloud vendors, 
which is one of the reasons why we introduced offload cards.), which TCP does 
not have. 

In some scenarios where fast live migration is needed (extremely short 
interruption duration and migration 
duration) is very useful. To this end, we have also developed RDMA support for 
multifd.

Regards,
-Gonglei

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Wednesday, May 1, 2024 11:31 PM
> To: Daniel P. Berrangé 
> Cc: Markus Armbruster ; Michael Galaxy
> ; Yu Zhang ; Zhijian Li (Fujitsu)
> ; Jinpu Wang ; Elmar Gerdes
> ; qemu-de...@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; integrat...@gluster.org; Paolo
> Bonzini ; qemu-block@nongnu.org;
> de...@lists.libvirt.org; Hanna Reitz ; Michael S. Tsirkin
> ; Thomas Huth ; Eric Blake
> ; Song Gao ; Marc-André
> Lureau ; Alex Bennée
> ; Wainer dos Santos Moschetta
> ; Beraldo Leal ; Gonglei (Arei)
> ; Pannengyuan 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > > Peter Xu  writes:
> > >
> > > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > > >> Hi All (and Peter),
> > > >
> > > > Hi, Michael,
> > > >
> > > >>
> > > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my
> > > >> last name (highly irregular for a male) and yes, that's my real last 
> > > >> name:
> > > >> https://www.linkedin.com/in/mrgalaxy/)
> > > >>
> > > >> I'm the original author of the RDMA implementation. I've been
> > > >> discussing with Yu Zhang for a little bit about potentially
> > > >> handing over maintainership of the codebase to his team.
> > > >>
> > > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > > >> unfortunately. so I've never been able to run tests or use what I
> > > >> wrote at work, and as all of you know, if you don't have a way to
> > > >> test something, then you can't maintain it.
> > > >>
> > > >> Yu Zhang put a (very kind) proposal forward to me to ask the
> > > >> community if they feel comfortable training his team to maintain
> > > >> the codebase (and run
> > > >> tests) while they learn about it.
> > > >
> > > > The "while learning" part is fine at least to me.  IMHO the
> > > > "ownership" to the code, or say, taking over the responsibility,
> > > > may or may not need 100% mastering the code base first.  There
> > > > should still be some fundamental confidence to work on the code
> > > > though as a starting point, then it's about serious use case to
> > > > back this up, and careful testings while getting more familiar with it.
> > >
> > > How much experience we expect of maintainers depends on the
> > > subsystem and other circumstances.  The hard requirement isn't
> > > experience, it's trust.  See the recent attack on xz.
> > >
> > > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > > I'm merely reminding y'all what's at stake.
> >
> > I think we shouldn't overly obsess[1] about 'xz', because the
> > overwhealmingly common scenario is that volunteer maintainers are
> > honest people. QEMU is in a massively better peer review situation.
> > With xz there was basically no oversight of the new maintainer. With
> > QEMU, we have oversight from 1000's of people on the list, a huge pool
> > of general maintainers, the specific migration maintainers, and the release
> manager merging code.
> >
> > With a lack of historical experiance with QEMU maintainership, I'd
> > suggest that new RDMA volunteers would start by adding themselves to the
> "MAINTAINERS"
> > file with only the 'Reviewer' classification. The main migration
> > maintainers would still handle pull requests, but wait for a R-b from
> > one of the RMDA volunteers. After some period of time the RDMA folks
> > could graduate to full maintainer status if the migration maintainers needed
> to reduce their load.
> > I suspect

RE: [PATCH 05/46] virtio-crypto-pci: Tidy up virtio_crypto_pci_realize()

2020-06-27 Thread Gonglei (Arei)


> -Original Message-
> From: Markus Armbruster [mailto:arm...@redhat.com]
> Sent: Thursday, June 25, 2020 12:43 AM
> To: qemu-de...@nongnu.org
> Cc: pbonz...@redhat.com; berra...@redhat.com; ehabk...@redhat.com;
> qemu-block@nongnu.org; peter.mayd...@linaro.org;
> vsement...@virtuozzo.com; Gonglei (Arei) ;
> Michael S . Tsirkin 
> Subject: [PATCH 05/46] virtio-crypto-pci: Tidy up virtio_crypto_pci_realize()
> 
> virtio_crypto_pci_realize() continues after realization of its 
> "virtio-crypto-device"
> fails.  Only an object_property_set_link() follows; looks harmless to me.  
> Tidy
> up anyway: return after failure, just like virtio_rng_pci_realize() does.
> 
> Cc: "Gonglei (Arei)" 
> Cc: Michael S. Tsirkin 
> Signed-off-by: Markus Armbruster 
> ---
>  hw/virtio/virtio-crypto-pci.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 

Reviewed-by: Gonglei < arei.gong...@huawei.com>

> diff --git a/hw/virtio/virtio-crypto-pci.c b/hw/virtio/virtio-crypto-pci.c 
> index
> 72be531c95..0755722288 100644
> --- a/hw/virtio/virtio-crypto-pci.c
> +++ b/hw/virtio/virtio-crypto-pci.c
> @@ -54,7 +54,9 @@ static void virtio_crypto_pci_realize(VirtIOPCIProxy
> *vpci_dev, Error **errp)
>  }
> 
>  virtio_pci_force_virtio_1(vpci_dev);
> -qdev_realize(vdev, BUS(_dev->bus), errp);
> +if (!qdev_realize(vdev, BUS(_dev->bus), errp)) {
> +return;
> +}
>  object_property_set_link(OBJECT(vcrypto),
>   OBJECT(vcrypto->vdev.conf.cryptodev), "cryptodev",
>   NULL);
> --
> 2.26.2




Re: [Qemu-block] [PATCH] nvme: generate OpenFirmware device path in the "bootorder" fw_cfg file

2016-01-26 Thread Gonglei (Arei)
>
> Subject: [PATCH] nvme: generate OpenFirmware device path in the "bootorder"
> fw_cfg file
> 
> Background on QEMU boot indices
> ---
> 
> Normally, the "bootindex" property is configured for bootable devices
> with:
> 
>   DEVICE_instance_init()
> device_add_bootindex_property(..., "bootindex", ...)
>   object_property_add(..., device_get_bootindex,
>   device_set_bootindex, ...)
> 
> and when the bootindex is set on the QEMU command line, with
> 
>   -device DEVICE,...,bootindex=N
> 
> the setter that was configured above is invoked:
> 
>   device_set_bootindex()
> /* parse boot index */
> visit_type_int32()
> 
> /* verify unicity */
> check_boot_index()
> 
> /* store parsed boot index */
> ...
> 
> /* insert device path to boot order */
> add_boot_device_path()
> 
> In the last step, add_boot_device_path() ensures that an OpenFirmware
> device path will show up in the "bootorder" fw_cfg file, at a position
> corresponding to the device's boot index. Thus guest firmware (SeaBIOS and
> OVMF) can try to boot off the device with the right priority.
> 
> NVMe boot index
> ---
> 
> In QEMU commit 33739c712982,
> 
>   nvma: ide: add bootindex to qom property
> 
> the following generic setters / getters:
> - device_set_bootindex()
> - device_get_bootindex()
> 
> were open-coded for NVMe, under the names
> - nvme_set_bootindex()
> - nvme_get_bootindex()
> 
> Plus nvme_instance_init() was added to configure the "bootindex" property
> manually, designating the open-coded getter & setter, rather than calling
> device_add_bootindex_property().
> 
> Crucially, nvme_set_bootindex() avoided the final add_boot_device_path()
> call. This fact is spelled out in the message of commit 33739c712982, and
> it was presumably the entire reason for all of the code duplication.
> 
> Now, Vladislav filed an RFE for OVMF
> ; OVMF should boot off NVMe
> devices. It is simple to build edk2's existent NvmExpressDxe driver into
> OVMF, but the boot order matching logic in OVMF can only handle NVMe if
> the "bootorder" fw_cfg file includes such devices.
> 
> Therefore this patch converts the NVMe device model to
> device_set_bootindex() all the way.
> 
> Device paths
> 
> 
> device_set_bootindex() accepts an optional parameter called "suffix". When
> present, it is expected to take the form of an OpenFirmware device path
> node, and it gets appended as last node to the otherwise auto-generated
> OFW path.
> 
> For NVMe, the auto-generated part is
> 
>   /pci@i0cf8/pci8086,5845@6[,1]
>^ ^^  ^
>| |PCI slot and (present when nonzero)
>| |function of the NVMe controller, both hex
>| "driver name" component, built from PCI vendor & device IDs
>PCI root at system bus port, PIO
> 
> to which here we append the suffix
> 
>   /namespace@1,0
>  ^ ^
>  | big endian (MSB at lowest address) numeric interpretation
>  | of the 64-bit IEEE Extended Unique Identifier, aka EUI-64,
>  | hex
>  32-bit NVMe namespace identifier, aka NSID, hex
> 
> resulting in the OFW device path
> 
>   /pci@i0cf8/pci8086,5845@6[,1]/namespace@1,0
> 
> The reason for including the NSID and the EUI-64 is that an NVMe device
> can in theory produce several different namespaces (distinguished by
> NSID). Additionally, each of those may (optionally) have an EUI-64 value.
> 
> For now, QEMU only provides namespace 1.
> 
> Furthermore, QEMU doesn't even represent the EUI-64 as a standalone field;
> it is embedded (and left unused) inside the "NvmeIdNs.res30" array, at the
> last eight bytes. (Which is fine, since EUI-64 can be left zero-filled if
> unsupported by the device.)
> 
> Based on the above, we set the "unit address" part of the last
> ("namespace") node to fixed "1,0".
> 
> OVMF will then map the above OFW device path to the following UEFI device
> path fragment, for boot order processing:
> 
>   PciRoot(0x0)/Pci(0x6,0x1)/NVMe(0x1,00-00-00-00-00-00-00-00)
>   ^^   ^^^   ^
>   ||   |||   octets of the EUI-64 in address
> order
>   ||   ||NSID
>   ||   |NVMe namespace messaging device path
> node
>   |PCI slot and function
>   PCI root bridge
> 
> Cc: Keith Busch  (supporter:nvme)
> Cc: Kevin Wolf  (supporter:Block layer core)
> Cc: qemu-block@nongnu.org (open list:nvme)
> Cc: Gonglei 
> Cc: Vladislav Vovchenko 
> Cc: Feng Tian 
> Cc: Gerd Hoffmann 
> Cc: Kevin O'Connor 
> Signed-off-by: Laszlo Ersek 
> ---
>  hw/block/nvme.c | 42 +-
>  1