Re: [PATCH 2/6] io: add QIOChannelRDMA class

2024-06-10 Thread Jinpu Wang
On Tue, Jun 4, 2024 at 2:14 PM Gonglei  wrote:
>
> From: Jialin Wang 
>
> Implement a QIOChannelRDMA subclass that is based on the rsocket
> API (similar to socket API).
>
> Signed-off-by: Jialin Wang 
> Signed-off-by: Gonglei 
> ---
>  include/io/channel-rdma.h | 152 +
>  io/channel-rdma.c | 437 ++
>  io/meson.build|   1 +
>  io/trace-events   |  14 ++
>  4 files changed, 604 insertions(+)
>  create mode 100644 include/io/channel-rdma.h
>  create mode 100644 io/channel-rdma.c
>
> diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> new file mode 100644
> index 00..8cab2459e5
> --- /dev/null
> +++ b/include/io/channel-rdma.h
> @@ -0,0 +1,152 @@
> +/*
> + * QEMU I/O channels RDMA driver
> + *
> + * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
> + *
> + * Authors:
> + *  Jialin Wang 
> + *  Gonglei 
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see 
> .
> + */
> +
> +#ifndef QIO_CHANNEL_RDMA_H
> +#define QIO_CHANNEL_RDMA_H
> +
> +#include "io/channel.h"
> +#include "io/task.h"
> +#include "qemu/sockets.h"
> +#include "qom/object.h"
> +
> +#define TYPE_QIO_CHANNEL_RDMA "qio-channel-rdma"
> +OBJECT_DECLARE_SIMPLE_TYPE(QIOChannelRDMA, QIO_CHANNEL_RDMA)
> +
> +/**
> + * QIOChannelRDMA:
> + *
> + * The QIOChannelRDMA object provides a channel implementation
> + * that discards all writes and returns EOF for all reads.
> + */
> +struct QIOChannelRDMA {
> +QIOChannel parent;
> +/* the rsocket fd */
> +int fd;
> +
> +struct sockaddr_storage localAddr;
> +socklen_t localAddrLen;
> +struct sockaddr_storage remoteAddr;
> +socklen_t remoteAddrLen;
> +};
> +
> +/**
> + * qio_channel_rdma_new:
> + *
> + * Create a channel for performing I/O on a rdma
> + * connection, that is initially closed. After
> + * creating the rdma, it must be setup as a client
> + * connection or server.
> + *
> + * Returns: the rdma channel object
> + */
> +QIOChannelRDMA *qio_channel_rdma_new(void);
> +
> +/**
> + * qio_channel_rdma_connect_sync:
> + * @ioc: the rdma channel object
> + * @addr: the address to connect to
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Attempt to connect to the address @addr. This method
> + * will run in the foreground so the caller will not regain
> + * execution control until the connection is established or
> + * an error occurs.
> + */
> +int qio_channel_rdma_connect_sync(QIOChannelRDMA *ioc, InetSocketAddress 
> *addr,
> +  Error **errp);
> +
> +/**
> + * qio_channel_rdma_connect_async:
> + * @ioc: the rdma channel object
> + * @addr: the address to connect to
> + * @callback: the function to invoke on completion
> + * @opaque: user data to pass to @callback
> + * @destroy: the function to free @opaque
> + * @context: the context to run the async task. If %NULL, the default
> + *   context will be used.
> + *
> + * Attempt to connect to the address @addr. This method
> + * will run in the background so the caller will regain
> + * execution control immediately. The function @callback
> + * will be invoked on completion or failure. The @addr
> + * parameter will be copied, so may be freed as soon
> + * as this function returns without waiting for completion.
> + */
> +void qio_channel_rdma_connect_async(QIOChannelRDMA *ioc,
> +InetSocketAddress *addr,
> +QIOTaskFunc callback, gpointer opaque,
> +GDestroyNotify destroy,
> +GMainContext *context);
> +
> +/**
> + * qio_channel_rdma_listen_sync:
> + * @ioc: the rdma channel object
> + * @addr: the address to listen to
> + * @num: the expected amount of connections
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Attempt to listen to the address @addr. This method
> + * will run in the foreground so the caller will not regain
> + * execution control until the connection is established or
> + * an error occurs.
> + */
> +int qio_channel_rdma_listen_sync(QIOChannelRDMA *ioc, InetSocketAddress 
> *addr,
> + int num, Error **errp);
> +
> +/**
> + * qio_channel_rdma_listen_async:
> + * @ioc: the rdma channel object
> + * @addr: the address to 

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-06 Thread Jinpu Wang
Hi Gonglei, hi folks on the list,

On Tue, Jun 4, 2024 at 2:14 PM Gonglei  wrote:
>
> From: Jialin Wang 
>
> Hi,
>
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
>
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
>
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
>
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-phi...@linaro.org/
>
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
>
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!
First thx for the effort, we are running migration tests on our IB
fabric, different generation of HCA from mellanox, the migration works
ok,
there are a few failures,  Yu will share the result later separately.

The one blocker for the change is the old implementation and the new
rsocket implementation;
they don't talk to each other due to the effect of different wire
protocol during connection establishment.
eg the old RDMA migration has special control message during the
migration flow, which rsocket use a different control message, so
there lead to no way
to migrate VM using rdma transport pre to the rsocket patchset to a
new version with rsocket implementation.

Probably we should keep both implementation for a while, mark the old
implementation as deprecated, and promote the new implementation, and
high light in doc,
they are not compatible.

Regards!
Jinpu



>
> Jialin Wang (6):
>   migration: remove RDMA live migration temporarily
>   io: add QIOChannelRDMA class
>   io/channel-rdma: support working in coroutine
>   tests/unit: add test-io-channel-rdma.c
>   migration: introduce new RDMA live migration
>   migration/rdma: support multifd for RDMA migration
>
>  docs/rdma.txt |  420 ---
>  include/io/channel-rdma.h |  165 ++
>  io/channel-rdma.c |  798 ++
>  io/meson.build|1 +
>  io/trace-events   |   14 +
>  meson.build   |6 -
>  migration/meson.build |3 +-
>  migration/migration-stats.c   |5 +-
>  migration/migration-stats.h   |4 -
>  migration/migration.c |   13 +-
>  migration/migration.h |9 -
>  migration/multifd.c   |   10 +
>  migration/options.c   |   16 -
>  migration/options.h   |2 -
>  migration/qemu-file.c |1 -
>  migration/ram.c   |   90 +-
>  migration/rdma.c  | 4205 +
>  migration/rdma.h  |   67 +-
>  migration/savevm.c|2 +-
>  migration/trace-events|   68 +-
>  qapi/migration.json   |   13 +-
>  scripts/analyze-migration.py  |3 -
>  tests/unit/meson.build|1 +
>  tests/unit/test-io-channel-rdma.c |  276 ++
>  24 files changed, 1360 insertions(+), 4832 deletions(-)
>  delete mode 100644 docs/rdma.txt
>  create mode 100644 include/io/channel-rdma.h
>  create mode 100644 io/channel-rdma.c
>  create mode 100644 tests/unit/test-io-channel-rdma.c
>
> --
> 2.43.0
>



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Jinpu Wang
On Wed, May 29, 2024 at 11:35 AM Gonglei (Arei) 
wrote:

>
>
> > -Original Message-
> > From: Jinpu Wang [mailto:jinpu.w...@ionos.com]
> > Sent: Wednesday, May 29, 2024 5:18 PM
> > To: Gonglei (Arei) 
> > Cc: Greg Sword ; Peter Xu ;
> > Yu Zhang ; Michael Galaxy ;
> > Elmar Gerdes ; zhengchuan
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-devel@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-bl...@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas ;
> > RDMA mailing list ; she...@nvidia.com; Haris
> > Iqbal 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> handling
> >
> > Hi Gonglei,
> >
> > On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei)  >
> > wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Greg Sword [mailto:gregswo...@gmail.com]
> > > > Sent: Wednesday, May 29, 2024 2:06 PM
> > > > To: Jinpu Wang 
> > > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> > > > handling
> > > >
> > > > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> > > > wrote:
> > > > >
> > > > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)
> > > > > 
> > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > > > servers widely used for production in our data center.
> > > > > > > > > > The network adapters are
> > > > > > > > > >
> > > > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > > > NetXtreme
> > > > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > > > >
> > > > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6
> > > > > > > > > looks more
> > > > > > > reasonable.
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > > 15
> > > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > > >
> > > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > > >
> > > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > > Family [ConnectX-5]
> > > > > > > > > >
> > > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > > while the other
> > > > is active.
> > > > > > > > > >
> > > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > > (rev
> > > > > > > > > > 01)
> > > > > > > > > >
> > > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > > could make more
> > > > > > > > > sense.
> > > > > > > > >
> > 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Jinpu Wang
Hi Gonglei,

On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei)  wrote:
>
>
>
> > -Original Message-
> > From: Greg Sword [mailto:gregswo...@gmail.com]
> > Sent: Wednesday, May 29, 2024 2:06 PM
> > To: Jinpu Wang 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> >
> > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> > wrote:
> > >
> > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) 
> > wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -Original Message-
> > > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > servers widely used for production in our data center. The
> > > > > > > > network adapters are
> > > > > > > >
> > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > NetXtreme
> > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > >
> > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks
> > > > > > > more
> > > > > reasonable.
> > > > > > >
> > > > > > >
> > > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > 15
> > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > >
> > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > >
> > > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > > [ConnectX-5]
> > > > > > > >
> > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for
> > > > > > > > VM migration. RDMA traffic is through InfiniBand and TCP
> > > > > > > > through Ethernet on these two hosts. One is standby while the 
> > > > > > > > other
> > is active.
> > > > > > > >
> > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev
> > > > > > > > 01)
> > > > > > > >
> > > > > > > > The comparison between RDMA and TCP on the same NIC could
> > > > > > > > make more
> > > > > > > sense.
> > > > > > >
> > > > > > > It looks to me NICs are powerful now, but again as I mentioned
> > > > > > > I don't think it's a reason we need to deprecate rdma,
> > > > > > > especially if QEMU's rdma migration has the chance to be 
> > > > > > > refactored
> > using rsocket.
> > > > > > >
> > > > > > > Is there anyone who started looking into that direction?
> > > > > > > Would it make sense we start some PoC now?
> > > > > > >
> > > > > >
> > > > > > My team has finished the PoC refactoring which works well.
> > > > > >
> > > > > > Progress:
> > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > functions from migration/ram.c. (to prevent RDMA live migration
> > > > > > from polluting the
> > > > > core logic of live migration), 6.  The soft-RoCE implemented by
> > > > > software is used to test the RDMA live migration. It's successful.
> > > > > >
> > > > > > We will be submit the patchset later.
> > > > >
> > > > > That's great news, thank you!
> > > > >
> > > > > --
> > > > > Peter Xu
> > > >
> > > > For rdma programming, the current mainstream impl

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Jinpu Wang
On Wed, May 29, 2024 at 8:08 AM Greg Sword  wrote:
>
> On Wed, May 29, 2024 at 12:33 PM Jinpu Wang  wrote:
> >
> > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)  
> > wrote:
> > >
> > > Hi,
> > >
> > > > -Original Message-
> > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > Exactly, not so compelling, as I did it first only on servers
> > > > > > > widely used for production in our data center. The network
> > > > > > > adapters are
> > > > > > >
> > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > >
> > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> > > > reasonable.
> > > > > >
> > > > > >
> > > > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > >
> > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > >
> > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > [ConnectX-5]
> > > > > > >
> > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > > > > Ethernet on these two hosts. One is standby while the other is 
> > > > > > > active.
> > > > > > >
> > > > > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > > > > network adapters. One of them has:
> > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > > > > >
> > > > > > > The comparison between RDMA and TCP on the same NIC could make
> > > > > > > more
> > > > > > sense.
> > > > > >
> > > > > > It looks to me NICs are powerful now, but again as I mentioned I
> > > > > > don't think it's a reason we need to deprecate rdma, especially if
> > > > > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > > > > >
> > > > > > Is there anyone who started looking into that direction?  Would it
> > > > > > make sense we start some PoC now?
> > > > > >
> > > > >
> > > > > My team has finished the PoC refactoring which works well.
> > > > >
> > > > > Progress:
> > > > > 1.  Implement io/channel-rdma.c,
> > > > > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > > > > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > > > > from migration/ram.c. (to prevent RDMA live migration from polluting 
> > > > > the
> > > > core logic of live migration), 6.  The soft-RoCE implemented by 
> > > > software is
> > > > used to test the RDMA live migration. It's successful.
> > > > >
> > > > > We will be submit the patchset later.
> > > >
> > > > That's great news, thank you!
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > For rdma programming, the current mainstream implementation is to use 
> > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > >
> > > rdma_cm and ibverbs create two FDs respectively. The two FDs have 
> > > different responsibilities. rdma_cm fd is used to notify connection 
> > > establishment events,
> > > and verbs fd is used to notify new CQEs. When poll/epoll monitoring is 
> > > directly performed on the rdma_cm fd, only a pollin event can be 
> > > monitored, which means
> > > that an rdma_cm event occurs. When the verbs fd is directly 
> > > polled/epolled, only the pollin event can be listened, which indicates 
> > > that a new CQE is generated.
> > >
> > > Rsocket is a sub-module attached to the rdma_cm library and provides rdma 
> > > calls that are completely similar to socket interfaces. However, this 
> > > library returns
> > > only the rdma_cm fd for listening to link setup-related events and does 
> > > not expose the verbs fd (readable and writable events for listening to 
> > > data). Only the rpoll
> > > interface provided by the RSocket can be used to listen to related 
> > > events. However, QEMU uses the ppoll interface to listen to the rdma_cm 
> > > fd (gotten by raccept API).
> > > And cannot listen to the verbs fd event. Only some hacking methods can be 
> > > used to address this problem.
> > >
> > > Do you guys have any ideas? Thanks.
> > +cc linux-rdma
>
> Why include rdma community?
rdma community has a lot people with experience in rdma/rsocket?
>
> > +cc Sean
> >
> >
> >
> > >
> > >
> > > Regards,
> > > -Gonglei
> >



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Jinpu Wang
On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)  wrote:
>
> Hi,
>
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > Exactly, not so compelling, as I did it first only on servers
> > > > > widely used for production in our data center. The network
> > > > > adapters are
> > > > >
> > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > >
> > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> > reasonable.
> > > >
> > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > >
> > > > Appreciate a lot for everyone helping on the testings.
> > > >
> > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > [ConnectX-5]
> > > > >
> > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > > Ethernet on these two hosts. One is standby while the other is active.
> > > > >
> > > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > > network adapters. One of them has:
> > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > > >
> > > > > The comparison between RDMA and TCP on the same NIC could make
> > > > > more
> > > > sense.
> > > >
> > > > It looks to me NICs are powerful now, but again as I mentioned I
> > > > don't think it's a reason we need to deprecate rdma, especially if
> > > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > > >
> > > > Is there anyone who started looking into that direction?  Would it
> > > > make sense we start some PoC now?
> > > >
> > >
> > > My team has finished the PoC refactoring which works well.
> > >
> > > Progress:
> > > 1.  Implement io/channel-rdma.c,
> > > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > Rewrite the rdma_start_outgoing_migration and
> > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > > from migration/ram.c. (to prevent RDMA live migration from polluting the
> > core logic of live migration), 6.  The soft-RoCE implemented by software is
> > used to test the RDMA live migration. It's successful.
> > >
> > > We will be submit the patchset later.
> >
> > That's great news, thank you!
> >
> > --
> > Peter Xu
>
> For rdma programming, the current mainstream implementation is to use rdma_cm 
> to establish a connection, and then use verbs to transmit data.
>
> rdma_cm and ibverbs create two FDs respectively. The two FDs have different 
> responsibilities. rdma_cm fd is used to notify connection establishment 
> events,
> and verbs fd is used to notify new CQEs. When poll/epoll monitoring is 
> directly performed on the rdma_cm fd, only a pollin event can be monitored, 
> which means
> that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, 
> only the pollin event can be listened, which indicates that a new CQE is 
> generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides rdma 
> calls that are completely similar to socket interfaces. However, this library 
> returns
> only the rdma_cm fd for listening to link setup-related events and does not 
> expose the verbs fd (readable and writable events for listening to data). 
> Only the rpoll
> interface provided by the RSocket can be used to listen to related events. 
> However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by 
> raccept API).
> And cannot listen to the verbs fd event. Only some hacking methods can be 
> used to address this problem.
>
> Do you guys have any ideas? Thanks.
+cc linux-rdma
+cc Sean



>
>
> Regards,
> -Gonglei



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Jinpu Wang
Hi Gonglei,

On Tue, May 28, 2024 at 11:06 AM Gonglei (Arei)  wrote:
>
> Hi Peter,
>
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang 
> > Cc: Michael Galaxy ; Jinpu Wang
> > ; Elmar Gerdes ;
> > zhengchuan ; Gonglei (Arei)
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-devel@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-bl...@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> >
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> >
> > Hi,
> >
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> >
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more 
> > reasonable.
> >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wvaqk81vxtkzx-l...@mail.gmail.com/
> >
> > Appreciate a lot for everyone helping on the testings.
> >
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> >
> > It looks to me NICs are powerful now, but again as I mentioned I don't 
> > think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> >
> > Is there anyone who started looking into that direction?  Would it make 
> > sense
> > we start some PoC now?
> >
>
> My team has finished the PoC refactoring which works well.
>
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
> successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and 
> rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
> migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live 
> migration. It's successful.
>
> We will be submit the patchset later.
>
Thanks for working on this PoC, and sharing progress on this, we are
looking forward for the patchset.

>
> Regards,
> -Gonglei
Regards!
Jinpu
>
> > Thanks,
> >
> > --
> > Peter Xu
>



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-13 Thread Jinpu Wang
Hi Peter, Hi Chuan,

On Thu, May 9, 2024 at 4:14 PM Peter Xu  wrote:
>
> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> > That's a good news to see the socket abstraction for RDMA!
> > When I was developed the series above, the most pain is the RDMA migration 
> > has no QIOChannel abstraction and i need to take a 'fake channel'
> > for it which is awkward in code implementation.
> > So, as far as I know, we can do this by
> > i. the first thing is that we need to evaluate the rsocket is good enough 
> > to satisfy our QIOChannel fundamental abstraction
> > ii. if it works right, then we will continue to see if it can give us 
> > opportunity to hide the detail of rdma protocol
> > into rsocket by remove most of code in rdma.c and also some hack in 
> > migration main process.
> > iii. implement the advanced features like multi-fd and multi-uri for rdma 
> > migration.
> >
> > Since I am not familiar with rsocket, I need some times to look at it and 
> > do some quick verify with rdma migration based on rsocket.
> > But, yes, I am willing to involved in this refactor work and to see if we 
> > can make this migration feature more better:)
>
> Based on what we have now, it looks like we'd better halt the deprecation
> process a bit, so I think we shouldn't need to rush it at least in 9.1
> then, and we'll need to see how it goes on the refactoring.
>
> It'll be perfect if rsocket works, otherwise supporting multifd with little
> overhead / exported APIs would also be a good thing in general with
> whatever approach.  And obviously all based on the facts that we can get
> resources from companies to support this feature first.
>
> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
> any of us can provide some test results please do so.  Many people are
> saying RDMA is better, but I yet didn't see any numbers comparing it with
> modern TCP networks.  I don't want to have old impressions floating around
> even if things might have changed..  When we have consolidated results, we
> should share them out and also reflect that in QEMU's migration docs when a
> rdma document page is ready.
I also did a tests with Mellanox ConnectX-6 100 G RoCE nic, the
results are mixed, for less than 3 streams native ethernet is faster,
and when more than 3 streams rsocket performs better.

root@x4-right:~# iperf -c 1.1.1.16 -P 1

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[  3] local 1.1.1.15 port 44214 connected with 1.1.1.16 port 5001
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10. sec  52.9 GBytes  45.4 Gbits/sec
root@x4-right:~# iperf -c 1.1.1.16 -P 2
[  3] local 1.1.1.15 port 33118 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 33130 connected with 1.1.1.16 port 5001

Client connecting to 1.1.1.16, TCP port 5001
TCP window size: 4.00 MByte (default)

[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0001 sec  45.0 GBytes  38.7 Gbits/sec
[  4] 0.-10. sec  43.9 GBytes  37.7 Gbits/sec
[SUM] 0.-10. sec  88.9 GBytes  76.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.172/0.189/0.205/0.172 ms (tot/err) = 2/0
root@x4-right:~# iperf -c 1.1.1.16 -P 4

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[  5] local 1.1.1.15 port 50748 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 50734 connected with 1.1.1.16 port 5001
[  6] local 1.1.1.15 port 50764 connected with 1.1.1.16 port 5001
[  3] local 1.1.1.15 port 50730 connected with 1.1.1.16 port 5001
[ ID] Interval   Transfer Bandwidth
[  6] 0.-10. sec  24.7 GBytes  21.2 Gbits/sec
[  3] 0.-10.0004 sec  23.6 GBytes  20.3 Gbits/sec
[  4] 0.-10. sec  27.8 GBytes  23.9 Gbits/sec
[  5] 0.-10. sec  28.0 GBytes  24.0 Gbits/sec
[SUM] 0.-10. sec   104 GBytes  89.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.104/0.156/0.204/0.124 ms (tot/err) = 4/0
root@x4-right:~# iperf -c 1.1.1.16 -P 8
[  4] local 1.1.1.15 port 55588 connected with 1.1.1.16 port 5001
[  5] local 1.1.1.15 port 55600 connected with 1.1.1.16 port 5001

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[ 10] local 1.1.1.15 port 55628 connected with 1.1.1.16 port 5001
[ 15] local 1.1.1.15 port 55648 connected with 1.1.1.16 port 5001
[  7] local 1.1.1.15 port 55620 connected with 1.1.1.16 port 5001
[  3] local 1.1.1.15 port 55584 connected with 1.1.1.16 port 5001
[ 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Jinpu Wang
Hi Peter, hi Daniel,
On Mon, May 6, 2024 at 5:29 PM Peter Xu  wrote:
>
> On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> > Hi Peter, hi Daniel,
>
> Hi, Jinpu,
>
> Thanks for sharing this test results.  Sounds like a great news.
>
> What's your plan next?  Would it then be worthwhile / possible moving QEMU
> into that direction?  Would that greatly simplify rdma code as Dan
> mentioned?
I'm rather not familiar with QEMU migration yet,  from the test
result, I think it's a possible direction,
just we need to at least based on a rather recent release like
rdma-core v33 with proper 'fork' support.

Maybe Dan or you could give more detail about what you have in mind
for using rsocket as a replacement for the future.
We will also look into the implementation details in the meantime.

Thx!
J

>
> Thanks,
>
> >
> > On Fri, May 3, 2024 at 4:33 PM Peter Xu  wrote:
> > >
> > > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > > I had a brief check in the rsocket changelog, there seems some
> > > > improvement over time,
> > > >  might be worth revisiting this. due to socket abstraction, we can't
> > > > use some feature like
> > > >  ODP, it won't be a small and easy task.
> > >
> > > It'll be good to know whether Dan's suggestion would work first, without
> > > rewritting everything yet so far.  Not sure whether some perf test could
> > > help with the rsocket APIs even without QEMU's involvements (or looking 
> > > for
> > > test data supporting / invalidate such conversions).
> > >
> > I did a quick test with iperf on 100 G environment and 40 G
> > environment, in summary rsocket works pretty well.
> >
> > iperf tests between 2 hosts with 40 G (IB),
> > first  a few test with different num. of threads on top of ipoib
> > interface, later with preload rsocket on top of same ipoib interface.
> >
> > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> > 
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > 
> > [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> > [ ID] Interval   Transfer Bandwidth
> > [  3] 0.-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> > 
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > 
> > [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> > [ ID] Interval   Transfer Bandwidth
> > [  3] 0.-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> > [  4] 0.-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> > [SUM] 0.-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> > 
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > 
> > [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> > [ ID] Interval   Transfer Bandwidth
> > [  3] 0.-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> > [  4] 0.-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> > [  5] 0.-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> > [  6] 0.-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> > [SUM] 0.-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> > 
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > 
> > [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> &

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Jinpu Wang
Hi Peter, hi Daniel,

On Fri, May 3, 2024 at 4:33 PM Peter Xu  wrote:
>
> On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > I had a brief check in the rsocket changelog, there seems some
> > improvement over time,
> >  might be worth revisiting this. due to socket abstraction, we can't
> > use some feature like
> >  ODP, it won't be a small and easy task.
>
> It'll be good to know whether Dan's suggestion would work first, without
> rewritting everything yet so far.  Not sure whether some perf test could
> help with the rsocket APIs even without QEMU's involvements (or looking for
> test data supporting / invalidate such conversions).
>
I did a quick test with iperf on 100 G environment and 40 G
environment, in summary rsocket works pretty well.

iperf tests between 2 hosts with 40 G (IB),
first  a few test with different num. of threads on top of ipoib
interface, later with preload rsocket on top of same ipoib interface.

jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
[  4] 0.-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
[SUM] 0.-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
[  4] 0.-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
[  5] 0.-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
[  6] 0.-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
[SUM] 0.-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
[  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
[ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
[ ID] Interval   Transfer Bandwidth
[  4] 0.-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
[  8] 0.-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
[  5] 0.-10. sec  2.85 GBytes  2.45 Gbits/sec
[ 12] 0.-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
[  3] 0.-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
[  7] 0.-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
[  9] 0.-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
[  6] 0.-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
[SUM] 0.-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.096/0.226/0.339/0.109 ms (tot/err) = 8/0
jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 16
[  3] local 10.43.3.146 port 49540 connected with 10.43.3.145 port 52000

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  6] local 10.43.3.146 port 49554 connected with 10.43.3.145 p

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-03 Thread Jinpu Wang
Hi Daniel,

On Wed, May 1, 2024 at 6:00 PM Daniel P. Berrangé  wrote:
>
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> >
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> >
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
>
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
>
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
I guess at the time rsocket is less mature, and less performant
compared to using uverbs directly.



>
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.
I had a brief check in the rsocket changelog, there seems some
improvement over time,
 might be worth revisiting this. due to socket abstraction, we can't
use some feature like
 ODP, it won't be a small and easy task.
> With regards,
> Daniel
Thanks for the suggestion.
>
> [1] "almost" trivially, because the poll() integration for rsockets
> requires a bit more magic sauce since rsockets FDs are not
> really FDs from the kernel's POV. Still, QIOCHannel likely can
> abstract that probme.
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Jinpu Wang
Hi Peter

On Thu, May 2, 2024 at 6:20 PM Peter Xu  wrote:
>
> On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> > Hi Michael, Hi Peter,
> >
> >
> > On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
> > >
> > > Yu Zhang / Jinpu,
> > >
> > > Any possibility (at your lesiure, and within the disclosure rules of
> > > your company, IONOS) if you could share any of your performance
> > > information to educate the group?
> > >
> > > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > > their disposal. Some people don't.
> > Our staging env is with 100 Gb/s IB environment.
> > We will have a new setup in the coming months with Ethernet (RoCE), we
> > will run some performance
> > comparison when we have the environment ready.
>
> Thanks both.  Please keep us posted.
>
> Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
> involved, am I right?
kinds of. Our new hardware is RDMA capable, we can configure it to run
in "rdma" transport or "tcp"
it is more straight comparison,
When run "rdma" transport, RoCE is involved, eg the
rdma-core/ibverbs/rdmacm/vendor verbs driver are used.
>
> The other note is that the comparison needs to be with multifd enabled for
> the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.
>
> I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
> without even waiting for months.  If you want I can try to see how we can
> test together.  And btw I don't think we need a cluster, IIUC we simply
> need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
> two cards just for experiments, systems that can drive the cards, and a
> wire supporting 100G?

Yes, the simple setup can be just two hosts directly connected. This remind me,
I may also able to find a test setup with 100 G nic in lab, will keep
you posted.

Regards!
>
> >
> > >
> > > - Michael
> >
> > Thx!
> > Jinpu
> > >
> > > On 5/1/24 11:16, Peter Xu wrote:
> > > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > > >>> What I worry more is whether this is really what we want to keep rdma 
> > > >>> in
> > > >>> qemu, and that's also why I was trying to request for some serious
> > > >>> performance measurements comparing rdma v.s. nics.  And here when I 
> > > >>> said
> > > >>> "we" I mean both QEMU community and any company that will support 
> > > >>> keeping
> > > >>> rdma around.
> > > >>>
> > > >>> The problem is if NICs now are fast enough to perform at least equally
> > > >>> against rdma, and if it has a lower cost of overall maintenance, does 
> > > >>> it
> > > >>> mean that rdma migration will only be used by whoever wants to keep 
> > > >>> them in
> > > >>> the products and existed already?  In that case we should simply ask 
> > > >>> new
> > > >>> users to stick with tcp, and rdma users should only drop but not 
> > > >>> increase.
> > > >>>
> > > >>> It seems also destined that most new migration features will not 
> > > >>> support
> > > >>> rdma: see how much we drop old features in migration now (which rdma
> > > >>> _might_ still leverage, but maybe not), and how much we add mostly 
> > > >>> multifd
> > > >>> relevant which will probably not apply to rdma at all.  So in general 
> > > >>> what
> > > >>> I am worrying is a both-loss condition, if the company might be 
> > > >>> easier to
> > > >>> either stick with an old qemu (depending on whether other new 
> > > >>> features are
> > > >>> requested to be used besides RDMA alone), or do periodic rebase with 
> > > >>> RDMA
> > > >>> downstream only.
> > > >> I don't know much about the originals of RDMA support in QEMU and why
> > > >> this particular design was taken. It is indeed a huge maint burden to
> > > >> have a completely different code flow for RDMA with 4000+ lines of
> > > >> custom protocol signalling which is barely understandable.
> > > >>
> > > &

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Jinpu Wang
Hi Michael, Hi Peter,


On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
>
> Yu Zhang / Jinpu,
>
> Any possibility (at your lesiure, and within the disclosure rules of
> your company, IONOS) if you could share any of your performance
> information to educate the group?
>
> NICs have indeed changed, but not everybody has 100ge mellanox cards at
> their disposal. Some people don't.
Our staging env is with 100 Gb/s IB environment.
We will have a new setup in the coming months with Ethernet (RoCE), we
will run some performance
comparison when we have the environment ready.

>
> - Michael

Thx!
Jinpu
>
> On 5/1/24 11:16, Peter Xu wrote:
> > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> >>> What I worry more is whether this is really what we want to keep rdma in
> >>> qemu, and that's also why I was trying to request for some serious
> >>> performance measurements comparing rdma v.s. nics.  And here when I said
> >>> "we" I mean both QEMU community and any company that will support keeping
> >>> rdma around.
> >>>
> >>> The problem is if NICs now are fast enough to perform at least equally
> >>> against rdma, and if it has a lower cost of overall maintenance, does it
> >>> mean that rdma migration will only be used by whoever wants to keep them 
> >>> in
> >>> the products and existed already?  In that case we should simply ask new
> >>> users to stick with tcp, and rdma users should only drop but not increase.
> >>>
> >>> It seems also destined that most new migration features will not support
> >>> rdma: see how much we drop old features in migration now (which rdma
> >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> >>> relevant which will probably not apply to rdma at all.  So in general what
> >>> I am worrying is a both-loss condition, if the company might be easier to
> >>> either stick with an old qemu (depending on whether other new features are
> >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> >>> downstream only.
> >> I don't know much about the originals of RDMA support in QEMU and why
> >> this particular design was taken. It is indeed a huge maint burden to
> >> have a completely different code flow for RDMA with 4000+ lines of
> >> custom protocol signalling which is barely understandable.
> >>
> >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> >> type could almost[1] trivially have supported RDMA. There would have
> >> been almost no RDMA code required in the migration subsystem, and all
> >> the modern features like compression, multifd, post-copy, etc would
> >> "just work".
> >>
> >> I guess the 'rsocket.h' shim may well limit some of the possible
> >> performance gains, but it might still have been a better tradeoff
> >> to have not quite so good peak performance, but with massively
> >> less maint burden.
> > My understanding so far is RDMA is sololy for performance but nothing else,
> > then it's a question on whether rdma existing users would like to do so if
> > it will run slower.
> >
> > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > quotting that word as I don't really know such details:
> >
> > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> >
> > So not sure whether that applies here too, in that having qiochannel
> > wrapper may not allow direct access to those ib verbs.
> >
> > Thanks,
> >
> >> With regards,
> >> Daniel
> >>
> >> [1] "almost" trivially, because the poll() integration for rsockets
> >>  requires a bit more magic sauce since rsockets FDs are not
> >>  really FDs from the kernel's POV. Still, QIOCHannel likely can
> >>  abstract that probme.
> >> --
> >> |: 
> >> https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$
> >>-o-
> >> https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$
> >>   :|
> >> |: 
> >> https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$
> >>   -o-
> >> https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$
> >>   :|
> >> |: 
> >> https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$
> >>  -o-
> >> 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-11 Thread Jinpu Wang
Hi Peter,

On Tue, Apr 9, 2024 at 9:47 PM Peter Xu  wrote:
>
> On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> > Hi Peter,
> >
> > On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
> > >
> > > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > > Hi Peter,
> > >
> > > Jinpu,
> > >
> > > Thanks for joining the discussion.
> > >
> > > >
> > > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > > > >
> > > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > > Hello Peter und Zhjian,
> > > > > >
> > > > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > > > surprised at
> > > > > > the plan for deprecating the RDMA migration subsystem.
> > > > >
> > > > > It's not too late, since it looks like we do have users not yet 
> > > > > notified
> > > > > from this, we'll redo the deprecation procedure even if it'll be the 
> > > > > final
> > > > > plan, and it'll be 2 releases after this.
> > > > >
> > > > > >
> > > > > > > IMHO it's more important to know whether there are still users 
> > > > > > > and whether
> > > > > > > they would still like to see it around.
> > > > > >
> > > > > > > I admit RDMA migration was lack of testing(unit/CI test), which 
> > > > > > > led to the a few
> > > > > > > obvious bugs being noticed too late.
> > > > > >
> > > > > > Yes, we are a user of this subsystem. I was unaware of the lack of 
> > > > > > test coverage
> > > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > > migration test
> > > > > > cases failed and came to realize that there might be a bug between 
> > > > > > 8.1
> > > > > > and 8.2, but
> > > > > > was unable to confirm and report it quickly to you.
> > > > > >
> > > > > > The maintenance of this part could be too costly or difficult from
> > > > > > your point of view.
> > > > >
> > > > > It may or may not be too costly, it's just that we need real users of 
> > > > > RDMA
> > > > > taking some care of it.  Having it broken easily for >1 releases 
> > > > > definitely
> > > > > is a sign of lack of users.  It is an implication to the community 
> > > > > that we
> > > > > should consider dropping some features so that we can get the best 
> > > > > use of
> > > > > the community resources for the things that may have a broader 
> > > > > audience.
> > > > >
> > > > > One thing majorly missing is a RDMA tester to guard all the merges to 
> > > > > not
> > > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA 
> > > > > hardwares
> > > > > but just to sanity check the migration+rdma code running all fine.  
> > > > > RDMA
> > > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > > features that will be merged at least for migration subsystem, so 
> > > > > that we
> > > > > plan to not merge anything that is not covered by CI unless extremely
> > > > > necessary in the future.
> > > > >
> > > > > For sure CI is not the only missing part, but I'd say we should start 
> > > > > with
> > > > > it, then someone should also take care of the code even if only in
> > > > > maintenance mode (no new feature to add on top).
> > > > >
> > > > > >
> > > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > > many) like us
> > > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > > version of QEMU,
> > > > > > or to abandon the currently used RDMA migration.
> > > > >
> > > > > RDMA doesn't get new features anyway, if there's specific use case 
> > > > > for RDMA
> > > > > migrations, would it work if such a scenario uses the old binary?  Is 
&g

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Jinpu Wang
Hi Peter,

On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
>
> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
> >
> > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > >
> > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > Hello Peter und Zhjian,
> > > >
> > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > surprised at
> > > > the plan for deprecating the RDMA migration subsystem.
> > >
> > > It's not too late, since it looks like we do have users not yet notified
> > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > plan, and it'll be 2 releases after this.
> > >
> > > >
> > > > > IMHO it's more important to know whether there are still users and 
> > > > > whether
> > > > > they would still like to see it around.
> > > >
> > > > > I admit RDMA migration was lack of testing(unit/CI test), which led 
> > > > > to the a few
> > > > > obvious bugs being noticed too late.
> > > >
> > > > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > > > coverage
> > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > migration test
> > > > cases failed and came to realize that there might be a bug between 8.1
> > > > and 8.2, but
> > > > was unable to confirm and report it quickly to you.
> > > >
> > > > The maintenance of this part could be too costly or difficult from
> > > > your point of view.
> > >
> > > It may or may not be too costly, it's just that we need real users of RDMA
> > > taking some care of it.  Having it broken easily for >1 releases 
> > > definitely
> > > is a sign of lack of users.  It is an implication to the community that we
> > > should consider dropping some features so that we can get the best use of
> > > the community resources for the things that may have a broader audience.
> > >
> > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > taught us the lesson so we're requesting CI coverage for all other new
> > > features that will be merged at least for migration subsystem, so that we
> > > plan to not merge anything that is not covered by CI unless extremely
> > > necessary in the future.
> > >
> > > For sure CI is not the only missing part, but I'd say we should start with
> > > it, then someone should also take care of the code even if only in
> > > maintenance mode (no new feature to add on top).
> > >
> > > >
> > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > many) like us
> > > > either to stick to the RDMA migration by using an increasingly older
> > > > version of QEMU,
> > > > or to abandon the currently used RDMA migration.
> > >
> > > RDMA doesn't get new features anyway, if there's specific use case for 
> > > RDMA
> > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > possible to switch to the TCP protocol with some good NICs?
> > We have used rdma migration with HCA from Nvidia for years, our
> > experience is RDMA migration works better than tcp (over ipoib).
>
> Please bare with me, as I know little on rdma stuff.
>
> I'm actually pretty confused (and since a long time ago..) on why we need
> to operation with rdma contexts when ipoib seems to provide all the tcp
> layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> even if there's rdma/ib hardwares underneath?  Is it because of performance
> improvements so that we must use a separate path comparing to generic
> "tcp:" protocol here?
using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
talking directly to NIC which bypasses the kernel overhead, less cpu
utilization and better performance.

While IPoIB is more for compatibility to  applications using tcp, but
can't get full benefit of RDMA.  When you have mix generation of IB
devices, there are performance issue on IPoIB, we've seen 40G HCA can
only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-08 Thread Jinpu Wang
Hi Peter,

On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
>
> On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > Hello Peter und Zhjian,
> >
> > Thank you so much for letting me know about this. I'm also a bit surprised 
> > at
> > the plan for deprecating the RDMA migration subsystem.
>
> It's not too late, since it looks like we do have users not yet notified
> from this, we'll redo the deprecation procedure even if it'll be the final
> plan, and it'll be 2 releases after this.
>
> >
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around.
> >
> > > I admit RDMA migration was lack of testing(unit/CI test), which led to 
> > > the a few
> > > obvious bugs being noticed too late.
> >
> > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > coverage
> > for this part. As soon as 8.2 was released, I saw that many of the
> > migration test
> > cases failed and came to realize that there might be a bug between 8.1
> > and 8.2, but
> > was unable to confirm and report it quickly to you.
> >
> > The maintenance of this part could be too costly or difficult from
> > your point of view.
>
> It may or may not be too costly, it's just that we need real users of RDMA
> taking some care of it.  Having it broken easily for >1 releases definitely
> is a sign of lack of users.  It is an implication to the community that we
> should consider dropping some features so that we can get the best use of
> the community resources for the things that may have a broader audience.
>
> One thing majorly missing is a RDMA tester to guard all the merges to not
> break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> but just to sanity check the migration+rdma code running all fine.  RDMA
> taught us the lesson so we're requesting CI coverage for all other new
> features that will be merged at least for migration subsystem, so that we
> plan to not merge anything that is not covered by CI unless extremely
> necessary in the future.
>
> For sure CI is not the only missing part, but I'd say we should start with
> it, then someone should also take care of the code even if only in
> maintenance mode (no new feature to add on top).
>
> >
> > My concern is, this plan will forces a few QEMU users (not sure how
> > many) like us
> > either to stick to the RDMA migration by using an increasingly older
> > version of QEMU,
> > or to abandon the currently used RDMA migration.
>
> RDMA doesn't get new features anyway, if there's specific use case for RDMA
> migrations, would it work if such a scenario uses the old binary?  Is it
> possible to switch to the TCP protocol with some good NICs?
We have used rdma migration with HCA from Nvidia for years, our
experience is RDMA migration works better than tcp (over ipoib).

Switching back to TCP will lead us to the old problems which was
solved by RDMA migration.

>
> Per our best knowledge, RDMA users are rare, and please let anyone know if
> you are aware of such users.  IIUC the major reason why RDMA stopped being
> the trend is because the network is not like ten years ago; I don't think I
> have good knowledge in RDMA at all nor network, but my understanding is
> it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> little sense to maintain multiple protocols, considering RDMA migration
> code is so special so that it has the most custom code comparing to other
> protocols.
+cc some guys from Huawei.

I'm surprised RDMA users are rare,  I guess maybe many are just
working with different code base.
>
> Thanks,
>
> --
> Peter Xu

Thx!
Jinpu Wang
>



[RFC] Convert VMWARE vmdk (snapshot) to raw disk

2024-02-16 Thread Jinpu Wang
Hi,

We want to convert some VMWARE VM to KVM, and to reduce the VM down
time, we want to do it in two steps copy approach:

1. a snapshot is taken
- source VM continues to run on VMware => diff creates

2. snapshot is available for download as vmdk
- we need a software to copy snapshot to target VM raw disk

3. source VM is shut down
- diff is available

4. diff needs to be applied to target raw disk
is qemu-img able to do it, or is there another tool?  I saw commit
98eb9733f4cf2eeab6d12db7e758665d2fd5367b
Author: Sam Eiderman 
Date:   Thu Jun 20 12:10:57 2019 +0300

vmdk: Add read-only support for seSparse snapshots

So it seems qemu vmdk already support the seSparse snapshot format,
but it is unclear for us how to connect all these features together.

In short we want to
1 vmdk => raw (big size)
2 vmdk delta => same raw disk (later time, with less content)

Can you give us some suggestions?

Thx!
Jinpu Wang



Re: [PATCH v2] target/i386: Export GDS_NO bit to guests

2023-09-14 Thread Jinpu Wang
Hi Paolo,

Ping!

Thx!

On Tue, Aug 15, 2023 at 7:44 AM Xiaoyao Li  wrote:
>
> On 8/15/2023 12:54 PM, Pawan Gupta wrote:
> > Gather Data Sampling (GDS) is a side-channel attack using Gather
> > instructions. Some Intel processors will set ARCH_CAP_GDS_NO bit in
> > MSR IA32_ARCH_CAPABILITIES to report that they are not vulnerable to
> > GDS.
> >
> > Make this bit available to guests.
> >
> > Closes: 
> > https://lore.kernel.org/qemu-devel/camgffemg6tnq0n3+4ojagxc8j0oevy60khzekxcbs3lok9v...@mail.gmail.com/
> > Reported-by: Jack Wang 
> > Signed-off-by: Pawan Gupta 
> > Tested-by: Jack Wang 
> > Tested-by: Daniel Sneddon 
>
> Reviewed-by: Xiaoyao Li 
>
> > ---
> > v2: Added commit tags
> >
> > v1: 
> > https://lore.kernel.org/qemu-devel/c373f3f92b542b738f296d44bb6a916a1cded7bd.1691774049.git.pawan.kumar.gu...@linux.intel.com/
> >
> >   target/i386/cpu.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index 97ad229d8ba3..48709b77689f 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -1155,7 +1155,7 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
> >   NULL, "sbdr-ssdp-no", "fbsdp-no", "psdp-no",
> >   NULL, "fb-clear", NULL, NULL,
> >   NULL, NULL, NULL, NULL,
> > -"pbrsb-no", NULL, NULL, NULL,
> > +"pbrsb-no", NULL, "gds-no", NULL,
> >   NULL, NULL, NULL, NULL,
> >   },
> >   .msr = {
>



Re: RFC: guest INTEL GDS mitigation status on patched host

2023-08-14 Thread Jinpu Wang
Hi Pawan, hi Daniel

Thanks for the patch.

I tried similar patch on Icelake server:
Architecture:   x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes:  46 bits physical, 57 bits virtual
CPU(s): 64
On-line CPU(s) list:0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s):  2
NUMA node(s):   2
Vendor ID:  GenuineIntel
CPU family: 6
Model:  106
Model name: Intel(R) Xeon(R) Gold 6346 CPU @ 3.1
0GHz
Stepping:   6
CPU MHz:3100.000
CPU max MHz:3600,
CPU min MHz:800,
BogoMIPS:   6200.00
Virtualization: VT-x
L1d cache:  1,5 MiB
L1i cache:  1 MiB
L2 cache:   40 MiB
L3 cache:   72 MiB
NUMA node0 CPU(s):  0,2,4,6,8,10,12,14,16,18,20,22,24,26
,28,30,32,34,36,38,40,42,44,46,48,50
,52,54,56,58,60,62
NUMA node1 CPU(s):  1,3,5,7,9,11,13,15,17,19,21,23,25,27
,29,31,33,35,37,39,41,43,45,47,49,51
,53,55,57,59,61,63
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds:  Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data:  Mitigation; Clear CPU buffers; SMT v
ulnerable
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:Mitigation; Speculative Store Bypass
 disabled via prctl and seccomp
Vulnerability Spectre v1:   Mitigation; usercopy/swapgs barriers
 and __user pointer sanitization
Vulnerability Spectre v2:   Mitigation; Enhanced IBRS, IBPB cond
itional, RSB filling, PBRSB-eIBRS SW
 sequence
Vulnerability Srbds:Not affected
Vulnerability Tsx async abort:  Not affected

 target/i386/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 97ad229d8ba3..48709b77689f 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1155,7 +1155,7 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 NULL, "sbdr-ssdp-no", "fbsdp-no", "psdp-no",
 NULL, "fb-clear", NULL, NULL,
 NULL, NULL, NULL, NULL,
-"pbrsb-no", NULL, NULL, NULL,
+"pbrsb-no", NULL, "gds-no", NULL,
 NULL, NULL, NULL, NULL,
 },
 .msr = {
-- 
2.34.1
For the change Pawan provided, I tested on Icelake server, it works as expected.
Somehow I'm not cc for the patch, but please consider it tested

Reported-by: Jack Wang 
Tested-by: Jack Wang 

Thx!
Jinpu Wang


while if I patches QEMU below:



On Fri, Aug 11, 2023 at 3:12 PM Jinpu Wang  wrote:
>
> Hi folks on the list:
>
> I'm testing the latest Downfall cpu vulnerability mitigation. what I
> notice is when both host and guest are using patched kernel +
> microcode eg kernel 5.15.125 +  intel-microcode 20230808 on affected
> server eg Icelake server.
>
> The mitigation status inside guest is:
>
> Vulnerabilities:
>   Gather data sampling:  Unknown: Dependent on hyp
>  ervisor status
> ---> this one.
>   Itlb multihit: Not affected
>   L1tf:  Not affected
>   Mds:   Not affected
>   Meltdown:  Not affected
>   Mmio stale data:   Vulnerable: Clear CPU buf
>  fers attempted, no microc
>  ode; SMT Host state unkno
>  wn
>   Retbleed:  Not affected
>   Spec rstack overflow:  Not affected
>   Spec store bypass: Mitigation; Speculative S
>  tore Bypass disabled via
>  prctl and seccomp
>   Spectre v1:Mitigation; usercopy/swap
>  gs barriers and __user po
>  inter sanitization
>   Spectre v2:Mitigation; Enhanced IBRS
>  , I

RFC: guest INTEL GDS mitigation status on patched host

2023-08-11 Thread Jinpu Wang
Hi folks on the list:

I'm testing the latest Downfall cpu vulnerability mitigation. what I
notice is when both host and guest are using patched kernel +
microcode eg kernel 5.15.125 +  intel-microcode 20230808 on affected
server eg Icelake server.

The mitigation status inside guest is:

Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hyp
 ervisor status
---> this one.
  Itlb multihit: Not affected
  L1tf:  Not affected
  Mds:   Not affected
  Meltdown:  Not affected
  Mmio stale data:   Vulnerable: Clear CPU buf
 fers attempted, no microc
 ode; SMT Host state unkno
 wn
  Retbleed:  Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass: Mitigation; Speculative S
 tore Bypass disabled via
 prctl and seccomp
  Spectre v1:Mitigation; usercopy/swap
 gs barriers and __user po
 inter sanitization
  Spectre v2:Mitigation; Enhanced IBRS
 , IBPB conditional, RSB f
 illing, PBRSB-eIBRS SW se
 quence
  Srbds: Not affected
  Tsx async abort:   Not affected

According to kernel commit below
commit 81ac7e5d741742d650b4ed6186c4826c1a0631a7
Author: Daniel Sneddon 
Date:   Wed Jul 12 19:43:14 2023 -0700

KVM: Add GDS_NO support to KVM

Gather Data Sampling (GDS) is a transient execution attack using
gather instructions from the AVX2 and AVX512 extensions. This attack
allows malicious code to infer data that was previously stored in
vector registers. Systems that are not vulnerable to GDS will set the
GDS_NO bit of the IA32_ARCH_CAPABILITIES MSR. This is useful for VM
guests that may think they are on vulnerable systems that are, in
fact, not affected. Guests that are running on affected hosts where
the mitigation is enabled are protected as if they were running
on an unaffected system.

On all hosts that are not affected or that are mitigated, set the
GDS_NO bit.

Signed-off-by: Daniel Sneddon 
Signed-off-by: Dave Hansen 
Acked-by: Josh Poimboeuf 

KVM also has the support of GDS_NO, but seems qemu side doesn't pass
the info to guest, that's why it is unknown. IMO qemu should pass
GDS_NO if the host is already patched.

Is Intel or anyone already working on the qemu patch? I know it's not
a must, but good to do.

Thx!
Jinpu Wang @ IONOS Cloud



Re: an issue for device hot-unplug

2023-04-04 Thread Jinpu Wang
Hi Yu and Laurent,

On Mon, Apr 3, 2023 at 6:59 PM Yu Zhang  wrote:
>
> Dear Laurent,
>
> Thank you for your quick reply. We used qemu-7.1, but it is reproducible with 
> qemu from v6.2 to the recent v8.0 release candidates.
> I found that it's introduced by the commit  9323f892b39 (between v6.2.0-rc2 
> and v6.2.0-rc3).
>
> If it doesn't break anything else, it suffices to remove the line below from 
> acpi_pcihp_device_unplug_request_cb():
>
> pdev->qdev.pending_deleted_event = true;
>
> but you may have a reason to keep it. First of all, I'll open a bug in the 
> bug tracker and let you know.
>
We opened an issue here:
https://gitlab.com/qemu-project/qemu/-/issues/1577

Regards!
Jinpu Wang
> Best regards,
> Yu Zhang
>
> On Mon, Apr 3, 2023 at 6:32 PM Laurent Vivier  wrote:
>>
>> Hi Yu,
>>
>> please open a bug in the bug tracker:
>>
>> https://gitlab.com/qemu/qemu/-/issues
>>
>> It's easier to track the problem.
>>
>> What is the version of QEMU you are using?
>> Could you provide QEMU command line?
>>
>> Thanks,
>> Laurent
>>
>>
>> On 4/3/23 15:24, Yu Zhang wrote:
>> > Dear Laurent,
>> >
>> > recently we run into an issue with the following error:
>> >
>> > command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" } 
>> > }' for VM "id"
>> > failed ({ "return": {"class": "GenericError", "desc": "Device virtio-diskX 
>> > is already in
>> > the process of unplug"} }).
>> >
>> > The issue is reproducible. With a few seconds delay before hot-unplug, 
>> > hot-unplug just
>> > works fine.
>> >
>> > After a few digging, we found that the commit 9323f892b39 may incur the 
>> > issue.
>> > --
>> >  failover: fix unplug pending detection
>> >
>> >  Failover needs to detect the end of the PCI unplug to start migration
>> >  after the VFIO card has been unplugged.
>> >
>> >  To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and 
>> > reset in
>> >  pcie_unplug_device().
>> >
>> >  But since
>> >  17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on 
>> > Q35")
>> >  we have switched to ACPI unplug and these functions are not called 
>> > anymore
>> >  and the flag not set. So failover migration is not able to detect if 
>> > card
>> >  is really unplugged and acts as it's done as soon as it's started. So 
>> > it
>> >  doesn't wait the end of the unplug to start the migration. We don't 
>> > see any
>> >  problem when we test that because ACPI unplug is faster than PCIe 
>> > native
>> >  hotplug and when the migration really starts the unplug operation is
>> >  already done.
>> >
>> >  See c000a9bd06ea ("pci: mark device having guest unplug request 
>> > pending")
>> >  a99c4da9fc2a ("pci: mark devices partially unplugged")
>> >
>> >  Signed-off-by: Laurent Vivier > > <mailto:lviv...@redhat.com>>
>> >  Reviewed-by: Ani Sinha mailto:a...@anisinha.ca>>
>> >  Message-Id: <2028133225.324937-4-lviv...@redhat.com
>> > <mailto:2028133225.324937-4-lviv...@redhat.com>>
>> >  Reviewed-by: Michael S. Tsirkin > > <mailto:m...@redhat.com>>
>> >  Signed-off-by: Michael S. Tsirkin > > <mailto:m...@redhat.com>>
>> > --
>> > The purpose is for detecting the end of the PCI device hot-unplug. 
>> > However, we feel the
>> > error confusing. How is it possible that a disk "is already in the process 
>> > of unplug"
>> > during the first hot-unplug attempt? So far as I know, the issue was also 
>> > encountered by
>> > libvirt, but they simply ignored it:
>> >
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1878659
>> > <https://bugzilla.redhat.com/show_bug.cgi?id=1878659>
>> >
>> > Hence, a question is: should we have the line below in  
>> > acpi_pcihp_device_unplug_request_cb()?
>> >
>> > pdev->qdev.pending_deleted_event = true;
>> >
>> > It would be great if you as the author could give us a few hints.
>> >
>> > Thank you very much for your reply!
>> >
>> > Sincerely,
>> >
>> > Yu Zhang @ Compute Platform IONOS
>> > 03.04.2013
>>



Re: an issue for device hot-unplug

2023-04-04 Thread Jinpu Wang
Hi Yu,

On Mon, Apr 3, 2023 at 6:59 PM Yu Zhang  wrote:
>
> Dear Laurent,
>
> Thank you for your quick reply. We used qemu-7.1, but it is reproducible with 
> qemu from v6.2 to the recent v8.0 release candidates.
> I found that it's introduced by the commit  9323f892b39 (between v6.2.0-rc2 
> and v6.2.0-rc3).
>
> If it doesn't break anything else, it suffices to remove the line below from 
> acpi_pcihp_device_unplug_request_cb():
>
> pdev->qdev.pending_deleted_event = true;
>
> but you may have a reason to keep it. First of all, I'll open a bug in the 
> bug tracker and let you know.
>
> Best regards,
> Yu Zhang
This patch from Igor Mammedov seems relevant,
https://lore.kernel.org/qemu-devel/20230403131833-mutt-send-email-...@kernel.org/T/#t
Can you try it out?

Regards!
Jinpu
>
> On Mon, Apr 3, 2023 at 6:32 PM Laurent Vivier  wrote:
>>
>> Hi Yu,
>>
>> please open a bug in the bug tracker:
>>
>> https://gitlab.com/qemu/qemu/-/issues
>>
>> It's easier to track the problem.
>>
>> What is the version of QEMU you are using?
>> Could you provide QEMU command line?
>>
>> Thanks,
>> Laurent
>>
>>
>> On 4/3/23 15:24, Yu Zhang wrote:
>> > Dear Laurent,
>> >
>> > recently we run into an issue with the following error:
>> >
>> > command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" } 
>> > }' for VM "id"
>> > failed ({ "return": {"class": "GenericError", "desc": "Device virtio-diskX 
>> > is already in
>> > the process of unplug"} }).
>> >
>> > The issue is reproducible. With a few seconds delay before hot-unplug, 
>> > hot-unplug just
>> > works fine.
>> >
>> > After a few digging, we found that the commit 9323f892b39 may incur the 
>> > issue.
>> > --
>> >  failover: fix unplug pending detection
>> >
>> >  Failover needs to detect the end of the PCI unplug to start migration
>> >  after the VFIO card has been unplugged.
>> >
>> >  To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and 
>> > reset in
>> >  pcie_unplug_device().
>> >
>> >  But since
>> >  17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on 
>> > Q35")
>> >  we have switched to ACPI unplug and these functions are not called 
>> > anymore
>> >  and the flag not set. So failover migration is not able to detect if 
>> > card
>> >  is really unplugged and acts as it's done as soon as it's started. So 
>> > it
>> >  doesn't wait the end of the unplug to start the migration. We don't 
>> > see any
>> >  problem when we test that because ACPI unplug is faster than PCIe 
>> > native
>> >  hotplug and when the migration really starts the unplug operation is
>> >  already done.
>> >
>> >  See c000a9bd06ea ("pci: mark device having guest unplug request 
>> > pending")
>> >  a99c4da9fc2a ("pci: mark devices partially unplugged")
>> >
>> >  Signed-off-by: Laurent Vivier > > >
>> >  Reviewed-by: Ani Sinha mailto:a...@anisinha.ca>>
>> >  Message-Id: <2028133225.324937-4-lviv...@redhat.com
>> > >
>> >  Reviewed-by: Michael S. Tsirkin > > >
>> >  Signed-off-by: Michael S. Tsirkin > > >
>> > --
>> > The purpose is for detecting the end of the PCI device hot-unplug. 
>> > However, we feel the
>> > error confusing. How is it possible that a disk "is already in the process 
>> > of unplug"
>> > during the first hot-unplug attempt? So far as I know, the issue was also 
>> > encountered by
>> > libvirt, but they simply ignored it:
>> >
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1878659
>> > 
>> >
>> > Hence, a question is: should we have the line below in  
>> > acpi_pcihp_device_unplug_request_cb()?
>> >
>> > pdev->qdev.pending_deleted_event = true;
>> >
>> > It would be great if you as the author could give us a few hints.
>> >
>> > Thank you very much for your reply!
>> >
>> > Sincerely,
>> >
>> > Yu Zhang @ Compute Platform IONOS
>> > 03.04.2013
>>



Re: RFC: sgx-epc is not listed in machine type help

2022-04-28 Thread Jinpu Wang
On Fri, Apr 29, 2022 at 4:22 AM Yang Zhong  wrote:
>
> On Thu, Apr 28, 2022 at 02:56:50PM +0200, Jinpu Wang wrote:
> > On Thu, Apr 28, 2022 at 2:32 PM Yang Zhong  wrote:
> > >
> > > On Thu, Apr 28, 2022 at 02:18:54PM +0200, Jinpu Wang wrote:
> > > > On Thu, Apr 28, 2022 at 2:05 PM Yang Zhong  wrote:
> > > > >
> > > > > On Thu, Apr 28, 2022 at 01:59:33PM +0200, Jinpu Wang wrote:
> > > > > > Hi Yang, hi Paolo,
> > > > > >
> > > > > > We noticed sgx-epc machine type is not listed in the output of
> > > > > > "qemu-system-x86_64 -M ?",
> > > > snip
> > > > > >
> > > > > >
> > > > > > I think this would cause confusion to users, is there a reason 
> > > > > > behind this?
> > > > > >
> > > > >
> > > > >   No specific machine type for SGX, and SGX is only supported in Qemu 
> > > > > PC and Q35 platform.
> > > > Hi Yang,
> > > >
> > > > Thanks for your quick reply. Sorry for the stupid question.
> > > > The information I've got from intel or the help sample from
> > > > https://www.qemu.org/docs/master/system/i386/sgx.html, We need to
> > > > specify commands something like this to run SGX-EPC guest:
> > > > qemu-system-x86-64 -m 2G -nographic -enable-kvm -cpu
> > > > host,+sgx-provisionkey  -object
> > > > memory-backend-epc,id=mem1,size=512M,prealloc=on -M
> > > > sgx-epc.0.memdev=mem1,sgx-epc.0.node=0 /tmp/volume-name.img
> > > >
> > > > Do you mean internally QEMU is converting -M sgx-epc to PC or Q35, can
> > > > I choose which one to use?
> > > >
> > >
> > >   Qemu will replace object with compound key, in that time, Paolo asked me
> > >   to use "-M sgx-epc..." to replace "-object sgx-epc..." from Qemu 
> > > command line.
> > >
> > >   So the "-M sgx-epc..." will get sgx-epc's parameters from hash key, and
> > >   do not covert sgx-epc to PC or Q35.
> > >
> > >   SGX is only one Intel cpu feature, and no dedicated SGX Qemu machine 
> > > type for SGX.
> > >
> > >   Another compound key example:
> > >   "-M pc,smp.cpus=4,smp.cores=1,smp.threads=1"
> > >
> > >   Yang
> > ah, ok. thx for the sharing.
> > so if I specify "-M pc -M sgx-epc.." it will be the explicit way to
> > choose PC machine type with sgx feature.
> > and "-M q35 -M sgx-epc.." qemu will use Q35 machine type?
>
>   The below command is okay,
>   "-M pc,sgx-epc.." or "-M q35,sgx-epc.."
Thanks!
>
>   Yang
>
> > >
> > >
> > > > Thanks!
> > > > Jinpu



Re: RFC: sgx-epc is not listed in machine type help

2022-04-28 Thread Jinpu Wang
Hi Daniel,

On Thu, Apr 28, 2022 at 2:33 PM Daniel P. Berrangé  wrote:
>
> On Thu, Apr 28, 2022 at 02:18:54PM +0200, Jinpu Wang wrote:
> > On Thu, Apr 28, 2022 at 2:05 PM Yang Zhong  wrote:
> > >
> > > On Thu, Apr 28, 2022 at 01:59:33PM +0200, Jinpu Wang wrote:
> > > > Hi Yang, hi Paolo,
> > > >
> > > > We noticed sgx-epc machine type is not listed in the output of
> > > > "qemu-system-x86_64 -M ?",
> > snip
> > > >
> > > >
> > > > I think this would cause confusion to users, is there a reason behind 
> > > > this?
> > > >
> > >
> > >   No specific machine type for SGX, and SGX is only supported in Qemu PC 
> > > and Q35 platform.
> > Hi Yang,
> >
> > Thanks for your quick reply. Sorry for the stupid question.
> > The information I've got from intel or the help sample from
> > https://www.qemu.org/docs/master/system/i386/sgx.html, We need to
> > specify commands something like this to run SGX-EPC guest:
> > qemu-system-x86-64 -m 2G -nographic -enable-kvm -cpu
> > host,+sgx-provisionkey  -object
> > memory-backend-epc,id=mem1,size=512M,prealloc=on -M
> > sgx-epc.0.memdev=mem1,sgx-epc.0.node=0 /tmp/volume-name.img
>
> That isn't an sgx-epc machine type.
>
> That is an (implicit) i440fx  machine type, with an sgx-epc.0.memdev
> property being set.
>
>
> With regards,
> Daniel
Thanks for your reply, I have better understanding now.
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>



Re: RFC: sgx-epc is not listed in machine type help

2022-04-28 Thread Jinpu Wang
On Thu, Apr 28, 2022 at 2:32 PM Yang Zhong  wrote:
>
> On Thu, Apr 28, 2022 at 02:18:54PM +0200, Jinpu Wang wrote:
> > On Thu, Apr 28, 2022 at 2:05 PM Yang Zhong  wrote:
> > >
> > > On Thu, Apr 28, 2022 at 01:59:33PM +0200, Jinpu Wang wrote:
> > > > Hi Yang, hi Paolo,
> > > >
> > > > We noticed sgx-epc machine type is not listed in the output of
> > > > "qemu-system-x86_64 -M ?",
> > snip
> > > >
> > > >
> > > > I think this would cause confusion to users, is there a reason behind 
> > > > this?
> > > >
> > >
> > >   No specific machine type for SGX, and SGX is only supported in Qemu PC 
> > > and Q35 platform.
> > Hi Yang,
> >
> > Thanks for your quick reply. Sorry for the stupid question.
> > The information I've got from intel or the help sample from
> > https://www.qemu.org/docs/master/system/i386/sgx.html, We need to
> > specify commands something like this to run SGX-EPC guest:
> > qemu-system-x86-64 -m 2G -nographic -enable-kvm -cpu
> > host,+sgx-provisionkey  -object
> > memory-backend-epc,id=mem1,size=512M,prealloc=on -M
> > sgx-epc.0.memdev=mem1,sgx-epc.0.node=0 /tmp/volume-name.img
> >
> > Do you mean internally QEMU is converting -M sgx-epc to PC or Q35, can
> > I choose which one to use?
> >
>
>   Qemu will replace object with compound key, in that time, Paolo asked me
>   to use "-M sgx-epc..." to replace "-object sgx-epc..." from Qemu command 
> line.
>
>   So the "-M sgx-epc..." will get sgx-epc's parameters from hash key, and
>   do not covert sgx-epc to PC or Q35.
>
>   SGX is only one Intel cpu feature, and no dedicated SGX Qemu machine type 
> for SGX.
>
>   Another compound key example:
>   "-M pc,smp.cpus=4,smp.cores=1,smp.threads=1"
>
>   Yang
ah, ok. thx for the sharing.
so if I specify "-M pc -M sgx-epc.." it will be the explicit way to
choose PC machine type with sgx feature.
and "-M q35 -M sgx-epc.." qemu will use Q35 machine type?
>
>
> > Thanks!
> > Jinpu



Re: RFC: sgx-epc is not listed in machine type help

2022-04-28 Thread Jinpu Wang
On Thu, Apr 28, 2022 at 2:05 PM Yang Zhong  wrote:
>
> On Thu, Apr 28, 2022 at 01:59:33PM +0200, Jinpu Wang wrote:
> > Hi Yang, hi Paolo,
> >
> > We noticed sgx-epc machine type is not listed in the output of
> > "qemu-system-x86_64 -M ?",
snip
> >
> >
> > I think this would cause confusion to users, is there a reason behind this?
> >
>
>   No specific machine type for SGX, and SGX is only supported in Qemu PC and 
> Q35 platform.
Hi Yang,

Thanks for your quick reply. Sorry for the stupid question.
The information I've got from intel or the help sample from
https://www.qemu.org/docs/master/system/i386/sgx.html, We need to
specify commands something like this to run SGX-EPC guest:
qemu-system-x86-64 -m 2G -nographic -enable-kvm -cpu
host,+sgx-provisionkey  -object
memory-backend-epc,id=mem1,size=512M,prealloc=on -M
sgx-epc.0.memdev=mem1,sgx-epc.0.node=0 /tmp/volume-name.img

Do you mean internally QEMU is converting -M sgx-epc to PC or Q35, can
I choose which one to use?

Thanks!
Jinpu



RFC: sgx-epc is not listed in machine type help

2022-04-28 Thread Jinpu Wang
Hi Yang, hi Paolo,

We noticed sgx-epc machine type is not listed in the output of
"qemu-system-x86_64 -M ?",
This is what I got with qemu-7.0
Supported machines are:
microvm  microvm (i386)
pc   Standard PC (i440FX + PIIX, 1996) (alias of pc-i440fx-7.0)
pc-i440fx-7.0Standard PC (i440FX + PIIX, 1996) (default)
pc-i440fx-6.2Standard PC (i440FX + PIIX, 1996)
pc-i440fx-6.1Standard PC (i440FX + PIIX, 1996)
pc-i440fx-6.0Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.2Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.1Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.0Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.2Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.1Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.0Standard PC (i440FX + PIIX, 1996)
pc-i440fx-3.1Standard PC (i440FX + PIIX, 1996)
pc-i440fx-3.0Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.9Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.8Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.7Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.6Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.5Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.4Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.3Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.2Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.12   Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.11   Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.10   Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.1Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.0Standard PC (i440FX + PIIX, 1996)
pc-i440fx-1.7Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.6Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.5Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.4Standard PC (i440FX + PIIX, 1996) (deprecated)
q35  Standard PC (Q35 + ICH9, 2009) (alias of pc-q35-7.0)
pc-q35-7.0   Standard PC (Q35 + ICH9, 2009)
pc-q35-6.2   Standard PC (Q35 + ICH9, 2009)
pc-q35-6.1   Standard PC (Q35 + ICH9, 2009)
pc-q35-6.0   Standard PC (Q35 + ICH9, 2009)
pc-q35-5.2   Standard PC (Q35 + ICH9, 2009)
pc-q35-5.1   Standard PC (Q35 + ICH9, 2009)
pc-q35-5.0   Standard PC (Q35 + ICH9, 2009)
pc-q35-4.2   Standard PC (Q35 + ICH9, 2009)
pc-q35-4.1   Standard PC (Q35 + ICH9, 2009)
pc-q35-4.0.1 Standard PC (Q35 + ICH9, 2009)
pc-q35-4.0   Standard PC (Q35 + ICH9, 2009)
pc-q35-3.1   Standard PC (Q35 + ICH9, 2009)
pc-q35-3.0   Standard PC (Q35 + ICH9, 2009)
pc-q35-2.9   Standard PC (Q35 + ICH9, 2009)
pc-q35-2.8   Standard PC (Q35 + ICH9, 2009)
pc-q35-2.7   Standard PC (Q35 + ICH9, 2009)
pc-q35-2.6   Standard PC (Q35 + ICH9, 2009)
pc-q35-2.5   Standard PC (Q35 + ICH9, 2009)
pc-q35-2.4   Standard PC (Q35 + ICH9, 2009)
pc-q35-2.12  Standard PC (Q35 + ICH9, 2009)
pc-q35-2.11  Standard PC (Q35 + ICH9, 2009)
pc-q35-2.10  Standard PC (Q35 + ICH9, 2009)
isapcISA-only PC
none empty machine
x-remote Experimental remote machine


I think this would cause confusion to users, is there a reason behind this?

Thanks!
Jinpu Wang @ IONOS Cloud



Re: [PATCH 2/2] migration/rdma: set the REUSEADDR option for destination

2022-02-02 Thread Jinpu Wang
On Wed, Feb 2, 2022 at 11:15 AM Dr. David Alan Gilbert
 wrote:
>
> * Jack Wang (jinpu.w...@ionos.com) wrote:
> > This allow address could be reused to avoid rdma_bind_addr error
> > out.
>
> In what case do you get the error - after a failed migrate and then a
> retry?

Yes, what I saw is in case of error, mgmt daemon pick one migration port,
incoming rdma:[::]:8089: RDMA ERROR: Error: could not rdma_bind_addr

Then try another -incoming rdma:[::]:8103, sometime it worked,
sometimes need another try with other ports number.

with this patch, I don't see the error anymore.
>
> Dave
Thanks!
>
> > Signed-off-by: Jack Wang 
> > ---
> >  migration/rdma.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index 2e223170d06d..b498ef013c77 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -2705,6 +2705,7 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, 
> > Error **errp)
> >  char ip[40] = "unknown";
> >  struct rdma_addrinfo *res, *e;
> >  char port_str[16];
> > +int reuse = 1;
> >
> >  for (idx = 0; idx < RDMA_WRID_MAX; idx++) {
> >  rdma->wr_data[idx].control_len = 0;
> > @@ -2740,6 +2741,12 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, 
> > Error **errp)
> >  goto err_dest_init_bind_addr;
> >  }
> >
> > +ret = rdma_set_option(listen_id, RDMA_OPTION_ID, 
> > RDMA_OPTION_ID_REUSEADDR,
> > +   , sizeof reuse);
> > +if (ret) {
> > +ERROR(errp, "Error: could not set REUSEADDR option");
> > +goto err_dest_init_bind_addr;
> > +}
> >  for (e = res; e != NULL; e = e->ai_next) {
> >  inet_ntop(e->ai_family,
> >  &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof 
> > ip);
> > --
> > 2.25.1
> >
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>



Re: [PATCH 1/2] migration/rdma: Increase the backlog from 5 to 128

2022-02-02 Thread Jinpu Wang
On Wed, Feb 2, 2022 at 10:20 AM Dr. David Alan Gilbert
 wrote:
>
> * Pankaj Gupta (pankaj.gu...@ionos.com) wrote:
> > > > > >  migration/rdma.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/migration/rdma.c b/migration/rdma.c
> > > > > > index c7c7a384875b..2e223170d06d 100644
> > > > > > --- a/migration/rdma.c
> > > > > > +++ b/migration/rdma.c
> > > > > > @@ -4238,7 +4238,7 @@ void rdma_start_incoming_migration(const char 
> > > > > > *host_port, Error **errp)
> > > > > >
> > > > > >  trace_rdma_start_incoming_migration_after_dest_init();
> > > > > >
> > > > > > -ret = rdma_listen(rdma->listen_id, 5);
> > > > > > +ret = rdma_listen(rdma->listen_id, 128);
> > > > >
> > > > > 128 backlog seems too much to me. Any reason for choosing this number.
> > > > > Any rationale to choose this number?
> > > > >
> > > > 128 is the default value of SOMAXCONN, I can use that if it is 
> > > > preferred.
> > >
> > > AFAICS backlog is only applicable with RDMA iWARP CM mode. Maybe we
> > > can increase it to 128.these many
> >
> > Or maybe we first increase it to 20 or 32? or so to avoid memory
> > overhead if we are not
> > using these many connections at the same time.
>
> Can you explain why you're requiring more than 1?  Is this with multifd
> patches?

no, I'm not using multifs patches, just code reading, I feel 5 is too
small for the backlog setting.

As Pankaj rightly mentioned, in RDMA, only iWARP CM will take some
effect, it does nothing for InfiniBand
and RoCE.

Please ignore this patch, we can revisit this when we introduce
multifid with RDMA.

Thanks!
Jinpu Wang
>
> Dave
>
> > > Maybe you can also share any testing data for multiple concurrent live
> > > migrations using RDMA, please.
> > >
> > > Thanks,
> > > Pankaj
> > >
> > > Thanks,
> > > Pankaj
> >
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>



Re: [PATCH 2/2] migration/rdma: set the REUSEADDR option for destination

2022-02-01 Thread Jinpu Wang
On Tue, Feb 1, 2022 at 7:39 PM Pankaj Gupta  wrote:
>
> > This allow address could be reused to avoid rdma_bind_addr error
> > out.
>
> Seems we are proposing to allow multiple connections on same source ip
> port pair?
according to the man page, it's more about the destination side which
is the incoming side.[1]
We hit the error on the migration target when there are many migration
tests in parallel:
"RDMA ERROR: Error: could not rdma_bind_addr!"

[1]https://manpages.debian.org/testing/librdmacm-dev/rdma_set_option.3.en.html
> >
> > Signed-off-by: Jack Wang 
> > ---
> >  migration/rdma.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index 2e223170d06d..b498ef013c77 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -2705,6 +2705,7 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, 
> > Error **errp)
> >  char ip[40] = "unknown";
> >  struct rdma_addrinfo *res, *e;
> >  char port_str[16];
> > +int reuse = 1;
> >
> >  for (idx = 0; idx < RDMA_WRID_MAX; idx++) {
> >  rdma->wr_data[idx].control_len = 0;
> > @@ -2740,6 +2741,12 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, 
> > Error **errp)
> >  goto err_dest_init_bind_addr;
> >  }
> >
> > +ret = rdma_set_option(listen_id, RDMA_OPTION_ID, 
> > RDMA_OPTION_ID_REUSEADDR,
> > + , sizeof reuse);
>
> maybe we can just write '1' directly on the argument list of 
> 'rdma_set_option'.
> Assuming reuseaddr does not effect core rdma transport? change seems ok to me.
I feel it's cleaner to do it with a variable than force conversion of
1 to void *.

It's bound to the cm_id which is newly created a few lines above, so
does not affect core rdma transport.

>
> Thanks,
> Pankaj
Thanks for the review!

Jinpu Wang
>
> > +if (ret) {
> > +ERROR(errp, "Error: could not set REUSEADDR option");
> > +goto err_dest_init_bind_addr;
> > +}
> >  for (e = res; e != NULL; e = e->ai_next) {
> >  inet_ntop(e->ai_family,
> >  &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof 
> > ip);
> > --
> > 2.25.1
> >



Re: [PATCH 1/2] migration/rdma: Increase the backlog from 5 to 128

2022-02-01 Thread Jinpu Wang
On Tue, Feb 1, 2022 at 7:19 PM Pankaj Gupta  wrote:
>
> > So it can handle more incoming requests.
> >
> > Signed-off-by: Jack Wang 
> > ---
> >  migration/rdma.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index c7c7a384875b..2e223170d06d 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -4238,7 +4238,7 @@ void rdma_start_incoming_migration(const char 
> > *host_port, Error **errp)
> >
> >  trace_rdma_start_incoming_migration_after_dest_init();
> >
> > -ret = rdma_listen(rdma->listen_id, 5);
> > +ret = rdma_listen(rdma->listen_id, 128);
>
> 128 backlog seems too much to me. Any reason for choosing this number.
> Any rationale to choose this number?
>
128 is the default value of SOMAXCONN, I can use that if it is preferred.

> Thanks,
> Pankaj

Thanks!
>
> >
> >  if (ret) {
> >  ERROR(errp, "listening on socket!");
> > --
> > 2.25.1
> >