RE: [PATCH 3/6] io/channel-rdma: support working in coroutine

2024-06-07 Thread Gonglei (Arei)
Hi Daniel,

> -Original Message-
> From: Daniel P. Berrangé [mailto:berra...@redhat.com]
> Sent: Friday, June 7, 2024 5:04 PM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; pet...@redhat.com; yu.zh...@ionos.com;
> mgal...@akamai.com; elmar.ger...@ionos.com; zhengchuan
> ; arm...@redhat.com; lizhij...@fujitsu.com;
> pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> ; linux-r...@vger.kernel.org; lixiao (H)
> ; jinpu.w...@ionos.com; Wangjialin
> 
> Subject: Re: [PATCH 3/6] io/channel-rdma: support working in coroutine
> 
> On Tue, Jun 04, 2024 at 08:14:09PM +0800, Gonglei wrote:
> > From: Jialin Wang 
> >
> > It is not feasible to obtain RDMA completion queue notifications
> > through poll/ppoll on the rsocket fd. Therefore, we create a thread
> > named rpoller for each rsocket fd and two eventfds: pollin_eventfd and
> > pollout_eventfd.
> >
> > When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
> > or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
> > and pollout_eventfd instead of the rsocket fd.
> >
> > The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
> > events.
> > When a POLLIN event occurs, the rpoller write the pollin_eventfd, and
> > then poll/ppoll will return the POLLIN event.
> > When a POLLOUT event occurs, the rpoller read the pollout_eventfd, and
> > then poll/ppoll will return the POLLOUT event.
> >
> > For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
> > read/write the pollin/pollout_eventfd, preventing poll/ppoll from
> > returning POLLIN/POLLOUT events.
> >
> > Known limitations:
> >
> >   For a blocking rsocket fd, if we use io_create_watch to wait for
> >   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
> >   cannot determine when it is not ready to read/write as we can with
> >   non-blocking fds. Therefore, when an event occurs, it will occurs
> >   always, potentially leave the qemu hanging. So we need be cautious
> >   to avoid hanging when using io_create_watch .
> >
> > Luckily, channel-rdma works well in coroutines :)
> >
> > Signed-off-by: Jialin Wang 
> > Signed-off-by: Gonglei 
> > ---
> >  include/io/channel-rdma.h |  15 +-
> >  io/channel-rdma.c | 363
> +-
> >  2 files changed, 376 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> > index 8cab2459e5..cb56127d76 100644
> > --- a/include/io/channel-rdma.h
> > +++ b/include/io/channel-rdma.h
> > @@ -47,6 +47,18 @@ struct QIOChannelRDMA {
> >  socklen_t localAddrLen;
> >  struct sockaddr_storage remoteAddr;
> >  socklen_t remoteAddrLen;
> > +
> > +/* private */
> > +
> > +/* qemu g_poll/ppoll() POLLIN event on it */
> > +int pollin_eventfd;
> > +/* qemu g_poll/ppoll() POLLOUT event on it */
> > +int pollout_eventfd;
> > +
> > +/* the index in the rpoller's fds array */
> > +int index;
> > +/* rpoller will rpoll() rpoll_events on the rsocket fd */
> > +short int rpoll_events;
> >  };
> >
> >  /**
> > @@ -147,6 +159,7 @@ void
> qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress
> *addr,
> >   *
> >   * Returns: the new client channel, or NULL on error
> >   */
> > -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> Error
> > **errp);
> > +QIOChannelRDMA *coroutine_mixed_fn
> qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> > +
> Error
> > +**errp);
> >
> >  #endif /* QIO_CHANNEL_RDMA_H */
> > diff --git a/io/channel-rdma.c b/io/channel-rdma.c index
> > 92c362df52..9792add5cf 100644
> > --- a/io/channel-rdma.c
> > +++ b/io/channel-rdma.c
> > @@ -23,10 +23,15 @@
> >
> >  #include "qemu/osdep.h"
> >  #include "io/channel-rdma.h"
> > +#include "io/channel-util.h"
> > +#include "io/channel-watch.h"
> >  #include "io/channel.h"
> >  #include "qapi/clone-visitor.h"
> >  #include "qapi/error.h"
> >  #include "qapi/qapi-visit-sockets.h"
> > +#include "qemu/atomic.h"
> > +#include "qemu/error-report.h"
> > +#include "qemu/thread.h"
> >  #include "trace.h"
> >  #include 
> >  #include 
> > @@ -39,11 +44,274 @@
> >  #include 
> >  #include 
> >
> > +typedef enum {
> > +CLEAR_POLLIN,
> > + 

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-07 Thread Gonglei (Arei)
Hi,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Thursday, June 6, 2024 5:19 AM
> To: Dr. David Alan Gilbert 
> Cc: Michael Galaxy ; zhengchuan
> ; Gonglei (Arei) ;
> Daniel P. Berrangé ; Markus Armbruster
> ; Yu Zhang ; Zhijian Li (Fujitsu)
> ; Jinpu Wang ; Elmar Gerdes
> ; qemu-devel@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; integrat...@gluster.org; Paolo
> Bonzini ; qemu-bl...@nongnu.org;
> de...@lists.libvirt.org; Hanna Reitz ; Michael S. Tsirkin
> ; Thomas Huth ; Eric Blake
> ; Song Gao ; Marc-André
> Lureau ; Alex Bennée
> ; Wainer dos Santos Moschetta
> ; Beraldo Leal ; Pannengyuan
> ; Xiexiangyou 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote:
> > > > I just noticed this thread; some random notes from a somewhat
> > > > fragmented memory of this:
> > > >
> > > >   a) Long long ago, I also tried rsocket;
> > > >
> https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > > >  as I remember the library was quite flaky at the time.
> > >
> > > Hmm interesting.  There also looks like a thread doing rpoll().
> >
> > Yeh, I can't actually remember much more about what I did back then!
> 
> Heh, that's understandable and fair. :)
> 
> > > I hope Lei and his team has tested >4G mem, otherwise definitely
> > > worth checking.  Lei also mentioned there're rsocket bugs they found
> > > in the cover letter, but not sure what's that about.
> >
> > It would probably be a good idea to keep track of what bugs are in
> > flight with it, and try it on a few RDMA cards to see what problems
> > get triggered.
> > I think I reported a few at the time, but I gave up after feeling it
> > was getting very hacky.
> 
> Agreed.  Maybe we can have a list of that in the cover letter or even QEMU's
> migration/rmda doc page.
> 
> Lei, if you think that makes sense please do so in your upcoming posts.
> There'll need to have a list of things you encountered in the kernel driver 
> and
> it'll be even better if there're further links to read on each problem.
> 
OK, no problem. There are two bugs:

Bug 1:

https://github.com/linux-rdma/rdma-core/commit/23985e25aebb559b761872313f8cab4e811c5a3d#diff-5ddbf83c6f021688166096ca96c9bba874dffc3cab88ded2e9d8b2176faa084cR3302-R3303

his commit introduces a bug that causes QEMU suspension.
When the timeout parameter of the rpoll is not -1 or 0, the program is 
suspended occasionally.

Problem analysis:
During the first rpoll,
In line 3297, rs_poll_enter () performs pollcnt++. In this case, the value of 
pollcnt is 1.
In line 3302, timeout expires and the function exits. Note that rs_poll_exit () 
is not --pollcnt here.
In this case, the value of pollcnt is 1.
During the second rpoll, pollcnt++ is performed in line 3297 rs_poll_enter (). 
In this case, the value of pollcnt is 2.
If no timeout expires and the poll return value is greater than 0, the 
rs_poll_stop () function is executed. Because the if (--pollcnt) condition is 
false, suspendpoll = 1 is executed.
Go back to the do while loop inside rpoll, again rs_poll_enter () now if 
(suspendpoll) condition is true, execute pthread_yield (); and return -EBUSY, 
Then, the do while loop in the rpoll is returned. Because the if (rs_poll_enter 
()) condition is true, the rs_poll_enter () function is executed again after 
the continue operation. As a result, the program is suspended.

Root cause: In line 3302, rs_poll_exit () is not executed before the timeout 
expires function exits.


Bug 2:

In rsocket.c, there is a receive queue int accept_queue[2] implemented by 
socketpair. The listen_svc thread in rsocket.c is responsible for receiving 
connections and writing them to the accept_queue[1]. When raccept () is called, 
a connection is received from accept_queue[0].
In the test case, qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN); waits for a 
readable event (waiting for a connection), rpoll () checks if accept_queue[0] 
has a readable event, However, this poll does not poll accept_queue[0]. After 
the timeout expires, rpoll () obtains the readable event of accept_queue[0] 
from rs_poll_arm again.

Impaction: 
The accept operation can be performed only after 5000 ms. Of course, we can 
shorten this time by echoing the millisecond time > 
/etc/rdma/rsocket/wake_up_interval.


Regards,
-Gonglei

> > > >
> > > >   e) Someone made a good suggestion (sorry can't remember who) -
> that the
> > > >  RDMA migration structure was the wrong way around - it should
> be the
&

RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-07 Thread Gonglei (Arei)


> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Wednesday, June 5, 2024 10:19 PM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; yu.zh...@ionos.com; mgal...@akamai.com;
> elmar.ger...@ionos.com; zhengchuan ;
> berra...@redhat.com; arm...@redhat.com; lizhij...@fujitsu.com;
> pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> ; linux-r...@vger.kernel.org; lixiao (H)
> ; jinpu.w...@ionos.com; Wangjialin
> ; Fabiano Rosas 
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> On Wed, Jun 05, 2024 at 10:09:43AM +, Gonglei (Arei) wrote:
> > Hi Peter,
> >
> > > -Original Message-
> > > From: Peter Xu [mailto:pet...@redhat.com]
> > > Sent: Wednesday, June 5, 2024 3:32 AM
> > > To: Gonglei (Arei) 
> > > Cc: qemu-devel@nongnu.org; yu.zh...@ionos.com;
> mgal...@akamai.com;
> > > elmar.ger...@ionos.com; zhengchuan ;
> > > berra...@redhat.com; arm...@redhat.com; lizhij...@fujitsu.com;
> > > pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> > > ; linux-r...@vger.kernel.org; lixiao (H)
> > > ; jinpu.w...@ionos.com; Wangjialin
> > > ; Fabiano Rosas 
> > > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on
> > > rsocket API
> > >
> > > Hi, Lei, Jialin,
> > >
> > > Thanks a lot for working on this!
> > >
> > > I think we'll need to wait a bit on feedbacks from Jinpu and his
> > > team on RDMA side, also Daniel for iochannels.  Also, please
> > > remember to copy Fabiano Rosas in any relevant future posts.  We'd
> > > also like to know whether he has any comments too.  I have him copied in
> this reply.
> > >
> > > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > > From: Jialin Wang 
> > > >
> > > > Hi,
> > > >
> > > > This patch series attempts to refactor RDMA live migration by
> > > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > > >
> > > > The /usr/include/rdma/rsocket.h provides a higher level rsocket
> > > > API that is a 1-1 match of the normal kernel 'sockets' API, which
> > > > hides the detail of rdma protocol into rsocket and allows us to
> > > > add support for some modern features like multifd more easily.
> > > >
> > > > Here is the previous discussion on refactoring RDMA live migration
> > > > using the rsocket API:
> > > >
> > > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@l
> > > > inar
> > > > o.org/
> > > >
> > > > We have encountered some bugs when using rsocket and plan to
> > > > submit them to the rdma-core community.
> > > >
> > > > In addition, the use of rsocket makes our programming more
> > > > convenient, but it must be noted that this method introduces
> > > > multiple memory copies, which can be imagined that there will be a
> > > > certain performance degradation, hoping that friends with RDMA
> > > > network cards can help verify,
> > > thank you!
> > >
> > > It'll be good to elaborate if you tested it in-house. What people
> > > should expect on the numbers exactly?  Is that okay from Huawei's POV?
> > >
> > > Besides that, the code looks pretty good at a first glance to me.
> > > Before others chim in, here're some high level comments..
> > >
> > > Firstly, can we avoid using coroutine when listen()?  Might be
> > > relevant when I see that rdma_accept_incoming_migration() runs in a
> > > loop to do raccept(), but would that also hang the qemu main loop
> > > even with the coroutine, before all channels are ready?  I'm not a
> > > coroutine person, but I think the hope is that we can make dest QEMU
> > > run in a thread in the future just like the src QEMU, so the less 
> > > coroutine
> the better in this path.
> > >
> >
> > Because rsocket is set to non-blocking, raccept will return EAGAIN
> > when no connection is received, coroutine will yield, and will not hang the
> qemu main loop.
> 
> Ah that's ok.  And also I just noticed it may not be a big deal either as 
> long as
> we're before migration_incoming_process().
> 
> I'm wondering whether it can do it similarly like what we do with sockets in
> qio_net_listener_set_client_func_full().  After all, rsocket wants to mimic 
> the
> socket API.  It'll make sense if rsocket code tries to match with socket, 

RE: [PATCH 3/6] io/channel-rdma: support working in coroutine

2024-06-07 Thread Gonglei (Arei)


> -Original Message-
> From: Haris Iqbal [mailto:haris.iq...@ionos.com]
> Sent: Thursday, June 6, 2024 9:35 PM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; pet...@redhat.com; yu.zh...@ionos.com;
> mgal...@akamai.com; elmar.ger...@ionos.com; zhengchuan
> ; berra...@redhat.com; arm...@redhat.com;
> lizhij...@fujitsu.com; pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> ; linux-r...@vger.kernel.org; lixiao (H)
> ; jinpu.w...@ionos.com; Wangjialin
> 
> Subject: Re: [PATCH 3/6] io/channel-rdma: support working in coroutine
> 
> On Tue, Jun 4, 2024 at 2:14 PM Gonglei  wrote:
> >
> > From: Jialin Wang 
> >
> > It is not feasible to obtain RDMA completion queue notifications
> > through poll/ppoll on the rsocket fd. Therefore, we create a thread
> > named rpoller for each rsocket fd and two eventfds: pollin_eventfd and
> > pollout_eventfd.
> >
> > When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
> > or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
> > and pollout_eventfd instead of the rsocket fd.
> >
> > The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
> > events.
> > When a POLLIN event occurs, the rpoller write the pollin_eventfd, and
> > then poll/ppoll will return the POLLIN event.
> > When a POLLOUT event occurs, the rpoller read the pollout_eventfd, and
> > then poll/ppoll will return the POLLOUT event.
> >
> > For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
> > read/write the pollin/pollout_eventfd, preventing poll/ppoll from
> > returning POLLIN/POLLOUT events.
> >
> > Known limitations:
> >
> >   For a blocking rsocket fd, if we use io_create_watch to wait for
> >   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
> >   cannot determine when it is not ready to read/write as we can with
> >   non-blocking fds. Therefore, when an event occurs, it will occurs
> >   always, potentially leave the qemu hanging. So we need be cautious
> >   to avoid hanging when using io_create_watch .
> >
> > Luckily, channel-rdma works well in coroutines :)
> >
> > Signed-off-by: Jialin Wang 
> > Signed-off-by: Gonglei 
> > ---
> >  include/io/channel-rdma.h |  15 +-
> >  io/channel-rdma.c | 363
> +-
> >  2 files changed, 376 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> > index 8cab2459e5..cb56127d76 100644
> > --- a/include/io/channel-rdma.h
> > +++ b/include/io/channel-rdma.h
> > @@ -47,6 +47,18 @@ struct QIOChannelRDMA {
> >  socklen_t localAddrLen;
> >  struct sockaddr_storage remoteAddr;
> >  socklen_t remoteAddrLen;
> > +
> > +/* private */
> > +
> > +/* qemu g_poll/ppoll() POLLIN event on it */
> > +int pollin_eventfd;
> > +/* qemu g_poll/ppoll() POLLOUT event on it */
> > +int pollout_eventfd;
> > +
> > +/* the index in the rpoller's fds array */
> > +int index;
> > +/* rpoller will rpoll() rpoll_events on the rsocket fd */
> > +short int rpoll_events;
> >  };
> >
> >  /**
> > @@ -147,6 +159,7 @@ void
> qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress
> *addr,
> >   *
> >   * Returns: the new client channel, or NULL on error
> >   */
> > -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> Error
> > **errp);
> > +QIOChannelRDMA *coroutine_mixed_fn
> qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> > +
> Error
> > +**errp);
> >
> >  #endif /* QIO_CHANNEL_RDMA_H */
> > diff --git a/io/channel-rdma.c b/io/channel-rdma.c index
> > 92c362df52..9792add5cf 100644
> > --- a/io/channel-rdma.c
> > +++ b/io/channel-rdma.c
> > @@ -23,10 +23,15 @@
> >
> >  #include "qemu/osdep.h"
> >  #include "io/channel-rdma.h"
> > +#include "io/channel-util.h"
> > +#include "io/channel-watch.h"
> >  #include "io/channel.h"
> >  #include "qapi/clone-visitor.h"
> >  #include "qapi/error.h"
> >  #include "qapi/qapi-visit-sockets.h"
> > +#include "qemu/atomic.h"
> > +#include "qemu/error-report.h"
> > +#include "qemu/thread.h"
> >  #include "trace.h"
> >  #include 
> >  #include 
> > @@ -39,11 +44,274 @@
> >  #include 
> >  #include 
> >
> > +typedef enum {
> > +CLEAR_POLLIN,

RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-07 Thread Gonglei (Arei)


> -Original Message-
> From: Jinpu Wang [mailto:jinpu.w...@ionos.com]
> Sent: Friday, June 7, 2024 1:54 PM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; pet...@redhat.com; yu.zh...@ionos.com;
> mgal...@akamai.com; elmar.ger...@ionos.com; zhengchuan
> ; berra...@redhat.com; arm...@redhat.com;
> lizhij...@fujitsu.com; pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> ; linux-r...@vger.kernel.org; lixiao (H)
> ; Wangjialin 
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi Gonglei, hi folks on the list,
> 
> On Tue, Jun 4, 2024 at 2:14 PM Gonglei  wrote:
> >
> > From: Jialin Wang 
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> First thx for the effort, we are running migration tests on our IB fabric, 
> different
> generation of HCA from mellanox, the migration works ok, there are a few
> failures,  Yu will share the result later separately.
> 

Thank you so much. 

> The one blocker for the change is the old implementation and the new rsocket
> implementation; they don't talk to each other due to the effect of different 
> wire
> protocol during connection establishment.
> eg the old RDMA migration has special control message during the migration
> flow, which rsocket use a different control message, so there lead to no way 
> to
> migrate VM using rdma transport pre to the rsocket patchset to a new version
> with rsocket implementation.
> 
> Probably we should keep both implementation for a while, mark the old
> implementation as deprecated, and promote the new implementation, and
> high light in doc, they are not compatible.
> 

IMO It makes sense. What's your opinion? @Peter.


Regards,
-Gonglei

> Regards!
> Jinpu
> 
> 
> 
> >
> > Jialin Wang (6):
> >   migration: remove RDMA live migration temporarily
> >   io: add QIOChannelRDMA class
> >   io/channel-rdma: support working in coroutine
> >   tests/unit: add test-io-channel-rdma.c
> >   migration: introduce new RDMA live migration
> >   migration/rdma: support multifd for RDMA migration
> >
> >  docs/rdma.txt |  420 ---
> >  include/io/channel-rdma.h |  165 ++
> >  io/channel-rdma.c |  798 ++
> >  io/meson.build|1 +
> >  io/trace-events   |   14 +
> >  meson.build   |6 -
> >  migration/meson.build |3 +-
> >  migration/migration-stats.c   |5 +-
> >  migration/migration-stats.h   |4 -
> >  migration/migration.c |   13 +-
> >  migration/migration.h |9 -
> >  migration/multifd.c   |   10 +
> >  migration/options.c   |   16 -
> >  migration/options.h   |2 -
> >  migration/qemu-file.c |1 -
> >  migration/ram.c   |   90 +-
> >  migration/rdma.c  | 4205 +
> >  migration/rdma.h  |   67 +-
> >  migration/savevm.c|2 +-
> >  migration/trace-events|   68 +-
> >  qapi/migration.json   |   13 +-
> >  scripts/analyze-migration.py  |3 -
> >  tests/unit/meson.build|1 +
> >  tests/unit/test-io-channel-rdma.c |  276 ++
> >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > create mode 100644 io/channel-rdma.c  create mode 100644
> > tests/unit/test-io-channel-rdma.c
> >
> > --
> > 2.43.0
> >



RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-05 Thread Gonglei (Arei)
Hi Peter,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Wednesday, June 5, 2024 3:32 AM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; yu.zh...@ionos.com; mgal...@akamai.com;
> elmar.ger...@ionos.com; zhengchuan ;
> berra...@redhat.com; arm...@redhat.com; lizhij...@fujitsu.com;
> pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> ; linux-r...@vger.kernel.org; lixiao (H)
> ; jinpu.w...@ionos.com; Wangjialin
> ; Fabiano Rosas 
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi, Lei, Jialin,
> 
> Thanks a lot for working on this!
> 
> I think we'll need to wait a bit on feedbacks from Jinpu and his team on RDMA
> side, also Daniel for iochannels.  Also, please remember to copy Fabiano
> Rosas in any relevant future posts.  We'd also like to know whether he has any
> comments too.  I have him copied in this reply.
> 
> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > From: Jialin Wang 
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> 
> It'll be good to elaborate if you tested it in-house. What people should 
> expect
> on the numbers exactly?  Is that okay from Huawei's POV?
> 
> Besides that, the code looks pretty good at a first glance to me.  Before
> others chim in, here're some high level comments..
> 
> Firstly, can we avoid using coroutine when listen()?  Might be relevant when I
> see that rdma_accept_incoming_migration() runs in a loop to do raccept(), but
> would that also hang the qemu main loop even with the coroutine, before all
> channels are ready?  I'm not a coroutine person, but I think the hope is that
> we can make dest QEMU run in a thread in the future just like the src QEMU, so
> the less coroutine the better in this path.
> 

Because rsocket is set to non-blocking, raccept will return EAGAIN when no 
connection 
is received, coroutine will yield, and will not hang the qemu main loop.

> I think I also left a comment elsewhere on whether it would be possible to 
> allow
> iochannels implement their own poll() functions to avoid the per-channel poll
> thread that is proposed in this series.
> 
> https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n
> 

We noticed that, and it's a big operation. I'm not sure that's a better way.

> Personally I think even with the thread proposal it's better than the old rdma
> code, but I just still want to double check with you guys.  E.g., maybe that 
> just
> won't work at all?  Again, that'll also be based on the fact that we move
> migration incoming into a thread first to keep the dest QEMU main loop intact,
> I think, but I hope we will reach that irrelevant of rdma, IOW it'll be nice 
> to
> happen even earlier if possible.
> 
Yep. This is a fairly big change, I wonder what other people's suggestions are?

> Another nitpick is that qio_channel_rdma_listen_async() doesn't look used and
> may prone to removal.
> 

Yes. This is because when we wrote the test case, we wanted to test 
qio_channel_rdma_connect_async, 
and also I added qio_channel_rdma_listen_async. It is not used in the RDMA hot 
migration code.

Regards,
-Gonglei



RE: [PATCH 1/6] migration: remove RDMA live migration temporarily

2024-06-05 Thread Gonglei (Arei)


> -Original Message-
> From: David Hildenbrand [mailto:da...@redhat.com]
> Sent: Tuesday, June 4, 2024 10:02 PM
> To: Gonglei (Arei) ; qemu-devel@nongnu.org
> Cc: pet...@redhat.com; yu.zh...@ionos.com; mgal...@akamai.com;
> elmar.ger...@ionos.com; zhengchuan ;
> berra...@redhat.com; arm...@redhat.com; lizhij...@fujitsu.com;
> pbonz...@redhat.com; m...@redhat.com; Xiexiangyou
> ; linux-r...@vger.kernel.org; lixiao (H)
> ; jinpu.w...@ionos.com; Wangjialin
> 
> Subject: Re: [PATCH 1/6] migration: remove RDMA live migration temporarily
> 
> On 04.06.24 14:14, Gonglei via wrote:
> > From: Jialin Wang 
> >
> > The new RDMA live migration will be introduced in the upcoming few
> > commits.
> >
> > Signed-off-by: Jialin Wang 
> > Signed-off-by: Gonglei 
> > ---
> 
> [...]
> 
> > -
> > -/* Avoid ram_block_discard_disable(), cannot change during migration.
> */
> > -if (ram_block_discard_is_required()) {
> > -error_setg(errp, "RDMA: cannot disable RAM discard");
> > -return;
> > -}
> 
> I'm particularly interested in the interaction with virtio-balloon/virtio-mem.
> 
> Do we still have to disable discarding of RAM, and where would you do that in
> the rewrite?
> 

Yes, we do. We didn't change the logic. Thanks for your catching.

Regards,
-Gonglei

> --
> Cheers,
> 
> David / dhildenb



RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-05 Thread Gonglei (Arei)



> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Wednesday, June 5, 2024 3:57 PM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; pet...@redhat.com; yu.zh...@ionos.com;
> mgal...@akamai.com; elmar.ger...@ionos.com; zhengchuan
> ; berra...@redhat.com; arm...@redhat.com;
> lizhij...@fujitsu.com; pbonz...@redhat.com; Xiexiangyou
> ; linux-r...@vger.kernel.org; lixiao (H)
> ; jinpu.w...@ionos.com; Wangjialin
> 
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > From: Jialin Wang 
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> 
> So you didn't test it with an RDMA card?

Yep, we tested it by Soft-ROCE.

> You really should test with an RDMA card though, for correctness as much as
> performance.
> 
We will, we just don't have RDMA cards environment on hand at the moment.

Regards,
-Gonglei

> 
> > Jialin Wang (6):
> >   migration: remove RDMA live migration temporarily
> >   io: add QIOChannelRDMA class
> >   io/channel-rdma: support working in coroutine
> >   tests/unit: add test-io-channel-rdma.c
> >   migration: introduce new RDMA live migration
> >   migration/rdma: support multifd for RDMA migration
> >
> >  docs/rdma.txt |  420 ---
> >  include/io/channel-rdma.h |  165 ++
> >  io/channel-rdma.c |  798 ++
> >  io/meson.build|1 +
> >  io/trace-events   |   14 +
> >  meson.build   |6 -
> >  migration/meson.build |3 +-
> >  migration/migration-stats.c   |5 +-
> >  migration/migration-stats.h   |4 -
> >  migration/migration.c |   13 +-
> >  migration/migration.h |9 -
> >  migration/multifd.c   |   10 +
> >  migration/options.c   |   16 -
> >  migration/options.h   |2 -
> >  migration/qemu-file.c |1 -
> >  migration/ram.c   |   90 +-
> >  migration/rdma.c  | 4205 +
> >  migration/rdma.h  |   67 +-
> >  migration/savevm.c|2 +-
> >  migration/trace-events|   68 +-
> >  qapi/migration.json   |   13 +-
> >  scripts/analyze-migration.py  |3 -
> >  tests/unit/meson.build|1 +
> >  tests/unit/test-io-channel-rdma.c |  276 ++
> >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > create mode 100644 io/channel-rdma.c  create mode 100644
> > tests/unit/test-io-channel-rdma.c
> >
> > --
> > 2.43.0




[PATCH 0/6] refactor RDMA live migration based on rsocket API

2024-06-04 Thread Gonglei via
From: Jialin Wang 

Hi,

This patch series attempts to refactor RDMA live migration by
introducing a new QIOChannelRDMA class based on the rsocket API.

The /usr/include/rdma/rsocket.h provides a higher level rsocket API
that is a 1-1 match of the normal kernel 'sockets' API, which hides the
detail of rdma protocol into rsocket and allows us to add support for
some modern features like multifd more easily.

Here is the previous discussion on refactoring RDMA live migration using
the rsocket API:

https://lore.kernel.org/qemu-devel/20240328130255.52257-1-phi...@linaro.org/

We have encountered some bugs when using rsocket and plan to submit them to
the rdma-core community.

In addition, the use of rsocket makes our programming more convenient,
but it must be noted that this method introduces multiple memory copies,
which can be imagined that there will be a certain performance degradation,
hoping that friends with RDMA network cards can help verify, thank you!

Jialin Wang (6):
  migration: remove RDMA live migration temporarily
  io: add QIOChannelRDMA class
  io/channel-rdma: support working in coroutine
  tests/unit: add test-io-channel-rdma.c
  migration: introduce new RDMA live migration
  migration/rdma: support multifd for RDMA migration

 docs/rdma.txt |  420 ---
 include/io/channel-rdma.h |  165 ++
 io/channel-rdma.c |  798 ++
 io/meson.build|1 +
 io/trace-events   |   14 +
 meson.build   |6 -
 migration/meson.build |3 +-
 migration/migration-stats.c   |5 +-
 migration/migration-stats.h   |4 -
 migration/migration.c |   13 +-
 migration/migration.h |9 -
 migration/multifd.c   |   10 +
 migration/options.c   |   16 -
 migration/options.h   |2 -
 migration/qemu-file.c |1 -
 migration/ram.c   |   90 +-
 migration/rdma.c  | 4205 +
 migration/rdma.h  |   67 +-
 migration/savevm.c|2 +-
 migration/trace-events|   68 +-
 qapi/migration.json   |   13 +-
 scripts/analyze-migration.py  |3 -
 tests/unit/meson.build|1 +
 tests/unit/test-io-channel-rdma.c |  276 ++
 24 files changed, 1360 insertions(+), 4832 deletions(-)
 delete mode 100644 docs/rdma.txt
 create mode 100644 include/io/channel-rdma.h
 create mode 100644 io/channel-rdma.c
 create mode 100644 tests/unit/test-io-channel-rdma.c

-- 
2.43.0




[PATCH 6/6] migration/rdma: support multifd for RDMA migration

2024-06-04 Thread Gonglei via
From: Jialin Wang 

Signed-off-by: Jialin Wang 
Signed-off-by: Gonglei 
---
 migration/multifd.c | 10 ++
 migration/rdma.c| 27 +++
 migration/rdma.h|  6 ++
 3 files changed, 43 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index f317bff077..cee9858ad1 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -32,6 +32,7 @@
 #include "io/channel-file.h"
 #include "io/channel-socket.h"
 #include "yank_functions.h"
+#include "rdma.h"
 
 /* Multiple fd's */
 
@@ -793,6 +794,9 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams 
*p, Error **errp)
 static void multifd_send_cleanup_state(void)
 {
 file_cleanup_outgoing_migration();
+#ifdef CONFIG_RDMA
+rdma_cleanup_outgoing_migration();
+#endif
 socket_cleanup_outgoing_migration();
 qemu_sem_destroy(_send_state->channels_created);
 qemu_sem_destroy(_send_state->channels_ready);
@@ -1139,6 +1143,12 @@ static bool multifd_new_send_channel_create(gpointer 
opaque, Error **errp)
 return file_send_channel_create(opaque, errp);
 }
 
+#ifdef CONFIG_RDMA
+if (rdma_send_channel_create(multifd_new_send_channel_async, opaque)) {
+return true;
+}
+#endif
+
 socket_send_channel_create(multifd_new_send_channel_async, opaque);
 return true;
 }
diff --git a/migration/rdma.c b/migration/rdma.c
index 09a4de7f59..af4d2b5a5a 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -19,6 +19,7 @@
 #include "qapi/qapi-visit-sockets.h"
 #include "channel.h"
 #include "migration.h"
+#include "options.h"
 #include "rdma.h"
 #include "trace.h"
 #include 
@@ -27,6 +28,28 @@ static struct RDMAOutgoingArgs {
 InetSocketAddress *addr;
 } outgoing_args;
 
+bool rdma_send_channel_create(QIOTaskFunc f, void *data)
+{
+QIOChannelRDMA *rioc;
+
+if (!outgoing_args.addr) {
+return false;
+}
+
+rioc = qio_channel_rdma_new();
+qio_channel_rdma_connect_async(rioc, outgoing_args.addr, f, data, NULL,
+   NULL);
+return true;
+}
+
+void rdma_cleanup_outgoing_migration(void)
+{
+if (outgoing_args.addr) {
+qapi_free_InetSocketAddress(outgoing_args.addr);
+outgoing_args.addr = NULL;
+}
+}
+
 static void rdma_outgoing_migration(QIOTask *task, gpointer opaque)
 {
 MigrationState *s = opaque;
@@ -74,6 +97,10 @@ void rdma_start_incoming_migration(InetSocketAddress *addr, 
Error **errp)
 
 qio_channel_set_name(QIO_CHANNEL(rioc), "migration-rdma-listener");
 
+if (migrate_multifd()) {
+num = migrate_multifd_channels();
+}
+
 if (qio_channel_rdma_listen_sync(rioc, addr, num, errp) < 0) {
 object_unref(OBJECT(rioc));
 return;
diff --git a/migration/rdma.h b/migration/rdma.h
index 4c3eb9a972..cefccac61c 100644
--- a/migration/rdma.h
+++ b/migration/rdma.h
@@ -16,6 +16,12 @@
 
 #include "qemu/sockets.h"
 
+#include 
+
+bool rdma_send_channel_create(QIOTaskFunc f, void *data);
+
+void rdma_cleanup_outgoing_migration(void);
+
 void rdma_start_outgoing_migration(MigrationState *s, InetSocketAddress *addr,
Error **errp);
 
-- 
2.43.0




[PATCH 3/6] io/channel-rdma: support working in coroutine

2024-06-04 Thread Gonglei via
From: Jialin Wang 

It is not feasible to obtain RDMA completion queue notifications
through poll/ppoll on the rsocket fd. Therefore, we create a thread
named rpoller for each rsocket fd and two eventfds: pollin_eventfd
and pollout_eventfd.

When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
and pollout_eventfd instead of the rsocket fd.

The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
events.
When a POLLIN event occurs, the rpoller write the pollin_eventfd,
and then poll/ppoll will return the POLLIN event.
When a POLLOUT event occurs, the rpoller read the pollout_eventfd,
and then poll/ppoll will return the POLLOUT event.

For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
read/write the pollin/pollout_eventfd, preventing poll/ppoll from
returning POLLIN/POLLOUT events.

Known limitations:

  For a blocking rsocket fd, if we use io_create_watch to wait for
  POLLIN or POLLOUT events, since the rsocket fd is blocking, we
  cannot determine when it is not ready to read/write as we can with
  non-blocking fds. Therefore, when an event occurs, it will occurs
  always, potentially leave the qemu hanging. So we need be cautious
  to avoid hanging when using io_create_watch .

Luckily, channel-rdma works well in coroutines :)

Signed-off-by: Jialin Wang 
Signed-off-by: Gonglei 
---
 include/io/channel-rdma.h |  15 +-
 io/channel-rdma.c | 363 +-
 2 files changed, 376 insertions(+), 2 deletions(-)

diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
index 8cab2459e5..cb56127d76 100644
--- a/include/io/channel-rdma.h
+++ b/include/io/channel-rdma.h
@@ -47,6 +47,18 @@ struct QIOChannelRDMA {
 socklen_t localAddrLen;
 struct sockaddr_storage remoteAddr;
 socklen_t remoteAddrLen;
+
+/* private */
+
+/* qemu g_poll/ppoll() POLLIN event on it */
+int pollin_eventfd;
+/* qemu g_poll/ppoll() POLLOUT event on it */
+int pollout_eventfd;
+
+/* the index in the rpoller's fds array */
+int index;
+/* rpoller will rpoll() rpoll_events on the rsocket fd */
+short int rpoll_events;
 };
 
 /**
@@ -147,6 +159,7 @@ void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, 
InetSocketAddress *addr,
  *
  * Returns: the new client channel, or NULL on error
  */
-QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc, Error **errp);
+QIOChannelRDMA *coroutine_mixed_fn qio_channel_rdma_accept(QIOChannelRDMA *ioc,
+   Error **errp);
 
 #endif /* QIO_CHANNEL_RDMA_H */
diff --git a/io/channel-rdma.c b/io/channel-rdma.c
index 92c362df52..9792add5cf 100644
--- a/io/channel-rdma.c
+++ b/io/channel-rdma.c
@@ -23,10 +23,15 @@
 
 #include "qemu/osdep.h"
 #include "io/channel-rdma.h"
+#include "io/channel-util.h"
+#include "io/channel-watch.h"
 #include "io/channel.h"
 #include "qapi/clone-visitor.h"
 #include "qapi/error.h"
 #include "qapi/qapi-visit-sockets.h"
+#include "qemu/atomic.h"
+#include "qemu/error-report.h"
+#include "qemu/thread.h"
 #include "trace.h"
 #include 
 #include 
@@ -39,11 +44,274 @@
 #include 
 #include 
 
+typedef enum {
+CLEAR_POLLIN,
+CLEAR_POLLOUT,
+SET_POLLIN,
+SET_POLLOUT,
+} UpdateEvent;
+
+typedef enum {
+RP_CMD_ADD_IOC,
+RP_CMD_DEL_IOC,
+RP_CMD_UPDATE,
+} RpollerCMD;
+
+typedef struct {
+RpollerCMD cmd;
+QIOChannelRDMA *rioc;
+} RpollerMsg;
+
+/*
+ * rpoll() on the rsocket fd with rpoll_events, when POLLIN/POLLOUT event
+ * occurs, it will write/read the pollin_eventfd/pollout_eventfd to allow
+ * qemu g_poll/ppoll() get the POLLIN/POLLOUT event
+ */
+static struct Rpoller {
+QemuThread thread;
+bool is_running;
+int sock[2];
+int count; /* the number of rsocket fds being rpoll() */
+int size; /* the size of fds/riocs */
+struct pollfd *fds;
+QIOChannelRDMA **riocs;
+} rpoller;
+
+static void qio_channel_rdma_notify_rpoller(QIOChannelRDMA *rioc,
+RpollerCMD cmd)
+{
+RpollerMsg msg;
+int ret;
+
+msg.cmd = cmd;
+msg.rioc = rioc;
+
+ret = RETRY_ON_EINTR(write(rpoller.sock[0], , sizeof msg));
+if (ret != sizeof msg) {
+error_report("%s: failed to send msg, errno: %d", __func__, errno);
+}
+}
+
+static void qio_channel_rdma_update_poll_event(QIOChannelRDMA *rioc,
+   UpdateEvent action,
+   bool notify_rpoller)
+{
+/* An eventfd with the value of ULLONG_MAX - 1 is readable but unwritable 
*/
+unsigned long long buf = ULLONG_MAX - 1;
+
+switch (action) {
+/* only rpoller do SET_* action, to allow qemu ppoll() get the event */
+case SET_POLLIN:
+  

[PATCH 2/6] io: add QIOChannelRDMA class

2024-06-04 Thread Gonglei via
From: Jialin Wang 

Implement a QIOChannelRDMA subclass that is based on the rsocket
API (similar to socket API).

Signed-off-by: Jialin Wang 
Signed-off-by: Gonglei 
---
 include/io/channel-rdma.h | 152 +
 io/channel-rdma.c | 437 ++
 io/meson.build|   1 +
 io/trace-events   |  14 ++
 4 files changed, 604 insertions(+)
 create mode 100644 include/io/channel-rdma.h
 create mode 100644 io/channel-rdma.c

diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
new file mode 100644
index 00..8cab2459e5
--- /dev/null
+++ b/include/io/channel-rdma.h
@@ -0,0 +1,152 @@
+/*
+ * QEMU I/O channels RDMA driver
+ *
+ * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
+ *
+ * Authors:
+ *  Jialin Wang 
+ *  Gonglei 
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef QIO_CHANNEL_RDMA_H
+#define QIO_CHANNEL_RDMA_H
+
+#include "io/channel.h"
+#include "io/task.h"
+#include "qemu/sockets.h"
+#include "qom/object.h"
+
+#define TYPE_QIO_CHANNEL_RDMA "qio-channel-rdma"
+OBJECT_DECLARE_SIMPLE_TYPE(QIOChannelRDMA, QIO_CHANNEL_RDMA)
+
+/**
+ * QIOChannelRDMA:
+ *
+ * The QIOChannelRDMA object provides a channel implementation
+ * that discards all writes and returns EOF for all reads.
+ */
+struct QIOChannelRDMA {
+QIOChannel parent;
+/* the rsocket fd */
+int fd;
+
+struct sockaddr_storage localAddr;
+socklen_t localAddrLen;
+struct sockaddr_storage remoteAddr;
+socklen_t remoteAddrLen;
+};
+
+/**
+ * qio_channel_rdma_new:
+ *
+ * Create a channel for performing I/O on a rdma
+ * connection, that is initially closed. After
+ * creating the rdma, it must be setup as a client
+ * connection or server.
+ *
+ * Returns: the rdma channel object
+ */
+QIOChannelRDMA *qio_channel_rdma_new(void);
+
+/**
+ * qio_channel_rdma_connect_sync:
+ * @ioc: the rdma channel object
+ * @addr: the address to connect to
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr. This method
+ * will run in the foreground so the caller will not regain
+ * execution control until the connection is established or
+ * an error occurs.
+ */
+int qio_channel_rdma_connect_sync(QIOChannelRDMA *ioc, InetSocketAddress *addr,
+  Error **errp);
+
+/**
+ * qio_channel_rdma_connect_async:
+ * @ioc: the rdma channel object
+ * @addr: the address to connect to
+ * @callback: the function to invoke on completion
+ * @opaque: user data to pass to @callback
+ * @destroy: the function to free @opaque
+ * @context: the context to run the async task. If %NULL, the default
+ *   context will be used.
+ *
+ * Attempt to connect to the address @addr. This method
+ * will run in the background so the caller will regain
+ * execution control immediately. The function @callback
+ * will be invoked on completion or failure. The @addr
+ * parameter will be copied, so may be freed as soon
+ * as this function returns without waiting for completion.
+ */
+void qio_channel_rdma_connect_async(QIOChannelRDMA *ioc,
+InetSocketAddress *addr,
+QIOTaskFunc callback, gpointer opaque,
+GDestroyNotify destroy,
+GMainContext *context);
+
+/**
+ * qio_channel_rdma_listen_sync:
+ * @ioc: the rdma channel object
+ * @addr: the address to listen to
+ * @num: the expected amount of connections
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to listen to the address @addr. This method
+ * will run in the foreground so the caller will not regain
+ * execution control until the connection is established or
+ * an error occurs.
+ */
+int qio_channel_rdma_listen_sync(QIOChannelRDMA *ioc, InetSocketAddress *addr,
+ int num, Error **errp);
+
+/**
+ * qio_channel_rdma_listen_async:
+ * @ioc: the rdma channel object
+ * @addr: the address to listen to
+ * @num: the expected amount of connections
+ * @callback: the function to invoke on completion
+ * @opaque: user data to pass to @callback
+ * @destroy: the function to free @opaque
+ * @context: the context to run the async task. If %NULL, the default
+ *  

[PATCH 5/6] migration: introduce new RDMA live migration

2024-06-04 Thread Gonglei via
From: Jialin Wang 

Signed-off-by: Jialin Wang 
Signed-off-by: Gonglei 
---
 migration/meson.build |  2 +
 migration/migration.c | 11 +-
 migration/rdma.c  | 88 +++
 migration/rdma.h  | 24 
 4 files changed, 124 insertions(+), 1 deletion(-)
 create mode 100644 migration/rdma.c
 create mode 100644 migration/rdma.h

diff --git a/migration/meson.build b/migration/meson.build
index 4e8a9ccf3e..04e2e16239 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -42,3 +42,5 @@ system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
 specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
 if_true: files('ram.c',
'target.c'))
+
+system_ss.add(when: rdma, if_true: files('rdma.c'))
diff --git a/migration/migration.c b/migration/migration.c
index 6b9ad4ff5f..77c301d351 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -25,6 +25,7 @@
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/cpu-throttle.h"
+#include "rdma.h"
 #include "ram.h"
 #include "migration/global_state.h"
 #include "migration/misc.h"
@@ -145,7 +146,7 @@ static bool 
transport_supports_multi_channels(MigrationAddress *addr)
 } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
 return migrate_mapped_ram();
 } else {
-return false;
+return addr->transport == MIGRATION_ADDRESS_TYPE_RDMA;
 }
 }
 
@@ -644,6 +645,10 @@ static void qemu_start_incoming_migration(const char *uri, 
bool has_channels,
 } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
 fd_start_incoming_migration(saddr->u.fd.str, errp);
 }
+#ifdef CONFIG_RDMA
+} else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
+rdma_start_incoming_migration(>u.rdma, errp);
+#endif
 } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
 exec_start_incoming_migration(addr->u.exec.args, errp);
 } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
@@ -2046,6 +2051,10 @@ void qmp_migrate(const char *uri, bool has_channels,
 } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
 fd_start_outgoing_migration(s, saddr->u.fd.str, _err);
 }
+#ifdef CONFIG_RDMA
+} else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
+rdma_start_outgoing_migration(s, >u.rdma, _err);
+#endif
 } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
 exec_start_outgoing_migration(s, addr->u.exec.args, _err);
 } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
diff --git a/migration/rdma.c b/migration/rdma.c
new file mode 100644
index 00..09a4de7f59
--- /dev/null
+++ b/migration/rdma.c
@@ -0,0 +1,88 @@
+/*
+ * QEMU live migration via RDMA
+ *
+ * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
+ *
+ * Authors:
+ *  Jialin Wang 
+ *  Gonglei 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "io/channel-rdma.h"
+#include "io/channel.h"
+#include "qapi/clone-visitor.h"
+#include "qapi/qapi-types-sockets.h"
+#include "qapi/qapi-visit-sockets.h"
+#include "channel.h"
+#include "migration.h"
+#include "rdma.h"
+#include "trace.h"
+#include 
+
+static struct RDMAOutgoingArgs {
+InetSocketAddress *addr;
+} outgoing_args;
+
+static void rdma_outgoing_migration(QIOTask *task, gpointer opaque)
+{
+MigrationState *s = opaque;
+QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qio_task_get_source(task));
+
+migration_channel_connect(s, QIO_CHANNEL(rioc), outgoing_args.addr->host,
+  NULL);
+object_unref(OBJECT(rioc));
+}
+
+void rdma_start_outgoing_migration(MigrationState *s, InetSocketAddress *iaddr,
+   Error **errp)
+{
+QIOChannelRDMA *rioc = qio_channel_rdma_new();
+
+/* in case previous migration leaked it */
+qapi_free_InetSocketAddress(outgoing_args.addr);
+outgoing_args.addr = QAPI_CLONE(InetSocketAddress, iaddr);
+
+qio_channel_set_name(QIO_CHANNEL(rioc), "migration-rdma-outgoing");
+qio_channel_rdma_connect_async(rioc, iaddr, rdma_outgoing_migration, s,
+   NULL, NULL);
+}
+
+static void coroutine_fn rdma_accept_incoming_migration(void *opaque)
+{
+QIOChannelRDMA *rioc = opaque;
+QIOChannelRDMA *cioc;
+
+while (!migration_has_all_channels()) {
+cioc = qio_channel_rdma_accept(rioc, NULL);
+
+qio_channel_set_name(QIO_CHANNEL(cioc), "migration-rdma-incoming");
+migration_channel_process_incoming(QIO_CHANNEL(cioc));
+object_unref(OBJECT(cioc));
+}

[PATCH 4/6] tests/unit: add test-io-channel-rdma.c

2024-06-04 Thread Gonglei via
From: Jialin Wang 

Signed-off-by: Jialin Wang 
Signed-off-by: Gonglei 
---
 tests/unit/meson.build|   1 +
 tests/unit/test-io-channel-rdma.c | 276 ++
 2 files changed, 277 insertions(+)
 create mode 100644 tests/unit/test-io-channel-rdma.c

diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index 26c109c968..c44020a3b5 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -85,6 +85,7 @@ if have_block
 'test-authz-listfile': [authz],
 'test-io-task': [testblock],
 'test-io-channel-socket': ['socket-helpers.c', 'io-channel-helpers.c', io],
+'test-io-channel-rdma': ['io-channel-helpers.c', io],
 'test-io-channel-file': ['io-channel-helpers.c', io],
 'test-io-channel-command': ['io-channel-helpers.c', io],
 'test-io-channel-buffer': ['io-channel-helpers.c', io],
diff --git a/tests/unit/test-io-channel-rdma.c 
b/tests/unit/test-io-channel-rdma.c
new file mode 100644
index 00..e96b55c8c7
--- /dev/null
+++ b/tests/unit/test-io-channel-rdma.c
@@ -0,0 +1,276 @@
+/*
+ * QEMU I/O channel RDMA test
+ *
+ * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "io/channel-rdma.h"
+#include "qapi/error.h"
+#include "qemu/main-loop.h"
+#include "qemu/module.h"
+#include "io-channel-helpers.h"
+#include "qapi-types-sockets.h"
+#include 
+
+static SocketAddress *l_addr;
+static SocketAddress *c_addr;
+
+static void test_io_channel_set_rdma_bufs(QIOChannel *src, QIOChannel *dst)
+{
+int buflen = 64 * 1024;
+
+/*
+ * Make the socket buffers small so that we see
+ * the effects of partial reads/writes
+ */
+rsetsockopt(((QIOChannelRDMA *)src)->fd, SOL_SOCKET, SO_SNDBUF,
+(char *), sizeof(buflen));
+
+rsetsockopt(((QIOChannelRDMA *)dst)->fd, SOL_SOCKET, SO_SNDBUF,
+(char *), sizeof(buflen));
+}
+
+static void test_io_channel_setup_sync(InetSocketAddress *listen_addr,
+   InetSocketAddress *connect_addr,
+   QIOChannel **srv, QIOChannel **src,
+   QIOChannel **dst)
+{
+QIOChannelRDMA *lioc;
+
+lioc = qio_channel_rdma_new();
+qio_channel_rdma_listen_sync(lioc, listen_addr, 1, _abort);
+
+*src = QIO_CHANNEL(qio_channel_rdma_new());
+qio_channel_rdma_connect_sync(QIO_CHANNEL_RDMA(*src), connect_addr,
+  _abort);
+qio_channel_set_delay(*src, false);
+
+qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
+*dst = QIO_CHANNEL(qio_channel_rdma_accept(lioc, _abort));
+g_assert(*dst);
+
+test_io_channel_set_rdma_bufs(*src, *dst);
+
+*srv = QIO_CHANNEL(lioc);
+}
+
+struct TestIOChannelData {
+bool err;
+GMainLoop *loop;
+};
+
+static void test_io_channel_complete(QIOTask *task, gpointer opaque)
+{
+struct TestIOChannelData *data = opaque;
+data->err = qio_task_propagate_error(task, NULL);
+g_main_loop_quit(data->loop);
+}
+
+static void test_io_channel_setup_async(InetSocketAddress *listen_addr,
+InetSocketAddress *connect_addr,
+QIOChannel **srv, QIOChannel **src,
+QIOChannel **dst)
+{
+QIOChannelRDMA *lioc;
+struct TestIOChannelData data;
+
+data.loop = g_main_loop_new(g_main_context_default(), TRUE);
+
+lioc = qio_channel_rdma_new();
+qio_channel_rdma_listen_async(lioc, listen_addr, 1,
+  test_io_channel_complete, , NULL, NULL);
+
+g_main_loop_run(data.loop);
+g_main_context_iteration(g_main_context_default(), FALSE);
+
+g_assert(!data.err);
+
+*src = QIO_CHANNEL(qio_channel_rdma_new());
+
+qio_channel_rdma_connect_async(QIO_CHANNEL_RDMA(*src), connect_addr,
+   test_io_channel_complete, , NULL, 
NULL);
+
+g_main_loop_run(data.loop);
+g_main_context_iteration(g_main_context_default(), FALSE);
+
+g_assert(!data.err);
+
+if (qemu_in_coroutine()) {
+qio_channel_yield(QIO_CHANNEL(lioc), G_IO_IN);
+} else {
+qio_channel_wait(QIO_CHAN

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)
Hi,

> -Original Message-
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > 15
> > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > >
> > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > >
> > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > Family [ConnectX-5]
> > > > > > > > >
> > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > while the other
> > > is active.
> > > > > > > > >
> > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > (rev
> > > > > > > > > 01)
> > > > > > > > >
> > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > could make more
> > > > > > > > sense.
> > > > > > > >
> > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > to be refactored
> > > using rsocket.
> > > > > > > >
> > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > Would it make sense we start some PoC now?
> > > > > > > >
> > > > > > >
> > > > > > > My team has finished the PoC refactoring which works well.
> > > > > > >
> > > > > > > Progress:
> > > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > > functions from migration/ram.c. (to prevent RDMA live
> > > > > > > migration from polluting the
> > > > > > core logic of live migration), 6.  The soft-RoCE implemented
> > > > > > by software is used to test the RDMA live migration. It's 
> > > > > > successful.
> > > > > > >
> > > > > > > We will be submit the patchset later.
> > > > > >
> > > > > > That's great news, thank you!
> > > > > >
> > > > > > --
> > > > > > Peter Xu
> > > > >
> > > > > For rdma programming, the current mainstream implementation is
> > > > > to use
> > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > > > >
> > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs
> > > > > have different responsibilities. rdma_cm fd is used to notify
> > > > > connection establishment events, and verbs fd is used to notify
> > > > > new CQEs. When
> > > poll/epoll monitoring is directly performed on the rdma_cm fd, only
> > > a pollin event can be monitored, which means that an rdma_cm event
> > > occurs. When the verbs fd is directly polled/epolled, only the
> > > pollin event can be listened, which indicates that a new CQE is generated.
> > > > >
> > > > > Rsocket is a sub-module attached to the rdma_cm library and
> > > > > provides rdma calls that are completely similar to socket interfaces.
> > > > > However, this library returns only the rdma_cm fd for listening
> > > > > to link
> > > setup-related events and does not expose the verbs fd (readable and
> > > writable events for listening to data). Only the rpoll interface
> > > provided by the RSocket can be used to listen to related events.
> > > However, QEMU uses the ppoll interface to listen to the rdma_cm fd
> (gotten by raccept API).
> > > > > And cannot listen to the verbs fd event.
> I'm confused, the rs_poll_arm
> :https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#
> L3290
> For STREAM, rpoll setup fd for both cq fd and cm fd.
> 

Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(


Regards,
-Gonglei



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)


> -Original Message-
> From: Jinpu Wang [mailto:jinpu.w...@ionos.com]
> Sent: Wednesday, May 29, 2024 5:18 PM
> To: Gonglei (Arei) 
> Cc: Greg Sword ; Peter Xu ;
> Yu Zhang ; Michael Galaxy ;
> Elmar Gerdes ; zhengchuan
> ; Daniel P. Berrangé ;
> Markus Armbruster ; Zhijian Li (Fujitsu)
> ; qemu-devel@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; Paolo Bonzini
> ; qemu-bl...@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou ; Fabiano Rosas ;
> RDMA mailing list ; she...@nvidia.com; Haris
> Iqbal 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> Hi Gonglei,
> 
> On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei) 
> wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Greg Sword [mailto:gregswo...@gmail.com]
> > > Sent: Wednesday, May 29, 2024 2:06 PM
> > > To: Jinpu Wang 
> > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> > > handling
> > >
> > > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> > > wrote:
> > > >
> > > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)
> > > > 
> > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -Original Message-
> > > > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > > servers widely used for production in our data center.
> > > > > > > > > The network adapters are
> > > > > > > > >
> > > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > > NetXtreme
> > > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > > >
> > > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6
> > > > > > > > looks more
> > > > > > reasonable.
> > > > > > > >
> > > > > > > >
> > > > > >
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > 15
> > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > >
> > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > >
> > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > Family [ConnectX-5]
> > > > > > > > >
> > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > while the other
> > > is active.
> > > > > > > > >
> > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > (rev
> > > > > > > > > 01)
> > > > > > > > >
> > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > could make more
> > > > > > > > sense.
> > > > > > > >
> > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > to be refactored
> > > using rsocket.
> > > > > > > >
> > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > Would it make

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)


> -Original Message-
> From: Greg Sword [mailto:gregswo...@gmail.com]
> Sent: Wednesday, May 29, 2024 2:06 PM
> To: Jinpu Wang 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> wrote:
> >
> > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) 
> wrote:
> > >
> > > Hi,
> > >
> > > > -Original Message-
> > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > servers widely used for production in our data center. The
> > > > > > > network adapters are
> > > > > > >
> > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > NetXtreme
> > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > >
> > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks
> > > > > > more
> > > > reasonable.
> > > > > >
> > > > > >
> > > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > 15
> > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > >
> > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > >
> > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > [ConnectX-5]
> > > > > > >
> > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for
> > > > > > > VM migration. RDMA traffic is through InfiniBand and TCP
> > > > > > > through Ethernet on these two hosts. One is standby while the 
> > > > > > > other
> is active.
> > > > > > >
> > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev
> > > > > > > 01)
> > > > > > >
> > > > > > > The comparison between RDMA and TCP on the same NIC could
> > > > > > > make more
> > > > > > sense.
> > > > > >
> > > > > > It looks to me NICs are powerful now, but again as I mentioned
> > > > > > I don't think it's a reason we need to deprecate rdma,
> > > > > > especially if QEMU's rdma migration has the chance to be refactored
> using rsocket.
> > > > > >
> > > > > > Is there anyone who started looking into that direction?
> > > > > > Would it make sense we start some PoC now?
> > > > > >
> > > > >
> > > > > My team has finished the PoC refactoring which works well.
> > > > >
> > > > > Progress:
> > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > functions from migration/ram.c. (to prevent RDMA live migration
> > > > > from polluting the
> > > > core logic of live migration), 6.  The soft-RoCE implemented by
> > > > software is used to test the RDMA live migration. It's successful.
> > > > >
> > > > > We will be submit the patchset later.
> > > >
> > > > That's great news, thank you!
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> > >
> > > rdma_cm and ibverbs create two FDs respectively. The two FDs have
> > > different responsibilities. rdma_cm fd is used to notify connection
> > > establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin
> event can be monitored, which means that an rdma_cm event occurs. When
> the verbs fd is directly polled/epolled, only the pollin event can be 
> listened,
> which indicates that a new CQE is generated.
> > >
> > > Rsocket is a sub-module attached to the rdma_cm library and provides
> > > rdma calls that are completely similar to socket interfaces.
> > > However, this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and writable
> events for listening to data). Only the rpoll interface provided by the 
> RSocket
> can be used to listen to related events. However, QEMU uses the ppoll
> interface to listen to the rdma_cm fd (gotten by raccept API).
> > > And cannot listen to the verbs fd event. Only some hacking methods can be
> used to address this problem.
> > >
> > > Do you guys have any ideas? Thanks.
> > +cc linux-rdma
> 
> Why include rdma community?
> 

Can rdma/rsocket provide an API to expose the verbs fd? 


Regards,
-Gonglei

> > +cc Sean
> >
> >
> >
> > >
> > >
> > > Regards,
> > > -Gonglei
> >


RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Gonglei (Arei)
Hi,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Tuesday, May 28, 2024 11:55 PM
> > > > Exactly, not so compelling, as I did it first only on servers
> > > > widely used for production in our data center. The network
> > > > adapters are
> > > >
> > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > BCM5720 2-port Gigabit Ethernet PCIe
> > >
> > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> reasonable.
> > >
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > >
> > > Appreciate a lot for everyone helping on the testings.
> > >
> > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > [ConnectX-5]
> > > >
> > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > Ethernet on these two hosts. One is standby while the other is active.
> > > >
> > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > network adapters. One of them has:
> > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > >
> > > > The comparison between RDMA and TCP on the same NIC could make
> > > > more
> > > sense.
> > >
> > > It looks to me NICs are powerful now, but again as I mentioned I
> > > don't think it's a reason we need to deprecate rdma, especially if
> > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > >
> > > Is there anyone who started looking into that direction?  Would it
> > > make sense we start some PoC now?
> > >
> >
> > My team has finished the PoC refactoring which works well.
> >
> > Progress:
> > 1.  Implement io/channel-rdma.c,
> > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > Rewrite the rdma_start_outgoing_migration and
> > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > from migration/ram.c. (to prevent RDMA live migration from polluting the
> core logic of live migration), 6.  The soft-RoCE implemented by software is
> used to test the RDMA live migration. It's successful.
> >
> > We will be submit the patchset later.
> 
> That's great news, thank you!
> 
> --
> Peter Xu

For rdma programming, the current mainstream implementation is to use rdma_cm 
to establish a connection, and then use verbs to transmit data.

rdma_cm and ibverbs create two FDs respectively. The two FDs have different 
responsibilities. rdma_cm fd is used to notify connection establishment events, 
and verbs fd is used to notify new CQEs. When poll/epoll monitoring is directly 
performed on the rdma_cm fd, only a pollin event can be monitored, which means
that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, 
only the pollin event can be listened, which indicates that a new CQE is 
generated.

Rsocket is a sub-module attached to the rdma_cm library and provides rdma calls 
that are completely similar to socket interfaces. However, this library returns 
only the rdma_cm fd for listening to link setup-related events and does not 
expose the verbs fd (readable and writable events for listening to data). Only 
the rpoll 
interface provided by the RSocket can be used to listen to related events. 
However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by 
raccept API). 
And cannot listen to the verbs fd event. Only some hacking methods can be used 
to address this problem. 

Do you guys have any ideas? Thanks.


Regards,
-Gonglei


RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Gonglei (Arei)
Hi Peter,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Wednesday, May 22, 2024 6:15 AM
> To: Yu Zhang 
> Cc: Michael Galaxy ; Jinpu Wang
> ; Elmar Gerdes ;
> zhengchuan ; Gonglei (Arei)
> ; Daniel P. Berrangé ;
> Markus Armbruster ; Zhijian Li (Fujitsu)
> ; qemu-devel@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; Paolo Bonzini
> ; qemu-bl...@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou ; Fabiano Rosas 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > Hello Michael and Peter,
> 
> Hi,
> 
> >
> > Exactly, not so compelling, as I did it first only on servers widely
> > used for production in our data center. The network adapters are
> >
> > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > 2-port Gigabit Ethernet PCIe
> 
> Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more reasonable.
> 
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> wvaqk81vxtkzx-l...@mail.gmail.com/
> 
> Appreciate a lot for everyone helping on the testings.
> 
> > InfiniBand controller: Mellanox Technologies MT27800 Family
> > [ConnectX-5]
> >
> > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > on these two hosts. One is standby while the other is active.
> >
> > Now I'll try on a server with more recent Ethernet and InfiniBand
> > network adapters. One of them has:
> > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> >
> > The comparison between RDMA and TCP on the same NIC could make more
> sense.
> 
> It looks to me NICs are powerful now, but again as I mentioned I don't think 
> it's
> a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> the chance to be refactored using rsocket.
> 
> Is there anyone who started looking into that direction?  Would it make sense
> we start some PoC now?
> 

My team has finished the PoC refactoring which works well. 

Progress:
1.  Implement io/channel-rdma.c,
2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
successful,
3.  Remove the original code from migration/rdma.c,
4.  Rewrite the rdma_start_outgoing_migration and rdma_start_incoming_migration 
logic,
5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
migration from polluting the core logic of live migration),
6.  The soft-RoCE implemented by software is used to test the RDMA live 
migration. It's successful.

We will be submit the patchset later.


Regards,
-Gonglei

> Thanks,
> 
> --
> Peter Xu



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Gonglei (Arei)
Hello,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Monday, May 6, 2024 11:18 PM
> To: Gonglei (Arei) 
> Cc: Daniel P. Berrangé ; Markus Armbruster
> ; Michael Galaxy ; Yu Zhang
> ; Zhijian Li (Fujitsu) ; Jinpu Wang
> ; Elmar Gerdes ;
> qemu-devel@nongnu.org; Yuval Shaia ; Kevin Wolf
> ; Prasanna Kumar Kalever
> ; Cornelia Huck ;
> Michael Roth ; Prasanna Kumar Kalever
> ; integrat...@gluster.org; Paolo Bonzini
> ; qemu-bl...@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> > Hi, Peter
> 
> Hey, Lei,
> 
> Happy to see you around again after years.
> 
Haha, me too.

> > RDMA features high bandwidth, low latency (in non-blocking lossless
> > network), and direct remote memory access by bypassing the CPU (As you
> > know, CPU resources are expensive for cloud vendors, which is one of
> > the reasons why we introduced offload cards.), which TCP does not have.
> 
> It's another cost to use offload cards, v.s. preparing more cpu resources?
> 
Software and hardware offload converged architecture is the way to go for all 
cloud vendors 
(Including comprehensive benefits in terms of performance, cost, security, and 
innovation speed), 
it's not just a matter of adding the resource of a DPU card.

> > In some scenarios where fast live migration is needed (extremely short
> > interruption duration and migration duration) is very useful. To this
> > end, we have also developed RDMA support for multifd.
> 
> Will any of you upstream that work?  I'm curious how intrusive would it be
> when adding it to multifd, if it can keep only 5 exported functions like what
> rdma.h does right now it'll be pretty nice.  We also want to make sure it 
> works
> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
> loads to
> multifd channels too.
> 

In fact, we sent the patchset to the community in 2021. Pls see:
https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/


> One thing to note that the question here is not about a pure performance
> comparison between rdma and nics only.  It's about help us make a decision
> on whether to drop rdma, iow, even if rdma performs well, the community still
> has the right to drop it if nobody can actively work and maintain it.
> It's just that if nics can perform as good it's more a reason to drop, unless
> companies can help to provide good support and work together.
> 

We are happy to provide the necessary review and maintenance work for RDMA
if the community needs it.

CC'ing Chuan Zheng.


Regards,
-Gonglei



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Gonglei (Arei)
Hi, Peter

RDMA features high bandwidth, low latency (in non-blocking lossless network), 
and direct remote 
memory access by bypassing the CPU (As you know, CPU resources are expensive 
for cloud vendors, 
which is one of the reasons why we introduced offload cards.), which TCP does 
not have. 

In some scenarios where fast live migration is needed (extremely short 
interruption duration and migration 
duration) is very useful. To this end, we have also developed RDMA support for 
multifd.

Regards,
-Gonglei

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Wednesday, May 1, 2024 11:31 PM
> To: Daniel P. Berrangé 
> Cc: Markus Armbruster ; Michael Galaxy
> ; Yu Zhang ; Zhijian Li (Fujitsu)
> ; Jinpu Wang ; Elmar Gerdes
> ; qemu-devel@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; integrat...@gluster.org; Paolo
> Bonzini ; qemu-bl...@nongnu.org;
> de...@lists.libvirt.org; Hanna Reitz ; Michael S. Tsirkin
> ; Thomas Huth ; Eric Blake
> ; Song Gao ; Marc-André
> Lureau ; Alex Bennée
> ; Wainer dos Santos Moschetta
> ; Beraldo Leal ; Gonglei (Arei)
> ; Pannengyuan 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > > Peter Xu  writes:
> > >
> > > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > > >> Hi All (and Peter),
> > > >
> > > > Hi, Michael,
> > > >
> > > >>
> > > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my
> > > >> last name (highly irregular for a male) and yes, that's my real last 
> > > >> name:
> > > >> https://www.linkedin.com/in/mrgalaxy/)
> > > >>
> > > >> I'm the original author of the RDMA implementation. I've been
> > > >> discussing with Yu Zhang for a little bit about potentially
> > > >> handing over maintainership of the codebase to his team.
> > > >>
> > > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > > >> unfortunately. so I've never been able to run tests or use what I
> > > >> wrote at work, and as all of you know, if you don't have a way to
> > > >> test something, then you can't maintain it.
> > > >>
> > > >> Yu Zhang put a (very kind) proposal forward to me to ask the
> > > >> community if they feel comfortable training his team to maintain
> > > >> the codebase (and run
> > > >> tests) while they learn about it.
> > > >
> > > > The "while learning" part is fine at least to me.  IMHO the
> > > > "ownership" to the code, or say, taking over the responsibility,
> > > > may or may not need 100% mastering the code base first.  There
> > > > should still be some fundamental confidence to work on the code
> > > > though as a starting point, then it's about serious use case to
> > > > back this up, and careful testings while getting more familiar with it.
> > >
> > > How much experience we expect of maintainers depends on the
> > > subsystem and other circumstances.  The hard requirement isn't
> > > experience, it's trust.  See the recent attack on xz.
> > >
> > > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > > I'm merely reminding y'all what's at stake.
> >
> > I think we shouldn't overly obsess[1] about 'xz', because the
> > overwhealmingly common scenario is that volunteer maintainers are
> > honest people. QEMU is in a massively better peer review situation.
> > With xz there was basically no oversight of the new maintainer. With
> > QEMU, we have oversight from 1000's of people on the list, a huge pool
> > of general maintainers, the specific migration maintainers, and the release
> manager merging code.
> >
> > With a lack of historical experiance with QEMU maintainership, I'd
> > suggest that new RDMA volunteers would start by adding themselves to the
> "MAINTAINERS"
> > file with only the 'Reviewer' classification. The main migration
> > maintainers would still handle pull requests, but wait for a R-b from
> > one of the RMDA volunteers. After some period of time the RDMA folks
> > could graduate to full maintainer status if the migration maintainers needed
> to reduce their load.
> > I suspect

RE: [PATCH-for-8.2 v2] backends/cryptodev: Do not ignore throttle/backends Errors

2023-11-20 Thread Gonglei (Arei)


> -Original Message-
> From: Philippe Mathieu-Daudé [mailto:phi...@linaro.org]
> Sent: Monday, November 20, 2023 11:04 PM
> To: qemu-devel@nongnu.org
> Cc: Zhenwei Pi ; Gonglei (Arei)
> ; Markus Armbruster ;
> Daniel P . Berrangé ; Philippe Mathieu-Daudé
> ; qemu-sta...@nongnu.org
> Subject: [PATCH-for-8.2 v2] backends/cryptodev: Do not ignore
> throttle/backends Errors
> 
> Both cryptodev_backend_set_throttle() and CryptoDevBackendClass::init() can
> set their Error** argument. Do not ignore them, return early on failure. Use
> the ERRP_GUARD() macro as suggested in commit ae7c80a7bd
> ("error: New macro ERRP_GUARD()").
> 
> Cc: qemu-sta...@nongnu.org
> Fixes: e7a775fd9f ("cryptodev: Account statistics")
> Fixes: 2580b452ff ("cryptodev: support QoS")
> Signed-off-by: Philippe Mathieu-Daudé 
> ---

Reviewed-by: Gonglei 


Regards,
-Gonglei



RE: [PATCH v2] virtio-crypto: fix NULL pointer dereference in virtio_crypto_free_request

2023-05-09 Thread Gonglei (Arei)



> -Original Message-
> From: Mauro Matteo Cascella [mailto:mcasc...@redhat.com]
> Sent: Tuesday, May 9, 2023 3:53 PM
> To: qemu-devel@nongnu.org
> Cc: m...@redhat.com; Gonglei (Arei) ;
> pizhen...@bytedance.com; ta...@zju.edu.cn; mcasc...@redhat.com
> Subject: [PATCH v2] virtio-crypto: fix NULL pointer dereference in
> virtio_crypto_free_request
> 
> Ensure op_info is not NULL in case of QCRYPTODEV_BACKEND_ALG_SYM
> algtype.
> 
> Fixes: 0e660a6f90a ("crypto: Introduce RSA algorithm")
> Signed-off-by: Mauro Matteo Cascella 
> Reported-by: Yiming Tao 
> ---
> v2:
> - updated 'Fixes:' tag
> 
>  hw/virtio/virtio-crypto.c | 20 +++-
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 

Reviewed-by: Gonglei 


Regards,
-Gonglei

> diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c index
> 2fe804510f..c729a1f79e 100644
> --- a/hw/virtio/virtio-crypto.c
> +++ b/hw/virtio/virtio-crypto.c
> @@ -476,15 +476,17 @@ static void
> virtio_crypto_free_request(VirtIOCryptoReq *req)
>  size_t max_len;
>  CryptoDevBackendSymOpInfo *op_info =
> req->op_info.u.sym_op_info;
> 
> -max_len = op_info->iv_len +
> -  op_info->aad_len +
> -  op_info->src_len +
> -  op_info->dst_len +
> -  op_info->digest_result_len;
> -
> -/* Zeroize and free request data structure */
> -memset(op_info, 0, sizeof(*op_info) + max_len);
> -g_free(op_info);
> +if (op_info) {
> +max_len = op_info->iv_len +
> +  op_info->aad_len +
> +  op_info->src_len +
> +  op_info->dst_len +
> +  op_info->digest_result_len;
> +
> +/* Zeroize and free request data structure */
> +memset(op_info, 0, sizeof(*op_info) + max_len);
> +g_free(op_info);
> +}
>  } else if (req->flags == QCRYPTODEV_BACKEND_ALG_ASYM) {
>  CryptoDevBackendAsymOpInfo *op_info =
> req->op_info.u.asym_op_info;
>  if (op_info) {
> --
> 2.40.1




RE: [PATCH] virtio-crypto: fix NULL pointer dereference in virtio_crypto_free_request

2023-05-08 Thread Gonglei (Arei)



> -Original Message-
> From: Mauro Matteo Cascella [mailto:mcasc...@redhat.com]
> Sent: Monday, May 8, 2023 11:02 PM
> To: qemu-devel@nongnu.org
> Cc: m...@redhat.com; Gonglei (Arei) ;
> pizhen...@bytedance.com; ta...@zju.edu.cn; mcasc...@redhat.com
> Subject: [PATCH] virtio-crypto: fix NULL pointer dereference in
> virtio_crypto_free_request
> 
> Ensure op_info is not NULL in case of QCRYPTODEV_BACKEND_ALG_SYM
> algtype.
> 
> Fixes: 02ed3e7c ("virtio-crypto: zeroize the key material before free")

I have to say the fixes is incorrect. The bug was introduced by commit 
0e660a6f90a, which
changed the semantic meaning of request-> flag.

Regards,
-Gonglei




RE: RE: [PATCH v8 1/1] crypto: Introduce RSA algorithm

2022-05-31 Thread Gonglei (Arei)


> -Original Message-
> From: zhenwei pi [mailto:pizhen...@bytedance.com]
> Sent: Tuesday, May 31, 2022 9:48 AM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; m...@redhat.com;
> virtualizat...@lists.linux-foundation.org; helei.si...@bytedance.com;
> berra...@redhat.com
> Subject: Re: RE: [PATCH v8 1/1] crypto: Introduce RSA algorithm
> 
> On 5/30/22 21:31, Gonglei (Arei) wrote:
> >
> >
> >> -Original Message-
> >> From: zhenwei pi [mailto:pizhen...@bytedance.com]
> >> Sent: Friday, May 27, 2022 4:48 PM
> >> To: m...@redhat.com; Gonglei (Arei) 
> >> Cc: qemu-devel@nongnu.org; virtualizat...@lists.linux-foundation.org;
> >> helei.si...@bytedance.com; berra...@redhat.com; zhenwei pi
> >> 
> >> Subject: [PATCH v8 1/1] crypto: Introduce RSA algorithm
> >>
> >>
> > Skip...
> >
> >> +static int64_t
> >> +virtio_crypto_create_asym_session(VirtIOCrypto *vcrypto,
> >> +   struct virtio_crypto_akcipher_create_session_req
> >> *sess_req,
> >> +   uint32_t queue_id, uint32_t opcode,
> >> +   struct iovec *iov, unsigned int out_num) {
> >> +VirtIODevice *vdev = VIRTIO_DEVICE(vcrypto);
> >> +CryptoDevBackendSessionInfo info = {0};
> >> +CryptoDevBackendAsymSessionInfo *asym_info;
> >> +int64_t session_id;
> >> +int queue_index;
> >> +uint32_t algo, keytype, keylen;
> >> +g_autofree uint8_t *key = NULL;
> >> +Error *local_err = NULL;
> >> +
> >> +algo = ldl_le_p(_req->para.algo);
> >> +keytype = ldl_le_p(_req->para.keytype);
> >> +keylen = ldl_le_p(_req->para.keylen);
> >> +
> >> +if ((keytype != VIRTIO_CRYPTO_AKCIPHER_KEY_TYPE_PUBLIC)
> >> + && (keytype !=
> VIRTIO_CRYPTO_AKCIPHER_KEY_TYPE_PRIVATE)) {
> >> +error_report("unsupported asym keytype: %d", keytype);
> >> +return -VIRTIO_CRYPTO_NOTSUPP;
> >> +}
> >> +
> >> +if (keylen) {
> >> +key = g_malloc(keylen);
> >> +if (iov_to_buf(iov, out_num, 0, key, keylen) != keylen) {
> >> +virtio_error(vdev, "virtio-crypto asym key incorrect");
> >> +return -EFAULT;
> >
> > Memory leak.
> >
> >> +}
> >> +iov_discard_front(, _num, keylen);
> >> +}
> >> +
> >> +info.op_code = opcode;
> >> +asym_info = _sess_info;
> >> +asym_info->algo = algo;
> >> +asym_info->keytype = keytype;
> >> +asym_info->keylen = keylen;
> >> +asym_info->key = key;
> >> +switch (asym_info->algo) {
> >> +case VIRTIO_CRYPTO_AKCIPHER_RSA:
> >> +asym_info->u.rsa.padding_algo =
> >> +ldl_le_p(_req->para.u.rsa.padding_algo);
> >> +asym_info->u.rsa.hash_algo =
> >> +ldl_le_p(_req->para.u.rsa.hash_algo);
> >> +break;
> >> +
> >> +/* TODO DSA handling */
> >> +
> >> +default:
> >> +return -VIRTIO_CRYPTO_ERR;
> >> +}
> >> +
> >> +queue_index = virtio_crypto_vq2q(queue_id);
> >> +session_id =
> >> + cryptodev_backend_create_session(vcrypto->cryptodev,
> >> ,
> >> + queue_index, _err);
> >> +if (session_id < 0) {
> >> +if (local_err) {
> >> +error_report_err(local_err);
> >> +}
> >> +return -VIRTIO_CRYPTO_ERR;
> >> +}
> >> +
> >> +return session_id;
> >
> > Where to free the key at both normal and exceptional paths?
> >
> 
> Hi, Lei
> 
> The key is declared with g_autofree:
> g_autofree uint8_t *key = NULL;
> 

OK. For the patch:

Reviewed-by: Gonglei 


Regards,
-Gonglei




RE: [PATCH v8 1/1] crypto: Introduce RSA algorithm

2022-05-30 Thread Gonglei (Arei)



> -Original Message-
> From: zhenwei pi [mailto:pizhen...@bytedance.com]
> Sent: Friday, May 27, 2022 4:48 PM
> To: m...@redhat.com; Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; virtualizat...@lists.linux-foundation.org;
> helei.si...@bytedance.com; berra...@redhat.com; zhenwei pi
> 
> Subject: [PATCH v8 1/1] crypto: Introduce RSA algorithm
> 
> 
Skip...

> +static int64_t
> +virtio_crypto_create_asym_session(VirtIOCrypto *vcrypto,
> +   struct virtio_crypto_akcipher_create_session_req
> *sess_req,
> +   uint32_t queue_id, uint32_t opcode,
> +   struct iovec *iov, unsigned int out_num) {
> +VirtIODevice *vdev = VIRTIO_DEVICE(vcrypto);
> +CryptoDevBackendSessionInfo info = {0};
> +CryptoDevBackendAsymSessionInfo *asym_info;
> +int64_t session_id;
> +int queue_index;
> +uint32_t algo, keytype, keylen;
> +g_autofree uint8_t *key = NULL;
> +Error *local_err = NULL;
> +
> +algo = ldl_le_p(_req->para.algo);
> +keytype = ldl_le_p(_req->para.keytype);
> +keylen = ldl_le_p(_req->para.keylen);
> +
> +if ((keytype != VIRTIO_CRYPTO_AKCIPHER_KEY_TYPE_PUBLIC)
> + && (keytype != VIRTIO_CRYPTO_AKCIPHER_KEY_TYPE_PRIVATE)) {
> +error_report("unsupported asym keytype: %d", keytype);
> +return -VIRTIO_CRYPTO_NOTSUPP;
> +}
> +
> +if (keylen) {
> +key = g_malloc(keylen);
> +if (iov_to_buf(iov, out_num, 0, key, keylen) != keylen) {
> +virtio_error(vdev, "virtio-crypto asym key incorrect");
> +return -EFAULT;

Memory leak.

> +}
> +iov_discard_front(, _num, keylen);
> +}
> +
> +info.op_code = opcode;
> +asym_info = _sess_info;
> +asym_info->algo = algo;
> +asym_info->keytype = keytype;
> +asym_info->keylen = keylen;
> +asym_info->key = key;
> +switch (asym_info->algo) {
> +case VIRTIO_CRYPTO_AKCIPHER_RSA:
> +asym_info->u.rsa.padding_algo =
> +ldl_le_p(_req->para.u.rsa.padding_algo);
> +asym_info->u.rsa.hash_algo =
> +ldl_le_p(_req->para.u.rsa.hash_algo);
> +break;
> +
> +/* TODO DSA handling */
> +
> +default:
> +return -VIRTIO_CRYPTO_ERR;
> +}
> +
> +queue_index = virtio_crypto_vq2q(queue_id);
> +session_id = cryptodev_backend_create_session(vcrypto->cryptodev,
> ,
> + queue_index, _err);
> +if (session_id < 0) {
> +if (local_err) {
> +error_report_err(local_err);
> +}
> +return -VIRTIO_CRYPTO_ERR;
> +}
> +
> +return session_id;

Where to free the key at both normal and exceptional paths?


Regards,
-Gonglei





RE: [PATCH 9/9] crypto: Introduce RSA algorithm

2022-05-26 Thread Gonglei (Arei)



> -Original Message-
> From: Lei He [mailto:helei.si...@bytedance.com]
> Sent: Wednesday, May 25, 2022 5:01 PM
> To: m...@redhat.com; Gonglei (Arei) ;
> berra...@redhat.com
> Cc: qemu-devel@nongnu.org; virtualizat...@lists.linux-foundation.org;
> linux-cry...@vger.kernel.org; jasow...@redhat.com; coh...@redhat.com;
> pizhen...@bytedance.com; helei.si...@bytedance.com
> Subject: [PATCH 9/9] crypto: Introduce RSA algorithm
> 
> From: zhenwei pi 
> 
> There are two parts in this patch:
> 1, support akcipher service by cryptodev-builtin driver 2, virtio-crypto 
> driver
> supports akcipher service
> 
> In principle, we should separate this into two patches, to avoid compiling 
> error,
> merge them into one.
> 
> Then virtio-crypto gets request from guest side, and forwards the request to
> builtin driver to handle it.
> 
> Test with a guest linux:
> 1, The self-test framework of crypto layer works fine in guest kernel 2, Test
> with Linux guest(with asym support), the following script test(note that
> pkey_XXX is supported only in a newer version of keyutils):
>   - both public key & private key
>   - create/close session
>   - encrypt/decrypt/sign/verify basic driver operation
>   - also test with kernel crypto layer(pkey add/query)
> 
> All the cases work fine.
> 
> Run script in guest:
> rm -rf *.der *.pem *.pfx
> modprobe pkcs8_key_parser # if CONFIG_PKCS8_PRIVATE_KEY_PARSER=m rm
> -rf /tmp/data dd if=/dev/random of=/tmp/data count=1 bs=20
> 
> openssl req -nodes -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -subj
> "/C=CN/ST=BJ/L=HD/O=qemu/OU=dev/CN=qemu/emailAddress=qemu@qemu
> .org"
> openssl pkcs8 -in key.pem -topk8 -nocrypt -outform DER -out key.der openssl
> x509 -in cert.pem -inform PEM -outform DER -out cert.der
> 
> PRIV_KEY_ID=`cat key.der | keyctl padd asymmetric test_priv_key @s` echo
> "priv key id = "$PRIV_KEY_ID PUB_KEY_ID=`cat cert.der | keyctl padd
> asymmetric test_pub_key @s` echo "pub key id = "$PUB_KEY_ID
> 
> keyctl pkey_query $PRIV_KEY_ID 0
> keyctl pkey_query $PUB_KEY_ID 0
> 
> echo "Enc with priv key..."
> keyctl pkey_encrypt $PRIV_KEY_ID 0 /tmp/data enc=pkcs1 >/tmp/enc.priv echo
> "Dec with pub key..."
> keyctl pkey_decrypt $PRIV_KEY_ID 0 /tmp/enc.priv enc=pkcs1 >/tmp/dec cmp
> /tmp/data /tmp/dec
> 
> echo "Sign with priv key..."
> keyctl pkey_sign $PRIV_KEY_ID 0 /tmp/data enc=pkcs1 hash=sha1 > /tmp/sig
> echo "Verify with pub key..."
> keyctl pkey_verify $PRIV_KEY_ID 0 /tmp/data /tmp/sig enc=pkcs1 hash=sha1
> 
> echo "Enc with pub key..."
> keyctl pkey_encrypt $PUB_KEY_ID 0 /tmp/data enc=pkcs1 >/tmp/enc.pub echo
> "Dec with priv key..."
> keyctl pkey_decrypt $PRIV_KEY_ID 0 /tmp/enc.pub enc=pkcs1 >/tmp/dec cmp
> /tmp/data /tmp/dec
> 
> echo "Verify with pub key..."
> keyctl pkey_verify $PUB_KEY_ID 0 /tmp/data /tmp/sig enc=pkcs1 hash=sha1
> 
> Signed-off-by: zhenwei pi 
> Signed-off-by: lei he  ---
>  backends/cryptodev-builtin.c  | 272
> +++-
>  backends/cryptodev-vhost-user.c   |  34 +++-
>  backends/cryptodev.c  |  32 ++--
>  hw/virtio/virtio-crypto.c | 323
> ++
>  include/hw/virtio/virtio-crypto.h |   5 +-
>  include/sysemu/cryptodev.h|  83 --
>  6 files changed, 604 insertions(+), 145 deletions(-)
> 
> diff --git a/backends/cryptodev-builtin.c b/backends/cryptodev-builtin.c index
> 0671bf9f3e..388aedd8df 100644
> --- a/backends/cryptodev-builtin.c
> +++ b/backends/cryptodev-builtin.c
> @@ -26,6 +26,7 @@
>  #include "qapi/error.h"
>  #include "standard-headers/linux/virtio_crypto.h"
>  #include "crypto/cipher.h"
> +#include "crypto/akcipher.h"
>  #include "qom/object.h"
> 
> 
> @@ -41,11 +42,12 @@
> OBJECT_DECLARE_SIMPLE_TYPE(CryptoDevBackendBuiltin,
> CRYPTODEV_BACKEND_BUILTIN)  typedef struct
> CryptoDevBackendBuiltinSession {
>  QCryptoCipher *cipher;
>  uint8_t direction; /* encryption or decryption */
> -uint8_t type; /* cipher? hash? aead? */
> +uint8_t type; /* cipher? hash? aead? akcipher? */

Do you actually use the type for akcipher?

> +QCryptoAkCipher *akcipher;
>  QTAILQ_ENTRY(CryptoDevBackendBuiltinSession) next;  }
> CryptoDevBackendBuiltinSession;
> 
> -/* Max number of symmetric sessions */
> +/* Max number of symmetric/asymmetric sessions */
>  #define MAX_NUM_SESSIONS 256
> 
>  #define CRYPTODEV_BUITLIN_MAX_AUTH_KEY_LEN512
> @@ -80,15 +82,17 @@ static void cryptodev_builtin_init(
>  backend-

RE: [PATCH v7 0/9] Introduce akcipher service for virtio-crypto

2022-05-26 Thread Gonglei (Arei)


> -Original Message-
> From: Daniel P. Berrangé [mailto:berra...@redhat.com]
> Sent: Thursday, May 26, 2022 6:48 PM
> To: Lei He 
> Cc: m...@redhat.com; Gonglei (Arei) ;
> qemu-devel@nongnu.org; virtualizat...@lists.linux-foundation.org;
> linux-cry...@vger.kernel.org; jasow...@redhat.com; coh...@redhat.com;
> pizhen...@bytedance.com
> Subject: Re: [PATCH v7 0/9] Introduce akcipher service for virtio-crypto
> 
> I've sent a pull request containing all the crypto/ changes, as that covers 
> stuff I
> maintain. ie patches 2-8
> 
> Patches 1 and 9, I'll leave for MST to review & queue since the virtual 
> hardware
> is not my area of knowledge.
> 

Thanks for your work, Daniel.

Regards,
-Gonglei

> On Wed, May 25, 2022 at 05:01:09PM +0800, Lei He wrote:
> > v6 -> v7:
> > - Fix serval build errors for some specific platforms/configurations.
> > - Use '%zu' instead of '%lu' for size_t parameters.
> > - AkCipher-gcrypt: avoid setting wrong error messages when parsing RSA
> >   keys.
> > - AkCipher-benchmark: process constant amount of sign/verify instead
> > of running sign/verify for a constant duration.
> >
> > v5 -> v6:
> > - Fix build errors and codestyles.
> > - Add parameter 'Error **errp' for qcrypto_akcipher_rsakey_parse.
> > - Report more detailed errors.
> > - Fix buffer length check and return values of akcipher-nettle, allows
> > caller to  pass a buffer with larger size than actual needed.
> >
> > A million thanks to Daniel!
> >
> > v4 -> v5:
> > - Move QCryptoAkCipher into akcipherpriv.h, and modify the related
> comments.
> > - Rename asn1_decoder.c to der.c.
> > - Code style fix: use 'cleanup' & 'error' lables.
> > - Allow autoptr type to auto-free.
> > - Add test cases for rsakey to handle DER error.
> > - Other minor fixes.
> >
> > v3 -> v4:
> > - Coding style fix: Akcipher -> AkCipher, struct XXX -> XXX, Rsa ->
> > RSA, XXX-alg -> XXX-algo.
> > - Change version info in qapi/crypto.json, from 7.0 -> 7.1.
> > - Remove ecdsa from qapi/crypto.json, it would be introduced with the
> implemetion later.
> > - Use QCryptoHashAlgothrim instead of QCryptoRSAHashAlgorithm(removed)
> in qapi/crypto.json.
> > - Rename arguments of qcrypto_akcipher_XXX to keep aligned with
> qcrypto_cipher_XXX(dec/enc/sign/vefiry -> in/out/in2), and add
> qcrypto_akcipher_max_XXX APIs.
> > - Add new API: qcrypto_akcipher_supports.
> > - Change the return value of qcrypto_akcipher_enc/dec/sign, these functions
> return the actual length of result.
> > - Separate ASN.1 source code and test case clean.
> > - Disable RSA raw encoding for akcipher-nettle.
> > - Separate RSA key parser into rsakey.{hc}, and implememts it with
> builtin-asn1-decoder and nettle respectivly.
> > - Implement RSA(pkcs1 and raw encoding) algorithm by gcrypt. This has
> higher priority than nettle.
> > - For some akcipher operations(eg, decryption of pkcs1pad(rsa)), the
> > length of returned result maybe less than the dst buffer size, return
> > the actual length of result instead of the buffer length to the guest
> > side. (in function virtio_crypto_akcipher_input_data_helper)
> > - Other minor changes.
> >
> > Thanks to Daniel!
> >
> > Eric pointed out this missing part of use case, send it here again.
> >
> > In our plan, the feature is designed for HTTPS offloading case and other
> applications which use kernel RSA/ecdsa by keyctl syscall. The full picture
> shows bellow:
> >
> >
> >  Nginx/openssl[1] ... Apps
> > Guest   -
> >   virtio-crypto driver[2]
> > -
> >   virtio-crypto backend[3]
> > Host-
> >  /  |  \
> >  builtin[4]   vhost keyctl[5] ...
> >
> >
> > [1] User applications can offload RSA calculation to kernel by keyctl 
> > syscall.
> There is no keyctl engine in openssl currently, we developed a engine and 
> tried
> to contribute it to openssl upstream, but openssl 1.x does not accept new
> feature. Link:
> >https://github.com/openssl/openssl/pull/16689
> >
> > This branch is available and maintained by Lei 
> >
> > https://github.com/TousakaRin/openssl/tree/OpenSSL_1_1_1-kctl_engine
> >
> > We tested nginx(change config file only) with openssl keyctl engine, it 
> > works
> fine.
> >
> > [2] virtio-crypto driver is used to

RE: [PATCH v2 1/3] virtio-crypto: header update

2022-02-17 Thread Gonglei (Arei)



> -Original Message-
> From: zhenwei pi [mailto:pizhen...@bytedance.com]
> Sent: Friday, February 11, 2022 4:44 PM
> To: Gonglei (Arei) ; m...@redhat.com
> Cc: jasow...@redhat.com; virtualizat...@lists.linux-foundation.org;
> linux-cry...@vger.kernel.org; qemu-devel@nongnu.org;
> helei.si...@bytedance.com; herb...@gondor.apana.org.au; zhenwei pi
> 
> Subject: [PATCH v2 1/3] virtio-crypto: header update
> 
> Update header from linux, support akcipher service.
> 
> Signed-off-by: lei he 
> Signed-off-by: zhenwei pi 
> ---
>  .../standard-headers/linux/virtio_crypto.h| 82 ++-
>  1 file changed, 81 insertions(+), 1 deletion(-)
> 

Reviewed-by: Gonglei 


> diff --git a/include/standard-headers/linux/virtio_crypto.h
> b/include/standard-headers/linux/virtio_crypto.h
> index 5ff0b4ee59..68066dafb6 100644
> --- a/include/standard-headers/linux/virtio_crypto.h
> +++ b/include/standard-headers/linux/virtio_crypto.h
> @@ -37,6 +37,7 @@
>  #define VIRTIO_CRYPTO_SERVICE_HASH   1
>  #define VIRTIO_CRYPTO_SERVICE_MAC2
>  #define VIRTIO_CRYPTO_SERVICE_AEAD   3
> +#define VIRTIO_CRYPTO_SERVICE_AKCIPHER 4
> 
>  #define VIRTIO_CRYPTO_OPCODE(service, op)   (((service) << 8) | (op))
> 
> @@ -57,6 +58,10 @@ struct virtio_crypto_ctrl_header {
>  VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AEAD, 0x02)
> #define VIRTIO_CRYPTO_AEAD_DESTROY_SESSION \
>  VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AEAD, 0x03)
> +#define VIRTIO_CRYPTO_AKCIPHER_CREATE_SESSION \
> +VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER, 0x04)
> #define
> +VIRTIO_CRYPTO_AKCIPHER_DESTROY_SESSION \
> +VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER,
> 0x05)
>   uint32_t opcode;
>   uint32_t algo;
>   uint32_t flag;
> @@ -180,6 +185,58 @@ struct virtio_crypto_aead_create_session_req {
>   uint8_t padding[32];
>  };
> 
> +struct virtio_crypto_rsa_session_para {
> +#define VIRTIO_CRYPTO_RSA_RAW_PADDING   0
> +#define VIRTIO_CRYPTO_RSA_PKCS1_PADDING 1
> + uint32_t padding_algo;
> +
> +#define VIRTIO_CRYPTO_RSA_NO_HASH   0
> +#define VIRTIO_CRYPTO_RSA_MD2   1
> +#define VIRTIO_CRYPTO_RSA_MD3   2
> +#define VIRTIO_CRYPTO_RSA_MD4   3
> +#define VIRTIO_CRYPTO_RSA_MD5   4
> +#define VIRTIO_CRYPTO_RSA_SHA1  5
> +#define VIRTIO_CRYPTO_RSA_SHA2566
> +#define VIRTIO_CRYPTO_RSA_SHA3847
> +#define VIRTIO_CRYPTO_RSA_SHA5128
> +#define VIRTIO_CRYPTO_RSA_SHA2249
> + uint32_t hash_algo;
> +};
> +
> +struct virtio_crypto_ecdsa_session_para {
> +#define VIRTIO_CRYPTO_CURVE_UNKNOWN   0
> +#define VIRTIO_CRYPTO_CURVE_NIST_P192 1 #define
> +VIRTIO_CRYPTO_CURVE_NIST_P224 2 #define
> VIRTIO_CRYPTO_CURVE_NIST_P256 3
> +#define VIRTIO_CRYPTO_CURVE_NIST_P384 4 #define
> +VIRTIO_CRYPTO_CURVE_NIST_P521 5
> + uint32_t curve_id;
> + uint32_t padding;
> +};
> +
> +struct virtio_crypto_akcipher_session_para {
> +#define VIRTIO_CRYPTO_NO_AKCIPHER0
> +#define VIRTIO_CRYPTO_AKCIPHER_RSA   1
> +#define VIRTIO_CRYPTO_AKCIPHER_DSA   2
> +#define VIRTIO_CRYPTO_AKCIPHER_ECDSA 3
> + uint32_t algo;
> +
> +#define VIRTIO_CRYPTO_AKCIPHER_KEY_TYPE_PUBLIC  1 #define
> +VIRTIO_CRYPTO_AKCIPHER_KEY_TYPE_PRIVATE 2
> + uint32_t keytype;
> + uint32_t keylen;
> +
> + union {
> + struct virtio_crypto_rsa_session_para rsa;
> + struct virtio_crypto_ecdsa_session_para ecdsa;
> + } u;
> +};
> +
> +struct virtio_crypto_akcipher_create_session_req {
> + struct virtio_crypto_akcipher_session_para para;
> + uint8_t padding[36];
> +};
> +
>  struct virtio_crypto_alg_chain_session_para {  #define
> VIRTIO_CRYPTO_SYM_ALG_CHAIN_ORDER_HASH_THEN_CIPHER  1
> #define VIRTIO_CRYPTO_SYM_ALG_CHAIN_ORDER_CIPHER_THEN_HASH  2
> @@ -247,6 +304,8 @@ struct virtio_crypto_op_ctrl_req {
>   mac_create_session;
>   struct virtio_crypto_aead_create_session_req
>   aead_create_session;
> + struct virtio_crypto_akcipher_create_session_req
> + akcipher_create_session;
>   struct virtio_crypto_destroy_session_req
>   destroy_session;
>   uint8_t padding[56];
> @@ -266,6 +325,14 @@ struct virtio_crypto_op_header {
>   VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AEAD, 0x00)
> #define VIRTIO_CRYPTO_AEAD_DECRYPT \
>   VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AEAD, 0x01)
> +#define VIRTIO_CRYPTO_AKCIPHER_ENCRYPT \
> + VIRTIO_CRYPTO_OPCODE(VIRTIO_CRYPTO_SERVICE_AKCIPHER, 0x00)
> #define
> +VIRTIO_CRYPTO_AKCIPHER_DECRYPT \
> + VIRTIO_CRYPTO_OPCODE(VIRTIO_

RE: [PATCH v2] MAINTAINERS: Change my email address

2021-12-14 Thread Gonglei (Arei)


> -Original Message-
> From: Daniel P. Berrangé [mailto:berra...@redhat.com]
> Sent: Tuesday, December 14, 2021 5:22 PM
> To: Philippe Mathieu-Daudé 
> Cc: Hailiang Zhang ;
> qemu-devel@nongnu.org; Gonglei (Arei) ;
> Wencongyang (HongMeng) ;
> dgilb...@redhat.com; quint...@redhat.com
> Subject: Re: [PATCH v2] MAINTAINERS: Change my email address
> 
> On Tue, Dec 14, 2021 at 10:04:03AM +0100, Philippe Mathieu-Daudé
> wrote:
> > On 12/14/21 08:54, Hailiang Zhang wrote:
> > > The zhang.zhanghaili...@huawei.com email address has been
> stopped.
> > > Change it to my new email address.
> > >
> > > Signed-off-by: Hailiang Zhang 
> > > ---
> > > hi Juan & Dave,
> > >
> > > Firstly, thank you for your working on maintaining the COLO
> framework.
> > > I didn't have much time on it in the past days.
> > >
> > > I may have some time in the next days since my job has changed.
> > >
> > > Because of my old email being stopped, i can not use it to send this
> patch.
> > > Please help me to merge this patch.
> >
> > Can we have an Ack-by from someone working at Huawei?
> 
> Why do we need that ? Subsystems are not owned by companies.
> 
> If someone moves company and wants to carry on in their existing role as
> maintainer that is fine and doesn't need approva from their old company
> IMHO.
> 

Agreed. I'm just confirming HaiLiang's identity. 

Acked-by: Gonglei 

Good luck, bro. @Hailiang

Thanks,
-Gonglei

> Regards,
> Daniel
> --
> |: https://berrange.com  -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-
> https://www.instagram.com/dberrange :|



RE: [PATCH 01/24] cryptodev-vhost-user: Register "chardev" as class property

2020-09-21 Thread Gonglei (Arei)



> -Original Message-
> From: Eduardo Habkost [mailto:ehabk...@redhat.com]
> Sent: Tuesday, September 22, 2020 6:10 AM
> To: qemu-devel@nongnu.org
> Cc: Paolo Bonzini ; Daniel P. Berrange
> ; John Snow ; Gonglei (Arei)
> 
> Subject: [PATCH 01/24] cryptodev-vhost-user: Register "chardev" as class
> property
> 
> Class properties make QOM introspection simpler and easier, as they don't
> require an object to be instantiated.
> 
> Signed-off-by: Eduardo Habkost 
> ---
> Cc: "Gonglei (Arei)" 
> Cc: qemu-devel@nongnu.org
> ---

Reviewed-by: Gonglei 

Regards,
-Gonglei

>  backends/cryptodev-vhost-user.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/backends/cryptodev-vhost-user.c
> b/backends/cryptodev-vhost-user.c index 41089dede15..690738c6c95
> 100644
> --- a/backends/cryptodev-vhost-user.c
> +++ b/backends/cryptodev-vhost-user.c
> @@ -336,13 +336,6 @@ cryptodev_vhost_user_get_chardev(Object *obj,
> Error **errp)
>  return NULL;
>  }
> 
> -static void cryptodev_vhost_user_instance_int(Object *obj) -{
> -object_property_add_str(obj, "chardev",
> -cryptodev_vhost_user_get_chardev,
> -cryptodev_vhost_user_set_chardev);
> -}
> -
>  static void cryptodev_vhost_user_finalize(Object *obj)  {
>  CryptoDevBackendVhostUser *s =
> @@ -363,13 +356,17 @@ cryptodev_vhost_user_class_init(ObjectClass *oc,
> void *data)
>  bc->create_session = cryptodev_vhost_user_sym_create_session;
>  bc->close_session = cryptodev_vhost_user_sym_close_session;
>  bc->do_sym_op = NULL;
> +
> +object_class_property_add_str(oc, "chardev",
> +  cryptodev_vhost_user_get_chardev,
> +  cryptodev_vhost_user_set_chardev);
> +
>  }
> 
>  static const TypeInfo cryptodev_vhost_user_info = {
>  .name = TYPE_CRYPTODEV_BACKEND_VHOST_USER,
>  .parent = TYPE_CRYPTODEV_BACKEND,
>  .class_init = cryptodev_vhost_user_class_init,
> -.instance_init = cryptodev_vhost_user_instance_int,
>  .instance_finalize = cryptodev_vhost_user_finalize,
>  .instance_size = sizeof(CryptoDevBackendVhostUser),
>  };
> --
> 2.26.2




RE: [PATCH 02/24] cryptodev-backend: Register "chardev" as class property

2020-09-21 Thread Gonglei (Arei)



> -Original Message-
> From: Eduardo Habkost [mailto:ehabk...@redhat.com]
> Sent: Tuesday, September 22, 2020 6:10 AM
> To: qemu-devel@nongnu.org
> Cc: Paolo Bonzini ; Daniel P. Berrange
> ; John Snow ; Gonglei (Arei)
> 
> Subject: [PATCH 02/24] cryptodev-backend: Register "chardev" as class
> property
> 
> Class properties make QOM introspection simpler and easier, as they don't
> require an object to be instantiated.
> 
> Signed-off-by: Eduardo Habkost 
> ---
> Cc: "Gonglei (Arei)" 
> Cc: qemu-devel@nongnu.org
> ---
>  backends/cryptodev.c | 8 ----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 

Reviewed-by: Gonglei 

Regards,
-Gonglei


> diff --git a/backends/cryptodev.c b/backends/cryptodev.c index
> ada4ebe78b1..3f141f61ed6 100644
> --- a/backends/cryptodev.c
> +++ b/backends/cryptodev.c
> @@ -206,10 +206,6 @@ cryptodev_backend_can_be_deleted(UserCreatable
> *uc)
> 
>  static void cryptodev_backend_instance_init(Object *obj)  {
> -object_property_add(obj, "queues", "uint32",
> -  cryptodev_backend_get_queues,
> -  cryptodev_backend_set_queues,
> -  NULL, NULL);
>  /* Initialize devices' queues property to 1 */
>  object_property_set_int(obj, "queues", 1, NULL);  } @@ -230,6 +226,10
> @@ cryptodev_backend_class_init(ObjectClass *oc, void *data)
>  ucc->can_be_deleted = cryptodev_backend_can_be_deleted;
> 
>  QTAILQ_INIT(_clients);
> +object_class_property_add(oc, "queues", "uint32",
> +  cryptodev_backend_get_queues,
> +  cryptodev_backend_set_queues,
> +  NULL, NULL);
>  }
> 
>  static const TypeInfo cryptodev_backend_info = {
> --
> 2.26.2




RE: [PATCH 05/46] virtio-crypto-pci: Tidy up virtio_crypto_pci_realize()

2020-06-27 Thread Gonglei (Arei)


> -Original Message-
> From: Markus Armbruster [mailto:arm...@redhat.com]
> Sent: Thursday, June 25, 2020 12:43 AM
> To: qemu-devel@nongnu.org
> Cc: pbonz...@redhat.com; berra...@redhat.com; ehabk...@redhat.com;
> qemu-bl...@nongnu.org; peter.mayd...@linaro.org;
> vsement...@virtuozzo.com; Gonglei (Arei) ;
> Michael S . Tsirkin 
> Subject: [PATCH 05/46] virtio-crypto-pci: Tidy up virtio_crypto_pci_realize()
> 
> virtio_crypto_pci_realize() continues after realization of its 
> "virtio-crypto-device"
> fails.  Only an object_property_set_link() follows; looks harmless to me.  
> Tidy
> up anyway: return after failure, just like virtio_rng_pci_realize() does.
> 
> Cc: "Gonglei (Arei)" 
> Cc: Michael S. Tsirkin 
> Signed-off-by: Markus Armbruster 
> ---
>  hw/virtio/virtio-crypto-pci.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 

Reviewed-by: Gonglei < arei.gong...@huawei.com>

> diff --git a/hw/virtio/virtio-crypto-pci.c b/hw/virtio/virtio-crypto-pci.c 
> index
> 72be531c95..0755722288 100644
> --- a/hw/virtio/virtio-crypto-pci.c
> +++ b/hw/virtio/virtio-crypto-pci.c
> @@ -54,7 +54,9 @@ static void virtio_crypto_pci_realize(VirtIOPCIProxy
> *vpci_dev, Error **errp)
>  }
> 
>  virtio_pci_force_virtio_1(vpci_dev);
> -qdev_realize(vdev, BUS(_dev->bus), errp);
> +if (!qdev_realize(vdev, BUS(_dev->bus), errp)) {
> +return;
> +}
>  object_property_set_link(OBJECT(vcrypto),
>   OBJECT(vcrypto->vdev.conf.cryptodev), "cryptodev",
>   NULL);
> --
> 2.26.2




RE: [PATCH v1 29/59] cryptodev-vhost.c: remove unneeded 'err' label in cryptodev_vhost_start

2020-01-07 Thread Gonglei (Arei)


> -Original Message-
> From: Daniel Henrique Barboza [mailto:danielhb...@gmail.com]
> Sent: Tuesday, January 7, 2020 2:24 AM
> To: qemu-devel@nongnu.org
> Cc: qemu-triv...@nongnu.org; Daniel Henrique Barboza
> ; Gonglei (Arei) 
> Subject: [PATCH v1 29/59] cryptodev-vhost.c: remove unneeded 'err' label in
> cryptodev_vhost_start
> 
> 'err' can be replaced by 'return r'.
> 
> CC: Gonglei 
> Signed-off-by: Daniel Henrique Barboza 
> ---
>  backends/cryptodev-vhost.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Reviewed-by: Gonglei 


> diff --git a/backends/cryptodev-vhost.c b/backends/cryptodev-vhost.c index
> 8337c9a495..907ca21fa7 100644
> --- a/backends/cryptodev-vhost.c
> +++ b/backends/cryptodev-vhost.c
> @@ -201,7 +201,7 @@ int cryptodev_vhost_start(VirtIODevice *dev, int
> total_queues)
>  r = k->set_guest_notifiers(qbus->parent, total_queues, true);
>  if (r < 0) {
>  error_report("error binding guest notifier: %d", -r);
> -goto err;
> +return r;
>  }
> 
>  for (i = 0; i < total_queues; i++) { @@ -236,7 +236,7 @@ err_start:
>  if (e < 0) {
>  error_report("vhost guest notifier cleanup failed: %d", e);
>  }
> -err:
> +
>  return r;
>  }
> 
> --
> 2.24.1




RE: [PATCH v6] backends/cryptodev: drop local_err from cryptodev_backend_complete()

2019-11-27 Thread Gonglei (Arei)
CCing qemu-triv...@nongnu.org

Reviewed-by: Gonglei 


Regards,
-Gonglei

> -Original Message-
> From: Vladimir Sementsov-Ogievskiy [mailto:vsement...@virtuozzo.com]
> Sent: Thursday, November 28, 2019 3:46 AM
> To: qemu-devel@nongnu.org
> Cc: Gonglei (Arei) ; marcandre.lur...@gmail.com;
> phi...@redhat.com; vsement...@virtuozzo.com
> Subject: [PATCH v6] backends/cryptodev: drop local_err from
> cryptodev_backend_complete()
> 
> No reason for local_err here, use errp directly instead.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> Reviewed-by: Philippe Mathieu-Daudé 
> Reviewed-by: Marc-André Lureau 
> ---
> 
> v6: add r-b by Philippe and Marc-André
> 
>  backends/cryptodev.c | 11 +--
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/backends/cryptodev.c b/backends/cryptodev.c index
> 3c071eab95..5a9735684e 100644
> --- a/backends/cryptodev.c
> +++ b/backends/cryptodev.c
> @@ -176,19 +176,10 @@ cryptodev_backend_complete(UserCreatable *uc,
> Error **errp)  {
>  CryptoDevBackend *backend = CRYPTODEV_BACKEND(uc);
>  CryptoDevBackendClass *bc = CRYPTODEV_BACKEND_GET_CLASS(uc);
> -Error *local_err = NULL;
> 
>  if (bc->init) {
> -bc->init(backend, _err);
> -if (local_err) {
> -goto out;
> -}
> +bc->init(backend, errp);
>  }
> -
> -return;
> -
> -out:
> -error_propagate(errp, local_err);
>  }
> 
>  void cryptodev_backend_set_used(CryptoDevBackend *backend, bool used)
> --
> 2.21.0



Re: [Qemu-devel] [PATCH] backends: cryptodev: fix oob access issue

2019-03-17 Thread Gonglei (Arei)
Hi Michael,

Could you pls apply this patch in your tree?

Thanks,
-Gonglei


> -Original Message-
> From: Li Qiang [mailto:liq...@163.com]
> Sent: Monday, March 18, 2019 9:12 AM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; Li Qiang 
> Subject: [PATCH] backends: cryptodev: fix oob access issue
> 
> The 'queue_index' of create/close_session function
> is from guest and can be exceed 'MAX_CRYPTO_QUEUE_NUM'.
> This leads oob access. This patch avoid this.
> 
> Signed-off-by: Li Qiang 
> ---
>  backends/cryptodev-builtin.c| 4 
>  backends/cryptodev-vhost-user.c | 4 
>  2 files changed, 8 insertions(+)
> 

Reviewed-by: Gonglei 


> diff --git a/backends/cryptodev-builtin.c b/backends/cryptodev-builtin.c
> index 9fb0bd57a6..c3a65b2f5f 100644
> --- a/backends/cryptodev-builtin.c
> +++ b/backends/cryptodev-builtin.c
> @@ -249,6 +249,8 @@ static int64_t cryptodev_builtin_sym_create_session(
> CryptoDevBackendSymSessionInfo *sess_info,
> uint32_t queue_index, Error **errp)
>  {
> +assert(queue_index < MAX_CRYPTO_QUEUE_NUM);
> +
>  CryptoDevBackendBuiltin *builtin =
>CRYPTODEV_BACKEND_BUILTIN(backend);
>  int64_t session_id = -1;
> @@ -280,6 +282,8 @@ static int cryptodev_builtin_sym_close_session(
> uint64_t session_id,
> uint32_t queue_index, Error **errp)
>  {
> +assert(queue_index < MAX_CRYPTO_QUEUE_NUM);
> +
>  CryptoDevBackendBuiltin *builtin =
>CRYPTODEV_BACKEND_BUILTIN(backend);
> 
> diff --git a/backends/cryptodev-vhost-user.c b/backends/cryptodev-vhost-user.c
> index 1052a5d0e9..36a40eeb4d 100644
> --- a/backends/cryptodev-vhost-user.c
> +++ b/backends/cryptodev-vhost-user.c
> @@ -236,6 +236,8 @@ static int64_t
> cryptodev_vhost_user_sym_create_session(
> CryptoDevBackendSymSessionInfo *sess_info,
> uint32_t queue_index, Error **errp)
>  {
> +assert(queue_index < MAX_CRYPTO_QUEUE_NUM);
> +
>  CryptoDevBackendClient *cc =
> backend->conf.peers.ccs[queue_index];
>  CryptoDevBackendVhost *vhost_crypto;
> @@ -262,6 +264,8 @@ static int cryptodev_vhost_user_sym_close_session(
> uint64_t session_id,
> uint32_t queue_index, Error **errp)
>  {
> +assert(queue_index < MAX_CRYPTO_QUEUE_NUM);
> +
>  CryptoDevBackendClient *cc =
>backend->conf.peers.ccs[queue_index];
>  CryptoDevBackendVhost *vhost_crypto;
> --
> 2.17.1
> 




Re: [Qemu-devel] [PATCH] cryptodev-vhost-user: fix a oob access

2019-03-17 Thread Gonglei (Arei)
Hi,

> -Original Message-
> From: Li Qiang [mailto:liq...@163.com]
> Sent: Sunday, March 17, 2019 5:10 PM
> To: Gonglei (Arei) 
> Cc: qemu-devel@nongnu.org; Li Qiang 
> Subject: [PATCH] cryptodev-vhost-user: fix a oob access
> 
> The 'queue_index' of create/close_session function
> is from guest and can be exceed 'MAX_CRYPTO_QUEUE_NUM'.
> This leads oob access. This patch avoid this.
> 
> Signed-off-by: Li Qiang 
> ---
>  backends/cryptodev-vhost-user.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/backends/cryptodev-vhost-user.c b/backends/cryptodev-vhost-user.c
> index 1052a5d0e9..36a40eeb4d 100644
> --- a/backends/cryptodev-vhost-user.c
> +++ b/backends/cryptodev-vhost-user.c
> @@ -236,6 +236,8 @@ static int64_t
> cryptodev_vhost_user_sym_create_session(
> CryptoDevBackendSymSessionInfo *sess_info,
> uint32_t queue_index, Error **errp)
>  {
> +assert(queue_index < MAX_CRYPTO_QUEUE_NUM);
> +
>  CryptoDevBackendClient *cc =
> backend->conf.peers.ccs[queue_index];
>  CryptoDevBackendVhost *vhost_crypto;
> @@ -262,6 +264,8 @@ static int cryptodev_vhost_user_sym_close_session(
> uint64_t session_id,
> uint32_t queue_index, Error **errp)
>  {
> +assert(queue_index < MAX_CRYPTO_QUEUE_NUM);
> +
>  CryptoDevBackendClient *cc =
>backend->conf.peers.ccs[queue_index];
>  CryptoDevBackendVhost *vhost_crypto;
> --
> 2.17.1
> 

Pls add an assertion for cryptodev-builtin backend though the queue_index 
isn't used currently.

Thanks,
-Gonglei




Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)
> >
> > > -Original Message-
> > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > Sent: Thursday, February 21, 2019 10:05 AM
> > > To: Gonglei (Arei) 
> > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> dgilb...@redhat.com;
> > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> ken@amd.com;
> > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > k...@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > > >
> > > > > > 5) About log sync, why not register log_global_start/stop in
> > > > > vfio_memory_listener?
> > > > > >
> > > > > >
> > > > > seems log_global_start/stop cannot be iterately called in pre-copy
> phase?
> > > > > for dirty pages in system memory, it's better to transfer dirty data
> > > > > iteratively to reduce down time, right?
> > > > >
> > > >
> > > > We just need invoking only once for start and stop logging. Why we need
> to
> > > call
> > > > them literately? See memory_listener of vhost.
> > > >
> > > the dirty pages in system memory produces by device is incremental.
> > > if it can be got iteratively, the dirty pages in stop-and-copy phase can 
> > > be
> > > minimal.
> > > :)
> > >
> > I mean starting or stopping the capability of logging, not log sync.
> >
> > We register the below callbacks:
> >
> > .log_sync = vfio_log_sync,
> > .log_global_start = vfio_log_global_start,
> > .log_global_stop = vfio_log_global_stop,
> >
> .log_global_start is also a good point to notify logging state.
> But if notifying in .save_setup handler, we can do fine-grained
> control of when to notify of logging starting together with get_buffer
> operation.
> Is there any special benifit by registering to .log_global_start/stop?
> 

Performance benefit when one VM has multiple same vfio devices.


Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)







> -Original Message-
> From: Zhao Yan [mailto:yan.y.z...@intel.com]
> Sent: Thursday, February 21, 2019 12:08 PM
> To: Gonglei (Arei) 
> Cc: c...@nvidia.com; k...@vger.kernel.org; a...@ozlabs.ru;
> zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> mlevi...@redhat.com; pa...@linux.ibm.com; fel...@nutanix.com;
> ken@amd.com; kevin.t...@intel.com; dgilb...@redhat.com;
> alex.william...@redhat.com; intel-gvt-...@lists.freedesktop.org;
> changpeng@intel.com; coh...@redhat.com; zhi.a.w...@intel.com;
> jonathan.dav...@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 03:33:24AM +, Gonglei (Arei) wrote:
> >
> > > -Original Message-
> > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > Sent: Thursday, February 21, 2019 9:59 AM
> > > To: Gonglei (Arei) 
> > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> dgilb...@redhat.com;
> > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> ken@amd.com;
> > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > k...@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Thu, Feb 21, 2019 at 01:35:43AM +, Gonglei (Arei) wrote:
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > > > Sent: Thursday, February 21, 2019 8:25 AM
> > > > > To: Gonglei (Arei) 
> > > > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> > > dgilb...@redhat.com;
> > > > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> > > ken@amd.com;
> > > > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > > > k...@vger.kernel.org
> > > > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > > > >
> > > > > On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > > > > > Hi yan,
> > > > > >
> > > > > > Thanks for your work.
> > > > > >
> > > > > > I have some suggestions or questions:
> > > > > >
> > > > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > > > vfio_pci_save_config(), likes Nvidia's solution.
> > > > > ok.
> > > > >
> > > > > > 2) We should start vfio devices before vcpu resumes, so we can't 
> > > > > > rely
> on
> > > vm
> > > > > start change handler completely.
> > > > > vfio devices is by default set to running state.
> > > > > In the target machine, its state transition flow is
> running->stop->running.
> > > >
> > > > That's confusing. We should start vfio devices after vfio_load_state,
> > > otherwise
> > > > how can you keep the devices' information are the same between source
> side
> > > > and destination side?
> > > >
> > > so, your meaning is to set device state to running in the first call to
> > > vfio_load_state?
> > >
> > No, it should start devices after vfio_load_state and before vcpu resuming.
> >
> 
> What about set device state to running in load_cleanup handler ?
> 

The timing is fine, but you should also think about if should set device state 
to running in failure branches when calling load_cleanup handler.

Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)


> -Original Message-
> From: Zhao Yan [mailto:yan.y.z...@intel.com]
> Sent: Thursday, February 21, 2019 9:59 AM
> To: Gonglei (Arei) 
> Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> k...@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Thu, Feb 21, 2019 at 01:35:43AM +, Gonglei (Arei) wrote:
> >
> >
> > > -Original Message-
> > > From: Zhao Yan [mailto:yan.y.z...@intel.com]
> > > Sent: Thursday, February 21, 2019 8:25 AM
> > > To: Gonglei (Arei) 
> > > Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> > > intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > coh...@redhat.com; shuangtai@alibaba-inc.com;
> dgilb...@redhat.com;
> > > zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> > > a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> > > jonathan.dav...@nutanix.com; changpeng@intel.com;
> ken@amd.com;
> > > kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> > > k...@vger.kernel.org
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > > > Hi yan,
> > > >
> > > > Thanks for your work.
> > > >
> > > > I have some suggestions or questions:
> > > >
> > > > 1) Would you add msix mode support,? if not, pls add a check in
> > > vfio_pci_save_config(), likes Nvidia's solution.
> > > ok.
> > >
> > > > 2) We should start vfio devices before vcpu resumes, so we can't rely on
> vm
> > > start change handler completely.
> > > vfio devices is by default set to running state.
> > > In the target machine, its state transition flow is 
> > > running->stop->running.
> >
> > That's confusing. We should start vfio devices after vfio_load_state,
> otherwise
> > how can you keep the devices' information are the same between source side
> > and destination side?
> >
> so, your meaning is to set device state to running in the first call to
> vfio_load_state?
> 
No, it should start devices after vfio_load_state and before vcpu resuming.

> > > so, maybe you can ignore the stop notification in kernel?
> > > > 3) We'd better support live migration rollback since have many failure
> > > scenarios,
> > > >  register a migration notifier is a good choice.
> > > I think this patchset can also handle the failure case well.
> > > if migration failure or cancelling happens,
> > > in cleanup handler, LOGGING state is cleared. device state(running or
> > > stopped) keeps as it is).
> >
> > IIRC there're many failure paths don't calling cleanup handler.
> >
> could you take an example?

Never mind, that's another bug I think. 

> > > then,
> > > if vm switches back to running, device state will be set to running;
> > > if vm stayes at stopped state, device state is also stopped (it has no
> > > meaning to let it in running state).
> > > Do you think so ?
> > >
> > IF the underlying state machine is complicated,
> > We should tell the canceling state to vendor driver proactively.
> >
> That makes sense.
> 
> > > > 4) Four memory region for live migration is too complicated IMHO.
> > > one big region requires the sub-regions well padded.
> > > like for the first control fields, they have to be padded to 4K.
> > > the same for other data fields.
> > > Otherwise, mmap simply fails, because the start-offset and size for mmap
> > > both need to be PAGE aligned.
> > >
> > But if we don't need use mmap for control filed and device state, they are
> small basically.
> > The performance is enough using pread/pwrite.
> >
> we don't mmap control fields. but if data fields going immedately after
> control fields (e.g. just 64 bytes), we can't mmap data fields
> successfully because its start offset is 64. Therefore control fields have
> to be padded to 4k to let data fields start from 4k.
> That's the drawback of one big region holding both control and data fields.
> 
> > > Also, 4 regions is clearer in my view :)
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> 
> 
> 
> > Regards,
> > -Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)




> -Original Message-
> From: Zhao Yan [mailto:yan.y.z...@intel.com]
> Sent: Thursday, February 21, 2019 10:05 AM
> To: Gonglei (Arei) 
> Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> k...@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> > >
> > > > 5) About log sync, why not register log_global_start/stop in
> > > vfio_memory_listener?
> > > >
> > > >
> > > seems log_global_start/stop cannot be iterately called in pre-copy phase?
> > > for dirty pages in system memory, it's better to transfer dirty data
> > > iteratively to reduce down time, right?
> > >
> >
> > We just need invoking only once for start and stop logging. Why we need to
> call
> > them literately? See memory_listener of vhost.
> >
> the dirty pages in system memory produces by device is incremental.
> if it can be got iteratively, the dirty pages in stop-and-copy phase can be
> minimal.
> :)
> 
I mean starting or stopping the capability of logging, not log sync. 

We register the below callbacks:

.log_sync = vfio_log_sync,
.log_global_start = vfio_log_global_start,
.log_global_stop = vfio_log_global_stop,

Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)



> -Original Message-
> From: Zhao Yan [mailto:yan.y.z...@intel.com]
> Sent: Thursday, February 21, 2019 8:25 AM
> To: Gonglei (Arei) 
> Cc: alex.william...@redhat.com; qemu-devel@nongnu.org;
> intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com;
> k...@vger.kernel.org
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, Feb 20, 2019 at 11:56:01AM +, Gonglei (Arei) wrote:
> > Hi yan,
> >
> > Thanks for your work.
> >
> > I have some suggestions or questions:
> >
> > 1) Would you add msix mode support,? if not, pls add a check in
> vfio_pci_save_config(), likes Nvidia's solution.
> ok.
> 
> > 2) We should start vfio devices before vcpu resumes, so we can't rely on vm
> start change handler completely.
> vfio devices is by default set to running state.
> In the target machine, its state transition flow is running->stop->running.

That's confusing. We should start vfio devices after vfio_load_state, otherwise
how can you keep the devices' information are the same between source side
and destination side?

> so, maybe you can ignore the stop notification in kernel?
> > 3) We'd better support live migration rollback since have many failure
> scenarios,
> >  register a migration notifier is a good choice.
> I think this patchset can also handle the failure case well.
> if migration failure or cancelling happens,
> in cleanup handler, LOGGING state is cleared. device state(running or
> stopped) keeps as it is).

IIRC there're many failure paths don't calling cleanup handler.

> then,
> if vm switches back to running, device state will be set to running;
> if vm stayes at stopped state, device state is also stopped (it has no
> meaning to let it in running state).
> Do you think so ?
> 
IF the underlying state machine is complicated,
We should tell the canceling state to vendor driver proactively.

> > 4) Four memory region for live migration is too complicated IMHO.
> one big region requires the sub-regions well padded.
> like for the first control fields, they have to be padded to 4K.
> the same for other data fields.
> Otherwise, mmap simply fails, because the start-offset and size for mmap
> both need to be PAGE aligned.
> 
But if we don't need use mmap for control filed and device state, they are 
small basically.
The performance is enough using pread/pwrite. 

> Also, 4 regions is clearer in my view :)
> 
> > 5) About log sync, why not register log_global_start/stop in
> vfio_memory_listener?
> >
> >
> seems log_global_start/stop cannot be iterately called in pre-copy phase?
> for dirty pages in system memory, it's better to transfer dirty data
> iteratively to reduce down time, right?
> 

We just need invoking only once for start and stop logging. Why we need to call
them literately? See memory_listener of vhost.

Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)



> -Original Message-
> From: Cornelia Huck [mailto:coh...@redhat.com]
> Sent: Wednesday, February 20, 2019 7:43 PM
> To: Gonglei (Arei) 
> Cc: Dr. David Alan Gilbert ; Zhao Yan
> ; c...@nvidia.com; k...@vger.kernel.org;
> a...@ozlabs.ru; zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> mlevi...@redhat.com; pa...@linux.ibm.com; fel...@nutanix.com;
> ken@amd.com; kevin.t...@intel.com; alex.william...@redhat.com;
> intel-gvt-...@lists.freedesktop.org; changpeng@intel.com;
> zhi.a.w...@intel.com; jonathan.dav...@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> On Wed, 20 Feb 2019 11:28:46 +
> "Gonglei (Arei)"  wrote:
> 
> > > -Original Message-
> > > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
> > > Sent: Wednesday, February 20, 2019 7:02 PM
> > > To: Zhao Yan 
> > > Cc: c...@nvidia.com; k...@vger.kernel.org; a...@ozlabs.ru;
> > > zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> > > qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> > > yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> > > mlevi...@redhat.com; pa...@linux.ibm.com; Gonglei (Arei)
> > > ; fel...@nutanix.com; ken@amd.com;
> > > kevin.t...@intel.com; alex.william...@redhat.com;
> > > intel-gvt-...@lists.freedesktop.org; changpeng@intel.com;
> > > coh...@redhat.com; zhi.a.w...@intel.com;
> jonathan.dav...@nutanix.com
> > > Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> > >
> > > * Zhao Yan (yan.y.z...@intel.com) wrote:
> > > > On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > This patchset enables VFIO devices to have live migration 
> > > > > > capability.
> > > > > > Currently it does not support post-copy phase.
> > > > > >
> > > > > > It follows Alex's comments on last version of VFIO live migration
> patches,
> > > > > > including device states, VFIO device state region layout, dirty 
> > > > > > bitmap's
> > > > > > query.
> 
> > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > version of device?  Or say to a device with older firmware or perhaps
> > > > > a device that has less device memory ?
> > > > Actually it's still an open for VFIO migration. Need to think about
> > > > whether it's better to check that in libvirt or qemu (like a device 
> > > > magic
> > > > along with verion ?).
> >
> > We must keep the hardware generation is the same with one POD of public
> cloud
> > providers. But we still think about the live migration between from the the
> lower
> > generation of hardware migrated to the higher generation.
> 
> Agreed, lower->higher is the one direction that might make sense to
> support.
> 
> But regardless of that, I think we need to make sure that incompatible
> devices/versions fail directly instead of failing in a subtle, hard to
> debug way. Might be useful to do some initial sanity checks in libvirt
> as well.
> 
> How easy is it to obtain that information in a form that can be
> consumed by higher layers? Can we find out the device type at least?
> What about some kind of revision?

We can provide an interface to query if the VM support live migration or not
in prepare phase of Libvirt.

Can we get the revision_id from the vendor driver ? before invoking

register_savevm_live(NULL, TYPE_VFIO_PCI, -1,
revision_id,
_vfio_handlers,
vdev);

then limit the live migration form higher gens to lower gens?

Regards,
-Gonglei



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)
Hi yan,

Thanks for your work.

I have some suggestions or questions:

1) Would you add msix mode support,? if not, pls add a check in 
vfio_pci_save_config(), likes Nvidia's solution.
2) We should start vfio devices before vcpu resumes, so we can't rely on vm 
start change handler completely.
3) We'd better support live migration rollback since have many failure 
scenarios,
 register a migration notifier is a good choice.
4) Four memory region for live migration is too complicated IMHO. 
5) About log sync, why not register log_global_start/stop in 
vfio_memory_listener?


Regards,
-Gonglei


> -Original Message-
> From: Yan Zhao [mailto:yan.y.z...@intel.com]
> Sent: Tuesday, February 19, 2019 4:51 PM
> To: alex.william...@redhat.com; qemu-devel@nongnu.org
> Cc: intel-gvt-...@lists.freedesktop.org; zhengxiao...@alibaba-inc.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; eau...@redhat.com; fel...@nutanix.com;
> jonathan.dav...@nutanix.com; changpeng@intel.com; ken@amd.com;
> kwankh...@nvidia.com; kevin.t...@intel.com; c...@nvidia.com; Gonglei (Arei)
> ; k...@vger.kernel.org; Yan Zhao
> 
> Subject: [PATCH 0/5] QEMU VFIO live migration
> 
> This patchset enables VFIO devices to have live migration capability.
> Currently it does not support post-copy phase.
> 
> It follows Alex's comments on last version of VFIO live migration patches,
> including device states, VFIO device state region layout, dirty bitmap's
> query.
> 
> Device Data
> ---
> Device data is divided into three types: device memory, device config,
> and system memory dirty pages produced by device.
> 
> Device config: data like MMIOs, page tables...
> Every device is supposed to possess device config data.
>   Usually device config's size is small (no big than 10M), and it
> needs to be loaded in certain strict order.
> Therefore, device config only needs to be saved/loaded in
> stop-and-copy phase.
> The data of device config is held in device config region.
> Size of device config data is smaller than or equal to that of
> device config region.
> 
> Device Memory: device's internal memory, standalone and outside system
> memory. It is usually very big.
> This kind of data needs to be saved / loaded in pre-copy and
> stop-and-copy phase.
> The data of device memory is held in device memory region.
> Size of devie memory is usually larger than that of device
> memory region. qemu needs to save/load it in chunks of size of
> device memory region.
> Not all device has device memory. Like IGD only uses system memory.
> 
> System memory dirty pages: If a device produces dirty pages in system
> memory, it is able to get dirty bitmap for certain range of system
> memory. This dirty bitmap is queried in pre-copy and stop-and-copy
> phase in .log_sync callback. By setting dirty bitmap in .log_sync
> callback, dirty pages in system memory will be save/loaded by ram's
> live migration code.
> The dirty bitmap of system memory is held in dirty bitmap region.
> If system memory range is larger than that dirty bitmap region can
> hold, qemu will cut it into several chunks and get dirty bitmap in
> succession.
> 
> 
> Device State Regions
> 
> Vendor driver is required to expose two mandatory regions and another two
> optional regions if it plans to support device state management.
> 
> So, there are up to four regions in total.
> One control region: mandatory.
> Get access via read/write system call.
> Its layout is defined in struct vfio_device_state_ctl
> Three data regions: mmaped into qemu.
> device config region: mandatory, holding data of device config
> device memory region: optional, holding data of device memory
> dirty bitmap region: optional, holding bitmap of system memory
> dirty pages
> 
> (The reason why four seperate regions are defined is that the unit of mmap
> system call is PAGE_SIZE, i.e. 4k bytes. So one read/write region for
> control and three mmaped regions for data seems better than one big region
> padded and sparse mmaped).
> 
> 
> kernel device state interface [1]
> --
> #define VFIO_DEVICE_STATE_INTERFACE_VERSION 1
> #define VFIO_DEVICE_DATA_CAP_DEVICE_MEMORY 1
> #define VFIO_DEVICE_DATA_CAP_SYSTEM_MEMORY 2
> 
> #define VFIO_DEVICE_STATE_RUNNING 0
> 

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-02-20 Thread Gonglei (Arei)


> -Original Message-
> From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
> Sent: Wednesday, February 20, 2019 7:02 PM
> To: Zhao Yan 
> Cc: c...@nvidia.com; k...@vger.kernel.org; a...@ozlabs.ru;
> zhengxiao...@alibaba-inc.com; shuangtai@alibaba-inc.com;
> qemu-devel@nongnu.org; kwankh...@nvidia.com; eau...@redhat.com;
> yi.l@intel.com; eskul...@redhat.com; ziye.y...@intel.com;
> mlevi...@redhat.com; pa...@linux.ibm.com; Gonglei (Arei)
> ; fel...@nutanix.com; ken@amd.com;
> kevin.t...@intel.com; alex.william...@redhat.com;
> intel-gvt-...@lists.freedesktop.org; changpeng@intel.com;
> coh...@redhat.com; zhi.a.w...@intel.com; jonathan.dav...@nutanix.com
> Subject: Re: [PATCH 0/5] QEMU VFIO live migration
> 
> * Zhao Yan (yan.y.z...@intel.com) wrote:
> > On Tue, Feb 19, 2019 at 11:32:13AM +, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > This patchset enables VFIO devices to have live migration capability.
> > > > Currently it does not support post-copy phase.
> > > >
> > > > It follows Alex's comments on last version of VFIO live migration 
> > > > patches,
> > > > including device states, VFIO device state region layout, dirty bitmap's
> > > > query.
> > >
> > > Hi,
> > >   I've sent minor comments to later patches; but some minor general
> > > comments:
> > >
> > >   a) Never trust the incoming migrations stream - it might be corrupt,
> > > so check when you can.
> > hi Dave
> > Thanks for this suggestion. I'll add more checks for migration streams.
> >
> >
> > >   b) How do we detect if we're migrating from/to the wrong device or
> > > version of device?  Or say to a device with older firmware or perhaps
> > > a device that has less device memory ?
> > Actually it's still an open for VFIO migration. Need to think about
> > whether it's better to check that in libvirt or qemu (like a device magic
> > along with verion ?).

We must keep the hardware generation is the same with one POD of public cloud
providers. But we still think about the live migration between from the the 
lower
generation of hardware migrated to the higher generation.

> > This patchset is intended to settle down the main device state interfaces
> > for VFIO migration. So that we can work on that and improve it.
> >
> >
> > >   c) Consider using the trace_ mechanism - it's really useful to
> > > add to loops writing/reading data so that you can see when it fails.
> > >
> > > Dave
> > >
> > Got it. many thanks~~
> >
> >
> > > (P.S. You have a few typo's grep your code for 'devcie', 'devie' and
> > > 'migrtion'
> >
> > sorry :)
> 
> No problem.
> 
> Given the mails, I'm guessing you've mostly tested this on graphics
> devices?  Have you also checked with VFIO network cards?
> 
> Also see the mail I sent in reply to Kirti's series; we need to boil
> these down to one solution.
> 
> Dave
> 
> > >
> > > > Device Data
> > > > ---
> > > > Device data is divided into three types: device memory, device config,
> > > > and system memory dirty pages produced by device.
> > > >
> > > > Device config: data like MMIOs, page tables...
> > > > Every device is supposed to possess device config data.
> > > > Usually device config's size is small (no big than 10M), and it
> > > > needs to be loaded in certain strict order.
> > > > Therefore, device config only needs to be saved/loaded in
> > > > stop-and-copy phase.
> > > > The data of device config is held in device config region.
> > > > Size of device config data is smaller than or equal to that of
> > > > device config region.
> > > >
> > > > Device Memory: device's internal memory, standalone and outside
> system
> > > > memory. It is usually very big.
> > > > This kind of data needs to be saved / loaded in pre-copy and
> > > > stop-and-copy phase.
> > > > The data of device memory is held in device memory region.
> > > > Size of devie memory is usually larger than that of device
> > > > memory region. qemu needs to save/load it in chunks of size of
> > > > device memory region.
> > > > Not all device has device memory. Like IGD only uses system

Re: [Qemu-devel] [PATCH] vfio: assign idstr for VFIO's mmaped regions for migration

2019-02-20 Thread Gonglei (Arei)
; +}
> > >
> > >  ret = vfio_region_setup(OBJECT(vdev), vbasedev,
> > >  >bars[i].region, i, name);
> > > @@ -3180,6 +3185,7 @@ static void vfio_instance_init(Object *obj)
> > >  static Property vfio_pci_dev_properties[] = {
> > >  DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
> > >  DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice,
> vbasedev.sysfsdev),
> > > +DEFINE_PROP_STRING("vfioid", VFIOPCIDevice, vbasedev.vfioid),
> > >  DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
> > >  display, ON_OFF_AUTO_OFF),
> > >  DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index 1b434d02f6..84bab94f52 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -108,6 +108,7 @@ typedef struct VFIODevice {
> > >  struct VFIOGroup *group;
> > >  char *sysfsdev;
> > >  char *name;
> > > +char *vfioid;
> > >  DeviceState *dev;
> > >  int fd;
> > >  int type;
> > > diff --git a/memory.c b/memory.c
> > > index d14c6dec1d..dbb29fa989 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -1588,6 +1588,7 @@ void
> memory_region_init_ram_ptr(MemoryRegion *mr,
> > >  uint64_t size,
> > >  void *ptr)
> > >  {
> > > +DeviceState *owner_dev;
> > >  memory_region_init(mr, owner, name, size);
> > >  mr->ram = true;
> > >  mr->terminates = true;
> > > @@ -1597,6 +1598,9 @@ void
> memory_region_init_ram_ptr(MemoryRegion *mr,
> > >  /* qemu_ram_alloc_from_ptr cannot fail with ptr != NULL.  */
> > >  assert(ptr != NULL);
> > >  mr->ram_block = qemu_ram_alloc_from_ptr(size, ptr, mr,
> _fatal);
> > > +
> > > +owner_dev = DEVICE(owner);
> > > +vmstate_register_ram(mr, owner_dev);
> >
> > Where does the corresponding vmstate_unregister_ram() call occur when
> > unplugged?  Thanks,
> >
> sorry, I just updated my qemu code base and found that in migration/ram.c
> now it will not save/restore ramblocks who do not call
> vmstate_regitser_ram().
> therefore, the vmstate_register_ram() may not be necessary for memory
> region mapped to device resources, as it's better to save/restore that part
> of memory from vendor driver side.
> So, do you think it's ok to just call qemu_ram_set_idstr() to set idstr for
> ramblocks of mmaped region?
> 
> Thanks
> Yan
> 
Why not invoking vmstate_register_ram() in vfio_region_mmap and
Invoking vmstate_unregister_ram() in vfio_region_exit?

Regards,
-Gonglei



Re: [Qemu-devel] About live migration rollback

2019-01-02 Thread Gonglei (Arei)
Hi,

> 
> * Gonglei (Arei) (arei.gong...@huawei.com) wrote:
> > Hi Dave,
> >
> > We discussed some live migration fallback scenarios in this year's KVM 
> > forum,
> > and now I can provide another scenario, perhaps the upstream should
> consider rolling
> > back for this situation.
> >
> > Environments information:
> >
> > host A: cpu E5620(model WestmereEP without flag xsave)
> > host B: cpu E5-2643(model SandyBridgeEP with flag xsave)
> >
> > The reproduce steps is :
> > 1. Start a windows 2008 vm with -cpu host(which means host-passthrough).
> 
> Well we don't guarantee migration across -cpu host - does this problem
> go away if both qemu's are started with matching CPU flags
> (corresponding to the Westmere) ?
> 
Sorry, we didn't test other cpu model scenarios since we should assure
that the live migration support from lower generation CPUs to higher
generation CPUs. :(


> > 2. Migrate the vm to host B when cr4.OSXSAVE=0.
> > 3. Vm runs on host B for a while so that cr4.OSXSAVE changes to 1.
> > 4. Then migrate the vm to host A successfully, but vm was paused, and qemu
> printed log as followed:
> >
> > KVM: entry failed, hardware error 0x8021
> >
> > If you're running a guest on an Intel machine without unrestricted mode
> > support, the failure can be most likely due to the guest entering an invalid
> > state for Intel VT. For example, the guest maybe running in big real mode
> > which is not supported on less recent Intel processors.
> >
> > EAX=019b3bb0 EBX=01a3ae80 ECX=01a61ce8 EDX=
> > ESI=01a62000 EDI= EBP= ESP=01718b20
> > EIP=0185d982 EFL=0286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> > ES =   9300
> > CS =f000   9b00
> > SS =   9300
> > DS =   9300
> > FS =   9300
> > GS =   9300
> > LDT=   8200
> > TR =   8b00
> > GDT=  
> > IDT=  
> > CR0=6010 CR2= CR3= CR4=
> > DR0= DR1= DR2=
> DR3=
> > DR6=0ff0 DR7=0400
> > EFER=
> > Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00
> >
> > Problem happened when kvm_put_sregs returns err -22(called by
> kvm_arch_put_registers(qemu)).
> >
> > Because kvm_arch_vcpu_ioctl_set_sregs(kvm module) checked that
> > guest_cpuid_has no X86_FEATURE_XSAVE but cr4.OSXSAVE=1.
> > We should cancel migration if kvm_arch_put_registers returns error.
> 
> Do you have a backtrace of when the kvm_arch_put_registers is called
> when it fails?

The main backtrace is below:

 qemu_loadvm_state
 cpu_synchronize_all_post_init--> w/o return value
 cpu_synchronize_post_init   --> w/o return value
 kvm_cpu_synchronize_post_init  --> w/o return value
run_on_cpu  ---> w/o return value
   do_kvm_cpu_synchronize_post_init  --> w/o 
return value
  kvm_arch_put_registers  --> w/ return value

Root cause is some functions don't have return values, the migration thread
can't detect those failures. Paolo?

> If it's called during the loading of the device state then we should be
> able to detect it and fail the migration; however if it's only failing
> after the CPU is restarted after the migration then it's a bit too late.
> 
Actually the CPUs haven't started in this scenario.

Thanks,
-Gonglei



[Qemu-devel] About live migration rollback

2018-12-18 Thread Gonglei (Arei)
Hi Dave,

We discussed some live migration fallback scenarios in this year's KVM forum, 
and now I can provide another scenario, perhaps the upstream should consider 
rolling
back for this situation.

Environments information:

host A: cpu E5620(model WestmereEP without flag xsave)
host B: cpu E5-2643(model SandyBridgeEP with flag xsave)

The reproduce steps is :
1. Start a windows 2008 vm with -cpu host(which means host-passthrough).
2. Migrate the vm to host B when cr4.OSXSAVE=0.
3. Vm runs on host B for a while so that cr4.OSXSAVE changes to 1.
4. Then migrate the vm to host A successfully, but vm was paused, and qemu 
printed log as followed:

KVM: entry failed, hardware error 0x8021

If you're running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

EAX=019b3bb0 EBX=01a3ae80 ECX=01a61ce8 EDX=
ESI=01a62000 EDI= EBP= ESP=01718b20
EIP=0185d982 EFL=0286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000   9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT=  
IDT=  
CR0=6010 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Problem happened when kvm_put_sregs returns err -22(called by 
kvm_arch_put_registers(qemu)).

Because kvm_arch_vcpu_ioctl_set_sregs(kvm module) checked that 
guest_cpuid_has no X86_FEATURE_XSAVE but cr4.OSXSAVE=1.
We should cancel migration if kvm_arch_put_registers returns error.

Thanks,
-Gonglei



Re: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices

2018-12-17 Thread Gonglei (Arei)
Hi,

It's great to see this patch series, which is a very important step, although 
currently only consider GPU mdev devices to support hot migration. 

However, this is based on the VFIO framework after all, so we expect 
that we can make this live migration framework more general.

For example, the vfio_save_pending() callback is used to obtain device
memory (such as GPU memory), but if the device (such as network card) 
has no special proprietary memory, but only system memory? 
It is too much to perform a null operation for this kind of device by writing
memory to the vendor driver of kernel space. 

I think we can acquire the capability from the vendor driver before using this. 
If there is device memory that needs iterative copying, the vendor driver return
ture, otherwise return false. Then QEMU implement the specific logic, 
otherwise return directly. Just like getting the capability list of KVM
module, can we?


Regards,
-Gonglei


> -Original Message-
> From: Qemu-devel
> [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> Behalf Of Kirti Wankhede
> Sent: Wednesday, November 21, 2018 4:40 AM
> To: alex.william...@redhat.com; c...@nvidia.com
> Cc: zhengxiao...@alibaba-inc.com; kevin.t...@intel.com; yi.l@intel.com;
> eskul...@redhat.com; ziye.y...@intel.com; qemu-devel@nongnu.org;
> coh...@redhat.com; shuangtai@alibaba-inc.com; dgilb...@redhat.com;
> zhi.a.w...@intel.com; mlevi...@redhat.com; pa...@linux.ibm.com;
> a...@ozlabs.ru; Kirti Wankhede ;
> eau...@redhat.com; fel...@nutanix.com; jonathan.dav...@nutanix.com;
> changpeng@intel.com; ken@amd.com
> Subject: [Qemu-devel] [PATCH 3/5] Add migration functions for VFIO devices
> 
> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
> - Added SaveVMHandlers and implemented all basic functions required for live
>   migration.
> - Added VM state change handler to know running or stopped state of VM.
> - Added migration state change notifier to get notification on migration state
>   change. This state is translated to VFIO device state and conveyed to vendor
>   driver.
> - VFIO device supportd migration or not is decided based of migration region
>   query. If migration region query is successful then migration is supported
>   else migration is blocked.
> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>   region and should always trapped by VFIO device's driver. Added both type of
>   access support, trapped or mmapped, for data section of the region.
> - To save device state, read data offset and size using structure
>   vfio_device_migration_info.data, accordingly copy data from the region.
> - To restore device state, write data offset and size in the structure and 
> write
>   data in the region.
> - To get dirty page bitmap, write start address and pfn count then read count 
> of
>   pfns copied and accordingly read those from the rest of the region or
> mmaped
>   part of the region. This copy is iterated till page bitmap for all requested
>   pfns are copied.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/Makefile.objs |   2 +-
>  hw/vfio/migration.c   | 729
> ++
>  include/hw/vfio/vfio-common.h |  23 ++
>  3 files changed, 753 insertions(+), 1 deletion(-)
>  create mode 100644 hw/vfio/migration.c
> 
[skip]

> +
> +static SaveVMHandlers savevm_vfio_handlers = {
> +.save_setup = vfio_save_setup,
> +.save_live_iterate = vfio_save_iterate,
> +.save_live_complete_precopy = vfio_save_complete_precopy,
> +.save_live_pending = vfio_save_pending,
> +.save_cleanup = vfio_save_cleanup,
> +.load_state = vfio_load_state,
> +.load_setup = vfio_load_setup,
> +.load_cleanup = vfio_load_cleanup,
> +.is_active_iterate = vfio_is_active_iterate,
> +};
> +

 



Re: [Qemu-devel] [PATCH v3 00/16] Virtio devices split from virtio-pci

2018-12-14 Thread Gonglei (Arei)
> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Friday, December 14, 2018 8:53 PM
> To: Gonglei (Arei) 
> Cc: Juan Quintela ; qemu-devel@nongnu.org; Thomas
> Huth ; Gerd Hoffmann 
> Subject: Re: [PATCH v3 00/16] Virtio devices split from virtio-pci
> 
> On Fri, Dec 14, 2018 at 07:07:44AM +, Gonglei (Arei) wrote:
> >
> > > -Original Message-
> > > From: Juan Quintela [mailto:quint...@redhat.com]
> > > Sent: Friday, December 14, 2018 5:01 AM
> > > To: qemu-devel@nongnu.org
> > > Cc: Michael S. Tsirkin ; Thomas Huth
> ;
> > > Gerd Hoffmann ; Gonglei (Arei)
> > > ; Juan Quintela 
> > > Subject: [PATCH v3 00/16] Virtio devices split from virtio-pci
> > >
> > > Hi
> > >
> > > v3:
> > > - rebase to master
> > > - only compile them if CONFIG_PCI is set (thomas)
> > >
> > > Please review.
> > >
> > > Later, Juan.
> > >
> > > V2:
> > >
> > > - Rebase on top of master
> > >
> > > Please review.
> > >
> > > Later, Juan.
> > >
> > > [v1]
> > > From previous verision (in the middle of make check tests):
> > > - split also the bits of virtio-pci.h (mst suggestion)
> > > - add gpu, crypt and gpg bits
> > > - more cleanups
> > > - fix all the copyrights (the ones not changed have been there
> > >   foverever)
> > > - be consistent with naming, vhost-* or virtio-*
> > >
> > > Please review, Juan.
> > >
> > > Juan Quintela (16):
> > >   virtio: split vhost vsock bits from virtio-pci
> > >   virtio: split virtio input host bits from virtio-pci
> > >   virtio: split virtio input bits from virtio-pci
> > >   virtio: split virtio rng bits from virtio-pci
> > >   virtio: split virtio balloon bits from virtio-pci
> > >   virtio: split virtio 9p bits from virtio-pci
> > >   virtio: split vhost user blk bits from virtio-pci
> > >   virtio: split vhost user scsi bits from virtio-pci
> > >   virtio: split vhost scsi bits from virtio-pci
> > >   virtio: split virtio scsi bits from virtio-pci
> > >   virtio: split virtio blk bits rom virtio-pci
> > >   virtio: split virtio net bits rom virtio-pci
> > >   virtio: split virtio serial bits rom virtio-pci
> > >   virtio: split virtio gpu bits rom virtio-pci.h
> > >   virtio: split virtio crypto bits rom virtio-pci.h
> > >   virtio: virtio 9p really requires CONFIG_VIRTFS to work
> > >
> > >  default-configs/virtio.mak|   3 +-
> > >  hw/display/virtio-gpu-pci.c   |  14 +
> > >  hw/display/virtio-vga.c   |   1 +
> > >  hw/virtio/Makefile.objs   |  15 +
> > >  hw/virtio/vhost-scsi-pci.c|  95 
> > >  hw/virtio/vhost-user-blk-pci.c| 101 
> > >  hw/virtio/vhost-user-scsi-pci.c   | 101 
> > >  hw/virtio/vhost-vsock-pci.c   |  82 
> > >  hw/virtio/virtio-9p-pci.c |  86 
> > >  hw/virtio/virtio-balloon-pci.c|  94 
> > >  hw/virtio/virtio-blk-pci.c|  97 
> > >  hw/virtio/virtio-crypto-pci.c |  14 +
> > >  hw/virtio/virtio-input-host-pci.c |  45 ++
> > >  hw/virtio/virtio-input-pci.c  | 154 ++
> > >  hw/virtio/virtio-net-pci.c|  96 
> > >  hw/virtio/virtio-pci.c| 783 --
> > >  hw/virtio/virtio-pci.h| 234 -
> > >  hw/virtio/virtio-rng-pci.c|  86 
> > >  hw/virtio/virtio-scsi-pci.c   | 106 
> > >  hw/virtio/virtio-serial-pci.c | 112 +
> > >  tests/Makefile.include|  20 +-
> > >  21 files changed, 1311 insertions(+), 1028 deletions(-)
> > >  create mode 100644 hw/virtio/vhost-scsi-pci.c
> > >  create mode 100644 hw/virtio/vhost-user-blk-pci.c
> > >  create mode 100644 hw/virtio/vhost-user-scsi-pci.c
> > >  create mode 100644 hw/virtio/vhost-vsock-pci.c
> > >  create mode 100644 hw/virtio/virtio-9p-pci.c
> > >  create mode 100644 hw/virtio/virtio-balloon-pci.c
> > >  create mode 100644 hw/virtio/virtio-blk-pci.c
> > >  create mode 100644 hw/virtio/virtio-input-host-pci.c
> > >  create mode 100644 hw/virtio/virtio-input-pci.c
> > >  create mode 100644 hw/virtio/virtio-net-pci.c
> > >  create mode 100644 hw/virtio/virtio-rng-pci.c
> > >  create mode 100644 hw/virtio/virtio-scsi-pci.c
> > >  create mode 100644 hw/virtio/virtio-serial-pci.c
> > >
> > > --
> > > 2.19.2
> >
> > For series:
> > Reviewed-by: Gonglei 
> >
> >
> > Thanks,
> > -Gonglei
> 
> Thanks!
> Can you pls align Reviewed-by: tag at the 1st column in the future?
> Makes it easier to apply the tag.

OK, I will, thanks :)

Thanks,
-Gonglei



Re: [Qemu-devel] [PATCH v3 00/16] Virtio devices split from virtio-pci

2018-12-13 Thread Gonglei (Arei)


> -Original Message-
> From: Juan Quintela [mailto:quint...@redhat.com]
> Sent: Friday, December 14, 2018 5:01 AM
> To: qemu-devel@nongnu.org
> Cc: Michael S. Tsirkin ; Thomas Huth ;
> Gerd Hoffmann ; Gonglei (Arei)
> ; Juan Quintela 
> Subject: [PATCH v3 00/16] Virtio devices split from virtio-pci
> 
> Hi
> 
> v3:
> - rebase to master
> - only compile them if CONFIG_PCI is set (thomas)
> 
> Please review.
> 
> Later, Juan.
> 
> V2:
> 
> - Rebase on top of master
> 
> Please review.
> 
> Later, Juan.
> 
> [v1]
> From previous verision (in the middle of make check tests):
> - split also the bits of virtio-pci.h (mst suggestion)
> - add gpu, crypt and gpg bits
> - more cleanups
> - fix all the copyrights (the ones not changed have been there
>   foverever)
> - be consistent with naming, vhost-* or virtio-*
> 
> Please review, Juan.
> 
> Juan Quintela (16):
>   virtio: split vhost vsock bits from virtio-pci
>   virtio: split virtio input host bits from virtio-pci
>   virtio: split virtio input bits from virtio-pci
>   virtio: split virtio rng bits from virtio-pci
>   virtio: split virtio balloon bits from virtio-pci
>   virtio: split virtio 9p bits from virtio-pci
>   virtio: split vhost user blk bits from virtio-pci
>   virtio: split vhost user scsi bits from virtio-pci
>   virtio: split vhost scsi bits from virtio-pci
>   virtio: split virtio scsi bits from virtio-pci
>   virtio: split virtio blk bits rom virtio-pci
>   virtio: split virtio net bits rom virtio-pci
>   virtio: split virtio serial bits rom virtio-pci
>   virtio: split virtio gpu bits rom virtio-pci.h
>   virtio: split virtio crypto bits rom virtio-pci.h
>   virtio: virtio 9p really requires CONFIG_VIRTFS to work
> 
>  default-configs/virtio.mak|   3 +-
>  hw/display/virtio-gpu-pci.c   |  14 +
>  hw/display/virtio-vga.c   |   1 +
>  hw/virtio/Makefile.objs   |  15 +
>  hw/virtio/vhost-scsi-pci.c|  95 
>  hw/virtio/vhost-user-blk-pci.c| 101 
>  hw/virtio/vhost-user-scsi-pci.c   | 101 
>  hw/virtio/vhost-vsock-pci.c   |  82 
>  hw/virtio/virtio-9p-pci.c |  86 
>  hw/virtio/virtio-balloon-pci.c|  94 
>  hw/virtio/virtio-blk-pci.c|  97 
>  hw/virtio/virtio-crypto-pci.c |  14 +
>  hw/virtio/virtio-input-host-pci.c |  45 ++
>  hw/virtio/virtio-input-pci.c  | 154 ++
>  hw/virtio/virtio-net-pci.c|  96 
>  hw/virtio/virtio-pci.c| 783 --
>  hw/virtio/virtio-pci.h| 234 -
>  hw/virtio/virtio-rng-pci.c|  86 
>  hw/virtio/virtio-scsi-pci.c   | 106 
>  hw/virtio/virtio-serial-pci.c | 112 +
>  tests/Makefile.include|  20 +-
>  21 files changed, 1311 insertions(+), 1028 deletions(-)
>  create mode 100644 hw/virtio/vhost-scsi-pci.c
>  create mode 100644 hw/virtio/vhost-user-blk-pci.c
>  create mode 100644 hw/virtio/vhost-user-scsi-pci.c
>  create mode 100644 hw/virtio/vhost-vsock-pci.c
>  create mode 100644 hw/virtio/virtio-9p-pci.c
>  create mode 100644 hw/virtio/virtio-balloon-pci.c
>  create mode 100644 hw/virtio/virtio-blk-pci.c
>  create mode 100644 hw/virtio/virtio-input-host-pci.c
>  create mode 100644 hw/virtio/virtio-input-pci.c
>  create mode 100644 hw/virtio/virtio-net-pci.c
>  create mode 100644 hw/virtio/virtio-rng-pci.c
>  create mode 100644 hw/virtio/virtio-scsi-pci.c
>  create mode 100644 hw/virtio/virtio-serial-pci.c
> 
> --
> 2.19.2

For series:
Reviewed-by: Gonglei 

 
Thanks,
-Gonglei



Re: [Qemu-devel] [RFC PATCH v1 1/4] VFIO KABI for migration interface

2018-10-17 Thread Gonglei (Arei)
num for flags rather than bits.
> 
> >> + * Flag VFIO_MIGRATION_GET_PENDING:
> >> + *  To get pending bytes yet to be migrated from vendor driver
> >> + *  threshold_size [Input] : threshold of buffer in User space app.
> >> + *  pending_precopy_only [output] : pending data which must be
> migrated in
> >> + *  precopy phase or in stopped state, in other words - before
> target
> >> + *  vm start
> >> + *  pending_compatible [output] : pending data which may be
> migrated in any
> >> + *   phase
> >> + *  pending_postcopy_only [output] : pending data which must be
> migrated in
> >> + *   postcopy phase or in stopped state, in other words - after
> source
> >> + *   vm stop
> >> + *  Sum of pending_precopy_only, pending_compatible and
> >> + *  pending_postcopy_only is the whole amount of pending data.
> >
> > What's the significance of the provided threshold, are the pending
> > bytes limited to threshold size?  It makes me nervous to define a
> > kernel API in terms of the internal API of a userspace program that can
> > change.  I wonder if it makes sense to define these in terms of the
> > state of the devices, pending initial data, runtime data, post-offline
> > data.
> >
> 
> Threshold is required, because that will tell size in bytes that user
> space application buffer can accommodate. Driver can copy data less than
> threshold, but copying data more than threshold doesn't make sense
> because user space application won't be able to copy that extra data and
> that data might get overwritten or lost.
> 
> 
> >> + *
> >> + * Flag VFIO_MIGRATION_GET_BUFFER:
> >> + *  On this flag, vendor driver should write data to migration
> >> region and
> >> + *  return number of bytes written in the region.
> >> + *  bytes_written [output] : number of bytes written in
> >> migration buffer by
> >> + *  vendor driver
> >
> > Does the data the user gets here depend on the device state set
> > earlier?  For example the user gets pending_precopy_only data while
> > PRECOPY_ACTIVE is the device state and pending_postcopy_only data
> > in STOPNCOPY_ACTIVE?  The user should continue to call GET_BUFFER
> > in a given state until the associated pending field reaches zero?
> > Jumping between the region and ioctl is rather awkward.
> >
> 
> User gets pending_precopy_only data when device is in PRECOPY_ACTIVE
> state, but each time when user calls GET_BUFFER, pending bytes might
> change.
> VFIO device's driver is producer of data and user/QEMU is consumer of
> data. In pre-copy phase, when vCPUs are still running, driver will try
> to accumulate as much data as possible in this phase, but vCPUs are
> running and user of that device/application in guest is actively using
> that device, then there are chances that during next iteration of
> GET_BUFFER, driver might have more data.
> Even in case of STOPNCOPY_ACTIVE state of device, driver can start
> sending data in parts while a thread in vendor driver can still generate
> data after device has halted, producer and consumer can run in parallel.
> So User has to call GET_BUFFER until pending bytes are returned 0.
> 
How to understand "the driver still generate data after device has halted"?
Do interrupts still be generated after device halted? If so, it will lost 
interrupt
information in pi_desc.pir .

Thanks,
-Gonglei


Re: [Qemu-devel] [virtio-dev] Re: [PATCH v25 0/2] virtio-crypto: virtio crypto device specification

2018-08-28 Thread Gonglei (Arei)
> 
> On Tue, Aug 28, 2018 at 03:31:02AM +, Gonglei (Arei) wrote:
> >
> > > -Original Message-
> > > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > > Sent: Friday, August 24, 2018 8:54 PM
> > >
> > > On Fri, Aug 24, 2018 at 12:07:44PM +, Gonglei (Arei) wrote:
> > > > Hi Michael,
> > > >
> > > > > -Original Message-
> > > > > From: virtio-...@lists.oasis-open.org
> > > [mailto:virtio-...@lists.oasis-open.org]
> > > > > On Behalf Of Michael S. Tsirkin
> > > > > Sent: Friday, August 24, 2018 7:23 PM
> > > > > To: longpeng 
> > > > > Cc: xin.z...@intel.com; Gonglei (Arei) ;
> > > > > pa...@linux.vnet.ibm.com; qemu-devel@nongnu.org;
> > > > > virtio-...@lists.oasis-open.org; coh...@redhat.com;
> > > stefa...@redhat.com;
> > > > > denglin...@chinamobile.com; Jani Kokkonen
> > > ;
> > > > > ola.liljed...@arm.com; varun.se...@freescale.com;
> > > > > brian.a.keat...@intel.com; liang.j...@intel.com;
> john.grif...@intel.com;
> > > > > ag...@suse.de; jasow...@redhat.com; vincent.jar...@6wind.com;
> > > > > Huangweidong (C) ; wangxin (U)
> > > > > ; Zhoujian (jay)
> > > 
> > > > > Subject: [virtio-dev] Re: [PATCH v25 0/2] virtio-crypto: virtio crypto
> device
> > > > > specification
> > > > >
> > > > > Is there a github issue? If not pls create one.
> > > > >
> > > >
> > > > I just created one issue:
> > > >
> > > > https://github.com/oasis-tcs/virtio-spec/issues/19
> > >
> > > All set to start voting whenever you request it.
> > >
> >
> > Hi Michael,
> >
> > Since no comments currently, pls help to start a ballot for virtio crypto 
> > spec if
> you can. :)
> >
> >
> > Thanks,
> > -Gonglei
> 
> Done. In the future please add a link to mailing list archives.
> 

Sure. Ballot created at URL: 
https://www.oasis-open.org/committees/ballot.php?id=3242


Thanks,
-Gonglei



Re: [Qemu-devel] [virtio-dev] Re: [PATCH v25 0/2] virtio-crypto: virtio crypto device specification

2018-08-27 Thread Gonglei (Arei)


> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Friday, August 24, 2018 8:54 PM
> 
> On Fri, Aug 24, 2018 at 12:07:44PM +, Gonglei (Arei) wrote:
> > Hi Michael,
> >
> > > -Original Message-
> > > From: virtio-...@lists.oasis-open.org
> [mailto:virtio-...@lists.oasis-open.org]
> > > On Behalf Of Michael S. Tsirkin
> > > Sent: Friday, August 24, 2018 7:23 PM
> > > To: longpeng 
> > > Cc: xin.z...@intel.com; Gonglei (Arei) ;
> > > pa...@linux.vnet.ibm.com; qemu-devel@nongnu.org;
> > > virtio-...@lists.oasis-open.org; coh...@redhat.com;
> stefa...@redhat.com;
> > > denglin...@chinamobile.com; Jani Kokkonen
> ;
> > > ola.liljed...@arm.com; varun.se...@freescale.com;
> > > brian.a.keat...@intel.com; liang.j...@intel.com; john.grif...@intel.com;
> > > ag...@suse.de; jasow...@redhat.com; vincent.jar...@6wind.com;
> > > Huangweidong (C) ; wangxin (U)
> > > ; Zhoujian (jay)
> 
> > > Subject: [virtio-dev] Re: [PATCH v25 0/2] virtio-crypto: virtio crypto 
> > > device
> > > specification
> > >
> > > Is there a github issue? If not pls create one.
> > >
> >
> > I just created one issue:
> >
> > https://github.com/oasis-tcs/virtio-spec/issues/19
> 
> All set to start voting whenever you request it.
> 

Hi Michael,

Since no comments currently, pls help to start a ballot for virtio crypto spec 
if you can. :)


Thanks,
-Gonglei



Re: [Qemu-devel] [virtio-dev] Re: [PATCH v25 0/2] virtio-crypto: virtio crypto device specification

2018-08-24 Thread Gonglei (Arei)
Hi Michael,

> -Original Message-
> From: virtio-...@lists.oasis-open.org [mailto:virtio-...@lists.oasis-open.org]
> On Behalf Of Michael S. Tsirkin
> Sent: Friday, August 24, 2018 7:23 PM
> To: longpeng 
> Cc: xin.z...@intel.com; Gonglei (Arei) ;
> pa...@linux.vnet.ibm.com; qemu-devel@nongnu.org;
> virtio-...@lists.oasis-open.org; coh...@redhat.com; stefa...@redhat.com;
> denglin...@chinamobile.com; Jani Kokkonen ;
> ola.liljed...@arm.com; varun.se...@freescale.com;
> brian.a.keat...@intel.com; liang.j...@intel.com; john.grif...@intel.com;
> ag...@suse.de; jasow...@redhat.com; vincent.jar...@6wind.com;
> Huangweidong (C) ; wangxin (U)
> ; Zhoujian (jay) 
> Subject: [virtio-dev] Re: [PATCH v25 0/2] virtio-crypto: virtio crypto device
> specification
> 
> Is there a github issue? If not pls create one.
> 

I just created one issue:

https://github.com/oasis-tcs/virtio-spec/issues/19


Thanks,
-Gonglei



Re: [Qemu-devel] [PATCH] cryptodev: remove dead code

2018-07-30 Thread Gonglei (Arei)


> -Original Message-
> From: Peter Maydell [mailto:peter.mayd...@linaro.org]
> Sent: Monday, July 30, 2018 6:49 PM
> To: Paolo Bonzini 
> Cc: QEMU Developers ; Gonglei (Arei)
> 
> Subject: Re: [Qemu-devel] [PATCH] cryptodev: remove dead code
> 
> On 30 July 2018 at 09:51, Paolo Bonzini  wrote:
> > Reported by Coverity as CID 1390600.
> >
> > Signed-off-by: Paolo Bonzini 
> > ---
> 
> This already has a reviewed patch on-list for this from
> back in April:
> 
> https://patchwork.ozlabs.org/patch/906041/
> 
> so I think we should just apply that.
> 
Oh, yes. Would you pick it up directly? Or by qemu-trivial?

Thanks,
-Gonglei


Re: [Qemu-devel] [PATCH] cryptodev: remove dead code

2018-07-30 Thread Gonglei (Arei)


> -Original Message-
> From: Qemu-devel
> [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> Behalf Of Paolo Bonzini
> Sent: Monday, July 30, 2018 4:51 PM
> To: qemu-devel@nongnu.org
> Subject: [Qemu-devel] [PATCH] cryptodev: remove dead code
> 
> Reported by Coverity as CID 1390600.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  backends/cryptodev-vhost-user.c | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/backends/cryptodev-vhost-user.c b/backends/cryptodev-vhost-user.c
> index d52daccfcd..d539f14d59 100644
> --- a/backends/cryptodev-vhost-user.c
> +++ b/backends/cryptodev-vhost-user.c
> @@ -157,7 +157,6 @@ static void cryptodev_vhost_user_event(void *opaque,
> int event)
>  {
>  CryptoDevBackendVhostUser *s = opaque;
>  CryptoDevBackend *b = CRYPTODEV_BACKEND(s);
> -Error *err = NULL;
>  int queues = b->conf.peers.queues;
> 
>  assert(queues < MAX_CRYPTO_QUEUE_NUM);
> @@ -174,10 +173,6 @@ static void cryptodev_vhost_user_event(void
> *opaque, int event)
>  cryptodev_vhost_user_stop(queues, s);
>  break;
>  }
> -
> -if (err) {
> -    error_report_err(err);
> -}
>  }
> 
>  static void cryptodev_vhost_user_init(
> --
> 2.17.1
> 

Reviewed-by: Gonglei 

Thanks,
-Gonglei



[Qemu-devel] about live memory snapshot

2018-06-29 Thread Gonglei (Arei)
Hi Peter,

As we discussed in LC3 China, the current scheme of "migration to file" 
can't fit on production environment, which will cause the snapshot file bigger 
and bigger when the guest is under enough memory pressure. We can't
assume what size the snapshot file is.
 
Pls have a look if we have a simple method to resolve the problem. :)

PS: the below link is zhanghailiang's scheme based on userfaultfd.

https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00664.html


Thanks,
-Gonglei
 



Re: [Qemu-devel] [RFC v1 1/1] virtio-crypto: Allow disabling of cipher algorithms for virtio-crypto device

2018-06-14 Thread Gonglei (Arei)


> -Original Message-
> From: Daniel P. Berrangé [mailto:berra...@redhat.com]
> Sent: Thursday, June 14, 2018 11:11 PM
> To: Farhan Ali 
> Cc: Halil Pasic ; qemu-devel@nongnu.org;
> fran...@linux.ibm.com; m...@redhat.com; borntrae...@de.ibm.com; Gonglei
> (Arei) ; longpeng ;
> Viktor Mihajlovski ;
> mjros...@linux.vnet.ibm.com
> Subject: Re: [Qemu-devel] [RFC v1 1/1] virtio-crypto: Allow disabling of 
> cipher
> algorithms for virtio-crypto device
> 
> On Thu, Jun 14, 2018 at 10:50:40AM -0400, Farhan Ali wrote:
> >
> >
> > On 06/14/2018 04:21 AM, Daniel P. Berrangé wrote:
> > > On Wed, Jun 13, 2018 at 07:28:08PM +0200, Halil Pasic wrote:
> > > >
> > > >
> > > > On 06/13/2018 05:05 PM, Daniel P. Berrangé wrote:
> > > > > On Wed, Jun 13, 2018 at 11:01:05AM -0400, Farhan Ali wrote:
> > > > > > Hi Daniel
> > > > > >
> > > > > > On 06/13/2018 05:37 AM, Daniel P. Berrangé wrote:
> > > > > > > On Tue, Jun 12, 2018 at 03:48:34PM -0400, Farhan Ali wrote:
> > > > > > > > The virtio-crypto driver currently propagates to the guest
> > > > > > > > all the cipher algorithms that the backend cryptodev can
> > > > > > > > support. But in certain cases where the guest has more
> > > > > > > > performant mechanism to handle some algorithms, it would be
> > > > > > > > useful to propagate only a subset of the algorithms.
> > > > > > >
> > > > > > > I'm not really convinced by this.
> > > > > > >
> > > > > > > The performance of crypto algorithms has many influencing
> > > > > > > factors, making it pretty hard to decide which is best
> > > > > > > without actively testing specific impls and comparing
> > > > > > > them in a manner which matches the application usage
> > > > > > > pattern. eg in theory the kernel crypto impl of an alg
> > > > > > > is faster than a userspace impl, if the kernel uses
> > > > > > > hardware accel and userspace does not. This, however,
> > > > > > > ignores the overhead of the kernel/userspace switch.
> > > > > > > The real world performance winner, thus depends on the
> > > > > > > amount of data being processed in each operation. Some
> > > > > > > times userspace can win & sometimes kernel space can
> > > > > > > win. This is even more relevant to virtio-crypto as
> > > > > > > it has more expensive context switches.
> > > > > >
> > > > > > True. But what if the guest can perform some crypto algorithms
> without a
> > > > > > incurring a VM exit? For example in s390 we have the cpacf
> instructions to
> > > > > > perform crypto and this instruction is implemented for us by our
> hardware
> > > > > > virtualization technology. In such a case it would be better not to 
> > > > > > use
> > > > > > virtio-crypto's implementation of such a crypto algorithm.
> > > > > >
> > > > > > At the same time we would like to take advantage of virtio-crypto's
> > > > > > acceleration capabilities for certain crypto algorithms for which 
> > > > > > there
> is
> > > > > > no hardware assistance.
> > > > >
> > > > > IIUC, the kernel's crypto layer can support multiple implementations 
> > > > > of
> > > > > any algorithm. Providers can report a priority against implementations
> > > > > which influences which impl is used in practice. So if there's a 
> > > > > native
> > > > > instruction for a partiuclar algorithm I would expect the impl 
> > > > > registered
> > > > > for that to be designated higher priority than other impls, so that 
> > > > > it is
> > > > > used in preference to other impls.
> > > > >
> > > >
> > > > AFAIR the problem here is that in (the guest) kernel the virtio-crypto
> > > > driver has to register it's crypto algo implementations with a priority
> > > > (single number), which dictates if it's going to be the preferred (used)
> > > > implementation of the algorithm or not. The virtio-crypto driver does 
> > > > this
> > > > without having information

Re: [Qemu-devel] [RFC v1 1/1] virtio-crypto: Allow disabling of cipher algorithms for virtio-crypto device

2018-06-12 Thread Gonglei (Arei)


> -Original Message-
> From: Farhan Ali [mailto:al...@linux.ibm.com]
> Sent: Wednesday, June 13, 2018 3:49 AM
> To: qemu-devel@nongnu.org
> Cc: m...@redhat.com; Gonglei (Arei) ; longpeng
> ; pa...@linux.ibm.com; borntrae...@de.ibm.com;
> fran...@linux.ibm.com; al...@linux.ibm.com
> Subject: [RFC v1 1/1] virtio-crypto: Allow disabling of cipher algorithms for
> virtio-crypto device
> 
> The virtio-crypto driver currently propagates to the guest
> all the cipher algorithms that the backend cryptodev can
> support. But in certain cases where the guest has more
> performant mechanism to handle some algorithms, it would be
> useful to propagate only a subset of the algorithms.
> 

It makes sense to me. E.g. current Intel CPU has the AES-NI instruction for 
accelerating
AES algo. We don't need to propagate AES algos.

> This patch adds support for disabling the cipher
> algorithms of the backend cryptodev.
> 
> eg:
>  -object cryptodev-backend-builtin,id=cryptodev0
>  -device virtio-crypto-ccw,id=crypto0,cryptodev=cryptodev0,cipher-aes-cbc=off
> 
> Signed-off-by: Farhan Ali 
> ---
> 
> Please note this patch is not complete, and there are TODOs to handle
> for other types of algorithms such Hash, AEAD and MAC algorithms.
> 
> This is mainly intended to get some feedback on the design approach
> from the community.
> 
> 
>  hw/virtio/virtio-crypto.c | 46
> ---
>  include/hw/virtio/virtio-crypto.h |  3 +++
>  2 files changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
> index 9a9fa49..4aed9ca 100644
> --- a/hw/virtio/virtio-crypto.c
> +++ b/hw/virtio/virtio-crypto.c
> @@ -754,12 +754,22 @@ static void virtio_crypto_reset(VirtIODevice *vdev)
>  static void virtio_crypto_init_config(VirtIODevice *vdev)
>  {
>  VirtIOCrypto *vcrypto = VIRTIO_CRYPTO(vdev);
> +uint32_t user_crypto_services = (1u <<
> VIRTIO_CRYPTO_SERVICE_CIPHER) |
> +(1u <<
> VIRTIO_CRYPTO_SERVICE_HASH) |
> +(1u <<
> VIRTIO_CRYPTO_SERVICE_AEAD) |
> +(1u <<
> VIRTIO_CRYPTO_SERVICE_MAC);
> +
> +if (vcrypto->user_cipher_algo_l & (1u << VIRTIO_CRYPTO_NO_CIPHER)) {
> +vcrypto->user_cipher_algo_l = 1u << VIRTIO_CRYPTO_NO_CIPHER;
> +vcrypto->user_cipher_algo_h = 0;
> +user_crypto_services &= ~(1u <<
> VIRTIO_CRYPTO_SERVICE_CIPHER);
> +}
> 
> -vcrypto->conf.crypto_services =
> +vcrypto->conf.crypto_services = user_crypto_services &
>   vcrypto->conf.cryptodev->conf.crypto_services;
> -vcrypto->conf.cipher_algo_l =
> +vcrypto->conf.cipher_algo_l = vcrypto->user_cipher_algo_l &
>   vcrypto->conf.cryptodev->conf.cipher_algo_l;
> -vcrypto->conf.cipher_algo_h =
> +vcrypto->conf.cipher_algo_h = vcrypto->user_cipher_algo_h &
>   vcrypto->conf.cryptodev->conf.cipher_algo_h;
>  vcrypto->conf.hash_algo = vcrypto->conf.cryptodev->conf.hash_algo;
>  vcrypto->conf.mac_algo_l = vcrypto->conf.cryptodev->conf.mac_algo_l;
> @@ -853,6 +863,34 @@ static const VMStateDescription
> vmstate_virtio_crypto = {
>  static Property virtio_crypto_properties[] = {
>  DEFINE_PROP_LINK("cryptodev", VirtIOCrypto, conf.cryptodev,
>   TYPE_CRYPTODEV_BACKEND, CryptoDevBackend
> *),
> +DEFINE_PROP_BIT("no-cipher", VirtIOCrypto, user_cipher_algo_l,
> +VIRTIO_CRYPTO_CIPHER_ARC4, false),

s/ VIRTIO_CRYPTO_CIPHER_ARC4/VIRTIO_CRYPTO_NO_CIPHER/

> +DEFINE_PROP_BIT("cipher-arc4", VirtIOCrypto, user_cipher_algo_l,
> +VIRTIO_CRYPTO_CIPHER_ARC4, false),
> +DEFINE_PROP_BIT("cipher-aes-ecb", VirtIOCrypto, user_cipher_algo_l,
> +VIRTIO_CRYPTO_CIPHER_AES_ECB, false),
> +DEFINE_PROP_BIT("cipher-aes-cbc", VirtIOCrypto, user_cipher_algo_l,
> +VIRTIO_CRYPTO_CIPHER_AES_CBC, false),
> +DEFINE_PROP_BIT("cipher-aes-ctr", VirtIOCrypto, user_cipher_algo_l,
> +VIRTIO_CRYPTO_CIPHER_AES_CTR, false),
> +DEFINE_PROP_BIT("cipher-des-ecb", VirtIOCrypto, user_cipher_algo_l,
> +VIRTIO_CRYPTO_CIPHER_DES_ECB, false),
> +DEFINE_PROP_BIT("cipher-3des-ecb", VirtIOCrypto, user_cipher_algo_l,
> +VIRTIO_CRYPTO_CIPHER_3DES_ECB, false),
> +DEFINE_PROP_BIT("cipher-3des-cbc", VirtIOCrypto, 

Re: [Qemu-devel] [RFC v1 2/2] crypto/virtio-crypto: Register an algo only if it's supported

2018-06-12 Thread Gonglei (Arei)


> -Original Message-
> From: Farhan Ali [mailto:al...@linux.ibm.com]
> Sent: Wednesday, June 13, 2018 1:08 AM
> To: Gonglei (Arei) ; linux-ker...@vger.kernel.org;
> k...@vger.kernel.org
> Cc: m...@redhat.com; qemu-devel@nongnu.org; longpeng
> ; pa...@linux.ibm.com; fran...@linux.ibm.com;
> borntrae...@de.ibm.com
> Subject: Re: [RFC v1 2/2] crypto/virtio-crypto: Register an algo only if it's
> supported
> 
> 
> 
> On 06/11/2018 04:48 AM, Gonglei (Arei) wrote:
> >
> >
> >> -Original Message-
> >> From: Farhan Ali [mailto:al...@linux.ibm.com]
> >> Sent: Saturday, June 09, 2018 3:09 AM
> >> To: linux-ker...@vger.kernel.org; k...@vger.kernel.org
> >> Cc: m...@redhat.com; qemu-devel@nongnu.org; Gonglei (Arei)
> >> ; longpeng ;
> >> pa...@linux.ibm.com; fran...@linux.ibm.com; borntrae...@de.ibm.com;
> >> al...@linux.ibm.com
> >> Subject: [RFC v1 2/2] crypto/virtio-crypto: Register an algo only if it's
> supported
> >>
> >> From: Farhan Ali 
> >>
> >> Register a crypto algo with the Linux crypto layer only if
> >> the algorithm is supported by the backend virtio-crypto
> >> device.
> >>
> >> Also route crypto requests to a virtio-crypto
> >> device, only if it can support the requested service and
> >> algorithm.
> >>
> >> Signed-off-by: Farhan Ali 
> >> ---
> >>   drivers/crypto/virtio/virtio_crypto_algs.c   | 110
> >> ++-
> >>   drivers/crypto/virtio/virtio_crypto_common.h |  11 ++-
> >>   drivers/crypto/virtio/virtio_crypto_mgr.c|  81
> ++--
> >>   3 files changed, 158 insertions(+), 44 deletions(-)
> >>
> >> diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c
> >> b/drivers/crypto/virtio/virtio_crypto_algs.c
> >> index ba190cf..fef112a 100644
> >> --- a/drivers/crypto/virtio/virtio_crypto_algs.c
> >> +++ b/drivers/crypto/virtio/virtio_crypto_algs.c
> >> @@ -49,12 +49,18 @@ struct virtio_crypto_sym_request {
> >>bool encrypt;
> >>   };
> >>
> >> +struct virtio_crypto_algo {
> >> +  uint32_t algonum;
> >> +  uint32_t service;
> >> +  unsigned int active_devs;
> >> +  struct crypto_alg algo;
> >> +};
> >> +
> >>   /*
> >>* The algs_lock protects the below global virtio_crypto_active_devs
> >>* and crypto algorithms registion.
> >>*/
> >>   static DEFINE_MUTEX(algs_lock);
> >> -static unsigned int virtio_crypto_active_devs;
> >>   static void virtio_crypto_ablkcipher_finalize_req(
> >>struct virtio_crypto_sym_request *vc_sym_req,
> >>struct ablkcipher_request *req,
> >> @@ -312,13 +318,19 @@ static int virtio_crypto_ablkcipher_setkey(struct
> >> crypto_ablkcipher *tfm,
> >> unsigned int keylen)
> >>   {
> >>struct virtio_crypto_ablkcipher_ctx *ctx =
> crypto_ablkcipher_ctx(tfm);
> >> +  uint32_t alg;
> >>int ret;
> >>
> >> +  ret = virtio_crypto_alg_validate_key(keylen, );
> >> +  if (ret)
> >> +  return ret;
> >> +
> >>if (!ctx->vcrypto) {
> >>    /* New key */
> >>int node = virtio_crypto_get_current_node();
> >>struct virtio_crypto *vcrypto =
> >> -        virtcrypto_get_dev_node(node);
> >> +virtcrypto_get_dev_node(node,
> >> +VIRTIO_CRYPTO_SERVICE_CIPHER, alg);
> >>if (!vcrypto) {
> >>pr_err("virtio_crypto: Could not find a virtio device 
> >> in the
> >> system\n");
> >
> > We'd better change the above error message now. What about:
> >   " virtio_crypto: Could not find a virtio device in the system or 
> > unsupported
> algo" ?
> >
> > Regards,
> > -Gonglei
> 
> 
> Sure, I will update the error message. But other than that does the rest
> of the code looks good to you?
> 
Yes, good work. You can add my ack in v2:

Acked-by: Gonglei 

Regards,
-Gonglei





Re: [Qemu-devel] [RFC v1 1/2] crypto/virtio-crypto: Read crypto services and algorithm masks

2018-06-12 Thread Gonglei (Arei)

> -Original Message-
> From: Farhan Ali [mailto:al...@linux.ibm.com]
> Sent: Wednesday, June 13, 2018 1:07 AM
> To: Gonglei (Arei) ; linux-ker...@vger.kernel.org;
> k...@vger.kernel.org
> Cc: m...@redhat.com; qemu-devel@nongnu.org; longpeng
> ; pa...@linux.ibm.com; fran...@linux.ibm.com;
> borntrae...@de.ibm.com
> Subject: Re: [RFC v1 1/2] crypto/virtio-crypto: Read crypto services and
> algorithm masks
> 
> Hi Arei
> 
> On 06/11/2018 02:43 AM, Gonglei (Arei) wrote:
> >
> >> -Original Message-
> >> From: Farhan Ali [mailto:al...@linux.ibm.com]
> >> Sent: Saturday, June 09, 2018 3:09 AM
> >> To: linux-ker...@vger.kernel.org; k...@vger.kernel.org
> >> Cc: m...@redhat.com; qemu-devel@nongnu.org; Gonglei (Arei)
> >> ; longpeng ;
> >> pa...@linux.ibm.com; fran...@linux.ibm.com; borntrae...@de.ibm.com;
> >> al...@linux.ibm.com
> >> Subject: [RFC v1 1/2] crypto/virtio-crypto: Read crypto services and
> algorithm
> >> masks
> >>
> >> Read the crypto services and algorithm masks which provides
> >> information about the services and algorithms supported by
> >> virtio-crypto backend.
> >>
> >> Signed-off-by: Farhan Ali 
> >> ---
> >>   drivers/crypto/virtio/virtio_crypto_common.h | 14 ++
> >>   drivers/crypto/virtio/virtio_crypto_core.c   | 29
> >> 
> >>   2 files changed, 43 insertions(+)
> >>
> >> diff --git a/drivers/crypto/virtio/virtio_crypto_common.h
> >> b/drivers/crypto/virtio/virtio_crypto_common.h
> >> index 66501a5..05eca12e 100644
> >> --- a/drivers/crypto/virtio/virtio_crypto_common.h
> >> +++ b/drivers/crypto/virtio/virtio_crypto_common.h
> >> @@ -55,6 +55,20 @@ struct virtio_crypto {
> >>/* Number of queue currently used by the driver */
> >>u32 curr_queue;
> >>
> >> +  /*
> >> +   * Specifies the services mask which the device support,
> >> +   * see VIRTIO_CRYPTO_SERVICE_* above
> >> +   */
> >
> > Pls update the above comments. Except that:
> >
> > Acked-by: Gonglei 
> >
> 
> Sure will update the comment. How about " Specifies the services mask
> which the device support, * see VIRTIO_CRYPTO_SERVICE_*" ?
> 
It makes sense IMHO :)

Regards,
-Gonglei

> or should I specify the file where the VIRTIO_CRYPTO_SERVICE_* are defined?
> 
> Thanks
> Farhan
> 
> >> +  u32 crypto_services;
> >> +
> >> +  /* Detailed algorithms mask */
> >> +  u32 cipher_algo_l;
> >> +  u32 cipher_algo_h;
> >> +  u32 hash_algo;
> >> +  u32 mac_algo_l;
> >> +  u32 mac_algo_h;
> >> +  u32 aead_algo;
> >> +
> >>/* Maximum length of cipher key */
> >>u32 max_cipher_key_len;
> >>/* Maximum length of authenticated key */
> >> diff --git a/drivers/crypto/virtio/virtio_crypto_core.c
> >> b/drivers/crypto/virtio/virtio_crypto_core.c
> >> index 8332698..8f745f2 100644
> >> --- a/drivers/crypto/virtio/virtio_crypto_core.c
> >> +++ b/drivers/crypto/virtio/virtio_crypto_core.c
> >> @@ -303,6 +303,13 @@ static int virtcrypto_probe(struct virtio_device
> *vdev)
> >>u32 max_data_queues = 0, max_cipher_key_len = 0;
> >>u32 max_auth_key_len = 0;
> >>u64 max_size = 0;
> >> +  u32 cipher_algo_l = 0;
> >> +  u32 cipher_algo_h = 0;
> >> +  u32 hash_algo = 0;
> >> +  u32 mac_algo_l = 0;
> >> +  u32 mac_algo_h = 0;
> >> +  u32 aead_algo = 0;
> >> +  u32 crypto_services = 0;
> >>
> >>if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> >>return -ENODEV;
> >> @@ -339,6 +346,20 @@ static int virtcrypto_probe(struct virtio_device
> *vdev)
> >>max_auth_key_len, _auth_key_len);
> >>virtio_cread(vdev, struct virtio_crypto_config,
> >>max_size, _size);
> >> +  virtio_cread(vdev, struct virtio_crypto_config,
> >> +  crypto_services, _services);
> >> +  virtio_cread(vdev, struct virtio_crypto_config,
> >> +  cipher_algo_l, _algo_l);
> >> +  virtio_cread(vdev, struct virtio_crypto_config,
> >> +  cipher_algo_h, _algo_h);
> >> +  virtio_cread(vdev, struct virtio_crypto_config,
> >> +  hash_algo, _algo);
> >> +  virtio_cread(vdev, struct virtio_crypto_config,
> >> +  mac_algo_l, _algo_l);
> >> +  virtio_cread(vdev, struct virtio_cryp

Re: [Qemu-devel] An emulation failure occurs, if I hotplug vcpus immediately after the VM start

2018-06-11 Thread Gonglei (Arei)

> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
> Behalf Of David Hildenbrand
> Sent: Monday, June 11, 2018 8:36 PM
> To: Gonglei (Arei) ; 浙大邮箱 
> Subject: Re: An emulation failure occurs,if I hotplug vcpus immediately after 
> the
> VM start
> 
> On 11.06.2018 14:25, Gonglei (Arei) wrote:
> >
> > Hi David and Paolo,
> >
> >> -Original Message-
> >> From: David Hildenbrand [mailto:da...@redhat.com]
> >> Sent: Monday, June 11, 2018 6:44 PM
> >> To: 浙大邮箱 
> >> Cc: Paolo Bonzini ; Gonglei (Arei)
> >> ; Igor Mammedov ;
> >> xuyandong ; Zhanghailiang
> >> ; wangxin (U)
> >> ; lidonglin ;
> >> k...@vger.kernel.org; qemu-devel@nongnu.org; Huangweidong (C)
> >> 
> >> Subject: Re: An emulation failure occurs,if I hotplug vcpus immediately 
> >> after
> the
> >> VM start
> >>
> >> On 07.06.2018 18:03, 浙大邮箱 wrote:
> >>> Hi,all
> >>> I still have a question after reading your discussion: Will seabios 
> >>> detect the
> >> change of address space even if we add_region and del_region
> automatically? I
> >> guess that seabios may not take this change into consideration.
> >>
> >> Hi,
> >>
> >> We would just change the way how KVM memory slots are updated. This is
> >> right now not atomic, but would be later on. It should not have any
> >> other effect.
> >>
> > Yes. Do you have any plans to do that?
> 
> Well, I have plans to work on atomically resizable memory regions
> (atomic del + add), and what Paolo described could also work for that
> use case. However, I won't have time to look into that in the near
> future. So if somebody else wants to jump it, perfect. If not, it will
> have to wait unfortunately.
> 
Got it. :)

Thanks,
-Gonglei


Re: [Qemu-devel] An emulation failure occurs, if I hotplug vcpus immediately after the VM start

2018-06-11 Thread Gonglei (Arei)

Hi David and Paolo,

> -Original Message-
> From: David Hildenbrand [mailto:da...@redhat.com]
> Sent: Monday, June 11, 2018 6:44 PM
> To: 浙大邮箱 
> Cc: Paolo Bonzini ; Gonglei (Arei)
> ; Igor Mammedov ;
> xuyandong ; Zhanghailiang
> ; wangxin (U)
> ; lidonglin ;
> k...@vger.kernel.org; qemu-devel@nongnu.org; Huangweidong (C)
> 
> Subject: Re: An emulation failure occurs,if I hotplug vcpus immediately after 
> the
> VM start
> 
> On 07.06.2018 18:03, 浙大邮箱 wrote:
> > Hi,all
> > I still have a question after reading your discussion: Will seabios detect 
> > the
> change of address space even if we add_region and del_region automatically? I
> guess that seabios may not take this change into consideration.
> 
> Hi,
> 
> We would just change the way how KVM memory slots are updated. This is
> right now not atomic, but would be later on. It should not have any
> other effect.
> 
Yes. Do you have any plans to do that? 

Thanks,
-Gonglei


Re: [Qemu-devel] [RFC v1 2/2] crypto/virtio-crypto: Register an algo only if it's supported

2018-06-11 Thread Gonglei (Arei)



> -Original Message-
> From: Farhan Ali [mailto:al...@linux.ibm.com]
> Sent: Saturday, June 09, 2018 3:09 AM
> To: linux-ker...@vger.kernel.org; k...@vger.kernel.org
> Cc: m...@redhat.com; qemu-devel@nongnu.org; Gonglei (Arei)
> ; longpeng ;
> pa...@linux.ibm.com; fran...@linux.ibm.com; borntrae...@de.ibm.com;
> al...@linux.ibm.com
> Subject: [RFC v1 2/2] crypto/virtio-crypto: Register an algo only if it's 
> supported
> 
> From: Farhan Ali 
> 
> Register a crypto algo with the Linux crypto layer only if
> the algorithm is supported by the backend virtio-crypto
> device.
> 
> Also route crypto requests to a virtio-crypto
> device, only if it can support the requested service and
> algorithm.
> 
> Signed-off-by: Farhan Ali 
> ---
>  drivers/crypto/virtio/virtio_crypto_algs.c   | 110
> ++-
>  drivers/crypto/virtio/virtio_crypto_common.h |  11 ++-
>  drivers/crypto/virtio/virtio_crypto_mgr.c|  81 ++--
>  3 files changed, 158 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c
> b/drivers/crypto/virtio/virtio_crypto_algs.c
> index ba190cf..fef112a 100644
> --- a/drivers/crypto/virtio/virtio_crypto_algs.c
> +++ b/drivers/crypto/virtio/virtio_crypto_algs.c
> @@ -49,12 +49,18 @@ struct virtio_crypto_sym_request {
>   bool encrypt;
>  };
> 
> +struct virtio_crypto_algo {
> + uint32_t algonum;
> + uint32_t service;
> + unsigned int active_devs;
> + struct crypto_alg algo;
> +};
> +
>  /*
>   * The algs_lock protects the below global virtio_crypto_active_devs
>   * and crypto algorithms registion.
>   */
>  static DEFINE_MUTEX(algs_lock);
> -static unsigned int virtio_crypto_active_devs;
>  static void virtio_crypto_ablkcipher_finalize_req(
>   struct virtio_crypto_sym_request *vc_sym_req,
>   struct ablkcipher_request *req,
> @@ -312,13 +318,19 @@ static int virtio_crypto_ablkcipher_setkey(struct
> crypto_ablkcipher *tfm,
>unsigned int keylen)
>  {
>   struct virtio_crypto_ablkcipher_ctx *ctx = crypto_ablkcipher_ctx(tfm);
> + uint32_t alg;
>   int ret;
> 
> + ret = virtio_crypto_alg_validate_key(keylen, );
> + if (ret)
> + return ret;
> +
>   if (!ctx->vcrypto) {
>   /* New key */
>   int node = virtio_crypto_get_current_node();
>   struct virtio_crypto *vcrypto =
> -   virtcrypto_get_dev_node(node);
> +   virtcrypto_get_dev_node(node,
> +   VIRTIO_CRYPTO_SERVICE_CIPHER, alg);
>   if (!vcrypto) {
>   pr_err("virtio_crypto: Could not find a virtio device 
> in the
> system\n");

We'd better change the above error message now. What about:
 " virtio_crypto: Could not find a virtio device in the system or unsupported 
algo" ?

Regards,
-Gonglei






Re: [Qemu-devel] [RFC v1 1/2] crypto/virtio-crypto: Read crypto services and algorithm masks

2018-06-11 Thread Gonglei (Arei)


> -Original Message-
> From: Farhan Ali [mailto:al...@linux.ibm.com]
> Sent: Saturday, June 09, 2018 3:09 AM
> To: linux-ker...@vger.kernel.org; k...@vger.kernel.org
> Cc: m...@redhat.com; qemu-devel@nongnu.org; Gonglei (Arei)
> ; longpeng ;
> pa...@linux.ibm.com; fran...@linux.ibm.com; borntrae...@de.ibm.com;
> al...@linux.ibm.com
> Subject: [RFC v1 1/2] crypto/virtio-crypto: Read crypto services and algorithm
> masks
> 
> Read the crypto services and algorithm masks which provides
> information about the services and algorithms supported by
> virtio-crypto backend.
> 
> Signed-off-by: Farhan Ali 
> ---
>  drivers/crypto/virtio/virtio_crypto_common.h | 14 ++
>  drivers/crypto/virtio/virtio_crypto_core.c   | 29
> 
>  2 files changed, 43 insertions(+)
> 
> diff --git a/drivers/crypto/virtio/virtio_crypto_common.h
> b/drivers/crypto/virtio/virtio_crypto_common.h
> index 66501a5..05eca12e 100644
> --- a/drivers/crypto/virtio/virtio_crypto_common.h
> +++ b/drivers/crypto/virtio/virtio_crypto_common.h
> @@ -55,6 +55,20 @@ struct virtio_crypto {
>   /* Number of queue currently used by the driver */
>   u32 curr_queue;
> 
> + /*
> +  * Specifies the services mask which the device support,
> +  * see VIRTIO_CRYPTO_SERVICE_* above
> +  */

Pls update the above comments. Except that:

Acked-by: Gonglei 

> + u32 crypto_services;
> +
> + /* Detailed algorithms mask */
> + u32 cipher_algo_l;
> + u32 cipher_algo_h;
> + u32 hash_algo;
> + u32 mac_algo_l;
> + u32 mac_algo_h;
> + u32 aead_algo;
> +
>   /* Maximum length of cipher key */
>   u32 max_cipher_key_len;
>   /* Maximum length of authenticated key */
> diff --git a/drivers/crypto/virtio/virtio_crypto_core.c
> b/drivers/crypto/virtio/virtio_crypto_core.c
> index 8332698..8f745f2 100644
> --- a/drivers/crypto/virtio/virtio_crypto_core.c
> +++ b/drivers/crypto/virtio/virtio_crypto_core.c
> @@ -303,6 +303,13 @@ static int virtcrypto_probe(struct virtio_device *vdev)
>   u32 max_data_queues = 0, max_cipher_key_len = 0;
>   u32 max_auth_key_len = 0;
>   u64 max_size = 0;
> + u32 cipher_algo_l = 0;
> + u32 cipher_algo_h = 0;
> + u32 hash_algo = 0;
> + u32 mac_algo_l = 0;
> + u32 mac_algo_h = 0;
> + u32 aead_algo = 0;
> + u32 crypto_services = 0;
> 
>   if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
>   return -ENODEV;
> @@ -339,6 +346,20 @@ static int virtcrypto_probe(struct virtio_device *vdev)
>   max_auth_key_len, _auth_key_len);
>   virtio_cread(vdev, struct virtio_crypto_config,
>   max_size, _size);
> + virtio_cread(vdev, struct virtio_crypto_config,
> + crypto_services, _services);
> + virtio_cread(vdev, struct virtio_crypto_config,
> + cipher_algo_l, _algo_l);
> + virtio_cread(vdev, struct virtio_crypto_config,
> + cipher_algo_h, _algo_h);
> + virtio_cread(vdev, struct virtio_crypto_config,
> + hash_algo, _algo);
> + virtio_cread(vdev, struct virtio_crypto_config,
> + mac_algo_l, _algo_l);
> + virtio_cread(vdev, struct virtio_crypto_config,
> + mac_algo_h, _algo_h);
> + virtio_cread(vdev, struct virtio_crypto_config,
> + aead_algo, _algo);
> 
>   /* Add virtio crypto device to global table */
>   err = virtcrypto_devmgr_add_dev(vcrypto);
> @@ -358,6 +379,14 @@ static int virtcrypto_probe(struct virtio_device *vdev)
>   vcrypto->max_cipher_key_len = max_cipher_key_len;
>   vcrypto->max_auth_key_len = max_auth_key_len;
>   vcrypto->max_size = max_size;
> + vcrypto->crypto_services = crypto_services;
> + vcrypto->cipher_algo_l = cipher_algo_l;
> + vcrypto->cipher_algo_h = cipher_algo_h;
> + vcrypto->mac_algo_l = mac_algo_l;
> + vcrypto->mac_algo_h = mac_algo_h;
> + vcrypto->hash_algo = hash_algo;
> + vcrypto->aead_algo = aead_algo;
> +
> 
>   dev_info(>dev,
>   "max_queues: %u, max_cipher_key_len: %u, max_auth_key_len: %u,
> max_size 0x%llx\n",
> --
> 2.7.4




Re: [Qemu-devel] An emulation failure occurs, if I hotplug vcpus immediately after the VM start

2018-06-07 Thread Gonglei (Arei)

> -Original Message-
> From: David Hildenbrand [mailto:da...@redhat.com]
> Sent: Thursday, June 07, 2018 6:40 PM
> Subject: Re: An emulation failure occurs,if I hotplug vcpus immediately after 
> the
> VM start
> 
> On 06.06.2018 15:57, Paolo Bonzini wrote:
> > On 06/06/2018 15:28, Gonglei (Arei) wrote:
> >> gonglei: mem.slot: 3, mem.guest_phys_addr=0xc,
> >> mem.userspace_addr=0x7fc343ec, mem.flags=0, memory_size=0x0
> >> gonglei: mem.slot: 3, mem.guest_phys_addr=0xc,
> >> mem.userspace_addr=0x7fc343ec, mem.flags=0,
> memory_size=0x9000
> >>
> >> When the memory region is cleared, the KVM will tell the slot to be
> >> invalid (which it is set to KVM_MEMSLOT_INVALID).
> >>
> >> If SeaBIOS accesses this memory and cause page fault, it will find an
> >> invalid value according to gfn (by __gfn_to_pfn_memslot), and finally
> >> it will return an invalid value, and finally it will return a
> >> failure.
> >>
> >> So, My questions are:
> >>
> >> 1) Why don't we hold kvm->slots_lock during page fault processing?
> >
> > Because it's protected by SRCU.  We don't need kvm->slots_lock on the
> > read side.
> >
> >> 2) How do we assure that vcpus will not access the corresponding
> >> region when deleting an memory slot?
> >
> > We don't.  It's generally a guest bug if they do, but the problem here
> > is that QEMU is splitting a memory region in two parts and that is not
> > atomic.
> 
> BTW, one ugly (but QEMU-only) fix would be to temporarily pause all
> VCPUs, do the change and then unpause all VCPUs.
> 

The updating process of memory region is triggered by vcpu thread, not
the main process though.

Thanks,
-Gonglei

> >
> > One fix could be to add a KVM_SET_USER_MEMORY_REGIONS ioctl that
> > replaces the entire memory map atomically.
> >
> > Paolo
> >
> 
> 
> --
> 
> Thanks,
> 
> David / dhildenb


Re: [Qemu-devel] [PATCH] ps2: check PS2Queue wptr pointer in post_load routine

2018-06-07 Thread Gonglei (Arei)



> -Original Message-
> From: liujunjie (A)
> Sent: Thursday, June 07, 2018 4:03 PM
> To: kra...@redhat.com; berra...@redhat.com
> Cc: Gonglei (Arei) ; wangxin (U)
> ; Huangweidong (C)
> ; fangying ;
> qemu-devel@nongnu.org; liujunjie (A) 
> Subject: [PATCH] ps2: check PS2Queue wptr pointer in post_load routine
> 
> In commit 802cbcb7300, most issues have been fixed when qemu guest
> migration. But the queue size still need to check whether is equal to
> PS2_QUEUE_SIZE. If yes, the wptr should set as 0. Or, wptr would larger
> than PS2_QUEUE_SIZE and never come back when ps2_queue_noirq is called.
> This could lead to OOB access, add check to avoid it.
> 
> Signed-off-by: liujunjie 
> ---
>  hw/input/ps2.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/input/ps2.c b/hw/input/ps2.c
> index eeec618..fdfcadf 100644
> --- a/hw/input/ps2.c
> +++ b/hw/input/ps2.c
> @@ -927,7 +927,7 @@ static void ps2_common_post_load(PS2State *s)
> 
>  /* reset rptr/wptr/count */
>  q->rptr = 0;
> -q->wptr = size;
> +q->wptr = (size == PS2_QUEUE_SIZE) ? 0 : size;
>  q->count = size;
>  s->update_irq(s->update_arg, q->count != 0);
>  }
> --

Reviewed-by: Gonglei 




Re: [Qemu-devel] An emulation failure occurs, if I hotplug vcpus immediately after the VM start

2018-06-06 Thread Gonglei (Arei)
Hi Igor,

Thanks for your response firstly. :)

> -Original Message-
> From: Igor Mammedov [mailto:imamm...@redhat.com]
> Sent: Friday, June 01, 2018 6:23 PM
> 
> On Fri, 1 Jun 2018 08:17:12 +
> xuyandong  wrote:
> 
> > Hi there,
> >
> > I am doing some test on qemu vcpu hotplug and I run into some trouble.
> > An emulation failure occurs and qemu prints the following msg:
> >
> > KVM internal error. Suberror: 1
> > emulation failure
> > EAX= EBX= ECX= EDX=0600
> > ESI= EDI= EBP= ESP=fff8
> > EIP=ff53 EFL=00010082 [--S] CPL=0 II=0 A20=1 SMM=0 HLT=0
> > ES =   9300
> > CS =f000 000f  9b00
> > SS =   9300
> > DS =   9300
> > FS =   9300
> > GS =   9300
> > LDT=   8200
> > TR =   8b00if
> > GDT=  
> > IDT=  
> > CR0=6010 CR2= CR3= CR4=
> > DR0= DR1= DR2=
> DR3=
> > DR6=0ff0 DR7=0400
> > EFER=
> > Code=31 d2 eb 04 66 83 ca ff 66 89 d0 66 5b 66 c3 66 89 d0 66 c3  66 68
> 21 8a 00 00 e9 08 d7 66 56 66 53 66 83 ec 0c 66 89 c3 66 e8 ce 7b ff ff 66 89 
> c6
> >
> > I notice that guest is still running SeabBIOS in real mode when the vcpu has
> just been pluged.
> > This emulation failure can be steadly reproduced if I am doing vcpu hotplug
> during VM launch process.
> > After some digging, I find this KVM internal error shows up because KVM
> cannot emulate some MMIO (gpa 0xfff53 ).
> >
> > So I am confused,
> > (1) does qemu support vcpu hotplug even if guest is running seabios ?
> There is no code that forbids it, and I would expect it not to trigger error
> and be NOP.
> 
> > (2) the gpa (0xfff53) is an address of BIOS ROM section, why does kvm
> confirm it as a mmio address incorrectly?
> KVM trace and bios debug log might give more information to guess where to
> look
> or even better would be to debug Seabios and find out what exactly
> goes wrong if you could do it.

This issue can't be reproduced when we opened Seabios debug log or KVM trace. :(

After a few days of debugging, we found that this problem occurs every time 
when 
the memory region is cleared (memory_size is 0) and the VFIO device is 
hot-plugged. 

The key function is kvm_set_user_memory_region(), I added some logs in it.

gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc751e0, mem.flags=0, memory_size=0x20000
gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc751e0, mem.flags=0, memory_size=0x0
gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc343ec, mem.flags=0, memory_size=0x1
gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc343ec, mem.flags=0, memory_size=0x0
gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc343ec, mem.flags=0, memory_size=0xbff4
gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc343ec, mem.flags=0, memory_size=0x0
gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc343ec, mem.flags=0, memory_size=0xbff4
gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc343ec, mem.flags=0, memory_size=0x0
gonglei: mem.slot: 3, mem.guest_phys_addr=0xc, 
mem.userspace_addr=0x7fc343ec, mem.flags=0, memory_size=0x9000

When the memory region is cleared, the KVM will tell the slot to be
invalid (which it is set to KVM_MEMSLOT_INVALID). 

If SeaBIOS accesses this memory and cause page fault, it will find an invalid 
value according to 
gfn (by __gfn_to_pfn_memslot), and finally it will return an invalid value, and 
finally it will return a failure.

The function calls process is as follows in KVM:

kvm_mmu_page_fault
tdp_page_fault
try_async_pf
__gfn_to_pfn_memslot
    __direct_map // return true;
x86_emulate_instruction
handle_emulation_failure

The function calls process is as follows in Qemu:

Breakpoint 1, kvm_set_user_memory_region (kml=0x564aa1e2c890, 
slot=0x564aa1e2d230) at /mnt/sdb/gonglei/qemu/kvm-all.c:261
(gdb) bt
#0  kvm_set_user_memory_region (kml=0x564aa1e2c890, slot=0x564aa1e2d230) at 
/mnt/sdb/gonglei/qemu/kvm-all.c:261
#1  0x564a9e7e3096 in kvm_set_phys_mem (kml=0x564aa1e2c890, 
section=0x7febeb296500, add=false) at /mnt/sdb/gon

Re: [Qemu-devel] [PATCH] socket: dont't free msgfds if error equals EAGAIN

2018-05-30 Thread Gonglei (Arei)


> -Original Message-
> From: Eric Blake [mailto:ebl...@redhat.com]
> Sent: Wednesday, May 30, 2018 3:33 AM
> To: linzhecheng ; Marc-André Lureau
> 
> Cc: QEMU ; Paolo Bonzini ;
> wangxin (U) ; Gonglei (Arei)
> ; pet...@redhat.com; berra...@redhat.com
> Subject: Re: [Qemu-devel] [PATCH] socket: dont't free msgfds if error equals
> EAGAIN
> 
> On 05/29/2018 04:33 AM, linzhecheng wrote:
> > I think this patch doesn't fix my issue. For more details, please see 
> > Gonglei's
> reply.
> > https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg06296.html
> 
> Your mailer is not honoring threading (it failed to include
> 'In-Reply-To:' and 'References:' headers that refer to the message you
> are replying to), and you are top-posting, both of which make it
> difficult to follow your comments on a technical list.
> 
> 
Agree.

@Zhecheng, pls resend a patch with commit message. Ccing these guys.

Regards,
-Gonglei


Re: [Qemu-devel] [PATCH] socket: dont't free msgfds if error equals EAGAIN

2018-05-29 Thread Gonglei (Arei)
Hi all,

The issue is easy to reproduce when we confiugred multi-queue function for 
vhost-user nics.

The main backtrace is as follows:

vhost_user_write  ==>  0)  sets s->write_msgfds_num to 8
qemu_chr_fe_write_all
qemu_chr_fe_write_buffer  ==> 4) rewrite because (ret <0 && 
errno is EAGAIN)
tcp_chr_write  ==> 3) clear resource about 
s->write_msgfds and set s->write_msgfds_num to 0
io_channel_send_full  ==>  2) errno = EAGAIN 
and return -1
qio_channel_socket_writev  ==> 1) 
returns QIO_CHANNEL_ERR_BLOCK when ret <0 && errno == EAGAIN

Then at the above step 4) may cause undefined behaviors on the vhost-user 
server side because null control message is sent. 

So, we submit a patch to fix it. What's your opinion?

Regards,
-Gonglei

> -Original Message-
> From: linzhecheng
> Sent: Tuesday, May 29, 2018 4:20 PM
> To: qemu-devel@nongnu.org
> Cc: pbonz...@redhat.com; wangxin (U) ;
> berra...@redhat.com; pet...@redhat.com; marcandre.lur...@redhat.com;
> ebl...@redhat.com; Gonglei (Arei) 
> Subject: RE: [PATCH] socket: dont't free msgfds if error equals EAGAIN
> 
> CC'ing Daniel P. Berrangé , Peter Xu, Marc-André Lureau, Eric Blake, Gonglei
> 
> > -邮件原件-
> > 发件人: linzhecheng
> > 发送时间: 2018年5月29日 10:53
> > 收件人: qemu-devel@nongnu.org
> > 抄送: pbonz...@redhat.com; wangxin (U)
> ;
> > linzhecheng 
> > 主题: [PATCH] socket: dont't free msgfds if error equals EAGAIN
> >
> > Signed-off-by: linzhecheng 
> >
> > diff --git a/chardev/char-socket.c b/chardev/char-socket.c index
> > 159e69c3b1..17519ec589 100644
> > --- a/chardev/char-socket.c
> > +++ b/chardev/char-socket.c
> > @@ -134,8 +134,8 @@ static int tcp_chr_write(Chardev *chr, const uint8_t
> > *buf, int len)
> >  s->write_msgfds,
> >  s->write_msgfds_num);
> >
> > -/* free the written msgfds, no matter what */
> > -if (s->write_msgfds_num) {
> > +/* free the written msgfds in any cases other than errno==EAGAIN
> */
> > +if (EAGAIN != errno && s->write_msgfds_num) {
> >  g_free(s->write_msgfds);
> >  s->write_msgfds = 0;
> >  s->write_msgfds_num = 0;
> > --
> > 2.12.2.windows.2
> >



Re: [Qemu-devel] [PATCH] i386: Allow monitor / mwait cpuid override

2018-02-27 Thread Gonglei (Arei)
Hi all,

Guests could achive good performance in 'Message Passing Workloads' 
scenarios when knowing the X86_FEATURE_MWAIT feature which is presented by 
qemu. 
the reason is that after knowing that feature, 
the guest could use mwait method, which saves VMEXIT, 
to do idle, and achives high performace in latency-sensitive scenario.

Is there any plan for this patch? 

Or May I send a updated version based on yours? @Alex?

Thanks,
-Gonglei


> -Original Message-
> From: Qemu-devel
> [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> Behalf Of Alexander Graf
> Sent: Monday, March 27, 2017 10:27 PM
> To: qemu-devel@nongnu.org
> Cc: Paolo Bonzini; Eduardo Habkost; Richard Henderson
> Subject: [Qemu-devel] [PATCH] i386: Allow monitor / mwait cpuid override
> 
> KVM allows trap and emulate (read: NOP) of the MONITOR and MWAIT
> instructions. There is work undergoing to enable actual execution
> of these inside of KVM, but nobody really wants to expose the feature
> to the guest by default, as it would eat up all of the host CPU.
> 
> So today there is no streamlined way to actually notify the guest that
> it's ok to execute MONITOR / MWAIT, even when we want to explicitly
> leave the guest in guest context.
> 
> This patch adds a new -cpu parameter called "mwait" which - when
> enabled - force enables the MONITOR / MWAIT CPUID flag, even when
> the underlying accel framework does not explicitly advertise support.
> 
> With that in place, we can explicitly allow users to specify that
> they want have the guest execute MONITOR / MWAIT in its idle loop.
> 
> Signed-off-by: Alexander Graf <ag...@suse.de>
> ---
>  target/i386/cpu.c | 5 +
>  target/i386/cpu.h | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 7aa7622..c44020b 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -3460,6 +3460,10 @@ static int x86_cpu_filter_features(X86CPU *cpu)
>  x86_cpu_get_supported_feature_word(w, false);
>  uint32_t requested_features = env->features[w];
>  env->features[w] &= host_feat;
> +if (cpu->expose_monitor && (w == FEAT_1_ECX)) {
> +/* Force monitor feature in */
> +env->features[w] |= CPUID_EXT_MONITOR;
> +}
>  cpu->filtered_features[w] = requested_features &
> ~env->features[w];
>  if (cpu->filtered_features[w]) {
>  rv = 1;
> @@ -3988,6 +3992,7 @@ static Property x86_cpu_properties[] = {
>  DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
>  DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
>  DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
> +DEFINE_PROP_BOOL("mwait", X86CPU, expose_monitor, false),
>  DEFINE_PROP_UINT32("phys-bits", X86CPU, phys_bits, 0),
>  DEFINE_PROP_BOOL("host-phys-bits", X86CPU, host_phys_bits, false),
>  DEFINE_PROP_BOOL("fill-mtrr-mask", X86CPU, fill_mtrr_mask, true),
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 07401ad..7400d00 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1214,6 +1214,7 @@ struct X86CPU {
>  bool check_cpuid;
>  bool enforce_cpuid;
>  bool expose_kvm;
> +bool expose_monitor;
>  bool migratable;
>  bool max_features; /* Enable all supported features automatically */
>  uint32_t apic_id;
> --
> 1.8.5.6
> 




Re: [Qemu-devel] [PATCH v3] rtc: placing RTC memory region outside BQL

2018-02-22 Thread Gonglei (Arei)
Ping...


Regards,
-Gonglei


> -Original Message-
> From: Gonglei (Arei)
> Sent: Monday, February 12, 2018 4:58 PM
> To: qemu-devel@nongnu.org
> Cc: pbonz...@redhat.com; Huangweidong (C); peter.mayd...@linaro.org;
> Gonglei (Arei)
> Subject: [PATCH v3] rtc: placing RTC memory region outside BQL
> 
> As windows guest use rtc as the clock source device,
> and access rtc frequently. Let's move the rtc memory
> region outside BQL to decrease overhead for windows guests.
> Meanwhile, adding a new lock to avoid different vCPUs
> access the RTC together.
> 
> I tested PCMark 8 (https://www.futuremark.com/benchmarks/pcmark)
> in win7 guest and got the below results:
> 
> Guest: 2U2G
> 
> Before applying the patch:
> 
> Your Work 2.0 score: 2000
> Web Browsing - JunglePin 0.334s
> Web Browsing - Amazonia  0.132s
> Writing  3.59s
> Spreadsheet  70.13s
> Video Chat v2/Video Chat playback 1 v2   22.8 fps
> Video Chat v2/Video Chat encoding v2 307.0 ms
> Benchmark duration   1h 35min 46s
> 
> After applying the patch:
> 
> Your Work 2.0 score: 2040
> Web Browsing - JunglePin 0.345s
> Web Browsing - Amazonia  0.132s
> Writing  3.56s
> Spreadsheet  67.83s
> Video Chat v2/Video Chat playback 1 v2   28.7 fps
> Video Chat v2/Video Chat encoding v2 324.7 ms
> Benchmark duration   1h 32min 5s
> 
> Test results show that optimization is effective under
> stressful situations.
> 
> Signed-off-by: Gonglei <arei.gong...@huawei.com>
> ---
> v3->v2:
>  a) fix a typo, 's/rasie/raise/' [Peter]
>  b) change commit message [Peter]
> 
> v2->v1:
>  a)Adding a new lock to avoid different vCPUs
>access the RTC together. [Paolo]
>  b)Taking the BQL before raising the outbound IRQ line. [Peter]
>  c)Don't hold BQL if it was holden. [Peter]
> 
>  hw/timer/mc146818rtc.c | 55
> ++
>  1 file changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
> index 35a05a6..f0a2a62 100644
> --- a/hw/timer/mc146818rtc.c
> +++ b/hw/timer/mc146818rtc.c
> @@ -85,6 +85,7 @@ typedef struct RTCState {
>  uint16_t irq_reinject_on_ack_count;
>  uint32_t irq_coalesced;
>  uint32_t period;
> +QemuMutex rtc_lock;
>  QEMUTimer *coalesced_timer;
>  Notifier clock_reset_notifier;
>  LostTickPolicy lost_tick_policy;
> @@ -125,6 +126,36 @@ static void rtc_coalesced_timer_update(RTCState *s)
>  }
>  }
> 
> +static void rtc_raise_irq(RTCState *s)
> +{
> +bool unlocked = !qemu_mutex_iothread_locked();
> +
> +if (unlocked) {
> +qemu_mutex_lock_iothread();
> +}
> +
> +qemu_irq_raise(s->irq);
> +
> +if (unlocked) {
> +qemu_mutex_unlock_iothread();
> +}
> +}
> +
> +static void rtc_lower_irq(RTCState *s)
> +{
> +bool unlocked = !qemu_mutex_iothread_locked();
> +
> +if (unlocked) {
> +qemu_mutex_lock_iothread();
> +}
> +
> +qemu_irq_lower(s->irq);
> +
> +if (unlocked) {
> +qemu_mutex_unlock_iothread();
> +}
> +}
> +
>  static QLIST_HEAD(, RTCState) rtc_devices =
>  QLIST_HEAD_INITIALIZER(rtc_devices);
> 
> @@ -141,7 +172,7 @@ void qmp_rtc_reset_reinjection(Error **errp)
>  static bool rtc_policy_slew_deliver_irq(RTCState *s)
>  {
>  apic_reset_irq_delivered();
> -qemu_irq_raise(s->irq);
> +rtc_raise_irq(s);
>  return apic_get_irq_delivered();
>  }
> 
> @@ -277,8 +308,9 @@ static void rtc_periodic_timer(void *opaque)
>  DPRINTF_C("cmos: coalesced irqs increased to %d\n",
>s->irq_coalesced);
>  }
> -} else
> -qemu_irq_raise(s->irq);
> +} else {
> +rtc_raise_irq(s);
> +}
>  }
>  }
> 
> @@ -459,7 +491,7 @@ static void rtc_update_timer(void *opaque)
>  s->cmos_data[RTC_REG_C] |= irqs;
>  if ((new_irqs & s->cmos_data[RTC_REG_B]) != 0) {
>  s->cmos_data[RTC_REG_C] |= REG_C_IRQF;
> -qemu_irq_raise(s->irq);
> +rtc_raise_irq(s);
>  }
>  check_update_timer(s);
>  }
> @@ -471,6 +503,7 @@ static void cmos_ioport_write(void *opaque, hwaddr
> addr,
>  uint32_t old_period;
>  bool update_periodic_timer;
> 
> +qemu_mutex_lock(>rtc_lock);
>

[Qemu-devel] [PATCH v3] rtc: placing RTC memory region outside BQL

2018-02-12 Thread Gonglei
As windows guest use rtc as the clock source device,
and access rtc frequently. Let's move the rtc memory
region outside BQL to decrease overhead for windows guests.
Meanwhile, adding a new lock to avoid different vCPUs
access the RTC together.

I tested PCMark 8 (https://www.futuremark.com/benchmarks/pcmark) 
in win7 guest and got the below results:

Guest: 2U2G

Before applying the patch:

Your Work 2.0 score: 2000
Web Browsing - JunglePin 0.334s
Web Browsing - Amazonia  0.132s
Writing  3.59s
Spreadsheet  70.13s
Video Chat v2/Video Chat playback 1 v2   22.8 fps
Video Chat v2/Video Chat encoding v2 307.0 ms
Benchmark duration   1h 35min 46s

After applying the patch:

Your Work 2.0 score: 2040
Web Browsing - JunglePin 0.345s
Web Browsing - Amazonia  0.132s
Writing  3.56s
Spreadsheet  67.83s
Video Chat v2/Video Chat playback 1 v2   28.7 fps
Video Chat v2/Video Chat encoding v2 324.7 ms
Benchmark duration   1h 32min 5s

Test results show that optimization is effective under
stressful situations.

Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
v3->v2:
 a) fix a typo, 's/rasie/raise/' [Peter]
 b) change commit message [Peter]

v2->v1:
 a)Adding a new lock to avoid different vCPUs
   access the RTC together. [Paolo]
 b)Taking the BQL before raising the outbound IRQ line. [Peter]
 c)Don't hold BQL if it was holden. [Peter]

 hw/timer/mc146818rtc.c | 55 ++
 1 file changed, 47 insertions(+), 8 deletions(-)

diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
index 35a05a6..f0a2a62 100644
--- a/hw/timer/mc146818rtc.c
+++ b/hw/timer/mc146818rtc.c
@@ -85,6 +85,7 @@ typedef struct RTCState {
 uint16_t irq_reinject_on_ack_count;
 uint32_t irq_coalesced;
 uint32_t period;
+QemuMutex rtc_lock;
 QEMUTimer *coalesced_timer;
 Notifier clock_reset_notifier;
 LostTickPolicy lost_tick_policy;
@@ -125,6 +126,36 @@ static void rtc_coalesced_timer_update(RTCState *s)
 }
 }
 
+static void rtc_raise_irq(RTCState *s)
+{
+bool unlocked = !qemu_mutex_iothread_locked();
+
+if (unlocked) {
+qemu_mutex_lock_iothread();
+}
+
+qemu_irq_raise(s->irq);
+
+if (unlocked) {
+qemu_mutex_unlock_iothread();
+}
+}
+
+static void rtc_lower_irq(RTCState *s)
+{
+bool unlocked = !qemu_mutex_iothread_locked();
+
+if (unlocked) {
+qemu_mutex_lock_iothread();
+}
+
+qemu_irq_lower(s->irq);
+
+if (unlocked) {
+qemu_mutex_unlock_iothread();
+}
+}
+
 static QLIST_HEAD(, RTCState) rtc_devices =
 QLIST_HEAD_INITIALIZER(rtc_devices);
 
@@ -141,7 +172,7 @@ void qmp_rtc_reset_reinjection(Error **errp)
 static bool rtc_policy_slew_deliver_irq(RTCState *s)
 {
 apic_reset_irq_delivered();
-qemu_irq_raise(s->irq);
+rtc_raise_irq(s);
 return apic_get_irq_delivered();
 }
 
@@ -277,8 +308,9 @@ static void rtc_periodic_timer(void *opaque)
 DPRINTF_C("cmos: coalesced irqs increased to %d\n",
   s->irq_coalesced);
 }
-} else
-qemu_irq_raise(s->irq);
+} else {
+rtc_raise_irq(s);
+}
 }
 }
 
@@ -459,7 +491,7 @@ static void rtc_update_timer(void *opaque)
 s->cmos_data[RTC_REG_C] |= irqs;
 if ((new_irqs & s->cmos_data[RTC_REG_B]) != 0) {
 s->cmos_data[RTC_REG_C] |= REG_C_IRQF;
-qemu_irq_raise(s->irq);
+rtc_raise_irq(s);
 }
 check_update_timer(s);
 }
@@ -471,6 +503,7 @@ static void cmos_ioport_write(void *opaque, hwaddr addr,
 uint32_t old_period;
 bool update_periodic_timer;
 
+qemu_mutex_lock(>rtc_lock);
 if ((addr & 1) == 0) {
 s->cmos_index = data & 0x7f;
 } else {
@@ -560,10 +593,10 @@ static void cmos_ioport_write(void *opaque, hwaddr addr,
  * becomes enabled, raise an interrupt immediately.  */
 if (data & s->cmos_data[RTC_REG_C] & REG_C_MASK) {
 s->cmos_data[RTC_REG_C] |= REG_C_IRQF;
-qemu_irq_raise(s->irq);
+rtc_raise_irq(s);
 } else {
 s->cmos_data[RTC_REG_C] &= ~REG_C_IRQF;
-qemu_irq_lower(s->irq);
+rtc_lower_irq(s);
 }
 s->cmos_data[RTC_REG_B] = data;
 
@@ -583,6 +616,7 @@ static void cmos_ioport_write(void *opaque, hwaddr addr,
 break;
 }
 }
+qemu_mutex_unlock(>rtc_lock);
 }
 
 static inline int rtc_to_bcd(RTCState *s, int a)
@@ -710,6 +744,7 @@ static uint64_t cmos_ioport_read(void *opaque, hwaddr addr,
 if ((addr & 1) == 0) {
 retur

Re: [Qemu-devel] [PULL 00/26] virtio, vhost, pci, pc: features, fixes and cleanups

2018-02-10 Thread Gonglei (Arei)
> -Original Message-
> From: Qemu-devel
> [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> Behalf Of Peter Maydell
> Sent: Friday, February 09, 2018 6:07 PM
> To: Michael S. Tsirkin
> Cc: QEMU Developers
> Subject: Re: [Qemu-devel] [PULL 00/26] virtio, vhost, pci, pc: features, 
> fixes and
> cleanups
> 
> On 8 February 2018 at 19:08, Michael S. Tsirkin <m...@redhat.com> wrote:
> > The following changes since commit
> 008a51bbb343972dd8cf09126da8c3b87f4e1c96:
> >
> >   Merge remote-tracking branch 'remotes/famz/tags/staging-pull-request'
> into staging (2018-02-08 14:31:51 +)
> >
> > are available in the git repository at:
> >
> >   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream
> >
> > for you to fetch changes up to
> f4ac9b2e04e8d98854a97bc473353207765aa9e7:
> >
> >   virtio-balloon: include statistics of disk/file caches (2018-02-08 
> > 21:06:42
> +0200)
> >
> > 
> > virtio,vhost,pci,pc: features, fixes and cleanups
> >
> > - a new vhost crypto device
> > - new stats in virtio balloon
> > - virtio eventfd rework for boot speedup
> > - vhost memory rework for boot speedup
> > - fixes and cleanups all over the place
> >
> > Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> >
> 
> Hi. This has some format-string issues:
> 
> /home/peter.maydell/qemu/backends/cryptodev-vhost-user.c: In function
> 'cryptodev_vhost_user_start':
> /home/peter.maydell/qemu/backends/cryptodev-vhost-user.c:112:26:
> error: format '%lu' expects argument of type 'long unsigned int', but
> argument 2 has type 'size_t {aka unsigned int}' [-Werror=format=]
>  error_report("failed to init vhost_crypto for queue %lu", i);
>   ^
> /home/peter.maydell/qemu/backends/cryptodev-vhost-user.c: In function
> 'cryptodev_vhost_user_init':
> /home/peter.maydell/qemu/backends/cryptodev-vhost-user.c:205:40:
> error: format '%lu' expects argument of type 'long unsigned int', but
> argument 2 has type 'size_t {aka unsigned int}' [-Werror=format=]
>  cc->info_str = g_strdup_printf("cryptodev-vhost-user%lu to %s ",
> ^
> 
Using %zu instead of %lu will be correct. Michael, could you pls fix it 
directly?

Very sorry for the inconvenience. :(

Thanks,
-Gonglei



Re: [Qemu-devel] [PATCH v2] rtc: placing RTC memory region outside BQL

2018-02-09 Thread Gonglei (Arei)
>
> > >
> > > $ cat strace_c.sh
> > > strace -tt -p $1 -c -o result_$1.log &
> > > sleep $2
> > > pid=$(pidof strace)
> > > kill $pid
> > > cat result_$1.log
> > >
> > > Before appling this change:
> > > $ ./strace_c.sh 10528 30
> > > % time seconds  usecs/call callserrors syscall
> > > -- --- --- - - 
> > >  93.870.119070  30  4000   ppoll
> > >   3.270.004148   2  2038   ioctl
> > >   2.660.003370   2  2014   futex
> > >   0.090.000113   1   106   read
> > >   0.090.000109   1   104   io_getevents
> > >   0.020.29   130   poll
> > >   0.000.00   0 1   write
> > > -- --- --- - - 
> > > 100.000.126839  8293   total
> > >
> > > After appling the change:
> > > $ ./strace_c.sh 23829 30
> > > % time seconds  usecs/call callserrors syscall
> > > -- --- --- - - 
> > >  92.860.067441  16  4094   ppoll
> > >   4.850.003522   2  2136   ioctl
> > >   1.170.000850   4   189   futex
> > >   0.540.000395   2   202   read
> > >   0.520.000379   2   202   io_getevents
> > >   0.050.37   130   poll
> > > -- --- --- - - 
> > > 100.000.072624  6853   total
> > >
> > > The futex call number decreases ~90.6% on an idle windows 7 guest.
> >
> > These are the same figures as from v1 -- it would be interesting
> > to check whether the additional locking that v2 adds has affected
> > the results.
> >
> Oh, yes. the futex number of v2 don't decline compared too much to v1 because
> it
> takes the BQL before raising the outbound IRQ line now.
> 
> Before applying v2:
> # ./strace_c.sh 8776 30
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  78.010.164188  26  6436   ppoll
>   8.390.017650   5  370039 futex
>   7.680.016157   6  2758   ioctl
>   5.480.011530   3  4586  1113 read
>   0.300.000640  2032   io_submit
>   0.150.000317   489   write
> -- --- --- - - 
> 100.000.210482 17601  1152 total
> 
> After applying v2:
> # ./strace_c.sh 15968 30
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  78.280.171117  27  6272   ppoll
>   8.500.018571   5  366321 futex
>   7.760.016973   6  2732   ioctl
>   4.850.010597   3  4115   853 read
>   0.310.000672  1163   io_submit
>   0.300.000659   4   180   write
> -- --- --- - - 
> 100.000.218589 17025   874 total
> 
> > Does the patch improve performance in a more interesting use
> > case than "the guest is just idle" ?
> >
> I think so, after all, the scope of the locking is reduced .
> Besides this, can we optimize the rtc timer to avoid to hold BQL
> by separate threads?
> 
Hi Peter, Paolo

I tested PCMark 8 (https://www.futuremark.com/benchmarks/pcmark) 
in win7 guest and got the below results:

Guest: 2U2G

Before applying v2:

Your Work 2.0 score:   2000
Web Browsing - JunglePin0.334s
Web Browsing - Amazonia0.132s
Writing3.59s
Spreadsheet70.13s
Video Chat v2/Video Chat playback 1 v2   22.8 fps
Video Chat v2/Video Chat encoding v2   307.0 ms
Benchmark duration1h 35min 46s

After applying v2:

Your Work 2.0 score:   2040
Web Browsing - JunglePin0.345s
Web Browsing - Amazonia0.132s
Writing3.56s
Spreadsheet67.83s
Video Chat v2/Video Chat playback 1 v2   28.7 fps
Video Chat v2/Video Chat encoding v2   324.7 ms
Benchmark duration1h 32min 5s

Test results show that optimization is very effective in stressful situations.

Thanks,
-Gonglei



Re: [Qemu-devel] [PATCH] vl: fix possible int overflow for qemu_timedate_diff()

2018-02-07 Thread Gonglei (Arei)
> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Tuesday, February 06, 2018 11:52 PM
> To: Gonglei (Arei); qemu-devel@nongnu.org
> Cc: shenghualong
> Subject: Re: [PATCH] vl: fix possible int overflow for qemu_timedate_diff()
> 
> On 01/02/2018 12:59, Gonglei wrote:
> > From: shenghualong <shenghual...@huawei.com>
> >
> > When the Windows guest users set the time to year 2099,
> > the return value of qemu_timedate_diff() will overflow
> > with variable clock mode as below format:
> >
> >  
> >
> > Let's change the return value of qemu_timedate_diff() from
> > int to time_t to fix the possible overflow problem.
> >
> > Signed-off-by: shenghualong <shenghual...@huawei.com>
> > Signed-off-by: Gonglei <arei.gong...@huawei.com>
> 
> Thanks, this makes sense.  However, looking at the users, you should
> also change the type of:
> 
> - the diff variable in hw/timer/m48t59.c function set_alarm;
> 
> - the offset argument of the RTC_CHANGE QAPI event (to int64)
> 
> - the sec_offset and alm_sec fields of MenelausState in hw/timer/twl92230.c
> 
> - the offset argument of qemu_get_timedate.
> 
OK, will do.

Thanks,
-Gonglei

> Thanks,
> 
> Paolo
> 
> > ---
> >  include/qemu-common.h | 2 +-
> >  vl.c  | 4 ++--
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/qemu-common.h b/include/qemu-common.h
> > index 05319b9..6fb80aa 100644
> > --- a/include/qemu-common.h
> > +++ b/include/qemu-common.h
> > @@ -33,7 +33,7 @@ int qemu_main(int argc, char **argv, char **envp);
> >  #endif
> >
> >  void qemu_get_timedate(struct tm *tm, int offset);
> > -int qemu_timedate_diff(struct tm *tm);
> > +time_t qemu_timedate_diff(struct tm *tm);
> >
> >  #define qemu_isalnum(c)isalnum((unsigned char)(c))
> >  #define qemu_isalpha(c)isalpha((unsigned char)(c))
> > diff --git a/vl.c b/vl.c
> > index e517a8d..9d225da 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -146,7 +146,7 @@ int nb_nics;
> >  NICInfo nd_table[MAX_NICS];
> >  int autostart;
> >  static int rtc_utc = 1;
> > -static int rtc_date_offset = -1; /* -1 means no change */
> > +static time_t rtc_date_offset = -1; /* -1 means no change */
> >  QEMUClockType rtc_clock;
> >  int vga_interface_type = VGA_NONE;
> >  static int full_screen = 0;
> > @@ -812,7 +812,7 @@ void qemu_get_timedate(struct tm *tm, int offset)
> >  }
> >  }
> >
> > -int qemu_timedate_diff(struct tm *tm)
> > +time_t qemu_timedate_diff(struct tm *tm)
> >  {
> >  time_t seconds;
> >
> >



Re: [Qemu-devel] [PATCH v2] rtc: placing RTC memory region outside BQL

2018-02-07 Thread Gonglei (Arei)
> -Original Message-
> From: Peter Maydell [mailto:peter.mayd...@linaro.org]
> Sent: Tuesday, February 06, 2018 10:36 PM
> To: Gonglei (Arei)
> Cc: QEMU Developers; Paolo Bonzini; Huangweidong (C)
> Subject: Re: [PATCH v2] rtc: placing RTC memory region outside BQL
> 
> On 6 February 2018 at 14:07, Gonglei <arei.gong...@huawei.com> wrote:
> > As windows guest use rtc as the clock source device,
> > and access rtc frequently. Let's move the rtc memory
> > region outside BQL to decrease overhead for windows guests.
> > Meanwhile, adding a new lock to avoid different vCPUs
> > access the RTC together.
> >
> > $ cat strace_c.sh
> > strace -tt -p $1 -c -o result_$1.log &
> > sleep $2
> > pid=$(pidof strace)
> > kill $pid
> > cat result_$1.log
> >
> > Before appling this change:
> > $ ./strace_c.sh 10528 30
> > % time seconds  usecs/call callserrors syscall
> > -- --- --- - - 
> >  93.870.119070  30  4000   ppoll
> >   3.270.004148   2  2038   ioctl
> >   2.660.003370   2  2014   futex
> >   0.090.000113   1   106   read
> >   0.090.000109   1   104   io_getevents
> >   0.020.29   130   poll
> >   0.000.00   0 1   write
> > -- --- --- - - 
> > 100.000.126839  8293   total
> >
> > After appling the change:
> > $ ./strace_c.sh 23829 30
> > % time seconds  usecs/call callserrors syscall
> > -- --- --- - - 
> >  92.860.067441  16  4094   ppoll
> >   4.850.003522   2  2136   ioctl
> >   1.170.000850   4   189   futex
> >   0.540.000395   2   202   read
> >   0.520.000379   2   202   io_getevents
> >   0.050.37   130   poll
> > -- --- --- - - 
> > 100.000.072624  6853   total
> >
> > The futex call number decreases ~90.6% on an idle windows 7 guest.
> 
> These are the same figures as from v1 -- it would be interesting
> to check whether the additional locking that v2 adds has affected
> the results.
> 
Oh, yes. the futex number of v2 don't decline compared too much to v1 because it
takes the BQL before raising the outbound IRQ line now.

Before applying v2:
# ./strace_c.sh 8776 30
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 78.010.164188  26  6436   ppoll
  8.390.017650   5  370039 futex
  7.680.016157   6  2758   ioctl
  5.480.011530   3  4586  1113 read
  0.300.000640  2032   io_submit
  0.150.000317   489   write
-- --- --- - - 
100.000.210482 17601  1152 total

After applying v2:
# ./strace_c.sh 15968 30
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 78.280.171117  27  6272   ppoll
  8.500.018571   5  366321 futex
  7.760.016973   6  2732   ioctl
  4.850.010597   3  4115   853 read
  0.310.000672  1163   io_submit
  0.300.000659   4   180   write
-- --- --- - - 
100.000.218589 17025   874 total

> Does the patch improve performance in a more interesting use
> case than "the guest is just idle" ?
> 
I think so, after all, the scope of the locking is reduced . 
Besides this, can we optimize the rtc timer to avoid to hold BQL 
by separate threads?

> > +static void rtc_rasie_irq(RTCState *s)
> 
> Typo: should be "raise".
> 
Good catch. :)

Thanks,
-Gonglei


[Qemu-devel] [PATCH v2] rtc: placing RTC memory region outside BQL

2018-02-06 Thread Gonglei
As windows guest use rtc as the clock source device,
and access rtc frequently. Let's move the rtc memory
region outside BQL to decrease overhead for windows guests.
Meanwhile, adding a new lock to avoid different vCPUs
access the RTC together.

$ cat strace_c.sh
strace -tt -p $1 -c -o result_$1.log &
sleep $2
pid=$(pidof strace)
kill $pid
cat result_$1.log

Before appling this change:
$ ./strace_c.sh 10528 30
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 93.870.119070  30  4000   ppoll
  3.270.004148   2  2038   ioctl
  2.660.003370   2  2014   futex
  0.090.000113   1   106   read
  0.090.000109   1   104   io_getevents
  0.020.29   130   poll
  0.000.00   0 1   write
-- --- --- - - 
100.000.126839  8293   total

After appling the change:
$ ./strace_c.sh 23829 30
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 92.860.067441  16  4094   ppoll
  4.850.003522   2  2136   ioctl
  1.170.000850   4   189   futex
  0.540.000395   2   202   read
  0.520.000379   2   202   io_getevents
  0.050.37   130   poll
-- --- --- - - 
100.000.072624  6853   total

The futex call number decreases ~90.6% on an idle windows 7 guest.

Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
v2->v1:
 a)Adding a new lock to avoid different vCPUs
   access the RTC together. [Paolo]
 b)Taking the BQL before raising the outbound IRQ line. [Peter]
 c)Don't hold BQL if it was holden. [Peter]

 hw/timer/mc146818rtc.c | 55 ++
 1 file changed, 47 insertions(+), 8 deletions(-)

diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
index 35a05a6..f0a2a62 100644
--- a/hw/timer/mc146818rtc.c
+++ b/hw/timer/mc146818rtc.c
@@ -85,6 +85,7 @@ typedef struct RTCState {
 uint16_t irq_reinject_on_ack_count;
 uint32_t irq_coalesced;
 uint32_t period;
+QemuMutex rtc_lock;
 QEMUTimer *coalesced_timer;
 Notifier clock_reset_notifier;
 LostTickPolicy lost_tick_policy;
@@ -125,6 +126,36 @@ static void rtc_coalesced_timer_update(RTCState *s)
 }
 }
 
+static void rtc_rasie_irq(RTCState *s)
+{
+bool unlocked = !qemu_mutex_iothread_locked();
+
+if (unlocked) {
+qemu_mutex_lock_iothread();
+}
+
+qemu_irq_raise(s->irq);
+
+if (unlocked) {
+qemu_mutex_unlock_iothread();
+}
+}
+
+static void rtc_lower_irq(RTCState *s)
+{
+bool unlocked = !qemu_mutex_iothread_locked();
+
+if (unlocked) {
+qemu_mutex_lock_iothread();
+}
+
+qemu_irq_lower(s->irq);
+
+if (unlocked) {
+qemu_mutex_unlock_iothread();
+}
+}
+
 static QLIST_HEAD(, RTCState) rtc_devices =
 QLIST_HEAD_INITIALIZER(rtc_devices);
 
@@ -141,7 +172,7 @@ void qmp_rtc_reset_reinjection(Error **errp)
 static bool rtc_policy_slew_deliver_irq(RTCState *s)
 {
 apic_reset_irq_delivered();
-qemu_irq_raise(s->irq);
+rtc_rasie_irq(s);
 return apic_get_irq_delivered();
 }
 
@@ -277,8 +308,9 @@ static void rtc_periodic_timer(void *opaque)
 DPRINTF_C("cmos: coalesced irqs increased to %d\n",
   s->irq_coalesced);
 }
-} else
-qemu_irq_raise(s->irq);
+} else {
+rtc_rasie_irq(s);
+}
 }
 }
 
@@ -459,7 +491,7 @@ static void rtc_update_timer(void *opaque)
 s->cmos_data[RTC_REG_C] |= irqs;
 if ((new_irqs & s->cmos_data[RTC_REG_B]) != 0) {
 s->cmos_data[RTC_REG_C] |= REG_C_IRQF;
-qemu_irq_raise(s->irq);
+rtc_rasie_irq(s);
 }
 check_update_timer(s);
 }
@@ -471,6 +503,7 @@ static void cmos_ioport_write(void *opaque, hwaddr addr,
 uint32_t old_period;
 bool update_periodic_timer;
 
+qemu_mutex_lock(>rtc_lock);
 if ((addr & 1) == 0) {
 s->cmos_index = data & 0x7f;
 } else {
@@ -560,10 +593,10 @@ static void cmos_ioport_write(void *opaque, hwaddr addr,
  * becomes enabled, raise an interrupt immediately.  */
 if (data & s->cmos_data[RTC_REG_C] & REG_C_MASK) {
 s->cmos_data[RTC_REG_C] |= REG_C_IRQF;
-qemu_irq_raise(s->irq);
+rtc_rasie_irq(s);
 } else {
 s->cmos_data[RTC_REG_C] &= ~REG_C_IRQF;
-qemu_irq_lower(s->irq);
+r

Re: [Qemu-devel] [PATCH] rtc: placing RTC memory region outside BQL

2018-02-06 Thread Gonglei (Arei)

> -Original Message-
> From: Peter Maydell [mailto:peter.mayd...@linaro.org]
> Sent: Tuesday, February 06, 2018 5:49 PM
> To: Gonglei (Arei)
> Cc: Paolo Bonzini; QEMU Developers; Huangweidong (C)
> Subject: Re: [Qemu-devel] [PATCH] rtc: placing RTC memory region outside BQL
> 
> On 6 February 2018 at 08:24, Gonglei (Arei) <arei.gong...@huawei.com>
> wrote:
> > So, taking BQL is necessary, and what we can do is trying our best to narrow
> > down the process of locking ? For example, do the following wrapping:
> >
> > static void rtc_rasie_irq(RTCState *s)
> > {
> > qemu_mutex_lock_iothread();
> > qemu_irq_raise(s->irq);
> > qemu_mutex_unlock_iothread();
> > }
> >
> > static void rtc_lower_irq(RTCState *s)
> > {
> > qemu_mutex_lock_iothread();
> > qemu_irq_lower(s->irq);
> > qemu_mutex_unlock_iothread();
> > }
> 
> If you do that you'll also need to be careful about not calling
> those functions from contexts where you already hold the iothread
> mutex (eg timer callbacks), since you can't lock a mutex you
> already have locked.
> 
Exactly, all contexts caused by the main process. :)
Three timers callbacks, calling rtc_reset().

Thanks,
-Gonglei


Re: [Qemu-devel] [PATCH] rtc: placing RTC memory region outside BQL

2018-02-06 Thread Gonglei (Arei)

> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Monday, February 05, 2018 10:04 PM
> To: Peter Maydell
> Cc: Gonglei (Arei); QEMU Developers; Huangweidong (C)
> Subject: Re: [Qemu-devel] [PATCH] rtc: placing RTC memory region outside BQL
> 
> On 04/02/2018 19:02, Peter Maydell wrote:
> > On 1 February 2018 at 14:23, Paolo Bonzini <pbonz...@redhat.com> wrote:
> >> On 01/02/2018 08:47, Gonglei wrote:
> >>> diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
> >>> index 35a05a6..d9d99c5 100644
> >>> --- a/hw/timer/mc146818rtc.c
> >>> +++ b/hw/timer/mc146818rtc.c
> >>> @@ -986,6 +986,7 @@ static void rtc_realizefn(DeviceState *dev, Error
> **errp)
> >>>  qemu_register_suspend_notifier(>suspend_notifier);
> >>>
> >>>  memory_region_init_io(>io, OBJECT(s), _ops, s, "rtc", 2);
> >>> +memory_region_clear_global_locking(>io);
> >>>  isa_register_ioport(isadev, >io, base);
> >>>
> >>>  qdev_set_legacy_instance_id(dev, base, 3);
> >>>
> >>
> >> This is not enough, you need to add a new lock or something like that.
> >> Otherwise two vCPUs can access the RTC together and make a mess.
> >
> > Do you also need to do something to take the global lock before
> > raising the outbound IRQ line (since it might be connected to a device
> > that does need the global lock), or am I confused ?
> 
> Yes, that's a good point.  Most of the time the IRQ line is raised in a
> timer, but not always.
> 
So, taking BQL is necessary, and what we can do is trying our best to narrow
down the process of locking ? For example, do the following wrapping:

static void rtc_rasie_irq(RTCState *s)
{
qemu_mutex_lock_iothread();
qemu_irq_raise(s->irq);
qemu_mutex_unlock_iothread();
}

static void rtc_lower_irq(RTCState *s)
{
qemu_mutex_lock_iothread();
qemu_irq_lower(s->irq);
qemu_mutex_unlock_iothread();
}

Thanks,
-Gonglei


Re: [Qemu-devel] [PATCH] rtc: placing RTC memory region outside BQL

2018-02-03 Thread Gonglei (Arei)
> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Thursday, February 01, 2018 10:24 PM
> To: Gonglei (Arei); qemu-devel@nongnu.org
> Cc: Huangweidong (C)
> Subject: Re: [PATCH] rtc: placing RTC memory region outside BQL
> 
> On 01/02/2018 08:47, Gonglei wrote:
> > As windows guest use rtc as the clock source device,
> > and access rtc frequently. Let's move the rtc memory
> > region outside BQL to decrease overhead for windows guests.
> >
> > strace -tt -p $1 -c -o result_$1.log &
> > sleep $2
> > pid=$(pidof strace)
> > kill $pid
> > cat result_$1.log
> >
> > Before appling this change:
> >
> > % time seconds  usecs/call callserrors syscall
> > -- --- --- - - 
> >  93.870.119070  30  4000   ppoll
> >   3.270.004148   2  2038   ioctl
> >   2.660.003370   2  2014   futex
> >   0.090.000113   1   106   read
> >   0.090.000109   1   104   io_getevents
> >   0.020.29   130   poll
> >   0.000.00   0 1   write
> > -- --- --- - - 
> > 100.000.126839  8293   total
> >
> > After appling the change:
> >
> > % time seconds  usecs/call callserrors syscall
> > -- --- --- - - 
> >  92.860.067441  16  4094   ppoll
> >   4.850.003522   2  2136   ioctl
> >   1.170.000850   4   189   futex
> >   0.540.000395   2   202   read
> >   0.520.000379   2   202   io_getevents
> >   0.050.37   130   poll
> > -- --- --- - - 
> > 100.000.072624  6853   total
> >
> > The futex call number decreases ~90.6% on an idle windows 7 guest.
> >
> > Signed-off-by: Gonglei <arei.gong...@huawei.com>
> > ---
> >  hw/timer/mc146818rtc.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
> > index 35a05a6..d9d99c5 100644
> > --- a/hw/timer/mc146818rtc.c
> > +++ b/hw/timer/mc146818rtc.c
> > @@ -986,6 +986,7 @@ static void rtc_realizefn(DeviceState *dev, Error
> **errp)
> >  qemu_register_suspend_notifier(>suspend_notifier);
> >
> >  memory_region_init_io(>io, OBJECT(s), _ops, s, "rtc", 2);
> > +memory_region_clear_global_locking(>io);
> >  isa_register_ioport(isadev, >io, base);
> >
> >  qdev_set_legacy_instance_id(dev, base, 3);
> >
> 
> This is not enough, you need to add a new lock or something like that.
> Otherwise two vCPUs can access the RTC together and make a mess.
> 

Hi Paolo,

Yes, that's true, although I have not encountered any problems yet.
Let me enhance it in v2.

Thanks,
-Gonglei


[Qemu-devel] [PATCH] rtc: placing RTC memory region outside BQL

2018-02-01 Thread Gonglei
As windows guest use rtc as the clock source device,
and access rtc frequently. Let's move the rtc memory
region outside BQL to decrease overhead for windows guests.

$ cat strace_c.sh
strace -tt -p $1 -c -o result_$1.log &
sleep $2
pid=$(pidof strace)
kill $pid
cat result_$1.log

Before appling this change:
$ ./strace_c.sh 10528 30
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 93.870.119070  30  4000   ppoll
  3.270.004148   2  2038   ioctl
  2.660.003370   2  2014   futex
  0.090.000113   1   106   read
  0.090.000109   1   104   io_getevents
  0.020.29   130   poll
  0.000.00   0 1   write
-- --- --- - - 
100.000.126839  8293   total

After appling the change:
$ ./strace_c.sh 23829 30
% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 92.860.067441  16  4094   ppoll
  4.850.003522   2  2136   ioctl
  1.170.000850   4   189   futex
  0.540.000395   2   202   read
  0.520.000379   2   202   io_getevents
  0.050.37   130   poll
-- --- --- - - 
100.000.072624  6853   total

The futex call number decreases ~90.6% on an idle windows 7 guest.

Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 hw/timer/mc146818rtc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
index 35a05a6..d9d99c5 100644
--- a/hw/timer/mc146818rtc.c
+++ b/hw/timer/mc146818rtc.c
@@ -986,6 +986,7 @@ static void rtc_realizefn(DeviceState *dev, Error **errp)
 qemu_register_suspend_notifier(>suspend_notifier);
 
 memory_region_init_io(>io, OBJECT(s), _ops, s, "rtc", 2);
+memory_region_clear_global_locking(>io);
 isa_register_ioport(isadev, >io, base);
 
 qdev_set_legacy_instance_id(dev, base, 3);
-- 
1.8.3.1





[Qemu-devel] [PATCH] rtc: placing RTC memory region outside BQL

2018-02-01 Thread Gonglei
As windows guest use rtc as the clock source device,
and access rtc frequently. Let's move the rtc memory
region outside BQL to decrease overhead for windows guests.

strace -tt -p $1 -c -o result_$1.log &
sleep $2
pid=$(pidof strace)
kill $pid
cat result_$1.log

Before appling this change:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 93.870.119070  30  4000   ppoll
  3.270.004148   2  2038   ioctl
  2.660.003370   2  2014   futex
  0.090.000113   1   106   read
  0.090.000109   1   104   io_getevents
  0.020.29   130   poll
  0.000.00   0 1   write
-- --- --- - - 
100.000.126839  8293   total

After appling the change:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 92.860.067441  16  4094   ppoll
  4.850.003522   2  2136   ioctl
  1.170.000850   4   189   futex
  0.540.000395   2   202   read
  0.520.000379   2   202   io_getevents
  0.050.37   130   poll
-- --- --- - - 
100.000.072624  6853   total

The futex call number decreases ~90.6% on an idle windows 7 guest.

Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 hw/timer/mc146818rtc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
index 35a05a6..d9d99c5 100644
--- a/hw/timer/mc146818rtc.c
+++ b/hw/timer/mc146818rtc.c
@@ -986,6 +986,7 @@ static void rtc_realizefn(DeviceState *dev, Error **errp)
 qemu_register_suspend_notifier(>suspend_notifier);
 
 memory_region_init_io(>io, OBJECT(s), _ops, s, "rtc", 2);
+memory_region_clear_global_locking(>io);
 isa_register_ioport(isadev, >io, base);
 
 qdev_set_legacy_instance_id(dev, base, 3);
-- 
1.8.3.1





[Qemu-devel] [PATCH] vl: fix possible int overflow for qemu_timedate_diff()

2018-02-01 Thread Gonglei
From: shenghualong <shenghual...@huawei.com>

When the Windows guest users set the time to year 2099,
the return value of qemu_timedate_diff() will overflow
with variable clock mode as below format:

 

Let's change the return value of qemu_timedate_diff() from
int to time_t to fix the possible overflow problem.

Signed-off-by: shenghualong <shenghual...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 include/qemu-common.h | 2 +-
 vl.c  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/qemu-common.h b/include/qemu-common.h
index 05319b9..6fb80aa 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -33,7 +33,7 @@ int qemu_main(int argc, char **argv, char **envp);
 #endif
 
 void qemu_get_timedate(struct tm *tm, int offset);
-int qemu_timedate_diff(struct tm *tm);
+time_t qemu_timedate_diff(struct tm *tm);
 
 #define qemu_isalnum(c)isalnum((unsigned char)(c))
 #define qemu_isalpha(c)isalpha((unsigned char)(c))
diff --git a/vl.c b/vl.c
index e517a8d..9d225da 100644
--- a/vl.c
+++ b/vl.c
@@ -146,7 +146,7 @@ int nb_nics;
 NICInfo nd_table[MAX_NICS];
 int autostart;
 static int rtc_utc = 1;
-static int rtc_date_offset = -1; /* -1 means no change */
+static time_t rtc_date_offset = -1; /* -1 means no change */
 QEMUClockType rtc_clock;
 int vga_interface_type = VGA_NONE;
 static int full_screen = 0;
@@ -812,7 +812,7 @@ void qemu_get_timedate(struct tm *tm, int offset)
 }
 }
 
-int qemu_timedate_diff(struct tm *tm)
+time_t qemu_timedate_diff(struct tm *tm)
 {
 time_t seconds;
 
-- 
1.8.3.1





Re: [Qemu-devel] [PATCH v3 1/4] cryptodev: add vhost-user as a new cryptodev backend

2018-01-17 Thread Gonglei (Arei)


> -Original Message-
> From: Zhoujian (jay)
> Sent: Wednesday, January 17, 2018 1:01 PM
> To: Michael S. Tsirkin
> Cc: pa...@linux.vnet.ibm.com; Huangweidong (C); xin.z...@intel.com;
> qemu-devel@nongnu.org; Gonglei (Arei); roy.fan.zh...@intel.com;
> stefa...@redhat.com; pbonz...@redhat.com; longpeng
> Subject: RE: [Qemu-devel] [PATCH v3 1/4] cryptodev: add vhost-user as a new
> cryptodev backend
> 
> > -Original Message-
> > From: Qemu-devel [mailto:qemu-devel-
> > bounces+jianjay.zhou=huawei@nongnu.org] On Behalf Of Michael S.
> Tsirkin
> > Sent: Wednesday, January 17, 2018 12:41 AM
> > To: Zhoujian (jay) <jianjay.z...@huawei.com>
> > Cc: pa...@linux.vnet.ibm.com; Huangweidong (C)
> <weidong.hu...@huawei.com>;
> > xin.z...@intel.com; qemu-devel@nongnu.org; Gonglei (Arei)
> > <arei.gong...@huawei.com>; roy.fan.zh...@intel.com;
> stefa...@redhat.com;
> > pbonz...@redhat.com; longpeng <longpe...@huawei.com>
> > Subject: Re: [Qemu-devel] [PATCH v3 1/4] cryptodev: add vhost-user as a new
> > cryptodev backend
> >
> > On Tue, Jan 16, 2018 at 10:06:50PM +0800, Jay Zhou wrote:
> > > From: Gonglei <arei.gong...@huawei.com>
> > >
> > > Usage:
> > >  -chardev socket,id=charcrypto0,path=/path/to/your/socket
> > >  -object cryptodev-vhost-user,id=cryptodev0,chardev=charcrypto0
> > >  -device virtio-crypto-pci,id=crypto0,cryptodev=cryptodev0
> > >
> > > Signed-off-by: Gonglei <arei.gong...@huawei.com>
> > > Signed-off-by: Longpeng(Mike) <longpe...@huawei.com>
> > > Signed-off-by: Jay Zhou <jianjay.z...@huawei.com>
> > > ---
> > >  backends/Makefile.objs   |   4 +
> > >  backends/cryptodev-vhost-user.c  | 333
> > +++
> > >  backends/cryptodev-vhost.c   |  73 +
> > >  include/sysemu/cryptodev-vhost.h | 154 ++
> > >  qemu-options.hx  |  21 +++
> > >  vl.c |   4 +
> > >  6 files changed, 589 insertions(+)
> > >  create mode 100644 backends/cryptodev-vhost-user.c  create mode
> > > 100644 backends/cryptodev-vhost.c  create mode 100644
> > > include/sysemu/cryptodev-vhost.h
> > >
> > > diff --git a/backends/Makefile.objs b/backends/Makefile.objs index
> > > 0400799..9e1fb76 100644
> > > --- a/backends/Makefile.objs
> > > +++ b/backends/Makefile.objs
> > > @@ -8,3 +8,7 @@ common-obj-$(CONFIG_LINUX) += hostmem-file.o
> > >
> > >  common-obj-y += cryptodev.o
> > >  common-obj-y += cryptodev-builtin.o
> > > +
> > > +ifeq ($(CONFIG_VIRTIO),y)
> > > +common-obj-$(CONFIG_LINUX) += cryptodev-vhost.o
> > > +cryptodev-vhost-user.o endif
> >
> > Shouldn't this depend on CONFIG_VHOST_USER?
> 
> Yes, you're right. Will fix it soon.
> 
Hi Michael,

Can we apply this patch set firstly and then fix it on the top with other 
comments?

Thanks,
-Gonglei



Re: [Qemu-devel] [PATCH 3/7] i386: Add spec-ctrl CPUID bit

2018-01-16 Thread Gonglei (Arei)

> -Original Message-
> From: Eduardo Habkost [mailto:ehabk...@redhat.com]
> Sent: Monday, January 15, 2018 8:23 PM
> To: Gonglei (Arei)
> Cc: qemu-devel@nongnu.org; Paolo Bonzini
> Subject: Re: [Qemu-devel] [PATCH 3/7] i386: Add spec-ctrl CPUID bit
> 
> On Sat, Jan 13, 2018 at 03:04:44AM +, Gonglei (Arei) wrote:
> >
> > > -Original Message-
> > > From: Qemu-devel
> > > [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> > > Behalf Of Eduardo Habkost
> > > Sent: Tuesday, January 09, 2018 11:45 PM
> > > To: qemu-devel@nongnu.org
> > > Cc: Paolo Bonzini
> > > Subject: [Qemu-devel] [PATCH 3/7] i386: Add spec-ctrl CPUID bit
> > >
> > > Add the feature name and a CPUID_7_0_EDX_SPEC_CTRL macro.
> > >
> > > Signed-off-by: Eduardo Habkost <ehabk...@redhat.com>
> > > ---
> > >  target/i386/cpu.h | 1 +
> > >  target/i386/cpu.c | 2 +-
> > >  2 files changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> > > index 07f47997d6..de387c1311 100644
> > > --- a/target/i386/cpu.h
> > > +++ b/target/i386/cpu.h
> > > @@ -667,6 +667,7 @@ typedef uint32_t
> > > FeatureWordArray[FEATURE_WORDS];
> > >
> > >  #define CPUID_7_0_EDX_AVX512_4VNNIW (1U << 2) /* AVX512 Neural
> > > Network Instructions */
> > >  #define CPUID_7_0_EDX_AVX512_4FMAPS (1U << 3) /* AVX512 Multiply
> > > Accumulation Single Precision */
> > > +#define CPUID_7_0_EDX_SPEC_CTRL (1U << 26) /* Speculation
> Control
> > > */
> > >
> > >  #define CPUID_XSAVE_XSAVEOPT   (1U << 0)
> > >  #define CPUID_XSAVE_XSAVEC (1U << 1)
> > > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > > index 9f4f949899..1be1642eb2 100644
> > > --- a/target/i386/cpu.c
> > > +++ b/target/i386/cpu.c
> > > @@ -459,7 +459,7 @@ static FeatureWordInfo
> > > feature_word_info[FEATURE_WORDS] = {
> > >  NULL, NULL, NULL, NULL,
> > >  NULL, NULL, NULL, NULL,
> > >  NULL, NULL, NULL, NULL,
> > > -NULL, NULL, NULL, NULL,
> > > +NULL, NULL, "spec-ctrl", NULL,
> > >  NULL, NULL, NULL, NULL,
> > >  },
> > >  .cpuid_eax = 7,
> > > --
> > > 2.14.3
> > >
> > Don't we need to pass-through cupid_7_edx to guest when configuring '-cpu
> host'?
> > Otherwise how guests use IBPB/IBRS/STIPB capabilities?
> 
> We already do.  See the check for cpu->max_features at
> x86_cpu_expand_features().
> 
> Do you see something else missing?
> 
No, thank you. My bad. :(

Thanks,
-Gonglei




Re: [Qemu-devel] [PATCH 3/7] i386: Add spec-ctrl CPUID bit

2018-01-12 Thread Gonglei (Arei)

> -Original Message-
> From: Qemu-devel
> [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> Behalf Of Eduardo Habkost
> Sent: Tuesday, January 09, 2018 11:45 PM
> To: qemu-devel@nongnu.org
> Cc: Paolo Bonzini
> Subject: [Qemu-devel] [PATCH 3/7] i386: Add spec-ctrl CPUID bit
> 
> Add the feature name and a CPUID_7_0_EDX_SPEC_CTRL macro.
> 
> Signed-off-by: Eduardo Habkost <ehabk...@redhat.com>
> ---
>  target/i386/cpu.h | 1 +
>  target/i386/cpu.c | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 07f47997d6..de387c1311 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -667,6 +667,7 @@ typedef uint32_t
> FeatureWordArray[FEATURE_WORDS];
> 
>  #define CPUID_7_0_EDX_AVX512_4VNNIW (1U << 2) /* AVX512 Neural
> Network Instructions */
>  #define CPUID_7_0_EDX_AVX512_4FMAPS (1U << 3) /* AVX512 Multiply
> Accumulation Single Precision */
> +#define CPUID_7_0_EDX_SPEC_CTRL (1U << 26) /* Speculation Control
> */
> 
>  #define CPUID_XSAVE_XSAVEOPT   (1U << 0)
>  #define CPUID_XSAVE_XSAVEC (1U << 1)
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 9f4f949899..1be1642eb2 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -459,7 +459,7 @@ static FeatureWordInfo
> feature_word_info[FEATURE_WORDS] = {
>  NULL, NULL, NULL, NULL,
>  NULL, NULL, NULL, NULL,
>  NULL, NULL, NULL, NULL,
> -NULL, NULL, NULL, NULL,
> +NULL, NULL, "spec-ctrl", NULL,
>  NULL, NULL, NULL, NULL,
>  },
>      .cpuid_eax = 7,
> --
> 2.14.3
> 
Don't we need to pass-through cupid_7_edx to guest when configuring '-cpu host'?
Otherwise how guests use IBPB/IBRS/STIPB capabilities?

Thanks,
-Gonglei




Re: [Qemu-devel] [PATCH 0/4] cryptodev: add vhost support

2017-12-21 Thread Gonglei (Arei)

> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Thursday, December 21, 2017 10:25 PM
> To: Gonglei (Arei)
> Cc: qemu-devel@nongnu.org; pbonz...@redhat.com; Huangweidong (C);
> stefa...@redhat.com; Zhoujian (jay); pa...@linux.vnet.ibm.com; longpeng;
> xin.z...@intel.com; roy.fan.zh...@intel.com
> Subject: Re: [PATCH 0/4] cryptodev: add vhost support
> 
> On Tue, Nov 28, 2017 at 05:03:05PM +0800, Gonglei wrote:
> > I posted the RFC verion five months ago for DPDK
> > vhost-crypto implmention, and now it's time to send
> > the formal version. Because we need an user space scheme
> > for better performance.
> >
> > The vhost user crypto server side patches had been
> > sent to DPDK community, pls see
> >
> > [RFC PATCH 0/6] lib/librte_vhost: introduce new vhost_user crypto
> backend support
> > http://dpdk.org/ml/archives/dev/2017-November/081048.html
> >
> > You also can get virtio-crypto polling mode driver from:
> >
> > [PATCH] virtio: add new driver for crypto devices
> > http://dpdk.org/ml/archives/dev/2017-November/081985.html
> >
> 
> This makes build on mingw break:
> 
>   CC  sparc64-softmmu/hw/scsi/virtio-scsi-dataplane.o
> hw/virtio/virtio-crypto.o: In function `virtio_crypto_vhost_status':
> /scm/qemu/hw/virtio/virtio-crypto.c:898: undefined reference to
> `cryptodev_get_vhost'
> /scm/qemu/hw/virtio/virtio-crypto.c:910: undefined reference to
> `cryptodev_vhost_start'
> /scm/qemu/hw/virtio/virtio-crypto.c:917: undefined reference to
> `cryptodev_vhost_stop'
> hw/virtio/virtio-crypto.o: In function `virtio_crypto_guest_notifier_pending':
> /scm/qemu/hw/virtio/virtio-crypto.c:947: undefined reference to
> `cryptodev_vhost_virtqueue_pending'
> hw/virtio/virtio-crypto.o: In function `virtio_crypto_guest_notifier_mask':
> /scm/qemu/hw/virtio/virtio-crypto.c:937: undefined reference to
> `cryptodev_vhost_virtqueue_mask'
> collect2: error: ld returned 1 exit status
> make[1]: *** [Makefile:193: qemu-system-i386.exe] Error 1
> make: *** [Makefile:383: subdir-i386-softmmu] Error 2
> 
> 
Sorry about that. We'll build it on a cross-compiler environment.

Thanks,
-Gonglei



Re: [Qemu-devel] [PATCH 0/4] cryptodev: add vhost support

2017-12-20 Thread Gonglei (Arei)


> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Thursday, December 21, 2017 1:39 AM
> To: Gonglei (Arei)
> Cc: qemu-devel@nongnu.org; pbonz...@redhat.com; Huangweidong (C);
> stefa...@redhat.com; Zhoujian (jay); pa...@linux.vnet.ibm.com; longpeng;
> xin.z...@intel.com; roy.fan.zh...@intel.com
> Subject: Re: [PATCH 0/4] cryptodev: add vhost support
> 
> On Mon, Dec 18, 2017 at 09:03:16AM +, Gonglei (Arei) wrote:
> > Ping...
> >
> > Fan (working for DPDK parts) is waiting for those patches upstreamed. :)
> >
> > Thanks,
> > -Gonglei
> 
> As far as I am concerned, the main issue is that it says it assumes
> polling.  virtio does not work like this right now.  As long as spec
> does not support interrupt mode, I don't think we can merge this.
> 
Sorry, Michael. This makes me confused. Because the Qemu part about vhost-user 
crypto 
doesn't do this assumption. The main controversial point is whether session 
operations
should be added in the vhost-user protocol, raised by Paolo. And we made an 
explanation. 

Thanks,
-Gonglei

> >
> > > -Original Message-
> > > From: Gonglei (Arei)
> > > Sent: Tuesday, November 28, 2017 5:03 PM
> > > To: qemu-devel@nongnu.org
> > > Cc: m...@redhat.com; pbonz...@redhat.com; Huangweidong (C);
> > > stefa...@redhat.com; Zhoujian (jay); pa...@linux.vnet.ibm.com;
> longpeng;
> > > xin.z...@intel.com; roy.fan.zh...@intel.com; Gonglei (Arei)
> > > Subject: [PATCH 0/4] cryptodev: add vhost support
> > >
> > > I posted the RFC verion five months ago for DPDK
> > > vhost-crypto implmention, and now it's time to send
> > > the formal version. Because we need an user space scheme
> > > for better performance.
> > >
> > > The vhost user crypto server side patches had been
> > > sent to DPDK community, pls see
> > >
> > > [RFC PATCH 0/6] lib/librte_vhost: introduce new   vhost_user crypto
> backend
> > > support
> > > http://dpdk.org/ml/archives/dev/2017-November/081048.html
> > >
> > > You also can get virtio-crypto polling mode driver from:
> > >
> > > [PATCH] virtio: add new driver for crypto devices
> > > http://dpdk.org/ml/archives/dev/2017-November/081985.html
> > >
> > >
> > > Gonglei (4):
> > >   cryptodev: add vhost-user as a new cryptodev backend
> > >   cryptodev: add vhost support
> > >   cryptodev-vhost-user: add crypto session handler
> > >   cryptodev-vhost-user: set the key length
> > >
> > >  backends/Makefile.objs|   4 +
> > >  backends/cryptodev-builtin.c  |   1 +
> > >  backends/cryptodev-vhost-user.c   | 381
> > > ++
> > >  backends/cryptodev-vhost.c| 297
> > > ++
> > >  docs/interop/vhost-user.txt   |  19 ++
> > >  hw/virtio/vhost-user.c|  89 
> > >  hw/virtio/virtio-crypto.c |  70 +++
> > >  include/hw/virtio/vhost-backend.h |   8 +
> > >  include/hw/virtio/virtio-crypto.h |   1 +
> > >  include/sysemu/cryptodev-vhost-user.h |  47 +
> > >  include/sysemu/cryptodev-vhost.h  | 154 ++
> > >  include/sysemu/cryptodev.h|   8 +
> > >  qemu-options.hx   |  21 ++
> > >  vl.c  |   4 +
> > >  14 files changed, 1104 insertions(+)
> > >  create mode 100644 backends/cryptodev-vhost-user.c
> > >  create mode 100644 backends/cryptodev-vhost.c
> > >  create mode 100644 include/sysemu/cryptodev-vhost-user.h
> > >  create mode 100644 include/sysemu/cryptodev-vhost.h
> > >
> > > --
> > > 1.8.3.1
> > >



Re: [Qemu-devel] [PATCH 0/4] cryptodev: add vhost support

2017-12-18 Thread Gonglei (Arei)
Ping...

Fan (working for DPDK parts) is waiting for those patches upstreamed. :)

Thanks,
-Gonglei


> -Original Message-
> From: Gonglei (Arei)
> Sent: Tuesday, November 28, 2017 5:03 PM
> To: qemu-devel@nongnu.org
> Cc: m...@redhat.com; pbonz...@redhat.com; Huangweidong (C);
> stefa...@redhat.com; Zhoujian (jay); pa...@linux.vnet.ibm.com; longpeng;
> xin.z...@intel.com; roy.fan.zh...@intel.com; Gonglei (Arei)
> Subject: [PATCH 0/4] cryptodev: add vhost support
> 
> I posted the RFC verion five months ago for DPDK
> vhost-crypto implmention, and now it's time to send
> the formal version. Because we need an user space scheme
> for better performance.
> 
> The vhost user crypto server side patches had been
> sent to DPDK community, pls see
> 
> [RFC PATCH 0/6] lib/librte_vhost: introduce new   vhost_user crypto 
> backend
> support
> http://dpdk.org/ml/archives/dev/2017-November/081048.html
> 
> You also can get virtio-crypto polling mode driver from:
> 
> [PATCH] virtio: add new driver for crypto devices
> http://dpdk.org/ml/archives/dev/2017-November/081985.html
> 
> 
> Gonglei (4):
>   cryptodev: add vhost-user as a new cryptodev backend
>   cryptodev: add vhost support
>   cryptodev-vhost-user: add crypto session handler
>   cryptodev-vhost-user: set the key length
> 
>  backends/Makefile.objs|   4 +
>  backends/cryptodev-builtin.c  |   1 +
>  backends/cryptodev-vhost-user.c   | 381
> ++
>  backends/cryptodev-vhost.c| 297
> ++
>  docs/interop/vhost-user.txt   |  19 ++
>  hw/virtio/vhost-user.c|  89 
>  hw/virtio/virtio-crypto.c |  70 +++
>  include/hw/virtio/vhost-backend.h |   8 +
>  include/hw/virtio/virtio-crypto.h |   1 +
>  include/sysemu/cryptodev-vhost-user.h |  47 +
>  include/sysemu/cryptodev-vhost.h  | 154 ++
>  include/sysemu/cryptodev.h|   8 +
>  qemu-options.hx   |  21 ++
>  vl.c  |   4 +
>  14 files changed, 1104 insertions(+)
>  create mode 100644 backends/cryptodev-vhost-user.c
>  create mode 100644 backends/cryptodev-vhost.c
>  create mode 100644 include/sysemu/cryptodev-vhost-user.h
>  create mode 100644 include/sysemu/cryptodev-vhost.h
> 
> --
> 1.8.3.1
> 




Re: [Qemu-devel] [v22 1/2] virtio-crypto: Add virtio crypto device specification

2017-12-07 Thread Gonglei (Arei)
> 
> On 12/06/2017 08:37 AM, Longpeng(Mike) wrote:
> > +\field{outcome_len} is the size of struct virtio_crypto_session_input or
> > +ZERO for the session-destroy operation.
> 
> This ain't correct. It should have been something like
> virtio_crypto_destroy_session_input.
> 
Right, will fix it.

> > +
> > +
> > +\paragraph{Session operation}\label{sec:Device Types / Crypto Device /
> Device
> > +Operation / Control Virtqueue / Session operation}
> > +
> > +The session is a handle which describes the cryptographic parameters to be
> > +applied to a number of buffers.
> > +
> > +The following structure stores the result of session creation set by the
> device:
> > +
> > +\begin{lstlisting}
> > +struct virtio_crypto_session_input {
> > +/* Device write only portion */
> > +le64 session_id;
> > +le32 status;
> > +le32 padding;
> > +};
> > +\end{lstlisting}
> > +
> > +A request to destroy a session includes the following information:
> > +
> > +\begin{lstlisting}
> > +struct virtio_crypto_destroy_session_flf {
> > +/* Device read only portion */
> > +le64  session_id;
> > +/* Device write only portion */
> 
> This is the device writable portion and thus what we cal op_outcome above.
> So it should have been
> };
> 
> 
> struct virtio_crypto_destroy_session_input {
> > +le32  status;
> > +le32  padding;
> > +};
> 
> If we aren't consistent about it the dividing into parts (like op specific
> fixed and variable length (output) fields, operation outcome (input))
> isn't really helpful.
> 
It's ok to us, we can do it. Any other comments?

Thanks,
-Gonglei



Re: [Qemu-devel] About the light VM solution!

2017-12-06 Thread Gonglei (Arei)
> -Original Message-
> From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> Sent: Wednesday, December 06, 2017 11:10 PM
> To: Gonglei (Arei)
> Cc: Paolo Bonzini; Yang Zhong; Stefan Hajnoczi; qemu-devel
> Subject: Re: [Qemu-devel] About the light VM solution!
> 
> On Wed, Dec 06, 2017 at 09:21:55AM +, Gonglei (Arei) wrote:
> >
> > > -Original Message-
> > > From: Qemu-devel
> > > [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> > > Behalf Of Stefan Hajnoczi
> > > Sent: Wednesday, December 06, 2017 12:31 AM
> > > To: Paolo Bonzini
> > > Cc: Yang Zhong; Stefan Hajnoczi; qemu-devel
> > > Subject: Re: [Qemu-devel] About the light VM solution!
> > >
> > > On Tue, Dec 05, 2017 at 03:00:10PM +0100, Paolo Bonzini wrote:
> > > > On 05/12/2017 14:47, Stefan Hajnoczi wrote:
> > > > > On Tue, Dec 5, 2017 at 1:35 PM, Paolo Bonzini <pbonz...@redhat.com>
> > > wrote:
> > > > >> On 05/12/2017 13:06, Stefan Hajnoczi wrote:
> > > > >>> On Tue, Dec 05, 2017 at 02:33:13PM +0800, Yang Zhong wrote:
> > > > >>>> As you know, AWS has decided to switch to KVM in their clouds. This
> > > news make almost all
> > > > >>>> china CSPs(clouds service provider) pay more attention on
> KVM/Qemu,
> > > especially light VM
> > > > >>>> solution.
> > > > >>>>
> > > > >>>> Below are intel solution for light VM, qemu-lite.
> > > > >>>>
> > >
> http://events.linuxfoundation.org/sites/events/files/slides/Light%20weight%2
> > > 0virtualization%20with%20QEMU%26KVM_0.pdf
> > > > >>>>
> > > > >>>> My question is whether community has some plan to implement
> light
> > > VM or alternative solutions? If no, whether our
> > > > >>>> qemu-lite solution is suitable for upstream again? Many thanks!
> > > > >>>
> > > > >>> What caused a lot of discussion and held back progress was the
> approach
> > > > >>> that was taken.  The basic philosophy seems to be bypassing or
> > > > >>> special-casing components in order to avoid slow operations.  This
> > > > >>> requires special QEMU, firmware, and/or guest kernel binaries and
> > > causes
> > > > >>> extra work for the management stack, distributions, and testers.
> > > > >>
> > > > >> I think having a special firmware (be it qboot or a special-purpose
> > > > >> SeaBIOS) is acceptable.
> > > > >
> > > > > The work Marc Mari Barcelo did in 2015 showed that SeaBIOS can boot
> > > > > guests quickly.  The guest kernel was entered in <35 milliseconds
> > > > > IIRC.  Why is special firmware necessary?
> > > >
> > > > I thought that wasn't the "conventional" SeaBIOS, but rather one with
> > > > reduced configuration options, but I may be remembering wrong.
> > >
> > > Marc didn't spend much time on optimizing SeaBIOS, he used the build
> > > options that were suggested.  An extra flag can be added in
> > > qemu_preinit() to skip slow init that's unnecessary on optimized
> > > machines.  That would allow a single SeaBIOS binary to run both full and
> > > lite systems.
> > >
> > What's options do you remember? Stefan. Or any links about that
> > thread? I'm Interesting with this topic.
> 
> Here is what I found:
> 
> Marc Mari's fastest SeaBIOS build took 8 ms from the first guest CPU
> instruction to entering the guest kernel.  CBFS was used instead of a
> normal boot device (e.g. virtio-blk).  Most hardware support was
> disabled.
> 
> https://mail.coreboot.org/pipermail/seabios/2015-July/009554.html
> 
> The SeaBIOS configuration file is here:
> 
> https://mail.coreboot.org/pipermail/seabios/2015-July/009548.html
> 
Thanks for your information. :)
 
Thanks,
-Gonglei



Re: [Qemu-devel] About the light VM solution!

2017-12-06 Thread Gonglei (Arei)

> -Original Message-
> From: Qemu-devel
> [mailto:qemu-devel-bounces+arei.gonglei=huawei@nongnu.org] On
> Behalf Of Stefan Hajnoczi
> Sent: Wednesday, December 06, 2017 12:31 AM
> To: Paolo Bonzini
> Cc: Yang Zhong; Stefan Hajnoczi; qemu-devel
> Subject: Re: [Qemu-devel] About the light VM solution!
> 
> On Tue, Dec 05, 2017 at 03:00:10PM +0100, Paolo Bonzini wrote:
> > On 05/12/2017 14:47, Stefan Hajnoczi wrote:
> > > On Tue, Dec 5, 2017 at 1:35 PM, Paolo Bonzini <pbonz...@redhat.com>
> wrote:
> > >> On 05/12/2017 13:06, Stefan Hajnoczi wrote:
> > >>> On Tue, Dec 05, 2017 at 02:33:13PM +0800, Yang Zhong wrote:
> > >>>> As you know, AWS has decided to switch to KVM in their clouds. This
> news make almost all
> > >>>> china CSPs(clouds service provider) pay more attention on KVM/Qemu,
> especially light VM
> > >>>> solution.
> > >>>>
> > >>>> Below are intel solution for light VM, qemu-lite.
> > >>>>
> http://events.linuxfoundation.org/sites/events/files/slides/Light%20weight%2
> 0virtualization%20with%20QEMU%26KVM_0.pdf
> > >>>>
> > >>>> My question is whether community has some plan to implement light
> VM or alternative solutions? If no, whether our
> > >>>> qemu-lite solution is suitable for upstream again? Many thanks!
> > >>>
> > >>> What caused a lot of discussion and held back progress was the approach
> > >>> that was taken.  The basic philosophy seems to be bypassing or
> > >>> special-casing components in order to avoid slow operations.  This
> > >>> requires special QEMU, firmware, and/or guest kernel binaries and
> causes
> > >>> extra work for the management stack, distributions, and testers.
> > >>
> > >> I think having a special firmware (be it qboot or a special-purpose
> > >> SeaBIOS) is acceptable.
> > >
> > > The work Marc Mari Barcelo did in 2015 showed that SeaBIOS can boot
> > > guests quickly.  The guest kernel was entered in <35 milliseconds
> > > IIRC.  Why is special firmware necessary?
> >
> > I thought that wasn't the "conventional" SeaBIOS, but rather one with
> > reduced configuration options, but I may be remembering wrong.
> 
> Marc didn't spend much time on optimizing SeaBIOS, he used the build
> options that were suggested.  An extra flag can be added in
> qemu_preinit() to skip slow init that's unnecessary on optimized
> machines.  That would allow a single SeaBIOS binary to run both full and
> lite systems.
> 
What's options do you remember? Stefan. Or any links about that
thread? I'm Interesting with this topic.

Thanks,
-Gonglei



[Qemu-devel] 答复: [BUG] Windows 7 got stuck easily while run PCMark10 application

2017-12-01 Thread Gonglei (Arei)
I also think it's windows bug, the problem is that it doesn't occur on xen 
platform. And there are some other works need to be done while reading REG_C. 
So I wrote that patch.

Thanks,
Gonglei
发件人:Paolo Bonzini
收件人:龚磊,张海亮,qemu-devel,Michael S. Tsirkin
抄送:黄伟栋,王欣,谢祥有
时间:2017-12-02 01:10:08
主题:Re: [BUG] Windows 7 got stuck easily while run PCMark10 application

On 01/12/2017 08:08, Gonglei (Arei) wrote:
> First write to 0x70, cmos_index = 0xc & 0x7f = 0xc
>CPU 0/KVM-15566 kvm_pio: pio_write at 0x70 size 1 count 1 val 0xc> 
> Second write to 0x70, cmos_index = 0x86 & 0x7f = 0x6>CPU 1/KVM-15567 
> kvm_pio: pio_write at 0x70 size 1 count 1 val 0x86> vcpu0 read 0x6 because 
> cmos_index is 0x6 now:>CPU 0/KVM-15566 kvm_pio: pio_read at 0x71 size 
> 1 count 1 val 0x6> vcpu1 read 0x6:>CPU 1/KVM-15567 kvm_pio: pio_read 
> at 0x71 size 1 count 1 val 0x6
This seems to be a Windows bug.  The easiest workaround that I
can think of is to clear the interrupts already when 0xc is written,
without waiting for the read (because REG_C can only be read).

What do you think?

Thanks,

Paolo


Re: [Qemu-devel] [BUG] Windows 7 got stuck easily while run PCMark10 application

2017-12-01 Thread Gonglei (Arei)
Pls see the trace of kvm_pio:

   CPU 1/KVM-15567 [003]  209311.762579: kvm_pio: pio_read at 0x70 size 
1 count 1 val 0xff
   CPU 1/KVM-15567 [003]  209311.762582: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0x89
   CPU 1/KVM-15567 [003]  209311.762590: kvm_pio: pio_read at 0x71 size 
1 count 1 val 0x17
   CPU 0/KVM-15566 [005]  209311.762611: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0xc
   CPU 1/KVM-15567 [003]  209311.762615: kvm_pio: pio_read at 0x70 size 
1 count 1 val 0xff
   CPU 1/KVM-15567 [003]  209311.762619: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0x88
   CPU 1/KVM-15567 [003]  209311.762627: kvm_pio: pio_read at 0x71 size 
1 count 1 val 0x12
   CPU 0/KVM-15566 [005]  209311.762632: kvm_pio: pio_read at 0x71 size 
1 count 1 val 0x12
   CPU 1/KVM-15567 [003]  209311.762633: kvm_pio: pio_read at 0x70 size 
1 count 1 val 0xff
   CPU 0/KVM-15566 [005]  209311.762634: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0xc   <--- Firstly write to 0x70, cmo_index = 0xc & 
0x7f = 0xc
   CPU 1/KVM-15567 [003]  209311.762636: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0x86   <-- Secondly write to 0x70, cmo_index = 0x86 & 
0x7f = 0x6, cover the cmo_index result of first time
   CPU 0/KVM-15566 [005]  209311.762641: kvm_pio: pio_read at 0x71 size 
1 count 1 val 0x6  <--  vcpu0 read 0x6 because cmo_index is 0x6 now
   CPU 1/KVM-15567 [003]  209311.762644: kvm_pio: pio_read at 0x71 size 
1 count 1 val 0x6 <-  vcpu1 read 0x6
   CPU 1/KVM-15567 [003]  209311.762649: kvm_pio: pio_read at 0x70 size 
1 count 1 val 0xff
   CPU 1/KVM-15567 [003]  209311.762669: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0x87
   CPU 1/KVM-15567 [003]  209311.762678: kvm_pio: pio_read at 0x71 size 
1 count 1 val 0x1
   CPU 1/KVM-15567 [003]  209311.762683: kvm_pio: pio_read at 0x70 size 
1 count 1 val 0xff
   CPU 1/KVM-15567 [003]  209311.762686: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0x84
   CPU 1/KVM-15567 [003]  209311.762693: kvm_pio: pio_read at 0x71 size 
1 count 1 val 0x10
   CPU 1/KVM-15567 [003]  209311.762699: kvm_pio: pio_read at 0x70 size 
1 count 1 val 0xff
   CPU 1/KVM-15567 [003]  209311.762702: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0x82
   CPU 1/KVM-15567 [003]  209311.762709: kvm_pio: pio_read at 0x71 size 
1 count 1 val 0x25
   CPU 1/KVM-15567 [003]  209311.762714: kvm_pio: pio_read at 0x70 size 
1 count 1 val 0xff
   CPU 1/KVM-15567 [003]  209311.762717: kvm_pio: pio_write at 0x70 
size 1 count 1 val 0x80


Regards,
-Gonglei

From: Zhanghailiang
Sent: Friday, December 01, 2017 3:03 AM
To: qemu-devel@nongnu.org; m...@redhat.com; Paolo Bonzini
Cc: Huangweidong (C); Gonglei (Arei); wangxin (U); Xiexiangyou
Subject: [BUG] Windows 7 got stuck easily while run PCMark10 application

Hi,

We hit a bug in our test while run PCMark 10 in a windows 7 VM,
The VM got stuck and the wallclock was hang after several minutes running
PCMark 10 in it.
It is quite easily to reproduce the bug with the upstream KVM and Qemu.

We found that KVM can not inject any RTC irq to VM after it was hang, it fails 
to
Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.

static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
  int irq_level, bool line_status)
{
… …
 if (!irq_level) {
  ioapic->irr &= ~mask;
  ret = 1;
  goto out;
 }
… …
 if ((edge && old_irr == ioapic->irr) ||
 (!edge && entry.fields.remote_irr)) {
  ret = 0;
  goto out;
 }

According to RTC spec, after RTC injects a High level irq, OS will read CMOS’s
register C to to clear the irq flag, and pull down the irq electric pin.

For Qemu, we will emulate the reading operation in cmos_ioport_read(),
but Guest OS will fire a write operation before to tell which register will be 
read
after this write, where we use s->cmos_index to record the following register 
to read.

But in our test, we found that there is a possible situation that Vcpu fails to 
read
RTC_REG_C to clear irq, This could happens while two VCpus are writing/reading
registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
but before it tries to read register C, another vcpu1 is going to read RTC_YEAR,
it changes s->cmos_index to RTC_YEAR by a writing action.
The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we 
will miss
calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never 
inject RTC irq,
and Windows VM will hang.
static void cmos_ioport_write(void *opaque, hwaddr addr,
 

[Qemu-devel] [PATCH] rtc: fix windows guest hang problem when Reading RTC_REG_C is overwritten

2017-12-01 Thread Gonglei
We hit a bug in our test while run PCMark 10 in a windows 7 VM,
The VM got stuck and the wallclock was hang after several minutes running
PCMark 10 in it.
It is quite easily to reproduce the bug with the upstream KVM and Qemu.

We found that KVM can not inject any RTC irq to VM after it was hang, it fails 
to
Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.

static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
  int irq_level, bool line_status)
{
...
 if (!irq_level) {
  ioapic->irr &= ~mask;
  ret = 1;
  goto out;
 }
...
 if ((edge && old_irr == ioapic->irr) ||
 (!edge && entry.fields.remote_irr)) {
  ret = 0;
  goto out;
 }

According to RTC spec, after RTC injects a High level irq, OS will read CMOS's
register C to to clear the irq flag, and pull down the irq electric pin.

For Qemu, we will emulate the reading operation in cmos_ioport_read(),
but Guest OS will fire a write operation before to tell which register will be 
read
after this write, where we use s->cmos_index to record the following register 
to read.

But in our test, we found that there is a possible situation that Vcpu fails to 
read
RTC_REG_C to clear irq, This could happens while two VCpus are writing/reading
registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
but before it tries to read register C, another vcpu1 is going to read RTC_YEAR,
it changes s->cmos_index to RTC_YEAR by a writing action.
The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we 
will miss
calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never 
inject RTC irq,
and Windows VM will hang.

Let's add a global variable to record the status of accessing RTC_REG_C to
avoid the issue.

Tested by Windows guests with PCMark benchmark.

Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 hw/timer/mc146818rtc.c | 65 +++---
 1 file changed, 46 insertions(+), 19 deletions(-)

diff --git a/hw/timer/mc146818rtc.c b/hw/timer/mc146818rtc.c
index 7764be2..c7702e7 100644
--- a/hw/timer/mc146818rtc.c
+++ b/hw/timer/mc146818rtc.c
@@ -98,6 +98,9 @@ static void rtc_set_cmos(RTCState *s, const struct tm *tm);
 static inline int rtc_from_bcd(RTCState *s, int a);
 static uint64_t get_next_alarm(RTCState *s);
 
+/* Used to indicate that RTC_REG_C is about to be accessed */
+bool ready_to_access_rtc_reg_c;
+
 static inline bool rtc_running(RTCState *s)
 {
 return (!(s->cmos_data[RTC_REG_B] & REG_B_SET) &&
@@ -473,6 +476,10 @@ static void cmos_ioport_write(void *opaque, hwaddr addr,
 
 if ((addr & 1) == 0) {
 s->cmos_index = data & 0x7f;
+
+if (s->cmos_index == RTC_REG_C) {
+ready_to_access_rtc_reg_c = true;
+}
 } else {
 CMOS_DPRINTF("cmos: write index=0x%02x val=0x%02" PRIx64 "\n",
  s->cmos_index, data);
@@ -575,6 +582,8 @@ static void cmos_ioport_write(void *opaque, hwaddr addr,
 check_update_timer(s);
 break;
 case RTC_REG_C:
+ready_to_access_rtc_reg_c = false;
+break;
 case RTC_REG_D:
 /* cannot write to them */
 break;
@@ -702,6 +711,32 @@ static int update_in_progress(RTCState *s)
 return 0;
 }
 
+static int cmos_ioport_read_rtc_reg_c(RTCState *s)
+{
+int ret;
+
+ret = s->cmos_data[RTC_REG_C];
+qemu_irq_lower(s->irq);
+s->cmos_data[RTC_REG_C] = 0x00;
+if (ret & (REG_C_UF | REG_C_AF)) {
+check_update_timer(s);
+}
+
+if (s->irq_coalesced &&
+(s->cmos_data[RTC_REG_B] & REG_B_PIE) &&
+s->irq_reinject_on_ack_count < RTC_REINJECT_ON_ACK_COUNT) {
+s->irq_reinject_on_ack_count++;
+s->cmos_data[RTC_REG_C] |= REG_C_IRQF | REG_C_PF;
+DPRINTF_C("cmos: injecting on ack\n");
+if (rtc_policy_slew_deliver_irq(s)) {
+s->irq_coalesced--;
+DPRINTF_C("cmos: coalesced irqs decreased to %d\n",
+  s->irq_coalesced);
+}
+}
+return ret;
+}
+
 static uint64_t cmos_ioport_read(void *opaque, hwaddr addr,
  unsigned size)
 {
@@ -710,6 +745,15 @@ static uint64_t cmos_ioport_read(void *opaque, hwaddr addr,
 if ((addr & 1) == 0) {
 return 0xff;
 } else {
+/*
+ * It indicate that the cmos_index for RTC_REG_C is overwritten
+ * if the condition is met, we should execute read RTC_REG_C manually.
+ */
+if (ready_to_access_rtc_reg_c && s->cmos_index != RTC_REG_C) {

Re: [Qemu-devel] [PATCH v4] thread: move detach_thread from creating thread to created thread

2017-11-29 Thread Gonglei (Arei)

> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Thursday, November 30, 2017 12:39 AM
> To: Gonglei (Arei); Eric Blake; linzhecheng; qemu-devel@nongnu.org
> Cc: f...@redhat.com; wangxin (U)
> Subject: Re: [Qemu-devel] [PATCH v4] thread: move detach_thread from
> creating thread to created thread
> 
> On 29/11/2017 17:28, Gonglei (Arei) wrote:
> >>> The root cause of this problem is a bug of glibc(version 2.17,the latest
> version
> >> has the same bug),
> >>> let's see what happened in glibc's code.
> >> Have you reported this bug to the glibc folks, and if so, can we include
> >> a URL to the glibc bugzilla?
> >>
> > No, we didn't do that yet. :(
> 
> It's here:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=19951.
> 
> I've added a note to the commit message.
> 
Nice~ :)

Thanks,
-Gonglei


Re: [Qemu-devel] [PATCH v4] thread: move detach_thread from creating thread to created thread

2017-11-29 Thread Gonglei (Arei)


> -Original Message-
> From: Eric Blake [mailto:ebl...@redhat.com]
> Sent: Thursday, November 30, 2017 12:19 AM
> To: linzhecheng; qemu-devel@nongnu.org
> Cc: aligu...@us.ibm.com; f...@redhat.com; wangxin (U); Gonglei (Arei);
> pbonz...@redhat.com
> Subject: Re: [Qemu-devel] [PATCH v4] thread: move detach_thread from
> creating thread to created thread
> 
> On 11/27/2017 10:46 PM, linzhecheng wrote:
> > If we create a thread with QEMU_THREAD_DETACHED mode, QEMU may
> get a segfault in a low probability.
> >
> 
> >
> > The root cause of this problem is a bug of glibc(version 2.17,the latest 
> > version
> has the same bug),
> > let's see what happened in glibc's code.
> 
> Have you reported this bug to the glibc folks, and if so, can we include
> a URL to the glibc bugzilla?
> 
No, we didn't do that yet. :(


> Working around the glibc bug is nice, but glibc should really be fixed
> so that other projects do not have to continue working around it.
> 
> 
Yes, agree.


Regards,
-Gonglei

> >
> > QEMU get a segfault at line 50, becasue pd is an invalid address.
> > pd is still valid at line 38 when set pd->joinid = pd, at this moment,
> > created thread is just exiting(only keeps runing for a short time),
> 
> s/runing/running/
> 
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org


Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default

2017-11-28 Thread Gonglei (Arei)


> -Original Message-
> From: rka...@virtuozzo.com [mailto:rka...@virtuozzo.com]
> Sent: Wednesday, November 29, 2017 1:56 PM
> To: Gonglei (Arei)
> Cc: Eduardo Habkost; Denis V. Lunev; longpeng; Michael S. Tsirkin; Denis
> Plotnikov; pbonz...@redhat.com; r...@twiddle.net; qemu-devel@nongnu.org;
> huangpeng; Zhaoshenglong
> Subject: Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
> 
> On Wed, Nov 29, 2017 at 01:57:14AM +, Gonglei (Arei) wrote:
> > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> > > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> > > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> > > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for
> vcpus"
> > > > >> introduced and set by default exposing l3 to the guest.
> > > > >>
> > > > >> The motivation behind it was that in the Linux scheduler, when waking
> up
> > > > >> a task on a sibling CPU, the task was put onto the target CPU's
> runqueue
> > > > >> directly, without sending a reschedule IPI.  Reduction in the IPI 
> > > > >> count
> > > > >> led to performance gain.
> > > > >>
> >
> > Yes, that's one thing.
> >
> > The other reason for enabling L3 cache is the performance of accessing
> memory.
> 
> I guess you're talking about the super-smart buffer size tuning glibc
> does in its memcpy and friends.  We try to control that with an atomic
> test for memcpy, and we didn't notice a difference.  We'll need to
> double-check...
> 
> > We tested it by Stream benchmark, the performance is better with
> L3-cache=on.
> 
> This one: https://www.cs.virginia.edu/stream/ ?  Thanks, we'll have a
> look, too.
> 
Yes. :)

Thanks,
-Gonglei



  1   2   3   4   5   6   7   8   9   10   >