Re: [PATCH 08/28] hw/acpi: Avoid truncating acpi_data_len() to 32-bit

2021-09-08 Thread Igor Mammedov
On Fri,  3 Sep 2021 13:06:42 +0200
Philippe Mathieu-Daudé  wrote:

> acpi_data_len() returns an unsigned type, which might be bigger
> than 32-bit (although it is unlikely such value is returned).
> Hold the returned value in an 'unsigned' type to avoid unlikely
> size truncation.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Acked-by: Igor Mammedov 

> ---
>  hw/arm/virt-acpi-build.c | 2 +-
>  hw/i386/acpi-build.c | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 037cc1fd82c..95543d43e2a 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -885,7 +885,7 @@ void virt_acpi_build(VirtMachineState *vms, 
> AcpiBuildTables *tables)
>  
>  static void acpi_ram_update(MemoryRegion *mr, GArray *data)
>  {
> -uint32_t size = acpi_data_len(data);
> +unsigned size = acpi_data_len(data);
>  
>  /* Make sure RAM size is correct - in case it got changed
>   * e.g. by migration */
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index a33ac8b91e1..aa269914b49 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2660,7 +2660,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState 
> *machine)
>  
>  static void acpi_ram_update(MemoryRegion *mr, GArray *data)
>  {
> -uint32_t size = acpi_data_len(data);
> +unsigned size = acpi_data_len(data);
>  
>  /* Make sure RAM size is correct - in case it got changed e.g. by 
> migration */
>  memory_region_ram_resize(mr, size, &error_abort);
> @@ -2783,7 +2783,7 @@ void acpi_setup(void)
>   * Though RSDP is small, its contents isn't immutable, so
>   * we'll update it along with the rest of tables on guest access.
>   */
> -uint32_t rsdp_size = acpi_data_len(tables.rsdp);
> +unsigned rsdp_size = acpi_data_len(tables.rsdp);
>  
>  build_state->rsdp = g_memdup(tables.rsdp->data, rsdp_size);
>  fw_cfg_add_file_callback(x86ms->fw_cfg, ACPI_BUILD_RSDP_FILE,




Re: [PATCH 09/28] hw/acpi: Replace g_memdup() by g_memdup2_qemu()

2021-09-08 Thread Igor Mammedov
On Fri,  3 Sep 2021 13:06:43 +0200
Philippe Mathieu-Daudé  wrote:

> Per 
> https://discourse.gnome.org/t/port-your-module-from-g-memdup-to-g-memdup2-now/5538
> 
>   The old API took the size of the memory to duplicate as a guint,
>   whereas most memory functions take memory sizes as a gsize. This
>   made it easy to accidentally pass a gsize to g_memdup(). For large
>   values, that would lead to a silent truncation of the size from 64
>   to 32 bits, and result in a heap area being returned which is
>   significantly smaller than what the caller expects. This can likely
>   be exploited in various modules to cause a heap buffer overflow.
> 
> Replace g_memdup() by the safer g_memdup2_qemu() wrapper.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Acked-by: Igor Mammedov 


> ---
>  hw/acpi/core.c   | 3 ++-
>  hw/i386/acpi-build.c | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/core.c b/hw/acpi/core.c
> index 1e004d0078d..9dd2cf09a0b 100644
> --- a/hw/acpi/core.c
> +++ b/hw/acpi/core.c
> @@ -637,7 +637,8 @@ void acpi_pm1_cnt_init(ACPIREGS *ar, MemoryRegion *parent,
>  suspend[3] = 1 | ((!disable_s3) << 7);
>  suspend[4] = s4_val | ((!disable_s4) << 7);
>  
> -fw_cfg_add_file(fw_cfg, "etc/system-states", g_memdup(suspend, 6), 
> 6);
> +fw_cfg_add_file(fw_cfg, "etc/system-states",
> +g_memdup2_qemu(suspend, 6), 6);
>  }
>  }
>  
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index aa269914b49..54494ca1f65 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2785,7 +2785,7 @@ void acpi_setup(void)
>   */
>  unsigned rsdp_size = acpi_data_len(tables.rsdp);
>  
> -build_state->rsdp = g_memdup(tables.rsdp->data, rsdp_size);
> +build_state->rsdp = g_memdup2_qemu(tables.rsdp->data, rsdp_size);
>  fw_cfg_add_file_callback(x86ms->fw_cfg, ACPI_BUILD_RSDP_FILE,
>   acpi_build_update, NULL, build_state,
>   build_state->rsdp, rsdp_size, true);




Re: [PATCH v3 9/9] iotests: add nbd-reconnect-on-open test

2021-09-08 Thread Vladimir Sementsov-Ogievskiy

07.09.2021 23:51, Eric Blake wrote:

+def check_fail_to_connect(open_timeout):
+log(f'Check fail to connect with {open_timeout} seconds of timeout')
+
+start_t = time.time()
+qemu_io_log(*create_args(open_timeout))
+delta_t = time.time() - start_t
+
+max_delta = open_timeout + 0.2

Is this fractional delay going to bite us on heavily-loaded CI machines?



You mean that it's too small and may be racy? Hmm. But increasing it now means 
wasting extra time.. I'd adjust it when CI fail if it happens.

--
Best regards,
Vladimir



Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-08 Thread Dr. David Alan Gilbert
* Jason Wang (jasow...@redhat.com) wrote:
> On Wed, Sep 8, 2021 at 2:32 AM Peter Xu  wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > I don't think it is valid to unconditionally enable this feature due to 
> > > > the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >if the socket option was not set, the socket exceeds its optmem
> > > >limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> > >
> > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > locked
> > > memory for the user before using it, or adding a capability to qemu 
> > > before.
> >
> > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> >
> > The thing is IIUC that's accounting for pinned pages only with either 
> > mlock()
> > (FOLL_MLOCK) or vfio (FOLL_PIN).
> >
> > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> 
> It happens probably here:
> 
> E.g
> 
> __ip_append_data()
> msg_zerocopy_realloc()
> mm_account_pinned_pages()

When do they get uncounted?  i.e. is it just the data that's in flight
that's marked as pinned?

Dave

> Thanks
> 
> >
> > Say, I think there're plenty of iov_iter_get_pages() callers and normal 
> > GUPs, I
> > think most of them do no need such accountings.
> >
> > --
> > Peter Xu
> >
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Dr. David Alan Gilbert
* Peter Xu (pet...@redhat.com) wrote:
> On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > What if we do the 'flush()' before we start post-copy, instead of after 
> > > > each
> > > > iteration? would that be enough?
> > > 
> > > Possibly, yes. This really need David G's input since he understands
> > > the code in way more detail than me.
> > 
> > Hmm I'm not entirely sure why we have the sync after each iteration;
> 
> We don't have that yet, do we?

I think multifd does; I think it calls multifd_send_sync_main sometimes,
which I *think* corresponds to iterations.

Dave

> > the case I can think of is if we're doing async sending, we could have
> > two versions of the same page in flight (one from each iteration) -
> > you'd want those to get there in the right order.
> 
> From MSG_ZEROCOPY document:
> 
> For protocols that acknowledge data in-order, like TCP, each
> notification can be squashed into the previous one, so that no more
> than one notification is outstanding at any one point.
> 
> Ordered delivery is the common case, but not guaranteed. Notifications
> may arrive out of order on retransmission and socket teardown.
> 
> So no matter whether we have a flush() per iteration before, it seems we may
> want one when zero copy enabled?
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




[RFC PATCH 0/4] block layer: split block APIs in graph and I/O

2021-09-08 Thread Emanuele Giuseppe Esposito
Currently, block layer APIs like block-backend.h contain a mix of
functions that are either running in the main loop and under the
BQL, or are thread-safe functions and run in iothreads performing I/O.
The functions running under BQL also take care of modifying the
block graph, by using drain and/or aio_context_acquire/release.
This makes it very confusing to understand where each function
runs, and what assumptions it provided with regards to thread
safety.

We call the functions running under BQL "graph API", and 
distinguish them from the thread-safe "I/O API".

The aim of this series is to split the relevant block headers in 
graph and I/O sub-headers. The division will be done in this way:
header.h will be split in header-graph.h, header-io.h and
header-common.h. The latter will just contain the data structures
needed by header-graph and header-io. header.h will remain for
legacy and to avoid changing all includes in all QEMU c files,
but will only include the two new headers.
Once we split all relevant headers, it will be much easier to see what
uses the AioContext lock and remove it, which is the overall main
goal of this and other series that I posted/will post.

RFC: I am happy to change all includes, if you think that it would
be better than leaving an empty header.h file.

The relevant headers I found so far are:
- block-backend.h
- block.h
- block_int.h
- backup-top.h and blockdev.h (but contain very few functions)
Once these are done, we would also need to audit the callback
offered by struct Jobdriver, BlockDevOps and BlockDriver.

Each function in the graph API will have an assertion, checking
that it is always running under BQL.
I/O functions are instead thread safe (or so should be), meaning
that they *can* run under BQL, but also in an iothread in another
AioContext. Therefore they do not provide any assertion, and
need to be audited manually to verify the correctness.
RFC: Any idea on how to guarantee I/O functions is welcome.

Adding assetions has helped finding a bug already, as shown in
patch 2.

RFC: Because this task is very time consuming and requires a lot
of auditing, this series only provide block-backend.h split,
to gather initial feedbacks.

Tested this series by running unit tests, qemu-iotests and qtests 
(x86_64)
Some functions in the graph API are used everywhere but not
properly tested. Therefore their assertion is never actually run in
the tests, so despite my very careful auditing, it is not impossible
to exclude that some will trigger while actually using QEMU.

Patch 1 introduces qemu_in_main_thread(), the function used in
all assertions. This had to be introduced otherwise all unit tests
would fail, since they run in the main loop but use the code in
stubs/iothread.c
Patch 2 fixes a bug that came up when auditing the code.
Patch 3 splits block-backend header, and to reduce the diff 
I moved the assertions in patch 4.

Next steps:
1) if this series gets positive comments, convert block.h and block_int.h
2) audit graph API and replace the AioContext lock with drains,
or remove them when not necessary (requires further discussion,
not necessary now).
3) [optional as it should be already the case] audit the I/O API
and check that thread safety is guaranteed

Signed-off-by: Emanuele Giuseppe Esposito 

Emanuele Giuseppe Esposito (4):
  main-loop.h: introduce qemu_in_main_thread()
  migration: block-dirty-bitmap: add missing qemu_mutex_lock_iothread
  include/sysemu/block-backend: split header into I/O and graph API
  block/block-backend.c: assertions for block-backend

 block/block-backend.c | 102 +-
 include/qemu/main-loop.h  |  13 ++
 include/sysemu/block-backend-common.h |  74 
 include/sysemu/block-backend-graph.h  | 132 +
 include/sysemu/block-backend-io.h | 123 
 include/sysemu/block-backend.h| 262 +-
 migration/block-dirty-bitmap.c|   5 +-
 softmmu/cpus.c|   5 +
 softmmu/qdev-monitor.c|   2 +
 stubs/iothread-lock.c |   5 +
 10 files changed, 459 insertions(+), 264 deletions(-)
 create mode 100644 include/sysemu/block-backend-common.h
 create mode 100644 include/sysemu/block-backend-graph.h
 create mode 100644 include/sysemu/block-backend-io.h

-- 
2.27.0




[RFC PATCH 2/4] migration: block-dirty-bitmap: add missing qemu_mutex_lock_iothread

2021-09-08 Thread Emanuele Giuseppe Esposito
init_dirty_bitmap_migration assumes the iothread lock (BQL)
to be held, but instead it isn't.

Instead of adding the lock to qemu_savevm_state_setup(),
follow the same pattern as the other ->save_setup callbacks
and lock+unlock inside dirty_bitmap_save_setup().

Signed-off-by: Emanuele Giuseppe Esposito 
---
 migration/block-dirty-bitmap.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 35f5ef688d..9aba7d9c22 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -1215,7 +1215,10 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void 
*opaque)
 {
 DBMSaveState *s = &((DBMState *)opaque)->save;
 SaveBitmapState *dbms = NULL;
+
+qemu_mutex_lock_iothread();
 if (init_dirty_bitmap_migration(s) < 0) {
+qemu_mutex_unlock_iothread();
 return -1;
 }
 
@@ -1223,7 +1226,7 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void 
*opaque)
 send_bitmap_start(f, s, dbms);
 }
 qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
-
+qemu_mutex_unlock_iothread();
 return 0;
 }
 
-- 
2.27.0




[RFC PATCH 1/4] main-loop.h: introduce qemu_in_main_thread()

2021-09-08 Thread Emanuele Giuseppe Esposito
When invoked from the main loop, this function is the same
as qemu_mutex_iothread_locked, and returns true if the BQL is held.
When invoked from iothreads or tests, it returns true only
if the current AioContext is the Main Loop.

This essentially just extends qemu_mutex_iothread_locked to work
also in unit tests or other users like storage-daemon, that run
in the Main Loop but end up using the implementation in
stubs/iothread-lock.c.

Using qemu_mutex_iothread_locked in unit tests defaults to false
because they use the implementation in stubs/iothread-lock,
making all assertions added in next patches fail despite the
AioContext is still the main loop.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/qemu/main-loop.h | 13 +
 softmmu/cpus.c   |  5 +
 stubs/iothread-lock.c|  5 +
 3 files changed, 23 insertions(+)

diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
index 8dbc6fcb89..c6547207f7 100644
--- a/include/qemu/main-loop.h
+++ b/include/qemu/main-loop.h
@@ -245,6 +245,19 @@ AioContext *iohandler_get_aio_context(void);
  */
 bool qemu_mutex_iothread_locked(void);
 
+/**
+ * qemu_in_main_thread: Return true if the function runs with BQL
+ * or in the main loop AioContext.
+ *
+ * This function falls back to qemu_mutex_iothread_locked() if
+ * called from the main loop, otherwise it checks if the current
+ * AioContext is the main loop. This is useful to check that the BQL
+ * is held, and not make it return false when invoked by unit
+ * tests or other users like storage-daemon that end up using
+ * stubs/iothread-lock.c implementation.
+ */
+bool qemu_in_main_thread(void);
+
 /**
  * qemu_mutex_lock_iothread: Lock the main loop mutex.
  *
diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 071085f840..3f61a3c31d 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -481,6 +481,11 @@ bool qemu_mutex_iothread_locked(void)
 return iothread_locked;
 }
 
+bool qemu_in_main_thread(void)
+{
+return qemu_mutex_iothread_locked();
+}
+
 /*
  * The BQL is taken from so many places that it is worth profiling the
  * callers directly, instead of funneling them all through a single function.
diff --git a/stubs/iothread-lock.c b/stubs/iothread-lock.c
index 5b45b7fc8b..ff7386e42c 100644
--- a/stubs/iothread-lock.c
+++ b/stubs/iothread-lock.c
@@ -6,6 +6,11 @@ bool qemu_mutex_iothread_locked(void)
 return false;
 }
 
+bool qemu_in_main_thread(void)
+{
+return qemu_get_current_aio_context() == qemu_get_aio_context();
+}
+
 void qemu_mutex_lock_iothread_impl(const char *file, int line)
 {
 }
-- 
2.27.0




[RFC PATCH 3/4] include/sysemu/block-backend: split header into I/O and graph API

2021-09-08 Thread Emanuele Giuseppe Esposito
block-backend.h currently contains a mix of functions:
some of them run under the BQL and modify the block layer graph,
others are instead thread-safe and perform I/O in iothreads.
It is not easy to understand which function is part of which
group (I/O vs graph), and this patch aims to clarify it.

The graph functions need the BQL, and often use
aio_context_acquire/release and/or drain to be sure they
can modify the graph safely.
The I/O function are instead thread safe, and can run in
any AioContext.

By splitting the header in two files, block-backend-io.h
and block-backend-graph.h we have a clearer view on what
needs what kind of protection. block-backend-common.h
instead contains common structures shared by both headers.

In addition, remove the block/block.h include as it seems
it is not necessary anymore, together with qemu/iov.h

block-backend.h is left there for legacy and avoid to change
all includes in all c files that use the block-backend APIs.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c |   9 +-
 include/sysemu/block-backend-common.h |  74 
 include/sysemu/block-backend-graph.h  | 132 +
 include/sysemu/block-backend-io.h | 123 
 include/sysemu/block-backend.h| 262 +-
 5 files changed, 338 insertions(+), 262 deletions(-)
 create mode 100644 include/sysemu/block-backend-common.h
 create mode 100644 include/sysemu/block-backend-graph.h
 create mode 100644 include/sysemu/block-backend-io.h

diff --git a/block/block-backend.c b/block/block-backend.c
index deb55c272e..7d5216a9a0 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -78,6 +78,7 @@ struct BlockBackend {
 bool allow_aio_context_change;
 bool allow_write_beyond_eof;
 
+/* Protected by BQL lock */
 NotifierList remove_bs_notifiers, insert_bs_notifiers;
 QLIST_HEAD(, BlockBackendAioNotifier) aio_notifiers;
 
@@ -110,12 +111,14 @@ static const AIOCBInfo block_backend_aiocb_info = {
 static void drive_info_del(DriveInfo *dinfo);
 static BlockBackend *bdrv_first_blk(BlockDriverState *bs);
 
-/* All BlockBackends */
+/* All BlockBackends. Protected by the BQL lock. */
 static QTAILQ_HEAD(, BlockBackend) block_backends =
 QTAILQ_HEAD_INITIALIZER(block_backends);
 
-/* All BlockBackends referenced by the monitor and which are iterated through 
by
- * blk_next() */
+/*
+ * All BlockBackends referenced by the monitor and which are iterated through 
by
+ * blk_next(). Protected by the BQL lock.
+ */
 static QTAILQ_HEAD(, BlockBackend) monitor_block_backends =
 QTAILQ_HEAD_INITIALIZER(monitor_block_backends);
 
diff --git a/include/sysemu/block-backend-common.h 
b/include/sysemu/block-backend-common.h
new file mode 100644
index 00..52ff6a4d26
--- /dev/null
+++ b/include/sysemu/block-backend-common.h
@@ -0,0 +1,74 @@
+/*
+ * QEMU Block backends
+ *
+ * Copyright (C) 2014-2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Markus Armbruster ,
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1
+ * or later.  See the COPYING.LIB file in the top-level directory.
+ */
+
+#ifndef BLOCK_BACKEND_COMMON_H
+#define BLOCK_BACKEND_COMMON_H
+
+#include "block/throttle-groups.h"
+
+/* Callbacks for block device models */
+typedef struct BlockDevOps {
+/*
+ * Runs when virtual media changed (monitor commands eject, change)
+ * Argument load is true on load and false on eject.
+ * Beware: doesn't run when a host device's physical media
+ * changes.  Sure would be useful if it did.
+ * Device models with removable media must implement this callback.
+ */
+void (*change_media_cb)(void *opaque, bool load, Error **errp);
+/*
+ * Runs when an eject request is issued from the monitor, the tray
+ * is closed, and the medium is locked.
+ * Device models that do not implement is_medium_locked will not need
+ * this callback.  Device models that can lock the medium or tray might
+ * want to implement the callback and unlock the tray when "force" is
+ * true, even if they do not support eject requests.
+ */
+void (*eject_request_cb)(void *opaque, bool force);
+/*
+ * Is the virtual tray open?
+ * Device models implement this only when the device has a tray.
+ */
+bool (*is_tray_open)(void *opaque);
+/*
+ * Is the virtual medium locked into the device?
+ * Device models implement this only when device has such a lock.
+ */
+bool (*is_medium_locked)(void *opaque);
+/*
+ * Runs when the size changed (e.g. monitor command block_resize)
+ */
+void (*resize_cb)(void *opaque);
+/*
+ * Runs when the backend receives a drain request.
+ */
+void (*drained_begin)(void *opaque);
+/*
+ * Runs when the backend's last drain request ends.
+ */
+void (*drained_end)(void *opaque);
+/*
+ * Is the device still busy?
+ */

[RFC PATCH 4/4] block/block-backend.c: assertions for block-backend

2021-09-08 Thread Emanuele Giuseppe Esposito
All the graph API functions will check that
qemu_in_main_thread() returns true. If not, it means
that the safety of BQL cannot be guaranteed, and
they need to be moved to I/O.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c  | 93 +-
 softmmu/qdev-monitor.c |  2 +
 2 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 7d5216a9a0..7289c4f62e 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -227,6 +227,7 @@ static void blk_root_activate(BdrvChild *child, Error 
**errp)
 
 void blk_set_force_allow_inactivate(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 blk->force_allow_inactivate = true;
 }
 
@@ -345,6 +346,8 @@ BlockBackend *blk_new(AioContext *ctx, uint64_t perm, 
uint64_t shared_perm)
 {
 BlockBackend *blk;
 
+g_assert(qemu_in_main_thread());
+
 blk = g_new0(BlockBackend, 1);
 blk->refcnt = 1;
 blk->ctx = ctx;
@@ -382,6 +385,8 @@ BlockBackend *blk_new_with_bs(BlockDriverState *bs, 
uint64_t perm,
 {
 BlockBackend *blk = blk_new(bdrv_get_aio_context(bs), perm, shared_perm);
 
+g_assert(qemu_in_main_thread());
+
 if (blk_insert_bs(blk, bs, errp) < 0) {
 blk_unref(blk);
 return NULL;
@@ -410,6 +415,8 @@ BlockBackend *blk_new_open(const char *filename, const char 
*reference,
 uint64_t perm = 0;
 uint64_t shared = BLK_PERM_ALL;
 
+g_assert(qemu_in_main_thread());
+
 /*
  * blk_new_open() is mainly used in .bdrv_create implementations and the
  * tools where sharing isn't a major concern because the BDS stays private
@@ -487,6 +494,7 @@ static void drive_info_del(DriveInfo *dinfo)
 
 int blk_get_refcnt(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return blk ? blk->refcnt : 0;
 }
 
@@ -497,6 +505,7 @@ int blk_get_refcnt(BlockBackend *blk)
 void blk_ref(BlockBackend *blk)
 {
 assert(blk->refcnt > 0);
+g_assert(qemu_in_main_thread());
 blk->refcnt++;
 }
 
@@ -507,6 +516,7 @@ void blk_ref(BlockBackend *blk)
  */
 void blk_unref(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 if (blk) {
 assert(blk->refcnt > 0);
 if (blk->refcnt > 1) {
@@ -527,6 +537,7 @@ void blk_unref(BlockBackend *blk)
  */
 BlockBackend *blk_all_next(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return blk ? QTAILQ_NEXT(blk, link)
: QTAILQ_FIRST(&block_backends);
 }
@@ -535,6 +546,8 @@ void blk_remove_all_bs(void)
 {
 BlockBackend *blk = NULL;
 
+g_assert(qemu_in_main_thread());
+
 while ((blk = blk_all_next(blk)) != NULL) {
 AioContext *ctx = blk_get_aio_context(blk);
 
@@ -558,6 +571,7 @@ void blk_remove_all_bs(void)
  */
 BlockBackend *blk_next(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return blk ? QTAILQ_NEXT(blk, monitor_link)
: QTAILQ_FIRST(&monitor_block_backends);
 }
@@ -624,6 +638,7 @@ static void bdrv_next_reset(BdrvNextIterator *it)
 
 BlockDriverState *bdrv_first(BdrvNextIterator *it)
 {
+g_assert(qemu_in_main_thread());
 bdrv_next_reset(it);
 return bdrv_next(it);
 }
@@ -661,6 +676,7 @@ bool monitor_add_blk(BlockBackend *blk, const char *name, 
Error **errp)
 {
 assert(!blk->name);
 assert(name && name[0]);
+g_assert(qemu_in_main_thread());
 
 if (!id_wellformed(name)) {
 error_setg(errp, "Invalid device name");
@@ -688,6 +704,8 @@ bool monitor_add_blk(BlockBackend *blk, const char *name, 
Error **errp)
  */
 void monitor_remove_blk(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
+
 if (!blk->name) {
 return;
 }
@@ -703,6 +721,7 @@ void monitor_remove_blk(BlockBackend *blk)
  */
 const char *blk_name(const BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return blk->name ?: "";
 }
 
@@ -714,6 +733,7 @@ BlockBackend *blk_by_name(const char *name)
 {
 BlockBackend *blk = NULL;
 
+g_assert(qemu_in_main_thread());
 assert(name);
 while ((blk = blk_next(blk)) != NULL) {
 if (!strcmp(name, blk->name)) {
@@ -748,6 +768,7 @@ static BlockBackend *bdrv_first_blk(BlockDriverState *bs)
  */
 bool bdrv_has_blk(BlockDriverState *bs)
 {
+g_assert(qemu_in_main_thread());
 return bdrv_first_blk(bs) != NULL;
 }
 
@@ -758,6 +779,7 @@ bool bdrv_is_root_node(BlockDriverState *bs)
 {
 BdrvChild *c;
 
+g_assert(qemu_in_main_thread());
 QLIST_FOREACH(c, &bs->parents, next_parent) {
 if (c->klass != &child_root) {
 return false;
@@ -807,6 +829,7 @@ BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo)
  */
 BlockBackendPublic *blk_get_public(BlockBackend *blk)
 {
+g_assert(qemu_in_main_thread());
 return &blk->public;
 }
 
@@ -815,6 +838,7 @@ BlockBackendPublic *blk_get_public(BlockBackend *blk)
  */
 BlockBackend *blk_by_public(BlockBackendPublic *public)
 {
+g_assert(qemu_in_main_thread());
 return container_of(publi

Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk

2021-09-08 Thread Stefano Garzarella

Hi Denis,
I just found this discussion since we still have the following line in 
hw/core/machine.c:

{ "vhost-blk-device", "seg_max_adjust", "off"}

IIUC it was a typo, and I think we should fix it since in the future we 
can have `vhost-blk-device`.


So, I think we have 2 options:
1. remove that line since for now is useless
2. replace with "vhost-scsi"

I'm not sure which is the best, what do you suggest?

Thanks,
Stefano

On Fri, Feb 07, 2020 at 11:48:05AM +0300, Denis Plotnikov wrote:

On 05.02.2020 14:19, Stefan Hajnoczi wrote:

On Tue, Feb 04, 2020 at 12:59:04PM +0300, Denis Plotnikov wrote:

On 30.01.2020 17:58, Stefan Hajnoczi wrote:

On Wed, Jan 29, 2020 at 05:07:00PM +0300, Denis Plotnikov wrote:

The goal is to reduce the amount of requests issued by a guest on
1M reads/writes. This rises the performance up to 4% on that kind of
disk access pattern.

The maximum chunk size to be used for the guest disk accessing is
limited with seg_max parameter, which represents the max amount of
pices in the scatter-geather list in one guest disk request.

Since seg_max is virqueue_size dependent, increasing the virtqueue
size increases seg_max, which, in turn, increases the maximum size
of data to be read/write from guest disk.

More details in the original problem statment:
https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg03721.html

Suggested-by: Denis V. Lunev 
Signed-off-by: Denis Plotnikov 
---
  hw/core/machine.c  | 3 +++
  include/hw/virtio/virtio.h | 2 +-
  2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 3e288bfceb..8bc401d8b7 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -28,6 +28,9 @@
  #include "hw/mem/nvdimm.h"
  GlobalProperty hw_compat_4_2[] = {
+{ "virtio-blk-device", "queue-size", "128"},
+{ "virtio-scsi-device", "virtqueue_size", "128"},
+{ "vhost-blk-device", "virtqueue_size", "128"},

vhost-blk-device?!  Who has this?  It's not in qemu.git so please omit
this line. ;-)

So in this case the line:

{ "vhost-blk-device", "seg_max_adjust", "off"},

introduced by my patch:

commit 1bf8a989a566b2ba41c197004ec2a02562a766a4
Author: Denis Plotnikov 
Date:   Fri Dec 20 17:09:04 2019 +0300

    virtio: make seg_max virtqueue size dependent

is also wrong. It should be:

{ "vhost-scsi-device", "seg_max_adjust", "off"},

Am I right?

It's just called "vhost-scsi":

include/hw/virtio/vhost-scsi.h:#define TYPE_VHOST_SCSI "vhost-scsi"


On the other hand, do you want to do this for the vhost-user-blk,
vhost-user-scsi, and vhost-scsi devices that exist in qemu.git?  Those
devices would benefit from better performance too.
After thinking about that for a while, I think we shouldn't extend 
queue sizes for vhost-user-blk, vhost-user-scsi and vhost-scsi.
This is because increasing the queue sizes seems to be just useless 
for them: the all thing is about increasing the queue sizes for 
increasing seg_max (it limits the max block query size from the 
guest). For virtio-blk-device and virtio-scsi-device it makes sense, 
since they have seg-max-adjust property which, if true, sets seg_max 
to virtqueue_size-2. vhost-scsi also have this property but it seems 
the property just doesn't affect anything (remove it?).
Also vhost-user-blk, vhost-user-scsi and vhost-scsi don't do any 
seg_max settings. If I understand correctly, their backends are ment 
to be responsible for doing that.
So, what about changing the queue sizes just for virtio-blk-device and 
virtio-scsi-device?


Denis


It seems to be so. We also have the test checking those settings:
tests/acceptance/virtio_seg_max_adjust.py
For now it checks virtio-scsi-pci and virtio-blk-pci.
I'm going to extend it for the virtqueue size checking.
If I change vhost-user-blk, vhost-user-scsi and vhost-scsi it's worth
to check those devices too. But I don't know how to form a command line
for that 3 devices since they should involve some third party components as
backends (kernel modules, DPDK, etc.) and they seems to be not available in
the
qemu git.
Is there any way to do it with some qit.qemu available stubs or something
else?
If so, could you please point out the proper way to do it?

qemu.git has contrib/vhost-user-blk/ and contrib/vhost-user-scsi/ if
you need to test those vhost-user devices without external dependencies.

Stefan








Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk

2021-09-08 Thread Stefano Garzarella

Message bounced, I use new Denis's email address.

On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote:

Hi Denis,
I just found this discussion since we still have the following line in 
hw/core/machine.c:

   { "vhost-blk-device", "seg_max_adjust", "off"}

IIUC it was a typo, and I think we should fix it since in the future 
we can have `vhost-blk-device`.


So, I think we have 2 options:
1. remove that line since for now is useless
2. replace with "vhost-scsi"

I'm not sure which is the best, what do you suggest?

Thanks,
Stefano

On Fri, Feb 07, 2020 at 11:48:05AM +0300, Denis Plotnikov wrote:

On 05.02.2020 14:19, Stefan Hajnoczi wrote:

On Tue, Feb 04, 2020 at 12:59:04PM +0300, Denis Plotnikov wrote:

On 30.01.2020 17:58, Stefan Hajnoczi wrote:

On Wed, Jan 29, 2020 at 05:07:00PM +0300, Denis Plotnikov wrote:

The goal is to reduce the amount of requests issued by a guest on
1M reads/writes. This rises the performance up to 4% on that kind of
disk access pattern.

The maximum chunk size to be used for the guest disk accessing is
limited with seg_max parameter, which represents the max amount of
pices in the scatter-geather list in one guest disk request.

Since seg_max is virqueue_size dependent, increasing the virtqueue
size increases seg_max, which, in turn, increases the maximum size
of data to be read/write from guest disk.

More details in the original problem statment:
https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg03721.html

Suggested-by: Denis V. Lunev 
Signed-off-by: Denis Plotnikov 
---
 hw/core/machine.c  | 3 +++
 include/hw/virtio/virtio.h | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 3e288bfceb..8bc401d8b7 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -28,6 +28,9 @@
 #include "hw/mem/nvdimm.h"
 GlobalProperty hw_compat_4_2[] = {
+{ "virtio-blk-device", "queue-size", "128"},
+{ "virtio-scsi-device", "virtqueue_size", "128"},
+{ "vhost-blk-device", "virtqueue_size", "128"},

vhost-blk-device?!  Who has this?  It's not in qemu.git so please omit
this line. ;-)

So in this case the line:

{ "vhost-blk-device", "seg_max_adjust", "off"},

introduced by my patch:

commit 1bf8a989a566b2ba41c197004ec2a02562a766a4
Author: Denis Plotnikov 
Date:   Fri Dec 20 17:09:04 2019 +0300

    virtio: make seg_max virtqueue size dependent

is also wrong. It should be:

{ "vhost-scsi-device", "seg_max_adjust", "off"},

Am I right?

It's just called "vhost-scsi":

include/hw/virtio/vhost-scsi.h:#define TYPE_VHOST_SCSI "vhost-scsi"


On the other hand, do you want to do this for the vhost-user-blk,
vhost-user-scsi, and vhost-scsi devices that exist in qemu.git?  Those
devices would benefit from better performance too.
After thinking about that for a while, I think we shouldn't extend 
queue sizes for vhost-user-blk, vhost-user-scsi and vhost-scsi.
This is because increasing the queue sizes seems to be just useless 
for them: the all thing is about increasing the queue sizes for 
increasing seg_max (it limits the max block query size from the 
guest). For virtio-blk-device and virtio-scsi-device it makes sense, 
since they have seg-max-adjust property which, if true, sets seg_max 
to virtqueue_size-2. vhost-scsi also have this property but it seems 
the property just doesn't affect anything (remove it?).
Also vhost-user-blk, vhost-user-scsi and vhost-scsi don't do any 
seg_max settings. If I understand correctly, their backends are ment 
to be responsible for doing that.
So, what about changing the queue sizes just for virtio-blk-device 
and virtio-scsi-device?


Denis


It seems to be so. We also have the test checking those settings:
tests/acceptance/virtio_seg_max_adjust.py
For now it checks virtio-scsi-pci and virtio-blk-pci.
I'm going to extend it for the virtqueue size checking.
If I change vhost-user-blk, vhost-user-scsi and vhost-scsi it's worth
to check those devices too. But I don't know how to form a command line
for that 3 devices since they should involve some third party components as
backends (kernel modules, DPDK, etc.) and they seems to be not available in
the
qemu git.
Is there any way to do it with some qit.qemu available stubs or something
else?
If so, could you please point out the proper way to do it?

qemu.git has contrib/vhost-user-blk/ and contrib/vhost-user-scsi/ if
you need to test those vhost-user devices without external dependencies.

Stefan








Re: [PATCH] block/vvfat: Fix ro shared folder

2021-09-08 Thread Kevin Wolf
Am 31.08.2021 um 16:17 hat Guillaume Roche geschrieben:
> QEMU exits in error when passing a vfat shared folder in read-only mode.
> 
> To fix this issue, this patch removes any potential write permission
> from cumulative_perms, when a read-only block device is in use.
> 
> Buglink: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=918950
> 
> Signed-off-by: Guillaume Roche 
> ---
> This is an attempt to fix this behavior, but it feels a bit hacky to me
> since this patch checks for the vvfat format in a generic function.
> 
> However, I'd be glad to have some advice to make it better. Anyway, I
> ran the block tests to ensure this does not introduce any regression.
> 
> To add some context: I know that this can be worked around by setting 
> the shared folder in rw mode. But our use-case requires using both
> shared and VM snapshots, and QEMU prevents using snapshot with a rw 
> shared folder.

I don't think the behaviour is actually a bug: ide-hd requires a
writable backend, so attaching a read-only vvfat node is a real error.

You can either specify -drive read-only=on and use a device that can
accept read-only backends (such as virtio-blk or ide-cd), or add a
temporary writable qcow2 overlay with -snapshot or -drive snapshot=on so
that the ide-hd device actually does get the writable backend it needs.

Kevin




Re: [PATCH v1 2/4] virtio: increase virtuqueue size for virtio-scsi and virtio-blk

2021-09-08 Thread Denis Plotnikov



On 08.09.2021 16:22, Stefano Garzarella wrote:

Message bounced, I use new Denis's email address.

On Wed, Sep 08, 2021 at 03:17:16PM +0200, Stefano Garzarella wrote:

Hi Denis,
I just found this discussion since we still have the following line 
in hw/core/machine.c:

   { "vhost-blk-device", "seg_max_adjust", "off"}

IIUC it was a typo, and I think we should fix it since in the future 
we can have `vhost-blk-device`.


So, I think we have 2 options:
1. remove that line since for now is useless
2. replace with "vhost-scsi"

I'm not sure which is the best, what do you suggest?

Thanks,
Stefano


Hi Stefano

I prefer to just remove the line without replacing. This will keep 
things exactly like it is now.


Replacing with "vhost-scsi" will affect seg_max (limit to 126) on newly 
created VMs with machine types using "hw_compat_4_2" and older.


Now, because of the typo (error), their seg-max is queue-size dependent. 
I'm not sure, if replacing the line may cause any problems, for example 
on migration: source (queue-size 256, seg max 254) -> destination 
(queue-size 256, seg max 126). But it will definitely introduce two 
different behaviors for VMs with "hw_compat_4_2" and older. So, I'd 
choose the lesser of two evils and keep the things like it's now.


Thanks!

Denis



On Fri, Feb 07, 2020 at 11:48:05AM +0300, Denis Plotnikov wrote:

On 05.02.2020 14:19, Stefan Hajnoczi wrote:

On Tue, Feb 04, 2020 at 12:59:04PM +0300, Denis Plotnikov wrote:

On 30.01.2020 17:58, Stefan Hajnoczi wrote:

On Wed, Jan 29, 2020 at 05:07:00PM +0300, Denis Plotnikov wrote:

The goal is to reduce the amount of requests issued by a guest on
1M reads/writes. This rises the performance up to 4% on that 
kind of

disk access pattern.

The maximum chunk size to be used for the guest disk accessing is
limited with seg_max parameter, which represents the max amount of
pices in the scatter-geather list in one guest disk request.

Since seg_max is virqueue_size dependent, increasing the virtqueue
size increases seg_max, which, in turn, increases the maximum size
of data to be read/write from guest disk.

More details in the original problem statment:
https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg03721.html

Suggested-by: Denis V. Lunev 
Signed-off-by: Denis Plotnikov 
---
 hw/core/machine.c  | 3 +++
 include/hw/virtio/virtio.h | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 3e288bfceb..8bc401d8b7 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -28,6 +28,9 @@
 #include "hw/mem/nvdimm.h"
 GlobalProperty hw_compat_4_2[] = {
+    { "virtio-blk-device", "queue-size", "128"},
+    { "virtio-scsi-device", "virtqueue_size", "128"},
+    { "vhost-blk-device", "virtqueue_size", "128"},
vhost-blk-device?!  Who has this?  It's not in qemu.git so please 
omit

this line. ;-)

So in this case the line:

{ "vhost-blk-device", "seg_max_adjust", "off"},

introduced by my patch:

commit 1bf8a989a566b2ba41c197004ec2a02562a766a4
Author: Denis Plotnikov 
Date:   Fri Dec 20 17:09:04 2019 +0300

    virtio: make seg_max virtqueue size dependent

is also wrong. It should be:

{ "vhost-scsi-device", "seg_max_adjust", "off"},

Am I right?

It's just called "vhost-scsi":

include/hw/virtio/vhost-scsi.h:#define TYPE_VHOST_SCSI "vhost-scsi"


On the other hand, do you want to do this for the vhost-user-blk,
vhost-user-scsi, and vhost-scsi devices that exist in qemu.git?  
Those

devices would benefit from better performance too.
After thinking about that for a while, I think we shouldn't extend 
queue sizes for vhost-user-blk, vhost-user-scsi and vhost-scsi.
This is because increasing the queue sizes seems to be just useless 
for them: the all thing is about increasing the queue sizes for 
increasing seg_max (it limits the max block query size from the 
guest). For virtio-blk-device and virtio-scsi-device it makes sense, 
since they have seg-max-adjust property which, if true, sets seg_max 
to virtqueue_size-2. vhost-scsi also have this property but it seems 
the property just doesn't affect anything (remove it?).
Also vhost-user-blk, vhost-user-scsi and vhost-scsi don't do any 
seg_max settings. If I understand correctly, their backends are ment 
to be responsible for doing that.
So, what about changing the queue sizes just for virtio-blk-device 
and virtio-scsi-device?


Denis


It seems to be so. We also have the test checking those settings:
tests/acceptance/virtio_seg_max_adjust.py
For now it checks virtio-scsi-pci and virtio-blk-pci.
I'm going to extend it for the virtqueue size checking.
If I change vhost-user-blk, vhost-user-scsi and vhost-scsi it's worth
to check those devices too. But I don't know how to form a command 
line
for that 3 devices since they should involve some third party 
components as
backends (kernel modules, DPDK, etc.) and they seems to be not 
available in

the
qemu git.
Is there any way to do it with some qit.qemu available stubs or 
som

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-08 Thread Peter Xu
On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote:
> * Jason Wang (jasow...@redhat.com) wrote:
> > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu  wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > > > > I don't think it is valid to unconditionally enable this feature due 
> > > > > to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >if the socket option was not set, the socket exceeds its optmem
> > > > >limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > > locked
> > > > memory for the user before using it, or adding a capability to qemu 
> > > > before.
> > >
> > > So I'm a bit confused on why MSG_ZEROCOPY requires checking 
> > > RLIMIT_MEMLOCK.
> > >
> > > The thing is IIUC that's accounting for pinned pages only with either 
> > > mlock()
> > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > >
> > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> > 
> > It happens probably here:
> > 
> > E.g
> > 
> > __ip_append_data()
> > msg_zerocopy_realloc()
> > mm_account_pinned_pages()
> 
> When do they get uncounted?  i.e. is it just the data that's in flight
> that's marked as pinned?

I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages()
correspondingly.  Thanks,

-- 
Peter Xu




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Peter Xu
On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > > What if we do the 'flush()' before we start post-copy, instead of 
> > > > > after each
> > > > > iteration? would that be enough?
> > > > 
> > > > Possibly, yes. This really need David G's input since he understands
> > > > the code in way more detail than me.
> > > 
> > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > 
> > We don't have that yet, do we?
> 
> I think multifd does; I think it calls multifd_send_sync_main sometimes,
> which I *think* corresponds to iterations.

Oh.. Then we may need to:

  (1) Make that sync work for zero-copy too; say, semaphores may not be enough
  with it - we need to make sure all async buffers are consumed by checking
  the error queue of the sockets,

  (2) When we make zero-copy work without multi-fd, we may want some similar
  sync on the main stream too

Thanks,

> 
> Dave
> 
> > > the case I can think of is if we're doing async sending, we could have
> > > two versions of the same page in flight (one from each iteration) -
> > > you'd want those to get there in the right order.
> > 
> > From MSG_ZEROCOPY document:
> > 
> > For protocols that acknowledge data in-order, like TCP, each
> > notification can be squashed into the previous one, so that no more
> > than one notification is outstanding at any one point.
> > 
> > Ordered delivery is the common case, but not guaranteed. 
> > Notifications
> > may arrive out of order on retransmission and socket teardown.
> > 
> > So no matter whether we have a flush() per iteration before, it seems we may
> > want one when zero copy enabled?
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> > 
> -- 
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 

-- 
Peter Xu




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-08 Thread Daniel P . Berrangé
On Tue, Sep 07, 2021 at 12:13:28PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote:
> > > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > > > > > > thread.
> > > > > > > 
> > > > > > > Change the send_write() interface of multifd, allowing it to pass 
> > > > > > > down
> > > > > > > flags for qio_channel_write*().
> > > > > > > 
> > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while 
> > > > > > > keeping the
> > > > > > > other data being sent at the default copying approach.
> > > > > > > 
> > > > > > > Signed-off-by: Leonardo Bras 
> > > > > > > ---
> > > > > > >  migration/multifd-zlib.c | 7 ---
> > > > > > >  migration/multifd-zstd.c | 7 ---
> > > > > > >  migration/multifd.c  | 9 ++---
> > > > > > >  migration/multifd.h  | 3 ++-
> > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > > 
> > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > >  }
> > > > > > >  
> > > > > > >  if (used) {
> > > > > > > -ret = multifd_send_state->ops->send_write(p, 
> > > > > > > used, &local_err);
> > > > > > > +ret = multifd_send_state->ops->send_write(p, 
> > > > > > > used, MSG_ZEROCOPY,
> > > > > > > +  
> > > > > > > &local_err);
> > > > > > 
> > > > > > I don't think it is valid to unconditionally enable this feature 
> > > > > > due to the
> > > > > > resource usage implications
> > > > > > 
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > > 
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This 
> > > > > > happens 
> > > > > >if the socket option was not set, the socket exceeds its optmem 
> > > > > >limit or the user exceeds its ulimit on locked pages."
> > > > > > 
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > > 
> > > > > Yes it would be great to be a migration capability in parallel to 
> > > > > multifd. At
> > > > > initial phase if it's easy to be implemented on multi-fd only, we can 
> > > > > add a
> > > > > dependency between the caps.  In the future we can remove that 
> > > > > dependency when
> > > > > the code is ready to go without multifd.  Thanks,
> > > > 
> > > > Also, I'm wondering how zerocopy support interacts with kernel support
> > > > for kTLS and multipath-TCP, both of which we want to be able to use
> > > > with migration.
> > > 
> > > Copying Jason Wang for net implications between these features on kernel 
> > > side
> > > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
> > > 
> > > From the safe side we may want to only enable one of them until we prove
> > > they'll work together I guess..
> > 
> > MPTCP is good when we're network limited for migration
> > 
> > KTLS will be good when we're CPU limited on AES for migration,
> > which is essentially always when TLS is used.
> > 
> > ZEROCOPY will be good when we're CPU limited for data copy
> > on migration, or to reduce the impact on other concurrent
> > VMs on the same CPUs.
> > 
> > Ultimately we woudld benefit from all of them at the same
> > time, if it were technically possible todo.
> 
> I think last time I spoke to Paolo Abeni there were some interactions
> between them; I can't remember what though (I think mptcp and ktls
> didn't play at the time).

MPTCP and KTLS use the same kernel hook in the network layer and
only 1 hook is permitted at a time, making them mutually exclusive
for now. In theory this can be fixed but no one is actively working
on it yet.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v1 4/4] tests: rename virtio_seg_max_adjust to virtio_check_params

2021-09-08 Thread Philippe Mathieu-Daudé
On 1/29/20 3:07 PM, Denis Plotnikov wrote:
> Since, virtio_seg_max_adjust checks not only seg_max, but also
> virtqueue_size parameter, let's make the test more general and
> add new parameters to be checked there in the future.
> 
> Signed-off-by: Denis Plotnikov 
> ---
>  .../{virtio_seg_max_adjust.py => virtio_check_params.py}  | 0
>  1 file changed, 0 insertions(+), 0 deletions(-)
>  rename tests/acceptance/{virtio_seg_max_adjust.py => virtio_check_params.py} 
> (100%)

Old one... reminds me of
https://lore.kernel.org/qemu-devel/20200129212345.20547-1-phi...@redhat.com/
:~(

Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH v4 05/12] job: @force parameter for job_cancel_sync()

2021-09-08 Thread Eric Blake
On Tue, Sep 07, 2021 at 02:42:38PM +0200, Hanna Reitz wrote:
> Callers should be able to specify whether they want job_cancel_sync() to
> force-cancel the job or not.
> 
> In fact, almost all invocations do not care about consistency of the
> result and just want the job to terminate as soon as possible, so they
> should pass force=true.  The replication block driver is the exception,
> specifically the active commit job it runs.
> 
> As for job_cancel_sync_all(), all callers want it to force-cancel all
> jobs, because that is the point of it: To cancel all remaining jobs as
> quickly as possible (generally on process termination).  So make it
> invoke job_cancel_sync() with force=true.
> 
> This changes some iotest outputs, because quitting qemu while a mirror
> job is active will now lead to it being cancelled instead of completed,
> which is what we want.  (Cancelling a READY mirror job with force=false
> may take an indefinite amount of time, which we do not want when
> quitting.  If users want consistent results, they must have all jobs be
> done before they quit qemu.)
> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Hanna Reitz 
> ---
>  include/qemu/job.h| 10 ++---
>  block/replication.c   |  4 +-
>  blockdev.c|  4 +-
>  job.c | 18 ++--
>  tests/unit/test-blockjob.c|  2 +-
>  tests/qemu-iotests/109.out| 60 +++
>  tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
>  7 files changed, 50 insertions(+), 50 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v4 05/12] job: @force parameter for job_cancel_sync()

2021-09-08 Thread Vladimir Sementsov-Ogievskiy

07.09.2021 15:42, Hanna Reitz wrote:

Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception,
specifically the active commit job it runs.

As for job_cancel_sync_all(), all callers want it to force-cancel all
jobs, because that is the point of it: To cancel all remaining jobs as
quickly as possible (generally on process termination).  So make it
invoke job_cancel_sync() with force=true.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink:https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Hanna Reitz


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir



Re: [PATCH] vmdk: allow specification of tools version

2021-09-08 Thread Eric Blake
On Wed, Sep 08, 2021 at 07:42:50PM +0200, Thomas Weißschuh wrote:
> VMDK files support an attribute that represents the version of the guest
> tools that are installed on the disk.
> This attribute is used by vSphere before a machine has been started to
> determine if the VM has the guest tools installed.
> This is important when configuring "Operating system customizations" in
> vSphere, as it checks for the presence of the guest tools before
> allowing those customizations.
> Thus when the VM has not yet booted normally it would be impossible to
> customize it, therefore preventing a customized first-boot.
> 
> The attribute should not hurt on disks that do not have the guest tools
> installed and indeed the VMware tools also unconditionally add this
> attribute.
> (Defaulting to the value "2147483647", as is done in this patch)
> 
> Signed-off-by: Thomas Weißschuh 
> ---
>  block/vmdk.c | 24 
>  qapi/block-core.json |  2 ++
>  2 files changed, 22 insertions(+), 4 deletions(-)

UI review:

> +++ b/qapi/block-core.json
> @@ -4597,6 +4597,7 @@
>  # @adapter-type: The adapter type used to fill in the descriptor. Default: 
> ide.
>  # @hwversion: Hardware version. The meaningful options are "4" or "6".
>  # Default: "4".
> +# @toolsversion: VMware guest tools version.

Missing a '(since 6.2)' blurb, and a description of its default value.

>  # @zeroed-grain: Whether to enable zeroed-grain feature for sparse 
> subformats.
>  #Default: false.
>  #
> @@ -4610,6 +4611,7 @@
>  '*backing-file':'str',
>  '*adapter-type':'BlockdevVmdkAdapterType',
>  '*hwversion':   'str',
> +'*toolsversion':'str',

Is it an arbitrary string, or must a valid value always be parseable
as a numeric value?  If the latter, then make the QMP representation
numeric rather than string.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Leonardo Bras Soares Passos
On Tue, Sep 7, 2021 at 1:44 PM Peter Xu  wrote:
>
> On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> > I also suggested something like that, but I thought it could be good if we 
> > could
> > fall back to io_writev() if we didn't have the zerocopy feature (or
> > the async feature).
> >
> > What do you think?
>
> That fallback looks safe and ok, I'm just not sure whether it'll be of great
> help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
> write (then we do the fallback when necessary), it means the buffer 
> implication
> applies too to this api, so it's easier to me to just detect the zero copy
> capability and use one alternative.  Thanks,
>
> --
> Peter Xu
>

I was thinking this way (async method with fallback) we would allow code using
the QIO api to just try to use the io_async_writev() whenever the code
seems fit, and
let the QIO channel layer to decide when it can be used
(implementation provides,
features available), and just fallback to io_writev() when it can't be used.

Best regards,
Leo




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Leonardo Bras Soares Passos
On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
 wrote:
> > Possibly, yes. This really need David G's input since he understands
> > the code in way more detail than me.
>
> Hmm I'm not entirely sure why we have the sync after each iteration;
> the case I can think of is if we're doing async sending, we could have
> two versions of the same page in flight (one from each iteration) -
> you'd want those to get there in the right order.
>
> Dave

Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will in
fact have the same page in flight twice, instead of two versions,
given the buffer is
sent as it is during transmission.

For me it looks like there will be no change in the current algorithm,
but I am still
a beginner in migration code, and I am probably missing something.

Best regards,
Leo




[PATCH] vmdk: allow specification of tools version

2021-09-08 Thread Thomas Weißschuh
VMDK files support an attribute that represents the version of the guest
tools that are installed on the disk.
This attribute is used by vSphere before a machine has been started to
determine if the VM has the guest tools installed.
This is important when configuring "Operating system customizations" in
vSphere, as it checks for the presence of the guest tools before
allowing those customizations.
Thus when the VM has not yet booted normally it would be impossible to
customize it, therefore preventing a customized first-boot.

The attribute should not hurt on disks that do not have the guest tools
installed and indeed the VMware tools also unconditionally add this
attribute.
(Defaulting to the value "2147483647", as is done in this patch)

Signed-off-by: Thomas Weißschuh 
---
 block/vmdk.c | 24 
 qapi/block-core.json |  2 ++
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index 4499f136bd..93ef6426b0 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -60,6 +60,7 @@
 #define VMDK_ZEROED  (-3)
 
 #define BLOCK_OPT_ZEROED_GRAIN "zeroed_grain"
+#define BLOCK_OPT_TOOLSVERSION "toolsversion"
 
 typedef struct {
 uint32_t version;
@@ -2344,6 +2345,7 @@ static int coroutine_fn vmdk_co_do_create(int64_t size,
   BlockdevVmdkAdapterType adapter_type,
   const char *backing_file,
   const char *hw_version,
+  const char *toolsversion,
   bool compat6,
   bool zeroed_grain,
   vmdk_create_extent_fn extent_fn,
@@ -2384,7 +2386,8 @@ static int coroutine_fn vmdk_co_do_create(int64_t size,
 "ddb.geometry.cylinders = \"%" PRId64 "\"\n"
 "ddb.geometry.heads = \"%" PRIu32 "\"\n"
 "ddb.geometry.sectors = \"63\"\n"
-"ddb.adapterType = \"%s\"\n";
+"ddb.adapterType = \"%s\"\n"
+"ddb.toolsVersion = \"%s\"\n";
 
 ext_desc_lines = g_string_new(NULL);
 
@@ -2401,6 +2404,9 @@ static int coroutine_fn vmdk_co_do_create(int64_t size,
 if (!hw_version) {
 hw_version = "4";
 }
+if (!toolsversion) {
+toolsversion = "2147483647";
+}
 
 if (adapter_type != BLOCKDEV_VMDK_ADAPTER_TYPE_IDE) {
 /* that's the number of heads with which vmware operates when
@@ -2525,7 +2531,8 @@ static int coroutine_fn vmdk_co_do_create(int64_t size,
size /
(int64_t)(63 * number_heads * BDRV_SECTOR_SIZE),
number_heads,
-   BlockdevVmdkAdapterType_str(adapter_type));
+   BlockdevVmdkAdapterType_str(adapter_type),
+   toolsversion);
 desc_len = strlen(desc);
 /* the descriptor offset = 0x200 */
 if (!split && !flat) {
@@ -2617,6 +2624,7 @@ static int coroutine_fn vmdk_co_create_opts(BlockDriver 
*drv,
 BlockdevVmdkAdapterType adapter_type_enum;
 char *backing_file = NULL;
 char *hw_version = NULL;
+char *toolsversion = NULL;
 char *fmt = NULL;
 BlockdevVmdkSubformat subformat;
 int ret = 0;
@@ -2649,6 +2657,7 @@ static int coroutine_fn vmdk_co_create_opts(BlockDriver 
*drv,
 adapter_type = qemu_opt_get_del(opts, BLOCK_OPT_ADAPTER_TYPE);
 backing_file = qemu_opt_get_del(opts, BLOCK_OPT_BACKING_FILE);
 hw_version = qemu_opt_get_del(opts, BLOCK_OPT_HWVERSION);
+toolsversion = qemu_opt_get_del(opts, BLOCK_OPT_TOOLSVERSION);
 compat6 = qemu_opt_get_bool_del(opts, BLOCK_OPT_COMPAT6, false);
 if (strcmp(hw_version, "undefined") == 0) {
 g_free(hw_version);
@@ -2692,14 +2701,15 @@ static int coroutine_fn vmdk_co_create_opts(BlockDriver 
*drv,
 .opts = opts,
 };
 ret = vmdk_co_do_create(total_size, subformat, adapter_type_enum,
-backing_file, hw_version, compat6, zeroed_grain,
-vmdk_co_create_opts_cb, &data, errp);
+backing_file, hw_version, toolsversion, compat6,
+zeroed_grain, vmdk_co_create_opts_cb, &data, errp);
 
 exit:
 g_free(backing_fmt);
 g_free(adapter_type);
 g_free(backing_file);
 g_free(hw_version);
+g_free(toolsversion);
 g_free(fmt);
 g_free(desc);
 g_free(path);
@@ -2782,6 +2792,7 @@ static int coroutine_fn 
vmdk_co_create(BlockdevCreateOptions *create_options,
 opts->adapter_type,
 opts->backing_file,
 opts->hwversion,
+opts->toolsversion,
 false,
 opts->zeroed_grain,
 vmdk_co_create_cb,
@@ -3031,6 +3042,11 @@ static QemuOptsList vmdk_c

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Peter Xu
On Wed, Sep 08, 2021 at 05:13:40PM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Sep 7, 2021 at 1:44 PM Peter Xu  wrote:
> >
> > On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> > > I also suggested something like that, but I thought it could be good if 
> > > we could
> > > fall back to io_writev() if we didn't have the zerocopy feature (or
> > > the async feature).
> > >
> > > What do you think?
> >
> > That fallback looks safe and ok, I'm just not sure whether it'll be of great
> > help.  E.g. if we provide an QIO api that allows both sync write and 
> > zero-copy
> > write (then we do the fallback when necessary), it means the buffer 
> > implication
> > applies too to this api, so it's easier to me to just detect the zero copy
> > capability and use one alternative.  Thanks,
> >
> > --
> > Peter Xu
> >
> 
> I was thinking this way (async method with fallback) we would allow code using
> the QIO api to just try to use the io_async_writev() whenever the code
> seems fit, and
> let the QIO channel layer to decide when it can be used
> (implementation provides,
> features available), and just fallback to io_writev() when it can't be used.

Sure, I think it depends on whether the whole sync/async interface will be the
same, e.g., when async api needs some callback registered then the interface
won't be the same, and the caller will be aware of that anyways.  Otherwise it
looks fine indeed.  Thanks,

-- 
Peter Xu




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Peter Xu
On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
>  wrote:
> > > Possibly, yes. This really need David G's input since he understands
> > > the code in way more detail than me.
> >
> > Hmm I'm not entirely sure why we have the sync after each iteration;
> > the case I can think of is if we're doing async sending, we could have
> > two versions of the same page in flight (one from each iteration) -
> > you'd want those to get there in the right order.
> >
> > Dave
> 
> Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will 
> in
> fact have the same page in flight twice, instead of two versions,
> given the buffer is
> sent as it is during transmission.

That's an interesting point, which looks even valid... :)

There can still be two versions depending on when the page is read and feed to
the NICs as the page can be changing during the window, but as long as the
latter sent page always lands later than the former page on dest node then it
looks ok.

-- 
Peter Xu




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Daniel P . Berrangé
On Wed, Sep 08, 2021 at 05:09:33PM -0400, Peter Xu wrote:
> On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote:
> > On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
> >  wrote:
> > > > Possibly, yes. This really need David G's input since he understands
> > > > the code in way more detail than me.
> > >
> > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > > the case I can think of is if we're doing async sending, we could have
> > > two versions of the same page in flight (one from each iteration) -
> > > you'd want those to get there in the right order.
> > >
> > > Dave
> > 
> > Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we 
> > will in
> > fact have the same page in flight twice, instead of two versions,
> > given the buffer is
> > sent as it is during transmission.
> 
> That's an interesting point, which looks even valid... :)
> 
> There can still be two versions depending on when the page is read and feed to
> the NICs as the page can be changing during the window, but as long as the
> latter sent page always lands later than the former page on dest node then it
> looks ok.

The really strange scenario is around TCP re-transmits. The kernel may
transmit page version 1, then we have version 2 written. The kerenl now
needs to retransmit a packet due to network loss. The re-transmission will
contain version 2 payload which differs from the originally lost packet's
payload.

IOW, the supposed "reliable" TCP stream is no longer reliable and actually
changes its data retrospectively because we've intentionally violated the
rules the kernel documented for use of MSG_ZEROCOPY.

We think we're probably ok with migration as we are going to rely on the
face that we eventually pause the guest to stop page changes during the
final switchover. None the less I really strongly dislike the idea of
not honouring the kernel API contract, despite the potential performance
benefits it brings.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




[RFC PATCH 00/10] security: Introduce qemu_security_policy_taint() API

2021-09-08 Thread Philippe Mathieu-Daudé
Hi,

This series is experimental! The goal is to better limit the
boundary of what code is considerated security critical, and
what is less critical (but still important!).

This approach was quickly discussed few months ago with Markus
then Daniel. Instead of classifying the code on a file path
basis (see [1]), we insert (runtime) hints into the code
(which survive code movement). Offending unsafe code can
taint the global security policy. By default this policy
is 'none': the current behavior. It can be changed on the
command line to 'warn' to display warnings, and to 'strict'
to prohibit QEMU running with a tainted policy.

As examples I started implementing unsafe code taint from
3 different pieces of code:
- accelerators (KVM and Xen in allow-list)
- block drivers (vvfat and parcial null-co in deny-list)
- qdev (hobbyist devices regularly hit by fuzzer)

I don't want the security researchers to not fuzz QEMU unsafe
areas, but I'd like to make it clearer what the community
priority is (currently 47 opened issues on [3]).

Regards,

Phil.

[1] 
https://lore.kernel.org/qemu-devel/20200714083631.888605-2-ppan...@redhat.com/
[2] https://www.qemu.org/contribute/security-process/
[3] https://gitlab.com/qemu-project/qemu/-/issues?label_name[]=Fuzzer

Philippe Mathieu-Daudé (10):
  sysemu: Introduce qemu_security_policy_taint() API
  accel: Use qemu_security_policy_taint(), mark KVM and Xen as safe
  block: Use qemu_security_policy_taint() API
  block/vvfat: Mark the driver as unsafe
  block/null: Mark 'read-zeroes=off' option as unsafe
  qdev: Use qemu_security_policy_taint() API
  hw/display: Mark ATI and Artist devices as unsafe
  hw/misc: Mark testdev devices as unsafe
  hw/net: Mark Tulip device as unsafe
  hw/sd: Mark sdhci-pci device as unsafe

 qapi/run-state.json| 16 +
 include/block/block_int.h  |  6 +++-
 include/hw/qdev-core.h |  6 
 include/qemu-common.h  | 19 +++
 include/qemu/accel.h   |  5 +++
 accel/kvm/kvm-all.c|  1 +
 accel/xen/xen-all.c|  1 +
 block.c|  6 
 block/null.c   |  8 +
 block/vvfat.c  |  6 
 hw/core/qdev.c | 11 ++
 hw/display/artist.c|  1 +
 hw/display/ati.c   |  1 +
 hw/hyperv/hyperv_testdev.c |  1 +
 hw/misc/pc-testdev.c   |  1 +
 hw/misc/pci-testdev.c  |  1 +
 hw/net/tulip.c |  1 +
 hw/sd/sdhci-pci.c  |  1 +
 softmmu/vl.c   | 70 ++
 qemu-options.hx| 17 +
 20 files changed, 178 insertions(+), 1 deletion(-)

-- 
2.31.1





[RFC PATCH 01/10] sysemu: Introduce qemu_security_policy_taint() API

2021-09-08 Thread Philippe Mathieu-Daudé
Introduce qemu_security_policy_taint() which allows unsafe (read
"not very maintained") code to 'taint' QEMU security policy.

The "security policy" is the @SecurityPolicy QAPI enum, composed of:
- "none"   (no policy, current behavior)
- "warn"   (display a warning when the policy is tainted, keep going)
- "strict" (once tainted, exit QEMU before starting the VM)

The qemu_security_policy_is_strict() helper is also provided, which
will be proved useful once a VM is started (example we do not want
to kill a running VM if an unsafe device is hot-added).

Signed-off-by: Philippe Mathieu-Daudé 
---
 qapi/run-state.json   | 16 +++
 include/qemu-common.h | 19 
 softmmu/vl.c  | 67 +++
 qemu-options.hx   | 17 +++
 4 files changed, 119 insertions(+)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index 43d66d700fc..b15a107fa01 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -638,3 +638,19 @@
 { 'struct': 'MemoryFailureFlags',
   'data': { 'action-required': 'bool',
 'recursive': 'bool'} }
+
+##
+# @SecurityPolicy:
+#
+# An enumeration of the actions taken when the security policy is tainted.
+#
+# @none: do nothing.
+#
+# @warn: display a warning.
+#
+# @strict: prohibit QEMU to start a VM.
+#
+# Since: 6.2
+##
+{ 'enum': 'SecurityPolicy',
+  'data': [ 'none', 'warn', 'strict' ] }
diff --git a/include/qemu-common.h b/include/qemu-common.h
index 73bcf763ed8..bf0b054bb66 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -139,4 +139,23 @@ void page_size_init(void);
  * returned. */
 bool dump_in_progress(void);
 
+/**
+ * qemu_security_policy_taint:
+ * @tainting whether any security policy is tainted (compromised).
+ * @fmt: taint reason format string
+ * ...: list of arguments to interpolate into @fmt, like printf().
+ *
+ * Allow unsafe code path to taint the global security policy.
+ * See #SecurityPolicy.
+ */
+void qemu_security_policy_taint(bool tainting, const char *fmt, ...)
+GCC_FMT_ATTR(2, 3);
+
+/**
+ * qemu_security_policy_is_strict:
+ *
+ * Return %true if the global security policy is 'strict', %false otherwise.
+ */
+bool qemu_security_policy_is_strict(void);
+
 #endif
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 55ab70eb97f..92c05ac97ee 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -489,6 +489,20 @@ static QemuOptsList qemu_action_opts = {
 },
 };
 
+static QemuOptsList qemu_security_policy_opts = {
+.name = "security-policy",
+.implied_opt_name = "policy",
+.merge_lists = true,
+.head = QTAILQ_HEAD_INITIALIZER(qemu_security_policy_opts.head),
+.desc = {
+{
+.name = "policy",
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
 const char *qemu_get_vm_name(void)
 {
 return qemu_name;
@@ -600,6 +614,52 @@ static int cleanup_add_fd(void *opaque, QemuOpts *opts, 
Error **errp)
 }
 #endif
 
+static SecurityPolicy security_policy = SECURITY_POLICY_NONE;
+
+bool qemu_security_policy_is_strict(void)
+{
+return security_policy == SECURITY_POLICY_STRICT;
+}
+
+static int select_security_policy(const char *p)
+{
+int policy;
+char *qapi_value;
+
+qapi_value = g_ascii_strdown(p, -1);
+policy = qapi_enum_parse(&SecurityPolicy_lookup, qapi_value, -1, NULL);
+g_free(qapi_value);
+if (policy < 0) {
+return -1;
+}
+security_policy = policy;
+
+return 0;
+}
+
+void qemu_security_policy_taint(bool tainting, const char *fmt, ...)
+{
+va_list ap;
+g_autofree char *efmt = NULL;
+
+if (security_policy == SECURITY_POLICY_NONE || !tainting) {
+return;
+}
+
+va_start(ap, fmt);
+if (security_policy == SECURITY_POLICY_STRICT) {
+efmt = g_strdup_printf("%s taints QEMU security policy, exiting.", 
fmt);
+error_vreport(efmt, ap);
+exit(EXIT_FAILURE);
+} else if (security_policy == SECURITY_POLICY_WARN) {
+efmt = g_strdup_printf("%s taints QEMU security policy.", fmt);
+warn_vreport(efmt, ap);
+} else {
+g_assert_not_reached();
+}
+va_end(ap);
+}
+
 /***/
 /* QEMU Block devices */
 
@@ -2764,6 +2824,7 @@ void qemu_init(int argc, char **argv, char **envp)
 qemu_add_opts(&qemu_semihosting_config_opts);
 qemu_add_opts(&qemu_fw_cfg_opts);
 qemu_add_opts(&qemu_action_opts);
+qemu_add_opts(&qemu_security_policy_opts);
 module_call_init(MODULE_INIT_OPTS);
 
 error_init(argv[0]);
@@ -3230,6 +3291,12 @@ void qemu_init(int argc, char **argv, char **envp)
 exit(1);
 }
 break;
+case QEMU_OPTION_security_policy:
+if (select_security_policy(optarg) == -1) {
+error_report("unknown -security-policy parameter");
+exit(1);
+}
+break;
 case

[RFC PATCH 04/10] block/vvfat: Mark the driver as unsafe

2021-09-08 Thread Philippe Mathieu-Daudé
While being listed as 'supported' in MAINTAINERS, this driver
does not have many reviewers and contains various /* TODO */
unattended since various years. Not safe enough for production
environment, so have it taint the global security policy.

Signed-off-by: Philippe Mathieu-Daudé 
---
 block/vvfat.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/vvfat.c b/block/vvfat.c
index 34bf1e3a86e..993e40727d6 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -3199,6 +3199,11 @@ static void vvfat_close(BlockDriverState *bs)
 }
 }
 
+static bool vvfat_taints_security_policy(BlockDriverState *bs)
+{
+return true;
+}
+
 static const char *const vvfat_strong_runtime_opts[] = {
 "dir",
 "fat-type",
@@ -3219,6 +3224,7 @@ static BlockDriver bdrv_vvfat = {
 .bdrv_refresh_limits= vvfat_refresh_limits,
 .bdrv_close = vvfat_close,
 .bdrv_child_perm= vvfat_child_perm,
+.bdrv_taints_security_policy = vvfat_taints_security_policy,
 
 .bdrv_co_preadv = vvfat_co_preadv,
 .bdrv_co_pwritev= vvfat_co_pwritev,
-- 
2.31.1




[RFC PATCH 05/10] block/null: Mark 'read-zeroes=off' option as unsafe

2021-09-08 Thread Philippe Mathieu-Daudé
See commit b317006a3f1 ("docs/secure-coding-practices: Describe how
to use 'null-co' block driver") for rationale.

Signed-off-by: Philippe Mathieu-Daudé 
---
 block/null.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/block/null.c b/block/null.c
index cc9b1d4ea72..11e428f3cc2 100644
--- a/block/null.c
+++ b/block/null.c
@@ -99,6 +99,13 @@ static int null_file_open(BlockDriverState *bs, QDict 
*options, int flags,
 return ret;
 }
 
+static bool null_taints_security_policy(BlockDriverState *bs)
+{
+BDRVNullState *s = bs->opaque;
+
+return !s->read_zeroes;
+}
+
 static int64_t null_getlength(BlockDriverState *bs)
 {
 BDRVNullState *s = bs->opaque;
@@ -283,6 +290,7 @@ static BlockDriver bdrv_null_co = {
 .bdrv_parse_filename= null_co_parse_filename,
 .bdrv_getlength = null_getlength,
 .bdrv_get_allocated_file_size = null_allocated_file_size,
+.bdrv_taints_security_policy = null_taints_security_policy,
 
 .bdrv_co_preadv = null_co_preadv,
 .bdrv_co_pwritev= null_co_pwritev,
-- 
2.31.1




[RFC PATCH 06/10] qdev: Use qemu_security_policy_taint() API

2021-09-08 Thread Philippe Mathieu-Daudé
Add DeviceClass::taints_security_policy field to allow an
unsafe device to eventually taint the global security policy
in DeviceRealize().

Signed-off-by: Philippe Mathieu-Daudé 
---
 include/hw/qdev-core.h |  6 ++
 hw/core/qdev.c | 11 +++
 2 files changed, 17 insertions(+)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index bafc311bfa1..ff9ce6671be 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -122,6 +122,12 @@ struct DeviceClass {
  */
 bool user_creatable;
 bool hotpluggable;
+/*
+ * %false if the device is within the QEMU security policy boundary,
+ * %true if there is no guarantee this device can be used safely.
+ * See: https://www.qemu.org/contribute/security-process/
+ */
+bool taints_security_policy;
 
 /* callbacks */
 /*
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index cefc5eaa0a9..a5a00f3564c 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -31,6 +31,7 @@
 #include "qapi/qmp/qerror.h"
 #include "qapi/visitor.h"
 #include "qemu/error-report.h"
+#include "qemu-common.h"
 #include "qemu/option.h"
 #include "hw/hotplug.h"
 #include "hw/irq.h"
@@ -257,6 +258,13 @@ bool qdev_hotplug_allowed(DeviceState *dev, Error **errp)
 MachineClass *mc;
 Object *m_obj = qdev_get_machine();
 
+if (qemu_security_policy_is_strict()
+&& DEVICE_GET_CLASS(dev)->taints_security_policy) {
+error_setg(errp, "Device '%s' can not be hotplugged when"
+ " 'strict' security policy is in place",
+   object_get_typename(OBJECT(dev)));
+}
+
 if (object_dynamic_cast(m_obj, TYPE_MACHINE)) {
 machine = MACHINE(m_obj);
 mc = MACHINE_GET_CLASS(machine);
@@ -385,6 +393,9 @@ bool qdev_realize(DeviceState *dev, BusState *bus, Error 
**errp)
 } else {
 assert(!DEVICE_GET_CLASS(dev)->bus_type);
 }
+qemu_security_policy_taint(DEVICE_GET_CLASS(dev)->taints_security_policy,
+   "device type %s",
+   object_get_typename(OBJECT(dev)));
 
 return object_property_set_bool(OBJECT(dev), "realized", true, errp);
 }
-- 
2.31.1




[RFC PATCH 02/10] accel: Use qemu_security_policy_taint(), mark KVM and Xen as safe

2021-09-08 Thread Philippe Mathieu-Daudé
Add the AccelClass::secure_policy_supported field to classify
safe (within security boundary) vs unsafe accelerators.

Signed-off-by: Philippe Mathieu-Daudé 
---
 include/qemu/accel.h | 5 +
 accel/kvm/kvm-all.c  | 1 +
 accel/xen/xen-all.c  | 1 +
 softmmu/vl.c | 3 +++
 4 files changed, 10 insertions(+)

diff --git a/include/qemu/accel.h b/include/qemu/accel.h
index 4f4c283f6fc..895e30be0de 100644
--- a/include/qemu/accel.h
+++ b/include/qemu/accel.h
@@ -44,6 +44,11 @@ typedef struct AccelClass {
hwaddr start_addr, hwaddr size);
 #endif
 bool *allowed;
+/*
+ * Whether the accelerator is withing QEMU security policy boundary.
+ * See: https://www.qemu.org/contribute/security-process/
+ */
+bool secure_policy_supported;
 /*
  * Array of global properties that would be applied when specific
  * accelerator is chosen. It works like MachineClass.compat_props
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 0125c17edb8..eb6b9e44df2 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3623,6 +3623,7 @@ static void kvm_accel_class_init(ObjectClass *oc, void 
*data)
 ac->init_machine = kvm_init;
 ac->has_memory = kvm_accel_has_memory;
 ac->allowed = &kvm_allowed;
+ac->secure_policy_supported = true;
 
 object_class_property_add(oc, "kernel-irqchip", "on|off|split",
 NULL, kvm_set_kernel_irqchip,
diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c
index 69aa7d018b2..57867af5faf 100644
--- a/accel/xen/xen-all.c
+++ b/accel/xen/xen-all.c
@@ -198,6 +198,7 @@ static void xen_accel_class_init(ObjectClass *oc, void 
*data)
 ac->setup_post = xen_setup_post;
 ac->allowed = &xen_allowed;
 ac->compat_props = g_ptr_array_new();
+ac->secure_policy_supported = true;
 
 compat_props_add(ac->compat_props, compat, G_N_ELEMENTS(compat));
 
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 92c05ac97ee..e4f94e159c3 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2388,6 +2388,9 @@ static int do_configure_accelerator(void *opaque, 
QemuOpts *opts, Error **errp)
 return 0;
 }
 
+qemu_security_policy_taint(!ac->secure_policy_supported,
+   "%s accelerator", acc);
+
 return 1;
 }
 
-- 
2.31.1




[RFC PATCH 03/10] block: Use qemu_security_policy_taint() API

2021-09-08 Thread Philippe Mathieu-Daudé
Add the BlockDriver::bdrv_taints_security_policy() handler.
Drivers implementing it might taint the global QEMU security
policy.

Signed-off-by: Philippe Mathieu-Daudé 
---
 include/block/block_int.h | 6 +-
 block.c   | 6 ++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index f1a54db0f8c..0ec0a5c06e9 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -169,7 +169,11 @@ struct BlockDriver {
 int (*bdrv_file_open)(BlockDriverState *bs, QDict *options, int flags,
   Error **errp);
 void (*bdrv_close)(BlockDriverState *bs);
-
+/*
+ * Return %true if the driver is withing QEMU security policy boundary,
+ * %false otherwise. See: https://www.qemu.org/contribute/security-process/
+ */
+bool (*bdrv_taints_security_policy)(BlockDriverState *bs);
 
 int coroutine_fn (*bdrv_co_create)(BlockdevCreateOptions *opts,
Error **errp);
diff --git a/block.c b/block.c
index b2b66263f9a..696ba486001 100644
--- a/block.c
+++ b/block.c
@@ -49,6 +49,7 @@
 #include "qemu/timer.h"
 #include "qemu/cutils.h"
 #include "qemu/id.h"
+#include "qemu-common.h"
 #include "block/coroutines.h"
 
 #ifdef CONFIG_BSD
@@ -1587,6 +1588,11 @@ static int bdrv_open_driver(BlockDriverState *bs, 
BlockDriver *drv,
 }
 }
 
+if (drv->bdrv_taints_security_policy) {
+qemu_security_policy_taint(drv->bdrv_taints_security_policy(bs),
+   "Block protocol '%s'", drv->format_name);
+}
+
 return 0;
 open_failed:
 bs->drv = NULL;
-- 
2.31.1




[RFC PATCH 08/10] hw/misc: Mark testdev devices as unsafe

2021-09-08 Thread Philippe Mathieu-Daudé
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/hyperv/hyperv_testdev.c | 1 +
 hw/misc/pc-testdev.c   | 1 +
 hw/misc/pci-testdev.c  | 1 +
 3 files changed, 3 insertions(+)

diff --git a/hw/hyperv/hyperv_testdev.c b/hw/hyperv/hyperv_testdev.c
index 9a56ddf83fe..6a75c350389 100644
--- a/hw/hyperv/hyperv_testdev.c
+++ b/hw/hyperv/hyperv_testdev.c
@@ -310,6 +310,7 @@ static void hv_test_dev_class_init(ObjectClass *klass, void 
*data)
 
 set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 dc->realize = hv_test_dev_realizefn;
+dc->taints_security_policy = true;
 }
 
 static const TypeInfo hv_test_dev_info = {
diff --git a/hw/misc/pc-testdev.c b/hw/misc/pc-testdev.c
index e3896518694..6294b80ec1b 100644
--- a/hw/misc/pc-testdev.c
+++ b/hw/misc/pc-testdev.c
@@ -199,6 +199,7 @@ static void testdev_class_init(ObjectClass *klass, void 
*data)
 
 set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 dc->realize = testdev_realizefn;
+dc->taints_security_policy = true;
 }
 
 static const TypeInfo testdev_info = {
diff --git a/hw/misc/pci-testdev.c b/hw/misc/pci-testdev.c
index 03845c8de34..189eb9bf1bb 100644
--- a/hw/misc/pci-testdev.c
+++ b/hw/misc/pci-testdev.c
@@ -340,6 +340,7 @@ static void pci_testdev_class_init(ObjectClass *klass, void 
*data)
 set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 dc->reset = qdev_pci_testdev_reset;
 device_class_set_props(dc, pci_testdev_properties);
+dc->taints_security_policy = true;
 }
 
 static const TypeInfo pci_testdev_info = {
-- 
2.31.1




[RFC PATCH 09/10] hw/net: Mark Tulip device as unsafe

2021-09-08 Thread Philippe Mathieu-Daudé
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/net/tulip.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/net/tulip.c b/hw/net/tulip.c
index ca69f7ea5e1..eaad3266212 100644
--- a/hw/net/tulip.c
+++ b/hw/net/tulip.c
@@ -1025,6 +1025,7 @@ static void tulip_class_init(ObjectClass *klass, void 
*data)
 device_class_set_props(dc, tulip_properties);
 dc->reset = tulip_qdev_reset;
 set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+dc->taints_security_policy = true;
 }
 
 static const TypeInfo tulip_info = {
-- 
2.31.1




[RFC PATCH 07/10] hw/display: Mark ATI and Artist devices as unsafe

2021-09-08 Thread Philippe Mathieu-Daudé
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/display/artist.c | 1 +
 hw/display/ati.c| 1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/display/artist.c b/hw/display/artist.c
index 21b7fd1b440..067a4b2cb59 100644
--- a/hw/display/artist.c
+++ b/hw/display/artist.c
@@ -1482,6 +1482,7 @@ static void artist_class_init(ObjectClass *klass, void 
*data)
 dc->vmsd = &vmstate_artist;
 dc->reset = artist_reset;
 device_class_set_props(dc, artist_properties);
+dc->taints_security_policy = true;
 }
 
 static const TypeInfo artist_info = {
diff --git a/hw/display/ati.c b/hw/display/ati.c
index 31f22754dce..2f27ab69a87 100644
--- a/hw/display/ati.c
+++ b/hw/display/ati.c
@@ -1024,6 +1024,7 @@ static void ati_vga_class_init(ObjectClass *klass, void 
*data)
 device_class_set_props(dc, ati_vga_properties);
 dc->hotpluggable = false;
 set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+dc->taints_security_policy = true;
 
 k->class_id = PCI_CLASS_DISPLAY_VGA;
 k->vendor_id = PCI_VENDOR_ID_ATI;
-- 
2.31.1




[RFC PATCH 10/10] hw/sd: Mark sdhci-pci device as unsafe

2021-09-08 Thread Philippe Mathieu-Daudé
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/sd/sdhci-pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/sd/sdhci-pci.c b/hw/sd/sdhci-pci.c
index c737c8b930e..7a36f88fd87 100644
--- a/hw/sd/sdhci-pci.c
+++ b/hw/sd/sdhci-pci.c
@@ -64,6 +64,7 @@ static void sdhci_pci_class_init(ObjectClass *klass, void 
*data)
 k->device_id = PCI_DEVICE_ID_REDHAT_SDHCI;
 k->class_id = PCI_CLASS_SYSTEM_SDHCI;
 device_class_set_props(dc, sdhci_pci_properties);
+dc->taints_security_policy = true;
 
 sdhci_common_class_init(klass, data);
 }
-- 
2.31.1




Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-08 Thread Jason Wang
On Wed, Sep 8, 2021 at 11:19 PM Peter Xu  wrote:
>
> On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote:
> > * Jason Wang (jasow...@redhat.com) wrote:
> > > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu  wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos 
> > > > wrote:
> > > > > > I don't think it is valid to unconditionally enable this feature 
> > > > > > due to the
> > > > > > resource usage implications
> > > > > >
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > >
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This 
> > > > > > happens
> > > > > >if the socket option was not set, the socket exceeds its optmem
> > > > > >limit or the user exceeds its ulimit on locked pages."
> > > > >
> > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > >
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > >
> > > > > Do you mean the limit an user has on locking memory?
> > > > >
> > > > > If so, that makes sense. I remember I needed to set the upper limit 
> > > > > of locked
> > > > > memory for the user before using it, or adding a capability to qemu 
> > > > > before.
> > > >
> > > > So I'm a bit confused on why MSG_ZEROCOPY requires checking 
> > > > RLIMIT_MEMLOCK.
> > > >
> > > > The thing is IIUC that's accounting for pinned pages only with either 
> > > > mlock()
> > > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > > >
> > > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking 
> > > > at
> > > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> > >
> > > It happens probably here:
> > >
> > > E.g
> > >
> > > __ip_append_data()
> > > msg_zerocopy_realloc()
> > > mm_account_pinned_pages()
> >
> > When do they get uncounted?  i.e. is it just the data that's in flight
> > that's marked as pinned?
>
> I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages()
> correspondingly.  Thanks,

Yes, and actually the memory that could be pinned is limited by the
sndbuf of TCP socket. So we are fine with rlimit (e.g we don't need to
pin all guest pages).

Thanks

>
> --
> Peter Xu
>




Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Peter Xu
On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote:
> We think we're probably ok with migration as we are going to rely on the
> face that we eventually pause the guest to stop page changes during the
> final switchover. None the less I really strongly dislike the idea of
> not honouring the kernel API contract, despite the potential performance
> benefits it brings.

Yes understandable, and it does looks tricky. But it's guest page and it's just
by nature how it works to me (sending page happening in parallel with page
being modified).

I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's
own choice if that happens. So even if it's not by design and not suggested, it
seems not forbidden either.

We can wr-protect the page (using things like userfaultfd-wp) during sending
and unprotect them when it's done with a qio callback, that'll guarantee the
buffer not changing during sending, however we gain nothing besides the "api
cleaness" then..

Thanks,

-- 
Peter Xu




RE: [PATCH] vmdk: allow specification of tools version

2021-09-08 Thread Weissschuh, Thomas [ext]
> -Original Message-
> From: Eric Blake 
> Sent: Wednesday, September 8, 2021 7:56 PM
> To: Weissschuh, Thomas [ext] 
> Cc: Fam Zheng ; Kevin Wolf ; Hanna Reitz
> ; Markus Armbruster ; qemu-
> bl...@nongnu.org; tho...@t-8ch.de; qemu-de...@nongnu.org
> Subject: Re: [PATCH] vmdk: allow specification of tools version
> 
> On Wed, Sep 08, 2021 at 07:42:50PM +0200, Thomas Weißschuh wrote:
> > VMDK files support an attribute that represents the version of the
> > guest tools that are installed on the disk.
> > This attribute is used by vSphere before a machine has been started to
> > determine if the VM has the guest tools installed.
> > This is important when configuring "Operating system customizations"
> > in vSphere, as it checks for the presence of the guest tools before
> > allowing those customizations.
> > Thus when the VM has not yet booted normally it would be impossible to
> > customize it, therefore preventing a customized first-boot.
> >
> > The attribute should not hurt on disks that do not have the guest
> > tools installed and indeed the VMware tools also unconditionally add
> > this attribute.
> > (Defaulting to the value "2147483647", as is done in this patch)
> >
> > Signed-off-by: Thomas Weißschuh 
> > ---
> >  block/vmdk.c | 24 
> >  qapi/block-core.json |  2 ++
> >  2 files changed, 22 insertions(+), 4 deletions(-)
> 
> UI review:
> 
> > +++ b/qapi/block-core.json
> > @@ -4597,6 +4597,7 @@
> >  # @adapter-type: The adapter type used to fill in the descriptor.
> Default: ide.
> >  # @hwversion: Hardware version. The meaningful options are "4" or "6".
> >  # Default: "4".
> > +# @toolsversion: VMware guest tools version.
> 
> Missing a '(since 6.2)' blurb, and a description of its default value.

I'll add that to v2, which I'll send as soon as some more review comments came 
in.

> >  # @zeroed-grain: Whether to enable zeroed-grain feature for sparse
> subformats.
> >  #Default: false.
> >  #
> > @@ -4610,6 +4611,7 @@
> >  '*backing-file':'str',
> >  '*adapter-type':'BlockdevVmdkAdapterType',
> >  '*hwversion':   'str',
> > +'*toolsversion':'str',
> 
> Is it an arbitrary string, or must a valid value always be parseable as a
> numeric value?  If the latter, then make the QMP representation numeric
> rather than string.

In the vmdk itself it is an arbitrary string but the actual values seem to be 
always numeric.
But I do have some very faint recollection of seeing normal version numbers 
(a.b.c) there at some point, but can't confirm this at the moment.

Thomas


Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Leonardo Bras Soares Passos
On Wed, Sep 8, 2021 at 11:05 PM Peter Xu  wrote:
>
> On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote:
> > We think we're probably ok with migration as we are going to rely on the
> > face that we eventually pause the guest to stop page changes during the
> > final switchover. None the less I really strongly dislike the idea of
> > not honouring the kernel API contract, despite the potential performance
> > benefits it brings.
>
> Yes understandable, and it does looks tricky. But it's guest page and it's 
> just
> by nature how it works to me (sending page happening in parallel with page
> being modified).
>
> I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's
> own choice if that happens. So even if it's not by design and not suggested, 
> it
> seems not forbidden either.
>
> We can wr-protect the page (using things like userfaultfd-wp) during sending
> and unprotect them when it's done with a qio callback, that'll guarantee the
> buffer not changing during sending, however we gain nothing besides the "api
> cleaness" then..
>
> Thanks,
>
> --
> Peter Xu
>

I can't stress enough how inexperienced I am in qemu codebase, but based on
the current discussion and my humble opinion, it would make sense if something
like an async_writev + async_flush method (or a callback instead) would be
offered by QIOChannel, and let the responsibility of "which flags to use",
"when to lock", and "how / when to flush" to the codebase using it.

I mean, it's clear how the sync requirements will be very different
depending on what
the using code will be sending with it, so It makes sense to me that
we offer the tool
and let it decide how it should be used (and possibly deal with any
consequences).

Is there anything that could go wrong with the above that I am missing?

About the callback proposed, I am not sure how that would work in an
efficient way.
Could someone help me with that?

FWIW, what I had in mind for a (theoretical) migration setup with
io_async_writev() + io_async_flush():
- For guest RAM we can decide not to rw_lock memory on zerocopy,
because there is no need,
- For headers, we can decide to not use async  (use io_writev() instead),
- flush() can happen each iteration of migration, or at each N
seconds, or at the end.

Thank you for the great discussion :)
Leonardo Bras