Re: [PATCH v2 2/4] mirror: allow specifying working bitmap

2024-03-07 Thread Markus Armbruster
Fiona Ebner  writes:

> From: John Snow 
>
> for the mirror job. The bitmap's granularity is used as the job's
> granularity.
>
> The new @bitmap parameter is marked unstable in the QAPI and can
> currently only be used for @sync=full mode.
>
> Clusters initially dirty in the bitmap as well as new writes are
> copied to the target.
>
> Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
> callers can simulate the three kinds of @BitmapSyncMode (which is used
> by backup):
> 1. always: default, just pass bitmap as working bitmap.
> 2. never: copy bitmap and pass copy to the mirror job.
> 3. on-success: copy bitmap and pass copy to the mirror job and if
>successful, merge bitmap into original afterwards.
>
> When the target image is a fresh "diff image", i.e. one that was not
> used as the target of a previous mirror and the target image's cluster
> size is larger than the bitmap's granularity, or when
> @copy-mode=write-blocking is used, there is a pitfall, because the
> cluster in the target image will be allocated, but not contain all the
> data corresponding to the same region in the source image.
>
> An idea to avoid the limitation would be to mark clusters which are
> affected by unaligned writes and are not allocated in the target image
> dirty, so they would be copied fully later. However, for migration,
> the invariant that an actively synced mirror stays actively synced
> (unless an error happens) is useful, because without that invariant,
> migration might inactivate block devices when mirror still got work
> to do and run into an assertion failure [0].
>
> Another approach would be to read the missing data from the source
> upon unaligned writes to be able to write the full target cluster
> instead.
>
> But certain targets like NBD do not allow querying the cluster size.
> To avoid limiting/breaking the use case of syncing to an existing
> target, which is arguably more common than the diff image use case,
> document the limiation in QAPI.
>
> This patch was originally based on one by Ma Haocong, but it has since
> been modified pretty heavily, first by John and then again by Fiona.
>
> [0]: 
> https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/
>
> Suggested-by: Ma Haocong 
> Signed-off-by: Ma Haocong 
> Signed-off-by: John Snow 
> [FG: switch to bdrv_dirty_bitmap_merge_internal]
> Signed-off-by: Fabian Grünbichler 
> Signed-off-by: Thomas Lamprecht 
> [FE: rebase for 9.0
>  get rid of bitmap mode parameter
>  use caller-provided bitmap as working bitmap
>  turn bitmap parameter experimental]
> Signed-off-by: Fiona Ebner 

[...]

> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 59d75b0793..4859fffd48 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2191,6 +2191,18 @@
>  # destination (all the disk, only the sectors allocated in the
>  # topmost image, or only new I/O).
>  #
> +# @bitmap: The name of a bitmap to use as a working bitmap for
> +# sync=full mode.  This argument must be not be present for other
> +# sync modes and not at the same time as @granularity.  The
> +# bitmap's granularity is used as the job's granularity.  When
> +# the target is a diff image, i.e. one that should only contain
> +# the delta and was not synced to previously, the target's
> +# cluster size must not be larger than the bitmap's granularity.
> +# For a diff image target, using copy-mode=write-blocking should
> +# not be used, because unaligned writes will lead to allocated
> +# clusters with partial data in the target image!  The bitmap
> +# will be enabled after the job finishes.  (Since 9.0)

That's a lot of restrictions and caveats.  Okay as long as the thing
remains experimental, I guess.

> +#
>  # @granularity: granularity of the dirty bitmap, default is 64K if the
>  # image format doesn't have clusters, 4K if the clusters are
>  # smaller than that, else the cluster size.  Must be a power of 2
> @@ -2228,12 +2240,18 @@
>  # disappear from the query list without user intervention.
>  # Defaults to true.  (Since 3.1)
>  #
> +# Features:
> +#
> +# @unstable: Member @bitmap is experimental.
> +#
>  # Since: 1.3
>  ##
>  { 'struct': 'DriveMirror',
>'data': { '*job-id': 'str', 'device': 'str', 'target': 'str',
>  '*format': 'str', '*node-name': 'str', '*replaces': 'str',
> -'sync': 'MirrorSyncMode', '*mode': 'NewImageMode',
> +'sync': 'MirrorSyncMode',
> +'*bitmap': { 'type': 'str', 'features': [ 'unstable' ] },
> +'*mode': 'NewImageMode',
>  '*speed': 'int', '*granularity': 'uint32',
>  '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
>  '*on-target-error': 'BlockdevOnError',
> @@ -2513,6 +2531,18 @@
>  # destination (all the disk, only the sectors allocated in the
>  # topmost image, or only new I/O).
>  #
> +# @bitmap: The name

Re: [PATCH v2 1/4] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-03-07 Thread Markus Armbruster
Fiona Ebner  writes:

> Backup supports all modes listed in MirrorSyncMode, while mirror does
> not. Introduce BackupSyncMode by copying the current MirrorSyncMode
> and drop the variants mirror does not support from MirrorSyncMode as
> well as the corresponding manual check in mirror_start().

Results in tighter introspection: query-qmp-schema no longer reports
drive-mirror and blockdev-mirror accepting @sync values they actually
reject.  Suggest to mention this in the commit message.

> Suggested-by: Vladimir Sementsov-Ogievskiy 
> Signed-off-by: Fiona Ebner 
> ---
>
> I felt like keeping the "Since: X.Y" as before makes the most sense as
> to not lose history. Or is it necessary to change this for
> BackupSyncMode (and its members) since it got a new name?

Doc comments are for users of the QMP interface.  Type names do not
matter there.  I agree with your decision not to update the "since"
tags.

[...]

> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 1874f880a8..59d75b0793 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1304,10 +1304,10 @@
>'data': ['report', 'ignore', 'enospc', 'stop', 'auto'] }
>  
>  ##
> -# @MirrorSyncMode:
> +# @BackupSyncMode:
>  #
> -# An enumeration of possible behaviors for the initial synchronization
> -# phase of storage mirroring.
> +# An enumeration of possible behaviors for image synchronization used
> +# by backup jobs.
>  #
>  # @top: copies data in the topmost image to the destination
>  #
> @@ -1323,7 +1323,7 @@
>  #
>  # Since: 1.3
>  ##
> -{ 'enum': 'MirrorSyncMode',
> +{ 'enum': 'BackupSyncMode',
>'data': ['top', 'full', 'none', 'incremental', 'bitmap'] }
>  
>  ##
> @@ -1347,6 +1347,23 @@
>  { 'enum': 'BitmapSyncMode',
>'data': ['on-success', 'never', 'always'] }
>  
> +##
> +# @MirrorSyncMode:
> +#
> +# An enumeration of possible behaviors for the initial synchronization
> +# phase of storage mirroring.
> +#
> +# @top: copies data in the topmost image to the destination
> +#
> +# @full: copies data from all images to the destination
> +#
> +# @none: only copy data written from now on
> +#
> +# Since: 1.3
> +##
> +{ 'enum': 'MirrorSyncMode',
> +  'data': ['top', 'full', 'none'] }
> +
>  ##
>  # @MirrorCopyMode:
>  #
> @@ -1624,7 +1641,7 @@
>  ##
>  { 'struct': 'BackupCommon',
>'data': { '*job-id': 'str', 'device': 'str',
> -'sync': 'MirrorSyncMode', '*speed': 'int',
> +'sync': 'BackupSyncMode', '*speed': 'int',
>  '*bitmap': 'str', '*bitmap-mode': 'BitmapSyncMode',
>  '*compress': 'bool',
>  '*on-source-error': 'BlockdevOnError',

QAPI schema
Acked-by: Markus Armbruster 




Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2024-03-07 Thread Vladimir Sementsov-Ogievskiy

On 04.03.24 14:09, Peter Krempa wrote:

On Mon, Mar 04, 2024 at 11:48:54 +0100, Kevin Wolf wrote:

Am 28.02.2024 um 19:07 hat Vladimir Sementsov-Ogievskiy geschrieben:

On 03.11.23 18:56, Markus Armbruster wrote:

Kevin Wolf  writes:


[...]


Is the job abstraction a failure?

We have

  block-job- command  since   job- commandsince
  -
  block-job-set-speed 1.1
  block-job-cancel1.1 job-cancel  3.0
  block-job-pause 1.3 job-pause   3.0
  block-job-resume1.3 job-resume  3.0
  block-job-complete  1.3 job-complete3.0
  block-job-dismiss   2.12job-dismiss 3.0
  block-job-finalize  2.12job-finalize3.0
  block-job-change8.2
  query-block-jobs1.1 query-jobs


[...]


I consider these strictly optional. We don't really have strong reasons
to deprecate these commands (they are just thin wrappers), and I think
libvirt still uses block-job-* in some places.


Libvirt uses 'block-job-cancel' because it has different semantics from
'job-cancel' which libvirt documented as the behaviour of the API that
uses it. (Semantics regarding the expectation of what is written to the
destination node at the point when the job is cancelled).



That's the following semantics:

  # Note that if you issue 'block-job-cancel' after 'drive-mirror' has
  # indicated (via the event BLOCK_JOB_READY) that the source and
  # destination are synchronized, then the event triggered by this
  # command changes to BLOCK_JOB_COMPLETED, to indicate that the
  # mirroring has ended and the destination now has a point-in-time copy
  # tied to the time of the cancellation.

Hmm. Looking at this, it looks for me, that should probably a 
'block-job-complete" command (as leading to BLOCK_JOB_COMPLETED).

Actually, what is the difference between block-job-complete and 
block-job-cancel(force=false) for mirror in ready state?

I only see the following differencies:

1. block-job-complete documents that it completes the job synchronously.. But 
looking at mirror code I see it just set s->should_complete = true, which will 
be then handled asynchronously.. So I doubt that documentation is correct.

2. block-job-complete will trigger final graph changes. block-job-cancel will 
not.

Is [2] really useful? Seems yes: in case of some failure before starting 
migration target, we'd like to continue executing source. So, no reason to 
break block-graph in source, better keep it unchanged.

But I think, such behavior better be setup by mirror-job start parameter, 
rather then by special option for cancel (or even compelete) command, useful 
only for mirror.

So, what about the following substitution for block-job-cancel:

block-job-cancel(force=true)  -->  use job-cancel

block-job-cancel(force=false) for backup, stream, commit  -->  use job-cancel

block-job-cancel(force=false) for mirror in ready mode  -->

  instead, use block-job-complete. If you don't need final graph modification 
which mirror job normally does, use graph-change=false parameter for 
blockdev-mirror command.


(I can hardly remember, that we've already discussed something like this long 
time ago, but I don't remember the results)

I also a bit unsure about active commit soft-cancelling semantics. Is it 
actually useful? If yes, block-commit command will need similar option.

--
Best regards,
Vladimir




Re: [PATCH] linux-aio: add IO_CMD_FDSYNC command support

2024-03-07 Thread Prasad Pandit
Hi,

On Thu, 7 Mar 2024 at 19:21, Kevin Wolf  wrote:
> Kernel support for this is "relatively" new, added in 2018. Should we
> fall back to the thread pool if we receive -EINVAL?

laio_co_submit
  laio_do_submit
ioq_submit
  io_submit

* When an aio operation is not supported by the kernel, io_submit()
call would return '-EINVAL', indicating that the specified (_FDSYNC)
operation is invalid for the file descriptor. ie. an error (-EINVAL)
needs to reach the 'laio_co_submit' routine above, to make its caller
fall back on the thread-pool functionality as default.

* Strangely the 'ioq_submit' routine above ignores the return value
from io_submit and returns void. I think we need to change that to be
able to fall back on the thread-pool functionality. I'll try to see if
such a change works as expected. Please let me know if there's another
way to fix this.

Thank you.
---
  - Prasad




Re: [PATCH v2 3/6] qdev-monitor: add option to report GenericError from find_device_state

2024-03-07 Thread Vladimir Sementsov-Ogievskiy

On 07.03.24 12:46, Markus Armbruster wrote:

Vladimir Sementsov-Ogievskiy  writes:


Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  system/qdev-monitor.c | 15 +++
  1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index 9febb743f1..cf7481e416 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -877,13 +877,20 @@ void qmp_device_add(QDict *qdict, QObject **ret_data, 
Error **errp)
  object_unref(OBJECT(dev));
  }
  
-static DeviceState *find_device_state(const char *id, Error **errp)

+/*
+ * Note that creating new APIs using error classes other than GenericError is
+ * not recommended. Set use_generic_error=true for new interfaces.
+ */
+static DeviceState *find_device_state(const char *id, bool use_generic_error,
+  Error **errp)
  {
  Object *obj = object_resolve_path_at(qdev_get_peripheral(), id);
  DeviceState *dev;
  
  if (!obj) {

-error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
+error_set(errp,
+  (use_generic_error ?
+   ERROR_CLASS_GENERIC_ERROR : ERROR_CLASS_DEVICE_NOT_FOUND),
"Device '%s' not found", id);
  return NULL;
  }
@@ -947,7 +954,7 @@ void qdev_unplug(DeviceState *dev, Error **errp)
  
  void qmp_device_del(const char *id, Error **errp)

  {
-DeviceState *dev = find_device_state(id, errp);
+DeviceState *dev = find_device_state(id, false, errp);
  if (dev != NULL) {
  if (dev->pending_deleted_event &&
  (dev->pending_deleted_expires_ms == 0 ||
@@ -1067,7 +1074,7 @@ BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
  
  GLOBAL_STATE_CODE();
  
-dev = find_device_state(id, errp);

+dev = find_device_state(id, false, errp);
  if (dev == NULL) {
  return NULL;
  }


I appreciate the attempt to curb the spread of DeviceNotFound errors.
Two issues:

* Copy-pasting find_device_state() with a false argument is an easy
   error to make.

* Most uses of find_device_state() are via blk_by_qdev_id() and
   qmp_get_blk().  Any new uses of qemu_get_blk() will still produce
   DeviceNotFound.



Hm. But what to do?

- rename all three functions to FOO_base(), and add a boolean parameter
- add FOO() wrappers, passing true (use generic error)
- add FOO_deprecated() wrappers, passing false (use device not found error)
- change existing callers to use FOO_deprecated()

Still, new generic-error-based blk_by_qdev_id() and qmp_get_blk() will be 
unused..

That seem too ugly for me. And that doesn't give 100% sure that nobody will 
simply duplicate call to _deprecated() function...

Maybe better don't care for now (and continue use device-not-found for new 
APIs, that reuse find_device_state()), and just declare the deprecation for 
ERROR_CLASS_DEVICE_NOT_FOUND? And drop it usage everywhere after 3-releases 
deprecation cycle.

Or do we want deprecate everything except for generic-error, and deprecate 
error/class field in qmp return value altogether?

--
Best regards,
Vladimir




[PATCH v2 1/4] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-03-07 Thread Fiona Ebner
Backup supports all modes listed in MirrorSyncMode, while mirror does
not. Introduce BackupSyncMode by copying the current MirrorSyncMode
and drop the variants mirror does not support from MirrorSyncMode as
well as the corresponding manual check in mirror_start().

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---

I felt like keeping the "Since: X.Y" as before makes the most sense as
to not lose history. Or is it necessary to change this for
BackupSyncMode (and its members) since it got a new name?

 block/backup.c | 18 -
 block/mirror.c |  7 ---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/replication.c|  2 +-
 blockdev.c | 26 -
 include/block/block_int-global-state.h |  2 +-
 qapi/block-core.json   | 27 +-
 7 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ec29d6b810..1cc4e055c6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -37,7 +37,7 @@ typedef struct BackupBlockJob {
 
 BdrvDirtyBitmap *sync_bitmap;
 
-MirrorSyncMode sync_mode;
+BackupSyncMode sync_mode;
 BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
@@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 
 assert(block_job_driver(job) == &backup_job_driver);
 
-if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) {
 error_setg(errp, "The backup job only supports block checkpoint in"
" sync=none mode");
 return;
@@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job)
 uint64_t estimate;
 BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs);
 
-if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) {
+if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) {
 bdrv_clear_dirty_bitmap(bcs_bitmap, NULL);
 bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL,
  true);
-} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) {
 /*
  * We can't hog the coroutine to initialize this thoroughly.
  * Set a flag and resume work when we are able to yield safely.
@@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 
 backup_init_bcs_bitmap(s);
 
-if (s->sync_mode == MIRROR_SYNC_MODE_TOP) {
+if (s->sync_mode == BACKUP_SYNC_MODE_TOP) {
 int64_t offset = 0;
 int64_t count;
 
@@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 block_copy_set_skip_unallocated(s->bcs, false);
 }
 
-if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode == BACKUP_SYNC_MODE_NONE) {
 /*
  * All bits are set in bcs bitmap to allow any cluster to be copied.
  * This does not actually require them to be copied.
@@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = {
 
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
-  MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
+  BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
   BitmapSyncMode bitmap_mode,
   bool compress,
   const char *filter_node_name,
@@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 GLOBAL_STATE_CODE();
 
 /* QMP interface protects us from these cases */
-assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
-assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
+assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL);
+assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP);
 
 if (bs == target) {
 error_setg(errp, "Source and target cannot be the same");
diff --git a/block/mirror.c b/block/mirror.c
index 5145eb53e1..1609354db3 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2011,13 +2011,6 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 
 GLOBAL_STATE_CODE();
 
-if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
-(mode == MIRROR_SYNC_MODE_BITMAP)) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(mode));
-return;
-}
-
 bdrv_graph_rdlock_main_loop();
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index d954bec6f1..9633d000c0 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -266,7 +266,7 @@ void hmp_d

[PATCH v2 0/4] mirror: allow specifying working bitmap

2024-03-07 Thread Fiona Ebner
Changes from RFC/v1 (discussion here [0]):
* Add patch to split BackupSyncMode and MirrorSyncMode.
* Drop bitmap-mode parameter and use passed-in bitmap as the working
  bitmap instead. Users can get the desired behaviors by
  using the block-dirty-bitmap-clear and block-dirty-bitmap-merge
  calls (see commit message in patch 2/4 for how exactly).
* Add patch to check whether target image's cluster size is at most
  mirror job's granularity. Optional, it's an extra safety check
  that's useful when the target is a "diff" image that does not have
  previously synced data.

Use cases:
* Possibility to resume a failed mirror later.
* Possibility to only mirror deltas to a previously mirrored volume.
* Possibility to (efficiently) mirror an drive that was previously
  mirrored via some external mechanism (e.g. ZFS replication).

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [1]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)


Now, the IO test added in patch 4/4 actually contains yet another use
case, namely doing incremental mirrors to stand-alone qcow2 "diff"
images, that only contain the delta and can be rebased later. I had to
adapt the IO test, because its output expected the mirror bitmap to
still be dirty, but nowadays the mirror is apparently already done
when the bitmaps are queried. So I thought, I'll just use
'write-blocking' mode to avoid any potential timing issues.

But this exposed an issue with the diff image approach. If a write is
not aligned to the granularity of the mirror target, then rebasing the
diff image onto a backing image will not yield the desired result,
because the full cluster is considered to be allocated and will "hide"
some part of the base/backing image. The failure can be seen by either
using 'write-blocking' mode in the IO test or setting the (bitmap)
granularity to 32 KiB rather than the current 64 KiB.

For the latter case, patch 4/4 adds a check. For the former, the
limitation is documented (I'd expect this to be a niche use case in
practice).

[0]: 
https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c
[1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/


Fabian Grünbichler (1):
  iotests: add test for bitmap mirror

Fiona Ebner (2):
  qapi/block-core: avoid the re-use of MirrorSyncMode for backup
  blockdev: mirror: check for target's cluster size when using bitmap

John Snow (1):
  mirror: allow specifying working bitmap

 block/backup.c|   18 +-
 block/mirror.c|  102 +-
 block/monitor/block-hmp-cmds.c|2 +-
 block/replication.c   |2 +-
 blockdev.c|   84 +-
 include/block/block_int-global-state.h|7 +-
 qapi/block-core.json  |   64 +-
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  571 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2946 +
 tests/unit/test-block-iothread.c  |2 +-
 10 files changed, 3729 insertions(+), 69 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

-- 
2.39.2





Re: [PATCH] linux-aio: add IO_CMD_FDSYNC command support

2024-03-07 Thread Kevin Wolf
Am 07.03.2024 um 12:47 hat Prasad Pandit geschrieben:
> From: Prasad Pandit 
> 
> Libaio defines IO_CMD_FDSYNC command to sync all outstanding
> asynchronous I/O operations, by flushing out file data to the
> disk storage.
> 
> Enable linux-aio to submit such aio request. This helps to
> reduce latency induced via pthread_create calls by
> thread-pool (aio=threads).
> 
> Signed-off-by: Prasad Pandit 

Kernel support for this is "relatively" new, added in 2018. Should we
fall back to the thread pool if we receive -EINVAL?

Kevin




[PATCH v2 3/4] iotests: add test for bitmap mirror

2024-03-07 Thread Fiona Ebner
From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

Intentionally keeping copyright and ownership of original test case to
honor provenance.

The test was originally adapted by Fabian from 257, but has seen
rather big changes, because the interface for mirror with bitmap was
changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as
the working bitmap.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0
 adapt to changes to mirror bitmap interface
 rename test from '384' to 'bitmap-sync-mirror']
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  565 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2939 +
 2 files changed, 3504 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror 
b/tests/qemu-iotests/tests/bitmap-sync-mirror
new file mode 100755
index 00..898f1f4ba4
--- /dev/null
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror
@@ -0,0 +1,565 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+# owner=js...@redhat.com
+
+import math
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+Pattern('0xaa', 0x001,
+3*GRANULARITY),   # Overwrite and 2x Adjacent-right
+Pattern('0xbb', 0x00d8000),   # Partial-left (1M-160K)
+Pattern('0xcc', 0x2028000),   # Partial-right (32M+160K)
+Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right
+]
+
+
+class EmulatedBitmap:
+def __init__(self, granularity=GRANULARITY):
+self._bits = set()
+self.granularity = granularity
+
+def dirty_bits(self, bits):
+self._bits |= set(bits)
+
+def dirty_group(self, n):
+self.dirty_bits(GROUPS[n].bits(self.granularity))
+
+def clear(self):
+self._bits = set()
+
+def clear_bits(self, bits):
+self._bits -= set(bits)
+
+def clear_bit(self, bit):
+self.c

[PATCH v2 2/4] mirror: allow specifying working bitmap

2024-03-07 Thread Fiona Ebner
From: John Snow 

for the mirror job. The bitmap's granularity is used as the job's
granularity.

The new @bitmap parameter is marked unstable in the QAPI and can
currently only be used for @sync=full mode.

Clusters initially dirty in the bitmap as well as new writes are
copied to the target.

Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
callers can simulate the three kinds of @BitmapSyncMode (which is used
by backup):
1. always: default, just pass bitmap as working bitmap.
2. never: copy bitmap and pass copy to the mirror job.
3. on-success: copy bitmap and pass copy to the mirror job and if
   successful, merge bitmap into original afterwards.

When the target image is a fresh "diff image", i.e. one that was not
used as the target of a previous mirror and the target image's cluster
size is larger than the bitmap's granularity, or when
@copy-mode=write-blocking is used, there is a pitfall, because the
cluster in the target image will be allocated, but not contain all the
data corresponding to the same region in the source image.

An idea to avoid the limitation would be to mark clusters which are
affected by unaligned writes and are not allocated in the target image
dirty, so they would be copied fully later. However, for migration,
the invariant that an actively synced mirror stays actively synced
(unless an error happens) is useful, because without that invariant,
migration might inactivate block devices when mirror still got work
to do and run into an assertion failure [0].

Another approach would be to read the missing data from the source
upon unaligned writes to be able to write the full target cluster
instead.

But certain targets like NBD do not allow querying the cluster size.
To avoid limiting/breaking the use case of syncing to an existing
target, which is arguably more common than the diff image use case,
document the limiation in QAPI.

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily, first by John and then again by Fiona.

[0]: 
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0
 get rid of bitmap mode parameter
 use caller-provided bitmap as working bitmap
 turn bitmap parameter experimental]
Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 95 --
 blockdev.c | 39 +--
 include/block/block_int-global-state.h |  5 +-
 qapi/block-core.json   | 37 +-
 tests/unit/test-block-iothread.c   |  2 +-
 5 files changed, 146 insertions(+), 32 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 1609354db3..5c9a00b574 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -73,6 +73,11 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+/*
+ * Whether the bitmap is created locally or provided by the caller (for
+ * incremental sync).
+ */
+bool dirty_bitmap_is_local;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -687,7 +692,11 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
+if (s->dirty_bitmap_is_local) {
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+} else {
+bdrv_enable_dirty_bitmap(s->dirty_bitmap);
+}
 
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
@@ -718,7 +727,8 @@ static int mirror_exit_common(Job *job)
  &error_abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -815,6 +825,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
+
+if (!s->dirty_bitmap_is_local && s->dirty_bitmap) {
+bdrv_

[PATCH v2 4/4] blockdev: mirror: check for target's cluster size when using bitmap

2024-03-07 Thread Fiona Ebner
When using mirror with a bitmap and the target is a diff image, i.e.
one that should only contain the delta and was not synced to
previously, a too large cluster size for the target can be
problematic. In particular, when the mirror sends data to the target
aligned to the jobs granularity, but not aligned to the larger target
image's cluster size, the target's cluster would be allocated but only
be filled partially. When rebasing such a diff image later, the
corresponding cluster of the base image would get "masked" and the
part of the cluster not in the diff image is not accessible anymore.

Unfortunately, it is not always possible to check for the target
image's cluster size, e.g. when it's NBD. Because the limitation is
already documented in the QAPI description for the @bitmap parameter
and it's only required for special diff image use-case, simply skip
the check then.

Signed-off-by: Fiona Ebner 
---
 blockdev.c| 19 +++
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  6 ++
 .../qemu-iotests/tests/bitmap-sync-mirror.out |  7 +++
 3 files changed, 32 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index c76eb97a4c..968d44cd3b 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2847,6 +2847,9 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 }
 
 if (bitmap_name) {
+BlockDriverInfo bdi;
+uint32_t bitmap_granularity;
+
 if (sync != MIRROR_SYNC_MODE_FULL) {
 error_setg(errp, "Sync mode '%s' not supported with bitmap.",
MirrorSyncMode_str(sync));
@@ -2863,6 +2866,22 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 return;
 }
 
+bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap);
+/*
+ * Ignore if unable to get the info, e.g. when target is NBD. It's only
+ * relevant for syncing to a diff image and the documentation already
+ * states that the target's cluster size needs to small enough then.
+ */
+if (bdrv_get_info(target, &bdi) >= 0) {
+if (bitmap_granularity < bdi.cluster_size ||
+bitmap_granularity % bdi.cluster_size != 0) {
+error_setg(errp, "Bitmap granularity %u is not a multiple of "
+   "the target image's cluster size %u",
+   bitmap_granularity, bdi.cluster_size);
+return;
+}
+}
+
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) {
 return;
 }
diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror 
b/tests/qemu-iotests/tests/bitmap-sync-mirror
index 898f1f4ba4..cbd5cc99cc 100755
--- a/tests/qemu-iotests/tests/bitmap-sync-mirror
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror
@@ -552,6 +552,12 @@ def test_mirror_api():
 bitmap=bitmap)
 log('')
 
+log("-- Test bitmap with too small granularity --\n".format(sync_mode))
+vm.qmp_log("block-dirty-bitmap-add", node=drive0.node,
+   name="bitmap-small", granularity=(GRANULARITY // 2))
+blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full",
+job_id='api_job', bitmap="bitmap-small")
+log('')
 
 def main():
 for bsync_mode in ("never", "on-success", "always"):
diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror.out 
b/tests/qemu-iotests/tests/bitmap-sync-mirror.out
index c05b4788c6..d40ea7d689 100644
--- a/tests/qemu-iotests/tests/bitmap-sync-mirror.out
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror.out
@@ -2937,3 +2937,10 @@ qemu_img compare "TEST_DIR/PID-img" 
"TEST_DIR/PID-fmirror3" ==> Identical, OK!
 {"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap0", "device": 
"drive0", "filter-node-name": "mirror-top", "job-id": "api_job", "sync": 
"none", "target": "mirror_target"}}
 {"error": {"class": "GenericError", "desc": "Sync mode 'none' not supported 
with bitmap."}}
 
+-- Test bitmap with too small granularity --
+
+{"execute": "block-dirty-bitmap-add", "arguments": {"granularity": 32768, 
"name": "bitmap-small", "node": "drive0"}}
+{"return": {}}
+{"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap-small", 
"device": "drive0", "filter-node-name": "mirror-top", "job-id": "api_job", 
"sync": "full", "target": "mirror_target"}}
+{"error": {"class": "GenericError", "desc": "Bitmap granularity 32768 is not a 
multiple of the target image's cluster size 65536"}}
+
-- 
2.39.2





[PATCH] linux-aio: add IO_CMD_FDSYNC command support

2024-03-07 Thread Prasad Pandit
From: Prasad Pandit 

Libaio defines IO_CMD_FDSYNC command to sync all outstanding
asynchronous I/O operations, by flushing out file data to the
disk storage.

Enable linux-aio to submit such aio request. This helps to
reduce latency induced via pthread_create calls by
thread-pool (aio=threads).

Signed-off-by: Prasad Pandit 
---
 block/file-posix.c | 5 +
 block/linux-aio.c  | 5 -
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 35684f7e21..4ef6c6c9e5 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2599,6 +2599,11 @@ static int coroutine_fn 
raw_co_flush_to_disk(BlockDriverState *bs)
 if (raw_check_linux_io_uring(s)) {
 return luring_co_submit(bs, s->fd, 0, NULL, QEMU_AIO_FLUSH);
 }
+#endif
+#ifdef CONFIG_LINUX_AIO
+if (raw_check_linux_aio(s)) {
+return laio_co_submit(s->fd, 0, NULL, QEMU_AIO_FLUSH, 0);
+}
 #endif
 return raw_thread_pool_submit(handle_aiocb_flush, &acb);
 }
diff --git a/block/linux-aio.c b/block/linux-aio.c
index ec05d946f3..d940d029e3 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -384,6 +384,9 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
*laiocb, off_t offset,
 case QEMU_AIO_READ:
 io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
 break;
+case QEMU_AIO_FLUSH:
+io_prep_fdsync(iocbs, fd);
+break;
 /* Currently Linux kernel does not support other operations */
 default:
 fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
@@ -412,7 +415,7 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
QEMUIOVector *qiov,
 AioContext *ctx = qemu_get_current_aio_context();
 struct qemu_laiocb laiocb = {
 .co = qemu_coroutine_self(),
-.nbytes = qiov->size,
+.nbytes = qiov ? qiov->size : 0,
 .ctx= aio_get_linux_aio(ctx),
 .ret= -EINPROGRESS,
 .is_read= (type == QEMU_AIO_READ),
-- 
2.44.0




Re: [PATCH v2 6/6] qapi: introduce CONFIG_READ event

2024-03-07 Thread Vladimir Sementsov-Ogievskiy

On 07.03.24 13:01, Markus Armbruster wrote:

Vladimir Sementsov-Ogievskiy  writes:


Send a new event when guest reads virtio-pci config after
virtio_notify_config() call.

That's useful to check that guest fetched modified config, for example
after resizing disk backend.

Signed-off-by: Vladimir Sementsov-Ogievskiy 


[...]


diff --git a/qapi/qdev.json b/qapi/qdev.json
index 6ece164172..ffc5e3be18 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -179,3 +179,29 @@
  { 'command': 'device-sync-config',
'features': [ 'unstable' ],
'data': {'id': 'str'} }
+
+##
+# @VIRTIO_CONFIG_READ:
+#
+# Emitted whenever guest reads virtio device config after config change.


Let's not abbreviate "configuration" to "config".


Oops, me again.




+#
+# @device: device name
+#
+# @path: device path
+#
+# Features:
+#
+# @unstable: The event is experimental.
+#
+# Since: 9.0
+#
+# Example:
+#
+# <- { "event": "VIRTIO_CONFIG_READ",
+#  "data": { "device": "virtio-net-pci-0",
+#"path": "/machine/peripheral/virtio-net-pci-0" },
+#  "timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
+##


As for PATCH 4, I'd like to see some guidance on intended use.


Will do




+{ 'event': 'VIRTIO_CONFIG_READ',
+  'features': [ 'unstable' ],
+  'data': { '*device': 'str', 'path': 'str' } }


[...]



Thanks for reviewing my patches!

--
Best regards,
Vladimir




Re: [PATCH v2 5/6] qapi: device-sync-config: check runstate

2024-03-07 Thread Vladimir Sementsov-Ogievskiy

On 07.03.24 12:57, Markus Armbruster wrote:

Vladimir Sementsov-Ogievskiy  writes:


Command result is racy if allow it during migration. Let's allow the
sync only in RUNNING state.

Signed-off-by: Vladimir Sementsov-Ogievskiy 


If I understand this correctly, the previous commit introduces a race,
and this one fixes it.


Sure, I will merge it. I recently argued against such things, bad for me to 
break the rule myself)



We normally avoid such temporary bugs.  When avoidance is impractical,
we point them out in commit message and FIXME comment.
>> ---

  include/sysemu/runstate.h |  1 +
  system/qdev-monitor.c | 27 ++-
  system/runstate.c |  5 +
  3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index 0117d243c4..296af52322 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -5,6 +5,7 @@
  #include "qemu/notify.h"
  
  bool runstate_check(RunState state);

+const char *current_run_state_str(void);
  void runstate_set(RunState new_state);
  RunState runstate_get(void);
  bool runstate_is_running(void);
diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index e3107a12d7..b83b5d23c9 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -23,6 +23,7 @@
  #include "monitor/monitor.h"
  #include "monitor/qdev.h"
  #include "sysemu/arch_init.h"
+#include "sysemu/runstate.h"
  #include "qapi/error.h"
  #include "qapi/qapi-commands-qdev.h"
  #include "qapi/qmp/dispatch.h"
@@ -983,7 +984,31 @@ int qdev_sync_config(DeviceState *dev, Error **errp)
  
  void qmp_device_sync_config(const char *id, Error **errp)

  {
-DeviceState *dev = find_device_state(id, true, errp);
+MigrationState *s = migrate_get_current();
+DeviceState *dev;
+
+/*
+ * During migration there is a race between syncing`config and migrating 
it,
+ * so let's just not allow it.
+ *
+ * Moreover, let's not rely on setting up interrupts in paused state, which
+ * may be a part of migration process.


Wrap your comment lines around column 70, please.


OK.

I never remember this recommendation.. Probably we'd better have a check in 
checkpatch for it




+ */
+
+if (migration_is_running(s->state)) {
+error_setg(errp, "Config synchronization is not allowed "
+   "during migration.");
+return;
+}
+
+if (!runstate_is_running()) {
+error_setg(errp, "Config synchronization allowed only in '%s' state, "
+   "current state is '%s'", RunState_str(RUN_STATE_RUNNING),
+   current_run_state_str());
+return;
+}
+
+dev = find_device_state(id, true, errp);
  if (!dev) {
  return;
  }
diff --git a/system/runstate.c b/system/runstate.c
index d6ab860eca..8fd89172ae 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -189,6 +189,11 @@ bool runstate_check(RunState state)
  return current_run_state == state;
  }
  
+const char *current_run_state_str(void)

+{
+return RunState_str(current_run_state);
+}
+
  static void runstate_init(void)
  {
  const RunStateTransition *p;




--
Best regards,
Vladimir




Re: [PATCH v1 0/8] virtio,vhost: Add VIRTIO_F_NOTIFICATION_DATA support

2024-03-07 Thread Eugenio Perez Martin
On Wed, Mar 6, 2024 at 8:34 AM Michael S. Tsirkin  wrote:
>
> On Wed, Mar 06, 2024 at 08:07:31AM +0100, Eugenio Perez Martin wrote:
> > On Wed, Mar 6, 2024 at 6:34 AM Jason Wang  wrote:
> > >
> > > On Tue, Mar 5, 2024 at 3:46 AM Jonah Palmer  
> > > wrote:
> > > >
> > > > The goal of these patches are to add support to a variety of virtio and
> > > > vhost devices for the VIRTIO_F_NOTIFICATION_DATA transport feature. This
> > > > feature indicates that a driver will pass extra data (instead of just a
> > > > virtqueue's index) when notifying the corresponding device.
> > > >
> > > > The data passed in by the driver when this feature is enabled varies in
> > > > format depending on if the device is using a split or packed virtqueue
> > > > layout:
> > > >
> > > >  Split VQ
> > > >   - Upper 16 bits: shadow_avail_idx
> > > >   - Lower 16 bits: virtqueue index
> > > >
> > > >  Packed VQ
> > > >   - Upper 16 bits: 1-bit wrap counter & 15-bit shadow_avail_idx
> > > >   - Lower 16 bits: virtqueue index
> > > >
> > > > Also, due to the limitations of ioeventfd not being able to carry the
> > > > extra provided by the driver, ioeventfd is left disabled for any devices
> > > > using this feature.
> > >
> > > Is there any method to overcome this? This might help for vhost.
> > >
> >
> > As a half-baked idea, read(2)ing an eventfd descriptor returns an
> > 8-byte integer already. The returned value of read depends on eventfd
> > flags, but both have to do with the number of writes of the other end.
> >
> > My proposal is to replace this value with the last value written by
> > the guest, so we can extract the virtio notification data from there.
> > The behavior of read is similar to not-EFD_SEMAPHORE, reading a value
> > and then blocking if read again without writes. The behavior of KVM
> > writes is different, as it is not a counter anymore.
> >
> > Thanks!
>
>
> I doubt you will be able to support this in ioeventfd...

I agree.

> But vhost does not really need the value at all.
> So why mask out ioeventfd with vhost?

The interface should not be able to start with vhost-kernel because
the feature is not offered by the vhost-kernel device. So ioeventfd is
always enabled with vhost-kernel.

Or do you mean we should allow it and let vhost-kernel fetch data from
the avail ring as usual? I'm ok with that but then the guest can place
any value to it, so the driver cannot be properly "validated by
software" that way.

> vhost-vdpa is probably the only one that might need it...

Right, but vhost-vdpa already supports doorbell memory regions so I
guess it has little use, isn't it?

Thanks!

>
>
>
> > > Thanks
> > >
> > > >
> > > > A significant aspect of this effort has been to maintain compatibility
> > > > across different backends. As such, the feature is offered by backend
> > > > devices only when supported, with fallback mechanisms where backend
> > > > support is absent.
> > > >
> > > > Jonah Palmer (8):
> > > >   virtio/virtio-pci: Handle extra notification data
> > > >   virtio-pci: Lock ioeventfd state with VIRTIO_F_NOTIFICATION_DATA
> > > >   virtio-mmio: Handle extra notification data
> > > >   virtio-mmio: Lock ioeventfd state with VIRTIO_F_NOTIFICATION_DATA
> > > >   virtio-ccw: Handle extra notification data
> > > >   virtio-ccw: Lock ioeventfd state with VIRTIO_F_NOTIFICATION_DATA
> > > >   vhost/vhost-user: Add VIRTIO_F_NOTIFICATION_DATA to vhost feature bits
> > > >   virtio: Add VIRTIO_F_NOTIFICATION_DATA property definition
> > > >
> > > >  hw/block/vhost-user-blk.c|  1 +
> > > >  hw/net/vhost_net.c   |  2 ++
> > > >  hw/s390x/s390-virtio-ccw.c   | 16 
> > > >  hw/s390x/virtio-ccw.c|  6 --
> > > >  hw/scsi/vhost-scsi.c |  1 +
> > > >  hw/scsi/vhost-user-scsi.c|  1 +
> > > >  hw/virtio/vhost-user-fs.c|  2 +-
> > > >  hw/virtio/vhost-user-vsock.c |  1 +
> > > >  hw/virtio/virtio-mmio.c  | 15 +++
> > > >  hw/virtio/virtio-pci.c   | 16 +++-
> > > >  hw/virtio/virtio.c   | 18 ++
> > > >  include/hw/virtio/virtio.h   |  5 -
> > > >  net/vhost-vdpa.c |  1 +
> > > >  13 files changed, 68 insertions(+), 17 deletions(-)
> > > >
> > > > --
> > > > 2.39.3
> > > >
> > >
>




Re: [PATCH v5 3/3] hw/nvme: Add SPDM over DOE support

2024-03-07 Thread Philippe Mathieu-Daudé

Hi Alistair, Wilfred,

On 7/3/24 01:58, Alistair Francis wrote:

From: Wilfred Mallawa 

Setup Data Object Exchance (DOE) as an extended capability for the NVME
controller and connect SPDM to it (CMA) to it.

Signed-off-by: Wilfred Mallawa 
Signed-off-by: Alistair Francis 
Reviewed-by: Jonathan Cameron 
Acked-by: Klaus Jensen 
---
  docs/specs/index.rst|   1 +
  docs/specs/spdm.rst | 122 
  include/hw/pci/pci_device.h |   5 ++
  include/hw/pci/pcie_doe.h   |   3 +
  hw/nvme/ctrl.c  |  53 
  5 files changed, 184 insertions(+)
  create mode 100644 docs/specs/spdm.rst




diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..b8379c78f1 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -3,6 +3,7 @@
  
  #include "hw/pci/pci.h"

  #include "hw/pci/pcie.h"
+#include "hw/pci/pcie_doe.h"
  
  #define TYPE_PCI_DEVICE "pci-device"

  typedef struct PCIDeviceClass PCIDeviceClass;
@@ -157,6 +158,10 @@ struct PCIDevice {
  MSIVectorReleaseNotifier msix_vector_release_notifier;
  MSIVectorPollNotifier msix_vector_poll_notifier;
  
+/* DOE */

+DOECap doe_spdm;
+uint16_t spdm_port;
+
  /* ID of standby device in net_failover pair */
  char *failover_pair_id;
  uint32_t acpi_index;
diff --git a/include/hw/pci/pcie_doe.h b/include/hw/pci/pcie_doe.h
index 15d94661f9..eb8f4e393d 100644
--- a/include/hw/pci/pcie_doe.h
+++ b/include/hw/pci/pcie_doe.h
@@ -108,6 +108,9 @@ struct DOECap {
  /* Protocols and its callback response */
  DOEProtocol *protocols;
  uint16_t protocol_num;
+
+/* Used for spdm-socket */
+int socket;


Why not name it 'spdm_socket' like 'spdm_port' earlier?


  };




  static bool nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
  {
  ERRP_GUARD();
@@ -8126,6 +8149,24 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev, Error **errp)
  
  nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
  
+pcie_cap_deverr_init(pci_dev);

+
+/* DOE Initialisation */
+if (pci_dev->spdm_port) {
+uint16_t doe_offset = n->params.sriov_max_vfs ?
+  PCI_CONFIG_SPACE_SIZE + PCI_ARI_SIZEOF
+  : PCI_CONFIG_SPACE_SIZE;
+
+pcie_doe_init(pci_dev, &pci_dev->doe_spdm, doe_offset, doe_spdm_prot, 
true, 0);
+
+pci_dev->doe_spdm.socket = spdm_socket_connect(pci_dev->spdm_port, 
errp);
+
+if (pci_dev->doe_spdm.socket < 0 ) {
+error_setg(errp, "Failed to connect to SPDM socket");


spdm_socket_connect() already sets errp in case of failure.


+return -ENOTSUP;


This function returns a boolean, so this return value is casted
to 'true'. IIUC true means success, so this is incorrect.


+}
+}
+
  if (n->params.cmb_size_mb) {
  nvme_init_cmb(n, pci_dev);
  }
@@ -8412,6 +8453,7 @@ static Property nvme_props[] = {
params.sriov_max_vi_per_vf, 0),
  DEFINE_PROP_UINT8("sriov_max_vq_per_vf", NvmeCtrl,
params.sriov_max_vq_per_vf, 0),
+DEFINE_PROP_UINT16("spdm", PCIDevice, spdm_port, 0),
  DEFINE_PROP_END_OF_LIST(),
  };


Regards,

Phil.



Re: [PATCH v5 2/3] backends: Initial support for SPDM socket support

2024-03-07 Thread Philippe Mathieu-Daudé

On 7/3/24 01:58, Alistair Francis wrote:

From: Huai-Cheng Kuo 

SPDM enables authentication, attestation and key exchange to assist in
providing infrastructure security enablement. It's a standard published
by the DMTF [1].

SPDM supports multiple transports, including PCIe DOE and MCTP.
This patch adds support to QEMU to connect to an external SPDM
instance.

SPDM support can be added to any QEMU device by exposing a
TCP socket to a SPDM server. The server can then implement the SPDM
decoding/encoding support, generally using libspdm [2].

This is similar to how the current TPM implementation works and means
that the heavy lifting of setting up certificate chains, capabilities,
measurements and complex crypto can be done outside QEMU by a well
supported and tested library.

1: https://www.dmtf.org/standards/SPDM
2: https://github.com/DMTF/libspdm

Signed-off-by: Huai-Cheng Kuo 
Signed-off-by: Chris Browy 
Co-developed-by: Jonathan Cameron 
Signed-off-by: Jonathan Cameron 
[ Changes by WM
  - Bug fixes from testing
]
Signed-off-by: Wilfred Mallawa 
[ Changes by AF:
  - Convert to be more QEMU-ified
  - Move to backends as it isn't PCIe specific
]
Signed-off-by: Alistair Francis 
---
  MAINTAINERS  |   6 +
  include/sysemu/spdm-socket.h |  44 +++
  backends/spdm-socket.c   | 216 +++
  backends/Kconfig |   4 +
  backends/meson.build |   2 +
  5 files changed, 272 insertions(+)
  create mode 100644 include/sysemu/spdm-socket.h
  create mode 100644 backends/spdm-socket.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4183f2f3ab..a07706c225 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3395,6 +3395,12 @@ F: tests/qtest/*tpm*
  F: docs/specs/tpm.rst
  T: git https://github.com/stefanberger/qemu-tpm.git tpm-next
  
+SPDM

+M: Alistair Francis 
+S: Maintained
+F: backends/spdm-socket.c
+F: include/sysemu/spdm-socket.h
+
  Checkpatch
  S: Odd Fixes
  F: scripts/checkpatch.pl
diff --git a/include/sysemu/spdm-socket.h b/include/sysemu/spdm-socket.h
new file mode 100644
index 00..24e6fccb83
--- /dev/null
+++ b/include/sysemu/spdm-socket.h
@@ -0,0 +1,44 @@
+/*
+ * QEMU SPDM socket support
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef SPDM_REQUESTER_H
+#define SPDM_REQUESTER_H
+
+int spdm_socket_connect(uint16_t port, Error **errp);


Could we have a short description on what this function returns
and its arguments?


+uint32_t spdm_socket_rsp(const int socket, uint32_t transport_type,
+ void *req, uint32_t req_len,
+ void *rsp, uint32_t rsp_len);


Ditto.


+void spdm_socket_close(const int socket, uint32_t transport_type);
+
+#define SPDM_SOCKET_COMMAND_NORMAL0x0001
+#define SPDM_SOCKET_COMMAND_OOB_ENCAP_KEY_UPDATE  0x8001
+#define SPDM_SOCKET_COMMAND_CONTINUE  0xFFFD
+#define SPDM_SOCKET_COMMAND_SHUTDOWN  0xFFFE
+#define SPDM_SOCKET_COMMAND_UNKOWN0x
+#define SPDM_SOCKET_COMMAND_TEST  0xDEAD
+
+#define SPDM_SOCKET_TRANSPORT_TYPE_MCTP   0x01
+#define SPDM_SOCKET_TRANSPORT_TYPE_PCI_DOE0x02
+
+#define SPDM_SOCKET_MAX_MESSAGE_BUFFER_SIZE   0x1200
+
+#endif





Re: [PATCH v2 4/6] qapi: introduce device-sync-config

2024-03-07 Thread Vladimir Sementsov-Ogievskiy

On 07.03.24 12:54, Markus Armbruster wrote:

Vladimir Sementsov-Ogievskiy  writes:


Add command to sync config from vhost-user backend to the device. It
may be helpful when VHOST_USER_SLAVE_CONFIG_CHANGE_MSG failed or not
triggered interrupt to the guest or just not available (not supported
by vhost-user server).

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  hw/block/vhost-user-blk.c | 27 ---
  hw/virtio/virtio-pci.c|  9 +
  include/hw/qdev-core.h|  3 +++
  qapi/qdev.json| 17 +
  system/qdev-monitor.c | 23 +++
  5 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 9e6bbc6950..2f301f380c 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -88,27 +88,39 @@ static void vhost_user_blk_set_config(VirtIODevice *vdev, 
const uint8_t *config)
  s->blkcfg.wce = blkcfg->wce;
  }
  
+static int vhost_user_blk_sync_config(DeviceState *dev, Error **errp)

+{
+int ret;
+VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+ret = vhost_dev_get_config(&s->dev, (uint8_t *)&s->blkcfg,
+   vdev->config_len, errp);
+if (ret < 0) {
+return ret;
+}
+
+memcpy(vdev->config, &s->blkcfg, vdev->config_len);
+virtio_notify_config(vdev);
+
+return 0;
+}
+
  static int vhost_user_blk_handle_config_change(struct vhost_dev *dev)
  {
  int ret;
-VirtIODevice *vdev = dev->vdev;
-VHostUserBlk *s = VHOST_USER_BLK(dev->vdev);
  Error *local_err = NULL;
  
  if (!dev->started) {

  return 0;
  }
  
-ret = vhost_dev_get_config(dev, (uint8_t *)&s->blkcfg,

-   vdev->config_len, &local_err);
+ret = vhost_user_blk_sync_config(DEVICE(dev->vdev), &local_err);
  if (ret < 0) {
  error_report_err(local_err);
  return ret;
  }
  
-memcpy(dev->vdev->config, &s->blkcfg, vdev->config_len);

-virtio_notify_config(dev->vdev);
-
  return 0;
  }
  
@@ -576,6 +588,7 @@ static void vhost_user_blk_class_init(ObjectClass *klass, void *data)
  
  device_class_set_props(dc, vhost_user_blk_properties);

  dc->vmsd = &vmstate_vhost_user_blk;
+dc->sync_config = vhost_user_blk_sync_config;
  set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
  vdc->realize = vhost_user_blk_device_realize;
  vdc->unrealize = vhost_user_blk_device_unrealize;
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 1a7039fb0c..1197341a3c 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2318,6 +2318,14 @@ static void virtio_pci_dc_realize(DeviceState *qdev, 
Error **errp)
  vpciklass->parent_dc_realize(qdev, errp);
  }
  
+static int virtio_pci_sync_config(DeviceState *dev, Error **errp)

+{
+VirtIOPCIProxy *proxy = VIRTIO_PCI(dev);
+VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus);
+
+return qdev_sync_config(DEVICE(vdev), errp);
+}
+
  static void virtio_pci_class_init(ObjectClass *klass, void *data)
  {
  DeviceClass *dc = DEVICE_CLASS(klass);
@@ -2334,6 +2342,7 @@ static void virtio_pci_class_init(ObjectClass *klass, 
void *data)
  device_class_set_parent_realize(dc, virtio_pci_dc_realize,
  &vpciklass->parent_dc_realize);
  rc->phases.hold = virtio_pci_bus_reset_hold;
+dc->sync_config = virtio_pci_sync_config;
  }
  
  static const TypeInfo virtio_pci_info = {

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 9228e96c87..87135bdcdf 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -95,6 +95,7 @@ typedef void (*DeviceUnrealize)(DeviceState *dev);
  typedef void (*DeviceReset)(DeviceState *dev);
  typedef void (*BusRealize)(BusState *bus, Error **errp);
  typedef void (*BusUnrealize)(BusState *bus);
+typedef int (*DeviceSyncConfig)(DeviceState *dev, Error **errp);
  
  /**

   * struct DeviceClass - The base class for all devices.
@@ -162,6 +163,7 @@ struct DeviceClass {
  DeviceReset reset;
  DeviceRealize realize;
  DeviceUnrealize unrealize;
+DeviceSyncConfig sync_config;
  
  /**

   * @vmsd: device state serialisation description for
@@ -546,6 +548,7 @@ bool qdev_hotplug_allowed(DeviceState *dev, Error **errp);
   */
  HotplugHandler *qdev_get_hotplug_handler(DeviceState *dev);
  void qdev_unplug(DeviceState *dev, Error **errp);
+int qdev_sync_config(DeviceState *dev, Error **errp);
  void qdev_simple_device_unplug_cb(HotplugHandler *hotplug_dev,
DeviceState *dev, Error **errp);
  void qdev_machine_creation_done(void);
diff --git a/qapi/qdev.json b/qapi/qdev.json
index 32ffaee644..6ece164172 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -162,3 +162,20 @@
  ##
  { 'event': 'DEVICE_UNPLUG_GUEST_ERROR',
'data': { '*device': 'str', 'path': 'str' } }
+
+##
+# @d

Re: [PATCH v2 3/6] qdev-monitor: add option to report GenericError from find_device_state

2024-03-07 Thread Vladimir Sementsov-Ogievskiy

On 07.03.24 12:46, Markus Armbruster wrote:

Vladimir Sementsov-Ogievskiy  writes:


Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  system/qdev-monitor.c | 15 +++
  1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index 9febb743f1..cf7481e416 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -877,13 +877,20 @@ void qmp_device_add(QDict *qdict, QObject **ret_data, 
Error **errp)
  object_unref(OBJECT(dev));
  }
  
-static DeviceState *find_device_state(const char *id, Error **errp)

+/*
+ * Note that creating new APIs using error classes other than GenericError is
+ * not recommended. Set use_generic_error=true for new interfaces.
+ */
+static DeviceState *find_device_state(const char *id, bool use_generic_error,
+  Error **errp)
  {
  Object *obj = object_resolve_path_at(qdev_get_peripheral(), id);
  DeviceState *dev;
  
  if (!obj) {

-error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
+error_set(errp,
+  (use_generic_error ?
+   ERROR_CLASS_GENERIC_ERROR : ERROR_CLASS_DEVICE_NOT_FOUND),
"Device '%s' not found", id);
  return NULL;
  }
@@ -947,7 +954,7 @@ void qdev_unplug(DeviceState *dev, Error **errp)
  
  void qmp_device_del(const char *id, Error **errp)

  {
-DeviceState *dev = find_device_state(id, errp);
+DeviceState *dev = find_device_state(id, false, errp);
  if (dev != NULL) {
  if (dev->pending_deleted_event &&
  (dev->pending_deleted_expires_ms == 0 ||
@@ -1067,7 +1074,7 @@ BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
  
  GLOBAL_STATE_CODE();
  
-dev = find_device_state(id, errp);

+dev = find_device_state(id, false, errp);
  if (dev == NULL) {
  return NULL;
  }


I appreciate the attempt to curb the spread of DeviceNotFound errors.
Two issues:

* Copy-pasting find_device_state() with a false argument is an easy
   error to make.

* Most uses of find_device_state() are via blk_by_qdev_id() and
   qmp_get_blk().  Any new uses of qemu_get_blk() will still produce
   DeviceNotFound.

Hmm.


Hmm. Right. Wait a bit, I can make the change stricter.

Could you still explain (or give a link), why and when we decided to use only 
GenericError? I think, having different "error-codes" is a good thing, why we 
are trying to get rid of it?

--
Best regards,
Vladimir




Re: [PATCH v2 6/6] qapi: introduce CONFIG_READ event

2024-03-07 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> Send a new event when guest reads virtio-pci config after
> virtio_notify_config() call.
>
> That's useful to check that guest fetched modified config, for example
> after resizing disk backend.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

[...]

> diff --git a/qapi/qdev.json b/qapi/qdev.json
> index 6ece164172..ffc5e3be18 100644
> --- a/qapi/qdev.json
> +++ b/qapi/qdev.json
> @@ -179,3 +179,29 @@
>  { 'command': 'device-sync-config',
>'features': [ 'unstable' ],
>'data': {'id': 'str'} }
> +
> +##
> +# @VIRTIO_CONFIG_READ:
> +#
> +# Emitted whenever guest reads virtio device config after config change.

Let's not abbreviate "configuration" to "config".

> +#
> +# @device: device name
> +#
> +# @path: device path
> +#
> +# Features:
> +#
> +# @unstable: The event is experimental.
> +#
> +# Since: 9.0
> +#
> +# Example:
> +#
> +# <- { "event": "VIRTIO_CONFIG_READ",
> +#  "data": { "device": "virtio-net-pci-0",
> +#"path": "/machine/peripheral/virtio-net-pci-0" },
> +#  "timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
> +##

As for PATCH 4, I'd like to see some guidance on intended use.

> +{ 'event': 'VIRTIO_CONFIG_READ',
> +  'features': [ 'unstable' ],
> +  'data': { '*device': 'str', 'path': 'str' } }

[...]




Re: [PATCH v2 5/6] qapi: device-sync-config: check runstate

2024-03-07 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> Command result is racy if allow it during migration. Let's allow the
> sync only in RUNNING state.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

If I understand this correctly, the previous commit introduces a race,
and this one fixes it.

We normally avoid such temporary bugs.  When avoidance is impractical,
we point them out in commit message and FIXME comment.

> ---
>  include/sysemu/runstate.h |  1 +
>  system/qdev-monitor.c | 27 ++-
>  system/runstate.c |  5 +
>  3 files changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
> index 0117d243c4..296af52322 100644
> --- a/include/sysemu/runstate.h
> +++ b/include/sysemu/runstate.h
> @@ -5,6 +5,7 @@
>  #include "qemu/notify.h"
>  
>  bool runstate_check(RunState state);
> +const char *current_run_state_str(void);
>  void runstate_set(RunState new_state);
>  RunState runstate_get(void);
>  bool runstate_is_running(void);
> diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
> index e3107a12d7..b83b5d23c9 100644
> --- a/system/qdev-monitor.c
> +++ b/system/qdev-monitor.c
> @@ -23,6 +23,7 @@
>  #include "monitor/monitor.h"
>  #include "monitor/qdev.h"
>  #include "sysemu/arch_init.h"
> +#include "sysemu/runstate.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-commands-qdev.h"
>  #include "qapi/qmp/dispatch.h"
> @@ -983,7 +984,31 @@ int qdev_sync_config(DeviceState *dev, Error **errp)
>  
>  void qmp_device_sync_config(const char *id, Error **errp)
>  {
> -DeviceState *dev = find_device_state(id, true, errp);
> +MigrationState *s = migrate_get_current();
> +DeviceState *dev;
> +
> +/*
> + * During migration there is a race between syncing`config and migrating 
> it,
> + * so let's just not allow it.
> + *
> + * Moreover, let's not rely on setting up interrupts in paused state, 
> which
> + * may be a part of migration process.

Wrap your comment lines around column 70, please.

> + */
> +
> +if (migration_is_running(s->state)) {
> +error_setg(errp, "Config synchronization is not allowed "
> +   "during migration.");
> +return;
> +}
> +
> +if (!runstate_is_running()) {
> +error_setg(errp, "Config synchronization allowed only in '%s' state, 
> "
> +   "current state is '%s'", RunState_str(RUN_STATE_RUNNING),
> +   current_run_state_str());
> +return;
> +}
> +
> +dev = find_device_state(id, true, errp);
>  if (!dev) {
>  return;
>  }
> diff --git a/system/runstate.c b/system/runstate.c
> index d6ab860eca..8fd89172ae 100644
> --- a/system/runstate.c
> +++ b/system/runstate.c
> @@ -189,6 +189,11 @@ bool runstate_check(RunState state)
>  return current_run_state == state;
>  }
>  
> +const char *current_run_state_str(void)
> +{
> +return RunState_str(current_run_state);
> +}
> +
>  static void runstate_init(void)
>  {
>  const RunStateTransition *p;




Re: [PATCH v2 4/6] qapi: introduce device-sync-config

2024-03-07 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> Add command to sync config from vhost-user backend to the device. It
> may be helpful when VHOST_USER_SLAVE_CONFIG_CHANGE_MSG failed or not
> triggered interrupt to the guest or just not available (not supported
> by vhost-user server).
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  hw/block/vhost-user-blk.c | 27 ---
>  hw/virtio/virtio-pci.c|  9 +
>  include/hw/qdev-core.h|  3 +++
>  qapi/qdev.json| 17 +
>  system/qdev-monitor.c | 23 +++
>  5 files changed, 72 insertions(+), 7 deletions(-)
>
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 9e6bbc6950..2f301f380c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -88,27 +88,39 @@ static void vhost_user_blk_set_config(VirtIODevice *vdev, 
> const uint8_t *config)
>  s->blkcfg.wce = blkcfg->wce;
>  }
>  
> +static int vhost_user_blk_sync_config(DeviceState *dev, Error **errp)
> +{
> +int ret;
> +VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> +VHostUserBlk *s = VHOST_USER_BLK(vdev);
> +
> +ret = vhost_dev_get_config(&s->dev, (uint8_t *)&s->blkcfg,
> +   vdev->config_len, errp);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +memcpy(vdev->config, &s->blkcfg, vdev->config_len);
> +virtio_notify_config(vdev);
> +
> +return 0;
> +}
> +
>  static int vhost_user_blk_handle_config_change(struct vhost_dev *dev)
>  {
>  int ret;
> -VirtIODevice *vdev = dev->vdev;
> -VHostUserBlk *s = VHOST_USER_BLK(dev->vdev);
>  Error *local_err = NULL;
>  
>  if (!dev->started) {
>  return 0;
>  }
>  
> -ret = vhost_dev_get_config(dev, (uint8_t *)&s->blkcfg,
> -   vdev->config_len, &local_err);
> +ret = vhost_user_blk_sync_config(DEVICE(dev->vdev), &local_err);
>  if (ret < 0) {
>  error_report_err(local_err);
>  return ret;
>  }
>  
> -memcpy(dev->vdev->config, &s->blkcfg, vdev->config_len);
> -virtio_notify_config(dev->vdev);
> -
>  return 0;
>  }
>  
> @@ -576,6 +588,7 @@ static void vhost_user_blk_class_init(ObjectClass *klass, 
> void *data)
>  
>  device_class_set_props(dc, vhost_user_blk_properties);
>  dc->vmsd = &vmstate_vhost_user_blk;
> +dc->sync_config = vhost_user_blk_sync_config;
>  set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
>  vdc->realize = vhost_user_blk_device_realize;
>  vdc->unrealize = vhost_user_blk_device_unrealize;
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index 1a7039fb0c..1197341a3c 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -2318,6 +2318,14 @@ static void virtio_pci_dc_realize(DeviceState *qdev, 
> Error **errp)
>  vpciklass->parent_dc_realize(qdev, errp);
>  }
>  
> +static int virtio_pci_sync_config(DeviceState *dev, Error **errp)
> +{
> +VirtIOPCIProxy *proxy = VIRTIO_PCI(dev);
> +VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus);
> +
> +return qdev_sync_config(DEVICE(vdev), errp);
> +}
> +
>  static void virtio_pci_class_init(ObjectClass *klass, void *data)
>  {
>  DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -2334,6 +2342,7 @@ static void virtio_pci_class_init(ObjectClass *klass, 
> void *data)
>  device_class_set_parent_realize(dc, virtio_pci_dc_realize,
>  &vpciklass->parent_dc_realize);
>  rc->phases.hold = virtio_pci_bus_reset_hold;
> +dc->sync_config = virtio_pci_sync_config;
>  }
>  
>  static const TypeInfo virtio_pci_info = {
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 9228e96c87..87135bdcdf 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -95,6 +95,7 @@ typedef void (*DeviceUnrealize)(DeviceState *dev);
>  typedef void (*DeviceReset)(DeviceState *dev);
>  typedef void (*BusRealize)(BusState *bus, Error **errp);
>  typedef void (*BusUnrealize)(BusState *bus);
> +typedef int (*DeviceSyncConfig)(DeviceState *dev, Error **errp);
>  
>  /**
>   * struct DeviceClass - The base class for all devices.
> @@ -162,6 +163,7 @@ struct DeviceClass {
>  DeviceReset reset;
>  DeviceRealize realize;
>  DeviceUnrealize unrealize;
> +DeviceSyncConfig sync_config;
>  
>  /**
>   * @vmsd: device state serialisation description for
> @@ -546,6 +548,7 @@ bool qdev_hotplug_allowed(DeviceState *dev, Error **errp);
>   */
>  HotplugHandler *qdev_get_hotplug_handler(DeviceState *dev);
>  void qdev_unplug(DeviceState *dev, Error **errp);
> +int qdev_sync_config(DeviceState *dev, Error **errp);
>  void qdev_simple_device_unplug_cb(HotplugHandler *hotplug_dev,
>DeviceState *dev, Error **errp);
>  void qdev_machine_creation_done(void);
> diff --git a/qapi/qdev.json b/qapi/qdev.json
> index 32ffaee644..6ece164172 100644
> --- a/qapi/qdev.json

Re: [PATCH v2 3/6] qdev-monitor: add option to report GenericError from find_device_state

2024-03-07 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  system/qdev-monitor.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
> index 9febb743f1..cf7481e416 100644
> --- a/system/qdev-monitor.c
> +++ b/system/qdev-monitor.c
> @@ -877,13 +877,20 @@ void qmp_device_add(QDict *qdict, QObject **ret_data, 
> Error **errp)
>  object_unref(OBJECT(dev));
>  }
>  
> -static DeviceState *find_device_state(const char *id, Error **errp)
> +/*
> + * Note that creating new APIs using error classes other than GenericError is
> + * not recommended. Set use_generic_error=true for new interfaces.
> + */
> +static DeviceState *find_device_state(const char *id, bool use_generic_error,
> +  Error **errp)
>  {
>  Object *obj = object_resolve_path_at(qdev_get_peripheral(), id);
>  DeviceState *dev;
>  
>  if (!obj) {
> -error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
> +error_set(errp,
> +  (use_generic_error ?
> +   ERROR_CLASS_GENERIC_ERROR : ERROR_CLASS_DEVICE_NOT_FOUND),
>"Device '%s' not found", id);
>  return NULL;
>  }
> @@ -947,7 +954,7 @@ void qdev_unplug(DeviceState *dev, Error **errp)
>  
>  void qmp_device_del(const char *id, Error **errp)
>  {
> -DeviceState *dev = find_device_state(id, errp);
> +DeviceState *dev = find_device_state(id, false, errp);
>  if (dev != NULL) {
>  if (dev->pending_deleted_event &&
>  (dev->pending_deleted_expires_ms == 0 ||
> @@ -1067,7 +1074,7 @@ BlockBackend *blk_by_qdev_id(const char *id, Error 
> **errp)
>  
>  GLOBAL_STATE_CODE();
>  
> -dev = find_device_state(id, errp);
> +dev = find_device_state(id, false, errp);
>  if (dev == NULL) {
>  return NULL;
>  }

I appreciate the attempt to curb the spread of DeviceNotFound errors.
Two issues:

* Copy-pasting find_device_state() with a false argument is an easy
  error to make.

* Most uses of find_device_state() are via blk_by_qdev_id() and
  qmp_get_blk().  Any new uses of qemu_get_blk() will still produce
  DeviceNotFound.

Hmm.




Re: [PATCH v5 0/3] Initial support for SPDM Responders

2024-03-07 Thread Jonathan Cameron via
On Thu,  7 Mar 2024 10:58:56 +1000
Alistair Francis  wrote:

> The Security Protocol and Data Model (SPDM) Specification defines
> messages, data objects, and sequences for performing message exchanges
> over a variety of transport and physical media.
>  - 
> https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.3.0.pdf
> 
> SPDM currently supports PCIe DOE and MCTP transports, but it can be
> extended to support others in the future. This series adds
> support to QEMU to connect to an external SPDM instance.
> 
> SPDM support can be added to any QEMU device by exposing a
> TCP socket to a SPDM server. The server can then implement the SPDM
> decoding/encoding support, generally using libspdm [1].
> 
> This is similar to how the current TPM implementation works and means
> that the heavy lifting of setting up certificate chains, capabilities,
> measurements and complex crypto can be done outside QEMU by a well
> supported and tested library.
> 
> This series implements socket support and exposes SPDM for a NVMe device.

Thanks Alastair,

I'm really keen to seen this land soon as I have the CXL infrastructure
for this backed up behind it.  Also will be needed for PCI (IDE) and CXL link
encryption emulation and most if not all of the confidential computing stacks
with QEMU emulating the host system + peripherals.

I believe it's just waiting for a PCI Maintainer Ack at this point? Klaus said 
he
was happy to take it through NVME but wanted a PCI Ack first.

Michael / Marcel, if you have time to look at it that would be great.

Thanks,

Jonathan


> 
> 1: https://github.com/DMTF/libspdm
> 
> v5:
>  - Update MAINTAINERS
> v4:
>  - Rebase
> v3:
>  - Spelling fixes
>  - Support for SPDM-Utils
> v2:
>  - Add cover letter
>  - A few code fixes based on comments
>  - Document SPDM-Utils
>  - A few tweaks and clarifications to the documentation
> 
> Alistair Francis (1):
>   hw/pci: Add all Data Object Types defined in PCIe r6.0
> 
> Huai-Cheng Kuo (1):
>   backends: Initial support for SPDM socket support
> 
> Wilfred Mallawa (1):
>   hw/nvme: Add SPDM over DOE support
> 
>  MAINTAINERS  |   6 +
>  docs/specs/index.rst |   1 +
>  docs/specs/spdm.rst  | 122 
>  include/hw/pci/pci_device.h  |   5 +
>  include/hw/pci/pcie_doe.h|   5 +
>  include/sysemu/spdm-socket.h |  44 +++
>  backends/spdm-socket.c   | 216 +++
>  hw/nvme/ctrl.c   |  53 +
>  backends/Kconfig |   4 +
>  backends/meson.build |   2 +
>  10 files changed, 458 insertions(+)
>  create mode 100644 docs/specs/spdm.rst
>  create mode 100644 include/sysemu/spdm-socket.h
>  create mode 100644 backends/spdm-socket.c
> 




Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-07 Thread Vladimir Sementsov-Ogievskiy

On 06.03.24 16:44, Fiona Ebner wrote:

Am 29.02.24 um 13:47 schrieb Fiona Ebner:

Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy:

On 29.02.24 13:11, Fiona Ebner wrote:


The iotest creates a new target image for each incremental sync which
only records the diff relative to the previous mirror and those diff
images are later rebased onto each other to get the full picture.

Thus, it can be that a previous mirror job (not just background process
or previous write) already copied a cluster, and in particular, copied
it to a different target!


Aha understand.

For simplicity, let's consider case, when source "cluster size" = "job
cluster size" = "bitmap granularity" = "target cluster size".

Which types of clusters we should consider, when we want to handle guest
write?

1. Clusters, that should be copied by background process

These are dirty clusters from user-given bitmap, or if we do a full-disk
mirror, all clusters, not yet copied by background process.

For such clusters we simply ignore the unaligned write. We can even
ignore the aligned write too: less disturbing the guest by delays.



Since do_sync_target_write() currently doesn't ignore aligned writes, I
wouldn't change it. Of course they can count towards the "done_bitmap"
you propose below.


2. Clusters, already copied by background process during this mirror job
and not dirtied by guest since this time.

For such clusters we are safe to do unaligned write, as target cluster
must be allocated.



Right.


3. Clusters, not marked initially by dirty bitmap.

What to do with them? We can't do unaligned write. I see two variants:

- do additional read from source, to fill the whole cluster, which seems
a bit too heavy



Yes, I'd rather only do that as a last resort.


- just mark the cluster as dirty for background job. So we behave like
in "background" mode. But why not? The maximum count of such "hacks" is
limited to number of "clear" clusters at start of mirror job, which
means that we don't seriously affect the convergence. Mirror is
guaranteed to converge anyway. And the whole sense of "write-blocking"
mode is to have a guaranteed convergence. What do you think?



It could lead to a lot of flips between job->actively_synced == true and
== false. AFAIU, currently, we only switch back from true to false when
an error happens. While I don't see a concrete issue with it, at least
it might be unexpected to users, so it better be documented.

I'll try going with this approach, thanks!



These flips are actually a problem. When using live-migration with disk
mirroring, it's good that an actively synced image stays actively
synced. Otherwise, migration could finish at an inconvenient time and
try to inactivate the block device while mirror still got something to
do which would lead to an assertion failure [0].


Hmm right. So, when mirror is actively-synced, we have to read the whole 
cluster from source to make an aligned write on target.



The IO test added by this series is what uses the possibility to sync to
"diff images" which contain only the delta. In production, we are only
syncing to a previously mirrored target image. Non-aligned writes are
not an issue later like with a diff image. (Even if the initial
mirroring happened via ZFS replication outside of QEMU).

So copy-mode=write-blocking would work fine for our use case, but if I
go with the "mark clusters for unaligned writes dirty"-approach, it
would not work fine anymore.

Should I rather just document the limitation for the combination "target
is a diff image" and copy-mode=write-blocking?


Of course, simply documenting the limitation is better than implementing a new 
feature, if you don't need the feature for production)



I'd still add the check for the granularity and target cluster size.


Check is good too.


While also only needed for diff images, it would allow using background
mode safely for those.

Best Regards,
Fiona

[0]:
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/



--
Best regards,
Vladimir




Re: [PATCH v2 2/6] qdev-monitor: fix error message in find_device_state()

2024-03-07 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> This "hotpluggable" here is misleading. Actually we check is object a
> device or not. Let's drop the word.
>
> SUggested-by: Markus Armbruster 

"Suggested"

> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  system/qdev-monitor.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
> index a13db763e5..9febb743f1 100644
> --- a/system/qdev-monitor.c
> +++ b/system/qdev-monitor.c
> @@ -890,7 +890,7 @@ static DeviceState *find_device_state(const char *id, 
> Error **errp)
>  
>  dev = (DeviceState *)object_dynamic_cast(obj, TYPE_DEVICE);
>  if (!dev) {
> -error_setg(errp, "%s is not a hotpluggable device", id);
> +error_setg(errp, "%s is not a device", id);
>  return NULL;
>  }

Perhaps including what @id resolved to could be useful, but that's
out of scope for this patch.

Reviewed-by: Markus Armbruster