from:"Stefan Hajnoczi"

Re: [PATCH v8 09/10] hw/nvme: add reservation protocal command

2024-07-11 Thread Stefan Hajnoczi

On Tue, Jul 09, 2024 at 10:47:05AM +0800, Changqi Lu wrote:
> Add reservation acquire, reservation register,
> reservation release and reservation report commands
> in the nvme device layer.
> 
> By introducing these commands, this enables the nvme
> device to perform reservation-related tasks, including
> querying keys, querying reservation status, registering
> reservation keys, initiating and releasing reservations,
> as well as clearing and preempting reservations held by
> other keys.
> 
> These commands are crucial for management and control of
> shared storage resources in a persistent manner.
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> Acked-by: Klaus Jensen 
> ---
>  hw/nvme/ctrl.c   | 323 ++-
>  hw/nvme/nvme.h   |   4 +
>  include/block/nvme.h |  37 +
>  3 files changed, 363 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index ad212de723..a69a499078 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -294,6 +294,10 @@ static const uint32_t nvme_cse_iocs_nvm[256] = {
>  [NVME_CMD_COMPARE]  = NVME_CMD_EFF_CSUPP,
>  [NVME_CMD_IO_MGMT_RECV] = NVME_CMD_EFF_CSUPP,
>  [NVME_CMD_IO_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
> +[NVME_CMD_RESV_REGISTER]= NVME_CMD_EFF_CSUPP,
> +[NVME_CMD_RESV_REPORT]  = NVME_CMD_EFF_CSUPP,
> +[NVME_CMD_RESV_ACQUIRE] = NVME_CMD_EFF_CSUPP,
> +[NVME_CMD_RESV_RELEASE] = NVME_CMD_EFF_CSUPP,
>  };
>  
>  static const uint32_t nvme_cse_iocs_zoned[256] = {
> @@ -308,6 +312,10 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
>  [NVME_CMD_ZONE_APPEND]  = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
>  [NVME_CMD_ZONE_MGMT_SEND]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
>  [NVME_CMD_ZONE_MGMT_RECV]   = NVME_CMD_EFF_CSUPP,
> +[NVME_CMD_RESV_REGISTER]= NVME_CMD_EFF_CSUPP,
> +[NVME_CMD_RESV_REPORT]  = NVME_CMD_EFF_CSUPP,
> +[NVME_CMD_RESV_ACQUIRE] = NVME_CMD_EFF_CSUPP,
> +[NVME_CMD_RESV_RELEASE] = NVME_CMD_EFF_CSUPP,
>  };
>  
>  static void nvme_process_sq(void *opaque);
> @@ -1745,6 +1753,7 @@ static void nvme_aio_err(NvmeRequest *req, int ret)
>  
>  switch (req->cmd.opcode) {
>  case NVME_CMD_READ:
> +case NVME_CMD_RESV_REPORT:
>  status = NVME_UNRECOVERED_READ;
>  break;
>  case NVME_CMD_FLUSH:
> @@ -1752,6 +1761,9 @@ static void nvme_aio_err(NvmeRequest *req, int ret)
>  case NVME_CMD_WRITE_ZEROES:
>  case NVME_CMD_ZONE_APPEND:
>  case NVME_CMD_COPY:
> +case NVME_CMD_RESV_REGISTER:
> +case NVME_CMD_RESV_ACQUIRE:
> +case NVME_CMD_RESV_RELEASE:
>  status = NVME_WRITE_FAULT;
>  break;
>  default:
> @@ -2127,7 +2139,10 @@ static inline bool nvme_is_write(NvmeRequest *req)
>  
>  return rw->opcode == NVME_CMD_WRITE ||
> rw->opcode == NVME_CMD_ZONE_APPEND ||
> -   rw->opcode == NVME_CMD_WRITE_ZEROES;
> +   rw->opcode == NVME_CMD_WRITE_ZEROES ||
> +   rw->opcode == NVME_CMD_RESV_REGISTER ||
> +   rw->opcode == NVME_CMD_RESV_ACQUIRE ||
> +   rw->opcode == NVME_CMD_RESV_RELEASE;
>  }

Why is this change necessary? The only nvme_is_write() caller I see is:

  void nvme_rw_complete_cb(void *opaque, int ret)
  {
  ...
  if (ns->params.zoned && nvme_is_write(req)) {
  nvme_finalize_zoned_write(ns, req);
  }

nvme_finalize_zoned_write() must not be called on reservation requests
because they don't use NvmeRwCmd:

  static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req)
  {
  NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;


signature.asc
Description: PGP signature

Re: [PATCH v8 10/10] block/iscsi: add persistent reservation in/out driver

2024-07-11 Thread Stefan Hajnoczi

On Tue, Jul 09, 2024 at 10:47:06AM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations for iscsi driver.
> The following methods are implemented: bdrv_co_pr_read_keys,
> bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
> bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/iscsi.c | 425 ++
>  1 file changed, 425 insertions(+)

Aside from the g_free() issue I commented on:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v8 10/10] block/iscsi: add persistent reservation in/out driver

2024-07-11 Thread Stefan Hajnoczi

On Tue, Jul 09, 2024 at 10:47:06AM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations for iscsi driver.
> The following methods are implemented: bdrv_co_pr_read_keys,
> bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
> bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/iscsi.c | 425 ++
>  1 file changed, 425 insertions(+)
> 
> diff --git a/block/iscsi.c b/block/iscsi.c
> index 2ff14b7472..ba51f6d016 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -96,6 +96,7 @@ typedef struct IscsiLun {
>  unsigned long *allocmap_valid;
>  long allocmap_size;
>  int cluster_size;
> +uint8_t pr_cap;
>  bool use_16_for_rw;
>  bool write_protected;
>  bool lbpme;
> @@ -280,6 +281,10 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int 
> status,
>  iTask->err_code = -error;
>  iTask->err_str = g_strdup(iscsi_get_error(iscsi));
>  }
> +} else if (status == SCSI_STATUS_RESERVATION_CONFLICT) {
> +iTask->err_code = -EBADE;
> +error_report("iSCSI Persistent Reservation Conflict: %s",
> + iscsi_get_error(iscsi));
>  }
>  }
>  }
> @@ -1792,6 +1797,50 @@ static void iscsi_save_designator(IscsiLun *lun,
>  }
>  }
>  
> +/*
> + *  Ensure iscsi_open() must succeed, weather or not the target
> + *  implement SCSI_PR_IN_REPORT_CAPABILITIES.
> + */
> +static void iscsi_get_pr_cap_sync(IscsiLun *iscsilun)
> +{
> +struct scsi_task *task = NULL;
> +struct scsi_persistent_reserve_in_report_capabilities *rc = NULL;
> +int retries = ISCSI_CMD_RETRIES;
> +int xferlen = sizeof(struct 
> scsi_persistent_reserve_in_report_capabilities);
> +
> +do {
> +if (task != NULL) {
> +scsi_free_scsi_task(task);
> +task = NULL;
> +}
> +
> +task = iscsi_persistent_reserve_in_sync(iscsilun->iscsi,
> +   iscsilun->lun, SCSI_PR_IN_REPORT_CAPABILITIES, xferlen);
> +if (task != NULL && task->status == SCSI_STATUS_GOOD) {
> +rc = scsi_datain_unmarshall(task);
> +if (rc == NULL) {
> +error_report("iSCSI: Failed to unmarshall "
> + "report capabilities data.");
> +} else {
> +iscsilun->pr_cap =
> +
> scsi_pr_cap_to_block(rc->persistent_reservation_type_mask);
> +iscsilun->pr_cap |= (rc->ptpl_a) ? BLK_PR_CAP_PTPL : 0;
> +}
> +break;
> +}
> +} while (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
> + && task->sense.key == SCSI_SENSE_UNIT_ATTENTION
> + && retries-- > 0);
> +
> +if (task == NULL || task->status != SCSI_STATUS_GOOD) {
> +error_report("iSCSI: failed to send report capabilities command.");
> +}
> +
> +if (task) {
> +scsi_free_scsi_task(task);
> +}
> +}
> +
>  static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
>Error **errp)
>  {
> @@ -2024,6 +2073,7 @@ static int iscsi_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>  }
>  
> +iscsi_get_pr_cap_sync(iscsilun);
>  out:
>  qemu_opts_del(opts);
>  g_free(initiator_name);
> @@ -2110,6 +2160,8 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  bs->bl.opt_transfer = pow2floor(iscsilun->bl.opt_xfer_len *
>  iscsilun->block_size);
>  }
> +
> +bs->bl.pr_cap = iscsilun->pr_cap;
>  }
>  
>  /* Note that this will not re-establish a connection with an iSCSI target - 
> it
> @@ -2408,6 +2460,371 @@ out_unlock:
>  return r;
>  }
>  
> +static int coroutine_fn
> +iscsi_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation,
> +  uint32_t num_keys, uint64_t *keys)
> +{
> +IscsiLun *iscsilun = bs->opaque;
> +QEMUIOVector qiov;
> +struct IscsiTask iTask;
> +int xferlen = sizeof(struct scsi_persistent_reserve_in_read_keys) +
> +  sizeof(uint64_t) * num_keys;
> +g_autofree uint8_t *buf = g_malloc0(xferlen);
> +int32_t num_collect_keys = 0;
> +int r = 0;
> +
> +qemu_iovec_init_buf(, buf, xferlen);
> +iscsi_co_init_iscsitask(iscsilun, );
> +qemu_mutex_lock(>mutex);
> +retry:
> +iTask.task = iscsi_persistent_reserve_in_task(iscsilun->iscsi,
> + iscsilun->lun, SCSI_PR_IN_READ_KEYS, xferlen,
> + iscsi_co_generic_cb, );
> +
> +if (iTask.task == NULL) {
> +qemu_mutex_unlock(>mutex);
> +return -ENOMEM;
> +}
> +
> +scsi_task_set_iov_in(iTask.task, (struct scsi_iovec

Re: [PATCH v8 08/10] hw/nvme: enable ONCS and rescap function

2024-07-11 Thread Stefan Hajnoczi

On Tue, Jul 09, 2024 at 10:47:04AM +0800, Changqi Lu wrote:
> This commit enables ONCS to support the reservation
> function at the controller level. Also enables rescap
> function in the namespace by detecting the supported reservation
> function in the backend driver.
> 
> Reviewed-by: Klaus Jensen 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/nvme/ctrl.c   | 3 ++-
>  hw/nvme/ns.c | 5 +
>  include/block/nvme.h | 2 +-
>  3 files changed, 8 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v8 05/10] hw/scsi: add persistent reservation in/out api for scsi device

2024-07-11 Thread Stefan Hajnoczi

On Tue, Jul 09, 2024 at 10:47:01AM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 368 
>  1 file changed, 368 insertions(+)

Paolo: Please review this patch, I'm not familiar enough with hw/scsi/.

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

[PULL 2/2] Consider discard option when writing zeros

2024-07-11 Thread Stefan Hajnoczi

From: Nir Soffer 

When opening an image with discard=off, we punch hole in the image when
writing zeroes, making the image sparse. This breaks users that want to
ensure that writes cannot fail with ENOSPACE by using fully allocated
images[1].

bdrv_co_pwrite_zeroes() correctly disables BDRV_REQ_MAY_UNMAP if we
opened the child without discard=unmap or discard=on. But we don't go
through this function when accessing the top node. Move the check down
to bdrv_co_do_pwrite_zeroes() which seems to be used in all code paths.

This change implements the documented behavior, punching holes only when
opening the image with discard=on or discard=unmap. This may not be the
best default but can improve it later.

The test depends on a file system supporting discard, deallocating the
entire file when punching hole with the length of the entire file.
Tested with xfs, ext4, and tmpfs.

[1] https://lists.nongnu.org/archive/html/qemu-discuss/2024-06/msg3.html

Signed-off-by: Nir Soffer 
Message-id: 20240628202058.1964986-3-nsof...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/io.c|   9 +-
 tests/qemu-iotests/tests/write-zeroes-unmap   | 127 ++
 .../qemu-iotests/tests/write-zeroes-unmap.out |  81 +++
 3 files changed, 213 insertions(+), 4 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/write-zeroes-unmap
 create mode 100644 tests/qemu-iotests/tests/write-zeroes-unmap.out

diff --git a/block/io.c b/block/io.c
index 7217cf811b..301514c880 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1862,6 +1862,11 @@ bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, int64_t 
offset, int64_t bytes,
 return -EINVAL;
 }
 
+/* If opened with discard=off we should never unmap. */
+if (!(bs->open_flags & BDRV_O_UNMAP)) {
+flags &= ~BDRV_REQ_MAY_UNMAP;
+}
+
 /* Invalidate the cached block-status data range if this write overlaps */
 bdrv_bsc_invalidate_range(bs, offset, bytes);
 
@@ -2315,10 +2320,6 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, 
int64_t offset,
 trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
 assert_bdrv_graph_readable();
 
-if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
-flags &= ~BDRV_REQ_MAY_UNMAP;
-}
-
 return bdrv_co_pwritev(child, offset, bytes, NULL,
BDRV_REQ_ZERO_WRITE | flags);
 }
diff --git a/tests/qemu-iotests/tests/write-zeroes-unmap 
b/tests/qemu-iotests/tests/write-zeroes-unmap
new file mode 100755
index 00..7cfeeaf839
--- /dev/null
+++ b/tests/qemu-iotests/tests/write-zeroes-unmap
@@ -0,0 +1,127 @@
+#!/usr/bin/env bash
+# group: quick
+#
+# Test write zeros unmap.
+#
+# Copyright (C) Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+trap _cleanup_test_img exit
+
+# get standard environment, filters and checks
+cd ..
+. ./common.rc
+. ./common.filter
+
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+create_test_image() {
+_make_test_img -f $IMGFMT 1m
+}
+
+filter_command() {
+_filter_testdir | _filter_qemu_io | _filter_qemu | _filter_hmp
+}
+
+print_disk_usage() {
+du -sh $TEST_IMG | _filter_testdir
+}
+
+echo
+echo "=== defaults - write zeros ==="
+echo
+
+create_test_image
+echo -e 'qemu-io none0 "write -z 0 1m"\nquit' \
+| $QEMU -monitor stdio -drive if=none,file=$TEST_IMG,format=$IMGFMT \
+| filter_command
+print_disk_usage
+
+echo
+echo "=== defaults - write zeros unmap ==="
+echo
+
+create_test_image
+echo -e 'qemu-io none0 "write -zu 0 1m"\nquit' \
+| $QEMU -monitor stdio -drive if=none,file=$TEST_IMG,format=$IMGFMT \
+| filter_command
+print_disk_usage
+
+
+echo
+echo "=== defaults - write actual zeros ==="
+echo
+
+create_test_image
+echo -e 'qemu-io none0 "write -P 0 0 1m"\nquit' \
+| $QEMU -monitor stdio -drive if=none,file=$TEST_IMG,format=$IMGFMT \
+| filter_command
+print_disk_usage
+
+echo
+echo "=== discard=off - write zeroes unmap ==="
+echo
+
+create_test_image
+echo -e 'qemu-io none0 "write -zu 0 1m"\nquit' \
+| $QEMU -monitor stdio -drive 
if=none,file=$TEST_IMG,format=$IMGFMT,discard=off \
+| filter_command
+print_disk_usage
+

[PULL 1/2] qemu-iotest/245: Add missing discard=unmap

2024-07-11 Thread Stefan Hajnoczi

From: Nir Soffer 

The test works since we punch holes by default even when opening the
image without discard=on or discard=unmap. Fix the test to enable
discard.

Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/245 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index a934c9d1e6..f96610f510 100755
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -592,7 +592,7 @@ class TestBlockdevReopen(iotests.QMPTestCase):
 @iotests.skip_if_unsupported(['compress'])
 def test_insert_compress_filter(self):
 # Add an image to the VM: hd (raw) -> hd0 (qcow2) -> hd0-file (file)
-opts = {'driver': 'raw', 'node-name': 'hd', 'file': hd_opts(0)}
+opts = {'driver': 'raw', 'node-name': 'hd', 'file': hd_opts(0), 
'discard': 'unmap'}
 self.vm.cmd('blockdev-add', conv_keys = False, **opts)
 
 # Add a 'compress' filter
-- 
2.45.2

[PULL 0/2] Block patches

2024-07-11 Thread Stefan Hajnoczi

The following changes since commit 59084feb256c617063e0dbe7e64821ae8852d7cf:

  Merge tag 'pull-aspeed-20240709' of https://github.com/legoater/qemu into 
staging (2024-07-09 07:13:55 -0700)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to d05ae948cc887054495977855b0859d0d4ab2613:

  Consider discard option when writing zeros (2024-07-11 11:06:36 +0200)


Pull request

A discard fix from Nir Soffer.



Nir Soffer (2):
  qemu-iotest/245: Add missing discard=unmap
  Consider discard option when writing zeros

 block/io.c|   9 +-
 tests/qemu-iotests/245|   2 +-
 tests/qemu-iotests/tests/write-zeroes-unmap   | 127 ++
 .../qemu-iotests/tests/write-zeroes-unmap.out |  81 +++
 4 files changed, 214 insertions(+), 5 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/write-zeroes-unmap
 create mode 100644 tests/qemu-iotests/tests/write-zeroes-unmap.out

-- 
2.45.2

Re: [PATCH v3 0/2] Consider discard option when writing zeros

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 11:20:56PM +0300, Nir Soffer wrote:
> Punch holes only when the image is opened with discard=on or discard=unmap.
> 
> Tested by:
> - new write-zeroes-unmap iotest on xfs, ext4, and tmpfs
> - tests/qemu-iotests/check -raw
> - tests/qemu-iotests/check -qcow2
> 
> Changes since v2
> - Add write-zeroes-unmap iotest
> - Fix iotest missing discard=unmap
> 
> v2 was here:
> https://lists.nongnu.org/archive/html/qemu-block/2024-06/msg00231.html
> 
> Nir Soffer (2):
>   qemu-iotest/245: Add missing discard=unmap
>   Consider discard option when writing zeros
> 
>  block/io.c|   9 +-
>  tests/qemu-iotests/245|   2 +-
>  tests/qemu-iotests/tests/write-zeroes-unmap   | 127 ++
>  .../qemu-iotests/tests/write-zeroes-unmap.out |  81 +++
>  4 files changed, 214 insertions(+), 5 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/write-zeroes-unmap
>  create mode 100644 tests/qemu-iotests/tests/write-zeroes-unmap.out
> 
> -- 
> 2.45.2
> 

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v3 2/2] Consider discard option when writing zeros

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 11:20:58PM +0300, Nir Soffer wrote:
> When opening an image with discard=off, we punch hole in the image when
> writing zeroes, making the image sparse. This breaks users that want to
> ensure that writes cannot fail with ENOSPACE by using fully allocated
> images[1].
> 
> bdrv_co_pwrite_zeroes() correctly disables BDRV_REQ_MAY_UNMAP if we
> opened the child without discard=unmap or discard=on. But we don't go
> through this function when accessing the top node. Move the check down
> to bdrv_co_do_pwrite_zeroes() which seems to be used in all code paths.
> 
> This change implements the documented behavior, punching holes only when
> opening the image with discard=on or discard=unmap. This may not be the
> best default but can improve it later.
> 
> The test depends on a file system supporting discard, deallocating the
> entire file when punching hole with the length of the entire file.
> Tested with xfs, ext4, and tmpfs.
> 
> [1] https://lists.nongnu.org/archive/html/qemu-discuss/2024-06/msg3.html
> 
> Signed-off-by: Nir Soffer 
> ---
>  block/io.c|   9 +-
>  tests/qemu-iotests/tests/write-zeroes-unmap   | 127 ++
>  .../qemu-iotests/tests/write-zeroes-unmap.out |  81 +++
>  3 files changed, 213 insertions(+), 4 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/write-zeroes-unmap
>  create mode 100644 tests/qemu-iotests/tests/write-zeroes-unmap.out

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v3 1/2] qemu-iotest/245: Add missing discard=unmap

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 11:20:57PM +0300, Nir Soffer wrote:
> The test works since we punch holes by default even when opening the
> image without discard=on or discard=unmap. Fix the test to enable
> discard.
> ---
>  tests/qemu-iotests/245 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 0/5] vhost-user: Add SHMEM_MAP/UNMAP requests

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 04:57:05PM +0200, Albert Esteve wrote:
> Hi all,
> 
> v1->v2:
> - Corrected typos and clarifications from
>   first review
> - Added SHMEM_CONFIG frontend request to
>   query VIRTIO shared memory regions from
>   backends
> - vhost-user-device to use SHMEM_CONFIG
>   to request and initialise regions
> - Added MEM_READ/WRITE backend requests
>   in case address translation fails
>   accessing VIRTIO Shared Memory Regions
>   with MMAPs

Hi Albert,
I will be offline next week. I've posted comments.

I think the hard part will be adjusting vhost-user backend code to make
use of MEM_READ/MEM_WRITE when address translation fails. Ideally every
guest memory access (including vring accesses) should fall back to
MEM_READ/MEM_WRITE.

A good test for MEM_READ/MEM_WRITE is to completely skip setting the
memory table from the frontend and fall back for every guest memory
access. If the vhost-user backend still works without a memory table
then you know MEM_READ/MEM_WRITE is working too.

The vhost-user spec should probably contain a comment explaining that
MEM_READ/MEM_WRITE may be necessary when other device backends use
SHMEM_MAP due to the incomplete memory table that prevents translating
those memory addresses. In other words, if the guest has a device that
uses SHMEM_MAP, then all other vhost-user devices should support
MEM_READ/MEM_WRITE in order to ensure that DMA works with Shared Memory
Regions.

Stefan

> 
> This is an update of my attempt to have
> backends support dynamic fd mapping into VIRTIO
> Shared Memory Regions. After the first review
> I have added more commits and new messages
> to the vhost-user protocol.
> However, I still have some doubts as to
> how will this work, specially regarding
> the MEM_READ and MEM_WRITE commands.
> Thus, I am still looking for feedback,
> to ensure that I am going in the right
> direction with the implementation.
> 
> The usecase for this patch is, e.g., to support
> vhost-user-gpu RESOURCE_BLOB operations,
> or DAX Window request for virtio-fs. In
> general, any operation where a backend
> need to request the frontend to mmap an
> fd into a VIRTIO Shared Memory Region,
> so that the guest can then access it.
> 
> After receiving the SHMEM_MAP/UNMAP request,
> the frontend will perform the mmap with the
> instructed parameters (i.e., shmid, shm_offset,
> fd_offset, fd, lenght).
> 
> As there are already a couple devices
> that could benefit of such a feature,
> and more could require it in the future,
> the goal is to make the implementation
> generic.
> 
> To that end, the VIRTIO Shared Memory
> Region list is declared in the `VirtIODevice`
> struct.
> 
> This patch also includes:
> SHMEM_CONFIG frontend request that is
> specifically meant to allow generic
> vhost-user-device frontend to be able to
> query VIRTIO Shared Memory settings from the
> backend (as this device is generic and agnostic
> of the actual backend configuration).
> 
> Finally, MEM_READ/WRITE backend requests are
> added to deal with a potential issue when having
> any backend sharing a descriptor that references
> a mapping to another backend. The first
> backend will not be able to see these
> mappings. So these requests are a fallback
> for vhost-user memory translation fails.
> 
> Albert Esteve (5):
>   vhost-user: Add VIRTIO Shared Memory map request
>   vhost_user: Add frontend command for shmem config
>   vhost-user-dev: Add cache BAR
>   vhost_user: Add MEM_READ/WRITE backend requests
>   vhost_user: Implement mem_read/mem_write handlers
> 
>  docs/interop/vhost-user.rst   |  58 ++
>  hw/virtio/vhost-user-base.c   |  39 +++-
>  hw/virtio/vhost-user-device-pci.c |  37 +++-
>  hw/virtio/vhost-user.c| 221 ++
>  hw/virtio/virtio.c|  12 ++
>  include/hw/virtio/vhost-backend.h |   6 +
>  include/hw/virtio/vhost-user.h|   1 +
>  include/hw/virtio/virtio.h|   5 +
>  subprojects/libvhost-user/libvhost-user.c | 149 +++
>  subprojects/libvhost-user/libvhost-user.h |  91 +
>  10 files changed, 614 insertions(+), 5 deletions(-)
> 
> -- 
> 2.45.2
> 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 5/5] vhost_user: Implement mem_read/mem_write handlers

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 04:57:10PM +0200, Albert Esteve wrote:
> Implement function handlers for memory read and write
> operations.
> 
> Signed-off-by: Albert Esteve 
> ---
>  hw/virtio/vhost-user.c | 34 ++
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 18cacb2d68..79becbc87b 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -1884,16 +1884,42 @@ static int
>  vhost_user_backend_handle_mem_read(struct vhost_dev *dev,
> VhostUserMemRWMsg *mem_rw)
>  {
> -/* TODO */
> -return -EPERM;
> +ram_addr_t offset;
> +int fd;
> +MemoryRegion *mr;
> +
> +mr = vhost_user_get_mr_data(mem_rw->guest_address, , );
> +
> +if (!mr) {
> +error_report("Failed to get memory region with address %" PRIx64,
> + mem_rw->guest_address);
> +return -EFAULT;
> +}
> +
> +memcpy(mem_rw->data, memory_region_get_ram_ptr(mr) + offset, 
> mem_rw->size);

Don't try to write this from scratch. Use address_space_read/write(). It
supports corner cases like crossing MemoryRegions.

> +
> +return 0;
>  }
>  
>  static int
>  vhost_user_backend_handle_mem_write(struct vhost_dev *dev,
> VhostUserMemRWMsg *mem_rw)
>  {
> -/* TODO */
> -return -EPERM;
> +ram_addr_t offset;
> +int fd;
> +MemoryRegion *mr;
> +
> +mr = vhost_user_get_mr_data(mem_rw->guest_address, , );
> +
> +if (!mr) {
> +error_report("Failed to get memory region with address %" PRIx64,
> + mem_rw->guest_address);
> +return -EFAULT;
> +}
> +
> +memcpy(memory_region_get_ram_ptr(mr) + offset, mem_rw->data, 
> mem_rw->size);
> +
> +return 0;
>  }
>  
>  static void close_backend_channel(struct vhost_user *u)
> -- 
> 2.45.2
> 


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 4/5] vhost_user: Add MEM_READ/WRITE backend requests

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 04:57:09PM +0200, Albert Esteve wrote:
> With SHMEM_MAP messages, sharing descriptors between
> devices will cause that these devices do not see the
> mappings, and fail to access these memory regions.
> 
> To solve this, introduce MEM_READ/WRITE requests
> that will get triggered as a fallback when
> vhost-user memory translation fails.
> 
> Signed-off-by: Albert Esteve 
> ---
>  hw/virtio/vhost-user.c| 31 +
>  subprojects/libvhost-user/libvhost-user.c | 84 +++
>  subprojects/libvhost-user/libvhost-user.h | 38 ++
>  3 files changed, 153 insertions(+)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 57406dc8b4..18cacb2d68 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -118,6 +118,8 @@ typedef enum VhostUserBackendRequest {
>  VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
>  VHOST_USER_BACKEND_SHMEM_MAP = 9,
>  VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
> +VHOST_USER_BACKEND_MEM_READ = 11,
> +VHOST_USER_BACKEND_MEM_WRITE = 12,
>  VHOST_USER_BACKEND_MAX
>  }  VhostUserBackendRequest;
>  
> @@ -145,6 +147,12 @@ typedef struct VhostUserShMemConfig {
>  uint64_t memory_sizes[VHOST_MEMORY_BASELINE_NREGIONS];
>  } VhostUserShMemConfig;
>  
> +typedef struct VhostUserMemRWMsg {
> +uint64_t guest_address;
> +uint32_t size;
> +uint8_t data[];

I don't think flexible array members work in VhostUserMsg payload
structs in its current form. It would be necessary to move the
VhostUserMsg.payload field to the end of the VhostUserMsg and then
heap-allocate VhostUserMsg with the additional size required for
VhostUserMemRWMsg.data[].

Right now this patch is calling memcpy() on memory beyond
VhostUserMsg.payload because the VhostUserMsg struct does not have size
bytes of extra space and the payload field is in the middle of the
struct where flexible array members cannot be used.

> +} VhostUserMemRWMsg;
> +
>  typedef struct VhostUserLog {
>  uint64_t mmap_size;
>  uint64_t mmap_offset;
> @@ -253,6 +261,7 @@ typedef union {
>  VhostUserTransferDeviceState transfer_state;
>  VhostUserMMap mmap;
>  VhostUserShMemConfig shmem;
> +VhostUserMemRWMsg mem_rw;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -1871,6 +1880,22 @@ vhost_user_backend_handle_shmem_unmap(struct vhost_dev 
> *dev,
>  return 0;
>  }
>  
> +static int
> +vhost_user_backend_handle_mem_read(struct vhost_dev *dev,
> +   VhostUserMemRWMsg *mem_rw)
> +{
> +/* TODO */
> +return -EPERM;
> +}
> +
> +static int
> +vhost_user_backend_handle_mem_write(struct vhost_dev *dev,
> +   VhostUserMemRWMsg *mem_rw)
> +{
> +/* TODO */
> +return -EPERM;
> +}

Reading/writing guest memory can be done via
address_space_read/write(vdev->dma_as, ...).

> +
>  static void close_backend_channel(struct vhost_user *u)
>  {
>  g_source_destroy(u->backend_src);
> @@ -1946,6 +1971,12 @@ static gboolean backend_read(QIOChannel *ioc, 
> GIOCondition condition,
>  case VHOST_USER_BACKEND_SHMEM_UNMAP:
>  ret = vhost_user_backend_handle_shmem_unmap(dev, );
>  break;
> +case VHOST_USER_BACKEND_MEM_READ:
> +ret = vhost_user_backend_handle_mem_read(dev, _rw);
> +break;
> +case VHOST_USER_BACKEND_MEM_WRITE:
> +ret = vhost_user_backend_handle_mem_write(dev, _rw);
> +break;
>  default:
>  error_report("Received unexpected msg type: %d.", hdr.request);
>  ret = -EINVAL;
> diff --git a/subprojects/libvhost-user/libvhost-user.c 
> b/subprojects/libvhost-user/libvhost-user.c
> index 28556d183a..b5184064b5 100644
> --- a/subprojects/libvhost-user/libvhost-user.c
> +++ b/subprojects/libvhost-user/libvhost-user.c
> @@ -1651,6 +1651,90 @@ vu_shmem_unmap(VuDev *dev, uint8_t shmid, uint64_t 
> fd_offset,
>  return vu_process_message_reply(dev, );
>  }
>  
> +bool
> +vu_send_mem_read(VuDev *dev, uint64_t guest_addr, uint32_t size,
> + uint8_t *data)
> +{
> +VhostUserMsg msg_reply;
> +VhostUserMsg msg = {
> +.request = VHOST_USER_BACKEND_MEM_READ,
> +.size = sizeof(msg.payload.mem_rw),
> +.flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
> +.payload = {
> +.mem_rw = {
> +.guest_address = guest_addr,
> +.size = size,
> +}
> +}
> +};
> +
> +pthread_mutex_lock(>backend_mutex);
> +if (!vu_message_write(dev, dev->backend_fd, )) {
> +goto out_err;
> +}
> +
> +if (!vu_message_read_default(dev, dev->backend_fd, _reply)) {
> +goto out_err;
> +}
> +
> +if (msg_reply.request != msg.request) {
> +DPRINT("Received unexpected msg type. Expected %d, received %d",
> +   msg.request, msg_reply.request);
> +goto out_err;
> +}
> +
> +if

Re: [RFC PATCH v2 3/5] vhost-user-dev: Add cache BAR

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 04:57:08PM +0200, Albert Esteve wrote:
> Add a cache BAR in the vhost-user-device
> into which files can be directly mapped.
> 
> The number, shmid, and size of the VIRTIO Shared
> Memory subregions is retrieved through a get_shmem_config
> message sent by the vhost-user-base module
> on the realize step, after virtio_init().
> 
> By default, if VHOST_USER_PROTOCOL_F_SHMEM
> feature is not supported by the backend,
> there is no cache.
> 
> Signed-off-by: Albert Esteve 

Michael: Please review vhost_user_device_pci_realize() below regarding
virtio-pci BAR layout. Thanks!

> ---
>  hw/virtio/vhost-user-base.c   | 39 +--
>  hw/virtio/vhost-user-device-pci.c | 37 ++---
>  2 files changed, 71 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user-base.c b/hw/virtio/vhost-user-base.c
> index a83167191e..e47c568a55 100644
> --- a/hw/virtio/vhost-user-base.c
> +++ b/hw/virtio/vhost-user-base.c
> @@ -268,7 +268,9 @@ static void vub_device_realize(DeviceState *dev, Error 
> **errp)
>  {
>  VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>  VHostUserBase *vub = VHOST_USER_BASE(dev);
> -int ret;
> +uint64_t memory_sizes[8];
> +void *cache_ptr;
> +int i, ret, nregions;
>  
>  if (!vub->chardev.chr) {
>  error_setg(errp, "vhost-user-base: missing chardev");
> @@ -311,7 +313,7 @@ static void vub_device_realize(DeviceState *dev, Error 
> **errp)
>  
>  /* Allocate queues */
>  vub->vqs = g_ptr_array_sized_new(vub->num_vqs);
> -for (int i = 0; i < vub->num_vqs; i++) {
> +for (i = 0; i < vub->num_vqs; i++) {
>  g_ptr_array_add(vub->vqs,
>  virtio_add_queue(vdev, vub->vq_size,
>   vub_handle_output));
> @@ -328,6 +330,39 @@ static void vub_device_realize(DeviceState *dev, Error 
> **errp)
>  do_vhost_user_cleanup(vdev, vub);
>  }
>  
> +ret = vub->vhost_dev.vhost_ops->vhost_get_shmem_config(>vhost_dev,
> +   ,
> +   memory_sizes,
> +   errp);
> +
> +if (ret < 0) {
> +do_vhost_user_cleanup(vdev, vub);
> +}
> +
> +for (i = 0; i < nregions; i++) {
> +if (memory_sizes[i]) {
> +if (!is_power_of_2(memory_sizes[i]) ||
> +memory_sizes[i] < qemu_real_host_page_size()) {

Or just if (memory_sizes[i] % qemu_real_host_page_size() != 0)?

> +error_setg(errp, "Shared memory %d size must be a power of 2 
> "
> + "no smaller than the page size", i);
> +return;
> +}
> +
> +cache_ptr = mmap(NULL, memory_sizes[i], PROT_READ,

Should this be PROT_NONE like in
vhost_user_backend_handle_shmem_unmap()?

> +MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> +if (cache_ptr == MAP_FAILED) {
> +error_setg(errp, "Unable to mmap blank cache: %s",
> +   strerror(errno));

error_setg_errno() can be used here.

> +return;
> +}
> +
> +virtio_new_shmem_region(vdev);
> +memory_region_init_ram_ptr(>shmem_list[i],
> +OBJECT(vdev), "vub-shm-" + i,
> +memory_sizes[i], cache_ptr);
> +}
> +}
> +
>  qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vub_event, NULL,
>   dev, NULL, true);
>  }
> diff --git a/hw/virtio/vhost-user-device-pci.c 
> b/hw/virtio/vhost-user-device-pci.c
> index efaf55d3dd..314bacfb7a 100644
> --- a/hw/virtio/vhost-user-device-pci.c
> +++ b/hw/virtio/vhost-user-device-pci.c
> @@ -8,14 +8,18 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qapi/error.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/virtio/vhost-user-base.h"
>  #include "hw/virtio/virtio-pci.h"
>  
> +#define VIRTIO_DEVICE_PCI_CACHE_BAR 2
> +
>  struct VHostUserDevicePCI {
>  VirtIOPCIProxy parent_obj;
>  
>  VHostUserBase vub;
> +MemoryRegion cachebar;
>  };
>  
>  #define TYPE_VHOST_USER_DEVICE_PCI "vhost-user-device-pci-base"
> @@ -25,10 +29,37 @@ OBJECT_DECLARE_SIMPLE_TYPE(VHostUserDevicePCI, 
> VHOST_USER_DEVICE_PCI)
>  static void vhost_user_device_pci_realize(VirtIOPCIProxy *vpci_dev, Error 
> **errp)
>  {
>  VHostUserDevicePCI *dev = VHOST_USER_DEVICE_PCI(vpci_dev);
> -DeviceState *vdev = DEVICE(>vub);
> -
> +DeviceState *dev_state = DEVICE(>vub);
> +VirtIODevice *vdev = VIRTIO_DEVICE(dev_state);
> +uint64_t offset = 0, cache_size = 0;
> +int i;
> +
>  vpci_dev->nvectors = 1;
> -qdev_realize(vdev, BUS(_dev->bus), errp);
> +qdev_realize(dev_state, BUS(_dev->bus), errp);
> +
> +for (i = 0; i < vdev->n_shmem_regions; i++) {
> +if

Re: [RFC PATCH v2 2/5] vhost_user: Add frontend command for shmem config

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 04:57:07PM +0200, Albert Esteve wrote:
> The frontend can use this command to retrieve
> VIRTIO Shared Memory Regions configuration from
> the backend. The response contains the number of
> shared memory regions, their size, and shmid.
> 
> This is useful when the frontend is unaware of
> specific backend type and configuration,
> for example, in the `vhost-user-device` case.
> 
> Signed-off-by: Albert Esteve 
> ---
>  docs/interop/vhost-user.rst   | 31 +++
>  hw/virtio/vhost-user.c| 42 +++
>  include/hw/virtio/vhost-backend.h |  6 +
>  include/hw/virtio/vhost-user.h|  1 +
>  4 files changed, 80 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index d52ba719d5..51f01d1d84 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -348,6 +348,19 @@ Device state transfer parameters
>In the future, additional phases might be added e.g. to allow
>iterative migration while the device is running.
>  
> +VIRTIO Shared Memory Region configuration
> +^
> +
> ++-+-++++
> +| num regions | padding | mem size 0 | .. | mem size 7 |
> ++-+-++++
> +
> +:num regions: a 32-bit number of regions
> +
> +:padding: 32-bit
> +
> +:mem size: 64-bit size of VIRTIO Shared Memory Region
> +
>  C structure
>  ---
>  
> @@ -369,6 +382,10 @@ In QEMU the vhost-user message is implemented with the 
> following struct:
>VhostUserConfig config;
>VhostUserVringArea area;
>VhostUserInflight inflight;
> +  VhostUserShared object;
> +  VhostUserTransferDeviceState transfer_state;
> +  VhostUserMMap mmap;
> +  VhostUserShMemConfig shmem;
>};
>} QEMU_PACKED VhostUserMsg;
>  
> @@ -1051,6 +1068,7 @@ Protocol features
>#define VHOST_USER_PROTOCOL_F_XEN_MMAP 17
>#define VHOST_USER_PROTOCOL_F_SHARED_OBJECT18
>#define VHOST_USER_PROTOCOL_F_DEVICE_STATE 19
> +  #define VHOST_USER_PROTOCOL_F_SHMEM20
>  
>  Front-end message types
>  ---
> @@ -1725,6 +1743,19 @@ Front-end message types
>Using this function requires prior negotiation of the
>``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
>  
> +``VHOST_USER_GET_SHMEM_CONFIG``
> +  :id: 44
> +  :equivalent ioctl: N/A
> +  :request payload: N/A
> +  :reply payload: ``struct VhostUserShMemConfig``
> +
> +  When the ``VHOST_USER_PROTOCOL_F_SHMEM`` protocol feature has been
> +  successfully negotiated, this message can be submitted by the front-end
> +  to gather the VIRTIO Shared Memory Region configuration. Back-end will 
> respond
> +  with the number of VIRTIO Shared Memory Regions it requires, and each 
> shared memory
> +  region size in an array. The shared memory IDs are represented by the index
> +  of the array.

Please add:
- The Shared Memory Region size must be a multiple of the page size supported 
by mmap(2).
- The size may be 0 if the region is unused. This can happen when the
  device does not support an optional feature but does support a feature
  that uses a higher shmid.


signature.asc
Description: PGP signature

Re: [RFC PATCH v2 2/5] vhost_user: Add frontend command for shmem config

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 04:57:07PM +0200, Albert Esteve wrote:
> The frontend can use this command to retrieve
> VIRTIO Shared Memory Regions configuration from
> the backend. The response contains the number of
> shared memory regions, their size, and shmid.
> 
> This is useful when the frontend is unaware of
> specific backend type and configuration,
> for example, in the `vhost-user-device` case.
> 
> Signed-off-by: Albert Esteve 
> ---
>  docs/interop/vhost-user.rst   | 31 +++
>  hw/virtio/vhost-user.c| 42 +++
>  include/hw/virtio/vhost-backend.h |  6 +
>  include/hw/virtio/vhost-user.h|  1 +
>  4 files changed, 80 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index d52ba719d5..51f01d1d84 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -348,6 +348,19 @@ Device state transfer parameters
>In the future, additional phases might be added e.g. to allow
>iterative migration while the device is running.
>  
> +VIRTIO Shared Memory Region configuration
> +^
> +
> ++-+-++++
> +| num regions | padding | mem size 0 | .. | mem size 7 |
> ++-+-++++

8 regions may not be enough. The max according to the VIRTIO spec is
256 because virtio-pci uses an 8-bit cap.id field for the shmid. I think
the maximum number should be 256 here.

(I haven't checked the QEMU vhost-user code to see whether it's
reasonable to hardcode to 256 or some logic if needed to dynamically
size the buffer depending on the "num regions" field.)

> +
> +:num regions: a 32-bit number of regions
> +
> +:padding: 32-bit
> +
> +:mem size: 64-bit size of VIRTIO Shared Memory Region
> +
>  C structure
>  ---
>  
> @@ -369,6 +382,10 @@ In QEMU the vhost-user message is implemented with the 
> following struct:
>VhostUserConfig config;
>VhostUserVringArea area;
>VhostUserInflight inflight;
> +  VhostUserShared object;
> +  VhostUserTransferDeviceState transfer_state;
> +  VhostUserMMap mmap;

Why are these added by this patch? Please add them in the same patch
where they are introduced.

> +  VhostUserShMemConfig shmem;
>};
>} QEMU_PACKED VhostUserMsg;
>  
> @@ -1051,6 +1068,7 @@ Protocol features
>#define VHOST_USER_PROTOCOL_F_XEN_MMAP 17
>#define VHOST_USER_PROTOCOL_F_SHARED_OBJECT18
>#define VHOST_USER_PROTOCOL_F_DEVICE_STATE 19
> +  #define VHOST_USER_PROTOCOL_F_SHMEM20
>  
>  Front-end message types
>  ---
> @@ -1725,6 +1743,19 @@ Front-end message types
>Using this function requires prior negotiation of the
>``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
>  
> +``VHOST_USER_GET_SHMEM_CONFIG``
> +  :id: 44
> +  :equivalent ioctl: N/A
> +  :request payload: N/A
> +  :reply payload: ``struct VhostUserShMemConfig``
> +
> +  When the ``VHOST_USER_PROTOCOL_F_SHMEM`` protocol feature has been
> +  successfully negotiated, this message can be submitted by the front-end
> +  to gather the VIRTIO Shared Memory Region configuration. Back-end will 
> respond
> +  with the number of VIRTIO Shared Memory Regions it requires, and each 
> shared memory
> +  region size in an array. The shared memory IDs are represented by the index
> +  of the array.

Is the information returned by SHMEM_CONFIG valid and unchanged for the
entire lifetime of the vhost-user connection?

I think the answer is yes because the enumeration that virtio-pci and
virtio-mmio transports support is basically a one-time operation at
driver startup and it is static (Shared Memory Regions do not appear or
go away at runtime). Please be explicit how VHOST_USER_GET_SHMEM_CONFIG
is intended to be used.

> +
>  Back-end message types
>  --
>  
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 7ee8a472c6..57406dc8b4 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -104,6 +104,7 @@ typedef enum VhostUserRequest {
>  VHOST_USER_GET_SHARED_OBJECT = 41,
>  VHOST_USER_SET_DEVICE_STATE_FD = 42,
>  VHOST_USER_CHECK_DEVICE_STATE = 43,
> +VHOST_USER_GET_SHMEM_CONFIG = 44,
>  VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -138,6 +139,12 @@ typedef struct VhostUserMemRegMsg {
>  VhostUserMemoryRegion region;
>  } VhostUserMemRegMsg;
>  
> +typedef struct VhostUserShMemConfig {
> +uint32_t nregions;
> +uint32_t padding;
> +uint64_t memory_sizes[VHOST_MEMORY_BASELINE_NREGIONS];
> +} VhostUserShMemConfig;
> +
>  typedef struct VhostUserLog {
>  uint64_t mmap_size;
>  uint64_t mmap_offset;
> @@ -245,6 +252,7 @@ typedef union {
>  VhostUserShared object;
>  VhostUserTransferDeviceState transfer_state;
>  VhostUserMMap mmap;

Re: [RFC PATCH v2 1/5] vhost-user: Add VIRTIO Shared Memory map request

2024-07-11 Thread Stefan Hajnoczi

On Fri, Jun 28, 2024 at 04:57:06PM +0200, Albert Esteve wrote:
> Add SHMEM_MAP/UNMAP requests to vhost-user to
> handle VIRTIO Shared Memory mappings.
> 
> This request allows backends to dynamically map
> fds into a VIRTIO Shared Memory Region indentified
> by its `shmid`. Then, the fd memory is advertised
> to the driver as a base addres + offset, so it
> can be read/written (depending on the mmap flags
> requested) while its valid.
> 
> The backend can munmap the memory range
> in a given VIRTIO Shared Memory Region (again,
> identified by its `shmid`), to free it. Upon
> receiving this message, the front-end must
> mmap the regions with PROT_NONE to reserve
> the virtual memory space.
> 
> The device model needs to create MemoryRegion
> instances for the VIRTIO Shared Memory Regions
> and add them to the `VirtIODevice` instance.
> 
> Signed-off-by: Albert Esteve 
> ---
>  docs/interop/vhost-user.rst   |  27 +
>  hw/virtio/vhost-user.c| 122 ++
>  hw/virtio/virtio.c|  12 +++
>  include/hw/virtio/virtio.h|   5 +
>  subprojects/libvhost-user/libvhost-user.c |  65 
>  subprojects/libvhost-user/libvhost-user.h |  53 ++
>  6 files changed, 284 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index d8419fd2f1..d52ba719d5 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -1859,6 +1859,33 @@ is sent by the front-end.
>when the operation is successful, or non-zero otherwise. Note that if the
>operation fails, no fd is sent to the backend.
>  
> +``VHOST_USER_BACKEND_SHMEM_MAP``
> +  :id: 9
> +  :equivalent ioctl: N/A
> +  :request payload: fd and ``struct VhostUserMMap``
> +  :reply payload: N/A
> +
> +  This message can be submitted by the backends to advertise a new mapping
> +  to be made in a given VIRTIO Shared Memory Region. Upon receiving the 
> message,
> +  The front-end will mmap the given fd into the VIRTIO Shared Memory Region
> +  with the requested ``shmid``. A reply is generated indicating whether 
> mapping
> +  succeeded.
> +
> +  Mapping over an already existing map is not allowed and request shall fail.
> +  Therefore, the memory range in the request must correspond with a valid,
> +  free region of the VIRTIO Shared Memory Region.
> +
> +``VHOST_USER_BACKEND_SHMEM_UNMAP``
> +  :id: 10
> +  :equivalent ioctl: N/A
> +  :request payload: ``struct VhostUserMMap``
> +  :reply payload: N/A
> +
> +  This message can be submitted by the backends so that the front-end un-mmap
> +  a given range (``offset``, ``len``) in the VIRTIO Shared Memory Region with

s/offset/shm_offset/

> +  the requested ``shmid``.

Please clarify that  must correspond to the entirety of a
valid mapped region.

By the way, the VIRTIO 1.3 gives the following behavior for the virtiofs
DAX Window:

  When a FUSE_SETUPMAPPING request perfectly overlaps a previous
  mapping, the previous mapping is replaced. When a mapping partially
  overlaps a previous mapping, the previous mapping is split into one or
  two smaller mappings. When a mapping is partially unmapped it is also
  split into one or two smaller mappings.

  Establishing new mappings or splitting existing mappings consumes
  resources. If the device runs out of resources the FUSE_SETUPMAPPING
  request fails until resources are available again following
  FUSE_REMOVEMAPPING.

I think SETUPMAPPING/REMOVMAPPING can be implemented using
SHMEM_MAP/UNMAP. SHMEM_MAP/UNMAP do not allow atomically replacing
partial ranges, but as far as I know that's not necessary for virtiofs
in practice.

It's worth mentioning that mappings consume resources and that SHMEM_MAP
can fail when there are no resources available. The process-wide limit
is vm.max_map_count on Linux although a vhost-user frontend may reduce
it further to control vhost-user resource usage.

> +  A reply is generated indicating whether unmapping succeeded.
> +
>  .. _reply_ack:
>  
>  VHOST_USER_PROTOCOL_F_REPLY_ACK
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index cdf9af4a4b..7ee8a472c6 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest {
>  VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6,
>  VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7,
>  VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
> +VHOST_USER_BACKEND_SHMEM_MAP = 9,
> +VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
>  VHOST_USER_BACKEND_MAX
>  }  VhostUserBackendRequest;
>  
> @@ -192,6 +194,24 @@ typedef struct VhostUserShared {
>  unsigned char uuid[16];
>  } VhostUserShared;
>  
> +/* For the flags field of VhostUserMMap */
> +#define VHOST_USER_FLAG_MAP_R (1u << 0)
> +#define VHOST_USER_FLAG_MAP_W (1u << 1)
> +
> +typedef struct {
> +/* VIRTIO Shared Memory Region ID */
> +uint8_t shmid;
> +uint8_t padding[7];
> +/* File offset */
> +

Re: [PATCH 0/2] block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked

2024-07-11 Thread Stefan Hajnoczi

On Thu, Jun 27, 2024 at 08:12:43PM +0200, Kevin Wolf wrote:
> Newer clang versions allow us to check scoped guards more thoroughly.
> Surprisingly, we only seem to have missed one instance with the old
> incomplete checks.
> 
> Kevin Wolf (2):
>   block-copy: Fix missing graph lock
>   block/graph-lock: Make WITH_GRAPH_RDLOCK_GUARD() fully checked
> 
>  include/block/graph-lock.h | 21 ++---
>  block/block-copy.c |  4 +++-
>  meson.build| 14 +-
>  3 files changed, 30 insertions(+), 9 deletions(-)
> 
> -- 
> 2.45.2
> 

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC] Per-request private data in virtio-block

2024-07-10 Thread Stefan Hajnoczi

On Wed, Jul 10, 2024 at 01:08:03PM +0300, Dmitry Fleytman wrote:
> Hello QEMU-DEVEL! It's been a while…
> 
> I work on a solution for "smart" IO caching on the host side.
> The configuration is virtio-block device backed by SPDK with OCF/OpenCAS
> on top of remote storage.
> 
> To improve decision making for caching, I would like to have an additional
> per-request contextual information from the guest side. It might include
> information about process issuing an IO request, a file this request it tied
> to and so on. In general, I'd like the set of collected information to be
> flexible and configurable.
> 
> I searched mailing lists and other related sources and surprisingly found no
> mentions of the topic of having custom per-request data in virtio rings.
> This makes me think that either I'm missing an obvious way of doing this
> or the concept itself is severely broken up to the point it's not even
> getting discussed. There is also a possibility that I'm just missing a proper
> search keywords...
> 
> I understand there might be be security implications to be considered.
> Also having custom kernel patches to have a flexibility of choosing which
> data to collect is probably not a viable solution.
> 
> Please share your thoughts.
> 
> I would like understand what is the right way of doing what I'm looking for.
> If there are new mechanisms to be implemented in virtio or other parts of
> the codebase, I'll gladly work on this for the sake of community.

Hi Dmitry,
Welcome back! The struct virtio_blk_outhdr 32-bit ioprio field is
currently ignored by many device implementations. Is I/O priority
directly related to cache behavior? If yes, your virtio-blk device
implementation could interpret this field. If not, then it's probably
better to add a separate field.

It would be technically possible to add a virtio-blk feature bit for
per-request metadata. Plumbing this new metadata through the stack
requires changes to multiple software components though and might be the
reason why something like this does not exist.

virtio-blk generally sticks to the generic block device model that the
Linux kernel block layer implements. If the Linux block layer doesn't
have the cache metadata concept, then it may require some convincing of
the VIRTIO and Linux communities to add this new concept. Is the concept
of cache metadata (separate from I/O priority in Linux) a thing?

If you can show examples from other storage protocols like NVMe or SCSI,
then that would help in designing a virtio-blk solution too.

Stefan

> 
> Thank you,
> Dmitry
> 

signature.asc
Description: PGP signature

Re: [PATCH] virtio: Always reset vhost devices

2024-07-10 Thread Stefan Hajnoczi

On Wed, 10 Jul 2024 at 13:25, Hanna Czenczek  wrote:
>
> Requiring `vhost_started` to be true for resetting vhost devices in
> `virtio_reset()` seems like the wrong condition: Most importantly, the
> preceding `virtio_set_status(vdev, 0)` call will (for vhost devices) end
> up in `vhost_dev_stop()` (through vhost devices' `.set_status`
> implementations), setting `vdev->vhost_started = false`.  Therefore, the
> gated `vhost_reset_device()` call is unreachable.
>
> `vhost_started` is not documented, so it is hard to say what exactly it
> is supposed to mean, but judging from the fact that `vhost_dev_start()`
> sets it and `vhost_dev_stop()` clears it, it seems like it indicates
> whether there is a vhost back-end, and whether that back-end is
> currently running and processing virtio requests.
>
> Making a reset conditional on whether the vhost back-end is processing
> virtio requests seems wrong; in fact, it is probably better to reset it
> only when it is not currently processing requests, which is exactly the
> current order of operations in `virtio_reset()`: First, the back-end is
> stopped through `virtio_set_status(vdev, 0)`, then we want to send a
> reset.
>
> Therefore, we should drop the `vhost_started` condition, but in its
> stead we then have to verify that we can indeed send a reset to this
> vhost device, by not just checking `k->get_vhost != NULL` (introduced by
> commit 95e1019a4a9), but also that the vhost back-end is connected
> (`hdev = k->get_vhost(); hdev != NULL && hdev->vhost_ops != NULL`).
>
> Signed-off-by: Hanna Czenczek 

I think an additional SET_STATUS 0 call is made to the vDPA vhost
backend after this patch, but that seems fine.

Reviewed-by: Stefan Hajnoczi 

> ---
>  hw/virtio/virtio.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 893a072c9d..4410d62126 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -2146,8 +2146,12 @@ void virtio_reset(void *opaque)
>  vdev->device_endian = virtio_default_endian();
>  }
>
> -if (vdev->vhost_started && k->get_vhost) {
> -vhost_reset_device(k->get_vhost(vdev));
> +if (k->get_vhost) {
> +struct vhost_dev *hdev = k->get_vhost(vdev);
> +/* Only reset when vhost back-end is connected */
> +if (hdev && hdev->vhost_ops) {
> +vhost_reset_device(hdev);
> +}
>  }
>
>  if (k->reset) {
> --
> 2.45.2
>
>

[PATCH] qdev-monitor: QAPIfy QMP device_add

2024-07-08 Thread Stefan Hajnoczi

The QMP device_add monitor command converts the QDict arguments to
QemuOpts and then back again to QDict. This process only supports scalar
types. Device properties like virtio-blk-pci's iothread-vq-mapping (an
array of objects) are silently dropped by qemu_opts_from_qdict() during
the QemuOpts conversion even though QAPI is capable of validating them.
As a result, hotplugging virtio-blk-pci devices with the
iothread-vq-mapping property does not work as expected (the property is
ignored). It's time to QAPIfy QMP device_add!

Get rid of the QemuOpts conversion in qmp_device_add() and call
qdev_device_add_from_qdict() with from_json=true. Using the QMP
command's QDict arguments directly allows non-scalar properties.

The HMP is also adjusted since qmp_device_add()'s now expects properly
typed JSON arguments and cannot be used from HMP anymore. Move the code
that was previously in qmp_device_add() (with QemuOpts conversion and
from_json=false) into hmp_device_add() so that its behavior is
unchanged.

This patch changes the behavior of QMP device_add but not HMP
device_add. QMP clients that sent incorrectly typed device_add QMP
commands no longer work. This is a breaking change but clients should be
using the correct types already. See the netdev_add QAPIfication in
commit db2a380c8457 for similar reasoning.

Markus helped me figure this out and even provided a draft patch. The
code ended up very close to what he suggested.

Suggested-by: Markus Armbruster 
Cc: Daniel P. Berrangé 
Signed-off-by: Stefan Hajnoczi 
---
 system/qdev-monitor.c | 41 -
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index 6af6ef7d66..1427aa173c 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -849,18 +849,9 @@ void hmp_info_qdm(Monitor *mon, const QDict *qdict)
 
 void qmp_device_add(QDict *qdict, QObject **ret_data, Error **errp)
 {
-QemuOpts *opts;
 DeviceState *dev;
 
-opts = qemu_opts_from_qdict(qemu_find_opts("device"), qdict, errp);
-if (!opts) {
-return;
-}
-if (!monitor_cur_is_qmp() && qdev_device_help(opts)) {
-qemu_opts_del(opts);
-return;
-}
-dev = qdev_device_add(opts, errp);
+dev = qdev_device_add_from_qdict(qdict, true, errp);
 if (!dev) {
 /*
  * Drain all pending RCU callbacks. This is done because
@@ -872,8 +863,6 @@ void qmp_device_add(QDict *qdict, QObject **ret_data, Error 
**errp)
  * to the user
  */
 drain_call_rcu();
-
-qemu_opts_del(opts);
 return;
 }
 object_unref(OBJECT(dev));
@@ -967,8 +956,34 @@ void qmp_device_del(const char *id, Error **errp)
 void hmp_device_add(Monitor *mon, const QDict *qdict)
 {
 Error *err = NULL;
+QemuOpts *opts;
+DeviceState *dev;
 
-qmp_device_add((QDict *)qdict, NULL, );
+opts = qemu_opts_from_qdict(qemu_find_opts("device"), qdict, );
+if (!opts) {
+goto out;
+}
+if (qdev_device_help(opts)) {
+qemu_opts_del(opts);
+return;
+}
+dev = qdev_device_add(opts, );
+if (!dev) {
+/*
+ * Drain all pending RCU callbacks. This is done because
+ * some bus related operations can delay a device removal
+ * (in this case this can happen if device is added and then
+ * removed due to a configuration error)
+ * to a RCU callback, but user might expect that this interface
+ * will finish its job completely once qmp command returns result
+ * to the user
+ */
+drain_call_rcu();
+
+qemu_opts_del(opts);
+}
+object_unref(OBJECT(dev));
+out:
 hmp_handle_error(mon, err);
 }
 
-- 
2.45.2

Re: [PATCH v7 00/10] Support persistent reservation operations

2024-07-08 Thread Stefan Hajnoczi

On Fri, Jul 05, 2024 at 06:56:04PM +0800, Changqi Lu wrote:
> Hi,
> 
> Patch v7 has been modified.
> Thanks again to Stefan for reviewing the code.
> 
> v6->v7:
> - Add buferlen size check at SCSI layer.
> - Add pr_cap calculation in bdrv_merge_limits() function at block layer,
>   so the ugly bs->file->bs->bl.pr_cap in scsi and nvme layers was
>   changed to bs->bl.pr_cap.
> - Fix memory leak at iscsi driver, and some other spelling errors.

I have left comments. Thanks!

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v7 06/10] block/nvme: add reservation command protocol constants

2024-07-08 Thread Stefan Hajnoczi

On Fri, Jul 05, 2024 at 06:56:10PM +0800, Changqi Lu wrote:
> Add constants for the NVMe persistent command protocol.
> The constants include the reservation command opcode and
> reservation type values defined in section 7 of the NVMe
> 2.0 specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/block/nvme.h | 61 
>  1 file changed, 61 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v7 10/10] block/iscsi: add persistent reservation in/out driver

2024-07-08 Thread Stefan Hajnoczi

On Fri, Jul 05, 2024 at 06:56:14PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations for iscsi driver.
> The following methods are implemented: bdrv_co_pr_read_keys,
> bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
> bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/iscsi.c | 431 ++
>  1 file changed, 431 insertions(+)
> 
> diff --git a/block/iscsi.c b/block/iscsi.c
> index 2ff14b7472..9a546f48de 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -96,6 +96,7 @@ typedef struct IscsiLun {
>  unsigned long *allocmap_valid;
>  long allocmap_size;
>  int cluster_size;
> +uint8_t pr_cap;
>  bool use_16_for_rw;
>  bool write_protected;
>  bool lbpme;
> @@ -280,6 +281,10 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int 
> status,
>  iTask->err_code = -error;
>  iTask->err_str = g_strdup(iscsi_get_error(iscsi));
>  }
> +} else if (status == SCSI_STATUS_RESERVATION_CONFLICT) {
> +iTask->err_code = -EBADE;
> +error_report("iSCSI Persistent Reservation Conflict: %s",
> + iscsi_get_error(iscsi));
>  }
>  }
>  }
> @@ -1792,6 +1797,52 @@ static void iscsi_save_designator(IscsiLun *lun,
>  }
>  }
>  
> +static void iscsi_get_pr_cap_sync(IscsiLun *iscsilun, Error **errp)
> +{
> +struct scsi_task *task = NULL;
> +struct scsi_persistent_reserve_in_report_capabilities *rc = NULL;
> +int retries = ISCSI_CMD_RETRIES;
> +int xferlen = sizeof(struct 
> scsi_persistent_reserve_in_report_capabilities);
> +
> +do {
> +if (task != NULL) {
> +scsi_free_scsi_task(task);
> +task = NULL;
> +}
> +
> +task = iscsi_persistent_reserve_in_sync(iscsilun->iscsi,
> +   iscsilun->lun, SCSI_PR_IN_REPORT_CAPABILITIES, xferlen);
> +if (task != NULL && task->status == SCSI_STATUS_GOOD) {
> +rc = scsi_datain_unmarshall(task);
> +if (rc == NULL) {
> +error_setg(errp,
> +"iSCSI: Failed to unmarshall report capabilities data.");
> +} else {
> +iscsilun->pr_cap =
> +
> scsi_pr_cap_to_block(rc->persistent_reservation_type_mask);
> +iscsilun->pr_cap |= (rc->ptpl_a) ? BLK_PR_CAP_PTPL : 0;
> +}
> +break;
> +}
> +
> +if (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
> +&& task->sense.key == SCSI_SENSE_UNIT_ATTENTION) {
> +break;
> +}
> +
> +} while (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
> + && task->sense.key == SCSI_SENSE_UNIT_ATTENTION
> + && retries-- > 0);

The if statement is the same condition as the while statement (except
for the retry counter)? It looks like retrying logic won't work in
practice because the if statement breaks early.

> +
> +if (task == NULL || task->status != SCSI_STATUS_GOOD) {
> +error_setg(errp, "iSCSI: failed to send report capabilities 
> command");
> +}

Did you test this function against a SCSI target that does not implement
the optional PERSISTENT RESERVE IN operation? iscsi_open() must succeed
when the target does not implement this.

> +
> +if (task) {
> +scsi_free_scsi_task(task);
> +}
> +}
> +
>  static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
>Error **errp)
>  {
> @@ -2024,6 +2075,11 @@ static int iscsi_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>  }
>  
> +iscsi_get_pr_cap_sync(iscsilun, _err);
> +if (local_err != NULL) {
> +error_propagate(errp, local_err);
> +ret = -EINVAL;
> +}
>  out:
>  qemu_opts_del(opts);
>  g_free(initiator_name);
> @@ -2110,6 +2166,8 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  bs->bl.opt_transfer = pow2floor(iscsilun->bl.opt_xfer_len *
>  iscsilun->block_size);
>  }
> +
> +bs->bl.pr_cap = iscsilun->pr_cap;
>  }
>  
>  /* Note that this will not re-establish a connection with an iSCSI target - 
> it
> @@ -2408,6 +2466,371 @@ out_unlock:
>  return r;
>  }
>  
> +static int coroutine_fn
> +iscsi_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation,
> +  uint32_t num_keys, uint64_t *keys)
> +{
> +IscsiLun *iscsilun = bs->opaque;
> +QEMUIOVector qiov;
> +struct IscsiTask iTask;
> +int xferlen = sizeof(struct scsi_persistent_reserve_in_read_keys) +
> +  sizeof(uint64_t) * num_keys;
> +uint8_t *buf =

Re: [PATCH v7 05/10] hw/scsi: add persistent reservation in/out api for scsi device

2024-07-08 Thread Stefan Hajnoczi

On Fri, Jul 05, 2024 at 06:56:09PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 368 
>  1 file changed, 368 insertions(+)
> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..f0c3ce774f 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -42,6 +43,7 @@
>  #include "qemu/cutils.h"
>  #include "trace.h"
>  #include "qom/object.h"
> +#include "block/block_int.h"
>  
>  #ifdef __linux
>  #include 
> @@ -1474,6 +1476,362 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +SCSIDiskReq *req;
> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +SCSIDiskReq *req;
> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);

The behavior of scsi_disk_req_check_error() above is strange for
pr_read_keys operations. When --drive ...,rerror=ignore and ret < 0 this
line is reached and we don't want a negative num_keys value. It would be
safer to use (ret > 0 ? ret : 0) instead of the raw value of ret.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v7 08/10] hw/nvme: enable ONCS and rescap function

2024-07-08 Thread Stefan Hajnoczi

On Fri, Jul 05, 2024 at 06:56:12PM +0800, Changqi Lu wrote:
> This commit enables ONCS to support the reservation
> function at the controller level. Also enables rescap
> function in the namespace by detecting the supported reservation
> function in the backend driver.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/nvme/ctrl.c   | 3 ++-
>  hw/nvme/ns.c | 5 +
>  include/block/nvme.h | 2 +-
>  3 files changed, 8 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v6 08/10] hw/nvme: enable ONCS and rescap function

2024-07-05 Thread Stefan Hajnoczi

On Thu, Jul 04, 2024 at 08:20:31PM +0200, Stefan Hajnoczi wrote:
> On Thu, Jun 13, 2024 at 03:13:25PM +0800, Changqi Lu wrote:
> > This commit enables ONCS to support the reservation
> > function at the controller level. Also enables rescap
> > function in the namespace by detecting the supported reservation
> > function in the backend driver.
> > 
> > Signed-off-by: Changqi Lu 
> > Signed-off-by: zhenwei pi 
> > ---
> >  hw/nvme/ctrl.c | 3 ++-
> >  hw/nvme/ns.c   | 5 +
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index 127c3d2383..182307a48b 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -8248,7 +8248,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
> > *pci_dev)
> >  id->nn = cpu_to_le32(NVME_MAX_NAMESPACES);
> >  id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP |
> > NVME_ONCS_FEATURES | NVME_ONCS_DSM |
> > -   NVME_ONCS_COMPARE | NVME_ONCS_COPY);
> > +   NVME_ONCS_COMPARE | NVME_ONCS_COPY |
> > +   NVME_ONCS_RESRVATIONS);
> 
> RESRVATIONS -> RESERVATIONS typo?
> 
> >  
> >  /*
> >   * NOTE: If this device ever supports a command set that does NOT use 
> > 0x0
> > diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> > index ea8db175db..320c9bf658 100644
> > --- a/hw/nvme/ns.c
> > +++ b/hw/nvme/ns.c
> > @@ -20,6 +20,7 @@
> >  #include "qemu/bitops.h"
> >  #include "sysemu/sysemu.h"
> >  #include "sysemu/block-backend.h"
> > +#include "block/block_int.h"
> >  
> >  #include "nvme.h"
> >  #include "trace.h"
> > @@ -33,6 +34,7 @@ void nvme_ns_init_format(NvmeNamespace *ns)
> >  BlockDriverInfo bdi;
> >  int npdg, ret;
> >  int64_t nlbas;
> > +uint8_t blk_pr_cap;
> >  
> >  ns->lbaf = id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> >  ns->lbasz = 1 << ns->lbaf.ds;
> > @@ -55,6 +57,9 @@ void nvme_ns_init_format(NvmeNamespace *ns)
> >  }
> >  
> >  id_ns->npda = id_ns->npdg = npdg - 1;
> > +
> > +blk_pr_cap = blk_bs(ns->blkconf.blk)->file->bs->bl.pr_cap;
> 
> Kevin: This unprotected block graph access and the assumption that
> ->file->bs exists could be problematic. What is the best practice for
> making this code safe and defensive?

I posted the following reply in another sub-thread and it seems worth
mentioning here:

"->file could be NULL if the SCSI disk points directly to
--blockdev file without a --blockdev raw on top. I think the block layer
should propagate pr_cap from the leaves of the block graph to the root
node via bdrv_merge_limits() so that traversing the graph (->file) is
not necessary. Instead this line should just be bs->bl.pr_cap."

I think ->file shouldn't be accessed at all. That also sidesteps the
block graph locking question.

> 
> > +id_ns->rescap = block_pr_cap_to_nvme(blk_pr_cap);
> >  }
> >  
> >  static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
> > -- 
> > 2.20.1
> > 




signature.asc
Description: PGP signature

Re: [PATCH v6 09/10] hw/nvme: add reservation protocal command

2024-07-04 Thread Stefan Hajnoczi

I will skip this since Klaus Jensen's review is required for NVMe anyway.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v6 10/10] block/iscsi: add persistent reservation in/out driver

2024-07-04 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:27PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations for iscsi driver.
> The following methods are implemented: bdrv_co_pr_read_keys,
> bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
> bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/iscsi.c | 443 ++
>  1 file changed, 443 insertions(+)
> 
> diff --git a/block/iscsi.c b/block/iscsi.c
> index 2ff14b7472..d94ebe35bd 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -96,6 +96,7 @@ typedef struct IscsiLun {
>  unsigned long *allocmap_valid;
>  long allocmap_size;
>  int cluster_size;
> +uint8_t pr_cap;
>  bool use_16_for_rw;
>  bool write_protected;
>  bool lbpme;
> @@ -280,6 +281,8 @@ iscsi_co_generic_cb(struct iscsi_context *iscsi, int 
> status,
>  iTask->err_code = -error;
>  iTask->err_str = g_strdup(iscsi_get_error(iscsi));
>  }
> +} else if (status == SCSI_STATUS_RESERVATION_CONFLICT) {
> +iTask->err_code = -EBADE;

Should err_str be set too? For example, iscsi_co_writev() seems to
assume err_str is set if the iSCSI task fails.

>  }
>  }
>  }
> @@ -1792,6 +1795,52 @@ static void iscsi_save_designator(IscsiLun *lun,
>  }
>  }
>  
> +static void iscsi_get_pr_cap_sync(IscsiLun *iscsilun, Error **errp)
> +{
> +struct scsi_task *task = NULL;
> +struct scsi_persistent_reserve_in_report_capabilities *rc = NULL;
> +int retries = ISCSI_CMD_RETRIES;
> +int xferlen = sizeof(struct 
> scsi_persistent_reserve_in_report_capabilities);
> +
> +do {
> +if (task != NULL) {
> +scsi_free_scsi_task(task);
> +task = NULL;
> +}
> +
> +task = iscsi_persistent_reserve_in_sync(iscsilun->iscsi,
> +   iscsilun->lun, SCSI_PR_IN_REPORT_CAPABILITIES, xferlen);
> +if (task != NULL && task->status == SCSI_STATUS_GOOD) {
> +rc = scsi_datain_unmarshall(task);
> +if (rc == NULL) {
> +error_setg(errp,
> +"iSCSI: Failed to unmarshall report capabilities data.");
> +} else {
> +iscsilun->pr_cap =
> +
> scsi_pr_cap_to_block(rc->persistent_reservation_type_mask);
> +iscsilun->pr_cap |= (rc->ptpl_a) ? BLK_PR_CAP_PTPL : 0;
> +}
> +break;
> +}
> +
> +if (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
> +&& task->sense.key == SCSI_SENSE_UNIT_ATTENTION) {
> +break;
> +}
> +
> +} while (task != NULL && task->status == SCSI_STATUS_CHECK_CONDITION
> + && task->sense.key == SCSI_SENSE_UNIT_ATTENTION
> + && retries-- > 0);
> +
> +if (task == NULL || task->status != SCSI_STATUS_GOOD) {
> +error_setg(errp, "iSCSI: failed to send report capabilities 
> command");
> +}
> +
> +if (task) {
> +scsi_free_scsi_task(task);
> +}
> +}
> +
>  static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
>Error **errp)
>  {
> @@ -2024,6 +2073,11 @@ static int iscsi_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>  }
>  
> +iscsi_get_pr_cap_sync(iscsilun, _err);
> +if (local_err != NULL) {
> +error_propagate(errp, local_err);
> +ret = -EINVAL;
> +}
>  out:
>  qemu_opts_del(opts);
>  g_free(initiator_name);
> @@ -2110,6 +2164,8 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  bs->bl.opt_transfer = pow2floor(iscsilun->bl.opt_xfer_len *
>  iscsilun->block_size);
>  }
> +
> +bs->bl.pr_cap = iscsilun->pr_cap;
>  }
>  
>  /* Note that this will not re-establish a connection with an iSCSI target - 
> it
> @@ -2408,6 +2464,385 @@ out_unlock:
>  return r;
>  }
>  
> +static int coroutine_fn
> +iscsi_co_pr_read_keys(BlockDriverState *bs, uint32_t *generation,
> +  uint32_t num_keys, uint64_t *keys)
> +{
> +IscsiLun *iscsilun = bs->opaque;
> +QEMUIOVector qiov;
> +struct IscsiTask iTask;
> +int xferlen = sizeof(struct scsi_persistent_reserve_in_read_keys) +
> +  sizeof(uint64_t) * num_keys;
> +uint8_t *buf = g_malloc0(xferlen);
> +int32_t num_collect_keys = 0;
> +int r = 0;
> +
> +qemu_iovec_init_buf(, buf, xferlen);
> +iscsi_co_init_iscsitask(iscsilun, );
> +qemu_mutex_lock(>mutex);
> +retry:
> +iTask.task = iscsi_persistent_reserve_in_task(iscsilun->iscsi,
> + iscsilun->lun, SCSI_PR_IN_READ_KEYS, xferlen,
> + iscsi_co_generic_cb, );
> +
> +if

Re: [PATCH v6 06/10] block/nvme: add reservation command protocol constants

2024-07-04 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:23PM +0800, Changqi Lu wrote:
> Add constants for the NVMe persistent command protocol.
> The constants include the reservation command opcode and
> reservation type values defined in section 7 of the NVMe
> 2.0 specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/block/nvme.h | 61 
>  1 file changed, 61 insertions(+)
> 
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index bb231d0b9a..da6ccb0f3b 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -633,6 +633,10 @@ enum NvmeIoCommands {
>  NVME_CMD_WRITE_ZEROES   = 0x08,
>  NVME_CMD_DSM= 0x09,
>  NVME_CMD_VERIFY = 0x0c,
> +NVME_CMD_RESV_REGISTER  = 0x0d,
> +NVME_CMD_RESV_REPORT= 0x0e,
> +NVME_CMD_RESV_ACQUIRE   = 0x11,
> +NVME_CMD_RESV_RELEASE   = 0x15,
>  NVME_CMD_IO_MGMT_RECV   = 0x12,

Keep NVME_CMD_IO_MGMT_RECV (0x12) before NVME_CMD_RESV_RELEASE (0x15) in
sorted order?

>  NVME_CMD_COPY   = 0x19,
>  NVME_CMD_IO_MGMT_SEND   = 0x1d,
> @@ -641,6 +645,63 @@ enum NvmeIoCommands {
>  NVME_CMD_ZONE_APPEND= 0x7d,
>  };
>  
> +typedef enum {
> +NVME_RESV_REGISTER_ACTION_REGISTER  = 0x00,
> +NVME_RESV_REGISTER_ACTION_UNREGISTER= 0x01,
> +NVME_RESV_REGISTER_ACTION_REPLACE   = 0x02,
> +} NvmeReservationRegisterAction;
> +
> +typedef enum {
> +NVME_RESV_RELEASE_ACTION_RELEASE= 0x00,
> +NVME_RESV_RELEASE_ACTION_CLEAR  = 0x01,
> +} NvmeReservationReleaseAction;
> +
> +typedef enum {
> +NVME_RESV_ACQUIRE_ACTION_ACQUIRE= 0x00,
> +NVME_RESV_ACQUIRE_ACTION_PREEMPT= 0x01,
> +NVME_RESV_ACQUIRE_ACTION_PREEMPT_AND_ABORT  = 0x02,
> +} NvmeReservationAcquireAction;
> +
> +typedef enum {
> +NVME_RESV_WRITE_EXCLUSIVE   = 0x01,
> +NVME_RESV_EXCLUSIVE_ACCESS  = 0x02,
> +NVME_RESV_WRITE_EXCLUSIVE_REGS_ONLY = 0x03,
> +NVME_RESV_EXCLUSIVE_ACCESS_REGS_ONLY= 0x04,
> +NVME_RESV_WRITE_EXCLUSIVE_ALL_REGS  = 0x05,
> +NVME_RESV_EXCLUSIVE_ACCESS_ALL_REGS = 0x06,
> +} NvmeResvType;
> +
> +typedef enum {
> +NVME_RESV_PTPL_NO_CHANGE = 0x00,
> +NVME_RESV_PTPL_DISABLE   = 0x02,
> +NVME_RESV_PTPL_ENABLE= 0x03,
> +} NvmeResvPTPL;
> +
> +typedef enum NVMEPrCap {
> +/* Persist Through Power Loss */
> +NVME_PR_CAP_PTPL = 1 << 0,
> +/* Write Exclusive reservation type */
> +NVME_PR_CAP_WR_EX = 1 << 1,
> +/* Exclusive Access reservation type */
> +NVME_PR_CAP_EX_AC = 1 << 2,
> +/* Write Exclusive Registrants Only reservation type */
> +NVME_PR_CAP_WR_EX_RO = 1 << 3,
> +/* Exclusive Access Registrants Only reservation type */
> +NVME_PR_CAP_EX_AC_RO = 1 << 4,
> +/* Write Exclusive All Registrants reservation type */
> +NVME_PR_CAP_WR_EX_AR = 1 << 5,
> +/* Exclusive Access All Registrants reservation type */
> +NVME_PR_CAP_EX_AC_AR = 1 << 6,
> +
> +NVME_PR_CAP_ALL = (NVME_PR_CAP_PTPL |
> +  NVME_PR_CAP_WR_EX |
> +  NVME_PR_CAP_EX_AC |
> +  NVME_PR_CAP_WR_EX_RO |
> +  NVME_PR_CAP_EX_AC_RO |
> +  NVME_PR_CAP_WR_EX_AR |
> +  NVME_PR_CAP_EX_AC_AR),
> +} NvmePrCap;
> +
>  typedef struct QEMU_PACKED NvmeDeleteQ {
>  uint8_t opcode;
>  uint8_t flags;
> -- 
> 2.20.1
> 


signature.asc
Description: PGP signature

Re: [PATCH v6 08/10] hw/nvme: enable ONCS and rescap function

2024-07-04 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:25PM +0800, Changqi Lu wrote:
> This commit enables ONCS to support the reservation
> function at the controller level. Also enables rescap
> function in the namespace by detecting the supported reservation
> function in the backend driver.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/nvme/ctrl.c | 3 ++-
>  hw/nvme/ns.c   | 5 +
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 127c3d2383..182307a48b 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -8248,7 +8248,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
> *pci_dev)
>  id->nn = cpu_to_le32(NVME_MAX_NAMESPACES);
>  id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROES | NVME_ONCS_TIMESTAMP |
> NVME_ONCS_FEATURES | NVME_ONCS_DSM |
> -   NVME_ONCS_COMPARE | NVME_ONCS_COPY);
> +   NVME_ONCS_COMPARE | NVME_ONCS_COPY |
> +   NVME_ONCS_RESRVATIONS);

RESRVATIONS -> RESERVATIONS typo?

>  
>  /*
>   * NOTE: If this device ever supports a command set that does NOT use 0x0
> diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> index ea8db175db..320c9bf658 100644
> --- a/hw/nvme/ns.c
> +++ b/hw/nvme/ns.c
> @@ -20,6 +20,7 @@
>  #include "qemu/bitops.h"
>  #include "sysemu/sysemu.h"
>  #include "sysemu/block-backend.h"
> +#include "block/block_int.h"
>  
>  #include "nvme.h"
>  #include "trace.h"
> @@ -33,6 +34,7 @@ void nvme_ns_init_format(NvmeNamespace *ns)
>  BlockDriverInfo bdi;
>  int npdg, ret;
>  int64_t nlbas;
> +uint8_t blk_pr_cap;
>  
>  ns->lbaf = id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
>  ns->lbasz = 1 << ns->lbaf.ds;
> @@ -55,6 +57,9 @@ void nvme_ns_init_format(NvmeNamespace *ns)
>  }
>  
>  id_ns->npda = id_ns->npdg = npdg - 1;
> +
> +blk_pr_cap = blk_bs(ns->blkconf.blk)->file->bs->bl.pr_cap;

Kevin: This unprotected block graph access and the assumption that
->file->bs exists could be problematic. What is the best practice for
making this code safe and defensive?

> +id_ns->rescap = block_pr_cap_to_nvme(blk_pr_cap);
>  }
>  
>  static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
> -- 
> 2.20.1
> 


signature.asc
Description: PGP signature

Re: [PATCH v6 05/10] hw/scsi: add persistent reservation in/out api for scsi device

2024-07-04 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:22PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 352 
>  1 file changed, 352 insertions(+)
> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..0e964dbd87 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -42,6 +43,7 @@
>  #include "qemu/cutils.h"
>  #include "trace.h"
>  #include "qom/object.h"
> +#include "block/block_int.h"
>  
>  #ifdef __linux
>  #include 
> @@ -1474,6 +1476,346 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +void *req;
> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +void *req;
> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);
> +blk_keys->generation = cpu_to_be32(blk_keys->generation);
> +memcpy([0], _keys->generation, 4);
> +for (int i = 0; i < num_keys; i++) {
> +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
> +memcpy([8 + i * 8], _keys->keys[i], 8);
> +}
> +num_keys = cpu_to_be32(num_keys * 8);
> +memcpy([4], _keys, 4);
> +
> +scsi_req_data(>req, r->buflen);
> +done:
> +scsi_req_unref(>req);
> +g_free(blk_keys->keys);
> +g_free(blk_keys);
> +}
> +
> +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
> +{
> +SCSIPrReadKeys *blk_keys;
> +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +int buflen = MIN(r->req.cmd.xfer, r->buflen);
> +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
> +
> +blk_keys = g_new0(SCSIPrReadKeys, 1);
> +blk_keys->generation = 0;
> +/* num_keys is the maximum number of keys that can be transmitted */
> +blk_keys->num_keys = num_keys;
> +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
> +blk_keys->req = r;
> +
> +/* The request is used as the AIO opaque value, so add a ref.  */
> +scsi_req_ref(>req);
> +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
> _keys->generation,
> +blk_keys->num_keys, blk_keys->keys,
> +scsi_pr_read_keys_complete, 
> blk_keys);
> +return 0;
> +}
> +
> +static void scsi_pr_read_reservation_complete(void *opaque, int ret)
> +{
> +uint8_t *buf;
> +uint32_t additional_len = 0;
> +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
> +memcpy([0], _rsv->generation, 4);
> +if (ret) {
> +additional_len = cpu_to_be32(16);
> +blk_rsv->key = cpu_to_be64(blk_rsv->key);
> +memcpy([8], _rsv->key, 8);
> +buf[21] = block_pr_type_to_scsi(blk_rsv->type) & 0xf;
> +} else {
> +additional_len = cpu_to_be32(0);
> +}
> +
> +memcpy([4], _len, 4);
> +

Re: [PATCH v6 05/10] hw/scsi: add persistent reservation in/out api for scsi device

2024-07-04 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:22PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 

As mentioned in my reply to a previous version, I don't understand the
buffer allocation/sizing in hw/scsi/ so I haven't been able to fully
review this code for buffer overflows and input validation. cmd.xfer
isn't consistently used for size checks in the new functions. Maybe some
checks are missing?

> ---
>  hw/scsi/scsi-disk.c | 352 
>  1 file changed, 352 insertions(+)
> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..0e964dbd87 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -42,6 +43,7 @@
>  #include "qemu/cutils.h"
>  #include "trace.h"
>  #include "qom/object.h"
> +#include "block/block_int.h"
>  
>  #ifdef __linux
>  #include 
> @@ -1474,6 +1476,346 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +void *req;

Why is this field void * instead of SCSIDiskReq *?

> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +void *req;

Same here.

> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);
> +blk_keys->generation = cpu_to_be32(blk_keys->generation);
> +memcpy([0], _keys->generation, 4);
> +for (int i = 0; i < num_keys; i++) {
> +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
> +memcpy([8 + i * 8], _keys->keys[i], 8);
> +}
> +num_keys = cpu_to_be32(num_keys * 8);
> +memcpy([4], _keys, 4);
> +
> +scsi_req_data(>req, r->buflen);
> +done:
> +scsi_req_unref(>req);
> +g_free(blk_keys->keys);
> +g_free(blk_keys);
> +}
> +
> +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
> +{
> +SCSIPrReadKeys *blk_keys;
> +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +int buflen = MIN(r->req.cmd.xfer, r->buflen);
> +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);

If buflen is an untrusted input then num_keys < 0 and maybe num_keys ==
0 need to be rejected with an error.

> +
> +blk_keys = g_new0(SCSIPrReadKeys, 1);
> +blk_keys->generation = 0;
> +/* num_keys is the maximum number of keys that can be transmitted */
> +blk_keys->num_keys = num_keys;
> +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
> +blk_keys->req = r;
> +
> +/* The request is used as the AIO opaque value, so add a ref.  */
> +scsi_req_ref(>req);
> +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
> _keys->generation,
> +blk_keys->num_keys, blk_keys->keys,
> +scsi_pr_read_keys_complete, 
> blk_keys);
> +return 0;
> +}
> +
> +static void scsi_pr_read_reservation_complete(void *opaque, int ret)
> +{
> +uint8_t *buf;
> +uint32_t additional_len = 0;
> +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto

Re: [PATCH v6 04/10] scsi/util: add helper functions for persistent reservation types conversion

2024-06-26 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:21PM +0800, Changqi Lu wrote:
> This commit introduces two helper functions
> that facilitate the conversion between the
> persistent reservation types used in the SCSI
> protocol and those used in the block layer.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/scsi/utils.h |  8 +
>  scsi/utils.c | 81 
>  2 files changed, 89 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v6 03/10] scsi/constant: add persistent reservation in/out protocol constants

2024-06-26 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:20PM +0800, Changqi Lu wrote:
> Add constants for the persistent reservation in/out protocol
> in the scsi/constant module. The constants include the persistent
> reservation command, type, and scope values defined in sections
> 6.13 and 6.14 of the SCSI Primary Commands-4 (SPC-4) specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/scsi/constants.h | 52 
>  1 file changed, 52 insertions(+)

These new constants are not copied from Linux include/scsi/scsi_proto.h
like the rest of the file, but it's okay because constants.h is not
kept in sync with the Linux headers.

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v6 02/10] block/raw: add persistent reservation in/out driver

2024-06-26 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:19PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations for raw driver.
> The following methods are implemented: bdrv_co_pr_read_keys,
> bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
> bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/raw-format.c | 56 ++
>  1 file changed, 56 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v6 01/10] block: add persistent reservation in/out api

2024-06-26 Thread Stefan Hajnoczi

On Thu, Jun 13, 2024 at 03:13:18PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations
> at the block level. The following operations
> are included:
> 
> - read_keys:retrieves the list of registered keys.
> - read_reservation: retrieves the current reservation status.
> - register: registers a new reservation key.
> - reserve:  initiates a reservation for a specific key.
> - release:  releases a reservation for a specific key.
> - clear:clears all existing reservations.
> - preempt:  preempts a reservation held by another key.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/block-backend.c | 403 ++
>  block/io.c| 163 
>  include/block/block-common.h  |  40 +++
>  include/block/block-io.h  |  20 ++
>  include/block/block_int-common.h  |  84 +++
>  include/sysemu/block-backend-io.h |  24 ++
>  6 files changed, 734 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [RFC PATCH 1/1] vhost-user: add shmem mmap request

2024-06-26 Thread Stefan Hajnoczi

On Wed, 26 Jun 2024 at 03:54, Albert Esteve  wrote:
>
> Hi Stefan,
>
> On Wed, Jun 5, 2024 at 4:28 PM Stefan Hajnoczi  wrote:
>>
>> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
>> > On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi  wrote:
>> >
>> > > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
>> > > > Add SHMEM_MAP/UNMAP requests to vhost-user.
>> > > >
>> > > > This request allows backends to dynamically map
>> > > > fds into a shared memory region indentified by
>> > >
>> > > Please call this "VIRTIO Shared Memory Region" everywhere (code,
>> > > vhost-user spec, commit description, etc) so it's clear that this is not
>> > > about vhost-user shared memory tables/regions.
>> > >
>> > > > its `shmid`. Then, the fd memory is advertised
>> > > > to the frontend through a BAR+offset, so it can
>> > > > be read by the driver while its valid.
>> > >
>> > > Why is a PCI BAR mentioned here? vhost-user does not know about the
>> > > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
>> > > report VIRTIO Shared Memory Regions to the driver.
>> > >
>> > >
>> > I will remove PCI BAR, as it is true that it depends on the
>> > transport. I was trying to explain that the driver
>> > will use the shm_base + shm_offset to access
>> > the mapped memory.
>> >
>> >
>> > > >
>> > > > Then, the backend can munmap the memory range
>> > > > in a given shared memory region (again, identified
>> > > > by its `shmid`), to free it. After this, the
>> > > > region becomes private and shall not be accessed
>> > > > by the frontend anymore.
>> > >
>> > > What does "private" mean?
>> > >
>> > > The frontend must mmap PROT_NONE to reserve the virtual memory space
>> > > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
>> > > unrelated mmap(NULL, ...) might use that address range and the guest
>> > > would have access to the host memory! This is a security issue and needs
>> > > to be mentioned explicitly in the spec.
>> > >
>> >
>> > I mentioned private because it changes the mapping from MAP_SHARED
>> > to MAP_PRIVATE. I will highlight PROT_NONE instead.
>>
>> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
>> mean mmap flags or something like the memory range is no longer
>> accessible to the driver.
>>
>> >
>> >
>> > >
>> > > >
>> > > > Initializing the memory region is reponsiblity
>> > > > of the PCI device that will using it.
>> > >
>> > > What does this mean?
>> > >
>> >
>> > The MemoryRegion is declared in `struct VirtIODevice`,
>> > but it is uninitialized in this commit. So I was trying to say
>> > that the initialization will happen in, e.g., vhost-user-gpu-pci.c
>> > with something like `memory_region_init` , and later `pci_register_bar`.
>>
>> Okay. The device model needs to create MemoryRegion instances for the
>> device's Shared Memory Regions and add them to the VirtIODevice.
>>
>> --device vhost-user-device will need to query the backend since, unlike
>> vhost-user-gpu-pci and friends, it doesn't have knowledge of specific
>> device types. It will need to create MemoryRegions enumerated from the
>> backend.
>>
>> By the way, the VIRTIO MMIO Transport also supports VIRTIO Shared Memory
>> Regions so this work should not be tied to PCI.
>>
>> >
>> > I am testing that part still.
>> >
>> >
>> > >
>> > > >
>> > > > Signed-off-by: Albert Esteve 
>> > > > ---
>> > > >  docs/interop/vhost-user.rst |  23 
>> > > >  hw/virtio/vhost-user.c  | 106 
>> > > >  hw/virtio/virtio.c  |   2 +
>> > > >  include/hw/virtio/virtio.h  |   3 +
>> > > >  4 files changed, 134 insertions(+)
>> > >
>> > > Two missing pieces:
>> > >
>> > > 1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
>> > > Shared Memory Regions from the vhost-user backend. vhost-user-device is
>> > > a generic vhost-user

Re: [RFC PATCH v3 2/5] rust: add bindgen step as a meson dependency

2024-06-24 Thread Stefan Hajnoczi

On Thu, 20 Jun 2024 at 14:35, Manos Pitsidianakis
 wrote:
>
> On Thu, 20 Jun 2024 15:32, Alex Bennée  wrote:
> >Manos Pitsidianakis  writes:
> >
> >> Add mechanism to generate rust hw targets that depend on a custom
> >> bindgen target for rust bindings to C.
> >>
> >> This way bindings will be created before the rust crate is compiled.
> >>
> >> The bindings will end up in BUILDDIR/{target}-generated.rs and have the 
> >> same name
> >> as a target:
> >>
> >> ninja aarch64-softmmu-generated.rs
> >>
> >
> >> +
> >> +
> >> +rust_targets = {}
> >> +
> >> +cargo_wrapper = [
> >> +  find_program(meson.global_source_root() / 'scripts/cargo_wrapper.py'),
> >> +  '--config-headers', meson.project_build_root() / 'config-host.h',
> >> +  '--meson-build-root', meson.project_build_root(),
> >> +  '--meson-build-dir', meson.current_build_dir(),
> >> +  '--meson-source-dir', meson.current_source_dir(),
> >> +]
> >
> >I'm unclear what the difference between meson-build-root and
> >meson-build-dir is?
>
> Build-dir is the subdir of the current subdir(...) meson.build file
>
> So if we are building under qemu/build, meson_build_root is qemu/build
> and meson_build_dir is qemu/build/rust
>
> >
> >We also end up defining crate-dir and outdir. Aren't these all
> >derivable from whatever module we are building?
>
> Crate dir is the source directory (i.e. qemu/rust/pl011) that contains
> the crate's manifest file Cargo.toml.
>
> Outdir is where to put the final build artifact for meson to find. We
> could derive that from the build directories and package names somehow
> but I chose to be explicit instead of doing indirect logic to make the
> process less magic.
>
> I know it's a lot so I'm open to simplifications. The only problem is
> that all of these directories, except the crate source code, are defined
> from meson and can change with any refactor we do from the meson side of
> things.

Expanding the help text for these command-line options would make it
easier to understand. It would be great to include an example path
too.

Stefan

Re: [PATCH v2] Consider discard option when writing zeros

2024-06-24 Thread Stefan Hajnoczi

On Wed, Jun 19, 2024 at 08:43:25PM +0300, Nir Soffer wrote:
> Tested using:

Hi Nir,
This looks like a good candidate for the qemu-iotests test suite. Adding
it to the automated tests will protect against future regressions.

Please add the script and the expected output to
tests/qemu-iotests/test/write-zeroes-unmap and run it using
`(cd build && tests/qemu-iotests/check write-zeroes-unmap)`.

See the existing test cases in tests/qemu-iotests/ and
tests/qemu-iotests/tests/ for examples. Some are shell scripts and
others are Python. I think shell makes sense for this test case. You can
copy the test framework boilerplate from an existing test case.

Thanks,
Stefan

> 
> $ cat test-unmap.sh
> #!/bin/sh
> 
> qemu=${1:?Usage: $0 qemu-executable}
> img=/tmp/test.raw
> 
> echo
> echo "defaults - write zeroes"
> fallocate -l 1m $img
> echo -e 'qemu-io none0 "write -z 0 1m"\nquit' | $qemu -monitor stdio \
> -drive if=none,file=$img,format=raw >/dev/null
> du -sh $img
> 
> echo
> echo "defaults - write zeroes unmap"
> fallocate -l 1m $img
> echo -e 'qemu-io none0 "write -zu 0 1m"\nquit' | $qemu -monitor stdio \
> -drive if=none,file=$img,format=raw >/dev/null
> du -sh $img
> 
> echo
> echo "defaults - write actual zeros"
> fallocate -l 1m $img
> echo -e 'qemu-io none0 "write -P 0 0 1m"\nquit' | $qemu -monitor stdio \
> -drive if=none,file=$img,format=raw >/dev/null
> du -sh $img
> 
> echo
> echo "discard=off - write zeroes unmap"
> fallocate -l 1m $img
> echo -e 'qemu-io none0 "write -zu 0 1m"\nquit' | $qemu -monitor stdio \
> -drive if=none,file=$img,format=raw,discard=off >/dev/null
> du -sh $img
> 
> echo
> echo "detect-zeros=on - write actual zeros"
> fallocate -l 1m $img
> echo -e 'qemu-io none0 "write -P 0 0 1m"\nquit' | $qemu -monitor stdio \
> -drive if=none,file=$img,format=raw,detect-zeroes=on >/dev/null
> du -sh $img
> 
> echo
> echo "detect-zeros=unmap,discard=unmap - write actual zeros"
> fallocate -l 1m $img
> echo -e 'qemu-io none0 "write -P 0 0 1m"\nquit' |  $qemu -monitor stdio \
> -drive if=none,file=$img,format=raw,detect-zeroes=unmap,discard=unmap
> >/dev/null
> du -sh $img
> 
> echo
> echo "discard=unmap - write zeroes"
> fallocate -l 1m $img
> echo -e 'qemu-io none0 "write -z 0 1m"\nquit' | $qemu -monitor stdio \
> -drive if=none,file=$img,format=raw,discard=unmap >/dev/null
> du -sh $img
> 
> echo
> echo "discard=unmap - write zeroes unmap"
> fallocate -l 1m $img
> echo -e 'qemu-io none0 "write -zu 0 1m"\nquit' | $qemu -monitor stdio \
> -drive if=none,file=$img,format=raw,discard=unmap >/dev/null
> du -sh $img
> 
> rm $img
> 
> 
> Before this change:
> 
> $ cat before.out
> 
> defaults - write zeroes
> 1.0M /tmp/test.raw
> 
> defaults - write zeroes unmap
> 0 /tmp/test.raw
> 
> defaults - write actual zeros
> 1.0M /tmp/test.raw
> 
> discard=off - write zeroes unmap
> 0 /tmp/test.raw
> 
> detect-zeros=on - write actual zeros
> 1.0M /tmp/test.raw
> 
> detect-zeros=unmap,discard=unmap - write actual zeros
> 0 /tmp/test.raw
> 
> discard=unmap - write zeroes
> 1.0M /tmp/test.raw
> 
> discard=unmap - write zeroes unmap
> 0 /tmp/test.raw
> [nsoffer build (consider-discard-option)]$
> 
> 
> After this change:
> 
> $ cat after.out
> 
> defaults - write zeroes
> 1.0M /tmp/test.raw
> 
> defaults - write zeroes unmap
> 1.0M /tmp/test.raw
> 
> defaults - write actual zeros
> 1.0M /tmp/test.raw
> 
> discard=off - write zeroes unmap
> 1.0M /tmp/test.raw
> 
> detect-zeros=on - write actual zeros
> 1.0M /tmp/test.raw
> 
> detect-zeros=unmap,discard=unmap - write actual zeros
> 0 /tmp/test.raw
> 
> discard=unmap - write zeroes
> 1.0M /tmp/test.raw
> 
> discard=unmap - write zeroes unmap
> 0 /tmp/test.raw
> 
> 
> Differences:
> 
> $ diff -u before.out after.out
> --- before.out 2024-06-19 20:24:09.234083713 +0300
> +++ after.out 2024-06-19 20:24:20.526165573 +0300
> @@ -3,13 +3,13 @@
>  1.0M /tmp/test.raw
> 
>  defaults - write zeroes unmap
> -0 /tmp/test.raw
> +1.0M /tmp/test.raw
> 
>  defaults - write actual zeros
>  1.0M /tmp/test.raw
> 
>  discard=off - write zeroes unmap
> -0 /tmp/test.raw
> +1.0M /tmp/test.raw
> 
> On Wed, Jun 19, 2024 at 8:40 PM Nir Soffer  wrote:
> 
> > When opening an image with discard=off, we punch hole in the image when
> > writing zeroes, making the image sparse. This breaks users that want to
> > ensure that writes cannot fail with ENOSPACE by using fully allocated
> > images.
> >
> > bdrv_co_pwrite_zeroes() correctly disable BDRV_REQ_MAY_UNMAP if we
> > opened the child without discard=unmap or discard=on. But we don't go
> > through this function when accessing the top node. Move the check down
> > to bdrv_co_do_pwrite_zeroes() which seems to be used in all code paths.
> >
> > Issues:
> > - We don't punch hole by default, so images are kept allocated. Before
> >   this change we punched holes by default. I'm not sure this is a good
> >   change in behavior.
> > - Need to run all block tests
> > - Not sure that we have tests covering unmapping,

Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context

2024-06-18 Thread Stefan Hajnoczi

On Fri, Jun 14, 2024 at 11:29:13AM +0200, Fiona Ebner wrote:
> Am 12.06.24 um 17:34 schrieb Stefan Hajnoczi:
> > 
> > Thank you for investigating! It looks like we would be trading one
> > issue (the assertion failures you mentioned) for another (a rare, but
> > possible, hang).
> > 
> > I'm not sure what the best solution is. It seems like vm_stop() is the
> > first place where things go awry. It's where we should exit device
> > emulation code. Doing that probably requires an asynchronous API that
> > takes a callback. Do you want to try that?
> > 
> 
> I can try, but I'm afraid it will be a while (at least a few weeks)
> until I can get around to it.

I am wrapping current work up and then going on vacation at the end of
June until mid-July. I'll let you know if I get a chance to look at it
when I'm back.

Stefan


signature.asc
Description: PGP signature

Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context

2024-06-12 Thread Stefan Hajnoczi

On Wed, 12 Jun 2024 at 05:21, Fiona Ebner  wrote:
>
> Am 11.06.24 um 16:04 schrieb Stefan Hajnoczi:
> > On Tue, Jun 11, 2024 at 02:08:49PM +0200, Fiona Ebner wrote:
> >> Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi:
> >>> On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
> >>>> The fact that the snapshot_save_job_bh() is scheduled in the main
> >>>> loop's qemu_aio_context AioContext means that it might get executed
> >>>> during a vCPU thread's aio_poll(). But saving of the VM state cannot
> >>>> happen while the guest or devices are active and can lead to assertion
> >>>> failures. See issue #2111 for two examples. Avoid the problem by
> >>>> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
> >>>> which is not polled by vCPU threads.
> >>>>
> >>>> Solves Issue #2111.
> >>>>
> >>>> This change also solves the following issue:
> >>>>
> >>>> Since commit effd60c878 ("monitor: only run coroutine commands in
> >>>> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
> >>>> right after starting the job anymore, but only after the job finished,
> >>>> which can take a long time. The reason is, because after commit
> >>>> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
> >>>> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
> >>>> coroutine cannot be entered immediately anymore, but needs to be
> >>>> scheduled to the main loop's qemu_aio_context AioContext. But
> >>>> snapshot_save_job_bh() was scheduled first to the same AioContext and
> >>>> thus gets executed first.
> >>>>
> >>>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
> >>>> Signed-off-by: Fiona Ebner 
> >>>> ---
> >>>>
> >>>> While initial smoke testing seems fine, I'm not familiar enough with
> >>>> this to rule out any pitfalls with the approach. Any reason why
> >>>> scheduling to the iohandler AioContext could be wrong here?
> >>>
> >>> If something waits for a BlockJob to finish using aio_poll() from
> >>> qemu_aio_context then a deadlock is possible since the iohandler_ctx
> >>> won't get a chance to execute. The only suspicious code path I found was
> >>> job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
> >>> sure whether it triggers this scenario. Please check that code path.
> >>>
> >>
> >> Sorry, I don't understand. Isn't executing the scheduled BH the only
> >> additional progress that the iohandler_ctx needs to make compared to
> >> before the patch? How exactly would that cause issues when waiting for a
> >> BlockJob?
> >>
> >> Or do you mean something waiting for the SnapshotJob from
> >> qemu_aio_context before snapshot_save_job_bh had the chance to run?
> >
> > Yes, exactly. job_finish_sync_locked() will hang since iohandler_ctx has
> > no chance to execute. But I haven't audited the code to understand
> > whether this can happen.
> So job_finish_sync_locked() is executed in
> job_completed_txn_abort_locked() when the following branch is taken
>
> > if (!job_is_completed_locked(other_job))
>
> and there is no other job in the transaction, so we can assume other_job
> being the snapshot-save job itself.
>
> The callers of job_completed_txn_abort_locked():
>
> 1. in job_do_finalize_locked() if job->ret is non-zero. The callers of
> which are:
>
> 1a. in job_finalize_locked() if JOB_VERB_FINALIZE is allowed, meaning
> job->status is JOB_STATUS_PENDING, so job_is_completed_locked() will be
> true.
>
> 1b. job_completed_txn_success_locked() sets job->status to
> JOB_STATUS_WAITING before, so job_is_completed_locked() will be true.
>
> 2. in job_completed_locked() it is only done if job->ret is non-zero, in
> which case job->status was set to JOB_STATUS_ABORTING by the preceding
> job_update_rc_locked(), and thus job_is_completed_locked() will be true.
>
> 3. in job_cancel_locked() if job->deferred_to_main_loop is true, which
> is set in job_co_entry() before job_exit() is scheduled as a BH and is
> also set in job_do_dismiss_locked(). In the former case, the
> snapshot_save_job_bh has already been executed. In the latter case,
> job_is_completed_locked() will be true (since job_early_fail() is not
> used for the snapshot job).
>
>
> Howeve

Re: [RFC PATCH v1 1/6] build-sys: Add rust feature option

2024-06-11 Thread Stefan Hajnoczi

On Tue, 11 Jun 2024 at 13:54, Manos Pitsidianakis
 wrote:
>
> On Tue, 11 Jun 2024 at 17:05, Stefan Hajnoczi  wrote:
> >
> > On Mon, Jun 10, 2024 at 09:22:36PM +0300, Manos Pitsidianakis wrote:
> > > Add options for Rust in meson_options.txt, meson.build, configure to
> > > prepare for adding Rust code in the followup commits.
> > >
> > > `rust` is a reserved meson name, so we have to use an alternative.
> > > `with_rust` was chosen.
> > >
> > > Signed-off-by: Manos Pitsidianakis 
> > > ---
> > > The cargo wrapper script hardcodes some rust target triples. This is
> > > just temporary.
> > > ---
> > >  .gitignore   |   2 +
> > >  configure|  12 +++
> > >  meson.build  |  11 ++
> > >  meson_options.txt|   4 +
> > >  scripts/cargo_wrapper.py | 211 +++
> > >  5 files changed, 240 insertions(+)
> > >  create mode 100644 scripts/cargo_wrapper.py
> > >
> > > diff --git a/.gitignore b/.gitignore
> > > index 61fa39967b..f42b0d937e 100644
> > > --- a/.gitignore
> > > +++ b/.gitignore
> > > @@ -2,6 +2,8 @@
> > >  /build/
> > >  /.cache/
> > >  /.vscode/
> > > +/target/
> > > +rust/**/target
> >
> > Are these necessary since the cargo build command-line below uses
> > --target-dir ?
> >
> > Adding new build output directories outside build/ makes it harder to
> > clean up the source tree and ensure no state from previous builds
> > remains.
>
> Agreed! These build directories would show up when using cargo
> directly instead of through the cargo_wrapper.py script, i.e. during
> development. I'd consider it an edge case, it won't happen much and if
> it does it's better to gitignore them than accidentally checking them
> in. Also, whatever artifacts are in a `target` directory won't be used
> for compilation with qemu inside a build directory.

Why would someone bypass the build system? I don't think we should
encourage developers to do this.

>
>
> > >  *.pyc
> > >  .sdk
> > >  .stgit-*
> > > diff --git a/configure b/configure
> > > index 38ee257701..c195630771 100755
> > > --- a/configure
> > > +++ b/configure
> > > @@ -302,6 +302,9 @@ else
> > >objcc="${objcc-${cross_prefix}clang}"
> > >  fi
> > >
> > > +with_rust="auto"
> > > +with_rust_target_triple=""
> > > +
> > >  ar="${AR-${cross_prefix}ar}"
> > >  as="${AS-${cross_prefix}as}"
> > >  ccas="${CCAS-$cc}"
> > > @@ -760,6 +763,12 @@ for opt do
> > >;;
> > >--gdb=*) gdb_bin="$optarg"
> > >;;
> > > +  --enable-rust) with_rust=enabled
> > > +  ;;
> > > +  --disable-rust) with_rust=disabled
> > > +  ;;
> > > +  --rust-target-triple=*) with_rust_target_triple="$optarg"
> > > +  ;;
> > ># everything else has the same name in configure and meson
> > >--*) meson_option_parse "$opt" "$optarg"
> > >;;
> > > @@ -1796,6 +1805,9 @@ if test "$skip_meson" = no; then
> > >test -n "${LIB_FUZZING_ENGINE+xxx}" && meson_option_add 
> > > "-Dfuzzing_engine=$LIB_FUZZING_ENGINE"
> > >test "$plugins" = yes && meson_option_add "-Dplugins=true"
> > >test "$tcg" != enabled && meson_option_add "-Dtcg=$tcg"
> > > +  test "$with_rust" != enabled && meson_option_add 
> > > "-Dwith_rust=$with_rust"
> > > +  test "$with_rust" != enabled && meson_option_add 
> > > "-Dwith_rust=$with_rust"
> >
> > Duplicate line.
>
> Thanks!
>
> >
> > > +  test "$with_rust_target_triple" != "" && meson_option_add 
> > > "-Dwith_rust_target_triple=$with_rust_target_triple"
> > >run_meson() {
> > >  NINJA=$ninja $meson setup "$@" "$PWD" "$source_path"
> > >}
> > > diff --git a/meson.build b/meson.build
> > > index a9de71d450..3533889852 100644
> > > --- a/meson.build
> > > +++ b/meson.build
> > > @@ -290,6 +290,12 @@ foreach lang : all_languages
> > >endif
> > >  endforeach
> > >
> > > +cargo = not_found
> > >

Re: Re: [PATCH v5 00/10] Support persistent reservation operations

2024-06-11 Thread Stefan Hajnoczi

On Mon, Jun 10, 2024 at 07:55:20PM -0700, 卢长奇 wrote:
> Hi,
> 
> Sorry, I explained it in patch2 and forgot to reply your email.
> 
> The existing PRManager only works with local scsi devices. This series
> will completely decouple devices and drivers. The device can not only be
> scsi, but also other devices such as nvme. The same is true for the
> driver, which is completely unrestricted.
> 
> And block/file-posix.c can implement the new block driver, and
> pr_manager can be executed after splicing ioctl commands in these
> drivers. This will be implemented in subsequent patches.

Thanks for explaining!

Stefan

> 
> On 2024/6/11 01:18, Stefan Hajnoczi wrote:
> > On Thu, Jun 06, 2024 at 08:24:34PM +0800, Changqi Lu wrote:
> >> Hi,
> >>
> >> patchv5 has been modified.
> >>
> >> Sincerely hope that everyone can help review the
> >> code and provide some suggestions.
> >>
> >> v4->v5:
> >> - Fixed a memory leak bug at hw/nvme/ctrl.c.
> >>
> >> v3->v4:
> >> - At the nvme layer, the two patches of enabling the ONCS
> >> function and enabling rescap are combined into one.
> >> - At the nvme layer, add helper functions for pr capacity
> >> conversion between the block layer and the nvme layer.
> >>
> >> v2->v3:
> >> In v2 Persist Through Power Loss(PTPL) is enable default.
> >> In v3 PTPL is supported, which is passed as a parameter.
> >>
> >> v1->v2:
> >> - Add sg_persist --report-capabilities for SCSI protocol and enable
> >> oncs and rescap for NVMe protocol.
> >> - Add persistent reservation capabilities constants and helper functions
> for
> >> SCSI and NVMe protocol.
> >> - Add comments for necessary APIs.
> >>
> >> v1:
> >> - Add seven APIs about persistent reservation command for block layer.
> >> These APIs including reading keys, reading reservations, registering,
> >> reserving, releasing, clearing and preempting.
> >> - Add the necessary pr-related operation APIs for both the
> >> SCSI protocol and NVMe protocol at the device layer.
> >> - Add scsi driver at the driver layer to verify the functions
> >
> > My question from v1 is unanswered:
> >
> > What is the relationship to the existing PRManager functionality
> > (docs/interop/pr-helper.rst) where block/file-posix.c interprets SCSI
> > ioctls and sends persistent reservation requests to an external helper
> > process?
> >
> > I wonder if block/file-posix.c can implement the new block driver
> > callbacks using pr_mgr (while keeping the existing scsi-generic
> > support).
> >
> > Thanks,
> > Stefan
> >
> >>
> >>
> >> Changqi Lu (10):
> >> block: add persistent reservation in/out api
> >> block/raw: add persistent reservation in/out driver
> >> scsi/constant: add persistent reservation in/out protocol constants
> >> scsi/util: add helper functions for persistent reservation types
> >> conversion
> >> hw/scsi: add persistent reservation in/out api for scsi device
> >> block/nvme: add reservation command protocol constants
> >> hw/nvme: add helper functions for converting reservation types
> >> hw/nvme: enable ONCS and rescap function
> >> hw/nvme: add reservation protocal command
> >> block/iscsi: add persistent reservation in/out driver
> >>
> >> block/block-backend.c | 397 ++
> >> block/io.c | 163 +++
> >> block/iscsi.c | 443 ++
> >> block/raw-format.c | 56 
> >> hw/nvme/ctrl.c | 326 +-
> >> hw/nvme/ns.c | 5 +
> >> hw/nvme/nvme.h | 84 ++
> >> hw/scsi/scsi-disk.c | 352 
> >> include/block/block-common.h | 40 +++
> >> include/block/block-io.h | 20 ++
> >> include/block/block_int-common.h | 84 ++
> >> include/block/nvme.h | 98 +++
> >> include/scsi/constants.h | 52 
> >> include/scsi/utils.h | 8 +
> >> include/sysemu/block-backend-io.h | 24 ++
> >> scsi/utils.c | 81 ++
> >> 16 files changed, 2231 insertions(+), 2 deletions(-)
> >>
> >> --
> >> 2.20.1
> >>


signature.asc
Description: PGP signature

Re: [RFC PATCH v1 1/6] build-sys: Add rust feature option

2024-06-11 Thread Stefan Hajnoczi

On Mon, Jun 10, 2024 at 09:22:36PM +0300, Manos Pitsidianakis wrote:
> Add options for Rust in meson_options.txt, meson.build, configure to
> prepare for adding Rust code in the followup commits.
> 
> `rust` is a reserved meson name, so we have to use an alternative.
> `with_rust` was chosen.
> 
> Signed-off-by: Manos Pitsidianakis 
> ---
> The cargo wrapper script hardcodes some rust target triples. This is 
> just temporary.
> ---
>  .gitignore   |   2 +
>  configure|  12 +++
>  meson.build  |  11 ++
>  meson_options.txt|   4 +
>  scripts/cargo_wrapper.py | 211 +++
>  5 files changed, 240 insertions(+)
>  create mode 100644 scripts/cargo_wrapper.py
> 
> diff --git a/.gitignore b/.gitignore
> index 61fa39967b..f42b0d937e 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -2,6 +2,8 @@
>  /build/
>  /.cache/
>  /.vscode/
> +/target/
> +rust/**/target

Are these necessary since the cargo build command-line below uses
--target-dir ?

Adding new build output directories outside build/ makes it harder to
clean up the source tree and ensure no state from previous builds
remains.

>  *.pyc
>  .sdk
>  .stgit-*
> diff --git a/configure b/configure
> index 38ee257701..c195630771 100755
> --- a/configure
> +++ b/configure
> @@ -302,6 +302,9 @@ else
>objcc="${objcc-${cross_prefix}clang}"
>  fi
>  
> +with_rust="auto"
> +with_rust_target_triple=""
> +
>  ar="${AR-${cross_prefix}ar}"
>  as="${AS-${cross_prefix}as}"
>  ccas="${CCAS-$cc}"
> @@ -760,6 +763,12 @@ for opt do
>;;
>--gdb=*) gdb_bin="$optarg"
>;;
> +  --enable-rust) with_rust=enabled
> +  ;;
> +  --disable-rust) with_rust=disabled
> +  ;;
> +  --rust-target-triple=*) with_rust_target_triple="$optarg"
> +  ;;
># everything else has the same name in configure and meson
>--*) meson_option_parse "$opt" "$optarg"
>;;
> @@ -1796,6 +1805,9 @@ if test "$skip_meson" = no; then
>test -n "${LIB_FUZZING_ENGINE+xxx}" && meson_option_add 
> "-Dfuzzing_engine=$LIB_FUZZING_ENGINE"
>test "$plugins" = yes && meson_option_add "-Dplugins=true"
>test "$tcg" != enabled && meson_option_add "-Dtcg=$tcg"
> +  test "$with_rust" != enabled && meson_option_add "-Dwith_rust=$with_rust"
> +  test "$with_rust" != enabled && meson_option_add "-Dwith_rust=$with_rust"

Duplicate line.

> +  test "$with_rust_target_triple" != "" && meson_option_add 
> "-Dwith_rust_target_triple=$with_rust_target_triple"
>run_meson() {
>  NINJA=$ninja $meson setup "$@" "$PWD" "$source_path"
>}
> diff --git a/meson.build b/meson.build
> index a9de71d450..3533889852 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -290,6 +290,12 @@ foreach lang : all_languages
>endif
>  endforeach
>  
> +cargo = not_found
> +if get_option('with_rust').allowed()
> +  cargo = find_program('cargo', required: get_option('with_rust'))
> +endif
> +with_rust = cargo.found()
> +
>  # default flags for all hosts
>  # We use -fwrapv to tell the compiler that we require a C dialect where
>  # left shift of signed integers is well defined and has the expected
> @@ -2066,6 +2072,7 @@ endif
>  
>  config_host_data = configuration_data()
>  
> +config_host_data.set('CONFIG_WITH_RUST', with_rust)
>  audio_drivers_selected = []
>  if have_system
>audio_drivers_available = {
> @@ -4190,6 +4197,10 @@ if 'objc' in all_languages
>  else
>summary_info += {'Objective-C compiler': false}
>  endif
> +summary_info += {'Rust support':  with_rust}
> +if with_rust and get_option('with_rust_target_triple') != ''
> +  summary_info += {'Rust target': get_option('with_rust_target_triple')}
> +endif
>  option_cflags = (get_option('debug') ? ['-g'] : [])
>  if get_option('optimization') != 'plain'
>option_cflags += ['-O' + get_option('optimization')]
> diff --git a/meson_options.txt b/meson_options.txt
> index 4c1583eb40..223491b731 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -366,3 +366,7 @@ option('qemu_ga_version', type: 'string', value: '',
>  
>  option('hexagon_idef_parser', type : 'boolean', value : true,
> description: 'use idef-parser to automatically generate TCG code for 
> the Hexagon frontend')
> +option('with_rust', type: 'feature', value: 'auto',
> +   description: 'Enable Rust support')
> +option('with_rust_target_triple', type : 'string', value: '',
> +   description: 'Rust target triple')
> diff --git a/scripts/cargo_wrapper.py b/scripts/cargo_wrapper.py
> new file mode 100644
> index 00..d338effdaa
> --- /dev/null
> +++ b/scripts/cargo_wrapper.py
> @@ -0,0 +1,211 @@
> +#!/usr/bin/env python3
> +# Copyright (c) 2020 Red Hat, Inc.
> +# Copyright (c) 2023 Linaro Ltd.
> +#
> +# Authors:
> +#  Manos Pitsidianakis 
> +#  Marc-André Lureau 
> +#
> +# This work is licensed under the terms of the GNU GPL, version 2 or
> +# later.  See the COPYING file in the top-level directory.
> +
> +import argparse
> +import configparser
> +import distutils.file_util

Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context

2024-06-11 Thread Stefan Hajnoczi

On Tue, Jun 11, 2024 at 02:08:49PM +0200, Fiona Ebner wrote:
> Am 06.06.24 um 20:36 schrieb Stefan Hajnoczi:
> > On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
> >> The fact that the snapshot_save_job_bh() is scheduled in the main
> >> loop's qemu_aio_context AioContext means that it might get executed
> >> during a vCPU thread's aio_poll(). But saving of the VM state cannot
> >> happen while the guest or devices are active and can lead to assertion
> >> failures. See issue #2111 for two examples. Avoid the problem by
> >> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
> >> which is not polled by vCPU threads.
> >>
> >> Solves Issue #2111.
> >>
> >> This change also solves the following issue:
> >>
> >> Since commit effd60c878 ("monitor: only run coroutine commands in
> >> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
> >> right after starting the job anymore, but only after the job finished,
> >> which can take a long time. The reason is, because after commit
> >> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
> >> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
> >> coroutine cannot be entered immediately anymore, but needs to be
> >> scheduled to the main loop's qemu_aio_context AioContext. But
> >> snapshot_save_job_bh() was scheduled first to the same AioContext and
> >> thus gets executed first.
> >>
> >> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
> >> Signed-off-by: Fiona Ebner 
> >> ---
> >>
> >> While initial smoke testing seems fine, I'm not familiar enough with
> >> this to rule out any pitfalls with the approach. Any reason why
> >> scheduling to the iohandler AioContext could be wrong here?
> > 
> > If something waits for a BlockJob to finish using aio_poll() from
> > qemu_aio_context then a deadlock is possible since the iohandler_ctx
> > won't get a chance to execute. The only suspicious code path I found was
> > job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
> > sure whether it triggers this scenario. Please check that code path.
> > 
> 
> Sorry, I don't understand. Isn't executing the scheduled BH the only
> additional progress that the iohandler_ctx needs to make compared to
> before the patch? How exactly would that cause issues when waiting for a
> BlockJob?
> 
> Or do you mean something waiting for the SnapshotJob from
> qemu_aio_context before snapshot_save_job_bh had the chance to run?

Yes, exactly. job_finish_sync_locked() will hang since iohandler_ctx has
no chance to execute. But I haven't audited the code to understand
whether this can happen.

Stefan


signature.asc
Description: PGP signature

Re: [RFC PATCH v1 0/6] Implement ARM PL011 in Rust

2024-06-10 Thread Stefan Hajnoczi

On Mon, 10 Jun 2024 at 16:27, Manos Pitsidianakis
 wrote:
>
> On Mon, 10 Jun 2024 22:59, Stefan Hajnoczi  wrote:
> >> What are the issues with not using the compiler, rustc, directly?
> >> -
> >> [whataretheissueswith] Back to [TOC]
> >>
> >> 1. Tooling
> >>Mostly writing up the build-sys tooling to do so. Ideally we'd
> >>compile everything without cargo but rustc directly.
> >
> >Why would that be ideal?
>
> It remove the indirection level of meson<->cargo<->rustc. I don't have a
> concrete idea on how to tackle this, but if cargo ends up not strictly
> necessary, I don't see why we cannot use one build system.

The convenience of being able to use cargo dependencies without
special QEMU meson build system effort seems worth the overhead of
meson<->cargo<->rustc to me. There is a blog post that explores using
cargo crates using meson's wrap dependencies here, and it seems like
extra work:
https://coaxion.net/blog/2023/04/building-a-gstreamer-plugin-in-rust-with-meson-instead-of-cargo/

It's possible to use just meson today, but I don't think it's
practical when using cargo dependencies.

>
> >
> >>
> >>If we decide we need Rust's `std` library support, we could
> >>investigate whether building it from scratch is a good solution. This
> >>will only build the bits we need in our devices.
> >
> >Whether or not to use std is a fundamental decision. It might be
> >difficult to back from std later on. This is something that should be
> >discussed in more detail.
> >
> >Do you want to avoid std for maximum flexibility in the future, or are
> >there QEMU use cases today where std is unavailable?
>
> For flexibility, and for being compatible with more versions.
>
> But I do not want to avoid it, what I am saying is we can do a custom
> build of it instead of linking to the rust toolchain's prebuilt version.

What advantages does a custom build of std bring?

>
> >
> >>
> >> 2. Rust dependencies
> >>We could go without them completely. I chose deliberately to include
> >>one dependency in my UART implementation, `bilge`[0], because it has
> >>an elegant way of representing typed bitfields for the UART's
> >>registers.
> >>
> >> [0]: Article: https://hecatia-elegua.github.io/blog/no-more-bit-fiddling/
> >>  Crates.io page: https://crates.io/crates/bilge
> >>  Repository: https://github.com/hecatia-elegua/bilge
> >
> >I guess there will be interest in using rust-vmm crates in some way.
> >
> >Bindings to platform features that are not available in core or std
> >will also be desirable. We probably don't want to reinvent them.
>
>
> Agreed.
>
> >
> >>
> >> Should QEMU use third-party dependencies?
> >> -
> >> [shouldqemuusethirdparty] Back to [TOC]
> >>
> >> In my personal opinion, if we need a dependency we need a strong
> >> argument for it. A dependency needs a trusted upstream source, a QEMU
> >> maintainer to make sure it us up-to-date in QEMU etc.
> >>
> >> We already fetch some projects with meson subprojects, so this is not a
> >> new reality. Cargo allows you to define "locked" dependencies which is
> >> the same as only fetching specific commits by SHA. No suspicious
> >> tarballs, and no disappearing dependencies a la left-pad in npm.
> >>
> >> However, I believe it's worth considering vendoring every dependency by
> >> default, if they prove to be few, for the sake of having a local QEMU
> >> git clone buildable without network access.
> >
> >Do you mean vendoring by committing them to qemu.git or just the
> >practice of running `cargo vendor` locally for users who decide they
> >want to keep a copy of the dependencies?
>
>
> Committing, with an option to opt-out. They are generally not big in
> size. I am not of strong opinion on this one, I'm very open to
> alternatives.

Fedora and Debian want Rust applications to use distro-packaged
crates. No vendoring and no crates.io online access. It's a bit of a
pain because Rust developers need to make sure their code works with
whatever version of crates Fedora and Debian provide.

The `cargo vendor` command makes it easy for anyone wishing to collect
the required dependencies for offline builds (something I've used for
CentOS builds where vendoring is allowed).

I suggest not vendoring packages in qemu.git. Users can still run
`cargo vendor` for easy

Re: [RFC PATCH v1 0/6] Implement ARM PL011 in Rust

2024-06-10 Thread Stefan Hajnoczi

On Mon, 10 Jun 2024 at 14:23, Manos Pitsidianakis
 wrote:
>
> Hello everyone,
>
> This is an early draft of my work on implementing a very simple device,
> in this case the ARM PL011 (which in C code resides in hw/char/pl011.c
> and is used in hw/arm/virt.c).
>
> The device is functional, with copied logic from the C code but with
> effort not to make a direct C to Rust translation. In other words, do
> not write Rust as a C developer would.
>
> That goal is not complete but a best-effort case. To give a specific
> example, register values are typed but interrupt bit flags are not (but
> could be). I will leave such minutiae for later iterations.
>
> By the way, the wiki page for Rust was revived to keep track of all
> current series on the mailing list https://wiki.qemu.org/RustInQemu
>
> a #qemu-rust IRC channel was also created for rust-specific discussion
> that might flood #qemu
>
> 
> A request: keep comments to Rust in relation to the QEMU project and no
> debates on the merits of the language itself. These are valid concerns,
> but it'd be better if they were on separate mailing list threads.
> 
>
> Table of contents: [TOC]
>
> - How can I try it? [howcanItryit]
> - What are the most important points to focus on, at this point?
>   [whatarethemostimportant]
>   - What are the issues with not using the compiler, rustc, directly?
> [whataretheissueswith]
> 1. Tooling
> 2. Rust dependencies
>
>   - Should QEMU use third-party dependencies? [shouldqemuusethirdparty]
>   - Should QEMU provide wrapping Rust APIs over QEMU internals?
> [qemuprovidewrappingrustapis]
>   - Will QEMU now depend on Rust and thus not build on my XYZ platform?
> [qemudependonrustnotbuildonxyz]
> - How is the compilation structured? [howisthecompilationstructured]
> - The generated.rs rust file includes a bunch of junk definitions?
>   [generatedrsincludesjunk]
> - The staticlib artifact contains a bunch of mangled .o objects?
>   [staticlibmangledobjects]
>
> How can I try it?
> =
> [howcanItryit] Back to [TOC]
>
> Hopefully applying this patches (or checking out `master` branch from
> https://gitlab.com/epilys/rust-for-qemu/ current commit
> de81929e0e9d470deac2c6b449b7a5183325e7ee )
>
> Tag for this RFC is rust-pl011-rfc-v1
>
> Rustdoc documentation is hosted on
>
> https://rust-for-qemu-epilys-aebb06ca9f9adfe6584811c14ae44156501d935ba4.gitlab.io/pl011/index.html
>
> If `cargo` and `bindgen` is installed in your system, you should be able
> to build qemu-system-aarch64 with configure flag --enable-rust and
> launch an arm virt VM. One of the patches hardcodes the default UART of
> the machine to the Rust one, so if something goes wrong you will see it
> upon launching qemu-system-aarch64.
>
> To confirm it is there for sure, run e.g. info qom-tree on the monitor
> and look for x-pl011-rust.
>
>
> What are the most important points to focus on, at this point?
> ==
> [whatarethemostimportant] Back to [TOC]
>
> In my opinion, integration of the go-to Rust build system (Cargo and
> crates.io) with the build system we use in QEMU. This is "easily" done
> in some definition of the word with a python wrapper script.
>
> What are the issues with not using the compiler, rustc, directly?
> -
> [whataretheissueswith] Back to [TOC]
>
> 1. Tooling
>Mostly writing up the build-sys tooling to do so. Ideally we'd
>compile everything without cargo but rustc directly.

Why would that be ideal?

>
>If we decide we need Rust's `std` library support, we could
>investigate whether building it from scratch is a good solution. This
>will only build the bits we need in our devices.

Whether or not to use std is a fundamental decision. It might be
difficult to back from std later on. This is something that should be
discussed in more detail.

Do you want to avoid std for maximum flexibility in the future, or are
there QEMU use cases today where std is unavailable?

>
> 2. Rust dependencies
>We could go without them completely. I chose deliberately to include
>one dependency in my UART implementation, `bilge`[0], because it has
>an elegant way of representing typed bitfields for the UART's
>registers.
>
> [0]: Article: https://hecatia-elegua.github.io/blog/no-more-bit-fiddling/
>  Crates.io page: https://crates.io/crates/bilge
>  Repository: https://github.com/hecatia-elegua/bilge

I guess there will be interest in using rust-vmm crates in some way.

Bindings to platform features that are not available in core or std
will also be desirable. We probably don't want to reinvent them.

>
> Should QEMU use third-party dependencies?
> -
> [shouldqemuusethirdparty] Back to

Re: [PATCH v5 01/10] block: add persistent reservation in/out api

2024-06-10 Thread Stefan Hajnoczi

On Thu, Jun 06, 2024 at 08:24:35PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations
> at the block level. The following operations
> are included:
> 
> - read_keys:retrieves the list of registered keys.
> - read_reservation: retrieves the current reservation status.
> - register: registers a new reservation key.
> - reserve:  initiates a reservation for a specific key.
> - release:  releases a reservation for a specific key.
> - clear:clears all existing reservations.
> - preempt:  preempts a reservation held by another key.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/block-backend.c | 397 ++
>  block/io.c| 163 
>  include/block/block-common.h  |  40 +++
>  include/block/block-io.h  |  20 ++
>  include/block/block_int-common.h  |  84 +++
>  include/sysemu/block-backend-io.h |  24 ++
>  6 files changed, 728 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index db6f9b92a3..6707d94df7 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1770,6 +1770,403 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned 
> long int req, void *buf,
>  return blk_aio_prwv(blk, req, 0, buf, blk_aio_ioctl_entry, 0, cb, 
> opaque);
>  }
>  
> +typedef struct BlkPrInCo {
> +BlockBackend *blk;
> +uint32_t *generation;
> +uint32_t num_keys;
> +BlockPrType *type;
> +uint64_t *keys;
> +int ret;
> +} BlkPrInCo;
> +
> +typedef struct BlkPrInCB {
> +BlockAIOCB common;
> +BlkPrInCo prco;
> +bool has_returned;
> +} BlkPrInCB;
> +
> +static const AIOCBInfo blk_pr_in_aiocb_info = {
> +.aiocb_size = sizeof(BlkPrInCB),
> +};
> +
> +static void blk_pr_in_complete(BlkPrInCB *acb)
> +{
> +if (acb->has_returned) {
> +acb->common.cb(acb->common.opaque, acb->prco.ret);
> +blk_dec_in_flight(acb->prco.blk);

Did you receive my replies to v1 of this patch series?

Please take a look at them and respond:
https://lore.kernel.org/qemu-devel/20240508093629.441057-1-luchangqi@bytedance.com/

Thanks,
Stefan

> +qemu_aio_unref(acb);
> +}
> +}
> +
> +static void blk_pr_in_complete_bh(void *opaque)
> +{
> +BlkPrInCB *acb = opaque;
> +assert(acb->has_returned);
> +blk_pr_in_complete(acb);
> +}
> +
> +static BlockAIOCB *blk_aio_pr_in(BlockBackend *blk, uint32_t *generation,
> + uint32_t num_keys, BlockPrType *type,
> + uint64_t *keys, CoroutineEntry co_entry,
> + BlockCompletionFunc *cb, void *opaque)
> +{
> +BlkPrInCB *acb;
> +Coroutine *co;
> +
> +blk_inc_in_flight(blk);
> +acb = blk_aio_get(_pr_in_aiocb_info, blk, cb, opaque);
> +acb->prco = (BlkPrInCo) {
> +.blk= blk,
> +.generation = generation,
> +.num_keys   = num_keys,
> +.type   = type,
> +.ret= NOT_DONE,
> +.keys   = keys,
> +};
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(co_entry, acb);
> +aio_co_enter(qemu_get_current_aio_context(), co);
> +
> +acb->has_returned = true;
> +if (acb->prco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
> + blk_pr_in_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
> +/* To be called between exactly one pair of blk_inc/dec_in_flight() */
> +static int coroutine_fn
> +blk_aio_pr_do_read_keys(BlockBackend *blk, uint32_t *generation,
> +uint32_t num_keys, uint64_t *keys)
> +{
> +IO_CODE();
> +
> +blk_wait_while_drained(blk);
> +GRAPH_RDLOCK_GUARD();
> +
> +if (!blk_co_is_available(blk)) {
> +return -ENOMEDIUM;
> +}
> +
> +return bdrv_co_pr_read_keys(blk_bs(blk), generation, num_keys, keys);
> +}
> +
> +static void coroutine_fn blk_aio_pr_read_keys_entry(void *opaque)
> +{
> +BlkPrInCB *acb = opaque;
> +BlkPrInCo *prco = >prco;
> +
> +prco->ret = blk_aio_pr_do_read_keys(prco->blk, prco->generation,
> +prco->num_keys, prco->keys);
> +blk_pr_in_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_pr_read_keys(BlockBackend *blk, uint32_t *generation,
> + uint32_t num_keys, uint64_t *keys,
> + BlockCompletionFunc *cb, void *opaque)
> +{
> +IO_CODE();
> +return blk_aio_pr_in(blk, generation, num_keys, NULL, keys,
> + blk_aio_pr_read_keys_entry, cb, opaque);
> +}
> +
> +/* To be called between exactly one pair of blk_inc/dec_in_flight() */
> +static int coroutine_fn
> +blk_aio_pr_do_read_reservation(BlockBackend *blk, uint32_t *generation,
> +   uint64_t *key, BlockPrType

Re: [PATCH v5 00/10] Support persistent reservation operations

2024-06-10 Thread Stefan Hajnoczi

On Thu, Jun 06, 2024 at 08:24:34PM +0800, Changqi Lu wrote:
> Hi,
> 
> patchv5 has been modified. 
> 
> Sincerely hope that everyone can help review the
> code and provide some suggestions.
> 
> v4->v5:
> - Fixed a memory leak bug at hw/nvme/ctrl.c.
> 
> v3->v4:
> - At the nvme layer, the two patches of enabling the ONCS
>   function and enabling rescap are combined into one.
> - At the nvme layer, add helper functions for pr capacity
>   conversion between the block layer and the nvme layer.
> 
> v2->v3:
> In v2 Persist Through Power Loss(PTPL) is enable default.
> In v3 PTPL is supported, which is passed as a parameter.
> 
> v1->v2:
> - Add sg_persist --report-capabilities for SCSI protocol and enable
>   oncs and rescap for NVMe protocol.
> - Add persistent reservation capabilities constants and helper functions for
>   SCSI and NVMe protocol.
> - Add comments for necessary APIs.
> 
> v1:
> - Add seven APIs about persistent reservation command for block layer.
>   These APIs including reading keys, reading reservations, registering,
>   reserving, releasing, clearing and preempting.
> - Add the necessary pr-related operation APIs for both the
>   SCSI protocol and NVMe protocol at the device layer.
> - Add scsi driver at the driver layer to verify the functions

My question from v1 is unanswered:

  What is the relationship to the existing PRManager functionality
  (docs/interop/pr-helper.rst) where block/file-posix.c interprets SCSI
  ioctls and sends persistent reservation requests to an external helper
  process?

  I wonder if block/file-posix.c can implement the new block driver
  callbacks using pr_mgr (while keeping the existing scsi-generic
  support).

Thanks,
Stefan

> 
> 
> Changqi Lu (10):
>   block: add persistent reservation in/out api
>   block/raw: add persistent reservation in/out driver
>   scsi/constant: add persistent reservation in/out protocol constants
>   scsi/util: add helper functions for persistent reservation types
> conversion
>   hw/scsi: add persistent reservation in/out api for scsi device
>   block/nvme: add reservation command protocol constants
>   hw/nvme: add helper functions for converting reservation types
>   hw/nvme: enable ONCS and rescap function
>   hw/nvme: add reservation protocal command
>   block/iscsi: add persistent reservation in/out driver
> 
>  block/block-backend.c | 397 ++
>  block/io.c| 163 +++
>  block/iscsi.c | 443 ++
>  block/raw-format.c|  56 
>  hw/nvme/ctrl.c| 326 +-
>  hw/nvme/ns.c  |   5 +
>  hw/nvme/nvme.h|  84 ++
>  hw/scsi/scsi-disk.c   | 352 
>  include/block/block-common.h  |  40 +++
>  include/block/block-io.h  |  20 ++
>  include/block/block_int-common.h  |  84 ++
>  include/block/nvme.h  |  98 +++
>  include/scsi/constants.h  |  52 
>  include/scsi/utils.h  |   8 +
>  include/sysemu/block-backend-io.h |  24 ++
>  scsi/utils.c  |  81 ++
>  16 files changed, 2231 insertions(+), 2 deletions(-)
> 
> -- 
> 2.20.1
> 


signature.asc
Description: PGP signature

[PULL 5/6] hw/vfio: Remove newline character in trace events

2024-06-10 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Trace events aren't designed to be multi-lines.
Remove the newline characters.

Signed-off-by: Philippe Mathieu-Daudé 
Acked-by: Mads Ynddal 
Reviewed-by: Daniel P. Berrangé 
Message-id: 20240606103943.79116-5-phi...@linaro.org
Signed-off-by: Stefan Hajnoczi 
---
 hw/vfio/trace-events | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 64161bf6f4..e16179b507 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -19,7 +19,7 @@ vfio_msix_fixup(const char *name, int bar, uint64_t start, 
uint64_t end) " (%s)
 vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d 
offset 0x%"PRIx64""
 vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI 
vectors"
 vfio_msi_disable(const char *name) " (%s)"
-vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, 
unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 
0x%lx"
+vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, 
unsigned long flags) "Device '%s' ROM: size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
 vfio_rom_read(const char *name, uint64_t addr, int size, uint64_t data) " (%s, 
0x%"PRIx64", 0x%x) = 0x%"PRIx64
 vfio_pci_size_rom(const char *name, int size) "%s ROM size 0x%x"
 vfio_vga_write(uint64_t addr, uint64_t data, int size) " (0x%"PRIx64", 
0x%"PRIx64", %d)"
@@ -35,7 +35,7 @@ vfio_pci_hot_reset(const char *name, const char *type) " (%s) 
%s"
 vfio_pci_hot_reset_has_dep_devices(const char *name) "%s: hot reset dependent 
devices:"
 vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, 
int group_id) "\t%04x:%02x:%02x.%x group %d"
 vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: 
%s"
-vfio_populate_device_config(const char *name, unsigned long size, unsigned 
long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 
0x%lx, flags: 0x%lx"
+vfio_populate_device_config(const char *name, unsigned long size, unsigned 
long offset, unsigned long flags) "Device '%s' config: size: 0x%lx, offset: 
0x%lx, flags: 0x%lx"
 vfio_populate_device_get_irq_info_failure(const char *errstr) 
"VFIO_DEVICE_GET_IRQ_INFO failure: %s"
 vfio_attach_device(const char *name, int group_id) " (%s) group %d"
 vfio_detach_device(const char *name, int group_id) " (%s) group %d"
-- 
2.45.1

[PULL 6/6] tracetool: Forbid newline character in event format

2024-06-10 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Events aren't designed to be multi-lines. Multiple events
can be used instead. Prevent that format using multi-lines
by forbidding the newline character.

Signed-off-by: Philippe Mathieu-Daudé 
Acked-by: Mads Ynddal 
Reviewed-by: Daniel P. Berrangé 
Message-id: 20240606103943.79116-6-phi...@linaro.org
Signed-off-by: Stefan Hajnoczi 
---
 scripts/tracetool/__init__.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/scripts/tracetool/__init__.py b/scripts/tracetool/__init__.py
index 7237abe0e8..bc03238c0f 100644
--- a/scripts/tracetool/__init__.py
+++ b/scripts/tracetool/__init__.py
@@ -301,6 +301,8 @@ def build(line_str, lineno, filename):
 if fmt.endswith(r'\n"'):
 raise ValueError("Event format must not end with a newline "
  "character")
+if '\\n' in fmt:
+raise ValueError("Event format must not use new line character")
 
 if len(fmt_trans) > 0:
 fmt = [fmt_trans, fmt]
-- 
2.45.1

[PULL 4/6] hw/usb: Remove newline character in trace events

2024-06-10 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Trace events aren't designed to be multi-lines.
Remove the newline characters.

Signed-off-by: Philippe Mathieu-Daudé 
Acked-by: Mads Ynddal 
Reviewed-by: Daniel P. Berrangé 
Message-id: 20240606103943.79116-4-phi...@linaro.org
Signed-off-by: Stefan Hajnoczi 
---
 hw/usb/trace-events | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/usb/trace-events b/hw/usb/trace-events
index fd7b90d70c..46732717a9 100644
--- a/hw/usb/trace-events
+++ b/hw/usb/trace-events
@@ -15,7 +15,7 @@ usb_ohci_exit(const char *s) "%s"
 
 # hcd-ohci.c
 usb_ohci_iso_td_read_failed(uint32_t addr) "ISO_TD read error at 0x%x"
-usb_ohci_iso_td_head(uint32_t head, uint32_t tail, uint32_t flags, uint32_t 
bp, uint32_t next, uint32_t be, uint32_t framenum, uint32_t startframe, 
uint32_t framecount, int rel_frame_num) "ISO_TD ED head 0x%.8x tailp 
0x%.8x\n0x%.8x 0x%.8x 0x%.8x 0x%.8x\nframe_number 0x%.8x starting_frame 
0x%.8x\nframe_count  0x%.8x relative %d"
+usb_ohci_iso_td_head(uint32_t head, uint32_t tail, uint32_t flags, uint32_t 
bp, uint32_t next, uint32_t be, uint32_t framenum, uint32_t startframe, 
uint32_t framecount, int rel_frame_num) "ISO_TD ED head 0x%.8x tailp 0x%.8x, 
flags 0x%.8x bp 0x%.8x next 0x%.8x be 0x%.8x, frame_number 0x%.8x 
starting_frame 0x%.8x, frame_count 0x%.8x relative %d"
 usb_ohci_iso_td_head_offset(uint32_t o0, uint32_t o1, uint32_t o2, uint32_t 
o3, uint32_t o4, uint32_t o5, uint32_t o6, uint32_t o7) "0x%.8x 0x%.8x 0x%.8x 
0x%.8x 0x%.8x 0x%.8x 0x%.8x 0x%.8x"
 usb_ohci_iso_td_relative_frame_number_neg(int rel) "ISO_TD R=%d < 0"
 usb_ohci_iso_td_relative_frame_number_big(int rel, int count) "ISO_TD R=%d > 
FC=%d"
@@ -23,7 +23,7 @@ usb_ohci_iso_td_bad_direction(int dir) "Bad direction %d"
 usb_ohci_iso_td_bad_bp_be(uint32_t bp, uint32_t be) "ISO_TD bp 0x%.8x be 
0x%.8x"
 usb_ohci_iso_td_bad_cc_not_accessed(uint32_t start, uint32_t next) "ISO_TD cc 
!= not accessed 0x%.8x 0x%.8x"
 usb_ohci_iso_td_bad_cc_overrun(uint32_t start, uint32_t next) "ISO_TD 
start_offset=0x%.8x > next_offset=0x%.8x"
-usb_ohci_iso_td_so(uint32_t so, uint32_t eo, uint32_t s, uint32_t e, const 
char *str, ssize_t len, int ret) "0x%.8x eo 0x%.8x\nsa 0x%.8x ea 0x%.8x\ndir %s 
len %zu ret %d"
+usb_ohci_iso_td_so(uint32_t so, uint32_t eo, uint32_t s, uint32_t e, const 
char *str, ssize_t len, int ret) "0x%.8x eo 0x%.8x sa 0x%.8x ea 0x%.8x dir %s 
len %zu ret %d"
 usb_ohci_iso_td_data_overrun(int ret, ssize_t len) "DataOverrun %d > %zu"
 usb_ohci_iso_td_data_underrun(int ret) "DataUnderrun %d"
 usb_ohci_iso_td_nak(int ret) "got NAK/STALL %d"
@@ -55,7 +55,7 @@ usb_ohci_td_pkt_full(const char *dir, const char *buf) "%s 
data: %s"
 usb_ohci_td_too_many_pending(int ep) "ep=%d"
 usb_ohci_td_packet_status(int status) "status=%d"
 usb_ohci_ed_read_error(uint32_t addr) "ED read error at 0x%x"
-usb_ohci_ed_pkt(uint32_t cur, int h, int c, uint32_t head, uint32_t tail, 
uint32_t next) "ED @ 0x%.8x h=%u c=%u\n  head=0x%.8x tailp=0x%.8x next=0x%.8x"
+usb_ohci_ed_pkt(uint32_t cur, int h, int c, uint32_t head, uint32_t tail, 
uint32_t next) "ED @ 0x%.8x h=%u c=%u head=0x%.8x tailp=0x%.8x next=0x%.8x"
 usb_ohci_ed_pkt_flags(uint32_t fa, uint32_t en, uint32_t d, int s, int k, int 
f, uint32_t mps) "fa=%u en=%u d=%u s=%u k=%u f=%u mps=%u"
 usb_ohci_hcca_read_error(uint32_t addr) "HCCA read error at 0x%x"
 usb_ohci_mem_read(uint32_t size, const char *name, uint32_t addr, uint32_t 
offs, uint32_t val) "%d %s 0x%x %d -> 0x%x"
-- 
2.45.1

[PULL 3/6] hw/sh4: Remove newline character in trace events

2024-06-10 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Trace events aren't designed to be multi-lines. Remove
the newline character which doesn't bring much value.

Signed-off-by: Philippe Mathieu-Daudé 
Acked-by: Mads Ynddal 
Reviewed-by: Daniel P. Berrangé 
Message-id: 20240606103943.79116-3-phi...@linaro.org
Signed-off-by: Stefan Hajnoczi 
---
 hw/sh4/trace-events | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/sh4/trace-events b/hw/sh4/trace-events
index 4b61cd56c8..6bfd7eebc4 100644
--- a/hw/sh4/trace-events
+++ b/hw/sh4/trace-events
@@ -1,3 +1,3 @@
 # sh7750.c
-sh7750_porta(uint16_t prev, uint16_t cur, uint16_t pdtr, uint16_t pctr) "porta 
changed from 0x%04x to 0x%04x\npdtra=0x%04x, pctra=0x%08x"
-sh7750_portb(uint16_t prev, uint16_t cur, uint16_t pdtr, uint16_t pctr) "portb 
changed from 0x%04x to 0x%04x\npdtrb=0x%04x, pctrb=0x%08x"
+sh7750_porta(uint16_t prev, uint16_t cur, uint16_t pdtr, uint16_t pctr) "porta 
changed from 0x%04x to 0x%04x (pdtra=0x%04x, pctra=0x%08x)"
+sh7750_portb(uint16_t prev, uint16_t cur, uint16_t pdtr, uint16_t pctr) "portb 
changed from 0x%04x to 0x%04x (pdtrb=0x%04x, pctrb=0x%08x)"
-- 
2.45.1

[PULL 2/6] backends/tpm: Remove newline character in trace event

2024-06-10 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

Split the 'tpm_util_show_buffer' event in two to avoid
using a newline character.

Signed-off-by: Philippe Mathieu-Daudé 
Acked-by: Mads Ynddal 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Stefan Berger 
Message-id: 20240606103943.79116-2-phi...@linaro.org
Signed-off-by: Stefan Hajnoczi 
---
 backends/tpm/tpm_util.c   | 5 +++--
 backends/tpm/trace-events | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/backends/tpm/tpm_util.c b/backends/tpm/tpm_util.c
index 1856589c3b..cf138551df 100644
--- a/backends/tpm/tpm_util.c
+++ b/backends/tpm/tpm_util.c
@@ -339,10 +339,11 @@ void tpm_util_show_buffer(const unsigned char *buffer,
 size_t len, i;
 char *line_buffer, *p;
 
-if (!trace_event_get_state_backends(TRACE_TPM_UTIL_SHOW_BUFFER)) {
+if (!trace_event_get_state_backends(TRACE_TPM_UTIL_SHOW_BUFFER_CONTENT)) {
 return;
 }
 len = MIN(tpm_cmd_get_size(buffer), buffer_size);
+trace_tpm_util_show_buffer_header(string, len);
 
 /*
  * allocate enough room for 3 chars per buffer entry plus a
@@ -356,7 +357,7 @@ void tpm_util_show_buffer(const unsigned char *buffer,
 }
 p += sprintf(p, "%.2X ", buffer[i]);
 }
-trace_tpm_util_show_buffer(string, len, line_buffer);
+trace_tpm_util_show_buffer_content(line_buffer);
 
 g_free(line_buffer);
 }
diff --git a/backends/tpm/trace-events b/backends/tpm/trace-events
index 1ecef42a07..cb5cfa6510 100644
--- a/backends/tpm/trace-events
+++ b/backends/tpm/trace-events
@@ -10,7 +10,8 @@ tpm_util_get_buffer_size_len(uint32_t len, size_t expected) 
"tpm_resp->len = %u,
 tpm_util_get_buffer_size_hdr_len2(uint32_t len, size_t expected) 
"tpm2_resp->hdr.len = %u, expected = %zu"
 tpm_util_get_buffer_size_len2(uint32_t len, size_t expected) "tpm2_resp->len = 
%u, expected = %zu"
 tpm_util_get_buffer_size(size_t len) "buffersize of device: %zu"
-tpm_util_show_buffer(const char *direction, size_t len, const char *buf) 
"direction: %s len: %zu\n%s"
+tpm_util_show_buffer_header(const char *direction, size_t len) "direction: %s 
len: %zu"
+tpm_util_show_buffer_content(const char *buf) "%s"
 
 # tpm_emulator.c
 tpm_emulator_set_locality(uint8_t locty) "setting locality to %d"
-- 
2.45.1

[PULL 1/6] tracetool: Remove unused vcpu.py script

2024-06-10 Thread Stefan Hajnoczi

From: Philippe Mathieu-Daudé 

vcpu.py is pointless since commit 89aafcf2a7 ("trace:
remove code that depends on setting vcpu"), remote it.

Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: Zhao Liu 
Message-id: 20240606102631.78152-1-phi...@linaro.org
Signed-off-by: Stefan Hajnoczi 
---
 meson.build   |  1 -
 scripts/tracetool/__init__.py |  8 +
 scripts/tracetool/vcpu.py | 59 ---
 3 files changed, 1 insertion(+), 67 deletions(-)
 delete mode 100644 scripts/tracetool/vcpu.py

diff --git a/meson.build b/meson.build
index ec59effca2..91278667ea 100644
--- a/meson.build
+++ b/meson.build
@@ -3232,7 +3232,6 @@ tracetool_depends = files(
   'scripts/tracetool/format/log_stap.py',
   'scripts/tracetool/format/stap.py',
   'scripts/tracetool/__init__.py',
-  'scripts/tracetool/vcpu.py'
 )
 
 qemu_version_cmd = [find_program('scripts/qemu-version.sh'),
diff --git a/scripts/tracetool/__init__.py b/scripts/tracetool/__init__.py
index b887540a55..7237abe0e8 100644
--- a/scripts/tracetool/__init__.py
+++ b/scripts/tracetool/__init__.py
@@ -306,13 +306,7 @@ def build(line_str, lineno, filename):
 fmt = [fmt_trans, fmt]
 args = Arguments.build(groups["args"])
 
-event = Event(name, props, fmt, args, lineno, filename)
-
-# add implicit arguments when using the 'vcpu' property
-import tracetool.vcpu
-event = tracetool.vcpu.transform_event(event)
-
-return event
+return Event(name, props, fmt, args, lineno, filename)
 
 def __repr__(self):
 """Evaluable string representation for this object."""
diff --git a/scripts/tracetool/vcpu.py b/scripts/tracetool/vcpu.py
deleted file mode 100644
index d232cb1d06..00
--- a/scripts/tracetool/vcpu.py
+++ /dev/null
@@ -1,59 +0,0 @@
-# -*- coding: utf-8 -*-
-
-"""
-Generic management for the 'vcpu' property.
-
-"""
-
-__author__ = "Lluís Vilanova "
-__copyright__  = "Copyright 2016, Lluís Vilanova "
-__license__= "GPL version 2 or (at your option) any later version"
-
-__maintainer__ = "Stefan Hajnoczi"
-__email__  = "stefa...@redhat.com"
-
-
-from tracetool import Arguments, try_import
-
-
-def transform_event(event):
-"""Transform event to comply with the 'vcpu' property (if present)."""
-if "vcpu" in event.properties:
-event.args = Arguments([("void *", "__cpu"), event.args])
-fmt = "\"cpu=%p \""
-event.fmt = fmt + event.fmt
-return event
-
-
-def transform_args(format, event, *args, **kwargs):
-"""Transforms the arguments to suit the specified format.
-
-The format module must implement function 'vcpu_args', which receives the
-implicit arguments added by the 'vcpu' property, and must return suitable
-arguments for the given format.
-
-The function is only called for events with the 'vcpu' property.
-
-Parameters
-==
-format : str
-Format module name.
-event : Event
-args, kwargs
-Passed to 'vcpu_transform_args'.
-
-Returns
-===
-Arguments
-The transformed arguments, including the non-implicit ones.
-
-"""
-if "vcpu" in event.properties:
-ok, func = try_import("tracetool.format." + format,
-  "vcpu_transform_args")
-assert ok
-assert func
-return Arguments([func(event.args[:1], *args, **kwargs),
-  event.args[1:]])
-else:
-return event.args
-- 
2.45.1

[PULL 0/6] Tracing patches

2024-06-10 Thread Stefan Hajnoczi

The following changes since commit 80e8f0602168f451a93e71cbb1d59e93d745e62e:

  Merge tag 'bsd-user-misc-2024q2-pull-request' of gitlab.com:bsdimp/qemu into 
staging (2024-06-09 11:21:55 -0700)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/tracing-pull-request

for you to fetch changes up to 4c2b6f328742084a5bd770af7c3a2ef07828c41c:

  tracetool: Forbid newline character in event format (2024-06-10 13:05:27 
-0400)


Pull request

Cleanups from Philippe Mathieu-Daudé.



Philippe Mathieu-Daudé (6):
  tracetool: Remove unused vcpu.py script
  backends/tpm: Remove newline character in trace event
  hw/sh4: Remove newline character in trace events
  hw/usb: Remove newline character in trace events
  hw/vfio: Remove newline character in trace events
  tracetool: Forbid newline character in event format

 meson.build   |  1 -
 backends/tpm/tpm_util.c   |  5 +--
 backends/tpm/trace-events |  3 +-
 hw/sh4/trace-events   |  4 +--
 hw/usb/trace-events   |  6 ++--
 hw/vfio/trace-events  |  4 +--
 scripts/tracetool/__init__.py | 10 ++
 scripts/tracetool/vcpu.py | 59 ---
 8 files changed, 15 insertions(+), 77 deletions(-)
 delete mode 100644 scripts/tracetool/vcpu.py

-- 
2.45.1

Re: [PATCH 0/5] trace: Remove and forbid newline characters in event format

2024-06-10 Thread Stefan Hajnoczi

On Thu, Jun 06, 2024 at 12:39:38PM +0200, Philippe Mathieu-Daudé wrote:
> Trace events aren't designed to be multi-lines.
> Few format use the newline character: remove it
> and forbid further uses.
> 
> Philippe Mathieu-Daudé (5):
>   backends/tpm: Remove newline character in trace event
>   hw/sh4: Remove newline character in trace events
>   hw/usb: Remove newline character in trace events
>   hw/vfio: Remove newline character in trace events
>   tracetool: Forbid newline character in event format
> 
>  backends/tpm/tpm_util.c   | 5 +++--
>  backends/tpm/trace-events | 3 ++-
>  hw/sh4/trace-events   | 4 ++--
>  hw/usb/trace-events   | 6 +++---
>  hw/vfio/trace-events  | 4 ++--
>  scripts/tracetool/__init__.py | 2 ++
>  6 files changed, 14 insertions(+), 10 deletions(-)
> 
> -- 
> 2.41.0
> 

Thanks, applied to my tracing tree:
https://gitlab.com/stefanha/qemu/commits/tracing

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] tracetool: Remove unused vcpu.py script

2024-06-10 Thread Stefan Hajnoczi

On Thu, Jun 06, 2024 at 12:26:31PM +0200, Philippe Mathieu-Daudé wrote:
> vcpu.py is pointless since commit 89aafcf2a7 ("trace:
> remove code that depends on setting vcpu"), remote it.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  meson.build   |  1 -
>  scripts/tracetool/__init__.py |  8 +
>  scripts/tracetool/vcpu.py | 59 ---
>  3 files changed, 1 insertion(+), 67 deletions(-)
>  delete mode 100644 scripts/tracetool/vcpu.py

Thanks, applied to my tracing tree:
https://gitlab.com/stefanha/qemu/commits/tracing

Stefan


signature.asc
Description: PGP signature

Re: [RFC PATCH] migration/savevm: do not schedule snapshot_save_job_bh in qemu_aio_context

2024-06-06 Thread Stefan Hajnoczi

On Wed, Jun 05, 2024 at 02:08:48PM +0200, Fiona Ebner wrote:
> The fact that the snapshot_save_job_bh() is scheduled in the main
> loop's qemu_aio_context AioContext means that it might get executed
> during a vCPU thread's aio_poll(). But saving of the VM state cannot
> happen while the guest or devices are active and can lead to assertion
> failures. See issue #2111 for two examples. Avoid the problem by
> scheduling the snapshot_save_job_bh() in the iohandler AioContext,
> which is not polled by vCPU threads.
> 
> Solves Issue #2111.
> 
> This change also solves the following issue:
> 
> Since commit effd60c878 ("monitor: only run coroutine commands in
> qemu_aio_context"), the 'snapshot-save' QMP call would not respond
> right after starting the job anymore, but only after the job finished,
> which can take a long time. The reason is, because after commit
> effd60c878, do_qmp_dispatch_bh() runs in the iohandler AioContext.
> When do_qmp_dispatch_bh() wakes the qmp_dispatch() coroutine, the
> coroutine cannot be entered immediately anymore, but needs to be
> scheduled to the main loop's qemu_aio_context AioContext. But
> snapshot_save_job_bh() was scheduled first to the same AioContext and
> thus gets executed first.
> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/2111
> Signed-off-by: Fiona Ebner 
> ---
> 
> While initial smoke testing seems fine, I'm not familiar enough with
> this to rule out any pitfalls with the approach. Any reason why
> scheduling to the iohandler AioContext could be wrong here?

If something waits for a BlockJob to finish using aio_poll() from
qemu_aio_context then a deadlock is possible since the iohandler_ctx
won't get a chance to execute. The only suspicious code path I found was
job_completed_txn_abort_locked() -> job_finish_sync_locked() but I'm not
sure whether it triggers this scenario. Please check that code path.

> Should the same be done for the snapshot_load_job_bh and
> snapshot_delete_job_bh to keep it consistent?

In the long term it would be cleaner to move away from synchronous APIs
that rely on nested event loops. They have been a source of bugs for
years.

If vm_stop() and perhaps other operations in save_snapshot() were
asynchronous, then it would be safe to run the operation in
qemu_aio_context without using iohandler_ctx. vm_stop() wouldn't invoke
its callback until vCPUs were quiesced and outside device emulation
code.

I think this patch is fine as a one-line bug fix, but we should be
careful about falling back on this trick because it makes the codebase
harder to understand and more fragile.

> 
>  migration/savevm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index c621f2359b..0086b76ab0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -3459,7 +3459,7 @@ static int coroutine_fn snapshot_save_job_run(Job *job, 
> Error **errp)
>  SnapshotJob *s = container_of(job, SnapshotJob, common);
>  s->errp = errp;
>  s->co = qemu_coroutine_self();
> -aio_bh_schedule_oneshot(qemu_get_aio_context(),
> +aio_bh_schedule_oneshot(iohandler_get_aio_context(),
>  snapshot_save_job_bh, job);
>  qemu_coroutine_yield();
>  return s->ret ? 0 : -1;
> -- 
> 2.39.2


signature.asc
Description: PGP signature

Re: [PATCH] target/s390x: Fix tracing header path in TCG mem_helper.c

2024-06-06 Thread Stefan Hajnoczi

On Thu, Jun 06, 2024 at 12:30:26PM +0200, Philippe Mathieu-Daudé wrote:
> Commit c9274b6bf0 ("target/s390x: start moving TCG-only code
> to tcg/") moved mem_helper.c, but the trace-events file is
> still in the parent directory, so is the generated trace.h.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Ideally we should only use trace events from current directory.

Yes, that would be cleaner. Is it possible to move the relevant trace
events to the trace-events file in target/s390x/tcg/?

> ---
>  target/s390x/tcg/mem_helper.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/target/s390x/tcg/mem_helper.c b/target/s390x/tcg/mem_helper.c
> index 6a308c5553..1fb6cbb6cf 100644
> --- a/target/s390x/tcg/mem_helper.c
> +++ b/target/s390x/tcg/mem_helper.c
> @@ -30,7 +30,7 @@
>  #include "hw/core/tcg-cpu-ops.h"
>  #include "qemu/int128.h"
>  #include "qemu/atomic128.h"
> -#include "trace.h"
> +#include "../trace.h"
>  
>  #if !defined(CONFIG_USER_ONLY)
>  #include "hw/s390x/storage-keys.h"
> -- 
> 2.41.0
> 


signature.asc
Description: PGP signature

Re: [RFC PATCH 1/1] vhost-user: add shmem mmap request

2024-06-05 Thread Stefan Hajnoczi

On Wed, Jun 5, 2024, 12:02 David Hildenbrand  wrote:

> On 05.06.24 17:19, Stefan Hajnoczi wrote:
> > On Wed, 5 Jun 2024 at 10:29, Stefan Hajnoczi 
> wrote:
> >>
> >> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
> >>> On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi 
> wrote:
> >>>
> >>>> On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> >>>>> Add SHMEM_MAP/UNMAP requests to vhost-user.
> >>>>>
> >>>>> This request allows backends to dynamically map
> >>>>> fds into a shared memory region indentified by
> >>>>
> >>>> Please call this "VIRTIO Shared Memory Region" everywhere (code,
> >>>> vhost-user spec, commit description, etc) so it's clear that this is
> not
> >>>> about vhost-user shared memory tables/regions.
> >>>>
> >>>>> its `shmid`. Then, the fd memory is advertised
> >>>>> to the frontend through a BAR+offset, so it can
> >>>>> be read by the driver while its valid.
> >>>>
> >>>> Why is a PCI BAR mentioned here? vhost-user does not know about the
> >>>> VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> >>>> report VIRTIO Shared Memory Regions to the driver.
> >>>>
> >>>>
> >>> I will remove PCI BAR, as it is true that it depends on the
> >>> transport. I was trying to explain that the driver
> >>> will use the shm_base + shm_offset to access
> >>> the mapped memory.
> >>>
> >>>
> >>>>>
> >>>>> Then, the backend can munmap the memory range
> >>>>> in a given shared memory region (again, identified
> >>>>> by its `shmid`), to free it. After this, the
> >>>>> region becomes private and shall not be accessed
> >>>>> by the frontend anymore.
> >>>>
> >>>> What does "private" mean?
> >>>>
> >>>> The frontend must mmap PROT_NONE to reserve the virtual memory space
> >>>> when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> >>>> unrelated mmap(NULL, ...) might use that address range and the guest
> >>>> would have access to the host memory! This is a security issue and
> needs
> >>>> to be mentioned explicitly in the spec.
> >>>>
> >>>
> >>> I mentioned private because it changes the mapping from MAP_SHARED
> >>> to MAP_PRIVATE. I will highlight PROT_NONE instead.
> >>
> >> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
> >> mean mmap flags or something like the memory range is no longer
> >> accessible to the driver.
> >
> > One more thing: please check whether kvm.ko memory regions need to be
> > modified or split to match the SHMEM_MAP mapping's read/write
> > permissions.
> >
> > The VIRTIO Shared Memory Area pages can have PROT_READ, PROT_WRITE,
> > PROT_READ|PROT_WRITE, or PROT_NONE.
> >
> > kvm.ko memory regions are read/write or read-only. I'm not sure what
> > happens when the guest accesses a kvm.ko memory region containing
> > mappings with permissions more restrictive than its kvm.ko memory
> > region.
>
> IIRC, the KVM R/O memory region requests could allow to further reduce
> permissions (assuming your mmap is R/W you could map it R/O into the KVM
> MMU), but I might remember things incorrectly.
>

I'm thinking about the opposite case where KVM is configured for R/W but
the mmap is more restrictive. This patch series makes this scenario
possible.


>
> > In other words, the kvm.ko memory region would allow the
> > access but the Linux virtual memory configuration would cause a page
> > fault.
> >
> > For example, imagine a QEMU MemoryRegion containing a SHMEM_MAP
> > mapping with PROT_READ. The kvm.ko memory region would be read/write
> > (unless extra steps were taken to tell kvm.ko about the permissions).
> > When the guest stores to the PROT_READ page, kvm.ko will process a
> > fault...and I'm not sure what happens next.
> >
> > A similar scenario occurs when a PROT_NONE mapping exists within a
> > kvm.ko memory region. I don't remember how kvm.ko behaves when the
> > guest tries to access the pages.
> >
> > It's worth figuring this out before going further because it could
> > become tricky if issues are discovered later. I have CCed David

Re: [RFC PATCH 1/1] vhost-user: add shmem mmap request

2024-06-05 Thread Stefan Hajnoczi

On Wed, 5 Jun 2024 at 10:29, Stefan Hajnoczi  wrote:
>
> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
> > On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi  wrote:
> >
> > > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> > > > Add SHMEM_MAP/UNMAP requests to vhost-user.
> > > >
> > > > This request allows backends to dynamically map
> > > > fds into a shared memory region indentified by
> > >
> > > Please call this "VIRTIO Shared Memory Region" everywhere (code,
> > > vhost-user spec, commit description, etc) so it's clear that this is not
> > > about vhost-user shared memory tables/regions.
> > >
> > > > its `shmid`. Then, the fd memory is advertised
> > > > to the frontend through a BAR+offset, so it can
> > > > be read by the driver while its valid.
> > >
> > > Why is a PCI BAR mentioned here? vhost-user does not know about the
> > > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> > > report VIRTIO Shared Memory Regions to the driver.
> > >
> > >
> > I will remove PCI BAR, as it is true that it depends on the
> > transport. I was trying to explain that the driver
> > will use the shm_base + shm_offset to access
> > the mapped memory.
> >
> >
> > > >
> > > > Then, the backend can munmap the memory range
> > > > in a given shared memory region (again, identified
> > > > by its `shmid`), to free it. After this, the
> > > > region becomes private and shall not be accessed
> > > > by the frontend anymore.
> > >
> > > What does "private" mean?
> > >
> > > The frontend must mmap PROT_NONE to reserve the virtual memory space
> > > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> > > unrelated mmap(NULL, ...) might use that address range and the guest
> > > would have access to the host memory! This is a security issue and needs
> > > to be mentioned explicitly in the spec.
> > >
> >
> > I mentioned private because it changes the mapping from MAP_SHARED
> > to MAP_PRIVATE. I will highlight PROT_NONE instead.
>
> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
> mean mmap flags or something like the memory range is no longer
> accessible to the driver.

One more thing: please check whether kvm.ko memory regions need to be
modified or split to match the SHMEM_MAP mapping's read/write
permissions.

The VIRTIO Shared Memory Area pages can have PROT_READ, PROT_WRITE,
PROT_READ|PROT_WRITE, or PROT_NONE.

kvm.ko memory regions are read/write or read-only. I'm not sure what
happens when the guest accesses a kvm.ko memory region containing
mappings with permissions more restrictive than its kvm.ko memory
region. In other words, the kvm.ko memory region would allow the
access but the Linux virtual memory configuration would cause a page
fault.

For example, imagine a QEMU MemoryRegion containing a SHMEM_MAP
mapping with PROT_READ. The kvm.ko memory region would be read/write
(unless extra steps were taken to tell kvm.ko about the permissions).
When the guest stores to the PROT_READ page, kvm.ko will process a
fault...and I'm not sure what happens next.

A similar scenario occurs when a PROT_NONE mapping exists within a
kvm.ko memory region. I don't remember how kvm.ko behaves when the
guest tries to access the pages.

It's worth figuring this out before going further because it could
become tricky if issues are discovered later. I have CCed David
Hildenbrand in case he knows.

Stefan

Re: [RFC PATCH 1/1] vhost-user: add shmem mmap request

2024-06-05 Thread Stefan Hajnoczi

On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
> On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi  wrote:
> 
> > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> > > Add SHMEM_MAP/UNMAP requests to vhost-user.
> > >
> > > This request allows backends to dynamically map
> > > fds into a shared memory region indentified by
> >
> > Please call this "VIRTIO Shared Memory Region" everywhere (code,
> > vhost-user spec, commit description, etc) so it's clear that this is not
> > about vhost-user shared memory tables/regions.
> >
> > > its `shmid`. Then, the fd memory is advertised
> > > to the frontend through a BAR+offset, so it can
> > > be read by the driver while its valid.
> >
> > Why is a PCI BAR mentioned here? vhost-user does not know about the
> > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> > report VIRTIO Shared Memory Regions to the driver.
> >
> >
> I will remove PCI BAR, as it is true that it depends on the
> transport. I was trying to explain that the driver
> will use the shm_base + shm_offset to access
> the mapped memory.
> 
> 
> > >
> > > Then, the backend can munmap the memory range
> > > in a given shared memory region (again, identified
> > > by its `shmid`), to free it. After this, the
> > > region becomes private and shall not be accessed
> > > by the frontend anymore.
> >
> > What does "private" mean?
> >
> > The frontend must mmap PROT_NONE to reserve the virtual memory space
> > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> > unrelated mmap(NULL, ...) might use that address range and the guest
> > would have access to the host memory! This is a security issue and needs
> > to be mentioned explicitly in the spec.
> >
> 
> I mentioned private because it changes the mapping from MAP_SHARED
> to MAP_PRIVATE. I will highlight PROT_NONE instead.

I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
mean mmap flags or something like the memory range is no longer
accessible to the driver.

> 
> 
> >
> > >
> > > Initializing the memory region is reponsiblity
> > > of the PCI device that will using it.
> >
> > What does this mean?
> >
> 
> The MemoryRegion is declared in `struct VirtIODevice`,
> but it is uninitialized in this commit. So I was trying to say
> that the initialization will happen in, e.g., vhost-user-gpu-pci.c
> with something like `memory_region_init` , and later `pci_register_bar`.

Okay. The device model needs to create MemoryRegion instances for the
device's Shared Memory Regions and add them to the VirtIODevice.

--device vhost-user-device will need to query the backend since, unlike
vhost-user-gpu-pci and friends, it doesn't have knowledge of specific
device types. It will need to create MemoryRegions enumerated from the
backend.

By the way, the VIRTIO MMIO Transport also supports VIRTIO Shared Memory
Regions so this work should not be tied to PCI.

> 
> I am testing that part still.
> 
> 
> >
> > >
> > > Signed-off-by: Albert Esteve 
> > > ---
> > >  docs/interop/vhost-user.rst |  23 
> > >  hw/virtio/vhost-user.c  | 106 
> > >  hw/virtio/virtio.c  |   2 +
> > >  include/hw/virtio/virtio.h  |   3 +
> > >  4 files changed, 134 insertions(+)
> >
> > Two missing pieces:
> >
> > 1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
> > Shared Memory Regions from the vhost-user backend. vhost-user-device is
> > a generic vhost-user frontend without knowledge of the device type, so
> > it doesn't know what the valid shmids are and what size the regions
> > have.
> >
> 
> Ok. I was assuming that if a device (backend) makes a request without a
> valid shmid or not enough size in the region to perform the mmap, would
> just fail in the VMM performing the actual mmap operation. So it would
> not necessarily need to know about valid shmids or region sizes.

But then --device vhost-user-device wouldn't be able to support VIRTIO
Shared Memory Regions, which means this patch series is incomplete. New
vhost-user features need to support both --device vhost-user--*
and --device vhost-user-device.

What's needed is:
1. New vhost-user messages so the frontend can query the shmids and
   sizes from the backend.
2. QEMU --device vhost-user-device code that queries the VIRTIO Shared
   Memory Regions from the vhost-user backend and then creates
   MemoryRegions for them.

> 
> 
> >
>

Re: [RFC PATCH 0/1] vhost-user: Add SHMEM_MAP/UNMAP requests

2024-06-05 Thread Stefan Hajnoczi

On Wed, Jun 05, 2024 at 09:24:36AM +0200, Albert Esteve wrote:
> On Tue, Jun 4, 2024 at 8:16 PM Stefan Hajnoczi  wrote:
> 
> > On Thu, May 30, 2024 at 05:22:22PM +0200, Albert Esteve wrote:
> > > Hi all,
> > >
> > > This is an early attempt to have backends
> > > support dynamic fd mapping into shared
> > > memory regions. As such, there are a few
> > > things that need settling, so I wanted to
> > > post this first to have some early feedback.
> > >
> > > The usecase for this is, e.g., to support
> > > vhost-user-gpu RESOURCE_BLOB operations,
> > > or DAX Window request for virtio-fs. In
> > > general, any operation where a backend
> > > would need to mmap an fd to a shared
> > > memory so that the guest can access it.
> >
> > I wanted to mention that this sentence confuses me because:
> >
> > - The frontend will mmap an fd into the guest's memory space so that a
> >   VIRTIO Shared Memory Region is exposed to the guest. The backend
> >   requests the frontend to perform this operation. The backend does not
> >   invoke mmap itself.
> >
> 
> Sorry for the confused wording. It is true that the backend does not
> do the mmap, but requests it to be done. One point of confusion for
> me from your sentence is that I refer to the driver as the frontend,

They are different concepts. Frontend is defined in the vhost-user spec
and driver is defined in the VIRTIO spec.

The frontend is the application that uses vhost-user protocol messages
to communicate with the backend.

The driver uses VIRTIO device model interfaces like virtqueues to
communicate with the device.

> and the mapping is done by the VMM (i.e., QEMU).
> 
> But yeah, I agree and the scenario you describe is what
> I had in mind. Thanks for pointing it out. I will rephrase it
> in follow-up patches.

Thanks!

> 
> 
> >
> > - "Shared memory" is ambiguous. Please call it VIRTIO Shared Memory
> >   Region to differentiate from vhost-user shared memory tables/regions.
> >
> 
> Ok!
> 
> 
> >
> > > The request will be processed by the VMM,
> > > that will, in turn, trigger a mmap with
> > > the instructed parameters (i.e., shmid,
> > > shm_offset, fd_offset, fd, lenght).
> > >
> > > As there are already a couple devices
> > > that could benefit of such a feature,
> > > and more could require it in the future,
> > > my intention was to make it generic.
> > >
> > > To that end, I declared the shared
> > > memory region list in `VirtIODevice`.
> > > I could add a couple commodity
> > > functions to add new regions to the list,
> > > so that the devices can use them. But
> > > I wanted to gather some feedback before
> > > refining it further, as I am probably
> > > missing some required steps/or security
> > > concerns that I am not taking into account.
> > >
> > > Albert Esteve (1):
> > >   vhost-user: add shmem mmap request
> > >
> > >  docs/interop/vhost-user.rst |  23 
> > >  hw/virtio/vhost-user.c  | 106 
> > >  hw/virtio/virtio.c  |   2 +
> > >  include/hw/virtio/virtio.h  |   3 +
> > >  4 files changed, 134 insertions(+)
> > >
> > > --
> > > 2.44.0
> > >
> >


signature.asc
Description: PGP signature

Re: [RFC PATCH 1/1] vhost-user: add shmem mmap request

2024-06-04 Thread Stefan Hajnoczi

On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> Add SHMEM_MAP/UNMAP requests to vhost-user.
> 
> This request allows backends to dynamically map
> fds into a shared memory region indentified by

Please call this "VIRTIO Shared Memory Region" everywhere (code,
vhost-user spec, commit description, etc) so it's clear that this is not
about vhost-user shared memory tables/regions.

> its `shmid`. Then, the fd memory is advertised
> to the frontend through a BAR+offset, so it can
> be read by the driver while its valid.

Why is a PCI BAR mentioned here? vhost-user does not know about the
VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
report VIRTIO Shared Memory Regions to the driver.

> 
> Then, the backend can munmap the memory range
> in a given shared memory region (again, identified
> by its `shmid`), to free it. After this, the
> region becomes private and shall not be accessed
> by the frontend anymore.

What does "private" mean?

The frontend must mmap PROT_NONE to reserve the virtual memory space
when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
unrelated mmap(NULL, ...) might use that address range and the guest
would have access to the host memory! This is a security issue and needs
to be mentioned explicitly in the spec.

> 
> Initializing the memory region is reponsiblity
> of the PCI device that will using it.

What does this mean?

> 
> Signed-off-by: Albert Esteve 
> ---
>  docs/interop/vhost-user.rst |  23 
>  hw/virtio/vhost-user.c  | 106 
>  hw/virtio/virtio.c  |   2 +
>  include/hw/virtio/virtio.h  |   3 +
>  4 files changed, 134 insertions(+)

Two missing pieces:

1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
Shared Memory Regions from the vhost-user backend. vhost-user-device is
a generic vhost-user frontend without knowledge of the device type, so
it doesn't know what the valid shmids are and what size the regions
have.

2. Other backends don't see these mappings. If the guest submits a vring
descriptor referencing a mapping to another backend, then that backend
won't be able to access this memory. David Gilbert hit this problem when
working on the virtiofs DAX Window. Either the frontend needs to forward
all SHMAP_MAP/UNMAP messages to the other backends (inefficient and
maybe racy!) or a new "memcpy" message is needed as a fallback for when
vhost-user memory region translation fails.

> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index d8419fd2f1..3caf2a290c 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -1859,6 +1859,29 @@ is sent by the front-end.
>when the operation is successful, or non-zero otherwise. Note that if the
>operation fails, no fd is sent to the backend.
>  
> +``VHOST_USER_BACKEND_SHMEM_MAP``
> +  :id: 9
> +  :equivalent ioctl: N/A
> +  :request payload: fd and ``struct VhostUserMMap``
> +  :reply payload: N/A
> +
> +  This message can be submitted by the backends to advertise a new mapping
> +  to be made in a given shared memory region. Upon receiving the message,
> +  QEMU will mmap the given fd into the shared memory region with the

s/QEMU/the frontend/

> +  requested ``shmid``. A reply is generated indicating whether mapping
> +  succeeded.

Please document whether mapping over an existing mapping is allowed. I
think it should be allowed because it might be useful to atomically
update a mapping without a race where the driver sees unmapped memory.

If mapping over an existing mapping is allowed, does the new mapping
need to cover the old mapping exactly, or can it span multiple previous
mappings or a subset of an existing mapping?

From a security point of view we need to be careful here. A potentially
untrusted backend process now has the ability to mmap an arbitrary file
descriptor into the frontend process. The backend can cause
denial of service by creating many small mappings to exhaust the OS
limits on virtual memory areas. The backend can map memory to use as
part of a security compromise, so we need to be sure the virtual memory
addresses are not leaked to the backend and only read/write page
permissions are available.

> +``VHOST_USER_BACKEND_SHMEM_UNMAP``
> +  :id: 10
> +  :equivalent ioctl: N/A
> +  :request payload: ``struct VhostUserMMap``
> +  :reply payload: N/A
> +
> +  This message can be submitted by the backends so that QEMU un-mmap

s/QEMU/the frontend/

> +  a given range (``offset``, ``len``) in the shared memory region with the
> +  requested ``shmid``.

Does the range need to correspond to a previously-mapped VhostUserMMap
or can it cross multiple VhostUserMMaps, be a subset of a VhostUserMMap,
etc?

> +  A reply is generated indicating whether unmapping succeeded.
> +
>  .. _reply_ack:
>  
>  VHOST_USER_PROTOCOL_F_REPLY_ACK
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index cdf9af4a4b..9526b9d07f 100644
> ---

Re: [RFC PATCH 0/1] vhost-user: Add SHMEM_MAP/UNMAP requests

2024-06-04 Thread Stefan Hajnoczi

On Thu, May 30, 2024 at 05:22:22PM +0200, Albert Esteve wrote:
> Hi all,
> 
> This is an early attempt to have backends
> support dynamic fd mapping into shared
> memory regions. As such, there are a few
> things that need settling, so I wanted to
> post this first to have some early feedback.
> 
> The usecase for this is, e.g., to support
> vhost-user-gpu RESOURCE_BLOB operations,
> or DAX Window request for virtio-fs. In
> general, any operation where a backend
> would need to mmap an fd to a shared
> memory so that the guest can access it.

I wanted to mention that this sentence confuses me because:

- The frontend will mmap an fd into the guest's memory space so that a
  VIRTIO Shared Memory Region is exposed to the guest. The backend
  requests the frontend to perform this operation. The backend does not
  invoke mmap itself.

- "Shared memory" is ambiguous. Please call it VIRTIO Shared Memory
  Region to differentiate from vhost-user shared memory tables/regions.

> The request will be processed by the VMM,
> that will, in turn, trigger a mmap with
> the instructed parameters (i.e., shmid,
> shm_offset, fd_offset, fd, lenght).
> 
> As there are already a couple devices
> that could benefit of such a feature,
> and more could require it in the future,
> my intention was to make it generic.
> 
> To that end, I declared the shared
> memory region list in `VirtIODevice`.
> I could add a couple commodity
> functions to add new regions to the list,
> so that the devices can use them. But
> I wanted to gather some feedback before
> refining it further, as I am probably
> missing some required steps/or security
> concerns that I am not taking into account.
> 
> Albert Esteve (1):
>   vhost-user: add shmem mmap request
> 
>  docs/interop/vhost-user.rst |  23 
>  hw/virtio/vhost-user.c  | 106 
>  hw/virtio/virtio.c  |   2 +
>  include/hw/virtio/virtio.h  |   3 +
>  4 files changed, 134 insertions(+)
> 
> -- 
> 2.44.0
> 


signature.asc
Description: PGP signature

Re: [RFC 0/6] scripts: Rewrite simpletrace printer in Rust

2024-05-29 Thread Stefan Hajnoczi

On Wed, May 29, 2024 at 10:10:00PM +0800, Zhao Liu wrote:
> Hi Stefan and Mads,
> 
> On Wed, May 29, 2024 at 11:33:42AM +0200, Mads Ynddal wrote:
> > Date: Wed, 29 May 2024 11:33:42 +0200
> > From: Mads Ynddal 
> > Subject: Re: [RFC 0/6] scripts: Rewrite simpletrace printer in Rust
> > X-Mailer: Apple Mail (2.3774.600.62)
> > 
> > 
> > >> Maybe later, Rust-simpletrace and python-simpletrace can differ, e.g.
> > >> the former goes for performance and the latter for scalability.
> > > 
> > > Rewriting an existing, maintained component without buy-in from the
> > > maintainers is risky. Mads is the maintainer of simpletrace.py and I am
> > > the overall tracing maintainer. While the performance improvement is
> > > nice, I'm a skeptical about the need for this and wonder whether thought
> > > was put into how simpletrace should evolve.
> > > 
> > > There are disadvantages to maintaining multiple implementations:
> > > - File format changes need to be coordinated across implementations to
> > >  prevent compatibility issues. In other words, changing the
> > >  trace-events format becomes harder and discourages future work.
> > > - Multiple implementations makes life harder for users because they need
> > >  to decide between implementations and understand the trade-offs.
> > > - There is more maintenance overall.
> > > 
> > > I think we should have a single simpletrace implementation to avoid
> > > these issues. The Python implementation is more convenient for
> > > user-written trace analysis scripts. The Rust implementation has better
> > > performance (although I'm not aware of efforts to improve the Python
> > > implementation's performance, so who knows).
> > > 
> > > I'm ambivalent about why a reimplementation is necessary. What I would
> > > like to see first is the TCG binary tracing functionality. Find the
> > > limits of the Python simpletrace implementation and then it will be
> > > clear whether a Rust reimplementation makes sense.
> > > 
> > > If Mads agrees, I am happy with a Rust reimplementation, but please
> > > demonstrate why a reimplementation is necessary first.
> > > 
> > > Stefan
> > 
> > I didn't want to shoot down the idea, since it seemed like somebody had a 
> > plan
> > with GSoC. But I actually agree, that I'm not quite convinced.
> > 
> > I think I'd need to see some data that showed the Python version is 
> > inadequate.
> > I know Python is not the fastest, but is it so prohibitively slow, that we
> > cannot make the TCG analysis? I'm not saying it can't be true, but it'd be 
> > nice
> > to see it demonstrated before making decisions.
> > 
> > Because, as you point out, there's a lot of downsides to having two 
> > versions. So
> > the benefits have to clearly outweigh the additional work.
> > 
> > I have a lot of other questions, but let's maybe start with the core idea 
> > first.
> > 
> > —
> > Mads Ynddal
> >
> 
> I really appreciate your patience and explanations, and your feedback
> and review has helped me a lot!
> 
> Yes, simple repetition creates much maintenance burden (though I'm happy
> to help with), and the argument for current performance isn't convincing
> enough.
> 
> Getting back to the project itself, as you have said, the core is still
> further support for TCG-related traces, and I'll continue to work on it,
> and then look back based on such work to see what issues there are with
> traces or what improvements can be made.

Thanks for doing that and sorry for holding up the work you have already
done!

Stefan


signature.asc
Description: PGP signature

Re: [RFC 1/6] scripts/simpletrace-rust: Add the basic cargo framework

2024-05-29 Thread Stefan Hajnoczi

On Wed, May 29, 2024 at 10:30:13PM +0800, Zhao Liu wrote:
> Hi Stefan,
> 
> On Tue, May 28, 2024 at 10:14:01AM -0400, Stefan Hajnoczi wrote:
> > Date: Tue, 28 May 2024 10:14:01 -0400
> > From: Stefan Hajnoczi 
> > Subject: Re: [RFC 1/6] scripts/simpletrace-rust: Add the basic cargo
> >  framework
> > 
> > On Tue, May 28, 2024 at 03:53:55PM +0800, Zhao Liu wrote:
> > > Hi Stefan,
> > > 
> > > [snip]
> > > 
> > > > > diff --git a/scripts/simpletrace-rust/.rustfmt.toml 
> > > > > b/scripts/simpletrace-rust/.rustfmt.toml
> > > > > new file mode 100644
> > > > > index ..97a97c24ebfb
> > > > > --- /dev/null
> > > > > +++ b/scripts/simpletrace-rust/.rustfmt.toml
> > > > > @@ -0,0 +1,9 @@
> > > > > +brace_style = "AlwaysNextLine"
> > > > > +comment_width = 80
> > > > > +edition = "2021"
> > > > > +group_imports = "StdExternalCrate"
> > > > > +imports_granularity = "item"
> > > > > +max_width = 80
> > > > > +use_field_init_shorthand = true
> > > > > +use_try_shorthand = true
> > > > > +wrap_comments = true
> > > > 
> > > > There should be QEMU-wide policy. That said, why is it necessary to 
> > > > customize rustfmt?
> > > 
> > > Indeed, but QEMU's style for Rust is currently undefined, so I'm trying
> > > to add this to make it easier to check the style...I will separate it
> > > out as a style policy proposal.
> > 
> > Why is a config file necessary? QEMU should use the default Rust style.
> > 
> 
> There are some that may be overdone, but I think some basic may still
> be necessary, like "comment_width = 80", "max_width = 80",
> "wrap_comments". Is it necessary to specify the width? As C.

Let's agree to follow the Rust coding style from the start, then the
problem is solved. My view is that deviating from the standard Rust
coding style in order to make QEMU Rust code resemble QEMU C code is
less helpful than following Rust conventions so our Rust code looks like
Rust.

> 
> And, "group_imports" and "imports_granularity" (refered from crosvm),
> can also be used to standardize including styles and improve
> readability, since importing can be done in many different styles.
> 
> This fmt config is something like ./script/check_patch.pl for QEMU/linux.
> Different programs have different practices, so I feel like that's an
> open too!

In languages like Rust that have a standard, let's follow the standard
instead of inventing our own way of formatting code.

> This certainly also depends on the maintainer's/your preferences, ;-)
> in what way looks more comfortable/convenient that is the best,
> completely according to the default is also good.

This will probably affect all Rust code in QEMU so everyone's opinion
counts.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] Use "void *" as parameter for functions that are used for aio_set_event_notifier()

2024-05-29 Thread Stefan Hajnoczi

On Wed, May 29, 2024 at 07:49:48PM +0200, Thomas Huth wrote:
> aio_set_event_notifier() and aio_set_event_notifier_poll() in
> util/aio-posix.c and util/aio-win32.c are casting function pointers of
> functions that take an "EventNotifier *" pointer as parameter to function
> pointers that take a "void *" pointer as parameter (i.e. the IOHandler
> type). When those function pointers are later used to call the referenced
> function, this triggers undefined behavior errors with the latest version
> of Clang in Fedora 40 when compiling with the option "-fsanitize=undefined".
> And this also prevents enabling the strict mode of CFI which is currently
> disabled with -fsanitize-cfi-icall-generalize-pointers. Thus let us avoid
> the problem by using "void *" as parameter in all spots where it is needed.
> 
> Signed-off-by: Thomas Huth 
> ---
>  Yes, I know, the patch looks ugly ... but I don't see a better way to
>  tackle this. If someone has a better idea, suggestions are welcome!

An alternative is adding EventNotifierHandler *io_read, *io_poll_ready,
*io_poll_begin, and *io_poll_end fields to EventNotifier so that
aio_set_event_notifier() and aio_set_event_notifier_poll() can pass
helper functions to the underlying aio_set_fd_handler() and
aio_set_fd_poll() APIs. These helper functions then invoke the
EventNotifier callbacks:

/* Helpers */
static void event_notifier_io_read(void *opaque)
{
EventNotifier *notifier = opaque;
notifier->io_read(notifier);
}

static void event_notifier_io_poll_ready(void *opaque)
{
EventNotifier *notifier = opaque;
notifier->io_poll_ready(notifier);
}

void aio_set_event_notifier(AioContext *ctx,
EventNotifier *notifier,
EventNotifierHandler *io_read,
AioPollFn *io_poll,
EventNotifierHandler *io_poll_ready)
{
notifier->io_read = io_read;
notifier->io_poll_ready = io_poll_ready;

aio_set_fd_handler(ctx, event_notifier_get_fd(notifier),
   io_read ? event_notifier_io_read : NULL,
   NULL, io_poll,
   io_poll_ready ? event_notifier_io_poll_ready : NULL,
   notifier);
}

...same for aio_set_event_notifier_poll()...

This is not beautiful either but keeps the API type safe and simpler for
users.

> 
>  include/block/aio.h|  8 
>  include/hw/virtio/virtio.h |  2 +-
>  include/qemu/main-loop.h   |  3 +--
>  block/linux-aio.c  |  6 +++---
>  block/nvme.c   |  8 
>  block/win32-aio.c  |  4 ++--
>  hw/hyperv/hyperv.c |  6 +++---
>  hw/hyperv/hyperv_testdev.c |  5 +++--
>  hw/hyperv/vmbus.c  |  8 
>  hw/nvme/ctrl.c |  8 
>  hw/usb/ccid-card-emulated.c|  5 +++--
>  hw/virtio/vhost-shadow-virtqueue.c | 11 ++-
>  hw/virtio/vhost.c  |  5 +++--
>  hw/virtio/virtio.c | 26 ++
>  tests/unit/test-aio.c  |  9 +
>  tests/unit/test-nested-aio-poll.c  |  8 
>  util/aio-posix.c   | 14 ++
>  util/aio-win32.c   | 10 +-
>  util/async.c   |  6 +++---
>  util/main-loop.c   |  3 +--
>  20 files changed, 79 insertions(+), 76 deletions(-)
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 8378553eb9..01e7ea069d 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -476,9 +476,9 @@ void aio_set_fd_handler(AioContext *ctx,
>   */
>  void aio_set_event_notifier(AioContext *ctx,
>  EventNotifier *notifier,
> -EventNotifierHandler *io_read,
> +IOHandler *io_read,
>  AioPollFn *io_poll,
> -EventNotifierHandler *io_poll_ready);
> +IOHandler *io_poll_ready);
>  
>  /*
>   * Set polling begin/end callbacks for an event notifier that has already 
> been
> @@ -491,8 +491,8 @@ void aio_set_event_notifier(AioContext *ctx,
>   */
>  void aio_set_event_notifier_poll(AioContext *ctx,
>   EventNotifier *notifier,
> - EventNotifierHandler *io_poll_begin,
> - EventNotifierHandler *io_poll_end);
> + IOHandler *io_poll_begin,
> + IOHandler *io_poll_end);
>  
>  /* Return a GSource that lets the main loop poll the file descriptors 
> attached
>   * to this AioContext.
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 7d5ffdc145..e98cecfdd7 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -396,7 +396,7 @@ void

Re: [PATCH 0/2] block/crypto: do not require number of threads upfront

2024-05-29 Thread Stefan Hajnoczi

On Wed, May 29, 2024 at 06:50:34PM +0200, Kevin Wolf wrote:
> Am 27.05.2024 um 17:58 hat Stefan Hajnoczi geschrieben:
> > The block layer does not know how many threads will perform I/O. It is 
> > possible
> > to exceed the number of threads that is given to qcrypto_block_open() and 
> > this
> > can trigger an assertion failure in qcrypto_block_pop_cipher().
> > 
> > This patch series removes the n_threads argument and instead handles an
> > arbitrary number of threads.
> > ---
> > Is it secure to store the key in QCryptoBlock? In this series I assumed the
> > answer is yes since the QCryptoBlock's cipher state is equally sensitive, 
> > but
> > I'm not familiar with this code or a crypto expert.
> 
> I would assume the same, but I'm not merging this yet because I think
> you said you'd like to have input from danpb?
> 
> Reviewed-by: Kevin Wolf 

Yes, please wait until Dan comments.

Thanks,
Stefan


signature.asc
Description: PGP signature

Re: [RFC 1/6] scripts/simpletrace-rust: Add the basic cargo framework

2024-05-28 Thread Stefan Hajnoczi

On Tue, May 28, 2024 at 03:53:55PM +0800, Zhao Liu wrote:
> Hi Stefan,
> 
> [snip]
> 
> > > diff --git a/scripts/simpletrace-rust/.rustfmt.toml 
> > > b/scripts/simpletrace-rust/.rustfmt.toml
> > > new file mode 100644
> > > index ..97a97c24ebfb
> > > --- /dev/null
> > > +++ b/scripts/simpletrace-rust/.rustfmt.toml
> > > @@ -0,0 +1,9 @@
> > > +brace_style = "AlwaysNextLine"
> > > +comment_width = 80
> > > +edition = "2021"
> > > +group_imports = "StdExternalCrate"
> > > +imports_granularity = "item"
> > > +max_width = 80
> > > +use_field_init_shorthand = true
> > > +use_try_shorthand = true
> > > +wrap_comments = true
> > 
> > There should be QEMU-wide policy. That said, why is it necessary to 
> > customize rustfmt?
> 
> Indeed, but QEMU's style for Rust is currently undefined, so I'm trying
> to add this to make it easier to check the style...I will separate it
> out as a style policy proposal.

Why is a config file necessary? QEMU should use the default Rust style.

Stefan


signature.asc
Description: PGP signature

Re: [RFC 0/6] scripts: Rewrite simpletrace printer in Rust

2024-05-28 Thread Stefan Hajnoczi

On Tue, May 28, 2024 at 02:48:42PM +0800, Zhao Liu wrote:
> Hi Stefan,
> 
> On Mon, May 27, 2024 at 03:59:44PM -0400, Stefan Hajnoczi wrote:
> > Date: Mon, 27 May 2024 15:59:44 -0400
> > From: Stefan Hajnoczi 
> > Subject: Re: [RFC 0/6] scripts: Rewrite simpletrace printer in Rust
> > 
> > On Mon, May 27, 2024 at 04:14:15PM +0800, Zhao Liu wrote:
> > > Hi maintainers and list,
> > > 
> > > This RFC series attempts to re-implement simpletrace.py with Rust, which
> > > is the 1st task of Paolo's GSoC 2024 proposal.
> > > 
> > > There are two motivations for this work:
> > > 1. This is an open chance to discuss how to integrate Rust into QEMU.
> > > 2. Rust delivers faster parsing.
> > > 
> > > 
> > > Introduction
> > > 
> > > 
> > > Code framework
> > > --
> > > 
> > > I choose "cargo" to organize the code, because the current
> > > implementation depends on external crates (Rust's library), such as
> > > "backtrace" for getting frameinfo, "clap" for parsing the cli, "rex" for
> > > regular matching, and so on. (Meson's support for external crates is
> > > still incomplete. [2])
> > > 
> > > The simpletrace-rust created in this series is not yet integrated into
> > > the QEMU compilation chain, so it can only be compiled independently, e.g.
> > > under ./scripts/simpletrace/, compile it be:
> > > 
> > > cargo build --release
> > 
> > Please make sure it's built by .gitlab-ci.d/ so that the continuous
> > integration system prevents bitrot. You can add a job that runs the
> > cargo build.
> 
> Thanks! I'll do this.
> 
> > > 
> > > The code tree for the entire simpletrace-rust is as follows:
> > > 
> > > $ script/simpletrace-rust .
> > > .
> > > ├── Cargo.toml
> > > └── src
> > > └── main.rs   // The simpletrace logic (similar to simpletrace.py).
> > > └── trace.rs  // The Argument and Event abstraction (refer to
> > >   // tracetool/__init__.py).
> > > 
> > > My question about meson v.s. cargo, I put it at the end of the cover
> > > letter (the section "Opens on Rust Support").
> > > 
> > > The following two sections are lessons I've learned from this Rust
> > > practice.
> > > 
> > > 
> > > Performance
> > > ---
> > > 
> > > I did the performance comparison using the rust-simpletrace prototype with
> > > the python one:
> > > 
> > > * On the i7-10700 CPU @ 2.90GHz machine, parsing and outputting a 35M
> > > trace binary file for 10 times on each item:
> > > 
> > >   AVE (ms)   Rust v.s. Python
> > > Rust   (stdout)   12687.16114.46%
> > > Python (stdout)   14521.85
> > > 
> > > Rust   (file)  1422.44264.99%
> > > Python (file)  3769.37
> > > 
> > > - The "stdout" lines represent output to the screen.
> > > - The "file" lines represent output to a file (via "> file").
> > > 
> > > This Rust version contains some optimizations (including print, regular
> > > matching, etc.), but there should be plenty of room for optimization.
> > > 
> > > The current performance bottleneck is the reading binary trace file,
> > > since I am parsing headers and event payloads one after the other, so
> > > that the IO read overhead accounts for 33%, which can be further
> > > optimized in the future.
> > 
> > Performance will become more important when large amounts of TCG data is
> > captured, as described in the project idea:
> > https://wiki.qemu.org/Internships/ProjectIdeas/TCGBinaryTracing
> > 
> > While I can't think of a time in the past where simpletrace.py's
> > performance bothered me, improving performance is still welcome. Just
> > don't spend too much time on performance (and making the code more
> > complex) unless there is a real need.
> 
> Yes, I agree that it shouldn't be over-optimized.
> 
> The logic in the current Rust version is pretty much a carbon copy of
> the Python version, without additional complex logic introduced, but the
> improvements in x2.6 were obtained by optimizing IO:
> 
> * reading the event configuration file, where I called the buffered
>   interface,
> * and the output formatted trace log, w

Re: [RFC 4/6] scripts/simpletrace-rust: Parse and check trace recode file

2024-05-27 Thread Stefan Hajnoczi

On Mon, May 27, 2024 at 04:14:19PM +0800, Zhao Liu wrote:
> Refer to scripts/simpletrace.py, parse and check the simple trace
> backend binary trace file.
> 
> Note, in order to keep certain backtrace information to get frame,
> adjust the cargo debug level for release version to "line-tables-only",
> which slows down the program, but is necessary.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Zhao Liu 
> ---
>  scripts/simpletrace-rust/Cargo.lock  |  79 +
>  scripts/simpletrace-rust/Cargo.toml  |   4 +
>  scripts/simpletrace-rust/src/main.rs | 253 ++-
>  3 files changed, 329 insertions(+), 7 deletions(-)
> 
> diff --git a/scripts/simpletrace-rust/Cargo.lock 
> b/scripts/simpletrace-rust/Cargo.lock
> index 3d815014eb44..37d80974ffe7 100644
> --- a/scripts/simpletrace-rust/Cargo.lock
> +++ b/scripts/simpletrace-rust/Cargo.lock
> @@ -2,6 +2,21 @@
>  # It is not intended for manual editing.
>  version = 3
>  
> +[[package]]
> +name = "addr2line"
> +version = "0.21.0"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "8a30b2e23b9e17a9f90641c7ab1549cd9b44f296d3ccbf309d2863cfe398a0cb"
> +dependencies = [
> + "gimli",
> +]
> +
> +[[package]]
> +name = "adler"
> +version = "1.0.2"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "f26201604c87b1e01bd3d98f8d5d9a8fcbb815e8cedb41ffccbeb4bf593a35fe"
> +
>  [[package]]
>  name = "aho-corasick"
>  version = "1.1.3"
> @@ -60,6 +75,33 @@ dependencies = [
>   "windows-sys",
>  ]
>  
> +[[package]]
> +name = "backtrace"
> +version = "0.3.71"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "26b05800d2e817c8b3b4b54abd461726265fa9789ae34330622f2db9ee696f9d"
> +dependencies = [
> + "addr2line",
> + "cc",
> + "cfg-if",
> + "libc",
> + "miniz_oxide",
> + "object",
> + "rustc-demangle",
> +]
> +
> +[[package]]
> +name = "cc"
> +version = "1.0.98"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "41c270e7540d725e65ac7f1b212ac8ce349719624d7bcff99f8e2e488e8cf03f"
> +
> +[[package]]
> +name = "cfg-if"
> +version = "1.0.0"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd"
> +
>  [[package]]
>  name = "clap"
>  version = "4.5.4"
> @@ -93,18 +135,48 @@ version = "1.0.1"
>  source = "registry+https://github.com/rust-lang/crates.io-index;
>  checksum = "0b6a852b24ab71dffc585bcb46eaf7959d175cb865a7152e35b348d1b2960422"
>  
> +[[package]]
> +name = "gimli"
> +version = "0.28.1"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "4271d37baee1b8c7e4b708028c57d816cf9d2434acb33a549475f78c181f6253"
> +
>  [[package]]
>  name = "is_terminal_polyfill"
>  version = "1.70.0"
>  source = "registry+https://github.com/rust-lang/crates.io-index;
>  checksum = "f8478577c03552c21db0e2724ffb8986a5ce7af88107e6be5d2ee6e158c12800"
>  
> +[[package]]
> +name = "libc"
> +version = "0.2.155"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "97b3888a4aecf77e811145cadf6eef5901f4782c53886191b2f693f24761847c"
> +
>  [[package]]
>  name = "memchr"
>  version = "2.7.2"
>  source = "registry+https://github.com/rust-lang/crates.io-index;
>  checksum = "6c8640c5d730cb13ebd907d8d04b52f55ac9a2eec55b440c8892f40d56c76c1d"
>  
> +[[package]]
> +name = "miniz_oxide"
> +version = "0.7.3"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "87dfd01fe195c66b572b37921ad8803d010623c0aca821bea2302239d155cdae"
> +dependencies = [
> + "adler",
> +]
> +
> +[[package]]
> +name = "object"
> +version = "0.32.2"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "a6a622008b6e321afc04970976f62ee297fdbaa6f95318ca343e3eebb9648441"
> +dependencies = [
> + "memchr",
> +]
> +
>  [[package]]
>  name = "once_cell"
>  version = "1.19.0"
> @@ -158,10 +230,17 @@ version = "0.8.3"
>  source = "registry+https://github.com/rust-lang/crates.io-index;
>  checksum = "adad44e29e4c806119491a7f06f03de4d1af22c3a680dd47f1e6e179439d1f56"
>  
> +[[package]]
> +name = "rustc-demangle"
> +version = "0.1.24"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "719b953e2095829ee67db738b3bfa9fa368c94900df327b3f07fe6e794d2fe1f"
> +
>  [[package]]
>  name = "simpletrace-rust"
>  version = "0.1.0"
>  dependencies = [
> + "backtrace",
>   "clap",
>   "once_cell",
>   "regex",
> diff --git a/scripts/simpletrace-rust/Cargo.toml 
> b/scripts/simpletrace-rust/Cargo.toml
> index 24a79d04e566..23a3179de01c 100644
> --- a/scripts/simpletrace-rust/Cargo.toml
> +++ b/scripts/simpletrace-rust/Cargo.toml
> @@ -7,7 +7,11 @@ authors = ["Zhao Liu "]
>  license = "GPL-2.0-or-later"
>  
>  [dependencies]
> +backtrace = "0.3"
>  clap = "4.5.4"
>  once_cell = "1.19.0"
>  regex = "1.10.4"
>  thiserror = "1.0.20"
> +
> +[profile.release]
> +debug = "line-tables-only"
> diff

Re: [RFC 3/6] scripts/simpletrace-rust: Add helpers to parse trace file

2024-05-27 Thread Stefan Hajnoczi

On Mon, May 27, 2024 at 04:14:18PM +0800, Zhao Liu wrote:
> Refer to scripts/simpletrace.py, add the helpers to read the trace file
> and parse the record type field, record header and log header.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Zhao Liu 
> ---
>  scripts/simpletrace-rust/src/main.rs | 151 +++
>  1 file changed, 151 insertions(+)
> 
> diff --git a/scripts/simpletrace-rust/src/main.rs 
> b/scripts/simpletrace-rust/src/main.rs
> index 2d2926b7658d..b3b8baee7c66 100644
> --- a/scripts/simpletrace-rust/src/main.rs
> +++ b/scripts/simpletrace-rust/src/main.rs
> @@ -14,21 +14,172 @@
>  mod trace;
>  
>  use std::env;
> +use std::fs::File;
> +use std::io::Error as IOError;
> +use std::io::ErrorKind;
> +use std::io::Read;
>  
>  use clap::Arg;
>  use clap::Command;
>  use thiserror::Error;
>  use trace::Event;
>  
> +const RECORD_TYPE_MAPPING: u64 = 0;
> +const RECORD_TYPE_EVENT: u64 = 1;
> +
>  #[derive(Error, Debug)]
>  pub enum Error
>  {
>  #[error("usage: {0} [--no-header]  ")]
>  CliOptionUnmatch(String),
> +#[error("Failed to read file: {0}")]
> +ReadFile(IOError),
> +#[error("Unknown record type ({0})")]
> +UnknownRecType(u64),
>  }
>  
>  pub type Result = std::result::Result;
>  
> +enum RecordType
> +{
> +Empty,
> +Mapping,
> +Event,
> +}
> +
> +#[repr(C)]
> +#[derive(Clone, Copy, Default)]
> +struct RecordRawType
> +{
> +rtype: u64,
> +}
> +
> +impl RecordType
> +{
> +fn read_type(mut fobj: ) -> Result
> +{
> +let mut tbuf = [0u8; 8];
> +if let Err(e) = fobj.read_exact( tbuf) {
> +if e.kind() == ErrorKind::UnexpectedEof {
> +return Ok(RecordType::Empty);
> +} else {
> +return Err(Error::ReadFile(e));
> +}
> +}
> +
> +/*
> + * Safe because the layout of the trace record requires us to parse
> + * the type first, and then there is a check on the validity of the
> + * record type.
> + */
> +let raw_t =
> +unsafe { std::mem::transmute::<[u8; 8], RecordRawType>(tbuf) };

A safe alternative: 
https://doc.rust-lang.org/std/primitive.u64.html#method.from_ne_bytes?

> +match raw_t.rtype {
> +RECORD_TYPE_MAPPING => Ok(RecordType::Mapping),
> +RECORD_TYPE_EVENT => Ok(RecordType::Event),
> +_ => Err(Error::UnknownRecType(raw_t.rtype)),
> +}
> +}
> +}
> +
> +trait ReadHeader
> +{
> +fn read_header(fobj: ) -> Result
> +where
> +Self: Sized;
> +}
> +
> +#[repr(C)]
> +#[derive(Clone, Copy)]
> +struct LogHeader
> +{
> +event_id: u64,
> +magic: u64,
> +version: u64,
> +}
> +
> +impl ReadHeader for LogHeader
> +{
> +fn read_header(mut fobj: ) -> Result
> +{
> +let mut raw_hdr = [0u8; 24];
> +fobj.read_exact( raw_hdr).map_err(Error::ReadFile)?;
> +
> +/*
> + * Safe because the size of log header (struct LogHeader)
> + * is 24 bytes, which is ensured by simple trace backend.
> + */
> +let hdr =
> +unsafe { std::mem::transmute::<[u8; 24], LogHeader>(raw_hdr) };

Or u64::from_ne_bytes() for each field.

> +Ok(hdr)
> +}
> +}
> +
> +#[derive(Default)]
> +struct RecordInfo
> +{
> +event_id: u64,
> +timestamp_ns: u64,
> +record_pid: u32,
> +args_payload: Vec,
> +}
> +
> +impl RecordInfo
> +{
> +fn new() -> Self
> +{
> +Default::default()
> +}
> +}
> +
> +#[repr(C)]
> +#[derive(Clone, Copy)]
> +struct RecordHeader
> +{
> +event_id: u64,
> +timestamp_ns: u64,
> +record_length: u32,
> +record_pid: u32,
> +}
> +
> +impl RecordHeader
> +{
> +fn extract_record(, mut fobj: ) -> Result
> +{
> +let mut info = RecordInfo::new();
> +
> +info.event_id = self.event_id;
> +info.timestamp_ns = self.timestamp_ns;
> +info.record_pid = self.record_pid;
> +info.args_payload = vec![
> +0u8;
> +self.record_length as usize
> +- std::mem::size_of::()
> +];
> +fobj.read_exact( info.args_payload)
> +.map_err(Error::ReadFile)?;
> +
> +Ok(info)
> +}
> +}
> +
> +impl ReadHeader for RecordHeader
> +{
> +fn read_header(mut fobj: ) -> Result
> +{
> +let mut raw_hdr = [0u8; 24];
> +fobj.read_exact( raw_hdr).map_err(Error::ReadFile)?;
> +
> +/*
> + * Safe because the size of record header (struct RecordHeader)
> + * is 24 bytes, which is ensured by simple trace backend.
> + */
> +let hdr: RecordHeader =
> +unsafe { std::mem::transmute::<[u8; 24], RecordHeader>(raw_hdr) 
> };

Or u64::from_ne_bytes() and u32::from_ne_bytes() for all fields.

> +Ok(hdr)
> +}
> +}
> +
>  pub struct EventArgPayload {}
>  
>  trait Analyzer
> -- 
> 2.34.1
> 


signature.asc
Description: PGP signature

Re: [RFC 2/6] scripts/simpletrace-rust: Support Event & Arguments in trace module

2024-05-27 Thread Stefan Hajnoczi

On Mon, May 27, 2024 at 04:14:17PM +0800, Zhao Liu wrote:
> Refer to scripts/tracetool/__init__.py, add Event & Arguments
> abstractions in trace module.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Zhao Liu 
> ---
>  scripts/simpletrace-rust/Cargo.lock   |  52 
>  scripts/simpletrace-rust/Cargo.toml   |   2 +
>  scripts/simpletrace-rust/src/trace.rs | 330 +-
>  3 files changed, 383 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/simpletrace-rust/Cargo.lock 
> b/scripts/simpletrace-rust/Cargo.lock
> index 4a0ff8092dcb..3d815014eb44 100644
> --- a/scripts/simpletrace-rust/Cargo.lock
> +++ b/scripts/simpletrace-rust/Cargo.lock
> @@ -2,6 +2,15 @@
>  # It is not intended for manual editing.
>  version = 3
>  
> +[[package]]
> +name = "aho-corasick"
> +version = "1.1.3"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "8e60d3430d3a69478ad0993f19238d2df97c507009a52b3c10addcd7f6bcb916"
> +dependencies = [
> + "memchr",
> +]
> +
>  [[package]]
>  name = "anstream"
>  version = "0.6.14"
> @@ -90,6 +99,18 @@ version = "1.70.0"
>  source = "registry+https://github.com/rust-lang/crates.io-index;
>  checksum = "f8478577c03552c21db0e2724ffb8986a5ce7af88107e6be5d2ee6e158c12800"
>  
> +[[package]]
> +name = "memchr"
> +version = "2.7.2"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "6c8640c5d730cb13ebd907d8d04b52f55ac9a2eec55b440c8892f40d56c76c1d"
> +
> +[[package]]
> +name = "once_cell"
> +version = "1.19.0"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "3fdb12b2476b595f9358c5161aa467c2438859caa136dec86c26fdd2efe17b92"
> +
>  [[package]]
>  name = "proc-macro2"
>  version = "1.0.83"
> @@ -108,11 +129,42 @@ dependencies = [
>   "proc-macro2",
>  ]
>  
> +[[package]]
> +name = "regex"
> +version = "1.10.4"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "c117dbdfde9c8308975b6a18d71f3f385c89461f7b3fb054288ecf2a2058ba4c"
> +dependencies = [
> + "aho-corasick",
> + "memchr",
> + "regex-automata",
> + "regex-syntax",
> +]
> +
> +[[package]]
> +name = "regex-automata"
> +version = "0.4.6"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "86b83b8b9847f9bf95ef68afb0b8e6cdb80f498442f5179a29fad448fcc1eaea"
> +dependencies = [
> + "aho-corasick",
> + "memchr",
> + "regex-syntax",
> +]
> +
> +[[package]]
> +name = "regex-syntax"
> +version = "0.8.3"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "adad44e29e4c806119491a7f06f03de4d1af22c3a680dd47f1e6e179439d1f56"
> +
>  [[package]]
>  name = "simpletrace-rust"
>  version = "0.1.0"
>  dependencies = [
>   "clap",
> + "once_cell",
> + "regex",
>   "thiserror",
>  ]
>  
> diff --git a/scripts/simpletrace-rust/Cargo.toml 
> b/scripts/simpletrace-rust/Cargo.toml
> index b44ba1569dad..24a79d04e566 100644
> --- a/scripts/simpletrace-rust/Cargo.toml
> +++ b/scripts/simpletrace-rust/Cargo.toml
> @@ -8,4 +8,6 @@ license = "GPL-2.0-or-later"
>  
>  [dependencies]
>  clap = "4.5.4"
> +once_cell = "1.19.0"
> +regex = "1.10.4"
>  thiserror = "1.0.20"
> diff --git a/scripts/simpletrace-rust/src/trace.rs 
> b/scripts/simpletrace-rust/src/trace.rs
> index 3a4b06435b8b..f41d6e0b5bc3 100644
> --- a/scripts/simpletrace-rust/src/trace.rs
> +++ b/scripts/simpletrace-rust/src/trace.rs
> @@ -8,4 +8,332 @@
>   * SPDX-License-Identifier: GPL-2.0-or-later
>   */
>  
> -pub struct Event {}
> +#![allow(dead_code)]
> +
> +use std::fs::read_to_string;
> +
> +use once_cell::sync::Lazy;
> +use regex::Regex;
> +use thiserror::Error;
> +
> +#[derive(Error, Debug)]
> +pub enum Error
> +{
> +#[error("Empty argument (did you forget to use 'void'?)")]
> +EmptyArg,
> +#[error("Event '{0}' has more than maximum permitted argument count")]
> +InvalidArgCnt(String),
> +#[error("{0} does not end with a new line")]
> +InvalidEventFile(String),
> +#[error("Invalid format: {0}")]
> +InvalidFormat(String),
> +#[error(
> +"Argument type '{0}' is not allowed. \
> +Only standard C types and fixed size integer \
> +types should be used. struct, union, and \
> +other complex pointer types should be \
> +declared as 'void *'"
> +)]
> +InvalidType(String),
> +#[error("Error at {0}:{1}: {2}")]
> +ReadEventFail(String, usize, String),
> +#[error("Unknown event: {0}")]
> +UnknownEvent(String),
> +#[error("Unknown properties: {0}")]
> +UnknownProp(String),
> +}
> +
> +pub type Result = std::result::Result;
> +
> +/*
> + * Refer to the description of ALLOWED_TYPES in
> + * scripts/tracetool/__init__.py.

Please don't reference the Python implementation because this will not
age well. It may bitrot if the Python code changes or if the Python
implementation is deprecated then the source file will go away
altogether. Make the Rust implementation self-contained. If there are
common file format

Re: [RFC 1/6] scripts/simpletrace-rust: Add the basic cargo framework

2024-05-27 Thread Stefan Hajnoczi

On Mon, May 27, 2024 at 04:14:16PM +0800, Zhao Liu wrote:
> Define the basic cargo framework to support compiling simpletrace-rust
> via cargo, and add the Rust code style (with some nightly features)
> check items to make Rust style as close to the QEMU C code as possible.
> 
> With the base cargo package, define the basic code framework for
> simpletrace-rust, approximating the Python version, and also abstract
> Analyzer operations for simpletrace-rust. Event and other future
> trace-related structures are placed in the trace module.
> 
> Additionally, support basic command line parsing for simpletrace-rust as
> a start.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Zhao Liu 
> ---
>  scripts/simpletrace-rust/.gitignore|   1 +
>  scripts/simpletrace-rust/.rustfmt.toml |   9 +
>  scripts/simpletrace-rust/Cargo.lock| 239 +
>  scripts/simpletrace-rust/Cargo.toml|  11 ++
>  scripts/simpletrace-rust/src/main.rs   | 173 ++
>  scripts/simpletrace-rust/src/trace.rs  |  11 ++
>  6 files changed, 444 insertions(+)
>  create mode 100644 scripts/simpletrace-rust/.gitignore
>  create mode 100644 scripts/simpletrace-rust/.rustfmt.toml
>  create mode 100644 scripts/simpletrace-rust/Cargo.lock
>  create mode 100644 scripts/simpletrace-rust/Cargo.toml
>  create mode 100644 scripts/simpletrace-rust/src/main.rs
>  create mode 100644 scripts/simpletrace-rust/src/trace.rs
> 
> diff --git a/scripts/simpletrace-rust/.gitignore 
> b/scripts/simpletrace-rust/.gitignore
> new file mode 100644
> index ..2f7896d1d136
> --- /dev/null
> +++ b/scripts/simpletrace-rust/.gitignore
> @@ -0,0 +1 @@
> +target/
> diff --git a/scripts/simpletrace-rust/.rustfmt.toml 
> b/scripts/simpletrace-rust/.rustfmt.toml
> new file mode 100644
> index ..97a97c24ebfb
> --- /dev/null
> +++ b/scripts/simpletrace-rust/.rustfmt.toml
> @@ -0,0 +1,9 @@
> +brace_style = "AlwaysNextLine"
> +comment_width = 80
> +edition = "2021"
> +group_imports = "StdExternalCrate"
> +imports_granularity = "item"
> +max_width = 80
> +use_field_init_shorthand = true
> +use_try_shorthand = true
> +wrap_comments = true

There should be QEMU-wide policy. That said, why is it necessary to customize 
rustfmt?

> diff --git a/scripts/simpletrace-rust/Cargo.lock 
> b/scripts/simpletrace-rust/Cargo.lock
> new file mode 100644
> index ..4a0ff8092dcb
> --- /dev/null
> +++ b/scripts/simpletrace-rust/Cargo.lock
> @@ -0,0 +1,239 @@
> +# This file is automatically @generated by Cargo.
> +# It is not intended for manual editing.
> +version = 3
> +
> +[[package]]
> +name = "anstream"
> +version = "0.6.14"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "418c75fa768af9c03be99d17643f93f79bbba589895012a80e3452a19ddda15b"
> +dependencies = [
> + "anstyle",
> + "anstyle-parse",
> + "anstyle-query",
> + "anstyle-wincon",
> + "colorchoice",
> + "is_terminal_polyfill",
> + "utf8parse",
> +]
> +
> +[[package]]
> +name = "anstyle"
> +version = "1.0.7"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "038dfcf04a5feb68e9c60b21c9625a54c2c0616e79b72b0fd87075a056ae1d1b"
> +
> +[[package]]
> +name = "anstyle-parse"
> +version = "0.2.4"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "c03a11a9034d92058ceb6ee011ce58af4a9bf61491aa7e1e59ecd24bd40d22d4"
> +dependencies = [
> + "utf8parse",
> +]
> +
> +[[package]]
> +name = "anstyle-query"
> +version = "1.0.3"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "a64c907d4e79225ac72e2a354c9ce84d50ebb4586dee56c82b3ee73004f537f5"
> +dependencies = [
> + "windows-sys",
> +]
> +
> +[[package]]
> +name = "anstyle-wincon"
> +version = "3.0.3"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "61a38449feb7068f52bb06c12759005cf459ee52bb4adc1d5a7c4322d716fb19"
> +dependencies = [
> + "anstyle",
> + "windows-sys",
> +]
> +
> +[[package]]
> +name = "clap"
> +version = "4.5.4"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "90bc066a67923782aa8515dbaea16946c5bcc5addbd668bb80af688e53e548a0"
> +dependencies = [
> + "clap_builder",
> +]
> +
> +[[package]]
> +name = "clap_builder"
> +version = "4.5.2"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "ae129e2e766ae0ec03484e609954119f123cc1fe650337e155d03b022f24f7b4"
> +dependencies = [
> + "anstream",
> + "anstyle",
> + "clap_lex",
> + "strsim",
> +]
> +
> +[[package]]
> +name = "clap_lex"
> +version = "0.7.0"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "98cc8fbded0c607b7ba9dd60cd98df59af97e84d24e49c8557331cfc26d301ce"
> +
> +[[package]]
> +name = "colorchoice"
> +version = "1.0.1"
> +source = "registry+https://github.com/rust-lang/crates.io-index;
> +checksum = "0b6a852b24ab71dffc585bcb46eaf7959d175cb865a7152e35b348d1b2960422"
> +
> +[[package]]
> +name = "is_terminal_polyfill"

Re: [RFC 0/6] scripts: Rewrite simpletrace printer in Rust

2024-05-27 Thread Stefan Hajnoczi

On Mon, May 27, 2024 at 04:14:15PM +0800, Zhao Liu wrote:
> Hi maintainers and list,
> 
> This RFC series attempts to re-implement simpletrace.py with Rust, which
> is the 1st task of Paolo's GSoC 2024 proposal.
> 
> There are two motivations for this work:
> 1. This is an open chance to discuss how to integrate Rust into QEMU.
> 2. Rust delivers faster parsing.
> 
> 
> Introduction
> 
> 
> Code framework
> --
> 
> I choose "cargo" to organize the code, because the current
> implementation depends on external crates (Rust's library), such as
> "backtrace" for getting frameinfo, "clap" for parsing the cli, "rex" for
> regular matching, and so on. (Meson's support for external crates is
> still incomplete. [2])
> 
> The simpletrace-rust created in this series is not yet integrated into
> the QEMU compilation chain, so it can only be compiled independently, e.g.
> under ./scripts/simpletrace/, compile it be:
> 
> cargo build --release

Please make sure it's built by .gitlab-ci.d/ so that the continuous
integration system prevents bitrot. You can add a job that runs the
cargo build.

> 
> The code tree for the entire simpletrace-rust is as follows:
> 
> $ script/simpletrace-rust .
> .
> ├── Cargo.toml
> └── src
> └── main.rs   // The simpletrace logic (similar to simpletrace.py).
> └── trace.rs  // The Argument and Event abstraction (refer to
>   // tracetool/__init__.py).
> 
> My question about meson v.s. cargo, I put it at the end of the cover
> letter (the section "Opens on Rust Support").
> 
> The following two sections are lessons I've learned from this Rust
> practice.
> 
> 
> Performance
> ---
> 
> I did the performance comparison using the rust-simpletrace prototype with
> the python one:
> 
> * On the i7-10700 CPU @ 2.90GHz machine, parsing and outputting a 35M
> trace binary file for 10 times on each item:
> 
>   AVE (ms)   Rust v.s. Python
> Rust   (stdout)   12687.16114.46%
> Python (stdout)   14521.85
> 
> Rust   (file)  1422.44264.99%
> Python (file)  3769.37
> 
> - The "stdout" lines represent output to the screen.
> - The "file" lines represent output to a file (via "> file").
> 
> This Rust version contains some optimizations (including print, regular
> matching, etc.), but there should be plenty of room for optimization.
> 
> The current performance bottleneck is the reading binary trace file,
> since I am parsing headers and event payloads one after the other, so
> that the IO read overhead accounts for 33%, which can be further
> optimized in the future.

Performance will become more important when large amounts of TCG data is
captured, as described in the project idea:
https://wiki.qemu.org/Internships/ProjectIdeas/TCGBinaryTracing

While I can't think of a time in the past where simpletrace.py's
performance bothered me, improving performance is still welcome. Just
don't spend too much time on performance (and making the code more
complex) unless there is a real need.

> Security
> 
> 
> This is an example.
> 
> Rust is very strict about type-checking, and it found timestamp reversal
> issue in simpletrace-rust [3] (sorry, haven't gotten around to digging
> deeper with more time)...in this RFC, I workingaround it by allowing
> negative values. And the python version, just silently covered this
> issue up.
>
> Opens on Rust Support
> =
> 
> Meson v.s. Cargo
> 
> 
> The first question is whether all Rust code (including under scripts)
> must be integrated into meson?
> 
> If so, because of [2] then I have to discard the external crates and
> build some more Rust wheels of my own to replace the previous external
> crates.
> 
> For the main part of the QEMU code, I think the answer must be Yes, but
> for the tools in the scripts directory, would it be possible to allow
> the use of cargo to build small tools/program for flexibility and
> migrate to meson later (as meson's support for rust becomes more
> mature)?

I have not seen a satisfying way to natively build Rust code using
meson. I remember reading about a tool that converts Cargo.toml files to
meson wrap files or something similar. That still doesn't feel great
because upstream works with Cargo and duplicating build information in
meson is a drag.

Calling cargo from meson is not ideal either, but it works and avoids
duplicating build information. This is the approach I would use for now
unless someone can point to an example of native Rust support in meson
that is clean.

Here is how libblkio calls cargo from meson:
https://gitlab.com/libblkio/libblkio/-/blob/main/src/meson.build
https://gitlab.com/libblkio/libblkio/-/blob/main/src/cargo-build.sh

> 
> 
> External crates
> ---
> 
> This is an additional question that naturally follows from the above
> question, do we have requirements for Rust's external crate? Is only std
> allowed?

There is no

[PATCH 0/2] block/crypto: do not require number of threads upfront

2024-05-27 Thread Stefan Hajnoczi

The block layer does not know how many threads will perform I/O. It is possible
to exceed the number of threads that is given to qcrypto_block_open() and this
can trigger an assertion failure in qcrypto_block_pop_cipher().

This patch series removes the n_threads argument and instead handles an
arbitrary number of threads.
---
Is it secure to store the key in QCryptoBlock? In this series I assumed the
answer is yes since the QCryptoBlock's cipher state is equally sensitive, but
I'm not familiar with this code or a crypto expert.

Stefan Hajnoczi (2):
  block/crypto: create ciphers on demand
  crypto/block: drop qcrypto_block_open() n_threads argument

 crypto/blockpriv.h |  13 ++--
 include/crypto/block.h |   2 -
 block/crypto.c |   1 -
 block/qcow.c   |   2 +-
 block/qcow2.c  |   5 +-
 crypto/block-luks.c|   4 +-
 crypto/block-qcow.c|   8 +--
 crypto/block.c | 116 -
 tests/unit/test-crypto-block.c |   4 --
 9 files changed, 85 insertions(+), 70 deletions(-)

-- 
2.45.1

[PATCH 1/2] block/crypto: create ciphers on demand

2024-05-27 Thread Stefan Hajnoczi

Ciphers are pre-allocated by qcrypto_block_init_cipher() depending on
the given number of threads. The -device
virtio-blk-pci,iothread-vq-mapping= feature allows users to assign
multiple IOThreads to a virtio-blk device, but the association between
the virtio-blk device and the block driver happens after the block
driver is already open.

When the number of threads given to qcrypto_block_init_cipher() is
smaller than the actual number of threads at runtime, the
block->n_free_ciphers > 0 assertion in qcrypto_block_pop_cipher() can
fail.

Get rid of qcrypto_block_init_cipher() n_thread's argument and allocate
ciphers on demand.

Reported-by: Qing Wang 
Buglink: https://issues.redhat.com/browse/RHEL-36159
Signed-off-by: Stefan Hajnoczi 
---
 crypto/blockpriv.h  |  12 +++--
 crypto/block-luks.c |   3 +-
 crypto/block-qcow.c |   2 +-
 crypto/block.c  | 113 ++--
 4 files changed, 79 insertions(+), 51 deletions(-)

diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
index 836f3b4726..4bf6043d5d 100644
--- a/crypto/blockpriv.h
+++ b/crypto/blockpriv.h
@@ -32,8 +32,14 @@ struct QCryptoBlock {
 const QCryptoBlockDriver *driver;
 void *opaque;
 
-QCryptoCipher **ciphers;
-size_t n_ciphers;
+/* Cipher parameters */
+QCryptoCipherAlgorithm alg;
+QCryptoCipherMode mode;
+uint8_t *key;
+size_t nkey;
+
+QCryptoCipher **free_ciphers;
+size_t max_free_ciphers;
 size_t n_free_ciphers;
 QCryptoIVGen *ivgen;
 QemuMutex mutex;
@@ -130,7 +136,7 @@ int qcrypto_block_init_cipher(QCryptoBlock *block,
   QCryptoCipherAlgorithm alg,
   QCryptoCipherMode mode,
   const uint8_t *key, size_t nkey,
-  size_t n_threads, Error **errp);
+  Error **errp);
 
 void qcrypto_block_free_cipher(QCryptoBlock *block);
 
diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 3ee928fb5a..3357852c0a 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -1262,7 +1262,6 @@ qcrypto_block_luks_open(QCryptoBlock *block,
   luks->cipher_mode,
   masterkey,
   luks->header.master_key_len,
-  n_threads,
   errp) < 0) {
 goto fail;
 }
@@ -1456,7 +1455,7 @@ qcrypto_block_luks_create(QCryptoBlock *block,
 /* Setup the block device payload encryption objects */
 if (qcrypto_block_init_cipher(block, luks_opts.cipher_alg,
   luks_opts.cipher_mode, masterkey,
-  luks->header.master_key_len, 1, errp) < 0) {
+  luks->header.master_key_len, errp) < 0) {
 goto error;
 }
 
diff --git a/crypto/block-qcow.c b/crypto/block-qcow.c
index 4d7cf36a8f..02305058e3 100644
--- a/crypto/block-qcow.c
+++ b/crypto/block-qcow.c
@@ -75,7 +75,7 @@ qcrypto_block_qcow_init(QCryptoBlock *block,
 ret = qcrypto_block_init_cipher(block, QCRYPTO_CIPHER_ALG_AES_128,
 QCRYPTO_CIPHER_MODE_CBC,
 keybuf, G_N_ELEMENTS(keybuf),
-n_threads, errp);
+errp);
 if (ret < 0) {
 ret = -ENOTSUP;
 goto fail;
diff --git a/crypto/block.c b/crypto/block.c
index 506ea1d1a3..ba6d1cebc7 100644
--- a/crypto/block.c
+++ b/crypto/block.c
@@ -20,6 +20,7 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qemu/lockable.h"
 #include "blockpriv.h"
 #include "block-qcow.h"
 #include "block-luks.h"
@@ -57,6 +58,8 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
 {
 QCryptoBlock *block = g_new0(QCryptoBlock, 1);
 
+qemu_mutex_init(>mutex);
+
 block->format = options->format;
 
 if (options->format >= G_N_ELEMENTS(qcrypto_block_drivers) ||
@@ -76,8 +79,6 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
 return NULL;
 }
 
-qemu_mutex_init(>mutex);
-
 return block;
 }
 
@@ -92,6 +93,8 @@ QCryptoBlock *qcrypto_block_create(QCryptoBlockCreateOptions 
*options,
 {
 QCryptoBlock *block = g_new0(QCryptoBlock, 1);
 
+qemu_mutex_init(>mutex);
+
 block->format = options->format;
 
 if (options->format >= G_N_ELEMENTS(qcrypto_block_drivers) ||
@@ -111,8 +114,6 @@ QCryptoBlock 
*qcrypto_block_create(QCryptoBlockCreateOptions *options,
 return NULL;
 }
 
-qemu_mutex_init(>mutex);
-
 return block;
 }
 
@@ -227,37 +228,42 @@ QCryptoCipher *qcrypto_block_get_cipher(QCryptoBlock 
*block)
  * This function is used only in test with one thread (it's safe to skip

[PATCH 2/2] crypto/block: drop qcrypto_block_open() n_threads argument

2024-05-27 Thread Stefan Hajnoczi

The n_threads argument is no longer used since the previous commit.
Remove it.

Signed-off-by: Stefan Hajnoczi 
---
 crypto/blockpriv.h | 1 -
 include/crypto/block.h | 2 --
 block/crypto.c | 1 -
 block/qcow.c   | 2 +-
 block/qcow2.c  | 5 ++---
 crypto/block-luks.c| 1 -
 crypto/block-qcow.c| 6 ++
 crypto/block.c | 3 +--
 tests/unit/test-crypto-block.c | 4 
 9 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/crypto/blockpriv.h b/crypto/blockpriv.h
index 4bf6043d5d..b8f77cb5eb 100644
--- a/crypto/blockpriv.h
+++ b/crypto/blockpriv.h
@@ -59,7 +59,6 @@ struct QCryptoBlockDriver {
 QCryptoBlockReadFunc readfunc,
 void *opaque,
 unsigned int flags,
-size_t n_threads,
 Error **errp);
 
 int (*create)(QCryptoBlock *block,
diff --git a/include/crypto/block.h b/include/crypto/block.h
index 92e823c9f2..5b5d039800 100644
--- a/include/crypto/block.h
+++ b/include/crypto/block.h
@@ -76,7 +76,6 @@ typedef enum {
  * @readfunc: callback for reading data from the volume
  * @opaque: data to pass to @readfunc
  * @flags: bitmask of QCryptoBlockOpenFlags values
- * @n_threads: allow concurrent I/O from up to @n_threads threads
  * @errp: pointer to a NULL-initialized error object
  *
  * Create a new block encryption object for an existing
@@ -113,7 +112,6 @@ QCryptoBlock *qcrypto_block_open(QCryptoBlockOpenOptions 
*options,
  QCryptoBlockReadFunc readfunc,
  void *opaque,
  unsigned int flags,
- size_t n_threads,
  Error **errp);
 
 typedef enum {
diff --git a/block/crypto.c b/block/crypto.c
index 21eed909c1..4eed3ffa6a 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -363,7 +363,6 @@ static int block_crypto_open_generic(QCryptoBlockFormat 
format,
block_crypto_read_func,
bs,
cflags,
-   1,
errp);
 
 if (!crypto->block) {
diff --git a/block/qcow.c b/block/qcow.c
index ca8e1d5ec8..c2f89db055 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -211,7 +211,7 @@ static int qcow_open(BlockDriverState *bs, QDict *options, 
int flags,
 cflags |= QCRYPTO_BLOCK_OPEN_NO_IO;
 }
 s->crypto = qcrypto_block_open(crypto_opts, "encrypt.",
-   NULL, NULL, cflags, 1, errp);
+   NULL, NULL, cflags, errp);
 if (!s->crypto) {
 ret = -EINVAL;
 goto fail;
diff --git a/block/qcow2.c b/block/qcow2.c
index 956128b409..10883a2494 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -321,7 +321,7 @@ qcow2_read_extensions(BlockDriverState *bs, uint64_t 
start_offset,
 }
 s->crypto = qcrypto_block_open(s->crypto_opts, "encrypt.",
qcow2_crypto_hdr_read_func,
-   bs, cflags, QCOW2_MAX_THREADS, 
errp);
+   bs, cflags, errp);
 if (!s->crypto) {
 return -EINVAL;
 }
@@ -1701,8 +1701,7 @@ qcow2_do_open(BlockDriverState *bs, QDict *options, int 
flags,
 cflags |= QCRYPTO_BLOCK_OPEN_NO_IO;
 }
 s->crypto = qcrypto_block_open(s->crypto_opts, "encrypt.",
-   NULL, NULL, cflags,
-   QCOW2_MAX_THREADS, errp);
+   NULL, NULL, cflags, errp);
 if (!s->crypto) {
 ret = -EINVAL;
 goto fail;
diff --git a/crypto/block-luks.c b/crypto/block-luks.c
index 3357852c0a..5b777c15d3 100644
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -1189,7 +1189,6 @@ qcrypto_block_luks_open(QCryptoBlock *block,
 QCryptoBlockReadFunc readfunc,
 void *opaque,
 unsigned int flags,
-size_t n_threads,
 Error **errp)
 {
 QCryptoBlockLUKS *luks = NULL;
diff --git a/crypto/block-qcow.c b/crypto/block-qcow.c
index 02305058e3..42e9556e42 100644
--- a/crypto/block-qcow.c
+++ b/crypto/block-qcow.c
@@ -44,7 +44,6 @@ qcrypto_block_qcow_has_format(const uint8_t *buf 
G_GNUC_UNUSED,
 static int
 qcrypto_block_qcow_init(QCryptoBlock *block,
 const char *keysecret,
-size_t n_threads,
 Error **errp)
 {
 char *password;
@@ -1

Re: [PATCH v2 0/3] docs: define policy forbidding use of "AI" / LLM code generators

2024-05-21 Thread Stefan Hajnoczi

On Thu, 16 May 2024 at 12:23, Daniel P. Berrangé  wrote:
>
> This patch kicks the hornet's nest of AI / LLM code generators.
>
> With the increasing interest in code generators in recent times,
> it is inevitable that QEMU contributions will include AI generated
> code. Thus far we have remained silent on the matter. Given that
> everyone knows these tools exist, our current position has to be
> considered tacit acceptance of the use of AI generated code in QEMU.
>
> The question for the project is whether that is a good position for
> QEMU to take or not ?
>
> IANAL, but I like to think I'm reasonably proficient at understanding
> open source licensing. I am not inherantly against the use of AI tools,
> rather I am anti-risk. I also want to see OSS licenses respected and
> complied with.
>
> AFAICT at its current state of (im)maturity the question of licensing
> of AI code generator output does not have a broadly accepted / settled
> legal position. This is an inherant bias/self-interest from the vendors
> promoting their usage, who tend to minimize/dismiss the legal questions.
> From my POV, this puts such tools in a position of elevated legal risk.
>
> Given the fuzziness over the legal position of generated code from
> such tools, I don't consider it credible (today) for a contributor
> to assert compliance with the DCO terms (b) or (c) (which is a stated
> pre-requisite for QEMU accepting patches) when a patch includes (or is
> derived from) AI generated code.
>
> By implication, I think that QEMU must (for now) explicitly decline
> to (knowingly) accept AI generated code.
>
> Perhaps a few years down the line the legal uncertainty will have
> reduced and we can re-evaluate this policy.
>
> Discuss...

Although this policy is unenforceable, I think it's a valid position
to take until the legal situation becomes clear.

Acked-by: Stefan Hajnoczi

Re: [PATCH 00/20] qapi: new sphinx qapi domain pre-requisites

2024-05-16 Thread Stefan Hajnoczi

  |   4 +-
>  scripts/qapi/visit.py |   9 +-
>  tests/qapi-schema/doc-empty-section.err   |   2 +-
>  tests/qapi-schema/doc-empty-section.json  |   2 +-
>  tests/qapi-schema/doc-good.json   |  18 +-
>  tests/qapi-schema/doc-good.out|  61 +++---
>  tests/qapi-schema/doc-good.txt    |  31 +--
>  .../qapi-schema/doc-interleaved-section.json  |   2 +-
>  47 files changed, 1152 insertions(+), 753 deletions(-)
>  create mode 100755 scripts/qapi-lint.sh
>  create mode 100644 scripts/qapi/Makefile
> 
> -- 
> 2.44.0
> 
> 

For block-core.json/block-export.json/block.json:

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH] scripts/simpletrace: Mark output with unstable timestamp as WARN

2024-05-14 Thread Stefan Hajnoczi

On Tue, May 14, 2024, 03:57 Zhao Liu  wrote:

> Hi Stefan,
>
> > QEMU uses clock_gettime(CLOCK_MONOTONIC) on Linux hosts. The man page
> > says:
> >
> >   All CLOCK_MONOTONIC variants guarantee that the time returned by
> >   consecutive  calls  will  not go backwards, but successive calls
> >   may—depending  on  the  architecture—return  identical  (not-in‐
> >   creased) time values.
> >
> > trace_record_start() calls clock_gettime(CLOCK_MONOTONIC) so trace events
> > should have monotonically increasing timestamps.
> >
> > I don't see a scenario where trace record A's timestamp is greater than
> > trace record B's timestamp unless the clock is non-monotonic.
> >
> > Which host CPU architecture and operating system are you running?
>
> I tested on these 2 machines:
> * CML (intel 10th) with Ubuntu 22.04 + kernel v6.5.0-28
> * MTL (intel 14th) with Ubuntu 22.04.2 + kernel v6.9.0
>
> > Please attach to the QEMU process with gdb and print out the value of
> > the use_rt_clock variable or add a printf in init_get_clock(). The value
> > should be 1.
>
> Thanks, on both above machines, use_rt_clock is 1 and there're both
> timestamp reversal issues with the following debug print:
>
> diff --git a/include/qemu/timer.h b/include/qemu/timer.h
> index 9a366e551fb3..7657785c27dc 100644
> --- a/include/qemu/timer.h
> +++ b/include/qemu/timer.h
> @@ -831,10 +831,17 @@ extern int use_rt_clock;
>
>  static inline int64_t get_clock(void)
>  {
> +static int64_t clock = 0;
>

Please try with a thread local variable (__thread) to check whether this
happens within a single thread.

If it only happens with a global variable then we'd need to look more
closely at race conditions in the patch below. I don't think the patch is a
reliable way to detect non-monotonic timestamps in a multi-threaded program.

 if (use_rt_clock) {
>  struct timespec ts;
>  clock_gettime(CLOCK_MONOTONIC, );
> -return ts.tv_sec * 10LL + ts.tv_nsec;
> +int64_t tmp = ts.tv_sec * 10LL + ts.tv_nsec;
> +if (tmp <= clock) {
> +printf("get_clock: strange, clock: %ld, tmp: %ld\n", clock,
> tmp);
> +}
> +assert(tmp > clock);
> +clock = tmp;
> +return clock;
>  } else {
>  /* XXX: using gettimeofday leads to problems if the date
> changes, so it should be avoided. */
> diff --git a/util/qemu-timer-common.c b/util/qemu-timer-common.c
> index cc1326f72646..3bf06eb4a4ce 100644
> --- a/util/qemu-timer-common.c
> +++ b/util/qemu-timer-common.c
> @@ -59,5 +59,6 @@ static void __attribute__((constructor))
> init_get_clock(void)
>  use_rt_clock = 1;
>  }
>  clock_start = get_clock();
> +printf("init_get_clock: use_rt_clock: %d\n", use_rt_clock);
>  }
>  #endif
>
> ---
> The timestamp interval is very small, for example:
> get_clock: strange, clock: 3302130503505, tmp: 3302130503503
>
> or
>
> get_clock: strange, clock: 2761577819846455, tmp: 2761577819846395
>
> I also tried to use CLOCK_MONOTONIC_RAW, but there's still the reversal
> issue.
>
> Thanks,
> Zhao
>
>

Re: [PATCH v4 00/12] vhost-user: support any POSIX system (tested on macOS, FreeBSD, OpenBSD)

2024-05-09 Thread Stefan Hajnoczi

jects/libvhost-user/libvhost-user.h |   2 +-
>  backends/hostmem-shm.c| 123 ++
>  contrib/vhost-user-blk/vhost-user-blk.c   |  27 +++--
>  contrib/vhost-user-input/main.c   |  16 +--
>  hw/net/vhost_net.c|   5 +
>  subprojects/libvhost-user/libvhost-user.c |  76 -
>  tests/qtest/vhost-user-blk-test.c |   2 +-
>  tests/qtest/vhost-user-test.c |  23 
>  util/vhost-user-server.c  |  12 +++
>  backends/meson.build  |   1 +
>  hw/block/Kconfig  |   2 +-
>  qemu-options.hx   |  13 +++
>  util/meson.build  |   4 +-
>  16 files changed, 305 insertions(+), 28 deletions(-)
>  create mode 100644 backends/hostmem-shm.c
> 
> -- 
> 2.45.0
> 

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 5/9] hw/scsi: add persistent reservation in/out api for scsi device

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:25PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 302 
>  1 file changed, 302 insertions(+)

Does sg_persist --report-capabilities
(https://linux.die.net/man/8/sg_persist) show that persistent
reservations are supported? Maybe some INQUIRY data still needs to be
added.

> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..bdd66c4026 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -1474,6 +1475,296 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +void *req;
> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +void *req;
> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);
> +blk_keys->generation = cpu_to_be32(blk_keys->generation);
> +memcpy([0], _keys->generation, 4);
> +for (int i = 0; i < num_keys; i++) {
> +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
> +memcpy([8 + i * 8], _keys->keys[i], 8);
> +}
> +num_keys = cpu_to_be32(num_keys * 8);
> +memcpy([4], _keys, 4);
> +
> +scsi_req_data(>req, r->buflen);
> +done:
> +scsi_req_unref(>req);
> +g_free(blk_keys->keys);
> +g_free(blk_keys);
> +}
> +
> +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
> +{
> +SCSIPrReadKeys *blk_keys;
> +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +int buflen = MIN(r->req.cmd.xfer, r->buflen);
> +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
> +
> +blk_keys = g_new0(SCSIPrReadKeys, 1);
> +blk_keys->generation = 0;
> +/* num_keys is the maximum number of keys that can be transmitted */
> +blk_keys->num_keys = num_keys;
> +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
> +blk_keys->req = r;
> +
> +scsi_req_ref(>req);
> +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
> _keys->generation,
> +blk_keys->num_keys, blk_keys->keys,
> +scsi_pr_read_keys_complete, 
> blk_keys);
> +return 0;
> +}
> +
> +static void scsi_pr_read_reservation_complete(void *opaque, int ret)
> +{
> +uint8_t *buf;
> +uint32_t num_keys = 0;
> +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
> +memcpy([0], _rsv->generation, 4);
> +if (ret) {
> +num_keys = cpu_to_be32(16);
> +blk_rsv->key = cpu_to_be64(blk_rsv->key);
> +memcpy([8], _rsv->key, 8);
> +buf[21] = block_pr_type_to_scsi(blk_rsv->type) & 0xf;
> +} else {
> +num_keys = cpu_to_be32(0);
> +}
> +
> +memcpy([4], _keys, 4);
> +scsi_req_data(>req, r->buflen);
> +
> +done:
> +

Re: [PATCH 0/9] Support persistent reservation operations

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:20PM +0800, Changqi Lu wrote:
> Hi,
> 
> I am going to introduce persistent reservation for QEMU block.
> There are three parts in this series:
> 
> Firstly, at the block layer, the commit abstracts seven APIs related to
> the persistent reservation command. These APIs including reading keys,
> reading reservations, registering, reserving, releasing, clearing and 
> preempting.
> 
> Next, the commit implements the necessary pr-related operation APIs for both 
> the
> SCSI protocol and NVMe protocol at the device layer. This ensures that the 
> necessary
> functionality is available for handling persistent reservations in these 
> protocols.
> 
> Finally, the commit includes adaptations to the iscsi driver at the driver 
> layer
> to verify the correct implementation and functionality of the changes.
> 
> With these changes, GFS works fine in the guest. Also, sg-utils(for SCSI 
> block) and
> nvme-cli(for NVMe block) work fine too.

What is the relationship to the existing PRManager functionality
(docs/interop/pr-helper.rst) where block/file-posix.c interprets SCSI
ioctls and sends persistent reservation requests to an external helper
process?

I wonder if block/file-posix.c can implement the new block driver
callbacks using pr_mgr (while keeping the existing scsi-generic
support).

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 7/9] hw/nvme: add helper functions for converting reservation types

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:27PM +0800, Changqi Lu wrote:
> This commit introduces two helper functions
> that facilitate the conversion between the
> reservation types used in the NVME protocol
> and those used in the block layer.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/nvme/nvme.h | 40 
>  1 file changed, 40 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 6/9] block/nvme: add reservation command protocol constants

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:26PM +0800, Changqi Lu wrote:
> Add constants for the NVMe persistent command protocol.
> The constants include the reservation command opcode and
> reservation type values defined in section 7 of the NVMe
> 2.0 specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/block/nvme.h | 30 ++
>  1 file changed, 30 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 5/9] hw/scsi: add persistent reservation in/out api for scsi device

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:25PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 302 
>  1 file changed, 302 insertions(+)

Paolo: Please review for buffer overflows. I find the buffer size
assumption in the SCSI layer mysterious (e.g. scsi_req_get_buf() returns
a void* and it's not obvious how large the buffer is), so I didn't check
for buffer overflows.

> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..bdd66c4026 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -1474,6 +1475,296 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +void *req;
> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +void *req;
> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);
> +blk_keys->generation = cpu_to_be32(blk_keys->generation);
> +memcpy([0], _keys->generation, 4);
> +for (int i = 0; i < num_keys; i++) {
> +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
> +memcpy([8 + i * 8], _keys->keys[i], 8);
> +}
> +num_keys = cpu_to_be32(num_keys * 8);
> +memcpy([4], _keys, 4);
> +
> +scsi_req_data(>req, r->buflen);
> +done:
> +scsi_req_unref(>req);
> +g_free(blk_keys->keys);
> +g_free(blk_keys);
> +}
> +
> +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
> +{
> +SCSIPrReadKeys *blk_keys;
> +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +int buflen = MIN(r->req.cmd.xfer, r->buflen);
> +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
> +
> +blk_keys = g_new0(SCSIPrReadKeys, 1);
> +blk_keys->generation = 0;
> +/* num_keys is the maximum number of keys that can be transmitted */
> +blk_keys->num_keys = num_keys;
> +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
> +blk_keys->req = r;
> +
> +scsi_req_ref(>req);
> +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
> _keys->generation,
> +blk_keys->num_keys, blk_keys->keys,
> +scsi_pr_read_keys_complete, 
> blk_keys);
> +return 0;
> +}
> +
> +static void scsi_pr_read_reservation_complete(void *opaque, int ret)
> +{
> +uint8_t *buf;
> +uint32_t num_keys = 0;
> +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
> +memcpy([0], _rsv->generation, 4);
> +if (ret) {
> +num_keys = cpu_to_be32(16);

The SPC-6 calls this field Additional Length, which is clearer than
num_keys (there is only 1 key here!). Could you rename it to
additional_len to avoid confusion?

> +blk_rsv->key = cpu_to_be64(blk_rsv->key);
> +memcpy([8], _rsv->key, 8);
> +

Re: [PATCH 4/9] scsi/util: add helper functions for persistent reservation types conversion

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:24PM +0800, Changqi Lu wrote:
> This commit introduces two helper functions
> that facilitate the conversion between the
> persistent reservation types used in the SCSI
> protocol and those used in the block layer.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/scsi/utils.h |  5 +
>  scsi/utils.c | 40 
>  2 files changed, 45 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 3/9] scsi/constant: add persistent reservation in/out protocol constants

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:23PM +0800, Changqi Lu wrote:
> Add constants for the persistent reservation in/out protocol
> in the scsi/constant module. The constants include the persistent
> reservation command, type, and scope values defined in sections
> 6.13 and 6.14 of the SCSI Primary Commands-4 (SPC-4) specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/scsi/constants.h | 29 +
>  1 file changed, 29 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 2/9] block/raw: add persistent reservation in/out driver

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:22PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations for raw driver.
> The following methods are implemented: bdrv_co_pr_read_keys,
> bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
> bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/raw-format.c | 55 ++
>  1 file changed, 55 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH 1/9] block: add persistent reservation in/out api

2024-05-09 Thread Stefan Hajnoczi

On Wed, May 08, 2024 at 05:36:21PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations
> at the block level. The following operations
> are included:
> 
> - read_keys:retrieves the list of registered keys.
> - read_reservation: retrieves the current reservation status.
> - register: registers a new reservation key.
> - reserve:  initiates a reservation for a specific key.
> - release:  releases a reservation for a specific key.
> - clear:clears all existing reservations.
> - preempt:  preempts a reservation held by another key.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/block-backend.c | 386 ++
>  block/io.c| 161 +
>  include/block/block-common.h  |   9 +
>  include/block/block-io.h  |  19 ++
>  include/block/block_int-common.h  |  31 +++
>  include/sysemu/block-backend-io.h |  22 ++
>  6 files changed, 628 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index db6f9b92a3..ec67937d28 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1770,6 +1770,392 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned 
> long int req, void *buf,
>  return blk_aio_prwv(blk, req, 0, buf, blk_aio_ioctl_entry, 0, cb, 
> opaque);
>  }
>  
> +typedef struct BlkPrInCo {
> +BlockBackend *blk;
> +uint32_t *generation;
> +uint32_t num_keys;
> +BlockPrType *type;
> +uint64_t *keys;
> +int ret;
> +} BlkPrInCo;
> +
> +typedef struct BlkPrInCB {
> +BlockAIOCB common;
> +BlkPrInCo prco;
> +bool has_returned;
> +} BlkPrInCB;
> +
> +static const AIOCBInfo blk_pr_in_aiocb_info = {
> +.aiocb_size = sizeof(BlkPrInCB),
> +};
> +
> +static void blk_pr_in_complete(BlkPrInCB *acb)
> +{
> +if (acb->has_returned) {
> +acb->common.cb(acb->common.opaque, acb->prco.ret);
> +blk_dec_in_flight(acb->prco.blk);

Please add a comment identifying which blk_inc_in_flight() call this dec
is paired with. That makes it easier for people trying to understand
in-flight reference counting.

> +qemu_aio_unref(acb);
> +}
> +}
> +
> +static void blk_pr_in_complete_bh(void *opaque)
> +{
> +BlkPrInCB *acb = opaque;
> +assert(acb->has_returned);
> +blk_pr_in_complete(acb);
> +}
> +
> +static BlockAIOCB *blk_aio_pr_in(BlockBackend *blk, uint32_t *generation,
> + uint32_t num_keys, BlockPrType *type,
> + uint64_t *keys, CoroutineEntry co_entry,
> + BlockCompletionFunc *cb, void *opaque)
> +{
> +BlkPrInCB *acb;
> +Coroutine *co;
> +
> +blk_inc_in_flight(blk);

Please add a comment identifying which blk_dec_in_flight() call this inc
is paired with.

> +acb = blk_aio_get(_pr_in_aiocb_info, blk, cb, opaque);
> +acb->prco = (BlkPrInCo) {
> +.blk= blk,
> +.generation = generation,
> +.num_keys   = num_keys,
> +.type   = type,
> +.ret= NOT_DONE,
> +.keys   = keys,
> +};
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(co_entry, acb);
> +aio_co_enter(qemu_get_current_aio_context(), co);
> +
> +acb->has_returned = true;
> +if (acb->prco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
> + blk_pr_in_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
> +
> +static int coroutine_fn
> +blk_aio_pr_do_read_keys(BlockBackend *blk, uint32_t *generation,
> +uint32_t num_keys, uint64_t *keys)
> +{
> +IO_CODE();
> +
> +blk_wait_while_drained(blk);
> +GRAPH_RDLOCK_GUARD();
> +
> +if (!blk_co_is_available(blk)) {
> +return -ENOMEDIUM;
> +}
> +
> +return bdrv_co_pr_read_keys(blk_bs(blk), generation, num_keys, keys);
> +}
> +
> +static void coroutine_fn blk_aio_pr_read_keys_entry(void *opaque)
> +{
> +BlkPrInCB *acb = opaque;
> +BlkPrInCo *prco = >prco;
> +
> +prco->ret = blk_aio_pr_do_read_keys(prco->blk, prco->generation,
> +prco->num_keys, prco->keys);
> +blk_pr_in_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_pr_read_keys(BlockBackend *blk, uint32_t *generation,
> + uint32_t num_keys, uint64_t *keys,
> + BlockCompletionFunc *cb, void *opaque)
> +{
> +IO_CODE();
> +return blk_aio_pr_in(blk, generation, num_keys, NULL, keys,
> + blk_aio_pr_read_keys_entry, cb, opaque);
> +}
> +
> +
> +static int coroutine_fn
> +blk_aio_pr_do_read_reservation(BlockBackend *blk, uint32_t *generation,
> +   uint64_t *key, BlockPrType *type)
> +{
> +IO_CODE();
> +
> +blk_wait_while_drained(blk);
> +

Re: [PATCH] scripts/simpletrace: Mark output with unstable timestamp as WARN

2024-05-09 Thread Stefan Hajnoczi

On Thu, May 09, 2024 at 11:59:10AM +0800, Zhao Liu wrote:
> On Wed, May 08, 2024 at 02:05:04PM -0400, Stefan Hajnoczi wrote:
> > Date: Wed, 8 May 2024 14:05:04 -0400
> > From: Stefan Hajnoczi 
> > Subject: Re: [PATCH] scripts/simpletrace: Mark output with unstable
> >  timestamp as WARN
> > 
> > On Wed, 8 May 2024 at 00:19, Zhao Liu  wrote:
> > >
> > > In some trace log, there're unstable timestamp breaking temporal
> > > ordering of trace records. For example:
> > >
> > > kvm_run_exit -0.015 pid=3289596 cpu_index=0x0 reason=0x6
> > > kvm_vm_ioctl -0.020 pid=3289596 type=0xc008ae67 arg=0x7ffeefb5aa60
> > > kvm_vm_ioctl -0.021 pid=3289596 type=0xc008ae67 arg=0x7ffeefb5aa60
> > >
> > > Negative delta intervals tend to get drowned in the massive trace logs,
> > > and an unstable timestamp can corrupt the calculation of intervals
> > > between two events adjacent to it.
> > >
> > > Therefore, mark the outputs with unstable timestamps as WARN like:
> > 
> > Why are the timestamps non-monotonic?
> > 
> > In a situation like that maybe not only the negative timestamps are
> > useless but even some positive timestamps are incorrect. I think it's
> > worth understanding the nature of the instability before merging a
> > fix.
> 
> I grabbed more traces (by -trace "*" to cover as many events as possible
> ) and have a couple observations:
> 
> * It's not an issue with kvm's ioctl, and that qemu_mutex_lock/
>   object_dynamic_cast_assert accounted for the vast majority of all
>   exception timestamps.
> * The total exception timestamp occurrence probability was roughly 0.013%
>   (608 / 4,616,938) in a 398M trace file.
> * And the intervals between the "wrong" timestamp and its pre event are
>   almost all within 50ns, even more concentrated within 20ns (there are
>   even quite a few 1~10ns).
> 
> Just a guess:
> 
> Would it be possible that a trace event which is too short of an interval,
> and happen to meet a delay in signaling to send to writeout_thread?
> 
> It seems that this short interval like a lack of real-time guarantees in
> the underlying mechanism...
> 
> If it's QEMU's own issue, I feel like the intervals should be randomized,
> not just within 50ns...
> 
> May I ask what you think? Any suggestions for researching this situation
> ;-)

QEMU uses clock_gettime(CLOCK_MONOTONIC) on Linux hosts. The man page
says:

  All CLOCK_MONOTONIC variants guarantee that the time returned by
  consecutive  calls  will  not go backwards, but successive calls
  may—depending  on  the  architecture—return  identical  (not-in‐
  creased) time values.

trace_record_start() calls clock_gettime(CLOCK_MONOTONIC) so trace events
should have monotonically increasing timestamps.

I don't see a scenario where trace record A's timestamp is greater than
trace record B's timestamp unless the clock is non-monotonic.

Which host CPU architecture and operating system are you running?

Please attach to the QEMU process with gdb and print out the value of
the use_rt_clock variable or add a printf in init_get_clock(). The value
should be 1.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] scripts/simpletrace: Mark output with unstable timestamp as WARN

2024-05-08 Thread Stefan Hajnoczi

On Wed, 8 May 2024 at 00:19, Zhao Liu  wrote:
>
> In some trace log, there're unstable timestamp breaking temporal
> ordering of trace records. For example:
>
> kvm_run_exit -0.015 pid=3289596 cpu_index=0x0 reason=0x6
> kvm_vm_ioctl -0.020 pid=3289596 type=0xc008ae67 arg=0x7ffeefb5aa60
> kvm_vm_ioctl -0.021 pid=3289596 type=0xc008ae67 arg=0x7ffeefb5aa60
>
> Negative delta intervals tend to get drowned in the massive trace logs,
> and an unstable timestamp can corrupt the calculation of intervals
> between two events adjacent to it.
>
> Therefore, mark the outputs with unstable timestamps as WARN like:

Why are the timestamps non-monotonic?

In a situation like that maybe not only the negative timestamps are
useless but even some positive timestamps are incorrect. I think it's
worth understanding the nature of the instability before merging a
fix.

Stefan

>
> WARN: skip unstable timestamp: kvm_run_exit 
> cur(8497404907761146)-pre(8497404907761161) pid=3289596 cpu_index=0x0 
> reason=0x6
> WARN: skip unstable timestamp: kvm_vm_ioctl 
> cur(8497404908603653)-pre(8497404908603673) pid=3289596 
> type=0xc008ae67 arg=0x7ffeefb5aa60
> WARN: skip unstable timestamp: kvm_vm_ioctl 
> cur(8497404908625787)-pre(8497404908625808) pid=3289596 
> type=0xc008ae67 arg=0x7ffeefb5aa60
>
> This would help to identify unusual events.
>
> And skip them without updating Formatter2.last_timestamp_ns to avoid
> time back.
>
> Signed-off-by: Zhao Liu 
> ---
>  scripts/simpletrace.py | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/scripts/simpletrace.py b/scripts/simpletrace.py
> index cef81b0707f0..23911dddb8a6 100755
> --- a/scripts/simpletrace.py
> +++ b/scripts/simpletrace.py
> @@ -343,6 +343,17 @@ def __init__(self):
>  def catchall(self, *rec_args, event, timestamp_ns, pid, event_id):
>  if self.last_timestamp_ns is None:
>  self.last_timestamp_ns = timestamp_ns
> +
> +if timestamp_ns < self.last_timestamp_ns:
> +fields = [
> +f'{name}={r}' if is_string(type) else f'{name}=0x{r:x}'
> +for r, (type, name) in zip(rec_args, event.args)
> +]
> +print(f'WARN: skip unstable timestamp: {event.name} '
> +  f'cur({timestamp_ns})-pre({self.last_timestamp_ns}) 
> {pid=} ' +
> +  f' '.join(fields))
> +return
> +
>  delta_ns = timestamp_ns - self.last_timestamp_ns
>  self.last_timestamp_ns = timestamp_ns
>
> --
> 2.34.1
>
>

[PATCH] qemu-io: add cvtnum() error handling for zone commands

2024-05-07 Thread Stefan Hajnoczi

cvtnum() parses positive int64_t values and returns a negative errno on
failure. Print errors and return early when cvtnum() fails.

While we're at it, also reject nr_zones values greater or equal to 2^32
since they cannot be represented.

Reported-by: Peter Maydell 
Cc: Sam Li 
Signed-off-by: Stefan Hajnoczi 
---
 qemu-io-cmds.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index f5d7202a13..e2fab57183 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1739,12 +1739,26 @@ static int zone_report_f(BlockBackend *blk, int argc, 
char **argv)
 {
 int ret;
 int64_t offset;
+int64_t val;
 unsigned int nr_zones;
 
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
-nr_zones = cvtnum(argv[optind]);
+val = cvtnum(argv[optind]);
+if (val < 0) {
+print_cvtnum_err(val, argv[optind]);
+return val;
+}
+if (val > UINT_MAX) {
+printf("Number of zones must be less than 2^32\n");
+return -ERANGE;
+}
+nr_zones = val;
 
 g_autofree BlockZoneDescriptor *zones = NULL;
 zones = g_new(BlockZoneDescriptor, nr_zones);
@@ -1780,8 +1794,16 @@ static int zone_open_f(BlockBackend *blk, int argc, char 
**argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
 if (ret < 0) {
 printf("zone open failed: %s\n", strerror(-ret));
@@ -1805,8 +1827,16 @@ static int zone_close_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
 if (ret < 0) {
 printf("zone close failed: %s\n", strerror(-ret));
@@ -1830,8 +1860,16 @@ static int zone_finish_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len);
 if (ret < 0) {
 printf("zone finish failed: %s\n", strerror(-ret));
@@ -1855,8 +1893,16 @@ static int zone_reset_f(BlockBackend *blk, int argc, 
char **argv)
 int64_t offset, len;
 ++optind;
 offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
 ++optind;
 len = cvtnum(argv[optind]);
+if (len < 0) {
+print_cvtnum_err(len, argv[optind]);
+return len;
+}
 ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len);
 if (ret < 0) {
 printf("zone reset failed: %s\n", strerror(-ret));
-- 
2.45.0

Re: [PULL v2 03/16] block/block-backend: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2024-05-07 Thread Stefan Hajnoczi

On Fri, May 03, 2024 at 01:33:51PM +0100, Peter Maydell wrote:
> On Mon, 15 May 2023 at 17:07, Stefan Hajnoczi  wrote:
> >
> > From: Sam Li 
> >
> > Add zoned device option to host_device BlockDriver. It will be presented 
> > only
> > for zoned host block devices. By adding zone management operations to the
> > host_block_device BlockDriver, users can use the new block layer APIs
> > including Report Zone and four zone management operations
> > (open, close, finish, reset, reset_all).
> >
> > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > zone_finish(zf).
> >
> > For example, to test zone_report, use following command:
> > $ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
> > -c "zrp offset nr_zones"
> 
> Hi; Coverity points out an issue in this commit (CID 1544771):
> 
> > +static int zone_report_f(BlockBackend *blk, int argc, char **argv)
> > +{
> > +int ret;
> > +int64_t offset;
> > +unsigned int nr_zones;
> > +
> > +++optind;
> > +offset = cvtnum(argv[optind]);
> > +++optind;
> > +nr_zones = cvtnum(argv[optind]);
> 
> cvtnum() can fail and return a negative value on error
> (e.g. if the number in the string is out of range),
> but we are not checking for that. Instead we stuff
> the value into an 'unsigned int' and then pass that to
> g_new(), which will result in our trying to allocate a large
> amount of memory.
> 
> Here, and also in the other functions below that use cvtnum(),
> I think we should follow the pattern for use of that function
> that is used in the pre-existing code in this function:
> 
>  int64_t foo; /* NB: not an unsigned or some smaller type */
> 
>  foo = cvtnum(arg)
>  if (foo < 0) {
>  print_cvtnum_err(foo, arg);
>  return foo; /* or otherwise handle returning an error upward */
>  }
> 
> It looks like all the uses of cvtnum in this patch should be
> adjusted to handle errors.

Thanks for letting me know. I will send a patch.

Stefan


signature.asc
Description: PGP signature

[PATCH 2/2] aio: warn about iohandler_ctx special casing

2024-05-06 Thread Stefan Hajnoczi

The main loop has two AioContexts: qemu_aio_context and iohandler_ctx.
The main loop runs them both, but nested aio_poll() calls on
qemu_aio_context exclude iohandler_ctx.

Which one should qemu_get_current_aio_context() return when called from
the main loop? Document that it's always qemu_aio_context.

This has subtle effects on functions that use
qemu_get_current_aio_context(). For example, aio_co_reschedule_self()
does not work when moving from iohandler_ctx to qemu_aio_context because
qemu_get_current_aio_context() does not differentiate these two
AioContexts.

Document this in order to reduce the chance of future bugs.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/block/aio.h b/include/block/aio.h
index 8378553eb9..4ee81936ed 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -629,6 +629,9 @@ void aio_co_schedule(AioContext *ctx, Coroutine *co);
  *
  * Move the currently running coroutine to new_ctx. If the coroutine is already
  * running in new_ctx, do nothing.
+ *
+ * Note that this function cannot reschedule from iohandler_ctx to
+ * qemu_aio_context.
  */
 void coroutine_fn aio_co_reschedule_self(AioContext *new_ctx);
 
@@ -661,6 +664,9 @@ void aio_co_enter(AioContext *ctx, Coroutine *co);
  * If called from an IOThread this will be the IOThread's AioContext.  If
  * called from the main thread or with the "big QEMU lock" taken it
  * will be the main loop AioContext.
+ *
+ * Note that the return value is never the main loop's iohandler_ctx and the
+ * return value is the main loop AioContext instead.
  */
 AioContext *qemu_get_current_aio_context(void);
 
-- 
2.45.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 27226 matches

Mail list logo