Re: [PATCH v4 12/12] tests/qtest/vhost-user-test: add a test case for memory-backend-shm

2024-05-09 Thread Thomas Huth

On 08/05/2024 09.44, Stefano Garzarella wrote:

`memory-backend-shm` can be used with vhost-user devices, so let's
add a new test case for it.

Signed-off-by: Stefano Garzarella 
---
  tests/qtest/vhost-user-test.c | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/tests/qtest/vhost-user-test.c b/tests/qtest/vhost-user-test.c
index d4e437265f..8c1d903b2a 100644
--- a/tests/qtest/vhost-user-test.c
+++ b/tests/qtest/vhost-user-test.c
@@ -44,6 +44,8 @@
  "mem-path=%s,share=on -numa node,memdev=mem"
  #define QEMU_CMD_MEMFD  " -m %d -object 
memory-backend-memfd,id=mem,size=%dM," \
  " -numa node,memdev=mem"
+#define QEMU_CMD_SHM" -m %d -object memory-backend-shm,id=mem,size=%dM," \
+" -numa node,memdev=mem"
  #define QEMU_CMD_CHR" -chardev socket,id=%s,path=%s%s"
  #define QEMU_CMD_NETDEV " -netdev vhost-user,id=hs0,chardev=%s,vhostforce=on"
  
@@ -195,6 +197,7 @@ enum test_memfd {

  TEST_MEMFD_AUTO,
  TEST_MEMFD_YES,
  TEST_MEMFD_NO,
+TEST_MEMFD_SHM,
  };
  
  static void append_vhost_net_opts(TestServer *s, GString *cmd_line,

@@ -228,6 +231,8 @@ static void append_mem_opts(TestServer *server, GString 
*cmd_line,
  
  if (memfd == TEST_MEMFD_YES) {

  g_string_append_printf(cmd_line, QEMU_CMD_MEMFD, size, size);
+} else if (memfd == TEST_MEMFD_SHM) {
+g_string_append_printf(cmd_line, QEMU_CMD_SHM, size, size);
  } else {
  const char *root = init_hugepagefs() ? : server->tmpfs;
  
@@ -788,6 +793,19 @@ static void *vhost_user_test_setup_memfd(GString *cmd_line, void *arg)

  return server;
  }
  
+static void *vhost_user_test_setup_shm(GString *cmd_line, void *arg)

+{
+TestServer *server = test_server_new("vhost-user-test", arg);
+test_server_listen(server);
+
+append_mem_opts(server, cmd_line, 256, TEST_MEMFD_SHM);
+server->vu_ops->append_opts(server, cmd_line, "");
+
+g_test_queue_destroy(vhost_user_test_cleanup, server);
+
+return server;
+}
+
  static void test_read_guest_mem(void *obj, void *arg, QGuestAllocator *alloc)
  {
  TestServer *server = arg;
@@ -1081,6 +1099,11 @@ static void register_vhost_user_test(void)
   "virtio-net",
   test_read_guest_mem, );
  
+opts.before = vhost_user_test_setup_shm;

+qos_add_test("vhost-user/read-guest-mem/shm",
+ "virtio-net",
+ test_read_guest_mem, );
+
  if (qemu_memfd_check(MFD_ALLOW_SEALING)) {
  opts.before = vhost_user_test_setup_memfd;
  qos_add_test("vhost-user/read-guest-mem/memfd",


Acked-by: Thomas Huth 




Re: [PATCH v4 11/12] tests/qtest/vhost-user-blk-test: use memory-backend-shm

2024-05-09 Thread Thomas Huth

On 08/05/2024 09.44, Stefano Garzarella wrote:

`memory-backend-memfd` is available only on Linux while the new
`memory-backend-shm` can be used on any POSIX-compliant operating
system. Let's use it so we can run the test in multiple environments.

Signed-off-by: Stefano Garzarella 
---
  tests/qtest/vhost-user-blk-test.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qtest/vhost-user-blk-test.c 
b/tests/qtest/vhost-user-blk-test.c
index 117b9acd10..e945f6abf2 100644
--- a/tests/qtest/vhost-user-blk-test.c
+++ b/tests/qtest/vhost-user-blk-test.c
@@ -906,7 +906,7 @@ static void start_vhost_user_blk(GString *cmd_line, int 
vus_instances,
 vhost_user_blk_bin);
  
  g_string_append_printf(cmd_line,

-" -object memory-backend-memfd,id=mem,size=256M,share=on "
+" -object memory-backend-shm,id=mem,size=256M,share=on "
  " -M memory-backend=mem -m 256M ");
  
  for (i = 0; i < vus_instances; i++) {


Acked-by: Thomas Huth 




Re: Re: [PATCH 1/9] block: add persistent reservation in/out api

2024-05-09 Thread zhenwei pi




On 5/10/24 02:22, Stefan Hajnoczi wrote:

On Wed, May 08, 2024 at 05:36:21PM +0800, Changqi Lu wrote:


[SNIP]


+
+/**
+ * Persist Through Power Loss(PTPL) is considered as required in QEMU
+ * block layer, the block driver need always enable PTPL.
+ */


What is the reasoning behind this? Will applications that rely on PTPL=0
work?



Hi Stefan,

PTPL needs to be supported at QEMU block layer in theory, include 
reporting PTPL capability and PTPL flag in PR OUT command. However, in 
the real production environment, both SCSI driver and NVMe driver, even 
linux block layer always enable PTPL nnconditionally on a linux platform.


Ref the latest code:
1, SCSI:
https://github.com/torvalds/linux/blob/master/drivers/scsi/sd.c#L1978
static int sd_pr_register(struct block_device *bdev, u64 old_key, u64 
new_key,

u32 flags)
{
if (flags & ~PR_FL_IGNORE_KEY)
return -EOPNOTSUPP;
return sd_pr_out_command(bdev, (flags & PR_FL_IGNORE_KEY) ? 0x06 : 0x00,
old_key, new_key, 0,
(1 << 0) /* APTPL */);
}

2, NVMe:
https://github.com/torvalds/linux/blob/master/drivers/nvme/host/pr.c#L127
static int nvme_pr_register(struct block_device *bdev, u64 old,
u64 new, unsigned flags)
{
u32 cdw10;

if (flags & ~PR_FL_IGNORE_KEY)
return -EOPNOTSUPP;

cdw10 = old ? 2 : 0;
cdw10 |= (flags & PR_FL_IGNORE_KEY) ? 1 << 3 : 0;
cdw10 |= (1 << 30) | (1 << 31); /* PTPL=1 */
return nvme_pr_command(bdev, cdw10, old, new, nvme_cmd_resv_register);
}

3, linux block layers also hides PTPL flag:
https://github.com/torvalds/linux/blob/master/block/ioctl.c#L283
static int blkdev_pr_register(struct block_device *bdev, blk_mode_t mode,
struct pr_registration __user *arg)
{
const struct pr_ops *ops = bdev->bd_disk->fops->pr_ops;
struct pr_registration reg;

if (!blkdev_pr_allowed(bdev, mode))
return -EPERM;
if (!ops || !ops->pr_register)
return -EOPNOTSUPP;
if (copy_from_user(, arg, sizeof(reg)))
return -EFAULT;

if (reg.flags & ~PR_FL_IGNORE_KEY)
return -EOPNOTSUPP;
return ops->pr_register(bdev, reg.old_key, reg.new_key, reg.flags);
}


So we(Changqi and me) wanted to keep PR a bit simple in QEMU block layer:
- consider PTPL is required in QEMU block, then we don't need an extra flag
- the block backend driver always request PR OUT with PTPL flag

Then: Will applications that rely on PTPL=0 work?
Yes, a guest PR out without PTPL will work, but the backend uses PTPL=1 
instead.


Will this request succeed?
If the backend driver' supports PTPL capability, it will succeed. 
Otherwise it will fail.


--
zhenwei pi



Re: [PATCH v4 00/12] vhost-user: support any POSIX system (tested on macOS, FreeBSD, OpenBSD)

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 09:44:44AM +0200, Stefano Garzarella wrote:
> v1: https://patchew.org/QEMU/20240228114759.44758-1-sgarz...@redhat.com/
> v2: https://patchew.org/QEMU/20240326133936.125332-1-sgarz...@redhat.com/
> v3: https://patchew.org/QEMU/20240404122330.92710-1-sgarz...@redhat.com/
> v4:
>   - rebased on master (commit e116b92d01c2cd75957a9f8ad1d4932292867b81)
>   - added patch 6 to move using QEMU bswap helper functions in a separate
> patch (Phil)
>   - fail if we find "share=off" in shm_backend_memory_alloc() (David)
>   - added Phil's R-b and David's A-b
> 
> The vhost-user protocol is not really Linux-specific, so let's try support
> QEMU's frontends and backends (including libvhost-user) in any POSIX system
> with this series. The main use case is to be able to use virtio devices that
> we don't have built-in in QEMU (e.g. virtiofsd, vhost-user-vsock, etc.) even
> in non-Linux systems.
> 
> The first 5 patches are more like fixes discovered at runtime on macOS or
> FreeBSD that could go even independently of this series.
> 
> Patches 6, 7, 8, and 9 enable building of frontends and backends (including
> libvhost-user) with associated code changes to succeed in compilation.
> 
> Patch 10 adds `memory-backend-shm` that uses the POSIX shm_open() API to
> create shared memory which is identified by an fd that can be shared with
> vhost-user backends. This is useful on those systems (like macOS) where
> we don't have memfd_create() or special filesystems like "/dev/shm".
> 
> Patches 11 and 12 use `memory-backend-shm` in some vhost-user tests.
> 
> Maybe the first 5 patches can go separately, but I only discovered those
> problems after testing patches 6 - 9, so I have included them in this series
> for now. Please let me know if you prefer that I send them separately.
> 
> I tested this series using vhost-user-blk and QSD on macOS Sonoma 14.4
> (aarch64), FreeBSD 14 (x86_64), OpenBSD 7.4 (x86_64), and Fedora 39 (x86_64)
> in this way:
> 
> - Start vhost-user-blk or QSD (same commands for all systems)
> 
>   vhost-user-blk -s /tmp/vhost.socket \
> -b Fedora-Cloud-Base-39-1.5.x86_64.raw
> 
>   qemu-storage-daemon \
> --blockdev 
> file,filename=Fedora-Cloud-Base-39-1.5.x86_64.qcow2,node-name=file \
> --blockdev qcow2,file=file,node-name=qcow2 \
> --export 
> vhost-user-blk,addr.type=unix,addr.path=/tmp/vhost.socket,id=vub,num-queues=1,node-name=qcow2,writable=on
> 
> - macOS (aarch64): start QEMU (using hvf accelerator)
> 
>   qemu-system-aarch64 -smp 2 -cpu host -M virt,accel=hvf,memory-backend=mem \
> -drive 
> file=./build/pc-bios/edk2-aarch64-code.fd,if=pflash,format=raw,readonly=on \
> -device virtio-net-device,netdev=net0 -netdev user,id=net0 \
> -device ramfb -device usb-ehci -device usb-kbd \
> -object memory-backend-shm,id=mem,size=512M \
> -device vhost-user-blk-pci,num-queues=1,disable-legacy=on,chardev=char0 \
> -chardev socket,id=char0,path=/tmp/vhost.socket
> 
> - FreeBSD/OpenBSD (x86_64): start QEMU (no accelerators available)
> 
>   qemu-system-x86_64 -smp 2 -M q35,memory-backend=mem \
> -object memory-backend-shm,id=mem,size="512M" \
> -device vhost-user-blk-pci,num-queues=1,chardev=char0 \
> -chardev socket,id=char0,path=/tmp/vhost.socket
> 
> - Fedora (x86_64): start QEMU (using kvm accelerator)
> 
>   qemu-system-x86_64 -smp 2 -M q35,accel=kvm,memory-backend=mem \
> -object memory-backend-shm,size="512M" \
> -device vhost-user-blk-pci,num-queues=1,chardev=char0 \
> -chardev socket,id=char0,path=/tmp/vhost.socket
> 
> Branch pushed (and CI started) at 
> https://gitlab.com/sgarzarella/qemu/-/tree/macos-vhost-user?ref_type=heads
> 
> Thanks,
> Stefano
> 
> Stefano Garzarella (12):
>   libvhost-user: set msg.msg_control to NULL when it is empty
>   libvhost-user: fail vu_message_write() if sendmsg() is failing
>   libvhost-user: mask F_INFLIGHT_SHMFD if memfd is not supported
>   vhost-user-server: do not set memory fd non-blocking
>   contrib/vhost-user-blk: fix bind() using the right size of the address
>   contrib/vhost-user-*: use QEMU bswap helper functions
>   vhost-user: enable frontends on any POSIX system
>   libvhost-user: enable it on any POSIX system
>   contrib/vhost-user-blk: enable it on any POSIX system
>   hostmem: add a new memory backend based on POSIX shm_open()
>   tests/qtest/vhost-user-blk-test: use memory-backend-shm
>   tests/qtest/vhost-user-test: add a test case for memory-backend-shm
> 
>  docs/system/devices/vhost-user.rst|   5 +-
>  meson.build   |   5 +-
>  qapi/qom.json |  17 +++
>  subprojects/libvhost-user/libvhost-user.h |   2 +-
>  backends/hostmem-shm.c| 123 ++
>  contrib/vhost-user-blk/vhost-user-blk.c   |  27 +++--
>  contrib/vhost-user-input/main.c   |  16 +--
>  hw/net/vhost_net.c|   5 +
>  subprojects/libvhost-user/libvhost-user.c |  76 

Re: [PATCH 5/9] hw/scsi: add persistent reservation in/out api for scsi device

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:25PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 302 
>  1 file changed, 302 insertions(+)

Does sg_persist --report-capabilities
(https://linux.die.net/man/8/sg_persist) show that persistent
reservations are supported? Maybe some INQUIRY data still needs to be
added.

> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..bdd66c4026 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -1474,6 +1475,296 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +void *req;
> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +void *req;
> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);
> +blk_keys->generation = cpu_to_be32(blk_keys->generation);
> +memcpy([0], _keys->generation, 4);
> +for (int i = 0; i < num_keys; i++) {
> +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
> +memcpy([8 + i * 8], _keys->keys[i], 8);
> +}
> +num_keys = cpu_to_be32(num_keys * 8);
> +memcpy([4], _keys, 4);
> +
> +scsi_req_data(>req, r->buflen);
> +done:
> +scsi_req_unref(>req);
> +g_free(blk_keys->keys);
> +g_free(blk_keys);
> +}
> +
> +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
> +{
> +SCSIPrReadKeys *blk_keys;
> +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +int buflen = MIN(r->req.cmd.xfer, r->buflen);
> +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
> +
> +blk_keys = g_new0(SCSIPrReadKeys, 1);
> +blk_keys->generation = 0;
> +/* num_keys is the maximum number of keys that can be transmitted */
> +blk_keys->num_keys = num_keys;
> +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
> +blk_keys->req = r;
> +
> +scsi_req_ref(>req);
> +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
> _keys->generation,
> +blk_keys->num_keys, blk_keys->keys,
> +scsi_pr_read_keys_complete, 
> blk_keys);
> +return 0;
> +}
> +
> +static void scsi_pr_read_reservation_complete(void *opaque, int ret)
> +{
> +uint8_t *buf;
> +uint32_t num_keys = 0;
> +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
> +memcpy([0], _rsv->generation, 4);
> +if (ret) {
> +num_keys = cpu_to_be32(16);
> +blk_rsv->key = cpu_to_be64(blk_rsv->key);
> +memcpy([8], _rsv->key, 8);
> +buf[21] = block_pr_type_to_scsi(blk_rsv->type) & 0xf;
> +} else {
> +num_keys = cpu_to_be32(0);
> +}
> +
> +memcpy([4], _keys, 4);
> +scsi_req_data(>req, r->buflen);
> +
> +done:
> +

Re: [PATCH 0/9] Support persistent reservation operations

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:20PM +0800, Changqi Lu wrote:
> Hi,
> 
> I am going to introduce persistent reservation for QEMU block.
> There are three parts in this series:
> 
> Firstly, at the block layer, the commit abstracts seven APIs related to
> the persistent reservation command. These APIs including reading keys,
> reading reservations, registering, reserving, releasing, clearing and 
> preempting.
> 
> Next, the commit implements the necessary pr-related operation APIs for both 
> the
> SCSI protocol and NVMe protocol at the device layer. This ensures that the 
> necessary
> functionality is available for handling persistent reservations in these 
> protocols.
> 
> Finally, the commit includes adaptations to the iscsi driver at the driver 
> layer
> to verify the correct implementation and functionality of the changes.
> 
> With these changes, GFS works fine in the guest. Also, sg-utils(for SCSI 
> block) and
> nvme-cli(for NVMe block) work fine too.

What is the relationship to the existing PRManager functionality
(docs/interop/pr-helper.rst) where block/file-posix.c interprets SCSI
ioctls and sends persistent reservation requests to an external helper
process?

I wonder if block/file-posix.c can implement the new block driver
callbacks using pr_mgr (while keeping the existing scsi-generic
support).

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 7/9] hw/nvme: add helper functions for converting reservation types

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:27PM +0800, Changqi Lu wrote:
> This commit introduces two helper functions
> that facilitate the conversion between the
> reservation types used in the NVME protocol
> and those used in the block layer.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/nvme/nvme.h | 40 
>  1 file changed, 40 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 6/9] block/nvme: add reservation command protocol constants

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:26PM +0800, Changqi Lu wrote:
> Add constants for the NVMe persistent command protocol.
> The constants include the reservation command opcode and
> reservation type values defined in section 7 of the NVMe
> 2.0 specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/block/nvme.h | 30 ++
>  1 file changed, 30 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 5/9] hw/scsi: add persistent reservation in/out api for scsi device

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:25PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations in the
> SCSI device layer. By introducing the persistent
> reservation in/out api, this enables the SCSI device
> to perform reservation-related tasks, including querying
> keys, querying reservation status, registering reservation
> keys, initiating and releasing reservations, as well as
> clearing and preempting reservations held by other keys.
> 
> These operations are crucial for management and control of
> shared storage resources in a persistent manner.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  hw/scsi/scsi-disk.c | 302 
>  1 file changed, 302 insertions(+)

Paolo: Please review for buffer overflows. I find the buffer size
assumption in the SCSI layer mysterious (e.g. scsi_req_get_buf() returns
a void* and it's not obvious how large the buffer is), so I didn't check
for buffer overflows.

> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 4bd7af9d0c..bdd66c4026 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -32,6 +32,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/scsi/emulation.h"
>  #include "scsi/constants.h"
> +#include "scsi/utils.h"
>  #include "sysemu/block-backend.h"
>  #include "sysemu/blockdev.h"
>  #include "hw/block/block.h"
> @@ -1474,6 +1475,296 @@ static void scsi_disk_emulate_read_data(SCSIRequest 
> *req)
>  scsi_req_complete(>req, GOOD);
>  }
>  
> +typedef struct SCSIPrReadKeys {
> +uint32_t generation;
> +uint32_t num_keys;
> +uint64_t *keys;
> +void *req;
> +} SCSIPrReadKeys;
> +
> +typedef struct SCSIPrReadReservation {
> +uint32_t generation;
> +uint64_t key;
> +BlockPrType type;
> +void *req;
> +} SCSIPrReadReservation;
> +
> +static void scsi_pr_read_keys_complete(void *opaque, int ret)
> +{
> +int num_keys;
> +uint8_t *buf;
> +SCSIPrReadKeys *blk_keys = (SCSIPrReadKeys *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_keys->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +num_keys = MIN(blk_keys->num_keys, ret);
> +blk_keys->generation = cpu_to_be32(blk_keys->generation);
> +memcpy([0], _keys->generation, 4);
> +for (int i = 0; i < num_keys; i++) {
> +blk_keys->keys[i] = cpu_to_be64(blk_keys->keys[i]);
> +memcpy([8 + i * 8], _keys->keys[i], 8);
> +}
> +num_keys = cpu_to_be32(num_keys * 8);
> +memcpy([4], _keys, 4);
> +
> +scsi_req_data(>req, r->buflen);
> +done:
> +scsi_req_unref(>req);
> +g_free(blk_keys->keys);
> +g_free(blk_keys);
> +}
> +
> +static int scsi_disk_emulate_pr_read_keys(SCSIRequest *req)
> +{
> +SCSIPrReadKeys *blk_keys;
> +SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +int buflen = MIN(r->req.cmd.xfer, r->buflen);
> +int num_keys = (buflen - sizeof(uint32_t) * 2) / sizeof(uint64_t);
> +
> +blk_keys = g_new0(SCSIPrReadKeys, 1);
> +blk_keys->generation = 0;
> +/* num_keys is the maximum number of keys that can be transmitted */
> +blk_keys->num_keys = num_keys;
> +blk_keys->keys = g_malloc(sizeof(uint64_t) * num_keys);
> +blk_keys->req = r;
> +
> +scsi_req_ref(>req);
> +r->req.aiocb = blk_aio_pr_read_keys(s->qdev.conf.blk, 
> _keys->generation,
> +blk_keys->num_keys, blk_keys->keys,
> +scsi_pr_read_keys_complete, 
> blk_keys);
> +return 0;
> +}
> +
> +static void scsi_pr_read_reservation_complete(void *opaque, int ret)
> +{
> +uint8_t *buf;
> +uint32_t num_keys = 0;
> +SCSIPrReadReservation *blk_rsv = (SCSIPrReadReservation *)opaque;
> +SCSIDiskReq *r = (SCSIDiskReq *)blk_rsv->req;
> +SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +assert(blk_get_aio_context(s->qdev.conf.blk) ==
> +qemu_get_current_aio_context());
> +
> +assert(r->req.aiocb != NULL);
> +r->req.aiocb = NULL;
> +
> +if (scsi_disk_req_check_error(r, ret, true)) {
> +goto done;
> +}
> +
> +buf = scsi_req_get_buf(>req);
> +blk_rsv->generation = cpu_to_be32(blk_rsv->generation);
> +memcpy([0], _rsv->generation, 4);
> +if (ret) {
> +num_keys = cpu_to_be32(16);

The SPC-6 calls this field Additional Length, which is clearer than
num_keys (there is only 1 key here!). Could you rename it to
additional_len to avoid confusion?

> +blk_rsv->key = cpu_to_be64(blk_rsv->key);
> +memcpy([8], _rsv->key, 8);
> +

Re: [PATCH 3/9] scsi/constant: add persistent reservation in/out protocol constants

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:23PM +0800, Changqi Lu wrote:
> Add constants for the persistent reservation in/out protocol
> in the scsi/constant module. The constants include the persistent
> reservation command, type, and scope values defined in sections
> 6.13 and 6.14 of the SCSI Primary Commands-4 (SPC-4) specification.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/scsi/constants.h | 29 +
>  1 file changed, 29 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 4/9] scsi/util: add helper functions for persistent reservation types conversion

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:24PM +0800, Changqi Lu wrote:
> This commit introduces two helper functions
> that facilitate the conversion between the
> persistent reservation types used in the SCSI
> protocol and those used in the block layer.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  include/scsi/utils.h |  5 +
>  scsi/utils.c | 40 
>  2 files changed, 45 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 2/9] block/raw: add persistent reservation in/out driver

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:22PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations for raw driver.
> The following methods are implemented: bdrv_co_pr_read_keys,
> bdrv_co_pr_read_reservation, bdrv_co_pr_register, bdrv_co_pr_reserve,
> bdrv_co_pr_release, bdrv_co_pr_clear and bdrv_co_pr_preempt.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/raw-format.c | 55 ++
>  1 file changed, 55 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 1/9] block: add persistent reservation in/out api

2024-05-09 Thread Stefan Hajnoczi
On Wed, May 08, 2024 at 05:36:21PM +0800, Changqi Lu wrote:
> Add persistent reservation in/out operations
> at the block level. The following operations
> are included:
> 
> - read_keys:retrieves the list of registered keys.
> - read_reservation: retrieves the current reservation status.
> - register: registers a new reservation key.
> - reserve:  initiates a reservation for a specific key.
> - release:  releases a reservation for a specific key.
> - clear:clears all existing reservations.
> - preempt:  preempts a reservation held by another key.
> 
> Signed-off-by: Changqi Lu 
> Signed-off-by: zhenwei pi 
> ---
>  block/block-backend.c | 386 ++
>  block/io.c| 161 +
>  include/block/block-common.h  |   9 +
>  include/block/block-io.h  |  19 ++
>  include/block/block_int-common.h  |  31 +++
>  include/sysemu/block-backend-io.h |  22 ++
>  6 files changed, 628 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index db6f9b92a3..ec67937d28 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1770,6 +1770,392 @@ BlockAIOCB *blk_aio_ioctl(BlockBackend *blk, unsigned 
> long int req, void *buf,
>  return blk_aio_prwv(blk, req, 0, buf, blk_aio_ioctl_entry, 0, cb, 
> opaque);
>  }
>  
> +typedef struct BlkPrInCo {
> +BlockBackend *blk;
> +uint32_t *generation;
> +uint32_t num_keys;
> +BlockPrType *type;
> +uint64_t *keys;
> +int ret;
> +} BlkPrInCo;
> +
> +typedef struct BlkPrInCB {
> +BlockAIOCB common;
> +BlkPrInCo prco;
> +bool has_returned;
> +} BlkPrInCB;
> +
> +static const AIOCBInfo blk_pr_in_aiocb_info = {
> +.aiocb_size = sizeof(BlkPrInCB),
> +};
> +
> +static void blk_pr_in_complete(BlkPrInCB *acb)
> +{
> +if (acb->has_returned) {
> +acb->common.cb(acb->common.opaque, acb->prco.ret);
> +blk_dec_in_flight(acb->prco.blk);

Please add a comment identifying which blk_inc_in_flight() call this dec
is paired with. That makes it easier for people trying to understand
in-flight reference counting.

> +qemu_aio_unref(acb);
> +}
> +}
> +
> +static void blk_pr_in_complete_bh(void *opaque)
> +{
> +BlkPrInCB *acb = opaque;
> +assert(acb->has_returned);
> +blk_pr_in_complete(acb);
> +}
> +
> +static BlockAIOCB *blk_aio_pr_in(BlockBackend *blk, uint32_t *generation,
> + uint32_t num_keys, BlockPrType *type,
> + uint64_t *keys, CoroutineEntry co_entry,
> + BlockCompletionFunc *cb, void *opaque)
> +{
> +BlkPrInCB *acb;
> +Coroutine *co;
> +
> +blk_inc_in_flight(blk);

Please add a comment identifying which blk_dec_in_flight() call this inc
is paired with.

> +acb = blk_aio_get(_pr_in_aiocb_info, blk, cb, opaque);
> +acb->prco = (BlkPrInCo) {
> +.blk= blk,
> +.generation = generation,
> +.num_keys   = num_keys,
> +.type   = type,
> +.ret= NOT_DONE,
> +.keys   = keys,
> +};
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(co_entry, acb);
> +aio_co_enter(qemu_get_current_aio_context(), co);
> +
> +acb->has_returned = true;
> +if (acb->prco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(qemu_get_current_aio_context(),
> + blk_pr_in_complete_bh, acb);
> +}
> +
> +return >common;
> +}
> +
> +
> +static int coroutine_fn
> +blk_aio_pr_do_read_keys(BlockBackend *blk, uint32_t *generation,
> +uint32_t num_keys, uint64_t *keys)
> +{
> +IO_CODE();
> +
> +blk_wait_while_drained(blk);
> +GRAPH_RDLOCK_GUARD();
> +
> +if (!blk_co_is_available(blk)) {
> +return -ENOMEDIUM;
> +}
> +
> +return bdrv_co_pr_read_keys(blk_bs(blk), generation, num_keys, keys);
> +}
> +
> +static void coroutine_fn blk_aio_pr_read_keys_entry(void *opaque)
> +{
> +BlkPrInCB *acb = opaque;
> +BlkPrInCo *prco = >prco;
> +
> +prco->ret = blk_aio_pr_do_read_keys(prco->blk, prco->generation,
> +prco->num_keys, prco->keys);
> +blk_pr_in_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_pr_read_keys(BlockBackend *blk, uint32_t *generation,
> + uint32_t num_keys, uint64_t *keys,
> + BlockCompletionFunc *cb, void *opaque)
> +{
> +IO_CODE();
> +return blk_aio_pr_in(blk, generation, num_keys, NULL, keys,
> + blk_aio_pr_read_keys_entry, cb, opaque);
> +}
> +
> +
> +static int coroutine_fn
> +blk_aio_pr_do_read_reservation(BlockBackend *blk, uint32_t *generation,
> +   uint64_t *key, BlockPrType *type)
> +{
> +IO_CODE();
> +
> +blk_wait_while_drained(blk);
> +

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Peter Xu
On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> That's a good news to see the socket abstraction for RDMA!
> When I was developed the series above, the most pain is the RDMA migration 
> has no QIOChannel abstraction and i need to take a 'fake channel'
> for it which is awkward in code implementation.
> So, as far as I know, we can do this by
> i. the first thing is that we need to evaluate the rsocket is good enough to 
> satisfy our QIOChannel fundamental abstraction
> ii. if it works right, then we will continue to see if it can give us 
> opportunity to hide the detail of rdma protocol
> into rsocket by remove most of code in rdma.c and also some hack in 
> migration main process.
> iii. implement the advanced features like multi-fd and multi-uri for rdma 
> migration.
> 
> Since I am not familiar with rsocket, I need some times to look at it and do 
> some quick verify with rdma migration based on rsocket.
> But, yes, I am willing to involved in this refactor work and to see if we can 
> make this migration feature more better:)

Based on what we have now, it looks like we'd better halt the deprecation
process a bit, so I think we shouldn't need to rush it at least in 9.1
then, and we'll need to see how it goes on the refactoring.

It'll be perfect if rsocket works, otherwise supporting multifd with little
overhead / exported APIs would also be a good thing in general with
whatever approach.  And obviously all based on the facts that we can get
resources from companies to support this feature first.

Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
any of us can provide some test results please do so.  Many people are
saying RDMA is better, but I yet didn't see any numbers comparing it with
modern TCP networks.  I don't want to have old impressions floating around
even if things might have changed..  When we have consolidated results, we
should share them out and also reflect that in QEMU's migration docs when a
rdma document page is ready.

Chuan, please check the whole thread discussion, it may help to understand
what we are looking for on rdma migrations [1].  Meanwhile please feel free
to sync with Jinpu's team and see how to move forward with such a project.

[1] https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/

Thanks,

-- 
Peter Xu




Re: [PATCH 3/6] virtio: virtqueue_ordered_fill - VIRTIO_F_IN_ORDER support

2024-05-09 Thread Eugenio Perez Martin
On Mon, May 6, 2024 at 5:05 PM Jonah Palmer  wrote:
>
> Add VIRTIO_F_IN_ORDER feature support for virtqueue_fill operations.
>
> The goal of the virtqueue_fill operation when the VIRTIO_F_IN_ORDER
> feature has been negotiated is to search for this now-used element,
> set its length, and mark the element as filled in the VirtQueue's
> used_elems array.
>
> By marking the element as filled, it will indicate that this element is
> ready to be flushed, so long as the element is in-order.
>
> Tested-by: Lei Yang 
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 26 +-
>  1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index e6eb1bb453..064046b5e2 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -873,6 +873,28 @@ static void virtqueue_packed_fill(VirtQueue *vq, const 
> VirtQueueElement *elem,
>  vq->used_elems[idx].ndescs = elem->ndescs;
>  }
>
> +static void virtqueue_ordered_fill(VirtQueue *vq, const VirtQueueElement 
> *elem,
> +   unsigned int len)
> +{
> +unsigned int i = vq->used_idx;
> +
> +/* Search for element in vq->used_elems */
> +while (i != vq->last_avail_idx) {
> +/* Found element, set length and mark as filled */
> +if (vq->used_elems[i].index == elem->index) {
> +vq->used_elems[i].len = len;
> +vq->used_elems[i].filled = true;
> +break;
> +}
> +
> +i += vq->used_elems[i].ndescs;
> +
> +if (i >= vq->vring.num) {
> +i -= vq->vring.num;
> +}
> +}

This has a subtle problem: ndescs and elems->id are controlled by the
guest, so it could make QEMU to loop forever looking for the right
descriptor. For each iteration, the code must control that the
variable "i" will be different for the next iteration, and that there
will be no more than vq->last_avail_idx - vq->used_idx iterations.

Apart of that, I think it makes more sense to split the logical
sections of the function this way:
/* declarations */
i = vq->used_idx

/* Search for element in vq->used_elems */
while (vq->used_elems[i].index != elem->index &&
vq->used_elems[i].index i != vq->last_avail_idx && ...) {
...
}

/* Set length and mark as filled */
vq->used_elems[i].len = len;
vq->used_elems[i].filled = true;
---

But I'm ok either way.

> +}
> +
>  static void virtqueue_packed_fill_desc(VirtQueue *vq,
> const VirtQueueElement *elem,
> unsigned int idx,
> @@ -923,7 +945,9 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement 
> *elem,
>  return;
>  }
>
> -if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
> +if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_IN_ORDER)) {
> +virtqueue_ordered_fill(vq, elem, len);
> +} else if (virtio_vdev_has_feature(vq->vdev, VIRTIO_F_RING_PACKED)) {
>  virtqueue_packed_fill(vq, elem, len, idx);
>  } else {
>  virtqueue_split_fill(vq, elem, len, idx);
> --
> 2.39.3
>




Re: [PATCH 2/6] virtio: virtqueue_pop - VIRTIO_F_IN_ORDER support

2024-05-09 Thread Eugenio Perez Martin
On Mon, May 6, 2024 at 5:06 PM Jonah Palmer  wrote:
>
> Add VIRTIO_F_IN_ORDER feature support in virtqueue_split_pop and
> virtqueue_packed_pop.
>
> VirtQueueElements popped from the available/descritpor ring are added to
> the VirtQueue's used_elems array in-order and in the same fashion as
> they would be added the used and descriptor rings, respectively.
>
> This will allow us to keep track of the current order, what elements
> have been written, as well as an element's essential data after being
> processed.
>
> Tested-by: Lei Yang 
> Signed-off-by: Jonah Palmer 
> ---
>  hw/virtio/virtio.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 893a072c9d..e6eb1bb453 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1506,7 +1506,7 @@ static void *virtqueue_alloc_element(size_t sz, 
> unsigned out_num, unsigned in_nu
>
>  static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>  {
> -unsigned int i, head, max;
> +unsigned int i, j, head, max;
>  VRingMemoryRegionCaches *caches;
>  MemoryRegionCache indirect_desc_cache;
>  MemoryRegionCache *desc_cache;
> @@ -1539,6 +1539,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t 
> sz)
>  goto done;
>  }
>
> +j = vq->last_avail_idx;
> +
>  if (!virtqueue_get_head(vq, vq->last_avail_idx++, )) {
>  goto done;
>  }
> @@ -1630,6 +1632,12 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t 
> sz)
>  elem->in_sg[i] = iov[out_num + i];
>  }
>
> +if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
> +vq->used_elems[j].index = elem->index;
> +vq->used_elems[j].len = elem->len;
> +vq->used_elems[j].ndescs = elem->ndescs;
> +}
> +
>  vq->inuse++;
>
>  trace_virtqueue_pop(vq, elem, elem->in_num, elem->out_num);
> @@ -1758,6 +1766,13 @@ static void *virtqueue_packed_pop(VirtQueue *vq, 
> size_t sz)
>
>  elem->index = id;
>  elem->ndescs = (desc_cache == _desc_cache) ? 1 : elem_entries;
> +
> +if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER)) {
> +vq->used_elems[vq->last_avail_idx].index = elem->index;
> +vq->used_elems[vq->last_avail_idx].len = elem->len;
> +vq->used_elems[vq->last_avail_idx].ndescs = elem->ndescs;
> +}
> +

I suggest using a consistent style between packed and split: Either
always use vq->last_avail_idx or j. If you use j, please rename to
something more related to the usage, as j is usually for iterations.

In my opinion I think vq->last_avail_idx is better.


>  vq->last_avail_idx += elem->ndescs;
>  vq->inuse += elem->ndescs;
>
> --
> 2.39.3
>




[PATCH] hw/nvme: Add CLI options for PCI vendor/device IDs and IEEE-OUI ID

2024-05-09 Thread Saif Abrar
Add CLI options for user specified
- PCI vendor, device, subsystem vendor and subsystem IDs
- IEEE-OUI ID

e.g. PCI IDs to be specified as follows:
-device 
nvme,id_vendor=0xABCD,id_device=0xA0B0,id_subsys_vendor=0xEF00,id_subsys=0xEF01

IEEE-OUI ID (Identify Controller bytes 75:73) is to be
specified in LE format.
(e.g. ieee_oui=0xABCDEF => Byte[73]=0xEF, Byte[74]=0xCD, Byte[75]=0xAB).

Signed-off-by: Saif Abrar 
---
 hw/nvme/nvme.h |  5 +
 hw/nvme/ctrl.c | 44 
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index bed8191bd5..6e19a479d1 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -537,6 +537,11 @@ typedef struct NvmeParams {
 uint8_t  sriov_max_vq_per_vf;
 uint8_t  sriov_max_vi_per_vf;
 bool msix_exclusive_bar;
+uint16_t id_vendor;
+uint16_t id_device;
+uint16_t id_subsys_vendor;
+uint16_t id_subsys;
+uint32_t ieee_oui;
 } NvmeParams;
 
 typedef struct NvmeCtrl {
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 127c3d2383..35aeb48e0b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8050,8 +8050,9 @@ out:
 
 static void nvme_init_sriov(NvmeCtrl *n, PCIDevice *pci_dev, uint16_t offset)
 {
-uint16_t vf_dev_id = n->params.use_intel_id ?
- PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME;
+uint16_t vf_dev_id = n->params.id_device ? n->params.id_device :
+(n->params.use_intel_id ?
+ PCI_DEVICE_ID_INTEL_NVME : PCI_DEVICE_ID_REDHAT_NVME);
 NvmePriCtrlCap *cap = >pri_ctrl_cap;
 uint64_t bar_size = nvme_mbar_size(le16_to_cpu(cap->vqfrsm),
   le16_to_cpu(cap->vifrsm),
@@ -8098,7 +8099,13 @@ static bool nvme_init_pci(NvmeCtrl *n, PCIDevice 
*pci_dev, Error **errp)
 pci_conf[PCI_INTERRUPT_PIN] = 1;
 pci_config_set_prog_interface(pci_conf, 0x2);
 
-if (n->params.use_intel_id) {
+if (n->params.id_vendor) {
+pci_config_set_vendor_id(pci_conf, n->params.id_vendor);
+pci_config_set_device_id(pci_conf, n->params.id_device);
+pci_set_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID,
+
n->params.id_subsys_vendor);
+pci_set_word(pci_conf + PCI_SUBSYSTEM_ID, n->params.id_subsys);
+} else if (n->params.use_intel_id) {
 pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
 pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_INTEL_NVME);
 } else {
@@ -8206,7 +8213,11 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 
 id->rab = 6;
 
-if (n->params.use_intel_id) {
+if (n->params.ieee_oui) {
+id->ieee[0] = extract32(n->params.ieee_oui, 0,  8);
+id->ieee[1] = extract32(n->params.ieee_oui, 8,  8);
+id->ieee[2] = extract32(n->params.ieee_oui, 16, 8);
+} else if (n->params.use_intel_id) {
 id->ieee[0] = 0xb3;
 id->ieee[1] = 0x02;
 id->ieee[2] = 0x00;
@@ -8419,6 +8430,24 @@ static void nvme_exit(PCIDevice *pci_dev)
 memory_region_del_subregion(>bar0, >iomem);
 }
 
+static void nvme_prop_ieee_set(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+Property *prop = opaque;
+uint32_t *val = object_field_prop_ptr(obj, prop);
+if (!visit_type_uint32(v, name, val, errp)) {
+return;
+}
+}
+
+static const PropertyInfo nvme_prop_ieee = {
+.name  = "uint32",
+.description = "IEEE OUI: Identify Controller bytes 75:73\
+ in LE format. (e.g. ieee_oui=0xABCDEF => Byte[73]=0xEF, Byte[74]=0xCD,\
+ Byte[75]=0xAB)",
+.set = nvme_prop_ieee_set,
+};
+
 static Property nvme_props[] = {
 DEFINE_BLOCK_PROPERTIES(NvmeCtrl, namespace.blkconf),
 DEFINE_PROP_LINK("pmrdev", NvmeCtrl, pmr.dev, TYPE_MEMORY_BACKEND,
@@ -8451,6 +8480,13 @@ static Property nvme_props[] = {
   params.sriov_max_vq_per_vf, 0),
 DEFINE_PROP_BOOL("msix-exclusive-bar", NvmeCtrl, params.msix_exclusive_bar,
  false),
+DEFINE_PROP_UINT16("id_vendor", NvmeCtrl, params.id_vendor, 0),
+DEFINE_PROP_UINT16("id_device", NvmeCtrl, params.id_device, 0),
+DEFINE_PROP_UINT16("id_subsys_vendor", NvmeCtrl,
+params.id_subsys_vendor, 
0),
+DEFINE_PROP_UINT16("id_subsys", NvmeCtrl, params.id_subsys, 0),
+DEFINE_PROP("ieee_oui", NvmeCtrl, params.ieee_oui, nvme_prop_ieee,
+  
uint32_t),
 DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.39.3




Re: [PATCH 1/6] virtio: Add bool to VirtQueueElement

2024-05-09 Thread Eugenio Perez Martin
On Mon, May 6, 2024 at 5:06 PM Jonah Palmer  wrote:
>
> Add the boolean 'filled' member to the VirtQueueElement structure. The
> use of this boolean will signify if the element has been written to the
> used / descriptor ring or not. This boolean is used to support the
> VIRTIO_F_IN_ORDER feature.
>
> Tested-by: Lei Yang 
> Signed-off-by: Jonah Palmer 
> ---
>  include/hw/virtio/virtio.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 7d5ffdc145..9ed9c3763c 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -69,6 +69,7 @@ typedef struct VirtQueueElement
>  unsigned int ndescs;
>  unsigned int out_num;
>  unsigned int in_num;
> +bool filled;

in_order_filled? I cannot come with a good name for this. Maybe we can
add a comment on top of the variable so we know what it is used for?

>  hwaddr *in_addr;
>  hwaddr *out_addr;
>  struct iovec *in_sg;
> --
> 2.39.3
>




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Zheng Chuan via
Hi, Peter,Lei,Jinpu.

On 2024/5/8 0:28, Peter Xu wrote:
> On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:
>> Hello,
>>
>>> -Original Message-
>>> From: Peter Xu [mailto:pet...@redhat.com]
>>> Sent: Monday, May 6, 2024 11:18 PM
>>> To: Gonglei (Arei) 
>>> Cc: Daniel P. Berrangé ; Markus Armbruster
>>> ; Michael Galaxy ; Yu Zhang
>>> ; Zhijian Li (Fujitsu) ; Jinpu 
>>> Wang
>>> ; Elmar Gerdes ;
>>> qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
>>> ; Prasanna Kumar Kalever
>>> ; Cornelia Huck ;
>>> Michael Roth ; Prasanna Kumar Kalever
>>> ; integrat...@gluster.org; Paolo Bonzini
>>> ; qemu-block@nongnu.org; de...@lists.libvirt.org;
>>> Hanna Reitz ; Michael S. Tsirkin ;
>>> Thomas Huth ; Eric Blake ; Song
>>> Gao ; Marc-André Lureau
>>> ; Alex Bennée ;
>>> Wainer dos Santos Moschetta ; Beraldo Leal
>>> ; Pannengyuan ;
>>> Xiexiangyou 
>>> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
>>>
>>> On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
 Hi, Peter
>>>
>>> Hey, Lei,
>>>
>>> Happy to see you around again after years.
>>>
>> Haha, me too.
>>
 RDMA features high bandwidth, low latency (in non-blocking lossless
 network), and direct remote memory access by bypassing the CPU (As you
 know, CPU resources are expensive for cloud vendors, which is one of
 the reasons why we introduced offload cards.), which TCP does not have.
>>>
>>> It's another cost to use offload cards, v.s. preparing more cpu resources?
>>>
>> Software and hardware offload converged architecture is the way to go for 
>> all cloud vendors 
>> (Including comprehensive benefits in terms of performance, cost, security, 
>> and innovation speed), 
>> it's not just a matter of adding the resource of a DPU card.
>>
 In some scenarios where fast live migration is needed (extremely short
 interruption duration and migration duration) is very useful. To this
 end, we have also developed RDMA support for multifd.
>>>
>>> Will any of you upstream that work?  I'm curious how intrusive would it be
>>> when adding it to multifd, if it can keep only 5 exported functions like 
>>> what
>>> rdma.h does right now it'll be pretty nice.  We also want to make sure it 
>>> works
>>> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
>>> loads to
>>> multifd channels too.
>>>
>>
>> In fact, we sent the patchset to the community in 2021. Pls see:
>> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/
> 

Yes, I have sent the patchset of multifd support for rdma migration by taking 
over my colleague, and also
sorry for not keeping on this work at that time due to some reasons.
And also I am strongly agree with Lei that the RDMA protocol has some special 
advantages against with TCP
in some scenario, and we are indeed to use it in our product.

> I wasn't aware of that for sure in the past..
> 
> Multifd has changed quite a bit in the last 9.0 release, that may not apply
> anymore.  One thing to mention is please look at Dan's comment on possible
> use of rsocket.h:
> 
> https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/
> 
> And Jinpu did help provide an initial test result over the library:
> 
> https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/
> 
> It looks like we have a chance to apply that in QEMU.
> 
>>
>>
>>> One thing to note that the question here is not about a pure performance
>>> comparison between rdma and nics only.  It's about help us make a decision
>>> on whether to drop rdma, iow, even if rdma performs well, the community 
>>> still
>>> has the right to drop it if nobody can actively work and maintain it.
>>> It's just that if nics can perform as good it's more a reason to drop, 
>>> unless
>>> companies can help to provide good support and work together.
>>>
>>
>> We are happy to provide the necessary review and maintenance work for RDMA
>> if the community needs it.
>>
>> CC'ing Chuan Zheng.
> 
> I'm not sure whether you and Jinpu's team would like to work together and
> provide a final solution for rdma over multifd.  It could be much simpler
> than the original 2021 proposal if the rsocket API will work out.
> 
> Thanks,
> 
That's a good news to see the socket abstraction for RDMA!
When I was developed the series above, the most pain is the RDMA migration has 
no QIOChannel abstraction and i need to take a 'fake channel'
for it which is awkward in code implementation.
So, as far as I know, we can do this by
i. the first thing is that we need to evaluate the rsocket is good enough to 
satisfy our QIOChannel fundamental abstraction
ii. if it works right, then we will continue to see if it can give us 
opportunity to hide the detail of rdma protocol
into rsocket by remove most of code in rdma.c and also some hack in 
migration main process.
iii. implement the advanced features like multi-fd and multi-uri for rdma 
migration.

Since I am not familiar