date:20240409

Re: [PATCH RFC v3 0/6] virtio-net: add support for SR-IOV emulation

2024-04-09 Thread Yui Washizu

On 2024/03/05 17:58, Akihiko Odaki wrote:

Based-on: <20240228-reuse-v8-0-282660281...@daynix.com>
("[PATCH v8 00/15] hw/pci: SR-IOV related fixes and improvements")

Introduction

This series is based on the RFC series submitted by Yui Washizu[1].
See also [2] for the context.

This series enables SR-IOV emulation for virtio-net. It is useful
to test SR-IOV support on the guest, or to expose several vDPA devices
in a VM. vDPA devices can also provide L2 switching feature for
offloading though it is out of scope to allow the guest to configure
such a feature.

The PF side code resides in virtio-pci. The VF side code resides in
the PCI common infrastructure, but it is restricted to work only for
virtio-net-pci because of lack of validation.

User Interface
--

A user can configure a SR-IOV capable virtio-net device by adding
virtio-net-pci functions to a bus. Below is a command line example:
-netdev user,id=n -netdev user,id=o
-netdev user,id=p -netdev user,id=q
-device pcie-root-port,id=b
-device virtio-net-pci,bus=b,addr=0x0.0x3,netdev=q,sriov-pf=f
-device virtio-net-pci,bus=b,addr=0x0.0x2,netdev=p,sriov-pf=f
-device virtio-net-pci,bus=b,addr=0x0.0x1,netdev=o,sriov-pf=f
-device virtio-net-pci,bus=b,addr=0x0.0x0,netdev=n,id=f

The VFs specify the paired PF with "sriov-pf" property. The PF must be
added after all VFs. It is user's responsibility to ensure that VFs have
function numbers larger than one of the PF, and the function numbers
have a consistent stride.

QEMU crashed same as RFC v2:

https://lore.kernel.org/all/20231210-sriov-v2-6-b959e8a6d...@daynix.com/T/#mb16bdef667a0b0b7c08f487fcd18e8d627694bf6

Keeping VF instances

A problem with SR-IOV emulation is that it needs to hotplug the VFs as
the guest requests. Previously, this behavior was implemented by
realizing and unrealizing VFs at runtime. However, this strategy does
not work well for the proposed virtio-net emulation; in this proposal,
device options passed in the command line must be maintained as VFs
are hotplugged, but they are consumed when the machine starts and not
available after that, which makes realizing VFs at runtime impossible.

As an strategy alternative to runtime realization/unrealization, this
series proposes to reuse the code to power down PCI Express devices.
When a PCI Express device is powered down, it will be hidden from the
guest but will be kept realized. This effectively implements the
behavior we need for the SR-IOV emulation.

Summary
---

Patch 1 disables ROM BAR, which virtio-net-pci enables by default, for
VFs.
Patch 2 and 3 adds validations.
Patch 4 adds user-created SR-IOV VF infrastructure.
Patch 5 makes virtio-pci work as SR-IOV PF for user-created VFs.
Patch 6 allows user to create SR-IOV VFs with virtio-net-pci.

[1]
https://patchew.org/QEMU/1689731808-3009-1-git-send-email-yui.wash...@gmail.com/
[2] https://lore.kernel.org/all/5d46f455-f530-4e5e-9ae7-13a2297d4...@daynix.com/

Co-developed-by: Yui Washizu
Signed-off-by: Akihiko Odaki
---
Changes in v3:
- Rebased.
- Link to v2:
https://lore.kernel.org/r/20231210-sriov-v2-0-b959e8a6d...@daynix.com

Changes in v2:
- Changed to keep VF instances.
- Link to v1:
https://lore.kernel.org/r/20231202-sriov-v1-0-32b3570f7...@daynix.com

---
Akihiko Odaki (6):
hw/pci: Do not add ROM BAR for SR-IOV VF
pcie_sriov: Ensure PF and VF are mutually exclusive
pcie_sriov: Check PCI Express for SR-IOV PF
pcie_sriov: Allow user to create SR-IOV device
virtio-pci: Implement SR-IOV PF
virtio-net: Implement SR-IOV VF

include/hw/pci/pci_device.h | 6 +-
include/hw/pci/pcie_sriov.h | 19 +++
hw/pci/pci.c| 70 +++
hw/pci/pcie_sriov.c | 299 +++-
hw/virtio/virtio-net-pci.c | 1 +
hw/virtio/virtio-pci.c | 7 ++
6 files changed, 319 insertions(+), 83 deletions(-)
---
base-commit: 2c4eb0476e461b8a4b2f745d25f987e831c7f640
change-id: 20231202-sriov-9402fb262be8

Best regards,

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Zhijian Li (Fujitsu)

on 4/10/2024 3:46 AM, Peter Xu wrote:

>> Is there document/link about the unittest/CI for migration tests, Why
>> are those tests missing?
>> Is it hard or very special to set up an environment for that? maybe we
>> can help in this regards.
> See tests/qtest/migration-test.c.  We put most of our migration tests
> there and that's covered in CI.
>
> I think one major issue is CI systems don't normally have rdma devices.
> Can rdma migration test be carried out without a real hardware?

Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
$ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
then we can get a new RDMA interface "rxe_eth0".
This new RDMA interface is able to do the QEMU RDMA migration.

Also, the loopback(lo) device is able to emulate the RDMA interface 
"rxe_lo", however when
I tried(years ago) to do RDMA migration over this 
interface(rdma:127.0.0.1:) , it got something wrong.
So i gave up enabling the RDMA migration qtest at that time.

Thanks
Zhijian

>
>>> It seems there can still be people joining this discussion.  I'll hold off
>>> a bit on merging this patch to provide enough window for anyone to chim in.
>> Thx for discussion and understanding.
> Thanks for all these inputs so far.  These can help us make a wiser and
> clearer step no matter which way we choose.

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Peter Xu

On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> Hi Peter,
> 
> On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
> >
> > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > Hi Peter,
> >
> > Jinpu,
> >
> > Thanks for joining the discussion.
> >
> > >
> > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > > >
> > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > Hello Peter und Zhjian,
> > > > >
> > > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > > surprised at
> > > > > the plan for deprecating the RDMA migration subsystem.
> > > >
> > > > It's not too late, since it looks like we do have users not yet notified
> > > > from this, we'll redo the deprecation procedure even if it'll be the 
> > > > final
> > > > plan, and it'll be 2 releases after this.
> > > >
> > > > >
> > > > > > IMHO it's more important to know whether there are still users and 
> > > > > > whether
> > > > > > they would still like to see it around.
> > > > >
> > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led 
> > > > > > to the a few
> > > > > > obvious bugs being noticed too late.
> > > > >
> > > > > Yes, we are a user of this subsystem. I was unaware of the lack of 
> > > > > test coverage
> > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > migration test
> > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > and 8.2, but
> > > > > was unable to confirm and report it quickly to you.
> > > > >
> > > > > The maintenance of this part could be too costly or difficult from
> > > > > your point of view.
> > > >
> > > > It may or may not be too costly, it's just that we need real users of 
> > > > RDMA
> > > > taking some care of it.  Having it broken easily for >1 releases 
> > > > definitely
> > > > is a sign of lack of users.  It is an implication to the community that 
> > > > we
> > > > should consider dropping some features so that we can get the best use 
> > > > of
> > > > the community resources for the things that may have a broader audience.
> > > >
> > > > One thing majorly missing is a RDMA tester to guard all the merges to 
> > > > not
> > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA 
> > > > hardwares
> > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > features that will be merged at least for migration subsystem, so that 
> > > > we
> > > > plan to not merge anything that is not covered by CI unless extremely
> > > > necessary in the future.
> > > >
> > > > For sure CI is not the only missing part, but I'd say we should start 
> > > > with
> > > > it, then someone should also take care of the code even if only in
> > > > maintenance mode (no new feature to add on top).
> > > >
> > > > >
> > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > many) like us
> > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > version of QEMU,
> > > > > or to abandon the currently used RDMA migration.
> > > >
> > > > RDMA doesn't get new features anyway, if there's specific use case for 
> > > > RDMA
> > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > possible to switch to the TCP protocol with some good NICs?
> > > We have used rdma migration with HCA from Nvidia for years, our
> > > experience is RDMA migration works better than tcp (over ipoib).
> >
> > Please bare with me, as I know little on rdma stuff.
> >
> > I'm actually pretty confused (and since a long time ago..) on why we need
> > to operation with rdma contexts when ipoib seems to provide all the tcp
> > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > even if there's rdma/ib hardwares underneath?  Is it because of performance
> > improvements so that we must use a separate path comparing to generic
> > "tcp:" protocol here?
> using rdma protocol with ib verbs , we can leverage the full benefit of RDMA 
> by
> talking directly to NIC which bypasses the kernel overhead, less cpu
> utilization and better performance.
> 
> While IPoIB is more for compatibility to  applications using tcp, but
> can't get full benefit of RDMA.  When you have mix generation of IB
> devices, there are performance issue on IPoIB, we've seen 40G HCA can
> only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> speed.
> 
> I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
> 
> iperf 3.9
> Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> 07:19:34 UTC 2024 x86_64
> ---
> Server listening on 5201
> ---
> Time: Tue, 09 Apr 2024 06:55:02 GMT
> Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
>

Re: [PATCH] block/virtio-blk: Fix memory leak from virtio_blk_zone_report

2024-04-09 Thread Michael Tokarev


04.04.2024 15:00, Zheyu Ma wrote:

This modification ensures that in scenarios where the buffer size is
insufficient for a zone report, the function will now properly set an
error status and proceed to a cleanup label, instead of merely
returning.

The following ASAN log reveals it:

==1767400==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 312 byte(s) in 1 object(s) allocated from:
 #0 0x64ac7b3280cd in malloc 
llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:129:3
 #1 0x735b02fb9738 in g_malloc 
(/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x5e738)
 #2 0x64ac7d23be96 in virtqueue_split_pop hw/virtio/virtio.c:1612:12
 #3 0x64ac7d23728a in virtqueue_pop hw/virtio/virtio.c:1783:16
 #4 0x64ac7cfcaacd in virtio_blk_get_request hw/block/virtio-blk.c:228:27
 #5 0x64ac7cfca7c7 in virtio_blk_handle_vq hw/block/virtio-blk.c:1123:23
 #6 0x64ac7cfecb95 in virtio_blk_handle_output hw/block/virtio-blk.c:1157:5

Signed-off-by: Zheyu Ma 
---
  hw/block/virtio-blk.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 92de315f17..bb86e65f65 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -768,7 +768,8 @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq 
*req,
  sizeof(struct virtio_blk_zone_report) +
  sizeof(struct virtio_blk_zone_descriptor)) {
  virtio_error(vdev, "in buffer too small for zone report");
-return;
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
  }
  
  /* start byte offset of the zone report */


Is this a -stable material, or is it not worth picking up for older release(s)?

Thanks,

/mjt

[PATCH v3 10/11] block: Convert qmp_query_block and qmp_query_named_block_nodes to coroutine

2024-04-09 Thread Fabiano Rosas

From: Lin Ma 

Convert the remaining functions to make the QMP commands query-block
and query-named-block-nodes run in their entirety in a coroutine. With
this, any yield from those commands will return all the way back to
the main loop. This releases the BQL and the main loop and avoids
having the QMP command block another more important task from running.

Both commands need to be converted at once because hmp_info_block
calls both and it needs to be moved to a coroutine as well.

Now the wrapper for bdrv_co_get_allocated_file_size() can be made not
mixed and the wrapper for bdrv_co_block_device_info() can be removed.

Signed-off-by: Lin Ma 
Signed-off-by: Fabiano Rosas 
---
 block.c|  8 
 block/monitor/block-hmp-cmds.c |  2 +-
 block/qapi.c   | 12 ++--
 blockdev.c |  8 
 hmp-commands-info.hx   |  1 +
 include/block/block-global-state.h |  3 ++-
 include/block/block-hmp-cmds.h |  2 +-
 include/block/block-io.h   |  2 +-
 include/block/qapi.h   |  3 ---
 qapi/block-core.json   |  5 +++--
 10 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/block.c b/block.c
index 01478c5471..cba28f07fc 100644
--- a/block.c
+++ b/block.c
@@ -6207,18 +6207,18 @@ BlockDriverState *bdrv_find_node(const char *node_name)
 }
 
 /* Put this QMP function here so it can access the static graph_bdrv_states. */
-BlockDeviceInfoList *bdrv_named_nodes_list(bool flat,
-   Error **errp)
+BlockDeviceInfoList *coroutine_fn bdrv_co_named_nodes_list(bool flat,
+   Error **errp)
 {
 BlockDeviceInfoList *list;
 BlockDriverState *bs;
 
 GLOBAL_STATE_CODE();
-GRAPH_RDLOCK_GUARD_MAINLOOP();
+GRAPH_RDLOCK_GUARD();
 
 list = NULL;
 QTAILQ_FOREACH(bs, _bdrv_states, node_list) {
-BlockDeviceInfo *info = bdrv_block_device_info(NULL, bs, flat, errp);
+BlockDeviceInfo *info = bdrv_co_block_device_info(NULL, bs, flat, 
errp);
 if (!info) {
 qapi_free_BlockDeviceInfoList(list);
 return NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 9db587c661..8ceff59688 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -738,7 +738,7 @@ static void print_block_info(Monitor *mon, BlockInfo *info,
 }
 }
 
-void hmp_info_block(Monitor *mon, const QDict *qdict)
+void coroutine_fn hmp_info_block(Monitor *mon, const QDict *qdict)
 {
 BlockInfoList *block_list, *info;
 BlockDeviceInfoList *blockdev_list, *blockdev;
diff --git a/block/qapi.c b/block/qapi.c
index 9a59e5606f..c4514295ec 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -418,8 +418,8 @@ fail:
 }
 
 /* @p_info will be set only on success. */
-static void GRAPH_RDLOCK
-bdrv_query_info(BlockBackend *blk, BlockInfo **p_info, Error **errp)
+static void GRAPH_RDLOCK coroutine_fn
+bdrv_co_query_info(BlockBackend *blk, BlockInfo **p_info, Error **errp)
 {
 BlockInfo *info = g_malloc0(sizeof(*info));
 BlockDriverState *bs = blk_bs(blk);
@@ -451,7 +451,7 @@ bdrv_query_info(BlockBackend *blk, BlockInfo **p_info, 
Error **errp)
 }
 
 if (bs && bs->drv) {
-info->inserted = bdrv_block_device_info(blk, bs, false, errp);
+info->inserted = bdrv_co_block_device_info(blk, bs, false, errp);
 if (info->inserted == NULL) {
 goto err;
 }
@@ -661,13 +661,13 @@ bdrv_query_bds_stats(BlockDriverState *bs, bool blk_level)
 return s;
 }
 
-BlockInfoList *qmp_query_block(Error **errp)
+BlockInfoList *coroutine_fn qmp_query_block(Error **errp)
 {
 BlockInfoList *head = NULL, **p_next = 
 BlockBackend *blk;
 Error *local_err = NULL;
 
-GRAPH_RDLOCK_GUARD_MAINLOOP();
+GRAPH_RDLOCK_GUARD();
 
 for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) {
 BlockInfoList *info;
@@ -677,7 +677,7 @@ BlockInfoList *qmp_query_block(Error **errp)
 }
 
 info = g_malloc0(sizeof(*info));
-bdrv_query_info(blk, >value, _err);
+bdrv_co_query_info(blk, >value, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 g_free(info);
diff --git a/blockdev.c b/blockdev.c
index 057601dcf0..fe3226c8c4 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2744,13 +2744,13 @@ void qmp_drive_backup(DriveBackup *backup, Error **errp)
 blockdev_do_action(, errp);
 }
 
-BlockDeviceInfoList *qmp_query_named_block_nodes(bool has_flat,
- bool flat,
- Error **errp)
+BlockDeviceInfoList *coroutine_fn qmp_query_named_block_nodes(bool has_flat,
+  bool flat,
+  Error **errp)
 {
 bool return_flat =

[PATCH v3 07/11] block: Convert bdrv_query_image_info to coroutine

2024-04-09 Thread Fabiano Rosas

This function is a caller of bdrv_do_query_node_info(), which have
been converted to a coroutine. Convert this function as well so we're
closer from having the whole qmp_query_block as a single coroutine.

Also remove the wrapper for bdrv_co_do_query_node_info() now that all
its callers are converted.

Signed-off-by: Fabiano Rosas 
---
 block/qapi.c | 16 +++-
 include/block/qapi.h |  8 
 2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 7b1cf48246..5e263960a9 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -304,7 +304,7 @@ bdrv_co_do_query_node_info(BlockDriverState *bs, 
BlockNodeInfo *info,
 }
 
 /**
- * bdrv_query_image_info:
+ * bdrv_co_query_image_info:
  * @bs: block node to examine
  * @p_info: location to store image information
  * @flat: skip backing node information
@@ -325,17 +325,15 @@ bdrv_co_do_query_node_info(BlockDriverState *bs, 
BlockNodeInfo *info,
  *
  * @p_info will be set only on success. On error, store error in @errp.
  */
-void bdrv_query_image_info(BlockDriverState *bs,
-   ImageInfo **p_info,
-   bool flat,
-   bool skip_implicit_filters,
-   Error **errp)
+void coroutine_fn
+bdrv_co_query_image_info(BlockDriverState *bs, ImageInfo **p_info, bool flat,
+ bool skip_implicit_filters, Error **errp)
 {
 ERRP_GUARD();
 ImageInfo *info;
 
 info = g_new0(ImageInfo, 1);
-bdrv_do_query_node_info(bs, qapi_ImageInfo_base(info), errp);
+bdrv_co_do_query_node_info(bs, qapi_ImageInfo_base(info), errp);
 if (*errp) {
 goto fail;
 }
@@ -353,8 +351,8 @@ void bdrv_query_image_info(BlockDriverState *bs,
 }
 
 if (backing) {
-bdrv_query_image_info(backing, >backing_image, false,
-  skip_implicit_filters, errp);
+bdrv_co_query_image_info(backing, >backing_image, false,
+ skip_implicit_filters, errp);
 if (*errp) {
 goto fail;
 }
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 76be9cc3e5..5f7e3a690e 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -38,7 +38,10 @@ int GRAPH_RDLOCK
 bdrv_query_snapshot_info_list(BlockDriverState *bs,
   SnapshotInfoList **p_list,
   Error **errp);
-void GRAPH_RDLOCK
+void coroutine_fn GRAPH_RDLOCK
+bdrv_co_query_image_info(BlockDriverState *bs, ImageInfo **p_info, bool flat,
+ bool skip_implicit_filters, Error **errp);
+void co_wrapper_bdrv_rdlock
 bdrv_query_image_info(BlockDriverState *bs, ImageInfo **p_info, bool flat,
   bool skip_implicit_filters, Error **errp);
 
@@ -58,7 +61,4 @@ void bdrv_node_info_dump(BlockNodeInfo *info, int 
indentation, bool protocol);
 void coroutine_fn GRAPH_RDLOCK
 bdrv_co_do_query_node_info(BlockDriverState *bs, BlockNodeInfo *info,
Error **errp);
-void co_wrapper_bdrv_rdlock
-bdrv_do_query_node_info(BlockDriverState *bs, BlockNodeInfo *info,
-Error **errp);
 #endif
-- 
2.35.3

Re: [PATCH v5 2/2] nbd/server: Mark negotiation functions as coroutine_fn

2024-04-09 Thread Eric Blake

On Tue, Apr 09, 2024 at 09:30:39AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 08.04.24 19:00, Eric Blake wrote:
> > nbd_negotiate() is already marked coroutine_fn.  And given the fix in
> > the previous patch to have nbd_negotiate_handle_starttls not create
> > and wait on a g_main_loop (as that would violate coroutine
> > constraints), it is worth marking the rest of the related static
> > functions reachable only during option negotiation as also being
> > coroutine_fn.
> > 
> > Suggested-by: Vladimir Sementsov-Ogievskiy 
> > Signed-off-by: Eric Blake 
> > ---
> >   nbd/server.c | 102 +--
> >   1 file changed, 59 insertions(+), 43 deletions(-)
> > 
> > diff --git a/nbd/server.c b/nbd/server.c
> > index 98ae0e16326..1857fba51c1 100644
> 
> [..]
> 
> >   {
> >   int rc;
> >   g_autofree char *name = NULL;
> > @@ -755,7 +764,8 @@ struct NBDTLSServerHandshakeData {
> >   Coroutine *co;
> >   };
> > 
> > -static void nbd_server_tls_handshake(QIOTask *task, void *opaque)
> > +static coroutine_fn void
> 
> This is not, that's a callback for tls handshake, which is not coroutine 
> context as I understand.

The call chain is:

nbd_negotiate() - coroutine_fn before this patch
  nbd_negotiate_options() - marked coroutine_fn here
nbd_negotiate_handle_starttls() - marked coroutine_fn here
  qio_channel_tls_handshake() - in io/channel-tls.c; not marked 
coroutine_fn, but
works both in or out of coroutine context
...
nbd_server_tls_handshake() - renamed in 1/2 of this series

> without this hunk:

I can drop that particular marking.  There are cases where the
callback is reached without having done the qemu_coroutine_yield() in
nbd_negotiate_handle_starttls; but there are also cases where the IO
took enough time that the coroutine yielded and it is that callback
that reawakens it.

> 
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 

Thanks.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

Re: [PATCH-for-9.0 v2] hw/sd/sdhci: Do not update TRNMOD when Command Inhibit (DAT) is set

2024-04-09 Thread Peter Maydell

On Tue, 9 Apr 2024 at 15:55, Philippe Mathieu-Daudé  wrote:
>
> Per "SD Host Controller Standard Specification Version 3.00":
>
>   * 2.2.5 Transfer Mode Register (Offset 00Ch)
>
> Writes to this register shall be ignored when the Command
> Inhibit (DAT) in the Present State register is 1.
>
> Do not update the TRNMOD register when Command Inhibit (DAT)
> bit is set to avoid the present-status register going out of
> sync, leading to malicious guest using DMA mode and overflowing
> the FIFO buffer:
>
>   $ cat << EOF | qemu-system-i386 \
>  -display none -nodefaults \
>  -machine accel=qtest -m 512M \
>  -device sdhci-pci,sd-spec-version=3 \
>  -device sd-card,drive=mydrive \
>  -drive 
> if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic \
>  -qtest stdio
>   outl 0xcf8 0x80001013
>   outl 0xcfc 0x91
>   outl 0xcf8 0x80001001
>   outl 0xcfc 0x0600
>   write 0x912c 0x1 0x05
>   write 0x9158 0x1 0x16
>   write 0x9105 0x1 0x04
>   write 0x9128 0x1 0x08
>   write 0x16 0x1 0x21
>   write 0x19 0x1 0x20
>   write 0x910c 0x1 0x01
>   write 0x910e 0x1 0x20
>   write 0x910f 0x1 0x00
>   write 0x910c 0x1 0x00
>   write 0x9120 0x1 0x00
>   EOF
>
> Stack trace (part):
> =
> ==89993==ERROR: AddressSanitizer: heap-buffer-overflow on address
> 0x61529900 at pc 0x55d5f885700d bp 0x7ffc1e1e9470 sp 0x7ffc1e1e9468
> WRITE of size 1 at 0x61529900 thread T0
> #0 0x55d5f885700c in sdhci_write_dataport hw/sd/sdhci.c:564:39
> #1 0x55d5f8849150 in sdhci_write hw/sd/sdhci.c:1223:13
> #2 0x55d5fa01db63 in memory_region_write_accessor system/memory.c:497:5
> #3 0x55d5fa01d245 in access_with_adjusted_size system/memory.c:573:18
> #4 0x55d5fa01b1a9 in memory_region_dispatch_write system/memory.c:1521:16
> #5 0x55d5fa09f5c9 in flatview_write_continue system/physmem.c:2711:23
> #6 0x55d5fa08f78b in flatview_write system/physmem.c:2753:12
> #7 0x55d5fa08f258 in address_space_write system/physmem.c:2860:18
> ...
> 0x61529900 is located 0 bytes to the right of 512-byte region
> [0x61529700,0x61529900) allocated by thread T0 here:
> #0 0x55d5f7237b27 in __interceptor_calloc
> #1 0x7f9e36dd4c50 in g_malloc0
> #2 0x55d5f88672f7 in sdhci_pci_realize hw/sd/sdhci-pci.c:36:5
> #3 0x55d5f844b582 in pci_qdev_realize hw/pci/pci.c:2092:9
> #4 0x55d5fa2ee74b in device_set_realized hw/core/qdev.c:510:13
> #5 0x55d5fa325bfb in property_set_bool qom/object.c:2358:5
> #6 0x55d5fa31ea45 in object_property_set qom/object.c:1472:5
> #7 0x55d5fa332509 in object_property_set_qobject om/qom-qobject.c:28:10
> #8 0x55d5fa31f6ed in object_property_set_bool qom/object.c:1541:15
> #9 0x55d5fa2e2948 in qdev_realize hw/core/qdev.c:292:12
> #10 0x55d5f8eed3f1 in qdev_device_add_from_qdict 
> system/qdev-monitor.c:719:10
> #11 0x55d5f8eef7ff in qdev_device_add system/qdev-monitor.c:738:11
> #12 0x55d5f8f211f0 in device_init_func system/vl.c:1200:11
> #13 0x55d5fad0877d in qemu_opts_foreach util/qemu-option.c:1135:14
> #14 0x55d5f8f0df9c in qemu_create_cli_devices system/vl.c:2638:5
> #15 0x55d5f8f0db24 in qmp_x_exit_preconfig system/vl.c:2706:5
> #16 0x55d5f8f14dc0 in qemu_init system/vl.c:3737:9
> ...
> SUMMARY: AddressSanitizer: heap-buffer-overflow hw/sd/sdhci.c:564:39
> in sdhci_write_dataport
>
> Add assertions to ensure the fifo_buffer[] is not overflowed by
> malicious accesses to the Buffer Data Port register.
>
> Fixes: CVE-2024-3447
> Cc: qemu-sta...@nongnu.org
> Fixes: d7dfca0807 ("hw/sdhci: introduce standard SD host controller")
> Buglink: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=58813
> Reported-by: Alexander Bulekov 
> Reported-by: Chuhong Yuan 
> Suggested-by: Peter Maydell 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Peter, since it is your patch, can I replace the Suggested-by your
> S-o-b tag?

Sure, you can have my Signed-off-by or Reviewed-by tag, whichever
you prefer.

thanks
-- PMM

[PATCH v3 05/11] block: Run bdrv_do_query_node_info in a coroutine

2024-04-09 Thread Fabiano Rosas

Move this function into a coroutine so we can convert the whole
qmp_query_block command into a coroutine in the next patches.

Placing the entire command in a coroutine allow us to yield all the
way back to the main loop, releasing the BQL and unblocking the main
loop.

When the whole conversion is completed, we'll be able to avoid a
priority inversion that happens when a QMP command calls a slow
(buggy) system call and blocks the vcpu thread from doing mmio due to
contention on the BQL.

About coroutine safety:

Most callees have coroutine versions themselves and thus are safe to
call in a coroutine. The remaining ones:

- bdrv_refresh_filename, bdrv_get_full_backing_filename: String
  manipulation, nothing that would be unsafe for use in coroutines;

- bdrv_get_format_name: Just accesses a field;

- bdrv_get_specific_info, bdrv_query_snapshot_info_list: No locks or
  anything that would poll or block.

(using a mixed wrapper for now, but after all callers are converted,
this can become a coroutine exclusively)

Signed-off-by: Fabiano Rosas 
---
- used the coroutine version of the called functions when available
---
 block/qapi.c | 11 ++-
 include/block/qapi.h |  8 
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 2b5793f1d9..26a9510345 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -225,8 +225,9 @@ int bdrv_query_snapshot_info_list(BlockDriverState *bs,
  * Helper function for other query info functions.  Store information about @bs
  * in @info, setting @errp on error.
  */
-static void GRAPH_RDLOCK
-bdrv_do_query_node_info(BlockDriverState *bs, BlockNodeInfo *info, Error 
**errp)
+void coroutine_fn
+bdrv_co_do_query_node_info(BlockDriverState *bs, BlockNodeInfo *info,
+   Error **errp)
 {
 int64_t size;
 const char *backing_filename;
@@ -234,7 +235,7 @@ bdrv_do_query_node_info(BlockDriverState *bs, BlockNodeInfo 
*info, Error **errp)
 int ret;
 Error *err = NULL;
 
-size = bdrv_getlength(bs);
+size = bdrv_co_getlength(bs);
 if (size < 0) {
 error_setg_errno(errp, -size, "Can't get image size '%s'",
  bs->exact_filename);
@@ -246,13 +247,13 @@ bdrv_do_query_node_info(BlockDriverState *bs, 
BlockNodeInfo *info, Error **errp)
 info->filename= g_strdup(bs->filename);
 info->format  = g_strdup(bdrv_get_format_name(bs));
 info->virtual_size= size;
-info->actual_size = bdrv_get_allocated_file_size(bs);
+info->actual_size = bdrv_co_get_allocated_file_size(bs);
 info->has_actual_size = info->actual_size >= 0;
 if (bs->encrypted) {
 info->encrypted = true;
 info->has_encrypted = true;
 }
-if (bdrv_get_info(bs, ) >= 0) {
+if (bdrv_co_get_info(bs, ) >= 0) {
 if (bdi.cluster_size != 0) {
 info->cluster_size = bdi.cluster_size;
 info->has_cluster_size = true;
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 54c48de26a..234730b3de 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -25,6 +25,7 @@
 #ifndef BLOCK_QAPI_H
 #define BLOCK_QAPI_H
 
+#include "block/block-common.h"
 #include "block/graph-lock.h"
 #include "block/snapshot.h"
 #include "qapi/qapi-types-block-core.h"
@@ -49,4 +50,11 @@ void bdrv_image_info_specific_dump(ImageInfoSpecific 
*info_spec,
const char *prefix,
int indentation);
 void bdrv_node_info_dump(BlockNodeInfo *info, int indentation, bool protocol);
+
+void coroutine_fn GRAPH_RDLOCK
+bdrv_co_do_query_node_info(BlockDriverState *bs, BlockNodeInfo *info,
+   Error **errp);
+void co_wrapper_bdrv_rdlock
+bdrv_do_query_node_info(BlockDriverState *bs, BlockNodeInfo *info,
+Error **errp);
 #endif
-- 
2.35.3

[PATCH v3 04/11] block: Reschedule query-block during qcow2 invalidation

2024-04-09 Thread Fabiano Rosas

There is a small window at the end of block device migration when
devices are being re-activated. This includes a resetting of some
fields of BDRVQcow2State at qcow2_co_invalidate_cache(). A concurrent
QMP query-block command can call qcow2_get_specific_info() during this
window and see the cleared values, which leads to an assert:

  qcow2_get_specific_info: Assertion `false' failed

This is the same issue as Gitlab #1933, which has already been
resolved[1], but there the fix applied only to non-coroutine
commands. Once we move query-block to a coroutine the problem will
manifest again.

Add an operation blocker to the invalidation function to block the
query info path during this window.

Instead of failing query-block, which would be disruptive to users,
use the blocker to know when to reschedule the coroutine back into the
iohandler so it doesn't run while the BDRVQcow2State is inconsistent.

To avoid failing query-block when all block operations are blocked,
unblock the INFO operation at various places. This preserves the prior
situations where query-block used to work.

1 - https://gitlab.com/qemu-project/qemu/-/issues/1933

Signed-off-by: Fabiano Rosas 
---
 block.c  |  1 +
 block/mirror.c   |  1 +
 block/qcow2.c| 20 
 block/replication.c  |  1 +
 blockjob.c   |  1 +
 include/block/block-common.h |  1 +
 migration/block.c|  1 +
 7 files changed, 26 insertions(+)

diff --git a/block.c b/block.c
index 468cf5e67d..01478c5471 100644
--- a/block.c
+++ b/block.c
@@ -1296,6 +1296,7 @@ static void GRAPH_WRLOCK bdrv_backing_attach(BdrvChild *c)
 parent->backing_blocker);
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
 parent->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_INFO, parent->backing_blocker);
 }
 
 static void bdrv_backing_detach(BdrvChild *c)
diff --git a/block/mirror.c b/block/mirror.c
index 1bdce3b657..9f952f3fdd 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1191,6 +1191,7 @@ static void mirror_complete(Job *job, Error **errp)
 error_setg(>replace_blocker,
"block device is in use by block-job-complete");
 bdrv_op_block_all(s->to_replace, s->replace_blocker);
+bdrv_op_unblock(s->to_replace, BLOCK_OP_TYPE_INFO, s->replace_blocker);
 bdrv_ref(s->to_replace);
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 956128b409..b0ec8009e3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2834,6 +2834,7 @@ qcow2_co_invalidate_cache(BlockDriverState *bs, Error 
**errp)
 BdrvChild *data_file;
 int flags = s->flags;
 QCryptoBlock *crypto = NULL;
+Error *blocker = NULL;
 QDict *options;
 int ret;
 
@@ -2845,6 +2846,17 @@ qcow2_co_invalidate_cache(BlockDriverState *bs, Error 
**errp)
 crypto = s->crypto;
 s->crypto = NULL;
 
+/*
+ * When qcow2_do_open() below reads the qcow header, it yields to
+ * wait for the I/O which allows a concurrent QMP query-block
+ * command to be dispatched on the same context before
+ * BDRVQcow2State has been completely repopulated. Block the
+ * query-info operation during this window to avoid having
+ * qcow2_get_specific_info() access bogus values.
+ */
+error_setg(, "invalidating cached metadata");
+bdrv_op_block(bs, BLOCK_OP_TYPE_INFO, blocker);
+
 /*
  * Do not reopen s->data_file (i.e., have qcow2_do_close() not close it,
  * and then prevent qcow2_do_open() from opening it), because this function
@@ -2864,6 +2876,8 @@ qcow2_co_invalidate_cache(BlockDriverState *bs, Error 
**errp)
 qemu_co_mutex_lock(>lock);
 ret = qcow2_do_open(bs, options, flags, false, errp);
 qemu_co_mutex_unlock(>lock);
+bdrv_op_unblock(bs, BLOCK_OP_TYPE_INFO, blocker);
+g_free(blocker);
 qobject_unref(options);
 if (ret < 0) {
 error_prepend(errp, "Could not reopen qcow2 layer: ");
@@ -5240,6 +5254,12 @@ qcow2_get_specific_info(BlockDriverState *bs, Error 
**errp)
 ImageInfoSpecific *spec_info;
 QCryptoBlockInfo *encrypt_info = NULL;
 
+if (qemu_in_coroutine() &&
+bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_INFO, errp)) {
+errp = NULL;
+aio_co_reschedule_self(iohandler_get_aio_context());
+}
+
 if (s->crypto != NULL) {
 encrypt_info = qcrypto_block_get_info(s->crypto, errp);
 if (!encrypt_info) {
diff --git a/block/replication.c b/block/replication.c
index ca6bd0a720..459d644673 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -577,6 +577,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 }
 bdrv_op_block_all(top_bs, s->blocker);
 bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_INFO, s->blocker);
 
 bdrv_graph_wrunlock();
 
diff --git a/blockjob.c

Re: [PATCH-for-9.0 v2] hw/sd/sdhci: Do not update TRNMOD when Command Inhibit (DAT) is set

2024-04-09 Thread Philippe Mathieu-Daudé


On 9/4/24 17:01, Peter Maydell wrote:

On Tue, 9 Apr 2024 at 15:55, Philippe Mathieu-Daudé  wrote:


Per "SD Host Controller Standard Specification Version 3.00":

   * 2.2.5 Transfer Mode Register (Offset 00Ch)

 Writes to this register shall be ignored when the Command
 Inhibit (DAT) in the Present State register is 1.

Do not update the TRNMOD register when Command Inhibit (DAT)
bit is set to avoid the present-status register going out of
sync, leading to malicious guest using DMA mode and overflowing
the FIFO buffer:

   $ cat << EOF | qemu-system-i386 \
  -display none -nodefaults \
  -machine accel=qtest -m 512M \
  -device sdhci-pci,sd-spec-version=3 \
  -device sd-card,drive=mydrive \
  -drive 
if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic \
  -qtest stdio
   outl 0xcf8 0x80001013
   outl 0xcfc 0x91
   outl 0xcf8 0x80001001
   outl 0xcfc 0x0600
   write 0x912c 0x1 0x05
   write 0x9158 0x1 0x16
   write 0x9105 0x1 0x04
   write 0x9128 0x1 0x08
   write 0x16 0x1 0x21
   write 0x19 0x1 0x20
   write 0x910c 0x1 0x01
   write 0x910e 0x1 0x20
   write 0x910f 0x1 0x00
   write 0x910c 0x1 0x00
   write 0x9120 0x1 0x00
   EOF

Stack trace (part):
=
==89993==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x61529900 at pc 0x55d5f885700d bp 0x7ffc1e1e9470 sp 0x7ffc1e1e9468
WRITE of size 1 at 0x61529900 thread T0
 #0 0x55d5f885700c in sdhci_write_dataport hw/sd/sdhci.c:564:39
 #1 0x55d5f8849150 in sdhci_write hw/sd/sdhci.c:1223:13
 #2 0x55d5fa01db63 in memory_region_write_accessor system/memory.c:497:5
 #3 0x55d5fa01d245 in access_with_adjusted_size system/memory.c:573:18
 #4 0x55d5fa01b1a9 in memory_region_dispatch_write system/memory.c:1521:16
 #5 0x55d5fa09f5c9 in flatview_write_continue system/physmem.c:2711:23
 #6 0x55d5fa08f78b in flatview_write system/physmem.c:2753:12
 #7 0x55d5fa08f258 in address_space_write system/physmem.c:2860:18
 ...
0x61529900 is located 0 bytes to the right of 512-byte region
[0x61529700,0x61529900) allocated by thread T0 here:
 #0 0x55d5f7237b27 in __interceptor_calloc
 #1 0x7f9e36dd4c50 in g_malloc0
 #2 0x55d5f88672f7 in sdhci_pci_realize hw/sd/sdhci-pci.c:36:5
 #3 0x55d5f844b582 in pci_qdev_realize hw/pci/pci.c:2092:9
 #4 0x55d5fa2ee74b in device_set_realized hw/core/qdev.c:510:13
 #5 0x55d5fa325bfb in property_set_bool qom/object.c:2358:5
 #6 0x55d5fa31ea45 in object_property_set qom/object.c:1472:5
 #7 0x55d5fa332509 in object_property_set_qobject om/qom-qobject.c:28:10
 #8 0x55d5fa31f6ed in object_property_set_bool qom/object.c:1541:15
 #9 0x55d5fa2e2948 in qdev_realize hw/core/qdev.c:292:12
 #10 0x55d5f8eed3f1 in qdev_device_add_from_qdict 
system/qdev-monitor.c:719:10
 #11 0x55d5f8eef7ff in qdev_device_add system/qdev-monitor.c:738:11
 #12 0x55d5f8f211f0 in device_init_func system/vl.c:1200:11
 #13 0x55d5fad0877d in qemu_opts_foreach util/qemu-option.c:1135:14
 #14 0x55d5f8f0df9c in qemu_create_cli_devices system/vl.c:2638:5
 #15 0x55d5f8f0db24 in qmp_x_exit_preconfig system/vl.c:2706:5
 #16 0x55d5f8f14dc0 in qemu_init system/vl.c:3737:9
 ...
SUMMARY: AddressSanitizer: heap-buffer-overflow hw/sd/sdhci.c:564:39
in sdhci_write_dataport

Add assertions to ensure the fifo_buffer[] is not overflowed by
malicious accesses to the Buffer Data Port register.

Fixes: CVE-2024-3447
Cc: qemu-sta...@nongnu.org
Fixes: d7dfca0807 ("hw/sdhci: introduce standard SD host controller")
Buglink: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=58813
Reported-by: Alexander Bulekov 
Reported-by: Chuhong Yuan 
Suggested-by: Peter Maydell 
Signed-off-by: Philippe Mathieu-Daudé 
---
Peter, since it is your patch, can I replace the Suggested-by your
S-o-b tag?


Sure, you can have my Signed-off-by or Reviewed-by tag, whichever
you prefer.


Thanks!

Patch queued.

[PATCH v3 06/11] block: Convert bdrv_query_block_graph_info to coroutine

2024-04-09 Thread Fabiano Rosas

We're converting callers of bdrv_co_get_allocated_file_size() to run
in coroutines because that function will be made asynchronous when
called (indirectly) from the QMP dispatcher.

This function is a candidate because it calls bdrv_do_query_node_info(),
which in turn calls bdrv_co_get_allocated_file_size().

All the functions called from bdrv_do_query_node_info() onwards are
coroutine-safe, either have a coroutine version themselves[1] or are
mostly simple code/string manipulation[2].

1) bdrv_co_getlength(), bdrv_co_get_allocated_file_size(),
   bdrv_co_get_info();

2) bdrv_refresh_filename(), bdrv_get_format_name(),
   bdrv_get_full_backing_filename(), bdrv_query_snapshot_info_list(),
   bdrv_get_specific_info();

Signed-off-by: Fabiano Rosas 
Reviewed-by: Hanna Czenczek 
---
 block/qapi.c | 14 --
 include/block/qapi.h |  6 +-
 qemu-img.c   |  3 ---
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 26a9510345..7b1cf48246 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -370,7 +370,7 @@ fail:
 }
 
 /**
- * bdrv_query_block_graph_info:
+ * bdrv_co_query_block_graph_info:
  * @bs: root node to start from
  * @p_info: location to store image information
  * @errp: location to store error information
@@ -379,17 +379,19 @@ fail:
  *
  * @p_info will be set only on success. On error, store error in @errp.
  */
-void bdrv_query_block_graph_info(BlockDriverState *bs,
- BlockGraphInfo **p_info,
- Error **errp)
+void coroutine_fn
+bdrv_co_query_block_graph_info(BlockDriverState *bs, BlockGraphInfo **p_info,
+   Error **errp)
 {
 ERRP_GUARD();
 BlockGraphInfo *info;
 BlockChildInfoList **children_list_tail;
 BdrvChild *c;
 
+assert_bdrv_graph_readable();
+
 info = g_new0(BlockGraphInfo, 1);
-bdrv_do_query_node_info(bs, qapi_BlockGraphInfo_base(info), errp);
+bdrv_co_do_query_node_info(bs, qapi_BlockGraphInfo_base(info), errp);
 if (*errp) {
 goto fail;
 }
@@ -403,7 +405,7 @@ void bdrv_query_block_graph_info(BlockDriverState *bs,
 QAPI_LIST_APPEND(children_list_tail, c_info);
 
 c_info->name = g_strdup(c->name);
-bdrv_query_block_graph_info(c->bs, _info->info, errp);
+bdrv_co_query_block_graph_info(c->bs, _info->info, errp);
 if (*errp) {
 goto fail;
 }
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 234730b3de..76be9cc3e5 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -41,7 +41,11 @@ bdrv_query_snapshot_info_list(BlockDriverState *bs,
 void GRAPH_RDLOCK
 bdrv_query_image_info(BlockDriverState *bs, ImageInfo **p_info, bool flat,
   bool skip_implicit_filters, Error **errp);
-void GRAPH_RDLOCK
+
+void coroutine_fn GRAPH_RDLOCK
+bdrv_co_query_block_graph_info(BlockDriverState *bs, BlockGraphInfo **p_info,
+   Error **errp);
+void co_wrapper_bdrv_rdlock
 bdrv_query_block_graph_info(BlockDriverState *bs, BlockGraphInfo **p_info,
 Error **errp);
 
diff --git a/qemu-img.c b/qemu-img.c
index 7668f86769..19db8f18fc 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2958,10 +2958,7 @@ static BlockGraphInfoList *collect_image_info_list(bool 
image_opts,
  * duplicate the backing chain information that we obtain by walking
  * the chain manually here.
  */
-bdrv_graph_rdlock_main_loop();
 bdrv_query_block_graph_info(bs, , );
-bdrv_graph_rdunlock_main_loop();
-
 if (err) {
 error_report_err(err);
 blk_unref(blk);
-- 
2.35.3

[PATCH v3 11/11] block: Add a thread-pool version of fstat

2024-04-09 Thread Fabiano Rosas

From: João Silva 

The fstat call can take a long time to finish when running over
NFS. Add a version of it that runs in the thread pool.

Adapt one of its users, raw_co_get_allocated_file size to use the new
version. That function is called via QMP under the qemu_global_mutex
so it has a large chance of blocking VCPU threads in case it takes too
long to finish.

Reviewed-by: Claudio Fontana 
Reviewed-by: Hanna Czenczek 
Signed-off-by: João Silva 
Signed-off-by: Fabiano Rosas 
---
 block/file-posix.c  | 40 +---
 include/block/raw-aio.h |  4 +++-
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 35684f7e21..6fbf961244 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -226,6 +226,9 @@ typedef struct RawPosixAIOData {
 struct {
 unsigned long op;
 } zone_mgmt;
+struct {
+struct stat *st;
+} fstat;
 };
 } RawPosixAIOData;
 
@@ -2616,6 +2619,34 @@ static void raw_close(BlockDriverState *bs)
 }
 }
 
+static int handle_aiocb_fstat(void *opaque)
+{
+RawPosixAIOData *aiocb = opaque;
+
+if (fstat(aiocb->aio_fildes, aiocb->fstat.st) < 0) {
+return -errno;
+}
+
+return 0;
+}
+
+static int coroutine_fn raw_co_fstat(BlockDriverState *bs, struct stat *st)
+{
+BDRVRawState *s = bs->opaque;
+RawPosixAIOData acb;
+
+acb = (RawPosixAIOData) {
+.bs = bs,
+.aio_fildes = s->fd,
+.aio_type   = QEMU_AIO_FSTAT,
+.fstat  = {
+.st = st,
+},
+};
+
+return raw_thread_pool_submit(handle_aiocb_fstat, );
+}
+
 /**
  * Truncates the given regular file @fd to @offset and, when growing, fills the
  * new space according to @prealloc.
@@ -2860,11 +2891,14 @@ static int64_t coroutine_fn 
raw_co_getlength(BlockDriverState *bs)
 static int64_t coroutine_fn raw_co_get_allocated_file_size(BlockDriverState 
*bs)
 {
 struct stat st;
-BDRVRawState *s = bs->opaque;
+int ret;
 
-if (fstat(s->fd, ) < 0) {
-return -errno;
+ret = raw_co_fstat(bs, );
+
+if (ret) {
+return ret;
 }
+
 return (int64_t)st.st_blocks * 512;
 }
 
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 20e000b8ef..0c6af8dc32 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -31,6 +31,7 @@
 #define QEMU_AIO_ZONE_REPORT  0x0100
 #define QEMU_AIO_ZONE_MGMT0x0200
 #define QEMU_AIO_ZONE_APPEND  0x0400
+#define QEMU_AIO_FSTAT0x0800
 #define QEMU_AIO_TYPE_MASK \
 (QEMU_AIO_READ | \
  QEMU_AIO_WRITE | \
@@ -42,7 +43,8 @@
  QEMU_AIO_TRUNCATE | \
  QEMU_AIO_ZONE_REPORT | \
  QEMU_AIO_ZONE_MGMT | \
- QEMU_AIO_ZONE_APPEND)
+ QEMU_AIO_ZONE_APPEND | \
+ QEMU_AIO_FSTAT)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
-- 
2.35.3

[PATCH v3 09/11] block: Don't query all block devices at hmp_nbd_server_start

2024-04-09 Thread Fabiano Rosas

We're currently doing a full query-block just to enumerate the devices
for qmp_nbd_server_add and then discarding the BlockInfoList
afterwards. Alter hmp_nbd_server_start to instead iterate explicitly
over the block_backends list.

This allows the removal of the dependency on qmp_query_block from
hmp_nbd_server_start. This is desirable because we're about to move
qmp_query_block into a coroutine and don't need to change the NBD code
at the same time.

Add the GRAPH_RDLOCK_GUARD_MAINLOOP macro because
bdrv_skip_implicit_filters() needs the graph lock.

Signed-off-by: Fabiano Rosas 
---
- add a comment explaining some checks are done to preserve previous
  behavior;

- we need the strdup when assigning .device to preserve const. Just
  add a matching g_free();

- about the possible leak at qmp_nbd_server_add() unrelated to this
  patch, commit 8675cbd68b ("nbd: Utilize QAPI_CLONE for type
  conversion") mentions that the QAPI visitor will already free
  arg->name.
---
 block/monitor/block-hmp-cmds.c | 32 
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index d954bec6f1..9db587c661 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -387,10 +387,12 @@ void hmp_nbd_server_start(Monitor *mon, const QDict 
*qdict)
 bool writable = qdict_get_try_bool(qdict, "writable", false);
 bool all = qdict_get_try_bool(qdict, "all", false);
 Error *local_err = NULL;
-BlockInfoList *block_list, *info;
+BlockBackend *blk;
 SocketAddress *addr;
 NbdServerAddOptions export;
 
+GRAPH_RDLOCK_GUARD_MAINLOOP();
+
 if (writable && !all) {
 error_setg(_err, "-w only valid together with -a");
 goto exit;
@@ -415,29 +417,43 @@ void hmp_nbd_server_start(Monitor *mon, const QDict 
*qdict)
 /* Then try adding all block devices.  If one fails, close all and
  * exit.
  */
-block_list = qmp_query_block(NULL);
+for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) {
+BlockDriverState *bs = blk_bs(blk);
 
-for (info = block_list; info; info = info->next) {
-if (!info->value->inserted) {
+if (!*blk_name(blk)) {
+continue;
+}
+
+/*
+ * Note: historically we used to call qmp_query_block() to get
+ * the list of block devices. The two 'continue' cases below
+ * are the same as used by that function and are here to
+ * preserve behavior.
+ */
+
+if (!blk_get_attached_dev(blk)) {
+continue;
+}
+
+bs = bdrv_skip_implicit_filters(bs);
+if (!bs || !bs->drv) {
 continue;
 }
 
 export = (NbdServerAddOptions) {
-.device = info->value->device,
+.device = g_strdup(blk_name(blk)),
 .has_writable   = true,
 .writable   = writable,
 };
 
 qmp_nbd_server_add(, _err);
-
+g_free(export.device);
 if (local_err != NULL) {
 qmp_nbd_server_stop(NULL);
 break;
 }
 }
 
-qapi_free_BlockInfoList(block_list);
-
 exit:
 hmp_handle_error(mon, local_err);
 }
-- 
2.35.3

[PATCH v3 08/11] block: Convert bdrv_block_device_info into co_wrapper

2024-04-09 Thread Fabiano Rosas

We're converting callers of bdrv_co_get_allocated_file_size() to run
in coroutines because that function will be made asynchronous when
called (indirectly) from the QMP dispatcher.

This function is a candidate because it calls bdrv_query_image_info()
-> bdrv_co_do_query_node_info() -> bdrv_co_get_allocated_file_size().

It is safe to turn this is a coroutine because the code it calls is
made up of either simple accessors and string manipulation functions
[1] or it has already been determined to be safe [2].

1) bdrv_refresh_filename(), bdrv_is_read_only(),
   blk_enable_write_cache(), bdrv_cow_bs(), blk_get_public(),
   throttle_group_get_name(), bdrv_write_threshold_get(),
   bdrv_query_dirty_bitmaps(), throttle_group_get_config(),
   bdrv_filter_or_cow_bs(), bdrv_skip_implicit_filters()

2) bdrv_co_do_query_node_info() (see previous commits);

This was the only caller of bdrv_query_image_info(), so we can remove
the wrapper for that function now.

Signed-off-by: Fabiano Rosas 
---
- used co_wrapper_bdrv_rdlock instead of co_wrapper
---
 block/qapi.c | 10 +-
 include/block/qapi.h | 13 ++---
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 5e263960a9..9a59e5606f 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -41,10 +41,10 @@
 #include "qemu/qemu-print.h"
 #include "sysemu/block-backend.h"
 
-BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
-BlockDriverState *bs,
-bool flat,
-Error **errp)
+BlockDeviceInfo *coroutine_fn bdrv_co_block_device_info(BlockBackend *blk,
+BlockDriverState *bs,
+bool flat,
+Error **errp)
 {
 ERRP_GUARD();
 ImageInfo **p_image_info;
@@ -152,7 +152,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
  * Skip automatically inserted nodes that the user isn't aware of for
  * query-block (blk != NULL), but not for query-named-block-nodes
  */
-bdrv_query_image_info(bs, p_image_info, flat, blk != NULL, errp);
+bdrv_co_query_image_info(bs, p_image_info, flat, blk != NULL, errp);
 if (*errp) {
 qapi_free_BlockDeviceInfo(info);
 return NULL;
diff --git a/include/block/qapi.h b/include/block/qapi.h
index 5f7e3a690e..9f0e957963 100644
--- a/include/block/qapi.h
+++ b/include/block/qapi.h
@@ -30,10 +30,12 @@
 #include "block/snapshot.h"
 #include "qapi/qapi-types-block-core.h"
 
-BlockDeviceInfo * GRAPH_RDLOCK
-bdrv_block_device_info(BlockBackend *blk, BlockDriverState *bs,
-   bool flat, Error **errp);
-
+BlockDeviceInfo *coroutine_fn GRAPH_RDLOCK
+bdrv_co_block_device_info(BlockBackend *blk, BlockDriverState *bs, bool flat,
+  Error **errp);
+BlockDeviceInfo *co_wrapper_bdrv_rdlock
+bdrv_block_device_info(BlockBackend *blk, BlockDriverState *bs, bool flat,
+   Error **errp);
 int GRAPH_RDLOCK
 bdrv_query_snapshot_info_list(BlockDriverState *bs,
   SnapshotInfoList **p_list,
@@ -41,9 +43,6 @@ bdrv_query_snapshot_info_list(BlockDriverState *bs,
 void coroutine_fn GRAPH_RDLOCK
 bdrv_co_query_image_info(BlockDriverState *bs, ImageInfo **p_info, bool flat,
  bool skip_implicit_filters, Error **errp);
-void co_wrapper_bdrv_rdlock
-bdrv_query_image_info(BlockDriverState *bs, ImageInfo **p_info, bool flat,
-  bool skip_implicit_filters, Error **errp);
 
 void coroutine_fn GRAPH_RDLOCK
 bdrv_co_query_block_graph_info(BlockDriverState *bs, BlockGraphInfo **p_info,
-- 
2.35.3

[PATCH v3 03/11] block: Take the graph lock in bdrv_snapshot_list

2024-04-09 Thread Fabiano Rosas

This function has up until now always ran in the main loop, outside of
a coroutine. We're about to make it run inside a coroutine so start
actually taking the graph lock.

Signed-off-by: Fabiano Rosas 
---
 block/snapshot.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/snapshot.c b/block/snapshot.c
index 8242b4abac..3db9536ca2 100644
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -389,7 +389,7 @@ int bdrv_snapshot_list(BlockDriverState *bs,
QEMUSnapshotInfo **psn_info)
 {
 GLOBAL_STATE_CODE();
-GRAPH_RDLOCK_GUARD_MAINLOOP();
+GRAPH_RDLOCK_GUARD();
 
 BlockDriver *drv = bs->drv;
 BlockDriverState *fallback_bs = bdrv_snapshot_fallback(bs);
-- 
2.35.3

[PATCH v3 02/11] block: Temporarily mark bdrv_co_get_allocated_file_size as mixed

2024-04-09 Thread Fabiano Rosas

Some callers of this function are about to be converted to run in
coroutines, so allow it to be executed both inside and outside a
coroutine while we convert all the callers.

This will be reverted once all callers of bdrv_do_query_node_info run
in a coroutine.

Signed-off-by: Fabiano Rosas 
Reviewed-by: Eric Blake 
Reviewed-by: Hanna Czenczek 
---
 include/block/block-io.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index b49e0537dd..349d7760a1 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -86,7 +86,7 @@ int64_t co_wrapper_mixed_bdrv_rdlock 
bdrv_getlength(BlockDriverState *bs);
 int64_t coroutine_fn GRAPH_RDLOCK
 bdrv_co_get_allocated_file_size(BlockDriverState *bs);
 
-int64_t co_wrapper_bdrv_rdlock
+int64_t co_wrapper_mixed_bdrv_rdlock
 bdrv_get_allocated_file_size(BlockDriverState *bs);
 
 BlockMeasureInfo *bdrv_measure(BlockDriver *drv, QemuOpts *opts,
-- 
2.35.3

[PATCH v3 00/11] block: Convert qmp_query_block into a coroutine

2024-04-09 Thread Fabiano Rosas

Hi, it's been a while since the last version, so a recap:

This series converts qmp_query_block() & qmp_query_named_block_nodes()
to coroutines so we can yield from them all the way back into the main
loop. This addresses a vcpu softlockup encountered when querying a
disk placed on NFS.

If the NFS server happens to have high latency, an fstat() issued from
raw_co_get_allocated_file_size() could take seconds while the whole
QMP command is holding the BQL and blocks a vcpu thread going out of
the guest to handle IO.

This scenario is clearly undesireable since a query command is of much
lower priority than the vcpu thread doing actual work.

Move the 'fstat' call into the thread-pool and make the necessary
adaptations to ensure the whole QMP command that calls
raw_co_get_allocated_file_size() runs in a coroutine.

Changes since v2:

- Do the changes more gradually to make it easier to reason about the
  safety of the change.

- Patch 4 addresses the issue I asked about recently on the ml [1]
  about how to avoid dispatching the QMP command during an aio_poll().

- Converted qmp_query_block and qmp_query_named_block_nodes in a
  single patch to avoid having hmp_info_block call a coroutine_fn out
  of coroutine context.

On v2, Hanna asked:

  "I wonder how the threading is actually supposed to work.  I assume
  QMP coroutines run in the main thread, so now we run
  bdrv_co_get_allocated_file_size() in the main thread – is that
  correct, or do we need to use bdrv_co_enter() like qmp_block_resize()
  does?  And so, if we run it in the main thread, is it OK not to
  acquire the AioContext around it to prevent interference from a
  potential I/O thread?"

The QMP coroutines and also bdrv_co_get_allocated_file_size() run in
the main thread. This series doesn't change that. The difference is
that instead of bdrv_co_get_allocated_file_size() yielding back to
bdrv_poll(), it now yields back to the main loop.

As for thread safety, that's basically what I asked about in [1], so
I'm still gathering information and don't have a definite answer for
it. Since we don't have the AioContext lock anymore, it seems that
safety is now dependant on not dispatching the QMP command while other
operations are ongoing.

Still, for this particular case of fstat(), I don't think interference
of an I/O thread could cause any problems, as long as the file
descriptor is not closed prematurely. The fstat() manual already
mentions that it is succeptible to return old information in some
cases.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1244905208

1- Advice on block QMP command coroutines
https://lore.kernel.org/r/87bk6trl9i@suse.de

Initial discussion:
https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg03141.html
v1:
https://lore.kernel.org/r/20230523213903.18418-1-faro...@suse.de
v2:
https://lore.kernel.org/r/20230609201910.12100-1-faro...@suse.de

Fabiano Rosas (9):
  block: Allow the wrapper script to see functions declared in qapi.h
  block: Temporarily mark bdrv_co_get_allocated_file_size as mixed
  block: Take the graph lock in bdrv_snapshot_list
  block: Reschedule query-block during qcow2 invalidation
  block: Run bdrv_do_query_node_info in a coroutine
  block: Convert bdrv_query_block_graph_info to coroutine
  block: Convert bdrv_query_image_info to coroutine
  block: Convert bdrv_block_device_info into co_wrapper
  block: Don't query all block devices at hmp_nbd_server_start

João Silva (1):
  block: Add a thread-pool version of fstat

Lin Ma (1):
  block: Convert qmp_query_block and qmp_query_named_block_nodes to
coroutine

 block.c|  9 +++--
 block/file-posix.c | 40 +--
 block/meson.build  |  1 +
 block/mirror.c |  1 +
 block/monitor/block-hmp-cmds.c | 34 +++-
 block/qapi.c   | 63 +++---
 block/qcow2.c  | 20 ++
 block/replication.c|  1 +
 block/snapshot.c   |  2 +-
 blockdev.c |  8 ++--
 blockjob.c |  1 +
 hmp-commands-info.hx   |  1 +
 include/block/block-common.h   |  1 +
 include/block/block-global-state.h |  3 +-
 include/block/block-hmp-cmds.h |  2 +-
 include/block/qapi.h   | 24 
 include/block/raw-aio.h|  4 +-
 migration/block.c  |  1 +
 qapi/block-core.json   |  5 ++-
 qemu-img.c |  3 --
 scripts/block-coroutine-wrapper.py |  1 +
 21 files changed, 157 insertions(+), 68 deletions(-)

-- 
2.35.3

[PATCH v3 01/11] block: Allow the wrapper script to see functions declared in qapi.h

2024-04-09 Thread Fabiano Rosas

The following patches will add co_wrapper annotations to functions
declared in qapi.h. Add that header to the set of files used by
block-coroutine-wrapper.py.

Reviewed-by: Hanna Czenczek 
Signed-off-by: Fabiano Rosas 
---
 block/meson.build  | 1 +
 scripts/block-coroutine-wrapper.py | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block/meson.build b/block/meson.build
index e1f03fd773..8fe684d301 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -154,6 +154,7 @@ block_gen_c = custom_target('block-gen.c',
   '../include/block/dirty-bitmap.h',
   '../include/block/block_int-io.h',
   '../include/block/block-global-state.h',
+  '../include/block/qapi.h',
   
'../include/sysemu/block-backend-global-state.h',
   '../include/sysemu/block-backend-io.h',
   'coroutines.h'
diff --git a/scripts/block-coroutine-wrapper.py 
b/scripts/block-coroutine-wrapper.py
index dbbde99e39..067b203801 100644
--- a/scripts/block-coroutine-wrapper.py
+++ b/scripts/block-coroutine-wrapper.py
@@ -44,6 +44,7 @@ def gen_header():
 #include "block/block-gen.h"
 #include "block/block_int.h"
 #include "block/dirty-bitmap.h"
+#include "block/qapi.h"
 """
 
 
-- 
2.35.3

Re: [PATCH-for-9.0] hw/sd/sdhci: Discard excess of data written to Buffer Data Port register

2024-04-09 Thread Philippe Mathieu-Daudé


On 9/4/24 13:35, Peter Maydell wrote:

On Mon, 8 Apr 2024 at 17:42, Peter Maydell  wrote:

So another approach here would be...


That said, this is all quite complicated looking, so
for 9.0 and backports at least this patch is fine.


Your patch looks like the correct fix, and doesn't seem that
complicated (nor hard to backport), so I'll send a v2 in case.


Reviewed-by: Peter Maydell 

thanks
-- PMM

[PATCH-for-9.0 v2] hw/sd/sdhci: Do not update TRNMOD when Command Inhibit (DAT) is set

2024-04-09 Thread Philippe Mathieu-Daudé

Per "SD Host Controller Standard Specification Version 3.00":

  * 2.2.5 Transfer Mode Register (Offset 00Ch)

Writes to this register shall be ignored when the Command
Inhibit (DAT) in the Present State register is 1.

Do not update the TRNMOD register when Command Inhibit (DAT)
bit is set to avoid the present-status register going out of
sync, leading to malicious guest using DMA mode and overflowing
the FIFO buffer:

  $ cat << EOF | qemu-system-i386 \
 -display none -nodefaults \
 -machine accel=qtest -m 512M \
 -device sdhci-pci,sd-spec-version=3 \
 -device sd-card,drive=mydrive \
 -drive 
if=none,index=0,file=null-co://,format=raw,id=mydrive -nographic \
 -qtest stdio
  outl 0xcf8 0x80001013
  outl 0xcfc 0x91
  outl 0xcf8 0x80001001
  outl 0xcfc 0x0600
  write 0x912c 0x1 0x05
  write 0x9158 0x1 0x16
  write 0x9105 0x1 0x04
  write 0x9128 0x1 0x08
  write 0x16 0x1 0x21
  write 0x19 0x1 0x20
  write 0x910c 0x1 0x01
  write 0x910e 0x1 0x20
  write 0x910f 0x1 0x00
  write 0x910c 0x1 0x00
  write 0x9120 0x1 0x00
  EOF

Stack trace (part):
=
==89993==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x61529900 at pc 0x55d5f885700d bp 0x7ffc1e1e9470 sp 0x7ffc1e1e9468
WRITE of size 1 at 0x61529900 thread T0
#0 0x55d5f885700c in sdhci_write_dataport hw/sd/sdhci.c:564:39
#1 0x55d5f8849150 in sdhci_write hw/sd/sdhci.c:1223:13
#2 0x55d5fa01db63 in memory_region_write_accessor system/memory.c:497:5
#3 0x55d5fa01d245 in access_with_adjusted_size system/memory.c:573:18
#4 0x55d5fa01b1a9 in memory_region_dispatch_write system/memory.c:1521:16
#5 0x55d5fa09f5c9 in flatview_write_continue system/physmem.c:2711:23
#6 0x55d5fa08f78b in flatview_write system/physmem.c:2753:12
#7 0x55d5fa08f258 in address_space_write system/physmem.c:2860:18
...
0x61529900 is located 0 bytes to the right of 512-byte region
[0x61529700,0x61529900) allocated by thread T0 here:
#0 0x55d5f7237b27 in __interceptor_calloc
#1 0x7f9e36dd4c50 in g_malloc0
#2 0x55d5f88672f7 in sdhci_pci_realize hw/sd/sdhci-pci.c:36:5
#3 0x55d5f844b582 in pci_qdev_realize hw/pci/pci.c:2092:9
#4 0x55d5fa2ee74b in device_set_realized hw/core/qdev.c:510:13
#5 0x55d5fa325bfb in property_set_bool qom/object.c:2358:5
#6 0x55d5fa31ea45 in object_property_set qom/object.c:1472:5
#7 0x55d5fa332509 in object_property_set_qobject om/qom-qobject.c:28:10
#8 0x55d5fa31f6ed in object_property_set_bool qom/object.c:1541:15
#9 0x55d5fa2e2948 in qdev_realize hw/core/qdev.c:292:12
#10 0x55d5f8eed3f1 in qdev_device_add_from_qdict 
system/qdev-monitor.c:719:10
#11 0x55d5f8eef7ff in qdev_device_add system/qdev-monitor.c:738:11
#12 0x55d5f8f211f0 in device_init_func system/vl.c:1200:11
#13 0x55d5fad0877d in qemu_opts_foreach util/qemu-option.c:1135:14
#14 0x55d5f8f0df9c in qemu_create_cli_devices system/vl.c:2638:5
#15 0x55d5f8f0db24 in qmp_x_exit_preconfig system/vl.c:2706:5
#16 0x55d5f8f14dc0 in qemu_init system/vl.c:3737:9
...
SUMMARY: AddressSanitizer: heap-buffer-overflow hw/sd/sdhci.c:564:39
in sdhci_write_dataport

Add assertions to ensure the fifo_buffer[] is not overflowed by
malicious accesses to the Buffer Data Port register.

Fixes: CVE-2024-3447
Cc: qemu-sta...@nongnu.org
Fixes: d7dfca0807 ("hw/sdhci: introduce standard SD host controller")
Buglink: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=58813
Reported-by: Alexander Bulekov 
Reported-by: Chuhong Yuan 
Suggested-by: Peter Maydell 
Signed-off-by: Philippe Mathieu-Daudé 
---
Peter, since it is your patch, can I replace the Suggested-by your
S-o-b tag?

Supersedes: <20240404085549.16987-1-phi...@linaro.org>
---
 hw/sd/sdhci.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/sd/sdhci.c b/hw/sd/sdhci.c
index c5e0bc018b..27673e1c70 100644
--- a/hw/sd/sdhci.c
+++ b/hw/sd/sdhci.c
@@ -473,6 +473,7 @@ static uint32_t sdhci_read_dataport(SDHCIState *s, unsigned 
size)
 }
 
 for (i = 0; i < size; i++) {
+assert(s->data_count < s->buf_maxsz);
 value |= s->fifo_buffer[s->data_count] << i * 8;
 s->data_count++;
 /* check if we've read all valid data (blksize bytes) from buffer */
@@ -561,6 +562,7 @@ static void sdhci_write_dataport(SDHCIState *s, uint32_t 
value, unsigned size)
 }
 
 for (i = 0; i < size; i++) {
+assert(s->data_count < s->buf_maxsz);
 s->fifo_buffer[s->data_count] = value & 0xFF;
 s->data_count++;
 value >>= 8;
@@ -1208,6 +1210,12 @@ sdhci_write(void *opaque, hwaddr offset, uint64_t val, 
unsigned size)
 if (!(s->capareg & R_SDHC_CAPAB_SDMA_MASK)) {
 value &= ~SDHC_TRNS_DMA;
 }
+
+/* TRNMOD writes are inhibited while

Re: [PATCH-for-9.0 v2 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Philippe Mathieu-Daudé


On 9/4/24 16:18, Kevin Wolf wrote:

Am 09.04.2024 um 15:59 hat Philippe Mathieu-Daudé geschrieben:

Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446

Since v1:
- Addressed Kevin trivial suggestions (unsigned offset)


You already kept the Reviewed-by tags, but looks good to me.


Less work on your side ;)

The changes seemed trivial enough to keep them, but better
be safe than sorry.

Thanks!

Series queued.

Re: [PATCH-for-9.0 v2 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Kevin Wolf

Am 09.04.2024 um 15:59 hat Philippe Mathieu-Daudé geschrieben:
> Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446
> 
> Since v1:
> - Addressed Kevin trivial suggestions (unsigned offset)

You already kept the Reviewed-by tags, but looks good to me.

Kevin

Re: [PATCH-for-9.0 v2 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Philippe Mathieu-Daudé


On 9/4/24 15:59, Philippe Mathieu-Daudé wrote:

Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446

Since v1:
- Addressed Kevin trivial suggestions (unsigned offset)


$ git backport-diff
Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, 
respectively


001/   3:[0009] [FC] 'hw/block/nand: Factor nand_load_iolen() method 
out'
002/   3:[0004] [FC] 'hw/block/nand: Have blk_load() return boolean 
indicating success'
003/   3:[] [-C] 'hw/block/nand: Fix out-of-bound access in NAND 
block buffer'


$ git diff
diff --git a/hw/block/nand.c b/hw/block/nand.c
index d90dc965a1..e2433c25bd 100644
--- a/hw/block/nand.c
+++ b/hw/block/nand.c
@@ -88,7 +88,7 @@ struct NANDFlashState {
  * Returns %true when block containing (@addr + @offset) is
  * successfully loaded, otherwise %false.
  */
-bool (*blk_load)(NANDFlashState *s, uint64_t addr, int offset);
+bool (*blk_load)(NANDFlashState *s, uint64_t addr, unsigned offset);

 uint32_t ioaddr_vmstate;
 };
@@ -251,18 +251,21 @@ static inline void nand_pushio_byte(NANDFlashState 
*s, uint8_t value)

  * nand_load_block: Load block containing (s->addr + @offset).
  * Returns length of data available at @offset in this block.
  */
-static int nand_load_block(NANDFlashState *s, int offset)
+static unsigned nand_load_block(NANDFlashState *s, unsigned offset)
 {
-int iolen;
+unsigned iolen;

 if (!s->blk_load(s, s->addr, offset)) {
 return 0;
 }

-iolen = (1 << s->page_shift) - offset;
+iolen = (1 << s->page_shift);
 if (s->gnd) {
 iolen += 1 << s->oob_shift;
 }
+assert(offset <= iolen);
+iolen -= offset;
+
 return iolen;
 }

@@ -776,7 +779,7 @@ static void glue(nand_blk_erase_, 
NAND_PAGE_SIZE)(NANDFlashState *s)

 }

 static bool glue(nand_blk_load_, NAND_PAGE_SIZE)(NANDFlashState *s,
-uint64_t addr, int offset)
+uint64_t addr, unsigned offset)
 {
 if (PAGE(addr) >= s->pages) {
 return false;
---



Philippe Mathieu-Daudé (3):
   hw/block/nand: Factor nand_load_iolen() method out
   hw/block/nand: Have blk_load() take unsigned offset and return boolean
   hw/block/nand: Fix out-of-bound access in NAND block buffer

  hw/block/nand.c | 55 ++---
  1 file changed, 38 insertions(+), 17 deletions(-)

[PATCH-for-9.0 v2 3/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Philippe Mathieu-Daudé

nand_command() and nand_getio() don't check @offset points
into the block, nor the available data length (s->iolen) is
not negative.

In order to fix:

- check the offset is in range in nand_blk_load_NAND_PAGE_SIZE(),
- do not set @iolen if blk_load() failed.

Reproducer:

  $ cat << EOF | qemu-system-arm -machine tosa \
 -monitor none -serial none \
 -display none -qtest stdio
  write 0x1111 0x1 0xca
  write 0x1104 0x1 0x47
  write 0x1000ca04 0x1 0xd7
  write 0x1000ca01 0x1 0xe0
  write 0x1000ca04 0x1 0x71
  write 0x1000ca00 0x1 0x50
  write 0x1000ca04 0x1 0xd7
  read 0x1000ca02 0x1
  write 0x1000ca01 0x1 0x10
  EOF

=
==15750==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61f00de0
 at pc 0x560e61557210 bp 0x7ffcfc4a59f0 sp 0x7ffcfc4a59e8
READ of size 1 at 0x61f00de0 thread T0
#0 0x560e6155720f in mem_and hw/block/nand.c:101:20
#1 0x560e6155ac9c in nand_blk_write_512 hw/block/nand.c:663:9
#2 0x560e61544200 in nand_command hw/block/nand.c:293:13
#3 0x560e6153cc83 in nand_setio hw/block/nand.c:520:13
#4 0x560e61a0a69e in tc6393xb_nand_writeb hw/display/tc6393xb.c:380:13
#5 0x560e619f9bf7 in tc6393xb_writeb hw/display/tc6393xb.c:524:9
#6 0x560e647c7d03 in memory_region_write_accessor softmmu/memory.c:492:5
#7 0x560e647c7641 in access_with_adjusted_size softmmu/memory.c:554:18
#8 0x560e647c5f66 in memory_region_dispatch_write softmmu/memory.c:1514:16
#9 0x560e6485409e in flatview_write_continue softmmu/physmem.c:2825:23
#10 0x560e648421eb in flatview_write softmmu/physmem.c:2867:12
#11 0x560e64841ca8 in address_space_write softmmu/physmem.c:2963:18
#12 0x560e61170162 in qemu_writeb tests/qtest/videzzo/videzzo_qemu.c:1080:5
#13 0x560e6116eef7 in dispatch_mmio_write 
tests/qtest/videzzo/videzzo_qemu.c:1227:28

0x61f00de0 is located 0 bytes to the right of 3424-byte region 
[0x61f00080,0x61f00de0)
allocated by thread T0 here:
#0 0x560e611276cf in malloc 
/root/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3
#1 0x7f7959a87e98 in g_malloc 
(/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x57e98)
#2 0x560e64b98871 in object_new qom/object.c:749:12
#3 0x560e64b5d1a1 in qdev_new hw/core/qdev.c:153:19
#4 0x560e61547ea5 in nand_init hw/block/nand.c:639:11
#5 0x560e619f8772 in tc6393xb_init hw/display/tc6393xb.c:558:16
#6 0x560e6390bad2 in tosa_init hw/arm/tosa.c:250:12

SUMMARY: AddressSanitizer: heap-buffer-overflow hw/block/nand.c:101:20 in 
mem_and
==15750==ABORTING

Broken since introduction in commit 3e3d5815cb ("NAND Flash memory
emulation and ECC calculation helpers for use by NAND controllers").

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1445
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1446
Reported-by: Qiang Liu 
Reviewed-by: Richard Henderson 
Reviewed-by: Kevin Wolf 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/nand.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/block/nand.c b/hw/block/nand.c
index 5a31d78b6b..e2433c25bd 100644
--- a/hw/block/nand.c
+++ b/hw/block/nand.c
@@ -255,7 +255,9 @@ static unsigned nand_load_block(NANDFlashState *s, unsigned 
offset)
 {
 unsigned iolen;
 
-s->blk_load(s, s->addr, offset);
+if (!s->blk_load(s, s->addr, offset)) {
+return 0;
+}
 
 iolen = (1 << s->page_shift);
 if (s->gnd) {
@@ -783,6 +785,10 @@ static bool glue(nand_blk_load_, 
NAND_PAGE_SIZE)(NANDFlashState *s,
 return false;
 }
 
+if (offset > NAND_PAGE_SIZE + OOB_SIZE) {
+return false;
+}
+
 if (s->blk) {
 if (s->mem_oob) {
 if (blk_pread(s->blk, SECTOR(addr) << BDRV_SECTOR_BITS,
-- 
2.41.0

[PATCH-for-9.0 v2 1/3] hw/block/nand: Factor nand_load_iolen() method out

2024-04-09 Thread Philippe Mathieu-Daudé

Reviewed-by: Richard Henderson 
Reviewed-by: Kevin Wolf 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/nand.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/hw/block/nand.c b/hw/block/nand.c
index d1435f2207..f33eb2d552 100644
--- a/hw/block/nand.c
+++ b/hw/block/nand.c
@@ -243,9 +243,28 @@ static inline void nand_pushio_byte(NANDFlashState *s, 
uint8_t value)
 }
 }
 
+/*
+ * nand_load_block: Load block containing (s->addr + @offset).
+ * Returns length of data available at @offset in this block.
+ */
+static unsigned nand_load_block(NANDFlashState *s, unsigned offset)
+{
+unsigned iolen;
+
+s->blk_load(s, s->addr, offset);
+
+iolen = (1 << s->page_shift);
+if (s->gnd) {
+iolen += 1 << s->oob_shift;
+}
+assert(offset <= iolen);
+iolen -= offset;
+
+return iolen;
+}
+
 static void nand_command(NANDFlashState *s)
 {
-unsigned int offset;
 switch (s->cmd) {
 case NAND_CMD_READ0:
 s->iolen = 0;
@@ -271,12 +290,7 @@ static void nand_command(NANDFlashState *s)
 case NAND_CMD_NOSERIALREAD2:
 if (!(nand_flash_ids[s->chip_id].options & NAND_SAMSUNG_LP))
 break;
-offset = s->addr & ((1 << s->addr_shift) - 1);
-s->blk_load(s, s->addr, offset);
-if (s->gnd)
-s->iolen = (1 << s->page_shift) - offset;
-else
-s->iolen = (1 << s->page_shift) + (1 << s->oob_shift) - offset;
+s->iolen = nand_load_block(s, s->addr & ((1 << s->addr_shift) - 1));
 break;
 
 case NAND_CMD_RESET:
@@ -597,12 +611,7 @@ uint32_t nand_getio(DeviceState *dev)
 if (!s->iolen && s->cmd == NAND_CMD_READ0) {
 offset = (int) (s->addr & ((1 << s->addr_shift) - 1)) + s->offset;
 s->offset = 0;
-
-s->blk_load(s, s->addr, offset);
-if (s->gnd)
-s->iolen = (1 << s->page_shift) - offset;
-else
-s->iolen = (1 << s->page_shift) + (1 << s->oob_shift) - offset;
+s->iolen = nand_load_block(s, offset);
 }
 
 if (s->ce || s->iolen <= 0) {
-- 
2.41.0

[PATCH-for-9.0 v2 2/3] hw/block/nand: Have blk_load() take unsigned offset and return boolean

2024-04-09 Thread Philippe Mathieu-Daudé

Negative offset is meaningless, use unsigned type.
Return a boolean value indicating success.

Reviewed-by: Richard Henderson 
Reviewed-by: Kevin Wolf 
Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/block/nand.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/hw/block/nand.c b/hw/block/nand.c
index f33eb2d552..5a31d78b6b 100644
--- a/hw/block/nand.c
+++ b/hw/block/nand.c
@@ -84,7 +84,11 @@ struct NANDFlashState {
 
 void (*blk_write)(NANDFlashState *s);
 void (*blk_erase)(NANDFlashState *s);
-void (*blk_load)(NANDFlashState *s, uint64_t addr, int offset);
+/*
+ * Returns %true when block containing (@addr + @offset) is
+ * successfully loaded, otherwise %false.
+ */
+bool (*blk_load)(NANDFlashState *s, uint64_t addr, unsigned offset);
 
 uint32_t ioaddr_vmstate;
 };
@@ -772,11 +776,11 @@ static void glue(nand_blk_erase_, 
NAND_PAGE_SIZE)(NANDFlashState *s)
 }
 }
 
-static void glue(nand_blk_load_, NAND_PAGE_SIZE)(NANDFlashState *s,
-uint64_t addr, int offset)
+static bool glue(nand_blk_load_, NAND_PAGE_SIZE)(NANDFlashState *s,
+ uint64_t addr, unsigned 
offset)
 {
 if (PAGE(addr) >= s->pages) {
-return;
+return false;
 }
 
 if (s->blk) {
@@ -804,6 +808,8 @@ static void glue(nand_blk_load_, 
NAND_PAGE_SIZE)(NANDFlashState *s,
 offset, NAND_PAGE_SIZE + OOB_SIZE - offset);
 s->ioaddr = s->io;
 }
+
+return true;
 }
 
 static void glue(nand_init_, NAND_PAGE_SIZE)(NANDFlashState *s)
-- 
2.41.0

[PATCH-for-9.0 v2 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Philippe Mathieu-Daudé

Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446

Since v1:
- Addressed Kevin trivial suggestions (unsigned offset)

Philippe Mathieu-Daudé (3):
  hw/block/nand: Factor nand_load_iolen() method out
  hw/block/nand: Have blk_load() take unsigned offset and return boolean
  hw/block/nand: Fix out-of-bound access in NAND block buffer

 hw/block/nand.c | 55 ++---
 1 file changed, 38 insertions(+), 17 deletions(-)

-- 
2.41.0

Re: [PATCH-for-9.0? 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Philippe Mathieu-Daudé


On 8/4/24 17:45, Mauro Matteo Cascella wrote:

On Mon, Apr 8, 2024 at 10:36 AM Philippe Mathieu-Daudé
 wrote:


Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446


Does hw/block/nand meet the security requirements for CVE assignment?

=> https://www.qemu.org/docs/master/system/security.html


I don't think this device model is used in virtualization,
so I don't think so. (Cc'ing qemu-arm@ in case).
Thanks!




Philippe Mathieu-Daudé (3):
   hw/block/nand: Factor nand_load_iolen() method out
   hw/block/nand: Have blk_load() return boolean indicating success
   hw/block/nand: Fix out-of-bound access in NAND block buffer

  hw/block/nand.c | 50 +
  1 file changed, 34 insertions(+), 16 deletions(-)

--
2.41.0

Re: [PATCH-for-9.0] hw/sd/sdhci: Discard excess of data written to Buffer Data Port register

2024-04-09 Thread Peter Maydell

On Mon, 8 Apr 2024 at 17:42, Peter Maydell  wrote:
> So another approach here would be...

That said, this is all quite complicated looking, so
for 9.0 and backports at least this patch is fine.

Reviewed-by: Peter Maydell 

thanks
-- PMM

Re: [PATCH-for-9.0? 0/3] hw/block/nand: Fix out-of-bound access in NAND block buffer

2024-04-09 Thread Kevin Wolf

Am 08.04.2024 um 10:36 hat Philippe Mathieu-Daudé geschrieben:
> Fix for https://gitlab.com/qemu-project/qemu/-/issues/1446
> 
> Philippe Mathieu-Daudé (3):
>   hw/block/nand: Factor nand_load_iolen() method out
>   hw/block/nand: Have blk_load() return boolean indicating success
>   hw/block/nand: Fix out-of-bound access in NAND block buffer

As we're short on time for 9.0:

Reviewed-by: Kevin Wolf 

But it feels to me like this device could use some more cleanup to make
the code more robust.

Kevin

Re: [PATCH-for-9.0? 1/3] hw/block/nand: Factor nand_load_iolen() method out

2024-04-09 Thread Kevin Wolf

Am 08.04.2024 um 10:36 hat Philippe Mathieu-Daudé geschrieben:
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/block/nand.c | 32 +++-
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/block/nand.c b/hw/block/nand.c
> index d1435f2207..6fa9038bb5 100644
> --- a/hw/block/nand.c
> +++ b/hw/block/nand.c
> @@ -243,9 +243,25 @@ static inline void nand_pushio_byte(NANDFlashState *s, 
> uint8_t value)
>  }
>  }
>  
> +/*
> + * nand_load_block: Load block containing (s->addr + @offset).
> + * Returns length of data available at @offset in this block.
> + */
> +static int nand_load_block(NANDFlashState *s, int offset)
> +{
> +int iolen;
> +
> +s->blk_load(s, s->addr, offset);

Wouldn't it make more sense for @offset to be unsigned, like in
nand_command() before this patch?

I think the values are guaranteed to be small enough to fit in either
signed or unsigned, but we never check for < 0 (maybe that should be
done in your patch to s->blk_load() anyway).

> +iolen = (1 << s->page_shift) - offset;

This is not new, but I'm confused. Can this legitimately be negative?
offset seems to be limited to (1 << s->addr_shift) + s->offset in
practice, but addr_shift > page_shift for NAND_PAGE_SIZE == 2048.

After patch 3, offset is implicitly limited to NAND_PAGE_SIZE + OOB_SIZE
because we return early if s->blk_load() fails. I wonder if it wouldn't
be better to catch this in the callers already and only assert in
s->blk_load().

Anyway, after patch 3 iolen can only temporarily become negative here...

> +if (s->gnd) {
> +iolen += 1 << s->oob_shift;

...before it becomes non-negative again here.

> +}
> +return iolen;
> +}

So none of this makes the code technically incorrect after applying the
full series, but let someone modify it who doesn't understand these
intricacies and I think chances are that it will fall apart.

Kevin

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Markus Armbruster

Peter Xu  writes:

> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
>> Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
>> 
>> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
>> >
>> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
>> > > Hello Peter und Zhjian,
>> > >
>> > > Thank you so much for letting me know about this. I'm also a bit 
>> > > surprised at
>> > > the plan for deprecating the RDMA migration subsystem.
>> >
>> > It's not too late, since it looks like we do have users not yet notified
>> > from this, we'll redo the deprecation procedure even if it'll be the final
>> > plan, and it'll be 2 releases after this.

[...]

>> > Per our best knowledge, RDMA users are rare, and please let anyone know if
>> > you are aware of such users.  IIUC the major reason why RDMA stopped being
>> > the trend is because the network is not like ten years ago; I don't think I
>> > have good knowledge in RDMA at all nor network, but my understanding is
>> > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
>> > little sense to maintain multiple protocols, considering RDMA migration
>> > code is so special so that it has the most custom code comparing to other
>> > protocols.
>> +cc some guys from Huawei.
>> 
>> I'm surprised RDMA users are rare,  I guess maybe many are just
>> working with different code base.
>
> Yes, please cc whoever might be interested (or surprised.. :) to know this,
> and let's be open to all possibilities.
>
> I don't think it makes sense if there're a lot of users of a feature then
> we deprecate that without a good reason.  However there's always the
> resource limitation issue we're facing, so it could still have the
> possibility that this gets deprecated if nobody is working on our upstream
> branch. Say, if people use private branches anyway to support rdma without
> collaborating upstream, keeping such feature upstream then may not make
> much sense either, unless there's some way to collaborate.  We'll see.
>
> It seems there can still be people joining this discussion.  I'll hold off
> a bit on merging this patch to provide enough window for anyone to chim in.

Users are not enough.  Only maintainers are.

At some point, people cared enough about RDMA in QEMU to contribute the
code.  That's why have the code.

To keep the code, we need people who care enough about RDMA in QEMU to
maintain it.  Without such people, the case for keeping it remains
dangerously weak, and no amount of talk or even benchmarks can change
that.

[PULL 6/7] vhost-user-blk: simplify and fix vhost_user_blk_handle_config_change

2024-04-09 Thread Michael S. Tsirkin

From: Vladimir Sementsov-Ogievskiy 

Let's not care about what was changed and update the whole config,
reasons:

1. config->geometry should be updated together with capacity, so we fix
   a bug.

2. Vhost-user protocol doesn't say anything about config change
   limitation. Silent ignore of changes doesn't seem to be correct.

3. vhost-user-vsock reads the whole config

4. on realize we don't do any checks on retrieved config, so no reason
   to care here

Comment "valid for resize only" exists since introduction the whole
hw/block/vhost-user-blk.c in commit
   00343e4b54ba0685e9ebe928ec5713b0cf7f1d1c
"vhost-user-blk: introduce a new vhost-user-blk host device",
seems it was just an extra limitation.

Also, let's notify guest unconditionally:

1. So does vhost-user-vsock

2. We are going to reuse the functionality in new cases when we do want
   to notify the guest unconditionally. So, no reason to create extra
   branches in the logic.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Acked-by: Raphael Norwitz 
Message-Id: <20240329183758.3360733-2-vsement...@yandex-team.ru>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/block/vhost-user-blk.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 6a856ad51a..9e6bbc6950 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -91,7 +91,6 @@ static void vhost_user_blk_set_config(VirtIODevice *vdev, 
const uint8_t *config)
 static int vhost_user_blk_handle_config_change(struct vhost_dev *dev)
 {
 int ret;
-struct virtio_blk_config blkcfg;
 VirtIODevice *vdev = dev->vdev;
 VHostUserBlk *s = VHOST_USER_BLK(dev->vdev);
 Error *local_err = NULL;
@@ -100,19 +99,15 @@ static int vhost_user_blk_handle_config_change(struct 
vhost_dev *dev)
 return 0;
 }
 
-ret = vhost_dev_get_config(dev, (uint8_t *),
+ret = vhost_dev_get_config(dev, (uint8_t *)>blkcfg,
vdev->config_len, _err);
 if (ret < 0) {
 error_report_err(local_err);
 return ret;
 }
 
-/* valid for resize only */
-if (blkcfg.capacity != s->blkcfg.capacity) {
-s->blkcfg.capacity = blkcfg.capacity;
-memcpy(dev->vdev->config, >blkcfg, vdev->config_len);
-virtio_notify_config(dev->vdev);
-}
+memcpy(dev->vdev->config, >blkcfg, vdev->config_len);
+virtio_notify_config(dev->vdev);
 
 return 0;
 }
-- 
MST

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Jinpu Wang

Hi Peter,

On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
>
> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
> >
> > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > >
> > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > Hello Peter und Zhjian,
> > > >
> > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > surprised at
> > > > the plan for deprecating the RDMA migration subsystem.
> > >
> > > It's not too late, since it looks like we do have users not yet notified
> > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > plan, and it'll be 2 releases after this.
> > >
> > > >
> > > > > IMHO it's more important to know whether there are still users and 
> > > > > whether
> > > > > they would still like to see it around.
> > > >
> > > > > I admit RDMA migration was lack of testing(unit/CI test), which led 
> > > > > to the a few
> > > > > obvious bugs being noticed too late.
> > > >
> > > > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > > > coverage
> > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > migration test
> > > > cases failed and came to realize that there might be a bug between 8.1
> > > > and 8.2, but
> > > > was unable to confirm and report it quickly to you.
> > > >
> > > > The maintenance of this part could be too costly or difficult from
> > > > your point of view.
> > >
> > > It may or may not be too costly, it's just that we need real users of RDMA
> > > taking some care of it.  Having it broken easily for >1 releases 
> > > definitely
> > > is a sign of lack of users.  It is an implication to the community that we
> > > should consider dropping some features so that we can get the best use of
> > > the community resources for the things that may have a broader audience.
> > >
> > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > taught us the lesson so we're requesting CI coverage for all other new
> > > features that will be merged at least for migration subsystem, so that we
> > > plan to not merge anything that is not covered by CI unless extremely
> > > necessary in the future.
> > >
> > > For sure CI is not the only missing part, but I'd say we should start with
> > > it, then someone should also take care of the code even if only in
> > > maintenance mode (no new feature to add on top).
> > >
> > > >
> > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > many) like us
> > > > either to stick to the RDMA migration by using an increasingly older
> > > > version of QEMU,
> > > > or to abandon the currently used RDMA migration.
> > >
> > > RDMA doesn't get new features anyway, if there's specific use case for 
> > > RDMA
> > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > possible to switch to the TCP protocol with some good NICs?
> > We have used rdma migration with HCA from Nvidia for years, our
> > experience is RDMA migration works better than tcp (over ipoib).
>
> Please bare with me, as I know little on rdma stuff.
>
> I'm actually pretty confused (and since a long time ago..) on why we need
> to operation with rdma contexts when ipoib seems to provide all the tcp
> layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> even if there's rdma/ib hardwares underneath?  Is it because of performance
> improvements so that we must use a separate path comparing to generic
> "tcp:" protocol here?
using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
talking directly to NIC which bypasses the kernel overhead, less cpu
utilization and better performance.

While IPoIB is more for compatibility to  applications using tcp, but
can't get full benefit of RDMA.  When you have mix generation of IB
devices, there are performance issue on IPoIB, we've seen 40G HCA can
only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
speed.

I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:

iperf 3.9
Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
07:19:34 UTC 2024 x86_64
---
Server listening on 5201
---
Time: Tue, 09 Apr 2024 06:55:02 GMT
Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
  Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
  TCP MSS: 0 (default)
[  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
2a02:247f:401:4:2:0:b:3 port 41136
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
0 seconds, 10 second test, tos 0
[ ID] Interval   Transfer Bitrate
[  5]   0.00-1.00   sec

[PATCH] tests/avocado: add hotplug_blk test

2024-04-09 Thread Vladimir Sementsov-Ogievskiy

Introduce a test, that checks that plug/unplug of virtio-blk device
works.

(the test is developed by copying hotplug_cpu.py, so keep original
copyright)

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/avocado/hotplug_blk.py | 69 
 1 file changed, 69 insertions(+)
 create mode 100644 tests/avocado/hotplug_blk.py

diff --git a/tests/avocado/hotplug_blk.py b/tests/avocado/hotplug_blk.py
new file mode 100644
index 00..5dc30f6616
--- /dev/null
+++ b/tests/avocado/hotplug_blk.py
@@ -0,0 +1,69 @@
+# Functional test that hotplugs a virtio blk disk and checks it on a Linux
+# guest
+#
+# Copyright (c) 2021 Red Hat, Inc.
+# Copyright (c) Yandex
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+import time
+
+from avocado_qemu import LinuxTest
+
+
+class HotPlug(LinuxTest):
+def blockdev_add(self) -> None:
+self.vm.cmd('blockdev-add', **{
+'driver': 'null-co',
+'size': 1073741824,
+'node-name': 'disk'
+})
+
+def assert_vda(self) -> None:
+self.ssh_command('test -e /sys/block/vda')
+
+def assert_no_vda(self) -> None:
+with self.assertRaises(AssertionError):
+self.assert_vda()
+
+def plug(self) -> None:
+args = {
+'driver': 'virtio-blk-pci',
+'drive': 'disk',
+'id': 'virtio-disk0',
+'bus': 'pci.1',
+'addr': 1
+}
+
+self.assert_no_vda()
+self.vm.cmd('device_add', args)
+try:
+self.assert_vda()
+except AssertionError:
+time.sleep(1)
+self.assert_vda()
+
+def unplug(self) -> None:
+self.vm.cmd('device_del', id='virtio-disk0')
+
+self.vm.event_wait('DEVICE_DELETED', 1.0,
+   match={'data': {'device': 'virtio-disk0'}})
+
+self.assert_no_vda()
+
+def test(self) -> None:
+"""
+:avocado: tags=arch:x86_64
+:avocado: tags=machine:q35
+:avocado: tags=accel:kvm
+"""
+self.require_accelerator('kvm')
+self.vm.add_args('-accel', 'kvm')
+self.vm.add_args('-device', 'pcie-pci-bridge,id=pci.1,bus=pcie.0')
+
+self.launch_and_wait()
+self.blockdev_add()
+
+self.plug()
+self.unplug()
-- 
2.34.1

Re: [PATCH v5 2/2] nbd/server: Mark negotiation functions as coroutine_fn

2024-04-09 Thread Vladimir Sementsov-Ogievskiy


On 08.04.24 19:00, Eric Blake wrote:

nbd_negotiate() is already marked coroutine_fn.  And given the fix in
the previous patch to have nbd_negotiate_handle_starttls not create
and wait on a g_main_loop (as that would violate coroutine
constraints), it is worth marking the rest of the related static
functions reachable only during option negotiation as also being
coroutine_fn.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Eric Blake 
---
  nbd/server.c | 102 +--
  1 file changed, 59 insertions(+), 43 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 98ae0e16326..1857fba51c1 100644


[..]


  {
  int rc;
  g_autofree char *name = NULL;
@@ -755,7 +764,8 @@ struct NBDTLSServerHandshakeData {
  Coroutine *co;
  };

-static void nbd_server_tls_handshake(QIOTask *task, void *opaque)
+static coroutine_fn void


This is not, that's a callback for tls handshake, which is not coroutine 
context as I understand.
without this hunk:


Reviewed-by: Vladimir Sementsov-Ogievskiy 



+nbd_server_tls_handshake(QIOTask *task, void *opaque)
  {
  struct NBDTLSServerHandshakeData *data = opaque;

@@ -768,8 +778,8 @@ static void nbd_server_tls_handshake(QIOTask *task, void 
*opaque)



[..]

--
Best regards,
Vladimir

Re: [PATCH v5 1/2] nbd/server: do not poll within a coroutine context

2024-04-09 Thread Vladimir Sementsov-Ogievskiy


On 08.04.24 19:00, Eric Blake wrote:

From: Zhu Yangyang

Coroutines are not supposed to block. Instead, they should yield.

The client performs TLS upgrade outside of an AIOContext, during
synchronous handshake; this still requires g_main_loop.  But the
server responds to TLS upgrade inside a coroutine, so a nested
g_main_loop is wrong.  Since the two callbacks no longer share more
than the setting of data.complete and data.error, it's just as easy to
use static helpers instead of trying to share a common code path.  It
is also possible to add assertions that no other code is interfering
with the eventual path to qio reaching the callback, whether or not it
required a yield or main loop.

Fixes: f95910f ("nbd: implement TLS support in the protocol negotiation")
Signed-off-by: Zhu Yangyang
[eblake: move callbacks to their use point, add assertions]
Signed-off-by: Eric Blake


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

38 matches

Mail list logo