Re: [PATCH 1/7] migration/rdma: Fix save_page method to fail on polling error

2023-09-05 Thread Zhijian Li (Fujitsu)


On 31/08/2023 21:25, Markus Armbruster wrote:
> qemu_rdma_save_page() reports polling error with error_report(), then
> succeeds anyway.  This is because the variable holding the polling
> status *shadows* the variable the function returns.  The latter
> remains zero.
> 
> Broken since day one, and duplicated more recently.
> 
> Fixes: 2da776db4846 (rdma: core logic)
> Fixes: b390afd8c50b (migration/rdma: Fix out of order wrid)

Thanks for the fixes


> Signed-off-by: Markus Armbruster 


Reviewed-by: Li Zhijian 




> ---
>   migration/rdma.c | 6 --
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index ca430d319d..b2e869aced 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3281,7 +3281,8 @@ static size_t qemu_rdma_save_page(QEMUFile *f,
>*/
>   while (1) {
>   uint64_t wr_id, wr_id_in;
> -int ret = qemu_rdma_poll(rdma, rdma->recv_cq, _id_in, NULL);
> +ret = qemu_rdma_poll(rdma, rdma->recv_cq, _id_in, NULL);
> +
>   if (ret < 0) {
>   error_report("rdma migration: polling error! %d", ret);
>   goto err;
> @@ -3296,7 +3297,8 @@ static size_t qemu_rdma_save_page(QEMUFile *f,
>   
>   while (1) {
>   uint64_t wr_id, wr_id_in;
> -int ret = qemu_rdma_poll(rdma, rdma->send_cq, _id_in, NULL);
> +ret = qemu_rdma_poll(rdma, rdma->send_cq, _id_in, NULL);
> +
>   if (ret < 0) {
>   error_report("rdma migration: polling error! %d", ret);
>   goto err;

Re: [Qemu-devel] [PATCH] ahci: enable pci bus master MemoryRegion before loading ahci engines

2023-09-05 Thread Michael S. Tsirkin
On Tue, Sep 10, 2019 at 10:08:20AM -0400, John Snow wrote:
> 
> 
> On 9/10/19 9:58 AM, Michael S. Tsirkin wrote:
> > On Tue, Sep 10, 2019 at 09:50:41AM -0400, John Snow wrote:
> >>
> >>
> >> On 9/10/19 3:04 AM, Michael S. Tsirkin wrote:
> >>> On Tue, Sep 10, 2019 at 01:18:37AM +0800, andychiu wrote:
>  If Windows 10 guests have enabled 'turn off hard disk after idle'
>  option in power settings, and the guest has a SATA disk plugged in,
>  the SATA disk will be turned off after a specified idle time.
>  If the guest is live migrated or saved/loaded with its SATA disk
>  turned off, the following error will occur:
> 
>  qemu-system-x86_64: AHCI: Failed to start FIS receive engine: bad FIS 
>  receive buffer address
>  qemu-system-x86_64: Failed to load ich9_ahci:ahci
>  qemu-system-x86_64: error while loading state for instance 0x0 of device 
>  ':00:1a.0/ich9_ahci'
>  qemu-system-x86_64: load of migration failed: Operation not permitted
> 
>  Observation from trace logs shows that a while after Windows 10 turns off
>  a SATA disk (IDE disks don't have the following behavior),
>  it will disable the PCI_COMMAND_MASTER flag of the pci device containing
>  the ahci device. When the the disk is turning back on,
>  the PCI_COMMAND_MASTER flag will be restored first.
>  But if the guest is migrated or saved/loaded while the disk is off,
>  the post_load callback of ahci device, ahci_state_post_load(), will fail
>  at ahci_cond_start_engines() if the MemoryRegion
>  pci_dev->bus_master_enable_region is not enabled, with pci_dev pointing
>  to the PCIDevice struct containing the ahci device.
> 
>  This patch enables pci_dev->bus_master_enable_region before calling
>  ahci_cond_start_engines() in ahci_state_post_load(), and restore the
>  MemoryRegion to its original state afterwards.
> 
>  Signed-off-by: andychiu 
> >>>
> >>> Poking at PCI device internals like this seems fragile.  And force
> >>> enabling bus master can lead to unpleasantness like corrupting guest
> >>> memory, unhandled interrupts, etc.  E.g. it's quite reasonable,
> >>> spec-wise, for the guest to move thing in memory around while bus
> >>> mastering is off.
> >>>
> >>> Can you teach ahci that region being disabled
> >>> during migration is ok, and recover from it?
> >>
> >> That's what I'm wondering.
> >>
> >> I could try to just disable the FIS RX engine if the mapping fails, but
> >> that will require a change to guest visible state.
> >>
> >> My hunch, though, is that when windows re-enables the device it will
> >> need to re-program the address registers anyway, so it might cope well
> >> with the FIS RX bit getting switched off.
> >>
> >> (I'm wondering if it isn't a mistake that QEMU is trying to re-map this
> >> address in the first place. Is it legal that the PCI device has pci bus
> >> master disabled but we've held on to a mapping?
> > 
> > If you are poking at guest memory when bus master is off, then most likely 
> > yes.
> > 
> >> Should there be some
> >> callback where AHCI knows to invalidate mappings at that point...?)
> > 
> > ATM the callback is the config write, you check
> > proxy->pci_dev.config[PCI_COMMAND] & PCI_COMMAND_MASTER
> > and if disabled invalidate the mapping.
> > 
> > virtio at least has code that pokes at
> > proxy->pci_dev.config[PCI_COMMAND] too, I'm quite
> > open to a function along the lines of
> > pci_is_bus_master_enabled()
> > that will do this.
> > 
> 
> Well, that's not a callback. I don't think it's right to check the
> PCI_COMMAND register *every* time AHCI does anything at all to see if
> its mappings are still valid.
> 
> AHCI makes a mapping *once* when FIS RX is turned on, and it unmaps it
> when it's turned off. It assumes it remains valid that whole time. When
> we migrate, it checks to see if it was running, and performs the
> mappings again to re-boot the state machine.
> 
> What I'm asking is; what are the implications of a guest disabling
> PCI_COMMAND_MASTER? (I don't know PCI as well as you do.)

The implication is that no reads or writes must be initiated by device:
either memory or IO reads, or sending MSI. INT#x is unaffected.
writes into device memory are unaffected. whether reads from
device memory are affected kind of depends, but maybe not.

Whether device caches anything internally has nothing to do
with PCI_COMMAND_MASTER and PCI spec says nothing about it.
Windows uses PCI_COMMAND_MASTER to emulate surprise removal
so there's that.


> What should that mean for the AHCI state machine?
> 
> Does this *necessarily* invalidate the mappings?
> (In which case -- it's an error that AHCI held on to them after Windows
> disabled the card, even if AHCI isn't being engaged by the guest
> anymore. Essentially, we were turned off but didn't clean up a dangling
> pointer, but we need the event that tells us to clean the dangling mapping.)

It does not have to but it 

Re: [PATCH v3 07/20] virtio: add vhost-user-base and a generic vhost-user-device

2023-09-05 Thread Alex Bennée


Matias Ezequiel Vara Larsen  writes:

> On Mon, Jul 10, 2023 at 04:35:09PM +0100, Alex Bennée wrote:
>> In theory we shouldn't need to repeat so much boilerplate to support
>> vhost-user backends. This provides a generic vhost-user-base QOM
>> object and a derived vhost-user-device for which the user needs to
>> provide the few bits of information that aren't currently provided by
>> the vhost-user protocol. This should provide a baseline implementation
>> from which the other vhost-user stub can specialise.
>> 
>> Signed-off-by: Alex Bennée 
>> 
>> ---
>> v2
>>   - split into vub and vud

>> +
>> +/*
>> + * Disable guest notifiers, by default all notifications will be via the
>> + * asynchronous vhost-user socket.
>> + */
>> +vdev->use_guest_notifier_mask = false;
>> +
>> +/* Allocate queues */
>> +vub->vqs = g_ptr_array_sized_new(vub->num_vqs);
>> +for (int i = 0; i < vub->num_vqs; i++) {
>> +g_ptr_array_add(vub->vqs,
>> +virtio_add_queue(vdev, 4, vub_handle_output));
>> +}
>> +
>
> Hello Alex, apologies if someone already asked this. If I understand
> correctly, the second parameter of virtio_add_queue() is the len of the
> queue. Why have you chosen "4" as its value? Shall qemu query the len of
> the queue from the vhost-user device instead?

Hmm yeah that is inherited from the virtio-rng backend which has a
pretty short queue. I don't think it is intrinsic to the device
implementation (although I guess that depends if a device will have
multiple requests in flight).

I propose making is some useful ^2 (like 64) and adding a config knob to
increase it if needed.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro



Re: [PATCH 05/21] block: Introduce bdrv_schedule_unref()

2023-09-05 Thread Kevin Wolf
Am 22.08.2023 um 21:01 hat Stefan Hajnoczi geschrieben:
> On Thu, Aug 17, 2023 at 02:50:04PM +0200, Kevin Wolf wrote:
> > bdrv_unref() is called by a lot of places that need to hold the graph
> > lock (it naturally happens in the context of operations that change the
> > graph). However, bdrv_unref() takes the graph writer lock internally, so
> > it can't actually be called while already holding a graph lock without
> > causing a deadlock.
> > 
> > bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the
> > node before closing it, and draining requires that the graph is
> > unlocked.
> > 
> > The solution is to defer deleting the node until we don't hold the lock
> > any more and draining is possible again.
> > 
> > Note that keeping images open for longer than necessary can create
> > problems, too: You can't open an image again before it is really closed
> > (if image locking didn't prevent it, it would cause corruption).
> > Reopening an image immediately happens at least during bdrv_open() and
> > bdrv_co_create().
> > 
> > In order to solve this problem, make sure to run the deferred unref in
> > bdrv_graph_wrunlock(), i.e. the first possible place where we can drain
> > again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK.
> > 
> > The output of iotest 051 is updated because the additional polling
> > changes the order of HMP output, resulting in a new "(qemu)" prompt in
> > the test output that was previously on a separate line and filtered out.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/block/block-global-state.h |  1 +
> >  block.c|  9 +
> >  block/graph-lock.c | 23 ---
> >  tests/qemu-iotests/051.pc.out  |  6 +++---
> >  4 files changed, 29 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/block/block-global-state.h 
> > b/include/block/block-global-state.h
> > index f347199bff..e570799f85 100644
> > --- a/include/block/block-global-state.h
> > +++ b/include/block/block-global-state.h
> > @@ -224,6 +224,7 @@ void bdrv_img_create(const char *filename, const char 
> > *fmt,
> >  void bdrv_ref(BlockDriverState *bs);
> >  void no_coroutine_fn bdrv_unref(BlockDriverState *bs);
> >  void coroutine_fn no_co_wrapper bdrv_co_unref(BlockDriverState *bs);
> > +void GRAPH_WRLOCK bdrv_schedule_unref(BlockDriverState *bs);
> >  void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
> >  BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
> >   BlockDriverState *child_bs,
> > diff --git a/block.c b/block.c
> > index 6376452768..9c4f24f4b9 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -7033,6 +7033,15 @@ void bdrv_unref(BlockDriverState *bs)
> >  }
> >  }
> >  
> > +void bdrv_schedule_unref(BlockDriverState *bs)
> 
> Please add a doc comment explaining when and why this should be used.

Ok.

> > +{
> > +if (!bs) {
> > +return;
> > +}
> > +aio_bh_schedule_oneshot(qemu_get_aio_context(),
> > +(QEMUBHFunc *) bdrv_unref, bs);
> > +}
> > +
> >  struct BdrvOpBlocker {
> >  Error *reason;
> >  QLIST_ENTRY(BdrvOpBlocker) list;
> > diff --git a/block/graph-lock.c b/block/graph-lock.c
> > index 5e66f01ae8..395d387651 100644
> > --- a/block/graph-lock.c
> > +++ b/block/graph-lock.c
> > @@ -163,17 +163,26 @@ void bdrv_graph_wrlock(BlockDriverState *bs)
> >  void bdrv_graph_wrunlock(void)
> >  {
> >  GLOBAL_STATE_CODE();
> > -QEMU_LOCK_GUARD(_context_list_lock);
> >  assert(qatomic_read(_writer));
> >  
> > +WITH_QEMU_LOCK_GUARD(_context_list_lock) {
> > +/*
> > + * No need for memory barriers, this works in pair with
> > + * the slow path of rdlock() and both take the lock.
> > + */
> > +qatomic_store_release(_writer, 0);
> > +
> > +/* Wake up all coroutine that are waiting to read the graph */
> 
> s/coroutine/coroutines/

I only changed the indentation, but I guess I can just fix it while I
touch it.

> > +qemu_co_enter_all(_queue, _context_list_lock);
> > +}
> > +
> >  /*
> > - * No need for memory barriers, this works in pair with
> > - * the slow path of rdlock() and both take the lock.
> > + * Run any BHs that were scheduled during the wrlock section and that
> > + * callers might expect to have finished (e.g. bdrv_unref() calls). Do 
> > this
> 
> Referring directly to bdrv_schedule_unref() would help make it clearer
> what you mean.
> 
> > + * only after restarting coroutines so that nested event loops in BHs 
> > don't
> > + * deadlock if their condition relies on the coroutine making progress.
> >   */
> > -qatomic_store_release(_writer, 0);
> > -
> > -/* Wake up all coroutine that are waiting to read the graph */
> > -qemu_co_enter_all(_queue, _context_list_lock);
> > +aio_bh_poll(qemu_get_aio_context());
> 
> Keeping a dedicated list of BDSes pending 

Re: [RFC PATCH v2 22/22] softmmu/physmem: Clean up local variable shadowing

2023-09-05 Thread Peter Xu
On Mon, Sep 04, 2023 at 05:31:30PM +0100, Daniel P. Berrangé wrote:
> On Mon, Sep 04, 2023 at 06:12:34PM +0200, Philippe Mathieu-Daudé wrote:
> > Fix:
> > 
> >   softmmu/physmem.c: In function 
> > ‘cpu_physical_memory_snapshot_and_clear_dirty’:
> >   softmmu/physmem.c:916:27: warning: declaration of ‘offset’ shadows a 
> > parameter [-Wshadow=compatible-local]
> > 916 | unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
> > |   ^~
> >   softmmu/physmem.c:892:31: note: shadowed declaration is here
> > 892 | (MemoryRegion *mr, hwaddr offset, hwaddr length, unsigned 
> > client)
> > |~~~^~
> > 
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> > RFC: Please double-check how 'offset' is used few lines later.
> 
> I don't see an issue - those lines are in an outer scope, so won't
> be accessing the 'offset' you've changed, they'll be the parameter
> instead. If you want to sanity check though, presumably the asm
> dissassembly for this method should be the same before/after this
> change

(and if it didn't do so then it's a bug..)

> 
> > ---
> >  softmmu/physmem.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> Reviewed-by: Daniel P. Berrangé 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v2 21/22] softmmu/memory: Clean up local variable shadowing

2023-09-05 Thread Peter Xu
On Mon, Sep 04, 2023 at 06:12:33PM +0200, Philippe Mathieu-Daudé wrote:
> Fix:
> 
>   softmmu/memory.c: In function ‘mtree_print_mr’:
>   softmmu/memory.c:3236:27: warning: declaration of ‘ml’ shadows a previous 
> local [-Wshadow=compatible-local]
>3236 | MemoryRegionList *ml;
> |   ^~
>   softmmu/memory.c:3213:32: note: shadowed declaration is here
>3213 | MemoryRegionList *new_ml, *ml, *next_ml;
> |^~
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Peter Xu 

-- 
Peter Xu




Re: [PULL v2 00/40] Misc patches for 2023-08-31

2023-09-05 Thread Stefan Hajnoczi
Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/8.2 for any 
user-visible changes.


signature.asc
Description: PGP signature


Re: [PATCH 17/21] block: Take graph rdlock in bdrv_drop_intermediate()

2023-09-05 Thread Kevin Wolf
Am 22.08.2023 um 21:35 hat Stefan Hajnoczi geschrieben:
> On Thu, Aug 17, 2023 at 02:50:16PM +0200, Kevin Wolf wrote:
> > The function reads the parents list, so it needs to hold the graph lock.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  block.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/block.c b/block.c
> > index 7df8780d6e..a82389f742 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -5934,9 +5934,11 @@ int bdrv_drop_intermediate(BlockDriverState *top, 
> > BlockDriverState *base,
> >  backing_file_str = base->filename;
> >  }
> >  
> > +bdrv_graph_rdlock_main_loop();
> >  QLIST_FOREACH(c, >parents, next_parent) {
> >  updated_children = g_slist_prepend(updated_children, c);
> >  }
> > +bdrv_graph_rdunlock_main_loop();
> 
> This is GLOBAL_STATE_CODE, so why take the read lock? I thought nothing
> can modify the graph here. If it could, then stashing the parents in
> the updated_children probably wouldn't be safe anyway.

The only thing bdrv_graph_rdlock_main_loop() does is asserting that the
conditions for doing nothing are met (GLOBAL_STATE_CODE + non-coroutine
context) and providing the right TSA attributes to make the compiler
happy.

Kevin


signature.asc
Description: PGP signature


Re: [PATCH 1/2] vmstate: Mark VMStateInfo.get/put() coroutine_mixed_fn

2023-09-05 Thread Peter Xu
On Tue, Sep 05, 2023 at 04:50:01PM +0200, Kevin Wolf wrote:
> Migration code can run both in coroutine context (the usual case) and
> non-coroutine context (at least savevm/loadvm for snapshots). This also
> affects the VMState callbacks, and devices must consider this. Change
> the callback definition in VMStateInfo to be explicit about it.
> 
> Signed-off-by: Kevin Wolf 

Acked-by: Peter Xu 

-- 
Peter Xu




Re: [PATCH v3 07/20] virtio: add vhost-user-base and a generic vhost-user-device

2023-09-05 Thread Matias Ezequiel Vara Larsen
On Mon, Jul 10, 2023 at 04:35:09PM +0100, Alex Bennée wrote:
> In theory we shouldn't need to repeat so much boilerplate to support
> vhost-user backends. This provides a generic vhost-user-base QOM
> object and a derived vhost-user-device for which the user needs to
> provide the few bits of information that aren't currently provided by
> the vhost-user protocol. This should provide a baseline implementation
> from which the other vhost-user stub can specialise.
> 
> Signed-off-by: Alex Bennée 
> 
> ---
> v2
>   - split into vub and vud
> ---
>  include/hw/virtio/vhost-user-device.h |  45 
>  hw/virtio/vhost-user-device.c | 324 ++
>  hw/virtio/meson.build |   2 +
>  3 files changed, 371 insertions(+)
>  create mode 100644 include/hw/virtio/vhost-user-device.h
>  create mode 100644 hw/virtio/vhost-user-device.c
> 
> diff --git a/include/hw/virtio/vhost-user-device.h 
> b/include/hw/virtio/vhost-user-device.h
> new file mode 100644
> index 00..9105011e25
> --- /dev/null
> +++ b/include/hw/virtio/vhost-user-device.h
> @@ -0,0 +1,45 @@
> +/*
> + * Vhost-user generic virtio device
> + *
> + * Copyright (c) 2023 Linaro Ltd
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef QEMU_VHOST_USER_DEVICE_H
> +#define QEMU_VHOST_USER_DEVICE_H
> +
> +#include "hw/virtio/vhost.h"
> +#include "hw/virtio/vhost-user.h"
> +
> +#define TYPE_VHOST_USER_BASE "vhost-user-base"
> +
> +OBJECT_DECLARE_TYPE(VHostUserBase, VHostUserBaseClass, VHOST_USER_BASE)
> +
> +struct VHostUserBase {
> +VirtIODevice parent;
> +/* Properties */
> +CharBackend chardev;
> +uint16_t virtio_id;
> +uint32_t num_vqs;
> +/* State tracking */
> +VhostUserState vhost_user;
> +struct vhost_virtqueue *vhost_vq;
> +struct vhost_dev vhost_dev;
> +GPtrArray *vqs;
> +bool connected;
> +};
> +
> +/* needed so we can use the base realize after specialisation
> +   tweaks */
> +struct VHostUserBaseClass {
> +/*< private >*/
> +VirtioDeviceClass parent_class;
> +/*< public >*/
> +DeviceRealize parent_realize;
> +};
> +
> +/* shared for the benefit of the derived pci class */
> +#define TYPE_VHOST_USER_DEVICE "vhost-user-device"
> +
> +#endif /* QEMU_VHOST_USER_DEVICE_H */
> diff --git a/hw/virtio/vhost-user-device.c b/hw/virtio/vhost-user-device.c
> new file mode 100644
> index 00..b0239fa033
> --- /dev/null
> +++ b/hw/virtio/vhost-user-device.c
> @@ -0,0 +1,324 @@
> +/*
> + * Generic vhost-user stub. This can be used to connect to any
> + * vhost-user backend. All configuration details must be handled by
> + * the vhost-user daemon itself
> + *
> + * Copyright (c) 2023 Linaro Ltd
> + * Author: Alex Bennée 
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/virtio/virtio-bus.h"
> +#include "hw/virtio/vhost-user-device.h"
> +#include "qemu/error-report.h"
> +
> +static void vub_start(VirtIODevice *vdev)
> +{
> +BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
> +VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> +VHostUserBase *vub = VHOST_USER_BASE(vdev);
> +int ret, i;
> +
> +if (!k->set_guest_notifiers) {
> +error_report("binding does not support guest notifiers");
> +return;
> +}
> +
> +ret = vhost_dev_enable_notifiers(>vhost_dev, vdev);
> +if (ret < 0) {
> +error_report("Error enabling host notifiers: %d", -ret);
> +return;
> +}
> +
> +ret = k->set_guest_notifiers(qbus->parent, vub->vhost_dev.nvqs, true);
> +if (ret < 0) {
> +error_report("Error binding guest notifier: %d", -ret);
> +goto err_host_notifiers;
> +}
> +
> +vub->vhost_dev.acked_features = vdev->guest_features;
> +
> +ret = vhost_dev_start(>vhost_dev, vdev, true);
> +if (ret < 0) {
> +error_report("Error starting vhost-user-device: %d", -ret);
> +goto err_guest_notifiers;
> +}
> +
> +/*
> + * guest_notifier_mask/pending not used yet, so just unmask
> + * everything here. virtio-pci will do the right thing by
> + * enabling/disabling irqfd.
> + */
> +for (i = 0; i < vub->vhost_dev.nvqs; i++) {
> +vhost_virtqueue_mask(>vhost_dev, vdev, i, false);
> +}
> +
> +return;
> +
> +err_guest_notifiers:
> +k->set_guest_notifiers(qbus->parent, vub->vhost_dev.nvqs, false);
> +err_host_notifiers:
> +vhost_dev_disable_notifiers(>vhost_dev, vdev);
> +}
> +
> +static void vub_stop(VirtIODevice *vdev)
> +{
> +VHostUserBase *vub = VHOST_USER_BASE(vdev);
> +BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
> +VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
> +int ret;
> +
> +if (!k->set_guest_notifiers) {
> +return;
> +}
> +
> +vhost_dev_stop(>vhost_dev, vdev, true);
> +
> +ret = k->set_guest_notifiers(qbus->parent, 

[PATCH 2/2] virtio: Drop out of coroutine context in virtio_load()

2023-09-05 Thread Kevin Wolf
virtio_load() as a whole should run in coroutine context because it
reads from the migration stream and we don't want this to block.

However, it calls virtio_set_features_nocheck() and devices don't
expect their .set_features callback to run in a coroutine and therefore
call functions that may not be called in coroutine context. To fix this,
drop out of coroutine context for calling virtio_set_features_nocheck().

Without this fix, the following crash was reported:

  #0  __pthread_kill_implementation (threadid=, 
signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
  #1  0x7efc738c05d3 in __pthread_kill_internal (signo=6, 
threadid=) at pthread_kill.c:78
  #2  0x7efc73873d26 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
  #3  0x7efc738477f3 in __GI_abort () at abort.c:79
  #4  0x7efc7384771b in __assert_fail_base (fmt=0x7efc739dbcb8 "", 
assertion=assertion@entry=0x560aebfbf5cf "!qemu_in_coroutine()",
 file=file@entry=0x560aebfcd2d4 "../block/graph-lock.c", 
line=line@entry=275, function=function@entry=0x560aebfcd34d "void 
bdrv_graph_rdlock_main_loop(void)") at assert.c:92
  #5  0x7efc7386ccc6 in __assert_fail (assertion=0x560aebfbf5cf 
"!qemu_in_coroutine()", file=0x560aebfcd2d4 "../block/graph-lock.c", line=275,
 function=0x560aebfcd34d "void bdrv_graph_rdlock_main_loop(void)") at 
assert.c:101
  #6  0x560aebcd8dd6 in bdrv_register_buf ()
  #7  0x560aeb97ed97 in ram_block_added.llvm ()
  #8  0x560aebb8303f in ram_block_add.llvm ()
  #9  0x560aebb834fa in qemu_ram_alloc_internal.llvm ()
  #10 0x560aebb2ac98 in vfio_region_mmap ()
  #11 0x560aebb3ea0f in vfio_bars_register ()
  #12 0x560aebb3c628 in vfio_realize ()
  #13 0x560aeb90f0c2 in pci_qdev_realize ()
  #14 0x560aebc40305 in device_set_realized ()
  #15 0x560aebc48e07 in property_set_bool.llvm ()
  #16 0x560aebc46582 in object_property_set ()
  #17 0x560aebc4cd58 in object_property_set_qobject ()
  #18 0x560aebc46ba7 in object_property_set_bool ()
  #19 0x560aeb98b3ca in qdev_device_add_from_qdict ()
  #20 0x560aebb1fbaf in virtio_net_set_features ()
  #21 0x560aebb46b51 in virtio_set_features_nocheck ()
  #22 0x560aebb47107 in virtio_load ()
  #23 0x560aeb9ae7ce in vmstate_load_state ()
  #24 0x560aeb9d2ee9 in qemu_loadvm_state_main ()
  #25 0x560aeb9d45e1 in qemu_loadvm_state ()
  #26 0x560aeb9bc32c in process_incoming_migration_co.llvm ()
  #27 0x560aebeace56 in coroutine_trampoline.llvm ()

Cc: qemu-sta...@nongnu.org
Buglink: https://issues.redhat.com/browse/RHEL-832
Signed-off-by: Kevin Wolf 
---
 hw/virtio/virtio.c | 45 -
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 309038fd46..969c25f4cf 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -2825,8 +2825,9 @@ static int virtio_device_put(QEMUFile *f, void *opaque, 
size_t size,
 }
 
 /* A wrapper for use as a VMState .get function */
-static int virtio_device_get(QEMUFile *f, void *opaque, size_t size,
- const VMStateField *field)
+static int coroutine_mixed_fn
+virtio_device_get(QEMUFile *f, void *opaque, size_t size,
+  const VMStateField *field)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(opaque);
 DeviceClass *dc = DEVICE_CLASS(VIRTIO_DEVICE_GET_CLASS(vdev));
@@ -2853,6 +2854,39 @@ static int virtio_set_features_nocheck(VirtIODevice 
*vdev, uint64_t val)
 return bad ? -1 : 0;
 }
 
+typedef struct VirtioSetFeaturesNocheckData {
+Coroutine *co;
+VirtIODevice *vdev;
+uint64_t val;
+int ret;
+} VirtioSetFeaturesNocheckData;
+
+static void virtio_set_features_nocheck_bh(void *opaque)
+{
+VirtioSetFeaturesNocheckData *data = opaque;
+
+data->ret = virtio_set_features_nocheck(data->vdev, data->val);
+aio_co_wake(data->co);
+}
+
+static int coroutine_mixed_fn
+virtio_set_features_nocheck_maybe_co(VirtIODevice *vdev, uint64_t val)
+{
+if (qemu_in_coroutine()) {
+VirtioSetFeaturesNocheckData data = {
+.co = qemu_coroutine_self(),
+.vdev = vdev,
+.val = val,
+};
+aio_bh_schedule_oneshot(qemu_get_current_aio_context(),
+virtio_set_features_nocheck_bh, );
+qemu_coroutine_yield();
+return data.ret;
+} else {
+return virtio_set_features_nocheck(vdev, val);
+}
+}
+
 int virtio_set_features(VirtIODevice *vdev, uint64_t val)
 {
 int ret;
@@ -2906,7 +2940,8 @@ size_t virtio_get_config_size(const 
VirtIOConfigSizeParams *params,
 return config_size;
 }
 
-int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
+int coroutine_mixed_fn
+virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
 {
 int i, ret;
 int32_t config_len;
@@ -3023,14 +3058,14 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int 
version_id)
  

[PATCH 1/2] vmstate: Mark VMStateInfo.get/put() coroutine_mixed_fn

2023-09-05 Thread Kevin Wolf
Migration code can run both in coroutine context (the usual case) and
non-coroutine context (at least savevm/loadvm for snapshots). This also
affects the VMState callbacks, and devices must consider this. Change
the callback definition in VMStateInfo to be explicit about it.

Signed-off-by: Kevin Wolf 
---
 include/migration/vmstate.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index d1b8abe08d..e4db910339 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -41,9 +41,11 @@ typedef struct VMStateField VMStateField;
  */
 struct VMStateInfo {
 const char *name;
-int (*get)(QEMUFile *f, void *pv, size_t size, const VMStateField *field);
-int (*put)(QEMUFile *f, void *pv, size_t size, const VMStateField *field,
-   JSONWriter *vmdesc);
+int coroutine_mixed_fn (*get)(QEMUFile *f, void *pv, size_t size,
+  const VMStateField *field);
+int coroutine_mixed_fn (*put)(QEMUFile *f, void *pv, size_t size,
+  const VMStateField *field,
+  JSONWriter *vmdesc);
 };
 
 enum VMStateFlags {
-- 
2.41.0




[PATCH 0/2] virtio: Drop out of coroutine context in virtio_load()

2023-09-05 Thread Kevin Wolf
This fixes a recently introduced assertion failure that was reported to
happen when migrating virtio-net with a failover. The latent bug that
we're executing code in coroutine context that was never supposed to run
there has existed for a long time. However, the new assertion that
callers of bdrv_graph_rdlock_main_loop() don't run in coroutine context
makes it very visible because it's now always a crash.

Kevin Wolf (2):
  vmstate: Mark VMStateInfo.get/put() coroutine_mixed_fn
  virtio: Drop out of coroutine context in virtio_load()

 include/migration/vmstate.h |  8 ---
 hw/virtio/virtio.c  | 45 -
 2 files changed, 45 insertions(+), 8 deletions(-)

-- 
2.41.0




Re: [PATCH v6 04/17] nbd: Prepare for 64-bit request effect lengths

2023-09-05 Thread Vladimir Sementsov-Ogievskiy

On 05.09.23 17:24, Eric Blake wrote:

On Mon, Sep 04, 2023 at 07:15:04PM +0300, Vladimir Sementsov-Ogievskiy wrote:

On 29.08.23 20:58, Eric Blake wrote:

Widen the length field of NBDRequest to 64-bits, although we can
assert that all current uses are still under 32 bits: either because
of NBD_MAX_BUFFER_SIZE which is even smaller (and where size_t can
still be appropriate, even on 32-bit platforms), or because nothing
ever puts us into NBD_MODE_EXTENDED yet (and while future patches will
allow larger transactions, the lengths in play here are still capped
at 32-bit).  There are no semantic changes, other than a typo fix in a
couple of error messages.

Signed-off-by: Eric Blake 
---

v6: fix sign extension bug

v5: tweak commit message, adjust a few more spots [Vladimir]

v4: split off enum changes to earlier patches [Vladimir]


[..]


--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1165,7 +1165,7 @@ static int nbd_negotiate_options(NBDClient *client, Error 
**errp)
   client->optlen = length;

   if (length > NBD_MAX_BUFFER_SIZE) {
-error_setg(errp, "len (%" PRIu32" ) is larger than max len (%u)",
+error_setg(errp, "len (%" PRIu32 ") is larger than max len (%u)",
  length, NBD_MAX_BUFFER_SIZE);
   return -EINVAL;
   }
@@ -1441,7 +1441,7 @@ static int coroutine_fn nbd_receive_request(NBDClient 
*client, NBDRequest *reque
   request->type   = lduw_be_p(buf + 6);
   request->cookie = ldq_be_p(buf + 8);
   request->from   = ldq_be_p(buf + 16);
-request->len= ldl_be_p(buf + 24);
+request->len= (uint32_t)ldl_be_p(buf + 24); /* widen 32 to 64 bits */


should it be "(uint64_t)" ?


No. ldl_be_p() returns an int.  Widening int to 64 bits sign-extends;
we have to first make it unsigned (by casting to uint32_t) so that we
then have an unsigned value that widens by zero-extension.

Maybe we should fix ldl_bd_p() and friends to favor unsigned values.
'git grep ldul_be' has zero hits; even though Peter just touched the
docs patch claiming it exists:

https://lists.gnu.org/archive/html/qemu-devel/2023-09/msg00645.html






   trace_nbd_receive_request(magic, request->flags, request->type,
 request->from, request->len);
@@ -1899,7 +1899,7 @@ static int coroutine_fn 
nbd_co_send_simple_reply(NBDClient *client,
NBDRequest *request,
uint32_t error,
void *data,
- size_t len,
+ uint64_t len,
Error **errp)
   {
   NBDSimpleReply reply;
@@ -1910,6 +1910,7 @@ static int coroutine_fn 
nbd_co_send_simple_reply(NBDClient *client,
   };

   assert(!len || !nbd_err);
+assert(len <= NBD_MAX_STRING_SIZE);


NBD_MAX_BUFFER_SIZE ?


No. MAX_STRING_SIZE is 4k, MAX_BUFFER_SIZE is 32M.  The smaller size
is the correct bound here (an error message has to be transmitted as a
single string, and the recipient does not expect more than a 4k error
message).



I miss something.. Why it's an error message? It's may be a simple reply for 
CMD_READ, where len is request->len


--
Best regards,
Vladimir




Re: [PATCH v6 06/17] nbd/server: Support a request payload

2023-09-05 Thread Vladimir Sementsov-Ogievskiy

On 29.08.23 20:58, Eric Blake wrote:

Upcoming additions to support NBD 64-bit effect lengths allow for the
possibility to distinguish between payload length (capped at 32M) and
effect length (64 bits, although we generally assume 63 bits because
of off_t limitations).  Without that extension, only the NBD_CMD_WRITE
request has a payload; but with the extension, it makes sense to allow
at least NBD_CMD_BLOCK_STATUS to have both a payload and effect length
in a future patch (where the payload is a limited-size struct that in
turn gives the real effect length as well as a subset of known ids for
which status is requested).  Other future NBD commands may also have a
request payload, so the 64-bit extension introduces a new
NBD_CMD_FLAG_PAYLOAD_LEN that distinguishes between whether the header
length is a payload length or an effect length, rather than
hard-coding the decision based on the command; although a client
should never send a command with a payload without the negotiation
phase proving such extension is available, we are now able to
gracefully fail unexpected client payloads while keeping the
connection alive.  Note that we do not support the payload version of
BLOCK_STATUS yet.

Signed-off-by: Eric Blake 
---

v5: retitled from v4 13/24, rewrite on top of previous patch's switch
statement [Vladimir]

v4: less indentation on several 'if's [Vladimir]
---
  nbd/server.c | 33 -
  nbd/trace-events |  1 +
  2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index dd3ab59224c..adcfcdeacb7 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -2334,7 +2334,8 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,
 Error **errp)
  {
  NBDClient *client = req->client;
-bool check_length = false;
+bool extended_with_payload;
+bool check_length;
  bool check_rofs = false;
  bool allocate_buffer = false;
  unsigned payload_len = 0;
@@ -2350,6 +2351,9 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,

  trace_nbd_co_receive_request_decode_type(request->cookie, request->type,
   nbd_cmd_lookup(request->type));
+check_length = extended_with_payload = client->mode >= NBD_MODE_EXTENDED &&
+request->flags & NBD_CMD_FLAG_PAYLOAD_LEN;
+
  switch (request->type) {
  case NBD_CMD_DISC:
  /* Special case: we're going to disconnect without a reply,
@@ -2366,6 +2370,14 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,
  break;

  case NBD_CMD_WRITE:
+if (client->mode >= NBD_MODE_EXTENDED) {
+if (!extended_with_payload) {
+/* The client is noncompliant. Trace it, but proceed. */
+trace_nbd_co_receive_ext_payload_compliance(request->from,
+request->len);
+}
+valid_flags |= NBD_CMD_FLAG_PAYLOAD_LEN;
+}
  payload_len = request->len;
  check_length = true;
  allocate_buffer = true;
@@ -2407,6 +2419,15 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,


more context:

/* Payload and buffer handling. */
if (!payload_len) {
req->complete = true;
}
if (check_length && request->len > NBD_MAX_BUFFER_SIZE) {
/* READ, WRITE, CACHE */
error_setg(errp, "len (%" PRIu64 ") is larger than max len (%u)",
   request->len, NBD_MAX_BUFFER_SIZE);
return -EINVAL;
}



+if (extended_with_payload && !allocate_buffer) {


it's correct but strange, as allocate_buffer is (READ or WRITE), and READ is 
totally unrelated here.


+/*
+ * For now, we don't support payloads on other commands; but
+ * we can keep the connection alive by ignoring the payload.


Was it in specification, that we can safely ignore unknown payload for known 
commands?


+ */
+assert(request->type != NBD_CMD_WRITE);
+payload_len = request->len;


what I don't like here, is that we already set req->complete to true for this 
request, when payload_len was zero.

Probably, for extended_with_payload requests we should initialize payload_len 
at top of the function?


+request->len = 0;
+}
  if (allocate_buffer) {>   /* READ, WRITE */
  req->data = blk_try_blockalign(client->exp->common.blk,
@@ -2417,10 +2438,12 @@ static int coroutine_fn 
nbd_co_receive_request(NBDRequestData *req,
  }
  }
  if (payload_len) {
-/* WRITE */
-assert(req->data);
-ret = nbd_read(client->ioc, req->data, payload_len,
-   "CMD_WRITE data", errp);
+if (req->data) {


and req->data is actually (READ or WRITE) ( == allocated_buffer) as well.

I'd prefer here "if (request->type == NBD_CMD_WRITE)" here


+ret 

Re: [PATCH v6 04/17] nbd: Prepare for 64-bit request effect lengths

2023-09-05 Thread Eric Blake
On Mon, Sep 04, 2023 at 07:15:04PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 29.08.23 20:58, Eric Blake wrote:
> > Widen the length field of NBDRequest to 64-bits, although we can
> > assert that all current uses are still under 32 bits: either because
> > of NBD_MAX_BUFFER_SIZE which is even smaller (and where size_t can
> > still be appropriate, even on 32-bit platforms), or because nothing
> > ever puts us into NBD_MODE_EXTENDED yet (and while future patches will
> > allow larger transactions, the lengths in play here are still capped
> > at 32-bit).  There are no semantic changes, other than a typo fix in a
> > couple of error messages.
> > 
> > Signed-off-by: Eric Blake 
> > ---
> > 
> > v6: fix sign extension bug
> > 
> > v5: tweak commit message, adjust a few more spots [Vladimir]
> > 
> > v4: split off enum changes to earlier patches [Vladimir]
> 
> [..]
> 
> > --- a/nbd/server.c
> > +++ b/nbd/server.c
> > @@ -1165,7 +1165,7 @@ static int nbd_negotiate_options(NBDClient *client, 
> > Error **errp)
> >   client->optlen = length;
> > 
> >   if (length > NBD_MAX_BUFFER_SIZE) {
> > -error_setg(errp, "len (%" PRIu32" ) is larger than max len 
> > (%u)",
> > +error_setg(errp, "len (%" PRIu32 ") is larger than max len 
> > (%u)",
> >  length, NBD_MAX_BUFFER_SIZE);
> >   return -EINVAL;
> >   }
> > @@ -1441,7 +1441,7 @@ static int coroutine_fn nbd_receive_request(NBDClient 
> > *client, NBDRequest *reque
> >   request->type   = lduw_be_p(buf + 6);
> >   request->cookie = ldq_be_p(buf + 8);
> >   request->from   = ldq_be_p(buf + 16);
> > -request->len= ldl_be_p(buf + 24);
> > +request->len= (uint32_t)ldl_be_p(buf + 24); /* widen 32 to 64 bits 
> > */
> 
> should it be "(uint64_t)" ?

No. ldl_be_p() returns an int.  Widening int to 64 bits sign-extends;
we have to first make it unsigned (by casting to uint32_t) so that we
then have an unsigned value that widens by zero-extension.

Maybe we should fix ldl_bd_p() and friends to favor unsigned values.
'git grep ldul_be' has zero hits; even though Peter just touched the
docs patch claiming it exists:

https://lists.gnu.org/archive/html/qemu-devel/2023-09/msg00645.html


> 
> > 
> >   trace_nbd_receive_request(magic, request->flags, request->type,
> > request->from, request->len);
> > @@ -1899,7 +1899,7 @@ static int coroutine_fn 
> > nbd_co_send_simple_reply(NBDClient *client,
> >NBDRequest *request,
> >uint32_t error,
> >void *data,
> > - size_t len,
> > + uint64_t len,
> >Error **errp)
> >   {
> >   NBDSimpleReply reply;
> > @@ -1910,6 +1910,7 @@ static int coroutine_fn 
> > nbd_co_send_simple_reply(NBDClient *client,
> >   };
> > 
> >   assert(!len || !nbd_err);
> > +assert(len <= NBD_MAX_STRING_SIZE);
> 
> NBD_MAX_BUFFER_SIZE ?

No. MAX_STRING_SIZE is 4k, MAX_BUFFER_SIZE is 32M.  The smaller size
is the correct bound here (an error message has to be transmitted as a
single string, and the recipient does not expect more than a 4k error
message).

> 
> with that fixed:
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH 2/2] block: Make more BlockDriver definitions static

2023-09-05 Thread Philippe Mathieu-Daudé

On 5/9/23 15:06, Kevin Wolf wrote:

Most block driver implementations don't have any reason for their
BlockDriver to be public. The only exceptions are bdrv_file, bdrv_raw
and bdrv_qcow2, which are actually used in other source files.

Make all other BlockDriver definitions static if they aren't yet.

Signed-off-by: Kevin Wolf 
---
  block/copy-before-write.c | 2 +-
  block/preallocate.c   | 2 +-
  block/snapshot-access.c   | 2 +-
  3 files changed, 3 insertions(+), 3 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH 1/2] block/meson.build: Restore alphabetical order of files

2023-09-05 Thread Philippe Mathieu-Daudé

On 5/9/23 15:06, Kevin Wolf wrote:

When commit 5e5733e5999 created block/meson.build, the list of
unconditionally added files was in alphabetical order. Later commits
added new files in random places. Reorder the list to be alphabetical
again. (As for ordering foo.c against foo-*.c, there are both ways used
currently; standardise on having foo.c first, even though this is
different from the original commit 5e5733e5999.)

Signed-off-by: Kevin Wolf 
---
  block/meson.build | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




[PATCH 2/2] block: Make more BlockDriver definitions static

2023-09-05 Thread Kevin Wolf
Most block driver implementations don't have any reason for their
BlockDriver to be public. The only exceptions are bdrv_file, bdrv_raw
and bdrv_qcow2, which are actually used in other source files.

Make all other BlockDriver definitions static if they aren't yet.

Signed-off-by: Kevin Wolf 
---
 block/copy-before-write.c | 2 +-
 block/preallocate.c   | 2 +-
 block/snapshot-access.c   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index b866e42271..9a0e2b69d9 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -503,7 +503,7 @@ static void cbw_close(BlockDriverState *bs)
 s->bcs = NULL;
 }
 
-BlockDriver bdrv_cbw_filter = {
+static BlockDriver bdrv_cbw_filter = {
 .format_name = "copy-before-write",
 .instance_size = sizeof(BDRVCopyBeforeWriteState),
 
diff --git a/block/preallocate.c b/block/preallocate.c
index 4d82125036..3d0f621003 100644
--- a/block/preallocate.c
+++ b/block/preallocate.c
@@ -535,7 +535,7 @@ static void preallocate_child_perm(BlockDriverState *bs, 
BdrvChild *c,
 }
 }
 
-BlockDriver bdrv_preallocate_filter = {
+static BlockDriver bdrv_preallocate_filter = {
 .format_name = "preallocate",
 .instance_size = sizeof(BDRVPreallocateState),
 
diff --git a/block/snapshot-access.c b/block/snapshot-access.c
index 67ea339da9..8d4e8932b8 100644
--- a/block/snapshot-access.c
+++ b/block/snapshot-access.c
@@ -108,7 +108,7 @@ static void snapshot_access_child_perm(BlockDriverState 
*bs, BdrvChild *c,
 *nshared = BLK_PERM_ALL;
 }
 
-BlockDriver bdrv_snapshot_access_drv = {
+static BlockDriver bdrv_snapshot_access_drv = {
 .format_name = "snapshot-access",
 
 .bdrv_open  = snapshot_access_open,
-- 
2.41.0




[PATCH 1/2] block/meson.build: Restore alphabetical order of files

2023-09-05 Thread Kevin Wolf
When commit 5e5733e5999 created block/meson.build, the list of
unconditionally added files was in alphabetical order. Later commits
added new files in random places. Reorder the list to be alphabetical
again. (As for ordering foo.c against foo-*.c, there are both ways used
currently; standardise on having foo.c first, even though this is
different from the original commit 5e5733e5999.)

Signed-off-by: Kevin Wolf 
---
 block/meson.build | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/meson.build b/block/meson.build
index 529fc172c6..f351b9d0d3 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -4,41 +4,41 @@ block_ss.add(files(
   'aio_task.c',
   'amend.c',
   'backup.c',
-  'copy-before-write.c',
   'blkdebug.c',
   'blklogwrites.c',
   'blkverify.c',
   'block-backend.c',
   'block-copy.c',
-  'graph-lock.c',
   'commit.c',
+  'copy-before-write.c',
   'copy-on-read.c',
-  'preallocate.c',
-  'progress_meter.c',
   'create.c',
   'crypto.c',
   'dirty-bitmap.c',
   'filter-compress.c',
+  'graph-lock.c',
   'io.c',
   'mirror.c',
   'nbd.c',
   'null.c',
   'plug.c',
+  'preallocate.c',
+  'progress_meter.c',
   'qapi.c',
+  'qcow2.c',
   'qcow2-bitmap.c',
   'qcow2-cache.c',
   'qcow2-cluster.c',
   'qcow2-refcount.c',
   'qcow2-snapshot.c',
   'qcow2-threads.c',
-  'qcow2.c',
   'quorum.c',
   'raw-format.c',
   'reqlist.c',
   'snapshot.c',
   'snapshot-access.c',
-  'throttle-groups.c',
   'throttle.c',
+  'throttle-groups.c',
   'write-threshold.c',
 ), zstd, zlib, gnutls)
 
-- 
2.41.0




[PATCH 0/2] block: Minor cleanups

2023-09-05 Thread Kevin Wolf
I had to go through all block drivers and noticed some details that
aren't as expected, so I thought I might as well fix them...

Kevin Wolf (2):
  block/meson.build: Restore alphabetical order of files
  block: Make more BlockDriver definitions static

 block/copy-before-write.c |  2 +-
 block/preallocate.c   |  2 +-
 block/snapshot-access.c   |  2 +-
 block/meson.build | 12 ++--
 4 files changed, 9 insertions(+), 9 deletions(-)

-- 
2.41.0




Re: deadlock when using iothread during backup_clean()

2023-09-05 Thread Paolo Bonzini

On 9/5/23 12:01, Fiona Ebner wrote:

Can we assume block_job_remove_all_bdrv() to always hold the job's
AioContext?


I think so, see job_unref_locked(), job_prepare_locked() and 
job_finalize_single_locked().  These call the callbacks that ultimately 
get to block_job_remove_all_bdrv().



And if yes, can we just tell bdrv_graph_wrlock() that it
needs to release that before polling to fix the deadlock?


No, but I think it should be released and re-acquired in 
block_job_remove_all_bdrv() itself.


mirror_exit_common() however holds _two_ AioContext locks at the time it 
calls block_job_remove_all_bdrv(), qemu_get_aio_context() has to be 
released and reacquired in mirror_exit_common() itself.


Paolo




Re: deadlock when using iothread during backup_clean()

2023-09-05 Thread Fiona Ebner
Am 05.09.23 um 12:01 schrieb Fiona Ebner:
> 
> Can we assume block_job_remove_all_bdrv() to always hold the job's
> AioContext? And if yes, can we just tell bdrv_graph_wrlock() that it
> needs to release that before polling to fix the deadlock?
> 

I tried a doing something similar as a proof-of-concept

> diff --git a/blockjob.c b/blockjob.c
> index 58c5d64539..1a696241a0 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -198,19 +198,19 @@ void block_job_remove_all_bdrv(BlockJob *job)
>   * one to make sure that such a concurrent access does not attempt
>   * to process an already freed BdrvChild.
>   */
> -bdrv_graph_wrlock(NULL);
>  while (job->nodes) {
>  GSList *l = job->nodes;
>  BdrvChild *c = l->data;
>  
>  job->nodes = l->next;
>  
> +bdrv_graph_wrlock(c->bs);
>  bdrv_op_unblock_all(c->bs, job->blocker);
>  bdrv_root_unref_child(c);
> +bdrv_graph_wrunlock();
>  
>  g_slist_free_1(l);
>  }
> -bdrv_graph_wrunlock();
>  }

and while it did get slightly further, I ran into another deadlock with

> #0  0x7f1941155136 in __ppoll (fds=0x55992068fb20, nfds=2, 
> timeout=, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> #1  0x55991c6a1a3f in qemu_poll_ns (fds=0x55992068fb20, nfds=2, 
> timeout=-1) at ../util/qemu-timer.c:339
> #2  0x55991c67ed6c in fdmon_poll_wait (ctx=0x55991f058810, 
> ready_list=0x7ffda8c987b0, timeout=-1) at ../util/fdmon-poll.c:79
> #3  0x55991c67e6a8 in aio_poll (ctx=0x55991f058810, blocking=true) at 
> ../util/aio-posix.c:670
> #4  0x55991c50a763 in bdrv_graph_wrlock (bs=0x0) at 
> ../block/graph-lock.c:145
> #5  0x55991c4daf85 in bdrv_close (bs=0x55991fff2f30) at ../block.c:5166
> #6  0x55991c4dc050 in bdrv_delete (bs=0x55991fff2f30) at ../block.c:5606
> #7  0x55991c4df205 in bdrv_unref (bs=0x55991fff2f30) at ../block.c:7173
> #8  0x55991c4fb8ca in bdrv_cbw_drop (bs=0x55991fff2f30) at 
> ../block/copy-before-write.c:566
> #9  0x55991c4f9685 in backup_clean (job=0x55992016d0b0) at 
> ../block/backup.c:105




deadlock when using iothread during backup_clean()

2023-09-05 Thread Fiona Ebner
Hi,

as the title says, a deadlock is possible in backup_clean() when using a
drive with an IO thread. The main thread keeps waiting in
bdrv_graph_wrlock() for reader_count() to become 0, while the IO thread
waits for its AioContext lock, which the main thread holds (because it
also is the backup job's AioContext). See the stack trace below [1].

The reason why it happens becomes clear with following commit message:

> commit 31b2ddfea304afc498aca8cac171020ef33eb89b
> Author: Kevin Wolf 
> Date:   Mon Jun 5 10:57:10 2023 +0200
> 
> graph-lock: Unlock the AioContext while polling
> 
> If the caller keeps the AioContext lock for a block node in an iothread,
> polling in bdrv_graph_wrlock() deadlocks if the condition isn't
> fulfilled immediately.
> 
> Now that all callers make sure to actually have the AioContext locked
> when they call bdrv_replace_child_noperm() like they should, we can
> change bdrv_graph_wrlock() to take a BlockDriverState whose AioContext
> lock the caller holds (NULL if it doesn't) and unlock it temporarily
> while polling.
>

and noting that bdrv_graph_wrlock() is called with bs being NULL while
the caller holds the IO thread's AioContext lock.

The question is where should the IO thread's AioContext lock be dropped
by the main thread? The "Graph locking part 4 (node management)" series
[0] (also reproduced the dealock with that applied) moves the
bdrv_graph_wrlock() further up to block_job_remove_all_bdrv(), but it
still passes along NULL as the argument.

Can we assume block_job_remove_all_bdrv() to always hold the job's
AioContext? And if yes, can we just tell bdrv_graph_wrlock() that it
needs to release that before polling to fix the deadlock?

Best Regards,
Fiona

[0]: https://lists.nongnu.org/archive/html/qemu-devel/2023-08/msg02831.html

[1]:

IO thread:

> Thread 3 (Thread 0x7fa2209596c0 (LWP 119031) "qemu-system-x86"):
> #0  futex_wait (private=0, expected=2, futex_word=0x5648fc93b2c0) at 
> ../sysdeps/nptl/futex-internal.h:146
> #1  __GI___lll_lock_wait (futex=futex@entry=0x5648fc93b2c0, private=0) at 
> ./nptl/lowlevellock.c:49
> #2  0x7fa2240e532a in lll_mutex_lock_optimized (mutex=0x5648fc93b2c0) at 
> ./nptl/pthread_mutex_lock.c:48
> #3  ___pthread_mutex_lock (mutex=0x5648fc93b2c0) at 
> ./nptl/pthread_mutex_lock.c:128
> #4  0x5648fa1742f8 in qemu_mutex_lock_impl (mutex=0x5648fc93b2c0, 
> file=0x5648fa446579 "../util/async.c", line=728) at 
> ../util/qemu-thread-posix.c:94
> #5  0x5648fa174528 in qemu_rec_mutex_lock_impl (mutex=0x5648fc93b2c0, 
> file=0x5648fa446579 "../util/async.c", line=728) at 
> ../util/qemu-thread-posix.c:149
> #6  0x5648fa18dd6f in aio_context_acquire (ctx=0x5648fc93b260) at 
> ../util/async.c:728
> #7  0x5648fa18dce4 in aio_co_enter (ctx=0x5648fc93b260, 
> co=0x7fa217f49210) at ../util/async.c:710
> #8  0x5648fa18dc31 in aio_co_wake (co=0x7fa217f49210) at 
> ../util/async.c:695
> #9  0x5648fa08e4a2 in luring_process_completions (s=0x5648fdf2dd00) at 
> ../block/io_uring.c:216
> #10 0x5648fa08e6c7 in luring_process_completions_and_submit 
> (s=0x5648fdf2dd00) at ../block/io_uring.c:268
> #11 0x5648fa08e727 in qemu_luring_completion_cb (opaque=0x5648fdf2dd00) 
> at ../block/io_uring.c:284
> #12 0x5648fa16f433 in aio_dispatch_handler (ctx=0x5648fc93b260, 
> node=0x5648fdf2de50) at ../util/aio-posix.c:372
> #13 0x5648fa16f506 in aio_dispatch_ready_handlers (ctx=0x5648fc93b260, 
> ready_list=0x7fa220954180) at ../util/aio-posix.c:401
> #14 0x5648fa16ff7d in aio_poll (ctx=0x5648fc93b260, blocking=true) at 
> ../util/aio-posix.c:723
> #15 0x5648f9fbaa59 in iothread_run (opaque=0x5648fc5ed6b0) at 
> ../iothread.c:63
> #16 0x5648fa175046 in qemu_thread_start (args=0x5648fc93b8f0) at 
> ../util/qemu-thread-posix.c:541
> #17 0x7fa2240e2044 in start_thread (arg=) at 
> ./nptl/pthread_create.c:442
> #18 0x7fa2241625fc in clone3 () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
> 

main thread:

> Thread 1 (Thread 0x7fa2214bf340 (LWP 119029) "qemu-system-x86"):
> #0  0x7fa224155136 in __ppoll (fds=0x5648fdf2ce50, nfds=2, 
> timeout=, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> #1  0x5648fa193102 in qemu_poll_ns (fds=0x5648fdf2ce50, nfds=2, 
> timeout=-1) at ../util/qemu-timer.c:339
> #2  0x5648fa17042f in fdmon_poll_wait (ctx=0x5648fc6c9740, 
> ready_list=0x72757260, timeout=-1) at ../util/fdmon-poll.c:79
> #3  0x5648fa16fd6b in aio_poll (ctx=0x5648fc6c9740, blocking=true) at 
> ../util/aio-posix.c:670
> #4  0x5648f9ffc0bc in bdrv_graph_wrlock (bs=0x0) at 
> ../block/graph-lock.c:145
> #5  0x5648f9fc78fa in bdrv_replace_child_noperm (child=0x5648fd729fd0, 
> new_bs=0x0) at ../block.c:2897
> #6  0x5648f9fc8487 in bdrv_root_unref_child (child=0x5648fd729fd0) at 
> ../block.c:3227
> #7  0x5648f9fd4b50 in block_job_remove_all_bdrv (job=0x5648fe0b6960) at 
> ../blockjob.c:208
> #8  0x5648f9feb11b in 

[PATCH v5 2/3] hw/i2c: add mctp core

2023-09-05 Thread Klaus Jensen
From: Klaus Jensen 

Add an abstract MCTP over I2C endpoint model. This implements MCTP
control message handling as well as handling the actual I2C transport
(packetization).

Devices are intended to derive from this and implement the class
methods.

Parts of this implementation is inspired by code[1] previously posted by
Jonathan Cameron.

Squashed a fix[2] from Matt Johnston.

  [1]: 
https://lore.kernel.org/qemu-devel/20220520170128.4436-1-jonathan.came...@huawei.com/
  [2]: 
https://lore.kernel.org/qemu-devel/20221121080445.ga29...@codeconstruct.com.au/

Tested-by: Jonathan Cameron 
Reviewed-by: Jonathan Cameron 
Signed-off-by: Klaus Jensen 
---
 MAINTAINERS   |   7 +
 hw/arm/Kconfig|   1 +
 hw/i2c/Kconfig|   4 +
 hw/i2c/mctp.c | 432 ++
 hw/i2c/meson.build|   1 +
 hw/i2c/trace-events   |  13 ++
 include/hw/i2c/mctp.h | 125 +++
 include/net/mctp.h|  35 
 8 files changed, 618 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6111b6b4d928..8ca71167607d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3395,6 +3395,13 @@ F: tests/qtest/adm1272-test.c
 F: tests/qtest/max34451-test.c
 F: tests/qtest/isl_pmbus_vr-test.c
 
+MCTP I2C Transport
+M: Klaus Jensen 
+S: Maintained
+F: hw/i2c/mctp.c
+F: include/hw/i2c/mctp.h
+F: include/net/mctp.h
+
 Firmware schema specifications
 M: Philippe Mathieu-Daudé 
 R: Daniel P. Berrange 
diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 7e6834844051..5bcb1e0e8a6f 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -541,6 +541,7 @@ config ASPEED_SOC
 select DS1338
 select FTGMAC100
 select I2C
+select I2C_MCTP
 select DPS310
 select PCA9552
 select SERIAL
diff --git a/hw/i2c/Kconfig b/hw/i2c/Kconfig
index 14886b35dac2..2b2a50b83d1e 100644
--- a/hw/i2c/Kconfig
+++ b/hw/i2c/Kconfig
@@ -6,6 +6,10 @@ config I2C_DEVICES
 # to any board's i2c bus
 bool
 
+config I2C_MCTP
+bool
+select I2C
+
 config SMBUS
 bool
 select I2C
diff --git a/hw/i2c/mctp.c b/hw/i2c/mctp.c
new file mode 100644
index ..8d8e74567745
--- /dev/null
+++ b/hw/i2c/mctp.c
@@ -0,0 +1,432 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * SPDX-FileCopyrightText: Copyright (c) 2023 Samsung Electronics Co., Ltd.
+ * SPDX-FileContributor: Klaus Jensen 
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/main-loop.h"
+
+#include "hw/qdev-properties.h"
+#include "hw/i2c/i2c.h"
+#include "hw/i2c/smbus_master.h"
+#include "hw/i2c/mctp.h"
+#include "net/mctp.h"
+
+#include "trace.h"
+
+/* DSP0237 1.2.0, Figure 1 */
+typedef struct MCTPI2CPacketHeader {
+uint8_t dest;
+#define MCTP_I2C_COMMAND_CODE 0xf
+uint8_t command_code;
+uint8_t byte_count;
+uint8_t source;
+} MCTPI2CPacketHeader;
+
+typedef struct MCTPI2CPacket {
+MCTPI2CPacketHeader i2c;
+MCTPPacket  mctp;
+} MCTPI2CPacket;
+
+#define i2c_mctp_payload_offset offsetof(MCTPI2CPacket, mctp.payload)
+#define i2c_mctp_payload(buf) (buf + i2c_mctp_payload_offset)
+
+/* DSP0236 1.3.0, Figure 20 */
+typedef struct MCTPControlMessage {
+#define MCTP_MESSAGE_TYPE_CONTROL 0x0
+uint8_t type;
+#define MCTP_CONTROL_FLAGS_RQ   (1 << 7)
+#define MCTP_CONTROL_FLAGS_D(1 << 6)
+uint8_t flags;
+uint8_t command_code;
+uint8_t data[];
+} MCTPControlMessage;
+
+enum MCTPControlCommandCodes {
+MCTP_CONTROL_SET_EID= 0x01,
+MCTP_CONTROL_GET_EID= 0x02,
+MCTP_CONTROL_GET_VERSION= 0x04,
+MCTP_CONTROL_GET_MESSAGE_TYPE_SUPPORT   = 0x05,
+};
+
+#define MCTP_CONTROL_ERROR_UNSUPPORTED_CMD 0x5
+
+#define i2c_mctp_control_data_offset \
+(i2c_mctp_payload_offset + offsetof(MCTPControlMessage, data))
+#define i2c_mctp_control_data(buf) (buf + i2c_mctp_control_data_offset)
+
+/**
+ * The byte count field in the SMBUS Block Write containers the number of bytes
+ * *following* the field itself.
+ *
+ * This is at least 5.
+ *
+ * 1 byte for the MCTP/I2C piggy-backed I2C source address in addition to the
+ * size of the MCTP transport/packet header.
+ */
+#define MCTP_I2C_BYTE_COUNT_OFFSET (sizeof(MCTPPacketHeader) + 1)
+
+void i2c_mctp_schedule_send(MCTPI2CEndpoint *mctp)
+{
+I2CBus *i2c = I2C_BUS(qdev_get_parent_bus(DEVICE(mctp)));
+
+mctp->tx.state = I2C_MCTP_STATE_TX_START_SEND;
+
+i2c_bus_master(i2c, mctp->tx.bh);
+}
+
+static void i2c_mctp_tx(void *opaque)
+{
+DeviceState *dev = DEVICE(opaque);
+I2CBus *i2c = I2C_BUS(qdev_get_parent_bus(dev));
+I2CSlave *slave = I2C_SLAVE(dev);
+MCTPI2CEndpoint *mctp = MCTP_I2C_ENDPOINT(dev);
+MCTPI2CEndpointClass *mc = MCTP_I2C_ENDPOINT_GET_CLASS(mctp);
+MCTPI2CPacket *pkt = (MCTPI2CPacket *)mctp->buffer;
+uint8_t flags = 0;
+
+switch (mctp->tx.state) {
+case I2C_MCTP_STATE_TX_SEND_BYTE:
+if (mctp->pos < mctp->len) {
+uint8_t byte = mctp->buffer[mctp->pos];
+
+

[PATCH v5 0/3] hw/{i2c,nvme}: mctp endpoint, nvme management interface model

2023-09-05 Thread Klaus Jensen
This adds a generic MCTP endpoint model that other devices may derive
from.

Also included is a very basic implementation of an NVMe-MI device,
supporting only a small subset of the required commands.

Since this all relies on i2c target mode, this can currently only be
used with an SoC that includes the Aspeed I2C controller.

The easiest way to get up and running with this, is to grab my buildroot
overlay[1] (aspeed_ast2600evb_nmi_defconfig). It includes modified a
modified dts as well as a couple of required packages.

QEMU can then be launched along these lines:

  qemu-system-arm \
-nographic \
-M ast2600-evb \
-kernel output/images/zImage \
-initrd output/images/rootfs.cpio \
-dtb output/images/aspeed-ast2600-evb-nmi.dtb \
-nic user,hostfwd=tcp::-:22 \
-device nmi-i2c,address=0x3a \
-serial mon:stdio

>From within the booted system,

  mctp addr add 8 dev mctpi2c15
  mctp link set mctpi2c15 up
  mctp route add 9 via mctpi2c15
  mctp neigh add 9 dev mctpi2c15 lladdr 0x3a
  mi-mctp 1 9 info

Comments are very welcome!

  [1]: https://github.com/birkelund/hwtests/tree/main/br2-external

Signed-off-by: Klaus Jensen 
---
Changes in v5:
- Added a nmi_scratch_append() that asserts available space in the
  scratch buffer. This is a similar defensive strategy as used in
  hw/i2c/mctp.c
- Various small fixups in response to review (Jonathan)
- Link to v4: 
https://lore.kernel.org/r/20230823-nmi-i2c-v4-0-2b0f86e5b...@samsung.com

---
Klaus Jensen (3):
  hw/i2c: add smbus pec utility function
  hw/i2c: add mctp core
  hw/nvme: add nvme management interface model

 MAINTAINERS   |   7 +
 hw/arm/Kconfig|   1 +
 hw/i2c/Kconfig|   4 +
 hw/i2c/mctp.c | 432 ++
 hw/i2c/meson.build|   1 +
 hw/i2c/smbus_master.c |  26 +++
 hw/i2c/trace-events   |  13 ++
 hw/nvme/Kconfig   |   4 +
 hw/nvme/meson.build   |   1 +
 hw/nvme/nmi-i2c.c | 424 +
 hw/nvme/trace-events  |   6 +
 include/hw/i2c/mctp.h | 125 
 include/hw/i2c/smbus_master.h |   2 +
 include/net/mctp.h|  35 
 14 files changed, 1081 insertions(+)
---
base-commit: 17780edd81d27fcfdb7a802efc870a99788bd2fc
change-id: 20230822-nmi-i2c-d804ed5be7e6

Best regards,
-- 
Klaus Jensen 




[PATCH v5 1/3] hw/i2c: add smbus pec utility function

2023-09-05 Thread Klaus Jensen
From: Klaus Jensen 

Add i2c_smbus_pec() to calculate the SMBus Packet Error Code for a
message.

Reviewed-by: Jonathan Cameron 
Signed-off-by: Klaus Jensen 
---
 hw/i2c/smbus_master.c | 26 ++
 include/hw/i2c/smbus_master.h |  2 ++
 2 files changed, 28 insertions(+)

diff --git a/hw/i2c/smbus_master.c b/hw/i2c/smbus_master.c
index 6a53c34e70b7..01a8e4700222 100644
--- a/hw/i2c/smbus_master.c
+++ b/hw/i2c/smbus_master.c
@@ -15,6 +15,32 @@
 #include "hw/i2c/i2c.h"
 #include "hw/i2c/smbus_master.h"
 
+static uint8_t crc8(uint16_t data)
+{
+int i;
+
+for (i = 0; i < 8; i++) {
+if (data & 0x8000) {
+data ^= 0x1070U << 3;
+}
+
+data <<= 1;
+}
+
+return (uint8_t)(data >> 8);
+}
+
+uint8_t i2c_smbus_pec(uint8_t crc, uint8_t *buf, size_t len)
+{
+int i;
+
+for (i = 0; i < len; i++) {
+crc = crc8((crc ^ buf[i]) << 8);
+}
+
+return crc;
+}
+
 /* Master device commands.  */
 int smbus_quick_command(I2CBus *bus, uint8_t addr, int read)
 {
diff --git a/include/hw/i2c/smbus_master.h b/include/hw/i2c/smbus_master.h
index bb13bc423c22..d90f81767d86 100644
--- a/include/hw/i2c/smbus_master.h
+++ b/include/hw/i2c/smbus_master.h
@@ -27,6 +27,8 @@
 
 #include "hw/i2c/i2c.h"
 
+uint8_t i2c_smbus_pec(uint8_t crc, uint8_t *buf, size_t len);
+
 /* Master device commands.  */
 int smbus_quick_command(I2CBus *bus, uint8_t addr, int read);
 int smbus_receive_byte(I2CBus *bus, uint8_t addr);

-- 
2.42.0




[PATCH v5 3/3] hw/nvme: add nvme management interface model

2023-09-05 Thread Klaus Jensen
From: Klaus Jensen 

Add the 'nmi-i2c' device that emulates an NVMe Management Interface
controller.

Initial support is very basic (Read NMI DS, Configuration Get).

This is based on previously posted code by Padmakar Kalghatgi, Arun
Kumar Agasar and Saurav Kumar.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/Kconfig  |   4 +
 hw/nvme/meson.build  |   1 +
 hw/nvme/nmi-i2c.c| 424 +++
 hw/nvme/trace-events |   6 +
 4 files changed, 435 insertions(+)

diff --git a/hw/nvme/Kconfig b/hw/nvme/Kconfig
index 8ac90942e55e..1d89a4f4ecea 100644
--- a/hw/nvme/Kconfig
+++ b/hw/nvme/Kconfig
@@ -2,3 +2,7 @@ config NVME_PCI
 bool
 default y if PCI_DEVICES
 depends on PCI
+
+config NVME_NMI_I2C
+bool
+default y if I2C_MCTP
diff --git a/hw/nvme/meson.build b/hw/nvme/meson.build
index 1a6a2ca2f307..7bc85f31c149 100644
--- a/hw/nvme/meson.build
+++ b/hw/nvme/meson.build
@@ -1 +1,2 @@
 system_ss.add(when: 'CONFIG_NVME_PCI', if_true: files('ctrl.c', 'dif.c', 
'ns.c', 'subsys.c'))
+system_ss.add(when: 'CONFIG_NVME_NMI_I2C', if_true: files('nmi-i2c.c'))
diff --git a/hw/nvme/nmi-i2c.c b/hw/nvme/nmi-i2c.c
new file mode 100644
index ..6e0ba7bd2a37
--- /dev/null
+++ b/hw/nvme/nmi-i2c.c
@@ -0,0 +1,424 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * SPDX-FileCopyrightText: Copyright (c) 2023 Samsung Electronics Co., Ltd.
+ *
+ * SPDX-FileContributor: Padmakar Kalghatgi 
+ * SPDX-FileContributor: Arun Kumar Agasar 
+ * SPDX-FileContributor: Saurav Kumar 
+ * SPDX-FileContributor: Klaus Jensen 
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/crc32c.h"
+#include "hw/registerfields.h"
+#include "hw/i2c/i2c.h"
+#include "hw/i2c/mctp.h"
+#include "net/mctp.h"
+#include "trace.h"
+
+/* NVM Express Management Interface 1.2c, Section 3.1 */
+#define NMI_MAX_MESSAGE_LENGTH 4224
+
+#define TYPE_NMI_I2C_DEVICE "nmi-i2c"
+OBJECT_DECLARE_SIMPLE_TYPE(NMIDevice, NMI_I2C_DEVICE)
+
+typedef struct NMIDevice {
+MCTPI2CEndpoint mctp;
+
+uint8_t buffer[NMI_MAX_MESSAGE_LENGTH];
+uint8_t scratch[NMI_MAX_MESSAGE_LENGTH];
+
+size_t  len;
+int64_t pos;
+} NMIDevice;
+
+FIELD(NMI_MCTPD, MT, 0, 7)
+FIELD(NMI_MCTPD, IC, 7, 1)
+
+#define NMI_MCTPD_MT_NMI 0x4
+#define NMI_MCTPD_IC_ENABLED 0x1
+
+FIELD(NMI_NMP, ROR, 7, 1)
+FIELD(NMI_NMP, NMIMT, 3, 4)
+
+#define NMI_NMP_NMIMT_NVME_MI 0x1
+#define NMI_NMP_NMIMT_NVME_ADMIN 0x2
+
+typedef struct NMIMessage {
+uint8_t mctpd;
+uint8_t nmp;
+uint8_t rsvd2[2];
+uint8_t payload[]; /* includes the Message Integrity Check */
+} NMIMessage;
+
+typedef struct NMIRequest {
+   uint8_t opc;
+   uint8_t rsvd1[3];
+   uint32_t dw0;
+   uint32_t dw1;
+   uint32_t mic;
+} NMIRequest;
+
+FIELD(NMI_CMD_READ_NMI_DS_DW0, DTYP, 24, 8)
+
+typedef enum NMIReadDSType {
+NMI_CMD_READ_NMI_DS_SUBSYSTEM   = 0x0,
+NMI_CMD_READ_NMI_DS_PORTS   = 0x1,
+NMI_CMD_READ_NMI_DS_CTRL_LIST   = 0x2,
+NMI_CMD_READ_NMI_DS_CTRL_INFO   = 0x3,
+NMI_CMD_READ_NMI_DS_OPT_CMD_SUPPORT = 0x4,
+NMI_CMD_READ_NMI_DS_MEB_CMD_SUPPORT = 0x5,
+} NMIReadDSType;
+
+#define NMI_STATUS_INVALID_PARAMETER 0x4
+
+static void nmi_scratch_append(NMIDevice *nmi, const void *buf, size_t count)
+{
+assert(nmi->pos + count <= NMI_MAX_MESSAGE_LENGTH);
+
+memcpy(nmi->scratch + nmi->pos, buf, count);
+nmi->pos += count;
+}
+
+static void nmi_set_parameter_error(NMIDevice *nmi, uint8_t bit, uint16_t byte)
+{
+/* NVM Express Management Interface 1.2c, Figure 30 */
+struct resp {
+uint8_t  status;
+uint8_t  bit;
+uint16_t byte;
+};
+
+struct resp buf = {
+.status = NMI_STATUS_INVALID_PARAMETER,
+.bit = bit & 0x3,
+.byte = byte,
+};
+
+nmi_scratch_append(nmi, , sizeof(buf));
+}
+
+static void nmi_set_error(NMIDevice *nmi, uint8_t status)
+{
+const uint8_t buf[4] = {status,};
+
+nmi_scratch_append(nmi, buf, sizeof(buf));
+}
+
+static void nmi_handle_mi_read_nmi_ds(NMIDevice *nmi, NMIRequest *request)
+{
+I2CSlave *i2c = I2C_SLAVE(nmi);
+
+uint32_t dw0 = le32_to_cpu(request->dw0);
+uint8_t dtyp = FIELD_EX32(dw0, NMI_CMD_READ_NMI_DS_DW0, DTYP);
+const uint8_t *buf;
+size_t len;
+
+trace_nmi_handle_mi_read_nmi_ds(dtyp);
+
+static const uint8_t nmi_ds_subsystem[36] = {
+0x00,   /* success */
+0x20, 0x00, /* response data length */
+0x00,   /* reserved */
+0x00,   /* number of ports */
+0x01,   /* major version */
+0x01,   /* minor version */
+};
+
+/*
+ * Cannot be static (or const) since we need to patch in the i2c
+ * address.
+ */
+const uint8_t nmi_ds_ports[36] = {
+0x00,   /* success */
+0x20, 0x00, /* response data length */
+0x00,   /* reserved */
+0x02,   /* port type (smbus) */
+0x00,   /* reserved */
+0x40, 0x00, /* maximum mctp transission unit