from:"Akihiko Odaki"

Re: [PATCH v3 03/14] ui/cocoa: Adds non-app runloop on main thread mode

2024-10-02 Thread Akihiko Odaki


On 2024/09/28 17:57, Phil Dennis-Jordan wrote:

Various system frameworks on macOS and other Apple platforms
require a main runloop to be processing events on the process’s
main thread. The Cocoa UI’s requirement to run the process as a
Cocoa application automatically enables this runloop, but it
can be useful to have the runloop handling events even without
the Cocoa UI active.

This change adds a non-app runloop mode to the cocoa_main
function. This can be requested by other code, while the Cocoa UI
additionally enables app mode. This arrangement ensures there is
only one qemu_main function switcheroo, and the Cocoa UI’s app
mode requirement and other subsystems’ runloop requests don’t
conflict with each other.


gtk and sdl need to run in the main thread so stealing the main thread 
by setting qemu_main will break them. Please investigate the possibility 
of running CFRunLoop in another thread.




The main runloop is required for the AppleGFX PV graphics device,
so the runloop request call has been added to its initialisation.


Please move this patch before any other patches that require it.

Regards,
Akihiko Odaki

Re: [PATCH v3 02/14] hw/display/apple-gfx: Adds PCI implementation

2024-10-02 Thread Akihiko Odaki


On 2024/09/28 17:57, Phil Dennis-Jordan wrote:

This change wires up the PCI variant of the paravirtualised
graphics device, mainly useful for x86-64 macOS guests, implemented
by macOS's ParavirtualizedGraphics.framework. It builds on code
shared with the vmapple/mmio variant of the PVG device.

Signed-off-by: Phil Dennis-Jordan 
---
  hw/display/Kconfig |   5 ++
  hw/display/apple-gfx-pci.m | 138 +
  hw/display/meson.build |   1 +
  3 files changed, 144 insertions(+)
  create mode 100644 hw/display/apple-gfx-pci.m

diff --git a/hw/display/Kconfig b/hw/display/Kconfig
index 179a479d220..c2ec268f8e9 100644
--- a/hw/display/Kconfig
+++ b/hw/display/Kconfig
@@ -152,3 +152,8 @@ config MAC_PVG_VMAPPLE
  bool
  depends on MAC_PVG
  depends on ARM
+
+config MAC_PVG_PCI
+bool
+depends on MAC_PVG && PCI
+default y if PCI_DEVICES
diff --git a/hw/display/apple-gfx-pci.m b/hw/display/apple-gfx-pci.m
new file mode 100644
index 000..9370258ee46
--- /dev/null
+++ b/hw/display/apple-gfx-pci.m
@@ -0,0 +1,138 @@
+/*
+ * QEMU Apple ParavirtualizedGraphics.framework device, PCI variant
+ *
+ * Copyright © 2023-2024 Phil Dennis-Jordan
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.


Please use SPDX-License-Identifier instead.


+ *
+ * ParavirtualizedGraphics.framework is a set of libraries that macOS provides
+ * which implements 3d graphics passthrough to the host as well as a
+ * proprietary guest communication channel to drive it. This device model
+ * implements support to drive that library from within QEMU as a PCI device
+ * aimed primarily at x86-64 macOS VMs.
+ */
+
+#include "apple-gfx.h"
+#include "hw/pci/pci_device.h"
+#include "hw/pci/msi.h"
+#include "qapi/error.h"
+#include "trace.h"
+#import 


Please add #include "qemu/osdep.h" at top and reorder according to 
"Include directives" section in: docs/devel/style.rst



+
+typedef struct AppleGFXPCIState {
+PCIDevice parent_obj;
+
+AppleGFXState common;
+} AppleGFXPCIState;
+
+OBJECT_DECLARE_SIMPLE_TYPE(AppleGFXPCIState, APPLE_GFX_PCI)
+
+static const char* apple_gfx_pci_option_rom_path = NULL;
+
+static void apple_gfx_init_option_rom_path(void)
+{
+NSURL *option_rom_url = PGCopyOptionROMURL();
+const char *option_rom_path = option_rom_url.fileSystemRepresentation;
+if (option_rom_url.fileURL && option_rom_path != NULL) {


option_rom_path != NULL is unnecessary; NSURL.h has 
NS_HEADER_AUDIT_BEGIN(nullability, sendability), which means any 
non-annotated member is non-nullable.



+apple_gfx_pci_option_rom_path = g_strdup(option_rom_path);
+}
+[option_rom_url release];
+}
+
+static void apple_gfx_pci_init(Object *obj)
+{
+AppleGFXPCIState *s = APPLE_GFX_PCI(obj);
+
+if (!apple_gfx_pci_option_rom_path) {
+/* Done on device not class init to avoid -daemonize ObjC fork crash */


It is unclear what "-daemonize ObjC fork crash" means. Please add more 
details.



+PCIDeviceClass *pci = PCI_DEVICE_CLASS(object_get_class(obj));
+apple_gfx_init_option_rom_path();
+pci->romfile = apple_gfx_pci_option_rom_path;
+}
+
+apple_gfx_common_init(obj, &s->common, TYPE_APPLE_GFX_PCI);
+}
+
+static void apple_gfx_pci_interrupt(PCIDevice *dev, AppleGFXPCIState *s,


s is unused.


+uint32_t vector)
+{
+bool msi_ok;
+trace_apple_gfx_raise_irq(vector);
+
+msi_ok = msi_enabled(dev);
+if (msi_ok) {
+msi_notify(dev, vector);
+}
+}
+
+static void apple_gfx_pci_realize(PCIDevice *dev, Error **errp)
+{
+AppleGFXPCIState *s = APPLE_GFX_PCI(dev);
+Error *err = NULL;
+int ret;
+
+pci_register_bar(dev, PG_PCI_BAR_MMIO,
+ PCI_BASE_ADDRESS_SPACE_MEMORY, &s->common.iomem_gfx);
+
+ret = msi_init(dev, 0x0 /* config offset; 0 = find space */,
+   PG_PCI_MAX_MSI_VECTORS, true /* msi64bit */,
+   false /*msi_per_vector_mask*/, &err);
+if (ret != 0) {
+error_propagate(errp, err);


You can just pass errp to msi_init().


+return;
+}
+
+@autoreleasepool {
+PGDeviceDescriptor *desc = [PGDeviceDescriptor new];
+desc.raiseInterrupt = ^(uint32_t vector) {
+apple_gfx_pci_interrupt(dev, s, vector);
+};
+
+apple_gfx_common_realize(&s->common, desc);
+[desc release];
+desc = nil;
+}
+}
+
+static void apple_gfx_pci_reset(Object *obj, ResetType type)
+{
+AppleGFXPCIState *s = APPLE_GFX_PCI(obj);
+[s->common.pgdev reset];
+}
+
+static void apple_gfx_pci_class_init(ObjectClass *klass, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(klass);
+PCIDeviceClass *pci = PCI_DEVICE_CLASS(klass);
+ResettableClass *rc = RESETTABLE_CLASS(klass);
+
+assert(rc->phases.hold == NULL);
+rc->phases.hold = apple_gfx_pci_reset;
+dc->

Re: [PATCH v3 01/14] hw/display/apple-gfx: Introduce ParavirtualizedGraphics.Framework support

2024-10-01 Thread Akihiko Odaki


On 2024/09/28 17:57, Phil Dennis-Jordan wrote:

MacOS provides a framework (library) that allows any vmm to implement a
paravirtualized 3d graphics passthrough to the host metal stack called
ParavirtualizedGraphics.Framework (PVG). The library abstracts away
almost every aspect of the paravirtualized device model and only provides
and receives callbacks on MMIO access as well as to share memory address
space between the VM and PVG.

This patch implements a QEMU device that drives PVG for the VMApple
variant of it.


I think it is better to name it MMIO variant instead of VMApple. There 
is nothing specific to VMApple in: hw/display/apple-gfx-vmapple.m




Signed-off-by: Alexander Graf 
Co-authored-by: Alexander Graf 

Subsequent changes:

  * Cherry-pick/rebase conflict fixes
  * BQL function renaming
  * Moved from hw/vmapple/ (useful outside that machine type)
  * Code review comments: Switched to DEFINE_TYPES macro & little endian
MMIO.
  * Removed some dead/superfluous code
  * Mad set_mode thread & memory safe
  * Added migration blocker due to lack of (de-)serialisation.
  * Fixes to ObjC refcounting and autorelease pool usage.
  * Fixed ObjC new/init misuse
  * Switched to ObjC category extension for private property.
  * Simplified task memory mapping and made it thread safe.
  * Refactoring to split generic and vmapple MMIO variant specific
code.
  * Switched to asynchronous MMIO writes on x86-64
  * Rendering and graphics update are now done asynchronously
  * Fixed cursor handling
  * Coding convention fixes
  * Removed software cursor compositing

Signed-off-by: Phil Dennis-Jordan 

---

v3:

  * Rebased on latest upstream, fixed breakages including switching to 
Resettable methods.
  * Squashed patches dealing with dGPUs, MMIO area size, and GPU picking.
  * Allow re-entrant MMIO; this simplifies the code and solves the divergence
between x86-64 and arm64 variants.

  hw/display/Kconfig |   9 +
  hw/display/apple-gfx-vmapple.m | 215 +
  hw/display/apple-gfx.h |  57 
  hw/display/apple-gfx.m | 536 +
  hw/display/meson.build |   2 +
  hw/display/trace-events|  26 ++
  meson.build|   4 +
  7 files changed, 849 insertions(+)
  create mode 100644 hw/display/apple-gfx-vmapple.m
  create mode 100644 hw/display/apple-gfx.h
  create mode 100644 hw/display/apple-gfx.m

diff --git a/hw/display/Kconfig b/hw/display/Kconfig
index a4552c8ed78..179a479d220 100644
--- a/hw/display/Kconfig
+++ b/hw/display/Kconfig
@@ -143,3 +143,12 @@ config XLNX_DISPLAYPORT
  
  config DM163

  bool
+
+config MAC_PVG
+bool
+default y
+
+config MAC_PVG_VMAPPLE
+bool
+depends on MAC_PVG
+depends on ARM


Use AARCH64 instead.


diff --git a/hw/display/apple-gfx-vmapple.m b/hw/display/apple-gfx-vmapple.m
new file mode 100644
index 000..d8fc7651dde
--- /dev/null
+++ b/hw/display/apple-gfx-vmapple.m
@@ -0,0 +1,215 @@
+/*
+ * QEMU Apple ParavirtualizedGraphics.framework device, vmapple (arm64) variant
+ *
+ * Copyright © 2023 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ * ParavirtualizedGraphics.framework is a set of libraries that macOS provides
+ * which implements 3d graphics passthrough to the host as well as a
+ * proprietary guest communication channel to drive it. This device model
+ * implements support to drive that library from within QEMU as an MMIO-based
+ * system device for macOS on arm64 VMs.
+ */
+
+#include "apple-gfx.h"
+#include "monitor/monitor.h"
+#include "hw/sysbus.h"
+#include "hw/irq.h"
+#include "trace.h"
+#import 
+
+_Static_assert(__aarch64__, "");


I don't think this assertion is worthwhile. This assertion will trigger 
if you accidentally remove depends on AARCH64 from Kconfig, but I don't 
think such code change happens by accident, and there is no reason to 
believe that this assertion won't be removed in such a case.



+
+/*
+ * ParavirtualizedGraphics.Framework only ships header files for the PCI
+ * variant which does not include IOSFC descriptors and host devices. We add
+ * their definitions here so that we can also work with the ARM version.
+ */
+typedef bool(^IOSFCRaiseInterrupt)(uint32_t vector);
+typedef bool(^IOSFCUnmapMemory)(
+void *a, void *b, void *c, void *d, void *e, void *f);


Omit dummy parameter names.


+typedef bool(^IOSFCMapMemory)(
+uint64_t phys, uint64_t len, bool ro, void **va, void *e, void *f);
+
+@interface PGDeviceDescriptor (IOSurfaceMapper)
+@property (readwrite, nonatomic) bool usingIOSurfaceMapper;
+@end
+
+@interface PGIOSurfaceHostDeviceDescriptor : NSObject
+-(PGIOSurfaceHostDeviceDescriptor *)init;
+@property (readwrite, nonatomic, copy, nullable) IOSFCMapMemory mapMemory;
+@property (readwrite, nonatomic, copy, nullable) IOSFCUnmapMemory unmapMemo

Re: [PATCH v16 04/13] s390x/pci: Avoid creating zpci for VFs

2024-09-18 Thread Akihiko Odaki


On 2024/09/18 17:02, Cédric Le Goater wrote:

Hello,

On 9/13/24 05:44, Akihiko Odaki wrote:

VFs are automatically created by PF, and creating zpci for them will
result in unexpected usage of fids. Currently QEMU does not support
multifunction for s390x so we don't need zpci for VFs anyway.

Signed-off-by: Akihiko Odaki 
---
  hw/s390x/s390-pci-bus.c | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 3e57d5faca18..1a620f5b2a04 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -1080,6 +1080,16 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,

  pbdev = s390_pci_find_dev_by_target(s, dev->id);
  if (!pbdev) {
+    /*
+ * VFs are automatically created by PF, and creating zpci 
for them
+ * will result in unexpected usage of fids. Currently 
QEMU does not
+ * support multifunction for s390x so we don't need zpci 
for VFs

+ * anyway.
+ */
+    if (pci_is_vf(pdev)) {
+    return;
+    }
+
  pbdev = s390_pci_device_new(s, dev->id, errp);
  if (!pbdev) {
  return;
@@ -1167,7 +1177,9 @@ static void s390_pcihost_unplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,

  int32_t devfn;
  pbdev = s390_pci_find_dev_by_pci(s, PCI_DEVICE(dev));
-    g_assert(pbdev);
+    if (!pbdev) {
+    return;
+    }



I don't understand this change. Could you please explain ?


We need to tolerate that pbdev being NULL because VFs do no longer have 
zpci and pbdev will be NULL for them.


Regards,
Akihiko Odaki

[PATCH RFC v3 03/11] virtio-net: Move virtio_net_get_features() down

2024-09-14 Thread Akihiko Odaki

Move virtio_net_get_features() to the later part of the file so that
it can call other functions.

Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 146 ++--
 1 file changed, 73 insertions(+), 73 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index b4a3fb575c7c..206b0335169d 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -750,79 +750,6 @@ static void virtio_net_set_queue_pairs(VirtIONet *n)
 
 static void virtio_net_set_multiqueue(VirtIONet *n, int multiqueue);
 
-static uint64_t virtio_net_get_features(VirtIODevice *vdev, uint64_t features,
-Error **errp)
-{
-VirtIONet *n = VIRTIO_NET(vdev);
-NetClientState *nc = qemu_get_queue(n->nic);
-
-/* Firstly sync all virtio-net possible supported features */
-features |= n->host_features;
-
-virtio_add_feature(&features, VIRTIO_NET_F_MAC);
-
-if (!peer_has_vnet_hdr(n)) {
-virtio_clear_feature(&features, VIRTIO_NET_F_CSUM);
-virtio_clear_feature(&features, VIRTIO_NET_F_HOST_TSO4);
-virtio_clear_feature(&features, VIRTIO_NET_F_HOST_TSO6);
-virtio_clear_feature(&features, VIRTIO_NET_F_HOST_ECN);
-
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_CSUM);
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_TSO4);
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_TSO6);
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_ECN);
-
-virtio_clear_feature(&features, VIRTIO_NET_F_HOST_USO);
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_USO4);
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_USO6);
-
-virtio_clear_feature(&features, VIRTIO_NET_F_HASH_REPORT);
-}
-
-if (!peer_has_vnet_hdr(n) || !peer_has_ufo(n)) {
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_UFO);
-virtio_clear_feature(&features, VIRTIO_NET_F_HOST_UFO);
-}
-
-if (!peer_has_uso(n)) {
-virtio_clear_feature(&features, VIRTIO_NET_F_HOST_USO);
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_USO4);
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_USO6);
-}
-
-if (!get_vhost_net(nc->peer)) {
-return features;
-}
-
-if (!ebpf_rss_is_loaded(&n->ebpf_rss)) {
-virtio_clear_feature(&features, VIRTIO_NET_F_RSS);
-}
-features = vhost_net_get_features(get_vhost_net(nc->peer), features);
-vdev->backend_features = features;
-
-if (n->mtu_bypass_backend &&
-(n->host_features & 1ULL << VIRTIO_NET_F_MTU)) {
-features |= (1ULL << VIRTIO_NET_F_MTU);
-}
-
-/*
- * Since GUEST_ANNOUNCE is emulated the feature bit could be set without
- * enabled. This happens in the vDPA case.
- *
- * Make sure the feature set is not incoherent, as the driver could refuse
- * to start.
- *
- * TODO: QEMU is able to emulate a CVQ just for guest_announce purposes,
- * helping guest to notify the new location with vDPA devices that does not
- * support it.
- */
-if (!virtio_has_feature(vdev->backend_features, VIRTIO_NET_F_CTRL_VQ)) {
-virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_ANNOUNCE);
-}
-
-return features;
-}
-
 static uint64_t virtio_net_bad_features(VirtIODevice *vdev)
 {
 uint64_t features = 0;
@@ -3041,6 +2968,79 @@ static void virtio_net_set_multiqueue(VirtIONet *n, int 
multiqueue)
 virtio_net_set_queue_pairs(n);
 }
 
+static uint64_t virtio_net_get_features(VirtIODevice *vdev, uint64_t features,
+Error **errp)
+{
+VirtIONet *n = VIRTIO_NET(vdev);
+NetClientState *nc = qemu_get_queue(n->nic);
+
+/* Firstly sync all virtio-net possible supported features */
+features |= n->host_features;
+
+virtio_add_feature(&features, VIRTIO_NET_F_MAC);
+
+if (!peer_has_vnet_hdr(n)) {
+virtio_clear_feature(&features, VIRTIO_NET_F_CSUM);
+virtio_clear_feature(&features, VIRTIO_NET_F_HOST_TSO4);
+virtio_clear_feature(&features, VIRTIO_NET_F_HOST_TSO6);
+virtio_clear_feature(&features, VIRTIO_NET_F_HOST_ECN);
+
+virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_CSUM);
+virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_TSO4);
+virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_TSO6);
+virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_ECN);
+
+virtio_clear_feature(&features, VIRTIO_NET_F_HOST_USO);
+virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_USO4);
+virtio_clear_feature(&features, VIRTIO_NET_F_GUEST_USO6);
+
+virtio_clear_feature(&features, VIRTIO_NET_F_HASH_REPORT);
+}
+
+if (!peer_has_vnet_h

[PATCH RFC v3 11/11] docs/devel/ebpf_rss.rst: Update for peer RSS

2024-09-14 Thread Akihiko Odaki

eBPF RSS virtio-net support was written in assumption that there is only
one alternative RSS implementation: 'in-qemu' RSS. It is no longer true,
and we now have yet another implementation; namely the peer RSS.

Signed-off-by: Akihiko Odaki 
---
 docs/devel/ebpf_rss.rst | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/docs/devel/ebpf_rss.rst b/docs/devel/ebpf_rss.rst
index 4a68682b31ac..06b09e8a3fed 100644
--- a/docs/devel/ebpf_rss.rst
+++ b/docs/devel/ebpf_rss.rst
@@ -5,9 +5,22 @@ eBPF RSS virtio-net support
 RSS(Receive Side Scaling) is used to distribute network packets to guest 
virtqueues
 by calculating packet hash. Usually every queue is processed then by a 
specific guest CPU core.
 
-For now there are 2 RSS implementations in qemu:
-- 'in-qemu' RSS (functions if qemu receives network packets, i.e. vhost=off)
-- eBPF RSS (can function with also with vhost=on)
+For now there are 3 RSS implementations in qemu:
+1. Peer RSS
+2. eBPF RSS
+3. 'In-QEMU' RSS
+
+'In-QEMU' RSS is incompatible with vhost since the packets are not routed to
+QEMU. eBPF RSS requires Linux 5.8+. Peer RSS requires the peer to implement 
RSS.
+Currently QEMU can use the RSS implementation of vDPA and Linux's TUN module
+with the following patch applied:
+https://lore.kernel.org/r/20240915-rss-v3-0-c630015db...@daynix.com/
+
+eBPF RSS does not support hash reporting. Peer RSS may support limited hash
+types.
+
+virtio-net automatically chooses the RSS implementation to use. Peer RSS is
+the most preferred, and 'in-QEMU' RSS is the least.
 
 eBPF support (CONFIG_EBPF) is enabled by 'configure' script.
 To enable eBPF RSS support use './configure --enable-bpf'.
@@ -47,9 +60,6 @@ eBPF RSS turned on by different combinations of vhost-net, 
vitrio-net and tap co
 
 tap,vhost=on & virtio-net-pci,rss=on,hash=on
 
-If CONFIG_EBPF is not set then only 'in-qemu' RSS is supported.
-Also 'in-qemu' RSS, as a fallback, is used if the eBPF program failed to load 
or set to TUN.
-
 RSS eBPF program
 
 
@@ -65,7 +75,6 @@ Prerequisites to recompile the eBPF program (regenerate 
ebpf/rss.bpf.skeleton.h)
 $ make -f Makefile.ebpf
 
 Current eBPF RSS implementation uses 'bounded loops' with 'backward jump 
instructions' which present in the last kernels.
-Overall eBPF RSS works on kernels 5.8+.
 
 eBPF RSS implementation
 ---

-- 
2.46.0

[PATCH RFC v3 08/11] virtio-net: Use qemu_set_vnet_hash()

2024-09-14 Thread Akihiko Odaki

This is necessary to offload hashing to tap.

Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 77 -
 1 file changed, 64 insertions(+), 13 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38ccd706f956..be6759d1c0f4 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1209,20 +1209,65 @@ static void virtio_net_detach_epbf_rss(VirtIONet *n)
 
 static void virtio_net_commit_rss_config(VirtIONet *n)
 {
-if (n->rss_data.peer_hash_available) {
-return;
-}
-
 if (n->rss_data.enabled) {
-n->rss_data.enabled_software_rss = n->rss_data.populate_hash;
-if (n->rss_data.populate_hash) {
-virtio_net_detach_epbf_rss(n);
-} else if (!virtio_net_attach_epbf_rss(n)) {
-if (get_vhost_net(qemu_get_queue(n->nic)->peer)) {
-warn_report("Can't load eBPF RSS for vhost");
+if (n->rss_data.peer_hash_available &&
+(n->rss_data.peer_hash_types & n->rss_data.runtime_hash_types) ==
+n->rss_data.runtime_hash_types) {
+NetVnetHash hash = {
+.flags = (n->rss_data.redirect ? NET_VNET_HASH_RSS : 0) |
+ (n->rss_data.populate_hash ? NET_VNET_HASH_REPORT : 
0),
+.types = n->rss_data.runtime_hash_types
+};
+
+if (n->rss_data.redirect) {
+size_t indirection_table_size =
+n->rss_data.indirections_len *
+sizeof(*n->rss_data.indirections_table);
+
+size_t hash_size = sizeof(NetVnetHash) +
+   sizeof(NetVnetHashRss) +
+   indirection_table_size +
+   sizeof(n->rss_data.key);
+
+g_autofree struct {
+NetVnetHash hdr;
+NetVnetHashRss rss;
+uint8_t footer[];
+} *rss = g_malloc(hash_size);
+
+rss->hdr = hash;
+rss->rss.indirection_table_mask =
+n->rss_data.indirections_len - 1;
+rss->rss.unclassified_queue = n->rss_data.default_queue;
+
+memcpy(rss->footer, n->rss_data.indirections_table,
+   indirection_table_size);
+
+memcpy(rss->footer + indirection_table_size, n->rss_data.key,
+   sizeof(n->rss_data.key));
+
+qemu_set_vnet_hash(qemu_get_queue(n->nic)->peer, &rss->hdr);
 } else {
-warn_report("Can't load eBPF RSS - fallback to software RSS");
-n->rss_data.enabled_software_rss = true;
+qemu_set_vnet_hash(qemu_get_queue(n->nic)->peer, &hash);
+}
+
+n->rss_data.enabled_software_rss = false;
+} else {
+if (n->rss_data.peer_hash_available) {
+NetVnetHash hash = { .flags = 0 };
+qemu_set_vnet_hash(qemu_get_queue(n->nic)->peer, &hash);
+}
+
+n->rss_data.enabled_software_rss = n->rss_data.populate_hash;
+if (n->rss_data.populate_hash) {
+virtio_net_detach_epbf_rss(n);
+} else if (!virtio_net_attach_epbf_rss(n)) {
+if (get_vhost_net(qemu_get_queue(n->nic)->peer)) {
+warn_report("Can't load eBPF RSS for vhost");
+} else {
+warn_report("Can't load eBPF RSS - fallback to software 
RSS");
+n->rss_data.enabled_software_rss = true;
+}
 }
 }
 
@@ -1230,7 +1275,13 @@ static void virtio_net_commit_rss_config(VirtIONet *n)
 n->rss_data.indirections_len,
 sizeof(n->rss_data.key));
 } else {
-virtio_net_detach_epbf_rss(n);
+if (n->rss_data.peer_hash_available) {
+NetVnetHash hash = { .flags = 0 };
+qemu_set_vnet_hash(qemu_get_queue(n->nic)->peer, &hash);
+} else {
+virtio_net_detach_epbf_rss(n);
+}
+
 trace_virtio_net_rss_disable();
 }
 }

-- 
2.46.0

[PATCH RFC v3 09/11] virtio-net: Offload hashing without vhost

2024-09-14 Thread Akihiko Odaki

This is necessary to offload hashing to tap.

Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index be6759d1c0f4..72493b652bf5 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1695,7 +1695,11 @@ static size_t receive_header(VirtIONet *n, struct 
virtio_net_hdr *hdr,
 {
 size_t hdr_len = n->guest_hdr_len;
 
-memcpy(hdr, buf, sizeof(struct virtio_net_hdr));
+memcpy(hdr, buf,
+   n->rss_data.populate_hash &&
+   n->rss_data.enabled && !n->rss_data.enabled_software_rss ?
+   sizeof(struct virtio_net_hdr_v1_hash) :
+   sizeof(struct virtio_net_hdr));
 
 *buf_offset = n->host_hdr_len;
 work_around_broken_dhclient(hdr, &hdr_len, buf, buf_size, buf_offset);
@@ -3072,11 +3076,13 @@ static uint64_t virtio_net_get_features(VirtIODevice 
*vdev, uint64_t features,
 }
 
 if (!get_vhost_net(nc->peer)) {
-if (!use_own_hash) {
-virtio_clear_feature(&features, VIRTIO_NET_F_HASH_REPORT);
-virtio_clear_feature(&features, VIRTIO_NET_F_RSS);
-} else if (virtio_has_feature(features, VIRTIO_NET_F_RSS)) {
-virtio_net_load_ebpf(n);
+if (!use_peer_hash) {
+if (!use_own_hash) {
+virtio_clear_feature(&features, VIRTIO_NET_F_HASH_REPORT);
+virtio_clear_feature(&features, VIRTIO_NET_F_RSS);
+} else if (virtio_has_feature(features, VIRTIO_NET_F_RSS)) {
+virtio_net_load_ebpf(n);
+}
 }
 
 return features;

-- 
2.46.0

[PATCH RFC v3 05/11] net/vhost-vdpa: Remove dummy SetSteeringEBPF

2024-09-14 Thread Akihiko Odaki

It is no longer used.

Signed-off-by: Akihiko Odaki 
---
 net/vhost-vdpa.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4af87ea226b4..5d846db5e71f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -245,12 +245,6 @@ static void vhost_vdpa_cleanup(NetClientState *nc)
 g_free(s->vhost_vdpa.shared);
 }
 
-/** Dummy SetSteeringEBPF to support RSS for vhost-vdpa backend  */
-static bool vhost_vdpa_set_steering_ebpf(NetClientState *nc, int prog_fd)
-{
-return true;
-}
-
 static bool vhost_vdpa_has_vnet_hdr(NetClientState *nc)
 {
 assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
@@ -465,7 +459,6 @@ static NetClientInfo net_vhost_vdpa_info = {
 .get_vnet_hash_supported_types = 
vhost_vdpa_get_vnet_hash_supported_types,
 .has_ufo = vhost_vdpa_has_ufo,
 .check_peer_type = vhost_vdpa_check_peer_type,
-.set_steering_ebpf = vhost_vdpa_set_steering_ebpf,
 };
 
 static int64_t vhost_vdpa_get_vring_group(int device_fd, unsigned vq_index,
@@ -1333,7 +1326,6 @@ static NetClientInfo net_vhost_vdpa_cvq_info = {
 .get_vnet_hash_supported_types = vhost_vdpa_get_vnet_hash_supported_types,
 .has_ufo = vhost_vdpa_has_ufo,
 .check_peer_type = vhost_vdpa_check_peer_type,
-.set_steering_ebpf = vhost_vdpa_set_steering_ebpf,
 };
 
 /*

-- 
2.46.0

[PATCH RFC v3 02/11] net/vhost-vdpa: Report hashing capability

2024-09-14 Thread Akihiko Odaki

Report hashing capability so that virtio-net can deliver the correct
capability information to the guest.

Signed-off-by: Akihiko Odaki 
---
 include/net/net.h |  3 +++
 net/net.c |  9 +
 net/vhost-vdpa.c  | 28 
 3 files changed, 40 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index c8f679761bf9..099616c8cbe3 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -60,6 +60,7 @@ typedef bool (HasVnetHdrLen)(NetClientState *, int);
 typedef void (SetOffload)(NetClientState *, int, int, int, int, int, int, int);
 typedef int (GetVnetHdrLen)(NetClientState *);
 typedef void (SetVnetHdrLen)(NetClientState *, int);
+typedef bool (GetVnetHashSupportedTypes)(NetClientState *, uint32_t *);
 typedef int (SetVnetLE)(NetClientState *, bool);
 typedef int (SetVnetBE)(NetClientState *, bool);
 typedef struct SocketReadState SocketReadState;
@@ -89,6 +90,7 @@ typedef struct NetClientInfo {
 SetVnetHdrLen *set_vnet_hdr_len;
 SetVnetLE *set_vnet_le;
 SetVnetBE *set_vnet_be;
+GetVnetHashSupportedTypes *get_vnet_hash_supported_types;
 NetAnnounce *announce;
 SetSteeringEBPF *set_steering_ebpf;
 NetCheckPeerType *check_peer_type;
@@ -192,6 +194,7 @@ void qemu_set_offload(NetClientState *nc, int csum, int 
tso4, int tso6,
   int ecn, int ufo, int uso4, int uso6);
 int qemu_get_vnet_hdr_len(NetClientState *nc);
 void qemu_set_vnet_hdr_len(NetClientState *nc, int len);
+bool qemu_get_vnet_hash_supported_types(NetClientState *nc, uint32_t *types);
 int qemu_set_vnet_le(NetClientState *nc, bool is_le);
 int qemu_set_vnet_be(NetClientState *nc, bool is_be);
 void qemu_macaddr_default_if_unset(MACAddr *macaddr);
diff --git a/net/net.c b/net/net.c
index 6938da05e077..3b04f8fe5d6b 100644
--- a/net/net.c
+++ b/net/net.c
@@ -559,6 +559,15 @@ void qemu_set_vnet_hdr_len(NetClientState *nc, int len)
 nc->info->set_vnet_hdr_len(nc, len);
 }
 
+bool qemu_get_vnet_hash_supported_types(NetClientState *nc, uint32_t *types)
+{
+if (!nc || !nc->info->get_vnet_hash_supported_types) {
+return false;
+}
+
+return nc->info->get_vnet_hash_supported_types(nc, types);
+}
+
 int qemu_set_vnet_le(NetClientState *nc, bool is_le)
 {
 #if HOST_BIG_ENDIAN
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 03457ead663a..af0c3c448c1f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -258,6 +258,32 @@ static bool vhost_vdpa_has_vnet_hdr(NetClientState *nc)
 return true;
 }
 
+static bool vhost_vdpa_get_vnet_hash_supported_types(NetClientState *nc,
+ uint32_t *types)
+{
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+uint64_t features = s->vhost_vdpa.dev->features;
+int fd = s->vhost_vdpa.shared->device_fd;
+struct {
+struct vhost_vdpa_config hdr;
+uint32_t supported_hash_types;
+} config;
+
+if (!virtio_has_feature(features, VIRTIO_NET_F_HASH_REPORT) &&
+!virtio_has_feature(features, VIRTIO_NET_F_RSS)) {
+return false;
+}
+
+config.hdr.off = offsetof(struct virtio_net_config, supported_hash_types);
+config.hdr.len = sizeof(config.supported_hash_types);
+
+assert(!ioctl(fd, VHOST_VDPA_GET_CONFIG, &config));
+*types = le32_to_cpu(config.supported_hash_types);
+
+return true;
+}
+
 static bool vhost_vdpa_has_ufo(NetClientState *nc)
 {
 assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
@@ -436,6 +462,7 @@ static NetClientInfo net_vhost_vdpa_info = {
 .stop = vhost_vdpa_net_client_stop,
 .cleanup = vhost_vdpa_cleanup,
 .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
+.get_vnet_hash_supported_types = 
vhost_vdpa_get_vnet_hash_supported_types,
 .has_ufo = vhost_vdpa_has_ufo,
 .check_peer_type = vhost_vdpa_check_peer_type,
 .set_steering_ebpf = vhost_vdpa_set_steering_ebpf,
@@ -1303,6 +1330,7 @@ static NetClientInfo net_vhost_vdpa_cvq_info = {
 .stop = vhost_vdpa_net_cvq_stop,
 .cleanup = vhost_vdpa_cleanup,
 .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
+.get_vnet_hash_supported_types = vhost_vdpa_get_vnet_hash_supported_types,
 .has_ufo = vhost_vdpa_has_ufo,
 .check_peer_type = vhost_vdpa_check_peer_type,
 .set_steering_ebpf = vhost_vdpa_set_steering_ebpf,

-- 
2.46.0

[PATCH RFC v3 07/11] net: Allow configuring virtio hashing

2024-09-14 Thread Akihiko Odaki

This adds set_vnet_hash() to configure virtio hashing and implements it
for Linux's tap. vDPA will have an empty function as configuring virtio
hashing is done with the load().

Signed-off-by: Akihiko Odaki 
---
 include/net/net.h | 17 +
 net/tap-linux.h   |  1 +
 net/tap_int.h |  2 ++
 net/net.c |  5 +
 net/tap-bsd.c |  5 +
 net/tap-linux.c   |  5 +
 net/tap-solaris.c |  5 +
 net/tap-stub.c|  5 +
 net/tap.c |  7 +++
 net/vhost-vdpa.c  |  7 +++
 10 files changed, 59 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 099616c8cbe3..0c7e3513cf5f 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -35,6 +35,20 @@ typedef struct NICConf {
 int32_t bootindex;
 } NICConf;
 
+#define NET_VNET_HASH_REPORT 1
+#define NET_VNET_HASH_RSS 2
+
+typedef struct NetVnetHash {
+uint16_t flags;
+uint8_t pad[2];
+uint32_t types;
+} NetVnetHash;
+
+typedef struct NetVnetHashRss {
+uint16_t indirection_table_mask;
+uint16_t unclassified_queue;
+} NetVnetHashRss;
+
 #define DEFINE_NIC_PROPERTIES(_state, _conf)\
 DEFINE_PROP_MACADDR("mac",   _state, _conf.macaddr),\
 DEFINE_PROP_NETDEV("netdev", _state, _conf.peers)
@@ -61,6 +75,7 @@ typedef void (SetOffload)(NetClientState *, int, int, int, 
int, int, int, int);
 typedef int (GetVnetHdrLen)(NetClientState *);
 typedef void (SetVnetHdrLen)(NetClientState *, int);
 typedef bool (GetVnetHashSupportedTypes)(NetClientState *, uint32_t *);
+typedef void (SetVnetHash)(NetClientState *, const NetVnetHash *);
 typedef int (SetVnetLE)(NetClientState *, bool);
 typedef int (SetVnetBE)(NetClientState *, bool);
 typedef struct SocketReadState SocketReadState;
@@ -91,6 +106,7 @@ typedef struct NetClientInfo {
 SetVnetLE *set_vnet_le;
 SetVnetBE *set_vnet_be;
 GetVnetHashSupportedTypes *get_vnet_hash_supported_types;
+SetVnetHash *set_vnet_hash;
 NetAnnounce *announce;
 SetSteeringEBPF *set_steering_ebpf;
 NetCheckPeerType *check_peer_type;
@@ -195,6 +211,7 @@ void qemu_set_offload(NetClientState *nc, int csum, int 
tso4, int tso6,
 int qemu_get_vnet_hdr_len(NetClientState *nc);
 void qemu_set_vnet_hdr_len(NetClientState *nc, int len);
 bool qemu_get_vnet_hash_supported_types(NetClientState *nc, uint32_t *types);
+void qemu_set_vnet_hash(NetClientState *nc, const NetVnetHash *hash);
 int qemu_set_vnet_le(NetClientState *nc, bool is_le);
 int qemu_set_vnet_be(NetClientState *nc, bool is_be);
 void qemu_macaddr_default_if_unset(MACAddr *macaddr);
diff --git a/net/tap-linux.h b/net/tap-linux.h
index 9a58cecb7f47..5fac64c24f99 100644
--- a/net/tap-linux.h
+++ b/net/tap-linux.h
@@ -32,6 +32,7 @@
 #define TUNSETVNETLE _IOW('T', 220, int)
 #define TUNSETVNETBE _IOW('T', 222, int)
 #define TUNSETSTEERINGEBPF _IOR('T', 224, int)
+#define TUNSETVNETHASH _IOW('T', 229, NetVnetHash)
 
 #endif
 
diff --git a/net/tap_int.h b/net/tap_int.h
index 8857ff299d22..e1b53e343397 100644
--- a/net/tap_int.h
+++ b/net/tap_int.h
@@ -27,6 +27,7 @@
 #define NET_TAP_INT_H
 
 #include "qapi/qapi-types-net.h"
+#include "net/net.h"
 
 int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
  int vnet_hdr_required, int mq_required, Error **errp);
@@ -40,6 +41,7 @@ int tap_probe_has_uso(int fd);
 void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo,
 int uso4, int uso6);
 void tap_fd_set_vnet_hdr_len(int fd, int len);
+void tap_fd_set_vnet_hash(int fd, const NetVnetHash *hash);
 int tap_fd_set_vnet_le(int fd, int vnet_is_le);
 int tap_fd_set_vnet_be(int fd, int vnet_is_be);
 int tap_fd_enable(int fd);
diff --git a/net/net.c b/net/net.c
index 3b04f8fe5d6b..db365b6ec211 100644
--- a/net/net.c
+++ b/net/net.c
@@ -568,6 +568,11 @@ bool qemu_get_vnet_hash_supported_types(NetClientState 
*nc, uint32_t *types)
 return nc->info->get_vnet_hash_supported_types(nc, types);
 }
 
+void qemu_set_vnet_hash(NetClientState *nc, const NetVnetHash *hash)
+{
+nc->info->set_vnet_hash(nc, hash);
+}
+
 int qemu_set_vnet_le(NetClientState *nc, bool is_le)
 {
 #if HOST_BIG_ENDIAN
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index b4c84441ba8b..2eee0c0a0ec5 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -221,6 +221,11 @@ void tap_fd_set_vnet_hdr_len(int fd, int len)
 {
 }
 
+void tap_fd_set_vnet_hash(int fd, const NetVnetHash *hash)
+{
+g_assert_not_reached();
+}
+
 int tap_fd_set_vnet_le(int fd, int is_le)
 {
 return -EINVAL;
diff --git a/net/tap-linux.c b/net/tap-linux.c
index 1226d5fda2d9..e96d38eec922 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -194,6 +194,11 @@ void tap_fd_set_vnet_hdr_len(int fd, int len)
 }
 }
 
+void tap_fd_set_vnet_hash(int fd, const NetVnetHash *hash)
+{
+assert(!ioctl(fd, TUNSETVNETHASH, hash));
+}
+
 int tap_fd_set_vnet_le(int f

[PATCH RFC v3 06/11] virtio-net: Add hash type options

2024-09-14 Thread Akihiko Odaki

By default, virtio-net limits the hash types that will be advertised to
the guest so that all hash types are covered by the offloading
capability the client provides. This change allows to override this
behavior and to advertise hash types that require user-space hash
calculation by specifying "on" for the corresponding properties.

Signed-off-by: Akihiko Odaki 
---
 include/hw/virtio/virtio-net.h |  1 +
 hw/net/virtio-net.c| 45 --
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 202016ec74fc..cc6da6ad6a1b 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -148,6 +148,7 @@ typedef struct VirtioNetRssData {
 uint32_t runtime_hash_types;
 uint32_t supported_hash_types;
 uint32_t peer_hash_types;
+OnOffAutoBit specified_hash_types;
 uint8_t key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
 uint16_t indirections_len;
 uint16_t *indirections_table;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 3da15a60eaa5..38ccd706f956 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3808,9 +3808,14 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 if (qemu_get_vnet_hash_supported_types(qemu_get_queue(n->nic)->peer,
&n->rss_data.peer_hash_types)) {
 n->rss_data.peer_hash_available = true;
-n->rss_data.supported_hash_types = n->rss_data.peer_hash_types;
+n->rss_data.supported_hash_types =
+n->rss_data.specified_hash_types.on_bits |
+(n->rss_data.specified_hash_types.auto_bits &
+ n->rss_data.peer_hash_types);
 } else {
-n->rss_data.supported_hash_types = VIRTIO_NET_RSS_SUPPORTED_HASHES;
+n->rss_data.supported_hash_types =
+n->rss_data.specified_hash_types.on_bits |
+n->rss_data.specified_hash_types.auto_bits;
 }
 }
 
@@ -4035,6 +4040,42 @@ static Property virtio_net_properties[] = {
   VIRTIO_NET_F_GUEST_USO6, true),
 DEFINE_PROP_BIT64("host_uso", VirtIONet, host_features,
   VIRTIO_NET_F_HOST_USO, true),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-ipv4", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_IPv4 - 1,
+ON_OFF_AUTO_AUTO),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-tcp4", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_TCPv4 - 1,
+ON_OFF_AUTO_AUTO),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-udp4", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_UDPv4 - 1,
+ON_OFF_AUTO_AUTO),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-ipv6", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_IPv6 - 1,
+ON_OFF_AUTO_AUTO),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-tcp6", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_TCPv6 - 1,
+ON_OFF_AUTO_AUTO),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-udp6", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_UDPv6 - 1,
+ON_OFF_AUTO_AUTO),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-ipv6ex", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_IPv6_EX - 1,
+ON_OFF_AUTO_AUTO),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-tcp6ex", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_TCPv6_EX - 1,
+ON_OFF_AUTO_AUTO),
+DEFINE_PROP_ON_OFF_AUTO_BIT("hash-udp6ex", VirtIONet,
+rss_data.specified_hash_types,
+VIRTIO_NET_HASH_REPORT_UDPv6_EX - 1,
+ON_OFF_AUTO_AUTO),
 DEFINE_PROP_END_OF_LIST(),
 };
 

-- 
2.46.0

[PATCH RFC v3 01/11] qdev-properties: DEFINE_PROP_ON_OFF_AUTO_BIT()

2024-09-14 Thread Akihiko Odaki

Signed-off-by: Akihiko Odaki 
---
 include/hw/qdev-properties.h | 18 
 hw/core/qdev-properties.c| 66 +++-
 2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index 09aa04ca1e27..678837569784 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -43,10 +43,21 @@ struct PropertyInfo {
 ObjectPropertyRelease *release;
 };
 
+/**
+ * struct OnOffAutoBit - OnOffAuto storage with 64 elements.
+ * @on_bits: Bitmap of elements with "on".
+ * @auto_bits: Bitmap of elements with "auto".
+ */
+typedef struct OnOffAutoBit {
+uint32_t on_bits;
+uint32_t auto_bits;
+} OnOffAutoBit;
+
 
 /*** qdev-properties.c ***/
 
 extern const PropertyInfo qdev_prop_bit;
+extern const PropertyInfo qdev_prop_on_off_auto_bit;
 extern const PropertyInfo qdev_prop_bit64;
 extern const PropertyInfo qdev_prop_bool;
 extern const PropertyInfo qdev_prop_enum;
@@ -86,6 +97,13 @@ extern const PropertyInfo qdev_prop_link;
 .set_default = true,\
 .defval.u= (bool)_defval)
 
+#define DEFINE_PROP_ON_OFF_AUTO_BIT(_name, _state, _field, _bit, _defval) \
+DEFINE_PROP(_name, _state, _field, qdev_prop_on_off_auto_bit, \
+OnOffAutoBit, \
+.bitnr= (_bit),   \
+.set_default = true,  \
+.defval.i = (OnOffAuto)_defval)
+
 #define DEFINE_PROP_UNSIGNED(_name, _state, _field, _defval, _prop, _type) \
 DEFINE_PROP(_name, _state, _field, _prop, _type,   \
 .set_default = true,   \
diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 86a583574dd0..e1ff992e7177 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -133,7 +133,8 @@ const PropertyInfo qdev_prop_enum = {
 
 static uint32_t qdev_get_prop_mask(Property *prop)
 {
-assert(prop->info == &qdev_prop_bit);
+assert(prop->info == &qdev_prop_bit ||
+   prop->info == &qdev_prop_on_off_auto_bit);
 return 0x1 << prop->bitnr;
 }
 
@@ -183,6 +184,69 @@ const PropertyInfo qdev_prop_bit = {
 .set_default_value = set_default_value_bool,
 };
 
+static void prop_get_on_off_auto_bit(Object *obj, Visitor *v,
+ const char *name, void *opaque,
+ Error **errp)
+{
+Property *prop = opaque;
+OnOffAutoBit *p = object_field_prop_ptr(obj, prop);
+int value;
+uint32_t mask = qdev_get_prop_mask(prop);
+
+if (p->auto_bits & mask) {
+value = ON_OFF_AUTO_AUTO;
+} else if (p->on_bits & mask) {
+value = ON_OFF_AUTO_ON;
+} else {
+value = ON_OFF_AUTO_OFF;
+}
+
+visit_type_enum(v, name, &value, &OnOffAuto_lookup, errp);
+}
+
+static void prop_set_on_off_auto_bit(Object *obj, Visitor *v,
+ const char *name, void *opaque,
+ Error **errp)
+{
+Property *prop = opaque;
+OnOffAutoBit *p = object_field_prop_ptr(obj, prop);
+bool bool_value;
+int value;
+uint32_t mask = qdev_get_prop_mask(prop);
+
+if (visit_type_bool(v, name, &bool_value, NULL)) {
+value = bool_value ? ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF;
+} else if (!visit_type_enum(v, name, &value, &OnOffAuto_lookup, errp)) {
+return;
+}
+
+switch (value) {
+case ON_OFF_AUTO_AUTO:
+p->on_bits &= ~mask;
+p->auto_bits |= mask;
+break;
+
+case ON_OFF_AUTO_ON:
+p->on_bits |= mask;
+p->auto_bits &= ~mask;
+break;
+
+case ON_OFF_AUTO_OFF:
+p->on_bits &= ~mask;
+p->auto_bits &= ~mask;
+break;
+}
+}
+
+const PropertyInfo qdev_prop_on_off_auto_bit = {
+.name  = "OnOffAuto",
+.description = "on/off/auto",
+.enum_table = &OnOffAuto_lookup,
+.get = prop_get_on_off_auto_bit,
+.set = prop_set_on_off_auto_bit,
+.set_default_value = qdev_propinfo_set_default_value_enum,
+};
+
 /* Bit64 */
 
 static uint64_t qdev_get_prop_mask64(Property *prop)

-- 
2.46.0

[PATCH RFC v3 10/11] tap: Report virtio-net hashing support on Linux

2024-09-14 Thread Akihiko Odaki

This allows offloading virtio-net hashing to tap on Linux.

Signed-off-by: Akihiko Odaki 
---
 net/tap-linux.h   |  1 +
 net/tap_int.h |  1 +
 net/tap-bsd.c |  5 +
 net/tap-linux.c   | 13 +
 net/tap-solaris.c |  5 +
 net/tap-stub.c|  5 +
 net/tap.c |  8 
 7 files changed, 38 insertions(+)

diff --git a/net/tap-linux.h b/net/tap-linux.h
index 5fac64c24f99..c773609c799e 100644
--- a/net/tap-linux.h
+++ b/net/tap-linux.h
@@ -32,6 +32,7 @@
 #define TUNSETVNETLE _IOW('T', 220, int)
 #define TUNSETVNETBE _IOW('T', 222, int)
 #define TUNSETSTEERINGEBPF _IOR('T', 224, int)
+#define TUNGETVNETHASHCAP _IOR('T', 228, NetVnetHash)
 #define TUNSETVNETHASH _IOW('T', 229, NetVnetHash)
 
 #endif
diff --git a/net/tap_int.h b/net/tap_int.h
index e1b53e343397..84a88841b720 100644
--- a/net/tap_int.h
+++ b/net/tap_int.h
@@ -36,6 +36,7 @@ ssize_t tap_read_packet(int tapfd, uint8_t *buf, int maxlen);
 
 void tap_set_sndbuf(int fd, const NetdevTapOptions *tap, Error **errp);
 int tap_probe_vnet_hdr(int fd, Error **errp);
+bool tap_probe_vnet_hash_supported_types(int fd, uint32_t *types);
 int tap_probe_has_ufo(int fd);
 int tap_probe_has_uso(int fd);
 void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo,
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index 2eee0c0a0ec5..142e1abe0420 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -217,6 +217,11 @@ int tap_probe_has_uso(int fd)
 return 0;
 }
 
+bool tap_probe_vnet_hash_supported_types(int fd, uint32_t *types)
+{
+return false;
+}
+
 void tap_fd_set_vnet_hdr_len(int fd, int len)
 {
 }
diff --git a/net/tap-linux.c b/net/tap-linux.c
index e96d38eec922..a601cb1ed2d9 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -185,6 +185,19 @@ int tap_probe_has_uso(int fd)
 return 1;
 }
 
+bool tap_probe_vnet_hash_supported_types(int fd, uint32_t *types)
+{
+NetVnetHash hash;
+
+if (ioctl(fd, TUNGETVNETHASHCAP, &hash)) {
+return false;
+}
+
+*types = hash.types;
+
+return true;
+}
+
 void tap_fd_set_vnet_hdr_len(int fd, int len)
 {
 if (ioctl(fd, TUNSETVNETHDRSZ, &len) == -1) {
diff --git a/net/tap-solaris.c b/net/tap-solaris.c
index c65104b84e93..00d1c850680d 100644
--- a/net/tap-solaris.c
+++ b/net/tap-solaris.c
@@ -221,6 +221,11 @@ int tap_probe_has_uso(int fd)
 return 0;
 }
 
+bool tap_probe_vnet_hash_supported_types(int fd, uint32_t *types)
+{
+return false;
+}
+
 void tap_fd_set_vnet_hdr_len(int fd, int len)
 {
 }
diff --git a/net/tap-stub.c b/net/tap-stub.c
index 5bdc76216b7f..a4718654fbeb 100644
--- a/net/tap-stub.c
+++ b/net/tap-stub.c
@@ -52,6 +52,11 @@ int tap_probe_has_uso(int fd)
 return 0;
 }
 
+bool tap_probe_vnet_hash_supported_types(int fd, uint32_t *types)
+{
+return false;
+}
+
 void tap_fd_set_vnet_hdr_len(int fd, int len)
 {
 }
diff --git a/net/tap.c b/net/tap.c
index 8d451c745d70..e17565c2ac3c 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -248,6 +248,13 @@ static void tap_set_vnet_hdr_len(NetClientState *nc, int 
len)
 s->using_vnet_hdr = true;
 }
 
+static bool tap_get_vnet_hash_supported_types(NetClientState *nc,
+  uint32_t *types)
+{
+TAPState *s = DO_UPCAST(TAPState, nc, nc);
+return tap_probe_vnet_hash_supported_types(s->fd, types);
+}
+
 static void tap_set_vnet_hash(NetClientState *nc, const NetVnetHash *hash)
 {
 TAPState *s = DO_UPCAST(TAPState, nc, nc);
@@ -350,6 +357,7 @@ static NetClientInfo net_tap_info = {
 .has_vnet_hdr_len = tap_has_vnet_hdr_len,
 .set_offload = tap_set_offload,
 .set_vnet_hdr_len = tap_set_vnet_hdr_len,
+.get_vnet_hash_supported_types = tap_get_vnet_hash_supported_types,
 .set_vnet_hash = tap_set_vnet_hash,
 .set_vnet_le = tap_set_vnet_le,
 .set_vnet_be = tap_set_vnet_be,

-- 
2.46.0

[PATCH RFC v3 04/11] virtio-net: Retrieve peer hashing capability

2024-09-14 Thread Akihiko Odaki

Retrieve peer hashing capability instead of hardcoding.

Signed-off-by: Akihiko Odaki 
---
 include/hw/virtio/virtio-net.h |  5 +++-
 hw/net/virtio-net.c| 67 ++
 net/vhost-vdpa.c   |  4 +--
 3 files changed, 60 insertions(+), 16 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 060c23c04d2d..202016ec74fc 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -144,7 +144,10 @@ typedef struct VirtioNetRssData {
 boolenabled_software_rss;
 boolredirect;
 boolpopulate_hash;
-uint32_t hash_types;
+boolpeer_hash_available;
+uint32_t runtime_hash_types;
+uint32_t supported_hash_types;
+uint32_t peer_hash_types;
 uint8_t key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
 uint16_t indirections_len;
 uint16_t *indirections_table;
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 206b0335169d..3da15a60eaa5 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -157,7 +157,7 @@ static void virtio_net_get_config(VirtIODevice *vdev, 
uint8_t *config)
  virtio_host_has_feature(vdev, VIRTIO_NET_F_RSS) ?
  VIRTIO_NET_RSS_MAX_TABLE_LEN : 1);
 virtio_stl_p(vdev, &netcfg.supported_hash_types,
- VIRTIO_NET_RSS_SUPPORTED_HASHES);
+ n->rss_data.supported_hash_types);
 memcpy(config, &netcfg, n->config_size);
 
 /*
@@ -1175,7 +1175,7 @@ static void rss_data_to_rss_config(struct 
VirtioNetRssData *data,
 {
 config->redirect = data->redirect;
 config->populate_hash = data->populate_hash;
-config->hash_types = data->hash_types;
+config->hash_types = data->runtime_hash_types;
 config->indirections_len = data->indirections_len;
 config->default_queue = data->default_queue;
 }
@@ -1209,6 +1209,10 @@ static void virtio_net_detach_epbf_rss(VirtIONet *n)
 
 static void virtio_net_commit_rss_config(VirtIONet *n)
 {
+if (n->rss_data.peer_hash_available) {
+return;
+}
+
 if (n->rss_data.enabled) {
 n->rss_data.enabled_software_rss = n->rss_data.populate_hash;
 if (n->rss_data.populate_hash) {
@@ -1222,7 +1226,7 @@ static void virtio_net_commit_rss_config(VirtIONet *n)
 }
 }
 
-trace_virtio_net_rss_enable(n->rss_data.hash_types,
+trace_virtio_net_rss_enable(n->rss_data.runtime_hash_types,
 n->rss_data.indirections_len,
 sizeof(n->rss_data.key));
 } else {
@@ -1324,7 +1328,7 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
 err_value = (uint32_t)s;
 goto error;
 }
-n->rss_data.hash_types = virtio_ldl_p(vdev, &cfg.hash_types);
+n->rss_data.runtime_hash_types = virtio_ldl_p(vdev, &cfg.hash_types);
 n->rss_data.indirections_len =
 virtio_lduw_p(vdev, &cfg.indirection_table_mask);
 n->rss_data.indirections_len++;
@@ -1387,12 +1391,12 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
 err_value = temp.b;
 goto error;
 }
-if (!temp.b && n->rss_data.hash_types) {
+if (!temp.b && n->rss_data.runtime_hash_types) {
 err_msg = "No key provided";
 err_value = 0;
 goto error;
 }
-if (!temp.b && !n->rss_data.hash_types) {
+if (!temp.b && !n->rss_data.runtime_hash_types) {
 virtio_net_disable_rss(n);
 return queue_pairs;
 }
@@ -1793,7 +1797,7 @@ static int virtio_net_process_rss(NetClientState *nc, 
const uint8_t *buf,
 net_rx_pkt_set_protocols(pkt, &iov, 1, n->host_hdr_len);
 net_rx_pkt_get_protocols(pkt, &hasip4, &hasip6, &l4hdr_proto);
 net_hash_type = virtio_net_get_hash_type(hasip4, hasip6, l4hdr_proto,
- n->rss_data.hash_types);
+ n->rss_data.runtime_hash_types);
 if (net_hash_type > NetPktRssIpV6UdpEx) {
 if (n->rss_data.populate_hash) {
 hdr->hash_value = VIRTIO_NET_HASH_REPORT_NONE;
@@ -2973,6 +2977,14 @@ static uint64_t virtio_net_get_features(VirtIODevice 
*vdev, uint64_t features,
 {
 VirtIONet *n = VIRTIO_NET(vdev);
 NetClientState *nc = qemu_get_queue(n->nic);
+uint32_t supported_hash_types = n->rss_data.supported_hash_types;
+uint32_t peer_hash_types = n->rss_data.peer_hash_types;
+bool use_own_hash =
+(supported_hash_types & VIRTIO_NET_RSS_SUPPORTED_HASHES) ==
+supported_hash_types;
+bool use_peer_hash =
+n->rss_data.peer_hash_available &&
+(supported_hash_types & peer_hash_types) == supported_hash_types;
 
 /* Firstly sync all virtio-net possible supported featur

[PATCH RFC v3 00/11] virtio-net: Offload hashing without eBPF

2024-09-14 Thread Akihiko Odaki

Based-on: <20240915-queue-v1-0-b49bd49b9...@daynix.com>
("[PATCH 0/7] virtio-net fixes")

I'm proposing to add a feature to offload virtio-net RSS/hash report to
Linux. This series contain patches to utilize the proposed Linux feature.
The patches for Linux are available at:
https://lore.kernel.org/r/20240915-rss-v3-0-c630015db...@daynix.com/

This work will be presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/

---
Akihiko Odaki (11):
  qdev-properties: DEFINE_PROP_ON_OFF_AUTO_BIT()
  net/vhost-vdpa: Report hashing capability
  virtio-net: Move virtio_net_get_features() down
  virtio-net: Retrieve peer hashing capability
  net/vhost-vdpa: Remove dummy SetSteeringEBPF
  virtio-net: Add hash type options
  net: Allow configuring virtio hashing
  virtio-net: Use qemu_set_vnet_hash()
  virtio-net: Offload hashing without vhost
  tap: Report virtio-net hashing support on Linux
  docs/devel/ebpf_rss.rst: Update for peer RSS

 docs/devel/ebpf_rss.rst|  23 ++-
 include/hw/qdev-properties.h   |  18 +++
 include/hw/virtio/virtio-net.h |   6 +-
 include/net/net.h  |  20 +++
 net/tap-linux.h|   2 +
 net/tap_int.h  |   3 +
 hw/core/qdev-properties.c  |  66 -
 hw/net/virtio-net.c| 327 +
 net/net.c  |  14 ++
 net/tap-bsd.c  |  10 ++
 net/tap-linux.c|  18 +++
 net/tap-solaris.c  |  10 ++
 net/tap-stub.c |  10 ++
 net/tap.c  |  15 ++
 net/vhost-vdpa.c   |  41 +-
 15 files changed, 473 insertions(+), 110 deletions(-)
---
base-commit: decf357a35b1201b34cc37c47b4b027f9601855e
change-id: 20240828-hash-628329a45d4d

Best regards,
-- 
Akihiko Odaki

[PATCH 6/7] virtio-net: Copy received header to buffer

2024-09-14 Thread Akihiko Odaki

receive_header() used to cast the const qualifier of the pointer to the
received packet away to modify the header. Avoid this by copying the
received header to buffer.

Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 85 +
 1 file changed, 46 insertions(+), 39 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 3fc1d10cb9e0..ca4e22344f78 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1685,41 +1685,44 @@ static void virtio_net_hdr_swap(VirtIODevice *vdev, 
struct virtio_net_hdr *hdr)
  * cache.
  */
 static void work_around_broken_dhclient(struct virtio_net_hdr *hdr,
-uint8_t *buf, size_t size)
+size_t *hdr_len, const uint8_t *buf,
+size_t buf_size, size_t *buf_offset)
 {
 size_t csum_size = ETH_HLEN + sizeof(struct ip_header) +
sizeof(struct udp_header);
 
+buf += *buf_offset;
+buf_size -= *buf_offset;
+
 if ((hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && /* missing csum */
-(size >= csum_size && size < 1500) && /* normal sized MTU */
+(buf_size >= csum_size && buf_size < 1500) && /* normal sized MTU */
 (buf[12] == 0x08 && buf[13] == 0x00) && /* ethertype == IPv4 */
 (buf[23] == 17) && /* ip.protocol == UDP */
 (buf[34] == 0 && buf[35] == 67)) { /* udp.srcport == bootps */
-net_checksum_calculate(buf, size, CSUM_UDP);
+memcpy((uint8_t *)hdr + *hdr_len, buf, csum_size);
+net_checksum_calculate((uint8_t *)hdr + *hdr_len, csum_size, CSUM_UDP);
 hdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
+*hdr_len += csum_size;
+*buf_offset += csum_size;
 }
 }
 
-static void receive_header(VirtIONet *n, const struct iovec *iov, int iov_cnt,
-   const void *buf, size_t size)
+static size_t receive_header(VirtIONet *n, struct virtio_net_hdr *hdr,
+ const void *buf, size_t buf_size,
+ size_t *buf_offset)
 {
-if (n->has_vnet_hdr) {
-/* FIXME this cast is evil */
-void *wbuf = (void *)buf;
-work_around_broken_dhclient(wbuf, wbuf + n->host_hdr_len,
-size - n->host_hdr_len);
+size_t hdr_len = n->guest_hdr_len;
 
-if (n->needs_vnet_hdr_swap) {
-virtio_net_hdr_swap(VIRTIO_DEVICE(n), wbuf);
-}
-iov_from_buf(iov, iov_cnt, 0, buf, sizeof(struct virtio_net_hdr));
-} else {
-struct virtio_net_hdr hdr = {
-.flags = 0,
-.gso_type = VIRTIO_NET_HDR_GSO_NONE
-};
-iov_from_buf(iov, iov_cnt, 0, &hdr, sizeof hdr);
+memcpy(hdr, buf, sizeof(struct virtio_net_hdr));
+
+*buf_offset = n->host_hdr_len;
+work_around_broken_dhclient(hdr, &hdr_len, buf, buf_size, buf_offset);
+
+if (n->needs_vnet_hdr_swap) {
+virtio_net_hdr_swap(VIRTIO_DEVICE(n), hdr);
 }
+
+return hdr_len;
 }
 
 static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
@@ -1887,6 +1890,13 @@ static int virtio_net_process_rss(NetClientState *nc, 
const uint8_t *buf,
 return (index == new_index) ? -1 : new_index;
 }
 
+typedef struct Header {
+struct virtio_net_hdr_v1_hash virtio_net;
+struct eth_header eth;
+struct ip_header ip;
+struct udp_header udp;
+} Header;
+
 static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
   size_t size)
 {
@@ -1896,15 +1906,15 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
*nc, const uint8_t *buf,
 VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
 size_t lens[VIRTQUEUE_MAX_SIZE];
 struct iovec mhdr_sg[VIRTQUEUE_MAX_SIZE];
-struct virtio_net_hdr_v1_hash extra_hdr;
+Header hdr;
 unsigned mhdr_cnt = 0;
 size_t offset, i, guest_offset, j;
 ssize_t err;
 
-memset(&extra_hdr, 0, sizeof(extra_hdr));
+memset(&hdr.virtio_net, 0, sizeof(hdr.virtio_net));
 
 if (n->rss_data.enabled && n->rss_data.enabled_software_rss) {
-int index = virtio_net_process_rss(nc, buf, size, &extra_hdr);
+int index = virtio_net_process_rss(nc, buf, size, &hdr.virtio_net);
 if (index >= 0) {
 nc = qemu_get_subqueue(n->nic, index);
 }
@@ -1969,21 +1979,18 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
*nc, const uint8_t *buf,
 if (n->mergeable_rx_bufs) {
 mhdr_cnt = iov_copy(mhdr_sg, ARRAY_SIZE(mhdr_sg),
 sg, elem->in_num,
-offsetof(typeof(extra_hdr), 
hdr.num_buffers),
-

[PATCH 1/7] net: checksum: Convert data to void *

2024-09-14 Thread Akihiko Odaki

Convert the data parameter of net_checksum_calculate() to void * to
save unnecessary casts for callers.

Signed-off-by: Akihiko Odaki 
---
 include/net/checksum.h | 2 +-
 net/checksum.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/checksum.h b/include/net/checksum.h
index 7dec37e56c78..188e4cca0b7f 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -30,7 +30,7 @@ uint32_t net_checksum_add_cont(int len, uint8_t *buf, int 
seq);
 uint16_t net_checksum_finish(uint32_t sum);
 uint16_t net_checksum_tcpudp(uint16_t length, uint16_t proto,
  uint8_t *addrs, uint8_t *buf);
-void net_checksum_calculate(uint8_t *data, int length, int csum_flag);
+void net_checksum_calculate(void *data, int length, int csum_flag);
 
 static inline uint32_t
 net_checksum_add(int len, uint8_t *buf)
diff --git a/net/checksum.c b/net/checksum.c
index 1a957e4c0b10..537457d89d07 100644
--- a/net/checksum.c
+++ b/net/checksum.c
@@ -57,7 +57,7 @@ uint16_t net_checksum_tcpudp(uint16_t length, uint16_t proto,
 return net_checksum_finish(sum);
 }
 
-void net_checksum_calculate(uint8_t *data, int length, int csum_flag)
+void net_checksum_calculate(void *data, int length, int csum_flag)
 {
 int mac_hdr_len, ip_len;
 struct ip_header *ip;
@@ -101,7 +101,7 @@ void net_checksum_calculate(uint8_t *data, int length, int 
csum_flag)
 return;
 }
 
-ip = (struct ip_header *)(data + mac_hdr_len);
+ip = (struct ip_header *)((uint8_t *)data + mac_hdr_len);
 
 if (IP_HEADER_VERSION(ip) != IP_HEADER_VERSION_4) {
 return; /* not IPv4 */

-- 
2.46.0

[PATCH 5/7] virtio-net: Initialize hash reporting values

2024-09-14 Thread Akihiko Odaki

The specification says hash_report should be set to
VIRTIO_NET_HASH_REPORT_NONE if VIRTIO_NET_F_HASH_REPORT is negotiated
but not configured with VIRTIO_NET_CTRL_MQ_RSS_CONFIG. However,
virtio_net_receive_rcu() instead wrote out the content of the extra_hdr
variable, which is not uninitialized in such a case.

Fix this by zeroing the extra_hdr.

Fixes: e22f0603fb2f ("virtio-net: reference implementation of hash report")
Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 3753c6aaca83..3fc1d10cb9e0 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1901,6 +1901,8 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, 
const uint8_t *buf,
 size_t offset, i, guest_offset, j;
 ssize_t err;
 
+memset(&extra_hdr, 0, sizeof(extra_hdr));
+
 if (n->rss_data.enabled && n->rss_data.enabled_software_rss) {
 int index = virtio_net_process_rss(nc, buf, size, &extra_hdr);
 if (index >= 0) {

-- 
2.46.0

[PATCH 3/7] virtio-net: Do not check for the queue before RSS

2024-09-14 Thread Akihiko Odaki

virtio_net_can_receive() checks if the queue is ready, but RSS will
change the queue to use so, strictly speaking, we may still be able to
receive the packet even if the queue initially provided is not ready.
Perform RSS before virtio_net_can_receive() to cover such a case.

Fixes: 4474e37a5b3a ("virtio-net: implement RX RSS processing")
Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 755530c035e4..3ee1ebd88daa 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1901,10 +1901,6 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
*nc, const uint8_t *buf,
 size_t offset, i, guest_offset, j;
 ssize_t err;
 
-if (!virtio_net_can_receive(nc)) {
-return -1;
-}
-
 if (!no_rss && n->rss_data.enabled && n->rss_data.enabled_software_rss) {
 int index = virtio_net_process_rss(nc, buf, size, &extra_hdr);
 if (index >= 0) {
@@ -1913,6 +1909,10 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
*nc, const uint8_t *buf,
 }
 }
 
+if (!virtio_net_can_receive(nc)) {
+return -1;
+}
+
 /* hdr_len refers to the header we supply to the guest */
 if (!virtio_net_has_buffers(q, size + n->guest_hdr_len - n->host_hdr_len)) 
{
 return 0;

-- 
2.46.0

[PATCH 7/7] virtio-net: Fix num_buffers for version 1

2024-09-14 Thread Akihiko Odaki

The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.

Fixes: df91055db5c9 ("virtio-net: enable virtio 1.0")
Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index ca4e22344f78..b4a3fb575c7c 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1982,6 +1982,8 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, 
const uint8_t *buf,
 offsetof(typeof(hdr),
  virtio_net.hdr.num_buffers),
 sizeof(hdr.virtio_net.hdr.num_buffers));
+} else {
+hdr.virtio_net.hdr.num_buffers = cpu_to_le16(1);
 }
 
 guest_offset = n->has_vnet_hdr ?

-- 
2.46.0

[PATCH 0/7] virtio-net fixes

2024-09-14 Thread Akihiko Odaki

Most of this series are fixes for software RSS and hash reporting, which
should have no production user.

However there is one exception; patch "virtio-net: Fix size check in
dhclient workaround" fixes an out-of-bound access that can be triggered
for anyone who don't use vhost. It has Cc: qemu-sta...@nongnu.org and
can be applied independently.

Signed-off-by: Akihiko Odaki 
---
Akihiko Odaki (7):
  net: checksum: Convert data to void *
  virtio-net: Fix size check in dhclient workaround
  virtio-net: Do not check for the queue before RSS
  virtio-net: Fix hash reporting when the queue changes
  virtio-net: Initialize hash reporting values
  virtio-net: Copy received header to buffer
  virtio-net: Fix num_buffers for version 1

 include/net/checksum.h |   2 +-
 hw/net/virtio-net.c| 109 -
 net/checksum.c |   4 +-
 3 files changed, 65 insertions(+), 50 deletions(-)
---
base-commit: 31669121a01a14732f57c49400bc239cf9fd505f
change-id: 20240907-queue-f425937a730f

Best regards,
-- 
Akihiko Odaki

[PATCH 2/7] virtio-net: Fix size check in dhclient workaround

2024-09-14 Thread Akihiko Odaki

work_around_broken_dhclient() accesses IP and UDP headers to detect
relevant packets and to calculate checksums, but it didn't check if
the packet has size sufficient to accommodate them, causing out-of-bound
access hazards. Fix this by correcting the size requirement.

Fixes: 1d41b0c1ec66 ("Work around dhclient brokenness")
Cc: qemu-sta...@nongnu.org
Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 8f3097270869..755530c035e4 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1687,8 +1687,11 @@ static void virtio_net_hdr_swap(VirtIODevice *vdev, 
struct virtio_net_hdr *hdr)
 static void work_around_broken_dhclient(struct virtio_net_hdr *hdr,
 uint8_t *buf, size_t size)
 {
+size_t csum_size = ETH_HLEN + sizeof(struct ip_header) +
+   sizeof(struct udp_header);
+
 if ((hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && /* missing csum */
-(size > 27 && size < 1500) && /* normal sized MTU */
+(size >= csum_size && size < 1500) && /* normal sized MTU */
 (buf[12] == 0x08 && buf[13] == 0x00) && /* ethertype == IPv4 */
 (buf[23] == 17) && /* ip.protocol == UDP */
 (buf[34] == 0 && buf[35] == 67)) { /* udp.srcport == bootps */

-- 
2.46.0

[PATCH 4/7] virtio-net: Fix hash reporting when the queue changes

2024-09-14 Thread Akihiko Odaki

virtio_net_process_rss() fills the values used for hash reporting, but
the values used to be thrown away with a recursive function call if
the queue changes after RSS. Avoid the function call to keep the values.

Fixes: a4c960eedcd2 ("virtio-net: Do not write hashes to peer buffer")
Signed-off-by: Akihiko Odaki 
---
 hw/net/virtio-net.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 3ee1ebd88daa..3753c6aaca83 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1888,10 +1888,10 @@ static int virtio_net_process_rss(NetClientState *nc, 
const uint8_t *buf,
 }
 
 static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
-  size_t size, bool no_rss)
+  size_t size)
 {
 VirtIONet *n = qemu_get_nic_opaque(nc);
-VirtIONetQueue *q = virtio_net_get_subqueue(nc);
+VirtIONetQueue *q;
 VirtIODevice *vdev = VIRTIO_DEVICE(n);
 VirtQueueElement *elems[VIRTQUEUE_MAX_SIZE];
 size_t lens[VIRTQUEUE_MAX_SIZE];
@@ -1901,11 +1901,10 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
*nc, const uint8_t *buf,
 size_t offset, i, guest_offset, j;
 ssize_t err;
 
-if (!no_rss && n->rss_data.enabled && n->rss_data.enabled_software_rss) {
+if (n->rss_data.enabled && n->rss_data.enabled_software_rss) {
 int index = virtio_net_process_rss(nc, buf, size, &extra_hdr);
 if (index >= 0) {
-NetClientState *nc2 = qemu_get_subqueue(n->nic, index);
-return virtio_net_receive_rcu(nc2, buf, size, true);
+nc = qemu_get_subqueue(n->nic, index);
 }
 }
 
@@ -1913,6 +1912,8 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, 
const uint8_t *buf,
 return -1;
 }
 
+q = virtio_net_get_subqueue(nc);
+
 /* hdr_len refers to the header we supply to the guest */
 if (!virtio_net_has_buffers(q, size + n->guest_hdr_len - n->host_hdr_len)) 
{
 return 0;
@@ -2038,7 +2039,7 @@ static ssize_t virtio_net_do_receive(NetClientState *nc, 
const uint8_t *buf,
 {
 RCU_READ_LOCK_GUARD();
 
-return virtio_net_receive_rcu(nc, buf, size, false);
+return virtio_net_receive_rcu(nc, buf, size);
 }
 
 static void virtio_net_rsc_extract_unit4(VirtioNetRscChain *chain,

-- 
2.46.0

[PATCH v16 06/13] s390x/pci: Check for multifunction after device realization

2024-09-12 Thread Akihiko Odaki

The SR-IOV PFs set the multifunction bit during device realization so
check them after that. There is no functional change because we
explicitly ignore the multifunction bit for SR-IOV devices.

Signed-off-by: Akihiko Odaki 
---
 hw/s390x/s390-pci-bus.c | 28 +---
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index eab9a4f97830..e645192562ae 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -971,21 +971,7 @@ static void s390_pcihost_pre_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 "this device");
 }
 
-if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
-PCIDevice *pdev = PCI_DEVICE(dev);
-
-/*
- * Multifunction is not supported due to the lack of CLP. However,
- * do not check for multifunction capability for SR-IOV devices because
- * SR-IOV devices automatically add the multifunction capability 
whether
- * the user intends to use the functions other than the PF.
- */
-if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION &&
-!pdev->exp.sriov_cap) {
-error_setg(errp, "multifunction not supported in s390");
-return;
-}
-} else if (object_dynamic_cast(OBJECT(dev), TYPE_S390_PCI_DEVICE)) {
+if (object_dynamic_cast(OBJECT(dev), TYPE_S390_PCI_DEVICE)) {
 S390PCIBusDevice *pbdev = S390_PCI_DEVICE(dev);
 
 if (!s390_pci_alloc_idx(s, pbdev)) {
@@ -1076,6 +1062,18 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
 pdev = PCI_DEVICE(dev);
 
+/*
+ * Multifunction is not supported due to the lack of CLP. However,
+ * do not check for multifunction capability for SR-IOV devices because
+ * SR-IOV devices automatically add the multifunction capability 
whether
+ * the user intends to use the functions other than the PF.
+ */
+if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION &&
+!pdev->exp.sriov_cap) {
+error_setg(errp, "multifunction not supported in s390");
+return;
+}
+
 if (!dev->id) {
 /* In the case the PCI device does not define an id */
 /* we generate one based on the PCI address */

-- 
2.46.0

[PATCH v16 10/13] pcie_sriov: Remove num_vfs from PCIESriovPF

2024-09-12 Thread Akihiko Odaki

num_vfs is not migrated so use PCI_SRIOV_CTRL_VFE and PCI_SRIOV_NUM_VF
instead.

Signed-off-by: Akihiko Odaki 
---
 include/hw/pci/pcie_sriov.h |  1 -
 hw/pci/pcie_sriov.c | 38 +++---
 hw/pci/trace-events |  2 +-
 3 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 70649236c18a..5148c5b77dd1 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -16,7 +16,6 @@
 #include "hw/pci/pci.h"
 
 typedef struct PCIESriovPF {
-uint16_t num_vfs;   /* Number of virtual functions created */
 uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
 PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
 } PCIESriovPF;
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index ac8c4013bc88..47028e150eac 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -45,7 +45,6 @@ bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
 offset, PCI_EXT_CAP_SRIOV_SIZEOF);
 dev->exp.sriov_cap = offset;
-dev->exp.sriov_pf.num_vfs = 0;
 dev->exp.sriov_pf.vf = NULL;
 
 pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
@@ -182,29 +181,28 @@ static void register_vfs(PCIDevice *dev)
 
 assert(sriov_cap > 0);
 num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
-if (num_vfs > pci_get_word(dev->config + sriov_cap + PCI_SRIOV_TOTAL_VF)) {
-return;
-}
 
 trace_sriov_register_vfs(dev->name, PCI_SLOT(dev->devfn),
  PCI_FUNC(dev->devfn), num_vfs);
 for (i = 0; i < num_vfs; i++) {
 pci_set_enabled(dev->exp.sriov_pf.vf[i], true);
 }
-dev->exp.sriov_pf.num_vfs = num_vfs;
+
+pci_set_word(dev->wmask + sriov_cap + PCI_SRIOV_NUM_VF, 0);
 }
 
 static void unregister_vfs(PCIDevice *dev)
 {
-uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
+uint8_t *cfg = dev->config + dev->exp.sriov_cap;
 uint16_t i;
 
 trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
-   PCI_FUNC(dev->devfn), num_vfs);
-for (i = 0; i < num_vfs; i++) {
+   PCI_FUNC(dev->devfn));
+for (i = 0; i < pci_get_word(cfg + PCI_SRIOV_TOTAL_VF); i++) {
 pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
 }
-dev->exp.sriov_pf.num_vfs = 0;
+
+pci_set_word(dev->wmask + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0x);
 }
 
 void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
@@ -230,6 +228,17 @@ void pcie_sriov_config_write(PCIDevice *dev, uint32_t 
address,
 } else {
 unregister_vfs(dev);
 }
+} else if (range_covers_byte(off, len, PCI_SRIOV_NUM_VF)) {
+uint8_t *cfg = dev->config + sriov_cap;
+uint8_t *wmask = dev->wmask + sriov_cap;
+uint16_t num_vfs = pci_get_word(cfg + PCI_SRIOV_NUM_VF);
+uint16_t wmask_val = PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI;
+
+if (num_vfs <= pci_get_word(cfg + PCI_SRIOV_TOTAL_VF)) {
+wmask_val |= PCI_SRIOV_CTRL_VFE;
+}
+
+pci_set_word(wmask + PCI_SRIOV_CTRL, wmask_val);
 }
 }
 
@@ -246,6 +255,8 @@ void pcie_sriov_pf_reset(PCIDevice *dev)
 unregister_vfs(dev);
 
 pci_set_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF, 0);
+pci_set_word(dev->wmask + sriov_cap + PCI_SRIOV_CTRL,
+ PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
 
 /*
  * Default is to use 4K pages, software can modify it
@@ -292,7 +303,7 @@ PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
 PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
 {
 assert(!pci_is_vf(dev));
-if (n < dev->exp.sriov_pf.num_vfs) {
+if (n < pcie_sriov_num_vfs(dev)) {
 return dev->exp.sriov_pf.vf[n];
 }
 return NULL;
@@ -300,5 +311,10 @@ PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int 
n)
 
 uint16_t pcie_sriov_num_vfs(PCIDevice *dev)
 {
-return dev->exp.sriov_pf.num_vfs;
+uint16_t sriov_cap = dev->exp.sriov_cap;
+uint8_t *cfg = dev->config + sriov_cap;
+
+return sriov_cap &&
+   (pci_get_word(cfg + PCI_SRIOV_CTRL) & PCI_SRIOV_CTRL_VFE) ?
+   pci_get_word(cfg + PCI_SRIOV_NUM_VF) : 0;
 }
diff --git a/hw/pci/trace-events b/hw/pci/trace-events
index 19643aa8c6b0..e98f575a9d19 100644
--- a/hw/pci/trace-events
+++ b/hw/pci/trace-events
@@ -14,7 +14,7 @@ msix_write_config(char *name, bool enabled, bool masked) "dev 
%s enabled %d mask
 
 # hw/pci/pcie_sriov.c
 sriov_register_vfs(const char *name, int slot, int function, int num_vfs) "%s 
%02x:%x: creating %d vf devs"
-sriov_unregister_vfs(const char *name, int slot, int function, int num_vfs) 
"%s %02x:%x

[PATCH v16 01/13] hw/pci: Rename has_power to enabled

2024-09-12 Thread Akihiko Odaki

The renamed state will not only represent powering state of PFs, but
also represent SR-IOV VF enablement in the future.

Signed-off-by: Akihiko Odaki 
---
 include/hw/pci/pci.h|  7 ++-
 include/hw/pci/pci_device.h |  2 +-
 hw/pci/pci.c| 14 +++---
 hw/pci/pci_host.c   |  4 ++--
 4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index eb26cac81098..fe04b4fafd04 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -678,6 +678,11 @@ static inline void pci_irq_pulse(PCIDevice *pci_dev)
 }
 
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
-void pci_set_power(PCIDevice *pci_dev, bool state);
+void pci_set_enabled(PCIDevice *pci_dev, bool state);
+
+static inline void pci_set_power(PCIDevice *pci_dev, bool state)
+{
+pci_set_enabled(pci_dev, state);
+}
 
 #endif
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index 15694f248948..f38fb3111954 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -57,7 +57,7 @@ typedef struct PCIReqIDCache PCIReqIDCache;
 struct PCIDevice {
 DeviceState qdev;
 bool partially_hotplugged;
-bool has_power;
+bool enabled;
 
 /* PCI config space */
 uint8_t *config;
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index fab86d056721..b532888e8f6c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1525,7 +1525,7 @@ static void pci_update_mappings(PCIDevice *d)
 continue;
 
 new_addr = pci_bar_address(d, i, r->type, r->size);
-if (!d->has_power) {
+if (!d->enabled) {
 new_addr = PCI_BAR_UNMAPPED;
 }
 
@@ -1613,7 +1613,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
addr, uint32_t val_in, int
 pci_update_irq_disabled(d, was_irq_disabled);
 memory_region_set_enabled(&d->bus_master_enable_region,
   (pci_get_word(d->config + PCI_COMMAND)
-   & PCI_COMMAND_MASTER) && d->has_power);
+   & PCI_COMMAND_MASTER) && d->enabled);
 }
 
 msi_write_config(d, addr, val_in, l);
@@ -2884,18 +2884,18 @@ MSIMessage pci_get_msi_message(PCIDevice *dev, int 
vector)
 return msg;
 }
 
-void pci_set_power(PCIDevice *d, bool state)
+void pci_set_enabled(PCIDevice *d, bool state)
 {
-if (d->has_power == state) {
+if (d->enabled == state) {
 return;
 }
 
-d->has_power = state;
+d->enabled = state;
 pci_update_mappings(d);
 memory_region_set_enabled(&d->bus_master_enable_region,
   (pci_get_word(d->config + PCI_COMMAND)
-   & PCI_COMMAND_MASTER) && d->has_power);
-if (!d->has_power) {
+   & PCI_COMMAND_MASTER) && d->enabled);
+if (!d->enabled) {
 pci_device_reset(d);
 }
 }
diff --git a/hw/pci/pci_host.c b/hw/pci/pci_host.c
index dfe6fe618401..0d82727cc9dd 100644
--- a/hw/pci/pci_host.c
+++ b/hw/pci/pci_host.c
@@ -86,7 +86,7 @@ void pci_host_config_write_common(PCIDevice *pci_dev, 
uint32_t addr,
  * allowing direct removal of unexposed functions.
  */
 if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
-!pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
+!pci_dev->enabled || is_pci_dev_ejected(pci_dev)) {
 return;
 }
 
@@ -111,7 +111,7 @@ uint32_t pci_host_config_read_common(PCIDevice *pci_dev, 
uint32_t addr,
  * allowing direct removal of unexposed functions.
  */
 if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
-!pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
+!pci_dev->enabled || is_pci_dev_ejected(pci_dev)) {
 return ~0x0;
 }
 

-- 
2.46.0

[PATCH v16 11/13] pcie_sriov: Register VFs after migration

2024-09-12 Thread Akihiko Odaki

pcie_sriov doesn't have code to restore its state after migration, but
igb, which uses pcie_sriov, naively claimed its migration capability.

Add code to register VFs after migration and fix igb migration.

Fixes: 3a977deebe6b ("Intrdocue igb device emulation")
Signed-off-by: Akihiko Odaki 
---
 include/hw/pci/pcie_sriov.h | 2 ++
 hw/pci/pci.c| 7 +++
 hw/pci/pcie_sriov.c | 7 +++
 3 files changed, 16 insertions(+)

diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 5148c5b77dd1..c5d2d318d330 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -57,6 +57,8 @@ void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t 
opt_sup_pgsize);
 void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
  uint32_t val, int len);
 
+void pcie_sriov_pf_post_load(PCIDevice *dev);
+
 /* Reset SR/IOV */
 void pcie_sriov_pf_reset(PCIDevice *dev);
 
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 5c0050e1786a..4c7be5295110 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -733,10 +733,17 @@ static bool migrate_is_not_pcie(void *opaque, int 
version_id)
 return !pci_is_express((PCIDevice *)opaque);
 }
 
+static int pci_post_load(void *opaque, int version_id)
+{
+pcie_sriov_pf_post_load(opaque);
+return 0;
+}
+
 const VMStateDescription vmstate_pci_device = {
 .name = "PCIDevice",
 .version_id = 2,
 .minimum_version_id = 1,
+.post_load = pci_post_load,
 .fields = (const VMStateField[]) {
 VMSTATE_INT32_POSITIVE_LE(version_id, PCIDevice),
 VMSTATE_BUFFER_UNSAFE_INFO_TEST(config, PCIDevice,
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 47028e150eac..a1cb1214af27 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -242,6 +242,13 @@ void pcie_sriov_config_write(PCIDevice *dev, uint32_t 
address,
 }
 }
 
+void pcie_sriov_pf_post_load(PCIDevice *dev)
+{
+if (dev->exp.sriov_cap) {
+register_vfs(dev);
+}
+}
+
 
 /* Reset SR/IOV */
 void pcie_sriov_pf_reset(PCIDevice *dev)

-- 
2.46.0

[PATCH v16 09/13] pcie_sriov: Release VFs failed to realize

2024-09-12 Thread Akihiko Odaki

Release VFs failed to realize just as we do in unregister_vfs().

Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization 
(SR/IOV)")
Signed-off-by: Akihiko Odaki 
---
 hw/pci/pcie_sriov.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 4bffe6c97f66..ac8c4013bc88 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -87,6 +87,8 @@ bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 vf->exp.sriov_vf.vf_number = i;
 
 if (!qdev_realize(&vf->qdev, bus, errp)) {
+object_unparent(OBJECT(vf));
+object_unref(vf);
 unparent_vfs(dev, i);
 return false;
 }

-- 
2.46.0

[PATCH v16 13/13] hw/qdev: Remove opts member

2024-09-12 Thread Akihiko Odaki

It is no longer used.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Markus Armbruster 
---
 include/hw/qdev-core.h |  4 
 hw/core/qdev.c |  1 -
 system/qdev-monitor.c  | 12 +++-
 3 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 77bfcbdf732a..a3757e6769f8 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -237,10 +237,6 @@ struct DeviceState {
  * @pending_deleted_expires_ms: optional timeout for deletion events
  */
 int64_t pending_deleted_expires_ms;
-/**
- * @opts: QDict of options for the device
- */
-QDict *opts;
 /**
  * @hotplugged: was device added after PHASE_MACHINE_READY?
  */
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index f3a996f57dee..2fc84699d432 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -706,7 +706,6 @@ static void device_finalize(Object *obj)
 dev->canonical_path = NULL;
 }
 
-qobject_unref(dev->opts);
 g_free(dev->id);
 }
 
diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index 6af6ef7d667f..3551989d5153 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -624,6 +624,7 @@ DeviceState *qdev_device_add_from_qdict(const QDict *opts,
 char *id;
 DeviceState *dev = NULL;
 BusState *bus = NULL;
+QDict *properties;
 
 driver = qdict_get_try_str(opts, "driver");
 if (!driver) {
@@ -705,13 +706,14 @@ DeviceState *qdev_device_add_from_qdict(const QDict *opts,
 }
 
 /* set properties */
-dev->opts = qdict_clone_shallow(opts);
-qdict_del(dev->opts, "driver");
-qdict_del(dev->opts, "bus");
-qdict_del(dev->opts, "id");
+properties = qdict_clone_shallow(opts);
+qdict_del(properties, "driver");
+qdict_del(properties, "bus");
+qdict_del(properties, "id");
 
-object_set_properties_from_keyval(&dev->parent_obj, dev->opts, from_json,
+object_set_properties_from_keyval(&dev->parent_obj, properties, from_json,
   errp);
+qobject_unref(properties);
 if (*errp) {
 goto err_del_dev;
 }

-- 
2.46.0

[PATCH v16 12/13] hw/pci: Use -1 as the default value for rombar

2024-09-12 Thread Akihiko Odaki

vfio_pci_size_rom() distinguishes whether rombar is explicitly set to 1
by checking dev->opts, bypassing the QOM property infrastructure.

Use -1 as the default value for rombar to tell if the user explicitly
set it to 1. The property is also converted from unsigned to signed.
-1 is signed so it is safe to give it a new meaning. The values in
[2 ^ 31, 2 ^ 32) become invalid, but nobody should have typed these
values by chance.

Suggested-by: Markus Armbruster 
Signed-off-by: Akihiko Odaki 
Reviewed-by: Markus Armbruster 
---
 include/hw/pci/pci_device.h | 2 +-
 hw/pci/pci.c| 2 +-
 hw/vfio/pci.c   | 5 ++---
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index 1ff3ce94e25b..8fa845beee5e 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -148,7 +148,7 @@ struct PCIDevice {
 uint32_t romsize;
 bool has_rom;
 MemoryRegion rom;
-uint32_t rom_bar;
+int32_t rom_bar;
 
 /* INTx routing notifier */
 PCIINTxRoutingNotifier intx_routing_notifier;
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 4c7be5295110..d2eaf0c51dde 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -71,7 +71,7 @@ static Property pci_props[] = {
 DEFINE_PROP_PCI_DEVFN("addr", PCIDevice, devfn, -1),
 DEFINE_PROP_STRING("romfile", PCIDevice, romfile),
 DEFINE_PROP_UINT32("romsize", PCIDevice, romsize, UINT32_MAX),
-DEFINE_PROP_UINT32("rombar",  PCIDevice, rom_bar, 1),
+DEFINE_PROP_INT32("rombar",  PCIDevice, rom_bar, -1),
 DEFINE_PROP_BIT("multifunction", PCIDevice, cap_present,
 QEMU_PCI_CAP_MULTIFUNCTION_BITNR, false),
 DEFINE_PROP_BIT("x-pcie-lnksta-dllla", PCIDevice, cap_present,
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2407720c3530..dc53837eac73 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1012,7 +1012,6 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 {
 uint32_t orig, size = cpu_to_le32((uint32_t)PCI_ROM_ADDRESS_MASK);
 off_t offset = vdev->config_offset + PCI_ROM_ADDRESS;
-DeviceState *dev = DEVICE(vdev);
 char *name;
 int fd = vdev->vbasedev.fd;
 
@@ -1046,12 +1045,12 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 }
 
 if (vfio_opt_rom_in_denylist(vdev)) {
-if (dev->opts && qdict_haskey(dev->opts, "rombar")) {
+if (vdev->pdev.rom_bar > 0) {
 warn_report("Device at %s is known to cause system instability"
 " issues during option rom execution",
 vdev->vbasedev.name);
 error_printf("Proceeding anyway since user specified"
- " non zero value for rombar\n");
+ " positive value for rombar\n");
 } else {
 warn_report("Rom loading for device at %s has been disabled"
 " due to system instability issues",

-- 
2.46.0

[PATCH v16 03/13] hw/ppc/spapr_pci: Do not reject VFs created after a PF

2024-09-12 Thread Akihiko Odaki

A PF may automatically create VFs and the PF may be function 0.

Signed-off-by: Akihiko Odaki 
---
 hw/ppc/spapr_pci.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index f63182a03c41..ed4454bbf79e 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1573,7 +1573,9 @@ static void spapr_pci_pre_plug(HotplugHandler 
*plug_handler,
  * hotplug, we do not allow functions to be hotplugged to a
  * slot that already has function 0 present
  */
-if (plugged_dev->hotplugged && bus->devices[PCI_DEVFN(slotnr, 0)] &&
+if (plugged_dev->hotplugged &&
+!pci_is_vf(pdev) &&
+bus->devices[PCI_DEVFN(slotnr, 0)] &&
 PCI_FUNC(pdev->devfn) != 0) {
 error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
" additional functions can no longer be exposed to guest.",

-- 
2.46.0

[PATCH v16 04/13] s390x/pci: Avoid creating zpci for VFs

2024-09-12 Thread Akihiko Odaki

VFs are automatically created by PF, and creating zpci for them will
result in unexpected usage of fids. Currently QEMU does not support
multifunction for s390x so we don't need zpci for VFs anyway.

Signed-off-by: Akihiko Odaki 
---
 hw/s390x/s390-pci-bus.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 3e57d5faca18..1a620f5b2a04 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -1080,6 +1080,16 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 
 pbdev = s390_pci_find_dev_by_target(s, dev->id);
 if (!pbdev) {
+/*
+ * VFs are automatically created by PF, and creating zpci for them
+ * will result in unexpected usage of fids. Currently QEMU does not
+ * support multifunction for s390x so we don't need zpci for VFs
+ * anyway.
+ */
+if (pci_is_vf(pdev)) {
+return;
+}
+
 pbdev = s390_pci_device_new(s, dev->id, errp);
 if (!pbdev) {
 return;
@@ -1167,7 +1177,9 @@ static void s390_pcihost_unplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 int32_t devfn;
 
 pbdev = s390_pci_find_dev_by_pci(s, PCI_DEVICE(dev));
-g_assert(pbdev);
+if (!pbdev) {
+return;
+}
 
 s390_pci_generate_plug_event(HP_EVENT_STANDBY_TO_RESERVED,
  pbdev->fh, pbdev->fid);
@@ -1206,7 +1218,10 @@ static void s390_pcihost_unplug_request(HotplugHandler 
*hotplug_dev,
  * we've checked the PCI device already (to prevent endless recursion).
  */
 pbdev = s390_pci_find_dev_by_pci(s, PCI_DEVICE(dev));
-g_assert(pbdev);
+if (!pbdev) {
+return;
+}
+
 pbdev->pci_unplug_request_processed = true;
 qdev_unplug(DEVICE(pbdev), errp);
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_S390_PCI_DEVICE)) {

-- 
2.46.0

[PATCH v16 08/13] pcie_sriov: Reuse SR-IOV VF device instances

2024-09-12 Thread Akihiko Odaki

Disable SR-IOV VF devices by reusing code to power down PCI devices
instead of removing them when the guest requests to disable VFs. This
allows to realize devices and report VF realization errors at PF
realization time.

Signed-off-by: Akihiko Odaki 
---
 docs/pcie_sriov.txt |   8 ++--
 include/hw/pci/pci.h|   5 ---
 include/hw/pci/pci_device.h |  15 +++
 include/hw/pci/pcie_sriov.h |   6 +--
 hw/net/igb.c|  13 --
 hw/nvme/ctrl.c  |  24 +++
 hw/pci/pci.c|   2 +-
 hw/pci/pcie_sriov.c | 102 +++-
 8 files changed, 95 insertions(+), 80 deletions(-)

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
index a47aad0bfab0..ab2142807f79 100644
--- a/docs/pcie_sriov.txt
+++ b/docs/pcie_sriov.txt
@@ -52,9 +52,11 @@ setting up a BAR for a VF.
   ...
 
   /* Add and initialize the SR/IOV capability */
-  pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
-   vf_devid, initial_vfs, total_vfs,
-   fun_offset, stride);
+  if (!pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+  vf_devid, initial_vfs, total_vfs,
+  fun_offset, stride, errp)) {
+ return;
+  }
 
   /* Set up individual VF BARs (parameters as for normal BARs) */
   pcie_sriov_pf_init_vf_bar( ... )
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index fe04b4fafd04..14a869eeaa71 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -680,9 +680,4 @@ static inline void pci_irq_pulse(PCIDevice *pci_dev)
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
 void pci_set_enabled(PCIDevice *pci_dev, bool state);
 
-static inline void pci_set_power(PCIDevice *pci_dev, bool state)
-{
-pci_set_enabled(pci_dev, state);
-}
-
 #endif
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index f38fb3111954..1ff3ce94e25b 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -212,6 +212,21 @@ static inline uint16_t pci_get_bdf(PCIDevice *dev)
 return PCI_BUILD_BDF(pci_bus_num(pci_get_bus(dev)), dev->devfn);
 }
 
+static inline void pci_set_power(PCIDevice *pci_dev, bool state)
+{
+/*
+ * Don't change the enabled state of VFs when powering on/off the device.
+ *
+ * When powering on, VFs must not be enabled immediately but they must
+ * wait until the guest configures SR-IOV.
+ * When powering off, their corresponding PFs will be reset and disable
+ * VFs.
+ */
+if (!pci_is_vf(pci_dev)) {
+pci_set_enabled(pci_dev, state);
+}
+}
+
 uint16_t pci_requester_id(PCIDevice *dev);
 
 /* DMA access functions */
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 450cbef6c201..70649236c18a 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -18,7 +18,6 @@
 typedef struct PCIESriovPF {
 uint16_t num_vfs;   /* Number of virtual functions created */
 uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
-const char *vfname; /* Reference to the device type used for the VFs */
 PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
 } PCIESriovPF;
 
@@ -27,10 +26,11 @@ typedef struct PCIESriovVF {
 uint16_t vf_number; /* Logical VF number of this function */
 } PCIESriovVF;
 
-void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 const char *vfname, uint16_t vf_dev_id,
 uint16_t init_vfs, uint16_t total_vfs,
-uint16_t vf_offset, uint16_t vf_stride);
+uint16_t vf_offset, uint16_t vf_stride,
+Error **errp);
 void pcie_sriov_pf_exit(PCIDevice *dev);
 
 /* Set up a VF bar in the SR/IOV bar area */
diff --git a/hw/net/igb.c b/hw/net/igb.c
index b92bba402e0d..b6ca2f1b8aee 100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -446,9 +446,16 @@ static void igb_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 
 pcie_ari_init(pci_dev, 0x150);
 
-pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET, TYPE_IGBVF,
-IGB_82576_VF_DEV_ID, IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
-IGB_VF_OFFSET, IGB_VF_STRIDE);
+if (!pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET,
+TYPE_IGBVF, IGB_82576_VF_DEV_ID,
+IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
+IGB_VF_OFFSET, IGB_VF_STRIDE,
+errp)) {
+pcie_cap_exit(pci_dev);
+igb_cleanup_msix(s);
+msi_uninit(pci_dev);
+return;
+}
 
 pcie_sriov_pf_init_vf_bar(pci_dev, IGBVF_MMIO_BAR_IDX,
 PCI_BASE_ADDRESS_MEM_TYPE_64 | PCI_BASE_ADDRESS_MEM_PREFETCH,
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index c6d4f61a47f9..

[PATCH v16 07/13] pcie_sriov: Do not manually unrealize

2024-09-12 Thread Akihiko Odaki

A device gets automatically unrealized when being unparented.

Signed-off-by: Akihiko Odaki 
---
 hw/pci/pcie_sriov.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index e9b23221d713..499becd5273f 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -204,11 +204,7 @@ static void unregister_vfs(PCIDevice *dev)
 trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
PCI_FUNC(dev->devfn), num_vfs);
 for (i = 0; i < num_vfs; i++) {
-Error *err = NULL;
 PCIDevice *vf = dev->exp.sriov_pf.vf[i];
-if (!object_property_set_bool(OBJECT(vf), "realized", false, &err)) {
-error_reportf_err(err, "Failed to unplug: ");
-}
 object_unparent(OBJECT(vf));
 object_unref(OBJECT(vf));
 }

-- 
2.46.0

[PATCH v16 05/13] s390x/pci: Allow plugging SR-IOV devices

2024-09-12 Thread Akihiko Odaki

The guest cannot use VFs due to the lack of multifunction support but
can use PFs.

Signed-off-by: Akihiko Odaki 
---
 hw/s390x/s390-pci-bus.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 1a620f5b2a04..eab9a4f97830 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -974,7 +974,14 @@ static void s390_pcihost_pre_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
 PCIDevice *pdev = PCI_DEVICE(dev);
 
-if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+/*
+ * Multifunction is not supported due to the lack of CLP. However,
+ * do not check for multifunction capability for SR-IOV devices because
+ * SR-IOV devices automatically add the multifunction capability 
whether
+ * the user intends to use the functions other than the PF.
+ */
+if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION &&
+!pdev->exp.sriov_cap) {
 error_setg(errp, "multifunction not supported in s390");
 return;
 }

-- 
2.46.0

[PATCH v16 00/13] hw/pci: SR-IOV related fixes and improvements

2024-09-12 Thread Akihiko Odaki

Supersedes: <20240714-rombar-v2-0-af1504ef5...@daynix.com>
("[PATCH v2 0/4] hw/pci: Convert rom_bar into OnOffAuto")

I submitted a RFC series[1] to add support for SR-IOV emulation to
virtio-net-pci. During the development of the series, I fixed some
trivial bugs and made improvements that I think are independently
useful. This series extracts those fixes and improvements from the RFC
series.

[1]: https://patchew.org/QEMU/20231210-sriov-v2-0-b959e8a6d...@daynix.com/

Signed-off-by: Akihiko Odaki 
---
Changes in v16:
- Added patch "s390x/pci: Avoid creating zpci for VFs".
- Added patch "s390x/pci: Allow plugging SR-IOV devices".
- Link to v15: 
https://lore.kernel.org/r/20240823-reuse-v15-0-eddcb960e...@daynix.com

Changes in v15:
- Fixed variable shadowing in patch
  "pcie_sriov: Remove num_vfs from PCIESriovPF"
- Link to v14: 
https://lore.kernel.org/r/20240813-reuse-v14-0-4c15bc6ee...@daynix.com

Changes in v14:
- Dropped patch "pcie_sriov: Ensure VF function number does not
  overflow" as I found the restriction it imposes is unnecessary.
- Link to v13: 
https://lore.kernel.org/r/20240805-reuse-v13-0-aaeaa4d7d...@daynix.com

Changes in v13:
- Added patch "s390x/pci: Check for multifunction after device
  realization". I found SR-IOV devices, which are multifunction devices,
  are not supposed to work at all with s390x on QEMU.
- Link to v12: 
https://lore.kernel.org/r/20240804-reuse-v12-0-d3930c411...@daynix.com

Changes in v12:
- Changed to ignore invalid PCI_SRIOV_NUM_VF writes as done for
  PCI_SRIOV_CTRL_VFE.
- Updated the message for patch "hw/pci: Use -1 as the default value for
  rombar". (Markus Armbruster)
- Link to v11: 
https://lore.kernel.org/r/20240802-reuse-v11-0-fb83bb8c1...@daynix.com

Changes in v11:
- Rebased.
- Dropped patch "hw/pci: Convert rom_bar into OnOffAuto".
- Added patch "hw/pci: Use -1 as the default value for rombar".
- Added for-9.2 to give a testing period for possible breakage with
  libvirt/s390x.
- Link to v10: 
https://lore.kernel.org/r/20240627-reuse-v10-0-7ca0b8ed3...@daynix.com

Changes in v10:
- Added patch "hw/ppc/spapr_pci: Do not reject VFs created after a PF".
- Added patch "hw/ppc/spapr_pci: Do not create DT for disabled PCI device".
- Added patch "hw/pci: Convert rom_bar into OnOffAuto".
- Dropped patch "hw/pci: Determine if rombar is explicitly enabled".
- Dropped patch "hw/qdev: Remove opts member".
- Link to v9: 
https://lore.kernel.org/r/20240315-reuse-v9-0-67aa69af4...@daynix.com

Changes in v9:
- Rebased.
- Restored '#include "qapi/error.h"' (Michael S. Tsirkin)
- Added patch "pcie_sriov: Ensure VF function number does not overflow"
  to fix abortion with wrong PF addr.
- Link to v8: 
https://lore.kernel.org/r/20240228-reuse-v8-0-282660281...@daynix.com

Changes in v8:
- Clarified that "hw/pci: Replace -1 with UINT32_MAX for romsize" is
  not a bug fix. (Markus Armbruster)
- Squashed patch "vfio: Avoid inspecting option QDict for rombar" into
  "hw/pci: Determine if rombar is explicitly enabled".
  (Markus Armbruster)
- Noted the minor semantics change for patch "hw/pci: Determine if
  rombar is explicitly enabled". (Markus Armbruster)
- Link to v7: 
https://lore.kernel.org/r/20240224-reuse-v7-0-29c14bcb9...@daynix.com

Changes in v7:
- Replaced -1 with UINT32_MAX when expressing uint32_t.
  (Markus Armbruster)
- Added patch "hw/pci: Replace -1 with UINT32_MAX for romsize".
- Link to v6: 
https://lore.kernel.org/r/20240220-reuse-v6-0-2e42a28b0...@daynix.com

Changes in v6:
- Fixed migration.
- Added patch "pcie_sriov: Do not manually unrealize".
- Restored patch "pcie_sriov: Release VFs failed to realize" that was
  missed in v5.
- Link to v5: 
https://lore.kernel.org/r/20240218-reuse-v5-0-e4fc1c19b...@daynix.com

Changes in v5:
- Added patch "hw/pci: Always call pcie_sriov_pf_reset()".
- Added patch "pcie_sriov: Reset SR-IOV extended capability".
- Removed a reference to PCI_SRIOV_CTRL_VFE in hw/nvme.
  (Michael S. Tsirkin)
- Noted the impact on the guest of patch "pcie_sriov: Do not reset
  NumVFs after unregistering VFs". (Michael S. Tsirkin)
- Changed to use pcie_sriov_num_vfs().
- Restored pci_set_power() and changed it to call pci_set_enabled() only
  for PFs with an expalanation. (Michael S. Tsirkin)
- Reordered patches.
- Link to v4: 
https://lore.kernel.org/r/20240214-reuse-v4-0-89ad093a0...@daynix.com

Changes in v4:
- Reverted the change to pci_rom_bar_explicitly_enabled().
  (Michael S. Tsirkin)
- Added patch "pcie_sriov: Do not reset NumVFs after unregistering VFs".
- Added patch "hw/nvme: Refer to dev->exp.sriov_pf.num_vfs".
- Link to v3: 
https://lore.kernel.org/r/20240212-reuse-v3-0-8017b689c...@daynix.com

Changes in v3:
- Extracted patc

[PATCH v16 02/13] hw/ppc/spapr_pci: Do not create DT for disabled PCI device

2024-09-12 Thread Akihiko Odaki

Disabled means it is a disabled SR-IOV VF or it is powered off, and
hidden from the guest.

Signed-off-by: Akihiko Odaki 
---
 hw/ppc/spapr_pci.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 7cf9904c3546..f63182a03c41 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1296,6 +1296,10 @@ static void spapr_dt_pci_device_cb(PCIBus *bus, 
PCIDevice *pdev,
 return;
 }
 
+if (!pdev->enabled) {
+return;
+}
+
 err = spapr_dt_pci_device(p->sphb, pdev, p->fdt, p->offset);
 if (err < 0) {
 p->err = err;

-- 
2.46.0

Re: [PATCH for-9.2 v15 04/11] s390x/pci: Check for multifunction after device realization

2024-09-11 Thread Akihiko Odaki


On 2024/09/12 6:11, Matthew Rosato wrote:

On 9/11/24 11:15 AM, Akihiko Odaki wrote:

On 2024/09/11 22:53, Matthew Rosato wrote:

On 9/11/24 6:58 AM, Akihiko Odaki wrote:

On 2024/09/11 18:38, Cédric Le Goater wrote:

+Matthew +Eric

Side note for the maintainers :

Before this change, the igb device, which is multifunction, was working
fine under Linux.

Was there a fix in Linux since :

     57da367b9ec4 ("s390x/pci: forbid multifunction pci device")
     6069bcdeacee ("s390x/pci: Move some hotplug checks to the pre_plug 
handler")

?

The timing of those particular commits predates the linux s390 kernel support 
of multifunction/SR-IOV.  At that time it was simply not possible on s390.


Is it OK to remove this check of multifunction now?


Yes, I think removing the check (which AFAIU was broken since 6069bcdeacee) 
will get us back to the behavior Cedric was seeing, where a device that reports 
multifunction in the config space is still allowed through but only the PF will 
be usable in the guest.


Commit 6069bcdeacee predates the introduction of SR-IOV (commit 
7c0fa8dff811 ["pcie: Add support for Single Root I/O Virtualization 
(SR/IOV)"]) so it did not break anything, strictly speaking. Ideally, we 
should have taken care of the check when introducing SR-IOV.






This code is not working properly with SR-IOV and misleading. It is better to 
remove the code if it does no good.

It would be nice if anyone confirms multifunction works for s390x with the code 
removed.


Even if you remove the check, multifunction itself won't work in the s390x 
guest without these missing CLP pieces too.  When I have some time I'll hack 
something together to fabricate some CLP data and try it out, but it sounds 
like Cedric could use his setup to right now at least verify that the PF itself 
should remain usable in the guest (current behavior).


Well, it seems better to keep the check for non-SR-IOV multifunction 
devices while not enforcing the restriction for SR-IOV.


Multifunction devices without SR-IOV are created with explicit requests 
by specifying multifunction=on for PCI devices. Such requests cannot be 
fulfilled without multifunction CLP so we should reject them.


The situation is different for SR-IOV. SR-IOV is a feature inherent to 
specific type of devices and gets automatically enabled for these 
devices. It may make more sense to ignore just the SR-IOV part of such 
devices and keep the other functionalities working.


The current code implements such a behavior, but it is accidental and 
semantically wrong. I will correct the code to explicitly allow 
multifunction for SR-IOV.


Regards,
Akihiko Odaki

Re: [PATCH for-9.2 v15 04/11] s390x/pci: Check for multifunction after device realization

2024-09-11 Thread Akihiko Odaki


On 2024/09/11 22:53, Matthew Rosato wrote:

On 9/11/24 6:58 AM, Akihiko Odaki wrote:

On 2024/09/11 18:38, Cédric Le Goater wrote:

+Matthew +Eric

Side note for the maintainers :

Before this change, the igb device, which is multifunction, was working
fine under Linux.

Was there a fix in Linux since :

    57da367b9ec4 ("s390x/pci: forbid multifunction pci device")
    6069bcdeacee ("s390x/pci: Move some hotplug checks to the pre_plug handler")

?

The timing of those particular commits predates the linux s390 kernel support 
of multifunction/SR-IOV.  At that time it was simply not possible on s390.


Is it OK to remove this check of multifunction now?

This code is not working properly with SR-IOV and misleading. It is 
better to remove the code if it does no good.


It would be nice if anyone confirms multifunction works for s390x with 
the code removed.






s390 PCI devices do not have extended capabilities, so the igb device
does not expose the SRIOV capability and only the PF is accessible but
it doesn't seem to be an issue. (Btw, CONFIG_PCI_IOV is set to y in the
default Linux config which is unexpected)


The linux config option makes sense because the s390 kernel now supports 
SR-IOV/multifunction.



Doesn't s390x really see extended capabilities? hw/s390x/s390-pci-inst.c has a 
call pci_config_size() and pci_host_config_write_common(), which means it is 
exposing the whole PCI Express configuration space. Why can't s390x use 
extended capabilities then?



So, rather than poking around in config space, s390 (and thus the s390 kernel) 
has an extra layer of 'capabilities' that it generally relies on to determine 
device functionality called 'CLP'.  Basically, there are pieces of CLP that are 
not currently generated (or forwarded from the host in the case of passthrough) 
by QEMU that would be needed by the guest to recognize the SRIOV/multifunction 
capability of a device, despite what config space has in it.  I suspect this is 
exactly why only the PF was available to your igb device then (missing CLP info 
made the device appear to not have multifunction capability as far as the s390 
guest is concerned - fwiw adding CLP emulation to enable that is on our todo 
list).


What is expected to happen if you poke the configuration space anyway? I 
also wonder if there is some public documentation of CLP and relevant 
aspect of PCI support in s390x.




Sounds like the short-term solution here would be to continue allowing the PF 
without multifunction being visible to the guest (so as to not regress prior 
functionality) and then aim for proper support after with the necessary CLP 
pieces.


I agree; we can keep the PF working.

Regards,
Akihiko Odaki

Re: [PATCH for-9.2 v15 04/11] s390x/pci: Check for multifunction after device realization

2024-09-11 Thread Akihiko Odaki


On 2024/09/11 18:38, Cédric Le Goater wrote:

+Matthew +Eric

Side note for the maintainers :

Before this change, the igb device, which is multifunction, was working
fine under Linux.

Was there a fix in Linux since :

   57da367b9ec4 ("s390x/pci: forbid multifunction pci device")
   6069bcdeacee ("s390x/pci: Move some hotplug checks to the pre_plug 
handler")


?

s390 PCI devices do not have extended capabilities, so the igb device
does not expose the SRIOV capability and only the PF is accessible but
it doesn't seem to be an issue. (Btw, CONFIG_PCI_IOV is set to y in the
default Linux config which is unexpected)


Doesn't s390x really see extended capabilities? hw/s390x/s390-pci-inst.c 
has a call pci_config_size() and pci_host_config_write_common(), which 
means it is exposing the whole PCI Express configuration space. Why 
can't s390x use extended capabilities then?


The best option for fix would be to replace the SR-IOV implementation 
with stub if s390x cannot use the SR-IOV capability. However I still 
need to know at what level I should change the implementation (e.g., is 
it fine to remove the entire capability, or should I keep the capability 
while writes to its registers no-op?)


Regards,
Akihiko Odaki



Thanks,

C.



On 8/23/24 07:00, Akihiko Odaki wrote:

The SR-IOV PFs set the multifunction bits during device realization so
check them after that. This forbids adding SR-IOV devices to s390x.

Signed-off-by: Akihiko Odaki 
---
  hw/s390x/s390-pci-bus.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 3e57d5faca18..00b2c1f6157b 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -971,14 +971,7 @@ static void s390_pcihost_pre_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,

  "this device");
  }
-    if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
-    PCIDevice *pdev = PCI_DEVICE(dev);
-
-    if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
-    error_setg(errp, "multifunction not supported in s390");
-    return;
-    }
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_S390_PCI_DEVICE)) {
+    if (object_dynamic_cast(OBJECT(dev), TYPE_S390_PCI_DEVICE)) {
  S390PCIBusDevice *pbdev = S390_PCI_DEVICE(dev);
  if (!s390_pci_alloc_idx(s, pbdev)) {
@@ -1069,6 +1062,11 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,

  } else if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
  pdev = PCI_DEVICE(dev);
+    if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+    error_setg(errp, "multifunction not supported in s390");
+    return;
+    }
+
  if (!dev->id) {
  /* In the case the PCI device does not define an id */
  /* we generate one based on the PCI address */

Re: [PATCH for-9.2 v15 00/11] hw/pci: SR-IOV related fixes and improvements

2024-09-10 Thread Akihiko Odaki


On 2024/09/11 0:27, Michael S. Tsirkin wrote:

On Tue, Sep 10, 2024 at 04:13:14PM +0200, Cédric Le Goater wrote:

On 9/10/24 15:34, Michael S. Tsirkin wrote:

On Tue, Sep 10, 2024 at 03:21:54PM +0200, Cédric Le Goater wrote:

On 9/10/24 11:33, Akihiko Odaki wrote:

On 2024/09/10 18:21, Michael S. Tsirkin wrote:

On Fri, Aug 23, 2024 at 02:00:37PM +0900, Akihiko Odaki wrote:

Supersedes: <20240714-rombar-v2-0-af1504ef5...@daynix.com>
("[PATCH v2 0/4] hw/pci: Convert rom_bar into OnOffAuto")

I submitted a RFC series[1] to add support for SR-IOV emulation to
virtio-net-pci. During the development of the series, I fixed some
trivial bugs and made improvements that I think are independently
useful. This series extracts those fixes and improvements from the RFC
series.

[1]: https://patchew.org/QEMU/20231210-sriov-v2-0-b959e8a6d...@daynix.com/

Signed-off-by: Akihiko Odaki 


I don't think Cédric's issues have been addressed, am I wrong?
Cédric, what is your take?


I put the URI to Cédric's report here:
https://lore.kernel.org/r/75cbc7d9-b48e-4235-85cf-49dacf3c7...@redhat.com

This issue was dealt with patch "s390x/pci: Check for multifunction after device 
realization". I found that s390x on QEMU does not support multifunction and SR-IOV 
devices accidentally circumvent this restriction, which means igb was never supposed to 
work with s390x. The patch prevents adding SR-IOV devices to s390x to ensure the 
restriction is properly enforced.


yes, indeed and it seems to fix :

6069bcdeacee ("s390x/pci: Move some hotplug checks to the pre_plug handler")

I will update patch 4.


Thanks,

C.


That said, the igb device worked perfectly fine under the s390x machine.


And it works for you after this patchset, yes?


ah no, IGB is not an available device for the s390x machine anymore :

   qemu-system-s390x: -device igb,netdev=net1,mac=C0:FF:EE:00:00:13: 
multifunction not supported in s390



So patch 4 didn't relly help.



This is what commit 57da367b9ec4 ("s390x/pci: forbid multifunction
pci device") initially required (and later broken by 6069bcdeacee).
So I guess we are fine with the expected behavior.

Thanks,

C.


Better to fix than to guess if there are users, I think.


Yes, but it will require some knowledge of s390x, which I cannot provide.

Commit 57da367b9ec4 ("s390x/pci: forbid multifunction pci device") says 
having a multifunction device will make the guest spin forever. That is 
not what Cédric observed with igb so it may no longer be relevant, but I 
cannot be sure that the problem is resolved without understanding how 
multifunction devices are supposed to work with s390x.


Ideally someone with s390x expertise should check relevant hardware 
documentation and confirm QEMU properly implements mutlifunction 
devices. Let's keep the restriction until then.


Regards,
Akihiko Odaki

Re: [PATCH for-9.2 v15 00/11] hw/pci: SR-IOV related fixes and improvements

2024-09-10 Thread Akihiko Odaki


On 2024/09/10 18:21, Michael S. Tsirkin wrote:

On Fri, Aug 23, 2024 at 02:00:37PM +0900, Akihiko Odaki wrote:

Supersedes: <20240714-rombar-v2-0-af1504ef5...@daynix.com>
("[PATCH v2 0/4] hw/pci: Convert rom_bar into OnOffAuto")

I submitted a RFC series[1] to add support for SR-IOV emulation to
virtio-net-pci. During the development of the series, I fixed some
trivial bugs and made improvements that I think are independently
useful. This series extracts those fixes and improvements from the RFC
series.

[1]: https://patchew.org/QEMU/20231210-sriov-v2-0-b959e8a6d...@daynix.com/

Signed-off-by: Akihiko Odaki 


I don't think Cédric's issues have been addressed, am I wrong?
Cédric, what is your take?


I put the URI to Cédric's report here:
https://lore.kernel.org/r/75cbc7d9-b48e-4235-85cf-49dacf3c7...@redhat.com

This issue was dealt with patch "s390x/pci: Check for multifunction 
after device realization". I found that s390x on QEMU does not support 
multifunction and SR-IOV devices accidentally circumvent this 
restriction, which means igb was never supposed to work with s390x. The 
patch prevents adding SR-IOV devices to s390x to ensure the restriction 
is properly enforced.


Regards,
Akihiko Odaki

Re: [PATCH] block: support locking on change medium

2024-09-09 Thread Akihiko Odaki


On 2024/09/09 23:18, Joelle van Dyne wrote:

On Mon, Sep 9, 2024 at 12:36 AM Akihiko Odaki  wrote:


On 2024/09/09 10:58, Joelle van Dyne wrote:

New optional argument for 'blockdev-change-medium' QAPI command to allow
the caller to specify if they wish to enable file locking.

Signed-off-by: Joelle van Dyne 
---
   qapi/block.json| 23 ++-
   block/monitor/block-hmp-cmds.c |  2 +-
   block/qapi-sysemu.c| 22 ++
   ui/cocoa.m |  1 +
   4 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/qapi/block.json b/qapi/block.json
index e6f5c6..35e8e2e191 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -309,6 +309,23 @@
   { 'enum': 'BlockdevChangeReadOnlyMode',
 'data': ['retain', 'read-only', 'read-write'] }

+##
+# @BlockdevChangeFileLockingMode:
+#
+# Specifies the new locking mode of a file image passed to the
+# @blockdev-change-medium command.
+#
+# @auto: Use locking if API is available
+#
+# @off: Disable file image locking
+#
+# @on: Enable file image locking
+#
+# Since: 9.2
+##
+{ 'enum': 'BlockdevChangeFileLockingMode',
+  'data': ['auto', 'off', 'on'] }


You can use OnOffAuto type instead of defining your own.


This can be done. I had thought that defining a new type makes the
argument more explicit about the meaning.


Speaking of semantics, it would be better to use OnOffAuto to match 
BlockdevOptionsFile's locking property.


We could also argue that having a dedicated type would make this 
consistent with the read-only-mode property, which has such a type, but 
there are other properties that use existing types like str and bool so 
I think it is fine to use an existing type here too.







+
   ##
   # @blockdev-change-medium:
   #
@@ -330,6 +347,9 @@
   # @read-only-mode: change the read-only mode of the device; defaults
   # to 'retain'
   #
+# @file-locking-mode: change the locking mode of the file image; defaults
+# to 'auto' (since: 9.2)
+#
   # @force: if false (the default), an eject request through
   # blockdev-open-tray will be sent to the guest if it has locked
   # the tray (and the tray will not be opened immediately); if true,
@@ -378,7 +398,8 @@
   'filename': 'str',
   '*format': 'str',
   '*force': 'bool',
-'*read-only-mode': 'BlockdevChangeReadOnlyMode' } }
+'*read-only-mode': 'BlockdevChangeReadOnlyMode',
+'*file-locking-mode': 'BlockdevChangeFileLockingMode' } }

   ##
   # @DEVICE_TRAY_MOVED:
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index bdf2eb50b6..ff64020a80 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -1007,5 +1007,5 @@ void hmp_change_medium(Monitor *mon, const char *device, 
const char *target,
   }

   qmp_blockdev_change_medium(device, NULL, target, arg, true, force,
-   !!read_only, read_only_mode, errp);
+   !!read_only, read_only_mode, false, 0, errp);
   }
diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index e4282631d2..8064bdfb3a 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -311,6 +311,8 @@ void qmp_blockdev_change_medium(const char *device,
   bool has_force, bool force,
   bool has_read_only,
   BlockdevChangeReadOnlyMode read_only,
+bool has_file_locking_mode,
+BlockdevChangeFileLockingMode 
file_locking_mode,
   Error **errp)
   {
   BlockBackend *blk;
@@ -362,6 +364,26 @@ void qmp_blockdev_change_medium(const char *device,
   qdict_put_str(options, "driver", format);
   }

+if (!has_file_locking_mode) {
+file_locking_mode = BLOCKDEV_CHANGE_FILE_LOCKING_MODE_AUTO;
+}
+
+switch (file_locking_mode) {
+case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_AUTO:
+break;
+
+case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_OFF:
+qdict_put_str(options, "file.locking", "off");
+break;
+
+case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_ON:
+qdict_put_str(options, "file.locking", "on");
+break;
+
+default:
+abort();
+}
+
   medium_bs = bdrv_open(filename, NULL, options, bdrv_flags, errp);

   if (!medium_bs) {
diff --git a/ui/cocoa.m b/ui/cocoa.m
index 4c2dd33532..6e73c6e13e 100644
--- a/ui/cocoa.m
+++ b/ui/cocoa.m
@@ -1611,6 +1611,7 @@ - (void)changeDeviceMedia:(id)sender
  "raw&q

Re: [PATCH] block: support locking on change medium

2024-09-09 Thread Akihiko Odaki


On 2024/09/09 10:58, Joelle van Dyne wrote:

New optional argument for 'blockdev-change-medium' QAPI command to allow
the caller to specify if they wish to enable file locking.

Signed-off-by: Joelle van Dyne 
---
  qapi/block.json| 23 ++-
  block/monitor/block-hmp-cmds.c |  2 +-
  block/qapi-sysemu.c| 22 ++
  ui/cocoa.m |  1 +
  4 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/qapi/block.json b/qapi/block.json
index e6f5c6..35e8e2e191 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -309,6 +309,23 @@
  { 'enum': 'BlockdevChangeReadOnlyMode',
'data': ['retain', 'read-only', 'read-write'] }
  
+##

+# @BlockdevChangeFileLockingMode:
+#
+# Specifies the new locking mode of a file image passed to the
+# @blockdev-change-medium command.
+#
+# @auto: Use locking if API is available
+#
+# @off: Disable file image locking
+#
+# @on: Enable file image locking
+#
+# Since: 9.2
+##
+{ 'enum': 'BlockdevChangeFileLockingMode',
+  'data': ['auto', 'off', 'on'] }


You can use OnOffAuto type instead of defining your own.


+
  ##
  # @blockdev-change-medium:
  #
@@ -330,6 +347,9 @@
  # @read-only-mode: change the read-only mode of the device; defaults
  # to 'retain'
  #
+# @file-locking-mode: change the locking mode of the file image; defaults
+# to 'auto' (since: 9.2)
+#
  # @force: if false (the default), an eject request through
  # blockdev-open-tray will be sent to the guest if it has locked
  # the tray (and the tray will not be opened immediately); if true,
@@ -378,7 +398,8 @@
  'filename': 'str',
  '*format': 'str',
  '*force': 'bool',
-'*read-only-mode': 'BlockdevChangeReadOnlyMode' } }
+'*read-only-mode': 'BlockdevChangeReadOnlyMode',
+'*file-locking-mode': 'BlockdevChangeFileLockingMode' } }
  
  ##

  # @DEVICE_TRAY_MOVED:
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index bdf2eb50b6..ff64020a80 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -1007,5 +1007,5 @@ void hmp_change_medium(Monitor *mon, const char *device, 
const char *target,
  }
  
  qmp_blockdev_change_medium(device, NULL, target, arg, true, force,

-   !!read_only, read_only_mode, errp);
+   !!read_only, read_only_mode, false, 0, errp);
  }
diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index e4282631d2..8064bdfb3a 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -311,6 +311,8 @@ void qmp_blockdev_change_medium(const char *device,
  bool has_force, bool force,
  bool has_read_only,
  BlockdevChangeReadOnlyMode read_only,
+bool has_file_locking_mode,
+BlockdevChangeFileLockingMode 
file_locking_mode,
  Error **errp)
  {
  BlockBackend *blk;
@@ -362,6 +364,26 @@ void qmp_blockdev_change_medium(const char *device,
  qdict_put_str(options, "driver", format);
  }
  
+if (!has_file_locking_mode) {

+file_locking_mode = BLOCKDEV_CHANGE_FILE_LOCKING_MODE_AUTO;
+}
+
+switch (file_locking_mode) {
+case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_AUTO:
+break;
+
+case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_OFF:
+qdict_put_str(options, "file.locking", "off");
+break;
+
+case BLOCKDEV_CHANGE_FILE_LOCKING_MODE_ON:
+qdict_put_str(options, "file.locking", "on");
+break;
+
+default:
+abort();
+}
+
  medium_bs = bdrv_open(filename, NULL, options, bdrv_flags, errp);
  
  if (!medium_bs) {

diff --git a/ui/cocoa.m b/ui/cocoa.m
index 4c2dd33532..6e73c6e13e 100644
--- a/ui/cocoa.m
+++ b/ui/cocoa.m
@@ -1611,6 +1611,7 @@ - (void)changeDeviceMedia:(id)sender
 "raw",
 true, false,
 false, 0,
+   false, 0,


This change is irrelevant.

Regards,
Akihiko Odaki

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-30 Thread Akihiko Odaki


On 2024/08/31 0:05, Peter Xu wrote:

On Fri, Aug 30, 2024 at 03:11:38PM +0900, Akihiko Odaki wrote:

On 2024/08/30 4:48, Peter Xu wrote:

On Thu, Aug 29, 2024 at 01:39:36PM +0900, Akihiko Odaki wrote:

I am calling the fact that embedded memory regions are accessible in
instance_finalize() "live". A device can perform operations on its memory
regions during instance_finalize() and we should be aware of that.


This part is true.  I suppose we should still suggest device finalize() to
properly detach MRs, and that should normally be done there.


It is better to avoid manual resource deallocation in general because it is
too error-prone.


I had an impression that you mixed up "finalize()" and "free()" in the
context of our discussion.. let us clarify this first before everything
else below, just in case I overlook stuff..

I am aware of that distinction. I dealt with it with patch "virtio-gpu:
Handle resource blob commands":
https://lore.kernel.org/r/20240822185110.1757429-12-dmitry.osipe...@collabora.com



MR is very special as an object, in that it should have no free() hook,
hence by default nothing is going to be freed when mr->refcount==0.  It
means MRs need to be freed manually always.

For example:

(gdb) p system_memory->parent_obj->free
$2 = (ObjectFree *) 0x0

It plays perfect because the majority of QEMU device model is using MR as a
field (rather than a pointer) of another structure, so that's exactly what
we're looking for: we don't want to free() the MR as it's allocated
together with the owner device.  That'll be released when the owner free().

When dynamic allocation gets into the picture for MR, it's more complicated
for sure, because it means the user (like VFIOQuirk) will need to manually
allocate the MRs, then it requires explicit object_unparent() to detach
that from the device / owner when finalize().  NOTE!  object_unparent()
will NOT free the MR yet so far.  The MR still need to be manually freed
with an explicit g_free(), normally.  Again, I'd suggest you refer to the
VFIOQuirk code just as an example.  In that case this part is done with
e.g. vfio_bar_quirk_finalize().

  for (i = 0; i < quirk->nr_mem; i++) {
  object_unparent(OBJECT(&quirk->mem[i]));
  }
  g_free(quirk->mem);

Here quirk->mem is a pointer to an array of MR which can contain one or
more MRs, but the idea is the same.



I have an impression with QEMU code base that it is failing manual resource
deallocation so frequently although such deallocation can be easily
automated by binding resources to objects and free them when objects die by
providing a function like Linux's devm_kmalloc(). Unfortunately I haven't
found time to do that though.


AFAICT, the property list is exactly what you're saying.  IIUC normally an
object will be properly finalized()ed + free()ed when the parent object is
finalize()ed.  Here MR is just special as it bypasses all the free() part.



So my opinion here is 1) we should automate resource deallocation as much as
possible but 2) we shouldn't disturb code that performs manual resource
management.


Agree.  However note again that in whatever case cross-device MR links will
still require explicit detachment or it's prone to memory leak.


I am not sure what you refer to with cross-device MR links so can you
clarify it?


I was referring to Peter Maydell's example, where the MR can be used
outside of its owner.  In that case manual operation is a must before
finalize(), as finalize() can only make sense to resolve internal links
automatically.

But now knowing that you're explicitly mentioning "deallocation" rather
than "finalize()", and you're aware of diff between deallocation
v.s. finalize(), I suppose I misunderstood what you meant, and now I'm not
sure I get what you're suggesting.


I referred to both of finalize() and free() with "resource 
deallocation"; actually I referred to any resource that requires some 
operation after its use finishes.


Ideally, any resource should be automatically released. It should be 
also possible to manually release resource to enforce ordering.










instance_finalize() is for manual resource management. It is better to have
less code in instance_finalize() and fortunately MemoryRegion don't require
any code in instance_finalize() in most cases. If instance_finalize() still
insists to call object_unparent(), we shouldn't prevent that. (I changed my
mind regarding this particular case of object_unparent() however as I
describe below.)





object_unparent() is such an example. instance_finalize() of a device can
call object_unparent() for a subregion and for its container. If we
automatically finalize the container when calling object_unparent() for the
subregion, calling object_unparent() for its container will result

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-29 Thread Akihiko Odaki


On 2024/08/30 4:48, Peter Xu wrote:

On Thu, Aug 29, 2024 at 01:39:36PM +0900, Akihiko Odaki wrote:

I am calling the fact that embedded memory regions are accessible in
instance_finalize() "live". A device can perform operations on its memory
regions during instance_finalize() and we should be aware of that.


This part is true.  I suppose we should still suggest device finalize() to
properly detach MRs, and that should normally be done there.


It is better to avoid manual resource deallocation in general because it is
too error-prone.


I had an impression that you mixed up "finalize()" and "free()" in the
context of our discussion.. let us clarify this first before everything
else below, just in case I overlook stuff..
I am aware of that distinction. I dealt with it with patch "virtio-gpu: 
Handle resource blob commands":

https://lore.kernel.org/r/20240822185110.1757429-12-dmitry.osipe...@collabora.com



MR is very special as an object, in that it should have no free() hook,
hence by default nothing is going to be freed when mr->refcount==0.  It
means MRs need to be freed manually always.

For example:

(gdb) p system_memory->parent_obj->free
$2 = (ObjectFree *) 0x0

It plays perfect because the majority of QEMU device model is using MR as a
field (rather than a pointer) of another structure, so that's exactly what
we're looking for: we don't want to free() the MR as it's allocated
together with the owner device.  That'll be released when the owner free().

When dynamic allocation gets into the picture for MR, it's more complicated
for sure, because it means the user (like VFIOQuirk) will need to manually
allocate the MRs, then it requires explicit object_unparent() to detach
that from the device / owner when finalize().  NOTE!  object_unparent()
will NOT free the MR yet so far.  The MR still need to be manually freed
with an explicit g_free(), normally.  Again, I'd suggest you refer to the
VFIOQuirk code just as an example.  In that case this part is done with
e.g. vfio_bar_quirk_finalize().

 for (i = 0; i < quirk->nr_mem; i++) {
 object_unparent(OBJECT(&quirk->mem[i]));
 }
 g_free(quirk->mem);

Here quirk->mem is a pointer to an array of MR which can contain one or
more MRs, but the idea is the same.



I have an impression with QEMU code base that it is failing manual resource
deallocation so frequently although such deallocation can be easily
automated by binding resources to objects and free them when objects die by
providing a function like Linux's devm_kmalloc(). Unfortunately I haven't
found time to do that though.


AFAICT, the property list is exactly what you're saying.  IIUC normally an
object will be properly finalized()ed + free()ed when the parent object is
finalize()ed.  Here MR is just special as it bypasses all the free() part.



So my opinion here is 1) we should automate resource deallocation as much as
possible but 2) we shouldn't disturb code that performs manual resource
management.


Agree.  However note again that in whatever case cross-device MR links will
still require explicit detachment or it's prone to memory leak.


I am not sure what you refer to with cross-device MR links so can you 
clarify it?






instance_finalize() is for manual resource management. It is better to have
less code in instance_finalize() and fortunately MemoryRegion don't require
any code in instance_finalize() in most cases. If instance_finalize() still
insists to call object_unparent(), we shouldn't prevent that. (I changed my
mind regarding this particular case of object_unparent() however as I
describe below.)





object_unparent() is such an example. instance_finalize() of a device can
call object_unparent() for a subregion and for its container. If we
automatically finalize the container when calling object_unparent() for the
subregion, calling object_unparent() for its container will result in the
second finalization, which is not good.


IMHO we don't finalize the container at all - what I suggested was we call
del_subregion() for the case where container != NULL.  Since in this case
both container & mr belong to the same owner, it shouldn't change any
refcount, but only remove the link.

However I think I see what you pointed out.  I wonder why we remove all
properties now before reaching instance_finalze(): shouldn't finalize() be
allowed to access some of the properties?

It goes back to this commit:

commit 76a6e1cc7cc3ad022e7159b37b291b75bc4615bf
Author: Paolo Bonzini 
Date:   Wed Jun 11 11:58:30 2014 +0200

  qom: object: delete properties before calling instance_finalize
  This ensures that the children's unparent callback will still
  have a usable parent.
  Reviewed-by: Peter Crosthwaite 
  Signed-off-by: Paolo Bonzini 

  From this series (as the 1st patch the

[PATCH] docs/devel: Prohibit calling object_unparent() for memory region

2024-08-28 Thread Akihiko Odaki

Previously it was allowed to call object_unparent() for a memory region
in instance_finalize() of its parent. However, such a call typically
has no effect because child objects get unparented before
instance_finalize().

Worse, memory regions typically gets finalized when they get unparented
before instance_finalize(). This means calling object_unparent() for
them in instance_finalize() is to call the function for an object
already finalized, which should be avoided.

Signed-off-by: Akihiko Odaki 
---
 docs/devel/memory.rst | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/devel/memory.rst b/docs/devel/memory.rst
index 69c5e3f914ac..83760279e3db 100644
--- a/docs/devel/memory.rst
+++ b/docs/devel/memory.rst
@@ -168,11 +168,10 @@ and VFIOQuirk in hw/vfio/pci.c.
 
 You must not destroy a memory region as long as it may be in use by a
 device or CPU.  In order to do this, as a general rule do not create or
-destroy memory regions dynamically during a device's lifetime, and only
-call object_unparent() in the memory region owner's instance_finalize
-callback.  The dynamically allocated data structure that contains the
-memory region then should obviously be freed in the instance_finalize
-callback as well.
+destroy memory regions dynamically during a device's lifetime, and do not
+call object_unparent().  The dynamically allocated data structure that contains
+the memory region then should be freed in the instance_finalize callback, which
+is called after it gets unparented.
 
 If you break this rule, the following situation can happen:
 
@@ -199,8 +198,9 @@ but nevertheless it is used in a few places.
 
 For regions that "have no owner" (NULL is passed at creation time), the
 machine object is actually used as the owner.  Since instance_finalize is
-never called for the machine object, you must never call object_unparent
-on regions that have no owner, unless they are aliases or containers.
+never called for the machine object, you must never free regions that have no
+owner, unless they are aliases or containers, which you can manually call
+object_unparent() for.
 
 
 Overlapping regions and priority

---
base-commit: 31669121a01a14732f57c49400bc239cf9fd505f
change-id: 20240829-memory-cfd3ee0af44d

Best regards,
-- 
Akihiko Odaki

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-28 Thread Akihiko Odaki


On 2024/08/29 0:56, Peter Xu wrote:

On Wed, Aug 28, 2024 at 11:02:06PM +0900, Akihiko Odaki wrote:

On 2024/08/28 22:09, Peter Xu wrote:

On Wed, Aug 28, 2024 at 02:33:59PM +0900, Akihiko Odaki wrote:

On 2024/08/28 1:11, Peter Xu wrote:

On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote:

On 2024/08/27 4:42, Peter Xu wrote:

On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote:

On Mon, 26 Aug 2024 at 16:22, Peter Xu  wrote:


On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:

memory_region_update_container_subregions() used to call
memory_region_ref(), which creates a reference to the owner of the
subregion, on behalf of the owner of the container. This results in a
circular reference if the subregion and container have the same owner.

memory_region_ref() creates a reference to the owner instead of the
memory region to match the lifetime of the owner and memory region. We
do not need such a hack if the subregion and container have the same
owner because the owner will be alive as long as the container is.
Therefore, create a reference to the subregion itself instead ot its
owner in such a case; the reference to the subregion is still necessary
to ensure that the subregion gets finalized after the container.

Signed-off-by: Akihiko Odaki 
---
 system/memory.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/system/memory.c b/system/memory.c
index 5e6eb459d5de..e4d3e9d1f427 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2612,7 +2612,9 @@ static void 
memory_region_update_container_subregions(MemoryRegion *subregion)

 memory_region_transaction_begin();

-memory_region_ref(subregion);
+object_ref(mr->owner == subregion->owner ?
+   OBJECT(subregion) : subregion->owner);


The only place that mr->refcount is used so far is the owner with the
object property attached to the mr, am I right (ignoring name-less MRs)?

I worry this will further complicate refcounting, now we're actively using
two refcounts for MRs..


The actor of object_ref() is the owner of the memory region also in this
case. We are calling object_ref() on behalf of mr->owner so we use
mr->refcount iff mr->owner == subregion->owner. In this sense there is only
one user of mr->refcount even after this change.


Yes it's still one user, but it's not that straightforward to see, also
it's still an extension to how we use mr->refcount right now.  Currently
it's about "true / false" just to describe, now it's a real counter.

I wished that counter doesn't even exist if we'd like to stick with device
/ owner's counter.  Adding this can definitely also make further effort
harder if we want to remove mr->refcount.


I don't think it will make removing mr->refcount harder. With this change,
mr->refcount will count the parent and container. If we remove mr->refcount,
we need to trigger object_finalize() in a way other than checking
mr->refcount, which can be achieved by simply evaluating OBJECT(mr)->parent
&& mr->container.







Continue discussion there:

https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com

What I don't see is how mr->subregions differs from mr->container, so we
allow subregions to be attached but not the container when finalize()
(which is, afaict, the other way round).

It seems easier to me that we allow both container and subregions to exist
as long as within the owner itself, rather than start heavier use of
mr->refcount.


I don't think just "same owner" necessarily will be workable --
you can have a setup like:
  * device A has a container C_A
  * device A has a child-device B
  * device B has a memory region R_B
  * device A's realize method puts R_B into C_A

R_B's owner is B, and the container's owner is A,
but we still want to be able to get rid of A (in the process
getting rid of B because it gets unparented and unreffed,
and R_B and C_A also).


For cross-device references, should we rely on an explicit call to
memory_region_del_subregion(), so as to detach the link between C_A and
R_B?


Yes, I agree.



My understanding so far: logically when MR finalize() it should guarantee
both (1) mr->container==NULL, and (2) mr->subregions empty.  That's before
commit 2e2b8eb70fdb7dfb and could be the ideal world (though at the very
beginning we don't assert on ->container==NULL yet).  It requires all
device emulations to do proper unrealize() to unlink all the MRs.

However what I'm guessing is QEMU probably used to have lots of devices
that are not following the rules and leaking these links.  Hence we have
had 2e2b8eb70fdb7dfb, allowing that to happen as long as it's safe, and
it's justified by comment in 2e2b8eb70fdb7dfb on why it's safe.

What I was thinking is this comment seems to

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-28 Thread Akihiko Odaki


On 2024/08/28 22:09, Peter Xu wrote:

On Wed, Aug 28, 2024 at 02:33:59PM +0900, Akihiko Odaki wrote:

On 2024/08/28 1:11, Peter Xu wrote:

On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote:

On 2024/08/27 4:42, Peter Xu wrote:

On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote:

On Mon, 26 Aug 2024 at 16:22, Peter Xu  wrote:


On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:

memory_region_update_container_subregions() used to call
memory_region_ref(), which creates a reference to the owner of the
subregion, on behalf of the owner of the container. This results in a
circular reference if the subregion and container have the same owner.

memory_region_ref() creates a reference to the owner instead of the
memory region to match the lifetime of the owner and memory region. We
do not need such a hack if the subregion and container have the same
owner because the owner will be alive as long as the container is.
Therefore, create a reference to the subregion itself instead ot its
owner in such a case; the reference to the subregion is still necessary
to ensure that the subregion gets finalized after the container.

Signed-off-by: Akihiko Odaki 
---
system/memory.c | 8 ++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/system/memory.c b/system/memory.c
index 5e6eb459d5de..e4d3e9d1f427 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2612,7 +2612,9 @@ static void 
memory_region_update_container_subregions(MemoryRegion *subregion)

memory_region_transaction_begin();

-memory_region_ref(subregion);
+object_ref(mr->owner == subregion->owner ?
+   OBJECT(subregion) : subregion->owner);


The only place that mr->refcount is used so far is the owner with the
object property attached to the mr, am I right (ignoring name-less MRs)?

I worry this will further complicate refcounting, now we're actively using
two refcounts for MRs..


The actor of object_ref() is the owner of the memory region also in this
case. We are calling object_ref() on behalf of mr->owner so we use
mr->refcount iff mr->owner == subregion->owner. In this sense there is only
one user of mr->refcount even after this change.


Yes it's still one user, but it's not that straightforward to see, also
it's still an extension to how we use mr->refcount right now.  Currently
it's about "true / false" just to describe, now it's a real counter.

I wished that counter doesn't even exist if we'd like to stick with device
/ owner's counter.  Adding this can definitely also make further effort
harder if we want to remove mr->refcount.


I don't think it will make removing mr->refcount harder. With this change,
mr->refcount will count the parent and container. If we remove mr->refcount,
we need to trigger object_finalize() in a way other than checking
mr->refcount, which can be achieved by simply evaluating OBJECT(mr)->parent
&& mr->container.







Continue discussion there:

https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com

What I don't see is how mr->subregions differs from mr->container, so we
allow subregions to be attached but not the container when finalize()
(which is, afaict, the other way round).

It seems easier to me that we allow both container and subregions to exist
as long as within the owner itself, rather than start heavier use of
mr->refcount.


I don't think just "same owner" necessarily will be workable --
you can have a setup like:
 * device A has a container C_A
 * device A has a child-device B
 * device B has a memory region R_B
 * device A's realize method puts R_B into C_A

R_B's owner is B, and the container's owner is A,
but we still want to be able to get rid of A (in the process
getting rid of B because it gets unparented and unreffed,
and R_B and C_A also).


For cross-device references, should we rely on an explicit call to
memory_region_del_subregion(), so as to detach the link between C_A and
R_B?


Yes, I agree.



My understanding so far: logically when MR finalize() it should guarantee
both (1) mr->container==NULL, and (2) mr->subregions empty.  That's before
commit 2e2b8eb70fdb7dfb and could be the ideal world (though at the very
beginning we don't assert on ->container==NULL yet).  It requires all
device emulations to do proper unrealize() to unlink all the MRs.

However what I'm guessing is QEMU probably used to have lots of devices
that are not following the rules and leaking these links.  Hence we have
had 2e2b8eb70fdb7dfb, allowing that to happen as long as it's safe, and
it's justified by comment in 2e2b8eb70fdb7dfb on why it's safe.

What I was thinking is this comment seems to apply too to mr->container, so
that it should be safe too to unlink ->container the same way as its

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-27 Thread Akihiko Odaki


On 2024/08/28 1:11, Peter Xu wrote:

On Tue, Aug 27, 2024 at 01:14:51PM +0900, Akihiko Odaki wrote:

On 2024/08/27 4:42, Peter Xu wrote:

On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote:

On Mon, 26 Aug 2024 at 16:22, Peter Xu  wrote:


On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:

memory_region_update_container_subregions() used to call
memory_region_ref(), which creates a reference to the owner of the
subregion, on behalf of the owner of the container. This results in a
circular reference if the subregion and container have the same owner.

memory_region_ref() creates a reference to the owner instead of the
memory region to match the lifetime of the owner and memory region. We
do not need such a hack if the subregion and container have the same
owner because the owner will be alive as long as the container is.
Therefore, create a reference to the subregion itself instead ot its
owner in such a case; the reference to the subregion is still necessary
to ensure that the subregion gets finalized after the container.

Signed-off-by: Akihiko Odaki 
---
   system/memory.c | 8 ++--
   1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/system/memory.c b/system/memory.c
index 5e6eb459d5de..e4d3e9d1f427 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2612,7 +2612,9 @@ static void 
memory_region_update_container_subregions(MemoryRegion *subregion)

   memory_region_transaction_begin();

-memory_region_ref(subregion);
+object_ref(mr->owner == subregion->owner ?
+   OBJECT(subregion) : subregion->owner);


The only place that mr->refcount is used so far is the owner with the
object property attached to the mr, am I right (ignoring name-less MRs)?

I worry this will further complicate refcounting, now we're actively using
two refcounts for MRs..


The actor of object_ref() is the owner of the memory region also in this
case. We are calling object_ref() on behalf of mr->owner so we use
mr->refcount iff mr->owner == subregion->owner. In this sense there is only
one user of mr->refcount even after this change.


Yes it's still one user, but it's not that straightforward to see, also
it's still an extension to how we use mr->refcount right now.  Currently
it's about "true / false" just to describe, now it's a real counter.

I wished that counter doesn't even exist if we'd like to stick with device
/ owner's counter.  Adding this can definitely also make further effort
harder if we want to remove mr->refcount.


I don't think it will make removing mr->refcount harder. With this 
change, mr->refcount will count the parent and container. If we remove 
mr->refcount, we need to trigger object_finalize() in a way other than 
checking mr->refcount, which can be achieved by simply evaluating 
OBJECT(mr)->parent && mr->container.








Continue discussion there:

https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com

What I don't see is how mr->subregions differs from mr->container, so we
allow subregions to be attached but not the container when finalize()
(which is, afaict, the other way round).

It seems easier to me that we allow both container and subregions to exist
as long as within the owner itself, rather than start heavier use of
mr->refcount.


I don't think just "same owner" necessarily will be workable --
you can have a setup like:
* device A has a container C_A
* device A has a child-device B
* device B has a memory region R_B
* device A's realize method puts R_B into C_A

R_B's owner is B, and the container's owner is A,
but we still want to be able to get rid of A (in the process
getting rid of B because it gets unparented and unreffed,
and R_B and C_A also).


For cross-device references, should we rely on an explicit call to
memory_region_del_subregion(), so as to detach the link between C_A and
R_B?


Yes, I agree.



My understanding so far: logically when MR finalize() it should guarantee
both (1) mr->container==NULL, and (2) mr->subregions empty.  That's before
commit 2e2b8eb70fdb7dfb and could be the ideal world (though at the very
beginning we don't assert on ->container==NULL yet).  It requires all
device emulations to do proper unrealize() to unlink all the MRs.

However what I'm guessing is QEMU probably used to have lots of devices
that are not following the rules and leaking these links.  Hence we have
had 2e2b8eb70fdb7dfb, allowing that to happen as long as it's safe, and
it's justified by comment in 2e2b8eb70fdb7dfb on why it's safe.

What I was thinking is this comment seems to apply too to mr->container, so
that it should be safe too to unlink ->container the same way as its own
subregions. >
IIUC that means for device-internal MR links we should be fine leaving
whatever lin

Re: [PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-26 Thread Akihiko Odaki


On 2024/08/27 4:42, Peter Xu wrote:

On Mon, Aug 26, 2024 at 06:10:25PM +0100, Peter Maydell wrote:

On Mon, 26 Aug 2024 at 16:22, Peter Xu  wrote:


On Fri, Aug 23, 2024 at 03:13:11PM +0900, Akihiko Odaki wrote:

memory_region_update_container_subregions() used to call
memory_region_ref(), which creates a reference to the owner of the
subregion, on behalf of the owner of the container. This results in a
circular reference if the subregion and container have the same owner.

memory_region_ref() creates a reference to the owner instead of the
memory region to match the lifetime of the owner and memory region. We
do not need such a hack if the subregion and container have the same
owner because the owner will be alive as long as the container is.
Therefore, create a reference to the subregion itself instead ot its
owner in such a case; the reference to the subregion is still necessary
to ensure that the subregion gets finalized after the container.

Signed-off-by: Akihiko Odaki 
---
  system/memory.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/system/memory.c b/system/memory.c
index 5e6eb459d5de..e4d3e9d1f427 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2612,7 +2612,9 @@ static void 
memory_region_update_container_subregions(MemoryRegion *subregion)

  memory_region_transaction_begin();

-memory_region_ref(subregion);
+object_ref(mr->owner == subregion->owner ?
+   OBJECT(subregion) : subregion->owner);


The only place that mr->refcount is used so far is the owner with the
object property attached to the mr, am I right (ignoring name-less MRs)?

I worry this will further complicate refcounting, now we're actively using
two refcounts for MRs..


The actor of object_ref() is the owner of the memory region also in this 
case. We are calling object_ref() on behalf of mr->owner so we use 
mr->refcount iff mr->owner == subregion->owner. In this sense there is 
only one user of mr->refcount even after this change.




Continue discussion there:

https://lore.kernel.org/r/067b17a4-cdfc-4f7e-b7e4-28c38e1c1...@daynix.com

What I don't see is how mr->subregions differs from mr->container, so we
allow subregions to be attached but not the container when finalize()
(which is, afaict, the other way round).

It seems easier to me that we allow both container and subregions to exist
as long as within the owner itself, rather than start heavier use of
mr->refcount.


I don't think just "same owner" necessarily will be workable --
you can have a setup like:
   * device A has a container C_A
   * device A has a child-device B
   * device B has a memory region R_B
   * device A's realize method puts R_B into C_A

R_B's owner is B, and the container's owner is A,
but we still want to be able to get rid of A (in the process
getting rid of B because it gets unparented and unreffed,
and R_B and C_A also).


For cross-device references, should we rely on an explicit call to
memory_region_del_subregion(), so as to detach the link between C_A and
R_B?


Yes, I agree.



My understanding so far: logically when MR finalize() it should guarantee
both (1) mr->container==NULL, and (2) mr->subregions empty.  That's before
commit 2e2b8eb70fdb7dfb and could be the ideal world (though at the very
beginning we don't assert on ->container==NULL yet).  It requires all
device emulations to do proper unrealize() to unlink all the MRs.

However what I'm guessing is QEMU probably used to have lots of devices
that are not following the rules and leaking these links.  Hence we have
had 2e2b8eb70fdb7dfb, allowing that to happen as long as it's safe, and
it's justified by comment in 2e2b8eb70fdb7dfb on why it's safe.

What I was thinking is this comment seems to apply too to mr->container, so
that it should be safe too to unlink ->container the same way as its own
subregions. >
IIUC that means for device-internal MR links we should be fine leaving
whatever link between MRs owned by such device; the device->refcount
guarantees none of them will be visible in any AS.  But then we need to
always properly unlink the MRs when the link is across >1 device owners,
otherwise it's prone to leak.


There is one principle we must satisfy in general: keep a reference to a 
memory region if it is visible to the guest.


It is safe to call memory_region_del_subregion() and to trigger the 
finalization of subregions when the container is not referenced because 
they are no longer visible. This is not true for the other way around; 
even when subregions are not referenced by anyone else, they are still 
visible to the guest as long as the container is visible to the guest. 
It is not safe to unref and finalize them in such a case.


A memory region and its owner will leak if a memory region kept visible 
for a too long period whether the chain of reference contains a 
container/subregion relationship or not.


Regards,
Akihiko Odaki

Re: [PATCH v2 09/15] memory: Do not create circular reference with subregion

2024-08-22 Thread Akihiko Odaki


On 2024/08/23 6:01, Peter Xu wrote:

On Thu, Aug 22, 2024 at 06:10:43PM +0100, Peter Maydell wrote:

On Thu, 27 Jun 2024 at 14:40, Akihiko Odaki  wrote:


A memory region does not use their own reference counters, but instead
piggybacks on another QOM object, "owner" (unless the owner is not the
memory region itself). When creating a subregion, a new reference to the
owner of the container must be created. However, if the subregion is
owned by the same QOM object, this result in a self-reference, and make
the owner immortal. Avoid such a self-reference.

Signed-off-by: Akihiko Odaki 
---
  system/memory.c | 11 +--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/system/memory.c b/system/memory.c
index 74cd73ebc78b..949f5016a68d 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2638,7 +2638,10 @@ static void 
memory_region_update_container_subregions(MemoryRegion *subregion)

  memory_region_transaction_begin();

-memory_region_ref(subregion);
+if (mr->owner != subregion->owner) {
+memory_region_ref(subregion);
+}
+
  QTAILQ_FOREACH(other, &mr->subregions, subregions_link) {
  if (subregion->priority >= other->priority) {
  QTAILQ_INSERT_BEFORE(other, subregion, subregions_link);
@@ -2696,7 +2699,11 @@ void memory_region_del_subregion(MemoryRegion *mr,
  assert(alias->mapped_via_alias >= 0);
  }
  QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
-memory_region_unref(subregion);
+
+if (mr->owner != subregion->owner) {
+memory_region_unref(subregion);
+}
+
  memory_region_update_pending |= mr->enabled && subregion->enabled;
  memory_region_transaction_commit();
  }


I was having another look at leaks this week, and I tried
this patch to see how many of the leaks I was seeing it
fixed. I found however that for arm it results in an
assertion when the device-introspection-test exercises
the "imx7.analog" device. By-hand repro:

$ ./build/asan/qemu-system-aarch64 -display none -machine none -accel
qtest -monitor stdio
==712838==WARNING: ASan doesn't fully support makecontext/swapcontext
functions and may produce false positives in some cases!
QEMU 9.0.92 monitor - type 'help' for more information
(qemu) device_add imx7.analog,help
qemu-system-aarch64: ../../system/memory.c:1777: void
memory_region_finalize(Object *): Assertion `!mr->container' failed.
Aborted (core dumped)

It may be well be that this is a preexisting bug that's only
exposed by this refcount change causing us to actually try
to dispose of the memory regions.

I think that what's happening here is that the device
object has multiple MemoryRegions, each of which is a child
QOM property. One of these MRs is a "container MR", and the
other three are actual-content MRs which the device put into
the container when it created them. When we deref the device,
we go through all the child QOM properties unparenting and
unreffing them. However, there's no particular ordering
here, and it happens that we try to unref one of the
actual-content MRs first. That MR is still inside the
container MR, so we hit the assert. If we had happened to
unref the container MR first then memory_region_finalize()
would have removed all the subregions from it, avoiding
the problem.

I'm not sure what the best fix would be here -- that assert
is there as a guard that the region isn't visible in
any address space, so maybe it needs to be made a bit
cleverer about the condition it checks? e.g. in this
example although mr->container is not NULL,
mr->container->container is NULL.


If we keep looking at ->container we'll always see NULL, IIUC, because
either it's removed from its parent MR so it's NULL already, or at some
point it can start to point to a root mr of an address space, where should
also be NULL, afaiu.


Or we could check whether the mr->container->owner is the same as the
mr->owner and allow a non-NULL mr->container in that case.  I don't know
this subsystem well enough so I'm just making random stabs here, though.


If with the assumption of this patch applied, then looks like it's pretty
legal a container MR and the child MRs be finalized in any order when the
owner device is being destroyed.

IIUC the MR should be destined to be invisible until this point, with or
without the fact that mr->container is NULL.  It's because anyone who
references the MR should do memory_region_ref() first, which takes the
owner's refcount.  Here if MR finalize() is reached I think it means the
owner refcount must be zero.  So it looks to me the only possible case when
mr->container is non-NULL is it's used internally like this.  Then it's
invisible and also safe to be detached even if container != NULL.


It is still nice if we can protec

[PATCH v4 2/7] memory: Do not refer to "memory region's reference count"

2024-08-22 Thread Akihiko Odaki

Now MemoryRegions do have their own reference counts, but they will not
be used when their owners are not themselves. However, the documentation
of memory_region_ref() says it adds "1 to a memory region's reference
count", which is confusing. Avoid referring to "memory region's
reference count" and just say: "Add a reference to a memory region".
Make a similar change to memory_region_unref() too.

Signed-off-by: Akihiko Odaki 
---
 include/exec/memory.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 02f7528ec060..b9f0ad09bfad 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1217,7 +1217,7 @@ void memory_region_init(MemoryRegion *mr,
 uint64_t size);
 
 /**
- * memory_region_ref: Add 1 to a memory region's reference count
+ * memory_region_ref: Add a reference to a memory region
  *
  * Whenever memory regions are accessed outside the BQL, they need to be
  * preserved against hot-unplug.  MemoryRegions actually do not have their
@@ -1234,7 +1234,7 @@ void memory_region_init(MemoryRegion *mr,
 void memory_region_ref(MemoryRegion *mr);
 
 /**
- * memory_region_unref: Remove 1 to a memory region's reference count
+ * memory_region_unref: Remove a reference to a memory region
  *
  * Whenever memory regions are accessed outside the BQL, they need to be
  * preserved against hot-unplug.  MemoryRegions actually do not have their

-- 
2.46.0

[PATCH v4 5/7] memory: Clarify owner must not call memory_region_ref()

2024-08-22 Thread Akihiko Odaki

Signed-off-by: Akihiko Odaki 
---
 include/exec/memory.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index d79415a3b159..6698e9d05eab 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1220,6 +1220,7 @@ void memory_region_init(MemoryRegion *mr,
  * memory_region_ref: Add a reference to a memory region
  *
  * This function adds a reference to the owner if present.
+ * The owner must not call this function as it results in a circular reference.
  * See docs/devel/memory.rst to know about owner.
  *
  * @mr: the #MemoryRegion

-- 
2.46.0

[PATCH v4 6/7] memory: Do not create circular reference with subregion

2024-08-22 Thread Akihiko Odaki

memory_region_update_container_subregions() used to call
memory_region_ref(), which creates a reference to the owner of the
subregion, on behalf of the owner of the container. This results in a
circular reference if the subregion and container have the same owner.

memory_region_ref() creates a reference to the owner instead of the
memory region to match the lifetime of the owner and memory region. We
do not need such a hack if the subregion and container have the same
owner because the owner will be alive as long as the container is.
Therefore, create a reference to the subregion itself instead ot its
owner in such a case; the reference to the subregion is still necessary
to ensure that the subregion gets finalized after the container.

Signed-off-by: Akihiko Odaki 
---
 system/memory.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/system/memory.c b/system/memory.c
index 5e6eb459d5de..e4d3e9d1f427 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2612,7 +2612,9 @@ static void 
memory_region_update_container_subregions(MemoryRegion *subregion)
 
 memory_region_transaction_begin();
 
-memory_region_ref(subregion);
+object_ref(mr->owner == subregion->owner ?
+   OBJECT(subregion) : subregion->owner);
+
 QTAILQ_FOREACH(other, &mr->subregions, subregions_link) {
 if (subregion->priority >= other->priority) {
 QTAILQ_INSERT_BEFORE(other, subregion, subregions_link);
@@ -2670,7 +2672,9 @@ void memory_region_del_subregion(MemoryRegion *mr,
 assert(alias->mapped_via_alias >= 0);
 }
 QTAILQ_REMOVE(&mr->subregions, subregion, subregions_link);
-memory_region_unref(subregion);
+object_unref(mr->owner == subregion->owner ?
+ OBJECT(subregion) : subregion->owner);
+
 memory_region_update_pending |= mr->enabled && subregion->enabled;
 memory_region_transaction_commit();
 }

-- 
2.46.0

[PATCH v4 4/7] memory: Clarify that owner may be missing

2024-08-22 Thread Akihiko Odaki

A memory region may not have an owner, and memory_region_ref() and
memory_region_unref() do nothing for such.

Signed-off-by: Akihiko Odaki 
---
 include/exec/memory.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 461e42d03491..d79415a3b159 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1219,7 +1219,7 @@ void memory_region_init(MemoryRegion *mr,
 /**
  * memory_region_ref: Add a reference to a memory region
  *
- * This function adds a reference to the owner.
+ * This function adds a reference to the owner if present.
  * See docs/devel/memory.rst to know about owner.
  *
  * @mr: the #MemoryRegion
@@ -1229,8 +1229,8 @@ void memory_region_ref(MemoryRegion *mr);
 /**
  * memory_region_unref: Remove a reference to a memory region
  *
- * This function removes a reference to the owner and possibly destroys it.
- * See docs/devel/memory.rst to know about owner.
+ * This function removes a reference to the owner and possibly destroys it if
+ * present. See docs/devel/memory.rst to know about owner.
  *
  * @mr: the #MemoryRegion
  */

-- 
2.46.0

[PATCH v4 1/7] migration: Free removed SaveStateEntry

2024-08-22 Thread Akihiko Odaki

This fixes LeakSanitizer warnings.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Peter Xu 
Reviewed-by: Michael S. Tsirkin 
---
 migration/savevm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index deb57833f8a8..85958d7b09cd 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -874,6 +874,8 @@ int vmstate_replace_hack_for_ppc(VMStateIf *obj, int 
instance_id,
 
 if (se) {
 savevm_state_handler_remove(se);
+g_free(se->compat);
+g_free(se);
 }
 return vmstate_register(obj, instance_id, vmsd, opaque);
 }

-- 
2.46.0

[PATCH v4 7/7] tests/qtest: Delete previous boot file

2024-08-22 Thread Akihiko Odaki

A test run may create boot files several times. Delete the previous boot
file before creating a new one.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Michael S. Tsirkin 
Acked-by: Thomas Huth 
---
 tests/qtest/migration-test.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 70b606b88864..6c06100d91e2 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -144,12 +144,23 @@ static char *bootpath;
 #include "tests/migration/ppc64/a-b-kernel.h"
 #include "tests/migration/s390x/a-b-bios.h"
 
+static void bootfile_delete(void)
+{
+unlink(bootpath);
+g_free(bootpath);
+bootpath = NULL;
+}
+
 static void bootfile_create(char *dir, bool suspend_me)
 {
 const char *arch = qtest_get_arch();
 unsigned char *content;
 size_t len;
 
+if (bootpath) {
+bootfile_delete();
+}
+
 bootpath = g_strdup_printf("%s/bootsect", dir);
 if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
 /* the assembled x86 boot sector should be exactly one sector large */
@@ -177,13 +188,6 @@ static void bootfile_create(char *dir, bool suspend_me)
 fclose(bootfile);
 }
 
-static void bootfile_delete(void)
-{
-unlink(bootpath);
-g_free(bootpath);
-bootpath = NULL;
-}
-
 /*
  * Wait for some output in the serial output file,
  * we get an 'A' followed by an endless string of 'B's

-- 
2.46.0

[PATCH v4 3/7] memory: Refer to docs/devel/memory.rst for "owner"

2024-08-22 Thread Akihiko Odaki

memory_region_ref() and memory_region_unref() used to have their own
descriptions of "owner", but they are somewhat out-of-date and
misleading.

In particular, they say "whenever memory regions are accessed outside
the BQL, they need to be preserved against hot-unplug", but protecting
against hot-unplug is not mandatory if it is known that they will never
be hot-unplugged. They also say "MemoryRegions actually do not have
their own reference count", but they actually do. They just will not be
used unless their owners are not themselves.

Refer to docs/devel/memory.rst as the single source of truth instead of
maintaining duplicate descriptions of "owner".

Signed-off-by: Akihiko Odaki 
---
 include/exec/memory.h | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index b9f0ad09bfad..461e42d03491 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1219,15 +1219,8 @@ void memory_region_init(MemoryRegion *mr,
 /**
  * memory_region_ref: Add a reference to a memory region
  *
- * Whenever memory regions are accessed outside the BQL, they need to be
- * preserved against hot-unplug.  MemoryRegions actually do not have their
- * own reference count; they piggyback on a QOM object, their "owner".
  * This function adds a reference to the owner.
- *
- * All MemoryRegions must have an owner if they can disappear, even if the
- * device they belong to operates exclusively under the BQL.  This is because
- * the region could be returned at any time by memory_region_find, and this
- * is usually under guest control.
+ * See docs/devel/memory.rst to know about owner.
  *
  * @mr: the #MemoryRegion
  */
@@ -1236,10 +1229,8 @@ void memory_region_ref(MemoryRegion *mr);
 /**
  * memory_region_unref: Remove a reference to a memory region
  *
- * Whenever memory regions are accessed outside the BQL, they need to be
- * preserved against hot-unplug.  MemoryRegions actually do not have their
- * own reference count; they piggyback on a QOM object, their "owner".
  * This function removes a reference to the owner and possibly destroys it.
+ * See docs/devel/memory.rst to know about owner.
  *
  * @mr: the #MemoryRegion
  */

-- 
2.46.0

[PATCH v4 0/7] Fix check-qtest-ppc64 sanitizer errors

2024-08-22 Thread Akihiko Odaki

I saw various sanitizer errors when running check-qtest-ppc64. While
I could just turn off sanitizers, I decided to tackle them this time.

Unfortunately, GLib versions older than 2.81.0 do not free test data in
some cases so some sanitizer errors remain. All sanitizer errors will be
gone with this patch series combined with the following change for GLib:
https://gitlab.gnome.org/GNOME/glib/-/merge_requests/4120

Signed-off-by: Akihiko Odaki 
---
Changes in v4:
- Changed to create a reference to the subregion instead of its owner
  when its owner equals to the container's owner.
- Dropped R-b from patch "memory: Do not create circular reference with
  subregion".
- Rebased.
- Link to v3: 
https://lore.kernel.org/r/20240708-san-v3-0-b03f671c4...@daynix.com

Changes in v3:
- Added patch "memory: Clarify that we use owner's reference count".
- Added patch "memory: Refer to docs/devel/memory.rst for 'owner'".
- Fixed the message of patch
  "memory: Do not create circular reference with subregion".
- Dropped patch "cpu: Free cpu_ases" in favor of:
  https://lore.kernel.org/r/20240607115649.214622-7-salil.me...@huawei.com/
  ("[PATCH V13 6/8] physmem: Add helper function to destroy CPU
  AddressSpace")
- Dropped patches "hw/ide: Convert macio ide_irq into GPIO line" and
  "hw/ide: Remove internal DMA qemu_irq" in favor of commit efb359346c7a
  ("hw/ide/macio: switch from using qemu_allocate_irq() to qdev input
  GPIOs")
- Dropped patch "hw/isa/vt82c686: Define a GPIO line between vt82c686
  and i8259" in favor of:
  https://patchew.org/QEMU/20240704205854.18537-1-shen...@gmail.com/
  ("[PATCH 0/3] Resolve vt82c686 and piix4 qemu_irq memory leaks")
- Dropped pulled patches.
- Link to v2: 
https://lore.kernel.org/r/20240627-san-v2-0-750bb0946...@daynix.com

Changes in v2:
- Rebased to "[PATCH] cpu: fix memleak of 'halt_cond' and 'thread'".
  (Philippe Mathieu-Daudé)
- Converted IRQs into GPIO lines and removed one qemu_irq usage.
  (Peter Maydell)
- s/suppresses/fixes/ (Michael S. Tsirkin)
- Corrected title of patch "hw/virtio: Free vqs after vhost_dev_cleanup()"
  (was "hw/virtio: Free vqs before vhost_dev_cleanup()")
- Link to v1: 
https://lore.kernel.org/r/20240626-san-v1-0-f3cc42302...@daynix.com

---
Akihiko Odaki (7):
  migration: Free removed SaveStateEntry
  memory: Do not refer to "memory region's reference count"
  memory: Refer to docs/devel/memory.rst for "owner"
  memory: Clarify that owner may be missing
  memory: Clarify owner must not call memory_region_ref()
  memory: Do not create circular reference with subregion
  tests/qtest: Delete previous boot file

 include/exec/memory.h| 22 +++---
 migration/savevm.c   |  2 ++
 system/memory.c  |  8 ++--
 tests/qtest/migration-test.c | 18 +++---
 4 files changed, 26 insertions(+), 24 deletions(-)
---
base-commit: 31669121a01a14732f57c49400bc239cf9fd505f
change-id: 20240625-san-097afaf4f1c2

Best regards,
-- 
Akihiko Odaki

Re: [PATCH for-9.2 v7 0/9] virtio-net: add support for SR-IOV emulation

2024-08-22 Thread Akihiko Odaki


On 2024/08/21 19:18, Yui Washizu wrote:


On 2024/08/13 15:36, Akihiko Odaki wrote:

Based-on: <20240802-reuse-v11-0-fb83bb8c1...@daynix.com>
("[PATCH for-9.2 v11 00/11] hw/pci: SR-IOV related fixes and 
improvements")


I couldn't apply this patch series
after applying "[PATCH for-9.2 v11 00/11] hw/pci: SR-IOV related fixes 
and improvements".


It was a mistake and intended to apply on: "[PATCH for-9.2 v14 00/11] 
hw/pci: SR-IOV related fixes and improvements"

https://patchew.org/QEMU/20240813-reuse-v14-0-4c15bc6ee...@daynix.com/

It can be also cleanly applied on the new version I have just sent: 
"[PATCH for-9.2 v15 00/11] hw/pci: SR-IOV related fixes and improvements"

https://patchew.org/QEMU/20240823-reuse-v15-0-eddcb960e...@daynix.com/

Regards,
Akihiko Odaki

[PATCH for-9.2 v15 10/11] hw/pci: Use -1 as the default value for rombar

2024-08-22 Thread Akihiko Odaki

vfio_pci_size_rom() distinguishes whether rombar is explicitly set to 1
by checking dev->opts, bypassing the QOM property infrastructure.

Use -1 as the default value for rombar to tell if the user explicitly
set it to 1. The property is also converted from unsigned to signed.
-1 is signed so it is safe to give it a new meaning. The values in
[2 ^ 31, 2 ^ 32) become invalid, but nobody should have typed these
values by chance.

Suggested-by: Markus Armbruster 
Signed-off-by: Akihiko Odaki 
Reviewed-by: Markus Armbruster 
---
 include/hw/pci/pci_device.h | 2 +-
 hw/pci/pci.c| 2 +-
 hw/vfio/pci.c   | 5 ++---
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index 1ff3ce94e25b..8fa845beee5e 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -148,7 +148,7 @@ struct PCIDevice {
 uint32_t romsize;
 bool has_rom;
 MemoryRegion rom;
-uint32_t rom_bar;
+int32_t rom_bar;
 
 /* INTx routing notifier */
 PCIINTxRoutingNotifier intx_routing_notifier;
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 4c7be5295110..d2eaf0c51dde 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -71,7 +71,7 @@ static Property pci_props[] = {
 DEFINE_PROP_PCI_DEVFN("addr", PCIDevice, devfn, -1),
 DEFINE_PROP_STRING("romfile", PCIDevice, romfile),
 DEFINE_PROP_UINT32("romsize", PCIDevice, romsize, UINT32_MAX),
-DEFINE_PROP_UINT32("rombar",  PCIDevice, rom_bar, 1),
+DEFINE_PROP_INT32("rombar",  PCIDevice, rom_bar, -1),
 DEFINE_PROP_BIT("multifunction", PCIDevice, cap_present,
 QEMU_PCI_CAP_MULTIFUNCTION_BITNR, false),
 DEFINE_PROP_BIT("x-pcie-lnksta-dllla", PCIDevice, cap_present,
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2407720c3530..dc53837eac73 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1012,7 +1012,6 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 {
 uint32_t orig, size = cpu_to_le32((uint32_t)PCI_ROM_ADDRESS_MASK);
 off_t offset = vdev->config_offset + PCI_ROM_ADDRESS;
-DeviceState *dev = DEVICE(vdev);
 char *name;
 int fd = vdev->vbasedev.fd;
 
@@ -1046,12 +1045,12 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
 }
 
 if (vfio_opt_rom_in_denylist(vdev)) {
-if (dev->opts && qdict_haskey(dev->opts, "rombar")) {
+if (vdev->pdev.rom_bar > 0) {
 warn_report("Device at %s is known to cause system instability"
 " issues during option rom execution",
 vdev->vbasedev.name);
 error_printf("Proceeding anyway since user specified"
- " non zero value for rombar\n");
+ " positive value for rombar\n");
 } else {
 warn_report("Rom loading for device at %s has been disabled"
 " due to system instability issues",

-- 
2.46.0

[PATCH for-9.2 v15 04/11] s390x/pci: Check for multifunction after device realization

2024-08-22 Thread Akihiko Odaki

The SR-IOV PFs set the multifunction bits during device realization so
check them after that. This forbids adding SR-IOV devices to s390x.

Signed-off-by: Akihiko Odaki 
---
 hw/s390x/s390-pci-bus.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 3e57d5faca18..00b2c1f6157b 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -971,14 +971,7 @@ static void s390_pcihost_pre_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 "this device");
 }
 
-if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
-PCIDevice *pdev = PCI_DEVICE(dev);
-
-if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
-error_setg(errp, "multifunction not supported in s390");
-return;
-}
-} else if (object_dynamic_cast(OBJECT(dev), TYPE_S390_PCI_DEVICE)) {
+if (object_dynamic_cast(OBJECT(dev), TYPE_S390_PCI_DEVICE)) {
 S390PCIBusDevice *pbdev = S390_PCI_DEVICE(dev);
 
 if (!s390_pci_alloc_idx(s, pbdev)) {
@@ -1069,6 +1062,11 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
 pdev = PCI_DEVICE(dev);
 
+if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+error_setg(errp, "multifunction not supported in s390");
+return;
+}
+
 if (!dev->id) {
 /* In the case the PCI device does not define an id */
 /* we generate one based on the PCI address */

-- 
2.46.0

[PATCH for-9.2 v15 07/11] pcie_sriov: Release VFs failed to realize

2024-08-22 Thread Akihiko Odaki

Release VFs failed to realize just as we do in unregister_vfs().

Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization 
(SR/IOV)")
Signed-off-by: Akihiko Odaki 
---
 hw/pci/pcie_sriov.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 4bffe6c97f66..ac8c4013bc88 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -87,6 +87,8 @@ bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 vf->exp.sriov_vf.vf_number = i;
 
 if (!qdev_realize(&vf->qdev, bus, errp)) {
+object_unparent(OBJECT(vf));
+object_unref(vf);
 unparent_vfs(dev, i);
 return false;
 }

-- 
2.46.0

[PATCH for-9.2 v15 03/11] hw/ppc/spapr_pci: Do not reject VFs created after a PF

2024-08-22 Thread Akihiko Odaki

A PF may automatically create VFs and the PF may be function 0.

Signed-off-by: Akihiko Odaki 
---
 hw/ppc/spapr_pci.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index f63182a03c41..ed4454bbf79e 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1573,7 +1573,9 @@ static void spapr_pci_pre_plug(HotplugHandler 
*plug_handler,
  * hotplug, we do not allow functions to be hotplugged to a
  * slot that already has function 0 present
  */
-if (plugged_dev->hotplugged && bus->devices[PCI_DEVFN(slotnr, 0)] &&
+if (plugged_dev->hotplugged &&
+!pci_is_vf(pdev) &&
+bus->devices[PCI_DEVFN(slotnr, 0)] &&
 PCI_FUNC(pdev->devfn) != 0) {
 error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
" additional functions can no longer be exposed to guest.",

-- 
2.46.0

[PATCH for-9.2 v15 05/11] pcie_sriov: Do not manually unrealize

2024-08-22 Thread Akihiko Odaki

A device gets automatically unrealized when being unparented.

Signed-off-by: Akihiko Odaki 
---
 hw/pci/pcie_sriov.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index e9b23221d713..499becd5273f 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -204,11 +204,7 @@ static void unregister_vfs(PCIDevice *dev)
 trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
PCI_FUNC(dev->devfn), num_vfs);
 for (i = 0; i < num_vfs; i++) {
-Error *err = NULL;
 PCIDevice *vf = dev->exp.sriov_pf.vf[i];
-if (!object_property_set_bool(OBJECT(vf), "realized", false, &err)) {
-error_reportf_err(err, "Failed to unplug: ");
-}
 object_unparent(OBJECT(vf));
 object_unref(OBJECT(vf));
 }

-- 
2.46.0

[PATCH for-9.2 v15 06/11] pcie_sriov: Reuse SR-IOV VF device instances

2024-08-22 Thread Akihiko Odaki

Disable SR-IOV VF devices by reusing code to power down PCI devices
instead of removing them when the guest requests to disable VFs. This
allows to realize devices and report VF realization errors at PF
realization time.

Signed-off-by: Akihiko Odaki 
---
 docs/pcie_sriov.txt |   8 ++--
 include/hw/pci/pci.h|   5 ---
 include/hw/pci/pci_device.h |  15 +++
 include/hw/pci/pcie_sriov.h |   6 +--
 hw/net/igb.c|  13 --
 hw/nvme/ctrl.c  |  24 +++
 hw/pci/pci.c|   2 +-
 hw/pci/pcie_sriov.c | 102 +++-
 8 files changed, 95 insertions(+), 80 deletions(-)

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
index a47aad0bfab0..ab2142807f79 100644
--- a/docs/pcie_sriov.txt
+++ b/docs/pcie_sriov.txt
@@ -52,9 +52,11 @@ setting up a BAR for a VF.
   ...
 
   /* Add and initialize the SR/IOV capability */
-  pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
-   vf_devid, initial_vfs, total_vfs,
-   fun_offset, stride);
+  if (!pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+  vf_devid, initial_vfs, total_vfs,
+  fun_offset, stride, errp)) {
+ return;
+  }
 
   /* Set up individual VF BARs (parameters as for normal BARs) */
   pcie_sriov_pf_init_vf_bar( ... )
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index fe04b4fafd04..14a869eeaa71 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -680,9 +680,4 @@ static inline void pci_irq_pulse(PCIDevice *pci_dev)
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
 void pci_set_enabled(PCIDevice *pci_dev, bool state);
 
-static inline void pci_set_power(PCIDevice *pci_dev, bool state)
-{
-pci_set_enabled(pci_dev, state);
-}
-
 #endif
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index f38fb3111954..1ff3ce94e25b 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -212,6 +212,21 @@ static inline uint16_t pci_get_bdf(PCIDevice *dev)
 return PCI_BUILD_BDF(pci_bus_num(pci_get_bus(dev)), dev->devfn);
 }
 
+static inline void pci_set_power(PCIDevice *pci_dev, bool state)
+{
+/*
+ * Don't change the enabled state of VFs when powering on/off the device.
+ *
+ * When powering on, VFs must not be enabled immediately but they must
+ * wait until the guest configures SR-IOV.
+ * When powering off, their corresponding PFs will be reset and disable
+ * VFs.
+ */
+if (!pci_is_vf(pci_dev)) {
+pci_set_enabled(pci_dev, state);
+}
+}
+
 uint16_t pci_requester_id(PCIDevice *dev);
 
 /* DMA access functions */
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 450cbef6c201..70649236c18a 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -18,7 +18,6 @@
 typedef struct PCIESriovPF {
 uint16_t num_vfs;   /* Number of virtual functions created */
 uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
-const char *vfname; /* Reference to the device type used for the VFs */
 PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
 } PCIESriovPF;
 
@@ -27,10 +26,11 @@ typedef struct PCIESriovVF {
 uint16_t vf_number; /* Logical VF number of this function */
 } PCIESriovVF;
 
-void pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
+bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 const char *vfname, uint16_t vf_dev_id,
 uint16_t init_vfs, uint16_t total_vfs,
-uint16_t vf_offset, uint16_t vf_stride);
+uint16_t vf_offset, uint16_t vf_stride,
+Error **errp);
 void pcie_sriov_pf_exit(PCIDevice *dev);
 
 /* Set up a VF bar in the SR/IOV bar area */
diff --git a/hw/net/igb.c b/hw/net/igb.c
index b92bba402e0d..b6ca2f1b8aee 100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -446,9 +446,16 @@ static void igb_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 
 pcie_ari_init(pci_dev, 0x150);
 
-pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET, TYPE_IGBVF,
-IGB_82576_VF_DEV_ID, IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
-IGB_VF_OFFSET, IGB_VF_STRIDE);
+if (!pcie_sriov_pf_init(pci_dev, IGB_CAP_SRIOV_OFFSET,
+TYPE_IGBVF, IGB_82576_VF_DEV_ID,
+IGB_MAX_VF_FUNCTIONS, IGB_MAX_VF_FUNCTIONS,
+IGB_VF_OFFSET, IGB_VF_STRIDE,
+errp)) {
+pcie_cap_exit(pci_dev);
+igb_cleanup_msix(s);
+msi_uninit(pci_dev);
+return;
+}
 
 pcie_sriov_pf_init_vf_bar(pci_dev, IGBVF_MMIO_BAR_IDX,
 PCI_BASE_ADDRESS_MEM_TYPE_64 | PCI_BASE_ADDRESS_MEM_PREFETCH,
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index c6d4f61a47f9..

[PATCH for-9.2 v15 08/11] pcie_sriov: Remove num_vfs from PCIESriovPF

2024-08-22 Thread Akihiko Odaki

num_vfs is not migrated so use PCI_SRIOV_CTRL_VFE and PCI_SRIOV_NUM_VF
instead.

Signed-off-by: Akihiko Odaki 
---
 include/hw/pci/pcie_sriov.h |  1 -
 hw/pci/pcie_sriov.c | 38 +++---
 hw/pci/trace-events |  2 +-
 3 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 70649236c18a..5148c5b77dd1 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -16,7 +16,6 @@
 #include "hw/pci/pci.h"
 
 typedef struct PCIESriovPF {
-uint16_t num_vfs;   /* Number of virtual functions created */
 uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
 PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
 } PCIESriovPF;
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index ac8c4013bc88..47028e150eac 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -45,7 +45,6 @@ bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
 offset, PCI_EXT_CAP_SRIOV_SIZEOF);
 dev->exp.sriov_cap = offset;
-dev->exp.sriov_pf.num_vfs = 0;
 dev->exp.sriov_pf.vf = NULL;
 
 pci_set_word(cfg + PCI_SRIOV_VF_OFFSET, vf_offset);
@@ -182,29 +181,28 @@ static void register_vfs(PCIDevice *dev)
 
 assert(sriov_cap > 0);
 num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
-if (num_vfs > pci_get_word(dev->config + sriov_cap + PCI_SRIOV_TOTAL_VF)) {
-return;
-}
 
 trace_sriov_register_vfs(dev->name, PCI_SLOT(dev->devfn),
  PCI_FUNC(dev->devfn), num_vfs);
 for (i = 0; i < num_vfs; i++) {
 pci_set_enabled(dev->exp.sriov_pf.vf[i], true);
 }
-dev->exp.sriov_pf.num_vfs = num_vfs;
+
+pci_set_word(dev->wmask + sriov_cap + PCI_SRIOV_NUM_VF, 0);
 }
 
 static void unregister_vfs(PCIDevice *dev)
 {
-uint16_t num_vfs = dev->exp.sriov_pf.num_vfs;
+uint8_t *cfg = dev->config + dev->exp.sriov_cap;
 uint16_t i;
 
 trace_sriov_unregister_vfs(dev->name, PCI_SLOT(dev->devfn),
-   PCI_FUNC(dev->devfn), num_vfs);
-for (i = 0; i < num_vfs; i++) {
+   PCI_FUNC(dev->devfn));
+for (i = 0; i < pci_get_word(cfg + PCI_SRIOV_TOTAL_VF); i++) {
 pci_set_enabled(dev->exp.sriov_pf.vf[i], false);
 }
-dev->exp.sriov_pf.num_vfs = 0;
+
+pci_set_word(dev->wmask + dev->exp.sriov_cap + PCI_SRIOV_NUM_VF, 0x);
 }
 
 void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
@@ -230,6 +228,17 @@ void pcie_sriov_config_write(PCIDevice *dev, uint32_t 
address,
 } else {
 unregister_vfs(dev);
 }
+} else if (range_covers_byte(off, len, PCI_SRIOV_NUM_VF)) {
+uint8_t *cfg = dev->config + sriov_cap;
+uint8_t *wmask = dev->wmask + sriov_cap;
+uint16_t num_vfs = pci_get_word(cfg + PCI_SRIOV_NUM_VF);
+uint16_t wmask_val = PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI;
+
+if (num_vfs <= pci_get_word(cfg + PCI_SRIOV_TOTAL_VF)) {
+wmask_val |= PCI_SRIOV_CTRL_VFE;
+}
+
+pci_set_word(wmask + PCI_SRIOV_CTRL, wmask_val);
 }
 }
 
@@ -246,6 +255,8 @@ void pcie_sriov_pf_reset(PCIDevice *dev)
 unregister_vfs(dev);
 
 pci_set_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF, 0);
+pci_set_word(dev->wmask + sriov_cap + PCI_SRIOV_CTRL,
+ PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE | PCI_SRIOV_CTRL_ARI);
 
 /*
  * Default is to use 4K pages, software can modify it
@@ -292,7 +303,7 @@ PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
 PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
 {
 assert(!pci_is_vf(dev));
-if (n < dev->exp.sriov_pf.num_vfs) {
+if (n < pcie_sriov_num_vfs(dev)) {
 return dev->exp.sriov_pf.vf[n];
 }
 return NULL;
@@ -300,5 +311,10 @@ PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int 
n)
 
 uint16_t pcie_sriov_num_vfs(PCIDevice *dev)
 {
-return dev->exp.sriov_pf.num_vfs;
+uint16_t sriov_cap = dev->exp.sriov_cap;
+uint8_t *cfg = dev->config + sriov_cap;
+
+return sriov_cap &&
+   (pci_get_word(cfg + PCI_SRIOV_CTRL) & PCI_SRIOV_CTRL_VFE) ?
+   pci_get_word(cfg + PCI_SRIOV_NUM_VF) : 0;
 }
diff --git a/hw/pci/trace-events b/hw/pci/trace-events
index 19643aa8c6b0..e98f575a9d19 100644
--- a/hw/pci/trace-events
+++ b/hw/pci/trace-events
@@ -14,7 +14,7 @@ msix_write_config(char *name, bool enabled, bool masked) "dev 
%s enabled %d mask
 
 # hw/pci/pcie_sriov.c
 sriov_register_vfs(const char *name, int slot, int function, int num_vfs) "%s 
%02x:%x: creating %d vf devs"
-sriov_unregister_vfs(const char *name, int slot, int function, int num_vfs) 
"%s %02x:%x

[PATCH for-9.2 v15 11/11] hw/qdev: Remove opts member

2024-08-22 Thread Akihiko Odaki

It is no longer used.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Markus Armbruster 
---
 include/hw/qdev-core.h |  4 
 hw/core/qdev.c |  1 -
 system/qdev-monitor.c  | 12 +++-
 3 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 77bfcbdf732a..a3757e6769f8 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -237,10 +237,6 @@ struct DeviceState {
  * @pending_deleted_expires_ms: optional timeout for deletion events
  */
 int64_t pending_deleted_expires_ms;
-/**
- * @opts: QDict of options for the device
- */
-QDict *opts;
 /**
  * @hotplugged: was device added after PHASE_MACHINE_READY?
  */
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index f3a996f57dee..2fc84699d432 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -706,7 +706,6 @@ static void device_finalize(Object *obj)
 dev->canonical_path = NULL;
 }
 
-qobject_unref(dev->opts);
 g_free(dev->id);
 }
 
diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
index 6af6ef7d667f..3551989d5153 100644
--- a/system/qdev-monitor.c
+++ b/system/qdev-monitor.c
@@ -624,6 +624,7 @@ DeviceState *qdev_device_add_from_qdict(const QDict *opts,
 char *id;
 DeviceState *dev = NULL;
 BusState *bus = NULL;
+QDict *properties;
 
 driver = qdict_get_try_str(opts, "driver");
 if (!driver) {
@@ -705,13 +706,14 @@ DeviceState *qdev_device_add_from_qdict(const QDict *opts,
 }
 
 /* set properties */
-dev->opts = qdict_clone_shallow(opts);
-qdict_del(dev->opts, "driver");
-qdict_del(dev->opts, "bus");
-qdict_del(dev->opts, "id");
+properties = qdict_clone_shallow(opts);
+qdict_del(properties, "driver");
+qdict_del(properties, "bus");
+qdict_del(properties, "id");
 
-object_set_properties_from_keyval(&dev->parent_obj, dev->opts, from_json,
+object_set_properties_from_keyval(&dev->parent_obj, properties, from_json,
   errp);
+qobject_unref(properties);
 if (*errp) {
 goto err_del_dev;
 }

-- 
2.46.0

[PATCH for-9.2 v15 02/11] hw/ppc/spapr_pci: Do not create DT for disabled PCI device

2024-08-22 Thread Akihiko Odaki

Disabled means it is a disabled SR-IOV VF or it is powered off, and
hidden from the guest.

Signed-off-by: Akihiko Odaki 
---
 hw/ppc/spapr_pci.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 7cf9904c3546..f63182a03c41 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1296,6 +1296,10 @@ static void spapr_dt_pci_device_cb(PCIBus *bus, 
PCIDevice *pdev,
 return;
 }
 
+if (!pdev->enabled) {
+return;
+}
+
 err = spapr_dt_pci_device(p->sphb, pdev, p->fdt, p->offset);
 if (err < 0) {
 p->err = err;

-- 
2.46.0

[PATCH for-9.2 v15 00/11] hw/pci: SR-IOV related fixes and improvements

2024-08-22 Thread Akihiko Odaki

Supersedes: <20240714-rombar-v2-0-af1504ef5...@daynix.com>
("[PATCH v2 0/4] hw/pci: Convert rom_bar into OnOffAuto")

I submitted a RFC series[1] to add support for SR-IOV emulation to
virtio-net-pci. During the development of the series, I fixed some
trivial bugs and made improvements that I think are independently
useful. This series extracts those fixes and improvements from the RFC
series.

[1]: https://patchew.org/QEMU/20231210-sriov-v2-0-b959e8a6d...@daynix.com/

Signed-off-by: Akihiko Odaki 
---
Changes in v15:
- Fixed variable shadowing in patch
  "pcie_sriov: Remove num_vfs from PCIESriovPF"
- Link to v14: 
https://lore.kernel.org/r/20240813-reuse-v14-0-4c15bc6ee...@daynix.com

Changes in v14:
- Dropped patch "pcie_sriov: Ensure VF function number does not
  overflow" as I found the restriction it imposes is unnecessary.
- Link to v13: 
https://lore.kernel.org/r/20240805-reuse-v13-0-aaeaa4d7d...@daynix.com

Changes in v13:
- Added patch "s390x/pci: Check for multifunction after device
  realization". I found SR-IOV devices, which are multifunction devices,
  are not supposed to work at all with s390x on QEMU.
- Link to v12: 
https://lore.kernel.org/r/20240804-reuse-v12-0-d3930c411...@daynix.com

Changes in v12:
- Changed to ignore invalid PCI_SRIOV_NUM_VF writes as done for
  PCI_SRIOV_CTRL_VFE.
- Updated the message for patch "hw/pci: Use -1 as the default value for
  rombar". (Markus Armbruster)
- Link to v11: 
https://lore.kernel.org/r/20240802-reuse-v11-0-fb83bb8c1...@daynix.com

Changes in v11:
- Rebased.
- Dropped patch "hw/pci: Convert rom_bar into OnOffAuto".
- Added patch "hw/pci: Use -1 as the default value for rombar".
- Added for-9.2 to give a testing period for possible breakage with
  libvirt/s390x.
- Link to v10: 
https://lore.kernel.org/r/20240627-reuse-v10-0-7ca0b8ed3...@daynix.com

Changes in v10:
- Added patch "hw/ppc/spapr_pci: Do not reject VFs created after a PF".
- Added patch "hw/ppc/spapr_pci: Do not create DT for disabled PCI device".
- Added patch "hw/pci: Convert rom_bar into OnOffAuto".
- Dropped patch "hw/pci: Determine if rombar is explicitly enabled".
- Dropped patch "hw/qdev: Remove opts member".
- Link to v9: 
https://lore.kernel.org/r/20240315-reuse-v9-0-67aa69af4...@daynix.com

Changes in v9:
- Rebased.
- Restored '#include "qapi/error.h"' (Michael S. Tsirkin)
- Added patch "pcie_sriov: Ensure VF function number does not overflow"
  to fix abortion with wrong PF addr.
- Link to v8: 
https://lore.kernel.org/r/20240228-reuse-v8-0-282660281...@daynix.com

Changes in v8:
- Clarified that "hw/pci: Replace -1 with UINT32_MAX for romsize" is
  not a bug fix. (Markus Armbruster)
- Squashed patch "vfio: Avoid inspecting option QDict for rombar" into
  "hw/pci: Determine if rombar is explicitly enabled".
  (Markus Armbruster)
- Noted the minor semantics change for patch "hw/pci: Determine if
  rombar is explicitly enabled". (Markus Armbruster)
- Link to v7: 
https://lore.kernel.org/r/20240224-reuse-v7-0-29c14bcb9...@daynix.com

Changes in v7:
- Replaced -1 with UINT32_MAX when expressing uint32_t.
  (Markus Armbruster)
- Added patch "hw/pci: Replace -1 with UINT32_MAX for romsize".
- Link to v6: 
https://lore.kernel.org/r/20240220-reuse-v6-0-2e42a28b0...@daynix.com

Changes in v6:
- Fixed migration.
- Added patch "pcie_sriov: Do not manually unrealize".
- Restored patch "pcie_sriov: Release VFs failed to realize" that was
  missed in v5.
- Link to v5: 
https://lore.kernel.org/r/20240218-reuse-v5-0-e4fc1c19b...@daynix.com

Changes in v5:
- Added patch "hw/pci: Always call pcie_sriov_pf_reset()".
- Added patch "pcie_sriov: Reset SR-IOV extended capability".
- Removed a reference to PCI_SRIOV_CTRL_VFE in hw/nvme.
  (Michael S. Tsirkin)
- Noted the impact on the guest of patch "pcie_sriov: Do not reset
  NumVFs after unregistering VFs". (Michael S. Tsirkin)
- Changed to use pcie_sriov_num_vfs().
- Restored pci_set_power() and changed it to call pci_set_enabled() only
  for PFs with an expalanation. (Michael S. Tsirkin)
- Reordered patches.
- Link to v4: 
https://lore.kernel.org/r/20240214-reuse-v4-0-89ad093a0...@daynix.com

Changes in v4:
- Reverted the change to pci_rom_bar_explicitly_enabled().
  (Michael S. Tsirkin)
- Added patch "pcie_sriov: Do not reset NumVFs after unregistering VFs".
- Added patch "hw/nvme: Refer to dev->exp.sriov_pf.num_vfs".
- Link to v3: 
https://lore.kernel.org/r/20240212-reuse-v3-0-8017b689c...@daynix.com

Changes in v3:
- Extracted patch "hw/pci: Use -1 as a default value for rombar" from
  patch "hw/pci: Determine if rombar is explicitly enabled"
  (Philippe Mathieu-Daudé)
- Added an audit result of PCIDevice::rom_bar to the message of patch
  "

[PATCH for-9.2 v15 09/11] pcie_sriov: Register VFs after migration

2024-08-22 Thread Akihiko Odaki

pcie_sriov doesn't have code to restore its state after migration, but
igb, which uses pcie_sriov, naively claimed its migration capability.

Add code to register VFs after migration and fix igb migration.

Fixes: 3a977deebe6b ("Intrdocue igb device emulation")
Signed-off-by: Akihiko Odaki 
---
 include/hw/pci/pcie_sriov.h | 2 ++
 hw/pci/pci.c| 7 +++
 hw/pci/pcie_sriov.c | 7 +++
 3 files changed, 16 insertions(+)

diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index 5148c5b77dd1..c5d2d318d330 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -57,6 +57,8 @@ void pcie_sriov_pf_add_sup_pgsize(PCIDevice *dev, uint16_t 
opt_sup_pgsize);
 void pcie_sriov_config_write(PCIDevice *dev, uint32_t address,
  uint32_t val, int len);
 
+void pcie_sriov_pf_post_load(PCIDevice *dev);
+
 /* Reset SR/IOV */
 void pcie_sriov_pf_reset(PCIDevice *dev);
 
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 5c0050e1786a..4c7be5295110 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -733,10 +733,17 @@ static bool migrate_is_not_pcie(void *opaque, int 
version_id)
 return !pci_is_express((PCIDevice *)opaque);
 }
 
+static int pci_post_load(void *opaque, int version_id)
+{
+pcie_sriov_pf_post_load(opaque);
+return 0;
+}
+
 const VMStateDescription vmstate_pci_device = {
 .name = "PCIDevice",
 .version_id = 2,
 .minimum_version_id = 1,
+.post_load = pci_post_load,
 .fields = (const VMStateField[]) {
 VMSTATE_INT32_POSITIVE_LE(version_id, PCIDevice),
 VMSTATE_BUFFER_UNSAFE_INFO_TEST(config, PCIDevice,
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 47028e150eac..a1cb1214af27 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -242,6 +242,13 @@ void pcie_sriov_config_write(PCIDevice *dev, uint32_t 
address,
 }
 }
 
+void pcie_sriov_pf_post_load(PCIDevice *dev)
+{
+if (dev->exp.sriov_cap) {
+register_vfs(dev);
+}
+}
+
 
 /* Reset SR/IOV */
 void pcie_sriov_pf_reset(PCIDevice *dev)

-- 
2.46.0

[PATCH for-9.2 v15 01/11] hw/pci: Rename has_power to enabled

2024-08-22 Thread Akihiko Odaki

The renamed state will not only represent powering state of PFs, but
also represent SR-IOV VF enablement in the future.

Signed-off-by: Akihiko Odaki 
---
 include/hw/pci/pci.h|  7 ++-
 include/hw/pci/pci_device.h |  2 +-
 hw/pci/pci.c| 14 +++---
 hw/pci/pci_host.c   |  4 ++--
 4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index eb26cac81098..fe04b4fafd04 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -678,6 +678,11 @@ static inline void pci_irq_pulse(PCIDevice *pci_dev)
 }
 
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
-void pci_set_power(PCIDevice *pci_dev, bool state);
+void pci_set_enabled(PCIDevice *pci_dev, bool state);
+
+static inline void pci_set_power(PCIDevice *pci_dev, bool state)
+{
+pci_set_enabled(pci_dev, state);
+}
 
 #endif
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index 15694f248948..f38fb3111954 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -57,7 +57,7 @@ typedef struct PCIReqIDCache PCIReqIDCache;
 struct PCIDevice {
 DeviceState qdev;
 bool partially_hotplugged;
-bool has_power;
+bool enabled;
 
 /* PCI config space */
 uint8_t *config;
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index fab86d056721..b532888e8f6c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1525,7 +1525,7 @@ static void pci_update_mappings(PCIDevice *d)
 continue;
 
 new_addr = pci_bar_address(d, i, r->type, r->size);
-if (!d->has_power) {
+if (!d->enabled) {
 new_addr = PCI_BAR_UNMAPPED;
 }
 
@@ -1613,7 +1613,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
addr, uint32_t val_in, int
 pci_update_irq_disabled(d, was_irq_disabled);
 memory_region_set_enabled(&d->bus_master_enable_region,
   (pci_get_word(d->config + PCI_COMMAND)
-   & PCI_COMMAND_MASTER) && d->has_power);
+   & PCI_COMMAND_MASTER) && d->enabled);
 }
 
 msi_write_config(d, addr, val_in, l);
@@ -2884,18 +2884,18 @@ MSIMessage pci_get_msi_message(PCIDevice *dev, int 
vector)
 return msg;
 }
 
-void pci_set_power(PCIDevice *d, bool state)
+void pci_set_enabled(PCIDevice *d, bool state)
 {
-if (d->has_power == state) {
+if (d->enabled == state) {
 return;
 }
 
-d->has_power = state;
+d->enabled = state;
 pci_update_mappings(d);
 memory_region_set_enabled(&d->bus_master_enable_region,
   (pci_get_word(d->config + PCI_COMMAND)
-   & PCI_COMMAND_MASTER) && d->has_power);
-if (!d->has_power) {
+   & PCI_COMMAND_MASTER) && d->enabled);
+if (!d->enabled) {
 pci_device_reset(d);
 }
 }
diff --git a/hw/pci/pci_host.c b/hw/pci/pci_host.c
index dfe6fe618401..0d82727cc9dd 100644
--- a/hw/pci/pci_host.c
+++ b/hw/pci/pci_host.c
@@ -86,7 +86,7 @@ void pci_host_config_write_common(PCIDevice *pci_dev, 
uint32_t addr,
  * allowing direct removal of unexposed functions.
  */
 if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
-!pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
+!pci_dev->enabled || is_pci_dev_ejected(pci_dev)) {
 return;
 }
 
@@ -111,7 +111,7 @@ uint32_t pci_host_config_read_common(PCIDevice *pci_dev, 
uint32_t addr,
  * allowing direct removal of unexposed functions.
  */
 if ((pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev)) ||
-!pci_dev->has_power || is_pci_dev_ejected(pci_dev)) {
+!pci_dev->enabled || is_pci_dev_ejected(pci_dev)) {
 return ~0x0;
 }
 

-- 
2.46.0

Re: [PATCH] vnc: fix crash when no console attached

2024-08-21 Thread Akihiko Odaki


On 2024/08/20 22:11, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau 

Since commit e99441a3793b5 ("ui/curses: Do not use console_select()")
qemu_text_console_put_keysym() no longer checks for NULL console
argument, which leads to a later crash:

Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
0x559ee186 in qemu_text_console_handle_keysym (s=0x0, keysym=31) at 
../ui/console-vc.c:332
332 } else if (s->echo && (keysym == '\r' || keysym == '\n')) {
(gdb) bt
  #0  0x559ee186 in qemu_text_console_handle_keysym (s=0x0, keysym=31) 
at ../ui/console-vc.c:332
  #1  0x559e18e5 in qemu_text_console_put_keysym (s=, 
keysym=) at ../ui/console.c:303
  #2  0x559f2e88 in do_key_event (vs=vs@entry=0x579045c0, 
down=down@entry=1, keycode=keycode@entry=60, sym=sym@entry=65471) at 
../ui/vnc.c:2034
  #3  0x559f845c in ext_key_event (vs=0x579045c0, down=1, sym=65471, 
keycode=) at ../ui/vnc.c:2070
  #4  protocol_client_msg (vs=0x579045c0, data=, len=) at ../ui/vnc.c:2514
  #5  0x559f515c in vnc_client_read (vs=0x579045c0) at 
../ui/vnc.c:1607

Fixes: e99441a3793b5 ("ui/curses: Do not use console_select()")
Fixes: https://issues.redhat.com/browse/RHEL-50529
Cc: qemu-sta...@nongnu.org
Signed-off-by: Marc-André Lureau 


Reviewed-by: Akihiko Odaki

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-08-18 Thread Akihiko Odaki


On 2024/08/18 16:03, Michael S. Tsirkin wrote:

On Sun, Aug 18, 2024 at 02:04:29PM +0900, Akihiko Odaki wrote:

On 2024/08/09 21:50, Fabiano Rosas wrote:

Peter Xu  writes:


On Thu, Aug 08, 2024 at 10:47:28AM -0400, Michael S. Tsirkin wrote:

On Thu, Aug 08, 2024 at 10:15:36AM -0400, Peter Xu wrote:

On Thu, Aug 08, 2024 at 07:12:14AM -0400, Michael S. Tsirkin wrote:

This is too big of a hammer. People already use what you call "cross
migrate" and have for years. We are not going to stop developing
features just because someone suddenly became aware of some such bit.
If you care, you will have to work to solve the problem properly -
nacking half baked hacks is the only tool maintainers have to make
people work on hard problems.


IMHO this is totally different thing.  It's not about proposing a new
feature yet so far, it's about how we should fix a breakage first.

And that's why I think we should fix it even in the simple way first, then
we consider anything more benefitial from perf side without breaking
anything, which should be on top of that.

Thanks,


As I said, once the quick hack is merged people stop caring.


IMHO it's not a hack. It's a proper fix to me to disable it by default for
now.

OTOH, having it ON always even knowing it can break migration is a hack to
me, when we don't have anything else to guard the migration.


Mixing different kernel versions in migration is esoteric enough for
this not to matter to most people. There's no rush I think, address
it properly.


Exactly mixing kernel versions will be tricky to users to identify, but
that's, AFAICT, exactly happening everywhere.  We can't urge user to always
use the exact same kernels when we're talking about a VM cluster.  That's
why I think allowing migration to work across those kernels matter.


I also worry a bit about the scenario where the cluster changes slightly
and now all VMs are already restricted by some option that requires the
exact same kernel. Specifically, kernel changes in a cloud environment
also happen due to factors completely unrelated to migration. I'm not
sure the people managing the infra (who care about migration) will be
gating kernel changes just because QEMU has been configured in a
specific manner.


I have wrote a bit about the expectation on the platform earlier[1], but let
me summarize it here.

1. I expect the user will not downgrade the platform of hosts after setting
up a VM. This is essential to enable any platform feature.

2. The user is allowed to upgrade the platform of hosts gradually. This
results in a situation with mixed platforms. The oldest platform is still
not older than the platform the VM is set up for. This enables the gradual
deployment strategy.

3. the user is allowed to downgrade the platform of hosts to the version
used when setting up the VM. This enables rollbacks in case of regression.

With these expectations, we can ensure migratability by a) enabling platform
features available on all hosts when setting up the VM and b) saving the
enabled features. This is covered with my
-dump-platform/-merge-platform/-use-platform proposal[2].


I really like [2]. Do you plan to work on it? Does anyone else?


No, but I want to move "[PATCH v3 0/5] virtio-net: Convert feature 
properties to OnOffAuto" forward:

https://patchew.org/QEMU/20240714-auto-v3-0-e27401aab...@daynix.com/

This will clarify the existence of the "auto" semantics, which is to 
enable a platform feature based on availability. [2] will be regarded as 
a feature to improve the handling of the "auto" semantics once this 
change lands.


Regards,
Akihiko Odaki




Regards,
Akihiko Odaki

[1]
https://lore.kernel.org/r/2b62780c-a6cb-4262-beb5-81d54c14f...@daynix.com
[2]
https://lore.kernel.org/all/2da4ebcd-2058-49c3-a4ec-8e60536e5...@daynix.com/

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-08-17 Thread Akihiko Odaki


On 2024/08/09 0:25, Peter Xu wrote:

On Thu, Aug 08, 2024 at 10:47:28AM -0400, Michael S. Tsirkin wrote:

On Thu, Aug 08, 2024 at 10:15:36AM -0400, Peter Xu wrote:

On Thu, Aug 08, 2024 at 07:12:14AM -0400, Michael S. Tsirkin wrote:

This is too big of a hammer. People already use what you call "cross
migrate" and have for years. We are not going to stop developing
features just because someone suddenly became aware of some such bit.
If you care, you will have to work to solve the problem properly -
nacking half baked hacks is the only tool maintainers have to make
people work on hard problems.


IMHO this is totally different thing.  It's not about proposing a new
feature yet so far, it's about how we should fix a breakage first.

And that's why I think we should fix it even in the simple way first, then
we consider anything more benefitial from perf side without breaking
anything, which should be on top of that.

Thanks,


As I said, once the quick hack is merged people stop caring.


IMHO it's not a hack. It's a proper fix to me to disable it by default for
now.

OTOH, having it ON always even knowing it can break migration is a hack to
me, when we don't have anything else to guard the migration.


I think neither of them is a hack; they just deal with different 
scenarios summarized in [1]. We need apply a solution appropriate for 
each scenario, or we will end up with a broken system.


Regards,
Akihiko Odaki

[1] 
https://lore.kernel.org/r/770300ac-7ed3-4aba-addb-b3f987cc6...@daynix.com/

Re: [PATCH v2 4/4] virtio-net: Add support for USO features

2024-08-17 Thread Akihiko Odaki


On 2024/08/09 21:50, Fabiano Rosas wrote:

Peter Xu  writes:


On Thu, Aug 08, 2024 at 10:47:28AM -0400, Michael S. Tsirkin wrote:

On Thu, Aug 08, 2024 at 10:15:36AM -0400, Peter Xu wrote:

On Thu, Aug 08, 2024 at 07:12:14AM -0400, Michael S. Tsirkin wrote:

This is too big of a hammer. People already use what you call "cross
migrate" and have for years. We are not going to stop developing
features just because someone suddenly became aware of some such bit.
If you care, you will have to work to solve the problem properly -
nacking half baked hacks is the only tool maintainers have to make
people work on hard problems.


IMHO this is totally different thing.  It's not about proposing a new
feature yet so far, it's about how we should fix a breakage first.

And that's why I think we should fix it even in the simple way first, then
we consider anything more benefitial from perf side without breaking
anything, which should be on top of that.

Thanks,


As I said, once the quick hack is merged people stop caring.


IMHO it's not a hack. It's a proper fix to me to disable it by default for
now.

OTOH, having it ON always even knowing it can break migration is a hack to
me, when we don't have anything else to guard the migration.


Mixing different kernel versions in migration is esoteric enough for
this not to matter to most people. There's no rush I think, address
it properly.


Exactly mixing kernel versions will be tricky to users to identify, but
that's, AFAICT, exactly happening everywhere.  We can't urge user to always
use the exact same kernels when we're talking about a VM cluster.  That's
why I think allowing migration to work across those kernels matter.


I also worry a bit about the scenario where the cluster changes slightly
and now all VMs are already restricted by some option that requires the
exact same kernel. Specifically, kernel changes in a cloud environment
also happen due to factors completely unrelated to migration. I'm not
sure the people managing the infra (who care about migration) will be
gating kernel changes just because QEMU has been configured in a
specific manner.


I have wrote a bit about the expectation on the platform earlier[1], but 
let me summarize it here.


1. I expect the user will not downgrade the platform of hosts after 
setting up a VM. This is essential to enable any platform feature.


2. The user is allowed to upgrade the platform of hosts gradually. This 
results in a situation with mixed platforms. The oldest platform is 
still not older than the platform the VM is set up for. This enables the 
gradual deployment strategy.


3. the user is allowed to downgrade the platform of hosts to the version 
used when setting up the VM. This enables rollbacks in case of regression.


With these expectations, we can ensure migratability by a) enabling 
platform features available on all hosts when setting up the VM and b) 
saving the enabled features. This is covered with my 
-dump-platform/-merge-platform/-use-platform proposal[2].


Regards,
Akihiko Odaki

[1] 
https://lore.kernel.org/r/2b62780c-a6cb-4262-beb5-81d54c14f...@daynix.com
[2] 
https://lore.kernel.org/all/2da4ebcd-2058-49c3-a4ec-8e60536e5...@daynix.com/

[PATCH] net: Check if nc is NULL in qemu_get_vnet_hdr_len()

2024-08-17 Thread Akihiko Odaki

A netdev may not have a peer specified, resulting in NULL. We should
make it behave like /dev/null in such a case instead of letting it
cause segmentatin fault.

Fixes: 4b52d63249a5 ("tap: Remove qemu_using_vnet_hdr()")
Reported-by: Jonathan Cameron 
Signed-off-by: Akihiko Odaki 
---
 net/net.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/net.c b/net/net.c
index 6938da05e077..4c21d91f9450 100644
--- a/net/net.c
+++ b/net/net.c
@@ -542,6 +542,10 @@ void qemu_set_offload(NetClientState *nc, int csum, int 
tso4, int tso6,
 
 int qemu_get_vnet_hdr_len(NetClientState *nc)
 {
+if (!nc) {
+return 0;
+}
+
 return nc->vnet_hdr_len;
 }
 

---
base-commit: 31669121a01a14732f57c49400bc239cf9fd505f
change-id: 20240817-net-dc461895a295

Best regards,
-- 
Akihiko Odaki

[PATCH v4] meson: Use -fno-sanitize=function when available

2024-08-16 Thread Akihiko Odaki

Commit 23ef50ae2d0c (".gitlab-ci.d/buildtest.yml: Use
-fno-sanitize=function in the clang-system job") adds
-fno-sanitize=function for the CI but doesn't add the flag in the
other context. Add it to meson.build for such. It is not removed from
.gitlab-ci.d/buildtest.yml because -fno-sanitize=function in meson.build
does not affect --extra-cflags due to argument ordering.

Signed-off-by: Akihiko Odaki 
---
Changes in v4:
- Moved -fno-sanitize=function immediately after -fsanitize=undefined
- Link to v3: 
https://lore.kernel.org/r/20240816-function-v3-1-32ff225e5...@daynix.com

Changes in v3:
- I was not properly dropping the change of .gitlab-ci.d/buildtest.yml
  but only updated the message. v3 fixes this. (Thomas Huth)
- Link to v2: 
https://lore.kernel.org/r/20240729-function-v2-1-2401ab18b...@daynix.com

Changes in v2:
- Dropped the change of: .gitlab-ci.d/buildtest.yml
- Link to v1: 
https://lore.kernel.org/r/20240714-function-v1-1-cc2acb417...@daynix.com
---
 meson.build | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/meson.build b/meson.build
index 5613b62a4f42..b7e50358a88a 100644
--- a/meson.build
+++ b/meson.build
@@ -483,8 +483,12 @@ if get_option('sanitizers')
   # Detect static linking issue with ubsan - 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84285
   if cc.links('int main(int argc, char **argv) { return argc + 1; }',
   args: [qemu_ldflags, '-fsanitize=undefined'])
-qemu_cflags = ['-fsanitize=undefined'] + qemu_cflags
-qemu_ldflags = ['-fsanitize=undefined'] + qemu_ldflags
+qemu_cflags = ['-fsanitize=undefined'] + \
+  cc.get_supported_arguments('-fno-sanitize=function') + \
+  qemu_cflags
+qemu_ldflags = ['-fsanitize=undefined'] + \
+   cc.get_supported_arguments('-fno-sanitize=function') + \
+   qemu_ldflags
   endif
 endif
 

---
base-commit: 93b799fafd9170da3a79a533ea6f73a18de82e22
change-id: 20240714-function-7d32c723abbc

Best regards,
-- 
Akihiko Odaki

Re: [PATCH v3] meson: Use -fno-sanitize=function when available

2024-08-16 Thread Akihiko Odaki





On 2024/08/16 17:46, Richard Henderson wrote:

On 8/16/24 18:27, Akihiko Odaki wrote:

On 2024/08/16 17:24, Thomas Huth wrote:

On 16/08/2024 10.21, Akihiko Odaki wrote:

On 2024/08/16 17:03, Thomas Huth wrote:

On 16/08/2024 09.30, Akihiko Odaki wrote:

On 2024/08/16 16:27, Thomas Huth wrote:

On 16/08/2024 09.12, Akihiko Odaki wrote:

On 2024/08/16 16:03, Thomas Huth wrote:

On 16/08/2024 08.22, Akihiko Odaki wrote:

Commit 23ef50ae2d0c (".gitlab-ci.d/buildtest.yml: Use
-fno-sanitize=function in the clang-system job") adds
-fno-sanitize=function for the CI but doesn't add the flag in the
other context. Add it to meson.build for such. It is not 
removed from
.gitlab-ci.d/buildtest.yml because -fno-sanitize=function in 
meson.build

does not affect --extra-cflags due to argument ordering.

Signed-off-by: Akihiko Odaki 
---
Changes in v3:
- I was not properly dropping the change of 
.gitlab-ci.d/buildtest.yml

   but only updated the message. v3 fixes this. (Thomas Huth)
- Link to v2: https://lore.kernel.org/r/20240729-function- 
v2-1-2401ab18b...@daynix.com


Changes in v2:
- Dropped the change of: .gitlab-ci.d/buildtest.yml
- Link to v1: 
https://lore.kernel.org/r/20240714-function-v1-1- 
cc2acb417...@daynix.com

---
  meson.build | 1 +
  1 file changed, 1 insertion(+)

diff --git a/meson.build b/meson.build
index 5613b62a4f42..a4169c572ba9 100644
--- a/meson.build
+++ b/meson.build
@@ -609,6 +609,7 @@ if host_os != 'openbsd' and \
  endif
  qemu_common_flags += 
cc.get_supported_arguments(hardening_flags)
+qemu_common_flags += 
cc.get_supported_arguments('-fno-sanitize=function')


As I mentioned in my last mail: I think it would make sense to 
move this at the end of the "if get_option('tsan')" block in 
meson.build, since this apparently only fixes the use of 
"--enable-sanitizers", and cannot fix the "--extra-cflags" that 
a user might have specified?


Sorry, I missed it. It cannot fix --extra-cflags, but it should 
be able to fix compiler flags specified by compiler distributor.


Oh, you mean that there are distros that enable 
-fsanitize=function by default? Can you name one? If so, I think 
that information should go into the patch description...?


No, it is just a precaution.


Ok. I don't think any normal distro will enable this by default 
since this impacts performance of the programs, so it's either the 
user specifying --enable-sanitizers or the user specifying 
--extra-cflags="-fsanitize=...". In the latter case, your patch 
does not help. In the former case, I think this setting should go 
into the same code block as where we set -fsanitize=undefined in 
our meson.build file, so that it is clear where it belongs to.


It does not look like -fno-sanitize=function belongs to the code 
block to me. Putting - fno-sanitize=function in the code block will 
make it seem to say that we should disable function sanitizer 
because the user requests to enable sanitizers, which makes little 
sense.


As far as I understood, -fsanitize=undefine turns on 
-fsanitize=function, too, or did I get that wrong?
If not, how did you run into this problem? How did you enable the 
function sanitizer if not using --enable-sanitizers ?


The point is we don't care who enables sanitizers, and unconditonally 
setting -fno- sanitize=function will clarify that.




Argument ordering is important.  You cannot just drop this in the middle 
of meson.build and expect anything reasonable to happen.


That is a good point. We should add -fno-sanitize=function immediately 
after -fsanitize=undefined; I will submit v4 with that change.


Regards,
Akihiko Odaki

Re: [PATCH v3] meson: Use -fno-sanitize=function when available

2024-08-16 Thread Akihiko Odaki


On 2024/08/16 17:24, Thomas Huth wrote:

On 16/08/2024 10.21, Akihiko Odaki wrote:

On 2024/08/16 17:03, Thomas Huth wrote:

On 16/08/2024 09.30, Akihiko Odaki wrote:

On 2024/08/16 16:27, Thomas Huth wrote:

On 16/08/2024 09.12, Akihiko Odaki wrote:

On 2024/08/16 16:03, Thomas Huth wrote:

On 16/08/2024 08.22, Akihiko Odaki wrote:

Commit 23ef50ae2d0c (".gitlab-ci.d/buildtest.yml: Use
-fno-sanitize=function in the clang-system job") adds
-fno-sanitize=function for the CI but doesn't add the flag in the
other context. Add it to meson.build for such. It is not removed 
from
.gitlab-ci.d/buildtest.yml because -fno-sanitize=function in 
meson.build

does not affect --extra-cflags due to argument ordering.

Signed-off-by: Akihiko Odaki 
---
Changes in v3:
- I was not properly dropping the change of 
.gitlab-ci.d/buildtest.yml

   but only updated the message. v3 fixes this. (Thomas Huth)
- Link to v2: 
https://lore.kernel.org/r/20240729-function-v2-1-2401ab18b...@daynix.com


Changes in v2:
- Dropped the change of: .gitlab-ci.d/buildtest.yml
- Link to v1: 
https://lore.kernel.org/r/20240714-function-v1-1-cc2acb417...@daynix.com

---
  meson.build | 1 +
  1 file changed, 1 insertion(+)

diff --git a/meson.build b/meson.build
index 5613b62a4f42..a4169c572ba9 100644
--- a/meson.build
+++ b/meson.build
@@ -609,6 +609,7 @@ if host_os != 'openbsd' and \
  endif
  qemu_common_flags += cc.get_supported_arguments(hardening_flags)
+qemu_common_flags += 
cc.get_supported_arguments('-fno-sanitize=function')


As I mentioned in my last mail: I think it would make sense to 
move this at the end of the "if get_option('tsan')" block in 
meson.build, since this apparently only fixes the use of 
"--enable-sanitizers", and cannot fix the "--extra-cflags" that a 
user might have specified?


Sorry, I missed it. It cannot fix --extra-cflags, but it should be 
able to fix compiler flags specified by compiler distributor.


Oh, you mean that there are distros that enable -fsanitize=function 
by default? Can you name one? If so, I think that information 
should go into the patch description...?


No, it is just a precaution.


Ok. I don't think any normal distro will enable this by default since 
this impacts performance of the programs, so it's either the user 
specifying --enable-sanitizers or the user specifying 
--extra-cflags="-fsanitize=...". In the latter case, your patch does 
not help. In the former case, I think this setting should go into the 
same code block as where we set -fsanitize=undefined in our 
meson.build file, so that it is clear where it belongs to.


It does not look like -fno-sanitize=function belongs to the code block 
to me. Putting -fno-sanitize=function in the code block will make it 
seem to say that we should disable function sanitizer because the user 
requests to enable sanitizers, which makes little sense.


As far as I understood, -fsanitize=undefine turns on 
-fsanitize=function, too, or did I get that wrong?
If not, how did you run into this problem? How did you enable the 
function sanitizer if not using --enable-sanitizers ?


The point is we don't care who enables sanitizers, and unconditonally 
setting -fno-sanitize=function will clarify that.

Re: [PATCH v3] meson: Use -fno-sanitize=function when available

2024-08-16 Thread Akihiko Odaki


On 2024/08/16 17:03, Thomas Huth wrote:

On 16/08/2024 09.30, Akihiko Odaki wrote:

On 2024/08/16 16:27, Thomas Huth wrote:

On 16/08/2024 09.12, Akihiko Odaki wrote:

On 2024/08/16 16:03, Thomas Huth wrote:

On 16/08/2024 08.22, Akihiko Odaki wrote:

Commit 23ef50ae2d0c (".gitlab-ci.d/buildtest.yml: Use
-fno-sanitize=function in the clang-system job") adds
-fno-sanitize=function for the CI but doesn't add the flag in the
other context. Add it to meson.build for such. It is not removed from
.gitlab-ci.d/buildtest.yml because -fno-sanitize=function in 
meson.build

does not affect --extra-cflags due to argument ordering.

Signed-off-by: Akihiko Odaki 
---
Changes in v3:
- I was not properly dropping the change of 
.gitlab-ci.d/buildtest.yml

   but only updated the message. v3 fixes this. (Thomas Huth)
- Link to v2: 
https://lore.kernel.org/r/20240729-function-v2-1-2401ab18b...@daynix.com


Changes in v2:
- Dropped the change of: .gitlab-ci.d/buildtest.yml
- Link to v1: 
https://lore.kernel.org/r/20240714-function-v1-1-cc2acb417...@daynix.com

---
  meson.build | 1 +
  1 file changed, 1 insertion(+)

diff --git a/meson.build b/meson.build
index 5613b62a4f42..a4169c572ba9 100644
--- a/meson.build
+++ b/meson.build
@@ -609,6 +609,7 @@ if host_os != 'openbsd' and \
  endif
  qemu_common_flags += cc.get_supported_arguments(hardening_flags)
+qemu_common_flags += 
cc.get_supported_arguments('-fno-sanitize=function')


As I mentioned in my last mail: I think it would make sense to move 
this at the end of the "if get_option('tsan')" block in 
meson.build, since this apparently only fixes the use of 
"--enable-sanitizers", and cannot fix the "--extra-cflags" that a 
user might have specified?


Sorry, I missed it. It cannot fix --extra-cflags, but it should be 
able to fix compiler flags specified by compiler distributor.


Oh, you mean that there are distros that enable -fsanitize=function 
by default? Can you name one? If so, I think that information should 
go into the patch description...?


No, it is just a precaution.


Ok. I don't think any normal distro will enable this by default since 
this impacts performance of the programs, so it's either the user 
specifying --enable-sanitizers or the user specifying 
--extra-cflags="-fsanitize=...". In the latter case, your patch does not 
help. In the former case, I think this setting should go into the same 
code block as where we set -fsanitize=undefined in our meson.build file, 
so that it is clear where it belongs to.


It does not look like -fno-sanitize=function belongs to the code block 
to me. Putting -fno-sanitize=function in the code block will make it 
seem to say that we should disable function sanitizer because the user 
requests to enable sanitizers, which makes little sense.

Re: [PATCH v3] meson: Use -fno-sanitize=function when available

2024-08-16 Thread Akihiko Odaki


On 2024/08/16 16:27, Thomas Huth wrote:

On 16/08/2024 09.12, Akihiko Odaki wrote:

On 2024/08/16 16:03, Thomas Huth wrote:

On 16/08/2024 08.22, Akihiko Odaki wrote:

Commit 23ef50ae2d0c (".gitlab-ci.d/buildtest.yml: Use
-fno-sanitize=function in the clang-system job") adds
-fno-sanitize=function for the CI but doesn't add the flag in the
other context. Add it to meson.build for such. It is not removed from
.gitlab-ci.d/buildtest.yml because -fno-sanitize=function in 
meson.build

does not affect --extra-cflags due to argument ordering.

Signed-off-by: Akihiko Odaki 
---
Changes in v3:
- I was not properly dropping the change of .gitlab-ci.d/buildtest.yml
   but only updated the message. v3 fixes this. (Thomas Huth)
- Link to v2: 
https://lore.kernel.org/r/20240729-function-v2-1-2401ab18b...@daynix.com


Changes in v2:
- Dropped the change of: .gitlab-ci.d/buildtest.yml
- Link to v1: 
https://lore.kernel.org/r/20240714-function-v1-1-cc2acb417...@daynix.com

---
  meson.build | 1 +
  1 file changed, 1 insertion(+)

diff --git a/meson.build b/meson.build
index 5613b62a4f42..a4169c572ba9 100644
--- a/meson.build
+++ b/meson.build
@@ -609,6 +609,7 @@ if host_os != 'openbsd' and \
  endif
  qemu_common_flags += cc.get_supported_arguments(hardening_flags)
+qemu_common_flags += 
cc.get_supported_arguments('-fno-sanitize=function')


As I mentioned in my last mail: I think it would make sense to move 
this at the end of the "if get_option('tsan')" block in meson.build, 
since this apparently only fixes the use of "--enable-sanitizers", 
and cannot fix the "--extra-cflags" that a user might have specified?


Sorry, I missed it. It cannot fix --extra-cflags, but it should be 
able to fix compiler flags specified by compiler distributor.


Oh, you mean that there are distros that enable -fsanitize=function by 
default? Can you name one? If so, I think that information should go 
into the patch description...?


No, it is just a precaution.

Regards,
Akihiko Odaki

Re: [PATCH v3] meson: Use -fno-sanitize=function when available

2024-08-16 Thread Akihiko Odaki


On 2024/08/16 16:03, Thomas Huth wrote:

On 16/08/2024 08.22, Akihiko Odaki wrote:

Commit 23ef50ae2d0c (".gitlab-ci.d/buildtest.yml: Use
-fno-sanitize=function in the clang-system job") adds
-fno-sanitize=function for the CI but doesn't add the flag in the
other context. Add it to meson.build for such. It is not removed from
.gitlab-ci.d/buildtest.yml because -fno-sanitize=function in meson.build
does not affect --extra-cflags due to argument ordering.

Signed-off-by: Akihiko Odaki 
---
Changes in v3:
- I was not properly dropping the change of .gitlab-ci.d/buildtest.yml
   but only updated the message. v3 fixes this. (Thomas Huth)
- Link to v2: 
https://lore.kernel.org/r/20240729-function-v2-1-2401ab18b...@daynix.com


Changes in v2:
- Dropped the change of: .gitlab-ci.d/buildtest.yml
- Link to v1: 
https://lore.kernel.org/r/20240714-function-v1-1-cc2acb417...@daynix.com

---
  meson.build | 1 +
  1 file changed, 1 insertion(+)

diff --git a/meson.build b/meson.build
index 5613b62a4f42..a4169c572ba9 100644
--- a/meson.build
+++ b/meson.build
@@ -609,6 +609,7 @@ if host_os != 'openbsd' and \
  endif
  qemu_common_flags += cc.get_supported_arguments(hardening_flags)
+qemu_common_flags += 
cc.get_supported_arguments('-fno-sanitize=function')


As I mentioned in my last mail: I think it would make sense to move this 
at the end of the "if get_option('tsan')" block in meson.build, since 
this apparently only fixes the use of "--enable-sanitizers", and cannot 
fix the "--extra-cflags" that a user might have specified?


Sorry, I missed it. It cannot fix --extra-cflags, but it should be able 
to fix compiler flags specified by compiler distributor.


Regards,
Akihiko Odaki

[PATCH v3] meson: Use -fno-sanitize=function when available

2024-08-15 Thread Akihiko Odaki

Commit 23ef50ae2d0c (".gitlab-ci.d/buildtest.yml: Use
-fno-sanitize=function in the clang-system job") adds
-fno-sanitize=function for the CI but doesn't add the flag in the
other context. Add it to meson.build for such. It is not removed from
.gitlab-ci.d/buildtest.yml because -fno-sanitize=function in meson.build
does not affect --extra-cflags due to argument ordering.

Signed-off-by: Akihiko Odaki 
---
Changes in v3:
- I was not properly dropping the change of .gitlab-ci.d/buildtest.yml
  but only updated the message. v3 fixes this. (Thomas Huth)
- Link to v2: 
https://lore.kernel.org/r/20240729-function-v2-1-2401ab18b...@daynix.com

Changes in v2:
- Dropped the change of: .gitlab-ci.d/buildtest.yml
- Link to v1: 
https://lore.kernel.org/r/20240714-function-v1-1-cc2acb417...@daynix.com
---
 meson.build | 1 +
 1 file changed, 1 insertion(+)

diff --git a/meson.build b/meson.build
index 5613b62a4f42..a4169c572ba9 100644
--- a/meson.build
+++ b/meson.build
@@ -609,6 +609,7 @@ if host_os != 'openbsd' and \
 endif
 
 qemu_common_flags += cc.get_supported_arguments(hardening_flags)
+qemu_common_flags += cc.get_supported_arguments('-fno-sanitize=function')
 
 add_global_arguments(qemu_common_flags, native: false, language: all_languages)
 add_global_link_arguments(qemu_ldflags, native: false, language: all_languages)

---
base-commit: 93b799fafd9170da3a79a533ea6f73a18de82e22
change-id: 20240714-function-7d32c723abbc

Best regards,
-- 
Akihiko Odaki

[PATCH v3] contrib/plugins: Add a plugin to generate basic block vectors

2024-08-15 Thread Akihiko Odaki

SimPoint is a widely used tool to find the ideal microarchitecture
simulation points so Valgrind[2] and Pin[3] support generating basic
block vectors for use with them. Let's add a corresponding plugin to
QEMU too.

Note that this plugin has a different goal with tests/plugin/bb.c.

This plugin creates a vector for each constant interval instead of
counting the execution of basic blocks for the entire run and able to
describe the change of execution behavior. Its output is also
syntactically simple and better suited for parsing, while the output of
tests/plugin/bb.c is more human-readable.

[1] https://cseweb.ucsd.edu/~calder/simpoint/
[2] https://valgrind.org/docs/manual/bbv-manual.html
[3] 
https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html

Signed-off-by: Yotaro Nada 
Signed-off-by: Akihiko Odaki 
---
Changes in v3:
- Protect the entire operation with bbs in vcpu_tb_trans().
- Reduce memory allocations.
- Link to v2: https://lore.kernel.org/r/20240815-bb-v2-1-6222ee982...@daynix.com

Changes in v2:
- Merged files variable into the global scoreboard.
- Added a lock for bbs.
- Added a summary to contrib/plugins/bbv.c.
- Rebased.
- Link to v1: https://lore.kernel.org/r/20240813-bb-v1-1-effbb77da...@daynix.com
---
 docs/about/emulation.rst |  30 +
 contrib/plugins/bbv.c| 158 +++
 contrib/plugins/Makefile |   1 +
 3 files changed, 189 insertions(+)

diff --git a/docs/about/emulation.rst b/docs/about/emulation.rst
index c03033e4e956..72d7846ab6f8 100644
--- a/docs/about/emulation.rst
+++ b/docs/about/emulation.rst
@@ -381,6 +381,36 @@ run::
   160  1  0
   135  1  0
 
+Basic Block Vectors
+...
+
+``contrib/plugins/bbv.c``
+
+The bbv plugin allows you to generate basic block vectors for use with the
+`SimPoint <https://cseweb.ucsd.edu/~calder/simpoint/>`__ analysis tool.
+
+.. list-table:: Basic block vectors arguments
+  :widths: 20 80
+  :header-rows: 1
+
+  * - Option
+- Description
+  * - interval=N
+- The interval to generate a basic block vector specified by the number of
+  instructions (Default: N = 1)
+  * - outfile=PATH
+- The path to output files.
+  It will be suffixed with ``.N.bb`` where ``N`` is a vCPU index.
+
+Example::
+
+  $ qemu-aarch64 \
+-plugin contrib/plugins/libbbv.so,interval=100,outfile=sha1 \
+tests/tcg/aarch64-linux-user/sha1
+  SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+  $ du sha1.0.bb
+  23128   sha1.0.bb
+
 Hot Blocks
 ..
 
diff --git a/contrib/plugins/bbv.c b/contrib/plugins/bbv.c
new file mode 100644
index ..a5256517dd44
--- /dev/null
+++ b/contrib/plugins/bbv.c
@@ -0,0 +1,158 @@
+/*
+ * Generate basic block vectors for use with the SimPoint analysis tool.
+ * SimPoint: https://cseweb.ucsd.edu/~calder/simpoint/
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include 
+#include 
+
+#include 
+
+typedef struct Bb {
+uint64_t vaddr;
+struct qemu_plugin_scoreboard *count;
+unsigned int index;
+} Bb;
+
+typedef struct Vcpu {
+uint64_t count;
+FILE *file;
+} Vcpu;
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+static GHashTable *bbs;
+static GRWLock bbs_lock;
+static char *filename;
+static struct qemu_plugin_scoreboard *vcpus;
+static uint64_t interval = 1;
+
+static void plugin_exit(qemu_plugin_id_t id, void *p)
+{
+for (int i = 0; i < qemu_plugin_num_vcpus(); i++) {
+fclose(((Vcpu *)qemu_plugin_scoreboard_find(vcpus, i))->file);
+}
+
+g_hash_table_unref(bbs);
+g_free(filename);
+qemu_plugin_scoreboard_free(vcpus);
+}
+
+static void free_bb(void *data)
+{
+qemu_plugin_scoreboard_free(((Bb *)data)->count);
+g_free(data);
+}
+
+static qemu_plugin_u64 count_u64(void)
+{
+return qemu_plugin_scoreboard_u64_in_struct(vcpus, Vcpu, count);
+}
+
+static qemu_plugin_u64 bb_count_u64(Bb *bb)
+{
+return qemu_plugin_scoreboard_u64(bb->count);
+}
+
+static void vcpu_init(qemu_plugin_id_t id, unsigned int vcpu_index)
+{
+g_autofree gchar *vcpu_filename = NULL;
+Vcpu *vcpu = qemu_plugin_scoreboard_find(vcpus, vcpu_index);
+
+vcpu_filename = g_strdup_printf("%s.%u.bb", filename, vcpu_index);
+vcpu->file = fopen(vcpu_filename, "w");
+}
+
+static void vcpu_interval_exec(unsigned int vcpu_index, void *udata)
+{
+Vcpu *vcpu = qemu_plugin_scoreboard_find(vcpus, vcpu_index);
+GHashTableIter iter;
+void *value;
+
+if (!vcpu->file) {
+return;
+}
+
+vcpu->count -= interval;
+
+fputc('T', vcpu->file);
+
+g_rw_lock_reader_lock(&bbs_lock);
+g_hash_table_iter_init(&iter, bbs);
+
+while (g_hash_table_iter_next(&iter, NULL, &value)) {
+Bb *bb = value;
+uint64_t bb_count = qemu_plugin_u64_get(bb_count_u64(bb), vcpu_index);
+
+

Re: [PATCH v2] contrib/plugins: Add a plugin to generate basic block vectors

2024-08-15 Thread Akihiko Odaki


On 2024/08/16 14:13, Akihiko Odaki wrote:

On 2024/08/15 14:48, Pierrick Bouvier wrote:

On 8/14/24 20:04, Akihiko Odaki wrote:

SimPoint is a widely used tool to find the ideal microarchitecture
simulation points so Valgrind[2] and Pin[3] support generating basic
block vectors for use with them. Let's add a corresponding plugin to
QEMU too.

Note that this plugin has a different goal with tests/plugin/bb.c.

This plugin creates a vector for each constant interval instead of
counting the execution of basic blocks for the entire run and able to
describe the change of execution behavior. Its output is also
syntactically simple and better suited for parsing, while the output of
tests/plugin/bb.c is more human-readable.

[1] https://cseweb.ucsd.edu/~calder/simpoint/
[2] https://valgrind.org/docs/manual/bbv-manual.html
[3] 
https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html


Signed-off-by: Yotaro Nada 
Signed-off-by: Akihiko Odaki 
---
Changes in v2:
- Merged files variable into the global scoreboard.
- Added a lock for bbs.
- Added a summary to contrib/plugins/bbv.c.
- Rebased.
- Link to v1: 
https://lore.kernel.org/r/20240813-bb-v1-1-effbb77da...@daynix.com

---
  docs/about/emulation.rst |  30 +
  contrib/plugins/bbv.c    | 158 
+++

  contrib/plugins/Makefile |   1 +
  3 files changed, 189 insertions(+)

diff --git a/docs/about/emulation.rst b/docs/about/emulation.rst
index c03033e4e956..72d7846ab6f8 100644
--- a/docs/about/emulation.rst
+++ b/docs/about/emulation.rst
@@ -381,6 +381,36 @@ run::
    160  1  0
    135  1  0
+Basic Block Vectors
+...
+
+``contrib/plugins/bbv.c``
+
+The bbv plugin allows you to generate basic block vectors for use 
with the

+`SimPoint <https://cseweb.ucsd.edu/~calder/simpoint/>`__ analysis tool.
+
+.. list-table:: Basic block vectors arguments
+  :widths: 20 80
+  :header-rows: 1
+
+  * - Option
+    - Description
+  * - interval=N
+    - The interval to generate a basic block vector specified by the 
number of

+  instructions (Default: N = 1)
+  * - outfile=PATH
+    - The path to output files.
+  It will be suffixed with ``.N.bb`` where ``N`` is a vCPU index.
+
+Example::
+
+  $ qemu-aarch64 \
+    -plugin contrib/plugins/libbbv.so,interval=100,outfile=sha1 \
+    tests/tcg/aarch64-linux-user/sha1
+  SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+  $ du sha1.0.bb
+  23128   sha1.0.bb
+
  Hot Blocks
  ..
diff --git a/contrib/plugins/bbv.c b/contrib/plugins/bbv.c
new file mode 100644
index ..41139f423fe2
--- /dev/null
+++ b/contrib/plugins/bbv.c
@@ -0,0 +1,158 @@
+/*
+ * Generate basic block vectors for use with the SimPoint analysis 
tool.

+ * SimPoint: https://cseweb.ucsd.edu/~calder/simpoint/
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include 
+#include 
+
+#include 
+
+typedef struct Bb {
+    struct qemu_plugin_scoreboard *count;
+    unsigned int index;
+} Bb;
+
+typedef struct Vcpu {
+    uint64_t count;
+    FILE *file;
+} Vcpu;
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+static GHashTable *bbs;
+static GRWLock bbs_lock;
+static char *filename;
+static struct qemu_plugin_scoreboard *vcpus;
+static uint64_t interval = 1;
+
+static void plugin_exit(qemu_plugin_id_t id, void *p)
+{
+    for (int i = 0; i < qemu_plugin_num_vcpus(); i++) {
+    fclose(((Vcpu *)qemu_plugin_scoreboard_find(vcpus, i))->file);
+    }
+
+    g_hash_table_unref(bbs);
+    g_free(filename);
+    qemu_plugin_scoreboard_free(vcpus);
+}
+
+static void free_bb(void *data)
+{
+    qemu_plugin_scoreboard_free(((Bb *)data)->count);
+    g_free(data);
+}
+
+static qemu_plugin_u64 count_u64(void)
+{
+    return qemu_plugin_scoreboard_u64_in_struct(vcpus, Vcpu, count);
+}
+
+static qemu_plugin_u64 bb_count_u64(Bb *bb)
+{
+    return qemu_plugin_scoreboard_u64(bb->count);
+}
+
+static void vcpu_init(qemu_plugin_id_t id, unsigned int vcpu_index)
+{
+    g_autofree gchar *vcpu_filename = NULL;
+    Vcpu *vcpu = qemu_plugin_scoreboard_find(vcpus, vcpu_index);
+
+    vcpu_filename = g_strdup_printf("%s.%u.bb", filename, vcpu_index);
+    vcpu->file = fopen(vcpu_filename, "w");
+}
+
+static void vcpu_interval_exec(unsigned int vcpu_index, void *udata)
+{
+    Vcpu *vcpu = qemu_plugin_scoreboard_find(vcpus, vcpu_index);
+    GHashTableIter iter;
+    void *value;
+
+    if (!vcpu->file) {
+    return;
+    }
+
+    vcpu->count -= interval;
+
+    fputc('T', vcpu->file);
+
+    g_rw_lock_reader_lock(&bbs_lock);
+    g_hash_table_iter_init(&iter, bbs);
+
+    while (g_hash_table_iter_next(&iter, NULL, &value)) {
+    Bb *bb = value;
+    uint64_t bb_count = qemu_plugin_u64_get(bb_count_u64(bb), 
vcpu_index);

+
+    if (!bb_count) {
+    continue;

Re: [PATCH v2] contrib/plugins: Add a plugin to generate basic block vectors

2024-08-15 Thread Akihiko Odaki


On 2024/08/15 14:48, Pierrick Bouvier wrote:

On 8/14/24 20:04, Akihiko Odaki wrote:

SimPoint is a widely used tool to find the ideal microarchitecture
simulation points so Valgrind[2] and Pin[3] support generating basic
block vectors for use with them. Let's add a corresponding plugin to
QEMU too.

Note that this plugin has a different goal with tests/plugin/bb.c.

This plugin creates a vector for each constant interval instead of
counting the execution of basic blocks for the entire run and able to
describe the change of execution behavior. Its output is also
syntactically simple and better suited for parsing, while the output of
tests/plugin/bb.c is more human-readable.

[1] https://cseweb.ucsd.edu/~calder/simpoint/
[2] https://valgrind.org/docs/manual/bbv-manual.html
[3] 
https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html


Signed-off-by: Yotaro Nada 
Signed-off-by: Akihiko Odaki 
---
Changes in v2:
- Merged files variable into the global scoreboard.
- Added a lock for bbs.
- Added a summary to contrib/plugins/bbv.c.
- Rebased.
- Link to v1: 
https://lore.kernel.org/r/20240813-bb-v1-1-effbb77da...@daynix.com

---
  docs/about/emulation.rst |  30 +
  contrib/plugins/bbv.c    | 158 
+++

  contrib/plugins/Makefile |   1 +
  3 files changed, 189 insertions(+)

diff --git a/docs/about/emulation.rst b/docs/about/emulation.rst
index c03033e4e956..72d7846ab6f8 100644
--- a/docs/about/emulation.rst
+++ b/docs/about/emulation.rst
@@ -381,6 +381,36 @@ run::
    160  1  0
    135  1  0
+Basic Block Vectors
+...
+
+``contrib/plugins/bbv.c``
+
+The bbv plugin allows you to generate basic block vectors for use 
with the

+`SimPoint <https://cseweb.ucsd.edu/~calder/simpoint/>`__ analysis tool.
+
+.. list-table:: Basic block vectors arguments
+  :widths: 20 80
+  :header-rows: 1
+
+  * - Option
+    - Description
+  * - interval=N
+    - The interval to generate a basic block vector specified by the 
number of

+  instructions (Default: N = 1)
+  * - outfile=PATH
+    - The path to output files.
+  It will be suffixed with ``.N.bb`` where ``N`` is a vCPU index.
+
+Example::
+
+  $ qemu-aarch64 \
+    -plugin contrib/plugins/libbbv.so,interval=100,outfile=sha1 \
+    tests/tcg/aarch64-linux-user/sha1
+  SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+  $ du sha1.0.bb
+  23128   sha1.0.bb
+
  Hot Blocks
  ..
diff --git a/contrib/plugins/bbv.c b/contrib/plugins/bbv.c
new file mode 100644
index ..41139f423fe2
--- /dev/null
+++ b/contrib/plugins/bbv.c
@@ -0,0 +1,158 @@
+/*
+ * Generate basic block vectors for use with the SimPoint analysis tool.
+ * SimPoint: https://cseweb.ucsd.edu/~calder/simpoint/
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include 
+#include 
+
+#include 
+
+typedef struct Bb {
+    struct qemu_plugin_scoreboard *count;
+    unsigned int index;
+} Bb;
+
+typedef struct Vcpu {
+    uint64_t count;
+    FILE *file;
+} Vcpu;
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+static GHashTable *bbs;
+static GRWLock bbs_lock;
+static char *filename;
+static struct qemu_plugin_scoreboard *vcpus;
+static uint64_t interval = 1;
+
+static void plugin_exit(qemu_plugin_id_t id, void *p)
+{
+    for (int i = 0; i < qemu_plugin_num_vcpus(); i++) {
+    fclose(((Vcpu *)qemu_plugin_scoreboard_find(vcpus, i))->file);
+    }
+
+    g_hash_table_unref(bbs);
+    g_free(filename);
+    qemu_plugin_scoreboard_free(vcpus);
+}
+
+static void free_bb(void *data)
+{
+    qemu_plugin_scoreboard_free(((Bb *)data)->count);
+    g_free(data);
+}
+
+static qemu_plugin_u64 count_u64(void)
+{
+    return qemu_plugin_scoreboard_u64_in_struct(vcpus, Vcpu, count);
+}
+
+static qemu_plugin_u64 bb_count_u64(Bb *bb)
+{
+    return qemu_plugin_scoreboard_u64(bb->count);
+}
+
+static void vcpu_init(qemu_plugin_id_t id, unsigned int vcpu_index)
+{
+    g_autofree gchar *vcpu_filename = NULL;
+    Vcpu *vcpu = qemu_plugin_scoreboard_find(vcpus, vcpu_index);
+
+    vcpu_filename = g_strdup_printf("%s.%u.bb", filename, vcpu_index);
+    vcpu->file = fopen(vcpu_filename, "w");
+}
+
+static void vcpu_interval_exec(unsigned int vcpu_index, void *udata)
+{
+    Vcpu *vcpu = qemu_plugin_scoreboard_find(vcpus, vcpu_index);
+    GHashTableIter iter;
+    void *value;
+
+    if (!vcpu->file) {
+    return;
+    }
+
+    vcpu->count -= interval;
+
+    fputc('T', vcpu->file);
+
+    g_rw_lock_reader_lock(&bbs_lock);
+    g_hash_table_iter_init(&iter, bbs);
+
+    while (g_hash_table_iter_next(&iter, NULL, &value)) {
+    Bb *bb = value;
+    uint64_t bb_count = qemu_plugin_u64_get(bb_count_u64(bb), 
vcpu_index);

+
+    if (!bb_count) {
+    continue;
+    }
+
+    fprintf(vcpu->file, ":%u:%"

[PATCH v2] contrib/plugins: Add a plugin to generate basic block vectors

2024-08-14 Thread Akihiko Odaki

SimPoint is a widely used tool to find the ideal microarchitecture
simulation points so Valgrind[2] and Pin[3] support generating basic
block vectors for use with them. Let's add a corresponding plugin to
QEMU too.

Note that this plugin has a different goal with tests/plugin/bb.c.

This plugin creates a vector for each constant interval instead of
counting the execution of basic blocks for the entire run and able to
describe the change of execution behavior. Its output is also
syntactically simple and better suited for parsing, while the output of
tests/plugin/bb.c is more human-readable.

[1] https://cseweb.ucsd.edu/~calder/simpoint/
[2] https://valgrind.org/docs/manual/bbv-manual.html
[3] 
https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html

Signed-off-by: Yotaro Nada 
Signed-off-by: Akihiko Odaki 
---
Changes in v2:
- Merged files variable into the global scoreboard.
- Added a lock for bbs.
- Added a summary to contrib/plugins/bbv.c.
- Rebased.
- Link to v1: https://lore.kernel.org/r/20240813-bb-v1-1-effbb77da...@daynix.com
---
 docs/about/emulation.rst |  30 +
 contrib/plugins/bbv.c| 158 +++
 contrib/plugins/Makefile |   1 +
 3 files changed, 189 insertions(+)

diff --git a/docs/about/emulation.rst b/docs/about/emulation.rst
index c03033e4e956..72d7846ab6f8 100644
--- a/docs/about/emulation.rst
+++ b/docs/about/emulation.rst
@@ -381,6 +381,36 @@ run::
   160  1  0
   135  1  0
 
+Basic Block Vectors
+...
+
+``contrib/plugins/bbv.c``
+
+The bbv plugin allows you to generate basic block vectors for use with the
+`SimPoint <https://cseweb.ucsd.edu/~calder/simpoint/>`__ analysis tool.
+
+.. list-table:: Basic block vectors arguments
+  :widths: 20 80
+  :header-rows: 1
+
+  * - Option
+- Description
+  * - interval=N
+- The interval to generate a basic block vector specified by the number of
+  instructions (Default: N = 1)
+  * - outfile=PATH
+- The path to output files.
+  It will be suffixed with ``.N.bb`` where ``N`` is a vCPU index.
+
+Example::
+
+  $ qemu-aarch64 \
+-plugin contrib/plugins/libbbv.so,interval=100,outfile=sha1 \
+tests/tcg/aarch64-linux-user/sha1
+  SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+  $ du sha1.0.bb
+  23128   sha1.0.bb
+
 Hot Blocks
 ..
 
diff --git a/contrib/plugins/bbv.c b/contrib/plugins/bbv.c
new file mode 100644
index ..41139f423fe2
--- /dev/null
+++ b/contrib/plugins/bbv.c
@@ -0,0 +1,158 @@
+/*
+ * Generate basic block vectors for use with the SimPoint analysis tool.
+ * SimPoint: https://cseweb.ucsd.edu/~calder/simpoint/
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include 
+#include 
+
+#include 
+
+typedef struct Bb {
+struct qemu_plugin_scoreboard *count;
+unsigned int index;
+} Bb;
+
+typedef struct Vcpu {
+uint64_t count;
+FILE *file;
+} Vcpu;
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+static GHashTable *bbs;
+static GRWLock bbs_lock;
+static char *filename;
+static struct qemu_plugin_scoreboard *vcpus;
+static uint64_t interval = 1;
+
+static void plugin_exit(qemu_plugin_id_t id, void *p)
+{
+for (int i = 0; i < qemu_plugin_num_vcpus(); i++) {
+fclose(((Vcpu *)qemu_plugin_scoreboard_find(vcpus, i))->file);
+}
+
+g_hash_table_unref(bbs);
+g_free(filename);
+qemu_plugin_scoreboard_free(vcpus);
+}
+
+static void free_bb(void *data)
+{
+qemu_plugin_scoreboard_free(((Bb *)data)->count);
+g_free(data);
+}
+
+static qemu_plugin_u64 count_u64(void)
+{
+return qemu_plugin_scoreboard_u64_in_struct(vcpus, Vcpu, count);
+}
+
+static qemu_plugin_u64 bb_count_u64(Bb *bb)
+{
+return qemu_plugin_scoreboard_u64(bb->count);
+}
+
+static void vcpu_init(qemu_plugin_id_t id, unsigned int vcpu_index)
+{
+g_autofree gchar *vcpu_filename = NULL;
+Vcpu *vcpu = qemu_plugin_scoreboard_find(vcpus, vcpu_index);
+
+vcpu_filename = g_strdup_printf("%s.%u.bb", filename, vcpu_index);
+vcpu->file = fopen(vcpu_filename, "w");
+}
+
+static void vcpu_interval_exec(unsigned int vcpu_index, void *udata)
+{
+Vcpu *vcpu = qemu_plugin_scoreboard_find(vcpus, vcpu_index);
+GHashTableIter iter;
+void *value;
+
+if (!vcpu->file) {
+return;
+}
+
+vcpu->count -= interval;
+
+fputc('T', vcpu->file);
+
+g_rw_lock_reader_lock(&bbs_lock);
+g_hash_table_iter_init(&iter, bbs);
+
+while (g_hash_table_iter_next(&iter, NULL, &value)) {
+Bb *bb = value;
+uint64_t bb_count = qemu_plugin_u64_get(bb_count_u64(bb), vcpu_index);
+
+if (!bb_count) {
+continue;
+}
+
+fprintf(vcpu->file, ":%u:%" PRIu64 " ", bb->index, bb_count);
+qemu_plugin_u64_set(bb_count_u64(bb), vcpu_i

Re: [PATCH] contrib/plugins: Add a plugin to generate basic block vectors

2024-08-13 Thread Akihiko Odaki


On 2024/08/14 14:41, Pierrick Bouvier wrote:

On 8/13/24 21:56, Akihiko Odaki wrote:

On 2024/08/14 4:20, Pierrick Bouvier wrote:

Hi Akihiko, and thanks for contributing this new plugin.


Hi,

Thanks for reviewing



Recently, plugins documentation has been modified, and list of plugins
and their doc is now in "docs/about/emulation.rst". You may want to
rebase on top of master.


I see. I'll rebase and update the documentation with v2.



Globally, I'm ok with this plugin and the implementation. Just a few
fixes are needed for concurrent accesses.

On 8/12/24 23:46, Akihiko Odaki wrote:

SimPoint is a widely used tool to find the ideal microarchitecture
simulation points so Valgrind[2] and Pin[3] support generating basic
block vectors for use with them. Let's add a corresponding plugin to
QEMU too.

Note that this plugin has a different goal with tests/plugin/bb.c.

This plugin creates a vector for each constant interval instead of
counting the execution of basic blocks for the entire run and able to
describe the change of execution behavior. Its output is also
syntactically simple and better suited for parsing, while the output of
tests/plugin/bb.c is more human-readable.



I think it can be confusing to have two plugins named bb. How about
simpoint, or bbv?


How about renaming tests/tcg/plugins/bb.c to simple-bb.c instead and
keep the name of contrib/plugins/bb.c concise?

tests/tcg/plugins/bb.c is simple and good as a sample, but less useful
in practice than this plugin due to differences described earlier. On
the other hand, this plugin is designed to be utilized for practical
purpose so I want to keep its name short and save typing.



I would recommend using the  key to save typing. I'm kidding :)

More seriously, I get your point, but I don't think this new plugin is 
generic enough to be qualified as "the" bb plugin (nor the 
tests/tcg/plugins/bb neither, but it predates this new plugin). It 
outputs format for SimPoint, so how about simpoint (accessible after 
typing si)?


Perhaps bbv is a better naming. I expect there are more use cases of 
basic block vectors than feeding it for SimPoint. Searching for "basic 
block vectors" with Google Scholar show this terminology has a 
consistent meaning and used in several researches.







[1] https://cseweb.ucsd.edu/~calder/simpoint/
[2] https://valgrind.org/docs/manual/bbv-manual.html
[3]
https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html

Signed-off-by: Yotaro Nada 
Signed-off-by: Akihiko Odaki 
---
   docs/devel/tcg-plugins.rst |  20 ++
   contrib/plugins/bb.c   | 153
+
   contrib/plugins/Makefile   |   1 +
   3 files changed, 174 insertions(+)

diff --git a/docs/devel/tcg-plugins.rst b/docs/devel/tcg-plugins.rst
index 9cc09d8c3da1..2859eecc13b9 100644
--- a/docs/devel/tcg-plugins.rst
+++ b/docs/devel/tcg-plugins.rst
@@ -332,6 +332,26 @@ run::
 160  1  0
 135  1  0
+- contrib/plugins/bb.c
+
+The bb plugin allows you to generates basic block vectors for use
with the
+`SimPoint <https://cseweb.ucsd.edu/~calder/simpoint/>`__ analysis 
tool.

+
+It has two options, ``interval`` and ``outfile``. ``interval``
specifies the
+interval to generate a basic block vector by the number of
instructions. It is
+optional, and its default value is 1. ``outfile`` is the 
path to

+output files, and it will be suffixed with ``.N.bb`` where ``N`` is a
vCPU
+index.
+
+Example::
+
+  $ qemu-aarch64 \
+    -plugin contrib/plugins/libb.so,interval=100,outfile=sha1 \
+    tests/tcg/aarch64-linux-user/sha1
+  SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+  $ du sha1.0.bb
+  23128   sha1.0.bb
+
   - contrib/plugins/hotblocks.c
   The hotblocks plugin allows you to examine the where hot paths of
diff --git a/contrib/plugins/bb.c b/contrib/plugins/bb.c
new file mode 100644
index ..4f1266d07ff5
--- /dev/null
+++ b/contrib/plugins/bb.c
@@ -0,0 +1,153 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+


A brief summary and the link to simpoint page can be added as a comment
here.


I'll add them with v2.




+#include 
+#include 
+
+#include 
+
+typedef struct Bb {
+    struct qemu_plugin_scoreboard *count;
+    unsigned int index;
+} Bb;
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+static GHashTable *bbs;
+static GPtrArray *files;
+static char *filename;
+static struct qemu_plugin_scoreboard *count;
+static uint64_t interval = 1;
+
+static void plugin_exit(qemu_plugin_id_t id, void *p)
+{
+    g_hash_table_unref(bbs);
+    g_ptr_array_unref(files);
+    g_free(filename);
+    qemu_plugin_scoreboard_free(count);
+}
+
+static void free_bb(void *data)
+{
+    qemu_plugin_scoreboard_free(((Bb *)data)->count);
+    g_free(data);
+}
+
+static void free_file(void *data)
+{
+    fclose(data);
+}
+ > +static

Re: [PATCH] contrib/plugins: Add a plugin to generate basic block vectors

2024-08-13 Thread Akihiko Odaki


On 2024/08/14 4:20, Pierrick Bouvier wrote:

Hi Akihiko, and thanks for contributing this new plugin.


Hi,

Thanks for reviewing



Recently, plugins documentation has been modified, and list of plugins 
and their doc is now in "docs/about/emulation.rst". You may want to 
rebase on top of master.


I see. I'll rebase and update the documentation with v2.



Globally, I'm ok with this plugin and the implementation. Just a few 
fixes are needed for concurrent accesses.


On 8/12/24 23:46, Akihiko Odaki wrote:

SimPoint is a widely used tool to find the ideal microarchitecture
simulation points so Valgrind[2] and Pin[3] support generating basic
block vectors for use with them. Let's add a corresponding plugin to
QEMU too.

Note that this plugin has a different goal with tests/plugin/bb.c.

This plugin creates a vector for each constant interval instead of
counting the execution of basic blocks for the entire run and able to
describe the change of execution behavior. Its output is also
syntactically simple and better suited for parsing, while the output of
tests/plugin/bb.c is more human-readable.



I think it can be confusing to have two plugins named bb. How about 
simpoint, or bbv?


How about renaming tests/tcg/plugins/bb.c to simple-bb.c instead and 
keep the name of contrib/plugins/bb.c concise?


tests/tcg/plugins/bb.c is simple and good as a sample, but less useful 
in practice than this plugin due to differences described earlier. On 
the other hand, this plugin is designed to be utilized for practical 
purpose so I want to keep its name short and save typing.





[1] https://cseweb.ucsd.edu/~calder/simpoint/
[2] https://valgrind.org/docs/manual/bbv-manual.html
[3] 
https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html


Signed-off-by: Yotaro Nada 
Signed-off-by: Akihiko Odaki 
---
  docs/devel/tcg-plugins.rst |  20 ++
  contrib/plugins/bb.c   | 153 
+

  contrib/plugins/Makefile   |   1 +
  3 files changed, 174 insertions(+)

diff --git a/docs/devel/tcg-plugins.rst b/docs/devel/tcg-plugins.rst
index 9cc09d8c3da1..2859eecc13b9 100644
--- a/docs/devel/tcg-plugins.rst
+++ b/docs/devel/tcg-plugins.rst
@@ -332,6 +332,26 @@ run::
    160  1  0
    135  1  0
+- contrib/plugins/bb.c
+
+The bb plugin allows you to generates basic block vectors for use 
with the

+`SimPoint <https://cseweb.ucsd.edu/~calder/simpoint/>`__ analysis tool.
+
+It has two options, ``interval`` and ``outfile``. ``interval`` 
specifies the
+interval to generate a basic block vector by the number of 
instructions. It is

+optional, and its default value is 1. ``outfile`` is the path to
+output files, and it will be suffixed with ``.N.bb`` where ``N`` is a 
vCPU

+index.
+
+Example::
+
+  $ qemu-aarch64 \
+    -plugin contrib/plugins/libb.so,interval=100,outfile=sha1 \
+    tests/tcg/aarch64-linux-user/sha1
+  SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+  $ du sha1.0.bb
+  23128   sha1.0.bb
+
  - contrib/plugins/hotblocks.c
  The hotblocks plugin allows you to examine the where hot paths of
diff --git a/contrib/plugins/bb.c b/contrib/plugins/bb.c
new file mode 100644
index ..4f1266d07ff5
--- /dev/null
+++ b/contrib/plugins/bb.c
@@ -0,0 +1,153 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+


A brief summary and the link to simpoint page can be added as a comment 
here.


I'll add them with v2.




+#include 
+#include 
+
+#include 
+
+typedef struct Bb {
+    struct qemu_plugin_scoreboard *count;
+    unsigned int index;
+} Bb;
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+static GHashTable *bbs;
+static GPtrArray *files;
+static char *filename;
+static struct qemu_plugin_scoreboard *count;
+static uint64_t interval = 1;
+
+static void plugin_exit(qemu_plugin_id_t id, void *p)
+{
+    g_hash_table_unref(bbs);
+    g_ptr_array_unref(files);
+    g_free(filename);
+    qemu_plugin_scoreboard_free(count);
+}
+
+static void free_bb(void *data)
+{
+    qemu_plugin_scoreboard_free(((Bb *)data)->count);
+    g_free(data);
+}
+
+static void free_file(void *data)
+{
+    fclose(data);
+}
+ > +static directly count_u64(void)
+{
+    return qemu_plugin_scoreboard_u64(count);
+}
+
+static qemu_plugin_u64 bb_count_u64(Bb *bb)
+{
+    return qemu_plugin_scoreboard_u64(bb->count);
+}
+
+static void vcpu_init(qemu_plugin_id_t id, unsigned int vcpu_index)
+{
+    g_autofree gchar *vcpu_filename = NULL;
+
+    if (vcpu_index >= files->len) {
+    g_ptr_array_set_size(files, vcpu_index + 1);
+    } else if (g_ptr_array_index(files, vcpu_index)) {
+    return;
+    }
+


You need a lock for files array for expansion/access.


I will replace GPtrArray with scoreboard instead.




+    vcpu_filename = g_strdup_printf("%s.%u.bb", filename, vcpu_index);
+    g_ptr_array_index(files, vcpu_index

[PATCH] contrib/plugins: Add a plugin to generate basic block vectors

2024-08-12 Thread Akihiko Odaki

SimPoint is a widely used tool to find the ideal microarchitecture
simulation points so Valgrind[2] and Pin[3] support generating basic
block vectors for use with them. Let's add a corresponding plugin to
QEMU too.

Note that this plugin has a different goal with tests/plugin/bb.c.

This plugin creates a vector for each constant interval instead of
counting the execution of basic blocks for the entire run and able to
describe the change of execution behavior. Its output is also
syntactically simple and better suited for parsing, while the output of
tests/plugin/bb.c is more human-readable.

[1] https://cseweb.ucsd.edu/~calder/simpoint/
[2] https://valgrind.org/docs/manual/bbv-manual.html
[3] 
https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html

Signed-off-by: Yotaro Nada 
Signed-off-by: Akihiko Odaki 
---
 docs/devel/tcg-plugins.rst |  20 ++
 contrib/plugins/bb.c   | 153 +
 contrib/plugins/Makefile   |   1 +
 3 files changed, 174 insertions(+)

diff --git a/docs/devel/tcg-plugins.rst b/docs/devel/tcg-plugins.rst
index 9cc09d8c3da1..2859eecc13b9 100644
--- a/docs/devel/tcg-plugins.rst
+++ b/docs/devel/tcg-plugins.rst
@@ -332,6 +332,26 @@ run::
   160  1  0
   135  1  0
 
+- contrib/plugins/bb.c
+
+The bb plugin allows you to generates basic block vectors for use with the
+`SimPoint <https://cseweb.ucsd.edu/~calder/simpoint/>`__ analysis tool.
+
+It has two options, ``interval`` and ``outfile``. ``interval`` specifies the
+interval to generate a basic block vector by the number of instructions. It is
+optional, and its default value is 1. ``outfile`` is the path to
+output files, and it will be suffixed with ``.N.bb`` where ``N`` is a vCPU
+index.
+
+Example::
+
+  $ qemu-aarch64 \
+-plugin contrib/plugins/libb.so,interval=100,outfile=sha1 \
+tests/tcg/aarch64-linux-user/sha1
+  SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+  $ du sha1.0.bb
+  23128   sha1.0.bb
+
 - contrib/plugins/hotblocks.c
 
 The hotblocks plugin allows you to examine the where hot paths of
diff --git a/contrib/plugins/bb.c b/contrib/plugins/bb.c
new file mode 100644
index ..4f1266d07ff5
--- /dev/null
+++ b/contrib/plugins/bb.c
@@ -0,0 +1,153 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#include 
+#include 
+
+#include 
+
+typedef struct Bb {
+struct qemu_plugin_scoreboard *count;
+unsigned int index;
+} Bb;
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+static GHashTable *bbs;
+static GPtrArray *files;
+static char *filename;
+static struct qemu_plugin_scoreboard *count;
+static uint64_t interval = 1;
+
+static void plugin_exit(qemu_plugin_id_t id, void *p)
+{
+g_hash_table_unref(bbs);
+g_ptr_array_unref(files);
+g_free(filename);
+qemu_plugin_scoreboard_free(count);
+}
+
+static void free_bb(void *data)
+{
+qemu_plugin_scoreboard_free(((Bb *)data)->count);
+g_free(data);
+}
+
+static void free_file(void *data)
+{
+fclose(data);
+}
+
+static qemu_plugin_u64 count_u64(void)
+{
+return qemu_plugin_scoreboard_u64(count);
+}
+
+static qemu_plugin_u64 bb_count_u64(Bb *bb)
+{
+return qemu_plugin_scoreboard_u64(bb->count);
+}
+
+static void vcpu_init(qemu_plugin_id_t id, unsigned int vcpu_index)
+{
+g_autofree gchar *vcpu_filename = NULL;
+
+if (vcpu_index >= files->len) {
+g_ptr_array_set_size(files, vcpu_index + 1);
+} else if (g_ptr_array_index(files, vcpu_index)) {
+return;
+}
+
+vcpu_filename = g_strdup_printf("%s.%u.bb", filename, vcpu_index);
+g_ptr_array_index(files, vcpu_index) = fopen(vcpu_filename, "w");
+}
+
+static void vcpu_tb_exec(unsigned int vcpu_index, void *udata)
+{
+FILE *file = g_ptr_array_index(files, vcpu_index);
+uint64_t count = qemu_plugin_u64_get(count_u64(), vcpu_index) - interval;
+GHashTableIter iter;
+void *value;
+
+if (!file) {
+return;
+}
+
+qemu_plugin_u64_set(count_u64(), vcpu_index, count);
+
+fputc('T', file);
+
+g_hash_table_iter_init(&iter, bbs);
+
+while (g_hash_table_iter_next(&iter, NULL, &value)) {
+Bb *bb = value;
+uint64_t bb_count = qemu_plugin_u64_get(bb_count_u64(bb), vcpu_index);
+
+if (!bb_count) {
+continue;
+}
+
+fprintf(file, ":%u:%" PRIu64 " ", bb->index, bb_count);
+qemu_plugin_u64_set(bb_count_u64(bb), vcpu_index, 0);
+}
+
+fputc('\n', file);
+}
+
+static void vcpu_tb_trans(qemu_plugin_id_t id, struct qemu_plugin_tb *tb)
+{
+uint64_t n_insns = qemu_plugin_tb_n_insns(tb);
+uint64_t vaddr = qemu_plugin_tb_vaddr(tb);
+Bb *bb = g_hash_table_lookup(bbs, &vaddr);
+
+if (!bb) {
+uint64_t *key = g_new(uint64_t, 1);
+
+*key = vaddr;
+bb = g_new(Bb, 1);
+

[PATCH for-9.2 v7 6/9] virtio-pci: Implement SR-IOV PF

2024-08-12 Thread Akihiko Odaki

Allow user to attach SR-IOV VF to a virtio-pci PF.

Signed-off-by: Akihiko Odaki 
---
 include/hw/virtio/virtio-pci.h |  1 +
 hw/virtio/virtio-pci.c | 20 +++-
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/hw/virtio/virtio-pci.h b/include/hw/virtio/virtio-pci.h
index 9e67ba38c748..34539f2f6722 100644
--- a/include/hw/virtio/virtio-pci.h
+++ b/include/hw/virtio/virtio-pci.h
@@ -152,6 +152,7 @@ struct VirtIOPCIProxy {
 uint32_t modern_io_bar_idx;
 uint32_t modern_mem_bar_idx;
 int config_cap;
+uint16_t last_pcie_cap_offset;
 uint32_t flags;
 bool disable_modern;
 bool ignore_backend_features;
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 9534730bba19..0c8fcc5627d5 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -1955,6 +1955,7 @@ static void virtio_pci_device_plugged(DeviceState *d, 
Error **errp)
 uint8_t *config;
 uint32_t size;
 VirtIODevice *vdev = virtio_bus_get_device(bus);
+int16_t res;
 
 /*
  * Virtio capabilities present without
@@ -2100,6 +2101,14 @@ static void virtio_pci_device_plugged(DeviceState *d, 
Error **errp)
 pci_register_bar(&proxy->pci_dev, proxy->legacy_io_bar_idx,
  PCI_BASE_ADDRESS_SPACE_IO, &proxy->bar);
 }
+
+res = pcie_sriov_pf_init_from_user_created_vfs(&proxy->pci_dev,
+   proxy->last_pcie_cap_offset,
+   errp);
+if (res > 0) {
+proxy->last_pcie_cap_offset += res;
+virtio_add_feature(&vdev->host_features, VIRTIO_F_SR_IOV);
+}
 }
 
 static void virtio_pci_device_unplugged(DeviceState *d)
@@ -2187,7 +2196,7 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 
 if (pcie_port && pci_is_express(pci_dev)) {
 int pos;
-uint16_t last_pcie_cap_offset = PCI_CONFIG_SPACE_SIZE;
+proxy->last_pcie_cap_offset = PCI_CONFIG_SPACE_SIZE;
 
 pos = pcie_endpoint_cap_init(pci_dev, 0);
 assert(pos > 0);
@@ -2207,9 +2216,9 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 pci_set_word(pci_dev->config + pos + PCI_PM_PMC, 0x3);
 
 if (proxy->flags & VIRTIO_PCI_FLAG_AER) {
-pcie_aer_init(pci_dev, PCI_ERR_VER, last_pcie_cap_offset,
+pcie_aer_init(pci_dev, PCI_ERR_VER, proxy->last_pcie_cap_offset,
   PCI_ERR_SIZEOF, NULL);
-last_pcie_cap_offset += PCI_ERR_SIZEOF;
+proxy->last_pcie_cap_offset += PCI_ERR_SIZEOF;
 }
 
 if (proxy->flags & VIRTIO_PCI_FLAG_INIT_DEVERR) {
@@ -2234,9 +2243,9 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error 
**errp)
 }
 
 if (proxy->flags & VIRTIO_PCI_FLAG_ATS) {
-pcie_ats_init(pci_dev, last_pcie_cap_offset,
+pcie_ats_init(pci_dev, proxy->last_pcie_cap_offset,
   proxy->flags & VIRTIO_PCI_FLAG_ATS_PAGE_ALIGNED);
-last_pcie_cap_offset += PCI_EXT_CAP_ATS_SIZEOF;
+proxy->last_pcie_cap_offset += PCI_EXT_CAP_ATS_SIZEOF;
 }
 
 if (proxy->flags & VIRTIO_PCI_FLAG_INIT_FLR) {
@@ -2263,6 +2272,7 @@ static void virtio_pci_exit(PCIDevice *pci_dev)
 bool pcie_port = pci_bus_is_express(pci_get_bus(pci_dev)) &&
  !pci_bus_is_root(pci_get_bus(pci_dev));
 
+pcie_sriov_pf_exit(&proxy->pci_dev);
 msix_uninit_exclusive_bar(pci_dev);
 if (proxy->flags & VIRTIO_PCI_FLAG_AER && pcie_port &&
 pci_is_express(pci_dev)) {

-- 
2.46.0

[PATCH for-9.2 v7 5/9] pcie_sriov: Allow user to create SR-IOV device

2024-08-12 Thread Akihiko Odaki

A user can create a SR-IOV device by specifying the PF with the
sriov-pf property of the VFs. The VFs must be added before the PF.

A user-creatable VF must have PCIDeviceClass::sriov_vf_user_creatable
set. Such a VF cannot refer to the PF because it is created before the
PF.

A PF that user-creatable VFs can be attached calls
pcie_sriov_pf_init_from_user_created_vfs() during realization and
pcie_sriov_pf_exit() when exiting.

Signed-off-by: Akihiko Odaki 
---
 include/hw/pci/pci_device.h |   6 +-
 include/hw/pci/pcie_sriov.h |  18 +++
 hw/pci/pci.c|  62 ++
 hw/pci/pcie_sriov.c | 279 +++-
 4 files changed, 286 insertions(+), 79 deletions(-)

diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index 8fa845beee5e..1d31099dd4dc 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -38,6 +38,8 @@ struct PCIDeviceClass {
 uint16_t subsystem_id;  /* only for header type = 0 */
 
 const char *romfile;/* rom bar */
+
+bool sriov_vf_user_creatable;
 };
 
 enum PCIReqIDType {
@@ -167,6 +169,8 @@ struct PCIDevice {
 /* ID of standby device in net_failover pair */
 char *failover_pair_id;
 uint32_t acpi_index;
+
+char *sriov_pf;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
@@ -199,7 +203,7 @@ static inline int pci_is_express_downstream_port(const 
PCIDevice *d)
 
 static inline int pci_is_vf(const PCIDevice *d)
 {
-return d->exp.sriov_vf.pf != NULL;
+return d->sriov_pf || d->exp.sriov_vf.pf != NULL;
 }
 
 static inline uint32_t pci_config_size(const PCIDevice *d)
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index c5d2d318d330..f75b8f22ee92 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -18,6 +18,7 @@
 typedef struct PCIESriovPF {
 uint8_t vf_bar_type[PCI_NUM_REGIONS];   /* Store type for each VF bar */
 PCIDevice **vf; /* Pointer to an array of num_vfs VF devices */
+bool vf_user_created; /* If VFs are created by user */
 } PCIESriovPF;
 
 typedef struct PCIESriovVF {
@@ -40,6 +41,23 @@ void pcie_sriov_pf_init_vf_bar(PCIDevice *dev, int 
region_num,
 void pcie_sriov_vf_register_bar(PCIDevice *dev, int region_num,
 MemoryRegion *memory);
 
+/**
+ * pcie_sriov_pf_init_from_user_created_vfs() - Initialize PF with user-created
+ *  VFs.
+ * @dev: A PCIe device being realized.
+ * @offset: The offset of the SR-IOV capability.
+ * @errp: pointer to Error*, to store an error if it happens.
+ *
+ * Return: The size of added capability. 0 if the user did not create VFs.
+ * -1 if failed.
+ */
+int16_t pcie_sriov_pf_init_from_user_created_vfs(PCIDevice *dev,
+ uint16_t offset,
+ Error **errp);
+
+bool pcie_sriov_register_device(PCIDevice *dev, Error **errp);
+void pcie_sriov_unregister_device(PCIDevice *dev);
+
 /*
  * Default (minimal) page size support values
  * as required by the SR/IOV standard:
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 0956fe5eb444..e693f5b1e044 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,7 @@ static Property pci_props[] = {
 QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
 DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
 QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+DEFINE_PROP_STRING("sriov-pf", PCIDevice, sriov_pf),
 DEFINE_PROP_END_OF_LIST()
 };
 
@@ -959,13 +960,8 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice 
*dev, Error **errp)
 dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
 }
 
-/*
- * With SR/IOV and ARI, a device at function 0 need not be a multifunction
- * device, as it may just be a VF that ended up with function 0 in
- * the legacy PCI interpretation. Avoid failing in such cases:
- */
-if (pci_is_vf(dev) &&
-dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+/* SR/IOV is not handled here. */
+if (pci_is_vf(dev)) {
 return;
 }
 
@@ -998,7 +994,8 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice 
*dev, Error **errp)
 }
 /* function 0 indicates single function, so function > 0 must be NULL */
 for (func = 1; func < PCI_FUNC_MAX; ++func) {
-if (bus->devices[PCI_DEVFN(slot, func)]) {
+PCIDevice *device = bus->devices[PCI_DEVFN(slot, func)];
+if (device && !pci_is_vf(device)) {
 error_setg(errp, "PCI: %x.0 indicates single function, "
"but %x.%x is already populated.",
slot, slot, func);
@@ -1283,6 +1280,7 @@ static void pci_qdev_unrealize(DeviceState *dev)
 
 pci_unregister_io_reg

[PATCH for-9.2 v7 9/9] pcie_sriov: Make a PCI device with user-created VF ARI-capable

2024-08-12 Thread Akihiko Odaki

Signed-off-by: Akihiko Odaki 
---
 docs/system/sriov.rst   |  3 ++-
 include/hw/pci/pcie_sriov.h |  7 +--
 hw/pci/pcie_sriov.c |  8 +++-
 hw/virtio/virtio-pci.c  | 16 ++--
 4 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/docs/system/sriov.rst b/docs/system/sriov.rst
index a851a66a4b8b..d12178f3c319 100644
--- a/docs/system/sriov.rst
+++ b/docs/system/sriov.rst
@@ -28,7 +28,8 @@ virtio-net-pci functions to a bus. Below is a command line 
example:
 The VFs specify the paired PF with ``sriov-pf`` property. The PF must be
 added after all VFs. It is the user's responsibility to ensure that VFs have
 function numbers larger than one of the PF, and that the function numbers
-have a consistent stride.
+have a consistent stride. Both the PF and VFs are ARI-capable so you can have
+255 VFs at maximum.
 
 You may also need to perform additional steps to activate the SR-IOV feature on
 your guest. For Linux, refer to [1]_.
diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
index f75b8f22ee92..aeaa38cf3456 100644
--- a/include/hw/pci/pcie_sriov.h
+++ b/include/hw/pci/pcie_sriov.h
@@ -43,12 +43,15 @@ void pcie_sriov_vf_register_bar(PCIDevice *dev, int 
region_num,
 
 /**
  * pcie_sriov_pf_init_from_user_created_vfs() - Initialize PF with user-created
- *  VFs.
+ *  VFs, adding ARI to PF
  * @dev: A PCIe device being realized.
  * @offset: The offset of the SR-IOV capability.
  * @errp: pointer to Error*, to store an error if it happens.
  *
- * Return: The size of added capability. 0 if the user did not create VFs.
+ * Initializes a PF with user-created VFs, adding the ARI extended capability 
to
+ * the PF. The VFs should call pcie_ari_init() to form an ARI device.
+ *
+ * Return: The size of added capabilities. 0 if the user did not create VFs.
  * -1 if failed.
  */
 int16_t pcie_sriov_pf_init_from_user_created_vfs(PCIDevice *dev,
diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index f5b83a92a00c..94ab92d8c80d 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -238,6 +238,7 @@ int16_t pcie_sriov_pf_init_from_user_created_vfs(PCIDevice 
*dev,
 PCIDevice **vfs;
 BusState *bus = qdev_get_parent_bus(DEVICE(dev));
 uint16_t ven_id = pci_get_word(dev->config + PCI_VENDOR_ID);
+uint16_t size = PCI_EXT_CAP_SRIOV_SIZEOF;
 uint16_t vf_dev_id;
 uint16_t vf_offset;
 uint16_t vf_stride;
@@ -304,6 +305,11 @@ int16_t pcie_sriov_pf_init_from_user_created_vfs(PCIDevice 
*dev,
 return -1;
 }
 
+if (!pcie_find_capability(dev, PCI_EXT_CAP_ID_ARI)) {
+pcie_ari_init(dev, offset + size);
+size += PCI_ARI_SIZEOF;
+}
+
 for (i = 0; i < pf->len; i++) {
 vfs[i]->exp.sriov_vf.pf = dev;
 vfs[i]->exp.sriov_vf.vf_number = i;
@@ -324,7 +330,7 @@ int16_t pcie_sriov_pf_init_from_user_created_vfs(PCIDevice 
*dev,
 }
 }
 
-return PCI_EXT_CAP_SRIOV_SIZEOF;
+return size;
 }
 
 bool pcie_sriov_register_device(PCIDevice *dev, Error **errp)
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 0c8fcc5627d5..b19e2983ee22 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -2102,12 +2102,16 @@ static void virtio_pci_device_plugged(DeviceState *d, 
Error **errp)
  PCI_BASE_ADDRESS_SPACE_IO, &proxy->bar);
 }
 
-res = pcie_sriov_pf_init_from_user_created_vfs(&proxy->pci_dev,
-   proxy->last_pcie_cap_offset,
-   errp);
-if (res > 0) {
-proxy->last_pcie_cap_offset += res;
-virtio_add_feature(&vdev->host_features, VIRTIO_F_SR_IOV);
+if (pci_is_vf(&proxy->pci_dev)) {
+pcie_ari_init(&proxy->pci_dev, proxy->last_pcie_cap_offset);
+proxy->last_pcie_cap_offset += PCI_ARI_SIZEOF;
+} else {
+res = pcie_sriov_pf_init_from_user_created_vfs(
+&proxy->pci_dev, proxy->last_pcie_cap_offset, errp);
+if (res > 0) {
+proxy->last_pcie_cap_offset += res;
+virtio_add_feature(&vdev->host_features, VIRTIO_F_SR_IOV);
+}
 }
 }
 

-- 
2.46.0

[PATCH for-9.2 v7 8/9] docs: Document composable SR-IOV device

2024-08-12 Thread Akihiko Odaki

Signed-off-by: Akihiko Odaki 
---
 MAINTAINERS   |  1 +
 docs/system/index.rst |  1 +
 docs/system/sriov.rst | 36 
 3 files changed, 38 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index e34c2bd4cda2..72b3c6736088 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2011,6 +2011,7 @@ F: hw/pci-bridge/*
 F: qapi/pci.json
 F: docs/pci*
 F: docs/specs/*pci*
+F: docs/system/sriov.rst
 
 PCIE DOE
 M: Huai-Cheng Kuo 
diff --git a/docs/system/index.rst b/docs/system/index.rst
index c21065e51932..718e9d3c56bb 100644
--- a/docs/system/index.rst
+++ b/docs/system/index.rst
@@ -39,3 +39,4 @@ or Hypervisor.Framework.
multi-process
confidential-guest-support
vm-templating
+   sriov
diff --git a/docs/system/sriov.rst b/docs/system/sriov.rst
new file mode 100644
index ..a851a66a4b8b
--- /dev/null
+++ b/docs/system/sriov.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+Compsable SR-IOV device
+===
+
+SR-IOV (Single Root I/O Virtualization) is an optional extended capability of a
+PCI Express device. It allows a single physical function (PF) to appear as
+multiple virtual functions (VFs) for the main purpose of eliminating software
+overhead in I/O from virtual machines.
+
+There are devices with predefined SR-IOV configurations, but it is also 
possible
+to compose an SR-IOV device yourself. Composing an SR-IOV device is currently
+only supported by virtio-net-pci.
+
+Users can configure an SR-IOV-capable virtio-net device by adding
+virtio-net-pci functions to a bus. Below is a command line example:
+
+.. code-block:: shell
+
+-netdev user,id=n -netdev user,id=o
+-netdev user,id=p -netdev user,id=q
+-device pcie-root-port,id=b
+-device virtio-net-pci,bus=b,addr=0x0.0x3,netdev=q,sriov-pf=f
+-device virtio-net-pci,bus=b,addr=0x0.0x2,netdev=p,sriov-pf=f
+-device virtio-net-pci,bus=b,addr=0x0.0x1,netdev=o,sriov-pf=f
+-device virtio-net-pci,bus=b,addr=0x0.0x0,netdev=n,id=f
+
+The VFs specify the paired PF with ``sriov-pf`` property. The PF must be
+added after all VFs. It is the user's responsibility to ensure that VFs have
+function numbers larger than one of the PF, and that the function numbers
+have a consistent stride.
+
+You may also need to perform additional steps to activate the SR-IOV feature on
+your guest. For Linux, refer to [1]_.
+
+.. [1] https://docs.kernel.org/PCI/pci-iov-howto.html

-- 
2.46.0

[PATCH for-9.2 v7 4/9] pcie_sriov: Check PCI Express for SR-IOV PF

2024-08-12 Thread Akihiko Odaki

SR-IOV requires PCI Express.

Signed-off-by: Akihiko Odaki 
---
 hw/pci/pcie_sriov.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index e1b4ecf79ff9..2daea6ecdb6a 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -42,6 +42,11 @@ bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 uint8_t *cfg = dev->config + offset;
 uint8_t *wmask;
 
+if (!pci_is_express(dev)) {
+error_setg(errp, "PCI Express is required for SR-IOV PF");
+return false;
+}
+
 if (pci_is_vf(dev)) {
 error_setg(errp, "a device cannot be a SR-IOV PF and a VF at the same 
time");
 return false;

-- 
2.46.0

[PATCH for-9.2 v7 0/9] virtio-net: add support for SR-IOV emulation

2024-08-12 Thread Akihiko Odaki

Based-on: <20240802-reuse-v11-0-fb83bb8c1...@daynix.com>
("[PATCH for-9.2 v11 00/11] hw/pci: SR-IOV related fixes and improvements")

Introduction


This series is based on the RFC series submitted by Yui Washizu[1].
See also [2] for the context.

This series enables SR-IOV emulation for virtio-net. It is useful
to test SR-IOV support on the guest, or to expose several vDPA devices
in a VM. vDPA devices can also provide L2 switching feature for
offloading though it is out of scope to allow the guest to configure
such a feature.

The PF side code resides in virtio-pci. The VF side code resides in
the PCI common infrastructure, but it is restricted to work only for
virtio-net-pci because of lack of validation.

User Interface
--

A user can configure a SR-IOV capable virtio-net device by adding
virtio-net-pci functions to a bus. Below is a command line example:
  -netdev user,id=n -netdev user,id=o
  -netdev user,id=p -netdev user,id=q
  -device pcie-root-port,id=b
  -device virtio-net-pci,bus=b,addr=0x0.0x3,netdev=q,sriov-pf=f
  -device virtio-net-pci,bus=b,addr=0x0.0x2,netdev=p,sriov-pf=f
  -device virtio-net-pci,bus=b,addr=0x0.0x1,netdev=o,sriov-pf=f
  -device virtio-net-pci,bus=b,addr=0x0.0x0,netdev=n,id=f

The VFs specify the paired PF with "sriov-pf" property. The PF must be
added after all VFs. It is user's responsibility to ensure that VFs have
function numbers larger than one of the PF, and the function numbers
have a consistent stride.

Keeping VF instances


A problem with SR-IOV emulation is that it needs to hotplug the VFs as
the guest requests. Previously, this behavior was implemented by
realizing and unrealizing VFs at runtime. However, this strategy does
not work well for the proposed virtio-net emulation; in this proposal,
device options passed in the command line must be maintained as VFs
are hotplugged, but they are consumed when the machine starts and not
available after that, which makes realizing VFs at runtime impossible.

As an strategy alternative to runtime realization/unrealization, this
series proposes to reuse the code to power down PCI Express devices.
When a PCI Express device is powered down, it will be hidden from the
guest but will be kept realized. This effectively implements the
behavior we need for the SR-IOV emulation.

Summary
---

Patch 1 disables ROM BAR, which virtio-net-pci enables by default, for
VFs.
Patch 2 makes zero stride valid for 1 VF configuration.
Patch 3 and 4 adds validations.
Patch 5 adds user-created SR-IOV VF infrastructure.
Patch 6 makes virtio-pci work as SR-IOV PF for user-created VFs.
Patch 7 allows user to create SR-IOV VFs with virtio-net-pci.

[1] 
https://patchew.org/QEMU/1689731808-3009-1-git-send-email-yui.wash...@gmail.com/
[2] https://lore.kernel.org/all/5d46f455-f530-4e5e-9ae7-13a2297d4...@daynix.com/

Co-developed-by: Yui Washizu 
Signed-off-by: Akihiko Odaki 
---
Changes in v7:
- Removed #include , which is no longer needed.
- Rebased.
- Link to v6: 
https://lore.kernel.org/r/20240802-sriov-v6-0-0c8ff49c4...@daynix.com

Changes in v6:
- Added ARI extended capability.
- Rebased.
- Link to v5: 
https://lore.kernel.org/r/20240715-sriov-v5-0-3f5539093...@daynix.com

Changes in v5:
- Dropped the RFC tag.
- Fixed device unrealization.
- Rebased.
- Link to v4: 
https://lore.kernel.org/r/20240428-sriov-v4-0-ac8ac6212...@daynix.com

Changes in v4:
- Added patch "hw/pci: Fix SR-IOV VF number calculation" to fix division
  by zero reported by Yui Washizu.
- Rebased.
- Link to v3: 
https://lore.kernel.org/r/20240305-sriov-v3-0-abdb75770...@daynix.com

Changes in v3:
- Rebased.
- Link to v2: 
https://lore.kernel.org/r/20231210-sriov-v2-0-b959e8a6d...@daynix.com

Changes in v2:
- Changed to keep VF instances.
- Link to v1: 
https://lore.kernel.org/r/20231202-sriov-v1-0-32b3570f7...@daynix.com

---
Akihiko Odaki (9):
  hw/pci: Do not add ROM BAR for SR-IOV VF
  hw/pci: Fix SR-IOV VF number calculation
  pcie_sriov: Ensure PF and VF are mutually exclusive
  pcie_sriov: Check PCI Express for SR-IOV PF
  pcie_sriov: Allow user to create SR-IOV device
  virtio-pci: Implement SR-IOV PF
  virtio-net: Implement SR-IOV VF
  docs: Document composable SR-IOV device
  pcie_sriov: Make a PCI device with user-created VF ARI-capable

 MAINTAINERS|   1 +
 docs/system/index.rst  |   1 +
 docs/system/sriov.rst  |  37 ++
 include/hw/pci/pci_device.h|   6 +-
 include/hw/pci/pcie_sriov.h|  21 +++
 include/hw/virtio/virtio-pci.h |   1 +
 hw/pci/pci.c   |  76 +++
 hw/pci/pcie_sriov.c| 295 +
 hw/virtio/virtio-net-pci.c |   1 +
 hw/virtio/virtio-pci.c |  24 +++-
 10 files changed, 378 insertions(+), 85 deletions(-)
---
base-commit: f5cebc77fe020e6ca0c33d8e06cd36edf3ff1d4c
change-id: 20231202-sriov-9402fb262be8

Best regards,
-- 
Akihiko Odaki

[PATCH for-9.2 v7 3/9] pcie_sriov: Ensure PF and VF are mutually exclusive

2024-08-12 Thread Akihiko Odaki

A device cannot be a SR-IOV PF and a VF at the same time.

Signed-off-by: Akihiko Odaki 
---
 hw/pci/pcie_sriov.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
index 1eae9f0a0acf..e1b4ecf79ff9 100644
--- a/hw/pci/pcie_sriov.c
+++ b/hw/pci/pcie_sriov.c
@@ -42,6 +42,11 @@ bool pcie_sriov_pf_init(PCIDevice *dev, uint16_t offset,
 uint8_t *cfg = dev->config + offset;
 uint8_t *wmask;
 
+if (pci_is_vf(dev)) {
+error_setg(errp, "a device cannot be a SR-IOV PF and a VF at the same 
time");
+return false;
+}
+
 pcie_add_capability(dev, PCI_EXT_CAP_ID_SRIOV, 1,
 offset, PCI_EXT_CAP_SRIOV_SIZEOF);
 dev->exp.sriov_cap = offset;

-- 
2.46.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1250 matches

Mail list logo