from:"anthony"

Re: [PATCH] MAINTAINERS: add Edgar as Xen maintainer

2024-07-11 Thread Anthony PERARD

On Wed, Jul 10, 2024 at 01:28:52PM -0700, Stefano Stabellini wrote:
> Add Edgar as Xen subsystem maintainer in QEMU. Edgar has been a QEMU
> maintainer for years, and has already made key changes to one of the
> most difficult areas of the Xen subsystem (the mapcache).
> 
> Edgar volunteered helping us maintain the Xen subsystem in QEMU and we
> are very happy to welcome him to the team. His knowledge and expertise
> with QEMU internals will be of great help.
> 
> Signed-off-by: Stefano Stabellini 
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6725913c8b..63e11095a2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -536,6 +536,7 @@ X86 Xen CPUs
>  M: Stefano Stabellini 
>  M: Anthony PERARD 
>  M: Paul Durrant 
> +M: Edgar E. Iglesias 
>  L: xen-de...@lists.xenproject.org
>  S: Supported
>  F: */xen*

Acked-by: Anthony PERARD 

Welcome!
Cheers,

-- 
Anthony PERARD

Re: [PATCH v1 2/2] xen: mapcache: Fix unmapping of first entries in buckets

2024-07-04 Thread Anthony PERARD

On Tue, Jul 02, 2024 at 12:44:21AM +0200, Edgar E. Iglesias wrote:
> From: "Edgar E. Iglesias" 
> 
> This fixes the clobbering of the entry->next pointer when
> unmapping the first entry in a bucket of a mapcache.
> 
> Fixes: 123acd816d ("xen: mapcache: Unmap first entries in buckets")
> Reported-by: Anthony PERARD 
> Signed-off-by: Edgar E. Iglesias 
> ---
>  hw/xen/xen-mapcache.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/xen/xen-mapcache.c b/hw/xen/xen-mapcache.c
> index 5f23b0adbe..18ba7b1d8f 100644
> --- a/hw/xen/xen-mapcache.c
> +++ b/hw/xen/xen-mapcache.c
> @@ -597,7 +597,17 @@ static void 
> xen_invalidate_map_cache_entry_unlocked(MapCache *mc,
>  pentry->next = entry->next;
>  g_free(entry);
>  } else {
> -memset(entry, 0, sizeof *entry);
> +/*
> + * Invalidate mapping but keep entry->next pointing to the rest
> + * of the list.
> + *
> + * Note that lock is already zero here, otherwise we don't unmap.
> + */
> +entry->paddr_index = 0;
> +entry->vaddr_base = NULL;
> +entry->valid_mapping = NULL;
> +entry->flags = 0;
> +entry->size = 0;

This kind of feels like mc->entry should be an array of pointer rather
than an array of MapCacheEntry but that seems to work well enough and
not the first time entries are been cleared like that.

Reviewed-by: Anthony PERARD 

Thanks,

-- 

Anthony Perard | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH v8 2/8] xen: mapcache: Unmap first entries in buckets

2024-07-01 Thread Anthony PERARD

On Mon, Jul 01, 2024 at 04:34:53PM +0200, Edgar E. Iglesias wrote:
> On Mon, Jul 1, 2024 at 4:30 PM Edgar E. Iglesias 
> wrote:
> > On Mon, Jul 1, 2024 at 3:58 PM Edgar E. Iglesias 
> > wrote:
> >> Any chance you could try to get a backtrace from QEMU when it failed?

Here it is:


#3  0x7fa8762f4472 in __GI_abort () at ./stdlib/abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, 
sa_mask = {__val = {94603440166168, 18446744073709551615, 94603406369640, 0, 0, 
94603406346720, 94603440166168, 140361486774256, 0, 140361486773376, 
94603401285536, 140361496232688, 94603440166096, 140361486773456, 
94603401289576, 140360849280256}}, sa_flags = -1804462896, sa_restorer = 
0x748f2d40}
#4  0x560a92230f0d in qemu_get_ram_block (addr=18446744073709551615) at 
../system/physmem.c:801
block = 0x0
#5  0x560a922350ab in qemu_ram_block_from_host (ptr=0x7fa84e8fcd00, 
round_offset=false, offset=0x7fa8748f2de8) at ../system/physmem.c:2280
ram_addr = 18446744073709551615
_rcu_read_auto = 0x1
block = 0x0
host = 0x7fa84e8fcd00 ""
_rcu_read_auto = 0x7fa8751f8288
#6  0x560a92229669 in memory_region_from_host (ptr=0x7fa84e8fcd00, 
offset=0x7fa8748f2de8) at ../system/memory.c:2440
block = 0x0
#7  0x560a92237418 in address_space_unmap (as=0x560a94b05a20, 
buffer=0x7fa84e8fcd00, len=32768, is_write=true, access_len=32768) at 
../system/physmem.c:3246
mr = 0x0
addr1 = 0
__PRETTY_FUNCTION__ = "address_space_unmap"
#8  0x560a91fd6cd3 in dma_memory_unmap (as=0x560a94b05a20, 
buffer=0x7fa84e8fcd00, len=32768, dir=DMA_DIRECTION_FROM_DEVICE, 
access_len=32768) at /root/build/qemu/include/sysemu/dma.h:236
#9  0x560a91fd763d in dma_blk_unmap (dbs=0x560a94d87400) at 
../system/dma-helpers.c:93
i = 1
#10 0x560a91fd76e6 in dma_complete (dbs=0x560a94d87400, ret=0) at 
../system/dma-helpers.c:105
__PRETTY_FUNCTION__ = "dma_complete"
#11 0x560a91fd781c in dma_blk_cb (opaque=0x560a94d87400, ret=0) at 
../system/dma-helpers.c:129
dbs = 0x560a94d87400
ctx = 0x560a9448da90
cur_addr = 0
cur_len = 0
mem = 0x0
__PRETTY_FUNCTION__ = "dma_blk_cb"
#12 0x560a9232e974 in blk_aio_complete (acb=0x560a9448d5f0) at 
../block/block-backend.c:1555
#13 0x560a9232ebd1 in blk_aio_read_entry (opaque=0x560a9448d5f0) at 
../block/block-backend.c:1610
acb = 0x560a9448d5f0
rwco = 0x560a9448d618
qiov = 0x560a94d87460
__PRETTY_FUNCTION__ = "blk_aio_read_entry"

> > One more thing, regarding this specific patch. I don't think we should
> > clear the
> > entire entry, the next field should be kept, otherwise we'll disconnect
> > following
> > mappings that will never be found again. IIUC, this could very well be
> > causing the problem you see.
> >
> > Does the following make sense?
> >
> And here without double-freeing entry->valid_mapping:
>
> diff --git a/hw/xen/xen-mapcache.c b/hw/xen/xen-mapcache.c
> index 5f23b0adbe..667807b3b6 100644
> --- a/hw/xen/xen-mapcache.c
> +++ b/hw/xen/xen-mapcache.c
> @@ -597,7 +597,13 @@ static void
> xen_invalidate_map_cache_entry_unlocked(MapCache *mc,
>  pentry->next = entry->next;
>  g_free(entry);
>  } else {
> -memset(entry, 0, sizeof *entry);
> +/* Invalidate mapping.  */
> +entry->paddr_index = 0;
> +entry->vaddr_base = NULL;
> +entry->size = 0;
> +entry->valid_mapping = NULL;
> +entry->flags = 0;
> +/* Keep entry->next pointing to the rest of the list.  */
>  }
>  }

I've tried this patch, and that fix the issue I've seen. I'll run more
tests on it, just in case, but there's no reason that would break
something else.

Cheers,


--

Anthony Perard | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

[PULL 2/3] xen: fix stubdom PCI addr

2024-07-01 Thread anthony

From: Marek Marczykowski-Górecki 

When running in a stubdomain, the config space access via sysfs needs to
use BDF as seen inside stubdomain (connected via xen-pcifront), which is
different from the real BDF. For other purposes (hypercall parameters
etc), the real BDF needs to be used.
Get the in-stubdomain BDF by looking up relevant PV PCI xenstore
entries.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Anthony PERARD 
Message-Id: 
<35049e99da634a74578a1ff2cb3ae4cc436ede33.1711506237.git-series.marma...@invisiblethingslab.com>
Signed-off-by: Anthony PERARD 
---
 hw/xen/xen-host-pci-device.c | 76 +++-
 hw/xen/xen-host-pci-device.h |  6 +++
 2 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c
index 8c6e9a1716..eaf32f2710 100644
--- a/hw/xen/xen-host-pci-device.c
+++ b/hw/xen/xen-host-pci-device.c
@@ -9,6 +9,8 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "qemu/cutils.h"
+#include "hw/xen/xen-legacy-backend.h"
+#include "hw/xen/xen-bus-helper.h"
 #include "xen-host-pci-device.h"
 
 #define XEN_HOST_PCI_MAX_EXT_CAP \
@@ -33,13 +35,73 @@
 #define IORESOURCE_PREFETCH 0x1000  /* No side effects */
 #define IORESOURCE_MEM_64   0x0010
 
+/*
+ * Non-passthrough (dom0) accesses are local PCI devices and use the given BDF
+ * Passthough (stubdom) accesses are through PV frontend PCI device.  Those
+ * either have a BDF identical to the backend's BDF (xen-backend.passthrough=1)
+ * or a local virtual BDF (xen-backend.passthrough=0)
+ *
+ * We are always given the backend's BDF and need to lookup the appropriate
+ * local BDF for sysfs access.
+ */
+static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp)
+{
+unsigned int num_devs, len, i;
+unsigned int domain, bus, dev, func;
+char *be_path = NULL;
+char path[16];
+
+be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", );
+if (!be_path) {
+error_setg(errp, "Failed to read device/pci/0/backend");
+goto out;
+}
+
+if (xs_node_scanf(xenstore, 0, be_path, "num_devs", NULL,
+  "%d", _devs) != 1) {
+error_setg(errp, "Failed to read or parse %s/num_devs", be_path);
+goto out;
+}
+
+for (i = 0; i < num_devs; i++) {
+snprintf(path, sizeof(path), "dev-%d", i);
+if (xs_node_scanf(xenstore, 0, be_path, path, NULL,
+  "%x:%x:%x.%x", , , , ) != 4) {
+error_setg(errp, "Failed to read or parse %s/%s", be_path, path);
+goto out;
+}
+if (domain != d->domain ||
+bus != d->bus ||
+dev != d->dev ||
+func != d->func)
+continue;
+snprintf(path, sizeof(path), "vdev-%d", i);
+if (xs_node_scanf(xenstore, 0, be_path, path, NULL,
+  "%x:%x:%x.%x", , , , ) != 4) {
+error_setg(errp, "Failed to read or parse %s/%s", be_path, path);
+goto out;
+}
+d->local_domain = domain;
+d->local_bus = bus;
+d->local_dev = dev;
+d->local_func = func;
+goto out;
+}
+error_setg(errp, "Failed to find PCI device %x:%x:%x.%x in xenstore",
+   d->domain, d->bus, d->dev, d->func);
+
+out:
+free(be_path);
+}
+
 static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d,
 const char *name, char *buf, ssize_t size)
 {
 int rc;
 
 rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s",
-  d->domain, d->bus, d->dev, d->func, name);
+  d->local_domain, d->local_bus, d->local_dev, d->local_func,
+  name);
 assert(rc >= 0 && rc < size);
 }
 
@@ -342,6 +404,18 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t 
domain,
 d->dev = dev;
 d->func = func;
 
+if (xen_is_stubdomain) {
+xen_host_pci_fill_local_addr(d, errp);
+if (*errp) {
+goto error;
+}
+} else {
+d->local_domain = d->domain;
+d->local_bus = d->bus;
+d->local_dev = d->dev;
+d->local_func = d->func;
+}
+
 xen_host_pci_config_open(d, errp);
 if (*errp) {
 goto error;
diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h
index 4d8d34ecb0..270dcb27f7 100644
--- a/hw/xen/xen-host-pci-device.h
+++ b/hw/xen/xen-host-pci-device.h
@@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice {
 uint8_t dev;
 uint8_t func;
 
+/* different from the above in case of stubdomain */
+uint16_t local_domain;
+uint8_t local_bus;
+uint8_t local_dev;
+uint8_t local_func;
+
 uint16_t vendor_id;
 uint16_t device_id;
 uint32_t class_code;
-- 
Anthony PERARD

[PULL 1/3] hw/xen: detect when running inside stubdomain

2024-07-01 Thread anthony

From: Marek Marczykowski-Górecki 

Introduce global xen_is_stubdomain variable when qemu is running inside
a stubdomain instead of dom0. This will be relevant for subsequent
patches, as few things like accessing PCI config space need to be done
differently.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Anthony PERARD 
Message-Id: 

Signed-off-by: Anthony PERARD 
---
 hw/i386/xen/xen-hvm.c | 22 ++
 include/hw/xen/xen.h  |  1 +
 system/globals.c  |  1 +
 3 files changed, 24 insertions(+)

diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index 006d219ad5..4f6446600c 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -584,6 +584,26 @@ static void xen_wakeup_notifier(Notifier *notifier, void 
*data)
 xc_set_hvm_param(xen_xc, xen_domid, HVM_PARAM_ACPI_S_STATE, 0);
 }
 
+static bool xen_check_stubdomain(struct xs_handle *xsh)
+{
+char *dm_path = g_strdup_printf(
+"/local/domain/%d/image/device-model-domid", xen_domid);
+char *val;
+int32_t dm_domid;
+bool is_stubdom = false;
+
+val = xs_read(xsh, 0, dm_path, NULL);
+if (val) {
+if (sscanf(val, "%d", _domid) == 1) {
+is_stubdom = dm_domid != 0;
+}
+free(val);
+}
+
+g_free(dm_path);
+return is_stubdom;
+}
+
 void xen_hvm_init_pc(PCMachineState *pcms, MemoryRegion **ram_memory)
 {
 MachineState *ms = MACHINE(pcms);
@@ -596,6 +616,8 @@ void xen_hvm_init_pc(PCMachineState *pcms, MemoryRegion 
**ram_memory)
 
 xen_register_ioreq(state, max_cpus, _memory_listener);
 
+xen_is_stubdomain = xen_check_stubdomain(state->xenstore);
+
 QLIST_INIT(_physmap);
 xen_read_physmap(state);
 
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index 37ecc91fc3..ecb89ecfc1 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -36,6 +36,7 @@ enum xen_mode {
 extern uint32_t xen_domid;
 extern enum xen_mode xen_mode;
 extern bool xen_domid_restrict;
+extern bool xen_is_stubdomain;
 
 int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 int xen_set_pci_link_route(uint8_t link, uint8_t irq);
diff --git a/system/globals.c b/system/globals.c
index e353584201..d602a04fa2 100644
--- a/system/globals.c
+++ b/system/globals.c
@@ -60,6 +60,7 @@ bool qemu_uuid_set;
 uint32_t xen_domid;
 enum xen_mode xen_mode = XEN_DISABLED;
 bool xen_domid_restrict;
+bool xen_is_stubdomain;
 struct evtchn_backend_ops *xen_evtchn_ops;
 struct gnttab_backend_ops *xen_gnttab_ops;
 struct foreignmem_backend_ops *xen_foreignmem_ops;
-- 
Anthony PERARD

[PULL 0/3] xen queue 2024-07-01

2024-07-01 Thread anthony

From: Anthony PERARD 

The following changes since commit b6d32a06fc0984e537091cba08f2e1ed9f775d74:

  Merge tag 'pull-trivial-patches' of https://gitlab.com/mjt0k/qemu into 
staging (2024-06-30 16:12:24 -0700)

are available in the Git repository at:

  https://xenbits.xen.org/git-http/people/aperard/qemu-dm.git 
tags/pull-xen-20240701

for you to fetch changes up to 410b4d560dfa3b38a11ad19cf00180238651d9b7:

  xen-hvm: Avoid livelock while handling buffered ioreqs (2024-07-01 14:57:18 
+0200)


Xen queue:

* Improvement for running QEMU in a stubdomain.
* Improve handling of buffered ioreqs.


Marek Marczykowski-Górecki (2):
  hw/xen: detect when running inside stubdomain
  xen: fix stubdom PCI addr

Ross Lagerwall (1):
  xen-hvm: Avoid livelock while handling buffered ioreqs

 hw/i386/xen/xen-hvm.c| 22 +
 hw/xen/xen-host-pci-device.c | 76 +++-
 hw/xen/xen-host-pci-device.h |  6 
 hw/xen/xen-hvm-common.c  | 26 +--
 include/hw/xen/xen.h |  1 +
 system/globals.c |  1 +
 6 files changed, 122 insertions(+), 10 deletions(-)

[PULL 3/3] xen-hvm: Avoid livelock while handling buffered ioreqs

2024-07-01 Thread anthony

From: Ross Lagerwall 

A malicious or buggy guest may generated buffered ioreqs faster than
QEMU can process them in handle_buffered_iopage(). The result is a
livelock - QEMU continuously processes ioreqs on the main thread without
iterating through the main loop which prevents handling other events,
processing timers, etc. Without QEMU handling other events, it often
results in the guest becoming unsable and makes it difficult to stop the
source of buffered ioreqs.

To avoid this, if we process a full page of buffered ioreqs, stop and
reschedule an immediate timer to continue processing them. This lets
QEMU go back to the main loop and catch up.

Signed-off-by: Ross Lagerwall 
Reviewed-by: Paul Durrant 
Message-Id: <20240404140833.1557953-1-ross.lagerw...@citrix.com>
Signed-off-by: Anthony PERARD 
---
 hw/xen/xen-hvm-common.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/hw/xen/xen-hvm-common.c b/hw/xen/xen-hvm-common.c
index b8ace1c368..3a9d6f981b 100644
--- a/hw/xen/xen-hvm-common.c
+++ b/hw/xen/xen-hvm-common.c
@@ -475,11 +475,11 @@ static void handle_ioreq(XenIOState *state, ioreq_t *req)
 }
 }
 
-static bool handle_buffered_iopage(XenIOState *state)
+static unsigned int handle_buffered_iopage(XenIOState *state)
 {
 buffered_iopage_t *buf_page = state->buffered_io_page;
 buf_ioreq_t *buf_req = NULL;
-bool handled_ioreq = false;
+unsigned int handled = 0;
 ioreq_t req;
 int qw;
 
@@ -492,7 +492,7 @@ static bool handle_buffered_iopage(XenIOState *state)
 req.count = 1;
 req.dir = IOREQ_WRITE;
 
-for (;;) {
+do {
 uint32_t rdptr = buf_page->read_pointer, wrptr;
 
 xen_rmb();
@@ -533,22 +533,30 @@ static bool handle_buffered_iopage(XenIOState *state)
 assert(!req.data_is_ptr);
 
 qatomic_add(_page->read_pointer, qw + 1);
-handled_ioreq = true;
-}
+handled += qw + 1;
+} while (handled < IOREQ_BUFFER_SLOT_NUM);
 
-return handled_ioreq;
+return handled;
 }
 
 static void handle_buffered_io(void *opaque)
 {
+unsigned int handled;
 XenIOState *state = opaque;
 
-if (handle_buffered_iopage(state)) {
+handled = handle_buffered_iopage(state);
+if (handled >= IOREQ_BUFFER_SLOT_NUM) {
+/* We handled a full page of ioreqs. Schedule a timer to continue
+ * processing while giving other stuff a chance to run.
+ */
 timer_mod(state->buffered_io_timer,
-BUFFER_IO_MAX_DELAY + qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
-} else {
+qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
+} else if (handled == 0) {
 timer_del(state->buffered_io_timer);
 qemu_xen_evtchn_unmask(state->xce_handle, state->bufioreq_local_port);
+} else {
+timer_mod(state->buffered_io_timer,
+BUFFER_IO_MAX_DELAY + qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
 }
 }
 
-- 
Anthony PERARD

Re: [PATCH v8 2/8] xen: mapcache: Unmap first entries in buckets

2024-07-01 Thread Anthony PERARD

Hi all,

Following this commit, a test which install Debian in a guest with OVMF
as firmware started to fail. QEMU exit with an error when GRUB is
running on the freshly installed Debian (I don't know if GRUB is
starting Linux or not).
The error is:
Bad ram offset 

Some logs:
http://logs.test-lab.xenproject.org/osstest/logs/186611/test-amd64-amd64-xl-qemuu-ovmf-amd64/info.html

Any idea? Something is trying to do something with the address "-1" when
it shouldn't?

Cheers,

Anthony

On Wed, May 29, 2024 at 04:07:33PM +0200, Edgar E. Iglesias wrote:
> From: "Edgar E. Iglesias" 
> 
> When invalidating memory ranges, if we happen to hit the first
> entry in a bucket we were never unmapping it. This was harmless
> for foreign mappings but now that we're looking to reuse the
> mapcache for transient grant mappings, we must unmap entries
> when invalidated.
> 
> Signed-off-by: Edgar E. Iglesias 
> Reviewed-by: Stefano Stabellini 
> ---
>  hw/xen/xen-mapcache.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/xen/xen-mapcache.c b/hw/xen/xen-mapcache.c
> index bc860f4373..ec95445696 100644
> --- a/hw/xen/xen-mapcache.c
> +++ b/hw/xen/xen-mapcache.c
> @@ -491,18 +491,23 @@ static void 
> xen_invalidate_map_cache_entry_unlocked(MapCache *mc,
>  return;
>  }
>  entry->lock--;
> -if (entry->lock > 0 || pentry == NULL) {
> +if (entry->lock > 0) {
>  return;
>  }
>  
> -pentry->next = entry->next;
>  ram_block_notify_remove(entry->vaddr_base, entry->size, entry->size);
>  if (munmap(entry->vaddr_base, entry->size) != 0) {
>  perror("unmap fails");
>  exit(-1);
>  }
> +
>  g_free(entry->valid_mapping);
> -g_free(entry);
> +if (pentry) {
> +pentry->next = entry->next;
> +g_free(entry);
> +} else {
> +memset(entry, 0, sizeof *entry);
> +}
>  }
>  
>  typedef struct XenMapCacheData {
> -- 
> 2.40.1
> 
> 
-- 

Anthony Perard | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech

Re: [PATCH v6 0/3] Add support for the RAPL MSRs series

2024-06-26 Thread Anthony Harivel



Just a gentle ping for the above patch series.


Anthony Harivel, May 22, 2024 at 17:34:
> Dear maintainers, 
>
> First of all, thank you very much for your review of my patch 
> [1].
>
> In this version (v6), I have attempted to address all the problems 
> addressed by Daniel and Paolo during the last review. 
>
> However, two open questions remains unanswered that would require the 
> attention of a x86 maintainers: 
>
> 1)Should I move from -kvm to -cpu the rapl feature ? [2]
>
> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
>   futur TMPI architecture ? [end of 3] 
>
> Thank you again for your continued guidance. 
>
> v5 -> v6
> 
> - Better error consistency in qio_channel_get_peerpid()
> - Memory leak g_strdup_printf/g_build_filename corrected
> - Renaming several struct with "vmsr_*" for better namespace
> - Renamed several struct with "guest_*" for better comprehension
> - Optimization suggerate from Daniel
> - Crash problem solved [4]
>
> v4 -> v5
> 
>
> - correct qio_channel_get_peerpid: return pid = -1 in case of error
> - Vmsr_helper: compile only for x86
> - Vmsr_helper: use qio_channel_read/write_all
> - Vmsr_helper: abandon user/group
> - Vmsr_energy.c: correct all error_report
> - Vmsr thread: compute default socket path only once
> - Vmsr thread: open socket only once
> - Pass relevant QEMU CI
>
> v3 -> v4
> 
>
> - Correct memory leaks with AddressSanitizer  
> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
>   INTEL and if RAPL is activated.
> - Rename poor variables naming for easier comprehension
> - Move code that checks Host before creating the VMSR thread
> - Get rid of libnuma: create function that read sysfs for reading the 
>   Host topology instead
>
> v2 -> v3
> 
>
> - Move all memory allocations from Clib to Glib
> - Compile on *BSD (working on Linux only)
> - No more limitation on the virtual package: each vCPU that belongs to 
>   the same virtual package is giving the same results like expected on 
>   a real CPU.
>   This has been tested topology like:
>  -smp 4,sockets=2
>  -smp 16,sockets=4,cores=2,threads=2
>
> v1 -> v2
> 
>
> - To overcome the CVE-2020-8694 a socket communication is created
>   to a priviliged helper
> - Add the priviliged helper (qemu-vmsr-helper)
> - Add SO_PEERCRED in qio channel socket
>
> RFC -> v1
> -----
>
> - Add vmsr_* in front of all vmsr specific function
> - Change malloc()/calloc()... with all glib equivalent
> - Pre-allocate all dynamic memories when possible
> - Add a Documentation of implementation, limitation and usage
>
> Best regards,
> Anthony
>
> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html
>
> Anthony Harivel (3):
>   qio: add support for SO_PEERCRED for socket channel
>   tools: build qemu-vmsr-helper
>   Add support for RAPL MSRs in KVM/Qemu
>
>  accel/kvm/kvm-all.c  |  27 ++
>  contrib/systemd/qemu-vmsr-helper.service |  15 +
>  contrib/systemd/qemu-vmsr-helper.socket  |   9 +
>  docs/specs/index.rst |   1 +
>  docs/specs/rapl-msr.rst  | 155 +++
>  docs/tools/index.rst |   1 +
>  docs/tools/qemu-vmsr-helper.rst  |  89 
>  include/io/channel.h |  21 +
>  include/sysemu/kvm_int.h |  32 ++
>  io/channel-socket.c  |  28 ++
>  io/channel.c |  13 +
>  meson.build  |   7 +
>  target/i386/cpu.h|   8 +
>  target/i386/kvm/kvm.c| 431 +-
>  target/i386/kvm/meson.build  |   1 +
>  target/i386/kvm/vmsr_energy.c| 337 ++
>  target/i386/kvm/vmsr_energy.h|  99 +
>  tools/i386/qemu-vmsr-helper.c| 530 +++
>  tools/i386/rapl-msr-index.h  |  28 ++
>  19 files changed, 1831 insertions(+), 1 deletion(-)
>  create mode 100644 contrib/systemd/qemu-vmsr-helper.service
>  create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
>  create mode 100644 docs/specs/rapl-msr.rst
>  create mode 100644 docs/tools/qemu-vmsr-helper.rst
>  create mode 100644 target/i386/kvm/vmsr_energy.c
>  create mode 100644 target/i386/kvm/vmsr_energy.h
>  create mode 100644 tools/i386/qemu-vmsr-helper.c
>  create mode 100644 tools/i386/rapl-msr-index.h
>
> -- 
> 2.45.1

Re: [PATCH 05/16] vfio/helpers: Make vfio_device_get_name() return bool

2024-05-24 Thread Anthony Krowiak




On 5/15/24 4:20 AM, Zhenzhong Duan wrote:

This is to follow the coding standand in qapi/error.h to return bool
for bool-valued functions.

Suggested-by: Cédric Le Goater 
Signed-off-by: Zhenzhong Duan 
---
  include/hw/vfio/vfio-common.h | 2 +-
  hw/vfio/ap.c  | 2 +-
  hw/vfio/ccw.c | 2 +-
  hw/vfio/helpers.c | 8 
  hw/vfio/pci.c | 2 +-
  hw/vfio/platform.c| 5 ++---
  6 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fdce13f0f2..d9891c796f 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -280,7 +280,7 @@ int vfio_get_dirty_bitmap(const VFIOContainerBase 
*bcontainer, uint64_t iova,
uint64_t size, ram_addr_t ram_addr, Error **errp);
  
  /* Returns 0 on success, or a negative errno. */

-int vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
+bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
  void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
  void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
DeviceState *dev, bool ram_discard);
diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index d8a9615fee..c12531a788 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -158,7 +158,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
  VFIOAPDevice *vapdev = VFIO_AP_DEVICE(dev);
  VFIODevice *vbasedev = >vdev;
  
-if (vfio_device_get_name(vbasedev, errp) < 0) {

+if (!vfio_device_get_name(vbasedev, errp)) {
  return;
  }



snip ...


  
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c

index 93e6fef6de..a69b4411e5 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -605,7 +605,7 @@ bool vfio_has_region_cap(VFIODevice *vbasedev, int region, 
uint16_t cap_type)
  return ret;
  }
  
-int vfio_device_get_name(VFIODevice *vbasedev, Error **errp)

+bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
  {
  ERRP_GUARD();
  struct stat st;
@@ -614,7 +614,7 @@ int vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
  if (stat(vbasedev->sysfsdev, ) < 0) {
  error_setg_errno(errp, errno, "no such host device");
  error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev);
-return -errno;
+return false;
  }
  /* User may specify a name, e.g: VFIO platform device */
  if (!vbasedev->name) {
@@ -623,7 +623,7 @@ int vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
  } else {
  if (!vbasedev->iommufd) {
  error_setg(errp, "Use FD passing only with iommufd backend");
-return -EINVAL;
+return false;
  }
  /*
   * Give a name with fd so any function printing out vbasedev->name
@@ -634,7 +634,7 @@ int vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
  }
  }
  
-return 0;

+return true;
  }



For the two functions above:

Reviewed-by: Anthony Krowiak 


  



snip ...

Re: [PATCH 04/16] vfio/helpers: Make vfio_set_irq_signaling() return bool

2024-05-24 Thread Anthony Krowiak




On 5/15/24 4:20 AM, Zhenzhong Duan wrote:

This is to follow the coding standand in qapi/error.h to return bool
for bool-valued functions.

Suggested-by: Cédric Le Goater 
Signed-off-by: Zhenzhong Duan 
---
  include/hw/vfio/vfio-common.h |  4 ++--
  hw/vfio/ap.c  |  8 +++
  hw/vfio/ccw.c |  8 +++
  hw/vfio/helpers.c | 18 ++--
  hw/vfio/pci.c | 40 ++-
  hw/vfio/platform.c| 18 +++-
  6 files changed, 46 insertions(+), 50 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2d8da32df4..fdce13f0f2 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -207,8 +207,8 @@ void vfio_spapr_container_deinit(VFIOContainer *container);
  void vfio_disable_irqindex(VFIODevice *vbasedev, int index);
  void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index);
  void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index);
-int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
-   int action, int fd, Error **errp);
+bool vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
+int action, int fd, Error **errp);
  void vfio_region_write(void *opaque, hwaddr addr,
 uint64_t data, unsigned size);
  uint64_t vfio_region_read(void *opaque,
diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index ba653ef70f..d8a9615fee 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -117,8 +117,8 @@ static bool vfio_ap_register_irq_notifier(VFIOAPDevice 
*vapdev,
  fd = event_notifier_get_fd(notifier);
  qemu_set_fd_handler(fd, fd_read, NULL, vapdev);
  
-if (vfio_set_irq_signaling(vdev, irq, 0, VFIO_IRQ_SET_ACTION_TRIGGER, fd,

-   errp)) {
+if (!vfio_set_irq_signaling(vdev, irq, 0, VFIO_IRQ_SET_ACTION_TRIGGER, fd,
+errp)) {
  qemu_set_fd_handler(fd, NULL, NULL, vapdev);
  event_notifier_cleanup(notifier);
  }
@@ -141,8 +141,8 @@ static void vfio_ap_unregister_irq_notifier(VFIOAPDevice 
*vapdev,
  return;
  }
  
-if (vfio_set_irq_signaling(>vdev, irq, 0,

-   VFIO_IRQ_SET_ACTION_TRIGGER, -1, )) {
+if (!vfio_set_irq_signaling(>vdev, irq, 0,
+VFIO_IRQ_SET_ACTION_TRIGGER, -1, )) {
  warn_reportf_err(err, VFIO_MSG_PREFIX, vapdev->vdev.name);
  }
  
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c

index 89bb980167..1f578a3c75 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -434,8 +434,8 @@ static bool vfio_ccw_register_irq_notifier(VFIOCCWDevice 
*vcdev,
  fd = event_notifier_get_fd(notifier);
  qemu_set_fd_handler(fd, fd_read, NULL, vcdev);
  
-if (vfio_set_irq_signaling(vdev, irq, 0,

-   VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
+if (!vfio_set_irq_signaling(vdev, irq, 0,
+VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
  qemu_set_fd_handler(fd, NULL, NULL, vcdev);
  event_notifier_cleanup(notifier);
  }
@@ -464,8 +464,8 @@ static void vfio_ccw_unregister_irq_notifier(VFIOCCWDevice 
*vcdev,
  return;
  }
  
-if (vfio_set_irq_signaling(>vdev, irq, 0,

-   VFIO_IRQ_SET_ACTION_TRIGGER, -1, )) {
+if (!vfio_set_irq_signaling(>vdev, irq, 0,
+VFIO_IRQ_SET_ACTION_TRIGGER, -1, )) {
  warn_reportf_err(err, VFIO_MSG_PREFIX, vcdev->vdev.name);
  }
  
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c

index 0bb7b40a6a..93e6fef6de 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -107,12 +107,12 @@ static const char *index_to_str(VFIODevice *vbasedev, int 
index)
  }
  }
  
-int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,

-   int action, int fd, Error **errp)
+bool vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
+int action, int fd, Error **errp)
  {
  ERRP_GUARD();
  g_autofree struct vfio_irq_set *irq_set = NULL;
-int argsz, ret = 0;
+int argsz;
  const char *name;
  int32_t *pfd;
  
@@ -127,15 +127,11 @@ int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,

  pfd = (int32_t *)_set->data;
  *pfd = fd;
  
-if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {

-ret = -errno;
+if (!ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
+return true;



With this change, I don't see where the allocation of irq_set is is freed.

g_free(irq_set);

What am I missing?



  }
  
-if (!ret) {

-return 0;
-}
-
-error_setg_errno(errp, -ret, "VFIO_DEVICE_SET_IRQS failure");
+error_setg_errno(errp, errno, "VFIO_DEVICE_SET_IRQS failure");
  
  name =

Re: [PATCH 3/7] hw/s390x/ccw: Remove local Error variable from s390_ccw_realize()

2024-05-24 Thread Anthony Krowiak




On 5/22/24 1:01 PM, Cédric Le Goater wrote:

Use the 'Error **errp' argument of s390_ccw_realize() instead and
remove the error_propagate() call.

Signed-off-by: Cédric Le Goater 



Reviewed-by: Anthony Krowiak 



---
  hw/s390x/s390-ccw.c | 13 +
  1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/hw/s390x/s390-ccw.c b/hw/s390x/s390-ccw.c
index 
4b8ede701df90949720262b6fc1b65f4e505e34d..b3d14c61d732880a651edcf28a040ca723cb9f5b
 100644
--- a/hw/s390x/s390-ccw.c
+++ b/hw/s390x/s390-ccw.c
@@ -115,13 +115,12 @@ static void s390_ccw_realize(S390CCWDevice *cdev, char 
*sysfsdev, Error **errp)
  DeviceState *parent = DEVICE(ccw_dev);
  SubchDev *sch;
  int ret;
-Error *err = NULL;
  
-if (!s390_ccw_get_dev_info(cdev, sysfsdev, )) {

-goto out_err_propagate;
+if (!s390_ccw_get_dev_info(cdev, sysfsdev, errp)) {
+return;
  }
  
-sch = css_create_sch(ccw_dev->devno, );

+sch = css_create_sch(ccw_dev->devno, errp);
  if (!sch) {
  goto out_mdevid_free;
  }
@@ -132,12 +131,12 @@ static void s390_ccw_realize(S390CCWDevice *cdev, char 
*sysfsdev, Error **errp)
  ccw_dev->sch = sch;
  ret = css_sch_build_schib(sch, >hostid);
  if (ret) {
-error_setg_errno(, -ret, "%s: Failed to build initial schib",
+error_setg_errno(errp, -ret, "%s: Failed to build initial schib",
   __func__);
  goto out_err;
  }
  
-if (!ck->realize(ccw_dev, )) {

+if (!ck->realize(ccw_dev, errp)) {
  goto out_err;
  }
  
@@ -151,8 +150,6 @@ out_err:

  g_free(sch);
  out_mdevid_free:
  g_free(cdev->mdevid);
-out_err_propagate:
-error_propagate(errp, err);
  }
  
  static void s390_ccw_unrealize(S390CCWDevice *cdev)

Re: [PATCH 4/7] s390x/css: Make S390CCWDeviceClass::realize return bool

2024-05-24 Thread Anthony Krowiak




On 5/22/24 1:01 PM, Cédric Le Goater wrote:

Since the realize() handler of S390CCWDeviceClass takes an 'Error **'
argument, best practices suggest to return a bool. See the api/error.h
Rules section. While at it, modify the call in vfio_ccw_realize().

Signed-off-by: Cédric Le Goater 



Reviewed-by: Anthony Krowiak 



---
  include/hw/s390x/s390-ccw.h | 2 +-
  hw/s390x/s390-ccw.c | 7 ---
  hw/vfio/ccw.c   | 3 +--
  3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/hw/s390x/s390-ccw.h b/include/hw/s390x/s390-ccw.h
index 
2c807ee3a1ae8d85460fe65be8a62c64f212fe4b..2e0a70998132070996d6b0d083b8ddba5b9b87dc
 100644
--- a/include/hw/s390x/s390-ccw.h
+++ b/include/hw/s390x/s390-ccw.h
@@ -31,7 +31,7 @@ struct S390CCWDevice {
  
  struct S390CCWDeviceClass {

  CCWDeviceClass parent_class;
-void (*realize)(S390CCWDevice *dev, char *sysfsdev, Error **errp);
+bool (*realize)(S390CCWDevice *dev, char *sysfsdev, Error **errp);
  void (*unrealize)(S390CCWDevice *dev);
  IOInstEnding (*handle_request) (SubchDev *sch);
  int (*handle_halt) (SubchDev *sch);
diff --git a/hw/s390x/s390-ccw.c b/hw/s390x/s390-ccw.c
index 
b3d14c61d732880a651edcf28a040ca723cb9f5b..3c0975055089c3629dd76ce2e1484a4ef66d8d41
 100644
--- a/hw/s390x/s390-ccw.c
+++ b/hw/s390x/s390-ccw.c
@@ -108,7 +108,7 @@ static bool s390_ccw_get_dev_info(S390CCWDevice *cdev,
  return true;
  }
  
-static void s390_ccw_realize(S390CCWDevice *cdev, char *sysfsdev, Error **errp)

+static bool s390_ccw_realize(S390CCWDevice *cdev, char *sysfsdev, Error **errp)
  {
  CcwDevice *ccw_dev = CCW_DEVICE(cdev);
  CCWDeviceClass *ck = CCW_DEVICE_GET_CLASS(ccw_dev);
@@ -117,7 +117,7 @@ static void s390_ccw_realize(S390CCWDevice *cdev, char 
*sysfsdev, Error **errp)
  int ret;
  
  if (!s390_ccw_get_dev_info(cdev, sysfsdev, errp)) {

-return;
+return false;
  }
  
  sch = css_create_sch(ccw_dev->devno, errp);

@@ -142,7 +142,7 @@ static void s390_ccw_realize(S390CCWDevice *cdev, char 
*sysfsdev, Error **errp)
  
  css_generate_sch_crws(sch->cssid, sch->ssid, sch->schid,

parent->hotplugged, 1);
-return;
+return true;
  
  out_err:

  css_subch_assign(sch->cssid, sch->ssid, sch->schid, sch->devno, NULL);
@@ -150,6 +150,7 @@ out_err:
  g_free(sch);
  out_mdevid_free:
  g_free(cdev->mdevid);
+return false;
  }
  
  static void s390_ccw_unrealize(S390CCWDevice *cdev)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 
2600e62e37238779800dc2b3a0bd315d7633017b..9a8e052711fe2f7c067c52808b2af30d0ebfee0c
 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -582,8 +582,7 @@ static void vfio_ccw_realize(DeviceState *dev, Error **errp)
  
  /* Call the class init function for subchannel. */

  if (cdc->realize) {
-cdc->realize(cdev, vcdev->vdev.sysfsdev, );
-if (err) {
+if (!cdc->realize(cdev, vcdev->vdev.sysfsdev, )) {
  goto out_err_propagate;
  }
  }

Re: [PATCH 7/7] vfio/{ap, ccw}: Use warn_report_err() for IRQ notifier registration errors

2024-05-24 Thread Anthony Krowiak




On 5/22/24 1:01 PM, Cédric Le Goater wrote:

vfio_ccw_register_irq_notifier() and vfio_ap_register_irq_notifier()
errors are currently reported using error_report_err(). Since they are
not considered as failing conditions, using warn_report_err() is more
appropriate.

Signed-off-by: Cédric Le Goater 



Reviewed-by: Anthony Krowiak 



---
  hw/vfio/ap.c  | 2 +-
  hw/vfio/ccw.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 
c12531a7886a2fe87598be0861fba5923bd2c206..0c4354e3e70169ec072e16da0919936647d1d351
 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -172,7 +172,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
   * Report this error, but do not make it a failing condition.
   * Lack of this IRQ in the host does not prevent normal operation.
   */
-error_report_err(err);
+warn_report_err(err);
  }
  
  return;

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 
36f2677a448c5e31523dcc3de7d973ec70e4a13c..1f8e1272c7555cd0a770481d1ae92988f6e2e62e
 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -616,7 +616,7 @@ static void vfio_ccw_realize(DeviceState *dev, Error **errp)
   * Report this error, but do not make it a failing condition.
   * Lack of this IRQ in the host does not prevent normal operation.
   */
-error_report_err(err);
+warn_report_err(err);
  }
  
  return;

Re: [PATCH 2/7] s390x/css: Make CCWDeviceClass::realize return bool

2024-05-24 Thread Anthony Krowiak




On 5/22/24 1:01 PM, Cédric Le Goater wrote:

Since the realize() handler of CCWDeviceClass takes an 'Error **'
argument, best practices suggest to return a bool. See the api/error.h
Rules section. While at it, modify the call in s390_ccw_realize().

Signed-off-by: Cédric Le Goater 



Reviewed-by: Anthony Krowiak 



---
  hw/s390x/ccw-device.h | 2 +-
  hw/s390x/ccw-device.c | 3 ++-
  hw/s390x/s390-ccw.c   | 3 +--
  3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/s390x/ccw-device.h b/hw/s390x/ccw-device.h
index 
6dff95225df11c63f9b66975019026b215c8c448..5feeb0ee7a268b8709043b5bbc56b06e707a448d
 100644
--- a/hw/s390x/ccw-device.h
+++ b/hw/s390x/ccw-device.h
@@ -36,7 +36,7 @@ extern const VMStateDescription vmstate_ccw_dev;
  struct CCWDeviceClass {
  DeviceClass parent_class;
  void (*unplug)(HotplugHandler *, DeviceState *, Error **);
-void (*realize)(CcwDevice *, Error **);
+bool (*realize)(CcwDevice *, Error **);
  void (*refill_ids)(CcwDevice *);
  };
  
diff --git a/hw/s390x/ccw-device.c b/hw/s390x/ccw-device.c

index 
fb8c1acc64d5002c861a4913f292d8346dbef192..a7d682e5af9ce90e7e2fad8c24b30e39328c7cf4
 100644
--- a/hw/s390x/ccw-device.c
+++ b/hw/s390x/ccw-device.c
@@ -31,9 +31,10 @@ static void ccw_device_refill_ids(CcwDevice *dev)
  dev->subch_id.valid = true;
  }
  
-static void ccw_device_realize(CcwDevice *dev, Error **errp)

+static bool ccw_device_realize(CcwDevice *dev, Error **errp)
  {
  ccw_device_refill_ids(dev);
+return true;
  }
  
  static Property ccw_device_properties[] = {

diff --git a/hw/s390x/s390-ccw.c b/hw/s390x/s390-ccw.c
index 
a06e91dfb318e3500324851488c56806fa46c08d..4b8ede701df90949720262b6fc1b65f4e505e34d
 100644
--- a/hw/s390x/s390-ccw.c
+++ b/hw/s390x/s390-ccw.c
@@ -137,8 +137,7 @@ static void s390_ccw_realize(S390CCWDevice *cdev, char 
*sysfsdev, Error **errp)
  goto out_err;
  }
  
-ck->realize(ccw_dev, );

-if (err) {
+if (!ck->realize(ccw_dev, )) {
  goto out_err;
  }

Re: [PATCH 5/7] vfio/ccw: Use the 'Error **errp' argument of vfio_ccw_realize()

2024-05-24 Thread Anthony Krowiak




On 5/22/24 1:01 PM, Cédric Le Goater wrote:

The local error variable is kept for vfio_ccw_register_irq_notifier()
because it is not considered as a failing condition. We will change
how error reporting is done in following changes.

Remove the error_propagate() call.

Cc: Zhenzhong Duan 
Signed-off-by: Cédric Le Goater 



Reviewed-by: Anthony Krowiak 



---
  hw/vfio/ccw.c | 12 +---
  1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 
9a8e052711fe2f7c067c52808b2af30d0ebfee0c..a468fa2342b97e0ee36bd5fb8443025cc90a0453
 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -582,8 +582,8 @@ static void vfio_ccw_realize(DeviceState *dev, Error **errp)
  
  /* Call the class init function for subchannel. */

  if (cdc->realize) {
-if (!cdc->realize(cdev, vcdev->vdev.sysfsdev, )) {
-goto out_err_propagate;
+if (!cdc->realize(cdev, vcdev->vdev.sysfsdev, errp)) {
+return;
  }
  }
  
@@ -596,17 +596,17 @@ static void vfio_ccw_realize(DeviceState *dev, Error **errp)

  goto out_attach_dev_err;
  }
  
-if (!vfio_ccw_get_region(vcdev, )) {

+if (!vfio_ccw_get_region(vcdev, errp)) {
  goto out_region_err;
  }
  
-if (!vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX, )) {

+if (!vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_IO_IRQ_INDEX, errp)) {
  goto out_io_notifier_err;
  }
  
  if (vcdev->crw_region) {

  if (!vfio_ccw_register_irq_notifier(vcdev, VFIO_CCW_CRW_IRQ_INDEX,
-)) {
+errp)) {
  goto out_irq_notifier_err;
  }
  }
@@ -634,8 +634,6 @@ out_attach_dev_err:
  if (cdc->unrealize) {
  cdc->unrealize(cdev);
  }
-out_err_propagate:
-error_propagate(errp, err);
  }
  
  static void vfio_ccw_unrealize(DeviceState *dev)

Re: [PATCH 1/7] hw/s390x/ccw: Make s390_ccw_get_dev_info() return a bool

2024-05-24 Thread Anthony Krowiak




On 5/22/24 1:01 PM, Cédric Le Goater wrote:

Since s390_ccw_get_dev_info() takes an 'Error **' argument, best
practices suggest to return a bool. See the qapi/error.h Rules
section. While at it, modify the call in s390_ccw_realize().

Signed-off-by: Cédric Le Goater 



Reviewed-by: Anthony Krowiak 



---
  hw/s390x/s390-ccw.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/s390x/s390-ccw.c b/hw/s390x/s390-ccw.c
index 
5261e66724f1cc3157b9413b0d5fdf5289c92503..a06e91dfb318e3500324851488c56806fa46c08d
 100644
--- a/hw/s390x/s390-ccw.c
+++ b/hw/s390x/s390-ccw.c
@@ -71,7 +71,7 @@ IOInstEnding s390_ccw_store(SubchDev *sch)
  return ret;
  }
  
-static void s390_ccw_get_dev_info(S390CCWDevice *cdev,

+static bool s390_ccw_get_dev_info(S390CCWDevice *cdev,
char *sysfsdev,
Error **errp)
  {
@@ -84,12 +84,12 @@ static void s390_ccw_get_dev_info(S390CCWDevice *cdev,
  error_setg(errp, "No host device provided");
  error_append_hint(errp,
"Use -device vfio-ccw,sysfsdev=PATH_TO_DEVICE\n");
-return;
+return false;
  }
  
  if (!realpath(sysfsdev, dev_path)) {

  error_setg_errno(errp, errno, "Host device '%s' not found", sysfsdev);
-return;
+return false;
  }
  
  cdev->mdevid = g_path_get_basename(dev_path);

@@ -98,13 +98,14 @@ static void s390_ccw_get_dev_info(S390CCWDevice *cdev,
  tmp = g_path_get_basename(tmp_dir);
  if (sscanf(tmp, "%2x.%1x.%4x", , , ) != 3) {
  error_setg_errno(errp, errno, "Failed to read %s", tmp);
-return;
+return false;
  }
  
  cdev->hostid.cssid = cssid;

  cdev->hostid.ssid = ssid;
  cdev->hostid.devid = devid;
  cdev->hostid.valid = true;
+return true;
  }
  
  static void s390_ccw_realize(S390CCWDevice *cdev, char *sysfsdev, Error **errp)

@@ -116,8 +117,7 @@ static void s390_ccw_realize(S390CCWDevice *cdev, char 
*sysfsdev, Error **errp)
  int ret;
  Error *err = NULL;
  
-s390_ccw_get_dev_info(cdev, sysfsdev, );

-if (err) {
+if (!s390_ccw_get_dev_info(cdev, sysfsdev, )) {
  goto out_err_propagate;
  }

[PATCH v6 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-05-22 Thread Anthony Harivel

Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

Signed-off-by: Anthony Harivel 
---
 accel/kvm/kvm-all.c   |  27 +++
 docs/specs/index.rst  |   1 +
 docs/specs/rapl-msr.rst   | 155 
 include/sysemu/kvm_int.h  |  32 +++
 target/i386/cpu.h |   8 +
 target/i386/kvm/kvm.c | 431 +-
 target/i386/kvm/meson.build   |   1 +
 target/i386/kvm/vmsr_energy.c | 344 +++
 target/i386/kvm/vmsr_energy.h |  99 
 9 files changed, 1097 insertions(+), 1 deletion(-)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c0be9f5eedb8..f455e6b987b4 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3745,6 +3745,21 @@ static void kvm_set_device(Object *obj,
 s->device = g_strdup(value);
 }
 
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+ const char *str,
+ Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+g_free(s->msr_energy.socket_path);
+s->msr_energy.socket_path = g_strdup(str);
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
 KVMState *s = KVM_STATE(obj);
@@ -3764,6 +3779,7 @@ static void kvm_accel_instance_init(Object *obj)
 s->xen_gnttab_max_frames = 64;
 s->xen_evtchn_max_pirq = 256;
 s->device = NULL;
+s->msr_energy.enable = false;
 }
 
 /**
@@ -3808,6 +3824,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void 
*data)
 object_class_property_set_description(oc, "device",
 "Path to the device node to use (default: /dev/kvm)");
 
+object_class_property_add_bool(oc, "rapl",
+   NULL,
+   kvm_set_kvm_rapl);
+object_class_property_set_description(oc, "rapl",
+"Allow energy related MSRs for RAPL interface in Guest");
+
+object_class_property_add_str(oc, "rapl-helper-socket", NULL,
+  kvm_set_kvm_rapl_socket_path);
+object_class_property_set_description(oc, "rapl-helper-socket",
+"Socket Path for comminucating with the Virtual MSR helper daemon");
+
 kvm_arch_accel_class_init(oc);
 }
 
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index 1484e3e76077..e738ea7d102f 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -33,3 +33,4 @@ guest hardware that is specific to QEMU.
virt-ctlr
vmcoreinfo
vmgenid
+   rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index ..1202ee89bee0
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,155 @@
+
+RAPL MSR support
+
+
+The RAPL interface (Running Average Power Limit) is advertising the accumulated
+energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
+
+The consumption is reported via MSRs (model specific registers) like
+MSR

[PATCH v6 1/3] qio: add support for SO_PEERCRED for socket channel

2024-05-22 Thread Anthony Harivel

The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.

This credentials structure is defined in  as follows:

struct ucred {
pid_t pid;/* Process ID of the sending process */
uid_t uid;/* User ID of the sending process */
gid_t gid;/* Group ID of the sending process */
};

The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.

On platform other than Linux, the function return 0.

Signed-off-by: Anthony Harivel 
---
 include/io/channel.h | 21 +
 io/channel-socket.c  | 28 
 io/channel.c | 13 +
 3 files changed, 62 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 7986c49c713a..bdf0bca92ae2 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -160,6 +160,9 @@ struct QIOChannelClass {
   void *opaque);
 int (*io_flush)(QIOChannel *ioc,
 Error **errp);
+int (*io_peerpid)(QIOChannel *ioc,
+   unsigned int *pid,
+   Error **errp);
 };
 
 /* General I/O handling functions */
@@ -981,4 +984,22 @@ int coroutine_mixed_fn 
qio_channel_writev_full_all(QIOChannel *ioc,
 int qio_channel_flush(QIOChannel *ioc,
   Error **errp);
 
+/**
+ * qio_channel_get_peercred:
+ * @ioc: the channel object
+ * @pid: pointer to pid
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns the pid of the peer process connected to this socket.
+ *
+ * The use of this function is possible only for connected
+ * AF_UNIX stream sockets and for AF_UNIX stream and datagram
+ * socket pairs on Linux.
+ * Return -1 on error with pid -1 for the non-Linux OS.
+ *
+ */
+int qio_channel_get_peerpid(QIOChannel *ioc,
+ unsigned int *pid,
+ Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 3a899b060858..608bcf066ecd 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -841,6 +841,33 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
 socket_set_cork(sioc->fd, v);
 }
 
+static int
+qio_channel_socket_get_peerpid(QIOChannel *ioc,
+   unsigned int *pid,
+   Error **errp)
+{
+#ifdef CONFIG_LINUX
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+Error *err = NULL;
+socklen_t len = sizeof(struct ucred);
+
+struct ucred cred;
+if (getsockopt(sioc->fd,
+   SOL_SOCKET, SO_PEERCRED,
+   , ) == -1) {
+error_setg_errno(, errno, "Unable to get peer credentials");
+error_propagate(errp, err);
+*pid = -1;
+return -1;
+}
+*pid = (unsigned int)cred.pid;
+return 0;
+#else
+error_setg(errp, "Unsupported feature");
+*pid = -1;
+return -1;
+#endif
+}
 
 static int
 qio_channel_socket_close(QIOChannel *ioc,
@@ -938,6 +965,7 @@ static void qio_channel_socket_class_init(ObjectClass 
*klass,
 #ifdef QEMU_MSG_ZEROCOPY
 ioc_klass->io_flush = qio_channel_socket_flush;
 #endif
+ioc_klass->io_peerpid = qio_channel_socket_get_peerpid;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel.c b/io/channel.c
index a1f12f8e9096..e3f17c24a00f 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -548,6 +548,19 @@ void qio_channel_set_cork(QIOChannel *ioc,
 }
 }
 
+int qio_channel_get_peerpid(QIOChannel *ioc,
+ unsigned int *pid,
+ Error **errp)
+{
+QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+if (!klass->io_peerpid) {
+error_setg(errp, "Channel does not support peer pid");
+return -1;
+}
+klass->io_peerpid(ioc, pid, errp);
+return 0;
+}
 
 off_t qio_channel_io_seek(QIOChannel *ioc,
   off_t offset,
-- 
2.45.1

[PATCH v6 2/3] tools: build qemu-vmsr-helper

2024-05-22 Thread Anthony Harivel

Introduce a privileged helper to access RAPL MSR.

The privileged helper tool, qemu-vmsr-helper, is designed to provide
virtual machines with the ability to read specific RAPL (Running Average
Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
on external, out-of-tree patches.

The helper tool leverages Unix permissions and SO_PEERCRED socket
options to enforce access control, ensuring that only processes
explicitly requesting read access via readmsr() from a valid Thread ID
can access these MSRs.

The list of RAPL MSRs that are allowed to be read by the helper tool is
defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
will be supported in the next commit titled "Add support for RAPL MSRs
in KVM/QEMU."

The tool is intentionally designed to run on the Linux x86 platform.
This initial implementation is tailored for Intel CPUs but can be
extended to support AMD CPUs in the future.

Signed-off-by: Anthony Harivel 
---
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/tools/index.rst |   1 +
 docs/tools/qemu-vmsr-helper.rst  |  89 
 meson.build  |   7 +
 tools/i386/qemu-vmsr-helper.c| 530 +++
 tools/i386/rapl-msr-index.h  |  28 ++
 7 files changed, 679 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

diff --git a/contrib/systemd/qemu-vmsr-helper.service 
b/contrib/systemd/qemu-vmsr-helper.service
new file mode 100644
index ..8fd397bf79a9
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.service
@@ -0,0 +1,15 @@
+[Unit]
+Description=Virtual RAPL MSR Daemon for QEMU
+
+[Service]
+WorkingDirectory=/tmp
+Type=simple
+ExecStart=/usr/bin/qemu-vmsr-helper
+PrivateTmp=yes
+ProtectSystem=strict
+ReadWritePaths=/var/run
+RestrictAddressFamilies=AF_UNIX
+Restart=always
+RestartSec=0
+
+[Install]
diff --git a/contrib/systemd/qemu-vmsr-helper.socket 
b/contrib/systemd/qemu-vmsr-helper.socket
new file mode 100644
index ..183e8304d6e2
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.socket
@@ -0,0 +1,9 @@
+[Unit]
+Description=Virtual RAPL MSR helper for QEMU
+
+[Socket]
+ListenStream=/run/qemu-vmsr-helper.sock
+SocketMode=0600
+
+[Install]
+WantedBy=multi-user.target
diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 8e65ce0dfc7b..33ad438e86f6 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -16,3 +16,4 @@ command line utilities and other standalone programs.
qemu-pr-helper
qemu-trace-stap
virtfs-proxy-helper
+   qemu-vmsr-helper
diff --git a/docs/tools/qemu-vmsr-helper.rst b/docs/tools/qemu-vmsr-helper.rst
new file mode 100644
index ..6ec87b49d962
--- /dev/null
+++ b/docs/tools/qemu-vmsr-helper.rst
@@ -0,0 +1,89 @@
+==
+QEMU virtual RAPL MSR helper
+==
+
+Synopsis
+
+
+**qemu-vmsr-helper** [*OPTION*]
+
+Description
+---
+
+Implements the virtual RAPL MSR helper for QEMU.
+
+Accessing the RAPL (Running Average Power Limit) MSR enables the RAPL powercap
+driver to advertise and monitor the power consumption or accumulated energy
+consumption of different power domains, such as CPU packages, DRAM, and other
+components when available.
+
+However those register are accesible under priviliged access (CAP_SYS_RAWIO).
+QEMU can use an external helper to access those priviliged register.
+
+:program:`qemu-vmsr-helper` is that external helper; it creates a listener
+socket which will accept incoming connections for communication with QEMU.
+
+If you want to run VMs in a setup like this, this helper should be started as a
+system service, and you should read the QEMU manual section on "RAPL MSR
+support" to find out how to configure QEMU to connect to the socket created by
+:program:`qemu-vmsr-helper`.
+
+After connecting to the socket, :program:`qemu-vmsr-helper` can
+optionally drop root privileges, except for those capabilities that
+are needed for its operation.
+
+:program:`qemu-vmsr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+[Socket]
+ListenStream=/var/run/qemu-vmsr-helper.sock
+
+Options
+---
+
+.. program:: qemu-vmsr-helper
+
+.. option:: -d, --daemon
+
+  run in the background (and create a PID file)
+
+.. option:: -q, --quiet
+
+  decrease verbosity
+
+.. option:: -v, --verbose
+
+  increase verbosity
+
+.. option:: -f, --pidfile=PATH
+
+  PID file when running as a daemon. By default the PID file
+  is created in the system runtime state directory, for example
+  :file:`/var/run/

[PATCH v6 0/3] Add support for the RAPL MSRs series

2024-05-22 Thread Anthony Harivel

Dear maintainers, 

First of all, thank you very much for your review of my patch 
[1].

In this version (v6), I have attempted to address all the problems 
addressed by Daniel and Paolo during the last review. 

However, two open questions remains unanswered that would require the 
attention of a x86 maintainers: 

1)Should I move from -kvm to -cpu the rapl feature ? [2]

2)Should I already rename to "rapl_vmsr_*" in order to anticipate the 
  futur TMPI architecture ? [end of 3] 

Thank you again for your continued guidance. 

v5 -> v6

- Better error consistency in qio_channel_get_peerpid()
- Memory leak g_strdup_printf/g_build_filename corrected
- Renaming several struct with "vmsr_*" for better namespace
- Renamed several struct with "guest_*" for better comprehension
- Optimization suggerate from Daniel
- Crash problem solved [4]

v4 -> v5


- correct qio_channel_get_peerpid: return pid = -1 in case of error
- Vmsr_helper: compile only for x86
- Vmsr_helper: use qio_channel_read/write_all
- Vmsr_helper: abandon user/group
- Vmsr_energy.c: correct all error_report
- Vmsr thread: compute default socket path only once
- Vmsr thread: open socket only once
- Pass relevant QEMU CI

v3 -> v4


- Correct memory leaks with AddressSanitizer  
- Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
  INTEL and if RAPL is activated.
- Rename poor variables naming for easier comprehension
- Move code that checks Host before creating the VMSR thread
- Get rid of libnuma: create function that read sysfs for reading the 
  Host topology instead

v2 -> v3


- Move all memory allocations from Clib to Glib
- Compile on *BSD (working on Linux only)
- No more limitation on the virtual package: each vCPU that belongs to 
  the same virtual package is giving the same results like expected on 
  a real CPU.
  This has been tested topology like:
 -smp 4,sockets=2
 -smp 16,sockets=4,cores=2,threads=2

v1 -> v2


- To overcome the CVE-2020-8694 a socket communication is created
  to a priviliged helper
- Add the priviliged helper (qemu-vmsr-helper)
- Add SO_PEERCRED in qio channel socket

RFC -> v1
-

- Add vmsr_* in front of all vmsr specific function
- Change malloc()/calloc()... with all glib equivalent
- Pre-allocate all dynamic memories when possible
- Add a Documentation of implementation, limitation and usage

Best regards,
Anthony

[1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html
[2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html
[3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html
[4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html

Anthony Harivel (3):
  qio: add support for SO_PEERCRED for socket channel
  tools: build qemu-vmsr-helper
  Add support for RAPL MSRs in KVM/Qemu

 accel/kvm/kvm-all.c  |  27 ++
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/specs/index.rst |   1 +
 docs/specs/rapl-msr.rst  | 155 +++
 docs/tools/index.rst |   1 +
 docs/tools/qemu-vmsr-helper.rst  |  89 
 include/io/channel.h |  21 +
 include/sysemu/kvm_int.h |  32 ++
 io/channel-socket.c  |  28 ++
 io/channel.c |  13 +
 meson.build  |   7 +
 target/i386/cpu.h|   8 +
 target/i386/kvm/kvm.c| 431 +-
 target/i386/kvm/meson.build  |   1 +
 target/i386/kvm/vmsr_energy.c| 337 ++
 target/i386/kvm/vmsr_energy.h|  99 +
 tools/i386/qemu-vmsr-helper.c| 530 +++
 tools/i386/rapl-msr-index.h  |  28 ++
 19 files changed, 1831 insertions(+), 1 deletion(-)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

-- 
2.45.1

Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-05-06 Thread Anthony Harivel

Anthony Harivel, Apr 26, 2024 at 10:36:
>
> Hi Paolo,
>
> Daniel P. Berrangé, Apr 25, 2024 at 17:42:
> > On Thu, Apr 25, 2024 at 05:34:52PM +0200, Anthony Harivel wrote:
> > > Hi Daniel,
> > > 
> > > Daniel P. Berrangé, Apr 18, 2024 at 18:42:
> > > 
> > > > > +if (kvm_is_rapl_feat_enable(cs)) {
> > > > > +if (!IS_INTEL_CPU(env)) {
> > > > > +error_setg(errp, "RAPL feature can only be\
> > > > > +  enabled with Intel CPU models");
> > > > > +return false;
> > > > > +}
> > > > > +}
> > > >
> > > > I see a crash in kvm_is_rapl_feat_enable() from this caller,
> > > > when I run with this kind of command line:
> > > >
> > > >  $ qemu-system-x86_64 \
> > > >   -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \
> > > >   -initrd tiny-initrd.img  -m 2000 -serial stdio -nodefaults \
> > > >   -display none -accel kvm -append "console=ttyS0 quiet"
> > > >
> > > >
> > > > #0  0x55bc14b7 in kvm_is_rapl_feat_enable 
> > > > (cs=cs@entry=0x57b83470) at ../target/i386/kvm/kvm.c:2531
> > > > #1  0x55bc7534 in kvm_cpu_realizefn (cs=0x57b83470, 
> > > > errp=0x7fffd2a0) at ../target/i386/kvm/kvm-cpu.c:54
> > > > #2  0x55d2432a in accel_cpu_common_realize (cpu=0x57b83470, 
> > > > errp=0x7fffd2a0) at ../accel/accel-target.c:130
> > > > #3  0x55cdd955 in cpu_exec_realizefn 
> > > > (cpu=cpu@entry=0x57b83470, errp=errp@entry=0x7fffd2a0) at 
> > > > ../cpu-target.c:137
> > > > #4  0x55c14b89 in x86_cpu_realizefn (dev=0x57b83470, 
> > > > errp=0x7fffd310) at ../target/i386/cpu.c:7320
> > > > #5  0x55d58f4b in device_set_realized (obj=, 
> > > > value=, errp=0x7fffd390) at ../hw/core/qdev.c:510
> > > > #6  0x55d5d78d in property_set_bool (obj=0x57b83470, 
> > > > v=, name=, opaque=0x578558e0, 
> > > > errp=0x7fffd390)
> > > > at ../qom/object.c:2358
> > > > #7  0x55d60b0b in object_property_set 
> > > > (obj=obj@entry=0x57b83470, name=name@entry=0x5607c799 
> > > > "realized", v=v@entry=0x57b8ccb0, errp=0x7fffd390, 
> > > > errp@entry=0x56e210d8 ) at ../qom/object.c:1472
> > > > #8  0x55d6444f in object_property_set_qobject
> > > > (obj=obj@entry=0x57b83470, name=name@entry=0x5607c799 
> > > > "realized", value=value@entry=0x57854800, 
> > > > errp=errp@entry=0x56e210d8 )
> > > > at ../qom/qom-qobject.c:28
> > > > #9  0x55d61174 in object_property_set_bool
> > > > (obj=0x57b83470, name=name@entry=0x5607c799 "realized", 
> > > > value=value@entry=true, errp=errp@entry=0x56e210d8 ) 
> > > > at ../qom/object.c:1541
> > > > #10 0x55d59a3c in qdev_realize (dev=, 
> > > > bus=bus@entry=0x0, errp=errp@entry=0x56e210d8 ) at 
> > > > ../hw/core/qdev.c:292
> > > > #11 0x55bd51e0 in x86_cpu_new (x86ms=, 
> > > > apic_id=0, errp=0x56e210d8 ) at ../hw/i386/x86.c:105
> > > > #12 0x55bd52ce in x86_cpus_init 
> > > > (x86ms=x86ms@entry=0x57aaed30, default_cpu_version=) 
> > > > at ../hw/i386/x86.c:156
> > > > #13 0x55bdc1a7 in pc_init1 (machine=0x57aaed30, 
> > > > pci_type=0x5604aa61 "i440FX") at ../hw/i386/pc_piix.c:185
> > > > #14 0x55947a11 in machine_run_board_init 
> > > > (machine=0x57aaed30, mem_path=, errp= > > > out>, errp@entry=0x56e210d8 )
> > > > at ../hw/core/machine.c:1547
> > > > #15 0x55b020ed in qemu_init_board () at ../system/vl.c:2613
> > > > #16 qmp_x_exit_preconfig (errp=0x56e210d8 ) at 
> > > > ../system/vl.c:2705
> > > > #17 0x55b0611e in qemu_init (argc=, 
> > > > argv=) at ../system/vl.c:3739
> > > > #18 0x55897ca9 in main (argc=, argv= > > > out>) at ../system/main.c:47
> > > >
> > > > The problem is that 'cs->kvm_state' is NULL here
> > > >
> > > 
> > > After some investigation it seems that kvm_state is not yet committed 
> > >

[PATCH] MAINTAINERS: Update my email address

2024-04-29 Thread Anthony PERARD

From: Anthony PERARD 

Signed-off-by: Anthony PERARD 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 302b6fd00c..ea9672fc52 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -532,7 +532,7 @@ Guest CPU Cores (Xen)
 -
 X86 Xen CPUs
 M: Stefano Stabellini 
-M: Anthony Perard 
+M: Anthony PERARD 
 M: Paul Durrant 
 L: xen-de...@lists.xenproject.org
 S: Supported
-- 
Anthony PERARD

Re: [PATCH v2 1/4] vfio/ap: Use g_autofree variable in vfio_ap_register_irq_notifier()

2024-04-26 Thread Anthony Krowiak




On 4/25/24 5:02 AM, Cédric Le Goater wrote:

Signed-off-by: Cédric Le Goater 
---
  hw/vfio/ap.c | 10 +++---
  1 file changed, 3 insertions(+), 7 deletions(-)



LGTM

Reviewed-by: Anthony Krowiak 




diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 
7c4caa5938636937680fec87e999249ac84a4498..03f8ffaa5e2bf13cf8daa2f44aa4cf17809abd94
 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -77,7 +77,7 @@ static void vfio_ap_register_irq_notifier(VFIOAPDevice 
*vapdev,
  size_t argsz;
  IOHandler *fd_read;
  EventNotifier *notifier;
-struct vfio_irq_info *irq_info;
+g_autofree struct vfio_irq_info *irq_info = NULL;
  VFIODevice *vdev = >vdev;
  
  switch (irq) {

@@ -104,14 +104,14 @@ static void vfio_ap_register_irq_notifier(VFIOAPDevice 
*vapdev,
  if (ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO,
irq_info) < 0 || irq_info->count < 1) {
  error_setg_errno(errp, errno, "vfio: Error getting irq info");
-goto out_free_info;
+return;
  }
  
  if (event_notifier_init(notifier, 0)) {

  error_setg_errno(errp, errno,
   "vfio: Unable to init event notifier for irq (%d)",
   irq);
-goto out_free_info;
+return;
  }
  
  fd = event_notifier_get_fd(notifier);

@@ -122,10 +122,6 @@ static void vfio_ap_register_irq_notifier(VFIOAPDevice 
*vapdev,
  qemu_set_fd_handler(fd, NULL, NULL, vapdev);
  event_notifier_cleanup(notifier);
  }
-
-out_free_info:
-g_free(irq_info);
-
  }
  
  static void vfio_ap_unregister_irq_notifier(VFIOAPDevice *vapdev,

Re: [PATCH] vfio/ap: Use g_autofree variable

2024-04-26 Thread Anthony Krowiak




On 4/24/24 8:54 AM, Cédric Le Goater wrote:

Also change the return value of vfio_ap_register_irq_notifier() to be
a bool since it takes and 'Error **' argument. See the qapi/error.h
Rules section.



LGTM

Signed-off-by: Anthony Krowiak 




Signed-off-by: Cédric Le Goater 
---
  hw/vfio/ap.c | 19 ---
  1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 
7c4caa5938636937680fec87e999249ac84a4498..8bb024e2fde4a1d72346dee4b662d762374326b9
 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -70,14 +70,14 @@ static void vfio_ap_req_notifier_handler(void *opaque)
  }
  }
  
-static void vfio_ap_register_irq_notifier(VFIOAPDevice *vapdev,

+static bool vfio_ap_register_irq_notifier(VFIOAPDevice *vapdev,
unsigned int irq, Error **errp)
  {
  int fd;
  size_t argsz;
  IOHandler *fd_read;
  EventNotifier *notifier;
-struct vfio_irq_info *irq_info;
+g_autofree struct vfio_irq_info *irq_info = NULL;
  VFIODevice *vdev = >vdev;
  
  switch (irq) {

@@ -87,13 +87,13 @@ static void vfio_ap_register_irq_notifier(VFIOAPDevice 
*vapdev,
  break;
  default:
  error_setg(errp, "vfio: Unsupported device irq(%d)", irq);
-return;
+return false;
  }
  
  if (vdev->num_irqs < irq + 1) {

  error_setg(errp, "vfio: IRQ %u not available (number of irqs %u)",
 irq, vdev->num_irqs);
-return;
+return false;
  }
  
  argsz = sizeof(*irq_info);

@@ -104,14 +104,14 @@ static void vfio_ap_register_irq_notifier(VFIOAPDevice 
*vapdev,
  if (ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO,
irq_info) < 0 || irq_info->count < 1) {
  error_setg_errno(errp, errno, "vfio: Error getting irq info");
-goto out_free_info;
+return false;
  }
  
  if (event_notifier_init(notifier, 0)) {

  error_setg_errno(errp, errno,
   "vfio: Unable to init event notifier for irq (%d)",
   irq);
-goto out_free_info;
+return false;
  }
  
  fd = event_notifier_get_fd(notifier);

@@ -123,9 +123,7 @@ static void vfio_ap_register_irq_notifier(VFIOAPDevice 
*vapdev,
  event_notifier_cleanup(notifier);
  }
  
-out_free_info:

-g_free(irq_info);
-
+return true;
  }
  
  static void vfio_ap_unregister_irq_notifier(VFIOAPDevice *vapdev,

@@ -171,8 +169,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
  goto error;
  }
  
-vfio_ap_register_irq_notifier(vapdev, VFIO_AP_REQ_IRQ_INDEX, );

-if (err) {
+if (!vfio_ap_register_irq_notifier(vapdev, VFIO_AP_REQ_IRQ_INDEX, )) {
  /*
   * Report this error, but do not make it a failing condition.
   * Lack of this IRQ in the host does not prevent normal operation.

Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-04-26 Thread Anthony Harivel



Hi Paolo,

Daniel P. Berrangé, Apr 25, 2024 at 17:42:
> On Thu, Apr 25, 2024 at 05:34:52PM +0200, Anthony Harivel wrote:
> > Hi Daniel,
> > 
> > Daniel P. Berrangé, Apr 18, 2024 at 18:42:
> > 
> > > > +if (kvm_is_rapl_feat_enable(cs)) {
> > > > +if (!IS_INTEL_CPU(env)) {
> > > > +error_setg(errp, "RAPL feature can only be\
> > > > +  enabled with Intel CPU models");
> > > > +return false;
> > > > +}
> > > > +}
> > >
> > > I see a crash in kvm_is_rapl_feat_enable() from this caller,
> > > when I run with this kind of command line:
> > >
> > >  $ qemu-system-x86_64 \
> > >   -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \
> > >   -initrd tiny-initrd.img  -m 2000 -serial stdio -nodefaults \
> > >   -display none -accel kvm -append "console=ttyS0 quiet"
> > >
> > >
> > > #0  0x55bc14b7 in kvm_is_rapl_feat_enable 
> > > (cs=cs@entry=0x57b83470) at ../target/i386/kvm/kvm.c:2531
> > > #1  0x55bc7534 in kvm_cpu_realizefn (cs=0x57b83470, 
> > > errp=0x7fffd2a0) at ../target/i386/kvm/kvm-cpu.c:54
> > > #2  0x55d2432a in accel_cpu_common_realize (cpu=0x57b83470, 
> > > errp=0x7fffd2a0) at ../accel/accel-target.c:130
> > > #3  0x55cdd955 in cpu_exec_realizefn 
> > > (cpu=cpu@entry=0x57b83470, errp=errp@entry=0x7fffd2a0) at 
> > > ../cpu-target.c:137
> > > #4  0x55c14b89 in x86_cpu_realizefn (dev=0x57b83470, 
> > > errp=0x7fffd310) at ../target/i386/cpu.c:7320
> > > #5  0x55d58f4b in device_set_realized (obj=, 
> > > value=, errp=0x7fffd390) at ../hw/core/qdev.c:510
> > > #6  0x55d5d78d in property_set_bool (obj=0x57b83470, 
> > > v=, name=, opaque=0x578558e0, 
> > > errp=0x7fffd390)
> > > at ../qom/object.c:2358
> > > #7  0x55d60b0b in object_property_set 
> > > (obj=obj@entry=0x57b83470, name=name@entry=0x5607c799 "realized", 
> > > v=v@entry=0x57b8ccb0, errp=0x7fffd390, 
> > > errp@entry=0x56e210d8 ) at ../qom/object.c:1472
> > > #8  0x55d6444f in object_property_set_qobject
> > > (obj=obj@entry=0x57b83470, name=name@entry=0x5607c799 
> > > "realized", value=value@entry=0x57854800, 
> > > errp=errp@entry=0x56e210d8 )
> > > at ../qom/qom-qobject.c:28
> > > #9  0x55d61174 in object_property_set_bool
> > > (obj=0x57b83470, name=name@entry=0x5607c799 "realized", 
> > > value=value@entry=true, errp=errp@entry=0x56e210d8 ) at 
> > > ../qom/object.c:1541
> > > #10 0x55d59a3c in qdev_realize (dev=, 
> > > bus=bus@entry=0x0, errp=errp@entry=0x56e210d8 ) at 
> > > ../hw/core/qdev.c:292
> > > #11 0x55bd51e0 in x86_cpu_new (x86ms=, apic_id=0, 
> > > errp=0x56e210d8 ) at ../hw/i386/x86.c:105
> > > #12 0x55bd52ce in x86_cpus_init 
> > > (x86ms=x86ms@entry=0x57aaed30, default_cpu_version=) 
> > > at ../hw/i386/x86.c:156
> > > #13 0x55bdc1a7 in pc_init1 (machine=0x57aaed30, 
> > > pci_type=0x5604aa61 "i440FX") at ../hw/i386/pc_piix.c:185
> > > #14 0x55947a11 in machine_run_board_init (machine=0x57aaed30, 
> > > mem_path=, errp=, errp@entry=0x56e210d8 
> > > )
> > > at ../hw/core/machine.c:1547
> > > #15 0x55b020ed in qemu_init_board () at ../system/vl.c:2613
> > > #16 qmp_x_exit_preconfig (errp=0x56e210d8 ) at 
> > > ../system/vl.c:2705
> > > #17 0x55b0611e in qemu_init (argc=, 
> > > argv=) at ../system/vl.c:3739
> > > #18 0x55897ca9 in main (argc=, argv= > > out>) at ../system/main.c:47
> > >
> > > The problem is that 'cs->kvm_state' is NULL here
> > >
> > 
> > After some investigation it seems that kvm_state is not yet committed 
> > at this point. Shame, because GDB showed me that we have already pass 
> > the kvm_accel_instance_init() in accel/kvm/kvm-all.c that sets the 
> > value "msr_energy.enable" in kvm_state...
> > 
> > So should I dig more to still do the sanity check in kvm_cpu_realizefn() 
> > or should I already move the RAPL feature  from -kvm to -cpu 
> > like suggested by Zhao from Intel and then access it from

Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-04-25 Thread Anthony Harivel

Hi Daniel,

Daniel P. Berrangé, Apr 18, 2024 at 18:42:

> > +if (kvm_is_rapl_feat_enable(cs)) {
> > +if (!IS_INTEL_CPU(env)) {
> > +error_setg(errp, "RAPL feature can only be\
> > +  enabled with Intel CPU models");
> > +return false;
> > +}
> > +}
>
> I see a crash in kvm_is_rapl_feat_enable() from this caller,
> when I run with this kind of command line:
>
>  $ qemu-system-x86_64 \
>   -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \
>   -initrd tiny-initrd.img  -m 2000 -serial stdio -nodefaults \
>   -display none -accel kvm -append "console=ttyS0 quiet"
>
>
> #0  0x55bc14b7 in kvm_is_rapl_feat_enable 
> (cs=cs@entry=0x57b83470) at ../target/i386/kvm/kvm.c:2531
> #1  0x55bc7534 in kvm_cpu_realizefn (cs=0x57b83470, 
> errp=0x7fffd2a0) at ../target/i386/kvm/kvm-cpu.c:54
> #2  0x55d2432a in accel_cpu_common_realize (cpu=0x57b83470, 
> errp=0x7fffd2a0) at ../accel/accel-target.c:130
> #3  0x55cdd955 in cpu_exec_realizefn (cpu=cpu@entry=0x57b83470, 
> errp=errp@entry=0x7fffd2a0) at ../cpu-target.c:137
> #4  0x55c14b89 in x86_cpu_realizefn (dev=0x57b83470, 
> errp=0x7fffd310) at ../target/i386/cpu.c:7320
> #5  0x55d58f4b in device_set_realized (obj=, 
> value=, errp=0x7fffd390) at ../hw/core/qdev.c:510
> #6  0x55d5d78d in property_set_bool (obj=0x57b83470, v= out>, name=, opaque=0x578558e0, errp=0x7fffd390)
> at ../qom/object.c:2358
> #7  0x55d60b0b in object_property_set (obj=obj@entry=0x57b83470, 
> name=name@entry=0x5607c799 "realized", v=v@entry=0x57b8ccb0, 
> errp=0x7fffd390, 
> errp@entry=0x56e210d8 ) at ../qom/object.c:1472
> #8  0x55d6444f in object_property_set_qobject
> (obj=obj@entry=0x57b83470, name=name@entry=0x5607c799 "realized", 
> value=value@entry=0x57854800, errp=errp@entry=0x56e210d8 
> )
> at ../qom/qom-qobject.c:28
> #9  0x55d61174 in object_property_set_bool
> (obj=0x57b83470, name=name@entry=0x5607c799 "realized", 
> value=value@entry=true, errp=errp@entry=0x56e210d8 ) at 
> ../qom/object.c:1541
> #10 0x55d59a3c in qdev_realize (dev=, 
> bus=bus@entry=0x0, errp=errp@entry=0x56e210d8 ) at 
> ../hw/core/qdev.c:292
> #11 0x55bd51e0 in x86_cpu_new (x86ms=, apic_id=0, 
> errp=0x56e210d8 ) at ../hw/i386/x86.c:105
> #12 0x55bd52ce in x86_cpus_init (x86ms=x86ms@entry=0x57aaed30, 
> default_cpu_version=) at ../hw/i386/x86.c:156
> #13 0x55bdc1a7 in pc_init1 (machine=0x57aaed30, 
> pci_type=0x5604aa61 "i440FX") at ../hw/i386/pc_piix.c:185
> #14 0x55947a11 in machine_run_board_init (machine=0x57aaed30, 
> mem_path=, errp=, errp@entry=0x56e210d8 
> )
> at ../hw/core/machine.c:1547
> #15 0x55b020ed in qemu_init_board () at ../system/vl.c:2613
> #16 qmp_x_exit_preconfig (errp=0x56e210d8 ) at 
> ../system/vl.c:2705
> #17 0x55b0611e in qemu_init (argc=, argv= out>) at ../system/vl.c:3739
> #18 0x55897ca9 in main (argc=, argv=) 
> at ../system/main.c:47
>
> The problem is that 'cs->kvm_state' is NULL here
>

After some investigation it seems that kvm_state is not yet committed 
at this point. Shame, because GDB showed me that we have already pass 
the kvm_accel_instance_init() in accel/kvm/kvm-all.c that sets the 
value "msr_energy.enable" in kvm_state...

So should I dig more to still do the sanity check in kvm_cpu_realizefn() 
or should I already move the RAPL feature  from -kvm to -cpu 
like suggested by Zhao from Intel and then access it from the CPUState ?

The last one would require more work but if I can skip a new iteration 
because I would need to do it anyway that would save me time in this end. 

Thanks

Regards,
Anthony

Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-04-18 Thread Anthony Harivel



Hi Daniel,

Daniel P. Berrangé, Apr 18, 2024 at 18:42:
> On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote:
> > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> > interface (Running Average Power Limit) for advertising the accumulated
> > energy consumption of various power domains (e.g. CPU packages, DRAM,
> > etc.).
> > 
> > The consumption is reported via MSRs (model specific registers) like
> > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> > 64 bits registers that represent the accumulated energy consumption in
> > micro Joules. They are updated by microcode every ~1ms.
> > 
> > For now, KVM always returns 0 when the guest requests the value of
> > these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> > these MSRs dynamically in userspace.
> > 
> > To limit the amount of system calls for every MSR call, create a new
> > thread in QEMU that updates the "virtual" MSR values asynchronously.
> > 
> > Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> > thread updates the vMSR values with the ratio of energy consumed of
> > the whole physical CPU package the vCPU thread runs on and the
> > thread's utime and stime values.
> > 
> > All other non-vCPU threads are also taken into account. Their energy
> > consumption is evenly distributed among all vCPUs threads running on
> > the same physical CPU package.
> > 
> > To overcome the problem that reading the RAPL MSR requires priviliged
> > access, a socket communication between QEMU and the qemu-vmsr-helper is
> > mandatory. You can specified the socket path in the parameter.
> > 
> > This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
> > 
> > Actual limitation:
> > - Works only on Intel host CPU because AMD CPUs are using different MSR
> >   adresses.
> > 
> > - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
> >   the moment.
> > 
> > Signed-off-by: Anthony Harivel 
> > ---
> >  accel/kvm/kvm-all.c   |  27 +++
> >  docs/specs/index.rst  |   1 +
> >  docs/specs/rapl-msr.rst   | 155 
> >  include/sysemu/kvm.h  |   2 +
> >  include/sysemu/kvm_int.h  |  32 +++
> >  target/i386/cpu.h |   8 +
> >  target/i386/kvm/kvm-cpu.c |   9 +
> >  target/i386/kvm/kvm.c | 428 ++
> >  target/i386/kvm/meson.build   |   1 +
> >  target/i386/kvm/vmsr_energy.c | 335 ++
> >  target/i386/kvm/vmsr_energy.h |  99 
> >  11 files changed, 1097 insertions(+)
> >  create mode 100644 docs/specs/rapl-msr.rst
> >  create mode 100644 target/i386/kvm/vmsr_energy.c
> >  create mode 100644 target/i386/kvm/vmsr_energy.h
> > 
>
> > diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
> > index 9c791b7b0520..eafb625839b8 100644
> > --- a/target/i386/kvm/kvm-cpu.c
> > +++ b/target/i386/kvm/kvm-cpu.c
> > @@ -50,6 +50,15 @@ static bool kvm_cpu_realizefn(CPUState *cs, Error **errp)
> > MSR_IA32_UCODE_REV);
> >  }
> >  }
> > +
> > +if (kvm_is_rapl_feat_enable(cs)) {
> > +if (!IS_INTEL_CPU(env)) {
> > +error_setg(errp, "RAPL feature can only be\
> > +  enabled with Intel CPU models");
> > +return false;
> > +}
> > +}
>
> I see a crash in kvm_is_rapl_feat_enable() from this caller,
> when I run with this kind of command line:
>
>  $ qemu-system-x86_64 \
>   -kernel /lib/modules/6.6.9-100.fc38.x86_64/vmlinuz \
>   -initrd tiny-initrd.img  -m 2000 -serial stdio -nodefaults \
>   -display none -accel kvm -append "console=ttyS0 quiet"
>
>
> #0  0x55bc14b7 in kvm_is_rapl_feat_enable 
> (cs=cs@entry=0x57b83470) at ../target/i386/kvm/kvm.c:2531
> #1  0x55bc7534 in kvm_cpu_realizefn (cs=0x57b83470, 
> errp=0x7fffd2a0) at ../target/i386/kvm/kvm-cpu.c:54
> #2  0x55d2432a in accel_cpu_common_realize (cpu=0x57b83470, 
> errp=0x7fffd2a0) at ../accel/accel-target.c:130
> #3  0x55cdd955 in cpu_exec_realizefn (cpu=cpu@entry=0x57b83470, 
> errp=errp@entry=0x7fffd2a0) at ../cpu-target.c:137
> #4  0x55c14b89 in x86_cpu_realizefn (dev=0x57b83470, 
> errp=0x7fffd310) at ../target/i386/cpu.c:7320
> #5  0x55d58f4b in device_set_realized (obj=, 
> value=, errp=0x7ff

Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-04-18 Thread Anthony Harivel



Hi Zhao,

Zhao Liu, Apr 17, 2024 at 12:07:
> Hi Anthony,
>
> May I ask what your usage scenario is? Is it to measure Guest's energy
> consumption and to charged per watt consumed? ;-)

See previous email from Daniel.

> On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote:
> > Date: Thu, 11 Apr 2024 14:14:34 +0200
> > From: Anthony Harivel 
> > Subject: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu
> > 
> > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> > interface (Running Average Power Limit) for advertising the accumulated
> > energy consumption of various power domains (e.g. CPU packages, DRAM,
> > etc.).
> >
> > The consumption is reported via MSRs (model specific registers) like
> > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> > 64 bits registers that represent the accumulated energy consumption in
> > micro Joules. They are updated by microcode every ~1ms.
>
> What is your current target platform?
>
> On future Xeon platforms (EMR and beyond) RAPL will support TPMI (an MMIO
> interface) and the TPMI based RAPL will be preferred in the future as
> well:
> * TPMI doc: https://github.com/intel/tpmi_power_management
> * TPMI based RAPL driver: drivers/powercap/intel_rapl_tpmi.c
>
> So do you have the plan to support RAPL-TPMI?

Yes, I guess this would be inevitable in the future. But right now the 
lack of HW with this TPMI make hard to integrate it on day 1.

> > For now, KVM always returns 0 when the guest requests the value of
> > these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> > these MSRs dynamically in userspace.
> > 
> > To limit the amount of system calls for every MSR call, create a new
> > thread in QEMU that updates the "virtual" MSR values asynchronously.
> > 
> > Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> > thread updates the vMSR values with the ratio of energy consumed of
> > the whole physical CPU package the vCPU thread runs on and the
> > thread's utime and stime values.
> > 
> > All other non-vCPU threads are also taken into account. Their energy
> > consumption is evenly distributed among all vCPUs threads running on
> > the same physical CPU package.
>
> The package energy consumption includes core part and uncore part, where
> uncore part consumption may not be able to be scaled based on vCPU
> runtime ratio.
>
> When the uncore part consumption is small, the error in this part is
> small, but if it is large, then the error generated by scaling by vCPU
> runtime will be large.
>

So far we can only work with what Intel is giving us i.e Package power 
plane and DRAM power plane on server, which is the main target of 
this feature. Maybe in the future, Intel will expand the core power 
plane and the uncore power plane to server class CPU ?

> May I ask what your usage scenario is? Is there significant uncore
> consumption (e.g. GPU)?
>

Same answer as above: uncore/graphics power plane is only available on 
client class CPU. 

> Also, I think of a generic question is whether the error in this
> calculation is measurable? Like comparing the RAPL status of the same
> workload on Guest and bare metal to check the error.
>
> IIUC, this calculation is highly affected by native/sibling Guests,
> especially in cloud scenarios where there are multiple Guests, the
> accuracy of this algorithm needs to be checked.
>

Indeed, depending on where your vCPUs are running within the package (on 
the native or sibling CPU), you might observe different power 
consumption levels. However, I don't consider this to be a problem, as 
the ratio calculation takes into account the vCPU's location.

We also need to approach the measurement differently. Due to the 
complexity of factors influencing power consumption, we must compare 
what is comparable. If you require precise power consumption data, 
use a power meter on the PSU of the server.It will provide the 
ultimate judgment. However, if you need an estimation to optimize 
software workloads in a guest, then this feature could be useful. All my 
tests have consistently shown reproducible output in terms of power 
consumption, which has convinced me that we can effectively work with 
it.

> > To overcome the problem that reading the RAPL MSR requires priviliged
> > access, a socket communication between QEMU and the qemu-vmsr-helper is
> > mandatory. You can specified the socket path in the parameter.
> > 
> > This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock
>
> Based on the above comment, I suggest to rename this option as "rapl-msr"
> to distinguish it from rapl-tpmi.

F

Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-04-18 Thread Anthony Harivel

Hi Zhao, Daniel,

Zhao Liu, Apr 17, 2024 at 17:13:
> Hi Daniel,
>
> On Wed, Apr 17, 2024 at 01:27:03PM +0100, Daniel P. Berrangé wrote:
> > Date: Wed, 17 Apr 2024 13:27:03 +0100
> > From: "Daniel P. Berrangé" 
> > Subject: Re: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu
> > 
> > On Wed, Apr 17, 2024 at 06:07:02PM +0800, Zhao Liu wrote:
> > > Hi Anthony,
> > > 
> > > May I ask what your usage scenario is? Is it to measure Guest's energy
> > > consumption and to charged per watt consumed? ;-)
> > > 
> > > On Thu, Apr 11, 2024 at 02:14:34PM +0200, Anthony Harivel wrote:
> > > > Date: Thu, 11 Apr 2024 14:14:34 +0200
> > > > From: Anthony Harivel 
> > > > Subject: [PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu
> > > > 
> > > > Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> > > > interface (Running Average Power Limit) for advertising the accumulated
> > > > energy consumption of various power domains (e.g. CPU packages, DRAM,
> > > > etc.).
> > > >
> > > > The consumption is reported via MSRs (model specific registers) like
> > > > MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> > > > 64 bits registers that represent the accumulated energy consumption in
> > > > micro Joules. They are updated by microcode every ~1ms.
> > > 
> > > What is your current target platform?
> > 
> > I think we can assume /all/ future CPUs are conceptially in scope
> > for this.
> > 
> > The use case is to allow guest owners to monitor the power consumption
> > of their workloads, so they can take steps to optimize their guest VM
> > workloads to reduce power consumed.
>
> Thanks for the explanation! 
>

Thanks Daniel for stepping in on the explanation.

> > > On future Xeon platforms (EMR and beyond) RAPL will support TPMI (an MMIO
> > > interface) and the TPMI based RAPL will be preferred in the future as
> > > well:
> > 
> > Is the MSR based interface likely to be removed in future silicon,
> > or it will be remain for back compat ?
>
> For Xeon, GNR will have both TMPI & MSR RAPL, but eventually MSR RAPL
> will be removed. Therefore, if RAPL support is desired for all future
> Xeons, then it's necessary to consider TMPI as the next plan.
>
> Alternatively, the whole RAPL scope can be split into rapl-msr and
> rapl-tpmi features.
>

I'm aware of the MSR/TPMI RAPL that will appear in the future, and 
I would like to share my perspective on that.

Firstly, we can safely assume that it will take years before all server 
hardware is transitioned to the new GNR (or future XEON without RAPL 
MSR). It may be around 2024 when these features could be integrated into 
QEMU. While the adoption of this feature might take some time, I'm 
optimistic that once it's implemented, people will finally have the 
tools to optimize workloads inside VMs and start reducing power 
consumption.

Secondly, the second-hand server market is substantial. This means that 
with the Virtual RAPL MSR, all XEON processors starting from Sandy 
Bridge (2012!) will have the potential for software optimization. Making 
the most of existing resources is essential for sustainability.

Lastly, when the TPMI becomes available in hardware in the future, the 
RAPL interface and ratio calculation will remain the same, with only the 
method of obtaining host values changing. This transition should be 
manageable.

Regards,
Anthony

Re: [PATCH v5 0/3] Add support for the RAPL MSRs series

2024-04-18 Thread Anthony Harivel

Hi Daniel,

Daniel P. Berrangé, Apr 17, 2024 at 19:23:
> On Thu, Apr 11, 2024 at 02:14:31PM +0200, Anthony Harivel wrote:
> > Dear maintainers, 
> > 
> > First of all, thank you very much for your review of my patch 
> > [1].
> > 
> > In this version (v5), I have attempted to address all the problems 
> > addressed by Daniel during the last review. I've been more careful with 
> > all the remarks made.
>
> I'm wondering if you had tips for testing this functionality ?
>
> Is there any nice app to run in the host/guest to report the
> power usage, to see that it is working as desired ?
>

Great question. Unfortunately, there isn't an easy-to-use, 
out-of-the-box app that can assist you.

The 'cpupower' tool in linux/tools/power/ or 'turbostat' in 
linux/tools/power/x86/ require some modifications as they fail the 
sanity check inside a VM. It is on my agenda to work on a proposal patch 
for these tools if the vmsr patch lands in QEMU. These are the excellent 
apps that everyone should use IMO.

So how do I test my patch ? 
I'm using a slightly more complex tool called Kepler [1]. Since a patch 
that has been merged [2], it can also report VM consumption.
The installation is easy on RPM based distribution [3].
But indeed, this tools is a Prometheus exporter so you need 
a Prometheus/Grafana stack for the observation which make the test more 
complex than the 2 previous tools mentioned.

Last month, I conducted a test with Kepler tools on both a host and 
within VMs. I was pleased to observe that the power graph trends were 
identical both outside and inside the VMs, albeit with a slight 
variation in terms of 1:1 Watt comparison.

If Kepler isn't the tool you're looking for, I'm open to any suggestions 
regarding cpupower/turbostat. I can work on a temporary patch that would 
enable us to utilize them.

Regards,
Anthony

[1]: https://sustainable-computing.io/
[2]: https://github.com/sustainable-computing-io/kepler/pull/931
[3]: https://sustainable-computing.io/installation/kepler-rpm/

[PATCH v5 1/3] qio: add support for SO_PEERCRED for socket channel

2024-04-11 Thread Anthony Harivel

The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.

This credentials structure is defined in  as follows:

struct ucred {
pid_t pid;/* Process ID of the sending process */
uid_t uid;/* User ID of the sending process */
gid_t gid;/* Group ID of the sending process */
};

The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.

On platform other than Linux, the function return 0.

Signed-off-by: Anthony Harivel 
---
 include/io/channel.h | 21 +
 io/channel-socket.c  | 28 
 io/channel.c | 13 +
 3 files changed, 62 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 7986c49c713a..bdf0bca92ae2 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -160,6 +160,9 @@ struct QIOChannelClass {
   void *opaque);
 int (*io_flush)(QIOChannel *ioc,
 Error **errp);
+int (*io_peerpid)(QIOChannel *ioc,
+   unsigned int *pid,
+   Error **errp);
 };
 
 /* General I/O handling functions */
@@ -981,4 +984,22 @@ int coroutine_mixed_fn 
qio_channel_writev_full_all(QIOChannel *ioc,
 int qio_channel_flush(QIOChannel *ioc,
   Error **errp);
 
+/**
+ * qio_channel_get_peercred:
+ * @ioc: the channel object
+ * @pid: pointer to pid
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns the pid of the peer process connected to this socket.
+ *
+ * The use of this function is possible only for connected
+ * AF_UNIX stream sockets and for AF_UNIX stream and datagram
+ * socket pairs on Linux.
+ * Return -1 on error with pid -1 for the non-Linux OS.
+ *
+ */
+int qio_channel_get_peerpid(QIOChannel *ioc,
+ unsigned int *pid,
+ Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 3a899b060858..608bcf066ecd 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -841,6 +841,33 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
 socket_set_cork(sioc->fd, v);
 }
 
+static int
+qio_channel_socket_get_peerpid(QIOChannel *ioc,
+   unsigned int *pid,
+   Error **errp)
+{
+#ifdef CONFIG_LINUX
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+Error *err = NULL;
+socklen_t len = sizeof(struct ucred);
+
+struct ucred cred;
+if (getsockopt(sioc->fd,
+   SOL_SOCKET, SO_PEERCRED,
+   , ) == -1) {
+error_setg_errno(, errno, "Unable to get peer credentials");
+error_propagate(errp, err);
+*pid = -1;
+return -1;
+}
+*pid = (unsigned int)cred.pid;
+return 0;
+#else
+error_setg(errp, "Unsupported feature");
+*pid = -1;
+return -1;
+#endif
+}
 
 static int
 qio_channel_socket_close(QIOChannel *ioc,
@@ -938,6 +965,7 @@ static void qio_channel_socket_class_init(ObjectClass 
*klass,
 #ifdef QEMU_MSG_ZEROCOPY
 ioc_klass->io_flush = qio_channel_socket_flush;
 #endif
+ioc_klass->io_peerpid = qio_channel_socket_get_peerpid;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel.c b/io/channel.c
index a1f12f8e9096..e3f17c24a00f 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -548,6 +548,19 @@ void qio_channel_set_cork(QIOChannel *ioc,
 }
 }
 
+int qio_channel_get_peerpid(QIOChannel *ioc,
+ unsigned int *pid,
+ Error **errp)
+{
+QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+if (!klass->io_peerpid) {
+error_setg(errp, "Channel does not support peer pid");
+return -1;
+}
+klass->io_peerpid(ioc, pid, errp);
+return 0;
+}
 
 off_t qio_channel_io_seek(QIOChannel *ioc,
   off_t offset,
-- 
2.44.0

[PATCH v5 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-04-11 Thread Anthony Harivel

Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

Signed-off-by: Anthony Harivel 
---
 accel/kvm/kvm-all.c   |  27 +++
 docs/specs/index.rst  |   1 +
 docs/specs/rapl-msr.rst   | 155 
 include/sysemu/kvm.h  |   2 +
 include/sysemu/kvm_int.h  |  32 +++
 target/i386/cpu.h |   8 +
 target/i386/kvm/kvm-cpu.c |   9 +
 target/i386/kvm/kvm.c | 428 ++
 target/i386/kvm/meson.build   |   1 +
 target/i386/kvm/vmsr_energy.c | 335 ++
 target/i386/kvm/vmsr_energy.h |  99 
 11 files changed, 1097 insertions(+)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a8cecd040ebc..7649f226767a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3613,6 +3613,21 @@ static void kvm_set_device(Object *obj,
 s->device = g_strdup(value);
 }
 
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+ const char *str,
+ Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+g_free(s->msr_energy.socket_path);
+s->msr_energy.socket_path = g_strdup(str);
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
 KVMState *s = KVM_STATE(obj);
@@ -3632,6 +3647,7 @@ static void kvm_accel_instance_init(Object *obj)
 s->xen_gnttab_max_frames = 64;
 s->xen_evtchn_max_pirq = 256;
 s->device = NULL;
+s->msr_energy.enable = false;
 }
 
 /**
@@ -3676,6 +3692,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void 
*data)
 object_class_property_set_description(oc, "device",
 "Path to the device node to use (default: /dev/kvm)");
 
+object_class_property_add_bool(oc, "rapl",
+   NULL,
+   kvm_set_kvm_rapl);
+object_class_property_set_description(oc, "rapl",
+"Allow energy related MSRs for RAPL interface in Guest");
+
+object_class_property_add_str(oc, "rapl-helper-socket", NULL,
+  kvm_set_kvm_rapl_socket_path);
+object_class_property_set_description(oc, "rapl-helper-socket",
+"Socket Path for comminucating with the Virtual MSR helper daemon");
+
 kvm_arch_accel_class_init(oc);
 }
 
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index 1484e3e76077..e738ea7d102f 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -33,3 +33,4 @@ guest hardware that is specific to QEMU.
virt-ctlr
vmcoreinfo
vmgenid
+   rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index ..1202ee89bee0
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,155 @@
+
+RAPL MSR support
+
+
+The RAPL interface (Running Average Power Limit) is advertising the accumulated
+energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
+
+The consumpt

[PATCH v5 0/3] Add support for the RAPL MSRs series

2024-04-11 Thread Anthony Harivel

Dear maintainers, 

First of all, thank you very much for your review of my patch 
[1].

In this version (v5), I have attempted to address all the problems 
addressed by Daniel during the last review. I've been more careful with 
all the remarks made. 

However, one question remains unanswered pointing the issue with the 
location of "/var/local/run/qemu-vmsr-helper.sock", created by 
compute_default_paths(). QEMU is not allowed to reach the socket here.

Thank you again for your continued guidance. 

v4 -> v5


- correct qio_channel_get_peerpid: return pid = -1 in case of error
- Vmsr_helper: compile only for x86
- Vmsr_helper: use qio_channel_read/write_all
- Vmsr_helper: abandon user/group
- Vmsr_energy.c: correct all error_report
- Vmsr thread: compute default socket path only once
- Vmsr thread: open socket only once
- Pass relevant QEMU CI

v3 -> v4


- Correct memory leaks with AddressSanitizer  
- Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
  INTEL and if RAPL is activated.
- Rename poor variables naming for easier comprehension
- Move code that checks Host before creating the VMSR thread
- Get rid of libnuma: create function that read sysfs for reading the 
  Host topology instead

v2 -> v3


- Move all memory allocations from Clib to Glib
- Compile on *BSD (working on Linux only)
- No more limitation on the virtual package: each vCPU that belongs to 
  the same virtual package is giving the same results like expected on 
  a real CPU.
  This has been tested topology like:
 -smp 4,sockets=2
 -smp 16,sockets=4,cores=2,threads=2

v1 -> v2


- To overcome the CVE-2020-8694 a socket communication is created
  to a priviliged helper
- Add the priviliged helper (qemu-vmsr-helper)
- Add SO_PEERCRED in qio channel socket

RFC -> v1
-

- Add vmsr_* in front of all vmsr specific function
- Change malloc()/calloc()... with all glib equivalent
- Pre-allocate all dynamic memories when possible
- Add a Documentation of implementation, limitation and usage

Best regards,
Anthony

[1]: https://lists.gnu.org/archive/html/qemu-devel/2024-03/msg04417.html

Anthony Harivel (3):
  qio: add support for SO_PEERCRED for socket channel
  tools: build qemu-vmsr-helper
  Add support for RAPL MSRs in KVM/Qemu

 accel/kvm/kvm-all.c  |  27 ++
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/specs/index.rst |   1 +
 docs/specs/rapl-msr.rst  | 155 +++
 docs/tools/index.rst |   1 +
 docs/tools/qemu-vmsr-helper.rst  |  89 
 include/io/channel.h |  21 +
 include/sysemu/kvm.h |   2 +
 include/sysemu/kvm_int.h |  32 ++
 io/channel-socket.c  |  28 ++
 io/channel.c |  13 +
 meson.build  |   7 +
 target/i386/cpu.h|   8 +
 target/i386/kvm/kvm-cpu.c|   9 +
 target/i386/kvm/kvm.c| 428 ++
 target/i386/kvm/meson.build  |   1 +
 target/i386/kvm/vmsr_energy.c| 335 ++
 target/i386/kvm/vmsr_energy.h|  99 +
 tools/i386/qemu-vmsr-helper.c| 529 +++
 tools/i386/rapl-msr-index.h  |  28 ++
 21 files changed, 1837 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

-- 
2.44.0

[PATCH v5 2/3] tools: build qemu-vmsr-helper

2024-04-11 Thread Anthony Harivel

Introduce a privileged helper to access RAPL MSR.

The privileged helper tool, qemu-vmsr-helper, is designed to provide
virtual machines with the ability to read specific RAPL (Running Average
Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
on external, out-of-tree patches.

The helper tool leverages Unix permissions and SO_PEERCRED socket
options to enforce access control, ensuring that only processes
explicitly requesting read access via readmsr() from a valid Thread ID
can access these MSRs.

The list of RAPL MSRs that are allowed to be read by the helper tool is
defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
will be supported in the next commit titled "Add support for RAPL MSRs
in KVM/QEMU."

The tool is intentionally designed to run on the Linux x86 platform.
This initial implementation is tailored for Intel CPUs but can be
extended to support AMD CPUs in the future.

Signed-off-by: Anthony Harivel 
---
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/tools/index.rst |   1 +
 docs/tools/qemu-vmsr-helper.rst  |  89 
 meson.build  |   7 +
 tools/i386/qemu-vmsr-helper.c| 529 +++
 tools/i386/rapl-msr-index.h  |  28 ++
 7 files changed, 678 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

diff --git a/contrib/systemd/qemu-vmsr-helper.service 
b/contrib/systemd/qemu-vmsr-helper.service
new file mode 100644
index ..8fd397bf79a9
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.service
@@ -0,0 +1,15 @@
+[Unit]
+Description=Virtual RAPL MSR Daemon for QEMU
+
+[Service]
+WorkingDirectory=/tmp
+Type=simple
+ExecStart=/usr/bin/qemu-vmsr-helper
+PrivateTmp=yes
+ProtectSystem=strict
+ReadWritePaths=/var/run
+RestrictAddressFamilies=AF_UNIX
+Restart=always
+RestartSec=0
+
+[Install]
diff --git a/contrib/systemd/qemu-vmsr-helper.socket 
b/contrib/systemd/qemu-vmsr-helper.socket
new file mode 100644
index ..183e8304d6e2
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.socket
@@ -0,0 +1,9 @@
+[Unit]
+Description=Virtual RAPL MSR helper for QEMU
+
+[Socket]
+ListenStream=/run/qemu-vmsr-helper.sock
+SocketMode=0600
+
+[Install]
+WantedBy=multi-user.target
diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 8e65ce0dfc7b..33ad438e86f6 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -16,3 +16,4 @@ command line utilities and other standalone programs.
qemu-pr-helper
qemu-trace-stap
virtfs-proxy-helper
+   qemu-vmsr-helper
diff --git a/docs/tools/qemu-vmsr-helper.rst b/docs/tools/qemu-vmsr-helper.rst
new file mode 100644
index ..6ec87b49d962
--- /dev/null
+++ b/docs/tools/qemu-vmsr-helper.rst
@@ -0,0 +1,89 @@
+==
+QEMU virtual RAPL MSR helper
+==
+
+Synopsis
+
+
+**qemu-vmsr-helper** [*OPTION*]
+
+Description
+---
+
+Implements the virtual RAPL MSR helper for QEMU.
+
+Accessing the RAPL (Running Average Power Limit) MSR enables the RAPL powercap
+driver to advertise and monitor the power consumption or accumulated energy
+consumption of different power domains, such as CPU packages, DRAM, and other
+components when available.
+
+However those register are accesible under priviliged access (CAP_SYS_RAWIO).
+QEMU can use an external helper to access those priviliged register.
+
+:program:`qemu-vmsr-helper` is that external helper; it creates a listener
+socket which will accept incoming connections for communication with QEMU.
+
+If you want to run VMs in a setup like this, this helper should be started as a
+system service, and you should read the QEMU manual section on "RAPL MSR
+support" to find out how to configure QEMU to connect to the socket created by
+:program:`qemu-vmsr-helper`.
+
+After connecting to the socket, :program:`qemu-vmsr-helper` can
+optionally drop root privileges, except for those capabilities that
+are needed for its operation.
+
+:program:`qemu-vmsr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+[Socket]
+ListenStream=/var/run/qemu-vmsr-helper.sock
+
+Options
+---
+
+.. program:: qemu-vmsr-helper
+
+.. option:: -d, --daemon
+
+  run in the background (and create a PID file)
+
+.. option:: -q, --quiet
+
+  decrease verbosity
+
+.. option:: -v, --verbose
+
+  increase verbosity
+
+.. option:: -f, --pidfile=PATH
+
+  PID file when running as a daemon. By default the PID file
+  is created in the system runtime state directory, for example
+  :file:`/var/run/

Re: [PATCH] xen-hvm: Avoid livelock while handling buffered ioreqs

2024-04-09 Thread Anthony PERARD

On Thu, Apr 04, 2024 at 03:08:33PM +0100, Ross Lagerwall wrote:
> diff --git a/hw/xen/xen-hvm-common.c b/hw/xen/xen-hvm-common.c
> index 1627da739822..1116b3978938 100644
> --- a/hw/xen/xen-hvm-common.c
> +++ b/hw/xen/xen-hvm-common.c
> @@ -521,22 +521,30 @@ static bool handle_buffered_iopage(XenIOState *state)
[...]
>  
>  static void handle_buffered_io(void *opaque)
>  {
> +unsigned int handled;
>  XenIOState *state = opaque;
>  
> -if (handle_buffered_iopage(state)) {
> +handled = handle_buffered_iopage(state);
> +if (handled >= IOREQ_BUFFER_SLOT_NUM) {
> +/* We handled a full page of ioreqs. Schedule a timer to continue
> + * processing while giving other stuff a chance to run.
> + */

./scripts/checkpatch.pl report a style issue here:
WARNING: Block comments use a leading /* on a separate line

I can try to remember to fix that on commit.

>  timer_mod(state->buffered_io_timer,
> -BUFFER_IO_MAX_DELAY + 
> qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
> -} else {
> +qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
> +} else if (handled == 0) {

Just curious, why did you check for `handled == 0` here instead of
`handled != 0`? That would have avoided to invert the last 2 cases, and
the patch would just have introduce a new case without changing the
order of the existing ones. But not that important I guess.

>  timer_del(state->buffered_io_timer);
>  qemu_xen_evtchn_unmask(state->xce_handle, 
> state->bufioreq_local_port);
> +} else {
> +timer_mod(state->buffered_io_timer,
> +    BUFFER_IO_MAX_DELAY + 
> qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
>  }
>  }

Cheers,

-- 
Anthony PERARD

Re: [PATCH v4 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-04-05 Thread Anthony Harivel

Hi Daniel,

> > +SocketAddress saddr = {
> > +.type = SOCKET_ADDRESS_TYPE_UNIX,
> > +.u.q_unix.path = socket_path
> > +};
> > +QIOChannelSocket *sioc = qio_channel_socket_new();
> > +Error *local_err = NULL;
> > +
> > +int r;
> > +
> > +qio_channel_set_name(QIO_CHANNEL(sioc), "vmsr-helper");
> > +qio_channel_socket_connect_sync(sioc,
> > +,
> > +_err);
> > +g_free(socket_path);
> > +if (local_err) {
> > +goto out_close;
> > +}
>
> In the previous posting I suggested that connectiong to the
> helper again & again for every individual MSR read is a
> high overhead. Connect once, and then just keep the socket
> open forever.
>

Indeed, this would be way more efficient. 

Does that means that I should create the socket during the 
initialisation of the main loop (kvm_msr_energy_thread_init) and keep 
track of the context variable and then just give the QIOChannelSocket 
pointer has a parameter to the vmsr_read_msr() function to send the 
data?

Regards,
Anthony

Re: [PATCH v3 2/2] xen: fix stubdom PCI addr

2024-04-03 Thread Anthony PERARD

On Wed, Mar 27, 2024 at 04:05:15AM +0100, Marek Marczykowski-Górecki wrote:
> When running in a stubdomain, the config space access via sysfs needs to
> use BDF as seen inside stubdomain (connected via xen-pcifront), which is
> different from the real BDF. For other purposes (hypercall parameters
> etc), the real BDF needs to be used.
> Get the in-stubdomain BDF by looking up relevant PV PCI xenstore
> entries.
> 
> Signed-off-by: Marek Marczykowski-Górecki 

Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH v3 1/2] hw/xen: detect when running inside stubdomain

2024-04-03 Thread Anthony PERARD

On Wed, Mar 27, 2024 at 04:05:14AM +0100, Marek Marczykowski-Górecki wrote:
> Introduce global xen_is_stubdomain variable when qemu is running inside
> a stubdomain instead of dom0. This will be relevant for subsequent
> patches, as few things like accessing PCI config space need to be done
> differently.
> 
> Signed-off-by: Marek Marczykowski-Górecki 

Reviewed-by: Anthony PERARD 

Thanks,


-- 
Anthony PERARD

Re: [PATCH-for-9.0 v2 19/19] hw/xen: Have most of Xen files become target-agnostic

2024-03-28 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:15PM +0100, Philippe Mathieu-Daudé wrote:
> Previous commits re-organized the target-specific bits
> from Xen files. We can now build the common files once
> instead of per-target.
> 
> Only 4 files call libxen API (thus its CPPFLAGS):
> - xen-hvm-common.c,
> - xen_pt.c, xen_pt_graphics.c, xen_pt_msi.c
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Reworked since v1 so dropping David's R-b tag.
> ---
>  accel/xen/meson.build  |  2 +-
>  hw/block/dataplane/meson.build |  2 +-
>  hw/xen/meson.build | 21 ++---
>  3 files changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/accel/xen/meson.build b/accel/xen/meson.build
> index 002bdb03c6..455ad5d6be 100644
> --- a/accel/xen/meson.build
> +++ b/accel/xen/meson.build
> @@ -1 +1 @@
> -specific_ss.add(when: 'CONFIG_XEN', if_true: files('xen-all.c'))
> +system_ss.add(when: 'CONFIG_XEN', if_true: files('xen-all.c'))
> diff --git a/hw/block/dataplane/meson.build b/hw/block/dataplane/meson.build
> index 025b3b061b..4d8bcb0bb9 100644
> --- a/hw/block/dataplane/meson.build
> +++ b/hw/block/dataplane/meson.build
> @@ -1,2 +1,2 @@
>  system_ss.add(when: 'CONFIG_VIRTIO_BLK', if_true: files('virtio-blk.c'))
> -specific_ss.add(when: 'CONFIG_XEN_BUS', if_true: files('xen-block.c'))
> +system_ss.add(when: 'CONFIG_XEN_BUS', if_true: files('xen-block.c'))
> diff --git a/hw/xen/meson.build b/hw/xen/meson.build
> index d887fa9ba4..403cab49cf 100644
> --- a/hw/xen/meson.build
> +++ b/hw/xen/meson.build
> @@ -7,26 +7,25 @@ system_ss.add(when: ['CONFIG_XEN_BUS'], if_true: files(
>'xen_pvdev.c',
>  ))
>  
> -system_ss.add(when: ['CONFIG_XEN', xen], if_true: files(
> +system_ss.add(when: ['CONFIG_XEN'], if_true: files(
>'xen-operations.c',
> -))
> -
> -xen_specific_ss = ss.source_set()
> -xen_specific_ss.add(files(
>'xen-mapcache.c',
> +))
> +system_ss.add(when: ['CONFIG_XEN', xen], if_true: files(
>'xen-hvm-common.c',
>  ))
> +
>  if have_xen_pci_passthrough
> -  xen_specific_ss.add(files(
> +  system_ss.add(when: ['CONFIG_XEN'], if_true: files(
>  'xen-host-pci-device.c',
> -'xen_pt.c',
>  'xen_pt_config_init.c',
> -'xen_pt_graphics.c',
>  'xen_pt_load_rom.c',
> +  ))
> +  system_ss.add(when: ['CONFIG_XEN', xen], if_true: files(
> +'xen_pt.c',
> +'xen_pt_graphics.c',

How is it useful to separate those source files? In the commit
description, there's a talk about "CPPFLAGS", but having `when: [xen]`
doesn't change the flags used to build those objects, so the talk about
"CPPFLAGS" is confusing.
Second, if for some reason the dependency `xen` is false, but
`CONFIG_XEN` is true, then we wouldn't be able to build QEMU. Try
linking a binary with "xen_pt_config_init.o" but without "xen_pt.o",
that's not going to work. So even if that first source file doesn't
directly depend on the Xen libraries, it depends on "xen_pt.o" which
depends on the Xen libraries. So ultimately, I think all those source
files should have the same condition: ['CONFIG_XEN', xen].

I've only checked the xen_pt* source files, I don't know if the same
applies to "xen-operations.c" or "xen-mapcache.c".

Beside this, QEMU built with Xen support still seems to works fine, so
adding the objects to `system_ss` instead of `specific_ss` seems
alright.

Thanks,

-- 
Anthony PERARD

Re: [PATCH v4 2/3] tools: build qemu-vmsr-helper

2024-03-28 Thread Anthony Harivel

Hi Daniel, 

My apologies for all the missed feedback in v2. 
I'll be more organized for my next iteration. 

For this specific comment below, I would like to make sure I'm testing 
the right way. 

> > diff --git a/meson.build b/meson.build
> > index b375248a7614..376da49b60ab 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -4052,6 +4052,11 @@ if have_tools
> > dependencies: [authz, crypto, io, qom, qemuutil,
> >libcap_ng, mpathpersist],
> > install: true)
> > +
> > +executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
> > +   dependencies: [authz, crypto, io, qom, qemuutil,
> > +  libcap_ng, mpathpersist],
> > +   install: true)
> >endif
>
> Missed feedback from v2 saying this must /only/ be built
> on x86 architectures. It fails to build on others due
> to the ASM usage eg
>
> https://gitlab.com/berrange/qemu/-/jobs/6445384073
>

To recreate your build system, I need to, for example, compile with the 
following configuration for arm64 (aarch64):

../configure --enable-werror --disable-docs --enable-fdt=system 
--disable-user --cross-prefix=aarch64-linux-gnu- 
--target-list-exclude="arm-softmmu cris-softmmu i386-softmmu 
microblaze-softmmu mips-softmmu mipsel-softmmu mips64-softmmu 
ppc-softmmu riscv32-softmmu sh4-softmmu sparc-softmmu xtensa-softmmu"

This is cross-compiling on x86 right?
Because on my laptop I've got the following error: 

WARNING: unrecognized host CPU, proceeding with 'uname -m' output 'x86_64'
python determined to be '/usr/bin/python3'
python version: Python 3.12.2
mkvenv: Creating non-isolated virtual environment at 'pyvenv'
mkvenv: checking for meson>=0.63.0

ERROR: Unrecognized host OS (uname -s reports 'Linux')

It looks like it wants to build natively on aarch64.
Maybe I need to create a VM with aarch64 Debian and compile natively?
Might take a long time but I'm not sure this is the best way.

Regards,
Anthony

Re: [PATCH-for-9.0 v2 16/19] hw/xen/xen_pt: Add missing license

2024-03-27 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:12PM +0100, Philippe Mathieu-Daudé wrote:
> Commit eaab4d60d3 ("Introduce Xen PCI Passthrough, qdevice")
> introduced both xen_pt.[ch], but only added the license to
> xen_pt.c. Use the same license for xen_pt.h.
> 
> Suggested-by: David Woodhouse 
> Signed-off-by: Philippe Mathieu-Daudé 

Fine by me. Looks like there was a license header before:
https://xenbits.xen.org/gitweb/?p=qemu-xen-unstable.git;a=blob;f=hw/pass-through.h;h=0b5822414e24d199a064abccc4d378dcaf569bd6;hb=HEAD
I don't know why I didn't copied it over here.

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH-for-9.0 v2 15/19] hw/xen: Reduce inclusion of 'cpu.h' to target-specific sources

2024-03-27 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:11PM +0100, Philippe Mathieu-Daudé wrote:
> We rarely need to include "cpu.h" in headers. Including it
> 'taint' headers to be target-specific. Here only the i386/arm
> implementations requires "cpu.h", so include it there and
> remove from the "hw/xen/xen-hvm-common.h" *common* header.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: Richard Henderson 
> Reviewed-by: David Woodhouse 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [RFC PATCH-for-9.0 v2 13/19] hw/xen: Remove use of 'target_ulong' in handle_ioreq()

2024-03-27 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:09PM +0100, Philippe Mathieu-Daudé wrote:
> Per commit f17068c1c7 ("xen-hvm: reorganize xen-hvm and move common
> function to xen-hvm-common"), handle_ioreq() is expected to be
> target-agnostic. However it uses 'target_ulong', which is a target
> specific definition.
> 
> Per xen/include/public/hvm/ioreq.h header:
> 
>   struct ioreq {
> uint64_t addr;  /* physical address */
> uint64_t data;  /* data (or paddr of data) */
> uint32_t count; /* for rep prefixes */
> uint32_t size;  /* size in bytes */
> uint32_t vp_eport;  /* evtchn for notifications to/from device model 
> */
> uint16_t _pad0;
> uint8_t state:4;
> uint8_t data_is_ptr:1;  /* if 1, data above is the guest paddr
>  * of the real data to use. */
> uint8_t dir:1;  /* 1=read, 0=write */
> uint8_t df:1;
> uint8_t _pad1:1;
> uint8_t type;   /* I/O type */
>   };
>   typedef struct ioreq ioreq_t;
> 
> If 'data' is not a pointer, it is a u64.
> 
> - In PIO / VMWARE_PORT modes, only 32-bit are used.
> 
> - In MMIO COPY mode, memory is accessed by chunks of 64-bit

Looks like it could also be 8, 16, or 32 as well, depending on
req->size.

> - In PCI_CONFIG mode, access is u8 or u16 or u32.
> 
> - None of TIMEOFFSET / INVALIDATE use 'req'.
> 
> - Fallback is only used in x86 for VMWARE_PORT.
> 
> Masking the upper bits of 'data' to keep 'req->size' low bits
> is irrelevant of the target word size. Remove the word size
> check and always extract the relevant bits.

When building QEMU for Xen, we tend to build the target "i386-softmmu",
which looks like to have target_ulong == uint32_t. So the `data`
clamping would only apply to size 8 and 16. The clamping with
target_ulong was introduce long time ago, here:
https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=b4a663b87df3954557434a2d31bff7f6b2706ec1
and they were more IOREQ types.
So my guess is it isn't relevant anymore, but extending the clamping to
32-bits request should be fine, when using qemu-system-i386 that is, as
it is already be done if one use qemu-system-x86_64.

So I think the patch is fine, and the tests I've ran so far worked fine.

> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH-for-9.0 v2 12/19] hw/xen: Merge 'hw/xen/arch_hvm.h' in 'hw/xen/xen-hvm-common.h'

2024-03-27 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:08PM +0100, Philippe Mathieu-Daudé wrote:
> We don't need a target-specific header for common target-specific
> prototypes. Declare xen_arch_handle_ioreq() and xen_arch_set_memory()
> in "hw/xen/xen-hvm-common.h".
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: David Woodhouse 
> Reviewed-by: Richard Henderson 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH-for-9.0 v2 11/19] hw/xen/xen_arch_hvm: Rename prototypes using 'xen_arch_' prefix

2024-03-27 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:07PM +0100, Philippe Mathieu-Daudé wrote:
> Use a common 'xen_arch_' prefix for architecture-specific functions.
> Rename xen_arch_set_memory() and xen_arch_handle_ioreq().
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> Reviewed-by: David Woodhouse 
> Reviewed-by: Richard Henderson 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [RFC PATCH-for-9.0 v2 09/19] hw/block/xen_blkif: Align structs with QEMU_ALIGNED() instead of #pragma

2024-03-27 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:05PM +0100, Philippe Mathieu-Daudé wrote:
> Except imported source files, QEMU code base uses
> the QEMU_ALIGNED() macro to align its structures.

This patch only convert the alignment, but discard pack. We need both or
the struct is changed.

> ---
>  hw/block/xen_blkif.h | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/xen_blkif.h b/hw/block/xen_blkif.h
> index 99733529c1..c1d154d502 100644
> --- a/hw/block/xen_blkif.h
> +++ b/hw/block/xen_blkif.h
> @@ -18,7 +18,6 @@ struct blkif_common_response {
>  };
>  
>  /* i386 protocol version */
> -#pragma pack(push, 4)
>  struct blkif_x86_32_request {
>  uint8_toperation;/* BLKIF_OP_??? 
> */
>  uint8_tnr_segments;  /* number of segments   
> */
> @@ -26,7 +25,7 @@ struct blkif_x86_32_request {
>  uint64_t   id;   /* private guest value, echoed in resp  
> */
>  blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  
> */
>  struct blkif_request_segment seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> -};
> +} QEMU_ALIGNED(4);

E.g. for this one, I've compare the output of
`pahole --class_name=blkif_x86_32_request build/qemu-system-i386`:

--- before
+++ after
@@ -1,11 +1,15 @@
 struct blkif_x86_32_request {
uint8_toperation;/* 0 1 */
uint8_tnr_segments;  /* 1 1 */
uint16_t   handle;   /* 2 2 */
-   uint64_t   id;   /* 4 8 */
-   uint64_t   sector_number;/*12 8 */
-   struct blkif_request_segment seg[11];/*2088 */

-   /* size: 108, cachelines: 2, members: 6 */
-   /* last cacheline: 44 bytes */
-} __attribute__((__packed__));
+   /* XXX 4 bytes hole, try to pack */
+
+   uint64_t   id;   /* 8 8 */
+   uint64_t   sector_number;/*16 8 */
+   struct blkif_request_segment seg[11];/*2488 */
+
+   /* size: 112, cachelines: 2, members: 6 */
+   /* sum members: 108, holes: 1, sum holes: 4 */
+   /* last cacheline: 48 bytes */
+} __attribute__((__aligned__(8)));

Thanks,

-- 
Anthony PERARD

Re: [PATCH-for-9.0 v2 08/19] hw/xen: Remove unused Xen stubs

2024-03-27 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:04PM +0100, Philippe Mathieu-Daudé wrote:
> All these stubs are protected by a 'if (xen_enabled())' check.

Are you sure? There's still nothing that prevent a compiler from wanting
those, I don't think.

Sure, often compilers will remove dead code in `if(0){...}`, but there's
no guaranty, is there?

Cheers,

-- 
Anthony PERARD

Re: [PATCH-for-9.0 v2 05/19] hw/display: Restrict xen_register_framebuffer() call to Xen

2024-03-27 Thread Anthony PERARD

On Tue, Nov 14, 2023 at 03:38:01PM +0100, Philippe Mathieu-Daudé wrote:
> Only call xen_register_framebuffer() when Xen is enabled.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

I don't think this patch is very useful but it's fine, so:
Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH v2 2/2] xen: fix stubdom PCI addr

2024-03-26 Thread Anthony PERARD

First things first, could you fix the coding style?

Run something like `./scripts/checkpatch.pl @^..` or
`./scripts/checkpatch.pl master..`. Patchew might have run that for you
if the patch series had a cover letter.

On Tue, Mar 05, 2024 at 08:12:30PM +0100, Marek Marczykowski-Górecki wrote:
> diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c
> index 8c6e9a1716..8ea2a5a4af 100644
> --- a/hw/xen/xen-host-pci-device.c
> +++ b/hw/xen/xen-host-pci-device.c
> @@ -9,6 +9,8 @@
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
>  #include "qemu/cutils.h"
> +#include "hw/xen/xen-legacy-backend.h"

I'd like to avoid this header here, that would be complicated at the
moment, as the global variable `xenstore` would be missing. So for now,
that's fine. I guess that could be rework if something like Philippe
talked about at
https://lore.kernel.org/qemu-devel/429a5a27-21b9-45bd-a1a6-a1c2ccc48...@linaro.org/
materialise.

Beside the coding style, the patch looks file.

Thanks,

-- 
Anthony PERARD

Re: [PATCH v2 1/2] hw/xen: detect when running inside stubdomain

2024-03-26 Thread Anthony PERARD

On Tue, Mar 05, 2024 at 08:12:29PM +0100, Marek Marczykowski-Górecki wrote:
> diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c
> index 124dd5f3d6..6bd4e6eb2f 100644
> --- a/hw/xen/xen-legacy-backend.c
> +++ b/hw/xen/xen-legacy-backend.c
> @@ -603,6 +603,20 @@ static void xen_set_dynamic_sysbus(void)
>  machine_class_allow_dynamic_sysbus_dev(mc, TYPE_XENSYSDEV);
>  }
>  
> +static bool xen_check_stubdomain(void)
> +{
> +char *dm_path = g_strdup_printf("/local/domain/%d/image", xen_domid);
> +int32_t dm_domid;
> +bool is_stubdom = false;
> +
> +if (!xenstore_read_int(dm_path, "device-model-domid", _domid)) {
> +is_stubdom = dm_domid != 0;
> +}
> +
> +g_free(dm_path);
> +return is_stubdom;
> +}
> +
>  void xen_be_init(void)
>  {
>  xenstore = qemu_xen_xs_open();
> @@ -616,6 +630,8 @@ void xen_be_init(void)
>  exit(1);
>  }
>  
> +xen_is_stubdomain = xen_check_stubdomain();

This isn't really a backend specific information, and xen_be_init() is
all about old backend implementation support. (qdisk which have been the
first to be rewritten doesn't need xen_be_init(), or shouldn't). Could
we move the initialisation elsewhere?

Is this relevant PV guests? If not, we could move the initialisation to
xen_hvm_init_pc().

Also, avoid having xen_check_stubdomain() depending on
"xen-legacy-backend", if possible.

(In xen_hvm_init_pc(), a call to xen_register_ioreq() opens another
xenstore, as `state->xenstore`.)

(There's already been effort to build QEMU without legacy backends, that
stubdom check would break in this scenario.)

Thanks,

-- 
Anthony PERARD

[PATCH v4 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-03-18 Thread Anthony Harivel

Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

Signed-off-by: Anthony Harivel 
---
 accel/kvm/kvm-all.c   |  27 +++
 docs/specs/index.rst  |   1 +
 docs/specs/rapl-msr.rst   | 155 +
 include/sysemu/kvm.h  |   2 +
 include/sysemu/kvm_int.h  |  30 +++
 target/i386/cpu.h |   8 +
 target/i386/kvm/kvm-cpu.c |   7 +
 target/i386/kvm/kvm.c | 420 ++
 target/i386/kvm/meson.build   |   1 +
 target/i386/kvm/vmsr_energy.c | 381 ++
 target/i386/kvm/vmsr_energy.h |  97 
 11 files changed, 1129 insertions(+)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a8cecd040ebc..7649f226767a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3613,6 +3613,21 @@ static void kvm_set_device(Object *obj,
 s->device = g_strdup(value);
 }
 
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+ const char *str,
+ Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+g_free(s->msr_energy.socket_path);
+s->msr_energy.socket_path = g_strdup(str);
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
 KVMState *s = KVM_STATE(obj);
@@ -3632,6 +3647,7 @@ static void kvm_accel_instance_init(Object *obj)
 s->xen_gnttab_max_frames = 64;
 s->xen_evtchn_max_pirq = 256;
 s->device = NULL;
+s->msr_energy.enable = false;
 }
 
 /**
@@ -3676,6 +3692,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void 
*data)
 object_class_property_set_description(oc, "device",
 "Path to the device node to use (default: /dev/kvm)");
 
+object_class_property_add_bool(oc, "rapl",
+   NULL,
+   kvm_set_kvm_rapl);
+object_class_property_set_description(oc, "rapl",
+"Allow energy related MSRs for RAPL interface in Guest");
+
+object_class_property_add_str(oc, "rapl-helper-socket", NULL,
+  kvm_set_kvm_rapl_socket_path);
+object_class_property_set_description(oc, "rapl-helper-socket",
+"Socket Path for comminucating with the Virtual MSR helper daemon");
+
 kvm_arch_accel_class_init(oc);
 }
 
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index 1484e3e76077..e738ea7d102f 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -33,3 +33,4 @@ guest hardware that is specific to QEMU.
virt-ctlr
vmcoreinfo
vmgenid
+   rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index ..1202ee89bee0
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,155 @@
+
+RAPL MSR support
+
+
+The RAPL interface (Running Average Power Limit) is advertising the accumulated
+energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
+
+

[PATCH v4 2/3] tools: build qemu-vmsr-helper

2024-03-18 Thread Anthony Harivel

Introduce a privileged helper to access RAPL MSR.

The privileged helper tool, qemu-vmsr-helper, is designed to provide
virtual machines with the ability to read specific RAPL (Running Average
Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
on external, out-of-tree patches.

The helper tool leverages Unix permissions and SO_PEERCRED socket
options to enforce access control, ensuring that only processes
explicitly requesting read access via readmsr() from a valid Thread ID
can access these MSRs.

The list of RAPL MSRs that are allowed to be read by the helper tool is
defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
will be supported in the next commit titled "Add support for RAPL MSRs
in KVM/QEMU."

The tool is intentionally designed to run on the Linux x86 platform.
This initial implementation is tailored for Intel CPUs but can be
extended to support AMD CPUs in the future.

Signed-off-by: Anthony Harivel 
---
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/tools/index.rst |   1 +
 docs/tools/qemu-vmsr-helper.rst  |  89 
 meson.build  |   5 +
 tools/i386/qemu-vmsr-helper.c| 564 +++
 tools/i386/rapl-msr-index.h  |  28 ++
 7 files changed, 711 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

diff --git a/contrib/systemd/qemu-vmsr-helper.service 
b/contrib/systemd/qemu-vmsr-helper.service
new file mode 100644
index ..8fd397bf79a9
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.service
@@ -0,0 +1,15 @@
+[Unit]
+Description=Virtual RAPL MSR Daemon for QEMU
+
+[Service]
+WorkingDirectory=/tmp
+Type=simple
+ExecStart=/usr/bin/qemu-vmsr-helper
+PrivateTmp=yes
+ProtectSystem=strict
+ReadWritePaths=/var/run
+RestrictAddressFamilies=AF_UNIX
+Restart=always
+RestartSec=0
+
+[Install]
diff --git a/contrib/systemd/qemu-vmsr-helper.socket 
b/contrib/systemd/qemu-vmsr-helper.socket
new file mode 100644
index ..183e8304d6e2
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.socket
@@ -0,0 +1,9 @@
+[Unit]
+Description=Virtual RAPL MSR helper for QEMU
+
+[Socket]
+ListenStream=/run/qemu-vmsr-helper.sock
+SocketMode=0600
+
+[Install]
+WantedBy=multi-user.target
diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 8e65ce0dfc7b..33ad438e86f6 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -16,3 +16,4 @@ command line utilities and other standalone programs.
qemu-pr-helper
qemu-trace-stap
virtfs-proxy-helper
+   qemu-vmsr-helper
diff --git a/docs/tools/qemu-vmsr-helper.rst b/docs/tools/qemu-vmsr-helper.rst
new file mode 100644
index ..6ec87b49d962
--- /dev/null
+++ b/docs/tools/qemu-vmsr-helper.rst
@@ -0,0 +1,89 @@
+==
+QEMU virtual RAPL MSR helper
+==
+
+Synopsis
+
+
+**qemu-vmsr-helper** [*OPTION*]
+
+Description
+---
+
+Implements the virtual RAPL MSR helper for QEMU.
+
+Accessing the RAPL (Running Average Power Limit) MSR enables the RAPL powercap
+driver to advertise and monitor the power consumption or accumulated energy
+consumption of different power domains, such as CPU packages, DRAM, and other
+components when available.
+
+However those register are accesible under priviliged access (CAP_SYS_RAWIO).
+QEMU can use an external helper to access those priviliged register.
+
+:program:`qemu-vmsr-helper` is that external helper; it creates a listener
+socket which will accept incoming connections for communication with QEMU.
+
+If you want to run VMs in a setup like this, this helper should be started as a
+system service, and you should read the QEMU manual section on "RAPL MSR
+support" to find out how to configure QEMU to connect to the socket created by
+:program:`qemu-vmsr-helper`.
+
+After connecting to the socket, :program:`qemu-vmsr-helper` can
+optionally drop root privileges, except for those capabilities that
+are needed for its operation.
+
+:program:`qemu-vmsr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+[Socket]
+ListenStream=/var/run/qemu-vmsr-helper.sock
+
+Options
+---
+
+.. program:: qemu-vmsr-helper
+
+.. option:: -d, --daemon
+
+  run in the background (and create a PID file)
+
+.. option:: -q, --quiet
+
+  decrease verbosity
+
+.. option:: -v, --verbose
+
+  increase verbosity
+
+.. option:: -f, --pidfile=PATH
+
+  PID file when running as a daemon. By default the PID file
+  is created in the system runtime state directory, for example
+  :file:`/var/run/

[PATCH v4 0/3] Add support for the RAPL MSRs series

2024-03-18 Thread Anthony Harivel

Dear maintainers, 

First of all, thank you very much for your review of my patch 
[1].

In this version (v4), I have attempted to address all the problems 
addressed during the last review. I hope I did not forget anything.

I added more than 400 lines of code, I guess it's time we review that. 

However, one question remains unanswered from Friday 1 March pointing 
the issue with the location of "/var/local/run/qemu-vmsr-helper.sock", 
created by compute_default_paths(). QEMU is not allowed to reach the 
socket here.

Thank you again for your continued guidance. 

v3 -> v4


- Correct memory leaks with AddressSanitizer  

- Add sanity check for QEMU and qemu-vmsr-helper for checking if host is 
  INTEL and if RAPL is activated.

- Rename poor variables naming for easier comprehension

- Move code that checks Host before creating the VMSR thread

- Get rid of libnuma: create function that read sysfs for reading the 
  Host topology instead

v2 -> v3


- Move all memory allocations from Clib to Glib

- Compile on *BSD (working on Linux only)

- No more limitation on the virtual package: each vCPU that belongs to 
  the same virtual package is giving the same results like expected on 
  a real CPU.
  This has been tested topology like:
 -smp 4,sockets=2
 -smp 16,sockets=4,cores=2,threads=2

v1 -> v2


- To overcome the CVE-2020-8694 a socket communication is created
  to a priviliged helper

- Add the priviliged helper (qemu-vmsr-helper)

- Add SO_PEERCRED in qio channel socket

RFC -> v1
-

- Add vmsr_* in front of all vmsr specific function

- Change malloc()/calloc()... with all glib equivalent

- Pre-allocate all dynamic memories when possible

- Add a Documentation of implementation, limitation and usage

Best regards,
Anthony

[1]: https://lore.kernel.org/all/20240125072214.318382-1-ahari...@redhat.com/#t

Anthony Harivel (3):
  qio: add support for SO_PEERCRED for socket channel
  tools: build qemu-vmsr-helper
  Add support for RAPL MSRs in KVM/Qemu

 accel/kvm/kvm-all.c  |  27 ++
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/specs/index.rst |   1 +
 docs/specs/rapl-msr.rst  | 155 +++
 docs/tools/index.rst |   1 +
 docs/tools/qemu-vmsr-helper.rst  |  89 
 include/io/channel.h |  21 +
 include/sysemu/kvm.h |   2 +
 include/sysemu/kvm_int.h |  30 ++
 io/channel-socket.c  |  24 +
 io/channel.c |  12 +
 meson.build  |   5 +
 target/i386/cpu.h|   8 +
 target/i386/kvm/kvm-cpu.c|   7 +
 target/i386/kvm/kvm.c| 420 +
 target/i386/kvm/meson.build  |   1 +
 target/i386/kvm/vmsr_energy.c| 381 +++
 target/i386/kvm/vmsr_energy.h|  97 
 tools/i386/qemu-vmsr-helper.c| 564 +++
 tools/i386/rapl-msr-index.h  |  28 ++
 21 files changed, 1897 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

-- 
2.44.0

[PATCH v4 1/3] qio: add support for SO_PEERCRED for socket channel

2024-03-18 Thread Anthony Harivel

The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.

This credentials structure is defined in  as follows:

struct ucred {
pid_t pid;/* Process ID of the sending process */
uid_t uid;/* User ID of the sending process */
gid_t gid;/* Group ID of the sending process */
};

The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.

On platform other than Linux, the function return 0.

Signed-off-by: Anthony Harivel 
---
 include/io/channel.h | 21 +
 io/channel-socket.c  | 24 
 io/channel.c | 12 
 3 files changed, 57 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 7986c49c713a..01ad7bd7e430 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -160,6 +160,9 @@ struct QIOChannelClass {
   void *opaque);
 int (*io_flush)(QIOChannel *ioc,
 Error **errp);
+void (*io_peerpid)(QIOChannel *ioc,
+   unsigned int *pid,
+   Error **errp);
 };
 
 /* General I/O handling functions */
@@ -981,4 +984,22 @@ int coroutine_mixed_fn 
qio_channel_writev_full_all(QIOChannel *ioc,
 int qio_channel_flush(QIOChannel *ioc,
   Error **errp);
 
+/**
+ * qio_channel_get_peercred:
+ * @ioc: the channel object
+ * @pid: pointer to pid
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns the pid of the peer process connected to this socket.
+ *
+ * The use of this function is possible only for connected
+ * AF_UNIX stream sockets and for AF_UNIX stream and datagram
+ * socket pairs on Linux.
+ * Return an error with pid -1 for the non-Linux OS.
+ *
+ */
+void qio_channel_get_peerpid(QIOChannel *ioc,
+ unsigned int *pid,
+ Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 3a899b060858..fcff92ecc151 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -841,6 +841,29 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
 socket_set_cork(sioc->fd, v);
 }
 
+static void
+qio_channel_socket_get_peerpid(QIOChannel *ioc,
+   unsigned int *pid,
+   Error **errp)
+{
+#ifdef CONFIG_LINUX
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+Error *err = NULL;
+socklen_t len = sizeof(struct ucred);
+
+struct ucred cred;
+if (getsockopt(sioc->fd,
+   SOL_SOCKET, SO_PEERCRED,
+   , ) == -1) {
+error_setg_errno(, errno, "Unable to get peer credentials");
+error_propagate(errp, err);
+}
+*pid = (unsigned int)cred.pid;
+#else
+error_setg(errp, "Unsupported feature");
+*pid = -1;
+#endif
+}
 
 static int
 qio_channel_socket_close(QIOChannel *ioc,
@@ -938,6 +961,7 @@ static void qio_channel_socket_class_init(ObjectClass 
*klass,
 #ifdef QEMU_MSG_ZEROCOPY
 ioc_klass->io_flush = qio_channel_socket_flush;
 #endif
+ioc_klass->io_peerpid = qio_channel_socket_get_peerpid;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel.c b/io/channel.c
index a1f12f8e9096..777989bc9a81 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -548,6 +548,18 @@ void qio_channel_set_cork(QIOChannel *ioc,
 }
 }
 
+void qio_channel_get_peerpid(QIOChannel *ioc,
+ unsigned int *pid,
+ Error **errp)
+{
+QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+if (!klass->io_peerpid) {
+error_setg(errp, "Channel does not support peer pid");
+return;
+}
+klass->io_peerpid(ioc, pid, errp);
+}
 
 off_t qio_channel_io_seek(QIOChannel *ioc,
   off_t offset,
-- 
2.44.0

Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-03-14 Thread Anthony Harivel



Hi Daniel,


> You don't need to access it via the /node/ hierarchy
>
> The canonical path for CPUs would be
>
>   /sys/devices/system/cpu/cpuNNN/topology
>
> The core_cpus_list file is giving you hyper-thread siblings within
> a core, which I don't think is what you want.
>
> If you're after discrete physical packages, then 'package_cpus_list'
> gives you all CPUs within a physical socket (package) I believe.
>

Yes, this could work.
However, on laptop, I've got: 
cat package_cpus_list
0-11

Where on server: 
package_cpus_list
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46

I asked my teammate: always the same results. This is I guess due to 
either a difference in the kernel version or how the kernel is handling 
the case where there is only one package, versus the case with multiple 
packages. 

Anyway, writing a C function to handle both cases might not be easy. 

The other solution would be to loop through 
/sys/devices/system/cpu/cpuNNN/ and update a table of integer of a fixed 
size (*) where I increment table[0] for package_0 when I encounter a CPU 
that belongs to package_0 and so on. Reading 
cpuNNN/topology/physical_package_id is giving to whom package the cpu 
belongs to.

This is a bit tedious but a safer solution, I think.


(*): Maybe dynamic allocation is better ?

Thanks again for your guidance.

Regards,
Anthony

Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-03-13 Thread Anthony Harivel

Hi Daniel,

Daniel P. Berrangé, Mar 12, 2024 at 16:49:

> The point still stands though. NUMA node ID numbers are not
> guaranteed to be the same as socket ID numbers. Very often
> then will be the same (which makes it annoying to test as it
> is easy to not realize the difference), but we can't rely on
> that.
>
> > I'm using functions of libnuma to populate the maxpkgs of the host. 
> > I tested this on different Intel CPU with multiple packages and this 
> > has always returned the good number of packages. A false positive ?
>
> maxpkgs comes from vmsr_get_max_physical_package() which you're
> reading from sysfs, rather than libnuma.
>
> > So here I'm checking if the thread has run on the package number 'i'. 
> > I populate 'numa_node_id' with numa_node_of_cpu().
> > 
> > I did not wanted to reinvent the wheel and the only lib that was talking 
> > about "node" was libnuma.
>
> I'm not actually convinced we need to use libnuma at all. IIUC, you're
> just trying to track all CPUs within the same physical socket (package).
> I don't think we need to care about NUMA nodes to do that tracking.
>

Alright, having a deeper look I'm actually using NUMA for 2 info:

- How many cpu per Package: this helps me calculate the ratio.

- To whom package the cpu belongs: to calculate the ratio with the right 
  package energy counter.

Without libnuma, I'm bit confused on how to handle this. 

Should I parse /sys/bus/node/devices/node* to know how many packages ?
Should I parse /sys/bus/node/devices/node0/cpu0/topology/core_cpus_list 
to handle which cpu belongs to which package ?

Would that be too cumbusome for the user to enter the detail about how
many packages and how many cpu per pakages ? 

i.e: 
-kvm,rapl=true,maxpkgs=2,cpupkgs=8,rapl-helper-socket=/path/sock.sock

> > Maybe I'm wrong assuming that a "node" (defined as an area where all 
> > memory has the same speed as seen from a particular CPU) could lead me 
> > to the packages number ?
>
> Historically you could have multiple sockets in the same NUMA node
> ie a m:1 mapping.
>
> These days with AMD sockets, you can have 1 socket compromising
> many NUMA nodes, as individual dies within a socket are each their
> own NUMA node. So a 1:m mapping
>
> On Intel I think it is still typical to have 1 socket per numa
> node, but again I don't think we can rely on that 1:1 mapping.
>
> Fortunately I don't think it matters, since it looks like you
> don't really need to track NUMA nodes, only sockets (phnysical
> package IDs)
>

Very informative, thanks !

> With regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Regards,
Anthony

[PULL 1/3] xen/pt: Emulate multifunction bit in header type

2024-03-12 Thread Anthony PERARD

From: Ross Lagerwall 

The intention of the code appears to have been to unconditionally set
the multifunction bit but since the emulation mask is 0x00 it has no
effect. Instead, emulate the bit and set it based on the multifunction
property of the PCIDevice (which can be set using QAPI).

This allows making passthrough devices appear as functions in a Xen
guest.

Signed-off-by: Ross Lagerwall 
Reviewed-by: Paul Durrant 
Message-Id: <20231103172601.1319375-1-ross.lagerw...@citrix.com>
Signed-off-by: Anthony PERARD 
---
 hw/xen/xen_pt_config_init.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index ba4cd78238..3edaeab1e3 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -292,7 +292,10 @@ static int 
xen_pt_header_type_reg_init(XenPCIPassthroughState *s,
uint32_t *data)
 {
 /* read PCI_HEADER_TYPE */
-*data = reg->init_val | 0x80;
+*data = reg->init_val;
+if ((PCI_DEVICE(s)->cap_present & QEMU_PCI_CAP_MULTIFUNCTION)) {
+*data |= PCI_HEADER_TYPE_MULTI_FUNCTION;
+}
 return 0;
 }
 
@@ -677,7 +680,7 @@ static XenPTRegInfo xen_pt_emu_reg_header0[] = {
 .size   = 1,
 .init_val   = 0x00,
 .ro_mask= 0xFF,
-.emu_mask   = 0x00,
+.emu_mask   = PCI_HEADER_TYPE_MULTI_FUNCTION,
 .init   = xen_pt_header_type_reg_init,
 .u.b.read   = xen_pt_byte_reg_read,
 .u.b.write  = xen_pt_byte_reg_write,
-- 
Anthony PERARD

[PULL 0/3] Xen queue 2024-03-12

2024-03-12 Thread Anthony PERARD

The following changes since commit 8f3f329f5e0117bd1a23a79ab751f8a7d3471e4b:

  Merge tag 'migration-20240311-pull-request' of https://gitlab.com/peterx/qemu 
into staging (2024-03-12 11:35:41 +)

are available in the Git repository at:

  https://xenbits.xen.org/git-http/people/aperard/qemu-dm.git 
tags/pull-xen-20240312

for you to fetch changes up to 918a7f706b69a8c725bac0694971d2831f688ebb:

  i386: load kernel on xen using DMA (2024-03-12 14:13:08 +)


Xen queue:

* In Xen PCI passthrough, emulate multifunction bit.
* Fix in Xen mapcache.
* Improve performance of kernel+initrd loading in an Xen HVM Direct
  Kernel Boot scenario.


Marek Marczykowski-Górecki (1):
  i386: load kernel on xen using DMA

Peng Fan (1):
  xen: Drop out of coroutine context xen_invalidate_map_cache_entry

Ross Lagerwall (1):
  xen/pt: Emulate multifunction bit in header type

 hw/i386/pc.c|  3 ++-
 hw/xen/xen-mapcache.c   | 30 --
 hw/xen/xen_pt_config_init.c |  7 +--
 3 files changed, 35 insertions(+), 5 deletions(-)

[PULL 2/3] xen: Drop out of coroutine context xen_invalidate_map_cache_entry

2024-03-12 Thread Anthony PERARD

From: Peng Fan 

xen_invalidate_map_cache_entry is not expected to run in a
coroutine. Without this, there is crash:

signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
threadid=) at pthread_kill.c:78
at /usr/src/debug/glibc/2.38+git-r0/sysdeps/posix/raise.c:26
fmt=0x9e1ca8a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0xe0d25740 "!qemu_in_coroutine()",
file=file@entry=0xe0d301a8 "../qemu-xen-dir-remote/block/graph-lock.c", 
line=line@entry=260,
function=function@entry=0xe0e522c0 <__PRETTY_FUNCTION__.3> 
"bdrv_graph_rdlock_main_loop") at assert.c:92
assertion=assertion@entry=0xe0d25740 "!qemu_in_coroutine()",
file=file@entry=0xe0d301a8 "../qemu-xen-dir-remote/block/graph-lock.c", 
line=line@entry=260,
function=function@entry=0xe0e522c0 <__PRETTY_FUNCTION__.3> 
"bdrv_graph_rdlock_main_loop") at assert.c:101
at ../qemu-xen-dir-remote/block/graph-lock.c:260
at 
/home/Freenix/work/sw-stash/xen/upstream/tools/qemu-xen-dir-remote/include/block/graph-lock.h:259
host=host@entry=0x742c8000, size=size@entry=2097152)
at ../qemu-xen-dir-remote/block/io.c:3362
host=0x742c8000, size=2097152)
at ../qemu-xen-dir-remote/block/block-backend.c:2859
host=, size=, max_size=)
at ../qemu-xen-dir-remote/block/block-ram-registrar.c:33
size=2097152, max_size=2097152)
at ../qemu-xen-dir-remote/hw/core/numa.c:883
buffer=buffer@entry=0x743c5000 "")
at ../qemu-xen-dir-remote/hw/xen/xen-mapcache.c:475
buffer=buffer@entry=0x743c5000 "")
at ../qemu-xen-dir-remote/hw/xen/xen-mapcache.c:487
as=as@entry=0xe1ca3ae8 , buffer=0x743c5000,
len=, is_write=is_write@entry=true,
access_len=access_len@entry=32768)
at ../qemu-xen-dir-remote/system/physmem.c:3199
dir=DMA_DIRECTION_FROM_DEVICE, len=,
buffer=, as=0xe1ca3ae8 )
at 
/home/Freenix/work/sw-stash/xen/upstream/tools/qemu-xen-dir-remote/include/sysemu/dma.h:236
elem=elem@entry=0xf620aa30, len=len@entry=32769)
at ../qemu-xen-dir-remote/hw/virtio/virtio.c:758
elem=elem@entry=0xf620aa30, len=len@entry=32769, idx=idx@entry=0)
at ../qemu-xen-dir-remote/hw/virtio/virtio.c:919
elem=elem@entry=0xf620aa30, len=32769)
at ../qemu-xen-dir-remote/hw/virtio/virtio.c:994
req=req@entry=0xf620aa30, status=status@entry=0 '\000')
at ../qemu-xen-dir-remote/hw/block/virtio-blk.c:67
ret=0) at ../qemu-xen-dir-remote/hw/block/virtio-blk.c:136
at ../qemu-xen-dir-remote/block/block-backend.c:1559
--Type  for more, q to quit, c to continue without paging--
at ../qemu-xen-dir-remote/block/block-backend.c:1614
i1=) at ../qemu-xen-dir-remote/util/coroutine-ucontext.c:177
at ../sysdeps/unix/sysv/linux/aarch64/setcontext.S:123

Signed-off-by: Peng Fan 
Reviewed-by: Stefano Stabellini 
Message-Id: <20240124021450.21656-1-peng@oss.nxp.com>
Signed-off-by: Anthony PERARD 
---
 hw/xen/xen-mapcache.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/hw/xen/xen-mapcache.c b/hw/xen/xen-mapcache.c
index 4f956d048e..7f59080ba7 100644
--- a/hw/xen/xen-mapcache.c
+++ b/hw/xen/xen-mapcache.c
@@ -476,11 +476,37 @@ static void 
xen_invalidate_map_cache_entry_unlocked(uint8_t *buffer)
 g_free(entry);
 }
 
-void xen_invalidate_map_cache_entry(uint8_t *buffer)
+typedef struct XenMapCacheData {
+Coroutine *co;
+uint8_t *buffer;
+} XenMapCacheData;
+
+static void xen_invalidate_map_cache_entry_bh(void *opaque)
 {
+XenMapCacheData *data = opaque;
+
 mapcache_lock();
-xen_invalidate_map_cache_entry_unlocked(buffer);
+xen_invalidate_map_cache_entry_unlocked(data->buffer);
 mapcache_unlock();
+
+aio_co_wake(data->co);
+}
+
+void coroutine_mixed_fn xen_invalidate_map_cache_entry(uint8_t *buffer)
+{
+if (qemu_in_coroutine()) {
+XenMapCacheData data = {
+.co = qemu_coroutine_self(),
+.buffer = buffer,
+};
+aio_bh_schedule_oneshot(qemu_get_current_aio_context(),
+xen_invalidate_map_cache_entry_bh, );
+qemu_coroutine_yield();
+} else {
+mapcache_lock();
+xen_invalidate_map_cache_entry_unlocked(buffer);
+mapcache_unlock();
+}
 }
 
 void xen_invalidate_map_cache(void)
-- 
Anthony PERARD

[PULL 3/3] i386: load kernel on xen using DMA

2024-03-12 Thread Anthony PERARD

From: Marek Marczykowski-Górecki 

Kernel on Xen is loaded via fw_cfg. Previously it used non-DMA version,
which loaded the kernel (and initramfs) byte by byte. Change this
to DMA, to load in bigger chunks.
This change alone reduces load time of a (big) kernel+initramfs from
~10s down to below 1s.

This change was suggested initially here:
https://lore.kernel.org/xen-devel/20180216204031.5...@gmail.com/
Apparently this alone is already enough to get massive speedup.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Alex Bennée 
Reviewed-by: Anthony PERARD 
Message-Id: <20210426034709.595432-1-marma...@invisiblethingslab.com>
Signed-off-by: Anthony PERARD 
---
 hw/i386/pc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index f5ff970acf..4f322e0856 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -718,7 +718,8 @@ void xen_load_linux(PCMachineState *pcms)
 
 assert(MACHINE(pcms)->kernel_filename != NULL);
 
-fw_cfg = fw_cfg_init_io(FW_CFG_IO_BASE);
+fw_cfg = fw_cfg_init_io_dma(FW_CFG_IO_BASE, FW_CFG_IO_BASE + 4,
+_space_memory);
 fw_cfg_add_i16(fw_cfg, FW_CFG_NB_CPUS, x86ms->boot_cpus);
 rom_set_fw(fw_cfg);
 
-- 
Anthony PERARD

Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-03-12 Thread Anthony Harivel



Hi Daniel, Paolo,

Here my last questions before wrapping up and send v4, or maybe call off
my attempt to add RAPL interface in QEMU.


Daniel P. Berrangé, Jan 30, 2024 at 10:39:
> > +rcu_register_thread();
> > +
> > +/* Get QEMU PID*/
> > +pid = getpid();
> > +
> > +/* Nb of CPUS per packages */
> > +maxcpus = vmsr_get_maxcpus(0);
> > +
> > +/* Nb of Physical Packages on the system */
> > +maxpkgs = vmsr_get_max_physical_package(maxcpus);
>
> This function can fail so this needs to be checked & reported.
>
> > +
> > +/* Those MSR values should not change as well */
> > +vmsr->msr_unit  = vmsr_read_msr(MSR_RAPL_POWER_UNIT, 0, pid,
> > +s->msr_energy.socket_path);
> > +vmsr->msr_limit = vmsr_read_msr(MSR_PKG_POWER_LIMIT, 0, pid,
> > +s->msr_energy.socket_path);
> > +vmsr->msr_info  = vmsr_read_msr(MSR_PKG_POWER_INFO, 0, pid,
> > +s->msr_energy.socket_path);
>
> This function can fail for a variety of reasons, most especially if someone
> gave an incorrect socket path, or if the daemon is not running. This is not
> getting diagnosed, and even if we try to report it here, we're in a background
> thread at this point.
>
> I think we need to connect and report errors before even starting this
> thread, so that QEMU startup gets aborted upon configuration error.
>

Fair enough. Would it be ok to do the sanity check before 
rcu_register_thread() and "return NULL;" in case of error or would you 
prefer me to check all of this before even calling the 
qemu_thread_create() ? 

> > +/* Populate all the thread stats */
> > +for (int i = 0; i < num_threads; i++) {
> > +thd_stat[i].utime = g_new0(unsigned long long, 2);
> > +thd_stat[i].stime = g_new0(unsigned long long, 2);
> > +thd_stat[i].thread_id = thread_ids[i];
> > +vmsr_read_thread_stat(_stat[i], pid, 0);
>
> It is non-obvious that the 3rd parameter here is an index into
> the utime & stime array. This function would be saner to review
> if called as:
>
> vmsr_read_thread_stat(pid,
> thd_stat[i].thread_id,
> _stat[i].utime[0],
> _stat[i].stime[0],
> _stat[i].cpu_id);
>
> so we see what are input parameters and what are output parameters.
>
> Also this method can fail, eg if the thread has exited already,
> so we need to take that into account and stop trying to get info
> for that thread in later code. eg by setting 'thread_id' to 0
> and then skipping any thread_id == 0 later.
>
>

Good point. I'll rework the function and return "thread_id" to 0 in 
case of failure in order to test it later on. 

> > +thd_stat[i].numa_node_id = 
> > numa_node_of_cpu(thd_stat[i].cpu_id);
> > +}
> > +
> > +/* Retrieve all packages power plane energy counter */
> > +for (int i = 0; i <= maxpkgs; i++) {
> > +for (int j = 0; j < num_threads; j++) {
> > +/*
> > + * Use the first thread we found that ran on the CPU
> > + * of the package to read the packages energy counter
> > + */
> > +if (thd_stat[j].numa_node_id == i) {
>
> 'i' is a CPU ID value, while 'numa_node_id' is a NUMA node ID value.
> I don't think it is semantically valid to compare them for equality.
>
> I'm not sure the NUMA node is even relevant, since IIUC from the docs
> earlier, the power values are scoped per package, which would mean per
> CPU socket.
>

'i' here is the package number on the host. 
I'm using functions of libnuma to populate the maxpkgs of the host. 
I tested this on different Intel CPU with multiple packages and this 
has always returned the good number of packages. A false positive ?

So here I'm checking if the thread has run on the package number 'i'. 
I populate 'numa_node_id' with numa_node_of_cpu().

I did not wanted to reinvent the wheel and the only lib that was talking 
about "node" was libnuma.

Maybe I'm wrong assuming that a "node" (defined as an area where all 
memory has the same speed as seen from a particular CPU) could lead me 
to the packages number ?

And this is what I see you wrote below: 
"A numa node isn't a package AFAICT."


Regards,
Anthony

[PATCH] migration: Fix format in error message

2024-03-11 Thread Anthony PERARD

From: Anthony PERARD 

In file_write_ramblock_iov(), "offset" is "uintptr_t" and not
"ram_addr_t". While usually they are both equivalent, this is not the
case with CONFIG_XEN_BACKEND.

Use the right format. This will fix build on 32-bit.

Fixes: f427d90b9898 ("migration/multifd: Support outgoing mapped-ram stream 
format")
Signed-off-by: Anthony PERARD 
---
 migration/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/file.c b/migration/file.c
index 164b079966..5054a60851 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -191,7 +191,7 @@ int file_write_ramblock_iov(QIOChannel *ioc, const struct 
iovec *iov,
  */
 offset = (uintptr_t) iov[slice_idx].iov_base - (uintptr_t) block->host;
 if (offset >= block->used_length) {
-error_setg(errp, "offset " RAM_ADDR_FMT
+error_setg(errp, "offset %" PRIxPTR
"outside of ramblock %s range", offset, block->idstr);
 ret = -1;
 break;
-- 
Anthony PERARD

Re: [PATCH v3 01/29] bulk: Access existing variables initialized to >F when available

2024-03-08 Thread Anthony PERARD

On Mon, Jan 29, 2024 at 05:44:43PM +0100, Philippe Mathieu-Daudé wrote:
> When a variable is initialized to >field, use it
> in place. Rationale: while this makes the code more concise,
> this also helps static analyzers.
> 
> Mechanical change using the following Coccinelle spatch script:
> 
>  @@
>  type S, F;
>  identifier s, m, v;
>  @@
>   S *s;
>   ...
>   F *v = >m;
>   <+...
>  ->m
>  +v
>   ...+>
> 
> Inspired-by: Zhao Liu 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
> index 36e6f93c37..10ddf6bc91 100644
> --- a/hw/xen/xen_pt.c
> +++ b/hw/xen/xen_pt.c
> @@ -710,7 +710,7 @@ static void xen_pt_destroy(PCIDevice *d) {
>  uint8_t intx;
>  int rc;
>  
> -if (machine_irq && !xen_host_pci_device_closed(>real_device)) {
> +if (machine_irq && !xen_host_pci_device_closed(host_dev)) {
>  intx = xen_pt_pci_intx(s);
>  rc = xc_domain_unbind_pt_irq(xen_xc, xen_domid, machine_irq,
>   PT_IRQ_TYPE_PCI,
> @@ -759,8 +759,8 @@ static void xen_pt_destroy(PCIDevice *d) {
>  memory_listener_unregister(>io_listener);
>  s->listener_set = false;
>  }
> -if (!xen_host_pci_device_closed(>real_device)) {
> -xen_host_pci_device_put(>real_device);
> +if (!xen_host_pci_device_closed(host_dev)) {
> +xen_host_pci_device_put(host_dev);

For the Xen part:
Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH] i386: load kernel on xen using DMA

2024-03-08 Thread Anthony PERARD

On Fri, Jun 18, 2021 at 09:54:14AM +0100, Alex Bennée wrote:
> 
> Marek Marczykowski-Górecki  writes:
> 
> > Kernel on Xen is loaded via fw_cfg. Previously it used non-DMA version,
> > which loaded the kernel (and initramfs) byte by byte. Change this
> > to DMA, to load in bigger chunks.
> > This change alone reduces load time of a (big) kernel+initramfs from
> > ~10s down to below 1s.
> >
> > This change was suggested initially here:
> > https://lore.kernel.org/xen-devel/20180216204031.5...@gmail.com/
> > Apparently this alone is already enough to get massive speedup.
> >
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> >  hw/i386/pc.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > index 8a84b25a03..14e43d4da4 100644
> > --- a/hw/i386/pc.c
> > +++ b/hw/i386/pc.c
> > @@ -839,7 +839,8 @@ void xen_load_linux(PCMachineState *pcms)
> >  
> >  assert(MACHINE(pcms)->kernel_filename != NULL);
> >  
> > -fw_cfg = fw_cfg_init_io(FW_CFG_IO_BASE);
> > +fw_cfg = fw_cfg_init_io_dma(FW_CFG_IO_BASE, FW_CFG_IO_BASE + 4,
> > +_space_memory);
> >  fw_cfg_add_i16(fw_cfg, FW_CFG_NB_CPUS, x86ms->boot_cpus);
> >  rom_set_fw(fw_cfg);
> 
> Gentle ping. The fix looks perfectly sane to me but I don't have any x86
> Xen HW to test this one. Are the x86 maintainers happy to take this on?

Yes. It looks like it works well with both SeaBIOS and OVMF, so the
patch is good.

> FWIW:
> 
> Reviewed-by: Alex Bennée 

Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH 14/16] hw/char/xen_console: Fix missing ERRP_GUARD() for error_prepend()

2024-03-08 Thread Anthony PERARD

On Thu, Feb 29, 2024 at 12:37:21AM +0800, Zhao Liu wrote:
> From: Zhao Liu 
> 
> As the comment in qapi/error, passing @errp to error_prepend() requires
> ERRP_GUARD():
> 
> * = Why, when and how to use ERRP_GUARD() =
> *
> * Without ERRP_GUARD(), use of the @errp parameter is restricted:
> ...
> * - It should not be passed to error_prepend(), error_vprepend() or
> *   error_append_hint(), because that doesn't work with _fatal.
> * ERRP_GUARD() lifts these restrictions.
> *
> * To use ERRP_GUARD(), add it right at the beginning of the function.
> * @errp can then be used without worrying about the argument being
> * NULL or _fatal.
> 
> ERRP_GUARD() could avoid the case when @errp is the pointer of
> error_fatal, the user can't see this additional information, because
> exit() happens in error_setg earlier than information is added [1].
> 
> The xen_console_connect() passes @errp to error_prepend() without
> ERRP_GUARD().
> 
> There're 2 places will call xen_console_connect():
>  - xen_console_realize(): the @errp is from DeviceClass.realize()'s
> parameter.
>  - xen_console_frontend_changed(): the @errp points its caller's
>@local_err.
> 
> To avoid the issue like [1] said, add missing ERRP_GUARD() at the
> beginning of xen_console_connect().
> 
> [1]: Issue description in the commit message of commit ae7c80a7bd73
>  ("error: New macro ERRP_GUARD()").
> 
> Cc: Stefano Stabellini 
> Cc: Anthony Perard 
> Cc: Paul Durrant 
> Cc: "Marc-André Lureau" 
> Cc: Paolo Bonzini 
> Signed-off-by: Zhao Liu 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH 02/17] hw/net/xen_nic: Fix missing ERRP_GUARD() for error_prepend()

2024-03-08 Thread Anthony PERARD

On Thu, Feb 29, 2024 at 06:25:40PM +0100, Thomas Huth wrote:
> On 29/02/2024 15.38, Zhao Liu wrote:
> > From: Zhao Liu 
> > 
> > As the comment in qapi/error, passing @errp to error_prepend() requires
> > ERRP_GUARD():
> > 
> > * = Why, when and how to use ERRP_GUARD() =
> > *
> > * Without ERRP_GUARD(), use of the @errp parameter is restricted:
> > ...
> > * - It should not be passed to error_prepend(), error_vprepend() or
> > *   error_append_hint(), because that doesn't work with _fatal.
> > * ERRP_GUARD() lifts these restrictions.
> > *
> > * To use ERRP_GUARD(), add it right at the beginning of the function.
> > * @errp can then be used without worrying about the argument being
> > * NULL or _fatal.
> > 
> > ERRP_GUARD() could avoid the case when @errp is the pointer of
> > error_fatal, the user can't see this additional information, because
> > exit() happens in error_setg earlier than information is added [1].
> > 
> > The xen_netdev_connect() passes @errp to error_prepend(), and its @errp
> > parameter is from xen_device_frontend_changed().
> > 
> > Though its @errp points to @local_err of xen_device_frontend_changed(),
> > to follow the requirement of @errp, add missing ERRP_GUARD() at the
> > beginning of this function.
> > 
> > [1]: Issue description in the commit message of commit ae7c80a7bd73
> >   ("error: New macro ERRP_GUARD()").
> > 
> > Cc: Stefano Stabellini 
> > Cc: Anthony Perard 
> > Cc: Paul Durrant 
> > Cc: Jason Wang 
> > Signed-off-by: Zhao Liu 
> > ---
> >   hw/net/xen_nic.c | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/hw/net/xen_nic.c b/hw/net/xen_nic.c
> > index 453fdb981983..89487b49baf9 100644
> > --- a/hw/net/xen_nic.c
> > +++ b/hw/net/xen_nic.c
> > @@ -351,6 +351,7 @@ static bool net_event(void *_xendev)
> >   static bool xen_netdev_connect(XenDevice *xendev, Error **errp)
> >   {
> > +ERRP_GUARD();
> >   XenNetDev *netdev = XEN_NET_DEVICE(xendev);
> >   unsigned int port, rx_copy;
> 
> Reviewed-by: Thomas Huth 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-03-05 Thread Anthony Harivel

Hi Daniel,

> > +
> > +/* Retrieve all packages power plane energy counter */
> > +for (int i = 0; i <= maxpkgs; i++) {
> > +for (int j = 0; j < num_threads; j++) {
> > +/*
> > + * Use the first thread we found that ran on the CPU
> > + * of the package to read the packages energy counter
> > + */
>
> This says we're using a thread ID
>
> > +if (thd_stat[j].numa_node_id == i) {
> > +pkg_stat[i].e_start =
> > +vmsr_read_msr(MSR_PKG_ENERGY_STATUS, i, pid,
>
> but here we're using a pid ID, which is the thread ID of the initial
> thread.
>
> > +  s->msr_energy.socket_path);
> > +break;
> > +}
> > +}
> > +}
>
> This API design for vmsr_read_msr() is incredibly inefficient.
> We're making (maxpkgs * num_threads) calls to vmsr_read_msr(),
> and every one of those is opening and closing the socket.
>
> Why isn't QEMU opening the socket once and then sending all
> the requests over the same socket ?
>


The usage of pid here is a mistake, thanks for pointing this out.

However, I'm more sceptical about the fact that the loop is inefficient. 
The confusion could definitely be because of the poor variable naming, 
and I apologize about that.
Let me try to explain what it's supposed to do:
Imagine we are running on machine that has i packages. QEMU has 
j threads running on whichever packages. We need to get the current 
packages energy of each packages that are used by the QEMU threads. 
(could be all i packages, only 1, 2.. we don't know what we need yet) 
So it loops first on the packages "0", and look if any thread has run 
on this packages. 
If no, test the next thread. 
if yes, we need the value, we call the vmsr_read_msr() then break and 
now loop for the next package, i.e package "1". And this until all 
packages has been tested.

So in the end, we 'only' have maximum "maxpkgs" calls of vmsr_read_msr().

Hope that's ok and that clear up the confusion!

Regards,
Anthony

Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-03-05 Thread Anthony Harivel

Daniel P. Berrangé, Mar 04, 2024 at 15:48:
> On Mon, Mar 04, 2024 at 03:41:02PM +0100, Anthony Harivel wrote:
> > 
> > Hi Daniel,
> > 
> > > > +if (s->msr_energy.enable == true) {
> > >
> > > This looks to be where we need to check that both the host CPU
> > > vendor is intel, and the guest CPU vendor is intel, and that
> > > the host CPU has the RAPL feature we're using.
> >
> > Checking for the host cpu and RAPL enable is fine and done. 
> > 
> > But checking for guest CPU is confusing me. 
> > The RAPL feature is enable only with KVM enable. 
> > This means "-cpu" can only be "host" or its derivative that essentially 
> > copy the host CPU definition, no?
>
> KVM can use any named CPU.
>
> > That means if we are already checking the host cpu we don't need to do 
> > anything for the guest, do we ?
>
> When I first wrote this I though it would be as simple as checknig a
> CPUID feature flag. That appears to not be the case, however, as Linux
> is just checking for various CPU models directly. With that in mind
> perhaps we should just check of the guest CPU model vendor
> == CPUID_VENDOR_INTEL and leave it at that.
>
> eg, create an error if running an AMD CPU such as $QEMU -cpu EPYC

The idea looks good to me. Now the hiccups of this solution is that 
I cannot find a way to reach CPUArchState at this level of code (i.e 
kvm_arch_init() ) with only the MachineState or the KVMState. 
I can only reach the topology with x86_possible_cpu_arch_ids().

CPUArchState struct is holding the cpuid_vendor variables where we can 
use IS_INTEL_CPU() for checking.

Maybe you know the trick that I miss ? 

Regards,
Anthony

>
> With regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH 05/17] hw/vfio/ap: Fix missing ERRP_GUARD() for error_prepend()

2024-03-04 Thread Anthony Krowiak




On 2/29/24 12:30 PM, Thomas Huth wrote:

On 29/02/2024 15.39, Zhao Liu wrote:

From: Zhao Liu 

As the comment in qapi/error, passing @errp to error_prepend() requires
ERRP_GUARD():

* = Why, when and how to use ERRP_GUARD() =
*
* Without ERRP_GUARD(), use of the @errp parameter is restricted:
...
* - It should not be passed to error_prepend(), error_vprepend() or
*   error_append_hint(), because that doesn't work with _fatal.
* ERRP_GUARD() lifts these restrictions.
*
* To use ERRP_GUARD(), add it right at the beginning of the function.
* @errp can then be used without worrying about the argument being
* NULL or _fatal.

ERRP_GUARD() could avoid the case when @errp is the pointer of
error_fatal, the user can't see this additional information, because
exit() happens in error_setg earlier than information is added [1].

The vfio_ap_realize() passes @errp to error_prepend(), and as a
DeviceClass.realize method, its @errp is so widely sourced that it is
necessary to protect it with ERRP_GUARD().

To avoid the issue like [1] said, add missing ERRP_GUARD() at the
beginning of this function.

[1]: Issue description in the commit message of commit ae7c80a7bd73
  ("error: New macro ERRP_GUARD()").

Cc: Alex Williamson 
Cc: "Cédric Le Goater" 
Cc: Thomas Huth 
Cc: Tony Krowiak 
Cc: Halil Pasic 
Cc: Jason Herne 
Signed-off-by: Zhao Liu 
---
  hw/vfio/ap.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index e157aa1ff79c..7c4caa593863 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -155,6 +155,7 @@ static void 
vfio_ap_unregister_irq_notifier(VFIOAPDevice *vapdev,

    static void vfio_ap_realize(DeviceState *dev, Error **errp)
  {
+    ERRP_GUARD();
  int ret;
  Error *err = NULL;


Now this function looks like we need both, ERRP_GUARD and the local 
"err" variable? ... patch looks ok to me, but maybe Markus has an idea 
how this could be done in a nicer way?



Correct me if I'm wrong, but my understanding from reading the prologue 
in error.h is that errp is used to pass errors back to the caller. The 
'err' variable is used to report errors set by a call to the 
vfio_ap_register_irq_notification function after which this function 
returns cleanly. It does seem, however, that this function should return 
a value (possibly a boolean?)  for the cases where errp is passed to a 
function that sets an error to be propagated to the caller.





 Thomas

Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-03-04 Thread Anthony Harivel



Hi Daniel,

> > +if (s->msr_energy.enable == true) {
>
> This looks to be where we need to check that both the host CPU
> vendor is intel, and the guest CPU vendor is intel, and that
> the host CPU has the RAPL feature we're using.
>
> With regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Checking for the host cpu and RAPL enable is fine and done. 

But checking for guest CPU is confusing me. 
The RAPL feature is enable only with KVM enable. 
This means "-cpu" can only be "host" or its derivative that essentially 
copy the host CPU definition, no?
That means if we are already checking the host cpu we don't need to do 
anything for the guest, do we ?

Regards,
Anthony

Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper

2024-03-01 Thread Anthony Harivel

Hi Paolo,

> > > +static void compute_default_paths(void)
> > > +{
> > > +socket_path = g_build_filename("/run", "qemu-vmsr-helper.sock", 
> > > NULL);
> > > +pidfile = g_build_filename("/run", "qemu-vmsr-helper.pid", NULL);
> > > +}
> >
> > We shouldn't be hardcoding /run, we need to honour --prefix and
> > --localstatedir args given to configure.  /var/run is a symlink
> > to /run so the end result ends up the same AFAIK
>
> Indeed; just copy from scsi/qemu-pr-helper.c.
>

When I copy the same as compute-default() of scsi/qemu-pr-helper.c, the 
helper is listening to "/var/local/run/qemu-vmsr-helper.sock".

Problem is /var/local/run is 700 while /run is 755.

So I would need to adjust the qemu-vmsr-helper.service to give 
the --socket=PATH of qemu-vmsr-helper.socket (i.e 
 /run/qemu-vmsr-helper.sock)

Problem is when I do that , I fall into the case: 
"Unix socket can't be set when using socket activation"

So I'm a bit confused what to do about that.

> > You never answered my question from the previous posting of this
> >
> > This check is merely validating the the thread ID in the message
> > is a child of the process ID connected to the socket. Any process
> > on the entire host can satisfy this requirement.
> >
> > I don't see what is limiting this to only QEMU as claimed by the
> > commit message, unless you're expecting the UNIX socket permissions
> > to be such that only processes under the qemu:qemu user:group pair
> > can access to the socket ? That would be a libvirt based permissions
> > assumption though.
>
> Yes, this is why the systemd socket uses 600, like
> contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> root:kvm would make sense on a Debian system), or a separate helper
> can be started by libvirt.
>
> Either way, the policy is left to the user rather than embedding it in
> the provided systemd unit.
>

During this patch test, when I run by hand my VM (without libvirt),
the vmsr helper systemd service/socket was like that: 
[Service]
ExecStart=/usr/bin/qemu-vmsr-helper
User=root
Group=root

and 

[Socket]
ListenStream=/run/qemu-vmsr-helper.sock
SocketUser=qemu
SocketGroup=qemu
SocketMode=0660

And it seems to works. So I'm not sure 100% what to do in my patch.

Should I follow the pr-helper systemd files anyway ?

Regards,
Anthony

Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper

2024-02-21 Thread Anthony Harivel

Daniel P. Berrangé, Feb 21, 2024 at 14:47:
> On Wed, Feb 21, 2024 at 02:19:11PM +0100, Anthony Harivel wrote:
> > Daniel P. Berrangé, Jan 29, 2024 at 20:45:
> > > On Mon, Jan 29, 2024 at 08:33:21PM +0100, Paolo Bonzini wrote:
> > > > On Mon, Jan 29, 2024 at 7:53 PM Daniel P. Berrangé 
> > > >  wrote:
> > > > > > diff --git a/meson.build b/meson.build
> > > > > > index d0329966f1b4..93fc233b0891 100644
> > > > > > --- a/meson.build
> > > > > > +++ b/meson.build
> > > > > > @@ -4015,6 +4015,11 @@ if have_tools
> > > > > > dependencies: [authz, crypto, io, qom, qemuutil,
> > > > > >libcap_ng, mpathpersist],
> > > > > > install: true)
> > > > > > +
> > > > > > +executable('qemu-vmsr-helper', 
> > > > > > files('tools/i386/qemu-vmsr-helper.c'),
> > > > >
> > > > > I'd suggest 'tools/x86/' since this works fine on 64-bit too
> > > > 
> > > > QEMU tends to use i386 in the source to mean both 32- and 64-bit.
> > >
> > > One day we should rename that to x86 too :-)
> > >
> > > > > You never answered my question from the previous posting of this
> > > > >
> > > > > This check is merely validating the the thread ID in the message
> > > > > is a child of the process ID connected to the socket. Any process
> > > > > on the entire host can satisfy this requirement.
> > > > >
> > > > > I don't see what is limiting this to only QEMU as claimed by the
> > > > > commit message, unless you're expecting the UNIX socket permissions
> > > > > to be such that only processes under the qemu:qemu user:group pair
> > > > > can access to the socket ? That would be a libvirt based permissions
> > > > > assumption though.
> > > > 
> > > > Yes, this is why the systemd socket uses 600, like
> > > > contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> > > > SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> > > > root:kvm would make sense on a Debian system), or a separate helper
> > > > can be started by libvirt.
> > > > 
> > > > Either way, the policy is left to the user rather than embedding it in
> > > > the provided systemd unit.
> > >
> > > Ok, this code needs a comment to explain that we're relying on
> > > socket permissions to control who/what can access the daemon,
> > > combined with this PID+TID check to validate it is not spoofing
> > > its identity, as without context the TID check looks pointless.
> > 
> > Hi Daniel,
> > 
> > would you prefer a comment in the code or a security section in the doc 
> > (i.e docs/specs/rapl-msr.rst) ?
>
> I think it is worth creating a docs/specs/rapl-msr.rst to explain the
> overall design & usage & security considerations.

It was already included in the add-support-for-RAPL-MSRs-in-KVM-Qemu.patch 
but indeed it needs now some updates for the v4 about security and 
change in design.

Regards,
Anthony

>
> With regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v3 2/3] tools: build qemu-vmsr-helper

2024-02-21 Thread Anthony Harivel

Daniel P. Berrangé, Jan 29, 2024 at 20:45:
> On Mon, Jan 29, 2024 at 08:33:21PM +0100, Paolo Bonzini wrote:
> > On Mon, Jan 29, 2024 at 7:53 PM Daniel P. Berrangé  
> > wrote:
> > > > diff --git a/meson.build b/meson.build
> > > > index d0329966f1b4..93fc233b0891 100644
> > > > --- a/meson.build
> > > > +++ b/meson.build
> > > > @@ -4015,6 +4015,11 @@ if have_tools
> > > > dependencies: [authz, crypto, io, qom, qemuutil,
> > > >libcap_ng, mpathpersist],
> > > > install: true)
> > > > +
> > > > +executable('qemu-vmsr-helper', 
> > > > files('tools/i386/qemu-vmsr-helper.c'),
> > >
> > > I'd suggest 'tools/x86/' since this works fine on 64-bit too
> > 
> > QEMU tends to use i386 in the source to mean both 32- and 64-bit.
>
> One day we should rename that to x86 too :-)
>
> > > You never answered my question from the previous posting of this
> > >
> > > This check is merely validating the the thread ID in the message
> > > is a child of the process ID connected to the socket. Any process
> > > on the entire host can satisfy this requirement.
> > >
> > > I don't see what is limiting this to only QEMU as claimed by the
> > > commit message, unless you're expecting the UNIX socket permissions
> > > to be such that only processes under the qemu:qemu user:group pair
> > > can access to the socket ? That would be a libvirt based permissions
> > > assumption though.
> > 
> > Yes, this is why the systemd socket uses 600, like
> > contrib/systemd/qemu-pr-helper.socket. The socket can be passed via
> > SCM_RIGHTS by libvirt, or its permissions can be changed (e.g. 660 and
> > root:kvm would make sense on a Debian system), or a separate helper
> > can be started by libvirt.
> > 
> > Either way, the policy is left to the user rather than embedding it in
> > the provided systemd unit.
>
> Ok, this code needs a comment to explain that we're relying on
> socket permissions to control who/what can access the daemon,
> combined with this PID+TID check to validate it is not spoofing
> its identity, as without context the TID check looks pointless.

Hi Daniel,

would you prefer a comment in the code or a security section in the doc 
(i.e docs/specs/rapl-msr.rst) ?

Regards,
Anthony

>
>
> With regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-02-20 Thread Anthony Harivel

Daniel P. Berrangé, Jan 29, 2024 at 20:29:
> On Thu, Jan 25, 2024 at 08:22:14AM +0100, Anthony Harivel wrote:
> > diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
> > new file mode 100644
> > index ..04d27c198fc0
> > --- /dev/null
> > +++ b/docs/specs/rapl-msr.rst
> > @@ -0,0 +1,133 @@
> > +
> > +RAPL MSR support
> > +
>
> > +
> > +Current Limitations
> > +---
> > +
> > +- Works only on Intel host CPUs because AMD CPUs are using different MSR
> > +  addresses.
>
> The privileged helper program is validating an allow list of MSRs.
>
> If those MSRs are only correct on Intel hosts, then the validation
> is incomplete, and it could be allowing unprivileged processes on
> AMD hosts to access forbidden MSRS whose address happen to clash
> with the Intel RAPL MSRs.
>
> IOW, the privileged helper needs to call cpuid() and validate that
> the current host vendor is Intel.
>
> I suspect we also need a feature check of some kind to validate
> that the intel processor supports this features, since old ones
> definitely didn't, and we shouldn't assume all future ones will
> either.
>

To validate that the processor supports the RAPL feature I propose
to check this on the Host:

$ cat /sys/class/powercap/intel-rapl/enabled
1


The only down side is that INTEL RAPL drivers needs to be
mounted then. We don't need it because we directly read the MSRs.

Regards,
Anthony

[PATCH v3 0/3] Add support for the RAPL MSRs series

2024-01-24 Thread Anthony Harivel

Dear maintainers,

First of all, thank you very much for your recent review of my patch 
[1].

In this version (v3), I have attempted to address the most crucial and 
challenging aspect highlighted in your last review.

I am hopeful that we can now engage in a discussion and address the 
remaining potential points that need attention.

Thank you for your continued guidance.

v2 -> v3


- Move all memory allocations from Clib to Glib

- Compile on *BSD (working on Linux only)

- No more limitation on the virtual package: each vCPU that belongs to 
  the same virtual package is giving the same results like expected on 
  a real CPU.
  This has been tested topology like:
 -smp 4,sockets=2
 -smp 16,sockets=4,cores=2,threads=2

v1 -> v2


- To overcome the CVE-2020-8694 a socket communication is created
  to a priviliged helper

- Add the priviliged helper (qemu-vmsr-helper)

- Add SO_PEERCRED in qio channel socket

RFC -> v1
-

- Add vmsr_* in front of all vmsr specific function

- Change malloc()/calloc()... with all glib equivalent

- Pre-allocate all dynamic memories when possible

- Add a Documentation of implementation, limitation and usage

Best regards,
Anthony

[1]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg1003382.html

Anthony Harivel (3):
  qio: add support for SO_PEERCRED for socket channel
  tools: build qemu-vmsr-helper
  Add support for RAPL MSRs in KVM/Qemu

 accel/kvm/kvm-all.c  |  27 ++
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/specs/index.rst |   1 +
 docs/specs/rapl-msr.rst  | 133 ++
 docs/tools/index.rst |   1 +
 docs/tools/qemu-vmsr-helper.rst  |  89 
 include/io/channel.h |  21 +
 include/sysemu/kvm_int.h |  17 +
 io/channel-socket.c  |  23 +
 io/channel.c |  12 +
 meson.build  |   5 +
 target/i386/cpu.h|   8 +
 target/i386/kvm/kvm.c| 348 
 target/i386/kvm/meson.build  |   1 +
 target/i386/kvm/vmsr_energy.c| 295 +
 target/i386/kvm/vmsr_energy.h|  87 
 tools/i386/qemu-vmsr-helper.c| 507 +++
 tools/i386/rapl-msr-index.h  |  28 ++
 19 files changed, 1627 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

-- 
2.43.0

[PATCH v3 3/3] Add support for RAPL MSRs in KVM/Qemu

2024-01-24 Thread Anthony Harivel

Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

Signed-off-by: Anthony Harivel 
---
 accel/kvm/kvm-all.c   |  27 +++
 docs/specs/index.rst  |   1 +
 docs/specs/rapl-msr.rst   | 133 +
 include/sysemu/kvm_int.h  |  17 ++
 target/i386/cpu.h |   8 +
 target/i386/kvm/kvm.c | 348 ++
 target/i386/kvm/meson.build   |   1 +
 target/i386/kvm/vmsr_energy.c | 295 
 target/i386/kvm/vmsr_energy.h |  87 +
 9 files changed, 917 insertions(+)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 49e755ec4ad2..d63a6af91291 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3603,6 +3603,21 @@ static void kvm_set_device(Object *obj,
 s->device = g_strdup(value);
 }
 
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+ const char *str,
+ Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+g_free(s->msr_energy.socket_path);
+s->msr_energy.socket_path = g_strdup(str);
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
 KVMState *s = KVM_STATE(obj);
@@ -3622,6 +3637,7 @@ static void kvm_accel_instance_init(Object *obj)
 s->xen_gnttab_max_frames = 64;
 s->xen_evtchn_max_pirq = 256;
 s->device = NULL;
+s->msr_energy.enable = false;
 }
 
 /**
@@ -3666,6 +3682,17 @@ static void kvm_accel_class_init(ObjectClass *oc, void 
*data)
 object_class_property_set_description(oc, "device",
 "Path to the device node to use (default: /dev/kvm)");
 
+object_class_property_add_bool(oc, "rapl",
+   NULL,
+   kvm_set_kvm_rapl);
+object_class_property_set_description(oc, "rapl",
+"Allow energy related MSRs for RAPL interface in Guest");
+
+object_class_property_add_str(oc, "rapl-helper-socket", NULL,
+  kvm_set_kvm_rapl_socket_path);
+object_class_property_set_description(oc, "rapl-helper-socket",
+"Socket Path for comminucating with the Virtual MSR helper daemon");
+
 kvm_arch_accel_class_init(oc);
 }
 
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index b3f482b0aa58..b426ebb7713c 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -32,3 +32,4 @@ guest hardware that is specific to QEMU.
virt-ctlr
vmcoreinfo
vmgenid
+   rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index ..04d27c198fc0
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,133 @@
+
+RAPL MSR support
+
+
+The RAPL interface (Running Average Power Limit) is advertising the accumulated
+energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
+
+The consumption is reported via MSRs (model specific registers) like
+MSR_PKG_ENERGY_

[PATCH v3 1/3] qio: add support for SO_PEERCRED for socket channel

2024-01-24 Thread Anthony Harivel

The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.

This credentials structure is defined in  as follows:

struct ucred {
pid_t pid;/* Process ID of the sending process */
uid_t uid;/* User ID of the sending process */
gid_t gid;/* Group ID of the sending process */
};

The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.

On platform other than Linux, the function return 0.

Signed-off-by: Anthony Harivel 
---
 include/io/channel.h | 21 +
 io/channel-socket.c  | 23 +++
 io/channel.c | 12 
 3 files changed, 56 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 5f9dbaab65b0..0413435ce011 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -149,6 +149,9 @@ struct QIOChannelClass {
   void *opaque);
 int (*io_flush)(QIOChannel *ioc,
 Error **errp);
+void (*io_peerpid)(QIOChannel *ioc,
+unsigned int *pid,
+Error **errp);
 };
 
 /* General I/O handling functions */
@@ -898,4 +901,22 @@ int coroutine_mixed_fn 
qio_channel_writev_full_all(QIOChannel *ioc,
 int qio_channel_flush(QIOChannel *ioc,
   Error **errp);
 
+/**
+ * qio_channel_get_peercred:
+ * @ioc: the channel object
+ * @pid: pointer to pid
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns the pid of the peer process connected to this socket.
+ *
+ * The use of this function is possible only for connected
+ * AF_UNIX stream sockets and for AF_UNIX stream and datagram
+ * socket pairs on Linux.
+ * Return 0 for the non-Linux OS.
+ *
+ */
+void qio_channel_get_peerpid(QIOChannel *ioc,
+  unsigned int *pid,
+  Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 3a899b060858..e6a73592650c 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -841,6 +841,28 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
 socket_set_cork(sioc->fd, v);
 }
 
+static void
+qio_channel_socket_get_peerpid(QIOChannel *ioc,
+   unsigned int *pid,
+   Error **errp)
+{
+#ifdef CONFIG_LINUX
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+Error *err = NULL;
+socklen_t len = sizeof(struct ucred);
+
+struct ucred cred;
+if (getsockopt(sioc->fd,
+   SOL_SOCKET, SO_PEERCRED,
+   , ) == -1) {
+error_setg_errno(, errno, "Unable to get peer credentials");
+error_propagate(errp, err);
+}
+*pid = (unsigned int)cred.pid;
+#else
+*pid = 0;
+#endif
+}
 
 static int
 qio_channel_socket_close(QIOChannel *ioc,
@@ -938,6 +960,7 @@ static void qio_channel_socket_class_init(ObjectClass 
*klass,
 #ifdef QEMU_MSG_ZEROCOPY
 ioc_klass->io_flush = qio_channel_socket_flush;
 #endif
+ioc_klass->io_peerpid = qio_channel_socket_get_peerpid;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel.c b/io/channel.c
index 86c5834510ff..a5646650cf72 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -490,6 +490,18 @@ void qio_channel_set_cork(QIOChannel *ioc,
 }
 }
 
+void qio_channel_get_peerpid(QIOChannel *ioc,
+ unsigned int *pid,
+ Error **errp)
+{
+QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+if (!klass->io_peerpid) {
+error_setg(errp, "Channel does not support peer pid");
+return;
+}
+klass->io_peerpid(ioc, pid, errp);
+}
 
 off_t qio_channel_io_seek(QIOChannel *ioc,
   off_t offset,
-- 
2.43.0

[PATCH v3 2/3] tools: build qemu-vmsr-helper

2024-01-24 Thread Anthony Harivel

Introduce a privileged helper to access RAPL MSR.

The privileged helper tool, qemu-vmsr-helper, is designed to provide
virtual machines with the ability to read specific RAPL (Running Average
Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
on external, out-of-tree patches.

The helper tool leverages Unix permissions and SO_PEERCRED socket
options to enforce access control, ensuring that only processes
explicitly requesting read access via readmsr() from a valid Thread ID
can access these MSRs.

The list of RAPL MSRs that are allowed to be read by the helper tool is
defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
will be supported in the next commit titled "Add support for RAPL MSRs
in KVM/QEMU."

The tool is intentionally designed to run on the Linux x86 platform.
This initial implementation is tailored for Intel CPUs but can be
extended to support AMD CPUs in the future.

Signed-off-by: Anthony Harivel 
---
 contrib/systemd/qemu-vmsr-helper.service |  15 +
 contrib/systemd/qemu-vmsr-helper.socket  |   9 +
 docs/tools/index.rst |   1 +
 docs/tools/qemu-vmsr-helper.rst  |  89 
 meson.build  |   5 +
 tools/i386/qemu-vmsr-helper.c| 507 +++
 tools/i386/rapl-msr-index.h  |  28 ++
 7 files changed, 654 insertions(+)
 create mode 100644 contrib/systemd/qemu-vmsr-helper.service
 create mode 100644 contrib/systemd/qemu-vmsr-helper.socket
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

diff --git a/contrib/systemd/qemu-vmsr-helper.service 
b/contrib/systemd/qemu-vmsr-helper.service
new file mode 100644
index ..8fd397bf79a9
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.service
@@ -0,0 +1,15 @@
+[Unit]
+Description=Virtual RAPL MSR Daemon for QEMU
+
+[Service]
+WorkingDirectory=/tmp
+Type=simple
+ExecStart=/usr/bin/qemu-vmsr-helper
+PrivateTmp=yes
+ProtectSystem=strict
+ReadWritePaths=/var/run
+RestrictAddressFamilies=AF_UNIX
+Restart=always
+RestartSec=0
+
+[Install]
diff --git a/contrib/systemd/qemu-vmsr-helper.socket 
b/contrib/systemd/qemu-vmsr-helper.socket
new file mode 100644
index ..183e8304d6e2
--- /dev/null
+++ b/contrib/systemd/qemu-vmsr-helper.socket
@@ -0,0 +1,9 @@
+[Unit]
+Description=Virtual RAPL MSR helper for QEMU
+
+[Socket]
+ListenStream=/run/qemu-vmsr-helper.sock
+SocketMode=0600
+
+[Install]
+WantedBy=multi-user.target
diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 8e65ce0dfc7b..33ad438e86f6 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -16,3 +16,4 @@ command line utilities and other standalone programs.
qemu-pr-helper
qemu-trace-stap
virtfs-proxy-helper
+   qemu-vmsr-helper
diff --git a/docs/tools/qemu-vmsr-helper.rst b/docs/tools/qemu-vmsr-helper.rst
new file mode 100644
index ..6ec87b49d962
--- /dev/null
+++ b/docs/tools/qemu-vmsr-helper.rst
@@ -0,0 +1,89 @@
+==
+QEMU virtual RAPL MSR helper
+==
+
+Synopsis
+
+
+**qemu-vmsr-helper** [*OPTION*]
+
+Description
+---
+
+Implements the virtual RAPL MSR helper for QEMU.
+
+Accessing the RAPL (Running Average Power Limit) MSR enables the RAPL powercap
+driver to advertise and monitor the power consumption or accumulated energy
+consumption of different power domains, such as CPU packages, DRAM, and other
+components when available.
+
+However those register are accesible under priviliged access (CAP_SYS_RAWIO).
+QEMU can use an external helper to access those priviliged register.
+
+:program:`qemu-vmsr-helper` is that external helper; it creates a listener
+socket which will accept incoming connections for communication with QEMU.
+
+If you want to run VMs in a setup like this, this helper should be started as a
+system service, and you should read the QEMU manual section on "RAPL MSR
+support" to find out how to configure QEMU to connect to the socket created by
+:program:`qemu-vmsr-helper`.
+
+After connecting to the socket, :program:`qemu-vmsr-helper` can
+optionally drop root privileges, except for those capabilities that
+are needed for its operation.
+
+:program:`qemu-vmsr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+[Socket]
+ListenStream=/var/run/qemu-vmsr-helper.sock
+
+Options
+---
+
+.. program:: qemu-vmsr-helper
+
+.. option:: -d, --daemon
+
+  run in the background (and create a PID file)
+
+.. option:: -q, --quiet
+
+  decrease verbosity
+
+.. option:: -v, --verbose
+
+  increase verbosity
+
+.. option:: -f, --pidfile=PATH
+
+  PID file when running as a daemon. By default the PID file
+  is created in the system runtime state directory, for example
+  :file:`/var/run/

Re: [PATCH V8 00/12] fix migration of suspended runstate

2023-12-20 Thread Anthony PERARD

On Mon, Dec 18, 2023 at 01:14:51PM +0800, Peter Xu wrote:
> On Wed, Dec 13, 2023 at 10:35:33AM -0500, Steven Sistare wrote:
> > Hi Peter, all have RB's, with all i's dotted and t's crossed - steve
> 
> Yes this seems to be more migration related so maybe good candidate for a
> pull from migration submodule.
> 
> But since this is still solving a generic issue, I'm copying a few more
> people from get_maintainers.pl that this series touches, just in case
> they'll have something to say before dev cycle starts.

I did a quick smoke test of migrating a Xen guest. It works fine for me.

Thanks,

-- 
Anthony PERARD

Re: [PATCH] fix qemu build with xen-4.18.0

2023-12-12 Thread Anthony PERARD

On Tue, Dec 12, 2023 at 03:35:50PM +, Volodymyr Babchuk wrote:
> Hi Anthony
> 
> Anthony PERARD  writes:
> 
> > On Fri, Dec 08, 2023 at 02:49:27PM -0800, Stefano Stabellini wrote:
> >> On Fri, 8 Dec 2023, Daniel P. Berrangé wrote:
> >> > On Thu, Dec 07, 2023 at 11:12:48PM +, Michael Young wrote:
> >> > > Builds of qemu-8.2.0rc2 with xen-4.18.0 are currently failing
> >> > > with errors like
> >> > > ../hw/arm/xen_arm.c:74:5: error: ‘GUEST_VIRTIO_MMIO_SPI_LAST’ 
> >> > > undeclared (first use in this function)
> >> > >74 |(GUEST_VIRTIO_MMIO_SPI_LAST - GUEST_VIRTIO_MMIO_SPI_FIRST)
> >> > >   | ^~
> >> > > 
> >> > > as there is an incorrect comparision in include/hw/xen/xen_native.h
> >> > > which means that settings like GUEST_VIRTIO_MMIO_SPI_LAST
> >> > > aren't being defined for xen-4.18.0
> >> > 
> >> > The conditions in arch-arm.h for xen 4.18 show:
> >> > 
> >> > $ cppi arch-arm.h | grep -E '(#.*if)|MMIO'
> >> > #ifndef __XEN_PUBLIC_ARCH_ARM_H__
> >> > # if defined(__XEN__) || defined(__XEN_TOOLS__) || defined(__GNUC__)
> >> > # endif
> >> > # ifndef __ASSEMBLY__
> >> > #  if defined(__XEN__) || defined(__XEN_TOOLS__)
> >> > #   if defined(__GNUC__) && !defined(__STRICT_ANSI__)
> >> > #   endif
> >> > #  endif /* __XEN__ || __XEN_TOOLS__ */
> >> > # endif
> >> > # if defined(__XEN__) || defined(__XEN_TOOLS__)
> >> > #  define PSR_MODE_BIT  0x10U /* Set iff AArch32 */
> >> > /* Virtio MMIO mappings */
> >> > #  define GUEST_VIRTIO_MMIO_BASE   xen_mk_ullong(0x0200)
> >> > #  define GUEST_VIRTIO_MMIO_SIZE   xen_mk_ullong(0x0010)
> >> > #  define GUEST_VIRTIO_MMIO_SPI_FIRST   33
> >> > #  define GUEST_VIRTIO_MMIO_SPI_LAST43
> >> > # endif
> >> > # ifndef __ASSEMBLY__
> >> > # endif
> >> > #endif /*  __XEN_PUBLIC_ARCH_ARM_H__ */
> >> > 
> >> > So the MMIO constants are available if __XEN__ or __XEN_TOOLS__
> >> > are defined. This is no different to the condition that was
> >> > present in Xen 4.17.
> >> > 
> >> > What you didn't mention was that the Fedora build failure is
> >> > seen on an x86_64 host, when building the aarch64 target QEMU,
> >> > and I think this is the key issue.
> >> 
> >> Hi Daniel, thanks for looking into it.
> >> 
> >> - you are building on a x86_64 host
> >> - the target is aarch64
> >> - the target is the aarch64 Xen PVH machine (xen_arm.c)
> >> 
> >> But is the resulting QEMU binary expected to be an x86 binary? Or are
> >> you cross compiling ARM binaries on a x86 host?
> >> 
> >> In other word, is the resulting QEMU binary expected to run on ARM or
> >> x86?
> >> 
> >> 
> >> > Are we expecting to build Xen support for non-arch native QEMU
> >> > system binaries or not ?
> >> 
> >> The ARM xenpvh machine (xen_arm.c) is meant to work with Xen on ARM, not
> >> Xen on x86.  So this is only expected to work if you are
> >> cross-compiling. But you can cross-compile both Xen and QEMU, and I am
> >> pretty sure that Yocto is able to build Xen, Xen userspace tools, and
> >> QEMU for Xen/ARM on an x86 host today.
> >> 
> >> 
> >> > The constants are defined in arch-arm.h, which is only included
> >> > under:
> >> > 
> >> >   #if defined(__i386__) || defined(__x86_64__)
> >> >   #include "arch-x86/xen.h"
> >> >   #elif defined(__arm__) || defined (__aarch64__)
> >> >   #include "arch-arm.h"
> >> >   #else
> >> >   #error "Unsupported architecture"
> >> >   #endif
> >> > 
> >> > 
> >> > When we are building on an x86_64 host, we not going to get
> >> > arch-arm.h included, even if we're trying to build the aarch64
> >> > system emulator.
> >> > 
> >> > I don't know how this is supposed to work ?
> >> 
> >> It looks like a host vs. target architecture mismatch: the #if defined
> >> (__aarch64__) check should pass I think.
> >
> >
> > Building qemu with something like:
> > ./configure --enable-xen --cpu=x86_64
> > used to work. Can we fix that? It still works with v8.1.0.
> > At least, it works on x86, I never really try to build qemu for arm.
> > Notice that there's no "--target-list" on the configure command line.
> > I don't know if --cpu is useful here.
> >
> > Looks like the first commit where the build doesn't work is
> > 7899f6589b78 ("xen_arm: Add virtual PCIe host bridge support").
> 
> I am currently trying to upstream this patch. It is in the QEMU mailing
> list but it was never accepted. It is not reviewed in fact. I'll take a
> look at it, but I don't understand how did you get in the first place.

Sorry, I got the wrong commit pasted, I actually meant:
0c8ab1cddd6c ("xen_arm: Create virtio-mmio devices during initialization")

-- 
Anthony PERARD

Re: [RFC PATCH v4 4/6] xen: add option to disable legacy backends

2023-12-12 Thread Anthony PERARD

On Sat, Dec 02, 2023 at 01:41:22AM +, Volodymyr Babchuk wrote:
> diff --git a/hw/xenpv/xen_machine_pv.c b/hw/xenpv/xen_machine_pv.c
> index 9f9f137f99..03a55f345c 100644
> --- a/hw/xenpv/xen_machine_pv.c
> +++ b/hw/xenpv/xen_machine_pv.c
> @@ -37,7 +37,9 @@ static void xen_init_pv(MachineState *machine)
>  setup_xen_backend_ops();
>  
>  /* Initialize backend core & drivers */
> +#ifdef CONFIG_XEN_LEGACY_BACKENDS
>  xen_be_init();
> +#endif

There's more code that depends on legacy backend support in this
function: Call to xen_be_register() and xen_config_dev_nic() and symbol
xen_config_cleanup, and the code commented with "configure framebuffer".
I've tried to build this on x86.

>  
>  switch (xen_mode) {
>  case XEN_ATTACH:
> diff --git a/meson.build b/meson.build
> index ec01f8b138..c8a43dd97d 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2219,6 +,7 @@ config_host_data.set('CONFIG_DBUS_DISPLAY', 
> dbus_display)
>  config_host_data.set('CONFIG_CFI', get_option('cfi'))
>  config_host_data.set('CONFIG_SELINUX', selinux.found())
>  config_host_data.set('CONFIG_XEN_BACKEND', xen.found())
> +config_host_data.set('CONFIG_XEN_LEGACY_BACKENDS', have_xen_legacy_backends)

I don't know if "config_host_data" is the right place to have "#define
CONFIG_XEN_LEGACY_BACKENDS", but the alternative is probably to define a
Kconfig value, but I don't know if that would be correct as well.
I guess this is fine here, for now.

>  config_host_data.set('CONFIG_LIBDW', libdw.found())
>  if xen.found()
># protect from xen.version() having less than three components
> @@ -3049,6 +3053,7 @@ config_all += config_targetos
>  config_all += config_all_disas
>  config_all += {
>'CONFIG_XEN': xen.found(),
> +  'CONFIG_XEN_LEGACY_BACKENDS': have_xen_legacy_backends,

I don't think this is useful here, or even wanted.
I think things added to config_all are used only in "meson.build" files,
for things like "system_ss.add(when: ['CONFIG_XEN_LEGACY_BACKENDS'] ..."
But you use "if have_xen_legacy_backends" instead, which is probably ok
(because objects also depends on CONFIG_XEN_BUS).

>'CONFIG_SYSTEM_ONLY': have_system,
>'CONFIG_USER_ONLY': have_user,
>'CONFIG_ALL': true,
> diff --git a/meson_options.txt b/meson_options.txt
> index c9baeda639..91dd677257 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -77,6 +77,8 @@ option('nvmm', type: 'feature', value: 'auto',
> description: 'NVMM acceleration support')
>  option('xen', type: 'feature', value: 'auto',
> description: 'Xen backend support')
> +option('xen-legacy-backends', type: 'feature', value: 'auto',

Every other meson options are using '_', I haven't found any single '-'.
Shouldn't this new option follow the same trend and be named
"xen_legacy_backends" ?

> +   description: 'Xen legacy backends (9pfs, fb, qusb) support')

This description feels a bit wrong somehow. "Legacy backend" is internal
to QEMU's code, and meant that the backends are implemented using legacy
support that we want to retire. But the backends them self, as seen by
a guest aren't going to change, and are not legacy. Also, a few month
ago, "qnic" would have been part of the list. Maybe a description like
"Xen backends based on legacy support" might be more appropriate. I'm
not sure listing the different backend in the description is a good
idea, as we will have to remember to change it whenever one of those
backend is been upgraded.

Cheers,

-- 
Anthony PERARD

Re: [PATCH] fix qemu build with xen-4.18.0

2023-12-12 Thread Anthony PERARD

On Fri, Dec 08, 2023 at 02:49:27PM -0800, Stefano Stabellini wrote:
> On Fri, 8 Dec 2023, Daniel P. Berrangé wrote:
> > On Thu, Dec 07, 2023 at 11:12:48PM +, Michael Young wrote:
> > > Builds of qemu-8.2.0rc2 with xen-4.18.0 are currently failing
> > > with errors like
> > > ../hw/arm/xen_arm.c:74:5: error: ‘GUEST_VIRTIO_MMIO_SPI_LAST’ undeclared 
> > > (first use in this function)
> > >74 |(GUEST_VIRTIO_MMIO_SPI_LAST - GUEST_VIRTIO_MMIO_SPI_FIRST)
> > >   | ^~
> > > 
> > > as there is an incorrect comparision in include/hw/xen/xen_native.h
> > > which means that settings like GUEST_VIRTIO_MMIO_SPI_LAST
> > > aren't being defined for xen-4.18.0
> > 
> > The conditions in arch-arm.h for xen 4.18 show:
> > 
> > $ cppi arch-arm.h | grep -E '(#.*if)|MMIO'
> > #ifndef __XEN_PUBLIC_ARCH_ARM_H__
> > # if defined(__XEN__) || defined(__XEN_TOOLS__) || defined(__GNUC__)
> > # endif
> > # ifndef __ASSEMBLY__
> > #  if defined(__XEN__) || defined(__XEN_TOOLS__)
> > #   if defined(__GNUC__) && !defined(__STRICT_ANSI__)
> > #   endif
> > #  endif /* __XEN__ || __XEN_TOOLS__ */
> > # endif
> > # if defined(__XEN__) || defined(__XEN_TOOLS__)
> > #  define PSR_MODE_BIT  0x10U /* Set iff AArch32 */
> > /* Virtio MMIO mappings */
> > #  define GUEST_VIRTIO_MMIO_BASE   xen_mk_ullong(0x0200)
> > #  define GUEST_VIRTIO_MMIO_SIZE   xen_mk_ullong(0x0010)
> > #  define GUEST_VIRTIO_MMIO_SPI_FIRST   33
> > #  define GUEST_VIRTIO_MMIO_SPI_LAST43
> > # endif
> > # ifndef __ASSEMBLY__
> > # endif
> > #endif /*  __XEN_PUBLIC_ARCH_ARM_H__ */
> > 
> > So the MMIO constants are available if __XEN__ or __XEN_TOOLS__
> > are defined. This is no different to the condition that was
> > present in Xen 4.17.
> > 
> > What you didn't mention was that the Fedora build failure is
> > seen on an x86_64 host, when building the aarch64 target QEMU,
> > and I think this is the key issue.
> 
> Hi Daniel, thanks for looking into it.
> 
> - you are building on a x86_64 host
> - the target is aarch64
> - the target is the aarch64 Xen PVH machine (xen_arm.c)
> 
> But is the resulting QEMU binary expected to be an x86 binary? Or are
> you cross compiling ARM binaries on a x86 host?
> 
> In other word, is the resulting QEMU binary expected to run on ARM or
> x86?
> 
> 
> > Are we expecting to build Xen support for non-arch native QEMU
> > system binaries or not ?
> 
> The ARM xenpvh machine (xen_arm.c) is meant to work with Xen on ARM, not
> Xen on x86.  So this is only expected to work if you are
> cross-compiling. But you can cross-compile both Xen and QEMU, and I am
> pretty sure that Yocto is able to build Xen, Xen userspace tools, and
> QEMU for Xen/ARM on an x86 host today.
> 
> 
> > The constants are defined in arch-arm.h, which is only included
> > under:
> > 
> >   #if defined(__i386__) || defined(__x86_64__)
> >   #include "arch-x86/xen.h"
> >   #elif defined(__arm__) || defined (__aarch64__)
> >   #include "arch-arm.h"
> >   #else
> >   #error "Unsupported architecture"
> >   #endif
> > 
> > 
> > When we are building on an x86_64 host, we not going to get
> > arch-arm.h included, even if we're trying to build the aarch64
> > system emulator.
> > 
> > I don't know how this is supposed to work ?
> 
> It looks like a host vs. target architecture mismatch: the #if defined
> (__aarch64__) check should pass I think.


Building qemu with something like:
./configure --enable-xen --cpu=x86_64
used to work. Can we fix that? It still works with v8.1.0.
At least, it works on x86, I never really try to build qemu for arm.
Notice that there's no "--target-list" on the configure command line.
I don't know if --cpu is useful here.

Looks like the first commit where the build doesn't work is
7899f6589b78 ("xen_arm: Add virtual PCIe host bridge support").

Could we get that fixed?

I'm sure distribution will appreciate to be able to build a single qemu
package for xen and other, rather than having a dedicated qemu-xen
package.

Cheers,

-- 
Anthony PERARD

Re: [PATCH v4 2/6] xen: backends: don't overwrite XenStore nodes created by toolstack

2023-12-06 Thread Anthony PERARD

On Sat, Dec 02, 2023 at 01:41:21AM +, Volodymyr Babchuk wrote:
> Xen PV devices in QEMU can be created in two ways: either by QEMU
> itself, if they were passed via command line, or by Xen toolstack. In
> the latter case, QEMU scans XenStore entries and configures devices
> accordingly.
> 
> In the second case we don't want QEMU to write/delete front-end
> entries for two reasons: it might have no access to those entries if
> it is running in un-privileged domain and it is just incorrect to
> overwrite entries already provided by Xen toolstack, because toolstack
> manages those nodes. For example, it might read backend- or frontend-
> state to be sure that they are both disconnected and it is safe to
> destroy a domain.
> 
> This patch checks presence of xendev->backend to check if Xen PV
> device was configured by Xen toolstack to decide if it should touch
> frontend entries in XenStore. Also, when we need to remove XenStore
> entries during device teardown only if they weren't created by Xen
> toolstack. If they were created by toolstack, then it is toolstack's
> job to do proper clean-up.
> 
> Suggested-by: Paul Durrant 
> Suggested-by: David Woodhouse 
> Co-Authored-by: Oleksandr Tyshchenko 
> Signed-off-by: Volodymyr Babchuk 
> Reviewed-by: David Woodhouse 
> 

Hi Volodymyr,

There's something wrong with this patch. The block backend doesn't work
when creating a guest via libxl, an x86 hvm guest with qdisk.

Error from guest kernel:
"2 reading backend fields at /local/domain/0/backend/qdisk/23/768"

It seems that "sector-size" is missing for the disk.

Thanks,

-- 
Anthony PERARD

[PATCH v2 1/3] qio: add support for SO_PEERCRED for socket channel

2023-10-31 Thread Anthony Harivel

The function qio_channel_get_peercred() returns a pointer to the
credentials of the peer process connected to this socket.

This credentials structure is defined in  as follows:

struct ucred {
pid_t pid;/* Process ID of the sending process */
uid_t uid;/* User ID of the sending process */
gid_t gid;/* Group ID of the sending process */
};

The use of this function is possible only for connected AF_UNIX stream
sockets and for AF_UNIX stream and datagram socket pairs.

Signed-off-by: Anthony Harivel 
---
 include/io/channel.h | 20 
 io/channel-socket.c  | 17 +
 io/channel.c | 12 
 3 files changed, 49 insertions(+)

diff --git a/include/io/channel.h b/include/io/channel.h
index 5f9dbaab65b0..99c02d61c3d9 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -149,6 +149,9 @@ struct QIOChannelClass {
   void *opaque);
 int (*io_flush)(QIOChannel *ioc,
 Error **errp);
+void (*io_peercred)(QIOChannel *ioc,
+struct ucred *cred,
+Error **errp);
 };
 
 /* General I/O handling functions */
@@ -898,4 +901,21 @@ int coroutine_mixed_fn 
qio_channel_writev_full_all(QIOChannel *ioc,
 int qio_channel_flush(QIOChannel *ioc,
   Error **errp);
 
+/**
+ * qio_channel_get_peercred:
+ * @ioc: the channel object
+ * @cred: pointer to ucred struct
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns the credentials of the peer process connected to this socket.
+ *
+ * The use of this function is possible only for connected
+ * AF_UNIX stream sockets and for AF_UNIX stream and datagra
+ * socket pairs.
+ *
+ */
+void qio_channel_get_peercred(QIOChannel *ioc,
+  struct ucred *cred,
+  Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 02ffb51e9957..b8285eb8ae49 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -836,6 +836,22 @@ qio_channel_socket_set_cork(QIOChannel *ioc,
 socket_set_cork(sioc->fd, v);
 }
 
+static void
+qio_channel_socket_get_peercred(QIOChannel *ioc,
+struct ucred *cred,
+Error **errp)
+{
+QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+socklen_t len = sizeof(struct ucred);
+Error *err = NULL;
+
+if (getsockopt(sioc->fd,
+   SOL_SOCKET, SO_PEERCRED,
+   cred, ) == -1) {
+error_setg_errno(, errno, "Unable to get peer credentials");
+error_propagate(errp, err);
+}
+}
 
 static int
 qio_channel_socket_close(QIOChannel *ioc,
@@ -933,6 +949,7 @@ static void qio_channel_socket_class_init(ObjectClass 
*klass,
 #ifdef QEMU_MSG_ZEROCOPY
 ioc_klass->io_flush = qio_channel_socket_flush;
 #endif
+ioc_klass->io_peercred = qio_channel_socket_get_peercred;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel.c b/io/channel.c
index 86c5834510ff..6dba5242 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -490,6 +490,18 @@ void qio_channel_set_cork(QIOChannel *ioc,
 }
 }
 
+void qio_channel_get_peercred(QIOChannel *ioc,
+  struct ucred *cred,
+  Error **errp)
+{
+QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+if (!klass->io_peercred) {
+error_setg(errp, "Channel does not support random access");
+return;
+}
+klass->io_peercred(ioc, cred, errp);
+}
 
 off_t qio_channel_io_seek(QIOChannel *ioc,
   off_t offset,
-- 
2.41.0

[PATCH v2 0/3] Add support for RAPL MSRs series

2023-10-31 Thread Anthony Harivel

Hello,

This v2 patch series tries to overcome the issue of the CVE 2020-8694
[1] while trying to read the RAPL MSR for populating the vitrual one on
KVM/QEMU virtual machine.

The solution proposed here is to create a helper daemon that would run
as a priviliged process and able to communicate via a socket to the QEMU
thread that deals with the ratio calculation of the energy counter.

So first it adds the SO_PEERCRED socket option in QIO CHANNEL so that
the helper daemon can check the PID of the peer (QEMU) to validate the
TID that is in the message. 

Then the daemon, called qemu-vmsr-helper, is added in the tools folder.
The daemon is very similar to the qemu-pr-helper in terms of operation.
However comminucation protocol is simplier and requires only one
coroutine to handle the peer request. Only the RAPL MSRs are allowed to
be read via the helper.

And to finish the last commit adds all the RAPL MSR in KVM/QEMU like the
v1 but, instead of reading directly the MSR via readmsr(), reads the
value through a socket comminucation. 

This is a follow-up of the V1 sent mid-june [2].

v1 -> v2


- To overcome the CVE-2020-8694 a socket communication is created
  to a priviliged helper

- Add the priviliged helper (qemu-vmsr-helper)

- Add SO_PEERCRED in qio channel socket

RFC -> v1
-

- Add vmsr_* in front of all vmsr specific function

- Change malloc()/calloc()... with all glib equivalent

- Pre-allocate all dynamic memories when possible

- Add a Documentation of implementation, limitation and usage

Regards,
Anthony

[1]: 
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/running-average-power-limit-energy-reporting.html#cve-2020-8694
[2]: 
https://lore.kernel.org/qemu-devel/20230616140830.316655-1-ahari...@redhat.com/

Anthony Harivel (3):
  qio: add support for SO_PEERCRED for socket channel
  tools: build qemu-vmsr-helper
  Add support for RAPL MSRs in KVM/Qemu

 accel/kvm/kvm-all.c |  26 ++
 docs/specs/index.rst|   1 +
 docs/specs/rapl-msr.rst | 131 +
 docs/tools/index.rst|   1 +
 docs/tools/qemu-vmsr-helper.rst |  89 ++
 include/io/channel.h|  20 ++
 include/sysemu/kvm_int.h|  12 +
 io/channel-socket.c |  17 ++
 io/channel.c|  12 +
 meson.build |   5 +
 target/i386/cpu.h   |   8 +
 target/i386/kvm/kvm.c   | 308 +++
 target/i386/kvm/meson.build |   1 +
 target/i386/kvm/vmsr_energy.c   | 278 +
 target/i386/kvm/vmsr_energy.h   |  82 ++
 tools/i386/qemu-vmsr-helper.c   | 507 
 tools/i386/rapl-msr-index.h |  28 ++
 17 files changed, 1526 insertions(+)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

-- 
2.41.0

[PATCH v2 2/3] tools: build qemu-vmsr-helper

2023-10-31 Thread Anthony Harivel

Introduce a privileged helper to access RAPL MSR.

The privileged helper tool, qemu-vmsr-helper, is designed to provide
virtual machines with the ability to read specific RAPL (Running Average
Power Limit) MSRs without requiring CAP_SYS_RAWIO privileges or relying
on external, out-of-tree patches.

The helper tool leverages Unix permissions and SO_PEERCRED socket
options to enforce access control, ensuring that only processes
explicitly requesting read access via readmsr() from a valid Thread ID
can access these MSRs.

The list of RAPL MSRs that are allowed to be read by the helper tool is
defined in rapl-msr-index.h. This list corresponds to the RAPL MSRs that
will be supported in the next commit titled "Add support for RAPL MSRs
in KVM/QEMU."

Signed-off-by: Anthony Harivel 
---
 docs/tools/index.rst|   1 +
 docs/tools/qemu-vmsr-helper.rst |  89 ++
 meson.build |   5 +
 tools/i386/qemu-vmsr-helper.c   | 507 
 tools/i386/rapl-msr-index.h |  28 ++
 5 files changed, 630 insertions(+)
 create mode 100644 docs/tools/qemu-vmsr-helper.rst
 create mode 100644 tools/i386/qemu-vmsr-helper.c
 create mode 100644 tools/i386/rapl-msr-index.h

diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 8e65ce0dfc7b..33ad438e86f6 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -16,3 +16,4 @@ command line utilities and other standalone programs.
qemu-pr-helper
qemu-trace-stap
virtfs-proxy-helper
+   qemu-vmsr-helper
diff --git a/docs/tools/qemu-vmsr-helper.rst b/docs/tools/qemu-vmsr-helper.rst
new file mode 100644
index ..6ec87b49d962
--- /dev/null
+++ b/docs/tools/qemu-vmsr-helper.rst
@@ -0,0 +1,89 @@
+==
+QEMU virtual RAPL MSR helper
+==
+
+Synopsis
+
+
+**qemu-vmsr-helper** [*OPTION*]
+
+Description
+---
+
+Implements the virtual RAPL MSR helper for QEMU.
+
+Accessing the RAPL (Running Average Power Limit) MSR enables the RAPL powercap
+driver to advertise and monitor the power consumption or accumulated energy
+consumption of different power domains, such as CPU packages, DRAM, and other
+components when available.
+
+However those register are accesible under priviliged access (CAP_SYS_RAWIO).
+QEMU can use an external helper to access those priviliged register.
+
+:program:`qemu-vmsr-helper` is that external helper; it creates a listener
+socket which will accept incoming connections for communication with QEMU.
+
+If you want to run VMs in a setup like this, this helper should be started as a
+system service, and you should read the QEMU manual section on "RAPL MSR
+support" to find out how to configure QEMU to connect to the socket created by
+:program:`qemu-vmsr-helper`.
+
+After connecting to the socket, :program:`qemu-vmsr-helper` can
+optionally drop root privileges, except for those capabilities that
+are needed for its operation.
+
+:program:`qemu-vmsr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+[Socket]
+ListenStream=/var/run/qemu-vmsr-helper.sock
+
+Options
+---
+
+.. program:: qemu-vmsr-helper
+
+.. option:: -d, --daemon
+
+  run in the background (and create a PID file)
+
+.. option:: -q, --quiet
+
+  decrease verbosity
+
+.. option:: -v, --verbose
+
+  increase verbosity
+
+.. option:: -f, --pidfile=PATH
+
+  PID file when running as a daemon. By default the PID file
+  is created in the system runtime state directory, for example
+  :file:`/var/run/qemu-vmsr-helper.pid`.
+
+.. option:: -k, --socket=PATH
+
+  path to the socket. By default the socket is created in
+  the system runtime state directory, for example
+  :file:`/var/run/qemu-vmsr-helper.sock`.
+
+.. option:: -T, --trace [[enable=]PATTERN][,events=FILE][,file=FILE]
+
+  .. include:: ../qemu-option-trace.rst.inc
+
+.. option:: -u, --user=USER
+
+  user to drop privileges to
+
+.. option:: -g, --group=GROUP
+
+  group to drop privileges to
+
+.. option:: -h, --help
+
+  Display a help message and exit.
+
+.. option:: -V, --version
+
+  Display version information and exit.
diff --git a/meson.build b/meson.build
index dcef8b1e7911..d30a7a09d46f 100644
--- a/meson.build
+++ b/meson.build
@@ -3950,6 +3950,11 @@ if have_tools
dependencies: [authz, crypto, io, qom, qemuutil,
   libcap_ng, mpathpersist],
install: true)
+
+executable('qemu-vmsr-helper', files('tools/i386/qemu-vmsr-helper.c'),
+   dependencies: [authz, crypto, io, qom, qemuutil,
+  libcap_ng, mpathpersist],
+   install: true)
   endif
 
   if have_ivshmem
diff --git a/tools/i386/qemu-vmsr-helper.c b/tools/i386/qemu-vmsr-helper.c
new file mode 100644
index ..1d82a2753e44
--- /dev/null
+++ b/tools/i386/qemu-vmsr-hel

[PATCH v2 3/3] Add support for RAPL MSRs in KVM/Qemu

2023-10-31 Thread Anthony Harivel

Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
interface (Running Average Power Limit) for advertising the accumulated
energy consumption of various power domains (e.g. CPU packages, DRAM,
etc.).

The consumption is reported via MSRs (model specific registers) like
MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
64 bits registers that represent the accumulated energy consumption in
micro Joules. They are updated by microcode every ~1ms.

For now, KVM always returns 0 when the guest requests the value of
these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
these MSRs dynamically in userspace.

To limit the amount of system calls for every MSR call, create a new
thread in QEMU that updates the "virtual" MSR values asynchronously.

Each vCPU has its own vMSR to reflect the independence of vCPUs. The
thread updates the vMSR values with the ratio of energy consumed of
the whole physical CPU package the vCPU thread runs on and the
thread's utime and stime values.

All other non-vCPU threads are also taken into account. Their energy
consumption is evenly distributed among all vCPUs threads running on
the same physical CPU package.

To overcome the problem that reading the RAPL MSR requires priviliged
access, a socket communication between QEMU and the qemu-vmsr-helper is
mandatory. You can specified the socket path in the parameter.

This feature is activated with -accel kvm,rapl=true,path=/path/sock.sock

Actual limitation:
- Works only on Intel host CPU because AMD CPUs are using different MSR
  adresses.

- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
  the moment.

- Since each vCPU has an independent vMSR value, the vCPU topology must
  be changed to match that reality. There must be a single vCPU per
  virtual socket (e.g.: -smp 4,sockets=4). Accessing pkg-0 energy will
  give vCPU 0 energy, pkg-1 will give vCPU 1 energy, etc.

Signed-off-by: Anthony Harivel 

 accel/kvm/kvm-all.c   |  26 +++
 docs/specs/index.rst  |   1 +
 docs/specs/rapl-msr.rst   | 131 +++
 include/sysemu/kvm_int.h  |  12 ++
 target/i386/cpu.h |   8 +
 target/i386/kvm/kvm.c | 308 ++
 target/i386/kvm/meson.build   |   1 +
 target/i386/kvm/vmsr_energy.c | 278 ++
 target/i386/kvm/vmsr_energy.h |  82 +
 9 files changed, 847 insertions(+)
 create mode 100644 docs/specs/rapl-msr.rst
 create mode 100644 target/i386/kvm/vmsr_energy.c
 create mode 100644 target/i386/kvm/vmsr_energy.h

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 3f7eafe08cbe..e0df75932e8e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3782,6 +3782,21 @@ static void kvm_set_dirty_ring_size(Object *obj, Visitor 
*v,
 s->kvm_dirty_ring_size = value;
 }
 
+static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+s->msr_energy.enable = value;
+}
+
+static void kvm_set_kvm_rapl_socket_path(Object *obj,
+ const char *str,
+ Error **errp)
+{
+KVMState *s = KVM_STATE(obj);
+g_free(s->msr_energy.socket_path);
+s->msr_energy.socket_path = g_strdup(str);
+}
+
 static void kvm_accel_instance_init(Object *obj)
 {
 KVMState *s = KVM_STATE(obj);
@@ -3800,6 +3815,7 @@ static void kvm_accel_instance_init(Object *obj)
 s->xen_version = 0;
 s->xen_gnttab_max_frames = 64;
 s->xen_evtchn_max_pirq = 256;
+s->msr_energy.enable = false;
 }
 
 /**
@@ -3840,6 +3856,16 @@ static void kvm_accel_class_init(ObjectClass *oc, void 
*data)
 object_class_property_set_description(oc, "dirty-ring-size",
 "Size of KVM dirty page ring buffer (default: 0, i.e. use bitmap)");
 
+object_class_property_add_bool(oc, "rapl",
+   NULL,
+   kvm_set_kvm_rapl);
+object_class_property_set_description(oc, "rapl",
+"Allow energy related MSRs for RAPL interface in Guest");
+
+object_class_property_add_str(oc, "path", NULL,
+  kvm_set_kvm_rapl_socket_path);
+object_class_property_set_description(oc, "path",
+"Socket Path for comminucating with the Virtual MSR helper daemon");
 kvm_arch_accel_class_init(oc);
 }
 
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index e58be38c41c7..5c2fa3d65877 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -24,3 +24,4 @@ guest hardware that is specific to QEMU.
acpi_erst
sev-guest-firmware
fw_cfg
+   rapl-msr
diff --git a/docs/specs/rapl-msr.rst b/docs/specs/rapl-msr.rst
new file mode 100644
index ..ec62a8206337
--- /dev/null
+++ b/docs/specs/rapl-msr.rst
@@ -0,0 +1,131

Re: [QEMU][PATCH v4 1/2] xen_arm: Create virtio-mmio devices during initialization

2023-10-23 Thread Anthony PERARD

On Wed, Oct 11, 2023 at 12:22:46PM -0700, Vikram Garhwal wrote:
> Hi Anthony,
> On Thu, Oct 05, 2023 at 11:40:57AM +0100, Anthony PERARD wrote:
> > Hi Vikram,
> > 
> > This patch prevent QEMU from been build with Xen 4.15. See comments.
> > 
> > Also, why didn't you CC all the maintainers of
> > include/hw/xen/xen_native.h?
> I missed it. Initial version didn't have this file change and i missed 
> updating
> my cc list.

I use `cccmd` to never miss anyone, and I don't have to build a cc list ;-)

$ git config sendemail.cccmd
scripts/get_maintainer.pl --noroles --norolestats --nogit --nogit-fallback

> > > +static inline int xendevicemodel_set_irq_level(xendevicemodel_handle 
> > > *dmod,
> > > +   domid_t domid, uint32_t 
> > > irq,
> > > +   unsigned int level)
> > > +{
> > > +return 0;
> > 
> > Shouldn't this return something like -ENOSYS, instead of returning a
> > success?
> Changed return to -ENOSYS for older version.

Actually, at least on linux, looks like the function would return either
-1 or 0, and set errno. It seems that xendevicemodel_set_irq_level()
ultimately called ioctl(), but also the code in
xen.git/tools/libs/devicemodel/ also only returns -1 or 0.

So it's probably best to set errno=ENOSYS and return -1.

> > 
> > > diff --git a/hw/arm/xen_arm.c b/hw/arm/xen_arm.c
> > > index 1d3e6d481a..7393b37355 100644
> > > --- a/hw/arm/xen_arm.c
> > > +++ b/hw/arm/xen_arm.c
> > > +
> > > +static void xen_set_irq(void *opaque, int irq, int level)
> > > +{
> > > +xendevicemodel_set_irq_level(xen_dmod, xen_domid, irq, level);
> > 
> > So, you just ignore the return value here. Shouldn't there be some kind
> > of error check?
> > 
> > And is it OK to create a virtio-mmio device without an error, even when
> > we could find out that it never going to work (e.g. on Xen 4.14)?
> This is something Oleksandr can answer better as it was written by him. But
> I think we can print an error "virtio init failed" and exit the
> machine init. Does that aligns with your thinking?

Something like that, yes, if possible. It would be a bit difficult
because xen_set_irq() seems to only be a handler which might only be
called after the machine as started. So I'm not sure what would be best
to do here.

Thanks,

-- 
Anthony PERARD

Re: [QEMU][PATCH v4 1/2] xen_arm: Create virtio-mmio devices during initialization

2023-10-05 Thread Anthony PERARD

Hi Vikram,

This patch prevent QEMU from been build with Xen 4.15. See comments.

Also, why didn't you CC all the maintainers of
include/hw/xen/xen_native.h?

On Tue, Aug 29, 2023 at 09:35:17PM -0700, Vikram Garhwal wrote:
> diff --git a/include/hw/xen/xen_native.h b/include/hw/xen/xen_native.h
> index 4dce905fde..a4b1aa9e5d 100644
> --- a/include/hw/xen/xen_native.h
> +++ b/include/hw/xen/xen_native.h
> @@ -523,4 +523,20 @@ static inline int xen_set_ioreq_server_state(domid_t dom,
>   enable);
>  }
>  
> +#if CONFIG_XEN_CTRL_INTERFACE_VERSION <= 41500

xendevicemodel_set_irq_level() was introduced in Xen 4.15, so this
should say '<' and not '<=', otherwise, we have:
include/hw/xen/xen_native.h:527:19: error: static declaration of 
‘xendevicemodel_set_irq_level’ follows non-static declaration

> +static inline int xendevicemodel_set_irq_level(xendevicemodel_handle *dmod,
> +   domid_t domid, uint32_t irq,
> +   unsigned int level)
> +{
> +return 0;

Shouldn't this return something like -ENOSYS, instead of returning a
success?

> diff --git a/hw/arm/xen_arm.c b/hw/arm/xen_arm.c
> index 1d3e6d481a..7393b37355 100644
> --- a/hw/arm/xen_arm.c
> +++ b/hw/arm/xen_arm.c
> +
> +static void xen_set_irq(void *opaque, int irq, int level)
> +{
> +xendevicemodel_set_irq_level(xen_dmod, xen_domid, irq, level);

So, you just ignore the return value here. Shouldn't there be some kind
of error check?

And is it OK to create a virtio-mmio device without an error, even when
we could find out that it never going to work (e.g. on Xen 4.14)?

Cheers,

-- 
Anthony PERARD

Re: [PATCH 6/7] block: Clean up local variable shadowing

2023-09-11 Thread Anthony PERARD

On Thu, Aug 31, 2023 at 03:25:45PM +0200, Markus Armbruster wrote:
> diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
> index 3906b9058b..a07cd7eb5d 100644
> --- a/hw/block/xen-block.c
> +++ b/hw/block/xen-block.c
> @@ -369,7 +369,7 @@ static void xen_block_get_vdev(Object *obj, Visitor *v, 
> const char *name,
>  case XEN_BLOCK_VDEV_TYPE_XVD:
>  case XEN_BLOCK_VDEV_TYPE_HD:
>  case XEN_BLOCK_VDEV_TYPE_SD: {
> -char *name = disk_to_vbd_name(vdev->disk);
> +char *vbd_name = disk_to_vbd_name(vdev->disk);
>  
>  str = g_strdup_printf("%s%s%lu",
>(vdev->type == XEN_BLOCK_VDEV_TYPE_XVD) ?
> @@ -377,8 +377,8 @@ static void xen_block_get_vdev(Object *obj, Visitor *v, 
> const char *name,
>(vdev->type == XEN_BLOCK_VDEV_TYPE_HD) ?
>"hd" :
>"sd",
> -  name, vdev->partition);
> -g_free(name);
> +  vbd_name, vdev->partition);
> +g_free(vbd_name);
>  break;
>  }
>  default:

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH v1 00/23] Q35 support for Xen

2023-08-22 Thread Anthony PERARD

Hi Joel,

We had a design session about Q35 support during Xen Summit, and I think
the result of it is that some more changes are going to be needed,
right?

So, is it worth it for me to spend some time on review this patch series
in its current form, or should I wait until the next revision? And same
question for the xen toolstack side.

Cheers,

-- 
Anthony PERARD

[PULL 0/5] Misc fixes, for thread-pool, xen, and xen-emulate

2023-08-01 Thread Anthony PERARD via

The following changes since commit 802341823f1720511dd5cf53ae40285f7978c61b:

  Merge tag 'pull-tcg-20230731' of https://gitlab.com/rth7680/qemu into staging 
(2023-07-31 14:02:51 -0700)

are available in the Git repository at:

  https://xenbits.xen.org/git-http/people/aperard/qemu-dm.git 
tags/pull-xen-20230801

for you to fetch changes up to 856ca10f9ce1fcffeab18546b36a64f79017c905:

  xen-platform: do full PCI reset during unplug of IDE devices (2023-08-01 
10:22:33 +0100)


Misc fixes, for thread-pool, xen, and xen-emulate

* fix an access to `request_cond` QemuCond in thread-pool
* fix issue with PCI devices when unplugging IDE devices in Xen guest
* several fixes for issues pointed out by Coverity


Anthony PERARD (2):
  xen-block: Avoid leaks on new error path
  thread-pool: signal "request_cond" while locked

David Woodhouse (1):
  hw/xen: Clarify (lack of) error handling in transaction_commit()

Olaf Hering (1):
  xen-platform: do full PCI reset during unplug of IDE devices

Peter Maydell (1):
  xen: Don't pass MemoryListener around by value

 hw/arm/xen_arm.c|  4 ++--
 hw/block/xen-block.c| 11 ++-
 hw/i386/kvm/xenstore_impl.c | 12 +++-
 hw/i386/xen/xen-hvm.c   |  4 ++--
 hw/i386/xen/xen_platform.c  |  7 ---
 hw/xen/xen-hvm-common.c |  8 
 include/hw/xen/xen-hvm-common.h |  2 +-
 util/thread-pool.c  |  2 +-
 8 files changed, 31 insertions(+), 19 deletions(-)

[PULL 5/5] xen-platform: do full PCI reset during unplug of IDE devices

2023-08-01 Thread Anthony PERARD via

From: Olaf Hering 

The IDE unplug function needs to reset the entire PCI device, to make
sure all state is initialized to defaults. This is done by calling
pci_device_reset, which resets not only the chip specific registers, but
also all PCI state. This fixes "unplug" in a Xen HVM domU with the
modular legacy xenlinux PV drivers.

Commit ee358e919e38 ("hw/ide/piix: Convert reset handler to
DeviceReset") changed the way how the the disks are unplugged. Prior
this commit the PCI device remained unchanged. After this change,
piix_ide_reset is exercised after the "unplug" command, which was not
the case prior that commit. This function resets the command register.
As a result the ata_piix driver inside the domU will see a disabled PCI
device. The generic PCI code will reenable the PCI device. On the qemu
side, this runs pci_default_write_config/pci_update_mappings. Here a
changed address is returned by pci_bar_address, this is the address
which was truncated in piix_ide_reset. In case of a Xen HVM domU, the
address changes from 0xc120 to 0xc100. This truncation was a bug in
piix_ide_reset, which was fixed in commit 230dfd9257 ("hw/ide/piix:
properly initialize the BMIBA register"). If pci_xen_ide_unplug had used
pci_device_reset, the PCI registers would have been properly reset, and
commit ee358e919e38 would have not introduced a regression for this
specific domU environment.

While the unplug is supposed to hide the IDE disks, the changed BMIBA
address broke the UHCI device. In case the domU has an USB tablet
configured, to recive absolute pointer coordinates for the GUI, it will
cause a hang during device discovery of the partly discovered USB hid
device. Reading the USBSTS word size register will fail. The access ends
up in the QEMU piix-bmdma device, instead of the expected uhci device.
Here a byte size request is expected, and a value of ~0 is returned. As
a result the UCHI driver sees an error state in the register, and turns
off the UHCI controller.

Signed-off-by: Olaf Hering 
Reviewed-by: Paul Durrant 
Message-Id: <20230720072950.20198-1-o...@aepfle.de>
Signed-off-by: Anthony PERARD 
---
 hw/i386/xen/xen_platform.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/i386/xen/xen_platform.c b/hw/i386/xen/xen_platform.c
index 57f1d742c1..17457ff3de 100644
--- a/hw/i386/xen/xen_platform.c
+++ b/hw/i386/xen/xen_platform.c
@@ -164,8 +164,9 @@ static void pci_unplug_nics(PCIBus *bus)
  *
  * [1] 
https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/hvm-emulated-unplug.pandoc
  */
-static void pci_xen_ide_unplug(DeviceState *dev, bool aux)
+static void pci_xen_ide_unplug(PCIDevice *d, bool aux)
 {
+DeviceState *dev = DEVICE(d);
 PCIIDEState *pci_ide;
 int i;
 IDEDevice *idedev;
@@ -195,7 +196,7 @@ static void pci_xen_ide_unplug(DeviceState *dev, bool aux)
 blk_unref(blk);
 }
 }
-device_cold_reset(dev);
+pci_device_reset(d);
 }
 
 static void unplug_disks(PCIBus *b, PCIDevice *d, void *opaque)
@@ -210,7 +211,7 @@ static void unplug_disks(PCIBus *b, PCIDevice *d, void 
*opaque)
 
 switch (pci_get_word(d->config + PCI_CLASS_DEVICE)) {
 case PCI_CLASS_STORAGE_IDE:
-pci_xen_ide_unplug(DEVICE(d), aux);
+pci_xen_ide_unplug(d, aux);
     break;
 
 case PCI_CLASS_STORAGE_SCSI:
-- 
Anthony PERARD

[PULL 1/5] hw/xen: Clarify (lack of) error handling in transaction_commit()

2023-08-01 Thread Anthony PERARD via

From: David Woodhouse 

Coverity was unhappy (CID 1508359) because we didn't check the return of
init_walk_op() in transaction_commit(), despite doing so at every other
call site.

Strictly speaking, this is a false positive since it can never fail. It
only fails for invalid user input (transaction ID or path), and both of
those are hard-coded to known sane values in this invocation.

But Coverity doesn't know that, and neither does the casual reader of the
code.

Returning an error here would be weird, since the transaction *is*
committed by this point; all the walk_op is doing is firing watches on
the newly-committed changed nodes. So make it a g_assert(!ret), since
it really should never happen.

Signed-off-by: David Woodhouse 
Reviewed-by: Paul Durrant 
Message-Id: <20076888f6bdf06a65aafc5cf954260965d45b97.ca...@infradead.org>
Signed-off-by: Anthony PERARD 
---
 hw/i386/kvm/xenstore_impl.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/i386/kvm/xenstore_impl.c b/hw/i386/kvm/xenstore_impl.c
index 305fe75519..d9732b567e 100644
--- a/hw/i386/kvm/xenstore_impl.c
+++ b/hw/i386/kvm/xenstore_impl.c
@@ -1022,6 +1022,7 @@ static int transaction_commit(XenstoreImplState *s, 
XsTransaction *tx)
 {
 struct walk_op op;
 XsNode **n;
+int ret;
 
 if (s->root_tx != tx->base_tx) {
 return EAGAIN;
@@ -1032,7 +1033,16 @@ static int transaction_commit(XenstoreImplState *s, 
XsTransaction *tx)
 s->root_tx = tx->tx_id;
 s->nr_nodes = tx->nr_nodes;
 
-init_walk_op(s, , XBT_NULL, tx->dom_id, "/", );
+ret = init_walk_op(s, , XBT_NULL, tx->dom_id, "/", );
+/*
+ * There are two reasons why init_walk_op() may fail: an invalid tx_id,
+ * or an invalid path. We pass XBT_NULL and "/", and it cannot fail.
+ * If it does, the world is broken. And returning 'ret' would be weird
+ * because the transaction *was* committed, and all this tree walk is
+ * trying to do is fire the resulting watches on newly-committed nodes.
+ */
+g_assert(!ret);
+
     op.deleted_in_tx = false;
 op.mutating = true;
 
-- 
Anthony PERARD

[PULL 2/5] xen-block: Avoid leaks on new error path

2023-08-01 Thread Anthony PERARD via

From: Anthony PERARD 

Commit 189829399070 ("xen-block: Use specific blockdev driver")
introduced a new error path, without taking care of allocated
resources.

So only allocate the qdicts after the error check, and free both
`filename` and `driver` when we are about to return and thus taking
care of both success and error path.

Coverity only spotted the leak of qdicts (*_layer variables).

Reported-by: Peter Maydell 
Fixes: Coverity CID 1508722, 1398649
Fixes: 189829399070 ("xen-block: Use specific blockdev driver")
Signed-off-by: Anthony PERARD 
Reviewed-by: Paul Durrant 
Reviewed-by: Peter Maydell 
Message-Id: <20230704171819.42564-1-anthony.per...@citrix.com>
Signed-off-by: Anthony PERARD 
---
 hw/block/xen-block.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index f099914831..3906b9058b 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -781,14 +781,15 @@ static XenBlockDrive *xen_block_drive_create(const char 
*id,
 drive = g_new0(XenBlockDrive, 1);
 drive->id = g_strdup(id);
 
-file_layer = qdict_new();
-driver_layer = qdict_new();
-
 rc = stat(filename, );
 if (rc) {
 error_setg_errno(errp, errno, "Could not stat file '%s'", filename);
 goto done;
 }
+
+file_layer = qdict_new();
+driver_layer = qdict_new();
+
 if (S_ISBLK(st.st_mode)) {
 qdict_put_str(file_layer, "driver", "host_device");
 } else {
@@ -796,7 +797,6 @@ static XenBlockDrive *xen_block_drive_create(const char *id,
 }
 
 qdict_put_str(file_layer, "filename", filename);
-g_free(filename);
 
 if (mode && *mode != 'w') {
 qdict_put_bool(file_layer, "read-only", true);
@@ -831,7 +831,6 @@ static XenBlockDrive *xen_block_drive_create(const char *id,
 qdict_put_str(file_layer, "locking", "off");
 
 qdict_put_str(driver_layer, "driver", driver);
-g_free(driver);
 
 qdict_put(driver_layer, "file", file_layer);
 
@@ -842,6 +841,8 @@ static XenBlockDrive *xen_block_drive_create(const char *id,
 qobject_unref(driver_layer);
 
 done:
+g_free(filename);
+g_free(driver);
 if (*errp) {
 xen_block_drive_destroy(drive, NULL);
 return NULL;
-- 
Anthony PERARD

[PULL 3/5] thread-pool: signal "request_cond" while locked

2023-08-01 Thread Anthony PERARD via

From: Anthony PERARD 

thread_pool_free() might have been called on the `pool`, which would
be a reason for worker_thread() to quit. In this case,
`pool->request_cond` is been destroyed.

If worker_thread() didn't managed to signal `request_cond` before it
been destroyed by thread_pool_free(), we got:
util/qemu-thread-posix.c:198: qemu_cond_signal: Assertion 
`cond->initialized' failed.

One backtrace:
__GI___assert_fail (assertion=0x5614abcb "cond->initialized", 
file=0x5614ab88 "util/qemu-thread-posix.c", line=198,
function=0x5614ad80 <__PRETTY_FUNCTION__.17104> "qemu_cond_signal") 
at assert.c:101
qemu_cond_signal (cond=0x7fffb800db30) at util/qemu-thread-posix.c:198
worker_thread (opaque=0x7fffb800dab0) at util/thread-pool.c:129
qemu_thread_start (args=0x7fffb8000b20) at util/qemu-thread-posix.c:505
start_thread (arg=) at pthread_create.c:486

Reported here:
https://lore.kernel.org/all/ZJwoK50FcnTSfFZ8@MacBook-Air-de-Roger.local/T/#u

To avoid issue, keep lock while sending a signal to `request_cond`.

Fixes: 900fa208f506 ("thread-pool: replace semaphore with condition variable")
Signed-off-by: Anthony PERARD 
Reviewed-by: Stefan Hajnoczi 
Message-Id: <20230714152720.5077-1-anthony.per...@citrix.com>
Signed-off-by: Anthony PERARD 
---
 util/thread-pool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/thread-pool.c b/util/thread-pool.c
index 0d97888df0..e3d8292d14 100644
--- a/util/thread-pool.c
+++ b/util/thread-pool.c
@@ -120,13 +120,13 @@ static void *worker_thread(void *opaque)
 
 pool->cur_threads--;
 qemu_cond_signal(>worker_stopped);
-qemu_mutex_unlock(>lock);
 
 /*
  * Wake up another thread, in case we got a wakeup but decided
  * to exit due to pool->cur_threads > pool->max_threads.
  */
 qemu_cond_signal(>request_cond);
+qemu_mutex_unlock(>lock);
 return NULL;
 }
 
-- 
Anthony PERARD

[PULL 4/5] xen: Don't pass MemoryListener around by value

2023-08-01 Thread Anthony PERARD via

From: Peter Maydell 

Coverity points out (CID 1513106, 1513107) that MemoryListener is a
192 byte struct which we are passing around by value.  Switch to
passing a const pointer into xen_register_ioreq() and then to
xen_do_ioreq_register().  We can also make the file-scope
MemoryListener variables const, since nothing changes them.

Signed-off-by: Peter Maydell 
Acked-by: Anthony PERARD 
Reviewed-by: Philippe Mathieu-Daudé 
Message-Id: <20230718101057.1110979-1-peter.mayd...@linaro.org>
Signed-off-by: Anthony PERARD 
---
 hw/arm/xen_arm.c| 4 ++--
 hw/i386/xen/xen-hvm.c   | 4 ++--
 hw/xen/xen-hvm-common.c | 8 
 include/hw/xen/xen-hvm-common.h | 2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/arm/xen_arm.c b/hw/arm/xen_arm.c
index 044093fec7..1d3e6d481a 100644
--- a/hw/arm/xen_arm.c
+++ b/hw/arm/xen_arm.c
@@ -37,7 +37,7 @@
 #define TYPE_XEN_ARM  MACHINE_TYPE_NAME("xenpvh")
 OBJECT_DECLARE_SIMPLE_TYPE(XenArmState, XEN_ARM)
 
-static MemoryListener xen_memory_listener = {
+static const MemoryListener xen_memory_listener = {
 .region_add = xen_region_add,
 .region_del = xen_region_del,
 .log_start = NULL,
@@ -108,7 +108,7 @@ static void xen_arm_init(MachineState *machine)
 
 xam->state =  g_new0(XenIOState, 1);
 
-xen_register_ioreq(xam->state, machine->smp.cpus, xen_memory_listener);
+xen_register_ioreq(xam->state, machine->smp.cpus, _memory_listener);
 
 #ifdef CONFIG_TPM
 if (xam->cfg.tpm_base_addr) {
diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index 3da5a2b23f..f42621e674 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -458,7 +458,7 @@ static void xen_log_global_stop(MemoryListener *listener)
 xen_in_migration = false;
 }
 
-static MemoryListener xen_memory_listener = {
+static const MemoryListener xen_memory_listener = {
 .name = "xen-memory",
 .region_add = xen_region_add,
 .region_del = xen_region_del,
@@ -582,7 +582,7 @@ void xen_hvm_init_pc(PCMachineState *pcms, MemoryRegion 
**ram_memory)
 
 state = g_new0(XenIOState, 1);
 
-xen_register_ioreq(state, max_cpus, xen_memory_listener);
+xen_register_ioreq(state, max_cpus, _memory_listener);
 
 QLIST_INIT(_physmap);
 xen_read_physmap(state);
diff --git a/hw/xen/xen-hvm-common.c b/hw/xen/xen-hvm-common.c
index 886c3ee944..565dc39c8f 100644
--- a/hw/xen/xen-hvm-common.c
+++ b/hw/xen/xen-hvm-common.c
@@ -765,8 +765,8 @@ void xen_shutdown_fatal_error(const char *fmt, ...)
 }
 
 static void xen_do_ioreq_register(XenIOState *state,
-   unsigned int max_cpus,
-   MemoryListener xen_memory_listener)
+  unsigned int max_cpus,
+  const MemoryListener *xen_memory_listener)
 {
 int i, rc;
 
@@ -824,7 +824,7 @@ static void xen_do_ioreq_register(XenIOState *state,
 
 qemu_add_vm_change_state_handler(xen_hvm_change_state_handler, state);
 
-state->memory_listener = xen_memory_listener;
+state->memory_listener = *xen_memory_listener;
 memory_listener_register(>memory_listener, _space_memory);
 
 state->io_listener = xen_io_listener;
@@ -842,7 +842,7 @@ static void xen_do_ioreq_register(XenIOState *state,
 }
 
 void xen_register_ioreq(XenIOState *state, unsigned int max_cpus,
-MemoryListener xen_memory_listener)
+const MemoryListener *xen_memory_listener)
 {
 int rc;
 
diff --git a/include/hw/xen/xen-hvm-common.h b/include/hw/xen/xen-hvm-common.h
index f9559e2885..4e9904f1a6 100644
--- a/include/hw/xen/xen-hvm-common.h
+++ b/include/hw/xen/xen-hvm-common.h
@@ -93,7 +93,7 @@ void xen_device_unrealize(DeviceListener *listener, 
DeviceState *dev);
 
 void xen_hvm_change_state_handler(void *opaque, bool running, RunState rstate);
 void xen_register_ioreq(XenIOState *state, unsigned int max_cpus,
-MemoryListener xen_memory_listener);
+const MemoryListener *xen_memory_listener);
 
 void cpu_ioreq_pio(ioreq_t *req);
 #endif /* HW_XEN_HVM_COMMON_H */
-- 
Anthony PERARD

Re: [PATCH for-8.1] xen: Don't pass MemoryListener around by value

2023-07-18 Thread Anthony PERARD via

On Tue, Jul 18, 2023 at 11:10:57AM +0100, Peter Maydell wrote:
> Coverity points out (CID 1513106, 1513107) that MemoryListener is a
> 192 byte struct which we are passing around by value.  Switch to
> passing a const pointer into xen_register_ioreq() and then to
> xen_do_ioreq_register().  We can also make the file-scope
> MemoryListener variables const, since nothing changes them.
> 
> Signed-off-by: Peter Maydell 

Acked-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: QEMU assert (was: [xen-unstable test] 181558: regressions - FAIL)

2023-07-14 Thread Anthony PERARD via

On Tue, Jul 04, 2023 at 11:56:54AM +0200, Roger Pau Monné wrote:
> On Tue, Jul 04, 2023 at 10:37:38AM +0100, Anthony PERARD wrote:
> > On Wed, Jun 28, 2023 at 02:31:39PM +0200, Roger Pau Monné wrote:
> > > On Fri, Jun 23, 2023 at 03:04:21PM +, osstest service owner wrote:
> > > > flight 181558 xen-unstable real [real]
> > > > http://logs.test-lab.xenproject.org/osstest/logs/181558/
> > > > 
> > > > Regressions :-(
> > > > 
> > > > Tests which did not succeed and are blocking,
> > > > including tests which could not be run:
> > > >  test-amd64-amd64-xl-qcow2   21 guest-start/debian.repeat fail REGR. 
> > > > vs. 181545
> > > 
> > > The test failing here is hitting the assert in qemu_cond_signal() as
> > > called by worker_thread():
> > > 
> > > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > > #1  0x7740b535 in __GI_abort () at abort.c:79
> > > #2  0x7740b40f in __assert_fail_base (fmt=0x7756cef0 
> > > "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5614abcb 
> > > "cond->initialized",
> > > file=0x5614ab88 
> > > "../qemu-xen-dir-remote/util/qemu-thread-posix.c", line=198, 
> > > function=) at assert.c:92
> > > #3  0x774191a2 in __GI___assert_fail (assertion=0x5614abcb 
> > > "cond->initialized", file=0x5614ab88 
> > > "../qemu-xen-dir-remote/util/qemu-thread-posix.c", line=198,
> > > function=0x5614ad80 <__PRETTY_FUNCTION__.17104> 
> > > "qemu_cond_signal") at assert.c:101
> > > #4  0x55f1c8d2 in qemu_cond_signal (cond=0x7fffb800db30) at 
> > > ../qemu-xen-dir-remote/util/qemu-thread-posix.c:198
> > > #5  0x55f36973 in worker_thread (opaque=0x7fffb800dab0) at 
> > > ../qemu-xen-dir-remote/util/thread-pool.c:129
> > > #6  0x55f1d1d2 in qemu_thread_start (args=0x7fffb8000b20) at 
> > > ../qemu-xen-dir-remote/util/qemu-thread-posix.c:505
> > > #7  0x775b0fa3 in start_thread (arg=) at 
> > > pthread_create.c:486
> > > #8  0x774e206f in clone () at 
> > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> > > 
> > > I've been trying to figure out how it can get in such state, but so
> > > far I had no luck.  I'm not a QEMU expert, so it's probably better if
> > > someone else could handle this.
> > > 
> > > In the failures I've seen, and the reproduction I have, the assert
> > > triggers in the QEMU dom0 instance responsible for locally-attaching
> > > the disk to dom0 in order to run pygrub.
> > > 
> > > This is also with QEMU 7.2, as testing with upstream QEMU is blocked
> > > ATM, so there's a chance it has already been fixed upstream.
> > > 
> > > Thanks, Roger.
> > 
> > So, I've run a test with the latest QEMU and I can still reproduce the
> > issue. The test also fails with QEMU 7.1.0.
> > 
> > But, QEMU 7.0 seems to pass the test, even with a start-stop loop of 200
> > iteration. So I'll try to find out if something change in that range.
> > Or try to find out why would the thread pool be not initialised
> > properly.
> 
> Thanks for looking into this.
> 
> There are a set of changes from Paolo Bonzini:
> 
> 232e9255478f thread-pool: remove stopping variable
> 900fa208f506 thread-pool: replace semaphore with condition variable
> 3c7b72ddca9c thread-pool: optimize scheduling of completion bottom half
> 
> That landed in 7.1 that seem like possible candidates.

I think I've figured out the issue. I've sent a patch:
https://lore.kernel.org/qemu-devel/20230714152720.5077-1-anthony.per...@citrix.com/

I did run osstest with this patch, and 200 iteration of stop/start, no
more issue of qemu for dom0 disapearing. The issue I've found is osstest
not able to ssh to the guest, which seems to be started. And qemu for
dom0 is still running.

While the report exist:
http://logs.test-lab.xenproject.org/osstest/logs/181785/

Cheers,

-- 
Anthony PERARD

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 11365 matches

Mail list logo