date:20230529

Re: [PATCH 1/3] hw/ufs: Initial commit for emulated Universal-Flash-Storage

2023-05-29 Thread Thomas Huth


On 30/05/2023 03.36, Jeuk Kim wrote:

On 26/05/2023 15:37, Thomas Huth wrote:

On 26/05/2023 07.05, Jeuk Kim wrote:

Universal Flash Storage (UFS) is a high-performance mass storage device
with a serial interface. It is primarily used as a high-performance
data storage device for embedded applications.

This commit contains code for UFS device to be recognized
as a UFS PCI device.
Patches to handle UFS logical unit and Transfer Request will follow.

Signed-off-by: Jeuk Kim 
---
   MAINTAINERS  |6 +
   hw/Kconfig   |1 +
   hw/meson.build   |1 +
   hw/ufs/Kconfig   |4 +
   hw/ufs/meson.build   |1 +
   hw/ufs/trace-events  |   33 +
   hw/ufs/trace.h   |1 +
   hw/ufs/ufs.c |  305 ++
   hw/ufs/ufs.h |   42 ++
   include/block/ufs.h  | 1251 ++
   include/hw/pci/pci.h |1 +
   include/hw/pci/pci_ids.h |1 +
   meson.build  |1 +


Do you expect lots of additional files to be added to the hw/ufs/ folder? If
the answer is no, then it's maybe a little bit overkill to introduce a
separate folder for this. Wouldn't hw/block/ be a good fit for this as well?
Or maybe we could introduce hw/flash/ or so and also move the contents of
hw/nvme there?


Yes. I plan to add more files to UFS for different functions (UICCMD, MQ, 
zoned, etc.) like nvme.
So personally, I think it would be good to keep the hw/ufs/ directory.


Ok, fair. Then start with hw/ufs/ first and we'll see how it goes. We can 
still move the files later if necessary.


 Thomas

Re: [PATCH] hw/ppc/mac_newworld: Check for the availability of pci-ohci before using it

2023-05-29 Thread Thomas Huth


On 26/05/2023 19.24, Mark Cave-Ayland wrote:

On 26/05/2023 14:30, BALATON Zoltan wrote:


On Fri, 26 May 2023, Thomas Huth wrote:

pci-ohci might habe been disabled in the QEMU binary (e.g. when "configure"
has been run with "--without-default-devices"). Thus we should check
for its availability before blindly using it.

Signed-off-by: Thomas Huth 
---
hw/ppc/mac_newworld.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/mac_newworld.c b/hw/ppc/mac_newworld.c
index 535710314a..c7cca430e1 100644
--- a/hw/ppc/mac_newworld.c
+++ b/hw/ppc/mac_newworld.c
@@ -349,7 +349,8 @@ static void ppc_core99_init(MachineState *machine)
    sysbus_mmio_get_region(s, 3));
    }

-    machine->usb |= defaults_enabled() && !machine->usb_disabled;
+    machine->usb |= defaults_enabled() && !machine->usb_disabled &&
+    module_object_class_by_name("pci-ohci") != 0;


Considering that PowerMacs have an OHCI controller built in soldered to 
the motherboard should this depend on it instead and not rely on pulling 
it in with PCI_DEVICES and --without-default-devices disabling it?


Currently it's not quite emulating a real Mac but I think we should aim 
for going that way rather than to keep emulating random Mac hardware.


Indeed that's correct: New World Macs should always have USB ports built-in. 
I guess the problem here is that this isn't being indicated correctly via 
Kconfig and/or the machine->usb_disabled logic?


Yes, the other solution to the problem is to add a proper "select" statement 
to the Kconfig file. I can also send a patch for that instead.


The other question is whether the OHCI device should always be instantiated, 
even if QEMU had been started with "--nodefaults"? ... otherwise you could 
not hot-plug USB devices to this machine during runtime...


 Thomas

Re: [PATCH v4 2/2] qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa

2023-05-29 Thread Markus Armbruster

Stefano Garzarella  writes:

> On Sat, May 27, 2023 at 07:56:13AM +0200, Markus Armbruster wrote:
>>Stefano Garzarella  writes:
>>
>>> The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
>>> passing through the new 'fd' property.
>>>
>>> Since now we are using qemu_open() on '@path' if the virtio-blk driver
>>> supports the fd passing, let's announce it.
>>> In this way, the management layer can pass the file descriptor of an
>>> already opened vhost-vdpa character device. This is useful especially
>>> when the device can only be accessed with certain privileges.
>>>
>>> Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
>>> in libblkio supports it.
>>>
>>> Suggested-by: Markus Armbruster 
>>> Signed-off-by: Stefano Garzarella 

[...]

>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>> index 98d9116dae..1538d84ef4 100644
>>> --- a/qapi/block-core.json
>>> +++ b/qapi/block-core.json
>>> @@ -3955,10 +3955,16 @@
>>>  #
>>>  # @path: path to the vhost-vdpa character device.
>>>  #
>>> +# Features:
>>> +# @fdset: Member @path supports the special "/dev/fdset/N" path (since 8.1)
>>
>>Slightly long line, break it like this:
>>
>>   # @fdset: Member @path supports the special "/dev/fdset/N" path
>>   # since 8.1)
>>
>
> Sure, I'll fix it!
> For the future, what is the maximum column?

70.  Going over is okay if it improves legibility.  However, when I
reformatted the doc comments, I did not need to even once.

[...]

Re: [PATCH v13 04/10] accel/tcg: add jit stats and time to TBStatistics

2023-05-29 Thread Wu, Fei

On 5/30/2023 12:07 PM, Richard Henderson wrote:
> On 5/29/23 04:49, Fei Wu wrote:
>> +/*
>> + * The TCGProfile structure holds data for analysing the quality of
>> + * the code generation. The data is split between stuff that is valid
>> + * for the lifetime of a single translation and things that are valid
>> + * for the lifetime of the translator. As the former is reset for each
>> + * new translation so it should be copied elsewhere if you want to
>> + * keep it.
>> + *
>> + * The structure is safe to access within the context of translation
>> + * but accessing the data from elsewhere should be done with safe
>> + * work.
>> + */
>> +typedef struct TCGProfile {
>> +
>> +    struct {
>> +    int nb_guest_insns;
>> +    int nb_spills;
>> +    int nb_ops_pre_opt;
>> +
>> +    int del_op_count;
>> +    int temp_count;
>> +    } translation;
>> +
>> +    int64_t cpu_exec_time;
>> +    int64_t op_count; /* total insn count */
>> +    int64_t code_in_len;
>> +    int64_t code_out_len;
>> +    int64_t search_out_len;
>> +
>> +    /* Timestamps during translation  */
>> +    uint64_t gen_start_time;
>> +    uint64_t gen_ir_done_time;
>> +    uint64_t gen_opt_done_time;
>> +    uint64_t gen_la_done_time;
>> +    uint64_t gen_code_done_time;
>> +
>> +    /* Lifetime count of TCGOps per TCGContext */
>> +    uint64_t table_op_count[NB_OPS];
>> +} TCGProfile;
>> +
> 
> Why have you added this back?
> 
> The whole point of removing CONFIG_PROFILE to begin was to have all new
> code.  Not to remove it then reintroduce it unchanged.
> 
> In tcg_gen_code, you have access to the TranslationBlock as s->gen_tb. 
> There is zero point to accumulating into TCGProfile, when you could be
> accumulating into TCGStatistics directly.
> 
TCGProfile contains global wide (per TCGContext) stats, but TBStatistics
is TB specific, some info in TCGProfile such as table_op_count is not
able to be summed up from TBStatistics. The per-translation stats in
TCGProfile may be removed indeed.

Thanks,
Fei.

> 
> r~

Re: [PATCH] tcg: Add tlb_index_and_entry() function

2023-05-29 Thread Richard Henderson


On 5/29/23 17:05, BALATON Zoltan wrote:

The tlb index and entry are often retrieved together and tlb_entry()
already calls tlb_index() so it could easily return it. Add a
tlb_index_and_entry() function that does that to simplify callers and
maybe avoid some duplicate calculations.

Signed-off-by: BALATON Zoltan 


The compiler does a good job removing duplicate calculations already.
What effect do you see with this?


r~

Re: [PATCH v13 04/10] accel/tcg: add jit stats and time to TBStatistics

2023-05-29 Thread Richard Henderson


On 5/29/23 04:49, Fei Wu wrote:

+/*
+ * The TCGProfile structure holds data for analysing the quality of
+ * the code generation. The data is split between stuff that is valid
+ * for the lifetime of a single translation and things that are valid
+ * for the lifetime of the translator. As the former is reset for each
+ * new translation so it should be copied elsewhere if you want to
+ * keep it.
+ *
+ * The structure is safe to access within the context of translation
+ * but accessing the data from elsewhere should be done with safe
+ * work.
+ */
+typedef struct TCGProfile {
+
+struct {
+int nb_guest_insns;
+int nb_spills;
+int nb_ops_pre_opt;
+
+int del_op_count;
+int temp_count;
+} translation;
+
+int64_t cpu_exec_time;
+int64_t op_count; /* total insn count */
+int64_t code_in_len;
+int64_t code_out_len;
+int64_t search_out_len;
+
+/* Timestamps during translation  */
+uint64_t gen_start_time;
+uint64_t gen_ir_done_time;
+uint64_t gen_opt_done_time;
+uint64_t gen_la_done_time;
+uint64_t gen_code_done_time;
+
+/* Lifetime count of TCGOps per TCGContext */
+uint64_t table_op_count[NB_OPS];
+} TCGProfile;
+


Why have you added this back?

The whole point of removing CONFIG_PROFILE to begin was to have all new code.  Not to 
remove it then reintroduce it unchanged.


In tcg_gen_code, you have access to the TranslationBlock as s->gen_tb.  There is zero 
point to accumulating into TCGProfile, when you could be accumulating into TCGStatistics 
directly.



r~

Re: [PATCH v2 4/4] migration/calc-dirty-rate: tool to predict migration time

2023-05-29 Thread Wang, Lei

On 4/27/2023 20:43, Andrei Gudkov via wrote:
> Signed-off-by: Andrei Gudkov 
> ---
>  MAINTAINERS  |   1 +
>  scripts/predict_migration.py | 283 +++
>  2 files changed, 284 insertions(+)
>  create mode 100644 scripts/predict_migration.py
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fc225e66df..0c578446cf 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3167,6 +3167,7 @@ F: docs/devel/migration.rst
>  F: qapi/migration.json
>  F: tests/migration/
>  F: util/userfaultfd.c
> +F: scripts/predict_migration.py
>  
>  D-Bus
>  M: Marc-André Lureau 
> diff --git a/scripts/predict_migration.py b/scripts/predict_migration.py
> new file mode 100644
> index 00..c92a97585f
> --- /dev/null
> +++ b/scripts/predict_migration.py
> @@ -0,0 +1,283 @@
> +#!/usr/bin/env python3
> +#
> +# Predicts time required to migrate VM under given max downtime constraint.
> +#
> +# Copyright (c) 2023 HUAWEI TECHNOLOGIES CO.,LTD.
> +#
> +# Authors:
> +#  Andrei Gudkov 
> +#
> +# This work is licensed under the terms of the GNU GPL, version 2 or
> +# later.  See the COPYING file in the top-level directory.
> +
> +
> +# Usage:
> +#
> +# Step 1. Collect dirty page statistics from live VM:
> +# $ scripts/predict_migration.py calc-dirty-rate   
> >dirty.json
> +# <...takes 1 minute by default...>
> +#
> +# Step 2. Run predictor against collected data:
> +# $ scripts/predict_migration.py predict < dirty.json
> +# Downtime> |125ms |250ms |500ms |   1000ms |   5000ms |
> unlim |
> +# 
> -
> +#  100 Mbps |- |- |- |- |- |   
> 16m45s |
> +#1 Gbps |- |- |- |- |- |
> 1m39s |
> +#2 Gbps |- |- |- |- |1m55s |  
> 50s |
> +#  2.5 Gbps |- |- |- |- |1m12s |  
> 40s |
> +#5 Gbps |- |- |- |  29s |  25s |  
> 20s |
> +#   10 Gbps |  13s |  13s |  12s |  12s |  12s |  
> 10s |
> +#   25 Gbps |   5s |   5s |   5s |   5s |   4s |   
> 4s |
> +#   40 Gbps |   3s |   3s |   3s |   3s |   3s |   
> 3s |
> +#
> +# The latter prints table that lists estimated time it will take to migrate 
> VM.
> +# This time depends on the network bandwidth and max allowed downtime.
> +# Dash indicates that migration does not converge.
> +# Prediction takes care only about migrating RAM and only in pre-copy mode.
> +# Other features, such as compression or local disk migration, are not 
> supported
> +
> +
> +import sys
> +import os
> +import math
> +import json
> +from dataclasses import dataclass
> +import asyncio
> +import argparse
> +
> +sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'python'))
> +from qemu.qmp import QMPClient
> +
> +async def calc_dirty_rate(host, port, calc_time, sample_pages):
> +client = QMPClient()
> +try:
> +await client.connect((host, port))
> +args = {
> +'calc-time': calc_time,
> +'sample-pages': sample_pages
> +}
> +await client.execute('calc-dirty-rate', args)
> +await asyncio.sleep(calc_time)
> +while True:
> +data = await client.execute('query-dirty-rate')
> +if data['status'] == 'measuring':
> +await asyncio.sleep(0.5)
> +elif data['status'] == 'measured':
> +return data
> +else:
> +raise ValueError(data['status'])
> +finally:
> +await client.disconnect()
> +
> +
> +class MemoryModel:
> +"""
> +Models RAM state during pre-copy migration using calc-dirty-rate results.
> +Its primary function is to estimate how many pages will be dirtied
> +after given time starting from "clean" state.
> +This function is non-linear and saturates at some point.
> +"""
> +
> +@dataclass
> +class Point:
> +period_millis:float
> +dirty_pages:float
> +
> +def __init__(self, data):
> +"""
> +:param data: dictionary returned by calc-dirty-rate
> +"""
> +self.__points = self.__make_points(data)
> +self.__page_size = data['page-size']
> +self.__num_total_pages = data['n-total-pages']
> +self.__num_zero_pages = data['n-zero-pages'] / \
> +(data['n-sampled-pages'] / data['n-total-pages'])
> +
> +def __make_points(self, data):
> +points = list()
> +
> +# Add observed points
> +sample_ratio = data['n-sampled-pages'] / data['n-total-pages']
> +for millis,dirty_pages in zip(data['periods'], 
> data['n-dirty-pages']):
> +millis = float(millis)
> +dirty_pages = dirty_pages / sample_ratio
> +points.append(MemoryModel.Point(millis, dirty_pages))
> +
> +

Re: [PATCH v2 2/4] migration/calc-dirty-rate: detailed stats in sampling mode

2023-05-29 Thread Wang, Lei

On 4/27/2023 20:42, Andrei Gudkov via wrote:
> Collect number of dirty pages for progresseively increasing time
> periods starting with 125ms up to number of seconds specified with
> calc-dirty-rate. Report through qmp and hmp: 1) vector of dirty page
> measurements, 2) page size, 3) total number of VM pages, 4) number
> of sampled pages.
> 
> Signed-off-by: Andrei Gudkov 
> ---
>  migration/dirtyrate.c | 165 +-
>  migration/dirtyrate.h |  25 ++-
>  qapi/migration.json   |  24 +-
>  3 files changed, 160 insertions(+), 54 deletions(-)
> 
> diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
> index acba3213a3..4491bbe91a 100644
> --- a/migration/dirtyrate.c
> +++ b/migration/dirtyrate.c
> @@ -224,6 +224,7 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
>  info->calc_time = DirtyStat.calc_time;
>  info->sample_pages = DirtyStat.sample_pages;
>  info->mode = dirtyrate_mode;
> +info->page_size = TARGET_PAGE_SIZE;
>  
>  if (qatomic_read() == DIRTY_RATE_STATUS_MEASURED) {
>  info->has_dirty_rate = true;
> @@ -245,6 +246,29 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
>  info->vcpu_dirty_rate = head;
>  }
>  
> +if (dirtyrate_mode == DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING) {
> +int64List *periods_head = NULL;
> +int64List **periods_tail = _head;
> +int64List *n_dirty_pages_head = NULL;
> +int64List **n_dirty_pages_tail = _dirty_pages_head;
> +
> +info->n_total_pages = DirtyStat.page_sampling.n_total_pages;
> +info->has_n_total_pages = true;
> +
> +info->n_sampled_pages = DirtyStat.page_sampling.n_sampled_pages;
> +info->has_n_sampled_pages = true;
> +
> +for (i = 0; i < DirtyStat.page_sampling.n_readings; i++) {
> +DirtyReading *dr = _sampling.readings[i];
> +QAPI_LIST_APPEND(periods_tail, dr->period);
> +QAPI_LIST_APPEND(n_dirty_pages_tail, dr->n_dirty_pages);
> +}
> +info->n_dirty_pages = n_dirty_pages_head;
> +info->periods = periods_head;
> +info->has_n_dirty_pages = true;
> +info->has_periods = true;
> +}
> +
>  if (dirtyrate_mode == DIRTY_RATE_MEASURE_MODE_DIRTY_BITMAP) {
>  info->sample_pages = 0;
>  }
> @@ -265,9 +289,11 @@ static void init_dirtyrate_stat(int64_t start_time,
>  
>  switch (config.mode) {
>  case DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING:
> -DirtyStat.page_sampling.total_dirty_samples = 0;
> -DirtyStat.page_sampling.total_sample_count = 0;
> -DirtyStat.page_sampling.total_block_mem_MB = 0;
> +DirtyStat.page_sampling.n_total_pages = 0;
> +DirtyStat.page_sampling.n_sampled_pages = 0;
> +DirtyStat.page_sampling.n_readings = 0;
> +DirtyStat.page_sampling.readings = 
> g_try_malloc0_n(MAX_DIRTY_READINGS,
> +  
> sizeof(DirtyReading));
>  break;
>  case DIRTY_RATE_MEASURE_MODE_DIRTY_RING:
>  DirtyStat.dirty_ring.nvcpu = -1;
> @@ -285,28 +311,10 @@ static void cleanup_dirtyrate_stat(struct 
> DirtyRateConfig config)
>  free(DirtyStat.dirty_ring.rates);
>  DirtyStat.dirty_ring.rates = NULL;
>  }
> -}
> -
> -static void update_dirtyrate_stat(struct RamblockDirtyInfo *info)
> -{
> -DirtyStat.page_sampling.total_dirty_samples += info->sample_dirty_count;
> -DirtyStat.page_sampling.total_sample_count += info->sample_pages_count;
> -/* size of total pages in MB */
> -DirtyStat.page_sampling.total_block_mem_MB += (info->ramblock_pages *
> -   TARGET_PAGE_SIZE) >> 20;
> -}
> -
> -static void update_dirtyrate(uint64_t msec)
> -{
> -uint64_t dirtyrate;
> -uint64_t total_dirty_samples = 
> DirtyStat.page_sampling.total_dirty_samples;
> -uint64_t total_sample_count = DirtyStat.page_sampling.total_sample_count;
> -uint64_t total_block_mem_MB = DirtyStat.page_sampling.total_block_mem_MB;
> -
> -dirtyrate = total_dirty_samples * total_block_mem_MB *
> -1000 / (total_sample_count * msec);
> -
> -DirtyStat.dirty_rate = dirtyrate;
> +if (DirtyStat.page_sampling.readings) {
> +free(DirtyStat.page_sampling.readings);
> +DirtyStat.page_sampling.readings = NULL;
> +}
>  }
>  
>  /*
> @@ -377,12 +385,14 @@ static bool save_ramblock_hash(struct RamblockDirtyInfo 
> *info)
>  return false;
>  }
>  
> -rand  = g_rand_new();
> +rand = g_rand_new();
> +DirtyStat.page_sampling.n_total_pages += info->ramblock_pages;
>  for (i = 0; i < sample_pages_count; i++) {
>  info->sample_page_vfn[i] = g_rand_int_range(rand, 0,
>  info->ramblock_pages - 
> 1);
>

Re: [PATCH v10 0/7] igb: packet-split descriptors support

2023-05-29 Thread Akihiko Odaki


On 2023/05/29 23:01, Tomasz Dzieciol wrote:

Purposes of this series of patches:
* introduce packet-split RX descriptors support. This feature is used by Linux
   VF driver for MTU values from 2048.
* refactor RX descriptor handling for introduction of packet-split RX
   descriptors support
* fix descriptors flags handling

Tomasz Dzieciol (7):
   igb: remove TCP ACK detection
   igb: rename E1000E_RingInfo_st
   igb: RX descriptors guest writting refactoring
   igb: RX payload guest writting refactoring
   igb: add IPv6 extended headers traffic detection
   igb: packet-split descriptors support
   e1000e: rename e1000e_ba_state and e1000e_write_hdr_to_rx_buffers

  hw/net/e1000e_core.c |  78 +++--
  hw/net/igb_core.c| 730 ---
  hw/net/igb_regs.h|  20 +-
  hw/net/trace-events  |   6 +-
  tests/qtest/libqos/igb.c |   5 +
  5 files changed, 592 insertions(+), 247 deletions(-)



Thanks for keeping working on this. For the entire series:

Reviewed-by: Akihiko Odaki 
Tested-by: Akihiko Odaki

[PATCH 0/2] net: Update MemReentrancyGuard for NIC

2023-05-29 Thread Akihiko Odaki

Recently MemReentrancyGuard was added to DeviceState to record that the
device is engaging in I/O. The network device backend needs to update it
when delivering a packet to a device.

This implementation follows what bottom half does, but it does not add
a tracepoint for the case that the network device backend started
delivering a packet to a device which is already engaging in I/O. This
is because such reentrancy frequently happens for
qemu_flush_queued_packets() and is insignificant.

This series consists of two patches. The first patch makes a bulk change to
add a new parameter to qemu_new_nic() and does not contain behavioral changes.
The second patch actually implements MemReentrancyGuard update.

Akihiko Odaki (2):
  net: Provide MemReentrancyGuard * to qemu_new_nic()
  net: Update MemReentrancyGuard for NIC

 include/net/net.h |  2 ++
 hw/net/allwinner-sun8i-emac.c |  3 ++-
 hw/net/allwinner_emac.c   |  3 ++-
 hw/net/cadence_gem.c  |  3 ++-
 hw/net/dp8393x.c  |  3 ++-
 hw/net/e1000.c|  3 ++-
 hw/net/e1000e.c   |  2 +-
 hw/net/eepro100.c |  4 +++-
 hw/net/etraxfs_eth.c  |  3 ++-
 hw/net/fsl_etsec/etsec.c  |  3 ++-
 hw/net/ftgmac100.c|  3 ++-
 hw/net/i82596.c   |  2 +-
 hw/net/igb.c  |  2 +-
 hw/net/imx_fec.c  |  2 +-
 hw/net/lan9118.c  |  3 ++-
 hw/net/mcf_fec.c  |  3 ++-
 hw/net/mipsnet.c  |  3 ++-
 hw/net/msf2-emac.c|  3 ++-
 hw/net/mv88w8618_eth.c|  3 ++-
 hw/net/ne2000-isa.c   |  3 ++-
 hw/net/ne2000-pci.c   |  3 ++-
 hw/net/npcm7xx_emc.c  |  3 ++-
 hw/net/opencores_eth.c|  3 ++-
 hw/net/pcnet.c|  3 ++-
 hw/net/rocker/rocker_fp.c |  4 ++--
 hw/net/rtl8139.c  |  3 ++-
 hw/net/smc91c111.c|  3 ++-
 hw/net/spapr_llan.c   |  3 ++-
 hw/net/stellaris_enet.c   |  3 ++-
 hw/net/sungem.c   |  2 +-
 hw/net/sunhme.c   |  3 ++-
 hw/net/tulip.c|  3 ++-
 hw/net/virtio-net.c   |  6 --
 hw/net/vmxnet3.c  |  2 +-
 hw/net/xen_nic.c  |  4 ++--
 hw/net/xgmac.c|  3 ++-
 hw/net/xilinx_axienet.c   |  3 ++-
 hw/net/xilinx_ethlite.c   |  3 ++-
 hw/usb/dev-network.c  |  3 ++-
 net/net.c | 15 +++
 40 files changed, 90 insertions(+), 41 deletions(-)

-- 
2.40.1

[PATCH 2/2] net: Update MemReentrancyGuard for NIC

2023-05-29 Thread Akihiko Odaki

Recently MemReentrancyGuard was added to DeviceState to record that the
device is engaging in I/O. The network device backend needs to update it
when delivering a packet to a device.

This implementation follows what bottom half does, but it does not add
a tracepoint for the case that the network device backend started
delivering a packet to a device which is already engaging in I/O. This
is because such reentrancy frequently happens for
qemu_flush_queued_packets() and is insignificant.

Reported-by: Alexander Bulekov 
Signed-off-by: Akihiko Odaki 
---
 include/net/net.h |  1 +
 net/net.c | 14 ++
 2 files changed, 15 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index a7d8deaccb..685ec58318 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -124,6 +124,7 @@ typedef QTAILQ_HEAD(NetClientStateList, NetClientState) 
NetClientStateList;
 typedef struct NICState {
 NetClientState *ncs;
 NICConf *conf;
+MemReentrancyGuard *reentrancy_guard;
 void *opaque;
 bool peer_deleted;
 } NICState;
diff --git a/net/net.c b/net/net.c
index 982df2479f..3523cceafc 100644
--- a/net/net.c
+++ b/net/net.c
@@ -332,6 +332,7 @@ NICState *qemu_new_nic(NetClientInfo *info,
 nic = g_malloc0(info->size + sizeof(NetClientState) * queues);
 nic->ncs = (void *)nic + info->size;
 nic->conf = conf;
+nic->reentrancy_guard = reentrancy_guard,
 nic->opaque = opaque;
 
 for (i = 0; i < queues; i++) {
@@ -805,6 +806,7 @@ static ssize_t qemu_deliver_packet_iov(NetClientState 
*sender,
int iovcnt,
void *opaque)
 {
+MemReentrancyGuard *owned_reentrancy_guard;
 NetClientState *nc = opaque;
 int ret;
 
@@ -817,12 +819,24 @@ static ssize_t qemu_deliver_packet_iov(NetClientState 
*sender,
 return 0;
 }
 
+if (nc->info->type != NET_CLIENT_DRIVER_NIC ||
+qemu_get_nic(nc)->reentrancy_guard->engaged_in_io) {
+owned_reentrancy_guard = NULL;
+} else {
+owned_reentrancy_guard = qemu_get_nic(nc)->reentrancy_guard;
+owned_reentrancy_guard->engaged_in_io = true;
+}
+
 if (nc->info->receive_iov && !(flags & QEMU_NET_PACKET_FLAG_RAW)) {
 ret = nc->info->receive_iov(nc, iov, iovcnt);
 } else {
 ret = nc_sendv_compat(nc, iov, iovcnt, flags);
 }
 
+if (owned_reentrancy_guard) {
+owned_reentrancy_guard->engaged_in_io = false;
+}
+
 if (ret == 0) {
 nc->receive_disabled = 1;
 }
-- 
2.40.1

[PATCH 1/2] net: Provide MemReentrancyGuard * to qemu_new_nic()

2023-05-29 Thread Akihiko Odaki

Recently MemReentrancyGuard was added to DeviceState to record that the
device is engaging in I/O. The network device backend needs to update it
when delivering a packet to a device.

In preparation for such a change, add MemReentrancyGuard * as a
parameter of qemu_new_nic().

Signed-off-by: Akihiko Odaki 
---
 include/net/net.h | 1 +
 hw/net/allwinner-sun8i-emac.c | 3 ++-
 hw/net/allwinner_emac.c   | 3 ++-
 hw/net/cadence_gem.c  | 3 ++-
 hw/net/dp8393x.c  | 3 ++-
 hw/net/e1000.c| 3 ++-
 hw/net/e1000e.c   | 2 +-
 hw/net/eepro100.c | 4 +++-
 hw/net/etraxfs_eth.c  | 3 ++-
 hw/net/fsl_etsec/etsec.c  | 3 ++-
 hw/net/ftgmac100.c| 3 ++-
 hw/net/i82596.c   | 2 +-
 hw/net/igb.c  | 2 +-
 hw/net/imx_fec.c  | 2 +-
 hw/net/lan9118.c  | 3 ++-
 hw/net/mcf_fec.c  | 3 ++-
 hw/net/mipsnet.c  | 3 ++-
 hw/net/msf2-emac.c| 3 ++-
 hw/net/mv88w8618_eth.c| 3 ++-
 hw/net/ne2000-isa.c   | 3 ++-
 hw/net/ne2000-pci.c   | 3 ++-
 hw/net/npcm7xx_emc.c  | 3 ++-
 hw/net/opencores_eth.c| 3 ++-
 hw/net/pcnet.c| 3 ++-
 hw/net/rocker/rocker_fp.c | 4 ++--
 hw/net/rtl8139.c  | 3 ++-
 hw/net/smc91c111.c| 3 ++-
 hw/net/spapr_llan.c   | 3 ++-
 hw/net/stellaris_enet.c   | 3 ++-
 hw/net/sungem.c   | 2 +-
 hw/net/sunhme.c   | 3 ++-
 hw/net/tulip.c| 3 ++-
 hw/net/virtio-net.c   | 6 --
 hw/net/vmxnet3.c  | 2 +-
 hw/net/xen_nic.c  | 4 ++--
 hw/net/xgmac.c| 3 ++-
 hw/net/xilinx_axienet.c   | 3 ++-
 hw/net/xilinx_ethlite.c   | 3 ++-
 hw/usb/dev-network.c  | 3 ++-
 net/net.c | 1 +
 40 files changed, 75 insertions(+), 41 deletions(-)

diff --git a/include/net/net.h b/include/net/net.h
index 1448d00afb..a7d8deaccb 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -157,6 +157,7 @@ NICState *qemu_new_nic(NetClientInfo *info,
NICConf *conf,
const char *model,
const char *name,
+   MemReentrancyGuard *reentrancy_guard,
void *opaque);
 void qemu_del_nic(NICState *nic);
 NetClientState *qemu_get_subqueue(NICState *nic, int queue_index);
diff --git a/hw/net/allwinner-sun8i-emac.c b/hw/net/allwinner-sun8i-emac.c
index fac4405f45..cc350d40e5 100644
--- a/hw/net/allwinner-sun8i-emac.c
+++ b/hw/net/allwinner-sun8i-emac.c
@@ -824,7 +824,8 @@ static void allwinner_sun8i_emac_realize(DeviceState *dev, 
Error **errp)
 
 qemu_macaddr_default_if_unset(>conf.macaddr);
 s->nic = qemu_new_nic(_allwinner_sun8i_emac_info, >conf,
-   object_get_typename(OBJECT(dev)), dev->id, s);
+  object_get_typename(OBJECT(dev)), dev->id,
+  >mem_reentrancy_guard, s);
 qemu_format_nic_info_str(qemu_get_queue(s->nic), s->conf.macaddr.a);
 }
 
diff --git a/hw/net/allwinner_emac.c b/hw/net/allwinner_emac.c
index 372e5b66da..e10965de14 100644
--- a/hw/net/allwinner_emac.c
+++ b/hw/net/allwinner_emac.c
@@ -453,7 +453,8 @@ static void aw_emac_realize(DeviceState *dev, Error **errp)
 
 qemu_macaddr_default_if_unset(>conf.macaddr);
 s->nic = qemu_new_nic(_aw_emac_info, >conf,
-  object_get_typename(OBJECT(dev)), dev->id, s);
+  object_get_typename(OBJECT(dev)), dev->id,
+  >mem_reentrancy_guard, s);
 qemu_format_nic_info_str(qemu_get_queue(s->nic), s->conf.macaddr.a);
 
 fifo8_create(>rx_fifo, RX_FIFO_SIZE);
diff --git a/hw/net/cadence_gem.c b/hw/net/cadence_gem.c
index 42ea2411a2..a7bce1c120 100644
--- a/hw/net/cadence_gem.c
+++ b/hw/net/cadence_gem.c
@@ -1633,7 +1633,8 @@ static void gem_realize(DeviceState *dev, Error **errp)
 qemu_macaddr_default_if_unset(>conf.macaddr);
 
 s->nic = qemu_new_nic(_gem_info, >conf,
-  object_get_typename(OBJECT(dev)), dev->id, s);
+  object_get_typename(OBJECT(dev)), dev->id,
+  >mem_reentrancy_guard, s);
 
 if (s->jumbo_max_len > MAX_FRAME_SIZE) {
 error_setg(errp, "jumbo-max-len is greater than %d",
diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 45b954e46c..abfcc6f69f 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -943,7 +943,8 @@ static void dp8393x_realize(DeviceState *dev, Error **errp)
   "dp8393x-regs", SONIC_REG_COUNT << s->it_shift);
 
 s->nic = qemu_new_nic(_dp83932_info, >conf,
-  object_get_typename(OBJECT(dev)), dev->id, s);
+  object_get_typename(OBJECT(dev)), dev->id,
+  >mem_reentrancy_guard, s);
 qemu_format_nic_info_str(qemu_get_queue(s->nic),

Re: [PATCH] igb: Add Function Level Reset to PF and VF

2023-05-29 Thread Akihiko Odaki


On 2023/05/30 0:07, Cédric Le Goater wrote:

On 5/29/23 09:45, Akihiko Odaki wrote:

On 2023/05/29 16:01, Cédric Le Goater wrote:

On 5/29/23 04:45, Akihiko Odaki wrote:

On 2023/05/28 19:50, Sriram Yagnaraman wrote:



-Original Message-
From: Cédric Le Goater 
Sent: Friday, 26 May 2023 19:31
To: qemu-devel@nongnu.org
Cc: Akihiko Odaki ; Sriram Yagnaraman
; Jason Wang ; 
Cédric

Le Goater 
Subject: [PATCH] igb: Add Function Level Reset to PF and VF

The Intel 82576EB GbE Controller say that the Physical and Virtual 
Functions
support Function Level Reset. Add the capability to each device 
model.


Cc: Akihiko Odaki 
Fixes: 3a977deebe6b ("Intrdocue igb device emulation")
Signed-off-by: Cédric Le Goater 
---
  hw/net/igb.c   | 3 +++
  hw/net/igbvf.c | 3 +++
  2 files changed, 6 insertions(+)

diff --git a/hw/net/igb.c b/hw/net/igb.c index 
1c989d767725..08e389338dca

100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -101,6 +101,7 @@ static void igb_write_config(PCIDevice *dev, 
uint32_t

addr,

  trace_igb_write_config(addr, val, len);
  pci_default_write_config(dev, addr, val, len);
+    pcie_cap_flr_write_config(dev, addr, val, len);

  if (range_covers_byte(addr, len, PCI_COMMAND) &&
  (dev->config[PCI_COMMAND] & PCI_COMMAND_MASTER)) { @@ 
-427,6
+428,8 @@ static void igb_pci_realize(PCIDevice *pci_dev, Error 
**errp)

  }

  /* PCIe extended capabilities (in order) */
+    pcie_cap_flr_init(pci_dev);
+
  if (pcie_aer_init(pci_dev, 1, 0x100, 0x40, errp) < 0) {
  hw_error("Failed to initialize AER capability");
  }
diff --git a/hw/net/igbvf.c b/hw/net/igbvf.c index
284ea611848b..0a58dad06802 100644
--- a/hw/net/igbvf.c
+++ b/hw/net/igbvf.c
@@ -204,6 +204,7 @@ static void igbvf_write_config(PCIDevice *dev,
uint32_t addr, uint32_t val,  {
  trace_igbvf_write_config(addr, val, len);
  pci_default_write_config(dev, addr, val, len);
+    pcie_cap_flr_write_config(dev, addr, val, len);
  }

  static uint64_t igbvf_mmio_read(void *opaque, hwaddr addr, 
unsigned size)
@@ -266,6 +267,8 @@ static void igbvf_pci_realize(PCIDevice *dev, 
Error

**errp)
  hw_error("Failed to initialize PCIe capability");
  }

+    pcie_cap_flr_init(dev);


Sorry for my naive question, and perhaps not related to your patch, 
IGBVF device class doesn't seem to have any reset functions 
registered via igbvf_class_init(). So, I am guessing an FLR will 
not trigger igb_vf_reset(), which is probably what we want.


It does through the VTCTRL registers.

You're right. Advertising FLR capability without implementing it can 
confuse the guest though such probability is quite a low in 
practice. The reset should be implemented first.



I was looking at an issue from a VFIO perspective which does a FLR
on a device when pass through. Software and FLR are equivalent for
a VF.


They should be equivalent according to the datasheet, but 
unfortunately current igbvf implementation does nothing when reset. 
What Sriram proposes is to add code to actually write VTCTRL when FLR 
occurred and make FLR and software reset equivalent. And I think that 
should be done before this change; you should advertise FLR capability 
after the reset is actually implemented.



AFAICT, the VFs are reset correctly by the OS when created or probed and
by QEMU when they are passthrough in a nested guest OS (with this patch).
igb_vf_reset() is clearly called in QEMU, see routine e1000_reset_hw_vf()
in Linux.


I don't think this patch makes difference for e1000_reset_hw_vf() as it 
does not rely on FLR.




I don't think a reset op is necessary because VFs are software constructs
but I don't mind really. If so, then, I wouldn't mimic what the OS does
by writing the RST bit in the CTRL register of the VF, I would simply
install igb_vf_reset() as a reset handler.


Thinking about the reason why VFIO performs FLR, probably VFIO expects 
the FLR clears all of states the kernel set to prevent the VF from 
leaking kernel addresses or addresses of other user space which the VF 
was assigned to in the past, for example.


Implementing the reset operation is not necessary to make it function 
but to make it secure, particularly we promise the guest that we clear 
the VF state by advertising FLR.


Regards,
Akihiko Odaki



Thanks,

C.





Regards,
Akihiko Odaki



I am not sure a VF needs more really, since it is all controlled
by the PF. >
C.






+
  if (pcie_aer_init(dev, 1, 0x100, 0x40, errp) < 0) {
  hw_error("Failed to initialize AER capability");
  }
--
2.40.1

Re: [PATCH 1/3] hw/ufs: Initial commit for emulated Universal-Flash-Storage

2023-05-29 Thread Jeuk Kim

On 26/05/2023 15:37, Thomas Huth wrote:
>On 26/05/2023 07.05, Jeuk Kim wrote:
>> Universal Flash Storage (UFS) is a high-performance mass storage device
>> with a serial interface. It is primarily used as a high-performance
>> data storage device for embedded applications.
>> 
>> This commit contains code for UFS device to be recognized
>> as a UFS PCI device.
>> Patches to handle UFS logical unit and Transfer Request will follow.
>> 
>> Signed-off-by: Jeuk Kim 
>> ---
>>   MAINTAINERS  |6 +
>>   hw/Kconfig   |1 +
>>   hw/meson.build   |1 +
>>   hw/ufs/Kconfig   |4 +
>>   hw/ufs/meson.build   |1 +
>>   hw/ufs/trace-events  |   33 +
>>   hw/ufs/trace.h   |1 +
>>   hw/ufs/ufs.c |  305 ++
>>   hw/ufs/ufs.h |   42 ++
>>   include/block/ufs.h  | 1251 ++
>>   include/hw/pci/pci.h |1 +
>>   include/hw/pci/pci_ids.h |1 +
>>   meson.build  |1 +
>
>Do you expect lots of additional files to be added to the hw/ufs/ folder? If 
>the answer is no, then it's maybe a little bit overkill to introduce a 
>separate folder for this. Wouldn't hw/block/ be a good fit for this as well? 
>Or maybe we could introduce hw/flash/ or so and also move the contents of 
>hw/nvme there?

Yes. I plan to add more files to UFS for different functions (UICCMD, MQ, 
zoned, etc.) like nvme.
So personally, I think it would be good to keep the hw/ufs/ directory.

[PATCH] tcg: Add tlb_index_and_entry() function

2023-05-29 Thread BALATON Zoltan

The tlb index and entry are often retrieved together and tlb_entry()
already calls tlb_index() so it could easily return it. Add a
tlb_index_and_entry() function that does that to simplify callers and
maybe avoid some duplicate calculations.

Signed-off-by: BALATON Zoltan 
---
 accel/tcg/cputlb.c  | 42 +
 include/exec/cpu_ldst.h | 17 ++---
 2 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 90c72c9940..616c68fd09 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1121,6 +1121,7 @@ void tlb_set_page_full(CPUState *cpu, int mmu_idx,
 CPUTLB *tlb = env_tlb(env);
 CPUTLBDesc *desc = >d[mmu_idx];
 MemoryRegionSection *section;
+uintptr_t idx;
 unsigned int index;
 target_ulong address;
 target_ulong write_address;
@@ -1203,8 +1204,8 @@ void tlb_set_page_full(CPUState *cpu, int mmu_idx,
 wp_flags = cpu_watchpoint_address_matches(cpu, vaddr_page,
   TARGET_PAGE_SIZE);
 
-index = tlb_index(env, mmu_idx, vaddr_page);
-te = tlb_entry(env, mmu_idx, vaddr_page);
+te = tlb_index_and_entry(env, mmu_idx, vaddr_page, );
+index = idx;
 
 /*
  * Hold the TLB lock for the rest of the function. We could acquire/release
@@ -1504,12 +1505,13 @@ static int probe_access_internal(CPUArchState *env, 
target_ulong addr,
  void **phost, CPUTLBEntryFull **pfull,
  uintptr_t retaddr)
 {
-uintptr_t index = tlb_index(env, mmu_idx, addr);
-CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
-target_ulong tlb_addr = tlb_read_idx(entry, access_type);
-target_ulong page_addr = addr & TARGET_PAGE_MASK;
+uintptr_t index;
+CPUTLBEntry *entry;
+target_ulong tlb_addr, page_addr = addr & TARGET_PAGE_MASK;
 int flags = TLB_FLAGS_MASK;
 
+entry = tlb_index_and_entry(env, mmu_idx, addr, );
+tlb_addr = tlb_read_idx(entry, access_type);
 if (!tlb_hit_page(tlb_addr, page_addr)) {
 if (!victim_tlb_hit(env, mmu_idx, index, access_type, page_addr)) {
 CPUState *cs = env_cpu(env);
@@ -1523,8 +1525,7 @@ static int probe_access_internal(CPUArchState *env, 
target_ulong addr,
 }
 
 /* TLB resize via tlb_fill may have moved the entry.  */
-index = tlb_index(env, mmu_idx, addr);
-entry = tlb_entry(env, mmu_idx, addr);
+entry = tlb_index_and_entry(env, mmu_idx, addr, );
 
 /*
  * With PAGE_WRITE_INV, we set TLB_INVALID_MASK immediately,
@@ -1691,10 +1692,12 @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong 
addr, int mmu_idx,
bool is_store, struct qemu_plugin_hwaddr *data)
 {
 CPUArchState *env = cpu->env_ptr;
-CPUTLBEntry *tlbe = tlb_entry(env, mmu_idx, addr);
-uintptr_t index = tlb_index(env, mmu_idx, addr);
-target_ulong tlb_addr = is_store ? tlb_addr_write(tlbe) : tlbe->addr_read;
+CPUTLBEntry *tlbe;
+uintptr_t index;
+target_ulong tlb_addr;
 
+tlbe = tlb_index_and_entry(env, mmu_idx, addr, );
+tlb_addr = is_store ? tlb_addr_write(tlbe) : tlbe->addr_read;
 if (likely(tlb_hit(tlb_addr, addr))) {
 /* We must have an iotlb entry for MMIO */
 if (tlb_addr & TLB_MMIO) {
@@ -1756,19 +1759,20 @@ static bool mmu_lookup1(CPUArchState *env, 
MMULookupPageData *data,
 int mmu_idx, MMUAccessType access_type, uintptr_t ra)
 {
 target_ulong addr = data->addr;
-uintptr_t index = tlb_index(env, mmu_idx, addr);
-CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
-target_ulong tlb_addr = tlb_read_idx(entry, access_type);
+uintptr_t index;
+CPUTLBEntry *entry;
+target_ulong tlb_addr;
 bool maybe_resized = false;
 
+entry = tlb_index_and_entry(env, mmu_idx, addr, );
+tlb_addr = tlb_read_idx(entry, access_type);
 /* If the TLB entry is for a different page, reload and try again.  */
 if (!tlb_hit(tlb_addr, addr)) {
 if (!victim_tlb_hit(env, mmu_idx, index, access_type,
 addr & TARGET_PAGE_MASK)) {
 tlb_fill(env_cpu(env), addr, data->size, access_type, mmu_idx, ra);
 maybe_resized = true;
-index = tlb_index(env, mmu_idx, addr);
-entry = tlb_entry(env, mmu_idx, addr);
+entry = tlb_index_and_entry(env, mmu_idx, addr, );
 }
 tlb_addr = tlb_read_idx(entry, access_type) & ~TLB_INVALID_MASK;
 }
@@ -1930,8 +1934,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
target_ulong addr,
 goto stop_the_world;
 }
 
-index = tlb_index(env, mmu_idx, addr);
-tlbe = tlb_entry(env, mmu_idx, addr);
+tlbe = tlb_index_and_entry(env, mmu_idx, addr, );
 
 /* Check TLB entry and enforce page permissions.  */
 tlb_addr = tlb_addr_write(tlbe);
@@ -1940,8 +1943,7 @@ static

Re: [PULL 00/10] ppc queue

2023-05-29 Thread Richard Henderson


On 5/28/23 09:49, Daniel Henrique Barboza wrote:

The following changes since commit ac84b57b4d74606f7f83667a0606deef32b2049d:

   Merge tag 'for-upstream' ofhttps://gitlab.com/bonzini/qemu  into staging 
(2023-05-26 14:40:55 -0700)

are available in the Git repository at:

   https://gitlab.com/danielhb/qemu.git  tags/pull-ppc-20230528

for you to fetch changes up to 56b8bfe9bb6b94184b8bbfc4be9196404a81e450:

   ppc/pegasos2: Change default CPU to 7457 (2023-05-28 13:25:45 -0300)


ppc patch queue for 2023-05-28:

This queue includes several assorted fixes for PowerPC SPR
emulation, a change in the default Pegasos2 CPU, the addition
of AIL mode 3 for spapr, a PIC->CPU interrupt fix for prep and
performance enhancements in fpu_helper.c.


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/8.1 as 
appropriate.


r~

Re: [PULL 00/19] Ui patches

2023-05-29 Thread Richard Henderson


On 5/28/23 06:19, marcandre.lur...@redhat.com wrote:

From: Marc-André Lureau

The following changes since commit ac84b57b4d74606f7f83667a0606deef32b2049d:

   Merge tag 'for-upstream' ofhttps://gitlab.com/bonzini/qemu  into staging 
(2023-05-26 14:40:55 -0700)

are available in the Git repository at:

   https://gitlab.com/marcandre.lureau/qemu.git  tags/ui-pull-request

for you to fetch changes up to 5a4cb61ae1ab0068ab53535ed0ccaf41a5e97d2f:

   ui/gtk: enable backend to send multi-touch events (2023-05-28 16:25:38 +0400)


UI queue

- virtio: add virtio-multitouch device
- sdl: various keyboard grab fixes
- gtk: enable multi-touch events
- misc fixes


Applied, thanks.  Please update https://wiki.qemu.org/ChangeLog/8.1 as 
appropriate.


r~

stable-8.0: block/export: call blk_set_dev_ops and Fix null pointer dereference in error path

2023-05-29 Thread Michael Tokarev


Hi!

For some reason I picked up this commit:

commit de79b52604e43fdeba6cee4f5af600b62169f2d2
Author: Stefan Hajnoczi 
Date:   Tue May 2 17:11:19 2023 -0400

block/export: call blk_set_dev_ops(blk, NULL, NULL)

for stable-8.0.1.  However it turned out it fails iotests, up until
the fix, which is

commit a184563778f2b8970eb93291f08108e66432a575
Author: Kevin Wolf 
Date:   Wed May 10 22:35:55 2023 +0200

block/export: Fix null pointer dereference in error path

It's a good question why I picked the first one up to begin
with, but now, especially after not realizing iotests are not
run within CI framework automatically - I ended up with broken
v8.0.1 which I tagged earlier today.

What do you think, what is better for 8.0.1 - to revert first
commit or to add second on top?

Thanks!

/mjt

Re: [PATCH v13 01/10] accel/tcg: remove CONFIG_PROFILER

2023-05-29 Thread Richard Henderson


On 5/29/23 04:49, Fei Wu wrote:

TBStats will be introduced to replace CONFIG_PROFILER totally, here
remove all CONFIG_PROFILER related stuffs first.

Signed-off-by: Vanderson M. do Rosario
Signed-off-by: Alex Bennée
Signed-off-by: Fei Wu
---
  accel/tcg/monitor.c   |  25 
  accel/tcg/tcg-accel-ops.c |  10 --
  accel/tcg/translate-all.c |  33 --
  include/qemu/timer.h  |   9 --
  include/tcg/tcg.h |  25 
  meson.build   |   2 -
  meson_options.txt |   2 -
  scripts/meson-buildoptions.sh |   3 -
  softmmu/runstate.c|   9 --
  tcg/tcg.c | 208 --
  tests/qtest/qmp-cmd-test.c|   3 -
  11 files changed, 329 deletions(-)

Reviewed-by: Richard Henderson 

r~

Re: [PATCH 0/4] Add Intel Data Streaming Accelerator offloading

2023-05-29 Thread Hao Xiang

Hi all, this is meant to be an RFC. Sorry I didn't put that in the email
subject correctly.
From: "Hao Xiang"
Date:  Mon, May 29, 2023, 11:20
Subject:  [PATCH 0/4] Add Intel Data Streaming Accelerator offloading
To: , , 
Cc: "Hao Xiang"
* Idea:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th
generation
Xeon server, aka Sapphire Rapids.
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This change proposes a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. For this to
work, I am looking into two potential improvements during QEMU background
operations, eg, VM checkpoint, VM live migration.

1. Reduce CPU usage.

I did a simple tracing for the save VM workload. I started VMs with
64GB RAM, 128GB RAM and 256GB RAM respectively and then ran the "savevm"
command from QEMU's commandline. During this savevm workload, I recorded
the CPU cycles spent in function save_snapshot and the total CPU cycles
spent in repeated callings into function buffer_is_zero, which performs
zero page checking.

|--|
|VM memory  |save_snapshot  |buffer_is_zero|
|capacity   |(CPU cycles)   |(CPU cycles)  |
|--|
|64GB   |19449838924|5022504892|
|--|
|128GB  |36966951760|10049561066   |
|--|
|256GB  |72744615676|20042076440   |
|--|

In the three scenarios, the CPU cycles spent in zero page checking accounts
roughly 27% of the total cycles spent in save_snapshot. I believe this is
due
to the fact that a large part of save_snapshot performs file IO operations
writing all memory pages into the QEMU image file and there is a linear
increase
on CPU cycles spent in savevm as the VM's total memory increases. If we can
offload all zero page checking function calls to the DSA accelerator, we
will
reduce the CPU usage by 27% in the savevm workload, potentially freeing CPU
resources for other work. The same savings should apply to live VM
migration
workload as well. Furthermore, if the guest VM's vcpu shares the same
physical
CPU core used for live migration, the guest VM will gain more underlying
CPU
resource and hence making the guest VM more responsive to it's own guest
workload
during live migration.

2. Reduce operation latency.

I did some benchmark testing on pure CPU memomory comparison implementation
and DSA hardware offload implementation.

Testbed: Intel(R) Xeon(R) Platinum 8457C, CPU 3100MHz

Latency is measured by completing memory comparison on two memory buffers,
each
with one GB in size. The memory comparison are done via CPU and DSA
accelerator
respectively. When performing CPU memory comparison, I use a single thread.
When
performing DSA accelerator memory comparison, I use one DSA engine. While
doing
memory comparison, both CPU and DSA based implementation uses 4k memory
buffer
as the granularity for comparison.

|---|
|Memory   |Latency  |
|---|
|CPU one thread   |80ms |
|---|
|DSA one engine   |89ms |
|---|

In our test system, we have two sockets and two DSA devices per socket.
Each
DSA device has four engines built in. I believe that if we leverage more
DSA
engine resources and a good parallelization on zero page checking, we can
keep the DSA devices busy and reduce CPU usage.

* Current state:

This patch implements the DSA offloading operation for zero page checking.
User can optionally replace the zero page checking function with DSA
offloading
by specifying a new argument in qemu start up commandline. There is no
performance gain in this change. This is mainly because zero page checking
is
a synchronous operation and each page size is 4k. Offloading a single 4k
memory
page comparison to the DSA accelerator and wait for the driver to complete
the operation introduces overhead. Currently the overhead is bigger than
the CPU cycles saved due to offloading.

* Future work:

1. Need to make the zero page checking workflow asynchronous. The idea is
that
we can throw lots of zero page checking operations at once to
N(configurable)
DSA engines. Then we wait for those operations to be completed by idxd (DSA
device driver). Currently ram_save_iterate has a loop to iterate through
all
the memory blocks, find the dirty pages and save them all. The loop exits
when there is no more dirty pages to save. I think when we walk through all
the memory blocks, we just need to identify whether there is dirty pages
remaining but we can do the actual "save page" asynchronously. We can
return
from ram_save_iterate when we finish walking through the memory blocks and
all pages are saved. This sounds like

[PATCH 2/4] Add dependency idxd.

2023-05-29 Thread Hao Xiang

Idxd is the device driver for DSA (Intel Data Streaming
Accelerator). The driver is fully functioning since Linux
kernel 5.19. This change adds the driver's header file used
for userspace development.

Signed-off-by: Hao Xiang 
---
 linux-headers/linux/idxd.h | 356 +
 1 file changed, 356 insertions(+)
 create mode 100644 linux-headers/linux/idxd.h

diff --git a/linux-headers/linux/idxd.h b/linux-headers/linux/idxd.h
new file mode 100644
index 00..1d553bedbd
--- /dev/null
+++ b/linux-headers/linux/idxd.h
@@ -0,0 +1,356 @@
+/* SPDX-License-Identifier: LGPL-2.1 WITH Linux-syscall-note */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#ifndef _USR_IDXD_H_
+#define _USR_IDXD_H_
+
+#ifdef __KERNEL__
+#include 
+#else
+#include 
+#endif
+
+/* Driver command error status */
+enum idxd_scmd_stat {
+   IDXD_SCMD_DEV_ENABLED = 0x8010,
+   IDXD_SCMD_DEV_NOT_ENABLED = 0x8020,
+   IDXD_SCMD_WQ_ENABLED = 0x8021,
+   IDXD_SCMD_DEV_DMA_ERR = 0x8002,
+   IDXD_SCMD_WQ_NO_GRP = 0x8003,
+   IDXD_SCMD_WQ_NO_NAME = 0x8004,
+   IDXD_SCMD_WQ_NO_SVM = 0x8005,
+   IDXD_SCMD_WQ_NO_THRESH = 0x8006,
+   IDXD_SCMD_WQ_PORTAL_ERR = 0x8007,
+   IDXD_SCMD_WQ_RES_ALLOC_ERR = 0x8008,
+   IDXD_SCMD_PERCPU_ERR = 0x8009,
+   IDXD_SCMD_DMA_CHAN_ERR = 0x800a,
+   IDXD_SCMD_CDEV_ERR = 0x800b,
+   IDXD_SCMD_WQ_NO_SWQ_SUPPORT = 0x800c,
+   IDXD_SCMD_WQ_NONE_CONFIGURED = 0x800d,
+   IDXD_SCMD_WQ_NO_SIZE = 0x800e,
+   IDXD_SCMD_WQ_NO_PRIV = 0x800f,
+   IDXD_SCMD_WQ_IRQ_ERR = 0x8010,
+   IDXD_SCMD_WQ_USER_NO_IOMMU = 0x8011,
+};
+
+#define IDXD_SCMD_SOFTERR_MASK 0x8000
+#define IDXD_SCMD_SOFTERR_SHIFT16
+
+/* Descriptor flags */
+#define IDXD_OP_FLAG_FENCE 0x0001
+#define IDXD_OP_FLAG_BOF   0x0002
+#define IDXD_OP_FLAG_CRAV  0x0004
+#define IDXD_OP_FLAG_RCR   0x0008
+#define IDXD_OP_FLAG_RCI   0x0010
+#define IDXD_OP_FLAG_CRSTS 0x0020
+#define IDXD_OP_FLAG_CR0x0080
+#define IDXD_OP_FLAG_CC0x0100
+#define IDXD_OP_FLAG_ADDR1_TCS 0x0200
+#define IDXD_OP_FLAG_ADDR2_TCS 0x0400
+#define IDXD_OP_FLAG_ADDR3_TCS 0x0800
+#define IDXD_OP_FLAG_CR_TCS0x1000
+#define IDXD_OP_FLAG_STORD 0x2000
+#define IDXD_OP_FLAG_DRDBK 0x4000
+#define IDXD_OP_FLAG_DSTS  0x8000
+
+/* IAX */
+#define IDXD_OP_FLAG_RD_SRC2_AECS  0x01
+#define IDXD_OP_FLAG_RD_SRC2_2ND   0x02
+#define IDXD_OP_FLAG_WR_SRC2_AECS_COMP 0x04
+#define IDXD_OP_FLAG_WR_SRC2_AECS_OVFL 0x08
+#define IDXD_OP_FLAG_SRC2_STS  0x10
+#define IDXD_OP_FLAG_CRC_RFC3720   0x20
+
+/* Opcode */
+enum dsa_opcode {
+   DSA_OPCODE_NOOP = 0,
+   DSA_OPCODE_BATCH,
+   DSA_OPCODE_DRAIN,
+   DSA_OPCODE_MEMMOVE,
+   DSA_OPCODE_MEMFILL,
+   DSA_OPCODE_COMPARE,
+   DSA_OPCODE_COMPVAL,
+   DSA_OPCODE_CR_DELTA,
+   DSA_OPCODE_AP_DELTA,
+   DSA_OPCODE_DUALCAST,
+   DSA_OPCODE_CRCGEN = 0x10,
+   DSA_OPCODE_COPY_CRC,
+   DSA_OPCODE_DIF_CHECK,
+   DSA_OPCODE_DIF_INS,
+   DSA_OPCODE_DIF_STRP,
+   DSA_OPCODE_DIF_UPDT,
+   DSA_OPCODE_CFLUSH = 0x20,
+};
+
+enum iax_opcode {
+   IAX_OPCODE_NOOP = 0,
+   IAX_OPCODE_DRAIN = 2,
+   IAX_OPCODE_MEMMOVE,
+   IAX_OPCODE_DECOMPRESS = 0x42,
+   IAX_OPCODE_COMPRESS,
+   IAX_OPCODE_CRC64,
+   IAX_OPCODE_ZERO_DECOMP_32 = 0x48,
+   IAX_OPCODE_ZERO_DECOMP_16,
+   IAX_OPCODE_ZERO_COMP_32 = 0x4c,
+   IAX_OPCODE_ZERO_COMP_16,
+   IAX_OPCODE_SCAN = 0x50,
+   IAX_OPCODE_SET_MEMBER,
+   IAX_OPCODE_EXTRACT,
+   IAX_OPCODE_SELECT,
+   IAX_OPCODE_RLE_BURST,
+   IAX_OPCODE_FIND_UNIQUE,
+   IAX_OPCODE_EXPAND,
+};
+
+/* Completion record status */
+enum dsa_completion_status {
+   DSA_COMP_NONE = 0,
+   DSA_COMP_SUCCESS,
+   DSA_COMP_SUCCESS_PRED,
+   DSA_COMP_PAGE_FAULT_NOBOF,
+   DSA_COMP_PAGE_FAULT_IR,
+   DSA_COMP_BATCH_FAIL,
+   DSA_COMP_BATCH_PAGE_FAULT,
+   DSA_COMP_DR_OFFSET_NOINC,
+   DSA_COMP_DR_OFFSET_ERANGE,
+   DSA_COMP_DIF_ERR,
+   DSA_COMP_BAD_OPCODE = 0x10,
+   DSA_COMP_INVALID_FLAGS,
+   DSA_COMP_NOZERO_RESERVE,
+   DSA_COMP_XFER_ERANGE,
+   DSA_COMP_DESC_CNT_ERANGE,
+   DSA_COMP_DR_ERANGE,
+   DSA_COMP_OVERLAP_BUFFERS,
+   DSA_COMP_DCAST_ERR,
+   DSA_COMP_DESCLIST_ALIGN,
+   DSA_COMP_INT_HANDLE_INVAL,
+   DSA_COMP_CRA_XLAT,
+   DSA_COMP_CRA_ALIGN,
+   DSA_COMP_ADDR_ALIGN,
+   DSA_COMP_PRIV_BAD,
+   DSA_COMP_TRAFFIC_CLASS_CONF,
+   DSA_COMP_PFAULT_RDBA,
+   DSA_COMP_HW_ERR1,
+   DSA_COMP_HW_ERR_DRB,
+   DSA_COMP_TRANSLATION_FAIL,
+};
+
+enum iax_completion_status {
+   IAX_COMP_NONE = 0,
+   IAX_COMP_SUCCESS,
+   IAX_COMP_PAGE_FAULT_IR = 0x04,
+   IAX_COMP_ANALYTICS_ERROR = 0x0a,
+

[PATCH 4/4] Add QEMU command line argument to enable DSA offloading.

2023-05-29 Thread Hao Xiang

This change adds a new argument --dsa-accelerate to qemu.

Signed-off-by: Hao Xiang 
---
 qemu-options.hx  | 10 ++
 softmmu/runstate.c   |  4 
 softmmu/vl.c | 22 ++
 storage-daemon/qemu-storage-daemon.c |  2 ++
 4 files changed, 38 insertions(+)

diff --git a/qemu-options.hx b/qemu-options.hx
index b37eb9662b..29491ee691 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4890,6 +4890,16 @@ SRST
 otherwise the option is ignored. Default is off.
 ERST
 
+DEF("dsa-accelerate", HAS_ARG, QEMU_OPTION_dsa,
+"-dsa-accelerate \n"
+"Use Intel Data Streaming Accelerator for certain QEMU\n"
+"operations, eg, checkpoint.\n",
+QEMU_ARCH_I386)
+SRST
+``-dsa-accelerate path``
+The device path to a DSA accelerator.
+ERST
+
 DEF("dump-vmstate", HAS_ARG, QEMU_OPTION_dump_vmstate,
 "-dump-vmstate \n"
 "Output vmstate information in JSON format to file.\n"
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 2f2396c819..1f938e192f 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -41,6 +41,7 @@
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-events-run-state.h"
 #include "qemu/accel.h"
+#include "qemu/cutils.h"
 #include "qemu/error-report.h"
 #include "qemu/job.h"
 #include "qemu/log.h"
@@ -834,6 +835,9 @@ void qemu_cleanup(void)
 tpm_cleanup();
 net_cleanup();
 audio_cleanup();
+
+dsa_cleanup();
+
 monitor_cleanup();
 qemu_chr_cleanup();
 user_creatable_cleanup();
diff --git a/softmmu/vl.c b/softmmu/vl.c
index b0b96f67fa..8ace491183 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -161,6 +161,7 @@ static const char *mem_path;
 static const char *incoming;
 static const char *loadvm;
 static const char *accelerators;
+static const char *dsa_path;
 static bool have_custom_ram_size;
 static const char *ram_memdev_id;
 static QDict *machine_opts_dict;
@@ -373,6 +374,20 @@ static QemuOptsList qemu_msg_opts = {
 },
 };
 
+static QemuOptsList qemu_dsa_opts = {
+.name = "dsa-accelerate",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_dsa_opts.head),
+.desc = {
+{
+.name = "device",
+.type = QEMU_OPT_STRING,
+.help = "The device path to DSA accelerator used for certain "
+"QEMU operations, eg, checkpoint\n",
+},
+{ /* end of list */ }
+},
+};
+
 static QemuOptsList qemu_name_opts = {
 .name = "name",
 .implied_opt_name = "guest",
@@ -2704,6 +2719,7 @@ void qemu_init(int argc, char **argv)
 qemu_add_opts(_semihosting_config_opts);
 qemu_add_opts(_fw_cfg_opts);
 qemu_add_opts(_action_opts);
+qemu_add_opts(_dsa_opts);
 module_call_init(MODULE_INIT_OPTS);
 
 error_init(argv[0]);
@@ -3504,6 +3520,12 @@ void qemu_init(int argc, char **argv)
 }
 configure_msg(opts);
 break;
+case QEMU_OPTION_dsa:
+dsa_path = optarg;
+if (configure_dsa(dsa_path)) {
+exit(1);
+}
+break;
 case QEMU_OPTION_dump_vmstate:
 if (vmstate_dump_file) {
 error_report("only one '-dump-vmstate' "
diff --git a/storage-daemon/qemu-storage-daemon.c 
b/storage-daemon/qemu-storage-daemon.c
index 0e9354faa6..0e4375407a 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -439,6 +439,8 @@ int main(int argc, char *argv[])
 job_cancel_sync_all();
 bdrv_close_all();
 
+dsa_cleanup();
+
 monitor_cleanup();
 qemu_chr_cleanup();
 user_creatable_cleanup();
-- 
2.30.2

[PATCH 1/4] Introduce new instruction set enqcmd/mmovdir64b to the build system.

2023-05-29 Thread Hao Xiang

1. Enable instruction set enqcmd in build.
2. Enable instruction set movdir64b in build.

Signed-off-by: Hao Xiang 
---
 meson.build   | 3 +++
 meson_options.txt | 4 
 scripts/meson-buildoptions.sh | 6 ++
 3 files changed, 13 insertions(+)

diff --git a/meson.build b/meson.build
index 2d48aa1e2e..46f1bb2e34 100644
--- a/meson.build
+++ b/meson.build
@@ -2682,6 +2682,8 @@ config_host_data.set('CONFIG_AVX512BW_OPT', 
get_option('avx512bw') \
 int main(int argc, char *argv[]) { return bar(argv[0]); }
   '''), error_message: 'AVX512BW not available').allowed())
 
+config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd'))
+
 have_pvrdma = get_option('pvrdma') \
   .require(rdma.found(), error_message: 'PVRDMA requires OpenFabrics 
libraries') \
   .require(cc.compiles(gnu_source_prefix + '''
@@ -4123,6 +4125,7 @@ summary_info += {'memory allocator':  
get_option('malloc')}
 summary_info += {'avx2 optimization': config_host_data.get('CONFIG_AVX2_OPT')}
 summary_info += {'avx512bw optimization': 
config_host_data.get('CONFIG_AVX512BW_OPT')}
 summary_info += {'avx512f optimization': 
config_host_data.get('CONFIG_AVX512F_OPT')}
+summary_info += {'dsa acceleration': config_host_data.get('CONFIG_DSA_OPT')}
 if get_option('gprof')
   gprof_info = 'YES (deprecated)'
 else
diff --git a/meson_options.txt b/meson_options.txt
index 90237389e2..51097da56c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -117,6 +117,10 @@ option('avx512f', type: 'feature', value: 'disabled',
description: 'AVX512F optimizations')
 option('avx512bw', type: 'feature', value: 'auto',
description: 'AVX512BW optimizations')
+option('enqcmd', type: 'boolean', value: false,
+   description: 'MENQCMD optimizations')
+option('movdir64b', type: 'boolean', value: false,
+   description: 'MMOVDIR64B optimizations')
 option('keyring', type: 'feature', value: 'auto',
description: 'Linux keyring support')
 
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 5714fd93d9..5ef4ec36f4 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -81,6 +81,8 @@ meson_options_help() {
   printf "%s\n" '  avx2AVX2 optimizations'
   printf "%s\n" '  avx512bwAVX512BW optimizations'
   printf "%s\n" '  avx512f AVX512F optimizations'
+  printf "%s\n" '  enqcmd  ENQCMD optimizations'
+  printf "%s\n" '  movdir64b   MOVDIR64B optimizations'
   printf "%s\n" '  blkio   libblkio block device driver'
   printf "%s\n" '  bochs   bochs image format support'
   printf "%s\n" '  bpf eBPF support'
@@ -221,6 +223,10 @@ _meson_option_parse() {
 --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
 --enable-avx512f) printf "%s" -Davx512f=enabled ;;
 --disable-avx512f) printf "%s" -Davx512f=disabled ;;
+--enable-enqcmd) printf "%s" -Denqcmd=true ;;
+--disable-enqcmd) printf "%s" -Denqcmd=false ;;
+--enable-movdir64b) printf "%s" -Dmovdir64b=true ;;
+--disable-movdir64b) printf "%s" -Dmovdir64b=false ;;
 --enable-gcov) printf "%s" -Db_coverage=true ;;
 --disable-gcov) printf "%s" -Db_coverage=false ;;
 --enable-lto) printf "%s" -Db_lto=true ;;
-- 
2.30.2

[PATCH 3/4] Implement zero page checking using DSA.

2023-05-29 Thread Hao Xiang

1. Adds a memory comparison function by submitting
the work to the idxd driver.
2. Add interface to set bufferiszero accel function
to DSA offloading.
3. Fallback to use CPU accel function if DSA offloading
fails due to page fault.

Signed-off-by: Hao Xiang 
---
 include/qemu/cutils.h |   6 +
 migration/ram.c   |   4 +
 util/bufferiszero.c   |  14 ++
 util/dsa.c| 295 ++
 util/meson.build  |   1 +
 5 files changed, 320 insertions(+)
 create mode 100644 util/dsa.c

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 92c436d8c7..9d0286ac99 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -188,9 +188,15 @@ char *freq_to_str(uint64_t freq_hz);
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
+typedef bool (*buffer_accel_fn)(const void *, size_t);
+void set_accel(buffer_accel_fn, size_t len);
+void get_fallback_accel(buffer_accel_fn *);
 bool buffer_is_zero(const void *buf, size_t len);
 bool test_buffer_is_zero_next_accel(void);
 
+int configure_dsa(const char *dsa_path);
+void dsa_cleanup(void);
+
 /*
  * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128)
  * Input is limited to 14-bit numbers
diff --git a/migration/ram.c b/migration/ram.c
index 88a6c82e63..b586ac4a99 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2280,6 +2280,10 @@ static int ram_save_host_page(RAMState *rs, 
PageSearchStatus *pss)
 if (preempt_active) {
 qemu_mutex_unlock(>bitmap_mutex);
 }
+/*
+ * TODO: Make ram_save_target_page asyn to take advantage
+ * of DSA offloading.
+ */
 tmppages = migration_ops->ram_save_target_page(rs, pss);
 if (tmppages >= 0) {
 pages += tmppages;
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 3e6a5dfd63..3d089ef1fe 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -206,6 +206,7 @@ buffer_zero_avx512(const void *buf, size_t len)
 static unsigned used_accel = INIT_USED;
 static unsigned length_to_accel = INIT_LENGTH;
 static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL;
+static bool (*buffer_accel_fallback)(const void *, size_t) = INIT_ACCEL;
 
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
@@ -231,6 +232,7 @@ select_accel_cpuinfo(unsigned info)
 if (info & all[i].bit) {
 length_to_accel = all[i].len;
 buffer_accel = all[i].fn;
+buffer_accel_fallback = all[i].fn;
 return all[i].bit;
 }
 }
@@ -272,6 +274,17 @@ bool test_buffer_is_zero_next_accel(void)
 }
 #endif
 
+void set_accel(buffer_accel_fn fn, size_t len)
+{
+buffer_accel = fn;
+length_to_accel = len;
+}
+
+void get_fallback_accel(buffer_accel_fn *fn)
+{
+*fn = buffer_accel_fallback;
+}
+
 /*
  * Checks if a buffer is all zeroes
  */
@@ -288,3 +301,4 @@ bool buffer_is_zero(const void *buf, size_t len)
includes a check for an unrolled loop over 64-bit integers.  */
 return select_accel_fn(buf, len);
 }
+
diff --git a/util/dsa.c b/util/dsa.c
new file mode 100644
index 00..2fdcdb4f49
--- /dev/null
+++ b/util/dsa.c
@@ -0,0 +1,295 @@
+/*
+ * Use Intel Data Streaming Accelerator to offload certain background
+ * operations.
+ *
+ * Copyright (c) 2023 Hao Xiang 
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/bswap.h"
+#include "qemu/error-report.h"
+
+#ifdef CONFIG_DSA_OPT
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+#pragma GCC target("movdir64b")
+
+#include 
+#include "x86intrin.h"
+
+#define DSA_WQ_SIZE 4096
+
+static bool use_simulation;
+static uint64_t total_bytes_checked;
+static uint64_t total_function_calls;
+static uint64_t total_success_count;
+static int max_retry_count;
+static int top_retry_count;

[PATCH 0/4] Add Intel Data Streaming Accelerator offloading

2023-05-29 Thread Hao Xiang

* Idea:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
Xeon server, aka Sapphire Rapids.
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This change proposes a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. For this to
work, I am looking into two potential improvements during QEMU background
operations, eg, VM checkpoint, VM live migration.

1. Reduce CPU usage.

I did a simple tracing for the save VM workload. I started VMs with
64GB RAM, 128GB RAM and 256GB RAM respectively and then ran the "savevm"
command from QEMU's commandline. During this savevm workload, I recorded
the CPU cycles spent in function save_snapshot and the total CPU cycles
spent in repeated callings into function buffer_is_zero, which performs
zero page checking.

|--|
|VM memory  |save_snapshot  |buffer_is_zero|
|capacity   |(CPU cycles)   |(CPU cycles)  |
|--|
|64GB   |19449838924|5022504892|
|--|
|128GB  |36966951760|10049561066   |
|--| 
|256GB  |72744615676|20042076440   |
|--|

In the three scenarios, the CPU cycles spent in zero page checking accounts
roughly 27% of the total cycles spent in save_snapshot. I believe this is due
to the fact that a large part of save_snapshot performs file IO operations
writing all memory pages into the QEMU image file and there is a linear increase
on CPU cycles spent in savevm as the VM's total memory increases. If we can
offload all zero page checking function calls to the DSA accelerator, we will
reduce the CPU usage by 27% in the savevm workload, potentially freeing CPU
resources for other work. The same savings should apply to live VM migration
workload as well. Furthermore, if the guest VM's vcpu shares the same physical
CPU core used for live migration, the guest VM will gain more underlying CPU
resource and hence making the guest VM more responsive to it's own guest 
workload
during live migration.

2. Reduce operation latency.

I did some benchmark testing on pure CPU memomory comparison implementation
and DSA hardware offload implementation.

Testbed: Intel(R) Xeon(R) Platinum 8457C, CPU 3100MHz

Latency is measured by completing memory comparison on two memory buffers, each
with one GB in size. The memory comparison are done via CPU and DSA accelerator
respectively. When performing CPU memory comparison, I use a single thread. When
performing DSA accelerator memory comparison, I use one DSA engine. While doing
memory comparison, both CPU and DSA based implementation uses 4k memory buffer
as the granularity for comparison.

|---|
|Memory   |Latency  |
|---|
|CPU one thread   |80ms |
|---|
|DSA one engine   |89ms |
|---|

In our test system, we have two sockets and two DSA devices per socket. Each
DSA device has four engines built in. I believe that if we leverage more DSA
engine resources and a good parallelization on zero page checking, we can
keep the DSA devices busy and reduce CPU usage.

* Current state:

This patch implements the DSA offloading operation for zero page checking.
User can optionally replace the zero page checking function with DSA offloading
by specifying a new argument in qemu start up commandline. There is no
performance gain in this change. This is mainly because zero page checking is
a synchronous operation and each page size is 4k. Offloading a single 4k memory
page comparison to the DSA accelerator and wait for the driver to complete
the operation introduces overhead. Currently the overhead is bigger than
the CPU cycles saved due to offloading.

* Future work:

1. Need to make the zero page checking workflow asynchronous. The idea is that
we can throw lots of zero page checking operations at once to N(configurable)
DSA engines. Then we wait for those operations to be completed by idxd (DSA
device driver). Currently ram_save_iterate has a loop to iterate through all
the memory blocks, find the dirty pages and save them all. The loop exits
when there is no more dirty pages to save. I think when we walk through all
the memory blocks, we just need to identify whether there is dirty pages
remaining but we can do the actual "save page" asynchronously. We can return
from ram_save_iterate when we finish walking through the memory blocks and
all pages are saved. This sounds like a pretty large refactoring change and
I am looking hard into this path to figure out exactly how I can tackle it.
Any feedback would be really appreciated.

2. Need to implement an abstraction layer where QEMU can just throw zero page

Re: [RFC] cxl: Multi-headed device design

2023-05-29 Thread Gregory Price

On Wed, May 17, 2023 at 03:18:59PM +0100, Jonathan Cameron wrote:
> > 
> > i.e. an SLD does not require an FM-Owned LD for management, but an MHD,
> > MLD, and DCD all do (at least in theory).
> 
> DCD 'might' though I don't think anything in the spec rules that you 'must'
> control the SLD/MLD via the FM-API, it's just a spec provided option.
> From our point of view we don't want to get more creative so lets assume
> it does.
> 
> I can't immediately think of reason for a single head SLD to have an FM owned
> LD, though it may well have an MCTP CCI for querying stuff about it from an 
> FM.
> 

Before I go running off into the woods, it seems like it would be simple
enough to simply make an FM-LD "device" which simply links a mhXXX device
and implements its own Mailbox CCI.

Maybe not "realistic", but to my mind this appears as a separate
character device in /dev/cxl/*. Maybe the realism here doesn't matter,
since we're just implementing for the sake of testing.  This is just a
straightforward way to pipe a DCD request into the device and trigger
DCD event log entries.

As commented early, this is done as a QEMU fed event.  If that's
sufficient, a hack like this feels like it would be at least mildly
cleaner and easier to test against.

Example: consider a user wanting to issue a DCD command to add capacity.

Real world: this would be some out of band communication, and eventually
this results in a DCD command to the device that results in a
capacity-event showing up in the log. Maybe it happens over TCP and
drills down to a Redfish event that talks to the BMC that issues a
command over etc etc MTCP emulations, etc.

With a simplistic /dev/cxl/memX-fmld device a user can simply issue these
commands without all that, and the effect is the same.

On the QEMU side you get something like:

-device cxl-type3,volatile-memdev=mem0,bus=rp0,mhd-head=0,id=mem0,mhd-main=true
-device cxl-mhsld,type3=mem0,bus=rp0,headid=0,id=mhsld1,shmid=X
-device cxl-fmld,mhsld=mdsld1,bus=rp1,id=mem0-fmld,shmid=Y

on the Linux side you get:
/dev/cxl/mem0
/dev/cxl/mem0-fmld

in this example, the shmid for mhsld is a shared memory region created
with mkipc that implements the shared state (basically section bitmap
tracking and the actual plumbing for DCD, etc). This limits the emulation
of the mhsld to a single host for now, but that seems sufficient.

The shmid for cxl-fmld implements any shared state for the fmld,
including a mutex, that allows all hosts attached to the mhsld to have
access to the fmld.  This may or may not be realistic, but it would
allow all head-attached hosts to send DCD commands over its own local
fabric, ratehr than going out of band.

This gets us to the point where, at a minimum, each host can issue its
own DCD commands to add capacity to itself.  That's step 1.

Step 2 is allow Host A to issue a DCD command to add capacity to Host B.

I suppose this could be done via a backgruond thread that waits on a
message to show up in the shared memory region?

Being somewhat unfamiliar with QEMU, is it kosher to start background
threads that just wait on events like this, or is that generally frowed
upon?  If done this way, it would stimplify the creation and startup
sequence at least.

~Gregory

Re: [RFC PATCH 5/6] block: Allow bdrv_get_allocated_file_size to run in bdrv context

2023-05-29 Thread Fabiano Rosas

Kevin Wolf  writes:

> Am 23.05.2023 um 23:39 hat Fabiano Rosas geschrieben:
>> We're about to move calls to 'fstat' into the thread-pool to avoid
>> blocking VCPU threads should the system call take too long.
>> 
>> To achieve that we first need to make sure none of its callers is
>> holding the aio_context lock, otherwise yielding before scheduling the
>> aiocb handler would result in a deadlock when the qemu_global_mutex is
>> released and another thread tries to acquire the aio_context.
>> 
>> Signed-off-by: Fabiano Rosas 
>> ---
>>  block/qapi.c | 22 +-
>>  1 file changed, 21 insertions(+), 1 deletion(-)
>> 
>> diff --git a/block/qapi.c b/block/qapi.c
>> index ae6cd1c2ff..cd197abf1f 100644
>> --- a/block/qapi.c
>> +++ b/block/qapi.c
>> @@ -222,6 +222,26 @@ int bdrv_query_snapshot_info_list(BlockDriverState *bs,
>>  return 0;
>>  }
>>  
>> +static int64_t bdrv_get_actual_size(BlockDriverState *bs)
>> +{
>> +int64_t size;
>> +AioContext *old_ctx = NULL;
>> +
>> +if (qemu_in_coroutine()) {
>
> Hm. Why can't we make sure that it always runs in a coroutine?
>
> Callers:
>
> * bdrv_query_block_node_info(). This functions seems to be completely
>   unused, we can remove it.
>

Ok, I'm already removing it in patch 1.

> * bdrv_query_image_info(). Called by bdrv_block_device_info() and itself.
>   bdrv_block_device_info() could become a co_wrapper after swapping the
>   first two parameters so that it runs in the AioContext of @bs.
>

We cannot have bdrv_block_device_info() as co_wrapper because that would
create its own coroutine and yielding from that would merely have us
waiting at bdrv_poll_co. So it doesn't solve the blocking issue.

What would work is to make bdrv_block_device_info() a coroutine_fn
(without a wrapper). Its two callers, qmp_query_block() and
qmp_query_named_block_nodes() are already being moved into the handler
coroutine in this series, so it would be mostly a matter of adding the
type annotation.

> * bdrv_query_block_graph_info(). Only called by qemu-img. Could become a
>   co_wrapper_bdrv_rdlock.
>

This one works ok.

>> +aio_context_release(bdrv_get_aio_context(bs));
>> +old_ctx = bdrv_co_enter(bs);
>
> I think this is the wrong function to do this. The caller should already
> make sure that it's in the right AioContext.
>

The issue here is that bdrv_do_query_node_info() calls
bdrv_co_get_allocated_file_size(), which is the function we want to make
async and therefore needs to run outside of aio_context_acquire/release.
However, bdrv_do_query_node_info() also calls bdrv_refresh_filename(),
which is GLOBAL_STATE_CODE and therefore wants to be in the main thread
context. So entering the bs AioContext at the caller doesn't work when
giving the device its own iothread.

>> +}
>> +
>> +size = bdrv_get_allocated_file_size(bs);
>> +
>> +if (qemu_in_coroutine() && old_ctx) {
>> +bdrv_co_leave(bs, old_ctx);
>> +aio_context_acquire(bdrv_get_aio_context(bs));
>> +}
>> +
>> +return size;
>> +}
>
> Kevin

Re: [PATCH] intel_iommu: Optimize out some unnecessary UNMAP calls

2023-05-29 Thread Jason Gunthorpe

On Fri, May 26, 2023 at 08:44:29AM +, Liu, Yi L wrote:

> > > >> In fact, the other purpose of this patch is to eliminate noisy error
> > > >> log when we work with IOMMUFD. It looks the duplicate UNMAP call will
> > > >> fail with IOMMUFD while always succeed with legacy container. This
> > > >> behavior difference lead to below error log for IOMMUFD:
> > 
> > A dumb question, should IOMMUFD stick the same behaviour with legacy 
> > container?
> 
> May need to hear from JasonG.  Should IOMMU_IOAS_UNMAP return error or
> success if the iova is not found?

The native iommufd functions will return failure if they could not
unmap anything.

Otherwise they return the number of consecutive bytes unmapped.

The VFIO emulation functions should do whatever VFIO does, is there a
mistake there?

Jason

Re: [PATCH 0/2] Vhost-vdpa Shadow Virtqueue Offloads support

2023-05-29 Thread Eugenio Perez Martin

On Mon, May 29, 2023 at 3:18 PM Hawkins Jiawei  wrote:
>
> This series enables shadowed CVQ to intercept Offloads commands
> through shadowed CVQ, update the virtio NIC device model so qemu
> send it in a migration, and the restore of that Offloads state
> in the destination.
>
> Hawkins Jiawei (2):
>   vdpa: Add vhost_vdpa_net_load_offloads
>   vdpa: Allow VIRTIO_NET_F_CTRL_GUEST_OFFLOADS in SVQ
>
>  net/vhost-vdpa.c | 27 +++
>  1 file changed, 27 insertions(+)
>

CCing MST too.

Thanks!

Re: [PATCH v1 5/9] KVM: x86: Add new hypercall to lock control registers

2023-05-29 Thread Mickaël Salaün




On 08/05/2023 23:11, Wei Liu wrote:

On Fri, May 05, 2023 at 05:20:42PM +0200, Mickaël Salaün wrote:

This enables guests to lock their CR0 and CR4 registers with a subset of
X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP, X86_CR4_UMIP, X86_CR4_FSGSBASE
and X86_CR4_CET flags.

The new KVM_HC_LOCK_CR_UPDATE hypercall takes two arguments.  The first
is to identify the control register, and the second is a bit mask to
pin (i.e. mark as read-only).

These register flags should already be pinned by Linux guests, but once
compromised, this self-protection mechanism could be disabled, which is
not the case with this dedicated hypercall.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-6-...@digikod.net

[...]

hw_cr4 = (cr4_read_shadow() & X86_CR4_MCE) | (cr4 & ~X86_CR4_MCE);
if (is_unrestricted_guest(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ffab64d08de3..a529455359ac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7927,11 +7927,77 @@ static unsigned long emulator_get_cr(struct 
x86_emulate_ctxt *ctxt, int cr)
return value;
  }
  
+#ifdef CONFIG_HEKI

+
+extern unsigned long cr4_pinned_mask;
+


Can this be moved to a header file?


Yep, but I'm not sure which one. Any preference Kees?





+static int heki_lock_cr(struct kvm *const kvm, const unsigned long cr,
+   unsigned long pin)
+{
+   if (!pin)
+   return -KVM_EINVAL;
+
+   switch (cr) {
+   case 0:
+   /* Cf. arch/x86/kernel/cpu/common.c */
+   if (!(pin & X86_CR0_WP))
+   return -KVM_EINVAL;
+
+   if ((read_cr0() & pin) != pin)
+   return -KVM_EINVAL;
+
+   atomic_long_or(pin, >heki_pinned_cr0);
+   return 0;
+   case 4:
+   /* Checks for irrelevant bits. */
+   if ((pin & cr4_pinned_mask) != pin)
+   return -KVM_EINVAL;
+


It is enforcing the host mask on the guest, right? If the guest's set is a
super set of the host's then it will get rejected.



+   /* Ignores bits not present in host. */
+   pin &= __read_cr4();
+   atomic_long_or(pin, >heki_pinned_cr4);


We assume that the host's mask is a superset of the guest's mask. I 
guess we should check the absolute supported bits instead, even if it 
would be weird for the host to not support these bits.




+   return 0;
+   }
+   return -KVM_EINVAL;
+}
+
+int heki_check_cr(const struct kvm *const kvm, const unsigned long cr,
+ const unsigned long val)
+{
+   unsigned long pinned;
+
+   switch (cr) {
+   case 0:
+   pinned = atomic_long_read(>heki_pinned_cr0);
+   if ((val & pinned) != pinned) {
+   pr_warn_ratelimited(
+   "heki-kvm: Blocked CR0 update: 0x%lx\n", val);


I think if the message contains the VM and VCPU identifier it will
become more useful.


Indeed, and this should be the case for all log messages, but I'd left 
that for future work. ;) I'll update the logs for the next series with a 
new kvm_warn_ratelimited() helper using VCPU's PID.

[PATCH 1/3] hw/smbios: Fix smbios_smp_sockets caculation

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

Here're 2 mistakes:
1. 003f230e37d7 ("machine: Tweak the order of topology members in struct
   CpuTopology") changes the meaning of smp.cores but doesn't fix
   original smp.cores uses. And because of the introduction of cluster,
   now smp.cores means the number of cores in one cluster. So smp.cores
   * smp.threads just means the cpus in a cluster not in a socket.
2. smp.cpus means the number of initial online cpus, not the total
   number of cpus. For such topology calculation, smp.max_cpus
   should be considered.

Since the number of sockets has already been recorded in smp structure,
use smp.sockets directly.

Fixes: 003f230e37d7 ("machine: Tweak the order of topology members in struct 
CpuTopology")
Signed-off-by: Zhao Liu 
---
 hw/smbios/smbios.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/smbios/smbios.c b/hw/smbios/smbios.c
index d2007e70fb05..d67415d44dd8 100644
--- a/hw/smbios/smbios.c
+++ b/hw/smbios/smbios.c
@@ -1088,8 +1088,7 @@ void smbios_get_tables(MachineState *ms,
 smbios_build_type_2_table();
 smbios_build_type_3_table();
 
-smbios_smp_sockets = DIV_ROUND_UP(ms->smp.cpus,
-  ms->smp.cores * ms->smp.threads);
+smbios_smp_sockets = ms->smp.sockets;
 assert(smbios_smp_sockets >= 1);
 
 for (i = 0; i < smbios_smp_sockets; i++) {
-- 
2.34.1

[PATCH 0/3] hw/smbios: Cleanup topology related variables

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

Hi all,

This patchset is split from my previous hybrid topology RFC [1] for
easier review.

There are three places for topology-related cleanup:

1. Fix the use of smp.cores.

   The meaning of CPU topology members in MachineState.smp has changed
   since 003f230e37d7 ("machine: Tweak the order of topology members in
   struct CpuTopology"), but the use of smp.cores based on the original
   semantics (the number of cores in one package) has not been updated,
   so here is a cleanup.

2. Consolidate the use of MachineState.smp.

   The socket calculation can be simplified by directly using the
   MachineState.smp.sockets.

3. Fix thread count in type4.

   I also found that the definition of thread count in type4 doesn't
   match the spec (smbios 3.0.0) and cleaned it up as well.

[1]: https://mail.gnu.org/archive/html/qemu-devel/2023-02/msg03205.html


Regards,
Zhao

---
Zhao Liu (3):
  hw/smbios: Fix smbios_smp_sockets caculation
  hw/smbios: Fix thread count in type4
  hw/smbios: Fix core count in type4

 hw/smbios/smbios.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

-- 
2.34.1

[PATCH 3/3] hw/smbios: Fix core count in type4

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

>From SMBIOS 3.0 specification, core count field means:

Core Count is the number of cores detected by the BIOS for this
processor socket. [1]

Before 003f230e37d7 ("machine: Tweak the order of topology members in
struct CpuTopology"), MachineState.smp.cores means "the number of cores
in one package", and it's correct to use smp.cores for core count.

But 003f230e37d7 changes the smp.cores' meaning to "the number of cores
in one die" and doesn't change the original smp.cores' use in smbios as
well, which makes core count in type4 go wrong.

Fix this issue with the correct "cores per socket" caculation.

[1] SMBIOS 3.0.0, section 7.5.6, Processor Information - Core Count

Fixes: 003f230e37d7 ("machine: Tweak the order of topology members in struct 
CpuTopology")
Signed-off-by: Zhao Liu 
---
 hw/smbios/smbios.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/smbios/smbios.c b/hw/smbios/smbios.c
index f80a701cdfc1..32e26bffa2df 100644
--- a/hw/smbios/smbios.c
+++ b/hw/smbios/smbios.c
@@ -714,6 +714,7 @@ static void smbios_build_type_4_table(MachineState *ms, 
unsigned instance)
 char sock_str[128];
 size_t tbl_len = SMBIOS_TYPE_4_LEN_V28;
 unsigned cpus_per_socket = ms->smp.max_cpus / ms->smp.sockets;
+unsigned cores_per_socket = cpus_per_socket / ms->smp.threads;
 
 if (smbios_ep_type == SMBIOS_ENTRY_POINT_TYPE_64) {
 tbl_len = SMBIOS_TYPE_4_LEN_V30;
@@ -748,7 +749,7 @@ static void smbios_build_type_4_table(MachineState *ms, 
unsigned instance)
 SMBIOS_TABLE_SET_STR(4, asset_tag_number_str, type4.asset);
 SMBIOS_TABLE_SET_STR(4, part_number_str, type4.part);
 
-t->core_count = (ms->smp.cores > 255) ? 0xFF : ms->smp.cores;
+t->core_count = (cores_per_socket > 255) ? 0xFF : cores_per_socket;
 t->core_enabled = t->core_count;
 
 t->thread_count = (cpus_per_socket > 255) ? 0xFF : cpus_per_socket;
@@ -757,7 +758,7 @@ static void smbios_build_type_4_table(MachineState *ms, 
unsigned instance)
 t->processor_family2 = cpu_to_le16(0x01); /* Other */
 
 if (tbl_len == SMBIOS_TYPE_4_LEN_V30) {
-t->core_count2 = t->core_enabled2 = cpu_to_le16(ms->smp.cores);
+t->core_count2 = t->core_enabled2 = cpu_to_le16(cores_per_socket);
 t->thread_count2 = cpu_to_le16(cpus_per_socket);
 }
 
-- 
2.34.1

[PATCH 2/3] hw/smbios: Fix thread count in type4

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

>From SMBIOS 3.0 specification, thread count field means:

Thread Count is the total number of threads detected by the BIOS for
this processor socket. It is a processor-wide count, not a
thread-per-core count. [1]

So here we should use threads per socket other than threads per core.

[1] SMBIOS 3.0.0, section 7.5.8, Processor Information - Thread Count

Fixes: c97294ec1b9e ("SMBIOS: Build aggregate smbios tables and entry point")
Signed-off-by: Zhao Liu 
---
 hw/smbios/smbios.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/smbios/smbios.c b/hw/smbios/smbios.c
index d67415d44dd8..f80a701cdfc1 100644
--- a/hw/smbios/smbios.c
+++ b/hw/smbios/smbios.c
@@ -713,6 +713,7 @@ static void smbios_build_type_4_table(MachineState *ms, 
unsigned instance)
 {
 char sock_str[128];
 size_t tbl_len = SMBIOS_TYPE_4_LEN_V28;
+unsigned cpus_per_socket = ms->smp.max_cpus / ms->smp.sockets;
 
 if (smbios_ep_type == SMBIOS_ENTRY_POINT_TYPE_64) {
 tbl_len = SMBIOS_TYPE_4_LEN_V30;
@@ -750,14 +751,14 @@ static void smbios_build_type_4_table(MachineState *ms, 
unsigned instance)
 t->core_count = (ms->smp.cores > 255) ? 0xFF : ms->smp.cores;
 t->core_enabled = t->core_count;
 
-t->thread_count = (ms->smp.threads > 255) ? 0xFF : ms->smp.threads;
+t->thread_count = (cpus_per_socket > 255) ? 0xFF : cpus_per_socket;
 
 t->processor_characteristics = cpu_to_le16(0x02); /* Unknown */
 t->processor_family2 = cpu_to_le16(0x01); /* Other */
 
 if (tbl_len == SMBIOS_TYPE_4_LEN_V30) {
 t->core_count2 = t->core_enabled2 = cpu_to_le16(ms->smp.cores);
-t->thread_count2 = cpu_to_le16(ms->smp.threads);
+t->thread_count2 = cpu_to_le16(cpus_per_socket);
 }
 
 SMBIOS_BUILD_TABLE_POST;
-- 
2.34.1

Re: [PATCH 1/2] vdpa: Add vhost_vdpa_net_load_offloads

2023-05-29 Thread Eugenio Perez Martin

On Mon, May 29, 2023 at 3:18 PM Hawkins Jiawei  wrote:
>
> This patch introduces vhost_vdpa_net_load_offloads() to
> restore offloads state at device's startup.
>
> Signed-off-by: Hawkins Jiawei 
> ---
>  net/vhost-vdpa.c | 26 ++
>  1 file changed, 26 insertions(+)
>
> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
> index 37cdc84562..682c749b19 100644
> --- a/net/vhost-vdpa.c
> +++ b/net/vhost-vdpa.c
> @@ -680,6 +680,28 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
>  return *s->status != VIRTIO_NET_OK;
>  }
>
> +static int vhost_vdpa_net_load_offloads(VhostVDPAState *s,
> +const VirtIONet *n)
> +{
> +uint64_t features, offloads;
> +ssize_t dev_written;
> +
> +features = n->parent_obj.guest_features;
> +if (!(features & BIT_ULL(VIRTIO_NET_F_CTRL_GUEST_OFFLOADS))) {
> +return 0;
> +}
> +

Maybe we can avoid sending this CVQ command if the guest already uses
the default values?

By default all features are enabled if I'm not wrong. I think the best
way is to expose virtio_net_supported_guest_offloads or
virtio_net_guest_offloads_by_features and then check if
n->curr_guest_offloads is the same.

We should do the same with vhost_vdpa_net_load_mq, but that is out of
the scope of this series.

Thanks!

> +offloads = cpu_to_le64(n->curr_guest_offloads);
> +dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_GUEST_OFFLOADS,
> +  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
> +  , sizeof(offloads));
> +if (unlikely(dev_written < 0)) {
> +return dev_written;
> +}
> +
> +return *s->status != VIRTIO_NET_OK;
> +}
> +
>  static int vhost_vdpa_net_load(NetClientState *nc)
>  {
>  VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
> @@ -702,6 +724,10 @@ static int vhost_vdpa_net_load(NetClientState *nc)
>  if (unlikely(r)) {
>  return r;
>  }
> +r = vhost_vdpa_net_load_offloads(s, n);
> +if (unlikely(r)) {
> +return r;
> +}
>
>  return 0;
>  }
> --
> 2.25.1
>

Re: [PATCH v1 3/9] virt: Implement Heki common code

2023-05-29 Thread Mickaël Salaün




On 17/05/2023 14:47, Madhavan T. Venkataraman wrote:

Sorry for the delay. See inline...

On 5/8/23 12:29, Wei Liu wrote:

On Fri, May 05, 2023 at 05:20:40PM +0200, Mickaël Salaün wrote:

From: Madhavan T. Venkataraman 

Hypervisor Enforced Kernel Integrity (Heki) is a feature that will use
the hypervisor to enhance guest virtual machine security.

Configuration
=

Define the config variables for the feature. This feature depends on
support from the architecture as well as the hypervisor.

Enabling HEKI
=

Define a kernel command line parameter "heki" to turn the feature on or
off. By default, Heki is on.


For such a newfangled feature can we have it off by default? Especially
when there are unsolved issues around dynamically loaded code.



Yes. We can certainly do that.


By default the Kconfig option should definitely be off. We also need to 
change the Kconfig option to only be set if kernel module, JIT, kprobes 
and other dynamic text change feature are disabled at build time  (see 
discussion with Sean).


With this new Kconfig option for the static case, I think the boot 
option should be on by default because otherwise it would not really be 
possible to switch back to on later without taking the risk to silently 
breaking users' machines. However, we should rename this option to 
something like "heki_static" to be in line with the new Kconfig option.


The goal of Heki is to improve and complement kernel self-protection 
mechanisms (which don't have boot time options), and to make it 
available to everyone, see 
https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings
In practice, it would then be kind of useless to be required to set a 
boot option to enable Heki (rather than to disable it).








[...]

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..5cf5a7a97811 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -297,6 +297,7 @@ config X86
select FUNCTION_ALIGNMENT_4B
imply IMA_SECURE_AND_OR_TRUSTED_BOOTif EFI
select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+   select ARCH_SUPPORTS_HEKI   if X86_64


Why is there a restriction on X86_64?



We want to get the PoC working and reviewed on X64 first. We have tested this 
only on X64 so far.


X86_64 includes Intel CPUs, which can support EPT and MBEC, which are a 
requirement for Heki. ARM might have similar features but we're focused 
on x86 for now.


As a side note, I only have access to an Intel machine, which means that 
I cannot work on AMD support. However, I'll be pleased to implement such 
support if I get access to a machine with a recent AMD CPU.





  
  config INSTRUCTION_DECODER

def_bool y
diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index a6e8373a5170..42ef1e33b8a5 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h

[...]
  
+#ifdef CONFIG_HEKI

+
+/*
+ * Gather all of the statically defined sections so heki_late_init() can
+ * protect these sections in the host page table.
+ *
+ * The sections are defined under "SECTIONS" in vmlinux.lds.S
+ * Keep this array in sync with SECTIONS.
+ */


This seems a bit fragile, because it requires constant attention from
people who care about this functionality. Can this table be
automatically generated?



We realize that. But I don't know of a way this can be automatically generated. 
Also, the permissions for
each section is specific to the use of that section. The developer who 
introduces a new section is the
one who will know what the permissions should be.

If any one has any ideas of how we can generate this table automatically or 
even just add a build time check
of some sort, please let us know.


One clean solution might be to parse the vmlinux.lds.S file, extract 
section and their permission, and fill that into an automatically 
generated header file.


Another way to do it would be to extract sections and associated 
permissions with objdump, but that could be an issue because of longer 
build time.


A better solution would be to extract such sections and associated 
permissions at boot time. I guess the kernel already has such helpers 
used in early boot.

Re: [PATCH v4 2/2] qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa

2023-05-29 Thread Stefan Hajnoczi

On Fri, May 26, 2023 at 05:03:04PM +0200, Stefano Garzarella wrote:
> The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
> passing through the new 'fd' property.
> 
> Since now we are using qemu_open() on '@path' if the virtio-blk driver
> supports the fd passing, let's announce it.
> In this way, the management layer can pass the file descriptor of an
> already opened vhost-vdpa character device. This is useful especially
> when the device can only be accessed with certain privileges.
> 
> Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
> in libblkio supports it.
> 
> Suggested-by: Markus Armbruster 
> Signed-off-by: Stefano Garzarella 
> ---
> 
> Notes:
> v4:
> - added this patch to allow libvirt to discover we support fdset [Markus]
> 
>  meson.build  | 4 
>  qapi/block-core.json | 8 +++-
>  2 files changed, 11 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v4 1/2] block/blkio: use qemu_open() to support fd passing for virtio-blk

2023-05-29 Thread Stefan Hajnoczi

On Fri, May 26, 2023 at 05:03:03PM +0200, Stefano Garzarella wrote:
> Some virtio-blk drivers (e.g. virtio-blk-vhost-vdpa) supports the fd
> passing. Let's expose this to the user, so the management layer
> can pass the file descriptor of an already opened path.
> 
> If the libblkio virtio-blk driver supports fd passing, let's always
> use qemu_open() to open the `path`, so we can handle fd passing
> from the management layer through the "/dev/fdset/N" special path.
> 
> Signed-off-by: Stefano Garzarella 
> ---
> 
> Notes:
> v4:
> - modified commit description
> 
> v3:
> - use qemu_open() on `path` to simplify libvirt code [Jonathon]
> 
>  block/blkio.c | 53 ++-
>  1 file changed, 44 insertions(+), 9 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

[PATCH v5 4/5] parallels: Replace fprintf by qemu_log in check

2023-05-29 Thread Alexander Ivanov

If the check is called during normal work, tracking of the check must be
present in VM logs to have some clues if something going wrong with user's
data.

Signed-off-by: Alexander Ivanov 
Reviewed-by: Denis V. Lunev 
---
 block/parallels.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 9fa1f93973..d64e8007d5 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -42,6 +42,7 @@
 #include "qemu/bswap.h"
 #include "qemu/bitmap.h"
 #include "qemu/memalign.h"
+#include "qemu/log-for-trace.h"
 #include "migration/blocker.h"
 #include "parallels.h"
 
@@ -436,8 +437,8 @@ static void parallels_check_unclean(BlockDriverState *bs,
 return;
 }
 
-fprintf(stderr, "%s image was not closed correctly\n",
-fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
+qemu_log("%s image was not closed correctly\n",
+ fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
 res->corruptions++;
 if (fix & BDRV_FIX_ERRORS) {
 /* parallels_close will do the job right */
@@ -464,8 +465,8 @@ parallels_check_outside_image(BlockDriverState *bs, 
BdrvCheckResult *res,
 for (i = 0; i < s->bat_size; i++) {
 off = bat2sect(s, i) << BDRV_SECTOR_BITS;
 if (off + s->cluster_size > size) {
-fprintf(stderr, "%s cluster %u is outside image\n",
-fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+qemu_log("%s cluster %u is outside image\n",
+ fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
 res->corruptions++;
 if (fix & BDRV_FIX_ERRORS) {
 parallels_set_bat_entry(s, i, 0);
@@ -551,8 +552,8 @@ parallels_check_leak(BlockDriverState *bs, BdrvCheckResult 
*res,
 
 count = DIV_ROUND_UP(leak_size, s->cluster_size);
 res->leaks += count;
-fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
-fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR", leak_size);
+qemu_log("%s space leaked at the end of the image %" PRId64 "\n",
+ fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR", leak_size);
 
 if (fix & BDRV_FIX_LEAKS) {
 res->leaks_fixed += count;
@@ -603,9 +604,8 @@ static int parallels_check_duplicate(BlockDriverState *bs,
 cluster_index = host_cluster_index(s, off);
 if (test_bit(cluster_index, bitmap)) {
 /* this cluster duplicates another one */
-fprintf(stderr,
-"%s duplicate offset in BAT entry %u\n",
-*fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+qemu_log("%s duplicate offset in BAT entry %u\n",
+ *fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
 
 res->corruptions++;
 
-- 
2.34.1

[PATCH v5 3/5] parallels: Add checking and repairing duplicate offsets in BAT

2023-05-29 Thread Alexander Ivanov

Cluster offsets must be unique among all the BAT entries. Find duplicate
offsets in the BAT and fix it by copying the content of the relevant
cluster to a newly allocated cluster and set the new cluster offset to the
duplicated entry.

Add host_cluster_index() helper to deduplicate the code.

Move parallels_fix_leak() call to parallels_co_check() to fix both types
of leak: real corruption and a leak produced by allocate_clusters()
during deduplication.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 138 --
 1 file changed, 133 insertions(+), 5 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 64850b9655..9fa1f93973 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -136,6 +136,12 @@ static int cluster_remainder(BDRVParallelsState *s, 
int64_t sector_num,
 return MIN(nb_sectors, ret);
 }
 
+static uint32_t host_cluster_index(BDRVParallelsState *s, int64_t off)
+{
+off -= le32_to_cpu(s->header->data_off) << BDRV_SECTOR_BITS;
+return off / s->cluster_size;
+}
+
 static int64_t block_status(BDRVParallelsState *s, int64_t sector_num,
 int nb_sectors, int *pnum)
 {
@@ -533,7 +539,6 @@ parallels_check_leak(BlockDriverState *bs, BdrvCheckResult 
*res,
 {
 BDRVParallelsState *s = bs->opaque;
 int64_t count, leak_size;
-int ret;
 
 leak_size = parallels_get_leak_size(bs, res);
 if (leak_size < 0) {
@@ -550,16 +555,127 @@ parallels_check_leak(BlockDriverState *bs, 
BdrvCheckResult *res,
 fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR", leak_size);
 
 if (fix & BDRV_FIX_LEAKS) {
-ret = parallels_fix_leak(bs, res);
-if (ret < 0) {
-return ret;
-}
 res->leaks_fixed += count;
 }
 
 return 0;
 }
 
+static int parallels_check_duplicate(BlockDriverState *bs,
+ BdrvCheckResult *res,
+ BdrvCheckMode *fix)
+{
+BDRVParallelsState *s = bs->opaque;
+QEMUIOVector qiov;
+int64_t off, sector;
+unsigned long *bitmap;
+uint32_t i, bitmap_size, cluster_index;
+int n, ret = 0;
+uint64_t *buf = NULL;
+
+/*
+ * Create a bitmap of used clusters.
+ * If a bit is set, there is a BAT entry pointing to this cluster.
+ * Loop through the BAT entries, check bits relevant to an entry offset.
+ * If bit is set, this entry is duplicated. Otherwise set the bit.
+ *
+ * We shouldn't worry about newly allocated clusters outside the image
+ * because they are created higher then any existing cluster pointed by
+ * a BAT entry.
+ */
+bitmap_size = host_cluster_index(s, res->image_end_offset);
+if (bitmap_size == 0) {
+return 0;
+}
+
+bitmap = bitmap_new(bitmap_size);
+
+buf = qemu_memalign(4096, s->cluster_size);
+qemu_iovec_init(, 0);
+qemu_iovec_add(, buf, s->cluster_size);
+
+for (i = 0; i < s->bat_size; i++) {
+off = bat2sect(s, i) << BDRV_SECTOR_BITS;
+if (off == 0) {
+continue;
+}
+
+cluster_index = host_cluster_index(s, off);
+if (test_bit(cluster_index, bitmap)) {
+/* this cluster duplicates another one */
+fprintf(stderr,
+"%s duplicate offset in BAT entry %u\n",
+*fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR", i);
+
+res->corruptions++;
+
+if (*fix & BDRV_FIX_ERRORS) {
+/*
+ * Reset the entry and allocate a new cluster
+ * for the relevant guest offset. In this way we let
+ * the lower layer to place the new cluster properly.
+ * Copy the original cluster to the allocated one.
+ */
+parallels_set_bat_entry(s, i, 0);
+
+ret = bdrv_co_pread(bs->file, off, s->cluster_size, buf, 0);
+if (ret < 0) {
+res->check_errors++;
+goto out;
+}
+
+sector = (i * (int64_t)s->cluster_size) >> BDRV_SECTOR_BITS;
+sector = allocate_clusters(bs, sector, s->tracks, );
+if (sector < 0) {
+res->check_errors++;
+ret = sector;
+goto out;
+}
+off = sector << BDRV_SECTOR_BITS;
+
+ret = bdrv_co_pwritev(bs->file, off, s->cluster_size, , 
0);
+if (ret < 0) {
+res->check_errors++;
+goto out;
+}
+
+if (off + s->cluster_size > res->image_end_offset) {
+res->image_end_offset = off + s->cluster_size;
+}
+
+/*
+ * In the future allocate_cluster() will reuse holed offsets
+ * inside the image. Keep the used clusters bitmap content
+

[PATCH v5 0/5] parallels: Add duplication check, repair at open, fix bugs

2023-05-29 Thread Alexander Ivanov

This patchset should be applied on the top of [PATCH v11 00/12] parallels:
Refactor the code of images checks and fix a bug.

Fix incorrect data end calculation in parallels_open().

Split image leak handling to separate check and fix helpers.

Add checking and repairing duplicate offsets in BAT

Replace fprintf() by qemu_log().

Image repairing in parallels_open().

v5:
3: Fixed a byteorder bug, fixed zero-length image handling and fixed uint32
   truncation.

v4:
2,5: Rebased.

v3:
2: Added (size >= res->image_end_offset) assert and changed the comment in
   parallels_get_leak_size(). Changed error printing and leaks fixing order.
3: Removed highest_offset() helper, instead image_end_offset field is used.
5: Moved highest_offset() code to parallels_open() - now it is used only in
   this function. Fixed data_end update condition. Fixed a leak of
   s->migration_blocker.

v2:
2: Moved outsude parallels_check_leak() 2 helpers:
   parallels_get_leak_size() and parallels_fix_leak().
   
3: Used highest_offset() helper in parallels_check_leak(). Fixed a typo.
   Added comments. Replaced g_malloc() call by qemu_memalign(). Replaced
   bdrv_pread() call by bdrv_co_pread(). Got rid of keeping bytes and
   sectors in the same variable. Added setting the bitmap of the used
   clusters for a new allocated cluster if it isn't out of the bitmap.
   Moved the leak fix to the end of all the checks. Removed a dependence
   on image format for the duplicate check.
   
4 (old): Merged this patch to the previous.
4 (former 5): Fixed formatting.
5 (former 6): Fixed comments. Added O_INACTIVE check in the condition.
  Replaced inuse detection by header_unclean checking.
  Replaced playing with corutines by bdrv_check() usage.

Alexander Ivanov (5):
  parallels: Incorrect data end calculation in parallels_open()
  parallels: Split image leak handling to separate check and fix helpers
  parallels: Add checking and repairing duplicate offsets in BAT
  parallels: Replace fprintf by qemu_log in check
  parallels: Image repairing in parallels_open()

 block/parallels.c | 295 --
 1 file changed, 232 insertions(+), 63 deletions(-)

-- 
2.34.1

[PATCH v5 5/5] parallels: Image repairing in parallels_open()

2023-05-29 Thread Alexander Ivanov

Repair an image at opening if the image is unclean or out-of-image
corruption was detected.

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 65 +--
 1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index d64e8007d5..7bbd5cb112 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -962,7 +962,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 BDRVParallelsState *s = bs->opaque;
 ParallelsHeader ph;
 int ret, size, i;
-int64_t file_nb_sectors;
+int64_t file_nb_sectors, sector;
 QemuOpts *opts = NULL;
 Error *local_err = NULL;
 char *buf;
@@ -1039,35 +1039,6 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 s->bat_bitmap = (uint32_t *)(s->header + 1);
 
-for (i = 0; i < s->bat_size; i++) {
-int64_t off = bat2sect(s, i);
-if (off >= file_nb_sectors) {
-if (flags & BDRV_O_CHECK) {
-continue;
-}
-error_setg(errp, "parallels: Offset %" PRIi64 " in BAT[%d] entry "
-   "is larger than file size (%" PRIi64 ")",
-   off << BDRV_SECTOR_BITS, i,
-   file_nb_sectors << BDRV_SECTOR_BITS);
-ret = -EINVAL;
-goto fail;
-}
-if (off >= s->data_end) {
-s->data_end = off + s->tracks;
-}
-}
-
-if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
-/* Image was not closed correctly. The check is mandatory */
-s->header_unclean = true;
-if ((flags & BDRV_O_RDWR) && !(flags & BDRV_O_CHECK)) {
-error_setg(errp, "parallels: Image was not closed correctly; "
-   "cannot be opened read/write");
-ret = -EACCES;
-goto fail;
-}
-}
-
 opts = qemu_opts_create(_runtime_opts, NULL, 0, errp);
 if (!opts) {
 goto fail_options;
@@ -1130,6 +1101,40 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto fail;
 }
 qemu_co_mutex_init(>lock);
+
+if (le32_to_cpu(ph.inuse) == HEADER_INUSE_MAGIC) {
+s->header_unclean = true;
+}
+
+for (i = 0; i < s->bat_size; i++) {
+sector = bat2sect(s, i);
+if (sector + s->tracks > s->data_end) {
+s->data_end = sector + s->tracks;
+}
+}
+
+/*
+ * We don't repair the image here if it's opened for checks. Also we don't
+ * want to change inactive images and can't change readonly images.
+ */
+if ((flags & (BDRV_O_CHECK | BDRV_O_INACTIVE)) || !(flags & BDRV_O_RDWR)) {
+return 0;
+}
+
+/*
+ * Repair the image if it's dirty or
+ * out-of-image corruption was detected.
+ */
+if (s->data_end > file_nb_sectors || s->header_unclean) {
+BdrvCheckResult res;
+ret = bdrv_check(bs, , BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
+if (ret < 0) {
+error_free(s->migration_blocker);
+error_setg_errno(errp, -ret, "Could not repair corrupted image");
+goto fail;
+}
+}
+
 return 0;
 
 fail_format:
-- 
2.34.1

[PATCH v5 1/5] parallels: Incorrect data end calculation in parallels_open()

2023-05-29 Thread Alexander Ivanov

The BDRVParallelsState structure contains data_end field that is measured
in sectors. In parallels_open() initially this field is set by data_off
field from parallels image header.

According to the parallels format documentation, data_off field contains
an offset, in sectors, from the start of the file to the start of the
data area. For "WithoutFreeSpace" images: if data_off is zero, the offset
is calculated as the end of the BAT table plus some padding to ensure
sector size alignment.

The parallels_open() function has code for handling zero value in
data_off, but in the result data_end contains the offset in bytes.

Replace the alignment to sector size by division by sector size and fix
the comparision with s->header_size.

Signed-off-by: Alexander Ivanov 
Reviewed-by: Denis V. Lunev 
---
 block/parallels.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 7c263d5085..1ec98c722b 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -861,9 +861,9 @@ static int parallels_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 s->data_end = le32_to_cpu(ph.data_off);
 if (s->data_end == 0) {
-s->data_end = ROUND_UP(bat_entry_off(s->bat_size), BDRV_SECTOR_SIZE);
+s->data_end = DIV_ROUND_UP(size, BDRV_SECTOR_SIZE);
 }
-if (s->data_end < s->header_size) {
+if (s->data_end < (s->header_size >> BDRV_SECTOR_BITS)) {
 /* there is not enough unused space to fit to block align between BAT
and actual data. We can't avoid read-modify-write... */
 s->header_size = size;
-- 
2.34.1

[PATCH v5 2/5] parallels: Split image leak handling to separate check and fix helpers

2023-05-29 Thread Alexander Ivanov

We need to fix leak after deduplication in the next patch. Move leak
fixing to a separate helper parallels_fix_leak() and add
parallels_get_leak_size() helper wich used in parallels_fix_leak() and
parallels_check_leak().

Signed-off-by: Alexander Ivanov 
---
 block/parallels.c | 86 +--
 1 file changed, 61 insertions(+), 25 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 1ec98c722b..64850b9655 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -482,43 +482,79 @@ parallels_check_outside_image(BlockDriverState *bs, 
BdrvCheckResult *res,
 return 0;
 }
 
+static int64_t parallels_get_leak_size(BlockDriverState *bs,
+   BdrvCheckResult *res)
+{
+int64_t size;
+
+size = bdrv_getlength(bs->file->bs);
+if (size < 0) {
+return size;
+}
+
+/*
+ * Before any usage of this function, image_end_offset has to be set to the
+ * the highest offset in the BAT, excluding out-of-image offsets.
+ */
+assert(size >= res->image_end_offset);
+
+return size - res->image_end_offset;
+}
+
+static int parallels_fix_leak(BlockDriverState *bs,
+  BdrvCheckResult *res)
+{
+Error *local_err = NULL;
+int64_t size;
+int ret;
+
+size = parallels_get_leak_size(bs, res);
+if (size <= 0) {
+return size;
+}
+
+/*
+ * In order to really repair the image, we must shrink it.
+ * That means we have to pass exact=true.
+ */
+ret = bdrv_co_truncate(bs->file, res->image_end_offset, true,
+   PREALLOC_MODE_OFF, 0, _err);
+if (ret < 0) {
+error_report_err(local_err);
+return ret;
+}
+
+return 0;
+}
+
 static int coroutine_fn GRAPH_RDLOCK
 parallels_check_leak(BlockDriverState *bs, BdrvCheckResult *res,
  BdrvCheckMode fix)
 {
 BDRVParallelsState *s = bs->opaque;
-int64_t size;
+int64_t count, leak_size;
 int ret;
 
-size = bdrv_getlength(bs->file->bs);
-if (size < 0) {
+leak_size = parallels_get_leak_size(bs, res);
+if (leak_size < 0) {
 res->check_errors++;
-return size;
+return leak_size;
+}
+if (leak_size == 0) {
+return 0;
 }
 
-if (size > res->image_end_offset) {
-int64_t count;
-count = DIV_ROUND_UP(size - res->image_end_offset, s->cluster_size);
-fprintf(stderr, "%s space leaked at the end of the image %" PRId64 
"\n",
-fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR",
-size - res->image_end_offset);
-res->leaks += count;
-if (fix & BDRV_FIX_LEAKS) {
-Error *local_err = NULL;
+count = DIV_ROUND_UP(leak_size, s->cluster_size);
+res->leaks += count;
+fprintf(stderr, "%s space leaked at the end of the image %" PRId64 "\n",
+fix & BDRV_FIX_LEAKS ? "Repairing" : "ERROR", leak_size);
 
-/*
- * In order to really repair the image, we must shrink it.
- * That means we have to pass exact=true.
- */
-ret = bdrv_co_truncate(bs->file, res->image_end_offset, true,
-   PREALLOC_MODE_OFF, 0, _err);
-if (ret < 0) {
-error_report_err(local_err);
-res->check_errors++;
-return ret;
-}
-res->leaks_fixed += count;
+if (fix & BDRV_FIX_LEAKS) {
+ret = parallels_fix_leak(bs, res);
+if (ret < 0) {
+return ret;
 }
+res->leaks_fixed += count;
 }
 
 return 0;
-- 
2.34.1

Re: [PATCH] igb: Add Function Level Reset to PF and VF

2023-05-29 Thread Cédric Le Goater


On 5/29/23 09:45, Akihiko Odaki wrote:

On 2023/05/29 16:01, Cédric Le Goater wrote:

On 5/29/23 04:45, Akihiko Odaki wrote:

On 2023/05/28 19:50, Sriram Yagnaraman wrote:



-Original Message-
From: Cédric Le Goater 
Sent: Friday, 26 May 2023 19:31
To: qemu-devel@nongnu.org
Cc: Akihiko Odaki ; Sriram Yagnaraman
; Jason Wang ; Cédric
Le Goater 
Subject: [PATCH] igb: Add Function Level Reset to PF and VF

The Intel 82576EB GbE Controller say that the Physical and Virtual Functions
support Function Level Reset. Add the capability to each device model.

Cc: Akihiko Odaki 
Fixes: 3a977deebe6b ("Intrdocue igb device emulation")
Signed-off-by: Cédric Le Goater 
---
  hw/net/igb.c   | 3 +++
  hw/net/igbvf.c | 3 +++
  2 files changed, 6 insertions(+)

diff --git a/hw/net/igb.c b/hw/net/igb.c index 1c989d767725..08e389338dca
100644
--- a/hw/net/igb.c
+++ b/hw/net/igb.c
@@ -101,6 +101,7 @@ static void igb_write_config(PCIDevice *dev, uint32_t
addr,

  trace_igb_write_config(addr, val, len);
  pci_default_write_config(dev, addr, val, len);
+    pcie_cap_flr_write_config(dev, addr, val, len);

  if (range_covers_byte(addr, len, PCI_COMMAND) &&
  (dev->config[PCI_COMMAND] & PCI_COMMAND_MASTER)) { @@ -427,6
+428,8 @@ static void igb_pci_realize(PCIDevice *pci_dev, Error **errp)
  }

  /* PCIe extended capabilities (in order) */
+    pcie_cap_flr_init(pci_dev);
+
  if (pcie_aer_init(pci_dev, 1, 0x100, 0x40, errp) < 0) {
  hw_error("Failed to initialize AER capability");
  }
diff --git a/hw/net/igbvf.c b/hw/net/igbvf.c index
284ea611848b..0a58dad06802 100644
--- a/hw/net/igbvf.c
+++ b/hw/net/igbvf.c
@@ -204,6 +204,7 @@ static void igbvf_write_config(PCIDevice *dev,
uint32_t addr, uint32_t val,  {
  trace_igbvf_write_config(addr, val, len);
  pci_default_write_config(dev, addr, val, len);
+    pcie_cap_flr_write_config(dev, addr, val, len);
  }

  static uint64_t igbvf_mmio_read(void *opaque, hwaddr addr, unsigned size)
@@ -266,6 +267,8 @@ static void igbvf_pci_realize(PCIDevice *dev, Error
**errp)
  hw_error("Failed to initialize PCIe capability");
  }

+    pcie_cap_flr_init(dev);


Sorry for my naive question, and perhaps not related to your patch, IGBVF 
device class doesn't seem to have any reset functions registered via 
igbvf_class_init(). So, I am guessing an FLR will not trigger igb_vf_reset(), 
which is probably what we want.


It does through the VTCTRL registers.


You're right. Advertising FLR capability without implementing it can confuse 
the guest though such probability is quite a low in practice. The reset should 
be implemented first.



I was looking at an issue from a VFIO perspective which does a FLR
on a device when pass through. Software and FLR are equivalent for
a VF.


They should be equivalent according to the datasheet, but unfortunately current igbvf implementation does nothing when reset. What Sriram proposes is to add code to actually write VTCTRL when FLR occurred and make FLR and software reset equivalent. 
And I think that should be done before this change; you should advertise FLR capability after the reset is actually implemented.



AFAICT, the VFs are reset correctly by the OS when created or probed and
by QEMU when they are passthrough in a nested guest OS (with this patch).
igb_vf_reset() is clearly called in QEMU, see routine e1000_reset_hw_vf()
in Linux.

I don't think a reset op is necessary because VFs are software constructs
but I don't mind really. If so, then, I wouldn't mimic what the OS does
by writing the RST bit in the CTRL register of the VF, I would simply
install igb_vf_reset() as a reset handler.

Thanks,

C.





Regards,
Akihiko Odaki



I am not sure a VF needs more really, since it is all controlled
by the PF. >
C.






+
  if (pcie_aer_init(dev, 1, 0x100, 0x40, errp) < 0) {
  hw_error("Failed to initialize AER capability");
  }
--
2.40.1

Re: [PATCH v2 2/2] vhost: release virtqueue objects in error path

2023-05-29 Thread Peter Xu

On Mon, May 29, 2023 at 05:13:33PM +0530, P J P wrote:
> From: Prasad Pandit 
> 
> vhost_dev_start function does not release virtqueue objects when
> event_notifier_init() function fails. Release virtqueue objects
> and log a message about function failure.
> 
> Signed-off-by: Prasad Pandit 

Fixes: f9a09ca3ea ("vhost: add support for configure interrupt")
Reviewed-by: Peter Xu 

> ---
>  hw/virtio/vhost.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> v2: split a single patch into two.
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 6be4a0626a..1de3029ae7 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1942,7 +1942,8 @@ int vhost_dev_start(struct vhost_dev *hdev, 
> VirtIODevice *vdev, bool vrings)
>  r = event_notifier_init(
>  >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier, 0);
>  if (r < 0) {
> -return r;
> +VHOST_OPS_DEBUG(r, "event_notifier_init failed");
> +goto fail_vq;
>  }
>  event_notifier_test_and_clear(
>  >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> --
> 2.40.1
> 

-- 
Peter Xu

Re: [PATCH v2 1/2] vhost: release memory_listener object in error path

2023-05-29 Thread Peter Xu

On Mon, May 29, 2023 at 05:13:32PM +0530, P J P wrote:
> From: Prasad Pandit 
> 
> vhost_dev_start function does not release memory_listener object
> in case of an error. This may crash the guest when vhost is unable
> to set memory table:
> 
>   stack trace of thread 125653:
>   Program terminated with signal SIGSEGV, Segmentation fault
>   #0  memory_listener_register (qemu-kvm + 0x6cda0f)
>   #1  vhost_dev_start (qemu-kvm + 0x699301)
>   #2  vhost_net_start (qemu-kvm + 0x45b03f)
>   #3  virtio_net_set_status (qemu-kvm + 0x665672)
>   #4  qmp_set_link (qemu-kvm + 0x548fd5)
>   #5  net_vhost_user_event (qemu-kvm + 0x552c45)
>   #6  tcp_chr_connect (qemu-kvm + 0x88d473)
>   #7  tcp_chr_new_client (qemu-kvm + 0x88cf83)
>   #8  tcp_chr_accept (qemu-kvm + 0x88b429)
>   #9  qio_net_listener_channel_func (qemu-kvm + 0x7ac07c)
>   #10 g_main_context_dispatch (libglib-2.0.so.0 + 0x54e2f)
> 
> Release memory_listener objects in the error path.
>
> Signed-off-by: Prasad Pandit 

Reviewed-by: Peter Xu 

Maybe worthwhile too with:

Fixes: c471ad0e9b ("vhost_net: device IOTLB support")

> ---
>  hw/virtio/vhost.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> v2: split a single patch into two. Mention about vhost set mem table failure
> resulting in guest crash.
> 
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 23da579ce2..6be4a0626a 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2004,6 +2004,9 @@ fail_vq:
>  }
> 
>  fail_mem:
> +if (vhost_dev_has_iommu(hdev)) {
> +memory_listener_unregister(>iommu_listener);
> +}
>  fail_features:
>  vdev->vhost_started = false;
>  hdev->started = false;
> --
> 2.40.1
> 

-- 
Peter Xu

Re: [PATCH v4 2/9] migration: Implement switchover ack logic

2023-05-29 Thread Peter Xu

On Sun, May 28, 2023 at 05:06:45PM +0300, Avihai Horon wrote:
> Implement switchover ack logic. This prevents the source from stopping
> the VM and completing the migration until an ACK is received from the
> destination that it's OK to do so.
> 
> To achieve this, a new SaveVMHandlers handler switchover_ack_needed()
> and a new return path message MIG_RP_MSG_SWITCHOVER_ACK are added.
> 
> The switchover_ack_needed() handler is called during migration setup in
> the destination to check if switchover ack is used by the migrated
> device.
> 
> When switchover is approved by all migrated devices in the destination
> that support this capability, the MIG_RP_MSG_SWITCHOVER_ACK return path
> message is sent to the source to notify it that it's OK to do
> switchover.
> 
> Signed-off-by: Avihai Horon 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v1] migration: fix migrate_params_test_apply to set the dest param correctly

2023-05-29 Thread Peter Xu

On Mon, May 29, 2023 at 12:55:30PM +, Wang, Wei W wrote:
> On Saturday, May 27, 2023 5:49 AM, Peter Xu wrote:
> > On Wed, May 24, 2023 at 04:01:57PM +0800, Wei Wang wrote:
> > > qmp_migrate_set_parameters expects to use tmp for parameters check, so
> > > migrate_params_test_apply is expected to copy the related fields from
> > > params to tmp. So fix migrate_params_test_apply to use the function
> > > parameter, *dest, rather than the global one. The dest->has_xxx (xxx
> > > is the feature name) related fields need to be set, as they will be
> > > checked by migrate_params_check.
> > 
> > I think it's fine to do as what you suggested, but I don't see much benefit
> > either.. the old code IIUC will check all params even if 1 param changed,
> > while after your change it only checks the modified ones.
> > 
> > There's slight benefits but not so much, especially "22+, 2-" LOCs, because
> > we don't really do this a lot; some more sanity check also makes sense to me
> > even if everything is always checked, so we'll hit errors if anything
> > accidentally goes wrong too.
> > 
> > Is there a real bug somewhere?
> 
> Yes. Please see qmp_migrate_set_parameters:
> 
> #1migrate_params_test_apply(params, );
> 
>  #2   if (!migrate_params_check(, errp)) {
> /* Invalid parameter */
> return;
> }
>  #3  migrate_params_apply(params, errp);
> 
> #2 tries to do params check using tmp, which is expected to be set up
> by #1, but #1 didn't use "",

#1 initialized "" with current parameters, here:

*dest = migrate_get_current()->parameters;

?

> so "tmp" doesn’t seem to store the
> valid values as expected for the check (that is, #2 above isn’t effectively
> doing any check for the user input params)

Do you have a reproducer where qmp set param will not check properly on
user input?

> 
> The alternative fix would be to remove the intermediate "tmp" params,
> but this might break the usage from commit 1bda8b3c6950, so need thoughts
> from Markus if we want go for this approach.

Thanks,

-- 
Peter Xu

Re: [PATCH v3 03/14] nbd/server: Prepare for alternate-size headers

2023-05-29 Thread Vladimir Sementsov-Ogievskiy


On 15.05.23 22:53, Eric Blake wrote:

Upstream NBD now documents[1] an extension that supports 64-bit effect
lengths in requests.  As part of that extension, the size of the reply
headers will change in order to permit a 64-bit length in the reply
for symmetry[2].  Additionally, where the reply header is currently
16 bytes for simple reply, and 20 bytes for structured reply; with the
extension enabled, there will only be one structured reply type, of 32
bytes.  Since we are already wired up to use iovecs, it is easiest to
allow for this change in header size by splitting each structured
reply across two iovecs, one for the header (which will become
variable-length in a future patch according to client negotiation),
and the other for the payload, and removing the header from the
payload struct definitions.  Interestingly, the client side code never
utilized the packed types, so only the server code needs to be
updated.


Actually, it does, since previous patch :) But, it becomes even better now? 
Anyway some note somewhere is needed I think.



[1] 
https://github.com/NetworkBlockDevice/nbd/blob/extension-ext-header/doc/proto.md
as of NBD commit e6f3b94a934

[2] Note that on the surface, this is because some future server might
permit a 4G+ NBD_CMD_READ and need to reply with that much data in one
transaction.  But even though the extended reply length is widened to
64 bits, for now the NBD spec is clear that servers will not reply
with more than a maximum payload bounded by the 32-bit
NBD_INFO_BLOCK_SIZE field; allowing a client and server to mutually
agree to transactions larger than 4G would require yet another
extension.

Signed-off-by: Eric Blake 



Reviewed-by: Vladimir Sementsov-Ogievskiy 


---
  include/block/nbd.h |  8 +++---
  nbd/server.c| 64 -
  2 files changed, 44 insertions(+), 28 deletions(-)



[..]



-static inline void set_be_chunk(NBDStructuredReplyChunk *chunk, uint16_t flags,
-uint16_t type, uint64_t handle, uint32_t 
length)
+static inline void set_be_chunk(NBDClient *client, struct iovec *iov,


Suggestion: pass niov here too, and caluculate "length" as a sum of iov lengths 
starting from second extent automatically.

Also, worth documenting that set_be_chunk() expects half-initialized iov, with 
.iov_base pointing to NBDReply (IN parameter) and .iov_len unset (OUT 
parameter), as that's not usual practice


+uint16_t flags, uint16_t type,
+uint64_t handle, uint32_t length)
  {
+NBDStructuredReplyChunk *chunk = iov->iov_base;
+
+iov->iov_len = sizeof(*chunk);
  stl_be_p(>magic, NBD_STRUCTURED_REPLY_MAGIC);
  stw_be_p(>flags, flags);
  stw_be_p(>type, type);


[..]

--
Best regards,
Vladimir

Re: [PULL 00/10] ppc queue

2023-05-29 Thread Michael Tokarev


29.05.2023 09:30, Nicholas Piggin wrote:

On Mon May 29, 2023 at 4:01 PM AEST, Michael Tokarev wrote:

..

They certainly fix some parts of target emulation, but what is the
guidance for backporting those type of fixes? Most of the patches I sent
including 2,3 were just found from inspection or new test code and not
real software failing.

Should just simple ones go in? 32-bit SPRs do not fix entirely the
behaviour of all SPRs, just one aspect. In another fix I had (that
didn't make it in this merge), was a bit more complicated and the
first iteration caused a deadlock that didn't show up in basic test
like booting Linux.

My guess is that fixes that correct an issue with real software running
on the target should be ported to stable. Perhaps "obviously correct"
small fixes as well. But not sure about larger changes.


This is exactly why I asked, - because I don't clearly understand how
important these to have in -stable. And also to remind that -stable
exist, just in case.. ;)


Ah okay, makes sense. I was just clarifying myself since I wasn't
too sure.


We do not have strict rules for stuff which should go to -stable.
In the wiki - https://www.qemu.org/docs/master/devel/stable-process.html -
it is stated as vague as

  If you think the patch would be important for users of the current release
  (or for a distribution picking fixes), it is usually a good candidate for 
stable.

so things are decided on a case-by-case basis.  Sometimes even spelling
fixes gets picked up, sometimes much larger and invasive changes has to
be picked to fix a real bug.  Once again, there's no strict rules.

Myself, I haven't been in this process before, but I see it from a
downstream PoV, as I maintain qemu in debian for quite some years.

A bug might not be reported by actual users, but if it has good potential
for breaking something, - *maybe* like in this case when qemu behaves
differently than an actual HW (once again, I don't know what the impact
of *these* fixes, or lack thereof, are), - and at the same time the
fix is small, self-contained and "obviously correct" (in other words,
has low potential of breaking something else), - I tend to pick it up,
to bring behavior to match the specs like in this case.

We had this numerous times: something is broken, but apparently no one
uses that. And out of the sudden people start hitting it in real life,
like even qemu itself which uses debian-provided packages in its own CI.
Just to discover this bug is known for a long time and is fixed in master
long ago too.

Sure thing there might be changes which fix more serious issues, but which
are based on some subsystem rewrite, - with these, it's more difficult to
decide what to do - to keep bug or to risk breaking other stuff.  I omitted
a few fixes for 7.2 due to this very situation: when the infrastructure
used in new code did not exist in 7.2 and quite a lot of other changes
needs to be back-ported.

There's always a tradeoff: the "importance" of the bug and the amount of
stuff it requires to back-port. Unimportant fixes (eg, a missing free()
in error path in init function) which require other code changes are
obviously not for -stable.  On the other end, small but important changes
(eg a fix for a crash/deadlock which many users are hitting) is obviously
a good fit for -stable.  And anything in-between is, sure thing, grey,
with all its shades.

This is why we're more dependent on the subsystem maintainers or authors
of the particular changes, - since they know much better when any given
fix is important or not, know much better about possible impact and so on.
Maybe it's just cosmetics which can safely be ignored, or maybe it's a
big thing which is worth to backport other stuff for it to work.
It is also much easier to think about this when you have the context
which you just fixed, still in your mind.

And this is why I'm asking.

Thank you!

/mjt

[PATCH v2] meson: Avoid implicit declaration of functions

2023-05-29 Thread Michal Privoznik

While detecting a presence of a function via 'cc.links()'
gives desired result (i.e. detects whether function is present),
it also produces a warning on systems where the function is not
present (into meson-log.txt), e.g.:

  qemu.git/build/meson-private/tmph74x3p38/testfile.c:2:34: \
  warning: implicit declaration of function 'malloc_trim' 
[-Wimplicit-function-declaration]

We can check whether given function exists via
'cc.has_function()' or whether STATX_* macros exist via
'cc.has_header_symbol()'.

Resolves: https://bugs.gentoo.org/898810
Signed-off-by: Michal Privoznik 
---

v2 of:

https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg07138.html

diff to v1:
- Drop cc.links() as it's redundant

 meson.build | 26 +-
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/meson.build b/meson.build
index 2d48aa1e2e..21061b19d4 100644
--- a/meson.build
+++ b/meson.build
@@ -1797,8 +1797,7 @@ malloc = []
 if get_option('malloc') == 'system'
   has_malloc_trim = \
 get_option('malloc_trim').allowed() and \
-cc.links('''#include 
-int main(void) { malloc_trim(0); return 0; }''')
+cc.has_function('malloc_trim', prefix: '#include ')
 else
   has_malloc_trim = false
   malloc = cc.find_library(get_option('malloc'), required: true)
@@ -1811,34 +1810,19 @@ if not has_malloc_trim and 
get_option('malloc_trim').enabled()
   endif
 endif
 
-# Check whether the glibc provides statx()
-
 gnu_source_prefix = '''
   #ifndef _GNU_SOURCE
   #define _GNU_SOURCE
   #endif
 '''
-statx_test = gnu_source_prefix + '''
-  #include 
-  int main(void) {
-struct statx statxbuf;
-statx(0, "", 0, STATX_BASIC_STATS, );
-return 0;
-  }'''
 
-has_statx = cc.links(statx_test)
+# Check whether the glibc provides STATX_BASIC_STATS
+
+has_statx = cc.has_header_symbol('sys/stat.h', 'STATX_BASIC_STATS', prefix: 
gnu_source_prefix)
 
 # Check whether statx() provides mount ID information
 
-statx_mnt_id_test = gnu_source_prefix + '''
-  #include 
-  int main(void) {
-struct statx statxbuf;
-statx(0, "", 0, STATX_BASIC_STATS | STATX_MNT_ID, );
-return statxbuf.stx_mnt_id;
-  }'''
-
-has_statx_mnt_id = cc.links(statx_mnt_id_test)
+has_statx_mnt_id = cc.has_header_symbol('sys/stat.h', 'STATX_MNT_ID', prefix: 
gnu_source_prefix)
 
 have_vhost_user_blk_server = get_option('vhost_user_blk_server') \
   .require(targetos == 'linux',
-- 
2.39.3

Re: [PATCH v3] target: ppc: Use MSR_HVB bit to get the target endianness for memory dump

2023-05-29 Thread Fabiano Rosas

"Nicholas Piggin"  writes:

> On Tue May 23, 2023 at 2:02 AM AEST, Narayana Murty N wrote:
>> Changes since V2:
>> commit message modified as per feedbak from Nicholas Piggin.
>> Changes since V1:
>> https://lore.kernel.org/qemu-devel/20230420145055.10196-1-nnmli...@linux.ibm.com/
>> The approach to solve the issue was changed based on feedback from
>> Fabiano Rosas on patch V1.
>> ---
>>  target/ppc/arch_dump.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/target/ppc/arch_dump.c b/target/ppc/arch_dump.c
>> index f58e6359d5..a8315659d9 100644
>> --- a/target/ppc/arch_dump.c
>> +++ b/target/ppc/arch_dump.c
>> @@ -237,7 +237,7 @@ int cpu_get_dump_info(ArchDumpInfo *info,
>>  info->d_machine = PPC_ELF_MACHINE;
>>  info->d_class = ELFCLASS;
>>  
>> -if (ppc_interrupts_little_endian(cpu, cpu->env.has_hv_mode)) {
>> +if (ppc_interrupts_little_endian(cpu, !!(cpu->env.msr_mask & MSR_HVB))) 
>> {
>>  info->d_endian = ELFDATA2LSB;
>>  } else {
>>  info->d_endian = ELFDATA2MSB;
>
> Oh, and now I see it cpu_get_dump_info just picks the first CPU to test
> this! So a test that can change at runtime is surely not the right one.
> If you use MSR[HV] then if you have a SMP machine that is doing a bunch
> of things and you want to dump to debug the system, this will just
> randomly give you a wrong-endian dump if CPU0 just happened to be
> running some KVM guest.
>

Not sure if you are just thinking out loud about MSR_HV or if you
mistook MSR_HVB for MSR_HV. But just in case:

The env->msr_mask is what tells us what MSR bits are supported for this
CPU, i.e. what features it contains. So MSR_HVB is not equivalent to
MSR[HV], but merely informs that this CPU knows about MSR_HV. We then
store that information at has_hv_mode. The MSR_HVB bit changes only
once (at cpu_ppc_set_vhyp), after we decide whether to use vhyp. So:

env->has_hv_mode == cpu supports HV mode;

MSR_HVB=1 == cpu supports HV mode AND we allow the OS to run with MSR_HV=1;

MSR_HVB=0 == cpu doesn't support HV mode OR
 cpu supports HV mode, but we don't allow the OS to run with
 MSR_HV=1 because QEMU is the HV (i.e. vhyp);

For the arch_dump code, passing (msr_mask & MSR_HVB) ends up meaning:
"can this OS ever run with MSR_HV=1?", which for emulated powernv would
be Yes and for pseries (incl. nested) would be No. So for emulated
powernv we always look at the emulated HILE and for pseries we always
look at LPCR_ILE.

> But even ignoring all of that, let's say you have all the same endian
> host and guest kernels and dump format... If you dump host memory then
> you need host kernel and structures to debug guest kernel/image
> (assuming crash or the person debugging it is smart enough to make sense
> of it). So I don't see how you can sanely use the crash dump of host
> memory with the guest kernel. I must still be missing something.
>

You're right, the QEMU instance that receives the qmp_dump_guest_memory
command should generate a memory dump of whatever OS is running in it at
a direct level.  So the endianness reported in the dump's ELF should
match the guest OS image ELF (roughly, there's -bios which doesn't
require an ELF).

[PATCH v10 0/7] igb: packet-split descriptors support

2023-05-29 Thread Tomasz Dzieciol

Purposes of this series of patches:
* introduce packet-split RX descriptors support. This feature is used by Linux
  VF driver for MTU values from 2048.
* refactor RX descriptor handling for introduction of packet-split RX
  descriptors support
* fix descriptors flags handling

Tomasz Dzieciol (7):
  igb: remove TCP ACK detection
  igb: rename E1000E_RingInfo_st
  igb: RX descriptors guest writting refactoring
  igb: RX payload guest writting refactoring
  igb: add IPv6 extended headers traffic detection
  igb: packet-split descriptors support
  e1000e: rename e1000e_ba_state and e1000e_write_hdr_to_rx_buffers

 hw/net/e1000e_core.c |  78 +++--
 hw/net/igb_core.c| 730 ---
 hw/net/igb_regs.h|  20 +-
 hw/net/trace-events  |   6 +-
 tests/qtest/libqos/igb.c |   5 +
 5 files changed, 592 insertions(+), 247 deletions(-)

-- 
2.25.1

[PATCH v10 1/7] igb: remove TCP ACK detection

2023-05-29 Thread Tomasz Dzieciol

TCP ACK detection is no longer present in igb.

Signed-off-by: Tomasz Dzieciol 
---
 hw/net/igb_core.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index d00b1caa6a..e927c51061 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -1327,11 +1327,6 @@ igb_build_rx_metadata(IGBCore *core,
 trace_e1000e_rx_metadata_ip_id(*ip_id);
 }
 
-if (l4hdr_proto == ETH_L4_HDR_PROTO_TCP && net_rx_pkt_is_tcp_ack(pkt)) {
-*status_flags |= E1000_RXD_STAT_ACK;
-trace_e1000e_rx_metadata_ack();
-}
-
 if (pkt_info) {
 *pkt_info = rss_info->enabled ? rss_info->type : 0;
 
-- 
2.25.1

[PATCH v10 6/7] igb: packet-split descriptors support

2023-05-29 Thread Tomasz Dzieciol

Packet-split descriptors are used by Linux VF driver for MTU values from 2048

Signed-off-by: Tomasz Dzieciol 
---
 hw/net/igb_core.c   | 348 ++--
 hw/net/igb_regs.h   |   9 ++
 hw/net/trace-events |   2 +-
 3 files changed, 316 insertions(+), 43 deletions(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index b54d7af8d8..0a56aac1eb 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -267,6 +267,29 @@ igb_rx_use_legacy_descriptor(IGBCore *core)
 return false;
 }
 
+typedef struct E1000ERingInfo {
+int dbah;
+int dbal;
+int dlen;
+int dh;
+int dt;
+int idx;
+} E1000ERingInfo;
+
+static uint32_t
+igb_rx_queue_desctyp_get(IGBCore *core, const E1000ERingInfo *r)
+{
+return core->mac[E1000_SRRCTL(r->idx) >> 2] & E1000_SRRCTL_DESCTYPE_MASK;
+}
+
+static bool
+igb_rx_use_ps_descriptor(IGBCore *core, const E1000ERingInfo *r)
+{
+uint32_t desctyp = igb_rx_queue_desctyp_get(core, r);
+return desctyp == E1000_SRRCTL_DESCTYPE_HDR_SPLIT ||
+   desctyp == E1000_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
+}
+
 static inline bool
 igb_rss_enabled(IGBCore *core)
 {
@@ -694,15 +717,6 @@ static uint32_t igb_rx_wb_eic(IGBCore *core, int queue_idx)
 return (ent & E1000_IVAR_VALID) ? BIT(ent & 0x1f) : 0;
 }
 
-typedef struct E1000ERingInfo {
-int dbah;
-int dbal;
-int dlen;
-int dh;
-int dt;
-int idx;
-} E1000ERingInfo;
-
 static inline bool
 igb_ring_empty(IGBCore *core, const E1000ERingInfo *r)
 {
@@ -1233,12 +1247,31 @@ igb_read_lgcy_rx_descr(IGBCore *core, struct 
e1000_rx_desc *desc,
 }
 
 static inline void
-igb_read_adv_rx_descr(IGBCore *core, union e1000_adv_rx_desc *desc,
-  hwaddr *buff_addr)
+igb_read_adv_rx_single_buf_descr(IGBCore *core, union e1000_adv_rx_desc *desc,
+ hwaddr *buff_addr)
 {
 *buff_addr = le64_to_cpu(desc->read.pkt_addr);
 }
 
+static inline void
+igb_read_adv_rx_split_buf_descr(IGBCore *core, union e1000_adv_rx_desc *desc,
+hwaddr *buff_addr)
+{
+buff_addr[0] = le64_to_cpu(desc->read.hdr_addr);
+buff_addr[1] = le64_to_cpu(desc->read.pkt_addr);
+}
+
+typedef struct IGBBAState {
+uint16_t written[IGB_MAX_PS_BUFFERS];
+uint8_t cur_idx;
+} IGBBAState;
+
+typedef struct IGBSplitDescriptorData {
+bool sph;
+bool hbo;
+size_t hdr_len;
+} IGBSplitDescriptorData;
+
 typedef struct IGBPacketRxDMAState {
 size_t size;
 size_t total_size;
@@ -1249,20 +1282,42 @@ typedef struct IGBPacketRxDMAState {
 uint32_t rx_desc_header_buf_size;
 struct iovec *iov;
 size_t iov_ofs;
+bool do_ps;
 bool is_first;
-uint16_t written;
-hwaddr ba;
+IGBBAState bastate;
+hwaddr ba[IGB_MAX_PS_BUFFERS];
+IGBSplitDescriptorData ps_desc_data;
 } IGBPacketRxDMAState;
 
 static inline void
-igb_read_rx_descr(IGBCore *core, union e1000_rx_desc_union *desc,
-  hwaddr *buff_addr)
+igb_read_rx_descr(IGBCore *core,
+  union e1000_rx_desc_union *desc,
+  IGBPacketRxDMAState *pdma_st,
+  const E1000ERingInfo *r)
 {
+uint32_t desc_type;
+
 if (igb_rx_use_legacy_descriptor(core)) {
-igb_read_lgcy_rx_descr(core, >legacy, buff_addr);
-} else {
-igb_read_adv_rx_descr(core, >adv, buff_addr);
+igb_read_lgcy_rx_descr(core, >legacy, _st->ba[1]);
+pdma_st->ba[0] = 0;
+return;
+}
+
+/* advanced header split descriptor */
+if (igb_rx_use_ps_descriptor(core, r)) {
+igb_read_adv_rx_split_buf_descr(core, >adv, _st->ba[0]);
+return;
+}
+
+/* descriptor replication modes not supported */
+desc_type = igb_rx_queue_desctyp_get(core, r);
+if (desc_type != E1000_SRRCTL_DESCTYPE_ADV_ONEBUF) {
+trace_igb_wrn_rx_desc_modes_not_supp(desc_type);
 }
+
+/* advanced single buffer descriptor */
+igb_read_adv_rx_single_buf_descr(core, >adv, _st->ba[1]);
+pdma_st->ba[0] = 0;
 }
 
 static void
@@ -1405,6 +1460,13 @@ igb_write_lgcy_rx_descr(IGBCore *core, struct 
e1000_rx_desc *desc,
 desc->status = (uint8_t) le32_to_cpu(status_flags);
 }
 
+static bool
+igb_rx_ps_descriptor_split_always(IGBCore *core, const E1000ERingInfo *r)
+{
+uint32_t desctyp = igb_rx_queue_desctyp_get(core, r);
+return desctyp == E1000_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
+}
+
 static uint16_t
 igb_rx_desc_get_packet_type(IGBCore *core, struct NetRxPkt *pkt, uint16_t etqf)
 {
@@ -1495,15 +1557,54 @@ igb_write_adv_rx_descr(IGBCore *core, union 
e1000_adv_rx_desc *desc,
 }
 
 static inline void
-igb_write_rx_descr(IGBCore *core, union e1000_rx_desc_union *desc,
-   struct NetRxPkt *pkt, const E1000E_RSSInfo *rss_info,
-   uint16_t etqf, bool ts, uint16_t length)
+igb_write_adv_ps_rx_descr(IGBCore *core,
+  union e1000_adv_rx_desc *desc,
+  struct NetRxPkt

[PATCH v10 2/7] igb: rename E1000E_RingInfo_st

2023-05-29 Thread Tomasz Dzieciol

Rename E1000E_RingInfo_st and E1000E_RingInfo according to qemu typdefs guide.

Signed-off-by: Tomasz Dzieciol 
---
 hw/net/e1000e_core.c | 34 +-
 hw/net/igb_core.c| 42 +-
 2 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index 9f185d099c..f6b628eaef 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -810,24 +810,24 @@ e1000e_txdesc_writeback(E1000ECore *core, dma_addr_t base,
 return e1000e_tx_wb_interrupt_cause(core, queue_idx);
 }
 
-typedef struct E1000E_RingInfo_st {
+typedef struct E1000ERingInfo {
 int dbah;
 int dbal;
 int dlen;
 int dh;
 int dt;
 int idx;
-} E1000E_RingInfo;
+} E1000ERingInfo;
 
 static inline bool
-e1000e_ring_empty(E1000ECore *core, const E1000E_RingInfo *r)
+e1000e_ring_empty(E1000ECore *core, const E1000ERingInfo *r)
 {
 return core->mac[r->dh] == core->mac[r->dt] ||
 core->mac[r->dt] >= core->mac[r->dlen] / E1000_RING_DESC_LEN;
 }
 
 static inline uint64_t
-e1000e_ring_base(E1000ECore *core, const E1000E_RingInfo *r)
+e1000e_ring_base(E1000ECore *core, const E1000ERingInfo *r)
 {
 uint64_t bah = core->mac[r->dbah];
 uint64_t bal = core->mac[r->dbal];
@@ -836,13 +836,13 @@ e1000e_ring_base(E1000ECore *core, const E1000E_RingInfo 
*r)
 }
 
 static inline uint64_t
-e1000e_ring_head_descr(E1000ECore *core, const E1000E_RingInfo *r)
+e1000e_ring_head_descr(E1000ECore *core, const E1000ERingInfo *r)
 {
 return e1000e_ring_base(core, r) + E1000_RING_DESC_LEN * core->mac[r->dh];
 }
 
 static inline void
-e1000e_ring_advance(E1000ECore *core, const E1000E_RingInfo *r, uint32_t count)
+e1000e_ring_advance(E1000ECore *core, const E1000ERingInfo *r, uint32_t count)
 {
 core->mac[r->dh] += count;
 
@@ -852,7 +852,7 @@ e1000e_ring_advance(E1000ECore *core, const E1000E_RingInfo 
*r, uint32_t count)
 }
 
 static inline uint32_t
-e1000e_ring_free_descr_num(E1000ECore *core, const E1000E_RingInfo *r)
+e1000e_ring_free_descr_num(E1000ECore *core, const E1000ERingInfo *r)
 {
 trace_e1000e_ring_free_space(r->idx, core->mac[r->dlen],
  core->mac[r->dh],  core->mac[r->dt]);
@@ -871,19 +871,19 @@ e1000e_ring_free_descr_num(E1000ECore *core, const 
E1000E_RingInfo *r)
 }
 
 static inline bool
-e1000e_ring_enabled(E1000ECore *core, const E1000E_RingInfo *r)
+e1000e_ring_enabled(E1000ECore *core, const E1000ERingInfo *r)
 {
 return core->mac[r->dlen] > 0;
 }
 
 static inline uint32_t
-e1000e_ring_len(E1000ECore *core, const E1000E_RingInfo *r)
+e1000e_ring_len(E1000ECore *core, const E1000ERingInfo *r)
 {
 return core->mac[r->dlen];
 }
 
 typedef struct E1000E_TxRing_st {
-const E1000E_RingInfo *i;
+const E1000ERingInfo *i;
 struct e1000e_tx *tx;
 } E1000E_TxRing;
 
@@ -896,7 +896,7 @@ e1000e_mq_queue_idx(int base_reg_idx, int reg_idx)
 static inline void
 e1000e_tx_ring_init(E1000ECore *core, E1000E_TxRing *txr, int idx)
 {
-static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
+static const E1000ERingInfo i[E1000E_NUM_QUEUES] = {
 { TDBAH,  TDBAL,  TDLEN,  TDH,  TDT, 0 },
 { TDBAH1, TDBAL1, TDLEN1, TDH1, TDT1, 1 }
 };
@@ -908,13 +908,13 @@ e1000e_tx_ring_init(E1000ECore *core, E1000E_TxRing *txr, 
int idx)
 }
 
 typedef struct E1000E_RxRing_st {
-const E1000E_RingInfo *i;
+const E1000ERingInfo *i;
 } E1000E_RxRing;
 
 static inline void
 e1000e_rx_ring_init(E1000ECore *core, E1000E_RxRing *rxr, int idx)
 {
-static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
+static const E1000ERingInfo i[E1000E_NUM_QUEUES] = {
 { RDBAH0, RDBAL0, RDLEN0, RDH0, RDT0, 0 },
 { RDBAH1, RDBAL1, RDLEN1, RDH1, RDT1, 1 }
 };
@@ -930,7 +930,7 @@ e1000e_start_xmit(E1000ECore *core, const E1000E_TxRing 
*txr)
 dma_addr_t base;
 struct e1000_tx_desc desc;
 bool ide = false;
-const E1000E_RingInfo *txi = txr->i;
+const E1000ERingInfo *txi = txr->i;
 uint32_t cause = E1000_ICS_TXQE;
 
 if (!(core->mac[TCTL] & E1000_TCTL_EN)) {
@@ -960,7 +960,7 @@ e1000e_start_xmit(E1000ECore *core, const E1000E_TxRing 
*txr)
 }
 
 static bool
-e1000e_has_rxbufs(E1000ECore *core, const E1000E_RingInfo *r,
+e1000e_has_rxbufs(E1000ECore *core, const E1000ERingInfo *r,
   size_t total_size)
 {
 uint32_t bufs = e1000e_ring_free_descr_num(core, r);
@@ -1460,7 +1460,7 @@ e1000e_update_rx_stats(E1000ECore *core, size_t pkt_size, 
size_t pkt_fcs_size)
 }
 
 static inline bool
-e1000e_rx_descr_threshold_hit(E1000ECore *core, const E1000E_RingInfo *rxi)
+e1000e_rx_descr_threshold_hit(E1000ECore *core, const E1000ERingInfo *rxi)
 {
 return e1000e_ring_free_descr_num(core, rxi) ==
e1000e_ring_len(core, rxi) >> core->rxbuf_min_shift;
@@ -1521,7 +1521,7 @@ e1000e_write_packet_to_guest(E1000ECore *core, struct 
NetRxPkt *pkt,
 struct iovec *iov =

[PATCH v10 3/7] igb: RX descriptors guest writting refactoring

2023-05-29 Thread Tomasz Dzieciol

Refactoring is done in preparation for support of multiple advanced
descriptors RX modes, especially packet-split modes.

Signed-off-by: Tomasz Dzieciol 
---
 hw/net/igb_core.c   | 170 +++-
 hw/net/igb_regs.h   |  10 +--
 hw/net/trace-events |   4 +-
 3 files changed, 96 insertions(+), 88 deletions(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index 051980b4f5..3a6df0d55e 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -1281,15 +1281,11 @@ igb_verify_csum_in_sw(IGBCore *core,
 }
 
 static void
-igb_build_rx_metadata(IGBCore *core,
-  struct NetRxPkt *pkt,
-  bool is_eop,
-  const E1000E_RSSInfo *rss_info, uint16_t etqf, bool ts,
-  uint16_t *pkt_info, uint16_t *hdr_info,
-  uint32_t *rss,
-  uint32_t *status_flags,
-  uint16_t *ip_id,
-  uint16_t *vlan_tag)
+igb_build_rx_metadata_common(IGBCore *core,
+ struct NetRxPkt *pkt,
+ bool is_eop,
+ uint32_t *status_flags,
+ uint16_t *vlan_tag)
 {
 struct virtio_net_hdr *vhdr;
 bool hasip4, hasip6, csum_valid;
@@ -1298,7 +1294,6 @@ igb_build_rx_metadata(IGBCore *core,
 *status_flags = E1000_RXD_STAT_DD;
 
 /* No additional metadata needed for non-EOP descriptors */
-/* TODO: EOP apply only to status so don't skip whole function. */
 if (!is_eop) {
 goto func_exit;
 }
@@ -1315,59 +1310,6 @@ igb_build_rx_metadata(IGBCore *core,
 trace_e1000e_rx_metadata_vlan(*vlan_tag);
 }
 
-/* Packet parsing results */
-if ((core->mac[RXCSUM] & E1000_RXCSUM_PCSD) != 0) {
-if (rss_info->enabled) {
-*rss = cpu_to_le32(rss_info->hash);
-trace_igb_rx_metadata_rss(*rss);
-}
-} else if (hasip4) {
-*status_flags |= E1000_RXD_STAT_IPIDV;
-*ip_id = cpu_to_le16(net_rx_pkt_get_ip_id(pkt));
-trace_e1000e_rx_metadata_ip_id(*ip_id);
-}
-
-if (pkt_info) {
-*pkt_info = rss_info->enabled ? rss_info->type : 0;
-
-if (etqf < 8) {
-*pkt_info |= (BIT(11) | etqf) << 4;
-} else {
-if (hasip4) {
-*pkt_info |= E1000_ADVRXD_PKT_IP4;
-}
-
-if (hasip6) {
-*pkt_info |= E1000_ADVRXD_PKT_IP6;
-}
-
-switch (l4hdr_proto) {
-case ETH_L4_HDR_PROTO_TCP:
-*pkt_info |= E1000_ADVRXD_PKT_TCP;
-break;
-
-case ETH_L4_HDR_PROTO_UDP:
-*pkt_info |= E1000_ADVRXD_PKT_UDP;
-break;
-
-case ETH_L4_HDR_PROTO_SCTP:
-*pkt_info |= E1000_ADVRXD_PKT_SCTP;
-break;
-
-default:
-break;
-}
-}
-}
-
-if (hdr_info) {
-*hdr_info = 0;
-}
-
-if (ts) {
-*status_flags |= BIT(16);
-}
-
 /* RX CSO information */
 if (hasip6 && (core->mac[RFCTL] & E1000_RFCTL_IPV6_XSUM_DIS)) {
 trace_e1000e_rx_metadata_ipv6_sum_disabled();
@@ -1423,43 +1365,108 @@ func_exit:
 static inline void
 igb_write_lgcy_rx_descr(IGBCore *core, struct e1000_rx_desc *desc,
 struct NetRxPkt *pkt,
-const E1000E_RSSInfo *rss_info, uint16_t etqf, bool ts,
+const E1000E_RSSInfo *rss_info,
 uint16_t length)
 {
-uint32_t status_flags, rss;
-uint16_t ip_id;
+uint32_t status_flags;
 
 assert(!rss_info->enabled);
+
+memset(desc, 0, sizeof(*desc));
 desc->length = cpu_to_le16(length);
-desc->csum = 0;
+igb_build_rx_metadata_common(core, pkt, pkt != NULL,
+ _flags,
+ >special);
 
-igb_build_rx_metadata(core, pkt, pkt != NULL,
-  rss_info, etqf, ts,
-  NULL, NULL, ,
-  _flags, _id,
-  >special);
 desc->errors = (uint8_t) (le32_to_cpu(status_flags) >> 24);
 desc->status = (uint8_t) le32_to_cpu(status_flags);
 }
 
+static uint16_t
+igb_rx_desc_get_packet_type(IGBCore *core, struct NetRxPkt *pkt, uint16_t etqf)
+{
+uint16_t pkt_type;
+bool hasip4, hasip6;
+EthL4HdrProto l4hdr_proto;
+
+if (etqf < 8) {
+pkt_type = BIT(11) | etqf;
+return pkt_type;
+}
+
+net_rx_pkt_get_protocols(pkt, , , _proto);
+
+if (hasip6 && !(core->mac[RFCTL] & E1000_RFCTL_IPV6_DIS)) {
+pkt_type = E1000_ADVRXD_PKT_IP6;
+} else if (hasip4) {
+pkt_type = E1000_ADVRXD_PKT_IP4;
+} else {
+pkt_type = 0;
+}
+
+switch (l4hdr_proto) {
+case ETH_L4_HDR_PROTO_TCP:
+pkt_type |= E1000_ADVRXD_PKT_TCP;
+break;
+case

[PATCH v10 4/7] igb: RX payload guest writting refactoring

2023-05-29 Thread Tomasz Dzieciol

Refactoring is done in preparation for support of multiple advanced
descriptors RX modes, especially packet-split modes.

Signed-off-by: Tomasz Dzieciol 
---
 hw/net/e1000e_core.c |  18 ++--
 hw/net/igb_core.c| 213 +--
 tests/qtest/libqos/igb.c |   5 +
 3 files changed, 150 insertions(+), 86 deletions(-)

diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index f6b628eaef..f29e5e2897 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -1418,11 +1418,11 @@ e1000e_write_hdr_to_rx_buffers(E1000ECore *core,
 }
 
 static void
-e1000e_write_to_rx_buffers(E1000ECore *core,
-   hwaddr ba[MAX_PS_BUFFERS],
-   e1000e_ba_state *bastate,
-   const char *data,
-   dma_addr_t data_len)
+e1000e_write_payload_frag_to_rx_buffers(E1000ECore *core,
+hwaddr ba[MAX_PS_BUFFERS],
+e1000e_ba_state *bastate,
+const char *data,
+dma_addr_t data_len)
 {
 while (data_len > 0) {
 uint32_t cur_buf_len = core->rxbuf_sizes[bastate->cur_idx];
@@ -1594,8 +1594,10 @@ e1000e_write_packet_to_guest(E1000ECore *core, struct 
NetRxPkt *pkt,
 while (copy_size) {
 iov_copy = MIN(copy_size, iov->iov_len - iov_ofs);
 
-e1000e_write_to_rx_buffers(core, ba, ,
-iov->iov_base + iov_ofs, iov_copy);
+e1000e_write_payload_frag_to_rx_buffers(core, ba, ,
+iov->iov_base +
+iov_ofs,
+iov_copy);
 
 copy_size -= iov_copy;
 iov_ofs += iov_copy;
@@ -1607,7 +1609,7 @@ e1000e_write_packet_to_guest(E1000ECore *core, struct 
NetRxPkt *pkt,
 
 if (desc_offset + desc_size >= total_size) {
 /* Simulate FCS checksum presence in the last descriptor */
-e1000e_write_to_rx_buffers(core, ba, ,
+e1000e_write_payload_frag_to_rx_buffers(core, ba, ,
   (const char *) _pad, e1000x_fcs_len(core->mac));
 }
 }
diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index 3a6df0d55e..8c248683c3 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -941,6 +941,14 @@ igb_has_rxbufs(IGBCore *core, const E1000ERingInfo *r, 
size_t total_size)
  bufsize;
 }
 
+static uint32_t
+igb_rxhdrbufsize(IGBCore *core, const E1000ERingInfo *r)
+{
+uint32_t srrctl = core->mac[E1000_SRRCTL(r->idx) >> 2];
+return (srrctl & E1000_SRRCTL_BSIZEHDRSIZE_MASK) >>
+   E1000_SRRCTL_BSIZEHDRSIZE_SHIFT;
+}
+
 void
 igb_start_recv(IGBCore *core)
 {
@@ -1231,6 +1239,21 @@ igb_read_adv_rx_descr(IGBCore *core, union 
e1000_adv_rx_desc *desc,
 *buff_addr = le64_to_cpu(desc->read.pkt_addr);
 }
 
+typedef struct IGBPacketRxDMAState {
+size_t size;
+size_t total_size;
+size_t ps_hdr_len;
+size_t desc_size;
+size_t desc_offset;
+uint32_t rx_desc_packet_buf_size;
+uint32_t rx_desc_header_buf_size;
+struct iovec *iov;
+size_t iov_ofs;
+bool is_first;
+uint16_t written;
+hwaddr ba;
+} IGBPacketRxDMAState;
+
 static inline void
 igb_read_rx_descr(IGBCore *core, union e1000_rx_desc_union *desc,
   hwaddr *buff_addr)
@@ -1514,19 +1537,6 @@ igb_pci_dma_write_rx_desc(IGBCore *core, PCIDevice *dev, 
dma_addr_t addr,
 }
 }
 
-static void
-igb_write_to_rx_buffers(IGBCore *core,
-PCIDevice *d,
-hwaddr ba,
-uint16_t *written,
-const char *data,
-dma_addr_t data_len)
-{
-trace_igb_rx_desc_buff_write(ba, *written, data, data_len);
-pci_dma_write(d, ba + *written, data, data_len);
-*written += data_len;
-}
-
 static void
 igb_update_rx_stats(IGBCore *core, const E1000ERingInfo *rxi,
 size_t pkt_size, size_t pkt_fcs_size)
@@ -1552,6 +1562,93 @@ igb_rx_descr_threshold_hit(IGBCore *core, const 
E1000ERingInfo *rxi)
((core->mac[E1000_SRRCTL(rxi->idx) >> 2] >> 20) & 31) * 16;
 }
 
+static void
+igb_truncate_to_descriptor_size(IGBPacketRxDMAState *pdma_st, size_t *size)
+{
+if (*size > pdma_st->rx_desc_packet_buf_size) {
+*size = pdma_st->rx_desc_packet_buf_size;
+}
+}
+
+static void
+igb_write_payload_frag_to_rx_buffers(IGBCore *core,
+ PCIDevice *d,
+ hwaddr ba,
+ uint16_t *written,
+ uint32_t cur_buf_len,
+

[PATCH v10 5/7] igb: add IPv6 extended headers traffic detection

2023-05-29 Thread Tomasz Dzieciol

Signed-off-by: Tomasz Dzieciol 
---
 hw/net/igb_core.c | 4 +++-
 hw/net/igb_regs.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/net/igb_core.c b/hw/net/igb_core.c
index 8c248683c3..b54d7af8d8 100644
--- a/hw/net/igb_core.c
+++ b/hw/net/igb_core.c
@@ -1420,7 +1420,9 @@ igb_rx_desc_get_packet_type(IGBCore *core, struct 
NetRxPkt *pkt, uint16_t etqf)
 net_rx_pkt_get_protocols(pkt, , , _proto);
 
 if (hasip6 && !(core->mac[RFCTL] & E1000_RFCTL_IPV6_DIS)) {
-pkt_type = E1000_ADVRXD_PKT_IP6;
+eth_ip6_hdr_info *ip6hdr_info = net_rx_pkt_get_ip6_info(pkt);
+pkt_type = ip6hdr_info->has_ext_hdrs ? E1000_ADVRXD_PKT_IP6E :
+   E1000_ADVRXD_PKT_IP6;
 } else if (hasip4) {
 pkt_type = E1000_ADVRXD_PKT_IP4;
 } else {
diff --git a/hw/net/igb_regs.h b/hw/net/igb_regs.h
index 71a8833229..36763f2ff7 100644
--- a/hw/net/igb_regs.h
+++ b/hw/net/igb_regs.h
@@ -694,6 +694,7 @@ union e1000_adv_rx_desc {
 
 #define E1000_ADVRXD_PKT_IP4  BIT(0)
 #define E1000_ADVRXD_PKT_IP6  BIT(2)
+#define E1000_ADVRXD_PKT_IP6E BIT(3)
 #define E1000_ADVRXD_PKT_TCP  BIT(4)
 #define E1000_ADVRXD_PKT_UDP  BIT(5)
 #define E1000_ADVRXD_PKT_SCTP BIT(6)
-- 
2.25.1

[PATCH v10 7/7] e1000e: rename e1000e_ba_state and e1000e_write_hdr_to_rx_buffers

2023-05-29 Thread Tomasz Dzieciol

Rename e1000e_ba_state according and e1000e_write_hdr_to_rx_buffers for
consistency with IGB.

Signed-off-by: Tomasz Dzieciol 
---
 hw/net/e1000e_core.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
index f29e5e2897..bbeafd6fd7 100644
--- a/hw/net/e1000e_core.c
+++ b/hw/net/e1000e_core.c
@@ -1397,17 +1397,17 @@ e1000e_pci_dma_write_rx_desc(E1000ECore *core, 
dma_addr_t addr,
 }
 }
 
-typedef struct e1000e_ba_state_st {
+typedef struct E1000EBAState {
 uint16_t written[MAX_PS_BUFFERS];
 uint8_t cur_idx;
-} e1000e_ba_state;
+} E1000EBAState;
 
 static inline void
-e1000e_write_hdr_to_rx_buffers(E1000ECore *core,
-   hwaddr ba[MAX_PS_BUFFERS],
-   e1000e_ba_state *bastate,
-   const char *data,
-   dma_addr_t data_len)
+e1000e_write_hdr_frag_to_rx_buffers(E1000ECore *core,
+hwaddr ba[MAX_PS_BUFFERS],
+E1000EBAState *bastate,
+const char *data,
+dma_addr_t data_len)
 {
 assert(data_len <= core->rxbuf_sizes[0] - bastate->written[0]);
 
@@ -1420,7 +1420,7 @@ e1000e_write_hdr_to_rx_buffers(E1000ECore *core,
 static void
 e1000e_write_payload_frag_to_rx_buffers(E1000ECore *core,
 hwaddr ba[MAX_PS_BUFFERS],
-e1000e_ba_state *bastate,
+E1000EBAState *bastate,
 const char *data,
 dma_addr_t data_len)
 {
@@ -1530,7 +1530,7 @@ e1000e_write_packet_to_guest(E1000ECore *core, struct 
NetRxPkt *pkt,
 
 do {
 hwaddr ba[MAX_PS_BUFFERS];
-e1000e_ba_state bastate = { { 0 } };
+E1000EBAState bastate = { { 0 } };
 bool is_last = false;
 
 desc_size = total_size - desc_offset;
@@ -1568,8 +1568,10 @@ e1000e_write_packet_to_guest(E1000ECore *core, struct 
NetRxPkt *pkt,
 iov_copy = MIN(ps_hdr_len - ps_hdr_copied,
iov->iov_len - iov_ofs);
 
-e1000e_write_hdr_to_rx_buffers(core, ba, ,
-  iov->iov_base, iov_copy);
+e1000e_write_hdr_frag_to_rx_buffers(core, ba,
+,
+iov->iov_base,
+iov_copy);
 
 copy_size -= iov_copy;
 ps_hdr_copied += iov_copy;
@@ -1585,8 +1587,8 @@ e1000e_write_packet_to_guest(E1000ECore *core, struct 
NetRxPkt *pkt,
 } else {
 /* Leave buffer 0 of each descriptor except first */
 /* empty as per spec 7.1.5.1  */
-e1000e_write_hdr_to_rx_buffers(core, ba, ,
-   NULL, 0);
+e1000e_write_hdr_frag_to_rx_buffers(core, ba, ,
+NULL, 0);
 }
 }
 
-- 
2.25.1

[PATCH 2/2] vdpa: Allow VIRTIO_NET_F_CTRL_GUEST_OFFLOADS in SVQ

2023-05-29 Thread Hawkins Jiawei

Enable SVQ with VIRTIO_NET_F_CTRL_GUEST_OFFLOADS feature.

Signed-off-by: Hawkins Jiawei 
---
 net/vhost-vdpa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 682c749b19..cc52b7f0ad 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -85,6 +85,7 @@ const int vdpa_feature_bits[] = {
 static const uint64_t vdpa_svq_device_features =
 BIT_ULL(VIRTIO_NET_F_CSUM) |
 BIT_ULL(VIRTIO_NET_F_GUEST_CSUM) |
+BIT_ULL(VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) |
 BIT_ULL(VIRTIO_NET_F_MTU) |
 BIT_ULL(VIRTIO_NET_F_MAC) |
 BIT_ULL(VIRTIO_NET_F_GUEST_TSO4) |
-- 
2.25.1

[PATCH 1/2] vdpa: Add vhost_vdpa_net_load_offloads

2023-05-29 Thread Hawkins Jiawei

This patch introduces vhost_vdpa_net_load_offloads() to
restore offloads state at device's startup.

Signed-off-by: Hawkins Jiawei 
---
 net/vhost-vdpa.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 37cdc84562..682c749b19 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -680,6 +680,28 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
 return *s->status != VIRTIO_NET_OK;
 }
 
+static int vhost_vdpa_net_load_offloads(VhostVDPAState *s,
+const VirtIONet *n)
+{
+uint64_t features, offloads;
+ssize_t dev_written;
+
+features = n->parent_obj.guest_features;
+if (!(features & BIT_ULL(VIRTIO_NET_F_CTRL_GUEST_OFFLOADS))) {
+return 0;
+}
+
+offloads = cpu_to_le64(n->curr_guest_offloads);
+dev_written = vhost_vdpa_net_load_cmd(s, VIRTIO_NET_CTRL_GUEST_OFFLOADS,
+  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
+  , sizeof(offloads));
+if (unlikely(dev_written < 0)) {
+return dev_written;
+}
+
+return *s->status != VIRTIO_NET_OK;
+}
+
 static int vhost_vdpa_net_load(NetClientState *nc)
 {
 VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
@@ -702,6 +724,10 @@ static int vhost_vdpa_net_load(NetClientState *nc)
 if (unlikely(r)) {
 return r;
 }
+r = vhost_vdpa_net_load_offloads(s, n);
+if (unlikely(r)) {
+return r;
+}
 
 return 0;
 }
-- 
2.25.1

[PATCH 0/2] Vhost-vdpa Shadow Virtqueue Offloads support

2023-05-29 Thread Hawkins Jiawei

This series enables shadowed CVQ to intercept Offloads commands
through shadowed CVQ, update the virtio NIC device model so qemu
send it in a migration, and the restore of that Offloads state
in the destination.

Hawkins Jiawei (2):
  vdpa: Add vhost_vdpa_net_load_offloads
  vdpa: Allow VIRTIO_NET_F_CTRL_GUEST_OFFLOADS in SVQ

 net/vhost-vdpa.c | 27 +++
 1 file changed, 27 insertions(+)

-- 
2.25.1

RE: [PATCH v1] migration: fix migrate_params_test_apply to set the dest param correctly

2023-05-29 Thread Wang, Wei W

On Saturday, May 27, 2023 5:49 AM, Peter Xu wrote:
> On Wed, May 24, 2023 at 04:01:57PM +0800, Wei Wang wrote:
> > qmp_migrate_set_parameters expects to use tmp for parameters check, so
> > migrate_params_test_apply is expected to copy the related fields from
> > params to tmp. So fix migrate_params_test_apply to use the function
> > parameter, *dest, rather than the global one. The dest->has_xxx (xxx
> > is the feature name) related fields need to be set, as they will be
> > checked by migrate_params_check.
> 
> I think it's fine to do as what you suggested, but I don't see much benefit
> either.. the old code IIUC will check all params even if 1 param changed,
> while after your change it only checks the modified ones.
> 
> There's slight benefits but not so much, especially "22+, 2-" LOCs, because
> we don't really do this a lot; some more sanity check also makes sense to me
> even if everything is always checked, so we'll hit errors if anything
> accidentally goes wrong too.
> 
> Is there a real bug somewhere?

Yes. Please see qmp_migrate_set_parameters:

#1migrate_params_test_apply(params, );

 #2   if (!migrate_params_check(, errp)) {
/* Invalid parameter */
return;
}
 #3  migrate_params_apply(params, errp);

#2 tries to do params check using tmp, which is expected to be set up
by #1, but #1 didn't use "", so "tmp" doesn’t seem to store the
valid values as expected for the check (that is, #2 above isn’t effectively
doing any check for the user input params)

The alternative fix would be to remove the intermediate "tmp" params,
but this might break the usage from commit 1bda8b3c6950, so need thoughts
from Markus if we want go for this approach.

[PATCH] i386/WHPX: Fix error message when fail to set ProcessorCount

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

003f230e37d7 ("machine: Tweak the order of topology members in struct
CpuTopology") changes the meaning of MachineState.smp.cores from "the
number of cores in one package" to "the number of cores in one die"
and doesn't fix other uses of MachineState.smp.cores. And because of
the introduction of cluster, now smp.cores just means "the number of
cores in one cluster". This clearly does not fit the semantics here.

And before this error message, WHvSetPartitionProperty() is called to
set prop.ProcessorCount.

So the error message should show the prop.ProcessorCount other than
"cores per cluster" or "cores per package".

Cc: Sunil Muthuswamy 
Signed-off-by: Zhao Liu 
---
 target/i386/whpx/whpx-all.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/whpx/whpx-all.c b/target/i386/whpx/whpx-all.c
index 52af81683c1e..5882bf22d0a1 100644
--- a/target/i386/whpx/whpx-all.c
+++ b/target/i386/whpx/whpx-all.c
@@ -2613,8 +2613,8 @@ static int whpx_accel_init(MachineState *ms)
 sizeof(WHV_PARTITION_PROPERTY));
 
 if (FAILED(hr)) {
-error_report("WHPX: Failed to set partition core count to %d,"
- " hr=%08lx", ms->smp.cores, hr);
+error_report("WHPX: Failed to set partition processor count to %d,"
+ " hr=%08lx", prop.ProcessorCount, hr);
 ret = -EINVAL;
 goto error;
 }
-- 
2.34.1

[PATCH v2 03/17] softmmu: Fix CPUSTATE.nr_cores' calculation

2023-05-29 Thread Zhao Liu

From: Zhuocheng Ding 

>From CPUState.nr_cores' comment, it represents "number of cores within
this CPU package".

After 003f230e37d7 ("machine: Tweak the order of topology members in
struct CpuTopology"), the meaning of smp.cores changed to "the number of
cores in one die", but this commit missed to change CPUState.nr_cores'
caculation, so that CPUState.nr_cores became wrong and now it
misses to consider numbers of clusters and dies.

At present, only i386 is using CPUState.nr_cores.

But as for i386, which supports die level, the uses of CPUState.nr_cores
are very confusing:

Early uses are based on the meaning of "cores per package" (before die
is introduced into i386), and later uses are based on "cores per die"
(after die's introduction).

This difference is due to that commit a94e1428991f ("target/i386: Add
CPUID.1F generation support for multi-dies PCMachine") misunderstood
that CPUState.nr_cores means "cores per die" when caculated
CPUID.1FH.01H:EBX. After that, the changes in i386 all followed this
wrong understanding.

With the influence of 003f230e37d7 and a94e1428991f, for i386 currently
the result of CPUState.nr_cores is "cores per die", thus the original
uses of CPUState.cores based on the meaning of "cores per package" are
wrong when mutiple dies exist:
1. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.01H:EBX[bits 23:16] is
   incorrect because it expects "cpus per package" but now the
   result is "cpus per die".
2. In cpu_x86_cpuid() of target/i386/cpu.c, for all leaves of CPUID.04H:
   EAX[bits 31:26] is incorrect because they expect "cpus per package"
   but now the result is "cpus per die". The error not only impacts the
   EAX caculation in cache_info_passthrough case, but also impacts other
   cases of setting cache topology for Intel CPU according to cpu
   topology (specifically, the incoming parameter "num_cores" expects
   "cores per package" in encode_cache_cpuid4()).
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.0BH.01H:EBX[bits
   15:00] is incorrect because the EBX of 0BH.01H (core level) expects
   "cpus per package", which may be different with 1FH.01H (The reason
   is 1FH can support more levels. For QEMU, 1FH also supports die,
   1FH.01H:EBX[bits 15:00] expects "cpus per die").
4. In cpu_x86_cpuid() of target/i386/cpu.c, when CPUID.8001H is
   caculated, here "cpus per package" is expected to be checked, but in
   fact, now it checks "cpus per die". Though "cpus per die" also works
   for this code logic, this isn't consistent with AMD's APM.
5. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.8008H:ECX expects
   "cpus per package" but it obtains "cpus per die".
6. In simulate_rdmsr() of target/i386/hvf/x86_emu.c, in
   kvm_rdmsr_core_thread_count() of target/i386/kvm/kvm.c, and in
   helper_rdmsr() of target/i386/tcg/sysemu/misc_helper.c,
   MSR_CORE_THREAD_COUNT expects "cpus per package" and "cores per
   package", but in these functions, it obtains "cpus per die" and
   "cores per die".

On the other hand, these uses are correct now (they are added in/after
a94e1428991f):
1. In cpu_x86_cpuid() of target/i386/cpu.c, topo_info.cores_per_die
   meets the actual meaning of CPUState.nr_cores ("cores per die").
2. In cpu_x86_cpuid() of target/i386/cpu.c, vcpus_per_socket (in CPUID.
   04H's caculation) considers number of dies, so it's correct.
3. In cpu_x86_cpuid() of target/i386/cpu.c, CPUID.1FH.01H:EBX[bits
   15:00] needs "cpus per die" and it gets the correct result, and
   CPUID.1FH.02H:EBX[bits 15:00] gets correct "cpus per package".

When CPUState.nr_cores is correctly changed to "cores per package" again
, the above errors will be fixed without extra work, but the "currently"
correct cases will go wrong and need special handling to pass correct
"cpus/cores per die" they want.

Thus in this patch, we fix CPUState.nr_cores' caculation to fit the
original meaning "cores per package", as well as changing caculation of
topo_info.cores_per_die, vcpus_per_socket and CPUID.1FH.

In addition, in the nr_threads' comment, specify it represents the
number of threads in the "core" to avoid confusion, and also add comment
for nr_dies in CPUX86State.

Fixes: a94e1428991f ("target/i386: Add CPUID.1F generation support for 
multi-dies PCMachine")
Fixes: 003f230e37d7 ("machine: Tweak the order of topology members in struct 
CpuTopology")
Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Add comment for nr_dies in CPUX86State. (Yanan)
---
 include/hw/core/cpu.h | 2 +-
 softmmu/cpus.c| 2 +-
 target/i386/cpu.c | 9 -
 target/i386/cpu.h | 1 +
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 39150cf8f835..6f9120159f20 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -280,7 +280,7 @@ struct qemu_work_item;
  *   See TranslationBlock::TCG CF_CLUSTER_MASK.
  * @tcg_cflags: Pre-computed cflags for this cpu.
  *

[PATCH v2 01/17] i386: Fix comment style in topology.h

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

For function comments in this file, keep the comment style consistent
with other places.

Signed-off-by: Zhao Liu 
Reviewed-by: Philippe Mathieu-Daudé 
---
 include/hw/i386/topology.h | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
index 81573f6cfde0..5a19679f618b 100644
--- a/include/hw/i386/topology.h
+++ b/include/hw/i386/topology.h
@@ -24,7 +24,8 @@
 #ifndef HW_I386_TOPOLOGY_H
 #define HW_I386_TOPOLOGY_H
 
-/* This file implements the APIC-ID-based CPU topology enumeration logic,
+/*
+ * This file implements the APIC-ID-based CPU topology enumeration logic,
  * documented at the following document:
  *   Intel® 64 Architecture Processor Topology Enumeration
  *   
http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/
@@ -41,7 +42,8 @@
 
 #include "qemu/bitops.h"
 
-/* APIC IDs can be 32-bit, but beware: APIC IDs > 255 require x2APIC support
+/*
+ * APIC IDs can be 32-bit, but beware: APIC IDs > 255 require x2APIC support
  */
 typedef uint32_t apic_id_t;
 
@@ -58,8 +60,7 @@ typedef struct X86CPUTopoInfo {
 unsigned threads_per_core;
 } X86CPUTopoInfo;
 
-/* Return the bit width needed for 'count' IDs
- */
+/* Return the bit width needed for 'count' IDs */
 static unsigned apicid_bitwidth_for_count(unsigned count)
 {
 g_assert(count >= 1);
@@ -67,15 +68,13 @@ static unsigned apicid_bitwidth_for_count(unsigned count)
 return count ? 32 - clz32(count) : 0;
 }
 
-/* Bit width of the SMT_ID (thread ID) field on the APIC ID
- */
+/* Bit width of the SMT_ID (thread ID) field on the APIC ID */
 static inline unsigned apicid_smt_width(X86CPUTopoInfo *topo_info)
 {
 return apicid_bitwidth_for_count(topo_info->threads_per_core);
 }
 
-/* Bit width of the Core_ID field
- */
+/* Bit width of the Core_ID field */
 static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info)
 {
 return apicid_bitwidth_for_count(topo_info->cores_per_die);
@@ -87,8 +86,7 @@ static inline unsigned apicid_die_width(X86CPUTopoInfo 
*topo_info)
 return apicid_bitwidth_for_count(topo_info->dies_per_pkg);
 }
 
-/* Bit offset of the Core_ID field
- */
+/* Bit offset of the Core_ID field */
 static inline unsigned apicid_core_offset(X86CPUTopoInfo *topo_info)
 {
 return apicid_smt_width(topo_info);
@@ -100,14 +98,14 @@ static inline unsigned apicid_die_offset(X86CPUTopoInfo 
*topo_info)
 return apicid_core_offset(topo_info) + apicid_core_width(topo_info);
 }
 
-/* Bit offset of the Pkg_ID (socket ID) field
- */
+/* Bit offset of the Pkg_ID (socket ID) field */
 static inline unsigned apicid_pkg_offset(X86CPUTopoInfo *topo_info)
 {
 return apicid_die_offset(topo_info) + apicid_die_width(topo_info);
 }
 
-/* Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
+/*
+ * Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
  *
  * The caller must make sure core_id < nr_cores and smt_id < nr_threads.
  */
@@ -120,7 +118,8 @@ static inline apic_id_t 
x86_apicid_from_topo_ids(X86CPUTopoInfo *topo_info,
topo_ids->smt_id;
 }
 
-/* Calculate thread/core/package IDs for a specific topology,
+/*
+ * Calculate thread/core/package IDs for a specific topology,
  * based on (contiguous) CPU index
  */
 static inline void x86_topo_ids_from_idx(X86CPUTopoInfo *topo_info,
@@ -137,7 +136,8 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo 
*topo_info,
 topo_ids->smt_id = cpu_index % nr_threads;
 }
 
-/* Calculate thread/core/package IDs for a specific topology,
+/*
+ * Calculate thread/core/package IDs for a specific topology,
  * based on APIC ID
  */
 static inline void x86_topo_ids_from_apicid(apic_id_t apicid,
@@ -155,7 +155,8 @@ static inline void x86_topo_ids_from_apicid(apic_id_t 
apicid,
 topo_ids->pkg_id = apicid >> apicid_pkg_offset(topo_info);
 }
 
-/* Make APIC ID for the CPU 'cpu_index'
+/*
+ * Make APIC ID for the CPU 'cpu_index'
  *
  * 'cpu_index' is a sequential, contiguous ID for the CPU.
  */
-- 
2.34.1

[PATCH v2 14/17] i386: Use CPUCacheInfo.share_level to encode CPUID[4]

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

CPUID[4].EAX[bits 25:14] is used to represent the cache topology for
intel CPUs.

After cache models have topology information, we can use
CPUCacheInfo.share_level to decide which topology level to be encoded
into CPUID[4].EAX[bits 25:14].

And since maximum_processor_id (original "num_apic_ids") is parsed
based on cpu topology levels, which are verified when parsing smp, it's
no need to check this value by "assert(num_apic_ids > 0)" again, so
remove this assert.

Additionally, wrap the encoding of CPUID[4].EAX[bits 31:26] into a
helper to make the code more clean.

Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Use "enum CPUTopoLevel share_level" as the parameter in
   max_processor_ids_for_cache().
 * Make cache_into_passthrough case also use
   max_processor_ids_for_cache() and max_core_ids_in_package() to
   encode CPUID[4]. (Yanan)
 * Rename the title of this patch (the original is "i386: Use
   CPUCacheInfo.share_level to encode CPUID[4].EAX[bits 25:14]").
---
 target/i386/cpu.c | 70 +--
 1 file changed, 43 insertions(+), 27 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index d36cea505727..962f7a5c8328 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -234,22 +234,53 @@ static uint8_t cpuid2_cache_descriptor(CPUCacheInfo 
*cache)
((t) == UNIFIED_CACHE) ? CACHE_TYPE_UNIFIED : \
0 /* Invalid value */)
 
+static uint32_t max_processor_ids_for_cache(X86CPUTopoInfo *topo_info,
+enum CPUTopoLevel share_level)
+{
+uint32_t num_ids = 0;
+
+switch (share_level) {
+case CPU_TOPO_LEVEL_CORE:
+num_ids = 1 << apicid_core_offset(topo_info);
+break;
+case CPU_TOPO_LEVEL_DIE:
+num_ids = 1 << apicid_die_offset(topo_info);
+break;
+case CPU_TOPO_LEVEL_PACKAGE:
+num_ids = 1 << apicid_pkg_offset(topo_info);
+break;
+default:
+/*
+ * Currently there is no use case for SMT and MODULE, so use
+ * assert directly to facilitate debugging.
+ */
+g_assert_not_reached();
+}
+
+return num_ids - 1;
+}
+
+static uint32_t max_core_ids_in_package(X86CPUTopoInfo *topo_info)
+{
+uint32_t num_cores = 1 << (apicid_pkg_offset(topo_info) -
+   apicid_core_offset(topo_info));
+return num_cores - 1;
+}
 
 /* Encode cache info for CPUID[4] */
 static void encode_cache_cpuid4(CPUCacheInfo *cache,
-int num_apic_ids, int num_cores,
+X86CPUTopoInfo *topo_info,
 uint32_t *eax, uint32_t *ebx,
 uint32_t *ecx, uint32_t *edx)
 {
 assert(cache->size == cache->line_size * cache->associativity *
   cache->partitions * cache->sets);
 
-assert(num_apic_ids > 0);
 *eax = CACHE_TYPE(cache->type) |
CACHE_LEVEL(cache->level) |
(cache->self_init ? CACHE_SELF_INIT_LEVEL : 0) |
-   ((num_cores - 1) << 26) |
-   ((num_apic_ids - 1) << 14);
+   (max_core_ids_in_package(topo_info) << 26) |
+   (max_processor_ids_for_cache(topo_info, cache->share_level) << 14);
 
 assert(cache->line_size > 0);
 assert(cache->partitions > 0);
@@ -5853,56 +5884,41 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
 
 if (cores_per_pkg > 1) {
-int addressable_cores_offset =
-apicid_pkg_offset(_info) -
-apicid_core_offset(_info);
-
 *eax &= ~0xFC00;
-*eax |= (1 << addressable_cores_offset - 1) << 26;
+*eax |= max_core_ids_in_package(_info) << 26;
 }
 if (host_vcpus_per_cache > cpus_per_pkg) {
-int pkg_offset = apicid_pkg_offset(_info);
-
 *eax &= ~0x3FFC000;
-*eax |= (1 << pkg_offset - 1) << 14;
+*eax |=
+max_processor_ids_for_cache(_info,
+CPU_TOPO_LEVEL_PACKAGE) << 14;
 }
 }
 } else if (cpu->vendor_cpuid_only && IS_AMD_CPU(env)) {
 *eax = *ebx = *ecx = *edx = 0;
 } else {
 *eax = 0;
-int addressable_cores_offset = apicid_pkg_offset(_info) -
-   apicid_core_offset(_info);
-int core_offset, die_offset;
 
 switch (count) {
 case 0: /* L1 dcache info */
-core_offset = apicid_core_offset(_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
-

[PATCH v2 13/17] i386: Add cache topology info in CPUCacheInfo

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

Currently, by default, the cache topology is encoded as:
1. i/d cache is shared in one core.
2. L2 cache is shared in one core.
3. L3 cache is shared in one die.

This default general setting has caused a misunderstanding, that is, the
cache topology is completely equated with a specific cpu topology, such
as the connection between L2 cache and core level, and the connection
between L3 cache and die level.

In fact, the settings of these topologies depend on the specific
platform and are not static. For example, on Alder Lake-P, every
four Atom cores share the same L2 cache.

Thus, we should explicitly define the corresponding cache topology for
different cache models to increase scalability.

Except legacy_l2_cache_cpuid2 (its default topo level is
CPU_TOPO_LEVEL_UNKNOW), explicitly set the corresponding topology level
for all other cache models. In order to be compatible with the existing
cache topology, set the CPU_TOPO_LEVEL_CORE level for the i/d cache, set
the CPU_TOPO_LEVEL_CORE level for L2 cache, and set the
CPU_TOPO_LEVEL_DIE level for L3 cache.

The field for CPUID[4].EAX[bits 25:14] or CPUID[0x801D].EAX[bits
25:14] will be set based on CPUCacheInfo.share_level.

Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Add the prefix "CPU_TOPO_LEVEL_*" for CPU topology level names.
   (Yanan)
 * Rename the "INVALID" level to "CPU_TOPO_LEVEL_UNKNOW". (Yanan)
---
 target/i386/cpu.c | 19 +++
 target/i386/cpu.h | 16 
 2 files changed, 35 insertions(+)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index e8d156428772..d36cea505727 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -436,6 +436,7 @@ static CPUCacheInfo legacy_l1d_cache = {
 .sets = 64,
 .partitions = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 };
 
 /*FIXME: CPUID leaf 0x8005 is inconsistent with leaves 2 & 4 */
@@ -450,6 +451,7 @@ static CPUCacheInfo legacy_l1d_cache_amd = {
 .partitions = 1,
 .lines_per_tag = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 };
 
 /* L1 instruction cache: */
@@ -463,6 +465,7 @@ static CPUCacheInfo legacy_l1i_cache = {
 .sets = 64,
 .partitions = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 };
 
 /*FIXME: CPUID leaf 0x8005 is inconsistent with leaves 2 & 4 */
@@ -477,6 +480,7 @@ static CPUCacheInfo legacy_l1i_cache_amd = {
 .partitions = 1,
 .lines_per_tag = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 };
 
 /* Level 2 unified cache: */
@@ -490,6 +494,7 @@ static CPUCacheInfo legacy_l2_cache = {
 .sets = 4096,
 .partitions = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 };
 
 /*FIXME: CPUID leaf 2 descriptor is inconsistent with CPUID leaf 4 */
@@ -512,6 +517,7 @@ static CPUCacheInfo legacy_l2_cache_amd = {
 .associativity = 16,
 .sets = 512,
 .partitions = 1,
+.share_level = CPU_TOPO_LEVEL_CORE,
 };
 
 /* Level 3 unified cache: */
@@ -527,6 +533,7 @@ static CPUCacheInfo legacy_l3_cache = {
 .self_init = true,
 .inclusive = true,
 .complex_indexing = true,
+.share_level = CPU_TOPO_LEVEL_DIE,
 };
 
 /* TLB definitions: */
@@ -1709,6 +1716,7 @@ static const CPUCaches epyc_cache_info = {
 .lines_per_tag = 1,
 .self_init = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 },
 .l1i_cache = &(CPUCacheInfo) {
 .type = INSTRUCTION_CACHE,
@@ -1721,6 +1729,7 @@ static const CPUCaches epyc_cache_info = {
 .lines_per_tag = 1,
 .self_init = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 },
 .l2_cache = &(CPUCacheInfo) {
 .type = UNIFIED_CACHE,
@@ -1731,6 +1740,7 @@ static const CPUCaches epyc_cache_info = {
 .partitions = 1,
 .sets = 1024,
 .lines_per_tag = 1,
+.share_level = CPU_TOPO_LEVEL_CORE,
 },
 .l3_cache = &(CPUCacheInfo) {
 .type = UNIFIED_CACHE,
@@ -1744,6 +1754,7 @@ static const CPUCaches epyc_cache_info = {
 .self_init = true,
 .inclusive = true,
 .complex_indexing = true,
+.share_level = CPU_TOPO_LEVEL_DIE,
 },
 };
 
@@ -1809,6 +1820,7 @@ static const CPUCaches epyc_rome_cache_info = {
 .lines_per_tag = 1,
 .self_init = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 },
 .l1i_cache = &(CPUCacheInfo) {
 .type = INSTRUCTION_CACHE,
@@ -1821,6 +1833,7 @@ static const CPUCaches epyc_rome_cache_info = {
 .lines_per_tag = 1,
 .self_init = 1,
 .no_invd_sharing = true,
+.share_level = CPU_TOPO_LEVEL_CORE,
 },
 .l2_cache = &(CPUCacheInfo) {
 .type = UNIFIED_CACHE,
@@ -1831,6 +1844,7 @@ static const CPUCaches epyc_rome_cache_info = {
 .partitions = 1,
 .sets = 1024,
 .lines_per_tag =

[PATCH v2 11/17] tests: Add test case of APIC ID for module level parsing

2023-05-29 Thread Zhao Liu

From: Zhuocheng Ding 

After i386 supports module level, it's time to add the test for module
level's parsing.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
Reviewed-by: Yanan Wang 
---
 tests/unit/test-x86-topo.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/tests/unit/test-x86-topo.c b/tests/unit/test-x86-topo.c
index f21b8a5d95c2..55b731ccae55 100644
--- a/tests/unit/test-x86-topo.c
+++ b/tests/unit/test-x86-topo.c
@@ -37,6 +37,7 @@ static void test_topo_bits(void)
 topo_info = (X86CPUTopoInfo) {1, 1, 1, 1};
 g_assert_cmpuint(apicid_smt_width(_info), ==, 0);
 g_assert_cmpuint(apicid_core_width(_info), ==, 0);
+g_assert_cmpuint(apicid_module_width(_info), ==, 0);
 g_assert_cmpuint(apicid_die_width(_info), ==, 0);
 
 topo_info = (X86CPUTopoInfo) {1, 1, 1, 1};
@@ -74,13 +75,22 @@ static void test_topo_bits(void)
 topo_info = (X86CPUTopoInfo) {1, 1, 33, 2};
 g_assert_cmpuint(apicid_core_width(_info), ==, 6);
 
-topo_info = (X86CPUTopoInfo) {1, 1, 30, 2};
+topo_info = (X86CPUTopoInfo) {1, 6, 30, 2};
+g_assert_cmpuint(apicid_module_width(_info), ==, 3);
+topo_info = (X86CPUTopoInfo) {1, 7, 30, 2};
+g_assert_cmpuint(apicid_module_width(_info), ==, 3);
+topo_info = (X86CPUTopoInfo) {1, 8, 30, 2};
+g_assert_cmpuint(apicid_module_width(_info), ==, 3);
+topo_info = (X86CPUTopoInfo) {1, 9, 30, 2};
+g_assert_cmpuint(apicid_module_width(_info), ==, 4);
+
+topo_info = (X86CPUTopoInfo) {1, 6, 30, 2};
 g_assert_cmpuint(apicid_die_width(_info), ==, 0);
-topo_info = (X86CPUTopoInfo) {2, 1, 30, 2};
+topo_info = (X86CPUTopoInfo) {2, 6, 30, 2};
 g_assert_cmpuint(apicid_die_width(_info), ==, 1);
-topo_info = (X86CPUTopoInfo) {3, 1, 30, 2};
+topo_info = (X86CPUTopoInfo) {3, 6, 30, 2};
 g_assert_cmpuint(apicid_die_width(_info), ==, 2);
-topo_info = (X86CPUTopoInfo) {4, 1, 30, 2};
+topo_info = (X86CPUTopoInfo) {4, 6, 30, 2};
 g_assert_cmpuint(apicid_die_width(_info), ==, 2);
 
 /* build a weird topology and see if IDs are calculated correctly
@@ -91,6 +101,7 @@ static void test_topo_bits(void)
 topo_info = (X86CPUTopoInfo) {1, 1, 6, 3};
 g_assert_cmpuint(apicid_smt_width(_info), ==, 2);
 g_assert_cmpuint(apicid_core_offset(_info), ==, 2);
+g_assert_cmpuint(apicid_module_offset(_info), ==, 5);
 g_assert_cmpuint(apicid_die_offset(_info), ==, 5);
 g_assert_cmpuint(apicid_pkg_offset(_info), ==, 5);
 
-- 
2.34.1

[PATCH v2 12/17] hw/i386/pc: Support smp.clusters for x86 PC machine

2023-05-29 Thread Zhao Liu

From: Zhuocheng Ding 

As module-level topology support is added to X86CPU, now we can enable
the support for the cluster parameter on PC machines. With this support,
we can define a 5-level x86 CPU topology with "-smp":

-smp cpus=*,maxcpus=*,sockets=*,dies=*,clusters=*,cores=*,threads=*.

Additionally, add the 5-level topology example in description of "-smp".

Signed-off-by: Zhuocheng Ding 
Signed-off-by: Zhao Liu 
Reviewed-by: Yanan Wang 
---
 hw/i386/pc.c|  1 +
 qemu-options.hx | 10 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index bb62c994fad3..e16eef1a84e1 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1979,6 +1979,7 @@ static void pc_machine_class_init(ObjectClass *oc, void 
*data)
 mc->default_cpu_type = TARGET_DEFAULT_CPU_TYPE;
 mc->nvdimm_supported = true;
 mc->smp_props.dies_supported = true;
+mc->smp_props.clusters_supported = true;
 mc->default_ram_id = "pc.ram";
 
 object_class_property_add(oc, PC_MACHINE_MAX_RAM_BELOW_4G, "size",
diff --git a/qemu-options.hx b/qemu-options.hx
index b37eb9662bfe..a51fab793899 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -319,14 +319,14 @@ SRST
 -smp 8,sockets=2,cores=2,threads=2,maxcpus=8
 
 The following sub-option defines a CPU topology hierarchy (2 sockets
-totally on the machine, 2 dies per socket, 2 cores per die, 2 threads
-per core) for PC machines which support sockets/dies/cores/threads.
-Some members of the option can be omitted but their values will be
-automatically computed:
+totally on the machine, 2 dies per socket, 2 clusters per die, 2 cores per
+cluster, 2 threads per core) for PC machines which support sockets/dies
+/clusters/cores/threads. Some members of the option can be omitted but
+their values will be automatically computed:
 
 ::
 
--smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16
+-smp 32,sockets=2,dies=2,clusters=2,cores=2,threads=2,maxcpus=32
 
 The following sub-option defines a CPU topology hierarchy (2 sockets
 totally on the machine, 2 clusters per socket, 2 cores per cluster,
-- 
2.34.1

[PATCH v2 08/17] i386: Support modules_per_die in X86CPUTopoInfo

2023-05-29 Thread Zhao Liu

From: Zhuocheng Ding 

Support module level in i386 cpu topology structure "X86CPUTopoInfo".

Since x86 does not yet support the "clusters" parameter in "-smp",
X86CPUTopoInfo.modules_per_die is currently always 1. Therefore, the
module level width in APIC ID, which can be calculated by
"apicid_bitwidth_for_count(topo_info->modules_per_die)", is always 0
for now, so we can directly add APIC ID related helpers to support
module level parsing.

At present, we don't expose module level in CPUID.1FH because currently
linux (v6.4-rc1) doesn't support module level. And exposing module and
die levels at the same time in CPUID.1FH will cause linux to calculate
the wrong die_id. The module level should be exposed until the real
machine has the module level in CPUID.1FH.

In addition, update topology structure in test-x86-topo.c.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Include module level related helpers (apicid_module_width() and
   apicid_module_offset()) in this patch. (Yanan)
---
 hw/i386/x86.c  |  3 ++-
 include/hw/i386/topology.h | 22 +++
 target/i386/cpu.c  | 12 ++
 tests/unit/test-x86-topo.c | 45 --
 4 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 4efc390905ff..a552ae8bb4a8 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -72,7 +72,8 @@ static void init_topo_info(X86CPUTopoInfo *topo_info,
 MachineState *ms = MACHINE(x86ms);
 
 topo_info->dies_per_pkg = ms->smp.dies;
-topo_info->cores_per_die = ms->smp.cores;
+topo_info->modules_per_die = ms->smp.clusters;
+topo_info->cores_per_module = ms->smp.cores;
 topo_info->threads_per_core = ms->smp.threads;
 }
 
diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
index 5a19679f618b..c807d3811dd3 100644
--- a/include/hw/i386/topology.h
+++ b/include/hw/i386/topology.h
@@ -56,7 +56,8 @@ typedef struct X86CPUTopoIDs {
 
 typedef struct X86CPUTopoInfo {
 unsigned dies_per_pkg;
-unsigned cores_per_die;
+unsigned modules_per_die;
+unsigned cores_per_module;
 unsigned threads_per_core;
 } X86CPUTopoInfo;
 
@@ -77,7 +78,13 @@ static inline unsigned apicid_smt_width(X86CPUTopoInfo 
*topo_info)
 /* Bit width of the Core_ID field */
 static inline unsigned apicid_core_width(X86CPUTopoInfo *topo_info)
 {
-return apicid_bitwidth_for_count(topo_info->cores_per_die);
+return apicid_bitwidth_for_count(topo_info->cores_per_module);
+}
+
+/* Bit width of the Module_ID (cluster ID) field */
+static inline unsigned apicid_module_width(X86CPUTopoInfo *topo_info)
+{
+return apicid_bitwidth_for_count(topo_info->modules_per_die);
 }
 
 /* Bit width of the Die_ID field */
@@ -92,10 +99,16 @@ static inline unsigned apicid_core_offset(X86CPUTopoInfo 
*topo_info)
 return apicid_smt_width(topo_info);
 }
 
+/* Bit offset of the Module_ID (cluster ID) field */
+static inline unsigned apicid_module_offset(X86CPUTopoInfo *topo_info)
+{
+return apicid_core_offset(topo_info) + apicid_core_width(topo_info);
+}
+
 /* Bit offset of the Die_ID field */
 static inline unsigned apicid_die_offset(X86CPUTopoInfo *topo_info)
 {
-return apicid_core_offset(topo_info) + apicid_core_width(topo_info);
+return apicid_module_offset(topo_info) + apicid_module_width(topo_info);
 }
 
 /* Bit offset of the Pkg_ID (socket ID) field */
@@ -127,7 +140,8 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo 
*topo_info,
  X86CPUTopoIDs *topo_ids)
 {
 unsigned nr_dies = topo_info->dies_per_pkg;
-unsigned nr_cores = topo_info->cores_per_die;
+unsigned nr_cores = topo_info->cores_per_module *
+topo_info->modules_per_die;
 unsigned nr_threads = topo_info->threads_per_core;
 
 topo_ids->pkg_id = cpu_index / (nr_dies * nr_cores * nr_threads);
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 05a5afd42a81..8e487c20ff12 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -339,7 +339,9 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache,
 
 /* L3 is shared among multiple cores */
 if (cache->level == 3) {
-l3_threads = topo_info->cores_per_die * topo_info->threads_per_core;
+l3_threads = topo_info->modules_per_die *
+ topo_info->cores_per_module *
+ topo_info->threads_per_core;
 *eax |= (l3_threads - 1) << 14;
 } else {
 *eax |= ((topo_info->threads_per_core - 1) << 14);
@@ -5749,10 +5751,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 uint32_t cpus_per_pkg;
 
 topo_info.dies_per_pkg = env->nr_dies;
-topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
+topo_info.modules_per_die = env->nr_modules;
+topo_info.cores_per_module = cs->nr_cores / env->nr_dies / env->nr_modules;

[PATCH v2 05/17] i386/cpu: Use APIC ID offset to encode cache topo in CPUID[4]

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

Refer to the fixes of cache_info_passthrough ([1], [2]) and SDM, the
CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26] should use the
nearest power-of-2 integer.

The nearest power-of-2 integer can be caculated by pow2ceil() or by
using APIC ID offset (like L3 topology using 1 << die_offset [3]).

But in fact, CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26]
are associated with APIC ID. For example, in linux kernel, the field
"num_threads_sharing" (Bits 25 - 14) is parsed with APIC ID. And for
another example, on Alder Lake P, the CPUID.04H:EAX[bits 31:26] is not
matched with actual core numbers and it's caculated by:
"(1 << (pkg_offset - core_offset)) - 1".

Therefore the offset of APIC ID should be preferred to caculate nearest
power-of-2 integer for CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits
31:26]:
1. d/i cache is shared in a core, 1 << core_offset should be used
   instand of "cs->nr_threads" in encode_cache_cpuid4() for
   CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14].
2. L2 cache is supposed to be shared in a core as for now, thereby
   1 << core_offset should also be used instand of "cs->nr_threads" in
   encode_cache_cpuid4() for CPUID.04H.02H:EAX[bits 25:14].
3. Similarly, the value for CPUID.04H:EAX[bits 31:26] should also be
   replaced by the offsets upper SMT level in APIC ID.

In addition, use APIC ID offset to replace "pow2ceil()" for
cache_into_passthrough case.

[1]: efb3934adf9e ("x86: cpu: make sure number of addressable IDs for processor 
cores meets the spec")
[2]: d7caf13b5fcf ("x86: cpu: fixup number of addressable IDs for logical 
processors sharing cache")
[3]: d65af288a84d ("i386: Update new x86_apicid parsing rules with die_offset 
support")

Fixes: 7e3482f82480 ("i386: Helpers to encode cache information consistently")
Suggested-by: Robert Hoo 
Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Use APIC ID offset to replace "pow2ceil()" for cache_into_passthrough
   case. (Yanan)
 * Split the L1 cache fix into a separate patch.
 * Rename the title of this patch (the original is "i386/cpu: Fix number
   of addressable IDs in CPUID.04H").
---
 target/i386/cpu.c | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 101161954173..92f16a152e0b 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5742,7 +5742,6 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 {
 X86CPU *cpu = env_archcpu(env);
 CPUState *cs = env_cpu(env);
-uint32_t die_offset;
 uint32_t limit;
 uint32_t signature[3];
 X86CPUTopoInfo topo_info;
@@ -5826,39 +5825,56 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
 int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
 if (cs->nr_cores > 1) {
+int addressable_cores_offset =
+apicid_pkg_offset(_info) -
+apicid_core_offset(_info);
+
 *eax &= ~0xFC00;
-*eax |= (pow2ceil(cs->nr_cores) - 1) << 26;
+*eax |= (1 << addressable_cores_offset - 1) << 26;
 }
 if (host_vcpus_per_cache > vcpus_per_socket) {
+int pkg_offset = apicid_pkg_offset(_info);
+
 *eax &= ~0x3FFC000;
-*eax |= (pow2ceil(vcpus_per_socket) - 1) << 14;
+*eax |= (1 << pkg_offset - 1) << 14;
 }
 }
 } else if (cpu->vendor_cpuid_only && IS_AMD_CPU(env)) {
 *eax = *ebx = *ecx = *edx = 0;
 } else {
 *eax = 0;
+int addressable_cores_offset = apicid_pkg_offset(_info) -
+   apicid_core_offset(_info);
+int core_offset, die_offset;
+
 switch (count) {
 case 0: /* L1 dcache info */
+core_offset = apicid_core_offset(_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << core_offset),
+(1 << addressable_cores_offset),
 eax, ebx, ecx, edx);
 break;
 case 1: /* L1 icache info */
+core_offset = apicid_core_offset(_info);
 encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache,
-cs->nr_threads, cs->nr_cores,
+(1 << core_offset),
+(1 << addressable_cores_offset),
 eax, ebx, ecx, edx);
 break;
 case 2: /* L2 cache info */
+

[PATCH v2 15/17] i386: Fix NumSharingCache for CPUID[0x8000001D].EAX[bits 25:14]

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

The commit 8f4202fb1080 ("i386: Populate AMD Processor Cache Information
for cpuid 0x801D") adds the cache topology for AMD CPU by encoding
the number of sharing threads directly.

>From AMD's APM, NumSharingCache (CPUID[0x801D].EAX[bits 25:14])
means [1]:

The number of logical processors sharing this cache is the value of
this field incremented by 1. To determine which logical processors are
sharing a cache, determine a Share Id for each processor as follows:

ShareId = LocalApicId >> log2(NumSharingCache+1)

Logical processors with the same ShareId then share a cache. If
NumSharingCache+1 is not a power of two, round it up to the next power
of two.

>From the description above, the caculation of this feild should be same
as CPUID[4].EAX[bits 25:14] for intel cpus. So also use the offsets of
APIC ID to caculate this field.

Note: I don't have the hardware available, hope someone can help me to
confirm whether this calculation is correct, thanks!

[1]: APM, vol.3, appendix.E.4.15 Function 8000_001Dh--Cache Topology
 Information

Cc: Babu Moger 
Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Rename "l3_threads" to "num_apic_ids" in
   encode_cache_cpuid801d(). (Yanan)
 * Add the description of the original commit and add Cc.
---
 target/i386/cpu.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 962f7a5c8328..7d3af82c353f 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -361,7 +361,7 @@ static void encode_cache_cpuid801d(CPUCacheInfo *cache,
uint32_t *eax, uint32_t *ebx,
uint32_t *ecx, uint32_t *edx)
 {
-uint32_t l3_threads;
+uint32_t num_apic_ids;
 assert(cache->size == cache->line_size * cache->associativity *
   cache->partitions * cache->sets);
 
@@ -370,13 +370,11 @@ static void encode_cache_cpuid801d(CPUCacheInfo 
*cache,
 
 /* L3 is shared among multiple cores */
 if (cache->level == 3) {
-l3_threads = topo_info->modules_per_die *
- topo_info->cores_per_module *
- topo_info->threads_per_core;
-*eax |= (l3_threads - 1) << 14;
+num_apic_ids = 1 << apicid_die_offset(topo_info);
 } else {
-*eax |= ((topo_info->threads_per_core - 1) << 14);
+num_apic_ids = 1 << apicid_core_offset(topo_info);
 }
+*eax |= (num_apic_ids - 1) << 14;
 
 assert(cache->line_size > 0);
 assert(cache->partitions > 0);
-- 
2.34.1

[PATCH v2 07/17] i386: Introduce module-level cpu topology to CPUX86State

2023-05-29 Thread Zhao Liu

From: Zhuocheng Ding 

smp command has the "clusters" parameter but x86 hasn't supported that
level. "cluster" is a CPU topology level concept above cores, in which
the cores may share some resources (L2 cache or some others like L3
cache tags, depending on the Archs) [1][2]. For x86, the resource shared
by cores at the cluster level is mainly the L2 cache.

However, using cluster to define x86's L2 cache topology will cause the
compatibility problem:

Currently, x86 defaults that the L2 cache is shared in one core, which
actually implies a default setting "cores per L2 cache is 1" and
therefore implicitly defaults to having as many L2 caches as cores.

For example (i386 PC machine):
-smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16 (*)

Considering the topology of the L2 cache, this (*) implicitly means "1
core per L2 cache" and "2 L2 caches per die".

If we use cluster to configure L2 cache topology with the new default
setting "clusters per L2 cache is 1", the above semantics will change
to "2 cores per cluster" and "1 cluster per L2 cache", that is, "2
cores per L2 cache".

So the same command (*) will cause changes in the L2 cache topology,
further affecting the performance of the virtual machine.

Therefore, x86 should only treat cluster as a cpu topology level and
avoid using it to change L2 cache by default for compatibility.

"cluster" in smp is the CPU topology level which is between "core" and
die.

For x86, the "cluster" in smp is corresponding to the module level [2],
which is above the core level. So use the "module" other than "cluster"
in i386 code.

And please note that x86 already has a cpu topology level also named
"cluster" [3], this level is at the upper level of the package. Here,
the cluster in x86 cpu topology is completely different from the
"clusters" as the smp parameter. After the module level is introduced,
the cluster as the smp parameter will actually refer to the module level
of x86.

[1]: 864c3b5c32f0 ("hw/core/machine: Introduce CPU cluster topology support")
[2]: Yanan's comment about "cluster",
 https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg04051.html
[3]: SDM, vol.3, ch.9, 9.9.1 Hierarchical Mapping of Shared Resources.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
Changes since v1:
 * The background of the introduction of the "cluster" parameter and its
   exact meaning were revised according to Yanan's explanation. (Yanan)
---
 hw/i386/x86.c | 1 +
 target/i386/cpu.c | 1 +
 target/i386/cpu.h | 5 +
 3 files changed, 7 insertions(+)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index a88a126123be..4efc390905ff 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -309,6 +309,7 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 init_topo_info(_info, x86ms);
 
 env->nr_dies = ms->smp.dies;
+env->nr_modules = ms->smp.clusters;
 
 /*
  * If APIC ID is not set,
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index b069b43ff999..05a5afd42a81 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -7332,6 +7332,7 @@ static void x86_cpu_initfn(Object *obj)
 CPUX86State *env = >env;
 
 env->nr_dies = 1;
+env->nr_modules = 1;
 cpu_set_cpustate_pointers(cpu);
 
 object_property_add(obj, "feature-words", "X86CPUFeatureWordInfo",
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 724ffd2b1fc9..fb8f7cb24902 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1872,6 +1872,11 @@ typedef struct CPUArchState {
 
 /* Number of dies within this CPU package. */
 unsigned nr_dies;
+/*
+ * Number of modules within this CPU package.
+ * Module level in x86 cpu topology is corresponding to smp.clusters.
+ */
+unsigned nr_modules;
 } CPUX86State;
 
 struct kvm_msrs;
-- 
2.34.1

[PATCH v2 10/17] i386/cpu: Introduce cluster-id to X86CPU

2023-05-29 Thread Zhao Liu

From: Zhuocheng Ding 

We introduce cluster-id other than module-id to be consistent with
CpuInstanceProperties.cluster-id, and this avoids the confusion
of parameter names when hotplugging.

Following the legacy smp check rules, also add the cluster_id validity
into x86_cpu_pre_plug().

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
 hw/i386/x86.c | 33 +
 target/i386/cpu.c |  2 ++
 target/i386/cpu.h |  1 +
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 0b460fd6074d..8154b86f95c7 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -328,6 +328,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 cpu->die_id = 0;
 }
 
+/*
+ * cluster-id was optional in QEMU 8.0 and older, so keep it optional
+ * if there's only one cluster per die.
+ */
+if (cpu->cluster_id < 0 && ms->smp.clusters == 1) {
+cpu->cluster_id = 0;
+}
+
 if (cpu->socket_id < 0) {
 error_setg(errp, "CPU socket-id is not set");
 return;
@@ -344,6 +352,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
cpu->die_id, ms->smp.dies - 1);
 return;
 }
+if (cpu->cluster_id < 0) {
+error_setg(errp, "CPU cluster-id is not set");
+return;
+} else if (cpu->cluster_id > ms->smp.clusters - 1) {
+error_setg(errp, "Invalid CPU cluster-id: %u must be in range 
0:%u",
+   cpu->cluster_id, ms->smp.clusters - 1);
+return;
+}
 if (cpu->core_id < 0) {
 error_setg(errp, "CPU core-id is not set");
 return;
@@ -363,16 +379,9 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 
 topo_ids.pkg_id = cpu->socket_id;
 topo_ids.die_id = cpu->die_id;
+topo_ids.module_id = cpu->cluster_id;
 topo_ids.core_id = cpu->core_id;
 topo_ids.smt_id = cpu->thread_id;
-
-/*
- * TODO: This is the temporary initialization for topo_ids.module_id to
- * avoid "maybe-uninitialized" compilation errors. Will remove when
- * X86CPU supports cluster_id.
- */
-topo_ids.module_id = 0;
-
 cpu->apic_id = x86_apicid_from_topo_ids(_info, _ids);
 }
 
@@ -419,6 +428,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 }
 cpu->die_id = topo_ids.die_id;
 
+if (cpu->cluster_id != -1 && cpu->cluster_id != topo_ids.module_id) {
+error_setg(errp, "property cluster-id: %u doesn't match set apic-id:"
+" 0x%x (cluster-id: %u)", cpu->cluster_id, cpu->apic_id,
+topo_ids.module_id);
+return;
+}
+cpu->cluster_id = topo_ids.module_id;
+
 if (cpu->core_id != -1 && cpu->core_id != topo_ids.core_id) {
 error_setg(errp, "property core-id: %u doesn't match set apic-id:"
 " 0x%x (core-id: %u)", cpu->core_id, cpu->apic_id,
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 8e487c20ff12..e8d156428772 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -7532,12 +7532,14 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_UINT32("apic-id", X86CPU, apic_id, 0),
 DEFINE_PROP_INT32("thread-id", X86CPU, thread_id, 0),
 DEFINE_PROP_INT32("core-id", X86CPU, core_id, 0),
+DEFINE_PROP_INT32("cluster-id", X86CPU, cluster_id, 0),
 DEFINE_PROP_INT32("die-id", X86CPU, die_id, 0),
 DEFINE_PROP_INT32("socket-id", X86CPU, socket_id, 0),
 #else
 DEFINE_PROP_UINT32("apic-id", X86CPU, apic_id, UNASSIGNED_APIC_ID),
 DEFINE_PROP_INT32("thread-id", X86CPU, thread_id, -1),
 DEFINE_PROP_INT32("core-id", X86CPU, core_id, -1),
+DEFINE_PROP_INT32("cluster-id", X86CPU, cluster_id, -1),
 DEFINE_PROP_INT32("die-id", X86CPU, die_id, -1),
 DEFINE_PROP_INT32("socket-id", X86CPU, socket_id, -1),
 #endif
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index fb8f7cb24902..62230b6f7701 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2026,6 +2026,7 @@ struct ArchCPU {
 int32_t node_id; /* NUMA node this CPU belongs to */
 int32_t socket_id;
 int32_t die_id;
+int32_t cluster_id;
 int32_t core_id;
 int32_t thread_id;
 
-- 
2.34.1

[PATCH v2 09/17] i386: Support module_id in X86CPUTopoIDs

2023-05-29 Thread Zhao Liu

From: Zhuocheng Ding 

Add module_id member in X86CPUTopoIDs.

module_id can be parsed from APIC ID, so also update APIC ID parsing
rule to support module level. With this support, the conversions with
module level between X86CPUTopoIDs, X86CPUTopoInfo and APIC ID are
completed.

module_id can be also generated from cpu topology, and before i386
supports "clusters" in smp, the default "clusters per die" is only 1,
thus the module_id generated in this way is 0, so that it will not
conflict with the module_id generated by APIC ID.

Signed-off-by: Zhuocheng Ding 
Co-developed-by: Zhao Liu 
Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Merge the patch "i386: Update APIC ID parsing rule to support module
   level" into this one. (Yanan)
 * Move the apicid_module_width() and apicid_module_offset() support
   into the previous modules_per_die related patch. (Yanan)
---
 hw/i386/x86.c  | 28 +---
 include/hw/i386/topology.h | 17 +
 2 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index a552ae8bb4a8..0b460fd6074d 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -314,11 +314,11 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 
 /*
  * If APIC ID is not set,
- * set it based on socket/die/core/thread properties.
+ * set it based on socket/die/cluster/core/thread properties.
  */
 if (cpu->apic_id == UNASSIGNED_APIC_ID) {
-int max_socket = (ms->smp.max_cpus - 1) /
-smp_threads / smp_cores / ms->smp.dies;
+int max_socket = (ms->smp.max_cpus - 1) / smp_threads / smp_cores /
+ms->smp.clusters / ms->smp.dies;
 
 /*
  * die-id was optional in QEMU 4.0 and older, so keep it optional
@@ -365,6 +365,14 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 topo_ids.die_id = cpu->die_id;
 topo_ids.core_id = cpu->core_id;
 topo_ids.smt_id = cpu->thread_id;
+
+/*
+ * TODO: This is the temporary initialization for topo_ids.module_id to
+ * avoid "maybe-uninitialized" compilation errors. Will remove when
+ * X86CPU supports cluster_id.
+ */
+topo_ids.module_id = 0;
+
 cpu->apic_id = x86_apicid_from_topo_ids(_info, _ids);
 }
 
@@ -373,11 +381,13 @@ void x86_cpu_pre_plug(HotplugHandler *hotplug_dev,
 MachineState *ms = MACHINE(x86ms);
 
 x86_topo_ids_from_apicid(cpu->apic_id, _info, _ids);
+
 error_setg(errp,
-"Invalid CPU [socket: %u, die: %u, core: %u, thread: %u] with"
-" APIC ID %" PRIu32 ", valid index range 0:%d",
-topo_ids.pkg_id, topo_ids.die_id, topo_ids.core_id, 
topo_ids.smt_id,
-cpu->apic_id, ms->possible_cpus->len - 1);
+"Invalid CPU [socket: %u, die: %u, module: %u, core: %u, thread: 
%u]"
+" with APIC ID %" PRIu32 ", valid index range 0:%d",
+topo_ids.pkg_id, topo_ids.die_id, topo_ids.module_id,
+topo_ids.core_id, topo_ids.smt_id, cpu->apic_id,
+ms->possible_cpus->len - 1);
 return;
 }
 
@@ -498,6 +508,10 @@ const CPUArchIdList 
*x86_possible_cpu_arch_ids(MachineState *ms)
 ms->possible_cpus->cpus[i].props.has_die_id = true;
 ms->possible_cpus->cpus[i].props.die_id = topo_ids.die_id;
 }
+if (ms->smp.clusters > 1) {
+ms->possible_cpus->cpus[i].props.has_cluster_id = true;
+ms->possible_cpus->cpus[i].props.cluster_id = topo_ids.module_id;
+}
 ms->possible_cpus->cpus[i].props.has_core_id = true;
 ms->possible_cpus->cpus[i].props.core_id = topo_ids.core_id;
 ms->possible_cpus->cpus[i].props.has_thread_id = true;
diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
index c807d3811dd3..3cec97b377f2 100644
--- a/include/hw/i386/topology.h
+++ b/include/hw/i386/topology.h
@@ -50,6 +50,7 @@ typedef uint32_t apic_id_t;
 typedef struct X86CPUTopoIDs {
 unsigned pkg_id;
 unsigned die_id;
+unsigned module_id;
 unsigned core_id;
 unsigned smt_id;
 } X86CPUTopoIDs;
@@ -127,6 +128,7 @@ static inline apic_id_t 
x86_apicid_from_topo_ids(X86CPUTopoInfo *topo_info,
 {
 return (topo_ids->pkg_id  << apicid_pkg_offset(topo_info)) |
(topo_ids->die_id  << apicid_die_offset(topo_info)) |
+   (topo_ids->module_id << apicid_module_offset(topo_info)) |
(topo_ids->core_id << apicid_core_offset(topo_info)) |
topo_ids->smt_id;
 }
@@ -140,12 +142,16 @@ static inline void x86_topo_ids_from_idx(X86CPUTopoInfo 
*topo_info,
  X86CPUTopoIDs *topo_ids)
 {
 unsigned nr_dies = topo_info->dies_per_pkg;
-unsigned nr_cores = topo_info->cores_per_module *
-topo_info->modules_per_die;
+unsigned nr_modules = topo_info->modules_per_die;
+unsigned nr_cores =

[PATCH v2 16/17] i386: Use CPUCacheInfo.share_level to encode CPUID[0x8000001D].EAX[bits 25:14]

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

CPUID[0x801D].EAX[bits 25:14] is used to represent the cache
topology for amd CPUs.

After cache models have topology information, we can use
CPUCacheInfo.share_level to decide which topology level to be encoded
into CPUID[0x801D].EAX[bits 25:14].

Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Use cache->share_level as the parameter in
   max_processor_ids_for_cache().
---
 target/i386/cpu.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 7d3af82c353f..e426f852ed3a 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -361,20 +361,12 @@ static void encode_cache_cpuid801d(CPUCacheInfo 
*cache,
uint32_t *eax, uint32_t *ebx,
uint32_t *ecx, uint32_t *edx)
 {
-uint32_t num_apic_ids;
 assert(cache->size == cache->line_size * cache->associativity *
   cache->partitions * cache->sets);
 
 *eax = CACHE_TYPE(cache->type) | CACHE_LEVEL(cache->level) |
(cache->self_init ? CACHE_SELF_INIT_LEVEL : 0);
-
-/* L3 is shared among multiple cores */
-if (cache->level == 3) {
-num_apic_ids = 1 << apicid_die_offset(topo_info);
-} else {
-num_apic_ids = 1 << apicid_core_offset(topo_info);
-}
-*eax |= (num_apic_ids - 1) << 14;
+*eax |= max_processor_ids_for_cache(topo_info, cache->share_level) << 14;
 
 assert(cache->line_size > 0);
 assert(cache->partitions > 0);
-- 
2.34.1

[PATCH v2 17/17] i386: Add new property to control L2 cache topo in CPUID.04H

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

The property x-l2-cache-topo will be used to change the L2 cache
topology in CPUID.04H.

Now it allows user to set the L2 cache is shared in core level or
cluster level.

If user passes "-cpu x-l2-cache-topo=[core|cluster]" then older L2 cache
topology will be overrided by the new topology setting.

Here we expose to user "cluster" instead of "module", to be consistent
with "cluster-id" naming.

Since CPUID.04H is used by intel CPUs, this property is available on
intel CPUs as for now.

When necessary, it can be extended to CPUID.801DH for amd CPUs.

Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Rename MODULE branch to CPU_TOPO_LEVEL_MODULE to match the previous
   renaming changes.
---
 target/i386/cpu.c | 34 +-
 target/i386/cpu.h |  2 ++
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index e426f852ed3a..f679f62f0f57 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -243,6 +243,9 @@ static uint32_t max_processor_ids_for_cache(X86CPUTopoInfo 
*topo_info,
 case CPU_TOPO_LEVEL_CORE:
 num_ids = 1 << apicid_core_offset(topo_info);
 break;
+case CPU_TOPO_LEVEL_MODULE:
+num_ids = 1 << apicid_module_offset(topo_info);
+break;
 case CPU_TOPO_LEVEL_DIE:
 num_ids = 1 << apicid_die_offset(topo_info);
 break;
@@ -251,7 +254,7 @@ static uint32_t max_processor_ids_for_cache(X86CPUTopoInfo 
*topo_info,
 break;
 default:
 /*
- * Currently there is no use case for SMT and MODULE, so use
+ * Currently there is no use case for SMT, so use
  * assert directly to facilitate debugging.
  */
 g_assert_not_reached();
@@ -7184,6 +7187,34 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
 env->cache_info_amd.l3_cache = _l3_cache;
 }
 
+if (cpu->l2_cache_topo_level) {
+/*
+ * FIXME: Currently only supports changing CPUID[4] (for intel), and
+ * will support changing CPUID[0x801D] when necessary.
+ */
+if (!IS_INTEL_CPU(env)) {
+error_setg(errp, "only intel cpus supports x-l2-cache-topo");
+return;
+}
+
+if (!strcmp(cpu->l2_cache_topo_level, "core")) {
+env->cache_info_cpuid4.l2_cache->share_level = CPU_TOPO_LEVEL_CORE;
+} else if (!strcmp(cpu->l2_cache_topo_level, "cluster")) {
+/*
+ * We expose to users "cluster" instead of "module", to be
+ * consistent with "cluster-id" naming.
+ */
+env->cache_info_cpuid4.l2_cache->share_level =
+CPU_TOPO_LEVEL_MODULE;
+} else {
+error_setg(errp,
+   "x-l2-cache-topo doesn't support '%s', "
+   "and it only supports 'core' or 'cluster'",
+   cpu->l2_cache_topo_level);
+return;
+}
+}
+
 #ifndef CONFIG_USER_ONLY
 MachineState *ms = MACHINE(qdev_get_machine());
 qemu_register_reset(x86_cpu_machine_reset_cb, cpu);
@@ -7687,6 +7718,7 @@ static Property x86_cpu_properties[] = {
  false),
 DEFINE_PROP_BOOL("x-intel-pt-auto-level", X86CPU, intel_pt_auto_level,
  true),
+DEFINE_PROP_STRING("x-l2-cache-topo", X86CPU, l2_cache_topo_level),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 5a0b5ba72433..56616e794ab9 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -2049,6 +2049,8 @@ struct ArchCPU {
 int32_t hv_max_vps;
 
 bool xen_vapic;
+
+char *l2_cache_topo_level;
 };
 
 
-- 
2.34.1

[PATCH v2 06/17] i386/cpu: Consolidate the use of topo_info in cpu_x86_cpuid()

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

In cpu_x86_cpuid(), there are many variables in representing the cpu
topology, e.g., topo_info, cs->nr_cores/cs->nr_threads.

Since the names of cs->nr_cores/cs->nr_threads does not accurately
represent its meaning, the use of cs->nr_cores/cs->nr_threads is prone
to confusion and mistakes.

And the structure X86CPUTopoInfo names its memebers clearly, thus the
variable "topo_info" should be preferred.

In addition, in cpu_x86_cpuid(), to uniformly use the topology variable,
replace env->dies with topo_info.dies_per_pkg as well.

Suggested-by: Robert Hoo 
Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Extract cores_per_socket from the code block and use it as a local
   variable for cpu_x86_cpuid(). (Yanan)
 * Remove vcpus_per_socket variable and use cpus_per_pkg directly.
   (Yanan)
 * Replace env->dies with topo_info.dies_per_pkg in cpu_x86_cpuid().
---
 target/i386/cpu.c | 31 ++-
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 92f16a152e0b..b069b43ff999 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5745,11 +5745,16 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 uint32_t limit;
 uint32_t signature[3];
 X86CPUTopoInfo topo_info;
+uint32_t cores_per_pkg;
+uint32_t cpus_per_pkg;
 
 topo_info.dies_per_pkg = env->nr_dies;
 topo_info.cores_per_die = cs->nr_cores / env->nr_dies;
 topo_info.threads_per_core = cs->nr_threads;
 
+cores_per_pkg = topo_info.cores_per_die * topo_info.dies_per_pkg;
+cpus_per_pkg = cores_per_pkg * topo_info.threads_per_core;
+
 /* Calculate & apply limits for different index ranges */
 if (index >= 0xC000) {
 limit = env->cpuid_xlevel2;
@@ -5785,8 +5790,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *ecx |= CPUID_EXT_OSXSAVE;
 }
 *edx = env->features[FEAT_1_EDX];
-if (cs->nr_cores * cs->nr_threads > 1) {
-*ebx |= (cs->nr_cores * cs->nr_threads) << 16;
+if (cpus_per_pkg > 1) {
+*ebx |= cpus_per_pkg << 16;
 *edx |= CPUID_HT;
 }
 if (!cpu->enable_pmu) {
@@ -5823,8 +5828,8 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  */
 if (*eax & 31) {
 int host_vcpus_per_cache = 1 + ((*eax & 0x3FFC000) >> 14);
-int vcpus_per_socket = cs->nr_cores * cs->nr_threads;
-if (cs->nr_cores > 1) {
+
+if (cores_per_pkg > 1) {
 int addressable_cores_offset =
 apicid_pkg_offset(_info) -
 apicid_core_offset(_info);
@@ -5832,7 +5837,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *eax &= ~0xFC00;
 *eax |= (1 << addressable_cores_offset - 1) << 26;
 }
-if (host_vcpus_per_cache > vcpus_per_socket) {
+if (host_vcpus_per_cache > cpus_per_pkg) {
 int pkg_offset = apicid_pkg_offset(_info);
 
 *eax &= ~0x3FFC000;
@@ -5972,12 +5977,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 switch (count) {
 case 0:
 *eax = apicid_core_offset(_info);
-*ebx = cs->nr_threads;
+*ebx = topo_info.threads_per_core;
 *ecx |= CPUID_TOPOLOGY_LEVEL_SMT;
 break;
 case 1:
 *eax = apicid_pkg_offset(_info);
-*ebx = cs->nr_cores * cs->nr_threads;
+*ebx = cpus_per_pkg;
 *ecx |= CPUID_TOPOLOGY_LEVEL_CORE;
 break;
 default:
@@ -5998,7 +6003,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 0x1F:
 /* V2 Extended Topology Enumeration Leaf */
-if (env->nr_dies < 2) {
+if (topo_info.dies_per_pkg < 2) {
 *eax = *ebx = *ecx = *edx = 0;
 break;
 }
@@ -6008,7 +6013,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 switch (count) {
 case 0:
 *eax = apicid_core_offset(_info);
-*ebx = cs->nr_threads;
+*ebx = topo_info.threads_per_core;
 *ecx |= CPUID_TOPOLOGY_LEVEL_SMT;
 break;
 case 1:
@@ -6018,7 +6023,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 2:
 *eax = apicid_pkg_offset(_info);
-*ebx = cs->nr_cores * cs->nr_threads;
+*ebx = cpus_per_pkg;
 *ecx |= CPUID_TOPOLOGY_LEVEL_DIE;
 break;
 default:
@@ -6243,7 +6248,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
  * discards multiple thread

[PATCH v2 04/17] i386/cpu: Fix i/d-cache topology to core level for Intel CPU

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

For i-cache and d-cache, the maximum IDs for CPUs sharing cache (
CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]) are
both 0, and this means i-cache and d-cache are shared in the SMT level.
This is correct if there's single thread per core, but is wrong for the
hyper threading case (one core contains multiple threads) since the
i-cache and d-cache are shared in the core level other than SMT level.

For AMD CPU, commit 8f4202fb1080 ("i386: Populate AMD Processor Cache
Information for cpuid 0x801D") has already introduced i/d cache
topology as core level by default.

Therefore, in order to be compatible with both multi-threaded and
single-threaded situations, we should set i-cache and d-cache be shared
at the core level by default.

This fix changes the default i/d cache topology from per-thread to
per-core. Potentially, this change in L1 cache topology may affect the
performance of the VM if the user does not specifically specify the
topology or bind the vCPU. However, the way to achieve optimal
performance should be to create a reasonable topology and set the
appropriate vCPU affinity without relying on QEMU's default topology
structure.

Fixes: 7e3482f82480 ("i386: Helpers to encode cache information consistently")
Suggested-by: Robert Hoo 
Signed-off-by: Zhao Liu 
---
 * Split this fix from the patch named "i386/cpu: Fix number of
   addressable IDs in CPUID.04H".
 * Add the explanation of the impact on performance. (Xiaoyao)
---
 target/i386/cpu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 384821ba4bd0..101161954173 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5841,12 +5841,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 switch (count) {
 case 0: /* L1 dcache info */
 encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
-1, cs->nr_cores,
+cs->nr_threads, cs->nr_cores,
 eax, ebx, ecx, edx);
 break;
 case 1: /* L1 icache info */
 encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache,
-1, cs->nr_cores,
+cs->nr_threads, cs->nr_cores,
 eax, ebx, ecx, edx);
 break;
 case 2: /* L2 cache info */
-- 
2.34.1

[PATCH v2 02/17] tests: Rename test-x86-cpuid.c to test-x86-topo.c

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

In fact, this unit tests APIC ID other than CPUID.
Rename to test-x86-topo.c to make its name more in line with its
actual content.

Signed-off-by: Zhao Liu 
---
Changes since v1:
 * Rename test-x86-apicid.c to test-x86-topo.c. (Yanan)
---
 MAINTAINERS  | 2 +-
 tests/unit/meson.build   | 4 ++--
 tests/unit/{test-x86-cpuid.c => test-x86-topo.c} | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)
 rename tests/unit/{test-x86-cpuid.c => test-x86-topo.c} (99%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4b025a7b63e2..d608afbe0e2f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1713,7 +1713,7 @@ F: include/hw/southbridge/ich9.h
 F: include/hw/southbridge/piix.h
 F: hw/isa/apm.c
 F: include/hw/isa/apm.h
-F: tests/unit/test-x86-cpuid.c
+F: tests/unit/test-x86-topo.c
 F: tests/qtest/test-x86-cpuid-compat.c
 
 PC Chipset
diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index 3a6314269bb4..22fcb2247e53 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -21,8 +21,8 @@ tests = {
   'test-opts-visitor': [testqapi],
   'test-visitor-serialization': [testqapi],
   'test-bitmap': [],
-  # all code tested by test-x86-cpuid is inside topology.h
-  'test-x86-cpuid': [],
+  # all code tested by test-x86-topo is inside topology.h
+  'test-x86-topo': [],
   'test-cutils': [],
   'test-div128': [],
   'test-shift128': [],
diff --git a/tests/unit/test-x86-cpuid.c b/tests/unit/test-x86-topo.c
similarity index 99%
rename from tests/unit/test-x86-cpuid.c
rename to tests/unit/test-x86-topo.c
index bfabc0403a1a..2b104f86d7c2 100644
--- a/tests/unit/test-x86-cpuid.c
+++ b/tests/unit/test-x86-topo.c
@@ -1,5 +1,5 @@
 /*
- *  Test code for x86 CPUID and Topology functions
+ *  Test code for x86 APIC ID and Topology functions
  *
  *  Copyright (c) 2012 Red Hat Inc.
  *
-- 
2.34.1

[PATCH v2 00/17] Support smp.clusters for x86

2023-05-29 Thread Zhao Liu

From: Zhao Liu 

Hi list,

This is the our v2 patch series, rebased on the master branch at the
commit ac84b57b4d74 ("Merge tag 'for-upstream' of
https://gitlab.com/bonzini/qemu into staging").

Comparing with v1 [1], v2 mainly reorganizes patches and does some
cleanup.

This series add the cluster support for x86 PC machine, which allows
x86 can use smp.clusters to configure the modlue level CPU topology
of x86.

And since the compatibility issue (see section: ## Why not share L2
cache in cluster directly), this series also introduce a new command
to adjust the topology of the x86 L2 cache.

Welcome your comments!


# Backgroud

The "clusters" parameter in "smp" is introduced by ARM [2], but x86
hasn't supported it.

At present, x86 defaults L2 cache is shared in one core, but this is
not enough. There're some platforms that multiple cores share the
same L2 cache, e.g., Alder Lake-P shares L2 cache for one module of
Atom cores [3], that is, every four Atom cores shares one L2 cache.
Therefore, we need the new CPU topology level (cluster/module).

Another reason is for hybrid architecture. cluster support not only
provides another level of topology definition in x86, but would aslo
provide required code change for future our hybrid topology support.


# Overview

## Introduction of module level for x86

"cluster" in smp is the CPU topology level which is between "core" and
die.

For x86, the "cluster" in smp is corresponding to the module level [4],
which is above the core level. So use the "module" other than "cluster"
in x86 code.

And please note that x86 already has a cpu topology level also named
"cluster" [4], this level is at the upper level of the package. Here,
the cluster in x86 cpu topology is completely different from the
"clusters" as the smp parameter. After the module level is introduced,
the cluster as the smp parameter will actually refer to the module level
of x86.


## Why not share L2 cache in cluster directly

Though "clusters" was introduced to help define L2 cache topology
[2], using cluster to define x86's L2 cache topology will cause the
compatibility problem:

Currently, x86 defaults that the L2 cache is shared in one core, which
actually implies a default setting "cores per L2 cache is 1" and
therefore implicitly defaults to having as many L2 caches as cores.

For example (i386 PC machine):
-smp 16,sockets=2,dies=2,cores=2,threads=2,maxcpus=16 (*)

Considering the topology of the L2 cache, this (*) implicitly means "1
core per L2 cache" and "2 L2 caches per die".

If we use cluster to configure L2 cache topology with the new default
setting "clusters per L2 cache is 1", the above semantics will change
to "2 cores per cluster" and "1 cluster per L2 cache", that is, "2
cores per L2 cache".

So the same command (*) will cause changes in the L2 cache topology,
further affecting the performance of the virtual machine.

Therefore, x86 should only treat cluster as a cpu topology level and
avoid using it to change L2 cache by default for compatibility.


## module level in CPUID

Currently, we don't expose module level in CPUID.1FH because currently
linux (v6.2-rc6) doesn't support module level. And exposing module and
die levels at the same time in CPUID.1FH will cause linux to calculate
wrong die_id. The module level should be exposed until the real machine
has the module level in CPUID.1FH.

We can configure CPUID.04H.02H (L2 cache topology) with module level by
a new command:

"-cpu,x-l2-cache-topo=cluster"

More information about this command, please see the section: "## New
property: x-l2-cache-topo".


## New cache topology info in CPUCacheInfo

Currently, by default, the cache topology is encoded as:
1. i/d cache is shared in one core.
2. L2 cache is shared in one core.
3. L3 cache is shared in one die.

This default general setting has caused a misunderstanding, that is, the
cache topology is completely equated with a specific cpu topology, such
as the connection between L2 cache and core level, and the connection
between L3 cache and die level.

In fact, the settings of these topologies depend on the specific
platform and are not static. For example, on Alder Lake-P, every
four Atom cores share the same L2 cache [2].

Thus, in this patch set, we explicitly define the corresponding cache
topology for different cpu models and this has two benefits:
1. Easy to expand to new CPU models in the future, which has different
   cache topology.
2. It can easily support custom cache topology by some command (e.g.,
   x-l2-cache-topo).


## New property: x-l2-cache-topo

The property l2-cache-topo will be used to change the L2 cache topology
in CPUID.04H.

Now it allows user to set the L2 cache is shared in core level or
cluster level.

If user passes "-cpu x-l2-cache-topo=[core|cluster]" then older L2 cache
topology will be overrided by the new topology setting.

Since CPUID.04H is used by intel cpus, this property is available on
intel cpus as for now.

When necessary, it can be

[PATCH 2/4] target/riscv: Remove check on mode for MPRV

2023-05-29 Thread Weiwei Li

Normally, MPRV can be set to 1 only in M mode (It will be cleared
when returning to lower-privilege mode by MRET/SRET).

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/cpu_helper.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
index bd892c05d4..45baf95c77 100644
--- a/target/riscv/cpu_helper.c
+++ b/target/riscv/cpu_helper.c
@@ -44,7 +44,7 @@ int riscv_cpu_mmu_index(CPURISCVState *env, bool ifetch)
 if (!ifetch) {
 uint64_t status = env->mstatus;
 
-if (mode == PRV_M && get_field(status, MSTATUS_MPRV)) {
+if (get_field(status, MSTATUS_MPRV)) {
 mode = get_field(env->mstatus, MSTATUS_MPP);
 virt = get_field(env->mstatus, MSTATUS_MPV) &&
(mode != PRV_M);
-- 
2.25.1

[PATCH 4/4] target/riscv: Remove redundant assignment to SXL

2023-05-29 Thread Weiwei Li

SXL is initialized as env->misa_mxl which is also the mxl value.
So we can just remain it unchanged to keep it read-only.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/csr.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index 6ac11d1f11..25345f3153 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -1321,10 +1321,6 @@ static RISCVException write_mstatus(CPURISCVState *env, 
int csrno,
 
 mstatus = (mstatus & ~mask) | (val & mask);
 
-if (xl > MXL_RV32) {
-/* SXL field is for now read only */
-mstatus = set_field(mstatus, MSTATUS64_SXL, xl);
-}
 env->mstatus = mstatus;
 
 /*
-- 
2.25.1

[PATCH 1/4] target/riscv: Make MPV only work when MPP != PRV_M

2023-05-29 Thread Weiwei Li

Upon MRET or explicit memory access with MPRV=1, MPV should be ignored
when MPP=PRV_M.

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/cpu_helper.c | 3 ++-
 target/riscv/op_helper.c  | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
index 09ea227ceb..bd892c05d4 100644
--- a/target/riscv/cpu_helper.c
+++ b/target/riscv/cpu_helper.c
@@ -46,7 +46,8 @@ int riscv_cpu_mmu_index(CPURISCVState *env, bool ifetch)
 
 if (mode == PRV_M && get_field(status, MSTATUS_MPRV)) {
 mode = get_field(env->mstatus, MSTATUS_MPP);
-virt = get_field(env->mstatus, MSTATUS_MPV);
+virt = get_field(env->mstatus, MSTATUS_MPV) &&
+   (mode != PRV_M);
 if (virt) {
 status = env->vsstatus;
 }
diff --git a/target/riscv/op_helper.c b/target/riscv/op_helper.c
index f563dc3981..9cdb9cdd06 100644
--- a/target/riscv/op_helper.c
+++ b/target/riscv/op_helper.c
@@ -335,7 +335,8 @@ target_ulong helper_mret(CPURISCVState *env)
 riscv_raise_exception(env, RISCV_EXCP_INST_ACCESS_FAULT, GETPC());
 }
 
-target_ulong prev_virt = get_field(env->mstatus, MSTATUS_MPV);
+target_ulong prev_virt = get_field(env->mstatus, MSTATUS_MPV) &&
+ (prev_priv != PRV_M);
 mstatus = set_field(mstatus, MSTATUS_MIE,
 get_field(mstatus, MSTATUS_MPIE));
 mstatus = set_field(mstatus, MSTATUS_MPIE, 1);
-- 
2.25.1

[PATCH 0/4] target/riscv: Fix mstatus related problems

2023-05-29 Thread Weiwei Li

This patchset tries to fix some problems in the fields of mstatus, such as make 
MPV only work when MPP != PRM.

The port is available here:
https://github.com/plctlab/plct-qemu/tree/plct-mpv-upstream

Weiwei Li (4):
  target/riscv: Make MPV only work when MPP != PRV_M
  target/riscv: Remove check on mode for MPRV
  target/riscv: Support MSTATUS.MPV/GVA only when RVH is enabled
  target/riscv: Remove redundant assignment to SXL

 target/riscv/cpu_helper.c |  5 +++--
 target/riscv/csr.c| 14 --
 target/riscv/op_helper.c  |  3 ++-
 3 files changed, 9 insertions(+), 13 deletions(-)

-- 
2.25.1

[PATCH 3/4] target/riscv: Support MSTATUS.MPV/GVA only when RVH is enabled

2023-05-29 Thread Weiwei Li

MPV and GVA bits are added by hypervisor extension to mstatus
and mstatush (if MXLEN=32).

Signed-off-by: Weiwei Li 
Signed-off-by: Junqiang Wang 
---
 target/riscv/csr.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index 58499b5afc..6ac11d1f11 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -1311,11 +1311,9 @@ static RISCVException write_mstatus(CPURISCVState *env, 
int csrno,
 }
 
 if (xl != MXL_RV32 || env->debugger) {
-/*
- * RV32: MPV and GVA are not in mstatus. The current plan is to
- * add them to mstatush. For now, we just don't support it.
- */
-mask |= MSTATUS_MPV | MSTATUS_GVA;
+if (riscv_has_ext(env, RVH)) {
+mask |= MSTATUS_MPV | MSTATUS_GVA;
+}
 if ((val & MSTATUS64_UXL) != 0) {
 mask |= MSTATUS64_UXL;
 }
@@ -1351,7 +1349,7 @@ static RISCVException write_mstatush(CPURISCVState *env, 
int csrno,
  target_ulong val)
 {
 uint64_t valh = (uint64_t)val << 32;
-uint64_t mask = MSTATUS_MPV | MSTATUS_GVA;
+uint64_t mask = riscv_has_ext(env, RVH) ? MSTATUS_MPV | MSTATUS_GVA : 0;
 
 env->mstatus = (env->mstatus & ~mask) | (valh & mask);
 
-- 
2.25.1

[PATCH v3 1/2] exec/ram_addr: return nr of dirty pages in cpu_physical_memory_set_dirty_lebitmap()

2023-05-29 Thread Joao Martins

In preparation for including the number of dirty pages in the
vfio_get_dirty_bitmap() tracepoint, return the number of dirty pages in
cpu_physical_memory_set_dirty_lebitmap() similar to
cpu_physical_memory_sync_dirty_bitmap().

To avoid counting twice when GLOBAL_DIRTY_RATE is enabled, stash the
number of bits set per bitmap quad in a variable (@nbits) and reuse it
there.

Signed-off-by: Joao Martins 
Reviewed-by: Peter Xu 
---
 include/exec/ram_addr.h | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 90a82692904f..9f2e3893f562 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -334,14 +334,23 @@ static inline void 
cpu_physical_memory_set_dirty_range(ram_addr_t start,
 }
 
 #if !defined(_WIN32)
-static inline void cpu_physical_memory_set_dirty_lebitmap(unsigned long 
*bitmap,
-  ram_addr_t start,
-  ram_addr_t pages)
+
+/*
+ * Contrary to cpu_physical_memory_sync_dirty_bitmap() this function returns
+ * the number of dirty pages in @bitmap passed as argument. On the other hand,
+ * cpu_physical_memory_sync_dirty_bitmap() returns newly dirtied pages that
+ * weren't set in the global migration bitmap.
+ */
+static inline
+uint64_t cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
+ram_addr_t start,
+ram_addr_t pages)
 {
 unsigned long i, j;
-unsigned long page_number, c;
+unsigned long page_number, c, nbits;
 hwaddr addr;
 ram_addr_t ram_addr;
+uint64_t num_dirty = 0;
 unsigned long len = (pages + HOST_LONG_BITS - 1) / HOST_LONG_BITS;
 unsigned long hpratio = qemu_real_host_page_size() / TARGET_PAGE_SIZE;
 unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
@@ -369,6 +378,7 @@ static inline void 
cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 if (bitmap[k]) {
 unsigned long temp = leul_to_cpu(bitmap[k]);
 
+nbits = ctpopl(temp);
 qatomic_or([DIRTY_MEMORY_VGA][idx][offset], temp);
 
 if (global_dirty_tracking) {
@@ -377,10 +387,12 @@ static inline void 
cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 temp);
 if (unlikely(
 global_dirty_tracking & GLOBAL_DIRTY_DIRTY_RATE)) {
-total_dirty_pages += ctpopl(temp);
+total_dirty_pages += nbits;
 }
 }
 
+num_dirty += nbits;
+
 if (tcg_enabled()) {
 qatomic_or([DIRTY_MEMORY_CODE][idx][offset],
temp);
@@ -409,9 +421,11 @@ static inline void 
cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 for (i = 0; i < len; i++) {
 if (bitmap[i] != 0) {
 c = leul_to_cpu(bitmap[i]);
+nbits = ctpopl(c);
 if (unlikely(global_dirty_tracking & GLOBAL_DIRTY_DIRTY_RATE)) 
{
-total_dirty_pages += ctpopl(c);
+total_dirty_pages += nbits;
 }
+num_dirty += nbits;
 do {
 j = ctzl(c);
 c &= ~(1ul << j);
@@ -424,6 +438,8 @@ static inline void 
cpu_physical_memory_set_dirty_lebitmap(unsigned long *bitmap,
 }
 }
 }
+
+return num_dirty;
 }
 #endif /* not _WIN32 */
 
-- 
2.39.3

[PATCH v3 0/2] hw/vfio: Improve vfio_get_dirty_bitmap() tracepoint

2023-05-29 Thread Joao Martins

Hey,

This tiny series changes the tracepoint to include the number of
dirty pages via the vfio_get_dirty_bitmap. I find it useful for
observability in general to understand the number of dirty pages in an
IOVA range. With dirty tracking supported by device or IOMMU it's specially
relevant data to include in the tracepoint.

First patch changes the return value to be the number of dirty pages in
the helper function setting dirty bits and the second patch expands the
VFIO tracepoint to include the dirty pages.

Thanks,
Joao

Changes since v2[2]:
* Add a comment explaining the difference of retval between
cpu_physical_memory_set_dirty_lebitmap() and
cpu_physical_memory_sync_dirty_bitmap() (Peter Xu)
* Add Peter's Reviewed-by;
* Rename dirty variable into dirty_pages (Philippe Mathieu-Daude)

Changes since v1[1]:
* Make the nr of dirty pages a retval similar to sync variant of helper in
  patch 1 (Cedric)
* Stash number of bits set in bitmap quad in a variable and reuse in
  GLOBAL_DIRTY_RATE in patch 1
* Drop init to 0 given that we always initialize the @dirty used in the
  tracepoint

[1] 
https://lore.kernel.org/qemu-devel/20230523151217.46427-1-joao.m.mart...@oracle.com/
[2] 
https://lore.kernel.org/qemu-devel/20230525114321.71066-1-joao.m.mart...@oracle.com/

Joao Martins (2):
  exec/ram_addr: return nr of dirty pages in
cpu_physical_memory_set_dirty_lebitmap()
  hw/vfio: Add nr of dirty pages to vfio_get_dirty_bitmap tracepoint

 hw/vfio/common.c|  7 ---
 hw/vfio/trace-events|  2 +-
 include/exec/ram_addr.h | 28 ++--
 3 files changed, 27 insertions(+), 10 deletions(-)

-- 
2.39.3

[PATCH v3 2/2] hw/vfio: Add nr of dirty pages to vfio_get_dirty_bitmap tracepoint

2023-05-29 Thread Joao Martins

Include the number of dirty pages on the vfio_get_dirty_bitmap tracepoint.
These are fetched from the newly added return value in
cpu_physical_memory_set_lebitmap().

Signed-off-by: Joao Martins 
---
 hw/vfio/common.c | 7 ---
 hw/vfio/trace-events | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 78358ede2764..fa8fd949b1cf 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1747,6 +1747,7 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
*container, uint64_t iova,
 {
 bool all_device_dirty_tracking =
 vfio_devices_all_device_dirty_tracking(container);
+uint64_t dirty_pages;
 VFIOBitmap vbmap;
 int ret;
 
@@ -1772,11 +1773,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer 
*container, uint64_t iova,
 goto out;
 }
 
-cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap, ram_addr,
-   vbmap.pages);
+dirty_pages = cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap, 
ram_addr,
+ vbmap.pages);
 
 trace_vfio_get_dirty_bitmap(container->fd, iova, size, vbmap.size,
-ram_addr);
+ram_addr, dirty_pages);
 out:
 g_free(vbmap.bitmap);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 646e42fd27f9..cfb60c354de3 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -120,7 +120,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, 
int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) 
"sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t 
subtype) "%s index %d, %08x/%08x"
 vfio_dma_unmap_overflow_workaround(void) ""
-vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t 
bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 
0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
+vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t 
bitmap_size, uint64_t start, uint64_t dirty_pages) "container fd=%d, 
iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" 
dirty_pages=%"PRIu64
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu 
dirty @ 0x%"PRIx64" - 0x%"PRIx64
 
 # platform.c
-- 
2.39.3

[PATCH v13 07/10] tb-stats: reset the tracked TBs on a tb_flush

2023-05-29 Thread Fei Wu

From: Alex Bennée 

We keep track of translations but can only do so up until the
translation cache is flushed. At that point we really have no idea if
we can re-create a translation because all the active tracking
information has been reset.

Signed-off-by: Alex Bennée 
Reviewed-by: Richard Henderson 
Signed-off-by: Fei Wu 
---
 accel/tcg/tb-maint.c|  1 +
 accel/tcg/tb-stats.c| 19 +++
 include/exec/tb-stats.h |  8 
 3 files changed, 28 insertions(+)

diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c
index 0980fca358..11ff0ddd90 100644
--- a/accel/tcg/tb-maint.c
+++ b/accel/tcg/tb-maint.c
@@ -763,6 +763,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data 
tb_flush_count)
 qht_reset_size(_ctx.htable, CODE_GEN_HTABLE_SIZE);
 tb_remove_all();
 
+tbstats_reset_tbs();
 tcg_region_reset_all();
 /* XXX: flush processor icache at this point if cache flush is expensive */
 qatomic_inc(_ctx.tb_flush_count);
diff --git a/accel/tcg/tb-stats.c b/accel/tcg/tb-stats.c
index 805e1fc74d..139f049ffc 100644
--- a/accel/tcg/tb-stats.c
+++ b/accel/tcg/tb-stats.c
@@ -267,6 +267,25 @@ void do_hmp_tbstats_safe(CPUState *cpu, run_on_cpu_data 
icmd)
 g_free(cmdinfo);
 }
 
+/*
+ * We have to reset the tbs array on a tb_flush as those
+ * TranslationBlocks no longer exist and we no loner know if the
+ * current mapping is still valid.
+ */
+
+static void reset_tbs_array(void *p, uint32_t hash, void *userp)
+{
+TBStatistics *tbs = p;
+g_ptr_array_set_size(tbs->tbs, 0);
+}
+
+void tbstats_reset_tbs(void)
+{
+if (tb_ctx.tb_stats.map) {
+qht_iter(_ctx.tb_stats, reset_tbs_array, NULL);
+}
+}
+
 void init_tb_stats_htable(void)
 {
 if (!tb_ctx.tb_stats.map && tb_stats_collection_enabled()) {
diff --git a/include/exec/tb-stats.h b/include/exec/tb-stats.h
index 4bb343870b..30b788f7b2 100644
--- a/include/exec/tb-stats.h
+++ b/include/exec/tb-stats.h
@@ -124,4 +124,12 @@ struct TbstatsCommand {
 
 void do_hmp_tbstats_safe(CPUState *cpu, run_on_cpu_data icmd);
 
+/**
+ * tbstats_reset_tbs: reset the linked array of TBs
+ *
+ * Reset the list of tbs for a given array. Should be called from
+ * safe work during tb_flush.
+ */
+void tbstats_reset_tbs(void);
+
 #endif
-- 
2.25.1

[PATCH v13 02/10] accel/tcg: introduce TBStatistics structure

2023-05-29 Thread Fei Wu

From: "Vanderson M. do Rosario" 

To store statistics for each TB, we created a TBStatistics structure
which is linked with the TBs. TBStatistics can stay alive after
tb_flush and be relinked to a regenerated TB. So the statistics can
be accumulated even through flushes.

The goal is to have all present and future qemu/tcg statistics and
meta-data stored in this new structure.

Reviewed-by: Alex Bennée 
Signed-off-by: Vanderson M. do Rosario 
Message-Id: <20190829173437.5926-2-vanderson...@gmail.com>
[AJB: fix git author, review comments]
Signed-off-by: Alex Bennée 
Signed-off-by: Fei Wu 
---
 MAINTAINERS   |  1 +
 accel/tcg/meson.build |  1 +
 accel/tcg/tb-context.h|  1 +
 accel/tcg/tb-hash.h   |  7 +
 accel/tcg/tb-maint.c  | 19 
 accel/tcg/tb-stats.c  | 58 +++
 accel/tcg/translate-all.c | 43 ++
 include/exec/exec-all.h   |  3 ++
 include/exec/tb-stats-flags.h | 21 +
 include/exec/tb-stats.h   | 56 +
 10 files changed, 210 insertions(+)
 create mode 100644 accel/tcg/tb-stats.c
 create mode 100644 include/exec/tb-stats-flags.h
 create mode 100644 include/exec/tb-stats.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 4b025a7b63..d72ecd12b4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -153,6 +153,7 @@ F: include/exec/cpu*.h
 F: include/exec/exec-all.h
 F: include/exec/tb-flush.h
 F: include/exec/target_long.h
+F: include/exec/tb-stats*.h
 F: include/exec/helper*.h
 F: include/sysemu/cpus.h
 F: include/sysemu/tcg.h
diff --git a/accel/tcg/meson.build b/accel/tcg/meson.build
index aeb20a6ef0..9263bdde11 100644
--- a/accel/tcg/meson.build
+++ b/accel/tcg/meson.build
@@ -4,6 +4,7 @@ tcg_ss.add(files(
   'cpu-exec-common.c',
   'cpu-exec.c',
   'tb-maint.c',
+  'tb-stats.c',
   'tcg-runtime-gvec.c',
   'tcg-runtime.c',
   'translate-all.c',
diff --git a/accel/tcg/tb-context.h b/accel/tcg/tb-context.h
index cac62d9749..d7910d586b 100644
--- a/accel/tcg/tb-context.h
+++ b/accel/tcg/tb-context.h
@@ -35,6 +35,7 @@ struct TBContext {
 /* statistics */
 unsigned tb_flush_count;
 unsigned tb_phys_invalidate_count;
+struct qht tb_stats;
 };
 
 extern TBContext tb_ctx;
diff --git a/accel/tcg/tb-hash.h b/accel/tcg/tb-hash.h
index 83dc610e4c..87d657a1c6 100644
--- a/accel/tcg/tb-hash.h
+++ b/accel/tcg/tb-hash.h
@@ -67,4 +67,11 @@ uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong 
pc, uint32_t flags,
 return qemu_xxhash7(phys_pc, pc, flags, cf_mask, trace_vcpu_dstate);
 }
 
+static inline
+uint32_t tb_stats_hash_func(tb_page_addr_t phys_pc, target_ulong pc,
+uint32_t flags)
+{
+return qemu_xxhash5(phys_pc, pc, flags);
+}
+
 #endif
diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c
index 991746f80f..0980fca358 100644
--- a/accel/tcg/tb-maint.c
+++ b/accel/tcg/tb-maint.c
@@ -24,6 +24,7 @@
 #include "exec/log.h"
 #include "exec/exec-all.h"
 #include "exec/tb-flush.h"
+#include "exec/tb-stats.h"
 #include "exec/translate-all.h"
 #include "sysemu/tcg.h"
 #include "tcg/tcg.h"
@@ -41,6 +42,23 @@
 #define TB_FOR_EACH_JMP(head_tb, tb, n) \
 TB_FOR_EACH_TAGGED((head_tb)->jmp_list_head, tb, n, jmp_list_next)
 
+/*
+ * This is the more or less the same compare as tb_cmp(), but the
+ * data persists over tb_flush. We also aggregate the various
+ * variations of cflags under one record and ignore the details of
+ * page overlap (although we can count it).
+ */
+bool tb_stats_cmp(const void *ap, const void *bp)
+{
+const TBStatistics *a = ap;
+const TBStatistics *b = bp;
+
+return a->phys_pc == b->phys_pc &&
+a->pc == b->pc &&
+a->cs_base == b->cs_base &&
+a->flags == b->flags;
+}
+
 static bool tb_cmp(const void *ap, const void *bp)
 {
 const TranslationBlock *a = ap;
@@ -60,6 +78,7 @@ void tb_htable_init(void)
 unsigned int mode = QHT_MODE_AUTO_RESIZE;
 
 qht_init(_ctx.htable, tb_cmp, CODE_GEN_HTABLE_SIZE, mode);
+init_tb_stats_htable();
 }
 
 typedef struct PageDesc PageDesc;
diff --git a/accel/tcg/tb-stats.c b/accel/tcg/tb-stats.c
new file mode 100644
index 00..f988bd8a31
--- /dev/null
+++ b/accel/tcg/tb-stats.c
@@ -0,0 +1,58 @@
+/*
+ * QEMU System Emulator, Code Quality Monitor System
+ *
+ * Copyright (c) 2019 Vanderson M. do Rosario 
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+
+#include "disas/disas.h"
+
+#include "exec/tb-stats.h"
+#include "tb-context.h"
+
+/* TBStatistic collection controls */
+enum TBStatsStatus {
+TB_STATS_DISABLED = 0,
+TB_STATS_RUNNING,
+TB_STATS_PAUSED,
+TB_STATS_STOPPED
+};
+
+static enum TBStatsStatus tcg_collect_tb_stats;
+
+void init_tb_stats_htable(void)
+{
+if (!tb_ctx.tb_stats.map && tb_stats_collection_enabled()) {
+qht_init(_ctx.tb_stats, tb_stats_cmp,
+CODE_GEN_HTABLE_SIZE,

[PATCH v13 10/10] docs: add tb-stats how to

2023-05-29 Thread Fei Wu

Signed-off-by: Fei Wu 
---
 docs/devel/tcg-tbstats.rst | 129 +
 1 file changed, 129 insertions(+)
 create mode 100644 docs/devel/tcg-tbstats.rst

diff --git a/docs/devel/tcg-tbstats.rst b/docs/devel/tcg-tbstats.rst
new file mode 100644
index 00..bfba222e96
--- /dev/null
+++ b/docs/devel/tcg-tbstats.rst
@@ -0,0 +1,129 @@
+
+TBStatistics
+
+
+What is TBStatistics
+
+
+TBStatistics (tb_stats) is a tool to gather various internal information of TCG
+during binary translation, this allows us to identify such as hottest TBs,
+guest to host instruction translation ratio, number of spills during register
+allocation and more.
+
+
+How to use TBStatistics
+===
+
+1. HMP interface
+
+
+TBStatistics provides HMP interface, you can try the following examples after
+connecting to the monitor.
+
+* First check the help info::
+
+(qemu) help tb_stats
+tb_stats command [stats_level] -- Control tb statistics 
collection:tb_stats (start|pause|stop|filter) [all|jit_stats|exec_stats]
+
+(qemu) help info tb-list
+info tb-list [number sortedby] -- show a [number] translated blocks sorted 
by [sortedby]sortedby opts: hotness hg spills
+
+(qemu) help info tb
+info tb id [flag1,flag2,...] -- show information about one translated 
block by id.dump flags can be used to set dump code level: out_asm in_asm op
+
+* Enable TBStatistics::
+
+(qemu) tb_stats start all
+(qemu)
+
+* Get interested TB list::
+
+(qemu) info tb-list 2
+TB id:1 | phys:0x79bca0 virt:0x8059bca0 flags:0x01024001 0 inv/1
+| exec:1464084/0 guest inst cov:0.15%
+| trans:1 ints: g:3 op:16 op_opt:15 spills:0
+| h/g (host bytes / guest insts): 64.00
+| time to gen at 2.4GHz => code:607.08(ns) IR:250.83(ns)
+
+TB id:2 | phys:0x2adf0c virt:0x800adf0c flags:0x01024001 0 inv/1
+| exec:1033298/0 guest inst cov:0.28%
+| trans:1 ints: g:8 op:35 op_opt:33 spills:0
+| h/g (host bytes / guest insts): 86.00
+| time to gen at 2.4GHz => code:1429.58(ns) IR:545.42(ns)
+
+* Dive into the specific TB::
+
+(qemu) info tb 1 op
+--
+
+TB id:1 | phys:0x79bca0 virt:0x8059bca0 flags:0x01024001 7 inv/19
+| exec:2038349/0 guest inst cov:0.15%
+| trans:19 ints: g:3 op:16 op_opt:15 spills:0
+| h/g (host bytes / guest insts): 64.00
+| time to gen at 2.4GHz => code:133434.17(ns) IR:176988.33(ns)
+
+OP:
+ ld_i32 loc1,env,$0xfff0
+ brcond_i32 loc1,$0x0,lt,$L0
+ mov_i64 loc3,$0x7f3c70b3a4e0
+ call inc_exec_freq,$0x1,$0,loc3
+
+  8059bca0 6422
+ add_i64 loc5,x2/sp,$0x8
+ qemu_ld_i64 x8/s0,loc5,un+leq,1
+
+  8059bca2 
+ add_i64 x2/sp,x2/sp,$0x10
+
+  8059bca4 
+ mov_i64 pc,x1/ra
+ and_i64 pc,pc,$0xfffe
+ call lookup_tb_ptr,$0x6,$1,tmp9,env
+ goto_ptr tmp9
+ set_label $L0
+ exit_tb $0x7f3e887a8043
+
+--
+
+* Stop TBStatistics after investigation, this will disable TBStatistics 
completely.::
+
+(qemu) tb_stats stop
+(qemu)
+
+* Alternatively, TBStatistics can be paused, the previous collected 
TBStatistics
+  are not cleared but there is no TBStatistics recorded for new TBs.::
+
+(qemu) tb_stats pause
+(qemu)
+
+* Definitely, TBStatistics can be restarted for another round of 
investigation.::
+
+(qemu) tb_stats start all
+(qemu)
+
+
+2. Dump at exit
+---
+
+New command line options have been added to enable dump TB information at 
exit:::
+
+-d tb_stats[[,level=(+all+jit+exec+time)][,dump_limit=]]
+
+e.g. starting qemu like this:::
+
+-d tb_stats,dump_limit=2
+
+Qemu prints the following at exit:::
+
+QEMU: Terminated
+TB id:1 | phys:0x61be02 virt:0x8041be02 flags:0x01024001 0 inv/1
+| exec:72739176/0 guest inst cov:20.22%
+| trans:1 ints: g:9 op:35 op_opt:33 spills:0
+| h/g (host bytes / guest insts): 51.57
+| time to gen at 2.4GHz => code:1065.42(ns) IR:554.17(ns)
+
+TB id:2 | phys:0x61bc66 virt:0x8041bc66 flags:0x01024001 0 inv/1
+| exec:25069507/0 guest inst cov:0.77%
+| trans:1 ints: g:1 op:15 op_opt:14 spills:0
+| h/g (host bytes / guest insts): 128.00
+| time to gen at 2.4GHz => code:312.50(ns) IR:152.08(ns)
-- 
2.25.1

[PATCH v13 08/10] Adding info [tb-list|tb] commands to HMP (WIP)

2023-05-29 Thread Fei Wu

From: "Vanderson M. do Rosario" 

These commands allow the exploration of TBs generated by the TCG.
Understand which one hotter, with more guest/host instructions... and
examine their guest, host and IR code.

The goal of this command is to allow the dynamic exploration of TCG
behavior and code quality. Therefore, for now, a corresponding QMP
command is not worthwhile.

[AJB: WIP - we still can't be safely sure a translation will succeed]

Example of output:

TB id:1 | phys:0x34d54 virt:0x00034d54 flags:0xf0
| exec:4828932/0 guest inst cov:16.38%
| trans:1 ints: g:3 op:82 op_opt:34 spills:3
| h/g (host bytes / guest insts): 90.64
| time to gen at 2.4GHz => code:3150.83(ns) IR:712.08(ns)
| targets: 0x00034d5e (id:11), 0x00034d0d (id:2)

TB id:2 | phys:0x34d0d virt:0x00034d0d flags:0xf0
| exec:4825842/0 guest inst cov:21.82%
| trans:1 ints: g:4 op:80 op_opt:38 spills:2
| h/g (host bytes / guest insts): 84.00
| time to gen at 2.4GHz => code:3362.92(ns) IR:793.75(ns)
| targets: 0x00034d19 (id:12), 0x00034d54 (id:1)

TB id:2 | phys:0x34d0d virt:0x00034d0d flags:0xf0
| exec:6956495/0  guest inst cov:21.82%
| trans:2 ints: g:2 op:40 op_opt:19 spills:1
| h/g (host bytes / guest insts): 84.00
| time to gen at 2.4GHz => code:3130.83(ns) IR:722.50(ns)
| targets: 0x00034d19 (id:12), 0x00034d54 (id:1)


IN:
0x00034d0d:  89 demovl %ebx, %esi
0x00034d0f:  26 8b 0e movl %es:(%esi), %ecx
0x00034d12:  26 f6 46 08 80   testb$0x80, %es:8(%esi)
0x00034d17:  75 3bjne  0x34d54

--

TB id:1 | phys:0x34d54 virt:0x00034d54 flags:0xf0
| exec:5202686/0 guest inst cov:11.28%
| trans:1 ints: g:3 op:82 op_opt:34 spills:3
| h/g (host bytes / guest insts): 90.64
| time to gen at 2.4GHz => code:2793.75(ns) IR:614.58(ns)
| targets: 0x00034d5e (id:3), 0x00034d0d (id:2)

TB id:2 | phys:0x34d0d virt:0x00034d0d flags:0xf0
| exec:5199468/0 guest inst cov:15.03%
| trans:1 ints: g:4 op:80 op_opt:38 spills:2
| h/g (host bytes / guest insts): 84.00
| time to gen at 2.4GHz => code:2958.75(ns) IR:719.58(ns)
| targets: 0x00034d19 (id:4), 0x00034d54 (id:1)

--
2 TBs to reach 25% of guest inst exec coverage
Total of guest insts exec: 138346727

--

Acked-by: Dr. David Alan Gilbert 
Signed-off-by: Vanderson M. do Rosario 
Message-Id: <20190829173437.5926-10-vanderson...@gmail.com>
[AJB: fix authorship, dropped coverset]
Signed-off-by: Alex Bennée 
Signed-off-by: Fei Wu 
---
 accel/tcg/monitor.c  |  54 ++
 accel/tcg/tb-stats.c | 322 +++
 disas/disas.c|  26 ++-
 hmp-commands-info.hx |  16 ++
 include/exec/tb-stats.h  |  33 +++-
 include/monitor/hmp.h|   2 +
 include/qemu/log-for-trace.h |   6 +-
 include/qemu/log.h   |   2 +
 util/log.c   |  66 +--
 9 files changed, 508 insertions(+), 19 deletions(-)

diff --git a/accel/tcg/monitor.c b/accel/tcg/monitor.c
index 2e00f10267..a1780c5920 100644
--- a/accel/tcg/monitor.c
+++ b/accel/tcg/monitor.c
@@ -8,6 +8,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/accel.h"
+#include "qemu/log.h"
 #include "qapi/error.h"
 #include "qapi/type-helpers.h"
 #include "qapi/qapi-commands-machine.h"
@@ -18,6 +19,7 @@
 #include "sysemu/cpu-timers.h"
 #include "sysemu/tcg.h"
 #include "exec/tb-stats.h"
+#include "tb-context.h"
 #include "internal.h"
 
 
@@ -132,6 +134,58 @@ void hmp_tbstats(Monitor *mon, const QDict *qdict)
 
 }
 
+void hmp_info_tblist(Monitor *mon, const QDict *qdict)
+{
+int number_int;
+const char *sortedby_str = NULL;
+if (!tcg_enabled()) {
+error_report("TB information is only available with accel=tcg");
+return;
+}
+if (!tb_ctx.tb_stats.map) {
+error_report("no TB information recorded");
+return;
+}
+
+number_int = qdict_get_try_int(qdict, "number", 10);
+sortedby_str = qdict_get_try_str(qdict, "sortedby");
+
+int sortedby = SORT_BY_HOTNESS;
+if (sortedby_str == NULL || strcmp(sortedby_str, "hotness") == 0) {
+sortedby = SORT_BY_HOTNESS;
+} else if (strcmp(sortedby_str, "hg") == 0) {
+sortedby = SORT_BY_HG;
+} else if (strcmp(sortedby_str, "spills") == 0) {
+sortedby = SORT_BY_SPILLS;
+} else {
+error_report("valid sort options are: hotness hg spills");
+return;
+}
+
+dump_tbs_info(number_int, sortedby, true);
+}
+
+void hmp_info_tb(Monitor *mon, const QDict *qdict)
+{
+const int id = qdict_get_int(qdict, "id");
+const char

[PATCH v13 09/10] tb-stats: dump hot TBs at the end of the execution

2023-05-29 Thread Fei Wu

From: "Vanderson M. do Rosario" 

Dump the hottest TBs if -d tb_stats,dump_limit=N is used.

Example of output for the 3 hottest TBs:

TB id:1 | phys:0x34d54 virt:0x00034d54 flags:0xf0
| exec:4828932/0 guest inst cov:16.38%
| trans:1 ints: g:3 op:82 op_opt:34 spills:3
| h/g (host bytes / guest insts): 90.64
| time to gen at 2.4GHz => code:3150.83(ns) IR:712.08(ns)
| targets: 0x00034d5e (id:11), 0x00034d0d (id:2)

TB id:2 | phys:0x34d0d virt:0x00034d0d flags:0xf0
| exec:4825842/0 guest inst cov:21.82%
| trans:1 ints: g:4 op:80 op_opt:38 spills:2
| h/g (host bytes / guest insts): 84.00
| time to gen at 2.4GHz => code:3362.92(ns) IR:793.75(ns)
| targets: 0x00034d19 (id:12), 0x00034d54 (id:1)

TB id:3 | phys:0xec1c1 virt:0x000ec1c1 flags:0xb0
| exec:872032/0 guest inst cov:1.97%
| trans:1 ints: g:2 op:56 op_opt:26 spills:1
| h/g (host bytes / guest insts): 68.00
| time to gen at 2.4GHz => code:1692.08(ns) IR:473.75(ns)
| targets: 0x000ec1c5 (id:4), 0x000ec1cb (id:13)

Signed-off-by: Vanderson M. do Rosario 
Message-Id: <20190829173437.5926-12-vanderson...@gmail.com>
[AJB: fix authorship, ad softmmu support]
Signed-off-by: Alex Bennée 
Signed-off-by: Fei Wu 
---
 accel/tcg/tb-stats.c  | 21 +
 include/exec/tb-stats-dump.h  | 21 +
 include/exec/tb-stats-flags.h |  1 +
 linux-user/exit.c |  2 ++
 softmmu/runstate.c|  2 ++
 stubs/tb-stats.c  |  5 +
 util/log.c|  4 ++--
 7 files changed, 54 insertions(+), 2 deletions(-)
 create mode 100644 include/exec/tb-stats-dump.h

diff --git a/accel/tcg/tb-stats.c b/accel/tcg/tb-stats.c
index 4b358cb421..d77891492a 100644
--- a/accel/tcg/tb-stats.c
+++ b/accel/tcg/tb-stats.c
@@ -19,6 +19,7 @@
 
 #include "exec/tb-stats.h"
 #include "exec/tb-flush.h"
+#include "exec/tb-stats-dump.h"
 #include "tb-context.h"
 
 #include "internal.h"
@@ -32,6 +33,7 @@ enum TBStatsStatus {
 
 static enum TBStatsStatus tcg_collect_tb_stats;
 static uint32_t default_tbstats_flag;
+static int max_dump_tbs;
 /* only accessed in safe work */
 static GList *last_search;
 
@@ -616,6 +618,20 @@ void dump_tb_info(int id, int log_mask, bool use_monitor)
 }
 
 
+/*
+ * Dump the final stats
+ */
+void tb_stats_dump(void)
+{
+if (!tb_stats_collection_enabled()) {
+return;
+}
+
+dump_tbs_info(max_dump_tbs, SORT_BY_HOTNESS, false);
+}
+
+/* TBStatistic collection controls */
+
 void enable_collect_tb_stats(void)
 {
 tcg_collect_tb_stats = TB_STATS_RUNNING;
@@ -669,3 +685,8 @@ void set_default_tbstats_flag(uint32_t flags)
 {
 default_tbstats_flag = flags;
 }
+
+void set_tbstats_max_tbs(int max)
+{
+max_dump_tbs = max;
+}
diff --git a/include/exec/tb-stats-dump.h b/include/exec/tb-stats-dump.h
new file mode 100644
index 00..197c6148e9
--- /dev/null
+++ b/include/exec/tb-stats-dump.h
@@ -0,0 +1,21 @@
+/*
+ * TB Stats common dump functions across sysemu/linux-user
+ *
+ * Copyright (c) 2019 Linaro
+ *
+ * SPDX-License-Identifier: GPL-3.0-or-later
+ */
+
+#ifndef _TB_STATS_DUMP_H_
+#define _TB_STATS_DUMP_H_
+
+/**
+ * tb_stats_dump: dump final tb_stats at end of execution
+ */
+#ifdef CONFIG_TCG
+void tb_stats_dump(void);
+#else
+static inline void tb_stats_dump(void) { /* do nothing */ };
+#endif
+
+#endif /* _TB_STATS_DUMP_H_ */
diff --git a/include/exec/tb-stats-flags.h b/include/exec/tb-stats-flags.h
index a3897c99b1..484a4c9c68 100644
--- a/include/exec/tb-stats-flags.h
+++ b/include/exec/tb-stats-flags.h
@@ -26,6 +26,7 @@ bool tb_stats_collection_enabled(void);
 bool tb_stats_collection_disabled(void);
 bool tb_stats_collection_paused(void);
 
+void set_tbstats_max_tbs(int max);
 uint32_t get_default_tbstats_flag(void);
 void set_default_tbstats_flag(uint32_t);
 void set_tbstats_flags(uint32_t flags);
diff --git a/linux-user/exit.c b/linux-user/exit.c
index 3017d28a3c..4fd23bcf60 100644
--- a/linux-user/exit.c
+++ b/linux-user/exit.c
@@ -25,6 +25,7 @@
 #ifdef CONFIG_GPROF
 #include 
 #endif
+#include "exec/tb-stats-dump.h"
 
 #ifdef CONFIG_GCOV
 extern void __gcov_dump(void);
@@ -41,4 +42,5 @@ void preexit_cleanup(CPUArchState *env, int code)
 gdb_exit(code);
 qemu_plugin_user_exit();
 perf_exit();
+tb_stats_dump();
 }
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 37390799f1..b5ceb55ffd 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -30,6 +30,7 @@
 #include "crypto/cipher.h"
 #include "crypto/init.h"
 #include "exec/cpu-common.h"
+#include "exec/tb-stats-dump.h"
 #include "gdbstub/syscalls.h"
 #include "hw/boards.h"
 #include "migration/misc.h"
@@ -827,6 +828,7 @@ void qemu_cleanup(void)
 vm_shutdown();
 replay_finish();
 
+tb_stats_dump();
 job_cancel_sync_all();

[PATCH v13 05/10] debug: add -d tb_stats to control TBStatistics collection:

2023-05-29 Thread Fei Wu

From: "Vanderson M. do Rosario" 

 -d tb_stats[[,level=(+all+jit+exec+time)][,dump_limit=]]

"dump_limit" is used to limit the number of dumped TBStats in
linux-user mode.

[all+jit+exec+time] control the profilling level used
by the TBStats. Can be used as follow:

-d tb_stats
-d tb_stats,level=jit+time
-d tb_stats,dump_limit=15
...

Signed-off-by: Vanderson M. do Rosario 
Message-Id: <20190829173437.5926-7-vanderson...@gmail.com>
[AJB: fix authorship, reword title]
Signed-off-by: Alex Bennée 
Signed-off-by: Fei Wu 
---
 accel/tcg/tb-stats.c  |  5 +
 include/exec/gen-icount.h |  1 +
 include/exec/tb-stats-flags.h |  3 +++
 include/exec/tb-stats.h   |  2 ++
 include/qemu/log.h|  1 +
 stubs/meson.build |  1 +
 stubs/tb-stats.c  | 27 +
 util/log.c| 37 +++
 8 files changed, 77 insertions(+)
 create mode 100644 stubs/tb-stats.c

diff --git a/accel/tcg/tb-stats.c b/accel/tcg/tb-stats.c
index 78a3104c7f..68ac7d3f73 100644
--- a/accel/tcg/tb-stats.c
+++ b/accel/tcg/tb-stats.c
@@ -208,3 +208,8 @@ uint32_t get_default_tbstats_flag(void)
 {
 return default_tbstats_flag;
 }
+
+void set_default_tbstats_flag(uint32_t flags)
+{
+default_tbstats_flag = flags;
+}
diff --git a/include/exec/gen-icount.h b/include/exec/gen-icount.h
index 20e7835ff0..4372817951 100644
--- a/include/exec/gen-icount.h
+++ b/include/exec/gen-icount.h
@@ -3,6 +3,7 @@
 
 #include "exec/exec-all.h"
 #include "exec/tb-stats.h"
+#include "tb-stats-flags.h"
 
 /* Helpers for instruction counting code generation.  */
 
diff --git a/include/exec/tb-stats-flags.h b/include/exec/tb-stats-flags.h
index f29eff7576..04adaee8d9 100644
--- a/include/exec/tb-stats-flags.h
+++ b/include/exec/tb-stats-flags.h
@@ -15,6 +15,7 @@
 #define TB_EXEC_STATS (1 << 1)
 #define TB_JIT_STATS  (1 << 2)
 #define TB_JIT_TIME   (1 << 3)
+#define TB_ALL_STATS  (TB_EXEC_STATS | TB_JIT_STATS | TB_JIT_TIME)
 
 /* TBStatistic collection controls */
 void enable_collect_tb_stats(void);
@@ -24,5 +25,7 @@ bool tb_stats_collection_enabled(void);
 bool tb_stats_collection_paused(void);
 
 uint32_t get_default_tbstats_flag(void);
+void set_default_tbstats_flag(uint32_t);
+void set_tbstats_flags(uint32_t flags);
 
 #endif
diff --git a/include/exec/tb-stats.h b/include/exec/tb-stats.h
index d93d42e085..72585c448a 100644
--- a/include/exec/tb-stats.h
+++ b/include/exec/tb-stats.h
@@ -31,6 +31,8 @@
 #include "exec/tb-stats-flags.h"
 #include "tcg/tcg.h"
 
+#include "exec/tb-stats-flags.h"
+
 #define tb_stats_enabled(tb, JIT_STATS) \
 (tb && tb->tb_stats && (tb->tb_stats->stats_enabled & JIT_STATS))
 
diff --git a/include/qemu/log.h b/include/qemu/log.h
index c5643d8dd5..6f3b8091cd 100644
--- a/include/qemu/log.h
+++ b/include/qemu/log.h
@@ -35,6 +35,7 @@ bool qemu_log_separate(void);
 /* LOG_STRACE is used for user-mode strace logging. */
 #define LOG_STRACE (1 << 19)
 #define LOG_PER_THREAD (1 << 20)
+#define CPU_LOG_TB_STATS   (1 << 21)
 
 /* Lock/unlock output. */
 
diff --git a/stubs/meson.build b/stubs/meson.build
index a56645e2f7..e926649d40 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -65,3 +65,4 @@ else
 endif
 stub_ss.add(files('semihost-all.c'))
 stub_ss.add(when: 'CONFIG_VFIO_USER_SERVER', if_false: 
files('vfio-user-obj.c'))
+stub_ss.add(files('tb-stats.c'))
diff --git a/stubs/tb-stats.c b/stubs/tb-stats.c
new file mode 100644
index 00..d212c2a1fa
--- /dev/null
+++ b/stubs/tb-stats.c
@@ -0,0 +1,27 @@
+/*
+ * TB Stats Stubs
+ *
+ * Copyright (c) 2019
+ * Written by Alex Bennée 
+ *
+ * This code is licensed under the GNU GPL v2, or later.
+ */
+
+
+#include "qemu/osdep.h"
+#include "exec/tb-stats-flags.h"
+
+void enable_collect_tb_stats(void)
+{
+return;
+}
+
+bool tb_stats_collection_enabled(void)
+{
+return false;
+}
+
+void set_default_tbstats_flag(uint32_t flags)
+{
+return;
+}
diff --git a/util/log.c b/util/log.c
index 53b4f6c58e..7ae471d97c 100644
--- a/util/log.c
+++ b/util/log.c
@@ -19,6 +19,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/log.h"
+#include "qemu/qemu-print.h"
 #include "qemu/range.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
@@ -27,6 +28,7 @@
 #include "qemu/thread.h"
 #include "qemu/lockable.h"
 #include "qemu/rcu.h"
+#include "exec/tb-stats-flags.h"
 #ifdef CONFIG_LINUX
 #include 
 #endif
@@ -47,6 +49,7 @@ static __thread Notifier qemu_log_thread_cleanup_notifier;
 int qemu_loglevel;
 static bool log_per_thread;
 static GArray *debug_regions;
+int32_t max_num_hot_tbs_to_dump;
 
 /* Returns true if qemu_log() will really write somewhere. */
 bool qemu_log_enabled(void)
@@ -495,6 +498,10 @@ const QEMULogItem qemu_log_items[] = {
   "log every user-mode syscall, its input, and its result" },
 { LOG_PER_THREAD, "tid",
   "open a separate log file per thread; filename must contain '%d'" },
+{ CPU_LOG_TB_STATS,
+

[PATCH v13 06/10] monitor: adding tb_stats hmp command

2023-05-29 Thread Fei Wu

From: "Vanderson M. do Rosario" 

Adding tb_stats [start|pause|stop|filter] command to hmp.
This allows controlling the collection of statistics.
It is also possible to set the level of collection:
all, jit, or exec.

tb_stats filter allow to only collect statistics for the TB
in the last_search list.

The goal of this command is to allow the dynamic exploration
of the TCG behavior and quality. Therefore, for now, a
corresponding QMP command is not worthwhile.

Acked-by: Dr. David Alan Gilbert 
Signed-off-by: Vanderson M. do Rosario 
Message-Id: <20190829173437.5926-8-vanderson...@gmail.com>
Message-Id: <20190829173437.5926-9-vanderson...@gmail.com>
[AJB: fix authorship]
Signed-off-by: Alex Bennée 
Signed-off-by: Fei Wu 
---
 accel/tcg/monitor.c   |  45 +
 accel/tcg/tb-stats.c  | 121 +-
 hmp-commands.hx   |  16 +
 include/exec/tb-stats-flags.h |   2 +
 include/exec/tb-stats.h   |  10 +++
 include/monitor/hmp.h |   1 +
 6 files changed, 192 insertions(+), 3 deletions(-)

diff --git a/accel/tcg/monitor.c b/accel/tcg/monitor.c
index 2bc87f2642..2e00f10267 100644
--- a/accel/tcg/monitor.c
+++ b/accel/tcg/monitor.c
@@ -11,7 +11,9 @@
 #include "qapi/error.h"
 #include "qapi/type-helpers.h"
 #include "qapi/qapi-commands-machine.h"
+#include "qapi/qmp/qdict.h"
 #include "monitor/monitor.h"
+#include "monitor/hmp.h"
 #include "sysemu/cpus.h"
 #include "sysemu/cpu-timers.h"
 #include "sysemu/tcg.h"
@@ -87,6 +89,49 @@ HumanReadableText *qmp_x_query_opcount(Error **errp)
 }
 
 #ifdef CONFIG_TCG
+void hmp_tbstats(Monitor *mon, const QDict *qdict)
+{
+if (!tcg_enabled()) {
+error_report("TB information is only available with accel=tcg");
+return;
+}
+
+char *cmd = (char *) qdict_get_try_str(qdict, "command");
+enum TbstatsCmd icmd = -1;
+
+if (strcmp(cmd, "start") == 0) {
+icmd = START;
+} else if (strcmp(cmd, "pause") == 0) {
+icmd = PAUSE;
+} else if (strcmp(cmd, "stop") == 0) {
+icmd = STOP;
+} else if (strcmp(cmd, "filter") == 0) {
+icmd = FILTER;
+} else {
+error_report("invalid command!");
+return;
+}
+
+char *slevel = (char *) qdict_get_try_str(qdict, "level");
+uint32_t level = TB_EXEC_STATS | TB_JIT_STATS | TB_JIT_TIME;
+if (slevel) {
+if (strcmp(slevel, "jit") == 0) {
+level = TB_JIT_STATS;
+} else if (strcmp(slevel, "exec") == 0) {
+level = TB_EXEC_STATS;
+} else if (strcmp(slevel, "time") == 0) {
+level = TB_JIT_TIME;
+}
+}
+
+struct TbstatsCommand *tbscommand = g_new0(struct TbstatsCommand, 1);
+tbscommand->cmd = icmd;
+tbscommand->level = level;
+async_safe_run_on_cpu(first_cpu, do_hmp_tbstats_safe,
+  RUN_ON_CPU_HOST_PTR(tbscommand));
+
+}
+
 HumanReadableText *qmp_x_query_profile(Error **errp)
 {
 g_autoptr(GString) buf = g_string_new("");
diff --git a/accel/tcg/tb-stats.c b/accel/tcg/tb-stats.c
index 68ac7d3f73..805e1fc74d 100644
--- a/accel/tcg/tb-stats.c
+++ b/accel/tcg/tb-stats.c
@@ -16,18 +16,20 @@
 #include "qemu/timer.h"
 
 #include "exec/tb-stats.h"
+#include "exec/tb-flush.h"
 #include "tb-context.h"
 
 /* TBStatistic collection controls */
 enum TBStatsStatus {
 TB_STATS_DISABLED = 0,
 TB_STATS_RUNNING,
-TB_STATS_PAUSED,
-TB_STATS_STOPPED
+TB_STATS_PAUSED
 };
 
 static enum TBStatsStatus tcg_collect_tb_stats;
 static uint32_t default_tbstats_flag;
+/* only accessed in safe work */
+static GList *last_search;
 
 uint64_t dev_time;
 
@@ -170,6 +172,101 @@ void dump_jit_profile_info(TCGProfile *s, GString *buf)
 g_free(jpi);
 }
 
+static void free_tbstats(void *p, uint32_t hash, void *userp)
+{
+g_free(p);
+}
+
+static void clean_tbstats(void)
+{
+/* remove all tb_stats */
+qht_iter(_ctx.tb_stats, free_tbstats, NULL);
+qht_destroy(_ctx.tb_stats);
+}
+
+void do_hmp_tbstats_safe(CPUState *cpu, run_on_cpu_data icmd)
+{
+struct TbstatsCommand *cmdinfo = icmd.host_ptr;
+int cmd = cmdinfo->cmd;
+uint32_t level = cmdinfo->level;
+
+switch (cmd) {
+case START:
+if (tb_stats_collection_enabled()) {
+qemu_printf("TB information already being recorded");
+return;
+} else if (tb_stats_collection_paused()) {
+set_tbstats_flags(level);
+} else {
+qht_init(_ctx.tb_stats, tb_stats_cmp, CODE_GEN_HTABLE_SIZE,
+ QHT_MODE_AUTO_RESIZE);
+}
+
+set_default_tbstats_flag(level);
+enable_collect_tb_stats();
+tb_flush(cpu);
+break;
+case PAUSE:
+if (!tb_stats_collection_enabled()) {
+qemu_printf("TB information not being recorded");
+return;
+}
+
+/*
+ * Continue to create TBStatistic structures but stop collecting
+ * statistics
+ */
+

[PATCH v13 01/10] accel/tcg: remove CONFIG_PROFILER

2023-05-29 Thread Fei Wu

TBStats will be introduced to replace CONFIG_PROFILER totally, here
remove all CONFIG_PROFILER related stuffs first.

Signed-off-by: Vanderson M. do Rosario 
Signed-off-by: Alex Bennée 
Signed-off-by: Fei Wu 
---
 accel/tcg/monitor.c   |  25 
 accel/tcg/tcg-accel-ops.c |  10 --
 accel/tcg/translate-all.c |  33 --
 include/qemu/timer.h  |   9 --
 include/tcg/tcg.h |  25 
 meson.build   |   2 -
 meson_options.txt |   2 -
 scripts/meson-buildoptions.sh |   3 -
 softmmu/runstate.c|   9 --
 tcg/tcg.c | 208 --
 tests/qtest/qmp-cmd-test.c|   3 -
 11 files changed, 329 deletions(-)

diff --git a/accel/tcg/monitor.c b/accel/tcg/monitor.c
index 92fce580f1..e903dd1d2e 100644
--- a/accel/tcg/monitor.c
+++ b/accel/tcg/monitor.c
@@ -80,36 +80,11 @@ HumanReadableText *qmp_x_query_opcount(Error **errp)
 return human_readable_text_from_str(buf);
 }
 
-#ifdef CONFIG_PROFILER
-
-int64_t dev_time;
-
-HumanReadableText *qmp_x_query_profile(Error **errp)
-{
-g_autoptr(GString) buf = g_string_new("");
-static int64_t last_cpu_exec_time;
-int64_t cpu_exec_time;
-int64_t delta;
-
-cpu_exec_time = tcg_cpu_exec_time();
-delta = cpu_exec_time - last_cpu_exec_time;
-
-g_string_append_printf(buf, "async time  %" PRId64 " (%0.3f)\n",
-   dev_time, dev_time / 
(double)NANOSECONDS_PER_SECOND);
-g_string_append_printf(buf, "qemu time   %" PRId64 " (%0.3f)\n",
-   delta, delta / (double)NANOSECONDS_PER_SECOND);
-last_cpu_exec_time = cpu_exec_time;
-dev_time = 0;
-
-return human_readable_text_from_str(buf);
-}
-#else
 HumanReadableText *qmp_x_query_profile(Error **errp)
 {
 error_setg(errp, "Internal profiler not compiled");
 return NULL;
 }
-#endif
 
 static void hmp_tcg_register(void)
 {
diff --git a/accel/tcg/tcg-accel-ops.c b/accel/tcg/tcg-accel-ops.c
index 58c8e64096..3973591508 100644
--- a/accel/tcg/tcg-accel-ops.c
+++ b/accel/tcg/tcg-accel-ops.c
@@ -70,20 +70,10 @@ void tcg_cpus_destroy(CPUState *cpu)
 int tcg_cpus_exec(CPUState *cpu)
 {
 int ret;
-#ifdef CONFIG_PROFILER
-int64_t ti;
-#endif
 assert(tcg_enabled());
-#ifdef CONFIG_PROFILER
-ti = profile_getclock();
-#endif
 cpu_exec_start(cpu);
 ret = cpu_exec(cpu);
 cpu_exec_end(cpu);
-#ifdef CONFIG_PROFILER
-qatomic_set(_ctx->prof.cpu_exec_time,
-tcg_ctx->prof.cpu_exec_time + profile_getclock() - ti);
-#endif
 return ret;
 }
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index c87648b99e..4e035e0f79 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -203,10 +203,6 @@ void cpu_restore_state_from_tb(CPUState *cpu, 
TranslationBlock *tb,
uintptr_t host_pc)
 {
 uint64_t data[TARGET_INSN_START_WORDS];
-#ifdef CONFIG_PROFILER
-TCGProfile *prof = _ctx->prof;
-int64_t ti = profile_getclock();
-#endif
 int insns_left = cpu_unwind_data_from_tb(tb, host_pc, data);
 
 if (insns_left < 0) {
@@ -223,12 +219,6 @@ void cpu_restore_state_from_tb(CPUState *cpu, 
TranslationBlock *tb,
 }
 
 cpu->cc->tcg_ops->restore_state_to_opc(cpu, tb, data);
-
-#ifdef CONFIG_PROFILER
-qatomic_set(>restore_time,
-prof->restore_time + profile_getclock() - ti);
-qatomic_set(>restore_count, prof->restore_count + 1);
-#endif
 }
 
 bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
@@ -291,13 +281,6 @@ static int setjmp_gen_code(CPUArchState *env, 
TranslationBlock *tb,
 tcg_ctx->cpu = NULL;
 *max_insns = tb->icount;
 
-#ifdef CONFIG_PROFILER
-qatomic_set(_ctx->prof.tb_count, tcg_ctx->prof.tb_count + 1);
-qatomic_set(_ctx->prof.interm_time,
-tcg_ctx->prof.interm_time + profile_getclock() - *ti);
-*ti = profile_getclock();
-#endif
-
 return tcg_gen_code(tcg_ctx, tb, pc);
 }
 
@@ -311,9 +294,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 tb_page_addr_t phys_pc;
 tcg_insn_unit *gen_code_buf;
 int gen_code_size, search_size, max_insns;
-#ifdef CONFIG_PROFILER
-TCGProfile *prof = _ctx->prof;
-#endif
 int64_t ti;
 void *host_pc;
 
@@ -365,12 +345,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 
  tb_overflow:
 
-#ifdef CONFIG_PROFILER
-/* includes aborted translations because of exceptions */
-qatomic_set(>tb_count1, prof->tb_count1 + 1);
-ti = profile_getclock();
-#endif
-
 trace_translate_block(tb, pc, tb->tc.ptr);
 
 gen_code_size = setjmp_gen_code(env, tb, pc, host_pc, _insns, );
@@ -425,13 +399,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
  */
 perf_report_code(pc, tb, tcg_splitwx_to_rx(gen_code_buf));
 
-#ifdef CONFIG_PROFILER
-qatomic_set(>code_time, prof->code_time + profile_getclock() - ti);
-qatomic_set(>code_in_len, prof->code_in_len + tb->size);
-qatomic_set(>code_out_len,

1 2 >

1 - 100 of 132 matches

Mail list logo