from:"Dongli Zhang"

[PATCH 1/1] pci: don't skip function 0 occupancy verification for devfn auto assign

2024-07-07 Thread Dongli Zhang

When the devfn is already assigned in the command line, the
do_pci_register_device() may verify if the function 0 is already occupied.

However, when devfn < 0, the verification is skipped because it is part of
the last "else if".

For instance, suppose there is already a device at addr=00.00 of a port.

-device pcie-root-port,bus=pcie.0,chassis=115,id=port01,addr=0e.00 \
-device virtio-net-pci,bus=port01,id=vnet01,addr=00.00 \

When 'addr' is specified for the 2nd device, the hotplug is denied.

(qemu) device_add virtio-net-pci,bus=port01,id=vnet02,addr=01.00
Error: PCI: slot 0 function 0 already occupied by virtio-net-pci, new func 
virtio-net-pci cannot be exposed to guest.

When 'addr' is automatically assigned, the hotplug is not denied. This is
because the verification is skipped.

(qemu) device_add virtio-net-pci,bus=port01,id=vnet02
warning: PCI: slot 1 is not valid for virtio-net-pci, parent device only allows 
plugging into slot 0.

Fix the issue by moving the verification into an independent 'if'
statement.

Fixes: 3f1e1478db2d ("enable multi-function hot-add")
Reported-by: Aswin Unnikrishnan 
Signed-off-by: Dongli Zhang 
---
 hw/pci/pci.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 4c7be52951..82ebd243d0 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1186,14 +1186,15 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev,
PCI_SLOT(devfn), PCI_FUNC(devfn), name,
bus->devices[devfn]->name, bus->devices[devfn]->qdev.id);
 return NULL;
-} /*
-   * Populating function 0 triggers a scan from the guest that
-   * exposes other non-zero functions. Hence we need to ensure that
-   * function 0 wasn't added yet.
-   */
-else if (dev->hotplugged &&
- !pci_is_vf(pci_dev) &&
- pci_get_function_0(pci_dev)) {
+}
+
+/*
+ * Populating function 0 triggers a scan from the guest that
+ * exposes other non-zero functions. Hence we need to ensure that
+ * function 0 wasn't added yet.
+ */
+if (dev->hotplugged && !pci_is_vf(pci_dev) &&
+pci_get_function_0(pci_dev)) {
 error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
" new func %s cannot be exposed to guest.",
PCI_SLOT(pci_get_function_0(pci_dev)->devfn),
-- 
2.34.1

Re: qemu process consumes 100% host CPU after reverting snapshot

2024-03-31 Thread Dongli Zhang

To share my test result, I do NOT reproduce the issue with the below command
line and QEMU-8.2.

However, I can reproduce with QEMU-6.2. The mainloop consumes 100% CPU usage.

qemu-system-x86_64 \
--enable-kvm -cpu host -smp 2 -m 8G \
-object
throttle-group,id=limit0,x-iops-total=200,x-iops-total-max=200,x-bps-total-max-length=1,x-bps-read-max-length=1,x-bps-write-max-length=1,x-iops-total-max-length=1,x-iops-read-max-length=1,x-iops-write-max-length=1
\
-object
throttle-group,id=limit1,x-iops-total=250,x-iops-total-max=250,x-bps-total-max-length=1,x-bps-read-max-length=1,x-bps-write-max-length=1,x-iops-total-max-length=1,x-iops-read-max-length=1,x-iops-write-max-length=1
\
-object
throttle-group,id=limit2,x-iops-total=300,x-iops-total-max=300,x-bps-total-max-length=1,x-bps-read-max-length=1,x-bps-write-max-length=1,x-iops-total-max-length=1,x-iops-read-max-length=1,x-iops-write-max-length=1
\
-object
throttle-group,id=limit012,x-iops-total=400,x-iops-total-max=400,x-bps-total-max-length=1,x-bps-read-max-length=1,x-bps-write-max-length=1,x-iops-total-max-length=1,x-iops-read-max-length=1,x-iops-write-max-length=1
\
-device
virtio-blk-pci,bus=pci.0,addr=0x4,drive=libvirt-4-format,id=virtio-disk0,bootindex=1
\
-blockdev
driver=file,filename=image.qcow2,node-name=libvirt-4-storage,discard=unmap \
-blockdev node-name=libvirt-4-format,driver=qcow2,file=libvirt-4-storage \
-device 
virtio-blk-pci,bus=pci.0,addr=0x5,drive=libvirt-6-filter,id=virtio-disk1 \
-blockdev
driver=file,filename=test01.qcow2,node-name=libvirt-3-storage,discard=unmap \
-blockdev node-name=libvirt-3-format,driver=qcow2,file=libvirt-3-storage \
-blockdev
driver=throttle,node-name=libvirt-5-filter,throttle-group=limit0,file=libvirt-3-format
\
-blockdev
driver=throttle,node-name=libvirt-6-filter,throttle-group=limit012,file=libvirt-5-filter
\
-device 
virtio-blk-pci,bus=pci.0,addr=0x6,drive=libvirt-4-filter,id=virtio-disk2 \
-blockdev
driver=file,filename=test02.qcow2,node-name=libvirt-2-storage,discard=unmap \
-blockdev node-name=libvirt-2-format,driver=qcow2,file=libvirt-2-storage \
-blockdev
driver=throttle,node-name=libvirt-3-filter,throttle-group=limit1,file=libvirt-2-format
\
-blockdev
driver=throttle,node-name=libvirt-4-filter,throttle-group=limit012,file=libvirt-3-filter
\
-device 
virtio-blk-pci,bus=pci.0,addr=0x7,drive=libvirt-2-filter,id=virtio-disk3 \
-blockdev
driver=file,filename=test03.qcow2,node-name=libvirt-1-storage,discard=unmap \
-blockdev node-name=libvirt-1-format,driver=qcow2,file=libvirt-1-storage \
-blockdev
driver=throttle,node-name=libvirt-1-filter,throttle-group=limit2,file=libvirt-1-format
\
-blockdev
driver=throttle,node-name=libvirt-2-filter,throttle-group=limit012,file=libvirt-1-filter
\
-netdev user,id=user0,hostfwd=tcp::5028-:22 \
-device
virtio-net-pci,netdev=user0,id=net0,mac=52:54:00:12:34:56,bus=pci.0,addr=0x3 \
-monitor stdio

(qemu) info status
(qemu) savevm snapshot1
(qemu) loadvm snapshot1

Dongli Zhang

On 3/28/24 03:50, Chun Feng Wu wrote:
> BTW, if I download
> qemu8.0(https://urldefense.com/v3/__https://download.qemu.org/qemu-8.0.0.tar.xz__;!!ACWV5N9M2RV99hQ!ObhT3ZYGr8GQFt4p7dWc-t-CFeKff1UzyRPc9WQrAuhz1B7jDC1-VTqjoCUDcrhXsffyBQuRtnoZhfZhS8FVig$
>  ), compile and install it in my ubuntu22.04, launch vm with the same 
> command(just update --machine to be "pc"), the host cpu usage is not high, it 
> seems to be a bug in QEMU6
> 
> Also, I have another question, for disk iotune or throttle group, after
> reverting snapshot, if I login vm, and run fio, I/O performance drops a lot in
> both QEMU6 and QEMU8, do anyone know the reason? Any explanation would be
> appreciated!
> 
> 
> 
> On 2024/3/28 07:32, Chun Feng Wu wrote:
>>
>> Hi,
>>
>> I am testing throttle filter chain(multiple throttle-groups on disk) with the
>> following steps:
>> 1. start guest vm(chained throttle filters applied on disk per
>> https://urldefense.com/v3/__https://github.com/qemu/qemu/blob/master/docs/throttle.txt__;!!ACWV5N9M2RV99hQ!ObhT3ZYGr8GQFt4p7dWc-t-CFeKff1UzyRPc9WQrAuhz1B7jDC1-VTqjoCUDcrhXsffyBQuRtnoZhfaQOXc8pg$
>>  )
>> 2. take snapshot
>> 3. revert snapshot
>>
>> after step3, I noticed qemu process in host consumes 100% cpu, and after I
>> login guest vm, vm cannot(or slowly) response my cmd (it works well before
>> reverting).
>>
>> /    PID USER  PR  NI    VIRT    RES         SHR S  %CPU  %MEM
>> TIME+ COMMAND
>>    65455 root  20   0 9659924 891328  20132 R 100.3 5.4    29:39.93
>> qemu-system-x86/
>>
>> /
>> /
>>
>> Does anybody know why such issue happens?  is it a bug or I misunderstand
>> something?
>>
>> my cmd:
>>
>> /qemu-system-x86_64 \
>>   -name ubuntu-20.04-vm,debug-threads=on \
>>   -machine pc-i440fx-jammy,usb=off,dump-guest-core=off \
>>   -acce

Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

2024-03-21 Thread Dongli Zhang

Hi Jonah,

Would you mind helping explain how does VIRTIO_F_IN_ORDER improve the 
performance?

https://lore.kernel.org/all/20240321155717.1392787-1-jonah.pal...@oracle.com/#t

I tried to look for it from prior discussions but could not find why.

https://lore.kernel.org/all/byapr18mb2791df7e6c0f61e2d8698e8fa0...@byapr18mb2791.namprd18.prod.outlook.com/

Thank you very much!

Dongli Zhang

On 3/21/24 08:57, Jonah Palmer wrote:
> The goal of these patches is to add support to a variety of virtio and
> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
> indicates that all buffers are used by the device in the same order in
> which they were made available by the driver.
> 
> These patches attempt to implement a generalized, non-device-specific
> solution to support this feature.
> 
> The core feature behind this solution is a buffer mechanism in the form
> of GLib's GHashTable. The decision behind using a hash table was to
> leverage their ability for quick lookup, insertion, and removal
> operations. Given that our keys are simply numbers of an ordered
> sequence, a hash table seemed like the best choice for a buffer
> mechanism.
> 
> -
> 
> The strategy behind this implementation is as follows:
> 
> We know that buffers that are popped from the available ring and enqueued
> for further processing will always done in the same order in which they
> were made available by the driver. Given this, we can note their order
> by assigning the resulting VirtQueueElement a key. This key is a number
> in a sequence that represents the order in which they were popped from
> the available ring, relative to the other VirtQueueElements.
> 
> For example, given 3 "elements" that were popped from the available
> ring, we assign a key value to them which represents their order (elem0
> is popped first, then elem1, then lastly elem2):
> 
>  elem2   --  elem1   --  elem0   ---> Enqueue for processing
> (key: 2)(key: 1)(key: 0)
> 
> Then these elements are enqueued for further processing by the host.
> 
> While most devices will return these completed elements in the same
> order in which they were enqueued, some devices may not (e.g.
> virtio-blk). To guarantee that these elements are put on the used ring
> in the same order in which they were enqueued, we can use a buffering
> mechanism that keeps track of the next expected sequence number of an
> element.
> 
> In other words, if the completed element does not have a key value that
> matches the next expected sequence number, then we know this element is
> not in-order and we must stash it away in a hash table until an order
> can be made. The element's key value is used as the key for placing it
> in the hash table.
> 
> If the completed element has a key value that matches the next expected
> sequence number, then we know this element is in-order and we can push
> it on the used ring. Then we increment the next expected sequence number
> and check if the hash table contains an element at this key location.
> 
> If so, we retrieve this element, push it to the used ring, delete the
> key-value pair from the hash table, increment the next expected sequence
> number, and check the hash table again for an element at this new key
> location. This process is repeated until we're unable to find an element
> in the hash table to continue the order.
> 
> So, for example, say the 3 elements we enqueued were completed in the
> following order: elem1, elem2, elem0. The next expected sequence number
> is 0:
> 
> exp-seq-num = 0:
> 
>  elem1   --> elem1.key == exp-seq-num ? --> No, stash it
> (key: 1) |
>  |
>  v
>
>|key: 1 - elem1|
>
> -
> exp-seq-num = 0:
> 
>  elem2   --> elem2.key == exp-seq-num ? --> No, stash it
> (key: 2) |
>  |
>  v
>
>|key: 1 - elem1|
>|--|
>|key: 2 - elem2|
>
> -
>

Re: [PATCH v4 1/1] oslib-posix: initialize backend memory objects in parallel

2024-02-03 Thread Dongli Zhang

  /* Avoid creating a single thread for MADV_POPULATE_WRITE */
> -if (context.num_threads == 1) {
> +/*
> + * Avoid creating a single thread for MADV_POPULATE_WRITE when
> + * preallocating synchronously.
> + */
> +if (context->num_threads == 1 && !async) {
>  if (qemu_madvise(area, hpagesize * numpages,
>   QEMU_MADV_POPULATE_WRITE)) {
>  return -errno;
> @@ -445,50 +478,86 @@ static int touch_all_pages(char *area, size_t 
> hpagesize, size_t numpages,
>  touch_fn = do_touch_pages;
>  }
>  
> -context.threads = g_new0(MemsetThread, context.num_threads);
> -numpages_per_thread = numpages / context.num_threads;
> -leftover = numpages % context.num_threads;
> -for (i = 0; i < context.num_threads; i++) {
> -context.threads[i].addr = addr;
> -context.threads[i].numpages = numpages_per_thread + (i < leftover);
> -context.threads[i].hpagesize = hpagesize;
> -context.threads[i].context = 
> +context->threads = g_new0(MemsetThread, context->num_threads);
> +numpages_per_thread = numpages / context->num_threads;
> +leftover = numpages % context->num_threads;
> +for (i = 0; i < context->num_threads; i++) {
> +context->threads[i].addr = addr;
> +context->threads[i].numpages = numpages_per_thread + (i < leftover);
> +context->threads[i].hpagesize = hpagesize;
> +context->threads[i].context = context;
>  if (tc) {
> -thread_context_create_thread(tc, [i].pgthread,
> +thread_context_create_thread(tc, >threads[i].pgthread,
>   "touch_pages",
> - touch_fn, [i],
> + touch_fn, >threads[i],
>   QEMU_THREAD_JOINABLE);
>  } else {
> -qemu_thread_create([i].pgthread, "touch_pages",
> -   touch_fn, [i],
> +qemu_thread_create(>threads[i].pgthread, "touch_pages",
> +   touch_fn, >threads[i],
> QEMU_THREAD_JOINABLE);
>  }
> -addr += context.threads[i].numpages * hpagesize;
> +addr += context->threads[i].numpages * hpagesize;
> +}
> +
> +if (async) {
> +/*
> + * async requests currently require the BQL. Add it to the list and 
> kick
> + * preallocation off during qemu_finish_async_prealloc_mem().
> + */
> +assert(bql_locked());
> +QLIST_INSERT_HEAD(_contexts, context, next);
> +return 0;
>  }
>  
>  if (!use_madv_populate_write) {
> -sigbus_memset_context = 
> +sigbus_memset_context = context;
>  }
>  
>  qemu_mutex_lock(_mutex);
> -context.all_threads_created = true;
> +context->all_threads_created = true;
>  qemu_cond_broadcast(_cond);
>  qemu_mutex_unlock(_mutex);
>  
> -for (i = 0; i < context.num_threads; i++) {
> -int tmp = (uintptr_t)qemu_thread_join([i].pgthread);
> +ret = wait_and_free_mem_prealloc_context(context);
>  
> +if (!use_madv_populate_write) {
> +sigbus_memset_context = NULL;
> +}
> +return ret;
> +}
> +
> +bool qemu_finish_async_prealloc_mem(Error **errp)
> +{
> +int ret, tmp;

The above should be initialized?

I did a build test and encounter:

In file included from ../util/oslib-posix.c:36:
../util/oslib-posix.c: In function ‘qemu_finish_async_prealloc_mem’:
/home/libvirt/vm/software/qemu/include/qapi/error.h:334:5: error: ‘ret’ may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
  334 | error_setg_errno_internal((errp), __FILE__, __LINE__, __func__, 
\
  | ^
../util/oslib-posix.c:531:9: note: ‘ret’ was declared here
  531 | int ret, tmp;
  | ^~~
cc1: all warnings being treated as errors
ninja: build stopped: subcommand failed.
make: *** [Makefile:162: run-ninja] Error 1

Thank you very much!

Dongli Zhang

> +MemsetContext *context, *next_context;
> +
> +/* Waiting for preallocation requires the BQL. */
> +assert(bql_locked());
> +if (QLIST_EMPTY(_contexts)) {
> +return true;
> +}
> +
> +qemu_mutex_lock(_mutex);
> +QLIST_FOREACH(context, _contexts, next) {
> +context->all_threads_created = true;
> +}
> +qemu_cond_broadcast(_cond);
> +qemu_mutex_unlock(_mutex);
> +
> +QLIST_FOREACH_SAFE(context, _con

[PATCH 1/1] vhost-scsi: fix usage of error_reportf_err()

2023-12-13 Thread Dongli Zhang

It is required to use error_report() instead of error_reportf_err(), if the
prior function does not take local_err as the argument. As a result, the
local_err is always NULL and segment fault may happen.

vhost_scsi_start()
-> vhost_scsi_set_endpoint(s) --> does not allocate local_err
-> error_reportf_err()
   -> error_vprepend()
  -> g_string_append(newmsg, (*errp)->msg) --> (*errp) is NULL

In addition, add ": " at the end of other error_reportf_err() logs.

Fixes: 7962e432b4e4 ("vhost-user-scsi: support reconnect to backend")
Signed-off-by: Dongli Zhang 
---
 hw/scsi/vhost-scsi.c  | 4 ++--
 hw/scsi/vhost-user-scsi.c | 3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 3126df9e1d..9929c0d14b 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -91,13 +91,13 @@ static int vhost_scsi_start(VHostSCSI *s)
 
 ret = vhost_scsi_common_start(vsc, _err);
 if (ret < 0) {
-error_reportf_err(local_err, "Error starting vhost-scsi");
+error_reportf_err(local_err, "Error starting vhost-scsi: ");
 return ret;
 }
 
 ret = vhost_scsi_set_endpoint(s);
 if (ret < 0) {
-error_reportf_err(local_err, "Error setting vhost-scsi endpoint");
+error_report("Error setting vhost-scsi endpoint");
 vhost_scsi_common_stop(vsc);
 }
 
diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index 780f10559d..af18c4f3d3 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -83,7 +83,8 @@ static void vhost_user_scsi_set_status(VirtIODevice *vdev, 
uint8_t status)
 if (should_start) {
 ret = vhost_user_scsi_start(s, _err);
 if (ret < 0) {
-error_reportf_err(local_err, "unable to start vhost-user-scsi: %s",
+error_reportf_err(local_err,
+  "unable to start vhost-user-scsi: %s: ",
   strerror(-ret));
 qemu_chr_fe_disconnect(>conf.chardev);
 }
-- 
2.34.1

[PATCH v3 1/1] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE

2023-11-13 Thread Dongli Zhang

The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
could disable the pmu virtualization in an AMD environment.

We still see below at VM kernel side ...

[0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[0.596381] Performance Events: PMU not available due to virtualization, 
using software events only.
[0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

This is because the AMD pmu does not rely on cpuid to decide if the pmu
virtualization is supported.

We introduce a new property 'pmu-cap-disabled' for KVM accel to set
KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host
is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for
x86.

Cc: Joe Jin 
Cc: Like Xu 
Cc: Denis V. Lunev 
Signed-off-by: Dongli Zhang 
---
This is to resurrect the patch to disable PMU. I split the patchset and
send the patch to disable PMU.

Changed since v1:
  [PATCH 1/3] kvm: introduce a helper before creating the 1st vcpu
  https://lore.kernel.org/all/20221119122901.2469-2-dongli.zh...@oracle.com/
  [PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled
  https://lore.kernel.org/all/20221119122901.2469-3-dongli.zh...@oracle.com/

- In version 1 we did not introduce the new property. We ioctl
  KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had
  introduced a helpfer function to do this job before creating the 1st
  KVM vcpu in v1.

Changed since v2:
   https://lore.kernel.org/all/20230621013821.6874-2-dongli.zh...@oracle.com/
   Nothing. I split the patchset and send this as a single patch.


As a summary:

- Greg Kurz and Liang Yan suggested introduce the machine property to
disable the PMU (e.g., with the concern of live migration, or vCPU prop
theoretically be different for each vCPU).

- Denis V. Lunev and Like Xu preferred the method in v1 patch: to re-use
cpu->enable_pmu.

Would you please suggest if we may go via v1 (re-use cpu->enable_pmu) or
v2 (to introduce new machine prop)


1. The v1 is to re-use cpu->enable_pmu. It disables KVM_PMU_CAP_DISABLE
when creating the 1st vCPU. We may use the vCPU id or (current_cpu ==
first_cpu) to check when it is the 1st vCPU creation.

The benefit is that the QEMU user (e.g., libvirt will not require much
change).

2. The v2 is to introduce the new machine property as in this patch.

The benefit: the 'pmu' is to configure cpuid, while KVM_PMU_CAP_DISABLE
is a different KVM feature. They are orthogonal features.

Perhaps there is another option to sum both v1 and v2 together ...


Perhaps the maintainer can help make decision on that :)

Thank you very much!


 accel/kvm/kvm-all.c  |  1 +
 include/sysemu/kvm_int.h |  1 +
 qemu-options.hx  |  7 ++
 target/i386/kvm/kvm.c| 46 
 4 files changed, 55 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index e39a810a4e..4acc5bdcc8 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3619,6 +3619,7 @@ static void kvm_accel_instance_init(Object *obj)
 s->xen_version = 0;
 s->xen_gnttab_max_frames = 64;
 s->xen_evtchn_max_pirq = 256;
+s->pmu_cap_disabled = false;
 }
 
 /**
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index fd846394be..b7c0c6ffee 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -120,6 +120,7 @@ struct KVMState
 uint32_t xen_caps;
 uint16_t xen_gnttab_max_frames;
 uint16_t xen_evtchn_max_pirq;
+bool pmu_cap_disabled;
 };
 
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
diff --git a/qemu-options.hx b/qemu-options.hx
index 42fd09e4de..7fe201e41c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -188,6 +188,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
 "dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n"
 "eager-split-size=n (KVM Eager Page Split chunk size, 
default 0, disabled. ARM only)\n"
 "notify-vmexit=run|internal-error|disable,notify-window=n 
(enable notify VM exit and set notify window, x86 only)\n"
+"pmu-cap-disabled=true|false (disable 
KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n"
 "thread=single|multi (enable multi-threaded TCG)\n", 
QEMU_ARCH_ALL)
 SRST
 ``-accel name[,prop=value[,...]]``
@@ -269,6 +270,12 @@ SRST
 open up for a specified of time (i.e. notify-window).
 Default: notify-vmexit=run,notify-window=0.
 
+``pmu-cap-disabled=true|false``
+When the KVM accelerator is used, it controls whether to disable the
+KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled, the
+PMU vi

Re: [PATCH v3 1/1] hmp: synchronize cpu state for lapic info

2023-10-31 Thread Dongli Zhang

Hi Juan,

On 10/30/23 09:31, Juan Quintela wrote:
> Dongli Zhang  wrote:
>> While the default "info lapic" always synchronizes cpu state ...
>>
>> mon_get_cpu()
>> -> mon_get_cpu_sync(mon, true)
>>-> cpu_synchronize_state(cpu)
>>   -> ioctl KVM_GET_LAPIC (taking KVM as example)
>>
>> ... the cpu state is not synchronized when the apic-id is available as
>> argument.
>>
>> The cpu state should be synchronized when apic-id is available. Otherwise
>> the "info lapic " always returns stale data.
>>
>> Reference:
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20211028155457.967291-19-berra...@redhat.com/__;!!ACWV5N9M2RV99hQ!KOLfuCesLC4T6ka9bjf4x6ncC34GPK9pVvWwOJhbwSZw2fwp3Mxlakk0fnR-NCoqRPKOX7X4SOAxozQBC7VQ$
>>  
>>
>> Cc: Joe Jin 
>> Signed-off-by: Dongli Zhang 
>> Reviewed-by: Daniel P. Berrangé 
>> Reviewed-by: David Woodhouse 
> 
> Reviewed-by: Juan Quintela 
> 
> But I wonder how I did get CC'd on this patch O:-)
> 

Thank you very much!

This component does not have a maintainer. I just blindly cc all suggested
reviewers :), in order to get it reviewed and merged.

get_maintainer.pl: No maintainers found, printing recent contributors.
get_maintainer.pl: Do not blindly cc: them on patches!  Use common sense.

Thank you very much!

Dongli Zhang

Re: [PATCH] target/i386/monitor: synchronize cpu before printing lapic state

2023-10-26 Thread Dongli Zhang

Hi David,

Thank you very much for the Reviewed-by in another thread.

I have re-based the patch and sent again.

https://lore.kernel.org/all/20231026211938.162815-1-dongli.zh...@oracle.com/

Dongli Zhang

On 10/26/23 09:39, Dongli Zhang wrote:
> Hi David,
> 
> On 10/26/23 08:39, David Woodhouse wrote:
>> From: David Woodhouse 
>>
>> Where the local APIC is emulated by KVM, we need kvm_get_apic() to pull
>> the current state into userspace before it's printed. Otherwise we get
>> stale values.
>>
>> Signed-off-by: David Woodhouse 
>> ---
>>  target/i386/monitor.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/target/i386/monitor.c b/target/i386/monitor.c
>> index 6512846327..0754d699ba 100644
>> --- a/target/i386/monitor.c
>> +++ b/target/i386/monitor.c
>> @@ -29,6 +29,7 @@
>>  #include "monitor/hmp.h"
>>  #include "qapi/qmp/qdict.h"
>>  #include "sysemu/kvm.h"
>> +#include "sysemu/hw_accel.h"
>>  #include "qapi/error.h"
>>  #include "qapi/qapi-commands-misc-target.h"
>>  #include "qapi/qapi-commands-misc.h"
>> @@ -655,6 +656,7 @@ void hmp_info_local_apic(Monitor *mon, const QDict 
>> *qdict)
>>  if (qdict_haskey(qdict, "apic-id")) {
>>  int id = qdict_get_try_int(qdict, "apic-id", 0);
>>  cs = cpu_by_arch_id(id);
>> +cpu_synchronize_state(cs);
> 
> AFAIR, there is a case that cs may be NULL here when I was sending the similar
> bugfix long time ago.
> 
> https://lore.kernel.org/qemu-devel/20210701214051.1588-1-dongli.zh...@oracle.com/
> 
> ... and resend:
> 
> https://lore.kernel.org/qemu-devel/20210908143803.29191-1-dongli.zh...@oracle.com/
> 
> ... and resent by Daniel as part of another patchset (after review):
> 
> https://lore.kernel.org/qemu-devel/20211028155457.967291-19-berra...@redhat.com/
> 
> 
> This utility is helpful for the diagnostic of loss of interrupt issue.
> 
> Dongli Zhang
> 
>>  } else {
>>  cs = mon_get_cpu(mon);
>>  }

[PATCH v3 1/1] hmp: synchronize cpu state for lapic info

2023-10-26 Thread Dongli Zhang

While the default "info lapic" always synchronizes cpu state ...

mon_get_cpu()
-> mon_get_cpu_sync(mon, true)
   -> cpu_synchronize_state(cpu)
  -> ioctl KVM_GET_LAPIC (taking KVM as example)

... the cpu state is not synchronized when the apic-id is available as
argument.

The cpu state should be synchronized when apic-id is available. Otherwise
the "info lapic " always returns stale data.

Reference:
https://lore.kernel.org/all/20211028155457.967291-19-berra...@redhat.com/

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
Reviewed-by: Daniel P. Berrangé 
Reviewed-by: David Woodhouse 
---
Changed since v1:
  - I sent out wrong patch version in v1
Changed since v2:
  - Add the Reviewed-by from Daniel and David

 target/i386/monitor.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index 6512846327..d727270fd0 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -28,6 +28,7 @@
 #include "monitor/hmp-target.h"
 #include "monitor/hmp.h"
 #include "qapi/qmp/qdict.h"
+#include "sysemu/hw_accel.h"
 #include "sysemu/kvm.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-misc-target.h"
@@ -654,7 +655,11 @@ void hmp_info_local_apic(Monitor *mon, const QDict *qdict)
 
 if (qdict_haskey(qdict, "apic-id")) {
 int id = qdict_get_try_int(qdict, "apic-id", 0);
+
 cs = cpu_by_arch_id(id);
+if (cs) {
+cpu_synchronize_state(cs);
+}
 } else {
 cs = mon_get_cpu(mon);
 }
-- 
2.34.1

Re: [PATCH] target/i386/monitor: synchronize cpu before printing lapic state

2023-10-26 Thread Dongli Zhang

Hi David,

On 10/26/23 08:39, David Woodhouse wrote:
> From: David Woodhouse 
> 
> Where the local APIC is emulated by KVM, we need kvm_get_apic() to pull
> the current state into userspace before it's printed. Otherwise we get
> stale values.
> 
> Signed-off-by: David Woodhouse 
> ---
>  target/i386/monitor.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/target/i386/monitor.c b/target/i386/monitor.c
> index 6512846327..0754d699ba 100644
> --- a/target/i386/monitor.c
> +++ b/target/i386/monitor.c
> @@ -29,6 +29,7 @@
>  #include "monitor/hmp.h"
>  #include "qapi/qmp/qdict.h"
>  #include "sysemu/kvm.h"
> +#include "sysemu/hw_accel.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-commands-misc-target.h"
>  #include "qapi/qapi-commands-misc.h"
> @@ -655,6 +656,7 @@ void hmp_info_local_apic(Monitor *mon, const QDict *qdict)
>  if (qdict_haskey(qdict, "apic-id")) {
>  int id = qdict_get_try_int(qdict, "apic-id", 0);
>  cs = cpu_by_arch_id(id);
> +cpu_synchronize_state(cs);

AFAIR, there is a case that cs may be NULL here when I was sending the similar
bugfix long time ago.

https://lore.kernel.org/qemu-devel/20210701214051.1588-1-dongli.zh...@oracle.com/

... and resend:

https://lore.kernel.org/qemu-devel/20210908143803.29191-1-dongli.zh...@oracle.com/

... and resent by Daniel as part of another patchset (after review):

https://lore.kernel.org/qemu-devel/20211028155457.967291-19-berra...@redhat.com/


This utility is helpful for the diagnostic of loss of interrupt issue.

Dongli Zhang

>  } else {
>  cs = mon_get_cpu(mon);
>  }

Re: [PATCH 1/1] dump: kdump-zlib data pages not dumped with pvtime/aarch64

2023-07-28 Thread Dongli Zhang

Hi Marc-André,

May I get any feedback on this bugfix?

Feel free to let me know if I can help by collecting any data from the aarch64
server.

Since the pvtime is supported by default by QEMU/linux, this indicates currently
the dump does not work for aarch64 with new version of QEMU/linux.

Thank you very much!

Dongli Zhang

On 7/12/23 22:58, Dongli Zhang wrote:
> The kdump-zlib data pages are not dumped from aarch64 host when the
> 'pvtime' is involved, that is, when the block->target_end is not aligned to
> page_size. In the below example, it is expected to dump two blocks.
> 
> (qemu) info mtree -f
> ... ...
>   090a-090a0fff (prio 0, ram): pvtime KVM
> ... ...
>   4000-0001bfff (prio 0, ram): mach-virt.ram KVM
> ... ...
> 
> However, there is an issue with get_next_page() so that the pages for
> "mach-virt.ram" will not be dumped.
> 
> At line 1296, although we have reached at the end of the 'pvtime' block,
> since it is not aligned to the page_size (e.g., 0x1), it will not break
> at line 1298.
> 
> 1255 static bool get_next_page(GuestPhysBlock **blockptr, uint64_t *pfnptr,
> 1256   uint8_t **bufptr, DumpState *s)
> ... ...
> 1294 memcpy(buf + addr % page_size, hbuf, n);
> 1295 addr += n;
> 1296 if (addr % page_size == 0) {
> 1297 /* we filled up the page */
> 1298 break;
> 1299 }
> 
> As a result, get_next_page() will continue to the next
> block ("mach-virt.ram"). Finally, when get_next_page() returns to the
> caller:
> 
> - 'pfnptr' is referring to the 'pvtime'
> - but 'blockptr' is referring to the "mach-virt.ram"
> 
> When get_next_page() is called the next time, "*pfnptr += 1" still refers
> to the prior 'pvtime'. It will exit immediately because it is out of the
> range of the current "mach-virt.ram".
> 
> The fix is to break when it is time to come to the next block, so that both
> 'pfnptr' and 'blockptr' refer to the same block.
> 
> Fixes: 94d788408d2d ("dump: fix kdump to work over non-aligned blocks")
> Cc: Joe Jin 
> Signed-off-by: Dongli Zhang 
> ---
>  dump/dump.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/dump/dump.c b/dump/dump.c
> index 1f1a6edcab..c93e4c572f 100644
> --- a/dump/dump.c
> +++ b/dump/dump.c
> @@ -1293,8 +1293,8 @@ static bool get_next_page(GuestPhysBlock **blockptr, 
> uint64_t *pfnptr,
>  
>  memcpy(buf + addr % page_size, hbuf, n);
>  addr += n;
> -if (addr % page_size == 0) {
> -/* we filled up the page */
> +if (addr % page_size == 0 || addr >= block->target_end) {
> +/* we filled up the page or the current block is finished */
>  break;
>  }
>  } else {

[PATCH 1/1] dump: kdump-zlib data pages not dumped with pvtime/aarch64

2023-07-13 Thread Dongli Zhang

The kdump-zlib data pages are not dumped from aarch64 host when the
'pvtime' is involved, that is, when the block->target_end is not aligned to
page_size. In the below example, it is expected to dump two blocks.

(qemu) info mtree -f
... ...
  090a-090a0fff (prio 0, ram): pvtime KVM
... ...
  4000-0001bfff (prio 0, ram): mach-virt.ram KVM
... ...

However, there is an issue with get_next_page() so that the pages for
"mach-virt.ram" will not be dumped.

At line 1296, although we have reached at the end of the 'pvtime' block,
since it is not aligned to the page_size (e.g., 0x1), it will not break
at line 1298.

1255 static bool get_next_page(GuestPhysBlock **blockptr, uint64_t *pfnptr,
1256   uint8_t **bufptr, DumpState *s)
... ...
1294 memcpy(buf + addr % page_size, hbuf, n);
1295 addr += n;
1296 if (addr % page_size == 0) {
1297 /* we filled up the page */
1298 break;
1299 }

As a result, get_next_page() will continue to the next
block ("mach-virt.ram"). Finally, when get_next_page() returns to the
caller:

- 'pfnptr' is referring to the 'pvtime'
- but 'blockptr' is referring to the "mach-virt.ram"

When get_next_page() is called the next time, "*pfnptr += 1" still refers
to the prior 'pvtime'. It will exit immediately because it is out of the
range of the current "mach-virt.ram".

The fix is to break when it is time to come to the next block, so that both
'pfnptr' and 'blockptr' refer to the same block.

Fixes: 94d788408d2d ("dump: fix kdump to work over non-aligned blocks")
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 dump/dump.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/dump/dump.c b/dump/dump.c
index 1f1a6edcab..c93e4c572f 100644
--- a/dump/dump.c
+++ b/dump/dump.c
@@ -1293,8 +1293,8 @@ static bool get_next_page(GuestPhysBlock **blockptr, 
uint64_t *pfnptr,
 
 memcpy(buf + addr % page_size, hbuf, n);
 addr += n;
-if (addr % page_size == 0) {
-/* we filled up the page */
+if (addr % page_size == 0 || addr >= block->target_end) {
+/* we filled up the page or the current block is finished */
 break;
 }
 } else {
-- 
2.34.1

Re: [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers

2023-07-03 Thread Dongli Zhang

Hi Like,

On 7/2/23 07:15, Like Xu wrote:
> On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang  wrote:
>>
>> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
>> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
>> the KVM side.
>>
>> However, only the Intel gp/fixed/global pmu registers are involved. There
>> is not any implementation for AMD pmu registers. The
>> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
>> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
>> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
>> the CPU version.
> 
> Updating the relevant documentation to clarify this part of the deficiency
> would be a good first step.

Would you mind suggesting the doc to add this TODO/deficiency?

The only place I find is to add a new TODO under docs/system/i386, but not sure
if it is worth it. This bugfix is not complex.

> 
>>
>> This patch is to add the support for AMD version=1 pmu, to get and put AMD
>> pmu registers. Otherwise, there will be a bug:
> 
> AMD version=1 ?
> 
> AMD does not have version 1, just directly has 2, perhaps because of x86
> compatibility. AMD also does not have the so-called architectural pmu.
> Maybe need to rename has_architectural_pmu_version for AMD.
> 

Thank you very much for the explanation. I will use version 2.

> It might be more helpful to add similar support for AMD PerfMonV2.

Yes. I will do that. During that time, the AMD PerfMonV2 KVM patchset (from you)
was still in progress.

I see the pull request from Paolo today. I will add that in v3.

> 
>>
>> 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
>> is running "perf top". The pmu registers are not disabled gracefully.
>>
>> 2. Although the x86_cpu_reset() resets many registers to zero, the
>> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
>> some pmu events are still enabled at the KVM side.
> 
> I agree that we should have done that, especially if guest pmu is enabled
> on the AMD platforms.
> 
>>
>> 3. The KVM pmc_speculative_in_use() always returns true so that the events
>> will not be reclaimed. The kvm_pmc->perf_event is still active.
>>
>> 4. After the reboot, the VM kernel reports below error:
>>
>> [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS 
>> detected, complain to your hardware vendor.
>> [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
>> c0010200 is 530076)
>>
>> 5. In a worse case, the active kvm_pmc->perf_event is still able to
>> inject unknown NMIs randomly to the VM kernel.
>>
>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>>
>> The patch is to fix the issue by resetting AMD pmu registers during the
>> reset.
> 
> I'm not sure if the qemu_reset or VM kexec will necessarily trigger
> kvm::amd_pmu_reset().

According to the mainline linux kernel:

kvm_vcpu_reset()
-> kvm_pmu_reset()
   -> amd_pmu_reset()

The PMU will not reset when init_event==true, that is, when processing the INIT:
line 12049.

11975 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
11976 {
... ...
12049 if (!init_event) {
12050 kvm_pmu_reset(vcpu);
12051 vcpu->arch.smbase = 0x3;
12052
12053 vcpu->arch.msr_misc_features_enables = 0;
12054 vcpu->arch.ia32_misc_enable_msr =
MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL |
12055
MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;
12056
12057 __kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
12058 __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
12059 }

According to the below ...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d28bc9dd25ce023270d2e039e7c98d38ecbf7758

... "INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP."

That's why initially I did not send a KVM patch to remove the 'init_event'.

> 
>>
>> Cc: Joe Jin 
>> Cc: Like Xu 
>> Signed-off-by: Dongli Zhang 
>> ---
>>  target/i386/cpu.h |  5 +++
>>  target/i386/kvm/kvm.c | 83 +--
>>  2 files changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>> index cd047e0410..b8ba72e87a 100644
>> --- a/target/i386/cpu.h
>> +++ b/target/i386/cpu.h
>> @@ -471,6 +471,11 @@ typedef enum X86Seg {
>>  #define MSR_CORE_PERF_GLOBAL_CTRL   0x38f
>>  #define M

Re: [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE

2023-07-03 Thread Dongli Zhang

Hi Like,

On 7/2/23 06:41, Like Xu wrote:
> On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang  wrote:
>>
>> The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
>> the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
>> could disable the pmu virtualization in an AMD environment.
>>
>> We still see below at VM kernel side ...
>>
>> [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>
>> ... although we expect something like below.
>>
>> [0.596381] Performance Events: PMU not available due to virtualization, 
>> using software events only.
>> [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>
>> This is because the AMD pmu (v1) does not rely on cpuid to decide if the
>> pmu virtualization is supported.
>>
>> We introduce a new property 'pmu-cap-disabled' for KVM accel to set
>> KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host
>> is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for
>> x86.
> 
> We may check cpu->enable_pmu when creating the first CPU or a BSP one
> (before it gets running) and then choose whether to disable guest pmu using
> vm ioctl KVM_CAP_PMU_CAPABILITY. Introducing a new property is not too
> acceptable if there are other options.

In the v1 of the implementation, we have implemented something similar: not
based on the cpu_index (or BSP), but to introduce a helper before creating the
KVM vcpu to let the further implementation decide. We did the
KVM_CAP_PMU_CAPABILITY in that helper once.

[PATCH 1/3] kvm: introduce a helper before creating the 1st vcpu
https://lore.kernel.org/all/20221119122901.2469-2-dongli.zh...@oracle.com/

[PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled
https://lore.kernel.org/all/20221119122901.2469-3-dongli.zh...@oracle.com/


The below was the suggestion from Greg Kurz about to use per-VCPU property to
control per-VM cap:

"It doesn't seem conceptually correct to configure VM level stuff out of
a vCPU property, which could theoretically be different for each vCPU,
even if this isn't the case with the current code base.

Maybe consider controlling PMU with a machine property and this
could be done in kvm_arch_init() like other VM level stuff ?"

Would you mind comment on that?

Thank you very much!

Dongli Zhang

> 
>>
>> Cc: Joe Jin 
>> Cc: Like Xu 
>> Signed-off-by: Dongli Zhang 
>> ---
>> Changed since v1:
>> - In version 1 we did not introduce the new property. We ioctl
>>   KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had
>>   introduced a helpfer function to do this job before creating the 1st
>>   KVM vcpu in v1.
>>
>>  accel/kvm/kvm-all.c  |  1 +
>>  include/sysemu/kvm_int.h |  1 +
>>  qemu-options.hx  |  7 ++
>>  target/i386/kvm/kvm.c| 46 
>>  4 files changed, 55 insertions(+)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 7679f397ae..238098e991 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -3763,6 +3763,7 @@ static void kvm_accel_instance_init(Object *obj)
>>  s->xen_version = 0;
>>  s->xen_gnttab_max_frames = 64;
>>  s->xen_evtchn_max_pirq = 256;
>> +s->pmu_cap_disabled = false;
>>  }
>>
>>  /**
>> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
>> index 511b42bde5..cbbe08ec54 100644
>> --- a/include/sysemu/kvm_int.h
>> +++ b/include/sysemu/kvm_int.h
>> @@ -123,6 +123,7 @@ struct KVMState
>>  uint32_t xen_caps;
>>  uint16_t xen_gnttab_max_frames;
>>  uint16_t xen_evtchn_max_pirq;
>> +bool pmu_cap_disabled;
>>  };
>>
>>  void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index b57489d7ca..1976c0ca3e 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -187,6 +187,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
>>  "tb-size=n (TCG translation block cache size)\n"
>>  "dirty-ring-size=n (KVM dirty ring GFN count, default 
>> 0)\n"
>>  "
>> notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM 
>> exit and set notify window, x86 only)\n"
>> +"pmu-cap-disabled=true|false (disable 
>> KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n"
>>  "

[PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE

2023-06-20 Thread Dongli Zhang

The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
could disable the pmu virtualization in an AMD environment.

We still see below at VM kernel side ...

[0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[0.596381] Performance Events: PMU not available due to virtualization, 
using software events only.
[0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

This is because the AMD pmu (v1) does not rely on cpuid to decide if the
pmu virtualization is supported.

We introduce a new property 'pmu-cap-disabled' for KVM accel to set
KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host
is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for
x86.

Cc: Joe Jin 
Cc: Like Xu 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
- In version 1 we did not introduce the new property. We ioctl
  KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had
  introduced a helpfer function to do this job before creating the 1st
  KVM vcpu in v1.

 accel/kvm/kvm-all.c  |  1 +
 include/sysemu/kvm_int.h |  1 +
 qemu-options.hx  |  7 ++
 target/i386/kvm/kvm.c| 46 
 4 files changed, 55 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 7679f397ae..238098e991 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3763,6 +3763,7 @@ static void kvm_accel_instance_init(Object *obj)
 s->xen_version = 0;
 s->xen_gnttab_max_frames = 64;
 s->xen_evtchn_max_pirq = 256;
+s->pmu_cap_disabled = false;
 }
 
 /**
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 511b42bde5..cbbe08ec54 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -123,6 +123,7 @@ struct KVMState
 uint32_t xen_caps;
 uint16_t xen_gnttab_max_frames;
 uint16_t xen_evtchn_max_pirq;
+bool pmu_cap_disabled;
 };
 
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
diff --git a/qemu-options.hx b/qemu-options.hx
index b57489d7ca..1976c0ca3e 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -187,6 +187,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
 "tb-size=n (TCG translation block cache size)\n"
 "dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n"
 "notify-vmexit=run|internal-error|disable,notify-window=n 
(enable notify VM exit and set notify window, x86 only)\n"
+"pmu-cap-disabled=true|false (disable 
KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n"
 "thread=single|multi (enable multi-threaded TCG)\n", 
QEMU_ARCH_ALL)
 SRST
 ``-accel name[,prop=value[,...]]``
@@ -254,6 +255,12 @@ SRST
 open up for a specified of time (i.e. notify-window).
 Default: notify-vmexit=run,notify-window=0.
 
+``pmu-cap-disabled=true|false``
+When the KVM accelerator is used, it controls whether to disable the
+KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled, the
+PMU virtualization is disabled at the KVM module side. This is for
+x86 host only.
+
 ERST
 
 DEF("smp", HAS_ARG, QEMU_OPTION_smp,
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index de531842f6..bf4136fa1b 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -129,6 +129,7 @@ static bool has_msr_ucode_rev;
 static bool has_msr_vmx_procbased_ctls2;
 static bool has_msr_perf_capabs;
 static bool has_msr_pkrs;
+static bool has_pmu_cap;
 
 static uint32_t has_architectural_pmu_version;
 static uint32_t num_architectural_pmu_gp_counters;
@@ -2767,6 +2768,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 }
 }
 
+has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
+
+if (s->pmu_cap_disabled) {
+if (has_pmu_cap) {
+ret = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
+KVM_PMU_CAP_DISABLE);
+if (ret < 0) {
+s->pmu_cap_disabled = false;
+error_report("kvm: Failed to disable pmu cap: %s",
+ strerror(-ret));
+}
+} else {
+s->pmu_cap_disabled = false;
+error_report("kvm: KVM_CAP_PMU_CAPABILITY is not supported");
+}
+}
+
 return 0;
 }
 
@@ -5951,6 +5969,28 @@ static void kvm_arch_set_xen_evtchn_max_pirq(Object 
*obj, Visitor *v,
 s->xen_evtchn_max_pirq = value;
 }
 
+static void kvm_set_pmu_cap_disabled(Object *obj, Visitor *v,
+ const char *name, void *opaque,
+ Error **errp)
+{
+KVMStat

[PATCH RESEND v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs

2023-06-20 Thread Dongli Zhang

This is to rebase the patchset on top of the most recet QEMU.

This patchset is to fix two svm pmu virtualization bugs, x86 only.

version 1:
https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/

1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization.

To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu
virtualization. There is still below at the VM linux side ...

[0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[0.596381] Performance Events: PMU not available due to virtualization, 
using software events only.
[0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

The 1st patch has introduced a new x86 only accel/kvm property
"pmu-cap-disabled=true" to disable the pmu virtualization via
KVM_PMU_CAP_DISABLE.

I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1.
Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I
finally used the latter because it is easier to use.


2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset)
at the KVM side may inject random unwanted/unknown NMIs to the VM.

The svm pmu registers are not reset during QEMU system_reset.

(1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
is running "perf top". The pmu registers are not disabled gracefully.

(2). Although the x86_cpu_reset() resets many registers to zero, the
kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
some pmu events are still enabled at the KVM side.

(3). The KVM pmc_speculative_in_use() always returns true so that the events
will not be reclaimed. The kvm_pmc->perf_event is still active.

(4). After the reboot, the VM kernel reports below error:

[0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, 
complain to your hardware vendor.
[0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
c0010200 is 530076)

(5). In a worse case, the active kvm_pmc->perf_event is still able to
inject unknown NMIs randomly to the VM kernel.

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

The 2nd patch is to fix the issue by resetting AMD pmu registers as well as
Intel registers.


This patchset does not cover PerfMonV2, until the below patchset is merged
into the KVM side. It has been queued to kvm-x86 next by Sean.

[PATCH v7 00/12] KVM: x86: Add AMD Guest PerfMonV2 PMU support
https://lore.kernel.org/all/168609790857.1417369.13152633386083458084.b4...@google.com/


Dongli Zhang (2):
  target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  target/i386/kvm: get and put AMD pmu registers

 accel/kvm/kvm-all.c  |   1 +
 include/sysemu/kvm_int.h |   1 +
 qemu-options.hx  |   7 +++
 target/i386/cpu.h|   5 ++
 target/i386/kvm/kvm.c| 129 +-
 5 files changed, 141 insertions(+), 2 deletions(-)


Thank you very much!

Dongli Zhang

[PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers

2023-06-20 Thread Dongli Zhang

The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
the KVM side.

However, only the Intel gp/fixed/global pmu registers are involved. There
is not any implementation for AMD pmu registers. The
'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
the CPU version.

This patch is to add the support for AMD version=1 pmu, to get and put AMD
pmu registers. Otherwise, there will be a bug:

1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
is running "perf top". The pmu registers are not disabled gracefully.

2. Although the x86_cpu_reset() resets many registers to zero, the
kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
some pmu events are still enabled at the KVM side.

3. The KVM pmc_speculative_in_use() always returns true so that the events
will not be reclaimed. The kvm_pmc->perf_event is still active.

4. After the reboot, the VM kernel reports below error:

[0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, 
complain to your hardware vendor.
[0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
c0010200 is 530076)

5. In a worse case, the active kvm_pmc->perf_event is still able to
inject unknown NMIs randomly to the VM kernel.

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

The patch is to fix the issue by resetting AMD pmu registers during the
reset.

Cc: Joe Jin 
Cc: Like Xu 
Signed-off-by: Dongli Zhang 
---
 target/i386/cpu.h |  5 +++
 target/i386/kvm/kvm.c | 83 +--
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index cd047e0410..b8ba72e87a 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -471,6 +471,11 @@ typedef enum X86Seg {
 #define MSR_CORE_PERF_GLOBAL_CTRL   0x38f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
 
+#define MSR_K7_EVNTSEL0 0xc001
+#define MSR_K7_PERFCTR0 0xc0010004
+#define MSR_F15H_PERF_CTL0  0xc0010200
+#define MSR_F15H_PERF_CTR0  0xc0010201
+
 #define MSR_MC0_CTL 0x400
 #define MSR_MC0_STATUS  0x401
 #define MSR_MC0_ADDR0x402
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index bf4136fa1b..a0f7273dad 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs)
 }
 }
 
+/*
+ * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
+ * disable the AMD pmu virtualization.
+ *
+ * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
+ * indicates the KVM side has already disabled the pmu virtualization.
+ */
+if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) {
+int64_t family;
+
+family = (env->cpuid_version >> 8) & 0xf;
+if (family == 0xf) {
+family += (env->cpuid_version >> 20) & 0xff;
+}
+
+if (family >= 6) {
+has_architectural_pmu_version = 1;
+
+if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) {
+num_architectural_pmu_gp_counters = 6;
+} else {
+num_architectural_pmu_gp_counters = 4;
+}
+}
+}
+
 cpu_x86_cpuid(env, 0x8000, 0, , , , );
 
 for (i = 0x8000; i <= limit; i++) {
@@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 
env->poll_control_msr);
 }
 
-if (has_architectural_pmu_version > 0) {
+if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
 if (has_architectural_pmu_version > 1) {
 /* Stop the counter.  */
 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
@@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
   env->msr_global_ctrl);
 }
 }
+
+if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
+uint32_t sel_base = MSR_K7_EVNTSEL0;
+uint32_t ctr_base = MSR_K7_PERFCTR0;
+uint32_t step = 1;
+
+if (num_architectural_pmu_gp_counters == 6) {
+sel_base = MSR_F15H_PERF_CTL0;
+ctr_base = MSR_F15H_PERF_CTR0;
+step = 2;
+}
+
+for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
+kvm_msr_entry_add(cpu, ctr_base + i * step,
+  env->msr_gp_c

Re: [PATCH v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs

2023-06-19 Thread Dongli Zhang

Hi Like and zhenyu,

Thank you very much! That will be very helpful.

In order to help the review, I will rebase the patchset on top of the most
recent QEMU.

Thank you very much!

Dongli Zhang

On 6/19/23 01:52, Like Xu wrote:
> I think we've been stuck here too long. Sorry Dongli.
> 
> +zhenyu, could you get someone to follow up on this, or I will start working 
> on
> that.
> 
> On 9/1/2023 9:19 am, Dongli Zhang wrote:
>> Ping?
>>
>> About [PATCH v2 2/2], the bad thing is that the customer will not be able to
>> notice the issue, that is, the "Broken BIOS detected" in dmesg, immediately.
>>
>> As a result, the customer VM many panic randomly anytime in the future (once
>> issue is encountered) if "/proc/sys/kernel/unknown_nmi_panic" is enabled.
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>> On 12/19/22 06:45, Dongli Zhang wrote:
>>> Can I get feedback for this patchset, especially the [PATCH v2 2/2]?
>>>
>>> About the [PATCH v2 2/2], currently the issue impacts the usage of PMUs on 
>>> AMD
>>> VM, especially the below case:
>>>
>>> 1. Enable panic on nmi.
>>> 2. Use perf to monitor the performance of VM. Although without a test, I 
>>> think
>>> the nmi watchdog has the same effect.
>>> 3. A sudden system reset, or a kernel panic (kdump/kexec).
>>> 4. After reboot, there will be random unknown NMI.
>>> 5. Unfortunately, the "panic on nmi" may panic the VM randomly at any time.
>>>
>>> Thank you very much!
>>>
>>> Dongli Zhang
>>>
>>> On 12/1/22 16:22, Dongli Zhang wrote:
>>>> This patchset is to fix two svm pmu virtualization bugs, x86 only.
>>>>
>>>> version 1:
>>>> https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/
>>>>
>>>> 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization.
>>>>
>>>> To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu
>>>> virtualization. There is still below at the VM linux side ...
>>>>
>>>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>>>
>>>> ... although we expect something like below.
>>>>
>>>> [    0.596381] Performance Events: PMU not available due to virtualization,
>>>> using software events only.
>>>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>>>
>>>> The 1st patch has introduced a new x86 only accel/kvm property
>>>> "pmu-cap-disabled=true" to disable the pmu virtualization via
>>>> KVM_PMU_CAP_DISABLE.
>>>>
>>>> I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1.
>>>> Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I
>>>> finally used the latter because it is easier to use.
>>>>
>>>>
>>>> 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset)
>>>> at the KVM side may inject random unwanted/unknown NMIs to the VM.
>>>>
>>>> The svm pmu registers are not reset during QEMU system_reset.
>>>>
>>>> (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
>>>> is running "perf top". The pmu registers are not disabled gracefully.
>>>>
>>>> (2). Although the x86_cpu_reset() resets many registers to zero, the
>>>> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
>>>> some pmu events are still enabled at the KVM side.
>>>>
>>>> (3). The KVM pmc_speculative_in_use() always returns true so that the 
>>>> events
>>>> will not be reclaimed. The kvm_pmc->perf_event is still active.
>>>>
>>>> (4). After the reboot, the VM kernel reports below error:
>>>>
>>>> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS
>>>> detected, complain to your hardware vendor.
>>>> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR
>>>> c0010200 is 530076)
>>>>
>>>> (5). In a worse case, the active kvm_pmc->perf_event is still able to
>>>> inject unknown NMIs randomly to the VM kernel.
>>>>
>>>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>>>>
>>>> The 2nd patch is to fix the issue by resetting AMD pmu registers as well as
>>>> Intel registers.
>>>>
>>>>
>>>> This patchset does not cover PerfMonV2, until the below patchset is merged
>>>> into the KVM side.
>>>>
>>>> [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support
>>>> https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/
>>>>
>>>>
>>>> Dongli Zhang (2):
>>>>    target/i386/kvm: introduce 'pmu-cap-disabled' to set 
>>>> KVM_PMU_CAP_DISABLE
>>>>    target/i386/kvm: get and put AMD pmu registers
>>>>
>>>>   accel/kvm/kvm-all.c  |   1 +
>>>>   include/sysemu/kvm_int.h |   1 +
>>>>   qemu-options.hx  |   7 +++
>>>>   target/i386/cpu.h    |   5 ++
>>>>   target/i386/kvm/kvm.c    | 129 +-
>>>>   5 files changed, 141 insertions(+), 2 deletions(-)
>>>>
>>>> Thank you very much!
>>>>
>>>> Dongli Zhang
>>>>
>>>>
>>
>>

Re: [PATCH] target/i386: Clear xsave pkru bit when KVM XCR0 not support

2023-05-17 Thread Dongli Zhang

Hi Yuchen,

On 5/17/23 03:55, Yuchen wrote:
> Migrating guest from Intel new CPU (as Gold 6230) to old CPU (as
> E5-2650 v4) will pause on the destination host. Because old CPU
> not support xsave pkru feature, and KVM KVM_SET_XSAVE ioctl
> return EINVAL.
> 
> This kernel commit introduces the problem:
> ea4d6938d4c0 x86/fpu: Replace KVMs home brewed FPU copy from user

This kernel commit issue should be resolved by the below kernel commit.

x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad856280ddea3401e1f5060ef20e6de9f6122c76

Since the old target server does not support pkru, I assume the VM's cpu type
should not support pkru. Therefore, the pkru should never be migrated away from
source server.

Dongli Zhang

> 
> Signed-off-by: YuChen 
> ---
> target/i386/xsave_helper.c | 8 
> 1 file changed, 8 insertions(+)
> 
> diff --git a/target/i386/xsave_helper.c b/target/i386/xsave_helper.c
> index 996e9f3bfe..64e2b969fe 100644
> --- a/target/i386/xsave_helper.c
> +++ b/target/i386/xsave_helper.c
> @@ -6,6 +6,8 @@
>  #include "cpu.h"
> +static bool has_xsave_pkru;
> +
> void x86_cpu_xsave_all_areas(X86CPU *cpu, void *buf, uint32_t buflen)
> {
>  CPUX86State *env = >env;
> @@ -47,6 +49,9 @@ void x86_cpu_xsave_all_areas(X86CPU *cpu, void *buf, 
> uint32_t buflen)
>  stq_p(xmm + 8, env->xmm_regs[i].ZMM_Q(1));
>  }
> +if (!has_xsave_pkru) {
> +env->xstate_bv &= ~XSTATE_PKRU_MASK;
> +}
>  header->xstate_bv = env->xstate_bv;
>  e = _ext_save_areas[XSTATE_YMM_BIT];
> @@ -181,6 +186,9 @@ void x86_cpu_xrstor_all_areas(X86CPU *cpu, const void 
> *buf, uint32_t buflen)
>  env->xmm_regs[i].ZMM_Q(1) = ldq_p(xmm + 8);
>  }
> +if (xsave->header.xstate_bv & XSTATE_PKRU_MASK) {
> +has_xsave_pkru = true;
> +}
>  env->xstate_bv = header->xstate_bv;
>  e = _ext_save_areas[XSTATE_YMM_BIT];
> --
> 2.34.1
> -
> ?
> 
> 
> ???
> This e-mail and its attachments contain confidential information from New 
> H3C, which is
> intended only for the person or entity whose address is listed above. Any use 
> of the
> information contained herein in any way (including, but not limited to, total 
> or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender
> by phone or email immediately and delete it!
>

Re: [PATCH 1/1] hw/core/cpu: always print cpu index with cpu state

2023-03-06 Thread Dongli Zhang

Ping?

The patch already got the Reviewed-by from Philippe Mathieu-Daudé and Alistair
Francis.

The current log does not provide much information when (1) multiple CPUs are
involved in the bug, and (2) when the "info registers -a" is not used to collect
the context from all CPUs for comparison.

We may wait for very long timer until the bug filer can reproduce the error 
again.

To print the cpu index helps a lot.

Thank you very much!

Dongli Zhang

On 2/6/23 15:42, Dongli Zhang wrote:
> The cpu_dump_state() does not print the cpu index. When the
> cpu_dump_state() is invoked due to the KVM failure, we are not able to tell
> from which CPU the state is. The below is an example.
> 
> KVM internal error. Suberror: 764064
> RAX=0002 RBX=8a9e57c38400 RCX= 
> RDX=8a9cc00ba8a0
> RSI=0003 RDI=8a9e57c38400 RBP=b6120c5b3c50 
> RSP=b6120c5b3c40
> R8 = R9 =8a9cc00ba8a0 R10=8e467350 
> R11=0007
> R12=000a R13=8f987e25 R14=8f988a01 
> R15=
> RIP=8e51bb04 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =   00c0
> CS =0010   00a09b00 DPL=0 CS64 [-RA]
> SS =   00c0
> DS =   00c0
> FS =   00c0
> GS = 8ac27fcc  00c0
> LDT=   00c0
> TR =0040 fe096000 206f 8b00 DPL=0 TSS64-busy
> GDT= fe094000 007f
> IDT= fe00 0fff
> CR0=80050033 CR2= CR3=0010ca40a001 CR4=003606e0
> DR0= DR1= DR2= 
> DR3=
> DR6=fffe0ff0 DR7=0400
> EFER=0d01
> Code=0f 1f ... ...
> 
> Print the cpu->cpu_index in cpu_dump_state() and remove it from the caller.
> 
> Cc: Joe Jin 
> Signed-off-by: Dongli Zhang 
> ---
>  hw/core/cpu-common.c  | 1 +
>  monitor/hmp-cmds-target.c | 2 --
>  softmmu/cpus.c| 1 -
>  3 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
> index 5ccc3837b6..d2503f2d09 100644
> --- a/hw/core/cpu-common.c
> +++ b/hw/core/cpu-common.c
> @@ -105,6 +105,7 @@ void cpu_dump_state(CPUState *cpu, FILE *f, int flags)
>  
>  if (cc->dump_state) {
>  cpu_synchronize_state(cpu);
> +qemu_fprintf(f, "\nCPU#%d\n", cpu->cpu_index);
>  cc->dump_state(cpu, f, flags);
>  }
>  }
> diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
> index 0d3e84d960..f7dd354d2a 100644
> --- a/monitor/hmp-cmds-target.c
> +++ b/monitor/hmp-cmds-target.c
> @@ -99,7 +99,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
>  
>  if (all_cpus) {
>  CPU_FOREACH(cs) {
> -monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index);
>  cpu_dump_state(cs, NULL, CPU_DUMP_FPU);
>  }
>  } else {
> @@ -114,7 +113,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
>  return;
>  }
>  
> -monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index);
>  cpu_dump_state(cs, NULL, CPU_DUMP_FPU);
>  }
>  }
> diff --git a/softmmu/cpus.c b/softmmu/cpus.c
> index 9cbc8172b5..f69bbe6abc 100644
> --- a/softmmu/cpus.c
> +++ b/softmmu/cpus.c
> @@ -122,7 +122,6 @@ void hw_error(const char *fmt, ...)
>  vfprintf(stderr, fmt, ap);
>  fprintf(stderr, "\n");
>  CPU_FOREACH(cpu) {
> -fprintf(stderr, "CPU #%d:\n", cpu->cpu_index);
>  cpu_dump_state(cpu, stderr, CPU_DUMP_FPU);
>  }
>  va_end(ap);

Re: [PATCH 1/1] hw/core/cpu: always print cpu index with cpu state

2023-02-07 Thread Dongli Zhang

Hi Philippe,

On 2/6/23 23:16, Philippe Mathieu-Daudé wrote:
> On 7/2/23 00:42, Dongli Zhang wrote:
>> The cpu_dump_state() does not print the cpu index. When the
>> cpu_dump_state() is invoked due to the KVM failure, we are not able to tell
>> from which CPU the state is. The below is an example.
>>
>> KVM internal error. Suberror: 764064
>> RAX=0002 RBX=8a9e57c38400 RCX=
>> RDX=8a9cc00ba8a0
>> RSI=0003 RDI=8a9e57c38400 RBP=b6120c5b3c50
>> RSP=b6120c5b3c40
>> R8 = R9 =8a9cc00ba8a0 R10=8e467350
>> R11=0007
>> R12=000a R13=8f987e25 R14=8f988a01
>> R15=
>> RIP=8e51bb04 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =   00c0
>> CS =0010   00a09b00 DPL=0 CS64 [-RA]
>> SS =   00c0
>> DS =   00c0
>> FS =   00c0
>> GS = 8ac27fcc  00c0
>> LDT=   00c0
>> TR =0040 fe096000 206f 8b00 DPL=0 TSS64-busy
>> GDT= fe094000 007f
>> IDT= fe00 0fff
>> CR0=80050033 CR2= CR3=0010ca40a001 CR4=003606e0
>> DR0= DR1= DR2=
>> DR3=
>> DR6=fffe0ff0 DR7=0400
>> EFER=0d01
>> Code=0f 1f ... ...
>>
>> Print the cpu->cpu_index in cpu_dump_state() and remove it from the caller.
>>
>> Cc: Joe Jin 
>> Signed-off-by: Dongli Zhang 
>> ---
>>   hw/core/cpu-common.c  | 1 +
>>   monitor/hmp-cmds-target.c | 2 --
>>   softmmu/cpus.c    | 1 -
>>   3 files changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
>> index 5ccc3837b6..d2503f2d09 100644
>> --- a/hw/core/cpu-common.c
>> +++ b/hw/core/cpu-common.c
>> @@ -105,6 +105,7 @@ void cpu_dump_state(CPUState *cpu, FILE *f, int flags)
>>     if (cc->dump_state) {
>>   cpu_synchronize_state(cpu);
> 
> Should we check for:
> 
>   if (cpu->cpu_index != -1) {
> 
>> +    qemu_fprintf(f, "\nCPU#%d\n", cpu->cpu_index);
> 
>   }

I think you meant if (cpu->cpu_index != UNASSIGNED_CPU_INDEX).

I do not see this case may happen within my knowledge. The cpu_index is always
expected to be assigned if cpu_exec_realizefn()-->cpu_list_add() is called.

 83 void cpu_list_add(CPUState *cpu)
 84 {
 85 QEMU_LOCK_GUARD(_cpu_list_lock);
 86 if (cpu->cpu_index == UNASSIGNED_CPU_INDEX) {
 87 cpu->cpu_index = cpu_get_free_index();
 88 assert(cpu->cpu_index != UNASSIGNED_CPU_INDEX);
 89 } else {
 90 assert(!cpu_index_auto_assigned);
 91 }
 92 QTAILQ_INSERT_TAIL_RCU(, cpu, node);
 93 cpu_list_generation_id++;
 94 }


In addition, the cc->dump_state() is always invoked by cpu_dump_state(). As a
result, e.g., arm_cpu_dump_state() or x86_cpu_dump_state() may always print the
cpu state unconditionally (same for mips, s390 or riscv). I do not see a reason
to hide the cpu_index.

Would you please let me know if the above is wrong? I do not think it is
required to filter the cpu_index with UNASSIGNED_CPU_INDEX.

Thank you very much!

Dongli Zhang

> 
> ?
> 
>>   cc->dump_state(cpu, f, flags);
>>   }
>>   }
>> diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
>> index 0d3e84d960..f7dd354d2a 100644
>> --- a/monitor/hmp-cmds-target.c
>> +++ b/monitor/hmp-cmds-target.c
>> @@ -99,7 +99,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
>>     if (all_cpus) {
>>   CPU_FOREACH(cs) {
>> -    monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index);
>>   cpu_dump_state(cs, NULL, CPU_DUMP_FPU);
>>   }
>

[PATCH 1/1] readline: fix hmp completion issue

2023-02-06 Thread Dongli Zhang

The auto completion does not work in some cases.

Case 1.

1. (qemu) info reg
2. Press 'Tab'.
3. It does not auto complete.

Case 2.

1. (qemu) block_resize flo
2. Press 'Tab'.
3. It does not auto complete 'floppy0'.

Since the readline_add_completion_of() may add any completion when
strlen(pfx) is zero, we remove the check with (name[0] == '\0') because
strlen() always returns zero in that case.

Fixes: 52f50b1e9f8f ("readline: Extract readline_add_completion_of() from 
monitor")
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 monitor/hmp.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/monitor/hmp.c b/monitor/hmp.c
index 2aa85d3982..fee410362f 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -1189,9 +1189,7 @@ static void cmd_completion(MonitorHMP *mon, const char 
*name, const char *list)
 }
 memcpy(cmd, pstart, len);
 cmd[len] = '\0';
-if (name[0] == '\0') {
-readline_add_completion_of(mon->rs, name, cmd);
-}
+readline_add_completion_of(mon->rs, name, cmd);
 if (*p == '\0') {
 break;
 }
@@ -1335,9 +1333,7 @@ static void monitor_find_completion_by_table(MonitorHMP 
*mon,
 /* block device name completion */
 readline_set_completion_index(mon->rs, strlen(str));
 while ((blk = blk_next(blk)) != NULL) {
-if (str[0] == '\0') {
-readline_add_completion_of(mon->rs, str, blk_name(blk));
-}
+readline_add_completion_of(mon->rs, str, blk_name(blk));
 }
 break;
 case 's':
-- 
2.34.1

[PATCH 1/1] hw/core/cpu: always print cpu index with cpu state

2023-02-06 Thread Dongli Zhang

The cpu_dump_state() does not print the cpu index. When the
cpu_dump_state() is invoked due to the KVM failure, we are not able to tell
from which CPU the state is. The below is an example.

KVM internal error. Suberror: 764064
RAX=0002 RBX=8a9e57c38400 RCX= 
RDX=8a9cc00ba8a0
RSI=0003 RDI=8a9e57c38400 RBP=b6120c5b3c50 
RSP=b6120c5b3c40
R8 = R9 =8a9cc00ba8a0 R10=8e467350 
R11=0007
R12=000a R13=8f987e25 R14=8f988a01 
R15=
RIP=8e51bb04 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   00c0
CS =0010   00a09b00 DPL=0 CS64 [-RA]
SS =   00c0
DS =   00c0
FS =   00c0
GS = 8ac27fcc  00c0
LDT=   00c0
TR =0040 fe096000 206f 8b00 DPL=0 TSS64-busy
GDT= fe094000 007f
IDT= fe00 0fff
CR0=80050033 CR2= CR3=0010ca40a001 CR4=003606e0
DR0= DR1= DR2= 
DR3=
DR6=fffe0ff0 DR7=0400
EFER=0d01
Code=0f 1f ... ...

Print the cpu->cpu_index in cpu_dump_state() and remove it from the caller.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 hw/core/cpu-common.c  | 1 +
 monitor/hmp-cmds-target.c | 2 --
 softmmu/cpus.c| 1 -
 3 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index 5ccc3837b6..d2503f2d09 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -105,6 +105,7 @@ void cpu_dump_state(CPUState *cpu, FILE *f, int flags)
 
 if (cc->dump_state) {
 cpu_synchronize_state(cpu);
+qemu_fprintf(f, "\nCPU#%d\n", cpu->cpu_index);
 cc->dump_state(cpu, f, flags);
 }
 }
diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c
index 0d3e84d960..f7dd354d2a 100644
--- a/monitor/hmp-cmds-target.c
+++ b/monitor/hmp-cmds-target.c
@@ -99,7 +99,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
 
 if (all_cpus) {
 CPU_FOREACH(cs) {
-monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index);
 cpu_dump_state(cs, NULL, CPU_DUMP_FPU);
 }
 } else {
@@ -114,7 +113,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict)
 return;
 }
 
-monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index);
 cpu_dump_state(cs, NULL, CPU_DUMP_FPU);
 }
 }
diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 9cbc8172b5..f69bbe6abc 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -122,7 +122,6 @@ void hw_error(const char *fmt, ...)
 vfprintf(stderr, fmt, ap);
 fprintf(stderr, "\n");
 CPU_FOREACH(cpu) {
-fprintf(stderr, "CPU #%d:\n", cpu->cpu_index);
 cpu_dump_state(cpu, stderr, CPU_DUMP_FPU);
 }
 va_end(ap);
-- 
2.34.1

Re: [PATCH v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs

2023-01-08 Thread Dongli Zhang

Ping?

About [PATCH v2 2/2], the bad thing is that the customer will not be able to
notice the issue, that is, the "Broken BIOS detected" in dmesg, immediately.

As a result, the customer VM many panic randomly anytime in the future (once
issue is encountered) if "/proc/sys/kernel/unknown_nmi_panic" is enabled.

Thank you very much!

Dongli Zhang

On 12/19/22 06:45, Dongli Zhang wrote:
> Can I get feedback for this patchset, especially the [PATCH v2 2/2]?
> 
> About the [PATCH v2 2/2], currently the issue impacts the usage of PMUs on AMD
> VM, especially the below case:
> 
> 1. Enable panic on nmi.
> 2. Use perf to monitor the performance of VM. Although without a test, I think
> the nmi watchdog has the same effect.
> 3. A sudden system reset, or a kernel panic (kdump/kexec).
> 4. After reboot, there will be random unknown NMI.
> 5. Unfortunately, the "panic on nmi" may panic the VM randomly at any time.
> 
> Thank you very much!
> 
> Dongli Zhang
> 
> On 12/1/22 16:22, Dongli Zhang wrote:
>> This patchset is to fix two svm pmu virtualization bugs, x86 only.
>>
>> version 1:
>> https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/
>>
>> 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization.
>>
>> To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu
>> virtualization. There is still below at the VM linux side ...
>>
>> [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>
>> ... although we expect something like below.
>>
>> [0.596381] Performance Events: PMU not available due to virtualization, 
>> using software events only.
>> [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>
>> The 1st patch has introduced a new x86 only accel/kvm property
>> "pmu-cap-disabled=true" to disable the pmu virtualization via
>> KVM_PMU_CAP_DISABLE.
>>
>> I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1.
>> Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I
>> finally used the latter because it is easier to use.
>>
>>
>> 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset)
>> at the KVM side may inject random unwanted/unknown NMIs to the VM.
>>
>> The svm pmu registers are not reset during QEMU system_reset.
>>
>> (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
>> is running "perf top". The pmu registers are not disabled gracefully.
>>
>> (2). Although the x86_cpu_reset() resets many registers to zero, the
>> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
>> some pmu events are still enabled at the KVM side.
>>
>> (3). The KVM pmc_speculative_in_use() always returns true so that the events
>> will not be reclaimed. The kvm_pmc->perf_event is still active.
>>
>> (4). After the reboot, the VM kernel reports below error:
>>
>> [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS 
>> detected, complain to your hardware vendor.
>> [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
>> c0010200 is 530076)
>>
>> (5). In a worse case, the active kvm_pmc->perf_event is still able to
>> inject unknown NMIs randomly to the VM kernel.
>>
>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>>
>> The 2nd patch is to fix the issue by resetting AMD pmu registers as well as
>> Intel registers.
>>
>>
>> This patchset does not cover PerfMonV2, until the below patchset is merged
>> into the KVM side.
>>
>> [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support
>> https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/
>>
>>
>> Dongli Zhang (2):
>>   target/i386/kvm: introduce 'pmu-cap-disabled' to set 
>> KVM_PMU_CAP_DISABLE
>>   target/i386/kvm: get and put AMD pmu registers
>>
>>  accel/kvm/kvm-all.c  |   1 +
>>  include/sysemu/kvm_int.h |   1 +
>>  qemu-options.hx  |   7 +++
>>  target/i386/cpu.h|   5 ++
>>  target/i386/kvm/kvm.c| 129 +-
>>  5 files changed, 141 insertions(+), 2 deletions(-)
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>

[PATCH 1/1] vhost-scsi: fix memleak of vsc->inflight

2023-01-04 Thread Dongli Zhang

This is below memleak detected when to quit the qemu-system-x86_64 (with
vhost-scsi-pci).

(qemu) quit

=
==15568==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7f00aec57917 in __interceptor_calloc (/lib64/libasan.so.6+0xb4917)
#1 0x7f00ada0d7b5 in g_malloc0 (/lib64/libglib-2.0.so.0+0x517b5)
#2 0x5648ffd38bac in vhost_scsi_start ../hw/scsi/vhost-scsi.c:92
#3 0x5648ffd38d52 in vhost_scsi_set_status ../hw/scsi/vhost-scsi.c:131
#4 0x5648ffda340e in virtio_set_status ../hw/virtio/virtio.c:2036
#5 0x5648ff8de281 in virtio_ioport_write ../hw/virtio/virtio-pci.c:431
#6 0x5648ff8deb29 in virtio_pci_config_write ../hw/virtio/virtio-pci.c:576
#7 0x5648ffe5c0c2 in memory_region_write_accessor ../softmmu/memory.c:493
#8 0x5648ffe5c424 in access_with_adjusted_size ../softmmu/memory.c:555
#9 0x5648ffe6428f in memory_region_dispatch_write ../softmmu/memory.c:1515
#10 0x5648ffe8613d in flatview_write_continue ../softmmu/physmem.c:2825
#11 0x5648ffe86490 in flatview_write ../softmmu/physmem.c:2867
#12 0x5648ffe86d9f in address_space_write ../softmmu/physmem.c:2963
#13 0x5648ffe86e57 in address_space_rw ../softmmu/physmem.c:2973
#14 0x5648fffbfb3d in kvm_handle_io ../accel/kvm/kvm-all.c:2639
#15 0x5648fffc0e0d in kvm_cpu_exec ../accel/kvm/kvm-all.c:2890
#16 0x5648fffc90a7 in kvm_vcpu_thread_fn ../accel/kvm/kvm-accel-ops.c:51
#17 0x56490042400a in qemu_thread_start ../util/qemu-thread-posix.c:505
#18 0x7f00ac3b6ea4 in start_thread (/lib64/libpthread.so.0+0x7ea4)

Free the vsc->inflight at the 'stop' path.

Fixes: b82526c7ee ("vhost-scsi: support inflight io track")
Cc: Joe Jin 
Cc: Li Feng 
Signed-off-by: Dongli Zhang 
---
 hw/scsi/vhost-scsi-common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/scsi/vhost-scsi-common.c b/hw/scsi/vhost-scsi-common.c
index 18ea5dcfa1..a06f01af26 100644
--- a/hw/scsi/vhost-scsi-common.c
+++ b/hw/scsi/vhost-scsi-common.c
@@ -113,6 +113,7 @@ void vhost_scsi_common_stop(VHostSCSICommon *vsc)
 
 if (vsc->inflight) {
 vhost_dev_free_inflight(vsc->inflight);
+g_free(vsc->inflight);
 vsc->inflight = NULL;
 }
 
-- 
2.34.1

Re: [PATCH v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs

2022-12-19 Thread Dongli Zhang

Can I get feedback for this patchset, especially the [PATCH v2 2/2]?

About the [PATCH v2 2/2], currently the issue impacts the usage of PMUs on AMD
VM, especially the below case:

1. Enable panic on nmi.
2. Use perf to monitor the performance of VM. Although without a test, I think
the nmi watchdog has the same effect.
3. A sudden system reset, or a kernel panic (kdump/kexec).
4. After reboot, there will be random unknown NMI.
5. Unfortunately, the "panic on nmi" may panic the VM randomly at any time.

Thank you very much!

Dongli Zhang

On 12/1/22 16:22, Dongli Zhang wrote:
> This patchset is to fix two svm pmu virtualization bugs, x86 only.
> 
> version 1:
> https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/
> 
> 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization.
> 
> To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu
> virtualization. There is still below at the VM linux side ...
> 
> [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
> 
> ... although we expect something like below.
> 
> [0.596381] Performance Events: PMU not available due to virtualization, 
> using software events only.
> [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
> 
> The 1st patch has introduced a new x86 only accel/kvm property
> "pmu-cap-disabled=true" to disable the pmu virtualization via
> KVM_PMU_CAP_DISABLE.
> 
> I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1.
> Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I
> finally used the latter because it is easier to use.
> 
> 
> 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset)
> at the KVM side may inject random unwanted/unknown NMIs to the VM.
> 
> The svm pmu registers are not reset during QEMU system_reset.
> 
> (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
> is running "perf top". The pmu registers are not disabled gracefully.
> 
> (2). Although the x86_cpu_reset() resets many registers to zero, the
> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
> some pmu events are still enabled at the KVM side.
> 
> (3). The KVM pmc_speculative_in_use() always returns true so that the events
> will not be reclaimed. The kvm_pmc->perf_event is still active.
> 
> (4). After the reboot, the VM kernel reports below error:
> 
> [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS 
> detected, complain to your hardware vendor.
> [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
> c0010200 is 530076)
> 
> (5). In a worse case, the active kvm_pmc->perf_event is still able to
> inject unknown NMIs randomly to the VM kernel.
> 
> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
> 
> The 2nd patch is to fix the issue by resetting AMD pmu registers as well as
> Intel registers.
> 
> 
> This patchset does not cover PerfMonV2, until the below patchset is merged
> into the KVM side.
> 
> [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support
> https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/
> 
> 
> Dongli Zhang (2):
>   target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
>   target/i386/kvm: get and put AMD pmu registers
> 
>  accel/kvm/kvm-all.c  |   1 +
>  include/sysemu/kvm_int.h |   1 +
>  qemu-options.hx  |   7 +++
>  target/i386/cpu.h|   5 ++
>  target/i386/kvm/kvm.c| 129 +-
>  5 files changed, 141 insertions(+), 2 deletions(-)
> 
> Thank you very much!
> 
> Dongli Zhang
> 
>

[PATCH v2 2/2] target/i386/kvm: get and put AMD pmu registers

2022-12-01 Thread Dongli Zhang

The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
the KVM side.

However, only the Intel gp/fixed/global pmu registers are involved. There
is not any implementation for AMD pmu registers. The
'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
the CPU version.

This patch is to add the support for AMD version=1 pmu, to get and put AMD
pmu registers. Otherwise, there will be a bug:

1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
is running "perf top". The pmu registers are not disabled gracefully.

2. Although the x86_cpu_reset() resets many registers to zero, the
kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
some pmu events are still enabled at the KVM side.

3. The KVM pmc_speculative_in_use() always returns true so that the events
will not be reclaimed. The kvm_pmc->perf_event is still active.

4. After the reboot, the VM kernel reports below error:

[0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, 
complain to your hardware vendor.
[0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
c0010200 is 530076)

5. In a worse case, the active kvm_pmc->perf_event is still able to
inject unknown NMIs randomly to the VM kernel.

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

The patch is to fix the issue by resetting AMD pmu registers during the
reset.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 target/i386/cpu.h |  5 +++
 target/i386/kvm/kvm.c | 83 +--
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index d4bc19577a..4cf0b98817 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -468,6 +468,11 @@ typedef enum X86Seg {
 #define MSR_CORE_PERF_GLOBAL_CTRL   0x38f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
 
+#define MSR_K7_EVNTSEL0 0xc001
+#define MSR_K7_PERFCTR0 0xc0010004
+#define MSR_F15H_PERF_CTL0  0xc0010200
+#define MSR_F15H_PERF_CTR0  0xc0010201
+
 #define MSR_MC0_CTL 0x400
 #define MSR_MC0_STATUS  0x401
 #define MSR_MC0_ADDR0x402
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 090e4fb44d..296cd4cab7 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1987,6 +1987,32 @@ int kvm_arch_init_vcpu(CPUState *cs)
 }
 }
 
+/*
+ * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
+ * disable the AMD pmu virtualization.
+ *
+ * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
+ * indicates the KVM side has already disabled the pmu virtualization.
+ */
+if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) {
+int64_t family;
+
+family = (env->cpuid_version >> 8) & 0xf;
+if (family == 0xf) {
+family += (env->cpuid_version >> 20) & 0xff;
+}
+
+if (family >= 6) {
+has_architectural_pmu_version = 1;
+
+if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) {
+num_architectural_pmu_gp_counters = 6;
+} else {
+num_architectural_pmu_gp_counters = 4;
+}
+}
+}
+
 cpu_x86_cpuid(env, 0x8000, 0, , , , );
 
 for (i = 0x8000; i <= limit; i++) {
@@ -3323,7 +3349,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 
env->poll_control_msr);
 }
 
-if (has_architectural_pmu_version > 0) {
+if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
 if (has_architectural_pmu_version > 1) {
 /* Stop the counter.  */
 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
@@ -3354,6 +3380,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
   env->msr_global_ctrl);
 }
 }
+
+if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
+uint32_t sel_base = MSR_K7_EVNTSEL0;
+uint32_t ctr_base = MSR_K7_PERFCTR0;
+uint32_t step = 1;
+
+if (num_architectural_pmu_gp_counters == 6) {
+sel_base = MSR_F15H_PERF_CTL0;
+ctr_base = MSR_F15H_PERF_CTR0;
+step = 2;
+}
+
+for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
+kvm_msr_entry_add(cpu, ctr_base + i * step,
+  env->msr_gp_counters[i]);
+

[PATCH v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs

2022-12-01 Thread Dongli Zhang

This patchset is to fix two svm pmu virtualization bugs, x86 only.

version 1:
https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/

1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization.

To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu
virtualization. There is still below at the VM linux side ...

[0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[0.596381] Performance Events: PMU not available due to virtualization, 
using software events only.
[0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

The 1st patch has introduced a new x86 only accel/kvm property
"pmu-cap-disabled=true" to disable the pmu virtualization via
KVM_PMU_CAP_DISABLE.

I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1.
Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I
finally used the latter because it is easier to use.


2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset)
at the KVM side may inject random unwanted/unknown NMIs to the VM.

The svm pmu registers are not reset during QEMU system_reset.

(1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
is running "perf top". The pmu registers are not disabled gracefully.

(2). Although the x86_cpu_reset() resets many registers to zero, the
kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
some pmu events are still enabled at the KVM side.

(3). The KVM pmc_speculative_in_use() always returns true so that the events
will not be reclaimed. The kvm_pmc->perf_event is still active.

(4). After the reboot, the VM kernel reports below error:

[0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, 
complain to your hardware vendor.
[0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
c0010200 is 530076)

(5). In a worse case, the active kvm_pmc->perf_event is still able to
inject unknown NMIs randomly to the VM kernel.

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

The 2nd patch is to fix the issue by resetting AMD pmu registers as well as
Intel registers.


This patchset does not cover PerfMonV2, until the below patchset is merged
into the KVM side.

[PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support
https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/


Dongli Zhang (2):
  target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  target/i386/kvm: get and put AMD pmu registers

 accel/kvm/kvm-all.c  |   1 +
 include/sysemu/kvm_int.h |   1 +
 qemu-options.hx  |   7 +++
 target/i386/cpu.h|   5 ++
 target/i386/kvm/kvm.c| 129 +-
 5 files changed, 141 insertions(+), 2 deletions(-)

Thank you very much!

Dongli Zhang

[PATCH v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE

2022-12-01 Thread Dongli Zhang

The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
could disable the pmu virtualization in an AMD environment.

We still see below at VM kernel side ...

[0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[0.596381] Performance Events: PMU not available due to virtualization, 
using software events only.
[0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

This is because the AMD pmu (v1) does not rely on cpuid to decide if the
pmu virtualization is supported.

We introduce a new property 'pmu_disabled' for KVM accel to set
KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host
is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for
x86.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - In version 1 we did not introduce the new property. We ioctl
KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had
introduced a helpfer function to do this job before creating the 1st
KVM vcpu in v1.

 accel/kvm/kvm-all.c  |  1 +
 include/sysemu/kvm_int.h |  1 +
 qemu-options.hx  |  7 ++
 target/i386/kvm/kvm.c| 46 
 4 files changed, 55 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f99b0becd8..5d4439ba74 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3620,6 +3620,7 @@ static void kvm_accel_instance_init(Object *obj)
 s->kvm_dirty_ring_size = 0;
 s->notify_vmexit = NOTIFY_VMEXIT_OPTION_RUN;
 s->notify_window = 0;
+s->pmu_cap_disabled = false;
 }
 
 /**
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 3b4adcdc10..e29ac5d767 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -110,6 +110,7 @@ struct KVMState
 struct KVMDirtyRingReaper reaper;
 NotifyVmexitOption notify_vmexit;
 uint32_t notify_window;
+bool pmu_cap_disabled;
 };
 
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
diff --git a/qemu-options.hx b/qemu-options.hx
index 7f99d15b23..15a2f717ff 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -186,6 +186,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
 "tb-size=n (TCG translation block cache size)\n"
 "dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n"
 "notify-vmexit=run|internal-error|disable,notify-window=n 
(enable notify VM exit and set notify window, x86 only)\n"
+"pmu-cap-disabled=true|false (disable 
KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n"
 "thread=single|multi (enable multi-threaded TCG)\n", 
QEMU_ARCH_ALL)
 SRST
 ``-accel name[,prop=value[,...]]``
@@ -247,6 +248,12 @@ SRST
 open up for a specified of time (i.e. notify-window).
 Default: notify-vmexit=run,notify-window=0.
 
+``pmu-cap-disabled=true|false``
+When the KVM accelerator is used, it controls whether to disable the
+KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled, the
+PMU virtualization is disabled at the KVM module side. This is for
+x86 host only.
+
 ERST
 
 DEF("smp", HAS_ARG, QEMU_OPTION_smp,
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index a213209379..090e4fb44d 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -122,6 +122,7 @@ static bool has_msr_ucode_rev;
 static bool has_msr_vmx_procbased_ctls2;
 static bool has_msr_perf_capabs;
 static bool has_msr_pkrs;
+static bool has_pmu_cap;
 
 static uint32_t has_architectural_pmu_version;
 static uint32_t num_architectural_pmu_gp_counters;
@@ -2652,6 +2653,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 }
 }
 
+has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
+
+if (s->pmu_cap_disabled) {
+if (has_pmu_cap) {
+ret = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
+KVM_PMU_CAP_DISABLE);
+if (ret < 0) {
+s->pmu_cap_disabled = false;
+error_report("kvm: Failed to disable pmu cap: %s",
+ strerror(-ret));
+}
+} else {
+s->pmu_cap_disabled = false;
+error_report("kvm: KVM_CAP_PMU_CAPABILITY is not supported");
+}
+}
+
 return 0;
 }
 
@@ -5706,6 +5724,28 @@ static void kvm_arch_set_notify_window(Object *obj, 
Visitor *v,
 s->notify_window = value;
 }
 
+static void kvm_set_pmu_cap_disabled(Object *obj, Visitor *v,
+ const char *name, void *opaque,
+ Error **errp)
+{
+K

Re: [PATCH 3/3] target/i386/kvm: get and put AMD pmu registers

2022-11-21 Thread Dongli Zhang

Hi Liang,

On 11/21/22 6:28 AM, Liang Yan wrote:
> A little bit more information from kernel perspective.
> 
> https://urldefense.com/v3/__https://lkml.org/lkml/2022/10/31/476__;!!ACWV5N9M2RV99hQ!NHxyuDAt7ZD4hlsoxPCIUSRsPzaii0kDx2DrS7umBMoKVD8Z6BH7IKvPu8p0EhBBTEqQkCMfTk1xoj-XzT0$
> 
> 
> I was kindly thinking of the same idea, but not sure if it is expected  from a
> bare-metal perspective, since the four legacy MSRs
> 
> are always there. Also not sure if they are used by other applications.
> 
> 
> ~Liang

Can I assume you were replying to the the patch ...

[PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled

... but not this one which is about to save/restore PMU registers?

[PATCH 3/3] target/i386/kvm: get and put AMD pmu registers


About the legacy registers, here is my understanding just based on the linux amd
source code.

The line 1450 indicates the PMU initialization is failed only when the family is
< 6. The amd_core_pmu_init() at line 1455 will decide whether to use 
PERFCTR_CORE.

1445 __init int amd_pmu_init(void)
1446 {
1447 int ret;
1448
1449 /* Performance-monitoring supported from K7 and later: */
1450 if (boot_cpu_data.x86 < 6)
1451 return -ENODEV;
1452
1453 x86_pmu = amd_pmu;
1454
1455 ret = amd_core_pmu_init();
1456 if (ret)
1457 return ret


By default (when family < 6), the below 4 legacy registers are used.

1270 static __initconst const struct x86_pmu amd_pmu = {
1271 .name   = "AMD",
... ...
1279 .eventsel   = MSR_K7_EVNTSEL0,
1280 .perfctr= MSR_K7_PERFCTR0,

#define MSR_K7_EVNTSEL0 0xc001
#define MSR_K7_PERFCTR0 0xc0010004
#define MSR_K7_EVNTSEL1 0xc0010001
#define MSR_K7_PERFCTR1 0xc0010005
#define MSR_K7_EVNTSEL2 0xc0010002
#define MSR_K7_PERFCTR2 0xc0010006
#define MSR_K7_EVNTSEL3 0xc0010003
#define MSR_K7_PERFCTR3 0xc0010007


If PERFCTR_CORE is supported, counters like below are used, as line 1368 and 
1369.

1351 static int __init amd_core_pmu_init(void)
1352 {
1353 union cpuid_0x8022_ebx ebx;
1354 u64 even_ctr_mask = 0ULL;
1355 int i;
1356
1357 if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
1358 return 0;
1359
1360 /* Avoid calculating the value each time in the NMI handler */
1361 perf_nmi_window = msecs_to_jiffies(100);
1362
1363 /*
1364  * If core performance counter extensions exists, we must use
1365  * MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also
1366  * amd_pmu_addr_offset().
1367  */
1368 x86_pmu.eventsel= MSR_F15H_PERF_CTL;
1369 x86_pmu.perfctr = MSR_F15H_PERF_CTR;
1370 x86_pmu.num_counters= AMD64_NUM_COUNTERS_CORE;

#define MSR_F15H_PERF_CTL   0xc0010200
#define MSR_F15H_PERF_CTL0  MSR_F15H_PERF_CTL
#define MSR_F15H_PERF_CTL1  (MSR_F15H_PERF_CTL + 2)
#define MSR_F15H_PERF_CTL2  (MSR_F15H_PERF_CTL + 4)
#define MSR_F15H_PERF_CTL3  (MSR_F15H_PERF_CTL + 6)
#define MSR_F15H_PERF_CTL4  (MSR_F15H_PERF_CTL + 8)
#define MSR_F15H_PERF_CTL5  (MSR_F15H_PERF_CTL + 10)

#define MSR_F15H_PERF_CTR   0xc0010201
#define MSR_F15H_PERF_CTR0  MSR_F15H_PERF_CTR
#define MSR_F15H_PERF_CTR1  (MSR_F15H_PERF_CTR + 2)
#define MSR_F15H_PERF_CTR2  (MSR_F15H_PERF_CTR + 4)
#define MSR_F15H_PERF_CTR3  (MSR_F15H_PERF_CTR + 6)
#define MSR_F15H_PERF_CTR4  (MSR_F15H_PERF_CTR + 8)
#define MSR_F15H_PERF_CTR5  (MSR_F15H_PERF_CTR + 10)

Unfortunately, I do not have access to machine with family < 6. It might be
interesting to use QEMU to emulate a machine with family < 6.

Thank you very much for suggestion!

Dongli Zhang


> 
> 
> On 11/19/22 07:29, Dongli Zhang wrote:
>> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
>> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
>> the KVM side.
>>
>> However, only the Intel gp/fixed/global pmu registers are involved. There
>> is not any implementation for AMD pmu registers. The
>> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
>> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
>> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
>> the CPU version.
>>
>> This patch is to add the support for AMD version=1 pmu, to get and put AMD
>> pmu registers. Otherwise, there will be a bug:
>>
>> 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec)

Re: [PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled

2022-11-21 Thread Dongli Zhang

Hi Greg and Liang,

On 11/21/22 6:23 AM, Liang Yan wrote:
> 
> On 11/21/22 06:03, Greg Kurz wrote:
>> On Sat, 19 Nov 2022 04:29:00 -0800
>> Dongli Zhang  wrote:
>>
>>> The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
>>> the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
>>> could disable the pmu virtualization in an AMD environment.
>>>
>>> We still see below at VM kernel side ...
>>>
>>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>>
>>> ... although we expect something like below.
>>>
>>> [    0.596381] Performance Events: PMU not available due to virtualization,
>>> using software events only.
>>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>>
>>> This is because the AMD pmu (v1) does not rely on cpuid to decide if the
>>> pmu virtualization is supported.
>>>
>>> We disable KVM_CAP_PMU_CAPABILITY if the 'pmu' is disabled in the vcpu
>>> properties.
>>>
>>> Cc: Joe Jin 
>>> Signed-off-by: Dongli Zhang 
>>> ---
>>>   target/i386/kvm/kvm.c | 17 +
>>>   1 file changed, 17 insertions(+)
>>>
>>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>>> index 8fec0bc5b5..0b1226ff7f 100644
>>> --- a/target/i386/kvm/kvm.c
>>> +++ b/target/i386/kvm/kvm.c
>>> @@ -137,6 +137,8 @@ static int has_triple_fault_event;
>>>     static bool has_msr_mcg_ext_ctl;
>>>   +static int has_pmu_cap;
>>> +
>>>   static struct kvm_cpuid2 *cpuid_cache;
>>>   static struct kvm_cpuid2 *hv_cpuid_cache;
>>>   static struct kvm_msr_list *kvm_feature_msrs;
>>> @@ -1725,6 +1727,19 @@ static void kvm_init_nested_state(CPUX86State *env)
>>>     void kvm_arch_pre_create_vcpu(CPUState *cs)
>>>   {
>>> +    X86CPU *cpu = X86_CPU(cs);
>>> +    int ret;
>>> +
>>> +    if (has_pmu_cap && !cpu->enable_pmu) {
>>> +    ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_PMU_CAPABILITY, 0,
>>> +    KVM_PMU_CAP_DISABLE);
>> It doesn't seem conceptually correct to configure VM level stuff out of
>> a vCPU property, which could theoretically be different for each vCPU,
>> even if this isn't the case with the current code base.
>>
>> Maybe consider controlling PMU with a machine property and this
>> could be done in kvm_arch_init() like other VM level stuff ?
>>
> 
> There is already a 'pmu' property for x86_cpu with variable 'enable_pmu' as we
> see the above code. It is mainly used by Intel CPU and set to off by default
> since qemu 1.5.
> 
> And, this property is spread to AMD CPU too.
> 
> I think you may need setup a machine property to disable it from current 
> machine
> model. Otherwise, it will break the Live Migration scenario.

Initially, I was thinking about to replace the CPU level property 'pmu' to a VM
level property but with a concern that reviewers/maintainers might not be happy
with it. That may break the way other Apps use the QEMU.

I will add an extra VM level property to the kvm_accel_class_init() for
"-machine accel=kvm" in addition to the existing per CPU property. The VM (kvm)
level
'pmu' will be more privileged than the CPU level 'pmu'.

Thank you very much!

Dongli Zhang

> 
> 
>>> +    if (ret < 0) {
>>> +    error_report("kvm: Failed to disable pmu cap: %s",
>>> + strerror(-ret));
>>> +    }
>>> +
>>> +    has_pmu_cap = 0;
>>> +    }
>>>   }
>>>     int kvm_arch_init_vcpu(CPUState *cs)
>>> @@ -2517,6 +2532,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>>>   }
>>>   }
>>>   +    has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
>>> +
>>>   ret = kvm_get_supported_msrs(s);
>>>   if (ret < 0) {
>>>   return ret;
>>

Re: [PATCH 0/3] kvm: fix two svm pmu virtualization bugs

2022-11-20 Thread Dongli Zhang

Hi Like,

On 11/20/22 22:42, Like Xu wrote:
> On 19/11/2022 8:28 pm, Dongli Zhang wrote:
>> This patchset is to fix two svm pmu virtualization bugs.
>>
>> 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization.
>>
>> To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu
>> virtualization. There is still below at the VM linux side ...
> 
> Many QEMU vendor forks already have similar fixes, and
> thanks for bringing this issue back to the mainline.

Would you mind helping point to if there used to be any prior patchset for
mainline to resolve the issue?

> 
>>
>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>
>> ... although we expect something like below.
>>
>> [    0.596381] Performance Events: PMU not available due to virtualization,
>> using software events only.
>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>
>> The patch 1-2 is to disable the pmu virtualization via KVM_PMU_CAP_DISABLE
>> if the per-vcpu "pmu" property is disabled.
>>
>> I considered 'KVM_X86_SET_MSR_FILTER' initially.
>> Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I
>> finally used the latter because it is easier to use.
>>
>>
>> 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset)
>> at the KVM side may inject random unwanted/unknown NMIs to the VM.
>>
>> The svm pmu registers are not reset during QEMU system_reset.
>>
>> (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
>> is running "perf top". The pmu registers are not disabled gracefully.
>>
>> (2). Although the x86_cpu_reset() resets many registers to zero, the
>> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
>> some pmu events are still enabled at the KVM side.
>>
>> (3). The KVM pmc_speculative_in_use() always returns true so that the events
>> will not be reclaimed. The kvm_pmc->perf_event is still active.
> 
> I'm not sure if you're saying KVM doing something wrong, I don't think so
> because KVM doesn't sense the system_reset defined by QEME or other user 
> space,
> AMD's vPMC will continue to be enabled (if it was enabled before), generating 
> pmi
> injection into the guest, and the newly started guest doesn't realize the
> counter is still
> enabled and blowing up the error log.

I were not saying KVM was buggy.

I was trying to explain how the issue impacts KVM side and VM side.

> 
>>
>> (4). After the reboot, the VM kernel reports below error:
>>
>> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS 
>> detected,
>> complain to your hardware vendor.
>> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR
>> c0010200 is 530076)
>>
>> (5). In a worse case, the active kvm_pmc->perf_event is still able to
>> inject unknown NMIs randomly to the VM kernel.
>>
>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>>
>> The patch 3 is to fix the issue by resetting AMD pmu registers as well as
>> Intel registers.
> 
> This fix idea looks good, it does require syncing the new changed device state
> of QEMU to KVM.

Thank you very much!

Dongli Zhang

> 
>>
>>
>> This patchset does cover does not cover PerfMonV2, until the below patchset
>> is merged into the KVM side.
>>
>> [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support
>> https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/
>>
>>
>> Dongli Zhang (3):
>>    kvm: introduce a helper before creating the 1st vcpu
>>    i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled
>>        target/i386/kvm: get and put AMD pmu registers
>>
>>   accel/kvm/kvm-all.c    |   7 ++-
>>   include/sysemu/kvm.h   |   2 +
>>   target/arm/kvm64.c |   4 ++
>>   target/i386/cpu.h  |   5 +++
>>   target/i386/kvm/kvm.c  | 104 +++-
>>   target/mips/kvm.c  |   4 ++
>>   target/ppc/kvm.c   |   4 ++
>>   target/riscv/kvm.c |   4 ++
>>   target/s390x/kvm/kvm.c |   4 ++
>>   9 files changed, 134 insertions(+), 4 deletions(-)
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>
>>

[PATCH 1/3] kvm: introduce a helper before creating the 1st vcpu

2022-11-19 Thread Dongli Zhang

Some per-VM kvm caps (e.g., KVM_CAP_PMU_CAPABILITY) can only be
enabled/disabled before creating the 1st vcpu, that is, when
(!kvm->created_vcpus) at the KVM side.

Unfortunately, some properties are still not set during kvm_arch_init().
The values of those properties are obtained during the init of each vcpu.

This is to add a new helper to provide the last chance before creating the
1st vcpu, in order for the QEMU to set kvm caps based on the per-vcpu
properties (e.g., "pmu").

In the future patch, we may disable KVM_CAP_PMU_CAPABILITY in the helper
if the "-pmu" is set for the vcpu.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 accel/kvm/kvm-all.c| 7 +--
 include/sysemu/kvm.h   | 2 ++
 target/arm/kvm64.c | 4 
 target/i386/kvm/kvm.c  | 4 
 target/mips/kvm.c  | 4 
 target/ppc/kvm.c   | 4 
 target/riscv/kvm.c | 4 
 target/s390x/kvm/kvm.c | 4 
 8 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f99b0becd8..335ff6ce4d 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -367,8 +367,9 @@ void kvm_destroy_vcpu(CPUState *cpu)
 }
 }
 
-static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id)
+static int kvm_get_vcpu(KVMState *s, CPUState *cs)
 {
+unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
 struct KVMParkedVcpu *cpu;
 
 QLIST_FOREACH(cpu, >kvm_parked_vcpus, node) {
@@ -382,6 +383,8 @@ static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id)
 }
 }
 
+kvm_arch_pre_create_vcpu(cs);
+
 return kvm_vm_ioctl(s, KVM_CREATE_VCPU, (void *)vcpu_id);
 }
 
@@ -393,7 +396,7 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
 
 trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
 
-ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu));
+ret = kvm_get_vcpu(s, cpu);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed 
(%lu)",
  kvm_arch_vcpu_id(cpu));
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index e9a97eda8c..9a2e2ba012 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -371,6 +371,8 @@ int kvm_arch_put_registers(CPUState *cpu, int level);
 
 int kvm_arch_init(MachineState *ms, KVMState *s);
 
+void kvm_arch_pre_create_vcpu(CPUState *cs);
+
 int kvm_arch_init_vcpu(CPUState *cpu);
 int kvm_arch_destroy_vcpu(CPUState *cpu);
 
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 1197253d12..da4317ad06 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -834,6 +834,10 @@ static int kvm_arm_sve_set_vls(CPUState *cs)
 return kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, );
 }
 
+void kvm_arch_pre_create_vcpu(CPUState *cs)
+{
+}
+
 #define ARM_CPU_ID_MPIDR   3, 0, 0, 0, 5
 
 int kvm_arch_init_vcpu(CPUState *cs)
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index a213209379..8fec0bc5b5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1723,6 +1723,10 @@ static void kvm_init_nested_state(CPUX86State *env)
 }
 }
 
+void kvm_arch_pre_create_vcpu(CPUState *cs)
+{
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
 struct {
diff --git a/target/mips/kvm.c b/target/mips/kvm.c
index bcb8e06b2c..1be1695b6b 100644
--- a/target/mips/kvm.c
+++ b/target/mips/kvm.c
@@ -61,6 +61,10 @@ int kvm_arch_irqchip_create(KVMState *s)
 return 0;
 }
 
+void kvm_arch_pre_create_vcpu(CPUState *cs)
+{
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
 MIPSCPU *cpu = MIPS_CPU(cs);
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index 7c25348b7b..9049c6eb5e 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -462,6 +462,10 @@ static void kvmppc_hw_debug_points_init(CPUPPCState *cenv)
 }
 }
 
+void kvm_arch_pre_create_vcpu(CPUState *cs)
+{
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
 PowerPCCPU *cpu = POWERPC_CPU(cs);
diff --git a/target/riscv/kvm.c b/target/riscv/kvm.c
index 30f21453d6..811f65d4f6 100644
--- a/target/riscv/kvm.c
+++ b/target/riscv/kvm.c
@@ -394,6 +394,10 @@ void kvm_arch_init_irq_routing(KVMState *s)
 {
 }
 
+void kvm_arch_pre_create_vcpu(CPUState *cs)
+{
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
 int ret = 0;
diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c
index 3ac7ec9acf..65f701894e 100644
--- a/target/s390x/kvm/kvm.c
+++ b/target/s390x/kvm/kvm.c
@@ -404,6 +404,10 @@ unsigned long kvm_arch_vcpu_id(CPUState *cpu)
 return cpu->cpu_index;
 }
 
+void kvm_arch_pre_create_vcpu(CPUState *cs)
+{
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
 unsigned int max_cpus = MACHINE(qdev_get_machine())->smp.max_cpus;
-- 
2.34.1

[PATCH 0/3] kvm: fix two svm pmu virtualization bugs

2022-11-19 Thread Dongli Zhang

This patchset is to fix two svm pmu virtualization bugs.

1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization.

To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu
virtualization. There is still below at the VM linux side ...

[0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[0.596381] Performance Events: PMU not available due to virtualization, 
using software events only.
[0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

The patch 1-2 is to disable the pmu virtualization via KVM_PMU_CAP_DISABLE
if the per-vcpu "pmu" property is disabled.

I considered 'KVM_X86_SET_MSR_FILTER' initially.
Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I
finally used the latter because it is easier to use.


2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset)
at the KVM side may inject random unwanted/unknown NMIs to the VM.

The svm pmu registers are not reset during QEMU system_reset.

(1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
is running "perf top". The pmu registers are not disabled gracefully.

(2). Although the x86_cpu_reset() resets many registers to zero, the
kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
some pmu events are still enabled at the KVM side.

(3). The KVM pmc_speculative_in_use() always returns true so that the events
will not be reclaimed. The kvm_pmc->perf_event is still active.

(4). After the reboot, the VM kernel reports below error:

[0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, 
complain to your hardware vendor.
[0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
c0010200 is 530076)

(5). In a worse case, the active kvm_pmc->perf_event is still able to
inject unknown NMIs randomly to the VM kernel.

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

The patch 3 is to fix the issue by resetting AMD pmu registers as well as
Intel registers.


This patchset does cover does not cover PerfMonV2, until the below patchset
is merged into the KVM side.

[PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support
https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/


Dongli Zhang (3):
  kvm: introduce a helper before creating the 1st vcpu
  i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled
  target/i386/kvm: get and put AMD pmu registers

 accel/kvm/kvm-all.c|   7 ++-
 include/sysemu/kvm.h   |   2 +
 target/arm/kvm64.c |   4 ++
 target/i386/cpu.h  |   5 +++
 target/i386/kvm/kvm.c  | 104 +++-
 target/mips/kvm.c  |   4 ++
 target/ppc/kvm.c   |   4 ++
 target/riscv/kvm.c |   4 ++
 target/s390x/kvm/kvm.c |   4 ++
 9 files changed, 134 insertions(+), 4 deletions(-)

Thank you very much!

Dongli Zhang

[PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled

2022-11-19 Thread Dongli Zhang

The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
could disable the pmu virtualization in an AMD environment.

We still see below at VM kernel side ...

[0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[0.596381] Performance Events: PMU not available due to virtualization, 
using software events only.
[0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

This is because the AMD pmu (v1) does not rely on cpuid to decide if the
pmu virtualization is supported.

We disable KVM_CAP_PMU_CAPABILITY if the 'pmu' is disabled in the vcpu
properties.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 target/i386/kvm/kvm.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 8fec0bc5b5..0b1226ff7f 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -137,6 +137,8 @@ static int has_triple_fault_event;
 
 static bool has_msr_mcg_ext_ctl;
 
+static int has_pmu_cap;
+
 static struct kvm_cpuid2 *cpuid_cache;
 static struct kvm_cpuid2 *hv_cpuid_cache;
 static struct kvm_msr_list *kvm_feature_msrs;
@@ -1725,6 +1727,19 @@ static void kvm_init_nested_state(CPUX86State *env)
 
 void kvm_arch_pre_create_vcpu(CPUState *cs)
 {
+X86CPU *cpu = X86_CPU(cs);
+int ret;
+
+if (has_pmu_cap && !cpu->enable_pmu) {
+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_PMU_CAPABILITY, 0,
+KVM_PMU_CAP_DISABLE);
+if (ret < 0) {
+error_report("kvm: Failed to disable pmu cap: %s",
+ strerror(-ret));
+}
+
+has_pmu_cap = 0;
+}
 }
 
 int kvm_arch_init_vcpu(CPUState *cs)
@@ -2517,6 +2532,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 }
 }
 
+has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
+
 ret = kvm_get_supported_msrs(s);
 if (ret < 0) {
 return ret;
-- 
2.34.1

[PATCH 3/3] target/i386/kvm: get and put AMD pmu registers

2022-11-19 Thread Dongli Zhang

The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
the KVM side.

However, only the Intel gp/fixed/global pmu registers are involved. There
is not any implementation for AMD pmu registers. The
'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
the CPU version.

This patch is to add the support for AMD version=1 pmu, to get and put AMD
pmu registers. Otherwise, there will be a bug:

1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
is running "perf top". The pmu registers are not disabled gracefully.

2. Although the x86_cpu_reset() resets many registers to zero, the
kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
some pmu events are still enabled at the KVM side.

3. The KVM pmc_speculative_in_use() always returns true so that the events
will not be reclaimed. The kvm_pmc->perf_event is still active.

4. After the reboot, the VM kernel reports below error:

[0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, 
complain to your hardware vendor.
[0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 
c0010200 is 530076)

5. In a worse case, the active kvm_pmc->perf_event is still able to
inject unknown NMIs randomly to the VM kernel.

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

The patch is to fix the issue by resetting AMD pmu registers during the
reset.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 target/i386/cpu.h |  5 +++
 target/i386/kvm/kvm.c | 83 +--
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index d4bc19577a..4cf0b98817 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -468,6 +468,11 @@ typedef enum X86Seg {
 #define MSR_CORE_PERF_GLOBAL_CTRL   0x38f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
 
+#define MSR_K7_EVNTSEL0 0xc001
+#define MSR_K7_PERFCTR0 0xc0010004
+#define MSR_F15H_PERF_CTL0  0xc0010200
+#define MSR_F15H_PERF_CTR0  0xc0010201
+
 #define MSR_MC0_CTL 0x400
 #define MSR_MC0_STATUS  0x401
 #define MSR_MC0_ADDR0x402
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 0b1226ff7f..023fcbce48 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2005,6 +2005,32 @@ int kvm_arch_init_vcpu(CPUState *cs)
 }
 }
 
+if (IS_AMD_CPU(env)) {
+int64_t family;
+
+family = (env->cpuid_version >> 8) & 0xf;
+if (family == 0xf) {
+family += (env->cpuid_version >> 20) & 0xff;
+}
+
+/*
+ * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
+ * disable the AMD pmu virtualization.
+ *
+ * If KVM_CAP_PMU_CAPABILITY is supported, "!has_pmu_cap" indicates
+ * the KVM side has already disabled the pmu virtualization.
+ */
+if (family >= 6 && (!has_pmu_cap || cpu->enable_pmu)) {
+has_architectural_pmu_version = 1;
+
+if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) {
+num_architectural_pmu_gp_counters = 6;
+} else {
+num_architectural_pmu_gp_counters = 4;
+}
+}
+}
+
 cpu_x86_cpuid(env, 0x8000, 0, , , , );
 
 for (i = 0x8000; i <= limit; i++) {
@@ -3326,7 +3352,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 
env->poll_control_msr);
 }
 
-if (has_architectural_pmu_version > 0) {
+if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
 if (has_architectural_pmu_version > 1) {
 /* Stop the counter.  */
 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
@@ -3357,6 +3383,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
   env->msr_global_ctrl);
 }
 }
+
+if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
+uint32_t sel_base = MSR_K7_EVNTSEL0;
+uint32_t ctr_base = MSR_K7_PERFCTR0;
+uint32_t step = 1;
+
+if (num_architectural_pmu_gp_counters == 6) {
+sel_base = MSR_F15H_PERF_CTL0;
+ctr_base = MSR_F15H_PERF_CTR0;
+step = 2;
+}
+
+for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
+kvm_msr_entry_add(cpu, ctr_base + i * step,
+  env->msr_

Re: [PATCH v3 1/1] monitor/hmp: print trace as option in help for log command

2022-09-19 Thread Dongli Zhang

Hi Markus,

On 9/17/22 2:44 PM, Philippe Mathieu-Daudé via wrote:
> Hi Markus,
> 
> On 2/9/22 14:24, Markus Armbruster wrote:
>> Dongli Zhang  writes:
>>
>>> The below is printed when printing help information in qemu-system-x86_64
>>> command line, and when CONFIG_TRACE_LOG is enabled:
>>>
>>> 
>>> $ qemu-system-x86_64 -d help
>>> ... ...
>>> trace:PATTERN   enable trace events
>>>
>>> Use "-d trace:help" to get a list of trace events.
>>> 
>>>
>>> However, the options of "trace:PATTERN" are only printed by
>>> "qemu-system-x86_64 -d help", but missing in hmp "help log" command.
>>>
>>> Fixes: c84ea00dc2 ("log: add "-d trace:PATTERN"")
>>> Cc: Joe Jin 
>>> Signed-off-by: Dongli Zhang 
>>> ---
>>> Changed since v1:
>>> - change format for "none" as well.
>>> Changed since v2:
>>> - use "log trace:help" in help message.
>>> - add more clarification in commit message.
>>> - add 'Fixes' tag.
>>> ---
>>>   monitor/hmp.c | 9 +++--
>>>   1 file changed, 7 insertions(+), 2 deletions(-)
> 
>> Not this patch's fault:
>>
>> 1. "-d help" terminates with exit status 1, "-d trace:help" with 0.  The
>>     former is wrong.

May I assume it is expected to have exit status 1 when "-d help"?

According to the output of "-d", there is even not a "help" option, but only a
"-d trace:help" option. That is, "-d help" is not officially supported.

The below example use "-d hellworld" but not "help".

# qemu-system-x86_64 -d helloworld
Log items (comma separated):
out_asm show generated host assembly code for each compiled TB
in_asm  show target assembly code for each compiled TB
op  show micro ops for each compiled TB
op_opt  show micro ops after optimization
op_ind  show micro ops before indirect lowering
int show interrupts/exceptions in short format
execshow trace before each executed TB (lots of logs)
cpu show CPU registers before entering a TB (lots of logs)
fpu include FPU registers in the 'cpu' logging
mmu log MMU-related activities
pcall   x86 only: show protected mode far calls/returns/exceptions
cpu_reset   show CPU state before CPU resets
unimp   log unimplemented functionality
guest_errorslog when the guest OS does something invalid (eg accessing a
non-existent register)
pagedump pages at beginning of user mode emulation
nochain do not chain compiled TBs so that "exec" and "cpu" show
complete traces
plugin  output from TCG plugins

strace  log every user-mode syscall, its input, and its result
tid open a separate log file per thread; filename must contain '%d'
trace:PATTERN   enable trace events

Use "-d trace:help" to get a list of trace events.


According to the source code, the qemu_str_to_log_mask() expects either log
items or "trace". For any other inputs (e.g., "help" or "helloworld"),
qemu_str_to_log_mask() returns 0 (no bit set in the mask). That indicates the
input (e.g., "help") is not an expected input.

Therefore, can I assume this is not a bug? I do not think something like below
is very helpful.

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 263f029a8e..54c8e624bf 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2389,6 +2389,8 @@ static void qemu_process_early_options(void)
 mask = qemu_str_to_log_mask(log_mask);
 if (!mask) {
 qemu_print_log_usage(stdout);
+if (g_str_equal(log_mask, "help"))
+exit(0)
 exit(1);
 }
 }

Thank you very much!

Dongli Zhang


>>
>> 2. HMP "log trace:help" prints to stdout instead of the current monitor.
>>
>> 3. Output of HMP "log trace:help" sometimes is truncated for me.
>>
>> 4. Output of "log trace:help" and "info trace-events" is unwieldy.
>>     Sorted output could be a bit less unwieldy.
>>
>> 5. Could "log trace:help" and "info trace-events" share code?
> 
> Do you mind opening issue(s) on our GitLab so we don't loose your
> analysis buried within the infinite mailing list?
>

Re: [PATCH v2 1/1] monitor/hmp: print trace as option in help for log command

2022-09-02 Thread Dongli Zhang

Hi Markus,

On 8/31/22 11:47 PM, Markus Armbruster wrote:
> Dongli Zhang  writes:
> 
>> Hi Markus,
>>
>> On 8/30/22 4:04 AM, Markus Armbruster wrote:
>>> Dongli Zhang  writes:
>>>
>>>> The below is printed when printing help information in qemu-system-x86_64
>>>> command line, and when CONFIG_TRACE_LOG is enabled:
>>>>
>>>> $ qemu-system-x86_64 -d help
>>>> ... ...
>>>> trace:PATTERN   enable trace events
>>>>
>>>> Use "-d trace:help" to get a list of trace events.
>>>>
>>>> However, they are not printed in hmp "help log" command.
>>>
>>> This leaves me guessing what exactly the patch tries to do.
>>
>> I will clarify in the commit message.
>>
>>>
>>>> Cc: Joe Jin 
>>>> Signed-off-by: Dongli Zhang 
>>>> ---
>>>> Changed since v1:
>>>> - change format for "none" as well.
>>>>
>>>>  monitor/hmp.c | 9 +++--
>>>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/monitor/hmp.c b/monitor/hmp.c
>>>> index 15ca047..467fc84 100644
>>>> --- a/monitor/hmp.c
>>>> +++ b/monitor/hmp.c
>>>> @@ -285,10 +285,15 @@ void help_cmd(Monitor *mon, const char *name)
>>>>  if (!strcmp(name, "log")) {
>>>>  const QEMULogItem *item;
>>>>  monitor_printf(mon, "Log items (comma separated):\n");
>>>> -monitor_printf(mon, "%-10s %s\n", "none", "remove all logs");
>>>> +monitor_printf(mon, "%-15s %s\n", "none", "remove all logs");
>>>>  for (item = qemu_log_items; item->mask != 0; item++) {
>>>> -monitor_printf(mon, "%-10s %s\n", item->name, item->help);
>>>> +monitor_printf(mon, "%-15s %s\n", item->name, item->help);
>>>>  }
>>>> +#ifdef CONFIG_TRACE_LOG
>>>> +monitor_printf(mon, "trace:PATTERN   enable trace events\n");
>>>> +monitor_printf(mon, "\nUse \"info trace-events\" to get a 
>>>> list of "
>>>> +"trace events.\n\n");
>>>
>>> Aha: it fixes help to show "log trace:PATTERN".  Was that forgotten in
>>> Paolo's commit c84ea00dc2 'log: add "-d trace:PATTERN"'?
>>
>> I will add the Fixes tag.
>>
>>>
>>> "info trace-events", hmmm... it shows trace events and their state.
>>> "log trace:help" also lists them, less their state, and in opposite
>>> order.  Why do we need both?
> 
> I guess we have both because we want an HMP command to show the state of
> trace events ("info trace-events"), and we want "-d trace" to provide
> help.
> 
> The latter also lets HMP command "log trace" help, which feels less
> important to me, since "info trace-events" exists and is easier to find
> and significantly more usable than "log trace:help": it can filter its
> output, and unfiltered output is too long to be useful without something
> like grep.
> 
> Could the two share more code?

Thank you very much for the suggestion. I will try if they can share the code.

> 
> Hmm, there seems to be something wrong with "log trace:help": I see
> truncated output.  Moreover, output goes to stdout instead of the
> monitor.  That's wrong.  Any help you can also emit from the monitor
> should be printed with qemu_printf().

Currently the "log trace:help" prints via fprintf() to stdout. I will fix this.

169 void trace_list_events(FILE *f)
170 {
171 TraceEventIter iter;
172 TraceEvent *ev;
173 trace_event_iter_init_all();
174 while ((ev = trace_event_iter_next()) != NULL) {
175 fprintf(f, "%s\n", trace_event_get_name(ev));
176 }
177 #ifdef CONFIG_TRACE_DTRACE
178 fprintf(f, "This list of names of trace points may be incomplete "
179"when using the DTrace/SystemTap backends.\n"
180"Run 'qemu-trace-stap list %s' to print the full list.\n",
181 g_get_prgname());
182 #endif
183 }

> 
>> I will print "log trace:help" in the help output.
>>
>>> What about showing them in alphabetical order?
>>
>> The order is following how they are defined in the qemu_log_items[] array. To
>> re-order them in the array may introduce more conflicts when backporting a
>> util/log patch to QEMU old version.
>>
>> Please let me know if you prefer to re-order. Otherwise, I prefer to avoid 
>> that.
> 
> I'm talking about the output of "log trace:help", not the output of "log
> help".

The order is guaranteed by each trace group. To re-order may require some works
with scripts/tracetool when the events from each group/folder are united. I will
have a confirm.

Thank you very much for suggestions!

Dongli Zhang

Re: [PATCH 2/2] util/log: add timestamp to logs via qemu_log()

2022-08-31 Thread Dongli Zhang

Hi Markus and Richard,

Thank you very much for the feedback. I agree this is not a good solution. I
will look for alternatives to add timestamp.

Thank you very much!

Dongli Zhang

On 8/30/22 8:31 AM, Richard Henderson wrote:
> On 8/30/22 04:09, Markus Armbruster wrote:
>> Dongli Zhang  writes:
>>
>>> The qemu_log is very helpful for diagnostic. Add the timestamp to the log
>>> when it is enabled (e.g., "-msg timestamp=on").
>>>
>>> While there are many other places that may print to log file, this patch is
>>> only for qemu_log(), e.g., the developer may add qemu_log/qemu_log_mask to
>>> selected locations to diagnose QEMU issue.
>>
>> Opinions on the new feature, anyone?
>>
>>> Cc: Joe Jin 
>>> Signed-off-by: Dongli Zhang 
>>> ---
>>> Please let me know if we should use 'error_with_guestname' as well.
>>>
>>>   util/log.c | 7 +++
>>>   1 file changed, 7 insertions(+)
>>>
>>> diff --git a/util/log.c b/util/log.c
>>> index d6eb037..f0a081a 100644
>>> --- a/util/log.c
>>> +++ b/util/log.c
>>> @@ -129,8 +129,15 @@ void qemu_log(const char *fmt, ...)
>>>   {
>>>   FILE *f = qemu_log_trylock();
>>>   if (f) {
>>> +    gchar *timestr;
>>>   va_list ap;
>>>   +    if (message_with_timestamp) {
>>> +    timestr = real_time_iso8601();
>>> +    fprintf(f, "%s ", timestr);
>>> +    g_free(timestr);
>>> +    }
>>> +
>>>   va_start(ap, fmt);
>>>   vfprintf(f, fmt, ap);
>>>   va_end(ap);
>>
>> This extends -msg timestamp=on to apply to log messages without
>> documenting it in -help or anywhere else.  Needs fixing.
> 
> I think this is a poor place to add the timestamp.
> 
> You'll find that qemu_log is used many times to assemble pieces, e.g.
> 
> linux-user/thunk.c:360:    qemu_log("%" PRIu64, tswap64(val));
> 
> linux-user/thunk.c:376:    qemu_log("\"");
> 
> linux-user/thunk.c:379:    qemu_log("[");
> 
> linux-user/thunk.c:384:    qemu_log(",");
> 
> linux-user/thunk.c:391:    qemu_log("\"");
> 
> linux-user/thunk.c:393:    qemu_log("]");
> 
> linux-user/thunk.c:417:    qemu_log("{");
> 
> linux-user/thunk.c:420:    qemu_log(",");
> 
> linux-user/thunk.c:424:    qemu_log("}");
> 
> 
> Not the best idea, really, but the replacement for this is to avoid qemu_log
> entirely, and use
> 
>     f = qemu_log_trylock();
>     if (f) {
>     fprintf
>     some
>     stuff
>     qemu_log_unlock(f);
>     }
> 
> at which point you don't get your timestamp either.  You'd need to explicitly
> add timestamps to individual locations.
> 
> It would probably be easier to add timestamps to tracepoints, which are always
> emitted as a unit.
> 
> 
> r~
>

[PATCH v3 1/1] monitor/hmp: print trace as option in help for log command

2022-08-31 Thread Dongli Zhang

The below is printed when printing help information in qemu-system-x86_64
command line, and when CONFIG_TRACE_LOG is enabled:


$ qemu-system-x86_64 -d help
... ...
trace:PATTERN   enable trace events

Use "-d trace:help" to get a list of trace events.


However, the options of "trace:PATTERN" are only printed by
"qemu-system-x86_64 -d help", but missing in hmp "help log" command.

Fixes: c84ea00dc2 ("log: add "-d trace:PATTERN"")
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
- change format for "none" as well.
Changed since v2:
- use "log trace:help" in help message.
- add more clarification in commit message.
- add 'Fixes' tag.
---
 monitor/hmp.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/monitor/hmp.c b/monitor/hmp.c
index 15ca04735c..a3375d0341 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -285,10 +285,15 @@ void help_cmd(Monitor *mon, const char *name)
 if (!strcmp(name, "log")) {
 const QEMULogItem *item;
 monitor_printf(mon, "Log items (comma separated):\n");
-monitor_printf(mon, "%-10s %s\n", "none", "remove all logs");
+monitor_printf(mon, "%-15s %s\n", "none", "remove all logs");
 for (item = qemu_log_items; item->mask != 0; item++) {
-monitor_printf(mon, "%-10s %s\n", item->name, item->help);
+monitor_printf(mon, "%-15s %s\n", item->name, item->help);
 }
+#ifdef CONFIG_TRACE_LOG
+monitor_printf(mon, "trace:PATTERN   enable trace events\n");
+monitor_printf(mon, "\nUse \"log trace:help\" to get a list of "
+   "trace events.\n\n");
+#endif
 return;
 }
 
-- 
2.17.1

Re: [PATCH v2 1/1] monitor/hmp: print trace as option in help for log command

2022-08-31 Thread Dongli Zhang

Hi Markus,

On 8/30/22 4:04 AM, Markus Armbruster wrote:
> Dongli Zhang  writes:
> 
>> The below is printed when printing help information in qemu-system-x86_64
>> command line, and when CONFIG_TRACE_LOG is enabled:
>>
>> $ qemu-system-x86_64 -d help
>> ... ...
>> trace:PATTERN   enable trace events
>>
>> Use "-d trace:help" to get a list of trace events.
>>
>> However, they are not printed in hmp "help log" command.
> 
> This leaves me guessing what exactly the patch tries to do.

I will clarify in the commit message.

> 
>> Cc: Joe Jin 
>> Signed-off-by: Dongli Zhang 
>> ---
>> Changed since v1:
>> - change format for "none" as well.
>>
>>  monitor/hmp.c | 9 +++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/monitor/hmp.c b/monitor/hmp.c
>> index 15ca047..467fc84 100644
>> --- a/monitor/hmp.c
>> +++ b/monitor/hmp.c
>> @@ -285,10 +285,15 @@ void help_cmd(Monitor *mon, const char *name)
>>  if (!strcmp(name, "log")) {
>>  const QEMULogItem *item;
>>  monitor_printf(mon, "Log items (comma separated):\n");
>> -monitor_printf(mon, "%-10s %s\n", "none", "remove all logs");
>> +monitor_printf(mon, "%-15s %s\n", "none", "remove all logs");
>>  for (item = qemu_log_items; item->mask != 0; item++) {
>> -monitor_printf(mon, "%-10s %s\n", item->name, item->help);
>> +monitor_printf(mon, "%-15s %s\n", item->name, item->help);
>>  }
>> +#ifdef CONFIG_TRACE_LOG
>> +monitor_printf(mon, "trace:PATTERN   enable trace events\n");
>> +monitor_printf(mon, "\nUse \"info trace-events\" to get a list 
>> of "
>> +"trace events.\n\n");
> 
> Aha: it fixes help to show "log trace:PATTERN".  Was that forgotten in
> Paolo's commit c84ea00dc2 'log: add "-d trace:PATTERN"'?

I will add the Fixes tag.

> 
> "info trace-events", hmmm... it shows trace events and their state.
> "log trace:help" also lists them, less their state, and in opposite
> order.  Why do we need both?

I will print "log trace:help" in the help output.

> 
> What about showing them in alphabetical order?

The order is following how they are defined in the qemu_log_items[] array. To
re-order them in the array may introduce more conflicts when backporting a
util/log patch to QEMU old version.

Please let me know if you prefer to re-order. Otherwise, I prefer to avoid that.

Thank you very much for the suggestions!

Dongli Zhang

> 
>> +#endif
>>  return;
>>  }
>

Re: [PATCH 1/1] monitor/hmp: print trace as option in help for log command

2022-08-29 Thread Dongli Zhang

Sorry that the format for "none" should be changed as well.

I have sent a v2:

https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg04445.html

Thank you very much!

Dongli Zhang

On 8/29/22 3:04 AM, Dongli Zhang wrote:
> The below is printed when printing help information in qemu-system-x86_64
> command line, and when CONFIG_TRACE_LOG is enabled:
> 
> $ qemu-system-x86_64 -d help
> ... ...
> trace:PATTERN   enable trace events
> 
> Use "-d trace:help" to get a list of trace events.
> 
> However, they are not printed in hmp "help log" command.
> 
> Cc: Joe Jin 
> Signed-off-by: Dongli Zhang 
> ---
>  monitor/hmp.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/monitor/hmp.c b/monitor/hmp.c
> index 15ca047..9f48b70 100644
> --- a/monitor/hmp.c
> +++ b/monitor/hmp.c
> @@ -287,8 +287,13 @@ void help_cmd(Monitor *mon, const char *name)
>  monitor_printf(mon, "Log items (comma separated):\n");
>  monitor_printf(mon, "%-10s %s\n", "none", "remove all logs");
>  for (item = qemu_log_items; item->mask != 0; item++) {
> -monitor_printf(mon, "%-10s %s\n", item->name, item->help);
> +monitor_printf(mon, "%-15s %s\n", item->name, item->help);
>  }
> +#ifdef CONFIG_TRACE_LOG
> +monitor_printf(mon, "trace:PATTERN   enable trace events\n");
> +monitor_printf(mon, "\nUse \"info trace-events\" to get a list 
> of "
> +"trace events.\n\n");
> +#endif
>  return;
>  }
>  
>

[PATCH v2 1/1] monitor/hmp: print trace as option in help for log command

2022-08-29 Thread Dongli Zhang

The below is printed when printing help information in qemu-system-x86_64
command line, and when CONFIG_TRACE_LOG is enabled:

$ qemu-system-x86_64 -d help
... ...
trace:PATTERN   enable trace events

Use "-d trace:help" to get a list of trace events.

However, they are not printed in hmp "help log" command.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
- change format for "none" as well.

 monitor/hmp.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/monitor/hmp.c b/monitor/hmp.c
index 15ca047..467fc84 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -285,10 +285,15 @@ void help_cmd(Monitor *mon, const char *name)
 if (!strcmp(name, "log")) {
 const QEMULogItem *item;
 monitor_printf(mon, "Log items (comma separated):\n");
-monitor_printf(mon, "%-10s %s\n", "none", "remove all logs");
+monitor_printf(mon, "%-15s %s\n", "none", "remove all logs");
 for (item = qemu_log_items; item->mask != 0; item++) {
-monitor_printf(mon, "%-10s %s\n", item->name, item->help);
+monitor_printf(mon, "%-15s %s\n", item->name, item->help);
 }
+#ifdef CONFIG_TRACE_LOG
+monitor_printf(mon, "trace:PATTERN   enable trace events\n");
+monitor_printf(mon, "\nUse \"info trace-events\" to get a list of "
+"trace events.\n\n");
+#endif
 return;
 }
 
-- 
1.8.3.1

[PATCH 1/2] error-report: make real_time_iso8601() non-static

2022-08-29 Thread Dongli Zhang

To make real_time_iso8601() a non-static function so that it can be used by
other files later.

No functional change.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 include/qemu/error-report.h | 2 ++
 util/error-report.c | 3 +--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/qemu/error-report.h b/include/qemu/error-report.h
index 3ae2357..cc73b99 100644
--- a/include/qemu/error-report.h
+++ b/include/qemu/error-report.h
@@ -30,6 +30,8 @@ void loc_set_none(void);
 void loc_set_cmdline(char **argv, int idx, int cnt);
 void loc_set_file(const char *fname, int lno);
 
+char *real_time_iso8601(void);
+
 int error_vprintf(const char *fmt, va_list ap) G_GNUC_PRINTF(1, 0);
 int error_printf(const char *fmt, ...) G_GNUC_PRINTF(1, 2);
 
diff --git a/util/error-report.c b/util/error-report.c
index 5edb2e6..63862cd 100644
--- a/util/error-report.c
+++ b/util/error-report.c
@@ -169,8 +169,7 @@ static void print_loc(void)
 }
 }
 
-static char *
-real_time_iso8601(void)
+char *real_time_iso8601(void)
 {
 #if GLIB_CHECK_VERSION(2,62,0)
 g_autoptr(GDateTime) dt = g_date_time_new_now_utc();
-- 
1.8.3.1

[PATCH 2/2] util/log: add timestamp to logs via qemu_log()

2022-08-29 Thread Dongli Zhang

The qemu_log is very helpful for diagnostic. Add the timestamp to the log
when it is enabled (e.g., "-msg timestamp=on").

While there are many other places that may print to log file, this patch is
only for qemu_log(), e.g., the developer may add qemu_log/qemu_log_mask to
selected locations to diagnose QEMU issue.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Please let me know if we should use 'error_with_guestname' as well.

 util/log.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/util/log.c b/util/log.c
index d6eb037..f0a081a 100644
--- a/util/log.c
+++ b/util/log.c
@@ -129,8 +129,15 @@ void qemu_log(const char *fmt, ...)
 {
 FILE *f = qemu_log_trylock();
 if (f) {
+gchar *timestr;
 va_list ap;
 
+if (message_with_timestamp) {
+timestr = real_time_iso8601();
+fprintf(f, "%s ", timestr);
+g_free(timestr);
+}
+
 va_start(ap, fmt);
 vfprintf(f, fmt, ap);
 va_end(ap);
-- 
1.8.3.1

[PATCH 1/1] monitor/hmp: print trace as option in help for log command

2022-08-29 Thread Dongli Zhang

The below is printed when printing help information in qemu-system-x86_64
command line, and when CONFIG_TRACE_LOG is enabled:

$ qemu-system-x86_64 -d help
... ...
trace:PATTERN   enable trace events

Use "-d trace:help" to get a list of trace events.

However, they are not printed in hmp "help log" command.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 monitor/hmp.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/monitor/hmp.c b/monitor/hmp.c
index 15ca047..9f48b70 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -287,8 +287,13 @@ void help_cmd(Monitor *mon, const char *name)
 monitor_printf(mon, "Log items (comma separated):\n");
 monitor_printf(mon, "%-10s %s\n", "none", "remove all logs");
 for (item = qemu_log_items; item->mask != 0; item++) {
-monitor_printf(mon, "%-10s %s\n", item->name, item->help);
+monitor_printf(mon, "%-15s %s\n", item->name, item->help);
 }
+#ifdef CONFIG_TRACE_LOG
+monitor_printf(mon, "trace:PATTERN   enable trace events\n");
+monitor_printf(mon, "\nUse \"info trace-events\" to get a list of "
+"trace events.\n\n");
+#endif
 return;
 }
 
-- 
1.8.3.1

Re: [PATCH 0/2] scsi: to fix issue on passing host_status to the guest kernel

2022-01-10 Thread Dongli Zhang

ping?

Thank you very much!

Dongli Zhang

On 12/10/21 6:16 AM, Dongli Zhang wrote:
> This patchset fixes the issue on passing 'host_status' to the guest kernel.
> 
> The 1st patch fixes the erroneous usage of req->host_status.
> 
> The 2nd patch is to pass the SCSI_HOST_ERROR to the guest kernel when the
> req->bus->info->fail() is not implemented. I do not add 'Fixes:' because I
> am not sure if to not pass SCSI_HOST_ERROR was on purpose, especially for
> security reason.
> 
> Thank you very much!
> 
> Dongli Zhang
> 
> 
>

[PATCH 2/2] scsi/utils: pass host_status = SCSI_HOST_ERROR to guest kernel

2021-12-10 Thread Dongli Zhang

For scsi_req_complete_failed() and when the req->bus->info->fail() is
implemented, the virtio-scsi passes SCSI_HOST_ERROR to the guest kernel as
VIRTIO_SCSI_S_FAILURE, while the pvscsi passes SCSI_HOST_ERROR to guest
kernel as BTSTAT_HASOFTWARE.

However, the scsi_req_complete_failed()->scsi_sense_from_host_status()
always returns GOOD for SCSI_HOST_ERROR, when the req->bus->info->fail() is
not implemented (e.g., megasas). As a result, the sense is not passed to
the guest kernel.

The SCSI_HOST_ERROR is reproduced on purpose by below QEMU command line:

-device megasas,id=vscsi0,bus=pci.0,addr=0x4 \
-drive file=/dev/sdc,format=raw,if=none,id=drive01 \
-device scsi-block,bus=vscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive01 \

... and when we remove the /dev/sdc from the host with:

"echo 1 > /sys/block/sdc/device/delete"

This patch passes sense_code_IO_ERROR to the guest kernel for
host_status = SCSI_HOST_ERROR.

(This issue is detected by running a testing code from Rui Loura).

Cc: Joe Jin 
Cc: Adnan Misherfi 
Cc: Rui Loura 
Signed-off-by: Dongli Zhang 
---
 scsi/utils.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scsi/utils.c b/scsi/utils.c
index 357b036671..086a1fea66 100644
--- a/scsi/utils.c
+++ b/scsi/utils.c
@@ -638,6 +638,9 @@ int scsi_sense_from_host_status(uint8_t host_status,
 case SCSI_HOST_ABORTED:
 *sense = SENSE_CODE(COMMAND_ABORTED);
 return CHECK_CONDITION;
+case SCSI_HOST_ERROR:
+*sense = SENSE_CODE(IO_ERROR);
+return CHECK_CONDITION;
 case SCSI_HOST_RESET:
 *sense = SENSE_CODE(RESET);
 return CHECK_CONDITION;
-- 
2.17.1

[PATCH 1/2] scsi/scsi_bus: use host_status as parameter for scsi_sense_from_host_status()

2021-12-10 Thread Dongli Zhang

The scsi_sense_from_host_status() always returns GOOD since
req->host_status is 255 (-1) at this time. Change req->host_status to
host_status so that scsi_sense_from_host_status() will be able to return
the expected value.

Fixes: f3126d65b393("scsi: move host_status handling into SCSI drivers")
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 hw/scsi/scsi-bus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index 77325d8cc7..d46650bd8c 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -1465,7 +1465,7 @@ void scsi_req_complete_failed(SCSIRequest *req, int 
host_status)
 assert(req->ops != _unit_attention);
 
 if (!req->bus->info->fail) {
-status = scsi_sense_from_host_status(req->host_status, );
+status = scsi_sense_from_host_status(host_status, );
 if (status == CHECK_CONDITION) {
 scsi_req_build_sense(req, sense);
 }
-- 
2.17.1

[PATCH 0/2] scsi: to fix issue on passing host_status to the guest kernel

2021-12-10 Thread Dongli Zhang

This patchset fixes the issue on passing 'host_status' to the guest kernel.

The 1st patch fixes the erroneous usage of req->host_status.

The 2nd patch is to pass the SCSI_HOST_ERROR to the guest kernel when the
req->bus->info->fail() is not implemented. I do not add 'Fixes:' because I
am not sure if to not pass SCSI_HOST_ERROR was on purpose, especially for
security reason.

Thank you very much!

Dongli Zhang

Re: [PATCH v2 38/53] qapi: introduce x-query-lapic QMP command

2021-09-20 Thread Dongli Zhang

u\n",
> +   s->divide_conf & APIC_DCR_MASK,
> +   divider_conf(s->divide_conf),
> +   s->initial_count, apic_get_current_count(s));
> +
> +g_string_append_printf(buf,
> +   "SPIV\t 0x%08x APIC %s, focus=%s, spurious vec 
> %u\n",
> +   s->spurious_vec,
> +   s->spurious_vec & APIC_SPURIO_ENABLED ?
> +   "enabled" : "disabled",
> +   s->spurious_vec & APIC_SPURIO_FOCUS ? "on" : 
> "off",
> +   s->spurious_vec & APIC_VECTOR_MASK);
> +
> +format_apic_icr(s, >env, buf);
> +
> +g_string_append_printf(buf, "ESR\t 0x%08x\n", s->esr);
> +
> +format_apic_interrupt("ISR", s->isr, s->tmr, buf);
> +format_apic_interrupt("IRR", s->irr, s->tmr, buf);
> +
> +g_string_append_printf(buf, "\nAPR 0x%02x TPR 0x%02x DFR 0x%02x LDR 
> 0x%02x",
> +   s->arb_id, s->tpr, s->dest_mode, s->log_dest);
>  if (s->dest_mode == 0) {
> -qemu_printf("(cluster %u: id %u)",
> -s->log_dest >> APIC_LOGDEST_XAPIC_SHIFT,
> -s->log_dest & APIC_LOGDEST_XAPIC_ID);
> +g_string_append_printf(buf, "(cluster %u: id %u)",
> +   s->log_dest >> APIC_LOGDEST_XAPIC_SHIFT,
> +   s->log_dest & APIC_LOGDEST_XAPIC_ID);
>  }
> -qemu_printf(" PPR 0x%02x\n", apic_get_ppr(s));
> -}
> -#else
> -void x86_cpu_dump_local_apic_state(CPUState *cs, int flags)
> -{
> +g_string_append_printf(buf, " PPR 0x%02x\n", apic_get_ppr(s));
> +
> +return g_steal_pointer();
>  }
>  #endif /* !CONFIG_USER_ONLY */
>  
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index d87c8808f6..2bcb175da8 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -2201,8 +2201,10 @@ void x86_cpu_set_default_version(X86CPUVersion 
> version);
>  #define APIC_DEFAULT_ADDRESS 0xfee0
>  #define APIC_SPACE_SIZE  0x10
>  
> +#ifndef CONFIG_USER_ONLY
>  /* cpu-dump.c */
> -void x86_cpu_dump_local_apic_state(CPUState *cs, int flags);
> +GString *x86_cpu_format_local_apic_state(CPUState *cs, int flags, Error 
> **errp);
> +#endif /* !CONFIG_USER_ONLY */
>  
>  /* cpu.c */
>  bool cpu_is_bsp(X86CPU *cpu);
> diff --git a/target/i386/monitor.c b/target/i386/monitor.c
> index 19468c4e85..fc09f90059 100644
> --- a/target/i386/monitor.c
> +++ b/target/i386/monitor.c
> @@ -33,6 +33,7 @@
>  #include "qapi/error.h"
>  #include "sev_i386.h"
>  #include "qapi/qapi-commands-misc-target.h"
> +#include "qapi/qapi-commands-machine-target.h"
>  #include "qapi/qapi-commands-misc.h"
>  #include "hw/i386/pc.h"
>  
> @@ -650,23 +651,52 @@ const MonitorDef *target_monitor_defs(void)
>  return monitor_defs;
>  }
>  
> +TargetHumanReadableText *qmp_x_query_lapic(int64_t apicid,
> +   Error **errp)
> +{
> +TargetHumanReadableText *ret;
> +g_autoptr(GString) buf = NULL;
> +CPUState *cs = cpu_by_arch_id(apicid);
> +
> +if (!cs) {
> +error_setg(errp, "No CPU with APIC ID %" PRId64 " available", 
> apicid);
> +return NULL;
> +}

Suppose the accelerator is KVM, the CPUState (cs) might not be syncrhonized with
KVM kernel space (APIC+PIR). As a result, the data is stale.

I sent below patch long time ago but it never got reviewed.

https://lore.kernel.org/qemu-devel/20210908143803.29191-1-dongli.zh...@oracle.com/

> +
> +buf = x86_cpu_format_local_apic_state(cs, CPU_DUMP_FPU, errp);
> +if (!buf) {
> +return NULL;
> +}
> +
> +ret = g_new0(TargetHumanReadableText, 1);
> +ret->human_readable_text = g_steal_pointer(>str);
> +return ret;
> +}
> +
>  void hmp_info_local_apic(Monitor *mon, const QDict *qdict)
>  {
> -CPUState *cs;
> +Error *err = NULL;
> +g_autoptr(TargetHumanReadableText) info = NULL;
> +int64_t apicid;
>  
>  if (qdict_haskey(qdict, "apic-id")) {
> -int id = qdict_get_try_int(qdict, "apic-id", 0);
> -cs = cpu_by_arch_id(id);
> +apicid = qdict_get_try_int(qdict, "apic-id", 0);

Here is where I used to fix with the patch.

Thank you very much!

Dongli Zhang

>  } else {
> -cs = mon_get_cpu(mon);
> +CPUState *cs = mon_get_cpu(mon);
> +if (!cs) {
> +monitor_printf(mon, "No CPU available\n");
> +return;
> +}
> +apicid = cpu_get_arch_id(cs);
>  }
>  
> -
> -if (!cs) {
> -monitor_printf(mon, "No CPU available\n");
> +info = qmp_x_query_lapic(apicid, );
> +if (err) {
> +error_report_err(err);
>  return;
>  }
> -x86_cpu_dump_local_apic_state(cs, CPU_DUMP_FPU);
> +
> +monitor_printf(mon, "%s", info->human_readable_text);
>  }
>  
>  SevInfo *qmp_query_sev(Error **errp)
>

[PATCH RESEND v2 1/1] hmp: synchronize cpu state for lapic info

2021-09-08 Thread Dongli Zhang

While the default "info lapic" always synchronizes cpu state ...

mon_get_cpu()
-> mon_get_cpu_sync(mon, true)
   -> cpu_synchronize_state(cpu)
  -> ioctl KVM_GET_LAPIC (taking KVM as example)

... the cpu state is not synchronized when the apic-id is available as
argument.

The cpu state should be synchronized when apic-id is available. Otherwise
the "info lapic " always returns stale data.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - I sent out wrong patch version in v1

 target/i386/monitor.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index 119211f0b0..d833ab5b8e 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -28,6 +28,7 @@
 #include "monitor/hmp-target.h"
 #include "monitor/hmp.h"
 #include "qapi/qmp/qdict.h"
+#include "sysemu/hw_accel.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sev.h"
 #include "qapi/error.h"
@@ -656,7 +657,11 @@ void hmp_info_local_apic(Monitor *mon, const QDict *qdict)
 
 if (qdict_haskey(qdict, "apic-id")) {
 int id = qdict_get_try_int(qdict, "apic-id", 0);
+
 cs = cpu_by_arch_id(id);
+if (cs) {
+cpu_synchronize_state(cs);
+}
 } else {
 cs = mon_get_cpu(mon);
 }
-- 
2.17.1

Re: [PATCH v2 1/1] hmp: synchronize cpu state for lapic info

2021-07-22 Thread Dongli Zhang

May I get feedback for this bugfix?

So far the "info lapic " returns stale data and could not accurate reflect
the status in KVM.

Thank you very much!

Dongli Zhang

On 7/1/21 2:40 PM, Dongli Zhang wrote:
> While the default "info lapic" always synchronizes cpu state ...
> 
> mon_get_cpu()
> -> mon_get_cpu_sync(mon, true)
>-> cpu_synchronize_state(cpu)
>   -> ioctl KVM_GET_LAPIC (taking KVM as example)
> 
> ... the cpu state is not synchronized when the apic-id is available as
> argument.
> 
> The cpu state should be synchronized when apic-id is available. Otherwise
> the "info lapic " always returns stale data.
> 
> Signed-off-by: Dongli Zhang 
> ---
> Changed since v1:
>   - I sent out wrong patch version in v1
> 
>  target/i386/monitor.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/target/i386/monitor.c b/target/i386/monitor.c
> index 119211f0b0..d833ab5b8e 100644
> --- a/target/i386/monitor.c
> +++ b/target/i386/monitor.c
> @@ -28,6 +28,7 @@
>  #include "monitor/hmp-target.h"
>  #include "monitor/hmp.h"
>  #include "qapi/qmp/qdict.h"
> +#include "sysemu/hw_accel.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/sev.h"
>  #include "qapi/error.h"
> @@ -656,7 +657,11 @@ void hmp_info_local_apic(Monitor *mon, const QDict 
> *qdict)
>  
>  if (qdict_haskey(qdict, "apic-id")) {
>  int id = qdict_get_try_int(qdict, "apic-id", 0);
> +
>  cs = cpu_by_arch_id(id);
> +if (cs) {
> +cpu_synchronize_state(cs);
> +}
>  } else {
>  cs = mon_get_cpu(mon);
>  }
>

[PATCH v4 3/3] virtio-pci/hmp: implement device specific hmp interface

2021-07-14 Thread Dongli Zhang

This patch is to implement the device specific interface to dump the
mapping between virtio queues and vectors.

(qemu) info msix -d /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

virtio pci vector info:
config: 0
queue 0: 1
queue 1: 2
queue 2: 3
queue 3: 4
queue 4: 5
queue 5: 6

Cc: Jason Wang 
Cc: Joe Jin 
Suggested-by: Jason Wang 
Signed-off-by: Dongli Zhang 
Acked-by: Jason Wang 
---
 hw/virtio/virtio-pci.c | 22 ++
 hw/virtio/virtio.c | 10 ++
 include/hw/virtio/virtio.h |  2 ++
 3 files changed, 34 insertions(+)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 433060ac02..2971e8049c 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -38,6 +38,7 @@
 #include "hw/virtio/virtio-bus.h"
 #include "qapi/visitor.h"
 #include "sysemu/replay.h"
+#include "monitor/monitor.h"
 
 #define VIRTIO_PCI_REGION_SIZE(dev) 
VIRTIO_PCI_CONFIG_OFF(msix_present(dev))
 
@@ -1587,6 +1588,26 @@ static void 
virtio_pci_modern_io_region_unmap(VirtIOPCIProxy *proxy,
 >mr);
 }
 
+static void virtio_pci_dc_msix_info(Monitor *mon, PCIDevice *dev,
+Error **errp)
+{
+DeviceState *qdev = DEVICE(dev);
+VirtIOPCIProxy *proxy = to_virtio_pci_proxy_fast(qdev);
+VirtIODevice *vdev = virtio_bus_get_device(>bus);
+int num = virtio_get_num_queues(vdev);
+int i;
+
+monitor_printf(mon, "virtio pci vector info:\n");
+
+monitor_printf(mon, "config: %d\n", virtio_get_config_vector(vdev));
+
+for (i = 0; i < num; i++)
+monitor_printf(mon, "queue %d: %u\n",
+   i, virtio_get_vector(vdev, i));
+
+monitor_printf(mon, "\n");
+}
+
 static void virtio_pci_pre_plugged(DeviceState *d, Error **errp)
 {
 VirtIOPCIProxy *proxy = VIRTIO_PCI(d);
@@ -2004,6 +2025,7 @@ static void virtio_pci_class_init(ObjectClass *klass, 
void *data)
 k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
 k->revision = VIRTIO_PCI_ABI_VERSION;
 k->class_id = PCI_CLASS_OTHERS;
+k->msix_info = virtio_pci_dc_msix_info;
 device_class_set_parent_realize(dc, virtio_pci_dc_realize,
 >parent_dc_realize);
 dc->reset = virtio_pci_reset;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 874377f37a..ea54939e98 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -2581,6 +2581,16 @@ void virtio_notify_config(VirtIODevice *vdev)
 virtio_notify_vector(vdev, vdev->config_vector);
 }
 
+uint16_t virtio_get_vector(VirtIODevice *vdev, int n)
+{
+return vdev->vq[n].vector;
+}
+
+uint16_t virtio_get_config_vector(VirtIODevice *vdev)
+{
+return vdev->config_vector;
+}
+
 static bool virtio_device_endian_needed(void *opaque)
 {
 VirtIODevice *vdev = opaque;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..6746227f73 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -268,6 +268,8 @@ int virtio_set_status(VirtIODevice *vdev, uint8_t val);
 void virtio_reset(void *opaque);
 void virtio_update_irq(VirtIODevice *vdev);
 int virtio_set_features(VirtIODevice *vdev, uint64_t val);
+uint16_t virtio_get_vector(VirtIODevice *vdev, int n);
+uint16_t virtio_get_config_vector(VirtIODevice *vdev);
 
 /* Base devices.  */
 typedef struct VirtIOBlkConf VirtIOBlkConf;
-- 
2.17.1

[PATCH v4 2/3] msix/hmp: add interface to dump device specific info

2021-07-14 Thread Dongli Zhang

While the previous patch is to dump the MSI-X table, sometimes we may
need to dump device specific data, e.g., to help match the vector with
the specific device queue.

This patch is to add the PCI device specific interface to help dump
those information. Any PCI device class may implement this
PCIDeviceClass->msix_info interface.

Cc: Jason Wang 
Cc: Joe Jin 
Suggested-by: Jason Wang 
Signed-off-by: Dongli Zhang 
Acked-by: Jason Wang 
---
 hmp-commands-info.hx   |  7 ---
 include/hw/pci/pci.h   |  3 +++
 softmmu/qdev-monitor.c | 11 +++
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index ce5c550d44..4e831d7ae4 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -223,9 +223,10 @@ ERST
 
 {
 .name   = "msix",
-.args_type  = "dev:s",
-.params = "dev",
-.help   = "dump MSI-X information",
+.args_type  = "info:-d,dev:s",
+.params = "[-d] dev",
+.help   = "dump MSI-X information; "
+  "(-d: show device specific info)",
 .cmd= hmp_info_msix,
 },
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6be4e0c460..4620b9e757 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -129,6 +129,8 @@ typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int 
region_num,
 pcibus_t addr, pcibus_t size, int type);
 typedef void PCIUnregisterFunc(PCIDevice *pci_dev);
 
+typedef void PCIMSIXInfoFunc(Monitor *mon, PCIDevice *dev, Error **errp);
+
 typedef struct PCIIORegion {
 pcibus_t addr; /* current PCI mapping address. -1 means not mapped */
 #define PCI_BAR_UNMAPPED (~(pcibus_t)0)
@@ -224,6 +226,7 @@ struct PCIDeviceClass {
 PCIUnregisterFunc *exit;
 PCIConfigReadFunc *config_read;
 PCIConfigWriteFunc *config_write;
+PCIMSIXInfoFunc *msix_info;
 
 uint16_t vendor_id;
 uint16_t device_id;
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 7837a17d0d..7fd3fe0ada 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -1011,7 +1011,9 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict)
 {
 const char *name = qdict_get_str(qdict, "dev");
 DeviceState *dev = find_device_state(name, NULL);
+bool info = qdict_get_try_bool(qdict, "info", false);
 PCIDevice *pci_dev;
+PCIDeviceClass *pc;
 Error *err = NULL;
 
 if (!dev) {
@@ -1027,6 +1029,15 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict)
 pci_dev = PCI_DEVICE(dev);
 msix_dump_info(mon, pci_dev, );
 
+if (info) {
+pc = PCI_DEVICE_GET_CLASS(pci_dev);
+if (pc->msix_info) {
+pc->msix_info(mon, pci_dev, );
+} else {
+error_setg(, "Device specific info not supported");
+}
+}
+
 exit:
 hmp_handle_error(mon, err);
 }
-- 
2.17.1

[PATCH v4 1/3] msix/hmp: add hmp interface to dump MSI-X info

2021-07-14 Thread Dongli Zhang

This patch is to add the HMP interface to dump MSI-X table and PBA, in
order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
vector is erroneously masked permanently). Here is the example with
vhost-scsi:

(qemu) info msix /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

Since this is just for debugging by humans, provide the command only in
HMP, not in QMP.

Cc: Jason Wang 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
Acked-by: Dr. David Alan Gilbert 
Acked-by: Jason Wang 
---
Changed since v1/v2:
  - Add msix_present() to pci-stub.c to avoid build error
Changed since v3:
  - Rework commit message to explain why only HMP (suggested by Markus
Armbruster)

 hmp-commands-info.hx   | 13 +
 hw/pci/msix.c  | 63 ++
 hw/pci/pci-stub.c  |  5 
 include/hw/pci/msix.h  |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 25 +
 6 files changed, 109 insertions(+)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 27206ac049..ce5c550d44 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -221,6 +221,19 @@ SRST
 Show PCI information.
 ERST
 
+{
+.name   = "msix",
+.args_type  = "dev:s",
+.params = "dev",
+.help   = "dump MSI-X information",
+.cmd= hmp_info_msix,
+},
+
+SRST
+  ``info msix`` *dev*
+Dump MSI-X information for device *dev*.
+ERST
+
 #if defined(TARGET_I386) || defined(TARGET_SH4) || defined(TARGET_SPARC) || \
 defined(TARGET_PPC) || defined(TARGET_XTENSA) || defined(TARGET_M68K)
 {
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331cd0b..4b4ec87eee 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -22,6 +22,7 @@
 #include "sysemu/xen.h"
 #include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
+#include "monitor/monitor.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "trace.h"
@@ -669,3 +670,65 @@ const VMStateDescription vmstate_msix = {
 VMSTATE_END_OF_LIST()
 }
 };
+
+static void msix_dump_table(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+uint32_t val;
+uint8_t *table_entry;
+
+monitor_printf(mon, "Msg L.Addr ");
+monitor_printf(mon, "Msg U.Addr ");
+monitor_printf(mon, "Msg Data   ");
+monitor_printf(mon, "Vect Ctrl\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE;
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_UPPER_ADDR);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_DATA);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_VECTOR_CTRL);
+monitor_printf(mon, "0x%08x\n", val);
+}
+
+monitor_printf(mon, "\n");
+}
+
+static void msix_dump_pba(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+
+monitor_printf(mon, "MSI-X PBA\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+monitor_printf(mon, "%d ", !!msix_is_pending(dev, vector));
+
+if (vector % 16 == 15) {
+monitor_printf(mon, "\n");
+}
+}
+
+if (vector % 16 != 15) {
+monitor_printf(mon, "\n");
+}
+
+monitor_printf(mon, "\n");
+}
+
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp)
+{
+if (!msix_present(dev)) {
+error_setg(errp, "MSI-X not available");
+return;
+}
+
+msix_dump_table(mon, dev);
+msix_dump_pba(mon, dev);
+}
diff --git a/hw/pci/pci-stub.c b/hw/pci/pci-stub.c
index 3a027c42e4..8191d49d56 100644
--- a/hw/pci/pci-stub.c
+++ b/hw/pci/pci-stub.c
@@ -91,3 +91,8 @@ MSIMessage msix_get_message(PCIDevice *dev, unsigned int 
vector)
 {
 g_assert_not_reached();
 }
+
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp)
+{
+ monitor_printf(mon, "PCI devices not supported\n");
+}
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c739..10a4500295 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -47,6 +47,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
   MSIVectorPollNo

[PATCH v4 0/3] To add HMP interface to dump PCI MSI-X table/PBA

2021-07-14 Thread Dongli Zhang

This is v4 to rework commit message to explain why only HMP.

Changed since RFC v1:
  - Add heading to MSI-X table (suggested by David Alan Gilbert)
  - Add device specific interface, e.g., to dump virtio-pci queue-to-vector
mapping (Suggested By Jason)

Changed since v2:
  - Add msix_present() to pci-stub.c to avoid build error

Changed since v3:
  - Rework commit message to explain why only HMP (suggested by Markus
Armbruster)



This patch is to introduce the new HMP command to dump the MSI-X table/PBA.

Here is the RFC v1:

https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg04673.html

The idea was inspired by below discussion:

https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html


(qemu) info msix -d /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

virtio pci vector info:
config: 0
queue 0: 1
queue 1: 2
queue 2: 3
queue 3: 4
queue 4: 5
queue 5: 6


 hmp-commands-info.hx   | 14 +
 hw/pci/msix.c  | 63 +
 hw/pci/pci-stub.c  |  5 
 hw/virtio/virtio-pci.c | 22 ++
 hw/virtio/virtio.c | 10 +++
 include/hw/pci/msix.h  |  2 ++
 include/hw/pci/pci.h   |  3 ++
 include/hw/virtio/virtio.h |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 36 +++
 10 files changed, 158 insertions(+)

Thank you very much!

Dongli Zhang

Re: [PATCH v3 1/3] msix/hmp: add hmp interface to dump MSI-X info

2021-07-14 Thread Dongli Zhang

Hi Markus,

On 7/14/21 2:42 AM, Markus Armbruster wrote:
> Dongli Zhang  writes:
> 
>> Hi Markus,
>>
>> On 7/13/21 10:46 PM, Markus Armbruster wrote:
>>> Dongli Zhang  writes:
>>>
>>>> This patch is to add the HMP interface to dump MSI-X table and PBA, in
>>>> order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
>>>> vector is erroneously masked permanently). Here is the example with
>>>> vhost-scsi:
>>>>
>>>> (qemu) info msix /machine/peripheral/vscsi0
>>>> Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
>>>> 0xfee0 0x 0x4041 0x
>>>> 0xfee0 0x 0x4051 0x
>>>> 0xfee0 0x 0x4061 0x
>>>> 0xfee0 0x 0x4071 0x
>>>> 0xfee01000 0x 0x40b1 0x
>>>> 0xfee02000 0x 0x40c1 0x
>>>> 0xfee03000 0x 0x40d1 0x
>>>>
>>>> MSI-X PBA
>>>> 0 0 0 0 0 0 0
>>>>
>>>> Since the number of MSI-X entries is not determined and might be very
>>>> large, it is sometimes inappropriate to dump via QMP.
>>>
>>> Why?  What makes HMP different?
>>
>> Here are two reasons.
>>
>> 1. The size of MSI-X table is nondeterministic and might be very large, e.g.,
>> the PCI_MSIX_FLAGS_QSIZE is 0x07FF. The "info tlb" (which is a table and 
>> similar
>> to MSI-X) and "info lapic" also only support hmp.
>>
>> 2. The [PATCH 3/3] of this patchset support device specific data, the
>> definitional of which varies depending on each device type (so far only
>> virtio-pci supports the interface).
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>>
>>>> Therefore, this patch dumps MSI-X information only via HMP, which is
>>>> similar to the implementation of hmp_info_mem().
> 
> I think you're mixing up valid and invalid arguments :)  Let me try to
> pick them apart.
> 
> 1a. "Command output may be too large for QMP, therefore provide only the
> HMP command" makes no sense.
> 
> Both QMP and HMP are not meant for bulk data transfer.  They are control
> plane, not data plane.  To illustrate what that means, consider a backup
> feature.  The commands to control the backup task are QMP (and HMP, if
> desired), but the actual data transfer uses some other channel, so it
> doesn't clog the QMP pipes.
> 
> If the data is too bulky for QMP, then it's too bulky for HMP, too.
> 
> 1b. "info tlb" and "info lapic" are HMP only because they are debugging
> commands for humans.  Debugging is not necessarily done by humans only.
> Sometimes, humans use programs to help them debug, and these programs
> would be better off with QMP commands.  For the HMP-only debugging
> commands, we decided providing for (largely hypothetical) debugging
> scripts wasn't worthwhile.  A similar argument could probably be made
> for "info msix".
> 
> 2. "Output is in part specific to Devices, therefore provide only the
> HMP command" is also iffy.  Yes, hacking up a bunch of monitor_printf()s
> is probably easier than specifying a QAPI schema.   "It's work" is no
> excuse.  "It's more work than it's worth" can be one.  But then we're
> back at argument 1b.

Thank you very much for the explanation!

> 
> Your commit message's first sentence suggests this is indeed just for
> debugging.  If this is the case, consider replacing
> 
> Since the number of MSI-X entries is not determined and might be very
> large, it is sometimes inappropriate to dump via QMP.
> 
> Therefore, this patch dumps MSI-X information only via HMP, which is
> similar to the implementation of hmp_info_mem().
> 
> by
> 
> Since this is just for debugging by humans, provide the command only
> in HMP, not in QMP.
> 

Yes, this is for debugging purpose. The primary objective is to confirm a
specific vector is not erroneously masked permanently, when a specific device
queue is stuck.

I will replace commit message and send v4.

Thank you very much!

Dongli Zhang

Re: [PATCH v3 1/3] msix/hmp: add hmp interface to dump MSI-X info

2021-07-14 Thread Dongli Zhang

Hi Markus,

On 7/13/21 10:46 PM, Markus Armbruster wrote:
> Dongli Zhang  writes:
> 
>> This patch is to add the HMP interface to dump MSI-X table and PBA, in
>> order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
>> vector is erroneously masked permanently). Here is the example with
>> vhost-scsi:
>>
>> (qemu) info msix /machine/peripheral/vscsi0
>> Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
>> 0xfee0 0x 0x4041 0x
>> 0xfee0 0x 0x4051 0x
>> 0xfee0 0x 0x4061 0x
>> 0xfee0 0x 0x4071 0x
>> 0xfee01000 0x 0x40b1 0x
>> 0xfee02000 0x 0x40c1 0x
>> 0xfee03000 0x 0x40d1 0x
>>
>> MSI-X PBA
>> 0 0 0 0 0 0 0
>>
>> Since the number of MSI-X entries is not determined and might be very
>> large, it is sometimes inappropriate to dump via QMP.
> 
> Why?  What makes HMP different?

Here are two reasons.

1. The size of MSI-X table is nondeterministic and might be very large, e.g.,
the PCI_MSIX_FLAGS_QSIZE is 0x07FF. The "info tlb" (which is a table and similar
to MSI-X) and "info lapic" also only support hmp.

2. The [PATCH 3/3] of this patchset support device specific data, the
definitional of which varies depending on each device type (so far only
virtio-pci supports the interface).

Thank you very much!

Dongli Zhang

> 
>> Therefore, this patch dumps MSI-X information only via HMP, which is
>> similar to the implementation of hmp_info_mem().
>>
>> Cc: Jason Wang 
>> Cc: Joe Jin 
>> Signed-off-by: Dongli Zhang 
>> Acked-by: Dr. David Alan Gilbert 
>> Acked-by: Jason Wang 
>

[PATCH v3 2/3] msix/hmp: add interface to dump device specific info

2021-07-13 Thread Dongli Zhang

While the previous patch is to dump the MSI-X table, sometimes we may
need to dump device specific data, e.g., to help match the vector with
the specific device queue.

This patch is to add the PCI device specific interface to help dump
those information. Any PCI device class may implement this
PCIDeviceClass->msix_info interface.

Cc: Jason Wang 
Cc: Joe Jin 
Suggested-by: Jason Wang 
Signed-off-by: Dongli Zhang 
Acked-by: Jason Wang 
---
 hmp-commands-info.hx   |  7 ---
 include/hw/pci/pci.h   |  3 +++
 softmmu/qdev-monitor.c | 11 +++
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index ce5c550d44..4e831d7ae4 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -223,9 +223,10 @@ ERST
 
 {
 .name   = "msix",
-.args_type  = "dev:s",
-.params = "dev",
-.help   = "dump MSI-X information",
+.args_type  = "info:-d,dev:s",
+.params = "[-d] dev",
+.help   = "dump MSI-X information; "
+  "(-d: show device specific info)",
 .cmd= hmp_info_msix,
 },
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6be4e0c460..4620b9e757 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -129,6 +129,8 @@ typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int 
region_num,
 pcibus_t addr, pcibus_t size, int type);
 typedef void PCIUnregisterFunc(PCIDevice *pci_dev);
 
+typedef void PCIMSIXInfoFunc(Monitor *mon, PCIDevice *dev, Error **errp);
+
 typedef struct PCIIORegion {
 pcibus_t addr; /* current PCI mapping address. -1 means not mapped */
 #define PCI_BAR_UNMAPPED (~(pcibus_t)0)
@@ -224,6 +226,7 @@ struct PCIDeviceClass {
 PCIUnregisterFunc *exit;
 PCIConfigReadFunc *config_read;
 PCIConfigWriteFunc *config_write;
+PCIMSIXInfoFunc *msix_info;
 
 uint16_t vendor_id;
 uint16_t device_id;
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 7837a17d0d..7fd3fe0ada 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -1011,7 +1011,9 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict)
 {
 const char *name = qdict_get_str(qdict, "dev");
 DeviceState *dev = find_device_state(name, NULL);
+bool info = qdict_get_try_bool(qdict, "info", false);
 PCIDevice *pci_dev;
+PCIDeviceClass *pc;
 Error *err = NULL;
 
 if (!dev) {
@@ -1027,6 +1029,15 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict)
 pci_dev = PCI_DEVICE(dev);
 msix_dump_info(mon, pci_dev, );
 
+if (info) {
+pc = PCI_DEVICE_GET_CLASS(pci_dev);
+if (pc->msix_info) {
+pc->msix_info(mon, pci_dev, );
+} else {
+error_setg(, "Device specific info not supported");
+}
+}
+
 exit:
 hmp_handle_error(mon, err);
 }
-- 
2.17.1

[PATCH v3 1/3] msix/hmp: add hmp interface to dump MSI-X info

2021-07-13 Thread Dongli Zhang

This patch is to add the HMP interface to dump MSI-X table and PBA, in
order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
vector is erroneously masked permanently). Here is the example with
vhost-scsi:

(qemu) info msix /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

Since the number of MSI-X entries is not determined and might be very
large, it is sometimes inappropriate to dump via QMP.

Therefore, this patch dumps MSI-X information only via HMP, which is
similar to the implementation of hmp_info_mem().

Cc: Jason Wang 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
Acked-by: Dr. David Alan Gilbert 
Acked-by: Jason Wang 
---
Changed since v1/v2:
  - Add msix_present() to pci-stub.c to avoid build error

 hmp-commands-info.hx   | 13 +
 hw/pci/msix.c  | 63 ++
 hw/pci/pci-stub.c  |  5 
 include/hw/pci/msix.h  |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 25 +
 6 files changed, 109 insertions(+)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 27206ac049..ce5c550d44 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -221,6 +221,19 @@ SRST
 Show PCI information.
 ERST
 
+{
+.name   = "msix",
+.args_type  = "dev:s",
+.params = "dev",
+.help   = "dump MSI-X information",
+.cmd= hmp_info_msix,
+},
+
+SRST
+  ``info msix`` *dev*
+Dump MSI-X information for device *dev*.
+ERST
+
 #if defined(TARGET_I386) || defined(TARGET_SH4) || defined(TARGET_SPARC) || \
 defined(TARGET_PPC) || defined(TARGET_XTENSA) || defined(TARGET_M68K)
 {
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331cd0b..4b4ec87eee 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -22,6 +22,7 @@
 #include "sysemu/xen.h"
 #include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
+#include "monitor/monitor.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "trace.h"
@@ -669,3 +670,65 @@ const VMStateDescription vmstate_msix = {
 VMSTATE_END_OF_LIST()
 }
 };
+
+static void msix_dump_table(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+uint32_t val;
+uint8_t *table_entry;
+
+monitor_printf(mon, "Msg L.Addr ");
+monitor_printf(mon, "Msg U.Addr ");
+monitor_printf(mon, "Msg Data   ");
+monitor_printf(mon, "Vect Ctrl\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE;
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_UPPER_ADDR);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_DATA);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_VECTOR_CTRL);
+monitor_printf(mon, "0x%08x\n", val);
+}
+
+monitor_printf(mon, "\n");
+}
+
+static void msix_dump_pba(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+
+monitor_printf(mon, "MSI-X PBA\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+monitor_printf(mon, "%d ", !!msix_is_pending(dev, vector));
+
+if (vector % 16 == 15) {
+monitor_printf(mon, "\n");
+}
+}
+
+if (vector % 16 != 15) {
+monitor_printf(mon, "\n");
+}
+
+monitor_printf(mon, "\n");
+}
+
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp)
+{
+if (!msix_present(dev)) {
+error_setg(errp, "MSI-X not available");
+return;
+}
+
+msix_dump_table(mon, dev);
+msix_dump_pba(mon, dev);
+}
diff --git a/hw/pci/pci-stub.c b/hw/pci/pci-stub.c
index 3a027c42e4..8191d49d56 100644
--- a/hw/pci/pci-stub.c
+++ b/hw/pci/pci-stub.c
@@ -91,3 +91,8 @@ MSIMessage msix_get_message(PCIDevice *dev, unsigned int 
vector)
 {
 g_assert_not_reached();
 }
+
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp)
+{
+ monitor_printf(mon, "PCI devices not supported\n");
+}
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c739..10a4500295 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -47,6 +47,8 @@ int msix_set_vector_notifiers(PCIDevice *

[PATCH v3 3/3] virtio-pci/hmp: implement device specific hmp interface

2021-07-13 Thread Dongli Zhang

This patch is to implement the device specific interface to dump the
mapping between virtio queues and vectors.

(qemu) info msix -d /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

virtio pci vector info:
config: 0
queue 0: 1
queue 1: 2
queue 2: 3
queue 3: 4
queue 4: 5
queue 5: 6

Cc: Jason Wang 
Cc: Joe Jin 
Suggested-by: Jason Wang 
Signed-off-by: Dongli Zhang 
Acked-by: Jason Wang 
---
 hw/virtio/virtio-pci.c | 22 ++
 hw/virtio/virtio.c | 10 ++
 include/hw/virtio/virtio.h |  2 ++
 3 files changed, 34 insertions(+)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 433060ac02..2971e8049c 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -38,6 +38,7 @@
 #include "hw/virtio/virtio-bus.h"
 #include "qapi/visitor.h"
 #include "sysemu/replay.h"
+#include "monitor/monitor.h"
 
 #define VIRTIO_PCI_REGION_SIZE(dev) 
VIRTIO_PCI_CONFIG_OFF(msix_present(dev))
 
@@ -1587,6 +1588,26 @@ static void 
virtio_pci_modern_io_region_unmap(VirtIOPCIProxy *proxy,
 >mr);
 }
 
+static void virtio_pci_dc_msix_info(Monitor *mon, PCIDevice *dev,
+Error **errp)
+{
+DeviceState *qdev = DEVICE(dev);
+VirtIOPCIProxy *proxy = to_virtio_pci_proxy_fast(qdev);
+VirtIODevice *vdev = virtio_bus_get_device(>bus);
+int num = virtio_get_num_queues(vdev);
+int i;
+
+monitor_printf(mon, "virtio pci vector info:\n");
+
+monitor_printf(mon, "config: %d\n", virtio_get_config_vector(vdev));
+
+for (i = 0; i < num; i++)
+monitor_printf(mon, "queue %d: %u\n",
+   i, virtio_get_vector(vdev, i));
+
+monitor_printf(mon, "\n");
+}
+
 static void virtio_pci_pre_plugged(DeviceState *d, Error **errp)
 {
 VirtIOPCIProxy *proxy = VIRTIO_PCI(d);
@@ -2004,6 +2025,7 @@ static void virtio_pci_class_init(ObjectClass *klass, 
void *data)
 k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
 k->revision = VIRTIO_PCI_ABI_VERSION;
 k->class_id = PCI_CLASS_OTHERS;
+k->msix_info = virtio_pci_dc_msix_info;
 device_class_set_parent_realize(dc, virtio_pci_dc_realize,
 >parent_dc_realize);
 dc->reset = virtio_pci_reset;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 874377f37a..ea54939e98 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -2581,6 +2581,16 @@ void virtio_notify_config(VirtIODevice *vdev)
 virtio_notify_vector(vdev, vdev->config_vector);
 }
 
+uint16_t virtio_get_vector(VirtIODevice *vdev, int n)
+{
+return vdev->vq[n].vector;
+}
+
+uint16_t virtio_get_config_vector(VirtIODevice *vdev)
+{
+return vdev->config_vector;
+}
+
 static bool virtio_device_endian_needed(void *opaque)
 {
 VirtIODevice *vdev = opaque;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..6746227f73 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -268,6 +268,8 @@ int virtio_set_status(VirtIODevice *vdev, uint8_t val);
 void virtio_reset(void *opaque);
 void virtio_update_irq(VirtIODevice *vdev);
 int virtio_set_features(VirtIODevice *vdev, uint64_t val);
+uint16_t virtio_get_vector(VirtIODevice *vdev, int n);
+uint16_t virtio_get_config_vector(VirtIODevice *vdev);
 
 /* Base devices.  */
 typedef struct VirtIOBlkConf VirtIOBlkConf;
-- 
2.17.1

[PATCH v3 0/3] To add HMP interface to dump PCI MSI-X table/PBA

2021-07-13 Thread Dongli Zhang

This is v3 to add msix_present() to pci-stub.c to avoid build error.

Changed since RFC v1:
  - Add heading to MSI-X table (suggested by David Alan Gilbert)
  - Add device specific interface, e.g., to dump virtio-pci queue-to-vector
mapping (Suggested By Jason)

Changed since v2:
  - Add msix_present() to pci-stub.c to avoid build error



This patch is to introduce the new HMP command to dump the MSI-X table/PBA.

Here is the RFC v1:

https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg04673.html

The idea was inspired by below discussion:

https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html


(qemu) info msix -d /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

virtio pci vector info:
config: 0
queue 0: 1
queue 1: 2
queue 2: 3
queue 3: 4
queue 4: 5
queue 5: 6


 hmp-commands-info.hx   | 14 +
 hw/pci/msix.c  | 63 +
 hw/pci/pci-stub.c  |  5 
 hw/virtio/virtio-pci.c | 22 ++
 hw/virtio/virtio.c | 10 +++
 include/hw/pci/msix.h  |  2 ++
 include/hw/pci/pci.h   |  3 ++
 include/hw/virtio/virtio.h |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 36 +++
 10 files changed, 158 insertions(+)

Thank you very much!

Dongli Zhang

Re: [PATCH RESEND v2 0/3] To add HMP interface to dump PCI MSI-X table/PBA

2021-07-13 Thread Dongli Zhang




On 7/13/21 1:59 PM, Michael S. Tsirkin wrote:
> On Mon, Jul 12, 2021 at 04:58:09PM -0700, Dongli Zhang wrote:
>> I have just rebased the v2 patchset on top of most recent qemu to resend.
>>
>> This patch is to introduce the new HMP command to dump the MSI-X table/PBA.
>>
>> Here is the RFC v1:
>>
>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg04673.html__;!!ACWV5N9M2RV99hQ!a_4RXoxWC3ppSNDk4TFH5-UjOMa9DsHfWF3x814Yf6urzW3rxZrwhWCU2zTFVDD57lE$
>>  
>>
>> The idea was inspired by below discussion:
>>
>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html__;!!ACWV5N9M2RV99hQ!a_4RXoxWC3ppSNDk4TFH5-UjOMa9DsHfWF3x814Yf6urzW3rxZrwhWCU2zTFAnnr09M$
>>  
> 
> Fails build for microblaze:
> 
> /scm/qemu/build/../softmmu/qdev-monitor.c:1030: undefined reference to 
> `msix_dump_info'
> collect2: error: ld returned 1 exit status
> [2525/8445] Compiling C object 
> libqemu-microblaze-softmmu.fa.p/meson-generated_.._qapi_qapi-visit-misc-target.c.o
> 

I am so sorry for the build error. I will add below to the hw/pci/pci-stub.c.

void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp)
{
monitor_printf(mon, "PCI devices not supported\n");
}

I will send out v3.


I should test for all arch instead of only x86_64-softmmu in the future.

Thank you very much!

Dongli Zhang

[PATCH RESEND 1/1] multi-process: fix usage information

2021-07-12 Thread Dongli Zhang

>From source code, the 'devid' of x-remote-object should be one of devices
in remote QEMU process.

Signed-off-by: Dongli Zhang 
Reviewed-by: Jagannathan Raman 
---
Resend to be applied as trivial patch.

I have verified by reading the code and playing with below orchestrator.
https://github.com/finallyjustice/sample/blob/master/kvm/multiprocess/orchestrator.py

 docs/system/multi-process.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/system/multi-process.rst b/docs/system/multi-process.rst
index 46bb0cafc2..210531ee17 100644
--- a/docs/system/multi-process.rst
+++ b/docs/system/multi-process.rst
@@ -45,7 +45,7 @@ Following is a description of command-line used to launch 
mpqemu.
   -device lsi53c895a,id=lsi0 \
   -drive id=drive_image2,file=/build/ol7-nvme-test-1.qcow2   \
   -device scsi-hd,id=drive2,drive=drive_image2,bus=lsi0.0,scsi-id=0  \
-  -object x-remote-object,id=robj1,devid=lsi1,fd=4,
+  -object x-remote-object,id=robj1,devid=lsi0,fd=4,
 
 * QEMU:
 
-- 
2.17.1

[PATCH RESEND v2 3/3] virtio-pci/hmp: implement device specific hmp interface

2021-07-12 Thread Dongli Zhang

This patch is to implement the device specific interface to dump the
mapping between virtio queues and vectors.

(qemu) info msix -d /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

virtio pci vector info:
config: 0
queue 0: 1
queue 1: 2
queue 2: 3
queue 3: 4
queue 4: 5
queue 5: 6

Cc: Jason Wang 
Cc: Joe Jin 
Suggested-by: Jason Wang 
Signed-off-by: Dongli Zhang 
---
 hw/virtio/virtio-pci.c | 22 ++
 hw/virtio/virtio.c | 10 ++
 include/hw/virtio/virtio.h |  2 ++
 3 files changed, 34 insertions(+)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 433060ac02..2971e8049c 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -38,6 +38,7 @@
 #include "hw/virtio/virtio-bus.h"
 #include "qapi/visitor.h"
 #include "sysemu/replay.h"
+#include "monitor/monitor.h"
 
 #define VIRTIO_PCI_REGION_SIZE(dev) 
VIRTIO_PCI_CONFIG_OFF(msix_present(dev))
 
@@ -1587,6 +1588,26 @@ static void 
virtio_pci_modern_io_region_unmap(VirtIOPCIProxy *proxy,
 >mr);
 }
 
+static void virtio_pci_dc_msix_info(Monitor *mon, PCIDevice *dev,
+Error **errp)
+{
+DeviceState *qdev = DEVICE(dev);
+VirtIOPCIProxy *proxy = to_virtio_pci_proxy_fast(qdev);
+VirtIODevice *vdev = virtio_bus_get_device(>bus);
+int num = virtio_get_num_queues(vdev);
+int i;
+
+monitor_printf(mon, "virtio pci vector info:\n");
+
+monitor_printf(mon, "config: %d\n", virtio_get_config_vector(vdev));
+
+for (i = 0; i < num; i++)
+monitor_printf(mon, "queue %d: %u\n",
+   i, virtio_get_vector(vdev, i));
+
+monitor_printf(mon, "\n");
+}
+
 static void virtio_pci_pre_plugged(DeviceState *d, Error **errp)
 {
 VirtIOPCIProxy *proxy = VIRTIO_PCI(d);
@@ -2004,6 +2025,7 @@ static void virtio_pci_class_init(ObjectClass *klass, 
void *data)
 k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
 k->revision = VIRTIO_PCI_ABI_VERSION;
 k->class_id = PCI_CLASS_OTHERS;
+k->msix_info = virtio_pci_dc_msix_info;
 device_class_set_parent_realize(dc, virtio_pci_dc_realize,
 >parent_dc_realize);
 dc->reset = virtio_pci_reset;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 874377f37a..ea54939e98 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -2581,6 +2581,16 @@ void virtio_notify_config(VirtIODevice *vdev)
 virtio_notify_vector(vdev, vdev->config_vector);
 }
 
+uint16_t virtio_get_vector(VirtIODevice *vdev, int n)
+{
+return vdev->vq[n].vector;
+}
+
+uint16_t virtio_get_config_vector(VirtIODevice *vdev)
+{
+return vdev->config_vector;
+}
+
 static bool virtio_device_endian_needed(void *opaque)
 {
 VirtIODevice *vdev = opaque;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..6746227f73 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -268,6 +268,8 @@ int virtio_set_status(VirtIODevice *vdev, uint8_t val);
 void virtio_reset(void *opaque);
 void virtio_update_irq(VirtIODevice *vdev);
 int virtio_set_features(VirtIODevice *vdev, uint64_t val);
+uint16_t virtio_get_vector(VirtIODevice *vdev, int n);
+uint16_t virtio_get_config_vector(VirtIODevice *vdev);
 
 /* Base devices.  */
 typedef struct VirtIOBlkConf VirtIOBlkConf;
-- 
2.17.1

[PATCH RESEND v2 1/3] msix/hmp: add hmp interface to dump MSI-X info

2021-07-12 Thread Dongli Zhang

This patch is to add the HMP interface to dump MSI-X table and PBA, in
order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
vector is erroneously masked permanently). Here is the example with
vhost-scsi:

(qemu) info msix /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

Since the number of MSI-X entries is not determined and might be very
large, it is sometimes inappropriate to dump via QMP.

Therefore, this patch dumps MSI-X information only via HMP, which is
similar to the implementation of hmp_info_mem().

Cc: Jason Wang 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
Acked-by: Dr. David Alan Gilbert 
---
 hmp-commands-info.hx   | 13 +
 hw/pci/msix.c  | 63 ++
 include/hw/pci/msix.h  |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 25 +
 5 files changed, 104 insertions(+)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 27206ac049..ce5c550d44 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -221,6 +221,19 @@ SRST
 Show PCI information.
 ERST
 
+{
+.name   = "msix",
+.args_type  = "dev:s",
+.params = "dev",
+.help   = "dump MSI-X information",
+.cmd= hmp_info_msix,
+},
+
+SRST
+  ``info msix`` *dev*
+Dump MSI-X information for device *dev*.
+ERST
+
 #if defined(TARGET_I386) || defined(TARGET_SH4) || defined(TARGET_SPARC) || \
 defined(TARGET_PPC) || defined(TARGET_XTENSA) || defined(TARGET_M68K)
 {
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331cd0b..4b4ec87eee 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -22,6 +22,7 @@
 #include "sysemu/xen.h"
 #include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
+#include "monitor/monitor.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "trace.h"
@@ -669,3 +670,65 @@ const VMStateDescription vmstate_msix = {
 VMSTATE_END_OF_LIST()
 }
 };
+
+static void msix_dump_table(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+uint32_t val;
+uint8_t *table_entry;
+
+monitor_printf(mon, "Msg L.Addr ");
+monitor_printf(mon, "Msg U.Addr ");
+monitor_printf(mon, "Msg Data   ");
+monitor_printf(mon, "Vect Ctrl\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE;
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_UPPER_ADDR);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_DATA);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_VECTOR_CTRL);
+monitor_printf(mon, "0x%08x\n", val);
+}
+
+monitor_printf(mon, "\n");
+}
+
+static void msix_dump_pba(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+
+monitor_printf(mon, "MSI-X PBA\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+monitor_printf(mon, "%d ", !!msix_is_pending(dev, vector));
+
+if (vector % 16 == 15) {
+monitor_printf(mon, "\n");
+}
+}
+
+if (vector % 16 != 15) {
+monitor_printf(mon, "\n");
+}
+
+monitor_printf(mon, "\n");
+}
+
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp)
+{
+if (!msix_present(dev)) {
+error_setg(errp, "MSI-X not available");
+return;
+}
+
+msix_dump_table(mon, dev);
+msix_dump_pba(mon, dev);
+}
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c739..10a4500295 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -47,6 +47,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
   MSIVectorPollNotifier poll_notifier);
 void msix_unset_vector_notifiers(PCIDevice *dev);
 
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp);
+
 extern const VMStateDescription vmstate_msix;
 
 #define VMSTATE_MSIX_TEST(_field, _state, _test) {   \
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 3baa1058e2..97c040a3c8 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -36,6 +36,7 @@ void hmp_info_irq(Monitor *mon, const QDict *qd

[PATCH RESEND v2 0/3] To add HMP interface to dump PCI MSI-X table/PBA

2021-07-12 Thread Dongli Zhang

I have just rebased the v2 patchset on top of most recent qemu to resend.

This patch is to introduce the new HMP command to dump the MSI-X table/PBA.

Here is the RFC v1:

https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg04673.html

The idea was inspired by below discussion:

https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html


(qemu) info msix -d /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

virtio pci vector info:
config: 0
queue 0: 1
queue 1: 2
queue 2: 3
queue 3: 4
queue 4: 5
queue 5: 6


Changed since RFC v1:
  - Add heading to MSI-X table (suggested by David Alan Gilbert)
  - Add device specific interface, e.g., to dump virtio-pci queue-to-vector
mapping (Suggested By Jason)


 hmp-commands-info.hx   | 14 +
 hw/pci/msix.c  | 63 +
 hw/virtio/virtio-pci.c | 22 ++
 hw/virtio/virtio.c | 10 +++
 include/hw/pci/msix.h  |  2 ++
 include/hw/pci/pci.h   |  3 ++
 include/hw/virtio/virtio.h |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 36 +++
 9 files changed, 153 insertions(+)

Thank you very much!

Dongli Zhang

[PATCH RESEND v2 2/3] msix/hmp: add interface to dump device specific info

2021-07-12 Thread Dongli Zhang

While the previous patch is to dump the MSI-X table, sometimes we may
need to dump device specific data, e.g., to help match the vector with
the specific device queue.

This patch is to add the PCI device specific interface to help dump
those information. Any PCI device class may implement this
PCIDeviceClass->msix_info interface.

Cc: Jason Wang 
Cc: Joe Jin 
Suggested-by: Jason Wang 
Signed-off-by: Dongli Zhang 
---
 hmp-commands-info.hx   |  7 ---
 include/hw/pci/pci.h   |  3 +++
 softmmu/qdev-monitor.c | 11 +++
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index ce5c550d44..4e831d7ae4 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -223,9 +223,10 @@ ERST
 
 {
 .name   = "msix",
-.args_type  = "dev:s",
-.params = "dev",
-.help   = "dump MSI-X information",
+.args_type  = "info:-d,dev:s",
+.params = "[-d] dev",
+.help   = "dump MSI-X information; "
+  "(-d: show device specific info)",
 .cmd= hmp_info_msix,
 },
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6be4e0c460..4620b9e757 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -129,6 +129,8 @@ typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int 
region_num,
 pcibus_t addr, pcibus_t size, int type);
 typedef void PCIUnregisterFunc(PCIDevice *pci_dev);
 
+typedef void PCIMSIXInfoFunc(Monitor *mon, PCIDevice *dev, Error **errp);
+
 typedef struct PCIIORegion {
 pcibus_t addr; /* current PCI mapping address. -1 means not mapped */
 #define PCI_BAR_UNMAPPED (~(pcibus_t)0)
@@ -224,6 +226,7 @@ struct PCIDeviceClass {
 PCIUnregisterFunc *exit;
 PCIConfigReadFunc *config_read;
 PCIConfigWriteFunc *config_write;
+PCIMSIXInfoFunc *msix_info;
 
 uint16_t vendor_id;
 uint16_t device_id;
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 7837a17d0d..7fd3fe0ada 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -1011,7 +1011,9 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict)
 {
 const char *name = qdict_get_str(qdict, "dev");
 DeviceState *dev = find_device_state(name, NULL);
+bool info = qdict_get_try_bool(qdict, "info", false);
 PCIDevice *pci_dev;
+PCIDeviceClass *pc;
 Error *err = NULL;
 
 if (!dev) {
@@ -1027,6 +1029,15 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict)
 pci_dev = PCI_DEVICE(dev);
 msix_dump_info(mon, pci_dev, );
 
+if (info) {
+pc = PCI_DEVICE_GET_CLASS(pci_dev);
+if (pc->msix_info) {
+pc->msix_info(mon, pci_dev, );
+} else {
+error_setg(, "Device specific info not supported");
+}
+}
+
 exit:
 hmp_handle_error(mon, err);
 }
-- 
2.17.1

[PATCH v2 1/1] hmp: synchronize cpu state for lapic info

2021-07-01 Thread Dongli Zhang

While the default "info lapic" always synchronizes cpu state ...

mon_get_cpu()
-> mon_get_cpu_sync(mon, true)
   -> cpu_synchronize_state(cpu)
  -> ioctl KVM_GET_LAPIC (taking KVM as example)

... the cpu state is not synchronized when the apic-id is available as
argument.

The cpu state should be synchronized when apic-id is available. Otherwise
the "info lapic " always returns stale data.

Signed-off-by: Dongli Zhang 
---
Changed since v1:
  - I sent out wrong patch version in v1

 target/i386/monitor.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index 119211f0b0..d833ab5b8e 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -28,6 +28,7 @@
 #include "monitor/hmp-target.h"
 #include "monitor/hmp.h"
 #include "qapi/qmp/qdict.h"
+#include "sysemu/hw_accel.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sev.h"
 #include "qapi/error.h"
@@ -656,7 +657,11 @@ void hmp_info_local_apic(Monitor *mon, const QDict *qdict)
 
 if (qdict_haskey(qdict, "apic-id")) {
 int id = qdict_get_try_int(qdict, "apic-id", 0);
+
 cs = cpu_by_arch_id(id);
+if (cs) {
+cpu_synchronize_state(cs);
+}
 } else {
 cs = mon_get_cpu(mon);
 }
-- 
2.17.1

Re: [PATCH 1/1] hmp: synchronize cpu state for lapic info

2021-07-01 Thread Dongli Zhang

Please ignore this patch. I sent out the wrong version without memory check.

I will re-send again.

Thank you very much!

Dongli Zhang

On 7/1/21 2:26 PM, Dongli Zhang wrote:
> While the default "info lapic" always synchronizes cpu state ...
> 
> mon_get_cpu()
> -> mon_get_cpu_sync(mon, true)
>-> cpu_synchronize_state(cpu)
>   -> ioctl KVM_GET_LAPIC (taking KVM as example)
> 
> ... the cpu state is not synchronized when the apic-id is available as
> argument.
> 
> The cpu state should be synchronized when apic-id is available. Otherwise
> the "info lapic " always returns stale data.
> 
> Cc: Joe Jin 
> Signed-off-by: Dongli Zhang 
> ---
>  target/i386/monitor.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/target/i386/monitor.c b/target/i386/monitor.c
> index 119211f0b0..09b3b73955 100644
> --- a/target/i386/monitor.c
> +++ b/target/i386/monitor.c
> @@ -28,6 +28,7 @@
>  #include "monitor/hmp-target.h"
>  #include "monitor/hmp.h"
>  #include "qapi/qmp/qdict.h"
> +#include "sysemu/hw_accel.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/sev.h"
>  #include "qapi/error.h"
> @@ -656,7 +657,9 @@ void hmp_info_local_apic(Monitor *mon, const QDict *qdict)
>  
>  if (qdict_haskey(qdict, "apic-id")) {
>  int id = qdict_get_try_int(qdict, "apic-id", 0);
> +
>  cs = cpu_by_arch_id(id);
> +cpu_synchronize_state(cs);
>  } else {
>  cs = mon_get_cpu(mon);
>  }
>

[PATCH 1/1] hmp: synchronize cpu state for lapic info

2021-07-01 Thread Dongli Zhang

While the default "info lapic" always synchronizes cpu state ...

mon_get_cpu()
-> mon_get_cpu_sync(mon, true)
   -> cpu_synchronize_state(cpu)
  -> ioctl KVM_GET_LAPIC (taking KVM as example)

... the cpu state is not synchronized when the apic-id is available as
argument.

The cpu state should be synchronized when apic-id is available. Otherwise
the "info lapic " always returns stale data.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 target/i386/monitor.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/target/i386/monitor.c b/target/i386/monitor.c
index 119211f0b0..09b3b73955 100644
--- a/target/i386/monitor.c
+++ b/target/i386/monitor.c
@@ -28,6 +28,7 @@
 #include "monitor/hmp-target.h"
 #include "monitor/hmp.h"
 #include "qapi/qmp/qdict.h"
+#include "sysemu/hw_accel.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sev.h"
 #include "qapi/error.h"
@@ -656,7 +657,9 @@ void hmp_info_local_apic(Monitor *mon, const QDict *qdict)
 
 if (qdict_haskey(qdict, "apic-id")) {
 int id = qdict_get_try_int(qdict, "apic-id", 0);
+
 cs = cpu_by_arch_id(id);
+cpu_synchronize_state(cs);
 } else {
 cs = mon_get_cpu(mon);
 }
-- 
2.17.1

[PATCH v2 0/3] To add HMP interface to dump PCI MSI-X table/PBA

2021-05-18 Thread Dongli Zhang

This patch is to introduce the new HMP command to dump the MSI-X table/PBA.

Here is the RFC v1:

https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg04673.html

The idea was inspired by below discussion:

https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html


(qemu) info msix -d /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

virtio pci vector info:
config: 0
queue 0: 1
queue 1: 2
queue 2: 3
queue 3: 4
queue 4: 5
queue 5: 6


Changed since RFC v1:
  - Add heading to MSI-X table (suggested by David Alan Gilbert)
  - Add device specific interface, e.g., to dump virtio-pci queue-to-vector
mapping (Suggested By Jason)


 hmp-commands-info.hx   | 14 +
 hw/pci/msix.c  | 63 +
 hw/virtio/virtio-pci.c | 22 ++
 hw/virtio/virtio.c | 10 +++
 include/hw/pci/msix.h  |  2 ++
 include/hw/pci/pci.h   |  3 ++
 include/hw/virtio/virtio.h |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 36 +++
 9 files changed, 153 insertions(+)


Thank you very much!

Dongli Zhang

[PATCH v2 1/3] msix/hmp: add hmp interface to dump MSI-X info

2021-05-18 Thread Dongli Zhang

This patch is to add the HMP interface to dump MSI-X table and PBA, in
order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
vector is erroneously masked permanently). Here is the example with
vhost-scsi:

(qemu) info msix /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

Since the number of MSI-X entries is not determined and might be very
large, it is sometimes inappropriate to dump via QMP.

Therefore, this patch dumps MSI-X information only via HMP, which is
similar to the implementation of hmp_info_mem().

Cc: Jason Wang 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
Acked-by: Dr. David Alan Gilbert 
---
Changed since v1:
  - Add heading to MSI-X table (suggested by David Alan Gilbert)
  - Not sure if it is fine to use Acked-by from David and there is
very few change

 hmp-commands-info.hx   | 13 +
 hw/pci/msix.c  | 63 ++
 include/hw/pci/msix.h  |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 25 +
 5 files changed, 104 insertions(+)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index ab0c7aa5ee..cbd056442b 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -221,6 +221,19 @@ SRST
 Show PCI information.
 ERST
 
+{
+.name   = "msix",
+.args_type  = "dev:s",
+.params = "dev",
+.help   = "dump MSI-X information",
+.cmd= hmp_info_msix,
+},
+
+SRST
+  ``info msix`` *dev*
+Dump MSI-X information for device *dev*.
+ERST
+
 #if defined(TARGET_I386) || defined(TARGET_SH4) || defined(TARGET_SPARC) || \
 defined(TARGET_PPC) || defined(TARGET_XTENSA) || defined(TARGET_M68K)
 {
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331cd0b..4b4ec87eee 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -22,6 +22,7 @@
 #include "sysemu/xen.h"
 #include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
+#include "monitor/monitor.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "trace.h"
@@ -669,3 +670,65 @@ const VMStateDescription vmstate_msix = {
 VMSTATE_END_OF_LIST()
 }
 };
+
+static void msix_dump_table(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+uint32_t val;
+uint8_t *table_entry;
+
+monitor_printf(mon, "Msg L.Addr ");
+monitor_printf(mon, "Msg U.Addr ");
+monitor_printf(mon, "Msg Data   ");
+monitor_printf(mon, "Vect Ctrl\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE;
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_UPPER_ADDR);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_DATA);
+monitor_printf(mon, "0x%08x ", val);
+
+val = pci_get_long(table_entry + PCI_MSIX_ENTRY_VECTOR_CTRL);
+monitor_printf(mon, "0x%08x\n", val);
+}
+
+monitor_printf(mon, "\n");
+}
+
+static void msix_dump_pba(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+
+monitor_printf(mon, "MSI-X PBA\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+monitor_printf(mon, "%d ", !!msix_is_pending(dev, vector));
+
+if (vector % 16 == 15) {
+monitor_printf(mon, "\n");
+}
+}
+
+if (vector % 16 != 15) {
+monitor_printf(mon, "\n");
+}
+
+monitor_printf(mon, "\n");
+}
+
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp)
+{
+if (!msix_present(dev)) {
+error_setg(errp, "MSI-X not available");
+return;
+}
+
+msix_dump_table(mon, dev);
+msix_dump_pba(mon, dev);
+}
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c739..10a4500295 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -47,6 +47,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
   MSIVectorPollNotifier poll_notifier);
 void msix_unset_vector_notifiers(PCIDevice *dev);
 
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp);
+
 extern const VMStateDescription vmstate_msix;
 
 #define VMSTATE_MSIX_TEST(_field, _state, _test) {   \
diff --git a/include/monitor/hmp.h b/include/m

[PATCH v2 3/3] virtio-pci/hmp: implement device specific hmp interface

2021-05-18 Thread Dongli Zhang

This patch is to implement the device specific interface to dump the
mapping between virtio queues and vectors.

(qemu) info msix -d /machine/peripheral/vscsi0
Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
0xfee0 0x 0x4041 0x
0xfee0 0x 0x4051 0x
0xfee0 0x 0x4061 0x
0xfee0 0x 0x4071 0x
0xfee01000 0x 0x40b1 0x
0xfee02000 0x 0x40c1 0x
0xfee03000 0x 0x40d1 0x

MSI-X PBA
0 0 0 0 0 0 0

virtio pci vector info:
config: 0
queue 0: 1
queue 1: 2
queue 2: 3
queue 3: 4
queue 4: 5
queue 5: 6

Cc: Jason Wang 
Cc: Joe Jin 
Suggested-by: Jason Wang 
Signed-off-by: Dongli Zhang 
---
 hw/virtio/virtio-pci.c | 22 ++
 hw/virtio/virtio.c | 10 ++
 include/hw/virtio/virtio.h |  2 ++
 3 files changed, 34 insertions(+)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index b321604d9b..edc143285b 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -37,6 +37,7 @@
 #include "qemu/range.h"
 #include "hw/virtio/virtio-bus.h"
 #include "qapi/visitor.h"
+#include "monitor/monitor.h"
 
 #define VIRTIO_PCI_REGION_SIZE(dev) 
VIRTIO_PCI_CONFIG_OFF(msix_present(dev))
 
@@ -1563,6 +1564,26 @@ static void 
virtio_pci_modern_io_region_unmap(VirtIOPCIProxy *proxy,
 >mr);
 }
 
+static void virtio_pci_dc_msix_info(Monitor *mon, PCIDevice *dev,
+Error **errp)
+{
+DeviceState *qdev = DEVICE(dev);
+VirtIOPCIProxy *proxy = to_virtio_pci_proxy_fast(qdev);
+VirtIODevice *vdev = virtio_bus_get_device(>bus);
+int num = virtio_get_num_queues(vdev);
+int i;
+
+monitor_printf(mon, "virtio pci vector info:\n");
+
+monitor_printf(mon, "config: %d\n", virtio_get_config_vector(vdev));
+
+for (i = 0; i < num; i++)
+monitor_printf(mon, "queue %d: %u\n",
+   i, virtio_get_vector(vdev, i));
+
+monitor_printf(mon, "\n");
+}
+
 static void virtio_pci_pre_plugged(DeviceState *d, Error **errp)
 {
 VirtIOPCIProxy *proxy = VIRTIO_PCI(d);
@@ -1975,6 +1996,7 @@ static void virtio_pci_class_init(ObjectClass *klass, 
void *data)
 k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
 k->revision = VIRTIO_PCI_ABI_VERSION;
 k->class_id = PCI_CLASS_OTHERS;
+k->msix_info = virtio_pci_dc_msix_info;
 device_class_set_parent_realize(dc, virtio_pci_dc_realize,
 >parent_dc_realize);
 dc->reset = virtio_pci_reset;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index e02544b2df..fd07d9deac 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -2581,6 +2581,16 @@ void virtio_notify_config(VirtIODevice *vdev)
 virtio_notify_vector(vdev, vdev->config_vector);
 }
 
+uint16_t virtio_get_vector(VirtIODevice *vdev, int n)
+{
+return vdev->vq[n].vector;
+}
+
+uint16_t virtio_get_config_vector(VirtIODevice *vdev)
+{
+return vdev->config_vector;
+}
+
 static bool virtio_device_endian_needed(void *opaque)
 {
 VirtIODevice *vdev = opaque;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..6746227f73 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -268,6 +268,8 @@ int virtio_set_status(VirtIODevice *vdev, uint8_t val);
 void virtio_reset(void *opaque);
 void virtio_update_irq(VirtIODevice *vdev);
 int virtio_set_features(VirtIODevice *vdev, uint64_t val);
+uint16_t virtio_get_vector(VirtIODevice *vdev, int n);
+uint16_t virtio_get_config_vector(VirtIODevice *vdev);
 
 /* Base devices.  */
 typedef struct VirtIOBlkConf VirtIOBlkConf;
-- 
2.17.1

[PATCH v2 2/3] msix/hmp: add interface to dump device specific info

2021-05-18 Thread Dongli Zhang

While the previous patch is to dump the MSI-X table, sometimes we may
need to dump device specific data, e.g., to help match the vector with
the specific device queue.

This patch is to add the PCI device specific interface to help dump
those information. Any PCI device class may implement this
PCIDeviceClass->msix_info interface.

Cc: Jason Wang 
Cc: Joe Jin 
Suggested-by: Jason Wang 
Signed-off-by: Dongli Zhang 
---
 hmp-commands-info.hx   |  7 ---
 include/hw/pci/pci.h   |  3 +++
 softmmu/qdev-monitor.c | 11 +++
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index cbd056442b..df9cdc6c7e 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -223,9 +223,10 @@ ERST
 
 {
 .name   = "msix",
-.args_type  = "dev:s",
-.params = "dev",
-.help   = "dump MSI-X information",
+.args_type  = "info:-d,dev:s",
+.params = "[-d] dev",
+.help   = "dump MSI-X information; "
+  "(-d: show device specific info)",
 .cmd= hmp_info_msix,
 },
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6be4e0c460..4620b9e757 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -129,6 +129,8 @@ typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int 
region_num,
 pcibus_t addr, pcibus_t size, int type);
 typedef void PCIUnregisterFunc(PCIDevice *pci_dev);
 
+typedef void PCIMSIXInfoFunc(Monitor *mon, PCIDevice *dev, Error **errp);
+
 typedef struct PCIIORegion {
 pcibus_t addr; /* current PCI mapping address. -1 means not mapped */
 #define PCI_BAR_UNMAPPED (~(pcibus_t)0)
@@ -224,6 +226,7 @@ struct PCIDeviceClass {
 PCIUnregisterFunc *exit;
 PCIConfigReadFunc *config_read;
 PCIConfigWriteFunc *config_write;
+PCIMSIXInfoFunc *msix_info;
 
 uint16_t vendor_id;
 uint16_t device_id;
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index 7837a17d0d..7fd3fe0ada 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -1011,7 +1011,9 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict)
 {
 const char *name = qdict_get_str(qdict, "dev");
 DeviceState *dev = find_device_state(name, NULL);
+bool info = qdict_get_try_bool(qdict, "info", false);
 PCIDevice *pci_dev;
+PCIDeviceClass *pc;
 Error *err = NULL;
 
 if (!dev) {
@@ -1027,6 +1029,15 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict)
 pci_dev = PCI_DEVICE(dev);
 msix_dump_info(mon, pci_dev, );
 
+if (info) {
+pc = PCI_DEVICE_GET_CLASS(pci_dev);
+if (pc->msix_info) {
+pc->msix_info(mon, pci_dev, );
+} else {
+error_setg(, "Device specific info not supported");
+}
+}
+
 exit:
 hmp_handle_error(mon, err);
 }
-- 
2.17.1

[PATCH 1/1] multi-process: fix usage information

2021-05-08 Thread Dongli Zhang

>From source code, the 'devid' of x-remote-object should be one of devices
in remote QEMU process.

Signed-off-by: Dongli Zhang 
---
I have verified by reading the code and playing with below orchestrator.
https://github.com/finallyjustice/sample/blob/master/kvm/multiprocess/orchestrator.py

 docs/system/multi-process.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/system/multi-process.rst b/docs/system/multi-process.rst
index 46bb0cafc2..210531ee17 100644
--- a/docs/system/multi-process.rst
+++ b/docs/system/multi-process.rst
@@ -45,7 +45,7 @@ Following is a description of command-line used to launch 
mpqemu.
   -device lsi53c895a,id=lsi0 \
   -drive id=drive_image2,file=/build/ol7-nvme-test-1.qcow2   \
   -device scsi-hd,id=drive2,drive=drive_image2,bus=lsi0.0,scsi-id=0  \
-  -object x-remote-object,id=robj1,devid=lsi1,fd=4,
+  -object x-remote-object,id=robj1,devid=lsi0,fd=4,
 
 * QEMU:
 
-- 
2.17.1

Re: [PATCH RFC 0/1] To add HMP interface to dump PCI MSI-X table/PBA

2021-04-28 Thread Dongli Zhang




On 4/28/21 1:45 AM, Dr. David Alan Gilbert wrote:
> * Dongli Zhang (dongli.zh...@oracle.com) wrote:
>> Hi Jason,
>>
>> On 4/27/21 7:31 PM, Jason Wang wrote:
>>>
>>> 在 2021/4/27 下午4:53, Dr. David Alan Gilbert 写道:
>>>> * Dongli Zhang (dongli.zh...@oracle.com) wrote:
>>>>>
>>>>> On 4/22/21 11:01 PM, Jason Wang wrote:
>>>>>> 在 2021/4/23 下午12:47, Dongli Zhang 写道:
>>>>>>> This is inspired by the discussion with Jason on below patchset.
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html__;!!GqivPVa7Brio!KbGQZW5lq3JZ60k12NuWZ6Th1lT6AwmBTF0pBgoWUKKQ4-2UhdW57PtvXUN5XQnZ2NU$
>>>>>>>
>>>>>>>
>>>>>>> The new HMP command is introduced to dump the MSI-X table and PBA.
>>>>>>>
>>>>>>> Initially, I was going to add new option to "info pci". However, as the
>>>>>>> number of entries is not determined and the output of MSI-X table is 
>>>>>>> much
>>>>>>> more similar to the output of hmp_info_tlb()/hmp_info_mem(), this patch
>>>>>>> adds interface for only HMP.
>>>>>>>
>>>>>>> The patch is tagged with RFC because I am looking for suggestions on:
>>>>>>>
>>>>>>> 1. Is it fine to add new "info msix " command?
>>>>>>
>>>>>> I wonder the reason for not simply reusing "info pci"?
>>>>> The "info pci" will show PCI data for all devices and it does not accept 
>>>>> any
>>>>> argument to print for a specific device.
>>>>>
>>>>> In addition, the "info pci" relies on qmp_query_pci(), where this patch 
>>>>> will not
>>>>> implement the interface for QMP considering the number of MSI-X entries 
>>>>> is not
>>>>> determined.
>>>>>
>>>>> Suppose we have 10 NVMe (emulated by QEMU with default number of queues), 
>>>>> we
>>>>> will have about 600+ lines of output.
>>>>  From an HMP perspective I'm happy, so:
>>>>
>>>> Acked-by: Dr. David Alan Gilbert 
>>>>
>>>> but since I don't know much about MSI I'd like to see Jason's reply.
>>>
>>>
>>> I think we'd better have more information, e.g the device can optionally 
>>> report
>>> how the MSI-X vector is used.
>>>
>>> Virtio-pci could be the first user for this.
>>
>> As discussed in another thread, you were talking about to print MSIMessage.
>>
>> However, I prefer to print the raw data as I think the user of this interface
>> should be able to understand it as MSI-X messages.
>>
>> For instance, below is the data printed by "info msix".
>>
>> 0xfee01004 0x 0x0022 0x
>> 0xfee02004 0x 0x0023 0x
>> 0xfee01004 0x0000 0x0023 0x
>> 0xfee01004 0x 0x0021 0x
>> 0xfee02004 0x 0x0022 0x
>> 0x 0x 0x 0x0001
>> 0x 0x 0x 0x0001
>>
>> The 1st column is Message Lower Address.
>>
>> The 2nd column is Message Upper Address.
>>
>> The 3rd column is Message Data.
>>
>> The 4th column is Vector Control.
> 
> So why not add a heading:
> 
>   Msg L.Addr Msg U.Addr Msg Data   Vect Ctrl
>   0xfee01004 0x 0x0022 0x
>   0xfee02004 0x 0x0023 0x
>   0xfee01004 0x 0x0023 0x
>   0xfee01004 0x 0x0021 0x
>   0xfee02004 0x 0x0022 0x
>   0x 0x 0x 0x0001
>   0x 0x 0x 0x0001
> 
> (if I'm understanding what you said).

Thank you very much! I will add that in the patch.

Dongli Zhang

> 
> Dave
> 
>> In my opinion, this is equivalent to MSIMessage.
>>
>> 26 struct MSIMessage {
>> 27 uint64_t address; --> column 1 and 2
>> 28 uint32_t data;--> column 3
>> 29 };
>>
>>
>> We use the similar way to read from Linux OS, e,g., given the address of 
>> MSI-X
>> cap, here is how we read from OS side.
>>
>> # busybox devmem 0xc1001000 32
>> 0xFEE0
>> # busybox devmem 0xc1001004 32
>> 0x
>> # busybox devmem 0xc1001008 32
>> 0x4049
>> # busybox devmem 0xc100100c 32
>> 0x
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>>
>>>
>>>>
>>>> Adding an optional option to 'info pci' to limit to one device would be 
>>>> easy
>>>> though; that bit is probably easier than adding a new command.
>>>
>>>
>>> One interesting point is that MSI could be extended for other bus, (e.g 
>>> MMIO).
>>> So "info msi" should be better I guess.
>>>
>>>
>>>> Figuring out the QMP representation of your entries might be harder -
>>>> and if this is strictly for debug, probably not worth it?
>>>
>>>
>>> I think so.
>>>
>>> Thanks
>>>
>>>
>>>>
>>>> Dave
>>>>
>>>>
>>>>> Dongli Zhang
>>>>>
>>>>>>
>>>>>>> 2. Is there any issue with output format?
>>>>>>
>>>>>> If it's not for QMP, I guess it's not a part of ABI so it should be fine.
>>>>>>
>>>>>>
>>>>>>> 3. Is it fine to add only for HMP, but not QMP?
>>>>>>
>>>>>> I think so.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>> Dongli Zhang
>>>>>>>
>>>>>>>
>>>>>>>
>>>
>>

Re: [PATCH RFC 0/1] To add HMP interface to dump PCI MSI-X table/PBA

2021-04-28 Thread Dongli Zhang




On 4/27/21 10:55 PM, Jason Wang wrote:
> 
> 在 2021/4/28 下午1:10, Dongli Zhang 写道:
>> Hi Jason,
>>
>> On 4/27/21 7:31 PM, Jason Wang wrote:
>>> 在 2021/4/27 下午4:53, Dr. David Alan Gilbert 写道:
>>>> * Dongli Zhang (dongli.zh...@oracle.com) wrote:
>>>>> On 4/22/21 11:01 PM, Jason Wang wrote:
>>>>>> 在 2021/4/23 下午12:47, Dongli Zhang 写道:
>>>>>>> This is inspired by the discussion with Jason on below patchset.
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html__;!!GqivPVa7Brio!KbGQZW5lq3JZ60k12NuWZ6Th1lT6AwmBTF0pBgoWUKKQ4-2UhdW57PtvXUN5XQnZ2NU$
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The new HMP command is introduced to dump the MSI-X table and PBA.
>>>>>>>
>>>>>>> Initially, I was going to add new option to "info pci". However, as the
>>>>>>> number of entries is not determined and the output of MSI-X table is 
>>>>>>> much
>>>>>>> more similar to the output of hmp_info_tlb()/hmp_info_mem(), this patch
>>>>>>> adds interface for only HMP.
>>>>>>>
>>>>>>> The patch is tagged with RFC because I am looking for suggestions on:
>>>>>>>
>>>>>>> 1. Is it fine to add new "info msix " command?
>>>>>> I wonder the reason for not simply reusing "info pci"?
>>>>> The "info pci" will show PCI data for all devices and it does not accept 
>>>>> any
>>>>> argument to print for a specific device.
>>>>>
>>>>> In addition, the "info pci" relies on qmp_query_pci(), where this patch
>>>>> will not
>>>>> implement the interface for QMP considering the number of MSI-X entries 
>>>>> is not
>>>>> determined.
>>>>>
>>>>> Suppose we have 10 NVMe (emulated by QEMU with default number of queues), 
>>>>> we
>>>>> will have about 600+ lines of output.
>>>>   From an HMP perspective I'm happy, so:
>>>>
>>>> Acked-by: Dr. David Alan Gilbert 
>>>>
>>>> but since I don't know much about MSI I'd like to see Jason's reply.
>>>
>>> I think we'd better have more information, e.g the device can optionally 
>>> report
>>> how the MSI-X vector is used.
>>>
>>> Virtio-pci could be the first user for this.
>> As discussed in another thread, you were talking about to print MSIMessage.
>>
>> However, I prefer to print the raw data as I think the user of this interface
>> should be able to understand it as MSI-X messages.
>>
>> For instance, below is the data printed by "info msix".
> 
> 
> Just to clarify, I meant e.g for virtio-pci device, we can let it to print the
> mapping between vq and msix vectors:
> 
> vq[0].msix_vector = 0
> vq[1].msix_vector = 1
> config.msix_vector = 2
> ...
> 
> But this could be added on top if you wish.
> 
> Does this make sense?

Yes, this makes since. For QEMU they are:

- vdev->vq[n].vector
- vdev->config_vector


I will introduce a callback and implement for virtio-pci to dump the vector 
mapping.

By default, "info msix " prints only msix table/PBA.

"info msix -d " will print device specific data.

Thank you very much!

Dongli Zhang

> 
> Thanks
> 
> 
>>
>> 0xfee01004 0x 0x0022 0x
>> 0xfee02004 0x 0x0023 0x0000
>> 0xfee01004 0x 0x0023 0x
>> 0xfee01004 0x 0x0021 0x
>> 0xfee02004 0x 0x0022 0x
>> 0x 0x 0x 0x0001
>> 0x 0x 0x 0x0001
>>
>> The 1st column is Message Lower Address.
>>
>> The 2nd column is Message Upper Address.
>>
>> The 3rd column is Message Data.
>>
>> The 4th column is Vector Control.
>>
>> In my opinion, this is equivalent to MSIMessage.
>>
>> 26 struct MSIMessage {
>> 27 uint64_t address; --> column 1 and 2
>> 28 uint32_t data;    --> column 3
>> 29 };
>>
>>
>> We use the similar way to read from Linux OS, e,g., given the address of 
>> MSI-X
>> cap, here is how we read from OS side.
>>
>> # busybox devmem 0xc1001000 32
>> 0xFEE0
>> # busybox devmem 0xc1001004 32
>> 0x
>> # busybox devmem 0xc1001008 32
>> 0x4049
>> # busybox devmem 0xc100100c 32
>> 0x
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>>
>>>> Adding an optional option to 'info pci' to limit to one device would be 
>>>> easy
>>>> though; that bit is probably easier than adding a new command.
>>>
>>> One interesting point is that MSI could be extended for other bus, (e.g 
>>> MMIO).
>>> So "info msi" should be better I guess.
>>>
>>>
>>>> Figuring out the QMP representation of your entries might be harder -
>>>> and if this is strictly for debug, probably not worth it?
>>>
>>> I think so.
>>>
>>> Thanks
>>>
>>>
>>>> Dave
>>>>
>>>>
>>>>> Dongli Zhang
>>>>>
>>>>>>> 2. Is there any issue with output format?
>>>>>> If it's not for QMP, I guess it's not a part of ABI so it should be fine.
>>>>>>
>>>>>>
>>>>>>> 3. Is it fine to add only for HMP, but not QMP?
>>>>>> I think so.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>> Dongli Zhang
>>>>>>>
>>>>>>>
>>>>>>>
>

Re: [PATCH RFC 0/1] To add HMP interface to dump PCI MSI-X table/PBA

2021-04-27 Thread Dongli Zhang

Hi Jason,

On 4/27/21 7:31 PM, Jason Wang wrote:
> 
> 在 2021/4/27 下午4:53, Dr. David Alan Gilbert 写道:
>> * Dongli Zhang (dongli.zh...@oracle.com) wrote:
>>>
>>> On 4/22/21 11:01 PM, Jason Wang wrote:
>>>> 在 2021/4/23 下午12:47, Dongli Zhang 写道:
>>>>> This is inspired by the discussion with Jason on below patchset.
>>>>>
>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html__;!!GqivPVa7Brio!KbGQZW5lq3JZ60k12NuWZ6Th1lT6AwmBTF0pBgoWUKKQ4-2UhdW57PtvXUN5XQnZ2NU$
>>>>>
>>>>>
>>>>> The new HMP command is introduced to dump the MSI-X table and PBA.
>>>>>
>>>>> Initially, I was going to add new option to "info pci". However, as the
>>>>> number of entries is not determined and the output of MSI-X table is much
>>>>> more similar to the output of hmp_info_tlb()/hmp_info_mem(), this patch
>>>>> adds interface for only HMP.
>>>>>
>>>>> The patch is tagged with RFC because I am looking for suggestions on:
>>>>>
>>>>> 1. Is it fine to add new "info msix " command?
>>>>
>>>> I wonder the reason for not simply reusing "info pci"?
>>> The "info pci" will show PCI data for all devices and it does not accept any
>>> argument to print for a specific device.
>>>
>>> In addition, the "info pci" relies on qmp_query_pci(), where this patch 
>>> will not
>>> implement the interface for QMP considering the number of MSI-X entries is 
>>> not
>>> determined.
>>>
>>> Suppose we have 10 NVMe (emulated by QEMU with default number of queues), we
>>> will have about 600+ lines of output.
>>  From an HMP perspective I'm happy, so:
>>
>> Acked-by: Dr. David Alan Gilbert 
>>
>> but since I don't know much about MSI I'd like to see Jason's reply.
> 
> 
> I think we'd better have more information, e.g the device can optionally 
> report
> how the MSI-X vector is used.
> 
> Virtio-pci could be the first user for this.

As discussed in another thread, you were talking about to print MSIMessage.

However, I prefer to print the raw data as I think the user of this interface
should be able to understand it as MSI-X messages.

For instance, below is the data printed by "info msix".

0xfee01004 0x 0x0022 0x
0xfee02004 0x 0x0023 0x
0xfee01004 0x 0x0023 0x
0xfee01004 0x 0x0021 0x
0xfee02004 0x 0x0022 0x
0x 0x 0x 0x0001
0x 0x 0x 0x0001

The 1st column is Message Lower Address.

The 2nd column is Message Upper Address.

The 3rd column is Message Data.

The 4th column is Vector Control.

In my opinion, this is equivalent to MSIMessage.

26 struct MSIMessage {
27 uint64_t address; --> column 1 and 2
28 uint32_t data;--> column 3
29 };


We use the similar way to read from Linux OS, e,g., given the address of MSI-X
cap, here is how we read from OS side.

# busybox devmem 0xc1001000 32
0xFEE0
# busybox devmem 0xc1001004 32
0x
# busybox devmem 0xc1001008 32
0x4049
# busybox devmem 0xc100100c 32
0x

Thank you very much!

Dongli Zhang

> 
> 
>>
>> Adding an optional option to 'info pci' to limit to one device would be easy
>> though; that bit is probably easier than adding a new command.
> 
> 
> One interesting point is that MSI could be extended for other bus, (e.g MMIO).
> So "info msi" should be better I guess.
> 
> 
>> Figuring out the QMP representation of your entries might be harder -
>> and if this is strictly for debug, probably not worth it?
> 
> 
> I think so.
> 
> Thanks
> 
> 
>>
>> Dave
>>
>>
>>> Dongli Zhang
>>>
>>>>
>>>>> 2. Is there any issue with output format?
>>>>
>>>> If it's not for QMP, I guess it's not a part of ABI so it should be fine.
>>>>
>>>>
>>>>> 3. Is it fine to add only for HMP, but not QMP?
>>>>
>>>> I think so.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Thank you very much!
>>>>>
>>>>> Dongli Zhang
>>>>>
>>>>>
>>>>>
>

Re: [PATCH RFC 1/1] msix: add hmp interface to dump MSI-X info

2021-04-25 Thread Dongli Zhang




On 4/24/21 8:36 PM, Jason Wang wrote:
> 
> 在 2021/4/24 上午1:32, Dongli Zhang 写道:
>>
>> On 4/23/21 12:59 AM, Jason Wang wrote:
>>> 在 2021/4/23 下午12:47, Dongli Zhang 写道:
>>>> This patch is to add the HMP interface to dump MSI-X table and PBA, in
>>>> order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
>>>> vector is erroneously masked permanently). Here is the example with
>>>> vhost-scsi:
>>>>
>>>> (qemu) info msix /machine/peripheral/vscsi0
>>>> MSI-X Table
>>>> 0xfee01004 0x 0x0022 0x
>>>> 0xfee02004 0x 0x0023 0x
>>>> 0xfee01004 0x 0x0023 0x
>>>> 0xfee01004 0x 0x0021 0x
>>>> 0xfee02004 0x 0x0022 0x
>>>> 0x 0x 0x 0x0001
>>>> 0x 0x 0x 0x0001
>>>> MSI-X PBA
>>>> 0 0 0 0 0 0 0
>>>>
>>>> Since the number of MSI-X entries is not determined and might be very
>>>> large, it is sometimes inappropriate to dump via QMP.
>>>>
>>>> Therefore, this patch dumps MSI-X information only via HMP, which is
>>>> similar to the implementation of hmp_info_mem().
>>>
>>> Besides PBA, I think it should be also useful to introduce device specifc
>>> callbacks for dump the MSI messages used by the device.
>>>
>>> Thanks
>> Would you please help confirm if you meant MSI messages or MSI-X messages?
> 
> 
> E.g for virtio-pci, you need a way to know how the MSI-X vector is used by 
> each
> virtqueue?
> 
> 
>>
>> About about MSI-X, I assume we are able to derive the message from the MSI-X
>> table, as in msix_get_message().  Therefore, the user of "info msix "
>> should be able to parse the output and convert it to messages.
>>
>> 34 MSIMessage msix_get_message(PCIDevice *dev, unsigned vector)
>> 35 {
>> 36 uint8_t *table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE;
>> 37 MSIMessage msg;
>> 38
>> 39 msg.address = pci_get_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR);
>> 40 msg.data = pci_get_long(table_entry + PCI_MSIX_ENTRY_DATA);
>> 41 return msg;
>> 42 }
> 
> 
> Something like this.

I prefer to print the raw data as I think the user of this interface is able to
understand it as MSI-X messages.

For instance, below is the data printed by "info msix".

0xfee01004 0x 0x0022 0x
0xfee02004 0x 0x0023 0x
0xfee01004 0x 0x0023 0x
0xfee01004 0x 0x0021 0x
0xfee02004 0x 0x0022 0x
0x 0x0000 0x0000 0x0001
0x 0x 0x 0x0001

The 1st column is Message Lower Address.

The 2nd column is Message Upper Address.

The 3rd column is Message Data.

The 4th column is Vector Control.


In order to verify whether a vector is masked, usually I would just check the
bit 0 of Vector Control.

Therefore, I think we can assume the user understands the MSI-X table format
well and we just prints the MSI-X table memory data.

Thank you very much!

Dongli Zhang

> 
> Thanks
> 
> 
>>
>> Perhaps I should remove the inner for-loop for MSI-X table in the patch, and
>> call pci_get_long() for 4 times, for PCI_MSIX_ENTRY_LOWER_ADDR,
>> PCI_MSIX_ENTRY_UPPER_ADDR, PCI_MSIX_ENTRY_DATA and 
>> PCI_MSIX_ENTRY_VECTOR_CTRL,
>> respectively.
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>>
>>>> Cc: Jason Wang 
>>>> Cc: Joe Jin 
>>>> Signed-off-by: Dongli Zhang 
>>>> ---
>>>>    hmp-commands-info.hx   | 13 +++
>>>>    hw/pci/msix.c  | 49 ++
>>>>    include/hw/pci/msix.h  |  2 ++
>>>>    include/monitor/hmp.h  |  1 +
>>>>    softmmu/qdev-monitor.c | 25 +
>>>>    5 files changed, 90 insertions(+)
>>>>
>>>> diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
>>>> index ab0c7aa5ee..cbd056442b 100644
>>>> --- a/hmp-commands-info.hx
>>>> +++ b/hmp-commands-info.hx
>>>> @@ -221,6 +221,19 @@ SRST
>>>>    Show PCI information.
>>>>    ERST
>>>>    +    {
>>>> +    .name   = "msix",
>>>> +    .args_type  = "dev:s",
>>>> +    .params = "dev",
>>>> +    .

Re: [PATCH RFC 1/1] msix: add hmp interface to dump MSI-X info

2021-04-23 Thread Dongli Zhang




On 4/23/21 12:59 AM, Jason Wang wrote:
> 
> 在 2021/4/23 下午12:47, Dongli Zhang 写道:
>> This patch is to add the HMP interface to dump MSI-X table and PBA, in
>> order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
>> vector is erroneously masked permanently). Here is the example with
>> vhost-scsi:
>>
>> (qemu) info msix /machine/peripheral/vscsi0
>> MSI-X Table
>> 0xfee01004 0x 0x0022 0x
>> 0xfee02004 0x 0x0023 0x
>> 0xfee01004 0x 0x0023 0x
>> 0xfee01004 0x 0x0021 0x
>> 0xfee02004 0x 0x0022 0x
>> 0x 0x 0x 0x0001
>> 0x 0x 0x 0x0001
>> MSI-X PBA
>> 0 0 0 0 0 0 0
>>
>> Since the number of MSI-X entries is not determined and might be very
>> large, it is sometimes inappropriate to dump via QMP.
>>
>> Therefore, this patch dumps MSI-X information only via HMP, which is
>> similar to the implementation of hmp_info_mem().
> 
> 
> Besides PBA, I think it should be also useful to introduce device specifc
> callbacks for dump the MSI messages used by the device.
> 
> Thanks

Would you please help confirm if you meant MSI messages or MSI-X messages?

About about MSI-X, I assume we are able to derive the message from the MSI-X
table, as in msix_get_message().  Therefore, the user of "info msix "
should be able to parse the output and convert it to messages.

34 MSIMessage msix_get_message(PCIDevice *dev, unsigned vector)
35 {
36 uint8_t *table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE;
37 MSIMessage msg;
38
39 msg.address = pci_get_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR);
40 msg.data = pci_get_long(table_entry + PCI_MSIX_ENTRY_DATA);
41 return msg;
42 }


Perhaps I should remove the inner for-loop for MSI-X table in the patch, and
call pci_get_long() for 4 times, for PCI_MSIX_ENTRY_LOWER_ADDR,
PCI_MSIX_ENTRY_UPPER_ADDR, PCI_MSIX_ENTRY_DATA and PCI_MSIX_ENTRY_VECTOR_CTRL,
respectively.

Thank you very much!

Dongli Zhang

> 
> 
>>
>> Cc: Jason Wang 
>> Cc: Joe Jin 
>> Signed-off-by: Dongli Zhang 
>> ---
>>   hmp-commands-info.hx   | 13 +++
>>   hw/pci/msix.c  | 49 ++
>>   include/hw/pci/msix.h  |  2 ++
>>   include/monitor/hmp.h  |  1 +
>>   softmmu/qdev-monitor.c | 25 +
>>   5 files changed, 90 insertions(+)
>>
>> diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
>> index ab0c7aa5ee..cbd056442b 100644
>> --- a/hmp-commands-info.hx
>> +++ b/hmp-commands-info.hx
>> @@ -221,6 +221,19 @@ SRST
>>   Show PCI information.
>>   ERST
>>   +    {
>> +    .name   = "msix",
>> +    .args_type  = "dev:s",
>> +    .params = "dev",
>> +    .help   = "dump MSI-X information",
>> +    .cmd    = hmp_info_msix,
>> +    },
>> +
>> +SRST
>> +  ``info msix`` *dev*
>> +    Dump MSI-X information for device *dev*.
>> +ERST
>> +
>>   #if defined(TARGET_I386) || defined(TARGET_SH4) || defined(TARGET_SPARC) 
>> || \
>>   defined(TARGET_PPC) || defined(TARGET_XTENSA) || defined(TARGET_M68K)
>>   {
>> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
>> index ae9331cd0b..a93d31da9f 100644
>> --- a/hw/pci/msix.c
>> +++ b/hw/pci/msix.c
>> @@ -22,6 +22,7 @@
>>   #include "sysemu/xen.h"
>>   #include "migration/qemu-file-types.h"
>>   #include "migration/vmstate.h"
>> +#include "monitor/monitor.h"
>>   #include "qemu/range.h"
>>   #include "qapi/error.h"
>>   #include "trace.h"
>> @@ -669,3 +670,51 @@ const VMStateDescription vmstate_msix = {
>>   VMSTATE_END_OF_LIST()
>>   }
>>   };
>> +
>> +static void msix_dump_table(Monitor *mon, PCIDevice *dev)
>> +{
>> +    int vector, i, offset;
>> +    uint32_t val;
>> +
>> +    monitor_printf(mon, "MSI-X Table\n");
>> +
>> +    for (vector = 0; vector < dev->msix_entries_nr; vector++) {
>> +    for (i = 0; i < 4; i++) {
>> +    offset = vector * PCI_MSIX_ENTRY_SIZE + i * 4;
>> +    val = pci_get_long(dev->msix_table + offset);
>> +
>> +    monitor_printf(mon, "0x%08x ", val);
>> +    }
>> +    monitor_printf(mon, "\n");
>> +    }
>> +}
>> +
>> +static void msix_dump_pba(Monitor

Re: [PATCH RFC 0/1] To add HMP interface to dump PCI MSI-X table/PBA

2021-04-23 Thread Dongli Zhang




On 4/22/21 11:01 PM, Jason Wang wrote:
> 
> 在 2021/4/23 下午12:47, Dongli Zhang 写道:
>> This is inspired by the discussion with Jason on below patchset.
>>
>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html__;!!GqivPVa7Brio!KbGQZW5lq3JZ60k12NuWZ6Th1lT6AwmBTF0pBgoWUKKQ4-2UhdW57PtvXUN5XQnZ2NU$
>>
>> The new HMP command is introduced to dump the MSI-X table and PBA.
>>
>> Initially, I was going to add new option to "info pci". However, as the
>> number of entries is not determined and the output of MSI-X table is much
>> more similar to the output of hmp_info_tlb()/hmp_info_mem(), this patch
>> adds interface for only HMP.
>>
>> The patch is tagged with RFC because I am looking for suggestions on:
>>
>> 1. Is it fine to add new "info msix " command?
> 
> 
> I wonder the reason for not simply reusing "info pci"?

The "info pci" will show PCI data for all devices and it does not accept any
argument to print for a specific device.

In addition, the "info pci" relies on qmp_query_pci(), where this patch will not
implement the interface for QMP considering the number of MSI-X entries is not
determined.

Suppose we have 10 NVMe (emulated by QEMU with default number of queues), we
will have about 600+ lines of output.

Dongli Zhang

> 
> 
>>
>> 2. Is there any issue with output format?
> 
> 
> If it's not for QMP, I guess it's not a part of ABI so it should be fine.
> 
> 
>>
>> 3. Is it fine to add only for HMP, but not QMP?
> 
> 
> I think so.
> 
> Thanks
> 
> 
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>>
>>
>

[PATCH RFC 1/1] msix: add hmp interface to dump MSI-X info

2021-04-22 Thread Dongli Zhang

This patch is to add the HMP interface to dump MSI-X table and PBA, in
order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X
vector is erroneously masked permanently). Here is the example with
vhost-scsi:

(qemu) info msix /machine/peripheral/vscsi0
MSI-X Table
0xfee01004 0x 0x0022 0x
0xfee02004 0x 0x0023 0x
0xfee01004 0x 0x0023 0x
0xfee01004 0x 0x0021 0x
0xfee02004 0x 0x0022 0x
0x 0x 0x 0x0001
0x 0x 0x 0x0001
MSI-X PBA
0 0 0 0 0 0 0

Since the number of MSI-X entries is not determined and might be very
large, it is sometimes inappropriate to dump via QMP.

Therefore, this patch dumps MSI-X information only via HMP, which is
similar to the implementation of hmp_info_mem().

Cc: Jason Wang 
Cc: Joe Jin 
Signed-off-by: Dongli Zhang 
---
 hmp-commands-info.hx   | 13 +++
 hw/pci/msix.c  | 49 ++
 include/hw/pci/msix.h  |  2 ++
 include/monitor/hmp.h  |  1 +
 softmmu/qdev-monitor.c | 25 +
 5 files changed, 90 insertions(+)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index ab0c7aa5ee..cbd056442b 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -221,6 +221,19 @@ SRST
 Show PCI information.
 ERST
 
+{
+.name   = "msix",
+.args_type  = "dev:s",
+.params = "dev",
+.help   = "dump MSI-X information",
+.cmd= hmp_info_msix,
+},
+
+SRST
+  ``info msix`` *dev*
+Dump MSI-X information for device *dev*.
+ERST
+
 #if defined(TARGET_I386) || defined(TARGET_SH4) || defined(TARGET_SPARC) || \
 defined(TARGET_PPC) || defined(TARGET_XTENSA) || defined(TARGET_M68K)
 {
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331cd0b..a93d31da9f 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -22,6 +22,7 @@
 #include "sysemu/xen.h"
 #include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
+#include "monitor/monitor.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "trace.h"
@@ -669,3 +670,51 @@ const VMStateDescription vmstate_msix = {
 VMSTATE_END_OF_LIST()
 }
 };
+
+static void msix_dump_table(Monitor *mon, PCIDevice *dev)
+{
+int vector, i, offset;
+uint32_t val;
+
+monitor_printf(mon, "MSI-X Table\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+for (i = 0; i < 4; i++) {
+offset = vector * PCI_MSIX_ENTRY_SIZE + i * 4;
+val = pci_get_long(dev->msix_table + offset);
+
+monitor_printf(mon, "0x%08x ", val);
+}
+monitor_printf(mon, "\n");
+}
+}
+
+static void msix_dump_pba(Monitor *mon, PCIDevice *dev)
+{
+int vector;
+
+monitor_printf(mon, "MSI-X PBA\n");
+
+for (vector = 0; vector < dev->msix_entries_nr; vector++) {
+monitor_printf(mon, "%d ", !!msix_is_pending(dev, vector));
+
+if (vector % 16 == 15) {
+monitor_printf(mon, "\n");
+}
+}
+
+if (vector % 16 != 15) {
+monitor_printf(mon, "\n");
+}
+}
+
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp)
+{
+if (!msix_present(dev)) {
+error_setg(errp, "MSI-X not available");
+return;
+}
+
+msix_dump_table(mon, dev);
+msix_dump_pba(mon, dev);
+}
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c739..10a4500295 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -47,6 +47,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
   MSIVectorPollNotifier poll_notifier);
 void msix_unset_vector_notifiers(PCIDevice *dev);
 
+void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp);
+
 extern const VMStateDescription vmstate_msix;
 
 #define VMSTATE_MSIX_TEST(_field, _state, _test) {   \
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 605d57287a..46e0efc213 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -36,6 +36,7 @@ void hmp_info_irq(Monitor *mon, const QDict *qdict);
 void hmp_info_pic(Monitor *mon, const QDict *qdict);
 void hmp_info_rdma(Monitor *mon, const QDict *qdict);
 void hmp_info_pci(Monitor *mon, const QDict *qdict);
+void hmp_info_msix(Monitor *mon, const QDict *qdict);
 void hmp_info_tpm(Monitor *mon, const QDict *qdict);
 void hmp_info_iothreads(Monitor *mon, const QDict *qdict);
 void hmp_quit(Monitor *mon, const QDict *qdict);
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index a9955b97a0..2a37d03fb7 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -19,6 +19,7 @@
 
 #include "qemu/osdep.h"
 #include "hw/sy

[PATCH RFC 0/1] To add HMP interface to dump PCI MSI-X table/PBA

2021-04-22 Thread Dongli Zhang

This is inspired by the discussion with Jason on below patchset.

https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html

The new HMP command is introduced to dump the MSI-X table and PBA.

Initially, I was going to add new option to "info pci". However, as the
number of entries is not determined and the output of MSI-X table is much
more similar to the output of hmp_info_tlb()/hmp_info_mem(), this patch
adds interface for only HMP.

The patch is tagged with RFC because I am looking for suggestions on:

1. Is it fine to add new "info msix " command?

2. Is there any issue with output format?

3. Is it fine to add only for HMP, but not QMP?

Thank you very much!

Dongli Zhang

Re: [PATCH 0/6] Add debug interface to kick/call on purpose

2021-04-07 Thread Dongli Zhang




On 4/6/21 7:20 PM, Jason Wang wrote:
> 
> 在 2021/4/7 上午7:27, Dongli Zhang 写道:
>>> This will answer your question that "Can it bypass the masking?".
>>>
>>> For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd 
>>> is
>>> not able to bypass masking because masking is to unregister the eventfd. To
>>> write to eventfd does not take effect.
>>>
>>> However, it is possible to bypass masking for vhost-net because vhost-net
>>> registered a specific masked_notifier eventfd in order to mask irq. To 
>>> write to
>>> original eventfd still takes effect.
>>>
>>> We may leave the user to decide whether to write to 'masked_notifier' or
>>> original 'guest_notifier' for vhost-net.
>> My fault here. To write to masked_notifier will always be masked:(
> 
> 
> Only when there's no bug in the qemu.
> 
> 
>>
>> If it is EventNotifier level, we will not care whether the EventNotifier is
>> masked or not. It just provides an interface to write to EventNotifier.
> 
> 
> Yes.
> 
> 
>>
>> To dump the MSI-x table for both virtio and vfio will help confirm if the 
>> vector
>> is masked.
> 
> 
> That would be helpful as well. It's probably better to extend "info pci" 
> command.
> 
> Thanks

I will try if to add to "info pci" (introduce new arg option to "info pci"), or
to introduce new command.

About the EventNotifier, I will classify them as guest notifier or host notifier
so that it will be much more easier for user to tell if the eventfd is for
injecting IRQ or kicking the doorbell.

Thank you very much for all suggestions!

Dongli Zhang

> 
> 
>>
>> Thank you very much!
>>
>> Dongli Zhang
>>
>

Re: [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event

2021-04-07 Thread Dongli Zhang




On 4/7/21 6:40 AM, Eduardo Habkost wrote:
> On Thu, Mar 25, 2021 at 10:44:28PM -0700, Dongli Zhang wrote:
>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>> the loss of doorbell kick, e.g.,
>>
>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!NaqdV_o0gMJkUtVWaHyLRwKDa_8MsiuANAqEcM-Ooy4pYE3R1bwPmLdCTkE0gq6gywY$
>>  
>>
>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>
>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>> or legacy IRQ).
>>
>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>> on purpose by admin at QEMU/host side for a specific device.
>>
>> Signed-off-by: Dongli Zhang 
> [...]
>> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
>> index 605d57287a..c7795d4ba5 100644
>> --- a/include/monitor/hmp.h
>> +++ b/include/monitor/hmp.h
>> @@ -129,5 +129,6 @@ void hmp_info_replay(Monitor *mon, const QDict *qdict);
>>  void hmp_replay_break(Monitor *mon, const QDict *qdict);
>>  void hmp_replay_delete_break(Monitor *mon, const QDict *qdict);
>>  void hmp_replay_seek(Monitor *mon, const QDict *qdict);
>> +void hmp_x_debug_device_event(Monitor *mon, const QDict *qdict);
>>  
>>  #endif
>> diff --git a/qapi/qdev.json b/qapi/qdev.json
>> index b83178220b..711c4a297a 100644
>> --- a/qapi/qdev.json
>> +++ b/qapi/qdev.json
>> @@ -124,3 +124,33 @@
>>  ##
>>  { 'event': 'DEVICE_DELETED',
>>'data': { '*device': 'str', 'path': 'str' } }
>> +
>> +##
>> +# @x-debug-device-event:
>> +#
>> +# Generate device event for a specific device queue
>> +#
>> +# @dev: device path
>> +#
>> +# @event: event (e.g., kick or call) to trigger
> 
> Any specific reason to not use an enum here?
> 
> In addition to making the QAPI schema and documentation more
> descriptive, it would save you the work of manually defining the
> DEVICE_EVENT_* constants and implementing get_device_event().

Thank you very much for the suggestion!

I will use enum in json file.

Dongli Zhang

> 
> 
>> +#
>> +# @queue: queue id
>> +#
>> +# Returns: Nothing on success
>> +#
>> +# Since: 6.1
>> +#
>> +# Notes: This is used to debug VM driver hang issue. The 'kick' event is to
>> +#send notification to QEMU/vhost while the 'call' event is to
>> +#interrupt VM on purpose.
>> +#
>> +# Example:
>> +#
>> +# -> { "execute": "x-debug-device_event",
>> +#  "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick",
>> +# "queue": 1 } }
>> +# <- { "return": {} }
>> +#
>> +##
>> +{ 'command': 'x-debug-device-event',
>> +  'data': {'dev': 'str', 'event': 'str', 'queue': 'int'} }
> [...]
>

Re: [PATCH 0/6] Add debug interface to kick/call on purpose

2021-04-06 Thread Dongli Zhang




On 4/6/21 1:43 AM, Dongli Zhang wrote:
> 
> 
> On 4/5/21 6:55 PM, Jason Wang wrote:
>>
>> 在 2021/4/6 上午4:00, Dongli Zhang 写道:
>>>
>>> On 4/1/21 8:47 PM, Jason Wang wrote:
>>>> 在 2021/3/30 下午3:29, Dongli Zhang 写道:
>>>>> On 3/28/21 8:56 PM, Jason Wang wrote:
>>>>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>>>>>> Hi Jason,
>>>>>>>
>>>>>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due 
>>>>>>>>> to
>>>>>>>>> the loss of doorbell kick, e.g.,
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>>>>>
>>>>>>>>> This patch introduces a new debug interface 'DeviceEvent' to 
>>>>>>>>> DeviceClass
>>>>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far 
>>>>>>>>> the new
>>>>>>>>> interface handles only two events: 'call' and 'kick'. Any device 
>>>>>>>>> (e.g.,
>>>>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, 
>>>>>>>>> MSI-X
>>>>>>>>> or legacy IRQ).
>>>>>>>>>
>>>>>>>>> The 'call' is to inject irq on purpose by admin for a specific device
>>>>>>>>> (e.g.,
>>>>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the 
>>>>>>>>> doorbell
>>>>>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>>>>>
>>>>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to 
>>>>>>>>> VM
>>>>>>>>> on purpose. This is considered future work once the virtio part is 
>>>>>>>>> done.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Below is from live crash analysis. Initially, the queue=2 has 
>>>>>>>>> count=15 for
>>>>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there 
>>>>>>>>> is no
>>>>>>>>> used available. We suspect this is because vhost-scsi was not 
>>>>>>>>> notified by
>>>>>>>>> VM. In order to narrow down and analyze the issue, we use live crash 
>>>>>>>>> to
>>>>>>>>> dump the current counter of eventfd for queue=2.
>>>>>>>>>
>>>>>>>>> crash> eventfd_ctx 8f67f6bbe700
>>>>>>>>> struct eventfd_ctx {
>>>>>>>>>   kref = {
>>>>>>>>>     refcount = {
>>>>>>>>>   refs = {
>>>>>>>>>     counter = 4
>>>>>>>>>   }
>>>>>>>>>     }
>>>>>>>>>   },
>>>>>>>>>   wqh = {
>>>>>>>>>     lock = {
>>>>>>>>>   {
>>>>>>>>>     rlock = {
>>>>>>>>>   raw_lock = {
>>>>>>>>>     val = {
>>>>>>>>>   counter = 0
>>>>>&

Re: [PATCH 0/6] Add debug interface to kick/call on purpose

2021-04-06 Thread Dongli Zhang




On 4/5/21 6:55 PM, Jason Wang wrote:
> 
> 在 2021/4/6 上午4:00, Dongli Zhang 写道:
>>
>> On 4/1/21 8:47 PM, Jason Wang wrote:
>>> 在 2021/3/30 下午3:29, Dongli Zhang 写道:
>>>> On 3/28/21 8:56 PM, Jason Wang wrote:
>>>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>>>>> Hi Jason,
>>>>>>
>>>>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due 
>>>>>>>> to
>>>>>>>> the loss of doorbell kick, e.g.,
>>>>>>>>
>>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>>>>
>>>>>>>> This patch introduces a new debug interface 'DeviceEvent' to 
>>>>>>>> DeviceClass
>>>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far 
>>>>>>>> the new
>>>>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, 
>>>>>>>> MSI-X
>>>>>>>> or legacy IRQ).
>>>>>>>>
>>>>>>>> The 'call' is to inject irq on purpose by admin for a specific device
>>>>>>>> (e.g.,
>>>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the 
>>>>>>>> doorbell
>>>>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>>>>
>>>>>>>>
>>>>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>>>>
>>>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to 
>>>>>>>> VM
>>>>>>>> on purpose. This is considered future work once the virtio part is 
>>>>>>>> done.
>>>>>>>>
>>>>>>>>
>>>>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 
>>>>>>>> for
>>>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there 
>>>>>>>> is no
>>>>>>>> used available. We suspect this is because vhost-scsi was not notified 
>>>>>>>> by
>>>>>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>>>>>> dump the current counter of eventfd for queue=2.
>>>>>>>>
>>>>>>>> crash> eventfd_ctx 8f67f6bbe700
>>>>>>>> struct eventfd_ctx {
>>>>>>>>   kref = {
>>>>>>>>     refcount = {
>>>>>>>>   refs = {
>>>>>>>>     counter = 4
>>>>>>>>   }
>>>>>>>>     }
>>>>>>>>   },
>>>>>>>>   wqh = {
>>>>>>>>     lock = {
>>>>>>>>   {
>>>>>>>>     rlock = {
>>>>>>>>   raw_lock = {
>>>>>>>>     val = {
>>>>>>>>   counter = 0
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>>     },
>>>>>>>>     head = {
>>>>>>>>   next = 0x8f841dc08e18,
>>>>>>>>   prev = 0x8f841dc08e18
>>>>

Re: [PATCH 0/6] Add debug interface to kick/call on purpose

2021-04-05 Thread Dongli Zhang




On 4/1/21 8:47 PM, Jason Wang wrote:
> 
> 在 2021/3/30 下午3:29, Dongli Zhang 写道:
>>
>> On 3/28/21 8:56 PM, Jason Wang wrote:
>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>>>> Hi Jason,
>>>>
>>>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>>>> the loss of doorbell kick, e.g.,
>>>>>>
>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>>>
>>>>>>
>>>>>>
>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>>>
>>>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far the 
>>>>>> new
>>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, 
>>>>>> MSI-X
>>>>>> or legacy IRQ).
>>>>>>
>>>>>> The 'call' is to inject irq on purpose by admin for a specific device 
>>>>>> (e.g.,
>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the 
>>>>>> doorbell
>>>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>>>
>>>>>>
>>>>>> This device can be used as a workaround if call/kick is lost due to
>>>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>>>
>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>>>> on purpose. This is considered future work once the virtio part is done.
>>>>>>
>>>>>>
>>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 
>>>>>> for
>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is 
>>>>>> no
>>>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>>>> dump the current counter of eventfd for queue=2.
>>>>>>
>>>>>> crash> eventfd_ctx 8f67f6bbe700
>>>>>> struct eventfd_ctx {
>>>>>>  kref = {
>>>>>>    refcount = {
>>>>>>  refs = {
>>>>>>    counter = 4
>>>>>>  }
>>>>>>    }
>>>>>>  },
>>>>>>  wqh = {
>>>>>>    lock = {
>>>>>>  {
>>>>>>    rlock = {
>>>>>>  raw_lock = {
>>>>>>    val = {
>>>>>>  counter = 0
>>>>>>    }
>>>>>>  }
>>>>>>    }
>>>>>>  }
>>>>>>    },
>>>>>>    head = {
>>>>>>  next = 0x8f841dc08e18,
>>>>>>  prev = 0x8f841dc08e18
>>>>>>    }
>>>>>>  },
>>>>>>  count = 15, ---> eventfd is 15 !!!
>>>>>>  flags = 526336
>>>>>> }
>>>>>>
>>>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>>>> with this interface.
>>>>>>
>>>>>> { "execute": "x-debug-device-event",
>>>>>>  "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>>>     "event": "kick", "queue": 2 } }
>>>>>>
>>>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>>>> indicates something bad is in software that the 'kick' is los

Re: [PATCH 0/6] Add debug interface to kick/call on purpose

2021-03-30 Thread Dongli Zhang




On 3/28/21 8:56 PM, Jason Wang wrote:
> 
> 在 2021/3/27 上午5:16, Dongli Zhang 写道:
>> Hi Jason,
>>
>> On 3/26/21 12:24 AM, Jason Wang wrote:
>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>>>> the loss of doorbell kick, e.g.,
>>>>
>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>>>
>>>>
>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>>>
>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>>>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>>>> or legacy IRQ).
>>>>
>>>> The 'call' is to inject irq on purpose by admin for a specific device 
>>>> (e.g.,
>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>>> on purpose by admin at QEMU/host side for a specific device.
>>>>
>>>>
>>>> This device can be used as a workaround if call/kick is lost due to
>>>> virtualization software (e.g., kernel or QEMU) issue.
>>>>
>>>> We may also implement the interface for VFIO PCI, e.g., to write to
>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>>>> on purpose. This is considered future work once the virtio part is done.
>>>>
>>>>
>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>>>> used available. We suspect this is because vhost-scsi was not notified by
>>>> VM. In order to narrow down and analyze the issue, we use live crash to
>>>> dump the current counter of eventfd for queue=2.
>>>>
>>>> crash> eventfd_ctx 8f67f6bbe700
>>>> struct eventfd_ctx {
>>>>     kref = {
>>>>   refcount = {
>>>>     refs = {
>>>>   counter = 4
>>>>     }
>>>>   }
>>>>     },
>>>>     wqh = {
>>>>   lock = {
>>>>     {
>>>>   rlock = {
>>>>     raw_lock = {
>>>>   val = {
>>>>     counter = 0
>>>>   }
>>>>     }
>>>>   }
>>>>     }
>>>>   },
>>>>   head = {
>>>>     next = 0x8f841dc08e18,
>>>>     prev = 0x8f841dc08e18
>>>>   }
>>>>     },
>>>>     count = 15, ---> eventfd is 15 !!!
>>>>     flags = 526336
>>>> }
>>>>
>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>>>> with this interface.
>>>>
>>>> { "execute": "x-debug-device-event",
>>>>     "arguments": { "dev": "/machine/peripheral/vscsi0",
>>>>    "event": "kick", "queue": 2 } }
>>>>
>>>> The counter is increased to 16. Suppose the hang issue is resolved, it
>>>> indicates something bad is in software that the 'kick' is lost.
>>> What do you mean by "software" here? And it looks to me you're testing 
>>> whether
>>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
>>> sure how much value could we gain from a dedicated debug interface like this
>>> consider there're a lot of exisinting general purpose debugging method like
>>> tracing or gdb. I'd say the path from virtio_queue_notify() to
>>> event_notifier_set() is only a very small fraction of the process of 
>>> virtqueue
>>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
>>> offloaded to KVM, it's more a chance that something is wrong in setuping
>>> ioeventfd instead of here. Irq is even more complicated.
>> Thank you very much!
>>
>> I am not testing whether event_notifier_set() is call

Re: [PATCH 0/6] Add debug interface to kick/call on purpose

2021-03-26 Thread Dongli Zhang

Hi Jason,

On 3/26/21 12:24 AM, Jason Wang wrote:
> 
> 在 2021/3/26 下午1:44, Dongli Zhang 写道:
>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
>> the loss of doorbell kick, e.g.,
>>
>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$
>>
>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").
>>
>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
>> to help narrow down if the issue is due to loss of irq/kick. So far the new
>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
>> or legacy IRQ).
>>
>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>> on purpose by admin at QEMU/host side for a specific device.
>>
>>
>> This device can be used as a workaround if call/kick is lost due to
>> virtualization software (e.g., kernel or QEMU) issue.
>>
>> We may also implement the interface for VFIO PCI, e.g., to write to
>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
>> on purpose. This is considered future work once the virtio part is done.
>>
>>
>> Below is from live crash analysis. Initially, the queue=2 has count=15 for
>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
>> used available. We suspect this is because vhost-scsi was not notified by
>> VM. In order to narrow down and analyze the issue, we use live crash to
>> dump the current counter of eventfd for queue=2.
>>
>> crash> eventfd_ctx 8f67f6bbe700
>> struct eventfd_ctx {
>>    kref = {
>>  refcount = {
>>    refs = {
>>  counter = 4
>>    }
>>  }
>>    },
>>    wqh = {
>>  lock = {
>>    {
>>  rlock = {
>>    raw_lock = {
>>  val = {
>>    counter = 0
>>  }
>>    }
>>  }
>>    }
>>  },
>>  head = {
>>    next = 0x8f841dc08e18,
>>    prev = 0x8f841dc08e18
>>  }
>>    },
>>    count = 15, ---> eventfd is 15 !!!
>>    flags = 526336
>> }
>>
>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
>> with this interface.
>>
>> { "execute": "x-debug-device-event",
>>    "arguments": { "dev": "/machine/peripheral/vscsi0",
>>   "event": "kick", "queue": 2 } }
>>
>> The counter is increased to 16. Suppose the hang issue is resolved, it
>> indicates something bad is in software that the 'kick' is lost.
> 
> 
> What do you mean by "software" here? And it looks to me you're testing whether
> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not
> sure how much value could we gain from a dedicated debug interface like this
> consider there're a lot of exisinting general purpose debugging method like
> tracing or gdb. I'd say the path from virtio_queue_notify() to
> event_notifier_set() is only a very small fraction of the process of virtqueue
> kick which is unlikey to be buggy. Consider usually the ioeventfd will be
> offloaded to KVM, it's more a chance that something is wrong in setuping
> ioeventfd instead of here. Irq is even more complicated.

Thank you very much!

I am not testing whether event_notifier_set() is called by 
virtio_queue_notify().

The 'software' indicates the data processing and event notification mechanism
involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an
extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event()
erroneously returns false due to corrupted ring buffer status.

This was initially proposed for vhost only and I was going to export
ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would
better implement this at QEMU.

The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to
KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell (via
ioeventfd), while the backend vhost side sends responses back and calls the IRQ
(via ioeventfd).

Unfortunately, sometimes there is issue with virtio/vhost so that kick/call was
missed/ignored, or even never triggered

[PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event

2021-03-25 Thread Dongli Zhang

The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
the loss of doorbell kick, e.g.,

https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html

... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").

This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
to help narrow down if the issue is due to loss of irq/kick. So far the new
interface handles only two events: 'call' and 'kick'. Any device (e.g.,
virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
or legacy IRQ).

The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
on purpose by admin at QEMU/host side for a specific device.

Signed-off-by: Dongli Zhang 
---
 hmp-commands.hx| 14 +++
 include/hw/qdev-core.h |  9 +++
 include/monitor/hmp.h  |  1 +
 qapi/qdev.json | 30 ++
 softmmu/qdev-monitor.c | 56 ++
 5 files changed, 110 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 435c591a1c..d74b895fff 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1725,3 +1725,17 @@ ERST
 .flags  = "p",
 },
 
+{
+.name   = "x-debug-device-event",
+.args_type  = "dev:s,event:s,queue:l",
+.params = "dev event queue",
+.help   = "generate device event for a specific device queue",
+.cmd= hmp_x_debug_device_event,
+.flags  = "p",
+},
+
+SRST
+``x-debug-device-event`` *dev* *event* *queue*
+  Generate device event *event* for specific *queue* of *dev*
+ERST
+
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index bafc311bfa..1ea8bf23b9 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -29,9 +29,17 @@ typedef enum DeviceCategory {
 DEVICE_CATEGORY_MAX
 } DeviceCategory;
 
+enum {
+DEVICE_EVENT_CALL,
+DEVICE_EVENT_KICK,
+DEVICE_EVENT_MAX
+};
+
 typedef void (*DeviceRealize)(DeviceState *dev, Error **errp);
 typedef void (*DeviceUnrealize)(DeviceState *dev);
 typedef void (*DeviceReset)(DeviceState *dev);
+typedef void (*DeviceEvent)(DeviceState *dev, int event, int queue,
+Error **errp);
 typedef void (*BusRealize)(BusState *bus, Error **errp);
 typedef void (*BusUnrealize)(BusState *bus);
 
@@ -132,6 +140,7 @@ struct DeviceClass {
 DeviceReset reset;
 DeviceRealize realize;
 DeviceUnrealize unrealize;
+DeviceEvent event;
 
 /* device state */
 const VMStateDescription *vmsd;
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 605d57287a..c7795d4ba5 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -129,5 +129,6 @@ void hmp_info_replay(Monitor *mon, const QDict *qdict);
 void hmp_replay_break(Monitor *mon, const QDict *qdict);
 void hmp_replay_delete_break(Monitor *mon, const QDict *qdict);
 void hmp_replay_seek(Monitor *mon, const QDict *qdict);
+void hmp_x_debug_device_event(Monitor *mon, const QDict *qdict);
 
 #endif
diff --git a/qapi/qdev.json b/qapi/qdev.json
index b83178220b..711c4a297a 100644
--- a/qapi/qdev.json
+++ b/qapi/qdev.json
@@ -124,3 +124,33 @@
 ##
 { 'event': 'DEVICE_DELETED',
   'data': { '*device': 'str', 'path': 'str' } }
+
+##
+# @x-debug-device-event:
+#
+# Generate device event for a specific device queue
+#
+# @dev: device path
+#
+# @event: event (e.g., kick or call) to trigger
+#
+# @queue: queue id
+#
+# Returns: Nothing on success
+#
+# Since: 6.1
+#
+# Notes: This is used to debug VM driver hang issue. The 'kick' event is to
+#send notification to QEMU/vhost while the 'call' event is to
+#interrupt VM on purpose.
+#
+# Example:
+#
+# -> { "execute": "x-debug-device_event",
+#  "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick",
+# "queue": 1 } }
+# <- { "return": {} }
+#
+##
+{ 'command': 'x-debug-device-event',
+  'data': {'dev': 'str', 'event': 'str', 'queue': 'int'} }
diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index a9955b97a0..bca53111fb 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -924,6 +924,62 @@ void hmp_device_del(Monitor *mon, const QDict *qdict)
 hmp_handle_error(mon, err);
 }
 
+static const char * const device_events[DEVICE_EVENT_MAX] = {
+[DEVICE_EVENT_KICK] = "kick",
+[DEVICE_EVENT_CALL] = "call"
+};
+
+static int get_device_event(const char *event)
+{
+int evt;
+
+for (evt = 0; evt < ARRAY_SIZE(device_events); evt++) {
+if (!strcmp(device_events[evt], event)) {
+return evt;
+}
+}
+
+return -ENOENT;
+}
+
+void qmp_x_de

[PATCH 5/6] vhost-scsi-pci: implement device event interface for kick/call

2021-03-25 Thread Dongli Zhang

This is to implement the device event interface for vhost-scsi-pci.

Signed-off-by: Dongli Zhang 
---
 hw/scsi/vhost-scsi.c   |  6 ++
 hw/virtio/vhost-scsi-pci.c | 10 ++
 include/hw/virtio/vhost-scsi.h |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 4d70fa036b..11dd94ff92 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -163,6 +163,12 @@ static const VMStateDescription vmstate_virtio_vhost_scsi 
= {
 .pre_save = vhost_scsi_pre_save,
 };
 
+void vhost_scsi_device_event(DeviceState *dev, int event, int queue,
+ Error **errp)
+{
+virtio_device_event(dev, event, queue, true, errp);
+}
+
 static void vhost_scsi_realize(DeviceState *dev, Error **errp)
 {
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev);
diff --git a/hw/virtio/vhost-scsi-pci.c b/hw/virtio/vhost-scsi-pci.c
index cb71a294fa..c7a614cb11 100644
--- a/hw/virtio/vhost-scsi-pci.c
+++ b/hw/virtio/vhost-scsi-pci.c
@@ -44,6 +44,15 @@ static Property vhost_scsi_pci_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
+static void vhost_scsi_pci_event(DeviceState *dev, int event, int queue,
+ Error **errp)
+{
+VHostSCSIPCI *vscsi = VHOST_SCSI_PCI(dev);
+DeviceState *vdev = DEVICE(>vdev);
+
+vhost_scsi_device_event(vdev, event, queue, errp);
+}
+
 static void vhost_scsi_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
 VHostSCSIPCI *dev = VHOST_SCSI_PCI(vpci_dev);
@@ -70,6 +79,7 @@ static void vhost_scsi_pci_class_init(ObjectClass *klass, 
void *data)
 k->realize = vhost_scsi_pci_realize;
 set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
 device_class_set_props(dc, vhost_scsi_pci_properties);
+dc->event = vhost_scsi_pci_event;
 pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
 pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_SCSI;
 pcidev_k->revision = 0x00;
diff --git a/include/hw/virtio/vhost-scsi.h b/include/hw/virtio/vhost-scsi.h
index 7dc2bdd69d..b47854a0c6 100644
--- a/include/hw/virtio/vhost-scsi.h
+++ b/include/hw/virtio/vhost-scsi.h
@@ -32,4 +32,7 @@ struct VHostSCSI {
 VHostSCSICommon parent_obj;
 };
 
+void vhost_scsi_device_event(DeviceState *dev, int event, int queue,
+ Error **errp);
+
 #endif
-- 
2.17.1

[PATCH 3/6] virtio-blk-pci: implement device event interface for kick/call

2021-03-25 Thread Dongli Zhang

This is to implement the device event interface for virtio-blk-pci.

Signed-off-by: Dongli Zhang 
---
 hw/block/virtio-blk.c  |  9 +
 hw/virtio/virtio-blk-pci.c | 10 ++
 include/hw/virtio/virtio-blk.h |  2 ++
 3 files changed, 21 insertions(+)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index d28979efb8..2b3583a913 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1118,6 +1118,15 @@ static int virtio_blk_load_device(VirtIODevice *vdev, 
QEMUFile *f,
 return 0;
 }
 
+void virtio_blk_device_event(DeviceState *dev, int event, int queue,
+ Error **errp)
+{
+VirtIOBlock *s = VIRTIO_BLK(dev);
+bool irqfd = s->dataplane_started && !s->dataplane_disabled;
+
+virtio_device_event(dev, event, queue, irqfd, errp);
+}
+
 static void virtio_resize_cb(void *opaque)
 {
 VirtIODevice *vdev = opaque;
diff --git a/hw/virtio/virtio-blk-pci.c b/hw/virtio/virtio-blk-pci.c
index 9d5795810c..f1fc72e7f1 100644
--- a/hw/virtio/virtio-blk-pci.c
+++ b/hw/virtio/virtio-blk-pci.c
@@ -47,6 +47,15 @@ static Property virtio_blk_pci_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
+static void virtio_blk_pci_event(DeviceState *dev, int event, int queue,
+ Error **errp)
+{
+VirtIOBlkPCI *vblk = VIRTIO_BLK_PCI(dev);
+DeviceState *vdev = DEVICE(>vdev);
+
+virtio_blk_device_event(vdev, event, queue, errp);
+}
+
 static void virtio_blk_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
 VirtIOBlkPCI *dev = VIRTIO_BLK_PCI(vpci_dev);
@@ -72,6 +81,7 @@ static void virtio_blk_pci_class_init(ObjectClass *klass, 
void *data)
 
 set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
 device_class_set_props(dc, virtio_blk_pci_properties);
+dc->event = virtio_blk_pci_event;
 k->realize = virtio_blk_pci_realize;
 pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
 pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_BLOCK;
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 29655a406d..500be01dff 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -92,5 +92,7 @@ typedef struct MultiReqBuffer {
 
 bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh);
+void virtio_blk_device_event(DeviceState *dev, int event, int queue,
+ Error **errp);
 
 #endif
-- 
2.17.1

[PATCH 4/6] virtio-scsi-pci: implement device event interface for kick/call

2021-03-25 Thread Dongli Zhang

This is to implement the device event interface for virtio-scsi-pci.

Signed-off-by: Dongli Zhang 
---
 hw/scsi/virtio-scsi.c   |  9 +
 hw/virtio/virtio-scsi-pci.c | 10 ++
 include/hw/virtio/virtio-scsi.h |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 6d80730287..f437ed1a81 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -962,6 +962,15 @@ static struct SCSIBusInfo virtio_scsi_scsi_info = {
 .load_request = virtio_scsi_load_request,
 };
 
+void virtio_scsi_device_event(DeviceState *dev, int event, int queue,
+  Error **errp)
+{
+VirtIOSCSI *s = VIRTIO_SCSI(dev);
+bool irqfd = s->dataplane_started && !s->dataplane_fenced;
+
+virtio_device_event(dev, event, queue, irqfd, errp);
+}
+
 void virtio_scsi_common_realize(DeviceState *dev,
 VirtIOHandleOutput ctrl,
 VirtIOHandleOutput evt,
diff --git a/hw/virtio/virtio-scsi-pci.c b/hw/virtio/virtio-scsi-pci.c
index 97fab74236..d5db743692 100644
--- a/hw/virtio/virtio-scsi-pci.c
+++ b/hw/virtio/virtio-scsi-pci.c
@@ -43,6 +43,15 @@ static Property virtio_scsi_pci_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
+static void virtio_scsi_pci_event(DeviceState *dev, int event, int queue,
+  Error **errp)
+{
+VirtIOSCSIPCI *vscsi = VIRTIO_SCSI_PCI(dev);
+DeviceState *vdev = DEVICE(>vdev);
+
+virtio_scsi_device_event(vdev, event, queue, errp);
+}
+
 static void virtio_scsi_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
 VirtIOSCSIPCI *dev = VIRTIO_SCSI_PCI(vpci_dev);
@@ -82,6 +91,7 @@ static void virtio_scsi_pci_class_init(ObjectClass *klass, 
void *data)
 k->realize = virtio_scsi_pci_realize;
 set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
 device_class_set_props(dc, virtio_scsi_pci_properties);
+dc->event = virtio_scsi_pci_event;
 pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
 pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_SCSI;
 pcidev_k->revision = 0x00;
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 543681bc18..d1fff0eeac 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -163,4 +163,7 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error 
**errp);
 int virtio_scsi_dataplane_start(VirtIODevice *s);
 void virtio_scsi_dataplane_stop(VirtIODevice *s);
 
+void virtio_scsi_device_event(DeviceState *dev, int event, int queue,
+  Error **errp);
+
 #endif /* QEMU_VIRTIO_SCSI_H */
-- 
2.17.1

[PATCH 6/6] virtio-net-pci: implement device event interface for kick/call

2021-03-25 Thread Dongli Zhang

This is to implement the device event interface for virtio-net-pci.

Signed-off-by: Dongli Zhang 
---
 hw/net/virtio-net.c|  9 +
 hw/virtio/virtio-net-pci.c | 10 ++
 include/hw/virtio/virtio-net.h |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 66b9ff4511..b5c3fa392c 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3195,6 +3195,15 @@ static bool failover_hide_primary_device(DeviceListener 
*listener,
 return qatomic_read(>failover_primary_hidden);
 }
 
+void virtio_net_device_event(DeviceState *dev, int event, int queue,
+ Error **errp)
+{
+VirtIONet *n = VIRTIO_NET(dev);
+bool irqfd = n->vhost_started;
+
+virtio_device_event(dev, event, queue, irqfd, errp);
+}
+
 static void virtio_net_device_realize(DeviceState *dev, Error **errp)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
diff --git a/hw/virtio/virtio-net-pci.c b/hw/virtio/virtio-net-pci.c
index aa0b3caecb..1fa5a6fe5d 100644
--- a/hw/virtio/virtio-net-pci.c
+++ b/hw/virtio/virtio-net-pci.c
@@ -46,6 +46,15 @@ static Property virtio_net_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
+static void virtio_net_pci_event(DeviceState *dev, int event, int queue,
+ Error **errp)
+{
+VirtIONetPCI *vnet = VIRTIO_NET_PCI(dev);
+DeviceState *vdev = DEVICE(>vdev);
+
+virtio_net_device_event(vdev, event, queue, errp);
+}
+
 static void virtio_net_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
 DeviceState *qdev = DEVICE(vpci_dev);
@@ -77,6 +86,7 @@ static void virtio_net_pci_class_init(ObjectClass *klass, 
void *data)
 k->class_id = PCI_CLASS_NETWORK_ETHERNET;
 set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
 device_class_set_props(dc, virtio_net_properties);
+dc->event = virtio_net_pci_event;
 vpciklass->realize = virtio_net_pci_realize;
 }
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 7e96d193aa..d88c9969ea 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -214,4 +214,7 @@ struct VirtIONet {
 void virtio_net_set_netclient_name(VirtIONet *n, const char *name,
const char *type);
 
+void virtio_net_device_event(DeviceState *dev, int event, int queue,
+ Error **errp);
+
 #endif
-- 
2.17.1

[PATCH 2/6] virtio: introduce helper function for kick/call device event

2021-03-25 Thread Dongli Zhang

This is to introduce the helper function for virtio device to kick or
call.

Signed-off-by: Dongli Zhang 
---
 hw/virtio/virtio.c | 64 ++
 include/hw/virtio/virtio.h |  3 ++
 2 files changed, 67 insertions(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 07f4e60b30..e081041a75 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -30,6 +30,8 @@
 #include "sysemu/runstate.h"
 #include "standard-headers/linux/virtio_ids.h"
 
+/* #define DEBUG_VIRTIO_EVENT */
+
 /*
  * The alignment to use between consumer and producer parts of vring.
  * x86 pagesize again. This is the default, used by transports like PCI
@@ -2572,6 +2574,68 @@ void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
 virtio_irq(vq);
 }
 
+static void virtio_device_event_call(VirtQueue *vq, bool eventfd,
+ Error **errp)
+{
+#ifdef DEBUG_VIRTIO_EVENT
+printf("The 'call' event is triggered for path=%s, queue=%d, irqfd=%d.\n",
+   object_get_canonical_path(OBJECT(vq->vdev)),
+   vq->queue_index, eventfd);
+#endif
+
+if (eventfd) {
+virtio_set_isr(vq->vdev, 0x1);
+event_notifier_set(>guest_notifier);
+} else {
+virtio_irq(vq);
+}
+}
+
+static void virtio_device_event_kick(VirtQueue *vq, bool eventfd,
+ Error **errp)
+{
+#ifdef DEBUG_VIRTIO_EVENT
+printf("The 'kick' event is triggered for path=%s, queue=%d.\n",
+   object_get_canonical_path(OBJECT(vq->vdev)), vq->queue_index);
+#endif
+
+virtio_queue_notify(vq->vdev, virtio_get_queue_index(vq));
+}
+
+typedef void (VirtIOEvent)(VirtQueue *vq, bool eventfd, Error **errp);
+
+static VirtIOEvent *virtio_event_funcs[DEVICE_EVENT_MAX] = {
+[DEVICE_EVENT_CALL] = virtio_device_event_call,
+[DEVICE_EVENT_KICK] = virtio_device_event_kick
+};
+
+void virtio_device_event(DeviceState *dev, int event, int queue,
+ bool eventfd, Error **errp)
+{
+struct VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+int num = virtio_get_num_queues(vdev);
+VirtQueue *vq;
+
+assert(event < DEVICE_EVENT_MAX);
+
+if (vdev->broken) {
+error_setg(errp, "Broken device");
+return;
+}
+
+if (queue < 0 || queue >= num) {
+error_setg(errp, "Invalid queue %d", queue);
+return;
+}
+
+vq = >vq[queue];
+
+if (virtio_event_funcs[event])
+virtio_event_funcs[event](vq, eventfd, errp);
+else
+error_setg(errp, "The event is not supported");
+}
+
 void virtio_notify_config(VirtIODevice *vdev)
 {
 if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK))
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b7ece7a6a8..21bb13ffa6 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -210,6 +210,9 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int 
*in_bytes,
 void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq);
 void virtio_notify(VirtIODevice *vdev, VirtQueue *vq);
 
+void virtio_device_event(DeviceState *dev, int event, int queue,
+ bool eventfd, Error **errp);
+
 int virtio_save(VirtIODevice *vdev, QEMUFile *f);
 
 extern const VMStateInfo virtio_vmstate_info;
-- 
2.17.1

[PATCH 0/6] Add debug interface to kick/call on purpose

2021-03-25 Thread Dongli Zhang

The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to
the loss of doorbell kick, e.g.,

https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html

... or due to the loss of IRQ, e.g., as fixed by linux kernel commit
fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler").

This patch introduces a new debug interface 'DeviceEvent' to DeviceClass
to help narrow down if the issue is due to loss of irq/kick. So far the new
interface handles only two events: 'call' and 'kick'. Any device (e.g.,
virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X
or legacy IRQ).

The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
on purpose by admin at QEMU/host side for a specific device.


This device can be used as a workaround if call/kick is lost due to
virtualization software (e.g., kernel or QEMU) issue.

We may also implement the interface for VFIO PCI, e.g., to write to
VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM
on purpose. This is considered future work once the virtio part is done.


Below is from live crash analysis. Initially, the queue=2 has count=15 for
'kick' eventfd_ctx. Suppose there is data in vring avail while there is no
used available. We suspect this is because vhost-scsi was not notified by
VM. In order to narrow down and analyze the issue, we use live crash to
dump the current counter of eventfd for queue=2.

crash> eventfd_ctx 8f67f6bbe700
struct eventfd_ctx {
  kref = {
refcount = {
  refs = {
counter = 4
  }
}
  }, 
  wqh = {
lock = {
  {
rlock = {
  raw_lock = {
val = {
  counter = 0
}
  }
}
  }
}, 
head = {
  next = 0x8f841dc08e18, 
  prev = 0x8f841dc08e18
}
  }, 
  count = 15, ---> eventfd is 15 !!!
  flags = 526336
}

Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic
with this interface.

{ "execute": "x-debug-device-event",
  "arguments": { "dev": "/machine/peripheral/vscsi0",
 "event": "kick", "queue": 2 } }

The counter is increased to 16. Suppose the hang issue is resolved, it
indicates something bad is in software that the 'kick' is lost.

crash> eventfd_ctx 8f67f6bbe700
struct eventfd_ctx {
  kref = {
refcount = {
  refs = {
counter = 4
  }
}
  },
  wqh = {
lock = {
  {
rlock = {
  raw_lock = {
val = {
  counter = 0
}
  }
}
  }
},
head = {
  next = 0x8f841dc08e18,
  prev = 0x8f841dc08e18
}
  },
  count = 16, ---> eventfd incremented to 16 !!!
  flags = 526336
}


Original RFC link:

https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html

Changed since RFC:
  - add support for more virtio/vhost pci devices
  - add log (toggled by DEBUG_VIRTIO_EVENT) to virtio.c to say that this 
mischeivous command had been used
  - fix grammer error (s/lost/loss/)
  - change version to 6.1
  - fix incorrect example in qapi/qdev.json
  - manage event types with enum/array, instead of hard coding


Dongli Zhang (6):
   qdev: introduce qapi/hmp command for kick/call event
   virtio: introduce helper function for kick/call device event
   virtio-blk-pci: implement device event interface for kick/call
   virtio-scsi-pci: implement device event interface for kick/call
   vhost-scsi-pci: implement device event interface for kick/call
   virtio-net-pci: implement device event interface for kick/call

 hmp-commands.hx | 14 
 hw/block/virtio-blk.c   |  9 +
 hw/net/virtio-net.c |  9 +
 hw/scsi/vhost-scsi.c|  6 
 hw/scsi/virtio-scsi.c   |  9 +
 hw/virtio/vhost-scsi-pci.c  | 10 ++
 hw/virtio/virtio-blk-pci.c  | 10 ++
 hw/virtio/virtio-net-pci.c  | 10 ++
 hw/virtio/virtio-scsi-pci.c | 10 ++
 hw/virtio/virtio.c  | 64 
 include/hw/qdev-core.h  |  9 +
 include/hw/virtio/vhost-scsi.h  |  3 ++
 include/hw/virtio/virtio-blk.h  |  2 ++
 include/hw/virtio/virtio-net.h  |  3 ++
 include/hw/virtio/virtio-scsi.h |  3 ++
 include/hw/virtio/virtio.h  |  3 ++
 include/monitor/hmp.h   |  1 +
 qapi/qdev.json  | 30 +
 softmmu/qdev-monitor.c  | 56 +++
 19 files changed, 261 insertions(+)


I did tests with below cases.

- virtio-blk-pci (ioeventfd on/off, iothread, live migration)
- virtio-scsi-pci (ioeventfd on/off)
- vhost-scsi-pci
- virtio-net-pci (ioeventfd on/off, vhost)

Thank you very much!

Dongli Zhang

Re: [PATCH RFC 0/2] Add debug interface to kick/call on purpose

2021-01-19 Thread Dongli Zhang




On 1/18/21 8:59 AM, Dr. David Alan Gilbert wrote:
> * Daniel P. BerrangÃ© (berra...@redhat.com) wrote:
>> On Thu, Jan 14, 2021 at 04:27:28PM -0800, Dongli Zhang wrote:
>>> The virtio device/driver (e.g., vhost-scsi and indeed any device including
>>> e1000e) may hang due to the lost of IRQ or the lost of doorbell register
>>> kick, e.g.,
>>>
>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!K_zaQzJhlvPjRZe9efEtyX8vB6fMlKQeNy_RGz7oPp9k76pC8zarG1nSs1SFSL2xI1g$
>>>  
>>>
>>> The virtio-net was in trouble in above link because the 'kick' was not
>>> taking effect (missed).
>>>
>>> This RFC adds a new debug interface 'DeviceEvent' to DeviceClass to help
>>> narrow down if the issue is due to lost of irq/kick. So far the new
>>> interface handles only two events: 'call' and 'kick'. Any device (e.g.,
>>> e1000e or vhost-scsi) may implement (e.g., via eventfd, MSI-X or legacy
>>> IRQ).
>>>
>>> The 'call' is to inject irq on purpose by admin for a specific device (e.g.,
>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell
>>> on purpose by admin at QEMU/host side for a specific device.
>>
>> I'm really not convinced that we want to give admins the direct ability to
>> poke at internals of devices in a running QEMU. It feels like there is way
>> too much potential for the admin to make a situation far worse by doing
>> the wrong thing here,
> 
> We already do have commands to write to an iport, and to inject MCEs for
> example; is this that much different?
> 
>> and people dealing with support tickets will have
>> no idea that the admin has been poking internals of the device and broken
>> it by doing something wrong.
> 
> You could add a one time log entry to say that this mischeivous command
> had been used.
> 
>> You pointed to bug that hit where this could conceivably be useful, but
>> that's a one time issue and should not a common occurrance that justifies
>> making an official public API to poke at devices forever more IMHO.
> 
> I think where it might be practically useful is if you were debugging a
> hung customers VM and need to find a way to get it to move again.
> THat's something I'm not familiar with on the virtio side;
> mst - is this useful from a virtio side?

BTW, the linux kernel blk-mq has similar idea/interface. To run the below will
be able to 'run' the block IO queue on purpose.

echo "kick" > /sys/kernel/debug/block/sda/state

It is helpful for diagnostic if we assume the IO stall is due to an unknown race
that a 'run' of queue is missing.

Dongli Zhang

> 
> Dave
> 
>> Regards,
>> Daniel
>> --

1 2 >

1 - 100 of 138 matches

Mail list logo