date:20240401

Re: [PATCH 0/2] P11 support for QEMU

2024-04-01 Thread Cédric Le Goater


Hello Aditya,

On 4/2/24 08:39, Aditya Gupta wrote:

Hello Cédric,

Thanks for reviewing this.

On Mon, Apr 01, 2024 at 10:25:31AM +0200, Cédric Le Goater wrote:

Hello Aditya,

Please run ./scripts/get_maintainer.pl when sending a series. qemu-ppc should be
in Cc:


Tried it now, For some reason, get_maintainer.pl shows no maintainers:

 $ ./scripts/get_maintainer.pl -f 
0002-ppc-powernv11-add-base-support-for-P11-PowerNV.patch
 get_maintainer.pl: No maintainers found, printing recent contributors.
 get_maintainer.pl: Do not blindly cc: them on patches!  Use common sense.
 
 qemu-devel@nongnu.org (open list:All patches CC here)


Weird. I downloaded your series with b4 and ran the get_maintainer.pl script :

$ ./scripts/get_maintainer.pl 
20240401_adityag_p11_support_for_qemu.patches/0001_ppc_pseries_add_p11_cpu_type.patch
 
20240401_adityag_p11_support_for_qemu.patches/0002_ppc_powernv11_add_base_support_for_p11_powernv.patch

Nicholas Piggin  (odd fixer:sPAPR (pseries))
Daniel Henrique Barboza  (reviewer:sPAPR (pseries))
David Gibson  (reviewer:sPAPR (pseries))
Harsh Prateek Bora  (reviewer:sPAPR (pseries))
"Cédric Le Goater"  (odd fixer:PowerNV Non-Virt...)
"Frédéric Barrat"  (reviewer:PowerNV Non-Virt...)
qemu-...@nongnu.org (open list:sPAPR (pseries))
qemu-devel@nongnu.org (open list:All patches CC here)

 

I checked the MAINTAINERS file, will add maintainers in Cc, thanks.



Briefly looking at this, please separate the changes using one patch per model,
that is : first CPU (target), LPC, OCC, PSI, SBE, PnvCore, SpaprCore. Last the
PnvChip and the machines, powernv11 and pseries. A minimum commit log describing
the HW is required.


Sure, I will split the changes and improve my commit descriptions.


I don't see PHB6 or XIVE3. Why ?


Power11 core is same as Power10, so it supports till PHB5 and XIVE2,
same as P10. That's why I have not added any code for them.


ok. That's typically the info the commit log should have.


Also, you will need an OPAL update. The above changes are pointless without it.
The minimum for now is a git commit from the opal repo, then you will need to
update QEMU with a binary.


Agreed. I will consult when we push it to public. Will update this in
next series.

There might be some days delay in the next patch series.


We have entered the QEMU 9.1 cycle. There is time. I will comment more
the next respin.

Thanks,

C.

Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-01 Thread Wang, Lei

On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +,
Wang, Wei W wrote:
>> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
>>> When using the post-copy preemption feature to perform post-copy live
>>> migration, the below scenario could lead to a deadlock and the migration 
>>> will
>>> never finish:
>>>
>>>  - Source connect() the preemption channel in postcopy_start().
>>>  - Source and the destination side TCP stack finished the 3-way handshake
>>>thus the connection is successful.
>>>  - The destination side main thread is handling the loading of the bulk RAM
>>>pages thus it doesn't start to handle the pending connection event in the
>>>event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for
>>>the preemption thread.
>>>  - The source side sends non-iterative device states, such as the virtio
>>>states.
>>>  - The destination main thread starts to receive the virtio states, this
>>>process may lead to a page fault (e.g., virtio_load()->vring_avail_idx()
>>>may trigger a page fault since the avail ring page may not be received
>>>yet).
> 
> Ouch.  Yeah I think this part got overlooked when working on the preempt
> channel.
> 
>>>  - The page request is sent back to the source side. Source sends the page
>>>content to the destination side preemption thread.
>>>  - Since the event is not arrived and the semaphore
>>>postcopy_qemufile_dst_done is not posted, the preemption thread in
>>>destination side is blocked, and cannot handle receiving the page.
>>>  - The QEMU main load thread on the destination side is stuck at the page
>>>fault, and cannot yield and handle the connect() event for the
>>>preemption channel to unblock the preemption thread.
>>>  - The postcopy will stuck there forever since this is a deadlock.
>>>
>>> The key point to reproduce this bug is that the source side is sending 
>>> pages at a
>>> rate faster than the destination handling, otherwise, the qemu_get_be64() in
>>> ram_load_precopy() will have a chance to yield since at that time there are 
>>> no
>>> pending data in the buffer to get. This will make this bug harder to be
>>> reproduced.
> 
> How hard would this reproduce?

We can manually make this easier to reproduce by adding the following code to
make the destination busier to load the pages:

diff --git a/migration/ram.c b/migration/ram.c
index 0ad9fbba48..0b42877e1f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f)
 {
 MigrationIncomingState *mis = migration_incoming_get_current();
 int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0;
+volatile unsigned long long a;

 if (!migrate_compress()) {
 invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE;
@@ -4347,6 +4348,7 @@ static int ram_load_precopy(QEMUFile *f)
 break;

 case RAM_SAVE_FLAG_PAGE:
+for (a = 0; a < 1; a++);
 qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
 break;

> 
> I'm thinking whether this should be 9.0 material or 9.1.  It's pretty late
> for 9.0 though, but we can still discuss.
> 
>>>
>>> Fix this by yielding the load coroutine when receiving
>>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
>>> connection event before loading the non-iterative devices state to avoid the
>>> deadlock condition.
>>>
>>> Signed-off-by: Lei Wang 
>>
>> This seems to be a regression issue caused by this commit:
>> 737840e2c6ea (migration: Use the number of transferred bytes directly)
>>
>> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate
>> seems to work (might not be a good fix though).
>>
>>> ---
>>>  migration/savevm.c | 5 +
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/migration/savevm.c b/migration/savevm.c index
>>> e386c5267f..8fd4dc92f2 100644
>>> --- a/migration/savevm.c
>>> +++ b/migration/savevm.c
>>> @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f)
>>>  return loadvm_postcopy_handle_advise(mis, len);
>>>
>>>  case MIG_CMD_POSTCOPY_LISTEN:
>>> +if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
>>> +aio_co_schedule(qemu_get_current_aio_context(),
>>> +qemu_coroutine_self());
>>> +qemu_coroutine_yield();
>>> +}
>>
>> The above could be moved to loadvm_postcopy_handle_listen().
> 
> I'm not 100% sure such thing (no matter here or moved into it, which does
> look cleaner) would work for us.
> 
> The problem is I still don't yet see an ordering restricted on top of (1)
> accept() happens, and (2) receive LISTEN cmd here.  What happens if the
> accept() request is not yet received when reaching LISTEN?  Or is it always
> guaranteed the accept(fd) will always be polled here?
> 
> For example, the source QEMU (no matter pre-7.2 or later) will always setup
> the preempt channel asynchrounously, then IIUC it can conne

Re: [PATCH] virtio-net: fix qemu set used ring flag even vhost started

2024-04-01 Thread Michael S. Tsirkin

On Tue, Apr 02, 2024 at 12:51:09PM +0800, Yajun Wu wrote:
> When vhost-user or vhost-kernel is handling virtio net datapath, qemu
> should not touch used ring.
> 
> But with vhost-user socket reconnect scenario, in a very rare case (has
> pending kick event). VRING_USED_F_NO_NOTIFY is set by qemu in
> following code path:
> 
>   #0  virtio_queue_split_set_notification (vq=0x7ff5f4c920a8, enable=0) 
> at ../hw/virtio/virtio.c:511
>   #1  0x559d6dbf033b in virtio_queue_set_notification 
> (vq=0x7ff5f4c920a8, enable=0) at ../hw/virtio/virtio.c:576
>   #2  0x559d6dbbbdbc in virtio_net_handle_tx_bh (vdev=0x559d703a6aa0, 
> vq=0x7ff5f4c920a8) at ../hw/net/virtio-net.c:2801
>   #3  0x559d6dbf4791 in virtio_queue_notify_vq (vq=0x7ff5f4c920a8) at 
> ../hw/virtio/virtio.c:2248
>   #4  0x559d6dbf79da in virtio_queue_host_notifier_read 
> (n=0x7ff5f4c9211c) at ../hw/virtio/virtio.c:3525
>   #5  0x559d6d9a5814 in virtio_bus_cleanup_host_notifier 
> (bus=0x559d703a6a20, n=1) at ../hw/virtio/virtio-bus.c:321
>   #6  0x559d6dbf83c9 in virtio_device_stop_ioeventfd_impl 
> (vdev=0x559d703a6aa0) at ../hw/virtio/virtio.c:3774
>   #7  0x559d6d9a55c8 in virtio_bus_stop_ioeventfd 
> (bus=0x559d703a6a20) at ../hw/virtio/virtio-bus.c:259
>   #8  0x559d6d9a53e8 in virtio_bus_grab_ioeventfd 
> (bus=0x559d703a6a20) at ../hw/virtio/virtio-bus.c:199
>   #9  0x559d6dbf841c in virtio_device_grab_ioeventfd 
> (vdev=0x559d703a6aa0) at ../hw/virtio/virtio.c:3783
>   #10 0x559d6d9bde18 in vhost_dev_enable_notifiers 
> (hdev=0x559d707edd70, vdev=0x559d703a6aa0) at ../hw/virtio/vhost.c:1592
>   #11 0x559d6d89a0b8 in vhost_net_start_one (net=0x559d707edd70, 
> dev=0x559d703a6aa0) at ../hw/net/vhost_net.c:266
>   #12 0x559d6d89a6df in vhost_net_start (dev=0x559d703a6aa0, 
> ncs=0x559d7048d890, data_queue_pairs=31, cvq=0) at ../hw/net/vhost_net.c:412
>   #13 0x559d6dbb5b89 in virtio_net_vhost_status (n=0x559d703a6aa0, 
> status=15 '\017') at ../hw/net/virtio-net.c:311
>   #14 0x559d6dbb5e34 in virtio_net_set_status (vdev=0x559d703a6aa0, 
> status=15 '\017') at ../hw/net/virtio-net.c:392
>   #15 0x559d6dbb60d8 in virtio_net_set_link_status 
> (nc=0x559d7048d890) at ../hw/net/virtio-net.c:455
>   #16 0x559d6da64863 in qmp_set_link (name=0x559d6f0b83d0 "hostnet1", 
> up=true, errp=0x7ffdd76569f0) at ../net/net.c:1459
>   #17 0x559d6da7226e in net_vhost_user_event (opaque=0x559d6f0b83d0, 
> event=CHR_EVENT_OPENED) at ../net/vhost-user.c:301
>   #18 0x559d6ddc7f63 in chr_be_event (s=0x559d6f2ffea0, 
> event=CHR_EVENT_OPENED) at ../chardev/char.c:62
>   #19 0x559d6ddc7fdc in qemu_chr_be_event (s=0x559d6f2ffea0, 
> event=CHR_EVENT_OPENED) at ../chardev/char.c:82
> 
> This issue causes guest kernel stop kicking device and traffic stop.
> 
> Add vhost_started check in virtio_net_handle_tx_bh to fix this wrong
> VRING_USED_F_NO_NOTIFY set.
> 
> Signed-off-by: Yajun Wu 
> Reviewed-by: Jiri Pirko 


Acked-by: Michael S. Tsirkin 

> ---
>  hw/net/virtio-net.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index a6ff000cd9..8035e01fdf 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -2865,6 +2865,10 @@ static void virtio_net_handle_tx_bh(VirtIODevice 
> *vdev, VirtQueue *vq)
>  VirtIONet *n = VIRTIO_NET(vdev);
>  VirtIONetQueue *q = &n->vqs[vq2q(virtio_get_queue_index(vq))];
>  
> +if (n->vhost_started) {
> +return;
> +}
> +
>  if (unlikely((n->status & VIRTIO_NET_S_LINK_UP) == 0)) {
>  virtio_net_drop_tx_queue_data(vdev, vq);
>  return;
> -- 
> 2.27.0

Re: [PATCH for-9.0] Fix some typos in documentation (found by codespell)

2024-04-01 Thread Luc Michel

On 18:15 Sun 31 Mar , Stefan Weil wrote:
> Signed-off-by: Stefan Weil 

Reviewed-by: Luc Michel 

> ---
>  docs/devel/atomics.rst | 2 +-
>  docs/devel/ci-jobs.rst.inc | 2 +-
>  docs/devel/clocks.rst  | 2 +-
>  docs/system/i386/sgx.rst   | 2 +-
>  qapi/qom.json  | 2 +-
>  5 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
> index ff9b5ee30c..b77c6e13e1 100644
> --- a/docs/devel/atomics.rst
> +++ b/docs/devel/atomics.rst
> @@ -119,7 +119,7 @@ The only guarantees that you can rely upon in this case 
> are:
>ordinary accesses instead cause data races if they are concurrent with
>other accesses of which at least one is a write.  In order to ensure this,
>the compiler will not optimize accesses out of existence, create 
> unsolicited
> -  accesses, or perform other similar optimzations.
> +  accesses, or perform other similar optimizations.
>  
>  - acquire operations will appear to happen, with respect to the other
>components of the system, before all the LOAD or STORE operations
> diff --git a/docs/devel/ci-jobs.rst.inc b/docs/devel/ci-jobs.rst.inc
> index ec33e6ee2b..be06322279 100644
> --- a/docs/devel/ci-jobs.rst.inc
> +++ b/docs/devel/ci-jobs.rst.inc
> @@ -115,7 +115,7 @@ CI pipeline.
>  QEMU_JOB_SKIPPED
>  
>  
> -The job is not reliably successsful in general, so is not
> +The job is not reliably successful in general, so is not
>  currently suitable to be run by default. Ideally this should
>  be a temporary marker until the problems can be addressed, or
>  the job permanently removed.
> diff --git a/docs/devel/clocks.rst b/docs/devel/clocks.rst
> index b2d1148cdb..177ee1c90d 100644
> --- a/docs/devel/clocks.rst
> +++ b/docs/devel/clocks.rst
> @@ -279,7 +279,7 @@ You can change the multiplier and divider of a clock at 
> runtime,
>  so you can use this to model clock controller devices which
>  have guest-programmable frequency multipliers or dividers.
>  
> -Similary to ``clock_set()``, ``clock_set_mul_div()`` returns ``true`` if
> +Similarly to ``clock_set()``, ``clock_set_mul_div()`` returns ``true`` if
>  the clock state was modified; that is, if the multiplier or the diviser
>  or both were changed by the call.
>  
> diff --git a/docs/system/i386/sgx.rst b/docs/system/i386/sgx.rst
> index 0f0a73f758..c293f7f44e 100644
> --- a/docs/system/i386/sgx.rst
> +++ b/docs/system/i386/sgx.rst
> @@ -6,7 +6,7 @@ Overview
>  
>  Intel Software Guard eXtensions (SGX) is a set of instructions and mechanisms
>  for memory accesses in order to provide security accesses for sensitive
> -applications and data. SGX allows an application to use it's pariticular
> +applications and data. SGX allows an application to use its particular
>  address space as an *enclave*, which is a protected area provides 
> confidentiality
>  and integrity even in the presence of privileged malware. Accesses to the
>  enclave memory area from any software not resident in the enclave are 
> prevented,
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 8d4ca8ed92..85e6b4f84a 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -802,7 +802,7 @@
>  #
>  # @fd: file descriptor name previously passed via 'getfd' command,
>  # which represents a pre-opened /dev/iommu.  This allows the
> -# iommufd object to be shared accross several subsystems (VFIO,
> +# iommufd object to be shared across several subsystems (VFIO,
>  # VDPA, ...), and the file descriptor to be shared with other
>  # process, e.g. DPDK.  (default: QEMU opens /dev/iommu by itself)
>  #
> -- 
> 2.39.2
> 

--

Re: [PATCH 0/2] P11 support for QEMU

2024-04-01 Thread Aditya Gupta

Hello Cédric,

Thanks for reviewing this.

On Mon, Apr 01, 2024 at 10:25:31AM +0200, Cédric Le Goater wrote:
> Hello Aditya,
> 
> Please run ./scripts/get_maintainer.pl when sending a series. qemu-ppc should 
> be
> in Cc:

Tried it now, For some reason, get_maintainer.pl shows no maintainers:

$ ./scripts/get_maintainer.pl -f 
0002-ppc-powernv11-add-base-support-for-P11-PowerNV.patch
get_maintainer.pl: No maintainers found, printing recent contributors.
get_maintainer.pl: Do not blindly cc: them on patches!  Use common sense.

qemu-devel@nongnu.org (open list:All patches CC here)

I checked the MAINTAINERS file, will add maintainers in Cc, thanks.

> 
> Briefly looking at this, please separate the changes using one patch per 
> model,
> that is : first CPU (target), LPC, OCC, PSI, SBE, PnvCore, SpaprCore. Last the
> PnvChip and the machines, powernv11 and pseries. A minimum commit log 
> describing
> the HW is required.

Sure, I will split the changes and improve my commit descriptions.

> I don't see PHB6 or XIVE3. Why ?

Power11 core is same as Power10, so it supports till PHB5 and XIVE2,
same as P10. That's why I have not added any code for them.

> 
> Also, you will need an OPAL update. The above changes are pointless without 
> it.
> The minimum for now is a git commit from the opal repo, then you will need to
> update QEMU with a binary.

Agreed. I will consult when we push it to public. Will update this in
next series.

There might be some days delay in the next patch series.

Thanks

> 
> Thanks,
> 
> C.
> 
> On 4/1/24 07:55, Aditya Gupta wrote:
> > This patch series adds support for Power11 pseries and powernv machine 
> > targets
> > to emulate VMs running on Power11.
> > 
> > Most of the P11 support code has been taken from P10 code in QEMU.
> > And has been tested in pseries, powernv, with and without compat mode.
> > 
> > Git Tree for Testing: https://github.com/adi-g15-ibm/qemu/tree/p11
> > 
> > Aditya Gupta (2):
> >ppc: pseries: add P11 cpu type
> >ppc: powernv11: add base support for P11 PowerNV
> > 
> >   docs/system/ppc/pseries.rst |   6 +-
> >   hw/ppc/pnv.c| 409 
> >   hw/ppc/pnv_core.c   |  94 +
> >   hw/ppc/pnv_homer.c  |  64 ++
> >   hw/ppc/pnv_lpc.c|  14 ++
> >   hw/ppc/pnv_occ.c|  14 ++
> >   hw/ppc/pnv_psi.c|  21 ++
> >   hw/ppc/pnv_sbe.c|  19 ++
> >   hw/ppc/spapr_cpu_core.c |   1 +
> >   include/hw/ppc/pnv.h|  51 +
> >   include/hw/ppc/pnv_chip.h   |  30 +++
> >   include/hw/ppc/pnv_homer.h  |   3 +
> >   include/hw/ppc/pnv_lpc.h|   4 +
> >   include/hw/ppc/pnv_occ.h|   2 +
> >   include/hw/ppc/pnv_psi.h|   2 +
> >   include/hw/ppc/pnv_sbe.h|   2 +
> >   include/hw/ppc/pnv_xscom.h  |  55 +
> >   target/ppc/compat.c |   7 +
> >   target/ppc/cpu-models.c |   2 +
> >   target/ppc/cpu-models.h |   2 +
> >   target/ppc/cpu_init.c   | 162 ++
> >   21 files changed, 961 insertions(+), 3 deletions(-)
> > 
>

Re: [PATCH 2/3] target/hppa: mask offset bits in gva

2024-04-01 Thread Sven Schnelle

Richard Henderson  writes:

> On 4/1/24 20:01, Sven Schnelle wrote:
>> Implement dr2 and the mfdiag/mtdiag instructions. dr2 contains a bit
>> which enables/disables space id hashing. Seabios would then set
>> this bit when booting. Linux would disable it again during boot (this
>> would be the same like on real hardware), while HP-UX would leave it
>> enabled.
>
> Pointer to documentation?

There's no documentation about that in the public. There's this code since the
beginning of linux on hppa in the linux kernel (arch/parisc/kernel/pacache.S):

/* Disable Space Register Hashing for PCXL */

.word 0x141c0600  /* mfdiag %dr0, %r28 */
depwi   0,28,2, %r28  /* Clear DHASH_EN & IHASH_EN */
.word 0x141c0240  /* mtdiag %r28, %dr0 */
b,n   srdis_done

srdis_pa20:

/* Disable Space Register Hashing for PCXU,PCXU+,PCXW,PCXW+,PCXW2 */

.word 0x144008bc/* mfdiag %dr2, %r28 */
depdi 0, 54,1, %r28 /* clear DIAG_SPHASH_ENAB (bit 54) */
.word 0x145c1840/* mtdiag %r28, %dr2 */

So PCXL (32 bit) uses dr0, while 64 bit uses dr2. This still is the same
on my C8000 - i see firmware still contains code reading dr2 to figure
out whether space id hashing is enabled. The mfdiag/mtdiag instructions
are described in the PCXL/PCXL2 ERS.

https://parisc.wiki.kernel.org/index.php/File:PCXL_ers.pdf
https://parisc.wiki.kernel.org/index.php/File:Pcxl2_ers.pdf

There was a discussion mentioning disabling Space ID hashing in Linux:

https://yhbt.net/lore/linux-parisc/199912161642.iaa11...@lucy.cup.hp.com/

Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-04-01 Thread Si-Wei Liu





On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote:

On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu  wrote:

Hi Michael,

On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this?

Didn't get a chance to look into the detail yet in the past week, but
thought it may have something to do with the (internals of) iova tree
range allocation and the lookup routine. It started to fall apart at the
first vhost_vdpa_dma_unmap call showing up in the trace events, where it
should've gotten IOVA=0x201000,  but an incorrect IOVA address
0x1000 was ended up returning from the iova tree lookup routine.

HVAGPAIOVA
-
Map
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000)
[0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
[0x80001000, 0x201000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)

Unmap
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
0x2) ???
  shouldn't it be [0x201000,
0x221000) ???

It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(), 
which is called from vhost_vdpa_listener_region_del(), can't properly 
deal with overlapped region. Specifically, q35's mch_realize() has the 
following:


579 memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), 
"smram-open-high",
580  mch->ram_memory, 
MCH_HOST_BRIDGE_SMRAM_C_BASE,

581  MCH_HOST_BRIDGE_SMRAM_C_SIZE);
582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda,
583 &mch->open_high_smram, 1);
584 memory_region_set_enabled(&mch->open_high_smram, false);

#0  0x564c30bf6980 in iova_tree_find_address_iterator 
(key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at 
../util/iova-tree.c:96

#1  0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0
#2  0x564c30bf6b53 in iova_tree_find_iova (tree=, 
map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114
#3  0x564c309da0a9 in vhost_iova_tree_find_iova (tree=out>, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70
#4  0x564c3085e49d in vhost_vdpa_listener_region_del 
(listener=0x564c331024c8, section=0x7fffb6d74aa0) at 
../hw/virtio/vhost-vdpa.c:444
#5  0x564c309f4931 in address_space_update_topology_pass 
(as=as@entry=0x564c31ab1840 , 
old_view=old_view@entry=0x564c33364cc0, 
new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at 
../system/memory.c:977
#6  0x564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840 
) at ../system/memory.c:1079
#7  0x564c309f86d0 in memory_region_transaction_commit () at 
../system/memory.c:1132
#8  0x564c309f86d0 in memory_region_transaction_commit () at 
../system/memory.c:1117
#9  0x564c307cce64 in mch_realize (d=, 
errp=) at ../hw/pci-host/q35.c:584


However, it looks like iova_tree_find_address_iterator() only check if 
the translated address (HVA) falls in to the range when trying to locate 
the desired IOVA, causing the first DMAMap that happens to overlap in 
the translated address (HVA) space to be returned prematurely:


 89 static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

 90 gpointer data)
 91 {
 :
 :
 99 if (map->translated_addr + map->size < needle->translated_addr ||
100 needle->translated_addr + needle->size < map->translated_addr) {
101 return false;
102 }
103
104 args->result = map;
105 return true;
106 }

In the QEMU trace file, it reveals that the first DMAMap as below gets 
returned incorrectly instead the second, the latter of which is what the 
actual IOVA corresponds to:


HVA GPA 
IOVA
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000)   
[0x1000, 0x80001000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)


Maybe other than check the HVA range, we should also match GPA, or at 
least the size should exactly match?



Yes, I'm still not able to reproduce. In particular, I don't know how
how the memory listener adds a region an

Re: [PATCH 2/3] target/hppa: mask offset bits in gva

2024-04-01 Thread Richard Henderson


On 4/1/24 20:01, Sven Schnelle wrote:

Implement dr2 and the mfdiag/mtdiag instructions. dr2 contains a bit
which enables/disables space id hashing. Seabios would then set
this bit when booting. Linux would disable it again during boot (this
would be the same like on real hardware), while HP-UX would leave it
enabled.


Pointer to documentation?


r~

Re: [PATCH 2/3] target/hppa: mask offset bits in gva

2024-04-01 Thread Sven Schnelle

Richard,

Sven Schnelle  writes:

> Richard Henderson  writes:
>
>> On 3/23/24 22:09, Sven Schnelle wrote:
>>> The CPU seems to mask a few bits in the offset when running
>>> under HP-UX. ISR/IOR register contents for an address in
>>> the processor HPA (0xfffa) on my C8000 and J6750:
>>> running on Linux: 3fff c000fffa0500
>>> running on HP-UX: 301f c000fffa0500
>>> I haven't found how this is switched (guess some diag in the
>>> firmware), but linux + seabios seems to handle that as well,
>>> so lets mask out the additional bits.
>>> Signed-off-by: Sven Schnelle 
>>> [..]
>> [..]
>> Though my argument would suggest the mask should be 0xff for the
>> 40-bit physical address, which is not what you see at all, so perhaps
>> the thing is moot.  I am at a loss to explain why or how HP-UX gets a
>> 7-bit hole in the ISR result.
>>
>> On the other hand, there are some not-well-documented shenanigans (aka
>> implementation defined behaviour) between Figure H-8 and Figure H-11,
>> where the 62-bit absolute address is expanded to a 64-bit logical
>> physical address and then compacted to a 40-bit implementation
>> physical address.
>>
>> We've already got hacks in place for this in hppa_abs_to_phys_pa2_w1,
>> which just truncates everything down to 40 bits.  But that's probably
>> not what the processor is really doing.
>
> I looked into this again, and it's caused by Space-ID hashing. HP-UX asks
> PDC/Firmware how many bits are used for the hashing. seabios returns
> zero, in which case HP-UX uses a default mask of 0xf01f.
> By modifying seabios, i can make HP-UX use the appropriate mask, but
> switching of SpaceID hashing entirely is impossible. The reason why
> the CPU doesn't strip the bits when running linux is that Linux switches
> of Space-ID hashing early in the startup code (before mm gets
> initialized).
>
> My J6750 Firmware only returns two values: 0 when Space-ID hashing is
> off, 0xfe0 when it is enabled. This is hardcoded in the firmware - the
> only thing PDC checks is a bit in Debug Register 2, which enables
> Space-ID hashing. 0xfe0 matches the 0xf01f... mask used by HP-UX
> pretty well.
>
> So if qemu wants to run 64 Bit HP-UX the proper way, i guess it needs
> to implement Space-ID hashing.

I wonder wether it would be acceptable to implement this masking in a
switchable way?

This would mean:

Implement dr2 and the mfdiag/mtdiag instructions. dr2 contains a bit
which enables/disables space id hashing. Seabios would then set
this bit when booting. Linux would disable it again during boot (this
would be the same like on real hardware), while HP-UX would leave it
enabled.

Re: [PATCH] target/hppa: Fix IIAOQ, IIASQ for pa2.0

2024-04-01 Thread Sven Schnelle

Richard Henderson  writes:

> The contents of IIAOQ depend on PSW_W.
> Follow the text in "Interruption Instruction Address Queues",
> pages 2-13 through 2-15.
>
> Reported-by: Sven Schnelle 
> Fixes: b10700d826c ("target/hppa: Update IIAOQ, IIASQ for pa2.0")
> Signed-off-by: Richard Henderson 
> ---
>
> Sven, I looked again through IIAOQ documentation and it does seem
> like some of the bits are wrong, both on interrupt delivery and RFI.

Yes, this fixes my issues, thanks!

Tested-by: Sven Schnelle

Re: [PATCH 1/2] CXL/cxl_type3: add first_dvsec_offset() helper

2024-04-01 Thread Zhijian Li (Fujitsu)



On 02/04/2024 12:09, fan wrote:
> On Tue, Apr 02, 2024 at 09:46:46AM +0800, Li Zhijian via wrote:
>> It helps to figure out where the first dvsec register is located. In
>> addition, replace offset and size hardcore with existing macros.
>>
>> Signed-off-by: Li Zhijian 
>> ---
>>   hw/mem/cxl_type3.c | 19 +--
>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
>> index b0a7e9f11b64..ad2fe7d463fb 100644
>> --- a/hw/mem/cxl_type3.c
>> +++ b/hw/mem/cxl_type3.c
>> @@ -643,6 +643,16 @@ static DOEProtocol doe_cdat_prot[] = {
>>   { }
>>   };
>>   
>> +static uint16_t first_dvsec_offset(CXLType3Dev *ct3d)
>> +{
>> +uint16_t offset = PCI_CONFIG_SPACE_SIZE;
>> +
>> +if (ct3d->sn != UI64_NULL)
>> +offset += PCI_EXT_CAP_DSN_SIZEOF;
>> +
>> +return offset;
>> +}
>> +
>>   static void ct3_realize(PCIDevice *pci_dev, Error **errp)
>>   {
>>   ERRP_GUARD();
>> @@ -663,13 +673,10 @@ static void ct3_realize(PCIDevice *pci_dev, Error 
>> **errp)
>>   pci_config_set_prog_interface(pci_conf, 0x10);
>>   
>>   pcie_endpoint_cap_init(pci_dev, 0x80);
>> -if (ct3d->sn != UI64_NULL) {
>> -pcie_dev_ser_num_init(pci_dev, 0x100, ct3d->sn);
>> -cxl_cstate->dvsec_offset = 0x100 + 0x0c;
>> -} else {
>> -cxl_cstate->dvsec_offset = 0x100;
>> -}
>> +if (ct3d->sn != UI64_NULL)
>> +pcie_dev_ser_num_init(pci_dev, PCI_CONFIG_SPACE_SIZE, ct3d->sn);
>>   
>> +cxl_cstate->dvsec_offset = first_dvsec_offset(ct3d);
>>   ct3d->cxl_cstate.pdev = pci_dev;
>>   build_dvsecs(ct3d);
>>   
>> -- 
>> 2.29.2
>>
>>
> Hi Zhijian,
> 
> Please use Qemu's checkpatch tool to make sure the patches meet the
> qemu code format requirement.


My mistake, any other input for these 2 patches?



> Also, please cc linux-...@vger.kernel.org if the code is cxl related.

Thanks for your remainder, do you mind if I send a patch to add the
"L: linux-...@vger.kernel.org' field to the CXL entry


Thanks
Zhijian
> 

> Fan

[PATCH] virtio-net: fix qemu set used ring flag even vhost started

2024-04-01 Thread Yajun Wu

When vhost-user or vhost-kernel is handling virtio net datapath, qemu
should not touch used ring.

But with vhost-user socket reconnect scenario, in a very rare case (has
pending kick event). VRING_USED_F_NO_NOTIFY is set by qemu in
following code path:

#0  virtio_queue_split_set_notification (vq=0x7ff5f4c920a8, enable=0) 
at ../hw/virtio/virtio.c:511
#1  0x559d6dbf033b in virtio_queue_set_notification 
(vq=0x7ff5f4c920a8, enable=0) at ../hw/virtio/virtio.c:576
#2  0x559d6dbbbdbc in virtio_net_handle_tx_bh (vdev=0x559d703a6aa0, 
vq=0x7ff5f4c920a8) at ../hw/net/virtio-net.c:2801
#3  0x559d6dbf4791 in virtio_queue_notify_vq (vq=0x7ff5f4c920a8) at 
../hw/virtio/virtio.c:2248
#4  0x559d6dbf79da in virtio_queue_host_notifier_read 
(n=0x7ff5f4c9211c) at ../hw/virtio/virtio.c:3525
#5  0x559d6d9a5814 in virtio_bus_cleanup_host_notifier 
(bus=0x559d703a6a20, n=1) at ../hw/virtio/virtio-bus.c:321
#6  0x559d6dbf83c9 in virtio_device_stop_ioeventfd_impl 
(vdev=0x559d703a6aa0) at ../hw/virtio/virtio.c:3774
#7  0x559d6d9a55c8 in virtio_bus_stop_ioeventfd 
(bus=0x559d703a6a20) at ../hw/virtio/virtio-bus.c:259
#8  0x559d6d9a53e8 in virtio_bus_grab_ioeventfd 
(bus=0x559d703a6a20) at ../hw/virtio/virtio-bus.c:199
#9  0x559d6dbf841c in virtio_device_grab_ioeventfd 
(vdev=0x559d703a6aa0) at ../hw/virtio/virtio.c:3783
#10 0x559d6d9bde18 in vhost_dev_enable_notifiers 
(hdev=0x559d707edd70, vdev=0x559d703a6aa0) at ../hw/virtio/vhost.c:1592
#11 0x559d6d89a0b8 in vhost_net_start_one (net=0x559d707edd70, 
dev=0x559d703a6aa0) at ../hw/net/vhost_net.c:266
#12 0x559d6d89a6df in vhost_net_start (dev=0x559d703a6aa0, 
ncs=0x559d7048d890, data_queue_pairs=31, cvq=0) at ../hw/net/vhost_net.c:412
#13 0x559d6dbb5b89 in virtio_net_vhost_status (n=0x559d703a6aa0, 
status=15 '\017') at ../hw/net/virtio-net.c:311
#14 0x559d6dbb5e34 in virtio_net_set_status (vdev=0x559d703a6aa0, 
status=15 '\017') at ../hw/net/virtio-net.c:392
#15 0x559d6dbb60d8 in virtio_net_set_link_status 
(nc=0x559d7048d890) at ../hw/net/virtio-net.c:455
#16 0x559d6da64863 in qmp_set_link (name=0x559d6f0b83d0 "hostnet1", 
up=true, errp=0x7ffdd76569f0) at ../net/net.c:1459
#17 0x559d6da7226e in net_vhost_user_event (opaque=0x559d6f0b83d0, 
event=CHR_EVENT_OPENED) at ../net/vhost-user.c:301
#18 0x559d6ddc7f63 in chr_be_event (s=0x559d6f2ffea0, 
event=CHR_EVENT_OPENED) at ../chardev/char.c:62
#19 0x559d6ddc7fdc in qemu_chr_be_event (s=0x559d6f2ffea0, 
event=CHR_EVENT_OPENED) at ../chardev/char.c:82

This issue causes guest kernel stop kicking device and traffic stop.

Add vhost_started check in virtio_net_handle_tx_bh to fix this wrong
VRING_USED_F_NO_NOTIFY set.

Signed-off-by: Yajun Wu 
Reviewed-by: Jiri Pirko 
---
 hw/net/virtio-net.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index a6ff000cd9..8035e01fdf 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -2865,6 +2865,10 @@ static void virtio_net_handle_tx_bh(VirtIODevice *vdev, 
VirtQueue *vq)
 VirtIONet *n = VIRTIO_NET(vdev);
 VirtIONetQueue *q = &n->vqs[vq2q(virtio_get_queue_index(vq))];
 
+if (n->vhost_started) {
+return;
+}
+
 if (unlikely((n->status & VIRTIO_NET_S_LINK_UP) == 0)) {
 virtio_net_drop_tx_queue_data(vdev, vq);
 return;
-- 
2.27.0

Re: Intention to work on GSoC project

2024-04-01 Thread Sahil

Hi,

On Monday, April 1, 2024 11:53:11 PM IST daleyoung4...@gmail.com wrote:
> Hi,
> 
> On Monday, March 25, 2024 21:20:32 CST Sahil wrote:
> > Q1.
> > Section 2.7.4 of the virtio spec [3] states that in an available
> > descriptor, the "Element Length" stores the length of the buffer element.
> > In the next few lines, it also states that the "Element Length" is
> > reserved for used descriptors and is ignored by drivers. This sounds a
> > little contradictory given that drivers write available desciptors in the
> > descriptor ring.
> When VIRTQ_DESC_F_WRITE is set, the device will use "Element Length" to
> specify the length it writes. When VIRTQ_DESC_F_WRITE is not set, which
> means the buffer is read-only for the device, "Element Length" will not be
> changed by the device, so drivers just ignore it.

Thank you for the clarification. I think I misunderstood what I had read
in the virtio spec. What I have understood now is that "Element Length"
has different meanings for available and used descriptors.

Correct me if I am wrong - for available descriptors, it represents the
length of the buffer. For used descriptors, it represents the length of
the buffer that is written to by the device if it's write-only, otherwise it
has no meaning and hence can be ignored by drivers.

> > Q2.
> > In the Red Hat article, just below the first listing ("Memory layout of a
> > packed virtqueue descriptor"), there's the following line referring to the
> > buffer id in
> > 
> > "virtq_desc":
> > > This time, the id field is not an index for the device to look for the
> > > buffer: it is an opaque value for it, only has meaning for the driver.
> > 
> > But the device returns the buffer id when it writes the used descriptor to
> > the descriptor ring. The "only has meaning for the driver" part has got me
> > a little confused. Which buffer id is this that the device returns? Is it
> > related to the buffer id in the available descriptor?
> 
> In my understanding, buffer id is the element that avail descriptor marks to
> identify when adding descriptors to table. Device will returns the buffer
> id in the processed descriptor or the last descriptor in a chain, and write
> it to the descriptor that used idx refers to (first one in the chain). Then
> used idx increments.
> 
> The Packed Virtqueue blog [1] is helpful, but some details in the examples
> are making me confused.
> 
> Q1.
> In the last step of the two-entries descriptor table example, it says both
> buffers #0 and #1 are available for the device. I understand descriptor[0]
> is available and descriptor[1] is not, but there is no ID #0 now. So does
> the device got buffer #0 by notification beforehand? If so, does it mean
> buffer #0 will be lost when notifications are disabled?
> 

I too have a similar question and understanding the relation between buffer
ids in the used and available descriptors might give more insight into this. For
available descriptors, the buffer id is used to associate descriptors with a
particular buffer. I am still not very sure about ids in used descriptors.

Regarding Q1, both buffers #0 and #1 are available. In the mentioned figure,
both descriptor[0] and descriptor[1] are available. This figure follows the 
figure
with the caption "Using first buffer out of order". So in the first figure the 
device
reads buffer #1 and writes the used descriptor but it still has buffer #0 to 
read.
That still belongs to the device while buffer #1 can now be handled by the 
driver
once again. So in the next figure, the driver makes buffer #1 available again. 
The
device can still read buffer #0 from the previous batch of available 
descriptors.

Based on what I have understood, the driver can't touch the descriptor
corresponding to buffer #0 until the device acknowledges it. I did find the
figure a little confusing as well. I think once the meaning of buffer id is 
clear
from the driver's and device's perspective, it'll be easier to understand the
figure.

I am also not very sure about what happens when notifications are disabled.
I'll have to read up on that again. But I believe the driver still won't be 
able to
touch #0 until the device uses it.

I think going through the source should better explain these concepts.

Thanks,
Sahil

Re: [PATCH 1/2] CXL/cxl_type3: add first_dvsec_offset() helper

2024-04-01 Thread fan

On Tue, Apr 02, 2024 at 09:46:46AM +0800, Li Zhijian via wrote:
> It helps to figure out where the first dvsec register is located. In
> addition, replace offset and size hardcore with existing macros.
> 
> Signed-off-by: Li Zhijian 
> ---
>  hw/mem/cxl_type3.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index b0a7e9f11b64..ad2fe7d463fb 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -643,6 +643,16 @@ static DOEProtocol doe_cdat_prot[] = {
>  { }
>  };
>  
> +static uint16_t first_dvsec_offset(CXLType3Dev *ct3d)
> +{
> +uint16_t offset = PCI_CONFIG_SPACE_SIZE;
> +
> +if (ct3d->sn != UI64_NULL)
> +offset += PCI_EXT_CAP_DSN_SIZEOF;
> +
> +return offset;
> +}
> +
>  static void ct3_realize(PCIDevice *pci_dev, Error **errp)
>  {
>  ERRP_GUARD();
> @@ -663,13 +673,10 @@ static void ct3_realize(PCIDevice *pci_dev, Error 
> **errp)
>  pci_config_set_prog_interface(pci_conf, 0x10);
>  
>  pcie_endpoint_cap_init(pci_dev, 0x80);
> -if (ct3d->sn != UI64_NULL) {
> -pcie_dev_ser_num_init(pci_dev, 0x100, ct3d->sn);
> -cxl_cstate->dvsec_offset = 0x100 + 0x0c;
> -} else {
> -cxl_cstate->dvsec_offset = 0x100;
> -}
> +if (ct3d->sn != UI64_NULL)
> +pcie_dev_ser_num_init(pci_dev, PCI_CONFIG_SPACE_SIZE, ct3d->sn);
>  
> +cxl_cstate->dvsec_offset = first_dvsec_offset(ct3d);
>  ct3d->cxl_cstate.pdev = pci_dev;
>  build_dvsecs(ct3d);
>  
> -- 
> 2.29.2
> 
> 
Hi Zhijian,

Please use Qemu's checkpatch tool to make sure the patches meet the
qemu code format requirement.
Also, please cc linux-...@vger.kernel.org if the code is cxl related.

Fan

Re: [RFC QEMU PATCH v8 2/2] virtio-pci: implement No_Soft_Reset bit

2024-04-01 Thread Chen, Jiqian

On 2024/3/29 18:44, Michael S. Tsirkin wrote:
> On Fri, Mar 29, 2024 at 03:20:59PM +0800, Jason Wang wrote:
>> On Fri, Mar 29, 2024 at 3:07 PM Chen, Jiqian  wrote:
>>>
>>> On 2024/3/28 20:36, Michael S. Tsirkin wrote:
>>> +}
>>> +
>>>  static void virtio_pci_bus_reset_hold(Object *obj)
>>>  {
>>>  PCIDevice *dev = PCI_DEVICE(obj);
>>>  DeviceState *qdev = DEVICE(obj);
>>>
>>> +if (virtio_pci_no_soft_reset(dev)) {
>>> +return;
>>> +}
>>> +
>>>  virtio_pci_reset(qdev);
>>>
>>>  if (pci_is_express(dev)) {
>>> @@ -2484,6 +2511,8 @@ static Property virtio_pci_properties[] = {
>>>  VIRTIO_PCI_FLAG_INIT_LNKCTL_BIT, true),
>>>  DEFINE_PROP_BIT("x-pcie-pm-init", VirtIOPCIProxy, flags,
>>>  VIRTIO_PCI_FLAG_INIT_PM_BIT, true),
>>> +DEFINE_PROP_BIT("x-pcie-pm-no-soft-reset", VirtIOPCIProxy, flags,
>>> +VIRTIO_PCI_FLAG_PM_NO_SOFT_RESET_BIT, false),
>>
>> Why does it come with an x prefix?
>>
>>>  DEFINE_PROP_BIT("x-pcie-flr-init", VirtIOPCIProxy, flags,
>>>  VIRTIO_PCI_FLAG_INIT_FLR_BIT, true),
>>>  DEFINE_PROP_BIT("aer", VirtIOPCIProxy, flags,
>>
>> I am a bit confused about this part.
>> Do you want to make this software controllable?
> Yes, because even the real hardware, this bit is not always set.
>>
>> We are talking about emulated devices here.
>>

 So which virtio devices should and which should not set this bit?
>>> This depends on the scenario the virtio-device is used, if we want to 
>>> trigger an internal soft reset for the virtio-device during S3, this bit 
>>> shouldn't be set.
>>
>> If the device doesn't need reset, why bother the driver for this?
>>
>> Btw, no_soft_reset is insufficient for some cases, there's a proposal
>> for the virtio-spec. I think we need to wait until it is done.
> 
> That seems orthogonal or did I miss something?
Yes, I looked the detail of the proposal, I also think they are unrelated.
I will set the default value of No_Soft_Reset bit to true in next version 
according to your opinion.
About the compatibility of old machine types, which types should I consider? 
Does the same as x-pcie-pm-init(hw_compat_2_8)?
Forgive me for not knowing much about compatibility.
> 
>>> In my use case on my environment, I don't want to reset virtio-gpu during 
>>> S3,
>>> because once the display resources are destroyed, there are not enough 
>>> information to re-create them, so this bit should be set.
>>> Making this bit software controllable is convenient for users to take their 
>>> own choices.
>>
>> Thanks
>>
>>>

>> Or should this be set to true by default and then
>> changed to false for old machine types?
> How can I do so?
> Do you mean set this to true by default, and if old machine types don't 
> need this bit, they can pass false config to qemu when running qemu?

 No, you would use compat machinery. See how is x-pcie-flr-init handled.


>>>
>>> --
>>> Best regards,
>>> Jiqian Chen.
> 

-- 
Best regards,
Jiqian Chen.

Re: [RFC QEMU PATCH v8 2/2] virtio-pci: implement No_Soft_Reset bit

2024-04-01 Thread Chen, Jiqian

On 2024/3/29 18:38, Jason Wang wrote:
> On Fri, Mar 29, 2024 at 4:00 PM Chen, Jiqian  wrote:
>>
>> On 2024/3/29 15:20, Jason Wang wrote:
>>> On Fri, Mar 29, 2024 at 3:07 PM Chen, Jiqian  wrote:

 On 2024/3/28 20:36, Michael S. Tsirkin wrote:
 +}
 +
  static void virtio_pci_bus_reset_hold(Object *obj)
  {
  PCIDevice *dev = PCI_DEVICE(obj);
  DeviceState *qdev = DEVICE(obj);

 +if (virtio_pci_no_soft_reset(dev)) {
 +return;
 +}
 +
  virtio_pci_reset(qdev);

  if (pci_is_express(dev)) {
 @@ -2484,6 +2511,8 @@ static Property virtio_pci_properties[] = {
  VIRTIO_PCI_FLAG_INIT_LNKCTL_BIT, true),
  DEFINE_PROP_BIT("x-pcie-pm-init", VirtIOPCIProxy, flags,
  VIRTIO_PCI_FLAG_INIT_PM_BIT, true),
 +DEFINE_PROP_BIT("x-pcie-pm-no-soft-reset", VirtIOPCIProxy, flags,
 +VIRTIO_PCI_FLAG_PM_NO_SOFT_RESET_BIT, false),
>>>
>>> Why does it come with an x prefix?
>> Sorry, it's my misunderstanding of this prefix, if No_Soft_Reset doesn't 
>> need this prefix, I will delete it in next version.
>> Does x prefix means compat machinery? Or other meanings?
>>
>>>
  DEFINE_PROP_BIT("x-pcie-flr-init", VirtIOPCIProxy, flags,
  VIRTIO_PCI_FLAG_INIT_FLR_BIT, true),
  DEFINE_PROP_BIT("aer", VirtIOPCIProxy, flags,
>>>
>>> I am a bit confused about this part.
>>> Do you want to make this software controllable?
>> Yes, because even the real hardware, this bit is not always set.
>>>
>>> We are talking about emulated devices here.
>> Yes, I just gave an example. It actually this bit is not always set. What's 
>> your opinion about when to set this bit or which virtio-device should set 
>> this bit?
> 
> If the implementation of Qemu is correct, we should set it unless we
> need compatibility.
Ok, I will set it default to true in next version.
If we need compatibility, what's your opinion about which machine types should 
we compatible with? Does the same as x-pcie-pm-init?
> 
>>
>>>
>
> So which virtio devices should and which should not set this bit?
 This depends on the scenario the virtio-device is used, if we want to 
 trigger an internal soft reset for the virtio-device during S3, this bit 
 shouldn't be set.
>>>
>>> If the device doesn't need reset, why bother the driver for this?
>> I don't know what you mean.
>> If the device doesn't need reset, we can config true to set this bit, then 
>> on the driver side, driver finds this bit is set, then driver will not 
>> trigger a soft reset.
> 
> I mean if the device can suspend without reset, we don't need to
> bother the driver to save and load states.
> 
>>
>>>
>>> Btw, no_soft_reset is insufficient for some cases,
>> May I know which cases?
>>
>>> there's a proposal for the virtio-spec. I think we need to wait until it is 
>>> done.
>> Can you share the proposal?
> 
> See this
> 
> https://lore.kernel.org/all/20240227015345.3614965-1-steve...@chromium.org/T/
I looked the detail of this proposal.
What the proposal does is to add a new status to trigger device to 
suspend/resume.
But this patch is to add No_Soft_Reset bit, it decides if the device should be 
reset. This patch doesn't depend on the proposal.
If you mean that the proposal says "A device MUST maintain its state while 
suspended", I think it needs new patch to implement it after the proposal is 
done.

> 
> Thanks
> 
>>
>>>
 In my use case on my environment, I don't want to reset virtio-gpu during 
 S3,
 because once the display resources are destroyed, there are not enough 
 information to re-create them, so this bit should be set.
 Making this bit software controllable is convenient for users to take 
 their own choices.
>>>
>>> Thanks
>>>

>
>>> Or should this be set to true by default and then
>>> changed to false for old machine types?
>> How can I do so?
>> Do you mean set this to true by default, and if old machine types don't 
>> need this bit, they can pass false config to qemu when running qemu?
>
> No, you would use compat machinery. See how is x-pcie-flr-init handled.
>
>

 --
 Best regards,
 Jiqian Chen.
>>>
>>
>> --
>> Best regards,
>> Jiqian Chen.
> 

-- 
Best regards,
Jiqian Chen.

[PATCH 1/2] CXL/cxl_type3: add first_dvsec_offset() helper

2024-04-01 Thread Li Zhijian via

It helps to figure out where the first dvsec register is located. In
addition, replace offset and size hardcore with existing macros.

Signed-off-by: Li Zhijian 
---
 hw/mem/cxl_type3.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index b0a7e9f11b64..ad2fe7d463fb 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -643,6 +643,16 @@ static DOEProtocol doe_cdat_prot[] = {
 { }
 };
 
+static uint16_t first_dvsec_offset(CXLType3Dev *ct3d)
+{
+uint16_t offset = PCI_CONFIG_SPACE_SIZE;
+
+if (ct3d->sn != UI64_NULL)
+offset += PCI_EXT_CAP_DSN_SIZEOF;
+
+return offset;
+}
+
 static void ct3_realize(PCIDevice *pci_dev, Error **errp)
 {
 ERRP_GUARD();
@@ -663,13 +673,10 @@ static void ct3_realize(PCIDevice *pci_dev, Error **errp)
 pci_config_set_prog_interface(pci_conf, 0x10);
 
 pcie_endpoint_cap_init(pci_dev, 0x80);
-if (ct3d->sn != UI64_NULL) {
-pcie_dev_ser_num_init(pci_dev, 0x100, ct3d->sn);
-cxl_cstate->dvsec_offset = 0x100 + 0x0c;
-} else {
-cxl_cstate->dvsec_offset = 0x100;
-}
+if (ct3d->sn != UI64_NULL)
+pcie_dev_ser_num_init(pci_dev, PCI_CONFIG_SPACE_SIZE, ct3d->sn);
 
+cxl_cstate->dvsec_offset = first_dvsec_offset(ct3d);
 ct3d->cxl_cstate.pdev = pci_dev;
 build_dvsecs(ct3d);
 
-- 
2.29.2

[PATCH 2/2] CXL/cxl_type3: reset DVSEC CXL Control in ct3d_reset

2024-04-01 Thread Li Zhijian via

After the kernel commit
0cab68720598 ("cxl/pci: Fix disabling memory if DVSEC CXL Range does not match 
a CFMWS window")
CXL type3 devices cannot be enabled again after the reboot because this
flag was not reset.

This flag could be changed by the firmware or OS, let it have a
reset(default) value in reboot so that the OS can read its clean status.

Signed-off-by: Li Zhijian 
---
 hw/mem/cxl_type3.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index ad2fe7d463fb..3fe136053390 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -305,7 +305,8 @@ static void build_dvsecs(CXLType3Dev *ct3d)
 
 dvsec = (uint8_t *)&(CXLDVSECDevice){
 .cap = 0x1e,
-.ctrl = 0x2,
+#define CT3D_DEVSEC_CXL_CTRL 0x2
+.ctrl = CT3D_DEVSEC_CXL_CTRL,
 .status2 = 0x2,
 .range1_size_hi = range1_size_hi,
 .range1_size_lo = range1_size_lo,
@@ -906,6 +907,16 @@ MemTxResult cxl_type3_write(PCIDevice *d, hwaddr 
host_addr, uint64_t data,
 return address_space_write(as, dpa_offset, attrs, &data, size);
 }
 
+/* Reset DVSEC CXL Control */
+static void ct3d_dvsec_cxl_ctrl_reset(CXLType3Dev *ct3d)
+{
+uint16_t offset = first_dvsec_offset(ct3d);
+CXLDVSECDevice *dvsec;
+
+dvsec = (CXLDVSECDevice *)(ct3d->cxl_cstate.pdev->config + offset);
+dvsec->ctrl = CT3D_DEVSEC_CXL_CTRL;
+}
+
 static void ct3d_reset(DeviceState *dev)
 {
 CXLType3Dev *ct3d = CXL_TYPE3(dev);
@@ -914,6 +925,7 @@ static void ct3d_reset(DeviceState *dev)
 
 cxl_component_register_init_common(reg_state, write_msk, 
CXL2_TYPE3_DEVICE);
 cxl_device_register_init_t3(ct3d);
+ct3d_dvsec_cxl_ctrl_reset(ct3d);
 
 /*
  * Bring up an endpoint to target with MCTP over VDM.
-- 
2.29.2

[PATCH] target/hppa: Fix IIAOQ, IIASQ for pa2.0

2024-04-01 Thread Richard Henderson

The contents of IIAOQ depend on PSW_W.
Follow the text in "Interruption Instruction Address Queues",
pages 2-13 through 2-15.

Reported-by: Sven Schnelle 
Fixes: b10700d826c ("target/hppa: Update IIAOQ, IIASQ for pa2.0")
Signed-off-by: Richard Henderson 
---

Sven, I looked again through IIAOQ documentation and it does seem
like some of the bits are wrong, both on interrupt delivery and RFI.


r~

---
 target/hppa/int_helper.c | 20 +++-
 target/hppa/sys_helper.c | 18 +-
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/target/hppa/int_helper.c b/target/hppa/int_helper.c
index 90437a92cd..a667ee380d 100644
--- a/target/hppa/int_helper.c
+++ b/target/hppa/int_helper.c
@@ -107,14 +107,10 @@ void hppa_cpu_do_interrupt(CPUState *cs)
 
 /* step 3 */
 /*
- * For pa1.x, IIASQ is simply a copy of IASQ.
- * For pa2.0, IIASQ is the top bits of the virtual address,
- *or zero if translation is disabled.
+ * IIASQ is the top bits of the virtual address, or zero if translation
+ * is disabled -- with PSW_W == 0, this will reduce to the space.
  */
-if (!hppa_is_pa20(env)) {
-env->cr[CR_IIASQ] = env->iasq_f >> 32;
-env->cr_back[0] = env->iasq_b >> 32;
-} else if (old_psw & PSW_C) {
+if (old_psw & PSW_C) {
 env->cr[CR_IIASQ] =
 hppa_form_gva_psw(old_psw, env->iasq_f, env->iaoq_f) >> 32;
 env->cr_back[0] =
@@ -123,8 +119,14 @@ void hppa_cpu_do_interrupt(CPUState *cs)
 env->cr[CR_IIASQ] = 0;
 env->cr_back[0] = 0;
 }
-env->cr[CR_IIAOQ] = env->iaoq_f;
-env->cr_back[1] = env->iaoq_b;
+/* IIAOQ is the full offset for wide mode, or 32 bits for narrow mode. */
+if (old_psw & PSW_W) {
+env->cr[CR_IIAOQ] = env->iaoq_f;
+env->cr_back[1] = env->iaoq_b;
+} else {
+env->cr[CR_IIAOQ] = (uint32_t)env->iaoq_f;
+env->cr_back[1] = (uint32_t)env->iaoq_b;
+}
 
 if (old_psw & PSW_Q) {
 /* step 5 */
diff --git a/target/hppa/sys_helper.c b/target/hppa/sys_helper.c
index 208e51c086..22d6c89964 100644
--- a/target/hppa/sys_helper.c
+++ b/target/hppa/sys_helper.c
@@ -78,21 +78,21 @@ target_ulong HELPER(swap_system_mask)(CPUHPPAState *env, 
target_ulong nsm)
 
 void HELPER(rfi)(CPUHPPAState *env)
 {
-env->iasq_f = (uint64_t)env->cr[CR_IIASQ] << 32;
-env->iasq_b = (uint64_t)env->cr_back[0] << 32;
-env->iaoq_f = env->cr[CR_IIAOQ];
-env->iaoq_b = env->cr_back[1];
+uint64_t mask;
+
+cpu_hppa_put_psw(env, env->cr[CR_IPSW]);
 
 /*
  * For pa2.0, IIASQ is the top bits of the virtual address.
  * To recreate the space identifier, remove the offset bits.
+ * For pa1.x, the mask reduces to no change to space.
  */
-if (hppa_is_pa20(env)) {
-env->iasq_f &= ~env->iaoq_f;
-env->iasq_b &= ~env->iaoq_b;
-}
+mask = gva_offset_mask(env->psw);
 
-cpu_hppa_put_psw(env, env->cr[CR_IPSW]);
+env->iaoq_f = env->cr[CR_IIAOQ];
+env->iaoq_b = env->cr_back[1];
+env->iasq_f = (env->cr[CR_IIASQ] << 32) & ~(env->iaoq_f & mask);
+env->iasq_b = (env->cr_back[0] << 32) & ~(env->iaoq_b & mask);
 }
 
 static void getshadowregs(CPUHPPAState *env)
-- 
2.34.1

[PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-01 Thread Ho-Ren (Jack) Chuang

The current implementation treats emulated memory devices, such as
CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to
distinguish them. Thus, we modify the tiered memory initialization process
to introduce a delay specifically for CPUless NUMA nodes. This delay
ensures that the memory tier initialization for these nodes is deferred
until HMAT information is obtained during the boot process. Finally,
demotion tables are recalculated at the end.

* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be excluded
in the late init.

* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT information.
If no HMAT is specified, it falls back to using the default DRAM tier.

* Introduce another new lock `default_dram_perf_lock` for adist calculation
In the current implementation, iterating through CPUlist nodes requires
holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
trying to acquire the same lock, leading to a potential deadlock.
Therefore, we propose introducing a standalone `default_dram_perf_lock` to
protect `default_dram_perf_*`. This approach not only avoids deadlock
but also prevents holding a large lock simultaneously.

* Upgrade `set_node_memory_tier` to support additional cases, including
  default DRAM, late CPUless, and hot-plugged initializations.
To cover hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information is
available.

* Introduce `default_memory_types` for those memory types that are not
  initialized by device drivers.
Because late initialized memory and default DRAM memory need to be managed,
a default memory type is created for storing all memory types that are
not initialized by device drivers and as a fallback.

Signed-off-by: Ho-Ren (Jack) Chuang 
Signed-off-by: Hao Xiang 
Reviewed-by: "Huang, Ying" 
---
 include/linux/memory-tiers.h |  5 +-
 mm/memory-tiers.c| 95 +---
 2 files changed, 81 insertions(+), 19 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index a44c03c2ba3a..16769552a338 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
return -EIO;
 }
 
-struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct list_head *memory_types)
 {
return NULL;
 }
 
-void mt_put_memory_types(struct list_head *memory_types)
+static inline void mt_put_memory_types(struct list_head *memory_types)
 {
 
 }
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 974af10cfdd8..44fa10980d37 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -36,6 +36,11 @@ struct node_memory_type_map {
 
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+/*
+ * The list is used to store all memory types that are not created
+ * by a device driver.
+ */
+static LIST_HEAD(default_memory_types);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 struct memory_dev_type *default_dram_type;
 
@@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
 
 static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
 
+/* The lock is used to protect `default_dram_perf*` info and nid. */
+static DEFINE_MUTEX(default_dram_perf_lock);
 static bool default_dram_perf_error;
 static struct access_coordinate default_dram_perf;
 static int default_dram_perf_ref_nid = NUMA_NO_NODE;
@@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct 
memory_dev_type *mem
 static struct memory_tier *set_node_memory_tier(int node)
 {
struct memory_tier *memtier;
-   struct memory_dev_type *memtype;
+   struct memory_dev_type *mtype = default_dram_type;
+   int adist = MEMTIER_ADISTANCE_DRAM;
pg_data_t *pgdat = NODE_DATA(node);
 
 
@@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int node)
if (!node_state(node, N_MEMORY))
return ERR_PTR(-EINVAL);
 
-   __init_node_memory_type(node, default_dram_type);
+   mt_calc_adistance(node, &adist);
+   if (node_memory_types[node].memtype == NULL) {
+   mtype = mt_find_alloc_memory_type(adist, &default_memory_types);
+   if (IS_ERR(mtype)) {
+   mtype = defa

[PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-01 Thread Ho-Ren (Jack) Chuang

Since different memory devices require finding, allocating, and putting
memory types, these common steps are abstracted in this patch,
enhancing the scalability and conciseness of the code.

Signed-off-by: Ho-Ren (Jack) Chuang 
Reviewed-by: "Huang, Ying" 
---
 drivers/dax/kmem.c   | 20 ++--
 include/linux/memory-tiers.h | 13 +
 mm/memory-tiers.c| 32 
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 42ee360cf4e3..01399e5b53b2 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
 
 static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
 {
-   bool found = false;
struct memory_dev_type *mtype;
 
mutex_lock(&kmem_memory_type_lock);
-   list_for_each_entry(mtype, &kmem_memory_types, list) {
-   if (mtype->adistance == adist) {
-   found = true;
-   break;
-   }
-   }
-   if (!found) {
-   mtype = alloc_memory_type(adist);
-   if (!IS_ERR(mtype))
-   list_add(&mtype->list, &kmem_memory_types);
-   }
+   mtype = mt_find_alloc_memory_type(adist, &kmem_memory_types);
mutex_unlock(&kmem_memory_type_lock);
 
return mtype;
@@ -77,13 +66,8 @@ static struct memory_dev_type 
*kmem_find_alloc_memory_type(int adist)
 
 static void kmem_put_memory_types(void)
 {
-   struct memory_dev_type *mtype, *mtn;
-
mutex_lock(&kmem_memory_type_lock);
-   list_for_each_entry_safe(mtype, mtn, &kmem_memory_types, list) {
-   list_del(&mtype->list);
-   put_memory_type(mtype);
-   }
+   mt_put_memory_types(&kmem_memory_types);
mutex_unlock(&kmem_memory_type_lock);
 }
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 69e781900082..a44c03c2ba3a 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
 int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
 const char *source);
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
+struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct list_head 
*memory_types);
+void mt_put_memory_types(struct list_head *memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
 {
return -EIO;
 }
+
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   return NULL;
+}
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0537664620e5..974af10cfdd8 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct 
memory_dev_type *memtype)
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   bool found = false;
+   struct memory_dev_type *mtype;
+
+   list_for_each_entry(mtype, memory_types, list) {
+   if (mtype->adistance == adist) {
+   found = true;
+   break;
+   }
+   }
+   if (!found) {
+   mtype = alloc_memory_type(adist);
+   if (!IS_ERR(mtype))
+   list_add(&mtype->list, memory_types);
+   }
+
+   return mtype;
+}
+EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+   struct memory_dev_type *mtype, *mtn;
+
+   list_for_each_entry_safe(mtype, mtn, memory_types, list) {
+   list_del(&mtype->list);
+   put_memory_type(mtype);
+   }
+}
+EXPORT_SYMBOL_GPL(mt_put_memory_types);
+
 static void dump_hmem_attrs(struct access_coordinate *coord, const char 
*prefix)
 {
pr_info(
-- 
Ho-Ren (Jack) Chuang

[PATCH v10 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-04-01 Thread Ho-Ren (Jack) Chuang

When a memory device, such as CXL1.1 type3 memory, is emulated as
normal memory (E820_TYPE_RAM), the memory device is indistinguishable from
normal DRAM in terms of memory tiering with the current implementation.
The current memory tiering assigns all detected normal memory nodes to
the same DRAM tier. This results in normal memory devices with different
attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different
types of memory.
https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues. It delays the
initialization of memory tiers for CPUless NUMA nodes until they obtain
HMAT information and after all devices are initialized at boot time,
eliminating the need for user intervention. If no HMAT is specified,
it falls back to using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance
attributes through QEMU, and the guest now sees memory nodes with
performance attributes in HMAT. With this change, we enable the
guest kernel to construct the correct memory tiering for the memory nodes.

- v10:
 Thanks to Andrew's and SeongJae's comments,
 * Address kunit compilation errors
 * Resolve the bug of not returning the correct error code in
   `mt_perf_to_adistance`
-v9:
 * Address corner cases in `memory_tier_late_init`. Thank Ying's comments.
 * 
https://lore.kernel.org/lkml/20240329053353.309557-1-horenchu...@bytedance.com/T/#u
-v8:
 * Fix email format
 * 
https://lore.kernel.org/lkml/20240329004815.195476-1-horenchu...@bytedance.com/T/#u
-v7:
 * Add Reviewed-by: "Huang, Ying" 
-v6:
 Thanks to Ying's comments,
 * Move `default_dram_perf_lock` to the function's beginning for clarity
 * Fix double unlocking at v5
 * 
https://lore.kernel.org/lkml/20240327072729.3381685-1-horenchu...@bytedance.com/T/#u
-v5:
 Thanks to Ying's comments,
 * Add comments about what is protected by `default_dram_perf_lock`
 * Fix an uninitialized pointer mtype
 * Slightly shorten the time holding `default_dram_perf_lock`
 * Fix a deadlock bug in `mt_perf_to_adistance`
 * 
https://lore.kernel.org/lkml/20240327041646.3258110-1-horenchu...@bytedance.com/T/#u
-v4:
 Thanks to Ying's comments,
 * Remove redundant code
 * Reorganize patches accordingly
 * 
https://lore.kernel.org/lkml/20240322070356.315922-1-horenchu...@bytedance.com/T/#u
-v3:
 Thanks to Ying's comments,
 * Make the newly added code independent of HMAT
 * Upgrade set_node_memory_tier to support more cases
 * Put all non-driver-initialized memory types into default_memory_types
   instead of using hmat_memory_types
 * find_alloc_memory_type -> mt_find_alloc_memory_type
 * 
https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchu...@bytedance.com/T/#u
-v2:
 Thanks to Ying's comments,
 * Rewrite cover letter & patch description
 * Rename functions, don't use _hmat
 * Abstract common functions into find_alloc_memory_type()
 * Use the expected way to use set_node_memory_tier instead of modifying it
 * 
https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u
-v1:
 * 
https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u

Ho-Ren (Jack) Chuang (2):
  memory tier: dax/kmem: introduce an abstract layer for finding,
allocating, and putting memory types
  memory tier: create CPUless memory tiers after obtaining HMAT info

 drivers/dax/kmem.c   |  20 +-
 include/linux/memory-tiers.h |  14 
 mm/memory-tiers.c| 127 ++-
 3 files changed, 126 insertions(+), 35 deletions(-)

-- 
Ho-Ren (Jack) Chuang

Re: [External] Re: [PATCH v9 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-01 Thread Ho-Ren (Jack) Chuang

Hi SeongJae,

On Mon, Apr 1, 2024 at 11:27 AM Ho-Ren (Jack) Chuang
 wrote:
>
> Hi SeongJae,
>
> On Sun, Mar 31, 2024 at 12:09 PM SeongJae Park  wrote:
> >
> > Hi Ho-Ren,
> >
> > On Fri, 29 Mar 2024 05:33:52 + "Ho-Ren (Jack) Chuang" 
> >  wrote:
> >
> > > Since different memory devices require finding, allocating, and putting
> > > memory types, these common steps are abstracted in this patch,
> > > enhancing the scalability and conciseness of the code.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Reviewed-by: "Huang, Ying" 
> > > ---
> > >  drivers/dax/kmem.c   | 20 ++--
> > >  include/linux/memory-tiers.h | 13 +
> > >  mm/memory-tiers.c| 32 
> > >  3 files changed, 47 insertions(+), 18 deletions(-)
> > >
> > [...]
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > index 69e781900082..a44c03c2ba3a 100644
> > > --- a/include/linux/memory-tiers.h
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
> > >  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
> > >const char *source);
> > >  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
> > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> > > + struct list_head 
> > > *memory_types);
> > > +void mt_put_memory_types(struct list_head *memory_types);
> > >  #ifdef CONFIG_MIGRATION
> > >  int next_demotion_node(int node);
> > >  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> > > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
> > > access_coordinate *perf, int *adis
> > >  {
> > >   return -EIO;
> > >  }
> > > +
> > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > > list_head *memory_types)
> > > +{
> > > + return NULL;
> > > +}
> > > +
> > > +void mt_put_memory_types(struct list_head *memory_types)
> > > +{
> > > +
> > > +}
> >
> > I found latest mm-unstable tree is failing kunit as below, and 'git bisect'
> > says it happens from this patch.
> >
> > $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> > [11:56:40] Configuring KUnit Kernel ...
> > [11:56:40] Building KUnit Kernel ...
> > Populating config with:
> > $ make ARCH=um O=../kunit.out/ olddefconfig
> > Building with:
> > $ make ARCH=um O=../kunit.out/ --jobs=36
> > ERROR:root:In file included from .../mm/memory.c:71:
> > .../include/linux/memory-tiers.h:143:25: warning: no previous prototype 
> > for ‘mt_find_alloc_memory_type’ [-Wmissing-prototypes]
> >   143 | struct memory_dev_type *mt_find_alloc_memory_type(int adist, 
> > struct list_head *memory_types)
> >   | ^
> > .../include/linux/memory-tiers.h:148:6: warning: no previous prototype 
> > for ‘mt_put_memory_types’ [-Wmissing-prototypes]
> >   148 | void mt_put_memory_types(struct list_head *memory_types)
> >   |  ^~~
> > [...]
> >
> > Maybe we should set these as 'static inline', like below?  I confirmed this
> > fixes the kunit error.  May I ask your opinion?
> >
>
> Thanks for catching this. I'm trying to figure out this problem. Will get 
> back.
>

These kunit compilation errors can be solved by adding `static inline`
to the two complaining functions, the same solution you mentioned
earlier.

I've also tested on my end and I will send out a V10 soon. Thank you again!

> >
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > index a44c03c2ba3a..ee6e53144156 100644
> > --- a/include/linux/memory-tiers.h
> > +++ b/include/linux/memory-tiers.h
> > @@ -140,12 +140,12 @@ static inline int mt_perf_to_adistance(struct 
> > access_coordinate *perf, int *adis
> > return -EIO;
> >  }
> >
> > -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > list_head *memory_types)
> > +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, 
> > struct list_head *memory_types)
> >  {
> > return NULL;
> >  }
> >
> > -void mt_put_memory_types(struct list_head *memory_types)
> > +static inline void mt_put_memory_types(struct list_head *memory_types)
> >  {
> >
> >  }
> >
> >
> > Thanks,
> > SJ
>
>
>
> --
> Best regards,
> Ho-Ren (Jack) Chuang
> 莊賀任



-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任

Re: [PATCH] target/hppa: mask upper iaoq bits when returning to narrow mode

2024-04-01 Thread Sven Schnelle

Richard Henderson  writes:

> On 4/1/24 10:39, Sven Schnelle wrote:
>> Richard Henderson  writes:
 For unknown reasons, Java 1.5 on 64-bit HP-UX 11.11 does signed
 computation of the new IAOQ value in the signal handler. In the
 current code these bits are not masked when returning to narrow
 mode, causing java to crash.
 Signed-off-by: Sven Schnelle 
>> INT   3530: instruction tlb miss fault @ :c007 
>> for :4000c004
>> INT   3531: external interrupt @ :c007 for 
>> :4000c004
>> INT   3532: instruction tlb miss fault @ :c007 
>> for :4000c004
>> INT   3533: external interrupt @ :c007 for 
>> :4000c004
>> So the PSW indicates narrow mode, but IAOQ seems to contain all the
>> ... bits.
>
> I believe that the IAOQ *should* contain all of the bits.  The bits
> should only be discarded when we form the GVA -- exactly like "ldb
> 0(r2)", where r2 contains all of the offset bits.  In particular, I
> believe that "b,l .+8,r2" should copy all of those bits to r2 from
> IAOQ_Back+4 and the fact that mainline crops those bits is a bug.
>
>
>> Also interesting is that the second TLB miss (INT 3530)
>> misses the Space ID.
>
> That is a bit curious, yes.
>
>> Any thoughts? Otherwise i need to investigate and make a wrong patch
>> again :-)
>> The only patch i have on top which touches target/hppa is the space
>> id
>> hashing mask patch:
>
> Ok.  I do have an hppa 11.11 iso -- for clarity, what is your command-line?

I'm using:

./build/qemu-system-hppa -M C3700 -m 1024 -cdrom 
/home/svens/parisc/hpux/11.11/HP-UX 11.11 (2004-12) - TCOE - Core OS, Install 
and Recovery - DVD.iso -bios 
/home/svens/seabios-hppa/out-64/hppa-firmware64.img -nographic -hda 
/home/svens/parisc/hpux.img -boot d


The qemu i'm using is: https://github.com/svenschnelle/qemu/tree/devel

You also need a special seabios-hppa version, because a special console
driver is needed:

https://github.com/hdeller/seabios-hppa/tree/devel

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-01 Thread Yu Zhang

Hello Peter und Zhjian,

Thank you so much for letting me know about this. I'm also a bit surprised at
the plan for deprecating the RDMA migration subsystem.

> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

> I admit RDMA migration was lack of testing(unit/CI test), which led to the a 
> few
> obvious bugs being noticed too late.

Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
for this part. As soon as 8.2 was released, I saw that many of the
migration test
cases failed and came to realize that there might be a bug between 8.1
and 8.2, but
was unable to confirm and report it quickly to you.

The maintenance of this part could be too costly or difficult from
your point of view.

My concern is, this plan will forces a few QEMU users (not sure how
many) like us
either to stick to the RDMA migration by using an increasingly older
version of QEMU,
or to abandon the currently used RDMA migration.

Best regards,
Yu Zhang

On Mon, Apr 1, 2024 at 9:56 AM Zhijian Li (Fujitsu)
 wrote:
>
> Phil,
>
> on 3/29/2024 6:28 PM, Philippe Mathieu-Daudé wrote:
> >>
> >>
> >>> IMHO it's more important to know whether there are still users and
> >>> whether
> >>> they would still like to see it around.
> >>
> >> Agree.
> >> I didn't immediately express my opinion in V1 because I'm also
> >> consulting our
> >> customers for this feature in the future.
> >>
> >> Personally, I agree with Perter's idea that "I'd slightly prefer
> >> postponing it one
> >> more release which might help a bit of our downstream maintenance"
> >
> > Do you mind posting a deprecation patch to clarify the situation?
> >
>
> No problem, i just posted a deprecation patch, please take a look.
> https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhij...@fujitsu.com/T/#u
>
> Thanks
> Zhijian

Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-01 Thread Fabiano Rosas

Peter Xu  writes:

> On Mon, Apr 01, 2024 at 02:17:28PM -0300, Fabiano Rosas wrote:
>> Peter Xu  writes:
>> 
>> > On Fri, Mar 29, 2024 at 08:54:07AM +, Wang, Wei W wrote:
>> >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
>> >> > When using the post-copy preemption feature to perform post-copy live
>> >> > migration, the below scenario could lead to a deadlock and the 
>> >> > migration will
>> >> > never finish:
>> >> > 
>> >> >  - Source connect() the preemption channel in postcopy_start().
>> >> >  - Source and the destination side TCP stack finished the 3-way 
>> >> > handshake
>> >> >thus the connection is successful.
>> >> >  - The destination side main thread is handling the loading of the bulk 
>> >> > RAM
>> >> >pages thus it doesn't start to handle the pending connection event 
>> >> > in the
>> >> >event loop. and doesn't post the semaphore 
>> >> > postcopy_qemufile_dst_done for
>> >> >the preemption thread.
>> >> >  - The source side sends non-iterative device states, such as the virtio
>> >> >states.
>> >> >  - The destination main thread starts to receive the virtio states, this
>> >> >process may lead to a page fault (e.g., 
>> >> > virtio_load()->vring_avail_idx()
>> >> >may trigger a page fault since the avail ring page may not be 
>> >> > received
>> >> >yet).
>> >
>> > Ouch.  Yeah I think this part got overlooked when working on the preempt
>> > channel.
>> >
>> >> >  - The page request is sent back to the source side. Source sends the 
>> >> > page
>> >> >content to the destination side preemption thread.
>> >> >  - Since the event is not arrived and the semaphore
>> >> >postcopy_qemufile_dst_done is not posted, the preemption thread in
>> >> >destination side is blocked, and cannot handle receiving the page.
>> >> >  - The QEMU main load thread on the destination side is stuck at the 
>> >> > page
>> >> >fault, and cannot yield and handle the connect() event for the
>> >> >preemption channel to unblock the preemption thread.
>> >> >  - The postcopy will stuck there forever since this is a deadlock.
>> >> > 
>> >> > The key point to reproduce this bug is that the source side is sending 
>> >> > pages at a
>> >> > rate faster than the destination handling, otherwise, the 
>> >> > qemu_get_be64() in
>> >> > ram_load_precopy() will have a chance to yield since at that time there 
>> >> > are no
>> >> > pending data in the buffer to get. This will make this bug harder to be
>> >> > reproduced.
>> >
>> > How hard would this reproduce?
>> >
>> > I'm thinking whether this should be 9.0 material or 9.1.  It's pretty late
>> > for 9.0 though, but we can still discuss.
>> >
>> >> > 
>> >> > Fix this by yielding the load coroutine when receiving
>> >> > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
>> >> > connection event before loading the non-iterative devices state to 
>> >> > avoid the
>> >> > deadlock condition.
>> >> > 
>> >> > Signed-off-by: Lei Wang 
>> >> 
>> >> This seems to be a regression issue caused by this commit:
>> >> 737840e2c6ea (migration: Use the number of transferred bytes directly)
>> >> 
>> >> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate
>> >> seems to work (might not be a good fix though).
>> >> 
>> >> > ---
>> >> >  migration/savevm.c | 5 +
>> >> >  1 file changed, 5 insertions(+)
>> >> > 
>> >> > diff --git a/migration/savevm.c b/migration/savevm.c index
>> >> > e386c5267f..8fd4dc92f2 100644
>> >> > --- a/migration/savevm.c
>> >> > +++ b/migration/savevm.c
>> >> > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f)
>> >> >  return loadvm_postcopy_handle_advise(mis, len);
>> >> > 
>> >> >  case MIG_CMD_POSTCOPY_LISTEN:
>> >> > +if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
>> >> > +aio_co_schedule(qemu_get_current_aio_context(),
>> >> > +qemu_coroutine_self());
>> >> > +qemu_coroutine_yield();
>> >> > +}
>> >> 
>> >> The above could be moved to loadvm_postcopy_handle_listen().
>> >
>> > I'm not 100% sure such thing (no matter here or moved into it, which does
>> > look cleaner) would work for us.
>> >
>> > The problem is I still don't yet see an ordering restricted on top of (1)
>> > accept() happens, and (2) receive LISTEN cmd here.  What happens if the
>> > accept() request is not yet received when reaching LISTEN?  Or is it always
>> > guaranteed the accept(fd) will always be polled here?
>> >
>> > For example, the source QEMU (no matter pre-7.2 or later) will always setup
>> > the preempt channel asynchrounously, then IIUC it can connect() after
>> > sending the whole chunk of packed data which should include this LISTEN.  I
>> > think it means it's not guaranteed this will 100% work, but maybe further
>> > reduce the possibility of the race.
>> >
>> > One right fix that I can think of is moving the sem_wait(&done) into the
>> > main thread too, so

Re: [PATCH] target/hppa: mask upper iaoq bits when returning to narrow mode

2024-04-01 Thread Richard Henderson


On 4/1/24 10:39, Sven Schnelle wrote:

Richard Henderson  writes:


On 4/1/24 04:52, Sven Schnelle wrote:

For unknown reasons, Java 1.5 on 64-bit HP-UX 11.11 does signed
computation of the new IAOQ value in the signal handler. In the
current code these bits are not masked when returning to narrow
mode, causing java to crash.
Signed-off-by: Sven Schnelle 
---
   target/hppa/sys_helper.c | 4 
   1 file changed, 4 insertions(+)
diff --git a/target/hppa/sys_helper.c b/target/hppa/sys_helper.c
index 208e51c086..3bbc2da71b 100644
--- a/target/hppa/sys_helper.c
+++ b/target/hppa/sys_helper.c
@@ -83,6 +83,10 @@ void HELPER(rfi)(CPUHPPAState *env)
   env->iaoq_f = env->cr[CR_IIAOQ];
   env->iaoq_b = env->cr_back[1];
   +if (!(env->cr[CR_IPSW] & PSW_W)) {
+env->iaoq_f &= 0x;
+env->iaoq_b &= 0x;
+}


This shouldn't be needed, because we are already masking these bits
later, in cpu_get_tb_cpu_state.  But I do have some cleanups in this
area, and perhaps one of them matters.


Ouch. Appologies, i removed that masking to make 64 bit work, but forgot
about that. Sorry. I tried your branch, but i'm not able to boot 64 bit HP-UX
(it wasn't working before without additional changes, so your branch
doesn't break it). However, i would like to get your oppinion before
debugging. The code is:

IN:
0xef468002c18:   20 20 08 01   ldil L%-4000,r1
0xef468002c1c:   e4 20 e0 08   be,l 4(sr7,r1),sr0,r31
0xef468002c20:   34 16 00 72   ldi 39,r22

IA_F 0ef46800:2c23 (0ef468002c23)
IA_B 01e27c00:c007 (01e27c00c007)
PSW  0004000f CB    -CQPDI
GR00  GR01 c000 GR02 2ac3 GR03 
7a00
GR04 0002 GR05 7a34 GR06 7a40 GR07 
7a50
GR08 7ae0 GR09 800040001000 GR10 0003 GR11 
0006
GR12 001ecee8 GR13  GR14 0801 GR15 
0006
GR16 8000400020c0 GR17 0001e720 GR18 0008 GR19 

GR20 0ef46800 GR21 7a60 GR22  GR23 
7a50
GR24  GR25  GR26 7a0020f0 GR27 
40008410
GR28 0002 GR29  GR30 7a002160 GR31 
2c27
RC   7fff CR1   CR2   CR3  

CR4   CR5   CR6  0002 CR7  
fffb
PID1 256f PID2 7306 CCR  00c0 SAR  
0004
PID3  PID4  IVA  0002a000 EIEM 

ITMR 0005d637f804 ISQF 0ef46800 IOQF 2aab IIR  
6bc23fd9
ISR  04d0d000 IOR  40007a0020cc IPSW 0004000f EIRR 

TR0  00abfe68 TR1  0465d1d6 TR2  0002 TR3  

TR4   TR5  0001757973bb TR6  21eb TR7  
7a0020e0
ISQB 0ef46800 IOQB 2aaf
SR00 0ef46800 SR01  SR02  SR03 
SR04 0ef46800 SR05 04d0d000 SR06 01e27c00 SR07 01e27c00

INT   3529: instruction tlb miss fault @ 01e27c00:c007 for 
01e27c00:4000c004
INT   3530: instruction tlb miss fault @ :c007 for 
:4000c004
INT   3531: external interrupt @ :c007 for 
:4000c004
INT   3532: instruction tlb miss fault @ :c007 for 
:4000c004
INT   3533: external interrupt @ :c007 for 
:4000c004

So the PSW indicates narrow mode, but IAOQ seems to contain all the
... bits.


I believe that the IAOQ *should* contain all of the bits.  The bits should only be 
discarded when we form the GVA -- exactly like "ldb 0(r2)", where r2 contains all of the 
offset bits.  In particular, I believe that "b,l .+8,r2" should copy all of those bits to 
r2 from IAOQ_Back+4 and the fact that mainline crops those bits is a bug.




Also interesting is that the second TLB miss (INT 3530)
misses the Space ID.


That is a bit curious, yes.


Any thoughts? Otherwise i need to investigate and make a wrong patch
again :-)

The only patch i have on top which touches target/hppa is the space id
hashing mask patch:


Ok.  I do have an hppa 11.11 iso -- for clarity, what is your command-line?


r~

Re: [PATCH] target/hppa: mask upper iaoq bits when returning to narrow mode

2024-04-01 Thread Richard Henderson


On 4/1/24 10:56, Sven Schnelle wrote:

This seems to be caused by IIAOQ's containing the upper bits. With the
patch below i'm able to boot. Not sure whether it's correct though.

diff --git a/target/hppa/int_helper.c b/target/hppa/int_helper.c
index 58c13d3e61..f7c4cca8f1 100644
--- a/target/hppa/int_helper.c
+++ b/target/hppa/int_helper.c
@@ -123,8 +123,14 @@ void hppa_cpu_do_interrupt(CPUState *cs)
  env->cr[CR_IIASQ] = 0;
  env->cr_back[0] = 0;
  }
-env->cr[CR_IIAOQ] = env->iaoq_f;
-env->cr_back[1] = env->iaoq_b;
+if (old_psw & PSW_W) {
+env->cr[CR_IIAOQ] = env->iaoq_f;
+env->cr_back[1] = env->iaoq_b;
+} else {
+env->cr[CR_IIAOQ] = (env->iaoq_f & 0x);
+env->cr_back[1] = env->iaoq_b & 0x;
+}
+


I guess the interesting question where should these bits get masked out
- i would assume that this place is to late, and it should happen
earlier in trans_be/when the iaoq value is copied. On the other hand
you had one commit that removed the masking in copy_iaoq_entry()...


I would have said this masking should not happen at all.


r~

Re: [PATCH] target/hppa: mask upper iaoq bits when returning to narrow mode

2024-04-01 Thread Sven Schnelle

Sven Schnelle  writes:

> Sven Schnelle  writes:
>
>> Richard Henderson  writes:
>>
>>> On 4/1/24 04:52, Sven Schnelle wrote:
 For unknown reasons, Java 1.5 on 64-bit HP-UX 11.11 does signed
 computation of the new IAOQ value in the signal handler. In the
 current code these bits are not masked when returning to narrow
 mode, causing java to crash.
 Signed-off-by: Sven Schnelle 
 ---
   target/hppa/sys_helper.c | 4 
   1 file changed, 4 insertions(+)
 diff --git a/target/hppa/sys_helper.c b/target/hppa/sys_helper.c
 index 208e51c086..3bbc2da71b 100644
 --- a/target/hppa/sys_helper.c
 +++ b/target/hppa/sys_helper.c
 @@ -83,6 +83,10 @@ void HELPER(rfi)(CPUHPPAState *env)
   env->iaoq_f = env->cr[CR_IIAOQ];
   env->iaoq_b = env->cr_back[1];
   +if (!(env->cr[CR_IPSW] & PSW_W)) {
 +env->iaoq_f &= 0x;
 +env->iaoq_b &= 0x;
 +}
>>>
>>> This shouldn't be needed, because we are already masking these bits
>>> later, in cpu_get_tb_cpu_state.  But I do have some cleanups in this
>>> area, and perhaps one of them matters.
>> Any thoughts? Otherwise i need to investigate and make a wrong patch
>> again :-)
>
> This seems to be caused by IIAOQ's containing the upper bits. With the
> patch below i'm able to boot. Not sure whether it's correct though.
>
> diff --git a/target/hppa/int_helper.c b/target/hppa/int_helper.c
> index 58c13d3e61..f7c4cca8f1 100644
> --- a/target/hppa/int_helper.c
> +++ b/target/hppa/int_helper.c
> @@ -123,8 +123,14 @@ void hppa_cpu_do_interrupt(CPUState *cs)
>  env->cr[CR_IIASQ] = 0;
>  env->cr_back[0] = 0;
>  }
> -env->cr[CR_IIAOQ] = env->iaoq_f;
> -env->cr_back[1] = env->iaoq_b;
> +if (old_psw & PSW_W) {
> +env->cr[CR_IIAOQ] = env->iaoq_f;
> +env->cr_back[1] = env->iaoq_b;
> +} else {
> +env->cr[CR_IIAOQ] = (env->iaoq_f & 0x);
> +env->cr_back[1] = env->iaoq_b & 0x;
> +}
> +

I guess the interesting question where should these bits get masked out
- i would assume that this place is to late, and it should happen
earlier in trans_be/when the iaoq value is copied. On the other hand
you had one commit that removed the masking in copy_iaoq_entry()...

Re: [PATCH] target/hppa: mask upper iaoq bits when returning to narrow mode

2024-04-01 Thread Sven Schnelle

Sven Schnelle  writes:

> Richard Henderson  writes:
>
>> On 4/1/24 04:52, Sven Schnelle wrote:
>>> For unknown reasons, Java 1.5 on 64-bit HP-UX 11.11 does signed
>>> computation of the new IAOQ value in the signal handler. In the
>>> current code these bits are not masked when returning to narrow
>>> mode, causing java to crash.
>>> Signed-off-by: Sven Schnelle 
>>> ---
>>>   target/hppa/sys_helper.c | 4 
>>>   1 file changed, 4 insertions(+)
>>> diff --git a/target/hppa/sys_helper.c b/target/hppa/sys_helper.c
>>> index 208e51c086..3bbc2da71b 100644
>>> --- a/target/hppa/sys_helper.c
>>> +++ b/target/hppa/sys_helper.c
>>> @@ -83,6 +83,10 @@ void HELPER(rfi)(CPUHPPAState *env)
>>>   env->iaoq_f = env->cr[CR_IIAOQ];
>>>   env->iaoq_b = env->cr_back[1];
>>>   +if (!(env->cr[CR_IPSW] & PSW_W)) {
>>> +env->iaoq_f &= 0x;
>>> +env->iaoq_b &= 0x;
>>> +}
>>
>> This shouldn't be needed, because we are already masking these bits
>> later, in cpu_get_tb_cpu_state.  But I do have some cleanups in this
>> area, and perhaps one of them matters.
>
> Ouch. Appologies, i removed that masking to make 64 bit work, but forgot
> about that. Sorry. I tried your branch, but i'm not able to boot 64 bit HP-UX
> (it wasn't working before without additional changes, so your branch
> doesn't break it). However, i would like to get your oppinion before
> debugging. The code is:
>
> IN: 
> 0xef468002c18:   20 20 08 01   ldil L%-4000,r1
> 0xef468002c1c:   e4 20 e0 08   be,l 4(sr7,r1),sr0,r31
> 0xef468002c20:   34 16 00 72   ldi 39,r22
>
> IA_F 0ef46800:2c23 (0ef468002c23)
> IA_B 01e27c00:c007 (01e27c00c007)
> PSW  0004000f CB    -CQPDI
> GR00  GR01 c000 GR02 2ac3 GR03 
> 7a00
> GR04 0002 GR05 7a34 GR06 7a40 GR07 
> 7a50
> GR08 7ae0 GR09 800040001000 GR10 0003 GR11 
> 0006
> GR12 001ecee8 GR13  GR14 0801 GR15 
> 0006
> GR16 8000400020c0 GR17 0001e720 GR18 0008 GR19 
> 
> GR20 0ef46800 GR21 7a60 GR22  GR23 
> 7a50
> GR24  GR25  GR26 7a0020f0 GR27 
> 40008410
> GR28 0002 GR29  GR30 7a002160 GR31 
> 2c27
> RC   7fff CR1   CR2   CR3  
> 
> CR4   CR5   CR6  0002 CR7  
> fffb
> PID1 256f PID2 7306 CCR  00c0 SAR  
> 0004
> PID3  PID4  IVA  0002a000 EIEM 
> 
> ITMR 0005d637f804 ISQF 0ef46800 IOQF 2aab IIR  
> 6bc23fd9
> ISR  04d0d000 IOR  40007a0020cc IPSW 0004000f EIRR 
> 
> TR0  00abfe68 TR1  0465d1d6 TR2  0002 TR3  
> 
> TR4   TR5  0001757973bb TR6  21eb TR7  
> 7a0020e0
> ISQB 0ef46800 IOQB 2aaf
> SR00 0ef46800 SR01  SR02  SR03 
> SR04 0ef46800 SR05 04d0d000 SR06 01e27c00 SR07 01e27c00
>
> INT   3529: instruction tlb miss fault @ 01e27c00:c007 
> for 01e27c00:4000c004
> INT   3530: instruction tlb miss fault @ :c007 
> for :4000c004
> INT   3531: external interrupt @ :c007 for 
> :4000c004
> INT   3532: instruction tlb miss fault @ :c007 
> for :4000c004
> INT   3533: external interrupt @ :c007 for 
> :4000c004
>
> So the PSW indicates narrow mode, but IAOQ seems to contain all the
> ... bits. Also interesting is that the second TLB miss (INT 3530)
> misses the Space ID.
>
> Any thoughts? Otherwise i need to investigate and make a wrong patch
> again :-)

This seems to be caused by IIAOQ's containing the upper bits. With the
patch below i'm able to boot. Not sure whether it's correct though.

diff --git a/target/hppa/int_helper.c b/target/hppa/int_helper.c
index 58c13d3e61..f7c4cca8f1 100644
--- a/target/hppa/int_helper.c
+++ b/target/hppa/int_helper.c
@@ -123,8 +123,14 @@ void hppa_cpu_do_interrupt(CPUState *cs)
 env->cr[CR_IIASQ] = 0;
 env->cr_back[0] = 0;
 }
-env->cr[CR_IIAOQ] = env->iaoq_f;
-env->cr_back[1] = env->iaoq_b;
+if (old_psw & PSW_W) {
+env->cr[CR_IIAOQ] = env->iaoq_f;
+env->cr_back[1] = env->iaoq_b;
+} else {
+env->cr[CR_IIAOQ] = (env->iaoq_f & 0x);
+env->cr_back[1] = env->iaoq_b & 0x;
+}
+

Re: [PATCH] target/hppa: mask upper iaoq bits when returning to narrow mode

2024-04-01 Thread Sven Schnelle

Richard Henderson  writes:

> On 4/1/24 04:52, Sven Schnelle wrote:
>> For unknown reasons, Java 1.5 on 64-bit HP-UX 11.11 does signed
>> computation of the new IAOQ value in the signal handler. In the
>> current code these bits are not masked when returning to narrow
>> mode, causing java to crash.
>> Signed-off-by: Sven Schnelle 
>> ---
>>   target/hppa/sys_helper.c | 4 
>>   1 file changed, 4 insertions(+)
>> diff --git a/target/hppa/sys_helper.c b/target/hppa/sys_helper.c
>> index 208e51c086..3bbc2da71b 100644
>> --- a/target/hppa/sys_helper.c
>> +++ b/target/hppa/sys_helper.c
>> @@ -83,6 +83,10 @@ void HELPER(rfi)(CPUHPPAState *env)
>>   env->iaoq_f = env->cr[CR_IIAOQ];
>>   env->iaoq_b = env->cr_back[1];
>>   +if (!(env->cr[CR_IPSW] & PSW_W)) {
>> +env->iaoq_f &= 0x;
>> +env->iaoq_b &= 0x;
>> +}
>
> This shouldn't be needed, because we are already masking these bits
> later, in cpu_get_tb_cpu_state.  But I do have some cleanups in this
> area, and perhaps one of them matters.

Ouch. Appologies, i removed that masking to make 64 bit work, but forgot
about that. Sorry. I tried your branch, but i'm not able to boot 64 bit HP-UX
(it wasn't working before without additional changes, so your branch
doesn't break it). However, i would like to get your oppinion before
debugging. The code is:

IN: 
0xef468002c18:   20 20 08 01   ldil L%-4000,r1
0xef468002c1c:   e4 20 e0 08   be,l 4(sr7,r1),sr0,r31
0xef468002c20:   34 16 00 72   ldi 39,r22

IA_F 0ef46800:2c23 (0ef468002c23)
IA_B 01e27c00:c007 (01e27c00c007)
PSW  0004000f CB    -CQPDI
GR00  GR01 c000 GR02 2ac3 GR03 
7a00
GR04 0002 GR05 7a34 GR06 7a40 GR07 
7a50
GR08 7ae0 GR09 800040001000 GR10 0003 GR11 
0006
GR12 001ecee8 GR13  GR14 0801 GR15 
0006
GR16 8000400020c0 GR17 0001e720 GR18 0008 GR19 

GR20 0ef46800 GR21 7a60 GR22  GR23 
7a50
GR24  GR25  GR26 7a0020f0 GR27 
40008410
GR28 0002 GR29  GR30 7a002160 GR31 
2c27
RC   7fff CR1   CR2   CR3  

CR4   CR5   CR6  0002 CR7  
fffb
PID1 256f PID2 7306 CCR  00c0 SAR  
0004
PID3  PID4  IVA  0002a000 EIEM 

ITMR 0005d637f804 ISQF 0ef46800 IOQF 2aab IIR  
6bc23fd9
ISR  04d0d000 IOR  40007a0020cc IPSW 0004000f EIRR 

TR0  00abfe68 TR1  0465d1d6 TR2  0002 TR3  

TR4   TR5  0001757973bb TR6  21eb TR7  
7a0020e0
ISQB 0ef46800 IOQB 2aaf
SR00 0ef46800 SR01  SR02  SR03 
SR04 0ef46800 SR05 04d0d000 SR06 01e27c00 SR07 01e27c00

INT   3529: instruction tlb miss fault @ 01e27c00:c007 for 
01e27c00:4000c004
INT   3530: instruction tlb miss fault @ :c007 for 
:4000c004
INT   3531: external interrupt @ :c007 for 
:4000c004
INT   3532: instruction tlb miss fault @ :c007 for 
:4000c004
INT   3533: external interrupt @ :c007 for 
:4000c004

So the PSW indicates narrow mode, but IAOQ seems to contain all the
... bits. Also interesting is that the second TLB miss (INT 3530)
misses the Space ID.

Any thoughts? Otherwise i need to investigate and make a wrong patch
again :-)

The only patch i have on top which touches target/hppa is the space id
hashing mask patch:

diff --git a/target/hppa/cpu.h b/target/hppa/cpu.h
index 66cae795bd..b35c7fa688 100644
--- a/target/hppa/cpu.h
+++ b/target/hppa/cpu.h
@@ -311,12 +311,13 @@ static inline int HPPA_BTLB_ENTRIES(CPUHPPAState *env)

 void hppa_translate_init(void);

+#define HPPA_GVA_OFFSET_MASK64 0x301f
 #define CPU_RESOLVING_TYPE TYPE_HPPA_CPU

 static inline uint64_t gva_offset_mask(target_ulong psw)
 {
 return (psw & PSW_W
-? MAKE_64BIT_MASK(0, 62)
+? HPPA_GVA_OFFSET_MASK64
 : MAKE_64BIT_MASK(0, 32));
 }

But that hoppefully isn't causing the issue here.

Re: /util/cpuinfo-aarch64.c:58:22: error: 'HWCAP_USCAT' undeclared

2024-04-01 Thread Liviu Ionescu

> On 1 Apr 2024, at 23:04, Marcin Juszkiewicz  
> wrote:
> 
> So ask Ubuntu Pro team for support?

I did not ask for support, I just notified the community of an issue I 
encountered while building the latest sources, driven by a sincere desire to 
improve the project.

If this bothered you, I apologise.

Liviu

Re: Point where target instructions are read

2024-04-01 Thread Richard Henderson


On 4/1/24 09:50, Gautam Bhat wrote:

Hi,

Some background: I am trying to write a CPU emulator for MSP430 with
Qemu. I am loading the MSP430 program as follows using the generic
device loader:

/qemu-system-msp430 -machine msp430-launchpad -device
loader,file=simple_test -d in_asm,out_asm

I have implemented somewhat the TranslatorOps callbacks and my sample
output with some prints is as follows:

===msp430_tr_disas_log:204===

OUT: [size=51]
  -- guest addr 0x07fa + tb prologue
0x7fff6403fe00:  8b 5d f0 movl -0x10(%rbp), %ebx
0x7fff6403fe03:  85 dbtestl%ebx, %ebx
0x7fff6403fe05:  0f 8c 1c 00 00 00jl   0x7fff6403fe27
0x7fff6403fe0b:  c6 45 f4 01  movb $1, -0xc(%rbp)
0x7fff6403fe0f:  e9 00 00 00 00   jmp  0x7fff6403fe14
0x7fff6403fe14:  c7 45 00 fc 07 00 00 movl $0x7fc, (%rbp)
0x7fff6403fe1b:  48 8d 05 1e ff ff ff leaq -0xe2(%rip), %rax
0x7fff6403fe22:  e9 f1 01 fc ff   jmp  0x7fff6418
0x7fff6403fe27:  48 8d 05 15 ff ff ff leaq -0xeb(%rip), %rax
0x7fff6403fe2e:  e9 e5 01 fc ff   jmp  0x7fff6418

===gen_intermediate_code:251===
===msp430_tr_init_disas_context:84===
===msp430_tr_tb_start:99===
===msp430_tr_insn_start:107===
===msp430_tr_translate_insn:122===
CTX Dump State
==
pc_first 2044
pc_next 2044
is_jmp 0
max_insns 1
num_insns 1
TB flags: 1
TB cflags: 4278190081
TB CS base: 0
TB PC: 2044
==
Opcode: 0
is_jmp: 1
DISAS_TOO_MANY ===msp430_tr_tb_stop:170===

I was trying to find out where exactly in the Qemu code does it read
the target instructions from the file loaded (I could trace it to
load_elf(...) loading the FW file)


Yes, the contents of the file are loaded within load_elf().



and call my TranslatorOps
callbacks.  I get the above output continuously in a loop. Also when
the device generic loader is used, should I set the program counter to
a specific value?


The boot process must cooperate somehow.

When using loader, you must link the image such that it loads at the pc reset address 
defined by the architecture manual.



r~

Re: /util/cpuinfo-aarch64.c:58:22: error: 'HWCAP_USCAT' undeclared

2024-04-01 Thread Marcin Juszkiewicz


W dniu 1.04.2024 o 21:55, Liviu Ionescu pisze:

On 1 Apr 2024, at 21:48, Richard
Henderson  wrote:

You were told back in September that Ubuntu 18.04 is no longer
supported.

Sorry, I missed that.

BTW, according to ubuntu.com: "With Ubuntu Pro, the 18.04 LTS will be
fully supported until 2028.".


So ask Ubuntu Pro team for support?

QEMU team does not support Ubuntu 18.04. And several other distributions.

Re: /util/cpuinfo-aarch64.c:58:22: error: 'HWCAP_USCAT' undeclared

2024-04-01 Thread Liviu Ionescu




> On 1 Apr 2024, at 21:48, Richard Henderson  
> wrote:
> 
> You were told back in September that Ubuntu 18.04 is no longer supported.

Sorry, I missed that.

BTW, according to ubuntu.com: "With Ubuntu Pro, the 18.04 LTS will be fully 
supported until 2028.".


Regards,

Liviu

Point where target instructions are read

2024-04-01 Thread Gautam Bhat

Hi,

Some background: I am trying to write a CPU emulator for MSP430 with
Qemu. I am loading the MSP430 program as follows using the generic
device loader:

/qemu-system-msp430 -machine msp430-launchpad -device
loader,file=simple_test -d in_asm,out_asm

I have implemented somewhat the TranslatorOps callbacks and my sample
output with some prints is as follows:

===msp430_tr_disas_log:204===

OUT: [size=51]
 -- guest addr 0x07fa + tb prologue
0x7fff6403fe00:  8b 5d f0 movl -0x10(%rbp), %ebx
0x7fff6403fe03:  85 dbtestl%ebx, %ebx
0x7fff6403fe05:  0f 8c 1c 00 00 00jl   0x7fff6403fe27
0x7fff6403fe0b:  c6 45 f4 01  movb $1, -0xc(%rbp)
0x7fff6403fe0f:  e9 00 00 00 00   jmp  0x7fff6403fe14
0x7fff6403fe14:  c7 45 00 fc 07 00 00 movl $0x7fc, (%rbp)
0x7fff6403fe1b:  48 8d 05 1e ff ff ff leaq -0xe2(%rip), %rax
0x7fff6403fe22:  e9 f1 01 fc ff   jmp  0x7fff6418
0x7fff6403fe27:  48 8d 05 15 ff ff ff leaq -0xeb(%rip), %rax
0x7fff6403fe2e:  e9 e5 01 fc ff   jmp  0x7fff6418

===gen_intermediate_code:251===
===msp430_tr_init_disas_context:84===
===msp430_tr_tb_start:99===
===msp430_tr_insn_start:107===
===msp430_tr_translate_insn:122===
CTX Dump State
==
pc_first 2044
pc_next 2044
is_jmp 0
max_insns 1
num_insns 1
TB flags: 1
TB cflags: 4278190081
TB CS base: 0
TB PC: 2044
==
Opcode: 0
is_jmp: 1
DISAS_TOO_MANY ===msp430_tr_tb_stop:170===

I was trying to find out where exactly in the Qemu code does it read
the target instructions from the file loaded (I could trace it to
load_elf(...) loading the FW file) and call my TranslatorOps
callbacks.  I get the above output continuously in a loop. Also when
the device generic loader is used, should I set the program counter to
a specific value?

I am not able to understand how to proceed. Any help would be greatly
appreciated.

Re: /util/cpuinfo-aarch64.c:58:22: error: 'HWCAP_USCAT' undeclared

2024-04-01 Thread Richard Henderson


On 4/1/24 08:08, Liviu Ionescu wrote:

same behaviour for 8.2.2; same workaround.


On 2 Sep 2023, at 21:11, Liviu Ionescu  wrote:

When trying to build 8.1.0 on an Ubuntu 18.04 aarch64, I get the above error.


You were told back in September that Ubuntu 18.04 is no longer supported.
The passage of 6 months has not changed that.


r~

Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-01 Thread Peter Xu

On Mon, Apr 01, 2024 at 02:17:28PM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Fri, Mar 29, 2024 at 08:54:07AM +, Wang, Wei W wrote:
> >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
> >> > When using the post-copy preemption feature to perform post-copy live
> >> > migration, the below scenario could lead to a deadlock and the migration 
> >> > will
> >> > never finish:
> >> > 
> >> >  - Source connect() the preemption channel in postcopy_start().
> >> >  - Source and the destination side TCP stack finished the 3-way handshake
> >> >thus the connection is successful.
> >> >  - The destination side main thread is handling the loading of the bulk 
> >> > RAM
> >> >pages thus it doesn't start to handle the pending connection event in 
> >> > the
> >> >event loop. and doesn't post the semaphore postcopy_qemufile_dst_done 
> >> > for
> >> >the preemption thread.
> >> >  - The source side sends non-iterative device states, such as the virtio
> >> >states.
> >> >  - The destination main thread starts to receive the virtio states, this
> >> >process may lead to a page fault (e.g., 
> >> > virtio_load()->vring_avail_idx()
> >> >may trigger a page fault since the avail ring page may not be received
> >> >yet).
> >
> > Ouch.  Yeah I think this part got overlooked when working on the preempt
> > channel.
> >
> >> >  - The page request is sent back to the source side. Source sends the 
> >> > page
> >> >content to the destination side preemption thread.
> >> >  - Since the event is not arrived and the semaphore
> >> >postcopy_qemufile_dst_done is not posted, the preemption thread in
> >> >destination side is blocked, and cannot handle receiving the page.
> >> >  - The QEMU main load thread on the destination side is stuck at the page
> >> >fault, and cannot yield and handle the connect() event for the
> >> >preemption channel to unblock the preemption thread.
> >> >  - The postcopy will stuck there forever since this is a deadlock.
> >> > 
> >> > The key point to reproduce this bug is that the source side is sending 
> >> > pages at a
> >> > rate faster than the destination handling, otherwise, the 
> >> > qemu_get_be64() in
> >> > ram_load_precopy() will have a chance to yield since at that time there 
> >> > are no
> >> > pending data in the buffer to get. This will make this bug harder to be
> >> > reproduced.
> >
> > How hard would this reproduce?
> >
> > I'm thinking whether this should be 9.0 material or 9.1.  It's pretty late
> > for 9.0 though, but we can still discuss.
> >
> >> > 
> >> > Fix this by yielding the load coroutine when receiving
> >> > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
> >> > connection event before loading the non-iterative devices state to avoid 
> >> > the
> >> > deadlock condition.
> >> > 
> >> > Signed-off-by: Lei Wang 
> >> 
> >> This seems to be a regression issue caused by this commit:
> >> 737840e2c6ea (migration: Use the number of transferred bytes directly)
> >> 
> >> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate
> >> seems to work (might not be a good fix though).
> >> 
> >> > ---
> >> >  migration/savevm.c | 5 +
> >> >  1 file changed, 5 insertions(+)
> >> > 
> >> > diff --git a/migration/savevm.c b/migration/savevm.c index
> >> > e386c5267f..8fd4dc92f2 100644
> >> > --- a/migration/savevm.c
> >> > +++ b/migration/savevm.c
> >> > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f)
> >> >  return loadvm_postcopy_handle_advise(mis, len);
> >> > 
> >> >  case MIG_CMD_POSTCOPY_LISTEN:
> >> > +if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
> >> > +aio_co_schedule(qemu_get_current_aio_context(),
> >> > +qemu_coroutine_self());
> >> > +qemu_coroutine_yield();
> >> > +}
> >> 
> >> The above could be moved to loadvm_postcopy_handle_listen().
> >
> > I'm not 100% sure such thing (no matter here or moved into it, which does
> > look cleaner) would work for us.
> >
> > The problem is I still don't yet see an ordering restricted on top of (1)
> > accept() happens, and (2) receive LISTEN cmd here.  What happens if the
> > accept() request is not yet received when reaching LISTEN?  Or is it always
> > guaranteed the accept(fd) will always be polled here?
> >
> > For example, the source QEMU (no matter pre-7.2 or later) will always setup
> > the preempt channel asynchrounously, then IIUC it can connect() after
> > sending the whole chunk of packed data which should include this LISTEN.  I
> > think it means it's not guaranteed this will 100% work, but maybe further
> > reduce the possibility of the race.
> >
> > One right fix that I can think of is moving the sem_wait(&done) into the
> > main thread too, so we wait for the sem _before_ reading the packed data,
> > so there's no chance of fault.  However I don't think sem_wait() will be
> > smart enough to yi

Re: [PATCH v2 3/4] hw/cxl/mbox: replace sanitize_running() with cxl_dev_media_disabled()

2024-04-01 Thread Jonathan Cameron via

On Sun, 21 Jan 2024 21:50:00 -0500
Hyeonggon Yoo <42.hye...@gmail.com> wrote:

> On Tue, Jan 9, 2024 at 12:54 PM Jonathan Cameron
>  wrote:
> >
> > On Fri, 22 Dec 2023 18:00:50 +0900
> > Hyeonggon Yoo <42.hye...@gmail.com> wrote:
> >  
> > > The spec states that reads/writes should have no effect and a part of
> > > commands should be ignored when the media is disabled, not when the
> > > sanitize command is running.qq
> > >
> > > Introduce cxl_dev_media_disabled() to check if the media is disabled and
> > > replace sanitize_running() with it.
> > >
> > > Make sure that the media has been correctly disabled during sanitation
> > > by adding an assert to __toggle_media(). Now, enabling when already
> > > enabled or vice versa results in an assert() failure.
> > >
> > > Suggested-by: Davidlohr Bueso 
> > > Signed-off-by: Hyeonggon Yoo <42.hye...@gmail.com>  
> >
> > This applies to
> >
> > hw/cxl: Add get scan media capabilities cmd support.
> >
> > Should I just squash it with that patch in my tree?
> > For now I'm holding it immediately on top of that, but I'm not keen to
> > send messy code upstream unless there is a good reason to retain the
> > history.  
> 
> Oh, while the diff looks like the patch touches scan_media_running(), it's 
> not.
> 
> The proper Fixes: tag will be:
> Fixes: d77176724422 ("hw/cxl: Add support for device sanitation")
> 
> > If you are doing this sort of fix series in future, please call out
> > what they fix explicitly.  Can't use fixes tags as the commit ids
> > are unstable, but can mention the patch to make my life easier!  
> 
> Okay, next time I will either add the Fixes tag or add a comment on
> what it fixes.
> 
> By the way I guess your latest, public branch is still cxl-2023-11-02, right?
> https://gitlab.com/jic23/qemu/-/tree/cxl-2023-11-02
> 
> I assume you adjusted my v2 series, but please let me know if you prefer
> sending v3 against your latest tree.
> 
> Thanks,
> Hyeonggon

Side note, in it's current form this breaks the switch-cci support in upstream
QEMU.  I've finally gotten back to getting ready to look at MMPT support and
ran into a crash as a result.  Needs protection with checked 
object_dynamic_cast()
to make sure we have a type3 device.  I'll update the patch in my tree.

Thanks,

Jonathan

>

Re: [PATCH] target/hppa: mask upper iaoq bits when returning to narrow mode

2024-04-01 Thread Richard Henderson


On 4/1/24 04:52, Sven Schnelle wrote:

For unknown reasons, Java 1.5 on 64-bit HP-UX 11.11 does signed
computation of the new IAOQ value in the signal handler. In the
current code these bits are not masked when returning to narrow
mode, causing java to crash.

Signed-off-by: Sven Schnelle 
---
  target/hppa/sys_helper.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/target/hppa/sys_helper.c b/target/hppa/sys_helper.c
index 208e51c086..3bbc2da71b 100644
--- a/target/hppa/sys_helper.c
+++ b/target/hppa/sys_helper.c
@@ -83,6 +83,10 @@ void HELPER(rfi)(CPUHPPAState *env)
  env->iaoq_f = env->cr[CR_IIAOQ];
  env->iaoq_b = env->cr_back[1];
  
+if (!(env->cr[CR_IPSW] & PSW_W)) {

+env->iaoq_f &= 0x;
+env->iaoq_b &= 0x;
+}


This shouldn't be needed, because we are already masking these bits later, in 
cpu_get_tb_cpu_state.  But I do have some cleanups in this area, and perhaps one of them 
matters.



Please try

https://gitlab.com/rth7680/qemu/-/commit/d06e0303595a63565593ab2a5d42f312135b9ded


r~

Advice on block QMP command coroutines

2024-04-01 Thread Fabiano Rosas

Hi, I'm working on a v3 of my query-block series [1] and I'm a bit
confused about how to convert a QMP command into a coroutine.

In case you miss the context:

 In that series I'm turning query-block into a coroutine so we can avoid
 holding the BQL for too long in the case of a misbehaving (slow)
 syscall at the end of the call chain (get_allocated_file_size -> fstat
 in my case).

The issue:

After converting qmp_query_block into a coroutine, I'm hitting the
assert(false) bug at qcow2_get_specific_info() which was already fixed
for non-coroutines [2]. The bug was caused by qmp_query_block() running
during bdrv_activate_all():

bdrv_activate_all
  ...
  bdrv_invalidate_cache
bdrv_poll_co
|-> aio_co_enter
|   ...
|   qcow2_co_invalidate_cache
| memset(s, 0, ...)
| qcow2_do_open
|   blk_co_pread
|   ...
|   qemu_coroutine_yield
|-> AIO_WAIT_WHILE
|   aio_poll
| reschedule of qmp_dispatch
| qmp_query_block
| ...
| qcow2_get_specific_info
|   sees s->qcow_version == 0
|   assert(false)
  
So my question is how do we expect to be able to convert a QMP command
into a coroutine if we're rescheduling all coroutines into
qemu_aio_context (at qmp_dispatch). I don't see how to avoid any
random aio_poll causing a dispatch of the coroutine in the middle of
something else.

If I keep the QMP command in the iohandler context, then the bug never
happens. Rescheduling back into the iohandler would also work, were it
not for the HMP path which only polls on qemu_aio_context and causes a
deadlock.

What's the recommended approach here?

Thank you

1- https://lore.kernel.org/r/20230609201910.12100-1-faro...@suse.de
2- https://gitlab.com/qemu-project/qemu/-/issues/1933

Re: [External] Re: [PATCH v9 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-01 Thread Ho-Ren (Jack) Chuang

Hi SeongJae,

On Sun, Mar 31, 2024 at 12:09 PM SeongJae Park  wrote:
>
> Hi Ho-Ren,
>
> On Fri, 29 Mar 2024 05:33:52 + "Ho-Ren (Jack) Chuang" 
>  wrote:
>
> > Since different memory devices require finding, allocating, and putting
> > memory types, these common steps are abstracted in this patch,
> > enhancing the scalability and conciseness of the code.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Reviewed-by: "Huang, Ying" 
> > ---
> >  drivers/dax/kmem.c   | 20 ++--
> >  include/linux/memory-tiers.h | 13 +
> >  mm/memory-tiers.c| 32 
> >  3 files changed, 47 insertions(+), 18 deletions(-)
> >
> [...]
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > index 69e781900082..a44c03c2ba3a 100644
> > --- a/include/linux/memory-tiers.h
> > +++ b/include/linux/memory-tiers.h
> > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
> >  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
> >const char *source);
> >  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
> > +struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> > + struct list_head 
> > *memory_types);
> > +void mt_put_memory_types(struct list_head *memory_types);
> >  #ifdef CONFIG_MIGRATION
> >  int next_demotion_node(int node);
> >  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
> > access_coordinate *perf, int *adis
> >  {
> >   return -EIO;
> >  }
> > +
> > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > list_head *memory_types)
> > +{
> > + return NULL;
> > +}
> > +
> > +void mt_put_memory_types(struct list_head *memory_types)
> > +{
> > +
> > +}
>
> I found latest mm-unstable tree is failing kunit as below, and 'git bisect'
> says it happens from this patch.
>
> $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> [11:56:40] Configuring KUnit Kernel ...
> [11:56:40] Building KUnit Kernel ...
> Populating config with:
> $ make ARCH=um O=../kunit.out/ olddefconfig
> Building with:
> $ make ARCH=um O=../kunit.out/ --jobs=36
> ERROR:root:In file included from .../mm/memory.c:71:
> .../include/linux/memory-tiers.h:143:25: warning: no previous prototype 
> for ‘mt_find_alloc_memory_type’ [-Wmissing-prototypes]
>   143 | struct memory_dev_type *mt_find_alloc_memory_type(int adist, 
> struct list_head *memory_types)
>   | ^
> .../include/linux/memory-tiers.h:148:6: warning: no previous prototype 
> for ‘mt_put_memory_types’ [-Wmissing-prototypes]
>   148 | void mt_put_memory_types(struct list_head *memory_types)
>   |  ^~~
> [...]
>
> Maybe we should set these as 'static inline', like below?  I confirmed this
> fixes the kunit error.  May I ask your opinion?
>

Thanks for catching this. I'm trying to figure out this problem. Will get back.

>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index a44c03c2ba3a..ee6e53144156 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -140,12 +140,12 @@ static inline int mt_perf_to_adistance(struct 
> access_coordinate *perf, int *adis
> return -EIO;
>  }
>
> -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> list_head *memory_types)
> +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, 
> struct list_head *memory_types)
>  {
> return NULL;
>  }
>
> -void mt_put_memory_types(struct list_head *memory_types)
> +static inline void mt_put_memory_types(struct list_head *memory_types)
>  {
>
>  }
>
>
> Thanks,
> SJ



-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任

Re: Intention to work on GSoC project

2024-04-01 Thread daleyoung4242

Hi,

On Monday, March 25, 2024 21:20:32 CST Sahil wrote:
> Q1.
> Section 2.7.4 of the virtio spec [3] states that in an available descriptor,
> the "Element Length" stores the length of the buffer element. In the next
> few lines, it also states that the "Element Length" is reserved for used
> descriptors and is ignored by drivers. This sounds a little contradictory
> given that drivers write available desciptors in the descriptor ring.

When VIRTQ_DESC_F_WRITE is set, the device will use "Element Length" to 
specify the length it writes. When VIRTQ_DESC_F_WRITE is not set, which means 
the buffer is read-only for the device, "Element Length" will not be changed by 
the device, so drivers just ignore it.

> Q2.
> In the Red Hat article, just below the first listing ("Memory layout of a
> packed virtqueue descriptor"), there's the following line referring to the
> buffer id in
> "virtq_desc":
> > This time, the id field is not an index for the device to look for the
> > buffer: it is an opaque value for it, only has meaning for the driver.
> 
> But the device returns the buffer id when it writes the used descriptor to
> the descriptor ring. The "only has meaning for the driver" part has got me
> a little confused. Which buffer id is this that the device returns? Is it
> related to the buffer id in the available descriptor?

In my understanding, buffer id is the element that avail descriptor marks to 
identify when adding descriptors to table. Device will returns the buffer id in 
the processed descriptor or the last descriptor in a chain, and write it to 
the descriptor that used idx refers to (first one in the chain). Then used idx 
increments.

The Packed Virtqueue blog [1] is helpful, but some details in the examples are 
making me confused. 

Q1.
In the last step of the two-entries descriptor table example, it says both 
buffers #0 and #1 are available for the device. I understand descriptor[0] is 
available and descriptor[1] is not, but there is no ID #0 now. So does the 
device got buffer #0 by notification beforehand? If so, does it mean buffer #0 
will be lost when notifications are disabled?

Q2.
If I'm right, the buffer id of the first descriptor in chain will be 
overwritten 
with the last id. But in Figure: Using another descriptor chain in the last 
step of the chain example, the buffer id of the first descriptor (4th in the 
table) in second chain doesn't change to 1. Is this a typo or I'm wrong?

Thanks,
Dale Young

[1] https://www.redhat.com/en/blog/packed-virtqueue-how-reduce-overhead-virtio

Re: /util/cpuinfo-aarch64.c:58:22: error: 'HWCAP_USCAT' undeclared

2024-04-01 Thread Liviu Ionescu

same behaviour for 8.2.2; same workaround.

> On 2 Sep 2023, at 21:11, Liviu Ionescu  wrote:
> 
> When trying to build 8.1.0 on an Ubuntu 18.04 aarch64, I get the above error.
> 
> The offending code in `/util/cpuinfo-aarch64.c` is:
> 
> ```c
> #ifdef CONFIG_LINUX
>unsigned long hwcap = qemu_getauxval(AT_HWCAP);
>info |= (hwcap & HWCAP_ATOMICS ? CPUINFO_LSE : 0);
>info |= (hwcap & HWCAP_USCAT ? CPUINFO_LSE2 : 0);
>info |= (hwcap & HWCAP_AES ? CPUINFO_AES: 0);
> #endif
> ```
> 
> The reason is that on this distribution the  header file does 
> not define HWCAP_USCAT:
> 
> ```
> root@9c7ad90af4f8:/# cat /usr/include/aarch64-linux-gnu/bits/hwcap.h
> /* Defines for bits in AT_HWCAP.  AArch64 Linux version.
>   Copyright (C) 2016-2018 Free Software Foundation, Inc.
>   This file is part of the GNU C Library.
> 
>   The GNU C Library is free software; you can redistribute it and/or
>   modify it under the terms of the GNU Lesser General Public
>   License as published by the Free Software Foundation; either
>   version 2.1 of the License, or (at your option) any later version.
> 
>   The GNU C Library is distributed in the hope that it will be useful,
>   but WITHOUT ANY WARRANTY; without even the implied warranty of
>   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>   Lesser General Public License for more details.
> 
>   You should have received a copy of the GNU Lesser General Public
>   License along with the GNU C Library; if not, see
>   .  */
> 
> #if !defined (_SYS_AUXV_H)
> # error "Never include  directly; use  instead."
> #endif
> 
> /* The following must match the kernel's .  */
> #define HWCAP_FP (1 << 0)
> #define HWCAP_ASIMD (1 << 1)
> #define HWCAP_EVTSTRM (1 << 2)
> #define HWCAP_AES (1 << 3)
> #define HWCAP_PMULL (1 << 4)
> #define HWCAP_SHA1 (1 << 5)
> #define HWCAP_SHA2 (1 << 6)
> #define HWCAP_CRC32 (1 << 7)
> #define HWCAP_ATOMICS (1 << 8)
> #define HWCAP_FPHP (1 << 9)
> #define HWCAP_ASIMDHP (1 << 10)
> #define HWCAP_CPUID (1 << 11)
> #define HWCAP_ASIMDRDM (1 << 12)
> #define HWCAP_JSCVT (1 << 13)
> #define HWCAP_FCMA (1 << 14)
> #define HWCAP_LRCPC (1 << 15)
> #define HWCAP_DCPOP (1 << 16)
> #define HWCAP_SHA3 (1 << 17)
> #define HWCAP_SM3 (1 << 18)
> #define HWCAP_SM4 (1 << 19)
> #define HWCAP_ASIMDDP (1 << 20)
> #define HWCAP_SHA512 (1 << 21)
> #define HWCAP_SVE (1 << 22)
> root@9c7ad90af4f8:/# 
> ```
> 
> The full list of definitions should include:
> 
> ```
> #define HWCAP_ASIMDFHM (1 << 23)
> #define HWCAP_DIT (1 << 24)
> #define HWCAP_USCAT (1 << 25)
> #define HWCAP_ILRCPC (1 << 26)
> #define HWCAP_FLAGM (1 << 27)
> #define HWCAP_SSBS (1 << 28)
> ```
> 
> I don't know the meaning behind these bits, and how important is for QEMU to 
> correctly identify them all.
> 
> Since I know my build environment, my quick and dirty workaround was to pass 
> the definition via the preprocessor options:
> 
> ```
> CPPFLAGS+=" -DHWCAP_USCAT=(1<<25)"
> ```
> 
> However, for QEMU this is not a solution.
> 
> A possible solution would be to compile the code conditionally:
> 
> ```
> #ifdef CONFIG_LINUX
>unsigned long hwcap = qemu_getauxval(AT_HWCAP);
>info |= (hwcap & HWCAP_ATOMICS ? CPUINFO_LSE : 0);
> #ifdef HWCAP_USCAT
>info |= (hwcap & HWCAP_USCAT ? CPUINFO_LSE2 : 0);
> #endif
>info |= (hwcap & HWCAP_AES ? CPUINFO_AES: 0);
> #endif
> ```
> 
> I don't know if other distributions are also affected, my build platform for 
> all xPack standalone binaries is Ubuntu 18.04 LTS.
> 
> I know that 18.04 is an old version, but I use the xPack QEMU mainly to run 
> unit tests, and in some enterprise environments the machines used for testing 
> are sometimes pretty outdated, thus 18.04 will remain the base build platform 
> for a while.
> 
> It would be very nice if QEMU would still compile on Ubuntu 18.04, as it did 
> before 8.1.0.
> 
> 
> 
> Regards,
> 
> Liviu
>

Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-01 Thread Fabiano Rosas

Peter Xu  writes:

> On Fri, Mar 29, 2024 at 08:54:07AM +, Wang, Wei W wrote:
>> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
>> > When using the post-copy preemption feature to perform post-copy live
>> > migration, the below scenario could lead to a deadlock and the migration 
>> > will
>> > never finish:
>> > 
>> >  - Source connect() the preemption channel in postcopy_start().
>> >  - Source and the destination side TCP stack finished the 3-way handshake
>> >thus the connection is successful.
>> >  - The destination side main thread is handling the loading of the bulk RAM
>> >pages thus it doesn't start to handle the pending connection event in 
>> > the
>> >event loop. and doesn't post the semaphore postcopy_qemufile_dst_done 
>> > for
>> >the preemption thread.
>> >  - The source side sends non-iterative device states, such as the virtio
>> >states.
>> >  - The destination main thread starts to receive the virtio states, this
>> >process may lead to a page fault (e.g., virtio_load()->vring_avail_idx()
>> >may trigger a page fault since the avail ring page may not be received
>> >yet).
>
> Ouch.  Yeah I think this part got overlooked when working on the preempt
> channel.
>
>> >  - The page request is sent back to the source side. Source sends the page
>> >content to the destination side preemption thread.
>> >  - Since the event is not arrived and the semaphore
>> >postcopy_qemufile_dst_done is not posted, the preemption thread in
>> >destination side is blocked, and cannot handle receiving the page.
>> >  - The QEMU main load thread on the destination side is stuck at the page
>> >fault, and cannot yield and handle the connect() event for the
>> >preemption channel to unblock the preemption thread.
>> >  - The postcopy will stuck there forever since this is a deadlock.
>> > 
>> > The key point to reproduce this bug is that the source side is sending 
>> > pages at a
>> > rate faster than the destination handling, otherwise, the qemu_get_be64() 
>> > in
>> > ram_load_precopy() will have a chance to yield since at that time there 
>> > are no
>> > pending data in the buffer to get. This will make this bug harder to be
>> > reproduced.
>
> How hard would this reproduce?
>
> I'm thinking whether this should be 9.0 material or 9.1.  It's pretty late
> for 9.0 though, but we can still discuss.
>
>> > 
>> > Fix this by yielding the load coroutine when receiving
>> > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
>> > connection event before loading the non-iterative devices state to avoid 
>> > the
>> > deadlock condition.
>> > 
>> > Signed-off-by: Lei Wang 
>> 
>> This seems to be a regression issue caused by this commit:
>> 737840e2c6ea (migration: Use the number of transferred bytes directly)
>> 
>> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate
>> seems to work (might not be a good fix though).
>> 
>> > ---
>> >  migration/savevm.c | 5 +
>> >  1 file changed, 5 insertions(+)
>> > 
>> > diff --git a/migration/savevm.c b/migration/savevm.c index
>> > e386c5267f..8fd4dc92f2 100644
>> > --- a/migration/savevm.c
>> > +++ b/migration/savevm.c
>> > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f)
>> >  return loadvm_postcopy_handle_advise(mis, len);
>> > 
>> >  case MIG_CMD_POSTCOPY_LISTEN:
>> > +if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
>> > +aio_co_schedule(qemu_get_current_aio_context(),
>> > +qemu_coroutine_self());
>> > +qemu_coroutine_yield();
>> > +}
>> 
>> The above could be moved to loadvm_postcopy_handle_listen().
>
> I'm not 100% sure such thing (no matter here or moved into it, which does
> look cleaner) would work for us.
>
> The problem is I still don't yet see an ordering restricted on top of (1)
> accept() happens, and (2) receive LISTEN cmd here.  What happens if the
> accept() request is not yet received when reaching LISTEN?  Or is it always
> guaranteed the accept(fd) will always be polled here?
>
> For example, the source QEMU (no matter pre-7.2 or later) will always setup
> the preempt channel asynchrounously, then IIUC it can connect() after
> sending the whole chunk of packed data which should include this LISTEN.  I
> think it means it's not guaranteed this will 100% work, but maybe further
> reduce the possibility of the race.
>
> One right fix that I can think of is moving the sem_wait(&done) into the
> main thread too, so we wait for the sem _before_ reading the packed data,
> so there's no chance of fault.  However I don't think sem_wait() will be
> smart enough to yield when in a coroutine..  In the long term run I think
> we should really make migration loadvm to do work in the thread rather than
> the main thread.  I think it means we have one more example to be listed in
> this todo so that's preferred..
>
> https://wiki.qemu.org/ToDo/LiveMigration#Create

Re: Patch for qemu-project/qemu#2247 issue

2024-04-01 Thread Michael Tokarev


01.04.2024 12:43, liu.d...@zte.com.cn wrote:

hmp: Add help information for watchdog action: inject-nmi

virsh qemu-monitor-command --hmp help information of watchdog_action missing 
inject-nmi which already supported in Commit 795dc6e4

Signed-off-by: Dayu Liu 


Applied to trivial-patches tree, in the following form:

Author: Dayu Liu 
Date:   Mon Apr 1 17:43:55 2024 +0800

hmp: Add help information for watchdog action: inject-nmi

virsh qemu-monitor-command --hmp help information of
watchdog_action missing inject-nmi which already supported
in Commit 795dc6e4

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2247
Signed-off-by: Dayu Liu 
Reviewed-by: Michael Tokarev 
Signed-off-by: Michael Tokarev 
[Mjt: decode and word-wrap commit message and add Resolves: tag]

Thanks,

/mjt

Re: [PATCH v2 1/1] nbd/server: do not poll within a coroutine context

2024-04-01 Thread Eric Blake

On Mon, Apr 01, 2024 at 08:41:20PM +0800, Zhu Yangyang wrote:
> Coroutines are not supposed to block. Instead, they should yield.
> 
> Fixes: f95910f ("nbd: implement TLS support in the protocol negotiation")
> Signed-off-by: Zhu Yangyang 
> ---
>  nbd/client.c   |  7 ---
>  nbd/common.c   | 19 ---
>  nbd/nbd-internal.h |  6 +++---
>  nbd/server.c   | 10 +-
>  4 files changed, 28 insertions(+), 14 deletions(-)
> 
> diff --git a/nbd/client.c b/nbd/client.c
> index 29ffc609a4..1ab91ed205 100644
> --- a/nbd/client.c
> +++ b/nbd/client.c
> @@ -619,18 +619,19 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
>  return NULL;
>  }
>  qio_channel_set_name(QIO_CHANNEL(tioc), "nbd-client-tls");
> -data.loop = g_main_loop_new(g_main_context_default(), FALSE);
>  trace_nbd_receive_starttls_tls_handshake();
>  qio_channel_tls_handshake(tioc,
> -  nbd_tls_handshake,
> +  nbd_client_tls_handshake,
>&data,
>NULL,
>NULL);
>  
>  if (!data.complete) {
> +data.loop = g_main_loop_new(g_main_context_default(), FALSE);
>  g_main_loop_run(data.loop);
> +g_main_loop_unref(data.loop);
>  }
> -g_main_loop_unref(data.loop);
> +

Aha - you figured out an elegant way around the client code not being
in coroutine context that had been stumping me, at least for the sake
of a minimal patch.

>  if (data.error) {
>  error_propagate(errp, data.error);
>  object_unref(OBJECT(tioc));
> diff --git a/nbd/common.c b/nbd/common.c
> index 3247c1d618..01ca30a5c4 100644
> --- a/nbd/common.c
> +++ b/nbd/common.c
> @@ -47,14 +47,27 @@ int nbd_drop(QIOChannel *ioc, size_t size, Error **errp)
>  }
>  
>  
> -void nbd_tls_handshake(QIOTask *task,
> -   void *opaque)
> +void nbd_client_tls_handshake(QIOTask *task, void *opaque)
>  {
>  struct NBDTLSHandshakeData *data = opaque;
>  
>  qio_task_propagate_error(task, &data->error);
>  data->complete = true;
> -g_main_loop_quit(data->loop);
> +if (data->loop) {
> +g_main_loop_quit(data->loop);
> +}
> +}
> +
> +
> +void nbd_server_tls_handshake(QIOTask *task, void *opaque)
> +{
> +struct NBDTLSHandshakeData *data = opaque;
> +
> +qio_task_propagate_error(task, &data->error);
> +data->complete = true;
> +if (!qemu_coroutine_entered(data->co)) {
> +aio_co_wake(data->co);
> +}
>  }

This matches up with what I was experimenting with last week, in that
server is in coroutine context while client is not.  However, it means
we no longer have a common implementation between the two, so keeping
the code isolated in nbd/common.c makes less sense than just placing
the two specific callbacks in the two specific files right where their
only use case exists (and even making them static at that point).

Or, can we still merge it into one helper?  It looks like we now have
3 viable possibilities:

data->loop data->co
non-NULL   non-NULLimpossible
NULL   NULLclient, qio_task completed right away
non-NULL   NULLclient, qio_task did not complete right away
NULL   non-NULLserver, waking the coroutine depends on if we are in one

With that, we can still get by with one function, but need good
documentation.  I'll post a v3 along those lines, to see what you
think.

>  
>  
> diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
> index dfa02f77ee..99cca9382c 100644
> --- a/nbd/nbd-internal.h
> +++ b/nbd/nbd-internal.h
> @@ -74,13 +74,13 @@ static inline int nbd_write(QIOChannel *ioc, const void 
> *buffer, size_t size,
>  
>  struct NBDTLSHandshakeData {
>  GMainLoop *loop;
> +Coroutine *co;
>  bool complete;
>  Error *error;
>  };

I had tried to get rid of the GMainLoop *loop member altogether, but
your change has the benefit of a smaller diff than what I was facing
(I got lost in the weeds trying to see if I could convert all of the
client into running in coroutine context).

>  
> -
> -void nbd_tls_handshake(QIOTask *task,
> -   void *opaque);
> +void nbd_server_tls_handshake(QIOTask *task, void *opaque);
> +void nbd_client_tls_handshake(QIOTask *task, void *opaque);
>  
>  int nbd_drop(QIOChannel *ioc, size_t size, Error **errp);
>  
> diff --git a/nbd/server.c b/nbd/server.c
> index c3484cc1eb..b218512ced 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -777,17 +777,17 @@ static QIOChannel 
> *nbd_negotiate_handle_starttls(NBDClient *client,
>  
>  qio_channel_set_name(QIO_CHANNEL(tioc), "nbd-server-tls");
>  trace_nbd_negotiate_handle_starttls_handshake();
> -data.loop = g_main_loop_new(g_main_context_default(), FALSE);
> +data.co = qemu_coroutine_self();
>  qio_channel_tls_handshake(tioc,
> -  nbd_tls_handshake,
> +

Re: [PATCH] migration, docs: mark RDMA migration as deprecated

2024-04-01 Thread Peter Xu

On Mon, Apr 01, 2024 at 11:59:47AM +0800, Li Zhijian wrote:
> Except for RDMA migration, other parts of the RDMA subsystem have been
> removed since 9.1.
> 
> Due to the lack of unit tests and CI tests for RDMA migration, int the
> past developing cycles, a few fatal errors were introduced and broke the
> RDMA migration, and these issues[1][2] were not fixed until some time later.
> 
> Modern network cards (TCP/IP) can also provide high bandwidth
> (similar to RDMA) to handle the large amount of data generated during
> migration.
> 
> Issue a warning to inform the end users of the RDMA migration status.
> 
> [1] https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com
> [2] 
> https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com
> 
> CC: Peter Xu 
> CC: Philippe Mathieu-Daudé 
> CC: Fabiano Rosas 
> CC: Thomas Huth 
> CC: Daniel P. Berrangé 
> CC: Yu Zhang 
> Signed-off-by: Li Zhijian 

Thanks, Zhijian and everyone.

Acked-by: Peter Xu 

I'll keep this around for more days for a better exposure, and this will be
included in the 1st 9.1 pull if no objections.

-- 
Peter Xu

Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-01 Thread Peter Xu

On Fri, Mar 29, 2024 at 08:54:07AM +, Wang, Wei W wrote:
> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
> > When using the post-copy preemption feature to perform post-copy live
> > migration, the below scenario could lead to a deadlock and the migration 
> > will
> > never finish:
> > 
> >  - Source connect() the preemption channel in postcopy_start().
> >  - Source and the destination side TCP stack finished the 3-way handshake
> >thus the connection is successful.
> >  - The destination side main thread is handling the loading of the bulk RAM
> >pages thus it doesn't start to handle the pending connection event in the
> >event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for
> >the preemption thread.
> >  - The source side sends non-iterative device states, such as the virtio
> >states.
> >  - The destination main thread starts to receive the virtio states, this
> >process may lead to a page fault (e.g., virtio_load()->vring_avail_idx()
> >may trigger a page fault since the avail ring page may not be received
> >yet).

Ouch.  Yeah I think this part got overlooked when working on the preempt
channel.

> >  - The page request is sent back to the source side. Source sends the page
> >content to the destination side preemption thread.
> >  - Since the event is not arrived and the semaphore
> >postcopy_qemufile_dst_done is not posted, the preemption thread in
> >destination side is blocked, and cannot handle receiving the page.
> >  - The QEMU main load thread on the destination side is stuck at the page
> >fault, and cannot yield and handle the connect() event for the
> >preemption channel to unblock the preemption thread.
> >  - The postcopy will stuck there forever since this is a deadlock.
> > 
> > The key point to reproduce this bug is that the source side is sending 
> > pages at a
> > rate faster than the destination handling, otherwise, the qemu_get_be64() in
> > ram_load_precopy() will have a chance to yield since at that time there are 
> > no
> > pending data in the buffer to get. This will make this bug harder to be
> > reproduced.

How hard would this reproduce?

I'm thinking whether this should be 9.0 material or 9.1.  It's pretty late
for 9.0 though, but we can still discuss.

> > 
> > Fix this by yielding the load coroutine when receiving
> > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
> > connection event before loading the non-iterative devices state to avoid the
> > deadlock condition.
> > 
> > Signed-off-by: Lei Wang 
> 
> This seems to be a regression issue caused by this commit:
> 737840e2c6ea (migration: Use the number of transferred bytes directly)
> 
> Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate
> seems to work (might not be a good fix though).
> 
> > ---
> >  migration/savevm.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/migration/savevm.c b/migration/savevm.c index
> > e386c5267f..8fd4dc92f2 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f)
> >  return loadvm_postcopy_handle_advise(mis, len);
> > 
> >  case MIG_CMD_POSTCOPY_LISTEN:
> > +if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
> > +aio_co_schedule(qemu_get_current_aio_context(),
> > +qemu_coroutine_self());
> > +qemu_coroutine_yield();
> > +}
> 
> The above could be moved to loadvm_postcopy_handle_listen().

I'm not 100% sure such thing (no matter here or moved into it, which does
look cleaner) would work for us.

The problem is I still don't yet see an ordering restricted on top of (1)
accept() happens, and (2) receive LISTEN cmd here.  What happens if the
accept() request is not yet received when reaching LISTEN?  Or is it always
guaranteed the accept(fd) will always be polled here?

For example, the source QEMU (no matter pre-7.2 or later) will always setup
the preempt channel asynchrounously, then IIUC it can connect() after
sending the whole chunk of packed data which should include this LISTEN.  I
think it means it's not guaranteed this will 100% work, but maybe further
reduce the possibility of the race.

One right fix that I can think of is moving the sem_wait(&done) into the
main thread too, so we wait for the sem _before_ reading the packed data,
so there's no chance of fault.  However I don't think sem_wait() will be
smart enough to yield when in a coroutine..  In the long term run I think
we should really make migration loadvm to do work in the thread rather than
the main thread.  I think it means we have one more example to be listed in
this todo so that's preferred..

https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration_destination

I attached such draft patch below, but I'm not sure it'll work.  Let me
know how both of you think about it.

> 
> Another option is to

Re: [RFC PATCH-for-9.1 08/29] hw/i386/pc: Move CXLState to PcPciMachineState

2024-04-01 Thread Jonathan Cameron via

On Thu, 28 Mar 2024 16:54:16 +0100
Philippe Mathieu-Daudé  wrote:

> CXL depends on PCIe, which isn't available on non-PCI
> machines such the ISA-only PC one.
> Move CXLState to PcPciMachineState, and move the CXL
> specific calls to pc_pci_machine_initfn() and
> pc_pci_machine_done().
> 
> Signed-off-by: Philippe Mathieu-Daudé 

LGTM as a change on it's own - I've not reviewed the series
in general though, hence just an ack as an rb feels too strong.

Acked-by: Jonathan Cameron

Re: [PATCH-for-9.0] hw/i386/pc: Restrict CXL to PCI-based machines

2024-04-01 Thread Jonathan Cameron via

On Wed, 27 Mar 2024 17:16:42 +0100
Philippe Mathieu-Daudé  wrote:

> CXL is based on PCIe. In is pointless to initialize
> its context on non-PCI machines.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
Seems a reasonable restriction.

Acked-by: Jonathan Cameron 

Jonathan

> ---
>  hw/i386/pc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index e80f02bef4..5c21b0c4db 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1738,7 +1738,9 @@ static void pc_machine_initfn(Object *obj)
>  pcms->pcspk = isa_new(TYPE_PC_SPEAKER);
>  object_property_add_alias(OBJECT(pcms), "pcspk-audiodev",
>OBJECT(pcms->pcspk), "audiodev");
> -cxl_machine_init(obj, &pcms->cxl_devices_state);
> +if (pcmc->pci_enabled) {
> +cxl_machine_init(obj, &pcms->cxl_devices_state);
> +}
>  
>  pcms->machine_done.notify = pc_machine_done;
>  qemu_add_machine_init_done_notifier(&pcms->machine_done);

Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic

2024-04-01 Thread Jonathan Cameron via

On Thu, 28 Mar 2024 06:24:24 +
"Xingtao Yao (Fujitsu)"  wrote:

> Jonathan
> 
> thanks for your reply!
> 
> > -Original Message-
> > From: Jonathan Cameron 
> > Sent: Wednesday, March 27, 2024 9:28 PM
> > To: Yao, Xingtao/姚 幸涛 
> > Cc: fan...@samsung.com; qemu-devel@nongnu.org; Cao, Quanquan/曹 全全
> > 
> > Subject: Re: [PATCH] mem/cxl_type3: fix hpa to dpa logic
> > 
> > On Tue, 26 Mar 2024 21:46:53 -0400
> > Yao Xingtao  wrote:
> >   
> > > In 3, 6, 12 interleave ways, we could not access cxl memory properly,
> > > and when the process is running on it, a 'segmentation fault' error will
> > > occur.
> > >
> > > According to the CXL specification '8.2.4.20.13 Decoder Protection',
> > > there are two branches to convert HPA to DPA:
> > > b1: Decoder[m].IW < 8 (for 1, 2, 4, 8, 16 interleave ways)
> > > b2: Decoder[m].IW >= 8 (for 3, 6, 12 interleave ways)
> > >
> > > but only b1 has been implemented.
> > >
> > > To solve this issue, we should implement b2:
> > >   DPAOffset[51:IG+8]=HPAOffset[51:IG+IW] / 3
> > >   DPAOffset[IG+7:0]=HPAOffset[IG+7:0]
> > >   DPA=DPAOffset + Decoder[n].DPABase
> > >
> > > Links:  
> > https://lore.kernel.org/linux-cxl/3e84b919-7631-d1db-3e1d-33000f3f3868@fujits
> > u.com/  
> > > Signed-off-by: Yao Xingtao   
> > 
> > Not implementing this was intentional (shouldn't seg fault obviously) but
> > I thought we were not advertising EP support for 3, 6, 12?  The HDM Decoder
> > configuration checking is currently terrible so we don't prevent
> > the bits being set (adding device side sanity checks for those decoders
> > has been on the todo list for a long time).  There are a lot of ways of
> > programming those that will blow up.
> > 
> > Can you confirm that the emulation reports they are supported.
> > https://elixir.bootlin.com/qemu/v9.0.0-rc1/source/hw/cxl/cxl-component-utils.c
> > #L246
> > implies it shouldn't and so any software using them is broken.  
> yes, the feature is not supported by QEMU, but I can still create a 
> 6-interleave-ways region on kernel layer.
> 
> I checked the source code of kernel, and found that the kernel did not check 
> this bit when committing decoder.
> we may add some check on kernel side.

ouch.  We definitely want that check!  The decoder commit will fail
anyway (which QEMU doesn't yet because we don't do all the sanity checks
we should). However failing on commit is nasty as the reason should have
been detected earlier.

> 
> > 
> > The non power of 2 decodes always made me nervous as the maths is more
> > complex and any changes to that decode will need careful checking.
> > For the power of 2 cases it was a bunch of writes to edge conditions etc
> > and checking the right data landed in the backing stores.  
> after applying this modification, I tested some command by using these 
> memory, like 'ls', 'top'..
> and they can be executed normally, maybe there are some other problems I 
> haven't met yet.

I usually run a bunch of manual tests with devmem2 to ensure the edge cases are 
handled
correctly, but I've not really seen any errors that didn't also show up in 
running
stressors (e.g. stressng) or just memhog on the memory.

Jonathan

> 
> > 
> > Joanthan
> > 
> >   
> > > ---
> > >  hw/mem/cxl_type3.c | 15 +++
> > >  1 file changed, 11 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > > index b0a7e9f11b..2c1218fb12 100644
> > > --- a/hw/mem/cxl_type3.c
> > > +++ b/hw/mem/cxl_type3.c
> > > @@ -805,10 +805,17 @@ static bool cxl_type3_dpa(CXLType3Dev *ct3d, hwaddr 
> > >  
> > host_addr, uint64_t *dpa)  
> > >  continue;
> > >  }
> > >
> > > -*dpa = dpa_base +
> > > -((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> > > - ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) & 
> > > hpa_offset)  
> > > -  >> iw));  
> > > +if (iw < 8) {
> > > +*dpa = dpa_base +
> > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> > > + ((MAKE_64BIT_MASK(8 + ig + iw, 64 - 8 - ig - iw) &  
> > hpa_offset)  
> > > +  >> iw));
> > > +} else {
> > > +*dpa = dpa_base +
> > > +((MAKE_64BIT_MASK(0, 8 + ig) & hpa_offset) |
> > > + MAKE_64BIT_MASK(ig + iw, 64 - ig - iw) & hpa_offset)
> > > +   >> (ig + iw)) / 3) << (ig + 8)));
> > > +}
> > >
> > >  return true;
> > >  }  
>

Re: [PATCH v4 1/7] qga: guest-get-fsinfo: add optional 'total-bytes-privileged' field

2024-04-01 Thread Konstantin Kostiuk

Hi Andrey,

I am ok with these changes.
No need new iteration with the version change, I will update before merging.

I will merge it after release.

Best Regards,
Konstantin Kostiuk.


On Mon, Apr 1, 2024 at 5:49 PM Andrey Drobyshev <
andrey.drobys...@virtuozzo.com> wrote:

> On 3/22/24 15:17, Andrey Drobyshev wrote:
> > On 3/22/24 12:39, Daniel P. Berrangé wrote:
> >> On Wed, Mar 20, 2024 at 06:16:42PM +0200, Andrey Drobyshev wrote:
> >>> Since the commit 25b5ff1a86 ("qga: add mountpoint usage info to
> >>> GuestFilesystemInfo") we have 2 values reported in guest-get-fsinfo:
> >>> used = (f_blocks - f_bfree), total = (f_blocks - f_bfree + f_bavail) as
> >>> returned by statvfs(3).  While on Windows guests that's all we can get
> >>> with GetDiskFreeSpaceExA(), on POSIX guests we might also be
> interested in
> >>> total file system size, as it's visible for root user.  Let's add an
> >>> optional field 'total-bytes-privileged' to GuestFilesystemInfo struct,
> >>> which'd only be reported on POSIX and represent f_blocks value as
> returned
> >>> by statvfs(3).
> >>>
> >>> While here, also tweak the docs to reflect better where those values
> >>> come from.
> >>>
> >>> Signed-off-by: Andrey Drobyshev 
> >>> ---
> >>>  qga/commands-posix.c | 2 ++
> >>>  qga/commands-win32.c | 1 +
> >>>  qga/qapi-schema.json | 7 +--
> >>>  3 files changed, 8 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
> >>> index 26008db497..7df2d72e9f 100644
> >>> --- a/qga/commands-posix.c
> >>> +++ b/qga/commands-posix.c
> >>> @@ -1569,8 +1569,10 @@ static GuestFilesystemInfo
> *build_guest_fsinfo(struct FsMount *mount,
> >>>  nonroot_total = used + buf.f_bavail;
> >>>  fs->used_bytes = used * fr_size;
> >>>  fs->total_bytes = nonroot_total * fr_size;
> >>> +fs->total_bytes_privileged = buf.f_blocks * fr_size;
> >>>
> >>>  fs->has_total_bytes = true;
> >>> +fs->has_total_bytes_privileged = true;
> >>>  fs->has_used_bytes = true;
> >>>  }
> >>>
> >>> diff --git a/qga/commands-win32.c b/qga/commands-win32.c
> >>> index 6242737b00..6fee0e1e6f 100644
> >>> --- a/qga/commands-win32.c
> >>> +++ b/qga/commands-win32.c
> >>> @@ -1143,6 +1143,7 @@ static GuestFilesystemInfo
> *build_guest_fsinfo(char *guid, Error **errp)
> >>>  fs = g_malloc(sizeof(*fs));
> >>>  fs->name = g_strdup(guid);
> >>>  fs->has_total_bytes = false;
> >>> +fs->has_total_bytes_privileged = false;
> >>>  fs->has_used_bytes = false;
> >>>  if (len == 0) {
> >>>  fs->mountpoint = g_strdup("System Reserved");
> >>> diff --git a/qga/qapi-schema.json b/qga/qapi-schema.json
> >>> index 9554b566a7..dcc469b268 100644
> >>> --- a/qga/qapi-schema.json
> >>> +++ b/qga/qapi-schema.json
> >>> @@ -1026,7 +1026,10 @@
> >>>  #
> >>>  # @used-bytes: file system used bytes (since 3.0)
> >>>  #
> >>> -# @total-bytes: non-root file system total bytes (since 3.0)
> >>> +# @total-bytes: filesystem capacity in bytes for unprivileged users
> (since 3.0)
> >>> +#
> >>> +# @total-bytes-privileged: filesystem capacity in bytes for
> privileged users
> >>> +# (since 9.0)
> >>
> >> Will need to be '9.1', not '9.0', since we don't accept new features
> >> during freeze periods.
> >>
> >>>  #
> >>>  # @disk: an array of disk hardware information that the volume lies
> >>>  # on, which may be empty if the disk type is not supported
> >>> @@ -1036,7 +1039,7 @@
> >>>  { 'struct': 'GuestFilesystemInfo',
> >>>'data': {'name': 'str', 'mountpoint': 'str', 'type': 'str',
> >>> '*used-bytes': 'uint64', '*total-bytes': 'uint64',
> >>> -   'disk': ['GuestDiskAddress']} }
> >>> +   '*total-bytes-privileged': 'uint64', 'disk':
> ['GuestDiskAddress']} }
> >>>
> >>>  ##
> >>>  # @guest-get-fsinfo:
> >>
> >> Assuming the version is changed:
> >>
> >>   Reviewed-by: Daniel P. Berrangé 
> >>
> >> With regards,
> >> Daniel
> >
> > Thanks, Daniel.
> >
> > @kkostiuk do I need to send one more iteration with the version change?
> > Whatever is most convenient for you.
> >
> > Andrey
>
> Ping
>
>

Re: [PATCH v2 4/5] migration: Implement 'qatzip' methods using QAT

2024-04-01 Thread Fabiano Rosas

Bryan Zhang  writes:

> Uses QAT to offload deflate compression and decompression in the
> 'qatzip' compression method for multifd migration.

Please merge this patch with the previous. It makes no sense to have a
commit that adds nocomp code to qatzip only to have the next commit
replace it all.

Re: [PATCH v2 5/5] tests/migration: Add integration test for 'qatzip' compression method

2024-04-01 Thread Fabiano Rosas

Bryan Zhang  writes:

> Adds an integration test for 'qatzip'.
>
> Signed-off-by: Bryan Zhang 
> Signed-off-by: Hao Xiang 

Reviewed-by: Fabiano Rosas

Re: [PATCH v2 2/5] migration: Add migration parameters for QATzip

2024-04-01 Thread Fabiano Rosas

Bryan Zhang  writes:

> Adds support for migration parameters to control QATzip compression
> level and to enable/disable software fallback when QAT hardware is
> unavailable. This is a preparatory commit for a subsequent commit that
> will actually use QATzip compression.
>
> Signed-off-by: Bryan Zhang 
> Signed-off-by: Hao Xiang 
> ---
> Revision: This commit now includes a parameter for controlling software
> fallback. Fallback is generally intended to be disabled, but having this
> option available enables using software fallback for testing.
>
> This commit also now has some glue code to properly set parameters.
>
>  migration/migration-hmp-cmds.c |  8 +
>  migration/options.c| 57 ++
>  migration/options.h|  2 ++
>  qapi/migration.json| 35 +
>  4 files changed, 102 insertions(+)
>
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 99b49df5dd..4bd23ba14d 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -630,6 +630,14 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict 
> *qdict)
>  p->has_multifd_zlib_level = true;
>  visit_type_uint8(v, param, &p->multifd_zlib_level, &err);
>  break;
> +case MIGRATION_PARAMETER_MULTIFD_QATZIP_LEVEL:
> +p->has_multifd_qatzip_level = true;
> +visit_type_uint8(v, param, &p->multifd_qatzip_level, &err);
> +break;
> +case MIGRATION_PARAMETER_MULTIFD_QATZIP_SW_FALLBACK:
> +p->has_multifd_qatzip_sw_fallback = true;
> +visit_type_bool(v, param, &p->multifd_qatzip_sw_fallback, &err);
> +break;
>  case MIGRATION_PARAMETER_MULTIFD_ZSTD_LEVEL:
>  p->has_multifd_zstd_level = true;
>  visit_type_uint8(v, param, &p->multifd_zstd_level, &err);
> diff --git a/migration/options.c b/migration/options.c
> index 3e3e0b93b4..1316ea605a 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -62,6 +62,15 @@
>  #define DEFAULT_MIGRATE_MULTIFD_COMPRESSION MULTIFD_COMPRESSION_NONE
>  /* 0: means nocompress, 1: best speed, ... 9: best compress ratio */
>  #define DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL 1
> +/*
> + * 1: best speed, ... 9: best compress ratio
> + * There is some nuance here. Refer to QATzip documentation to understand
> + * the mapping of QATzip levels to standard deflate levels.
> + */
> +#define DEFAULT_MIGRATE_MULTIFD_QATZIP_LEVEL 1
> +/* QATzip's SW fallback implementation is extremely slow, so avoid fallback 
> */
> +#define DEFAULT_MIGRATE_MULTIFD_QATZIP_SW_FALLBACK false
> +
>  /* 0: means nocompress, 1: best speed, ... 20: best compress ratio */
>  #define DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL 1
>  
> @@ -143,6 +152,12 @@ Property migration_properties[] = {
>  DEFINE_PROP_UINT8("multifd-zlib-level", MigrationState,
>parameters.multifd_zlib_level,
>DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL),
> +DEFINE_PROP_UINT8("multifd-qatzip-level", MigrationState,
> +  parameters.multifd_qatzip_level,
> +  DEFAULT_MIGRATE_MULTIFD_QATZIP_LEVEL),
> +DEFINE_PROP_BOOL("multifd-qatzip-sw-fallback", MigrationState,
> +  parameters.multifd_qatzip_sw_fallback,
> +  DEFAULT_MIGRATE_MULTIFD_QATZIP_SW_FALLBACK),
>  DEFINE_PROP_UINT8("multifd-zstd-level", MigrationState,
>parameters.multifd_zstd_level,
>DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL),
> @@ -861,6 +876,20 @@ int migrate_multifd_zlib_level(void)
>  return s->parameters.multifd_zlib_level;
>  }
>  
> +int migrate_multifd_qatzip_level(void)
> +{
> +MigrationState *s = migrate_get_current();
> +
> +return s->parameters.multifd_qatzip_level;
> +}
> +
> +bool migrate_multifd_qatzip_sw_fallback(void)
> +{
> +MigrationState *s = migrate_get_current();
> +
> +return s->parameters.multifd_qatzip_sw_fallback;
> +}
> +
>  int migrate_multifd_zstd_level(void)
>  {
>  MigrationState *s = migrate_get_current();
> @@ -983,6 +1012,11 @@ MigrationParameters *qmp_query_migrate_parameters(Error 
> **errp)
>  params->multifd_compression = s->parameters.multifd_compression;
>  params->has_multifd_zlib_level = true;
>  params->multifd_zlib_level = s->parameters.multifd_zlib_level;
> +params->has_multifd_qatzip_level = true;
> +params->multifd_qatzip_level = s->parameters.multifd_qatzip_level;
> +params->has_multifd_qatzip_sw_fallback = true;
> +params->multifd_qatzip_sw_fallback =
> +s->parameters.multifd_qatzip_sw_fallback;
>  params->has_multifd_zstd_level = true;
>  params->multifd_zstd_level = s->parameters.multifd_zstd_level;
>  params->has_xbzrle_cache_size = true;
> @@ -1038,6 +1072,8 @@ void migrate_params_init(MigrationParameters *params)
>  params->has_multifd_channels = true;
>  params->has_multifd_compression = true;
>

Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported

2024-04-01 Thread Jakub Kicinski

On Sun, 31 Mar 2024 16:20:30 -0400 Michael S. Tsirkin wrote:
> > Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
> > Cc: sta...@vger.kernel.org  
> 
> net has its own stable process, don't CC stable on net patches.

Not any more, FWIW:

  1.5.7. Stable tree

  While it used to be the case that netdev submissions were not
  supposed to carry explicit CC: sta...@vger.kernel.org tags that is no
  longer the case today. Please follow the standard stable rules in
  Documentation/process/stable-kernel-rules.rst, and make sure you
  include appropriate Fixes tags!

https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#stable-tree

Re: [PATCH 26/26] i386/kvm: Move architectural CPUID leaf generation to separate helper

2024-04-01 Thread Xiaoyao Li


On 3/23/2024 2:11 AM, Paolo Bonzini wrote:

From: Sean Christopherson 

Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

For now this is just a cleanup, so keep the function static.

Signed-off-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
Message-ID: <20240229063726.610065-23-xiaoyao...@intel.com>
[Unify error reporting, rename function. - Paolo]
Signed-off-by: Paolo Bonzini 
---
  target/i386/kvm/kvm.c | 446 +-
  1 file changed, 224 insertions(+), 222 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2577e345502..eab6261e1f5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1752,6 +1752,228 @@ static void kvm_init_nested_state(CPUX86State *env)
  }
  }
  
+static uint32_t kvm_x86_build_cpuid(CPUX86State *env,

+struct kvm_cpuid_entry2 *entries,
+uint32_t cpuid_i)
+{
+uint32_t limit, i, j;
+uint32_t unused;
+struct kvm_cpuid_entry2 *c;
+
+cpu_x86_cpuid(env, 0, 0, &limit, &unused, &unused, &unused);
+
+for (i = 0; i <= limit; i++) {
+j = 0;
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = &entries[cpuid_i++];
+switch (i) {
+case 2: {
+/* Keep reading function 2 till all the input is received */
+int times;
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
+   KVM_CPUID_FLAG_STATE_READ_NEXT;
+cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+times = c->eax & 0xff;
+
+for (j = 1; j < times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = &entries[cpuid_i++];
+c->function = i;
+c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
+cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+}
+break;
+}
+case 0x1f:
+if (env->nr_dies < 2) {
+cpuid_i--;
+break;
+}
+/* fallthrough */
+case 4:
+case 0xb:
+case 0xd:
+for (j = 0; ; j++) {
+if (i == 0xd && j == 64) {
+break;
+}
+
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
+
+if (i == 4 && c->eax == 0) {
+break;
+}
+if (i == 0xb && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0x1f && !(c->ecx & 0xff00)) {
+break;
+}
+if (i == 0xd && c->eax == 0) {
+continue;
+}
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = &entries[cpuid_i++];
+}
+break;
+case 0x12:
+for (j = 0; ; j++) {
+c->function = i;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+c->index = j;
+cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
+
+if (j > 1 && (c->eax & 0xf) != 1) {
+break;
+}
+
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = &entries[cpuid_i++];
+}
+break;
+case 0x7:
+case 0x14:
+case 0x1d:
+case 0x1e: {
+uint32_t times;
+
+c->function = i;
+c->index = 0;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+times = c->eax;
+
+for (j = 1; j <= times; ++j) {
+if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+goto full;
+}
+c = &entries[cpuid_i++];
+c->function = i;
+c->index = j;
+c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
+}
+break;
+}
+default:
+c->function = i;
+c->flags = 0;
+cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+if (!c->eax && !c->ebx && !c->ecx && !c->edx) {
+/*
+ * KVM already returns all zeroes if a CPUID entry is missing,
+

Re: [PATCH v4 1/7] qga: guest-get-fsinfo: add optional 'total-bytes-privileged' field

2024-04-01 Thread Andrey Drobyshev

On 3/22/24 15:17, Andrey Drobyshev wrote:
> On 3/22/24 12:39, Daniel P. Berrangé wrote:
>> On Wed, Mar 20, 2024 at 06:16:42PM +0200, Andrey Drobyshev wrote:
>>> Since the commit 25b5ff1a86 ("qga: add mountpoint usage info to
>>> GuestFilesystemInfo") we have 2 values reported in guest-get-fsinfo:
>>> used = (f_blocks - f_bfree), total = (f_blocks - f_bfree + f_bavail) as
>>> returned by statvfs(3).  While on Windows guests that's all we can get
>>> with GetDiskFreeSpaceExA(), on POSIX guests we might also be interested in
>>> total file system size, as it's visible for root user.  Let's add an
>>> optional field 'total-bytes-privileged' to GuestFilesystemInfo struct,
>>> which'd only be reported on POSIX and represent f_blocks value as returned
>>> by statvfs(3).
>>>
>>> While here, also tweak the docs to reflect better where those values
>>> come from.
>>>
>>> Signed-off-by: Andrey Drobyshev 
>>> ---
>>>  qga/commands-posix.c | 2 ++
>>>  qga/commands-win32.c | 1 +
>>>  qga/qapi-schema.json | 7 +--
>>>  3 files changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
>>> index 26008db497..7df2d72e9f 100644
>>> --- a/qga/commands-posix.c
>>> +++ b/qga/commands-posix.c
>>> @@ -1569,8 +1569,10 @@ static GuestFilesystemInfo 
>>> *build_guest_fsinfo(struct FsMount *mount,
>>>  nonroot_total = used + buf.f_bavail;
>>>  fs->used_bytes = used * fr_size;
>>>  fs->total_bytes = nonroot_total * fr_size;
>>> +fs->total_bytes_privileged = buf.f_blocks * fr_size;
>>>  
>>>  fs->has_total_bytes = true;
>>> +fs->has_total_bytes_privileged = true;
>>>  fs->has_used_bytes = true;
>>>  }
>>>  
>>> diff --git a/qga/commands-win32.c b/qga/commands-win32.c
>>> index 6242737b00..6fee0e1e6f 100644
>>> --- a/qga/commands-win32.c
>>> +++ b/qga/commands-win32.c
>>> @@ -1143,6 +1143,7 @@ static GuestFilesystemInfo *build_guest_fsinfo(char 
>>> *guid, Error **errp)
>>>  fs = g_malloc(sizeof(*fs));
>>>  fs->name = g_strdup(guid);
>>>  fs->has_total_bytes = false;
>>> +fs->has_total_bytes_privileged = false;
>>>  fs->has_used_bytes = false;
>>>  if (len == 0) {
>>>  fs->mountpoint = g_strdup("System Reserved");
>>> diff --git a/qga/qapi-schema.json b/qga/qapi-schema.json
>>> index 9554b566a7..dcc469b268 100644
>>> --- a/qga/qapi-schema.json
>>> +++ b/qga/qapi-schema.json
>>> @@ -1026,7 +1026,10 @@
>>>  #
>>>  # @used-bytes: file system used bytes (since 3.0)
>>>  #
>>> -# @total-bytes: non-root file system total bytes (since 3.0)
>>> +# @total-bytes: filesystem capacity in bytes for unprivileged users (since 
>>> 3.0)
>>> +#
>>> +# @total-bytes-privileged: filesystem capacity in bytes for privileged 
>>> users
>>> +# (since 9.0)
>>
>> Will need to be '9.1', not '9.0', since we don't accept new features
>> during freeze periods.
>>
>>>  #
>>>  # @disk: an array of disk hardware information that the volume lies
>>>  # on, which may be empty if the disk type is not supported
>>> @@ -1036,7 +1039,7 @@
>>>  { 'struct': 'GuestFilesystemInfo',
>>>'data': {'name': 'str', 'mountpoint': 'str', 'type': 'str',
>>> '*used-bytes': 'uint64', '*total-bytes': 'uint64',
>>> -   'disk': ['GuestDiskAddress']} }
>>> +   '*total-bytes-privileged': 'uint64', 'disk': 
>>> ['GuestDiskAddress']} }
>>>  
>>>  ##
>>>  # @guest-get-fsinfo:
>>
>> Assuming the version is changed:
>>
>>   Reviewed-by: Daniel P. Berrangé 
>>
>> With regards,
>> Daniel
> 
> Thanks, Daniel.
> 
> @kkostiuk do I need to send one more iteration with the version change?
> Whatever is most convenient for you.
> 
> Andrey

Ping

[PATCH] target/hppa: mask upper iaoq bits when returning to narrow mode

2024-04-01 Thread Sven Schnelle

For unknown reasons, Java 1.5 on 64-bit HP-UX 11.11 does signed
computation of the new IAOQ value in the signal handler. In the
current code these bits are not masked when returning to narrow
mode, causing java to crash.

Signed-off-by: Sven Schnelle 
---
 target/hppa/sys_helper.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/target/hppa/sys_helper.c b/target/hppa/sys_helper.c
index 208e51c086..3bbc2da71b 100644
--- a/target/hppa/sys_helper.c
+++ b/target/hppa/sys_helper.c
@@ -83,6 +83,10 @@ void HELPER(rfi)(CPUHPPAState *env)
 env->iaoq_f = env->cr[CR_IIAOQ];
 env->iaoq_b = env->cr_back[1];
 
+if (!(env->cr[CR_IPSW] & PSW_W)) {
+env->iaoq_f &= 0x;
+env->iaoq_b &= 0x;
+}
 /*
  * For pa2.0, IIASQ is the top bits of the virtual address.
  * To recreate the space identifier, remove the offset bits.
-- 
2.43.2

Re: [PATCH] gitlab-ci/cirrus: switch from 'master' to 'latest'

2024-04-01 Thread Peter Maydell

On Mon, 1 Apr 2024 at 06:17, Michael Tokarev  wrote:
>
> Commit ab72522797 "gitlab: switch from 'stable' to
> 'latest' docker container tags" switched most tags
> to 'latest' but missed cirrus image.  Fix this now.
>
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2256
> Signed-off-by: Michael Tokarev 

Thanks; applied to git to fix the CI jobs.

-- PMM

Re: [PULL 0/2] Migration 20240331 patches

2024-04-01 Thread Peter Maydell

On Sun, 31 Mar 2024 at 19:32,  wrote:
>
> From: Peter Xu 
>
> The following changes since commit b9dbf6f9bf533564f6a4277d03906fcd32bb0245:
>
>   Merge tag 'pull-tcg-20240329' of https://gitlab.com/rth7680/qemu into 
> staging (2024-03-30 14:54:57 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/peterx/qemu.git tags/migration-20240331-pull-request
>
> for you to fetch changes up to d0ad271a7613459bd0a3397c8071a4ad06f3f7eb:
>
>   migration/postcopy: Ensure postcopy_start() sets errp if it fails 
> (2024-03-31 14:30:03 -0400)
>
> 
> Migration pull for 9.0-rc2
>
> - Avihai's two fixes on error paths


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/9.0
for any user-visible changes.

-- PMM

Re: [PATCH for-9.1] migration: Add Error** argument to add_bitmaps_to_list()

2024-04-01 Thread Peter Xu

On Fri, Mar 29, 2024 at 11:56:27AM +0100, Cédric Le Goater wrote:
> This allows to report more precise errors in the migration handler
> dirty_bitmap_save_setup().
> 
> Suggested-by Vladimir Sementsov-Ogievskiy  
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  To apply on top of : 
>  https://lore.kernel.org/qemu-devel/20240320064911.545001-1-...@redhat.com/

queued, thanks.

-- 
Peter Xu

Re: [PATCH v2 1/4] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-04-01 Thread Vladimir Sementsov-Ogievskiy


On 07.03.24 16:47, Fiona Ebner wrote:

Backup supports all modes listed in MirrorSyncMode, while mirror does
not. Introduce BackupSyncMode by copying the current MirrorSyncMode
and drop the variants mirror does not support from MirrorSyncMode as
well as the corresponding manual check in mirror_start().

Suggested-by: Vladimir Sementsov-Ogievskiy
Signed-off-by: Fiona Ebner


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Patch for qemu-project/qemu#2247 issue

2024-04-01 Thread liu.dayu

hmp: Add help information for watchdog action: inject-nmi

virsh qemu-monitor-command --hmp help information of watchdog_action missing 
inject-nmi which already supported in Commit 795dc6e4

Signed-off-by: Dayu Liu 

---
 hmp-commands.hx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 772ab996a..bc7e6d1dc 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1412,7 +1412,7 @@ ETEXI
 {
 .name   = "watchdog_action",
 .args_type  = "action:s",
-.params = "[reset|shutdown|poweroff|pause|debug|none]",
+.params = "[reset|shutdown|poweroff|pause|debug|none|inject-nmi]",
 .help   = "change watchdog action",
 .cmd= hmp_watchdog_action,
 .command_completion = watchdog_action_completion,
--

[PATCH v1 1/1] hw/riscv/boot.c: Support 64-bit address for initrd

2024-04-01 Thread Cheng Yang

Use qemu_fdt_setprop_u64() instead of qemu_fdt_setprop_cell()
to set the address of initrd in FDT to support 64-bit address.

Signed-off-by: Cheng Yang 
---
 hw/riscv/boot.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/riscv/boot.c b/hw/riscv/boot.c
index 09878e722c..47281ca853 100644
--- a/hw/riscv/boot.c
+++ b/hw/riscv/boot.c
@@ -209,8 +209,8 @@ static void riscv_load_initrd(MachineState *machine, 
uint64_t kernel_entry)
 /* Some RISC-V machines (e.g. opentitan) don't have a fdt. */
 if (fdt) {
 end = start + size;
-qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-start", start);
-qemu_fdt_setprop_cell(fdt, "/chosen", "linux,initrd-end", end);
+qemu_fdt_setprop_u64(fdt, "/chosen", "linux,initrd-start", start);
+qemu_fdt_setprop_u64(fdt, "/chosen", "linux,initrd-end", end);
 }
 }
 
-- 
2.34.1

[PATCH v2 1/1] nbd/server: do not poll within a coroutine context

2024-04-01 Thread Zhu Yangyang via

Coroutines are not supposed to block. Instead, they should yield.

Fixes: f95910f ("nbd: implement TLS support in the protocol negotiation")
Signed-off-by: Zhu Yangyang 
---
 nbd/client.c   |  7 ---
 nbd/common.c   | 19 ---
 nbd/nbd-internal.h |  6 +++---
 nbd/server.c   | 10 +-
 4 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/nbd/client.c b/nbd/client.c
index 29ffc609a4..1ab91ed205 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -619,18 +619,19 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
 return NULL;
 }
 qio_channel_set_name(QIO_CHANNEL(tioc), "nbd-client-tls");
-data.loop = g_main_loop_new(g_main_context_default(), FALSE);
 trace_nbd_receive_starttls_tls_handshake();
 qio_channel_tls_handshake(tioc,
-  nbd_tls_handshake,
+  nbd_client_tls_handshake,
   &data,
   NULL,
   NULL);
 
 if (!data.complete) {
+data.loop = g_main_loop_new(g_main_context_default(), FALSE);
 g_main_loop_run(data.loop);
+g_main_loop_unref(data.loop);
 }
-g_main_loop_unref(data.loop);
+
 if (data.error) {
 error_propagate(errp, data.error);
 object_unref(OBJECT(tioc));
diff --git a/nbd/common.c b/nbd/common.c
index 3247c1d618..01ca30a5c4 100644
--- a/nbd/common.c
+++ b/nbd/common.c
@@ -47,14 +47,27 @@ int nbd_drop(QIOChannel *ioc, size_t size, Error **errp)
 }
 
 
-void nbd_tls_handshake(QIOTask *task,
-   void *opaque)
+void nbd_client_tls_handshake(QIOTask *task, void *opaque)
 {
 struct NBDTLSHandshakeData *data = opaque;
 
 qio_task_propagate_error(task, &data->error);
 data->complete = true;
-g_main_loop_quit(data->loop);
+if (data->loop) {
+g_main_loop_quit(data->loop);
+}
+}
+
+
+void nbd_server_tls_handshake(QIOTask *task, void *opaque)
+{
+struct NBDTLSHandshakeData *data = opaque;
+
+qio_task_propagate_error(task, &data->error);
+data->complete = true;
+if (!qemu_coroutine_entered(data->co)) {
+aio_co_wake(data->co);
+}
 }
 
 
diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
index dfa02f77ee..99cca9382c 100644
--- a/nbd/nbd-internal.h
+++ b/nbd/nbd-internal.h
@@ -74,13 +74,13 @@ static inline int nbd_write(QIOChannel *ioc, const void 
*buffer, size_t size,
 
 struct NBDTLSHandshakeData {
 GMainLoop *loop;
+Coroutine *co;
 bool complete;
 Error *error;
 };
 
-
-void nbd_tls_handshake(QIOTask *task,
-   void *opaque);
+void nbd_server_tls_handshake(QIOTask *task, void *opaque);
+void nbd_client_tls_handshake(QIOTask *task, void *opaque);
 
 int nbd_drop(QIOChannel *ioc, size_t size, Error **errp);
 
diff --git a/nbd/server.c b/nbd/server.c
index c3484cc1eb..b218512ced 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -777,17 +777,17 @@ static QIOChannel 
*nbd_negotiate_handle_starttls(NBDClient *client,
 
 qio_channel_set_name(QIO_CHANNEL(tioc), "nbd-server-tls");
 trace_nbd_negotiate_handle_starttls_handshake();
-data.loop = g_main_loop_new(g_main_context_default(), FALSE);
+data.co = qemu_coroutine_self();
 qio_channel_tls_handshake(tioc,
-  nbd_tls_handshake,
+  nbd_server_tls_handshake,
   &data,
   NULL,
   NULL);
 
-if (!data.complete) {
-g_main_loop_run(data.loop);
+while (!data.complete) {
+qemu_coroutine_yield();
 }
-g_main_loop_unref(data.loop);
+
 if (data.error) {
 object_unref(OBJECT(tioc));
 error_propagate(errp, data.error);
-- 
2.33.0

[PATCH v2 0/1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

2024-04-01 Thread Zhu Yangyang via

The problem that inserting duplicate coroutine to co_queue_wakeu has been
resolved by 7c1f51bf38 ("nbd/server: Fix drained_poll to wake coroutine
in right AioContext") that avoids repeatedly waking up the same coroutine.

The key modifications are as follows:

static void qio_channel_restart_read(void *opaque)
{
QIOChannel *ioc = opaque;
-   Coroutine *co = ioc->read_coroutine;
+   Coroutine *co = qatomic_xchg(&ioc->read_coroutine, NULL);
+
+   if (!co) {
+   return;
+   }

/* Assert that aio_co_wake() reenters the coroutine directly */
assert(qemu_get_current_aio_context() ==
   qemu_coroutine_get_aio_context(co));
aio_co_wake(co);
}

The root cause is that poll() is invoked in coroutine context, so fix it.

Changes in v2:
Drop the changes to aio_co_enter and instead fix the poll() call in the 
nbd/server.

Zhu Yangyang (1):
  nbd/server: do not poll within a coroutine context

 nbd/client.c   |  7 ---
 nbd/common.c   | 19 ---
 nbd/nbd-internal.h |  6 +++---
 nbd/server.c   | 10 +-
 4 files changed, 28 insertions(+), 14 deletions(-)

-- 
2.33.0

Re: [PULL 00/18] target/hppa patch queue

2024-04-01 Thread Peter Maydell

On Fri, 29 Mar 2024 at 22:32, Richard Henderson
 wrote:
>
> The following changes since commit 5012e522aca161be5c141596c66e5cc6082538a9:
>
>   Update version for v9.0.0-rc1 release (2024-03-26 19:46:55 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/rth7680/qemu.git tags/pull-pa-20240329
>
> for you to fetch changes up to 4a3aa11e1fb25c28c24a43fd2835c429b00a463d:
>
>   target/hppa: Clear psw_n for BE on use_nullify_skip path (2024-03-29 
> 08:15:01 -1000)
>
> 
> target/hppa: Fix BE,L set of sr0
> target/hppa: Fix B,GATE for wide mode
> target/hppa: Mark interval timer write as io
> target/hppa: Fix EIRR, EIEM versus icount
> target/hppa: Fix DCOR reconstruction of carry bits
> target/hppa: Fix unit carry conditions
> target/hppa: Fix overflow computation for shladd
> target/hppa: Add diag instructions to set/restore shadow registers
> target/hppa: Clear psw_n for BE on use_nullify_skip path
>
> 


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/9.0
for any user-visible changes.

-- PMM

Re: [PULL 0/8] ppc-for-9.0-3 queue

2024-04-01 Thread Peter Maydell

On Sun, 31 Mar 2024 at 08:34, Nicholas Piggin  wrote:
>
> The following changes since commit 5012e522aca161be5c141596c66e5cc6082538a9:
>
>   Update version for v9.0.0-rc1 release (2024-03-26 19:46:55 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/npiggin/qemu.git tags/pull-ppc-for-9.0-3-20240331
>
> for you to fetch changes up to b07a5bb736ca08d55cc3ada8ca309943b55d4b70:
>
>   tests/avocado: ppc_hv_tests.py set alpine time before setup-alpine 
> (2024-03-30 18:50:26 +1000)
>
> 
> * Various fixes for recent regressions and new code.
>
> 


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/9.0
for any user-visible changes.

-- PMM

Re: [PULL for-9.0 0/2] 9p queue 2024-03-29

2024-04-01 Thread Peter Maydell

On Sat, 30 Mar 2024 at 13:39, Christian Schoenebeck
 wrote:
>
> The following changes since commit 5012e522aca161be5c141596c66e5cc6082538a9:
>
>   Update version for v9.0.0-rc1 release (2024-03-26 19:46:55 +)
>
> are available in the Git repository at:
>
>   https://github.com/cschoenebeck/qemu.git tags/pull-9p-20240329
>
> for you to fetch changes up to dcae75fba1084823d0fc87caa13f0ba6f32155f3:
>
>   qtest/virtio-9p-test.c: remove g_test_slow() gate (2024-03-28 09:54:47 
> +0100)
>
> 
> Changes for 9p tests only:
>
> * Fix 9p tests for riscv.
>
> * Re-enable 9p 'local' tests for running in CI pipelines.
>
> 


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/9.0
for any user-visible changes.

-- PMM

Re: [RFC PATCH v2 1/6] cxl/core: correct length of DPA field masks

2024-04-01 Thread Shiyang Ruan via





在 2024/3/30 9:37, Dan Williams 写道:

Shiyang Ruan wrote:

The length of Physical Address in General Media Event Record/DRAM Event
Record is 64-bit, so the field mask should be defined as such length.
Otherwise, this causes cxl_general_media and cxl_dram tracepoints to
mask off the upper-32-bits of DPA addresses. The cxl_poison event is
unaffected.

If userspace was doing its own DPA-to-HPA translation this could lead to
incorrect page retirement decisions, but there is no known consumer
(like rasdaemon) of this event today.

Fixes: d54a531a430b ("cxl/mem: Trace General Media Event Record")
Cc: 
Cc: Dan Williams 
Cc: Davidlohr Bueso 
Cc: Jonathan Cameron 
Cc: Ira Weiny 
Signed-off-by: Shiyang Ruan 
---
  drivers/cxl/core/trace.h | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index e5f13260fc52..e2d1f296df97 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -253,11 +253,11 @@ TRACE_EVENT(cxl_generic_event,
   * DRAM Event Record
   * CXL rev 3.0 section 8.2.9.2.1.2; Table 8-44
   */
-#define CXL_DPA_FLAGS_MASK 0x3F
+#define CXL_DPA_FLAGS_MASK 0x3FULL


This change makes sense...


  #define CXL_DPA_MASK  (~CXL_DPA_FLAGS_MASK)
  
-#define CXL_DPA_VOLATILE			BIT(0)

-#define CXL_DPA_NOT_REPAIRABLE BIT(1)
+#define CXL_DPA_VOLATILE   BIT_ULL(0)
+#define CXL_DPA_NOT_REPAIRABLE BIT_ULL(1)


...these do not. The argument to __print_flags() is an unsigned long, so
they will be cast down to (unsigned long), and they are never used as a
mask so the generated code should not change.


They will only used to check if such flag is set, not used as mask.  So, 
yes, I'll remove these changes.



--
Thanks,
Ruan.

Re: vhost-user-blk reconnect issue

2024-04-01 Thread Yajun Wu



On 4/1/2024 4:34 PM, Li Feng wrote:

*External email: Use caution opening links or attachments*


Hi yajun,

I have submitted a patch to fix this problem a few months ago, but in 
the end this solution was not accepted and other solutions

were adopted to fix it.

[PATCH 1/2] vhost-user: fix lost reconnect - Li Feng 

lore.kernel.org 








I think this fix is valid.


This is the merged fix:


[PULL 76/83] vhost-user: fix lost reconnect - Michael S. Tsirkin 

lore.kernel.org 








My tests are with this fix, failed in the two scenarios I mentioned.



Thanks,
Li


2024年4月1日 10:08，Yajun Wu  写道：


On 3/27/2024 6:47 PM, Stefano Garzarella wrote:

External email: Use caution opening links or attachments


Hi Yajun,

On Mon, Mar 25, 2024 at 10:54:13AM +, Yajun Wu wrote:

Hi experts,

With latest QEMU (8.2.90), we find two vhost-user-blk backend reconnect
failure scenarios:

Do you know if has it ever worked and so it's a regression, or have we
always had this problem?


I am afraid this commit: "71e076a07d (2022-12-01 02:30:13 -0500) 
hw/virtio: generalise CHR_EVENT_CLOSED handling"  caused both 
failures. Previous hash is good.


I suspect the "if (vhost->vdev)" in vhost_user_async_close_bh is the 
cause, previous code doesn't have this check?




Thanks,
Stefano

1. Disconnect vhost-user-blk backend before guest driver probe vblk 
device, then reconnect backend after guest driver probe device. 
QEMU won't send out any vhost messages to restore backend.
This is because vhost->vdev is NULL before guest driver probe vblk 
device, so vhost_user_blk_disconnect won't be called, s->connected 
is still true. Next vhost_user_blk_connect will simply return 
without doing anything.


2. modprobe -r virtio-blk inside VM, then disconnect backend, then 
reconnect backend, then modprobe virtio-blk. QEMU won't send 
messages in vhost_dev_init.
This is because rmmod will let qemu call vhost_user_blk_stop, 
vhost->vdev also become NULL(in vhost_dev_stop), 
vhost_user_blk_disconnect won't be called. Again s->connected is 
still true, even chr connect is closed.


I think even vhost->vdev is NULL, vhost_user_blk_disconnect should 
be called when chr connect close?

Hope we can have a fix soon.


Thanks,
Yajun

Re: vhost-user-blk reconnect issue

2024-04-01 Thread Li Feng

Hi yajun,

I have submitted a patch to fix this problem a few months ago, but in the end 
this solution was not accepted and other solutions
were adopted to fix it.

https://lore.kernel.org/all/20230804052954.2918915-2-fen...@smartx.com/

This is the merged fix:


https://lore.kernel.org/all/a68c0148e9bf105f9e83ff5e763b8fcb6f7ba9be.1697644299.git@redhat.com/

Thanks,
Li

> 2024年4月1日 10:08，Yajun Wu  写道：
> 
> 
> On 3/27/2024 6:47 PM, Stefano Garzarella wrote:
>> External email: Use caution opening links or attachments
>> 
>> 
>> Hi Yajun,
>> 
>> On Mon, Mar 25, 2024 at 10:54:13AM +, Yajun Wu wrote:
>>> Hi experts,
>>> 
>>> With latest QEMU (8.2.90), we find two vhost-user-blk backend reconnect
>>> failure scenarios:
>> Do you know if has it ever worked and so it's a regression, or have we
>> always had this problem?
> 
> I am afraid this commit: "71e076a07d (2022-12-01 02:30:13 -0500) hw/virtio: 
> generalise CHR_EVENT_CLOSED handling"  caused both failures. Previous hash is 
> good.
> 
> I suspect the "if (vhost->vdev)" in vhost_user_async_close_bh is the cause, 
> previous code doesn't have this check?
> 
>> 
>> Thanks,
>> Stefano
>> 
>>> 1. Disconnect vhost-user-blk backend before guest driver probe vblk device, 
>>> then reconnect backend after guest driver probe device. QEMU won't send out 
>>> any vhost messages to restore backend.
>>> This is because vhost->vdev is NULL before guest driver probe vblk device, 
>>> so vhost_user_blk_disconnect won't be called, s->connected is still true. 
>>> Next vhost_user_blk_connect will simply return without doing anything.
>>> 
>>> 2. modprobe -r virtio-blk inside VM, then disconnect backend, then 
>>> reconnect backend, then modprobe virtio-blk. QEMU won't send messages in 
>>> vhost_dev_init.
>>> This is because rmmod will let qemu call vhost_user_blk_stop, vhost->vdev 
>>> also become NULL(in vhost_dev_stop), vhost_user_blk_disconnect won't be 
>>> called. Again s->connected is still true, even chr connect is closed.
>>> 
>>> I think even vhost->vdev is NULL, vhost_user_blk_disconnect should be 
>>> called when chr connect close?
>>> Hope we can have a fix soon.
>>> 
>>> 
>>> Thanks,
>>> Yajun
>>>

Re: [PATCH 0/2] P11 support for QEMU

2024-04-01 Thread Cédric Le Goater


Hello Aditya,

Please run ./scripts/get_maintainer.pl when sending a series. qemu-ppc should be
in Cc:

Briefly looking at this, please separate the changes using one patch per model,
that is : first CPU (target), LPC, OCC, PSI, SBE, PnvCore, SpaprCore. Last the
PnvChip and the machines, powernv11 and pseries. A minimum commit log describing
the HW is required. I don't see PHB6 or XIVE3. Why ?

Also, you will need an OPAL update. The above changes are pointless without it.
The minimum for now is a git commit from the opal repo, then you will need to
update QEMU with a binary.

Thanks,

C.

On 4/1/24 07:55, Aditya Gupta wrote:

This patch series adds support for Power11 pseries and powernv machine targets
to emulate VMs running on Power11.

Most of the P11 support code has been taken from P10 code in QEMU.
And has been tested in pseries, powernv, with and without compat mode.

Git Tree for Testing: https://github.com/adi-g15-ibm/qemu/tree/p11

Aditya Gupta (2):
   ppc: pseries: add P11 cpu type
   ppc: powernv11: add base support for P11 PowerNV

  docs/system/ppc/pseries.rst |   6 +-
  hw/ppc/pnv.c| 409 
  hw/ppc/pnv_core.c   |  94 +
  hw/ppc/pnv_homer.c  |  64 ++
  hw/ppc/pnv_lpc.c|  14 ++
  hw/ppc/pnv_occ.c|  14 ++
  hw/ppc/pnv_psi.c|  21 ++
  hw/ppc/pnv_sbe.c|  19 ++
  hw/ppc/spapr_cpu_core.c |   1 +
  include/hw/ppc/pnv.h|  51 +
  include/hw/ppc/pnv_chip.h   |  30 +++
  include/hw/ppc/pnv_homer.h  |   3 +
  include/hw/ppc/pnv_lpc.h|   4 +
  include/hw/ppc/pnv_occ.h|   2 +
  include/hw/ppc/pnv_psi.h|   2 +
  include/hw/ppc/pnv_sbe.h|   2 +
  include/hw/ppc/pnv_xscom.h  |  55 +
  target/ppc/compat.c |   7 +
  target/ppc/cpu-models.c |   2 +
  target/ppc/cpu-models.h |   2 +
  target/ppc/cpu_init.c   | 162 ++
  21 files changed, 961 insertions(+), 3 deletions(-)

Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup

2024-04-01 Thread Zhu Yangyang via

On Thu, 28 Mar 2024 07:40:14 -0500, Eric Blake wrote:
> On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang via wrote:
> > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > may be disordered.
> > 
> > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > some listened events is completed. Therefore, the completion callback
> > function is dispatched.
> > 
> > If this callback function needs to invoke aio_co_enter(), it will only
> > wake up the coroutine (because we are already in coroutine context),
> > which may cause that the data on this listening event_fd/socket_fd
> > is not read/cleared. When the next poll () exits, it will be woken up again
> > and inserted into the wakeup queue again.
> > 
> > For example, if TLS is enabled in NBD, the server will call 
> > g_main_loop_run()
> > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > The call stack is as follows:
> > 
> > aio_co_enter()
> > aio_co_wake()
> > qio_channel_restart_read()
> > aio_dispatch_handler()
> > aio_dispatch_handlers()
> > aio_dispatch()
> > aio_ctx_dispatch()
> > g_main_context_dispatch()
> > g_main_loop_run()
> > nbd_negotiate_handle_starttls()
> > nbd_negotiate_options()
> > nbd_negotiate()
> > nbd_co_client_start()
> > coroutine_trampoline()
> 
> zhuyangyang, do you have a reliable reproduction setup for how you
> were able to trigger this?  Obviously, it only happens when TLS is
> enabled (we aren't creating a g_main_loop_run for any other NBD
> command), and only when the server is first starting to serve a
> client; is this a case where you were hammering a long-running qemu
> process running an NBD server with multiple clients trying to
> reconnect to the server all near the same time?

This problem cannot be reproduced after 
7c1f51bf38 ("nbd/server: Fix drained_poll to wake coroutine in right 
AioContext") 
that avoids repeatedly waking up the same coroutine.

Invoking g_main_loop_run() in the coroutine will cause that 
event completion callback function qio_channel_restart_read() is called 
repeatedly, 
but the coroutine is woken up only once.

The key modifications are as follows:

static void qio_channel_restart_read(void *opaque)
{
QIOChannel *ioc = opaque;
-   Coroutine *co = ioc->read_coroutine;
+   Coroutine *co = qatomic_xchg(&ioc->read_coroutine, NULL);
+
+   if (!co) {
+   return;
+   }

/* Assert that aio_co_wake() reenters the coroutine directly */
assert(qemu_get_current_aio_context() ==
   qemu_coroutine_get_aio_context(co));
aio_co_wake(co);
}

The root cause is that poll() is invoked in coroutine context.

> 
> If we can come up with a reliable formula for reproducing the
> corrupted coroutine list, it would make a great iotest addition
> alongside the existing qemu-iotests 233 for ensuring that NBD TLS
> traffic is handled correctly in both server and client.
> 
> > 
> > Signed-off-by: zhuyangyang 
> 
> Side note: this appears to be your first qemu contribution (based on
> 'git shortlog --author zhuyangyang').  While I am not in a position to
> presume how you would like your name Anglicized, I will point out that
> the prevailing style is to separate given name from family name (just
> because your username at work has no spaces does not mean that your
> S-o-b has to follow suit).  It is also permissible to list your name
> in native characters alongside or in place of the Anglicized version;
> for example, 'git log --author="Stefano Dong"' shows this technique.


-- 
Best Regards,
Zhu Yangyang

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-01 Thread Zhijian Li (Fujitsu)

Phil,

on 3/29/2024 6:28 PM, Philippe Mathieu-Daudé wrote:
>>
>>
>>> IMHO it's more important to know whether there are still users and 
>>> whether
>>> they would still like to see it around.
>>
>> Agree.
>> I didn't immediately express my opinion in V1 because I'm also 
>> consulting our
>> customers for this feature in the future.
>>
>> Personally, I agree with Perter's idea that "I'd slightly prefer 
>> postponing it one
>> more release which might help a bit of our downstream maintenance"
>
> Do you mind posting a deprecation patch to clarify the situation?
>

No problem, i just posted a deprecation patch, please take a look.
https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhij...@fujitsu.com/T/#u

Thanks
Zhijian

Re: vhost-user-blk reconnect issue

2024-04-01 Thread Michael S. Tsirkin

On Mon, Apr 01, 2024 at 10:08:10AM +0800, Yajun Wu wrote:
> 
> On 3/27/2024 6:47 PM, Stefano Garzarella wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > Hi Yajun,
> > 
> > On Mon, Mar 25, 2024 at 10:54:13AM +, Yajun Wu wrote:
> > > Hi experts,
> > > 
> > > With latest QEMU (8.2.90), we find two vhost-user-blk backend reconnect
> > > failure scenarios:
> > Do you know if has it ever worked and so it's a regression, or have we
> > always had this problem?
> 
> I am afraid this commit: "71e076a07d (2022-12-01 02:30:13 -0500) hw/virtio:
> generalise CHR_EVENT_CLOSED handling"  caused both failures. Previous hash
> is good.

CC Alex who wrote that commit.

> I suspect the "if (vhost->vdev)" in vhost_user_async_close_bh is the cause,
> previous code doesn't have this check?
> 
> > 
> > Thanks,
> > Stefano
> > 
> > > 1. Disconnect vhost-user-blk backend before guest driver probe vblk 
> > > device, then reconnect backend after guest driver probe device. QEMU 
> > > won't send out any vhost messages to restore backend.
> > > This is because vhost->vdev is NULL before guest driver probe vblk 
> > > device, so vhost_user_blk_disconnect won't be called, s->connected is 
> > > still true. Next vhost_user_blk_connect will simply return without doing 
> > > anything.
> > > 
> > > 2. modprobe -r virtio-blk inside VM, then disconnect backend, then 
> > > reconnect backend, then modprobe virtio-blk. QEMU won't send messages in 
> > > vhost_dev_init.
> > > This is because rmmod will let qemu call vhost_user_blk_stop, vhost->vdev 
> > > also become NULL(in vhost_dev_stop), vhost_user_blk_disconnect won't be 
> > > called. Again s->connected is still true, even chr connect is closed.
> > > 
> > > I think even vhost->vdev is NULL, vhost_user_blk_disconnect should be 
> > > called when chr connect close?
> > > Hope we can have a fix soon.
> > > 
> > > 
> > > Thanks,
> > > Yajun
> > >

78 matches

Mail list logo