Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-12-02 Thread Peter Maydell
On Mon, 1 Dec 2025 at 14:38, Gavin Shan  wrote:
>
> Hi Mauro,
>
> On 12/2/25 12:31 AM, Mauro Carvalho Chehab wrote:
> > On Tue, 2 Dec 2025 00:13:06 +1000
> > Gavin Shan  wrote:
> >> On 12/1/25 10:17 PM, Mauro Carvalho Chehab wrote:
> >>> On Thu, 27 Nov 2025 10:44:30 +1000
> >>> Gavin Shan  wrote:
>
> [...]
>
> >>>
> >>> Btw, what setup are you using to test memory errors? It would be
> >>> nice to have it documented somewhere, maybe at
> >>> docs/specs/acpi_hest_ghes.rst.
> >>>
> >>
> >> I don't think docs/specs/acpi_hest_ghes.rst is the right place for that
> >> as it's for specifications.
> >
> > Perhaps not, but it would be nice to have it documented somewhere,
> > either there or at QEMU wiki.
> >
>
> QEMU wiki may be the best place for it. I never updated to QEMU wiki and
> any guiding steps on how to do that?

I think in general we should prefer to document things in docs/
if we think users would want to know them. If it's just a
test setup then perhaps docs/devel, or if feasible actually
make it a test in tests/. The wiki is largely unused except
for the changelog and planning docs.

(In an ideal world we'd check for parts of the wiki that still
have useful-to-users up to date information, and fold them into
our manuals.)

thanks
-- PMM



Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-12-02 Thread Igor Mammedov
On Tue, 2 Dec 2025 00:37:53 +1000
Gavin Shan  wrote:

> Hi Mauro,
> 
> On 12/2/25 12:31 AM, Mauro Carvalho Chehab wrote:
> > On Tue, 2 Dec 2025 00:13:06 +1000
> > Gavin Shan  wrote:  
> >> On 12/1/25 10:17 PM, Mauro Carvalho Chehab wrote:  
> >>> On Thu, 27 Nov 2025 10:44:30 +1000
> >>> Gavin Shan  wrote:  
> 
> [...]
> 
> >>>
> >>> Btw, what setup are you using to test memory errors? It would be
> >>> nice to have it documented somewhere, maybe at
> >>> docs/specs/acpi_hest_ghes.rst.
> >>>  
> >>
> >> I don't think docs/specs/acpi_hest_ghes.rst is the right place for that
> >> as it's for specifications.  
> > 
> > Perhaps not, but it would be nice to have it documented somewhere,
> > either there or at QEMU wiki.
> >   
> 
> QEMU wiki may be the best place for it. I never updated to QEMU wiki and
> any guiding steps on how to do that?

do you have an account already?

> 
> >> I'm sharing how this is tested here to make the thread complete.  
> > 
> > Thanks!
> >   
> >>
> >> - Both host and guest has 4KB page size
> >>
> >> - Start the guest by the following command lines
> >>
> >> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64
> >>   \
> >> -accel kvm -machine virt,gic-version=host,nvdimm=on,ras=on 
> >>   \
> >> -cpu host -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1 
> >>   \
> >> -m 4096M,slots=16,maxmem=128G  
> >>   \
> >> -object memory-backend-ram,id=mem0,size=4096M  
> >>   \
> >> -numa node,nodeid=0,cpus=0-7,memdev=mem0   
> >>   \
> >> -L /home/gavin/sandbox/qemu.main/build/pc-bios 
> >>   \
> >> -monitor none -serial mon:stdio -nographic 
> >>   \
> >> -gdb tcp:: -qmp tcp:localhost:,server,wait=off 
> >>   \
> >> -bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd 
> >>   \
> >> -boot c
> >>   \
> >> -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1  
> >>   \
> >> -device pcie-root-port,bus=pcie.0,chassis=2,id=pcie.2  
> >>   \
> >> -device pcie-root-port,bus=pcie.0,chassis=3,id=pcie.3  
> >>   \
> >>:   
> >>   \
> >> -device pcie-root-port,bus=pcie.0,chassis=16,id=pcie.16
> >>   \
> >> -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=drive0
> >>   \
> >> -device 
> >> virtio-blk-pci,id=virtblk0,bus=pcie.1,drive=drive0,num-queues=4  \
> >> -netdev 
> >> tap,id=tap1,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
> >> -device virtio-net-pci,bus=pcie.8,netdev=tap1,mac=52:54:00:f1:26:b0
> >>
> >> - Trigger 'victim -d' in the guest  
> > 
> > Hmm... from where I can get victim?
> >   
> 
> https://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
> 
> > Regards,
> > Mauro
> >   
> 
> Thanks,
> Gavin
> 




Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-12-01 Thread Gavin Shan

Hi Mauro,

On 12/2/25 12:31 AM, Mauro Carvalho Chehab wrote:

On Tue, 2 Dec 2025 00:13:06 +1000
Gavin Shan  wrote:

On 12/1/25 10:17 PM, Mauro Carvalho Chehab wrote:

On Thu, 27 Nov 2025 10:44:30 +1000
Gavin Shan  wrote:


[...]



Btw, what setup are you using to test memory errors? It would be
nice to have it documented somewhere, maybe at
docs/specs/acpi_hest_ghes.rst.
   


I don't think docs/specs/acpi_hest_ghes.rst is the right place for that
as it's for specifications.


Perhaps not, but it would be nice to have it documented somewhere,
either there or at QEMU wiki.



QEMU wiki may be the best place for it. I never updated to QEMU wiki and
any guiding steps on how to do that?


I'm sharing how this is tested here to make the thread complete.


Thanks!



- Both host and guest has 4KB page size

- Start the guest by the following command lines

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
-accel kvm -machine virt,gic-version=host,nvdimm=on,ras=on   \
-cpu host -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1   \
-m 4096M,slots=16,maxmem=128G\
-object memory-backend-ram,id=mem0,size=4096M\
-numa node,nodeid=0,cpus=0-7,memdev=mem0 \
-L /home/gavin/sandbox/qemu.main/build/pc-bios   \
-monitor none -serial mon:stdio -nographic   \
-gdb tcp:: -qmp tcp:localhost:,server,wait=off   \
-bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd   \
-boot c  \
-device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1\
-device pcie-root-port,bus=pcie.0,chassis=2,id=pcie.2\
-device pcie-root-port,bus=pcie.0,chassis=3,id=pcie.3\
   : \
-device pcie-root-port,bus=pcie.0,chassis=16,id=pcie.16  \
-drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=drive0  \
-device virtio-blk-pci,id=virtblk0,bus=pcie.1,drive=drive0,num-queues=4  \
-netdev 
tap,id=tap1,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device virtio-net-pci,bus=pcie.8,netdev=tap1,mac=52:54:00:f1:26:b0

- Trigger 'victim -d' in the guest


Hmm... from where I can get victim?



https://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git


Regards,
Mauro



Thanks,
Gavin




Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-12-01 Thread Mauro Carvalho Chehab
On Tue, 2 Dec 2025 00:13:06 +1000
Gavin Shan  wrote:

> Hi Mauro,
> 
> On 12/1/25 10:17 PM, Mauro Carvalho Chehab wrote:
> > On Thu, 27 Nov 2025 10:44:30 +1000
> > Gavin Shan  wrote:
> >   
> >> This series is curved from that for memory error handling improvement
> >> [1] based on the received comments, to improve the error object handling
> >> in various aspects.
> >>
> >> [1] https://lists.nongnu.org/archive/html/qemu-arm/2025-11/msg00534.html
> >>
> >> Gavin Shan (5):
> >>acpi/ghes: Automate data block cleanup in acpi_ghes_memory_errors()
> >>acpi/ghes: Abort in acpi_ghes_memory_errors() if necessary
> >>target/arm/kvm: Exit on error from acpi_ghes_memory_errors()
> >>acpi/ghes: Bail early on error from get_ghes_source_offsets()
> >>acpi/ghes: Use error_fatal in acpi_ghes_memory_errors()  
> > 
> > Patch series look ok on my eyes.
> > 
> > Reviewed-by: Mauro Carvalho Chehab 
> >   
> 
> Thanks.
> 
> > -
> > 
> > Btw, what setup are you using to test memory errors? It would be
> > nice to have it documented somewhere, maybe at
> > docs/specs/acpi_hest_ghes.rst.
> >   
> 
> I don't think docs/specs/acpi_hest_ghes.rst is the right place for that
> as it's for specifications. 

Perhaps not, but it would be nice to have it documented somewhere,
either there or at QEMU wiki.

> I'm sharing how this is tested here to make the thread complete.

Thanks!

> 
> - Both host and guest has 4KB page size
> 
> - Start the guest by the following command lines
> 
>/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>-accel kvm -machine virt,gic-version=host,nvdimm=on,ras=on   \
>-cpu host -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1   \
>-m 4096M,slots=16,maxmem=128G\
>-object memory-backend-ram,id=mem0,size=4096M\
>-numa node,nodeid=0,cpus=0-7,memdev=mem0 \
>-L /home/gavin/sandbox/qemu.main/build/pc-bios   \
>-monitor none -serial mon:stdio -nographic   \
>-gdb tcp:: -qmp tcp:localhost:,server,wait=off   \
>-bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd   \
>-boot c  \
>-device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1\
>-device pcie-root-port,bus=pcie.0,chassis=2,id=pcie.2\
>-device pcie-root-port,bus=pcie.0,chassis=3,id=pcie.3\
>   : \
>-device pcie-root-port,bus=pcie.0,chassis=16,id=pcie.16  \
>-drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=drive0  \
>-device virtio-blk-pci,id=virtblk0,bus=pcie.1,drive=drive0,num-queues=4  \
>-netdev 
> tap,id=tap1,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
>-device virtio-net-pci,bus=pcie.8,netdev=tap1,mac=52:54:00:f1:26:b0
> 
> - Trigger 'victim -d' in the guest

Hmm... from where I can get victim?

Regards,
Mauro



Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-12-01 Thread Gavin Shan

Hi Mauro,

On 12/1/25 10:17 PM, Mauro Carvalho Chehab wrote:

On Thu, 27 Nov 2025 10:44:30 +1000
Gavin Shan  wrote:


This series is curved from that for memory error handling improvement
[1] based on the received comments, to improve the error object handling
in various aspects.

[1] https://lists.nongnu.org/archive/html/qemu-arm/2025-11/msg00534.html

Gavin Shan (5):
   acpi/ghes: Automate data block cleanup in acpi_ghes_memory_errors()
   acpi/ghes: Abort in acpi_ghes_memory_errors() if necessary
   target/arm/kvm: Exit on error from acpi_ghes_memory_errors()
   acpi/ghes: Bail early on error from get_ghes_source_offsets()
   acpi/ghes: Use error_fatal in acpi_ghes_memory_errors()


Patch series look ok on my eyes.

Reviewed-by: Mauro Carvalho Chehab 



Thanks.


-

Btw, what setup are you using to test memory errors? It would be
nice to have it documented somewhere, maybe at
docs/specs/acpi_hest_ghes.rst.



I don't think docs/specs/acpi_hest_ghes.rst is the right place for that
as it's for specifications. I'm sharing how this is tested here to make
the thread complete.

- Both host and guest has 4KB page size

- Start the guest by the following command lines

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
  -accel kvm -machine virt,gic-version=host,nvdimm=on,ras=on   \
  -cpu host -smp maxcpus=8,cpus=8,sockets=2,clusters=2,cores=2,threads=1   \
  -m 4096M,slots=16,maxmem=128G\
  -object memory-backend-ram,id=mem0,size=4096M\
  -numa node,nodeid=0,cpus=0-7,memdev=mem0 \
  -L /home/gavin/sandbox/qemu.main/build/pc-bios   \
  -monitor none -serial mon:stdio -nographic   \
  -gdb tcp:: -qmp tcp:localhost:,server,wait=off   \
  -bios /home/gavin/sandbox/qemu.main/build/pc-bios/edk2-aarch64-code.fd   \
  -boot c  \
  -device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1\
  -device pcie-root-port,bus=pcie.0,chassis=2,id=pcie.2\
  -device pcie-root-port,bus=pcie.0,chassis=3,id=pcie.3\
 : \
  -device pcie-root-port,bus=pcie.0,chassis=16,id=pcie.16  \
  -drive file=/home/gavin/sandbox/images/disk.qcow2,if=none,id=drive0  \
  -device virtio-blk-pci,id=virtblk0,bus=pcie.1,drive=drive0,num-queues=4  \
  -netdev 
tap,id=tap1,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
  -device virtio-net-pci,bus=pcie.8,netdev=tap1,mac=52:54:00:f1:26:b0

- Trigger 'victim -d' in the guest

  guest$ ./victim -d
  physical address of (0x8d9b7000) = 0x1251d6000
  Hit any key to trigger error:

- Inject error to the GPA. "test.c" is attached

  host$ ./test 0x1251d6000

- Press enter on the guest so that 'victim' continues its execution

  [  435.467481] EDAC MC0: 1 UE unknown on unknown memory ( page:0x1251d6 
offset:0x0 grain:1 - APEI location: )
  [  435.467542] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 0
  [  435.467543] {1}[Hardware Error]: event severity: recoverable
  [  435.467544] {1}[Hardware Error]:  Error 0, type: recoverable
  [  435.467545] {1}[Hardware Error]:   section_type: memory error
  [  435.467546] {1}[Hardware Error]:   physical_address: 0x0001251d6000
  [  435.467547] {1}[Hardware Error]:   error_type: 0, unknown
  [  435.468380] Memory failure: 0x1251d6: recovery action for dirty LRU page: 
Recovered
  Bus error (core dumped)

Thanks,
Gavin






Thanks,
Mauro

// SPDX-License-Identifier: GPL-2.0+
/*
 * This test program runs on the host, to receive GPA outputed by 'victimd'
 * from the guest. The GPA is translated to HPA, and recoverable error
 * is inject to HPA automatically.
 *
 * NOTE: We have the assumption that the guest has only one NUMA node and
 * the memory capacity is 4GB. The test program won't work if the assumption
 * is broken.
 *
 * Author: Gavin Shan 
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_GUEST_MEM_SIZE	0x1	/* 4GB */
#define TEST_GUEST_MEM_START	0x04000	/* 1GB */
#define TEST_INJECT_ERROR_TYPE	0x10

struct test_struct {
	int pid;
	unsigned long	guest_mem_size;
	unsigned long	gpa;
	unsigned long	hva;
	unsigned long	hpa;
};

static void usage(void)
{
	fprintf(stdout, "\n");
	fprintf(stdout, "./test \n");
	fprintf(stdout, "gpa  The GPA (Guest Physical Address) where the error is injected\n");
	fprintf(stdout, "\n");
}

static void init_test_struct(struct test_struct *test)
{
	test->pid		= -1;
	test->guest_mem_size	= TEST_GUEST_MEM_SIZE;
	test->gpa		= -1UL;
	test->hpa		= -1UL;
}

static int fetch_gpa(struct test_struct *test, int argc, char **argv)
{
	if (argc != 2) {
		usa

Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-12-01 Thread Mauro Carvalho Chehab
On Thu, 27 Nov 2025 10:44:30 +1000
Gavin Shan  wrote:

> This series is curved from that for memory error handling improvement
> [1] based on the received comments, to improve the error object handling
> in various aspects.
> 
> [1] https://lists.nongnu.org/archive/html/qemu-arm/2025-11/msg00534.html
> 
> Gavin Shan (5):
>   acpi/ghes: Automate data block cleanup in acpi_ghes_memory_errors()
>   acpi/ghes: Abort in acpi_ghes_memory_errors() if necessary
>   target/arm/kvm: Exit on error from acpi_ghes_memory_errors()
>   acpi/ghes: Bail early on error from get_ghes_source_offsets()
>   acpi/ghes: Use error_fatal in acpi_ghes_memory_errors()

Patch series look ok on my eyes.

Reviewed-by: Mauro Carvalho Chehab 

-

Btw, what setup are you using to test memory errors? It would be
nice to have it documented somewhere, maybe at
docs/specs/acpi_hest_ghes.rst.

Thanks,
Mauro



Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-12-01 Thread Igor Mammedov
On Sat, 29 Nov 2025 11:21:55 +1000
Gavin Shan  wrote:

> Hi Igor,
> 
> On 11/29/25 12:09 AM, Igor Mammedov wrote:
> > On Thu, 27 Nov 2025 10:44:30 +1000
> > Gavin Shan  wrote:
> >   
> >> This series is curved from that for memory error handling improvement  
> >   ^^^ confusing
> > based on above I'm not sure if it depends on [1] and shoul be applied on top
> > or it can be merged on its own
> >   
> 
> The current series is a standalone series and expected to be merged by its 
> own.
> 
> For (v4) series of memory error improvement [1], Jonathan wants to extend
> the handlers in the guest kernel so that the granularity in CPER record
> will be used to isolate the corresponding memory address range. With this,
> the patches in the (v4) series to send 16x continuous errors become useless.
> However, those patches in (v4) series to improve the Error (object) hanlding
> are still useful. So I pulled those patches for the Error (object) hanlding
> improvement from (v4) series to form this series.

ok, then I'll review this series and skip v4 for now

> 
> >> [1] based on the received comments, to improve the error object handling
> >> in various aspects.
> >>
> >> [1] https://lists.nongnu.org/archive/html/qemu-arm/2025-11/msg00534.html
> >>  
> 
> Thanks,
> Gavin
> 
> >> Gavin Shan (5):
> >>acpi/ghes: Automate data block cleanup in acpi_ghes_memory_errors()
> >>acpi/ghes: Abort in acpi_ghes_memory_errors() if necessary
> >>target/arm/kvm: Exit on error from acpi_ghes_memory_errors()
> >>acpi/ghes: Bail early on error from get_ghes_source_offsets()
> >>acpi/ghes: Use error_fatal in acpi_ghes_memory_errors()
> >>
> >>   hw/acpi/ghes-stub.c|  6 +++---
> >>   hw/acpi/ghes.c | 45 ++
> >>   include/hw/acpi/ghes.h |  6 +++---
> >>   target/arm/kvm.c   | 10 +++---
> >>   4 files changed, 28 insertions(+), 39 deletions(-)
> >>  
> >   
> 




Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-11-28 Thread Gavin Shan

Hi Igor,

On 11/29/25 12:09 AM, Igor Mammedov wrote:

On Thu, 27 Nov 2025 10:44:30 +1000
Gavin Shan  wrote:


This series is curved from that for memory error handling improvement

  ^^^ confusing
based on above I'm not sure if it depends on [1] and shoul be applied on top
or it can be merged on its own



The current series is a standalone series and expected to be merged by its own.

For (v4) series of memory error improvement [1], Jonathan wants to extend
the handlers in the guest kernel so that the granularity in CPER record
will be used to isolate the corresponding memory address range. With this,
the patches in the (v4) series to send 16x continuous errors become useless.
However, those patches in (v4) series to improve the Error (object) hanlding
are still useful. So I pulled those patches for the Error (object) hanlding
improvement from (v4) series to form this series.


[1] based on the received comments, to improve the error object handling
in various aspects.

[1] https://lists.nongnu.org/archive/html/qemu-arm/2025-11/msg00534.html



Thanks,
Gavin


Gavin Shan (5):
   acpi/ghes: Automate data block cleanup in acpi_ghes_memory_errors()
   acpi/ghes: Abort in acpi_ghes_memory_errors() if necessary
   target/arm/kvm: Exit on error from acpi_ghes_memory_errors()
   acpi/ghes: Bail early on error from get_ghes_source_offsets()
   acpi/ghes: Use error_fatal in acpi_ghes_memory_errors()

  hw/acpi/ghes-stub.c|  6 +++---
  hw/acpi/ghes.c | 45 ++
  include/hw/acpi/ghes.h |  6 +++---
  target/arm/kvm.c   | 10 +++---
  4 files changed, 28 insertions(+), 39 deletions(-)








Re: [PATCH 0/5] acpi/ghes: Error object handling improvement

2025-11-28 Thread Igor Mammedov
On Thu, 27 Nov 2025 10:44:30 +1000
Gavin Shan  wrote:

> This series is curved from that for memory error handling improvement
 ^^^ confusing
based on above I'm not sure if it depends on [1] and shoul be applied on top
or it can be merged on its own

> [1] based on the received comments, to improve the error object handling
> in various aspects.
> 
> [1] https://lists.nongnu.org/archive/html/qemu-arm/2025-11/msg00534.html
> 
> Gavin Shan (5):
>   acpi/ghes: Automate data block cleanup in acpi_ghes_memory_errors()
>   acpi/ghes: Abort in acpi_ghes_memory_errors() if necessary
>   target/arm/kvm: Exit on error from acpi_ghes_memory_errors()
>   acpi/ghes: Bail early on error from get_ghes_source_offsets()
>   acpi/ghes: Use error_fatal in acpi_ghes_memory_errors()
> 
>  hw/acpi/ghes-stub.c|  6 +++---
>  hw/acpi/ghes.c | 45 ++
>  include/hw/acpi/ghes.h |  6 +++---
>  target/arm/kvm.c   | 10 +++---
>  4 files changed, 28 insertions(+), 39 deletions(-)
>