Re: [PATCH V14 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64

2017-04-06 Thread Catalin Marinas
Hi Rafael,

On Tue, Mar 28, 2017 at 01:30:30PM -0600, Tyler Baicar wrote:
> Tyler Baicar (9):
>   acpi: apei: read ack upon ghes record consumption
>   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
>   efi: parse ARM processor error
>   arm64: exception: handle Synchronous External Abort
>   acpi: apei: handle SEA notification type for ARMv8
>   efi: print unrecognized CPER section
>   ras: acpi / apei: generate trace event for unrecognized CPER section
>   trace, ras: add ARM processor error trace event
>   arm/arm64: KVM: add guest SEA support

It looks like the KVM parts are getting acked and the arm64 and efi bits
have been acked as well. Are you ok to take this series through the ACPI
tree?

Thanks.

-- 
Catalin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH V14 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64

2017-03-28 Thread Tyler Baicar
When a memory error, CPU error, PCIe error, or other type of hardware error
that's covered by RAS occurs, firmware should populate the shared GHES memory
location with the proper GHES structures to notify the OS of the error.
For example, platforms that implement firmware first handling may implement
separate GHES sources for corrected errors and uncorrected errors. If the
error is an uncorrectable error, then the firmware will notify the OS
immediately since the error needs to be handled ASAP. The OS will then be able
to take the appropriate action needed such as offlining a page. If the error
is a corrected error, then the firmware will not interrupt the OS immediately.
Instead, the OS will see and report the error the next time it's GHES timer
expires. The kernel will first parse the GHES structures and report the errors
through the kernel logs and then notify the user space through RAS trace
events. This allows user space applications such as RAS Daemon to see the
errors and report them however the user desires. This patchset extends the
kernel functionality for RAS errors based on updates in the UEFI 2.6 and
ACPI 6.1 specifications.

An example flow from firmware to user space could be:

 +---+
   +>|   |
   | |  GHES polling |--+
+-+  |source |  |   +---+   ++
| |  +---+  |   |  Kernel GHES  |   ||
|  Firmware   | +-->|  CPER AER and |-->|  RAS trace |
| |  +---+  |   |  EDAC drivers |   |   event|
+-+  |   |  |   +---+   ++
   | |  GHES sci |--+
   +>|   source  |
 +---+

Add support for Generic Hardware Error Source (GHES) v2, which introduces the
capability for the OS to acknowledge the consumption of the error record
generated by the Reliability, Availability and Serviceability (RAS) controller.
This eliminates potential race conditions between the OS and the RAS controller.

Add support for the timestamp field added to the Generic Error Data Entry v3,
allowing the OS to log the time that the error is generated by the firmware,
rather than the time the error is consumed. This improves the correctness of
event sequences when analyzing error logs. The timestamp is added in
ACPI 6.1, reference Table 18-343 Generic Error Data Entry.

Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
specification. ARMv8 specific processor error information is reported as part of
the CPER records.  This provides more detail on for processor error logs. This
can help describe ARMv8 cache, tlb, and bus errors.

Synchronous External Abort (SEA) represents a specific processor error condition
in ARM systems. A handler is added to recognize SEA errors, and a notifier is
added to parse and report the errors before the process is killed. Refer to
section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
specification.

Currently the kernel ignores CPER records that are unrecognized.
On the other hand, UEFI spec allows for non-standard (eg. vendor
proprietary) error section type in CPER (Common Platform Error Record),
as defined in section N2.3 of UEFI version 2.5. Therefore, user
is not able to see hardware error data of non-standard section.

If section Type field of Generic Error Data Entry is unrecognized,
prints out the raw data in dmesg buffer, and also adds a tracepoint
for reporting such hardware errors.

Currently even if an error status block's severity is fatal, the kernel
does not honor the severity level and panic. With the firmware first
model, the platform could inform the OS about a fatal hardware error
through the non-NMI GHES notification type. The OS should panic when a
hardware error record is received with this severity.

Add support to handle SEAs that occur while a KVM guest kernel is
running. Currently these are unsupported by the guest abort handling.

V14:Make sure function prototypes are in the __ASSEMBLY__ block
Change is_abort_synchronous to is_abort_sea
Use phys_addr_t for SEA address
Return after successful SEA handling in handle_guest_abort()

V13:Rebase on 4.11rc2
Print decimal and hex sizes for unknown CPER section errors
Use proper CONFIG_* when using IS_ENABLED
Move handle_guest_sea call prior to SEI check
Add a return value to handle_guest_sea
Move RCU locking into ghes_notify_sea
Add valid bit checks to ARM trace event
Remove GPIO, SEI, and GSIV cases in GHES
Add ARCH_HAVE_NMI_SAFE_CMPXCHG since we added NMI usage

V12:Remove double quotes from CPER code
Add helper function to check all SEA cases in KVM patch
Replace nmi_enter/exit with rcu_read_lock/unlock for KVM SEA
Change HAVE_ACPI_APEI_SEA to ACPI_APEI_SEA in KVM SEA case

V11:Change print_hex_dump calls to include ASCII output
Change