On Fri, 26 Jun 2026 08:33:14 -0700 Breno Leitao <[email protected]> wrote:

> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running.  The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash.  In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
> 
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context.  The default is disabled so existing workloads see no
> change.

Cool, thanks.  I added this to mm.git's mm-new branch.  Next week I'll
move it into the mm-unstable branch, where it will receive linux-next
exposure.

Sashiko identified a few possible things, some pre-existing:

        
https://sashiko.dev/#/patchset/[email protected]

Reply via email to