On Fri, 26 Jun 2026 08:33:14 -0700 Breno Leitao <[email protected]> wrote:
> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running. The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash. In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
>
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context. The default is disabled so existing workloads see no
> change.
Cool, thanks. I added this to mm.git's mm-new branch. Next week I'll
move it into the mm-unstable branch, where it will receive linux-next
exposure.
Sashiko identified a few possible things, some pre-existing:
https://sashiko.dev/#/patchset/[email protected]