On Fri, 24 Apr 2026 05:23:58 -0700 Breno Leitao <[email protected]> wrote:
> When the memory failure handler encounters an in-use kernel page that it
> cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> currently logs the error as "Ignored" and continues operation.
>
> This leaves corrupted data accessible to the kernel, which will inevitably
> cause either silent data corruption or a delayed crash when the poisoned
> memory
> is next accessed.
>
> This is a common problem on large fleets. We frequently observe multi-bit ECC
> errors hitting kernel slab pages, where memory_failure() fails to recover them
> and the system crashes later at an unrelated code path, making root cause
> analysis unnecessarily difficult.
>
> Here is one specific example from production on an arm64 server: a multi-bit
> ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> (slab pages are not supported by the hwpoison recovery mechanism), and 67
> seconds later d_lookup() accessed the poisoned cache line causing
> a synchronous external abort:
>
> [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
> [88690.498473] Memory failure: 0x40272d: unhandlable page.
> [88690.498619] Memory failure: 0x40272d: recovery action for
> get hwpoison page: Ignored
> ...
> [88757.847126] Internal error: synchronous external abort:
> 0000000096000410 [#1] SMP
> [88758.061075] pc : d_lookup+0x5c/0x220
>
> This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> (default 0) that, when enabled, panics immediately on unrecoverable
> memory failures. This provides a clean crash dump at the time of the
> error, which is far more useful for diagnosis than a random crash later
> at an unrelated code path.
Sashiko is asking things:
https://sashiko.dev/#/patchset/[email protected]