On Thu, Feb 29, 2024 at 10:55:14AM -0000, Michael van Elst wrote: > The OS could be smart, lock out bad memory regions, recover some > errors by e.g. paging in text data again or even use mirrored RAM > (with motherboard support).
IIRC Intel Icelake introduced mechanisms to enable kernels to recover from poison data situations, but I don't know how far this has been implemented. Ideally an app could be given some sort of notification about poisoned data instead of the kernel blindly panicing. > >A lot of fragile chipset specific code to get that. > > Indeed. There's expectation that the platform-spceific bits would be abstracted for now through ACPI, and eventually codified into a hardware RAS controller with a standardized driver attached either as a PCIe function or ACPI-discovered MMIO space. Part of EDAC is not only getting notifications of the errors, but being able to do mapping of physical addresses back to physical components (DIMMs or CXL devices) so you know what to replace or block. -- Aaron J. Grier | "Not your ordinary poofy goof." | agr...@poofygoof.com "The price of reliability is the pursuit of the utmost simplicity. It is a price which the very rich find most hard to pay." -- Tony Hoare