Source: linux Version: 6.5~rc7-1~exp1 Severity: wishlist Tags: patch X-Debbugs-Cc: miguel.bernal.ma...@linux.intel.com, jair.gonza...@linux.intel.com
Dear Maintainer, Please enable the Reliability, Availability and Serviceability (RAS) Correctable Errors Collector (RAS_CEC) feature on arch amd64/x86_64, on Debian Trixie. RAS_CEC introduce a simple data structure for collecting correctable errors along with accessors. This is a small cache which collects correctable memory errors per 4K page PFN and counts their repeated occurrence. Once the counter for a PFN overflows, we try to soft-offline that page as we take it to mean that it has reached a relatively high error count and would probably be best if we don't use it anymore. The error decoding is done with the decoding chain now and mce_first_notifier() gets to see the error first and the CEC decides whether to log it and then the rest of the chain doesn't hear about it - basically the main reason for the CE collector - or to continue running the notifiers. When the CEC hits the action threshold, it will try to soft-offine the page containing the ECC and then the whole decoding chain gets to see the error. To disable the Correctable Errors Collector, a kernel parameter is used: > ras=cec_disable A MR was created with this proposal at: https://salsa.debian.org/kernel-team/linux/-/merge_requests/827 Thanks, Miguel Bernal Marin