Ben Gordy writes: > EDAC is a kernel level driver, and it's talking directly to the chipset, > reading registers, and then just dumping out raw register values. When it > accesses these read-once registers they get cleared so no information will be > collected/logged by the Dell ESM. Without this information being obtained by > the Dell ESM, there will never be any [LCD -- hardware level] alerts if a > warning or failure threshold is reached. Also, there are no 'screens' > available that will clearly identify the component logged by EDAC whereas > Dell ESM already has the ability to log and identify a "problematic" > component.
This is the first time I have heard of this. When you refer to "Dell ESM" are you talking about OMSA, or the onboard firmware (ESM = embedded system management?) of the BMC/DRAC? Our systems are not running OMSA, but we are using IPMI to monitor the SEL and sensor data, and also monitoring both EDAC information (via /sys/devices/system/edac/) and MCE (via /dev/mcelog -> /var/log/mcelog). I would like to better understand what mechanism would generate reports for ECC or PCI bus parity errors if I follow your instructions to disable EDAC. Also, how does this relate to the kernel MCE driver, which also reports ECC errors? https://bugzilla.redhat.com/show_bug.cgi?id=501906, despite being closed as "notabug" seems to indicate that MCE will also poll the chipset for ECC errors, and if these are being cleared by MCE that would seem to be the same kind of problem. Of course, MCE also handles other issues than ECC, I don't know if all of these are handled by the BMC SEL monitoring. In a thread on LKML http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-04/msg04981.html about these sorts of conflicts between OS and firmware monitoring of hardware errors, Eric Biederman writes: "If the system event log actually captures all of the error events that are reported by the hardware, so we can write an equivalent driver by reading the SEL there may be a reasonable alternative route. Otherwise as it appears likely the SEL filters the data it appears to be yet another case of reducing the value of the hardware by putting an unreliable firmware interface in front of it." I'm not saying this is accurate or agreeing with it (the immediate cause of the problem in that thread was AMIBIOS hiding ECC registers - in general I have a better expectation for the more integrated Dell BIOS/BMC firmware) but the general issue is certainly germane. With the Dell BMC/DRAC I may have a better chance of having a hardware error mapped to an identifiable, and possibly field-replaceable, component, but are there important other details that I might be missing there? For example, our current EDAC monitoring looks at both UE and CE - a high level of CE may indicate a potential problem, even though a single CE may not be important enough to justify a SEL alert and LCD notification. If I disable EDAC entirely and rely on SEL events, I would have to modify the platform event filter (PEF) to report CE as well as UE (if that is even possible), in order to get this information, although this would risk overflowing the limited SEL buffer in the case of a serious memory problem. @alex -- mailto:alex.du...@mac.com _______________________________________________ Linux-PowerEdge mailing list Linux-PowerEdge@dell.com https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq