EDAC on Dell servers

Alexander Dupuy Wed, 20 Oct 2010 11:13:58 -0700

Ben Gordy writes:

> EDAC is a kernel level driver, and it's talking directly to the chipset, 
> reading registers, and then just dumping out raw register values.  When it 
> accesses these read-once registers they get cleared so no information will be 
> collected/logged by the Dell ESM.  Without this information being obtained by 
> the Dell ESM, there will never be any [LCD -- hardware level] alerts if a 
> warning or failure threshold is reached.  Also, there are no 'screens' 
> available that will clearly identify the component logged by EDAC whereas 
> Dell ESM already has the ability to log and identify a "problematic" 
> component.



This is the first time I have heard of this.  When you refer to "Dell
ESM" are you talking about OMSA, or the onboard firmware (ESM = embedded
system management?) of the BMC/DRAC?


Our systems are not running OMSA, but we are using IPMI to monitor the
SEL and sensor data, and also monitoring both EDAC information (via
/sys/devices/system/edac/) and MCE (via /dev/mcelog ->
/var/log/mcelog).  I would like to better understand what mechanism
would generate reports for ECC or PCI bus parity errors if I follow your
instructions to disable EDAC.  Also, how does this relate to the kernel
MCE driver, which also reports ECC errors? 
https://bugzilla.redhat.com/show_bug.cgi?id=501906, despite being closed
as "notabug" seems to indicate that MCE will also poll the chipset for
ECC errors, and if these are being cleared by MCE that would seem to be
the same kind of problem.  Of course, MCE also handles other issues than
ECC, I don't know if all of these are handled by the BMC SEL monitoring.


In a thread on LKML
http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-04/msg04981.html
about these sorts of conflicts between OS and firmware monitoring of
hardware errors, Eric Biederman writes:


"If the system event log actually captures all of the error events
that are reported by the hardware, so we can write an equivalent
driver by reading the SEL there may be a reasonable alternative
route. Otherwise as it appears likely the SEL filters the data
it appears to be yet another case of reducing the value of the
hardware by putting an unreliable firmware interface in front
of it."


I'm not saying this is accurate or agreeing with it (the immediate cause
of the problem in that thread was AMIBIOS hiding ECC registers - in
general I have a better expectation for the more integrated Dell
BIOS/BMC firmware) but the general issue is certainly germane.  With the
Dell BMC/DRAC I may have a better chance of having a hardware error
mapped to an identifiable, and possibly field-replaceable, component,
but are there important other details that I might be missing there?


For example, our current EDAC monitoring looks at both UE and CE - a
high level of CE may indicate a potential problem, even though a single
CE may not be important enough to justify a SEL alert and LCD
notification.  If I disable EDAC entirely and rely on SEL events, I
would have to modify the platform event filter (PEF) to report CE as
well as UE (if that is even possible), in order to get this information,
although this would risk overflowing the limited SEL buffer in the case
of a serious memory problem.


@alex

-- 
mailto:alex.du...@mac.com

_______________________________________________
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

EDAC on Dell servers

Reply via email to