> Also, I suspect that, if an error happens to affect more than one DIMM
> (e. g. part of the location is not available for a given error),
> that the DIMM label will also not be properly shown.

There are a couple of cases here:

1) There are a number of DIMMs behind some flaky h/w that introduces errors
that are apparently blamed onto each of those DIMMs.

  All we can do here is statistical correlations ... each error is reported 
independently,
  it is up to some entity to notice the higher level topology connection. There 
is enough
  information in the UEFI error record to do that (assuming that BIOS filled 
out the
  necessary fields).

2) There is a single reported error that spans more than one DIMM.

  This can happen with a UC error in a pair of lock-step DIMMs.  Since the 
error is UC
  we know that two (or more) bits are bad.  But we have no way to tell whether 
the
  bad bits came from the same DIMM, or one bit from each (because we don't know
  which bits are bad - if we knew that, we could fix them :-)   The eMCA case 
should
  log two subsections in this case - one for each of the lockstep DIMMs 
involved. A user
  seeing this will should probably just replace both DIMMs to be safe.  If they 
wanted to
  diagnose further they should swap DIMMs around so this pair are no longer 
lockstepped
  and see if they start seeing correctable errors from each of the split pair - 
or if the UC
  errors move with one or the other of the DIMMs

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to