On 11.10.19 07:40:31, Mauro Carvalho Chehab wrote: > Em Thu, 10 Oct 2019 20:25:16 +0000 > Robert Richter <rrich...@marvell.com> escreveu: > > > Looking at how mci->{ue,ce}_per_layer[EDAC_MAX_LAYERS] is used, it > > turns out that only the leaves in the memory hierarchy are consumed > > (in sysfs), but not the intermediate layers, e.g.: > > > > count = dimm->mci->ce_per_layer[dimm->mci->n_layers-1][dimm->idx]; > > > > These unused counters only add complexity, remove them. The error > > counter values are directly stored in struct dimm_info now. > > Hmm... not sure if this patch is correct. I remember that there are some > border cases on some drivers (maybe the 3-layer drivers used together > with RDIMM memory controllers?) where some errors are not associated > to an specific dimm, but, instead, are related to a problem at the memory > bus. > > Also, depending on how the memory controllers are organized[1], the ECC > logic groups memory on DIMM pairs. So, when an error occur, it may be > either at DIMM1 or DIMM2. > > [1] On Intel, this happens with pre-Nehalem memory controllers. > > Due to that, storing errors at the dimm struct sounds wrong, as the > error may affect multiple DIMMs or even the entire layer.
That was my first thought too, but the counter values are not used at all. The only exception is to provide *per-dimm* counters here: {ce,ue}_per_layer[n_layers-1][dimm->idx]. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc_sysfs.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n567 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc_sysfs.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n584 The case you mentioned above is if the mc only sends parts of the error location (with a top, mid or low layer missing). The dimm cannot be identified then. In this case edac_mc_handle_error() tries to find a unique row (+ channel infomation if available and lists all possible dimm labels in e->label. See: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/edac_mc.c?id=4f5cafb5cb8471e54afdc9054d973535614f7675#n1153 Thus, we see a counter increment for row (and also channel if it can be identified), but this is counted in mci->csrows array only that is not removed by this patch. That said, the {ue,ce}_per_layer[] arrays can be removed by keeping the same driver functionality, esp. the case you mentioned above. -Robert