On Aug 22, 2011, at 2:04 PM, <[email protected]> <[email protected]> wrote:
> I would think that memory errors are far more likely than cache errors. If a > CPU gets cache errors, it is very badly broken. Probably true but. > I'm not sure it's worth doing anything other than panic for cache errors. Specifically uncorrected cache errors on a dirty line. If the cache line was clean, you could just clear it and keep going. You might also want to keep a bitmap of cache lines to see cache errors keep happening for the same cache line. > For memory errors, if you can get the failing address (which some CPUs can do > and some cannot) and you can associate that address with some process, then > you might kill that process instead of panicking. Again, I'm not sure how > valuable that would be. For highly fault tolerant control systems, perhaps. > For anything else, not clear. Also, a highly fault tolerant system may well > use replicated CPUs, in which case having one CPU panic simply means the > other one takes over. If ECC error was in a page backed by the vnode-pager, you could just unmap the errant page, refill with zeros (fixing ECC), return it to a free list, and let whoever wanted the page fault the contents back in. > In short, is there a reason to change anything? I don't know. Which is why I'm asking.
