Re: what to do on memory or cache errors?

Matt Thomas Mon, 22 Aug 2011 14:13:35 -0700

On Aug 22, 2011, at 2:04 PM, <[email protected]> <[email protected]> 
wrote:


> I would think that memory errors are far more likely than cache errors.  If a 
> CPU gets cache errors, it is very badly broken. 

Probably true but.

> I'm not sure it's worth doing anything other than panic for cache errors.  

Specifically uncorrected cache errors on a dirty line.  If the cache line was 
clean, you could just clear it and keep going.  You might also want to keep a 
bitmap of cache lines to see cache errors keep happening for the same cache 
line.

> For memory errors, if you can get the failing address (which some CPUs can do 
> and some cannot) and you can associate that address with some process, then 
> you might kill that process instead of panicking.  Again, I'm not sure how 
> valuable that would be.  For highly fault tolerant control systems, perhaps.  
> For anything else, not clear.  Also, a highly fault tolerant system may well 
> use  replicated CPUs, in which case having one CPU panic simply means the 
> other one takes over.

If ECC error was in a page backed by the vnode-pager, you could just unmap the 
errant page, refill with zeros (fixing ECC), return it to a free list, and let 
whoever wanted the page fault the contents back in.

> In short, is there a reason to change anything?

I don't know.  Which is why I'm asking.

Re: what to do on memory or cache errors?

Reply via email to