Dear All,

as it is customary in any datacenter environment, we use ECC RAM on all of our 
machines.  Therefore, in the rare occasions where we had data corruption issues 
or sudden crashes, I used to think that RAM couldn't be the culprit (we've got 
ECC, right?), until I discovered this article:

https://www.neuhalfen.name/2013/09/05/your-data-is-corrupted-and-you-dont-know-it/

In a nutshell, the author proposes a software solution (a kernel module) to 
enhance hardware error correction, i.e. to catch and correct a number of errors 
that the hardware alone cannot fix.  This concept surprised me, as I used to 
think quite naively that hardware ECC was enough to catch and correct _100%_ of 
errors.

What's more interesting is that RAM sticks tend to exhibit more errors when 
they get old, so there's little point in running a burn-in test on your new 
server:  the problem will most likely happen once you've got valuable data onto 
your machines, maybe two years later.  Hence the need of an online detection 
and correction system.

Such a system has been developed already.  It's called RAMpage, it's based on 
the work of Jens Neuhalfen (the author of the article linked above) and it is 
currently available in a beta release:

https://github.com/schirmeier/rampage

Given the delicate area in which RAMpage operates, I would never use its beta 
version on a production server, but my colleagues and I agreed that the idea is 
absolutely terrific - if RAMpage was merged into the VZ kernel, several 
engineers that I know would be very interested.

What's your opinion?

Best,
Corrado Fiore
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users

Reply via email to