On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said: > a question and an idea: Q: is ecc guaranteed to detect all bitflips?
It depends on the exact ECC function the hardware implements. Usually it provides performance such as: "Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher, but not correct". (Of course, "correct all 1 or 2 bit and detect all 3 bit" can be done, it just takes more bits of ECC.) > Idea: what about a multicore system (3 or more) that runs the same > processes on 2 cores and a third core verifying that they both do the > same? As I think it is not only ram that can become faulty. This is actually done for high-reliability systems (Google for "tell me twice" and "tell me three times"). The problem is that it takes a lot of extra hardware. The G5 and later IBM Z-series mainframe chipsets (not to be confused with the PowerPC G5) implemented dual computation units and a comparator that signals a 'Machine Check' condition if the two CPUs don't end up in the same exact state (as an added bonus, at the end of each instruction that both *do* compare good, it latches the *entire* state of the CPU out, and then does the following: 1) Retry the instruction on the same CPU - if it compares correctly, keep going and flag a "soft" error. 2) If it still fails, read out the last "known good" status latch, and load it into a spare CPU, and fire it up, and flag the failing one as bad. http://www.research.ibm.com/journal/rd/435/spainhower.pdf http://www.research.ibm.com/journal/rd/435/mueller.pdf These guys have forgotten more about designing highly reliable systems than most of us will ever know. ;) Needless to say, not everybody is willing to pay the costs of the hardware overhead of this approach.
pgpCmbxDYMQib.pgp
Description: PGP signature