Hi, I am trying to find out why some applications crash on my laptop. I mostly use python and have configured it via configure --with-pydebug so that is wraps memory allocated regions with 0xfb. That helps to realize something overwrote that memory region. So far, it twice reported 0xfb to 0xfa transition at some logical position 5. I was told python cannot print physical hardware address but let's assume this is a memory error and one bit was flipped. However, sometimes other apps crash as well and I think I tried enough to run core dumps through gdb to find out where they crashed and it does not to an answer. I tried for hours memtest86+ to find an error but is never found anything wrong. From my experience, the errors appear when the CPU is loaded and that is not under memtest86+ started from a boot CD. I think it another reason why memtest86+ maybe does not find the problematic bit is that it would have to fill whole RAM with e.g. 0xfb and scan those values all remaining hours whether they still read as 0xfb. It seems all write&read tests done by memtest86+ happen too quickly after each other. I lack tests where the data if written into memory and kept there for a long while (hours, days).
Finally, I got an idea that linux kernel could emulate ECC RAM and just keep some checksums in another region of memory. This would to find not only flipped memory bit but even other (larger) corrupted regions of memory. I don't need speed (running apps under valgrind/DUMA is not fast either) and I don't need memory hotplug. Let's say this is for diagnostic purpose. I don't mind if somebody says I have to sacrifice 1/2 of my precious RAM to do software memory mirroring. Even that would be cool trick! to get around and see where is the bug hiding. I somewhat speculate it could be just a bit overheated memory controller after high CPU usage or the CPU or its cache gets upset and has nothing to do with RAM. When it is cold, it works. But, first I need a proof that RAM is not at fault. I think somebody must already thought about this so I am just asking what do you think. Maybe this is already available in some linux source tree as a proof-of-the-concept patch. ;) That would be great. https://www.usenix.org/legacy/event/atc10/tech/full_papers/Li.pdf Thank you, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/