On Mon, May 16, 2016 at 5:44 PM, Richard A. Lochner <loch...@clone1.com> wrote: > Chris, > > It has actually happened to me three times that I know of in ~7mos., > but your point about the "larger footprint" for data corruption is a > good one. No doubt I have silently experienced that too.
I dunno three is a lot to have the exact same corruption only in memory then written out into two copies with valid node checksums; and yet not have other problems, like a node item, or uuid, or xattr or any number of other item or object types all of which get checksummed. I suppose if the file system contains large files, the % of metadata that's csums could be the 2nd largest footprint. But still. Three times in 7 months, if it's really the same vector, is just short of almost reproducible. Ha. It seems like if you merely balanced this file system a few times, you'd eventually stumble on this. And if that's true, then it's time for debug options and see if it can be caught in action, and whether there's a hardware or software explanation for it. > And, as you > suggest, there is no way to prevent those errors. If the memory to be > written to disk gets corrupted before its checksum is calculated, the > data will be silently corrupted, period. Well, no way in the present design, maybe. > > Clearly, I won't rely on this machine to produce any data directly that > I would consider important at this point. > > One odd thing to me is that if this is really due to undetected memory > errors, I'd think this system would crash fairly often due to detected > "parity errors." This system rarely crashes. It often runs for > several months without an indication of problems. I think you'd have other problems. Only data csums are being corrupt after they're read in, but before the node csum is computed? Three times? Pretty wonky. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html