Martin Steigerwald posted on Fri, 26 Dec 2014 16:59:09 +0100 as excerpted: > Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce: > [Hardware Error]: Machine check events logged > Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce: > [Hardware Error]: Machine check events logged
Have you checked these MCEs? What are they? MCEs are hardware errors. These are *NOT* kernel errors, tho of course they may /trigger/ kernel errors. The reported event codes can be looked up and translated into English. Since shortly after the first one until a bit before the second one here, you had hardware thermal throttling, the CPUs, on-chip cache, and possibly the memory, was working pretty hard. FWIW, I had an AMD machine that would MCE with memory related errors some time (about a decade) ago. I had ECC RAM, but it was cheap and apparently not quite up to the speeds it was actually rated for. MemTest check out the memory fine, but under high stress especially, it would sometimes have bus/transit related corruption, which would sometimes (not always) trigger those MCEs. Eventually a BIOS update gave me the ability to turn down the memory timings, and turning them down just one notch made everything rock-stable -- I was even able to decrease some of the wait-states to get a bit of the memory speed back. It just so happened that it was borderline stable at the rated clock, and turning the memory clock down just one notch was all it took. Later, I upgraded the RAM (the bad RAM was two half-gig sticks, back when they were $100+ a piece, I upgraded to four 2-gig sticks), and the new RAM didn't have the problem at all -- the bad RAM sticks simply weren't /quite/ stable at the rated speed, that was it. I run gentoo so of course do a lot of building from sources, and interestingly enough, the thing that turned out to detect the corruption the most often was bzip2 compression checksums -- I'd get errors on sources decompress previous to the build, rather more often than actual build failures altho those would happen occasionally as well, while redoing it would work fine -- checksums passed, and I never had a build that actually finished fail to run due to a bad build. Now here's the thing. Of course a decade ago was well before I was running btrfs (FWIW I was running reiserfs at the time, and it seemed pretty resilient given the bad RAM I had), so it was the bzip2 checksums it failed on. But guess what btrfs uses for file integrity, checksums. If your MCEs are either like my memory-related MCEs were, or are similar CPU-cache or CPU related but still something that would affect checksumming, btrfs may well be fighting bad checksums due to the same issues, and that would of course throw all sorts of wrenches into things. Another thing I've seen reported as triggering MCEs is bad power (in that case it was an either underpowered or going bad UPS, once it was out of the picture, the MCEs and problems stopped). Now I think you're having other btrfs issues as well, some of which are likely legit bugs. However, your MCEs certainly aren't helping things, and I'd definitely recommend checking up on them to see what's actually happening to your hardware. It may well be that without whatever hardware issues are triggering those MCEs, you may end up with less btrfs problems as well. Or maybe not, but it's something to look into, because right now, regardless of whether they're making things worse physically, they're at minimum obscuring a troubleshooting picture that would be clearer without them. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html