Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

Austin S. Hemmelgarn Mon, 09 May 2016 11:22:43 -0700

On 2016-05-09 12:29, Zygo Blaxell wrote:

On Mon, May 09, 2016 at 04:53:13PM +0200, Niccolò Belli wrote:

While trying to find a common denominator for my issue I did lots of backups
of /dev/mapper/cryptroot and I restored them into /dev/mapper/cryptroot
dozens of times (triggering a 150GB+ random data write every time), without
any issue (after restoring the backup I alwyas check the parition with btrfs
check). So disk doesn't seem to be the culprit.


Did you also check the data matches the backup?  btrfs check will only
look at the metadata, which is 0.1% of what you've copied.  From what
you've written, there should be a lot of errors in the data too.  If you
have incorrect data but btrfs scrub finds no incorrect checksums, then
your storage layer is probably fine and we have to look at CPU, host RAM,
and software as possible culprits.

This is a good point.


The logs you've posted so far indicate that bad metadata (e.g. negative
item lengths, nonsense transids in metadata references but sane transids
in the referred pages) is getting into otherwise valid and well-formed
btrfs metadata pages.  Since these pages are protected by checksums,
the corruption can't be originating in the storage layer--if it was, the
pages should be rejected as they are read from disk, before btrfs even
looks at them, and the insane transid should be the "found" one not the
"expected" one.  That suggests there is either RAM corruption happening
_after_ the data is read from disk (i.e. while the pages are cached in
RAM), or a severe software bug in the kernel you're running.

Try different kernel versions (e.g. 4.4.9 or 4.1.23) in case whoever
maintains your kernel had a bad day and merged a patch they should
not have.

Try a minimal configuration with as few drivers as possible loaded,
especially GPU drivers and anything from the staging subdirectory--when
these drivers have bugs, they ruin everything.

Try memtest86+ which has a few more/different tests than memtest86.
I have encountered RAM modules that pass memtest86 but fail memtest86+
and vice versa.

Try memtester, a memory tester that runs as a Linux process, so it can
detect corruption caused when device drivers spray data randomly into RAM,
or when the CPU thermal controls are influenced by Linux (an overheating
CPU-to-RAM bridge can really ruin your day, and some of the dumber laptop
designs rely on the OS for thermal management).

Try running more than one memory testing process, in case there is a bug
in your hardware that affects interactions between multiple cores (memtest
is single-threaded).  You can run memtest86 inside a kvm (e.g. kvm
-m 3072 -kernel /boot/memtest86.bin) to detect these kinds of issues.

Kernel compiles are a bad way to test RAM.  I've successfully built
kernels on hosts with known RAM failures.  The kernels don't always work
properly, but it's quite rare to see a build fail outright.

My original suggestion that prompted that part of the comment was to runa bunch of concurrent kernel builds (I only use kernel builds myselfbecause it's a big project with essentially zero build dependencies, ifI had the patience and space (and a LiveCD with the right tools andpackages installed), I'd probably be using something like LibreOffice orChromium instead), each run with as many jobs as CPU's (so on aquad-core system, run a dozen or so concurrently with make -j4). Idon't use this as my sole test (I also use multiple other tools), but Ifind that this does a particularly good job of exercising things thatmemtest doesn't, and I don't just make sure the build's succeed, butalso that the compiled kernel images all match, because if there's badRAM, the resultant images will often be different in some way (and I hadforgotten to mention this bit).

This practice evolved out of the fact that the only bad RAM I've everdealt with either completely failed to POST (which can have all kinds ofinteresting symptoms if it's just one module, some MB's refuse to boot,some report the error, others just disable the module and act likenothing happened), or passed all the memory testing tools I threw at it(memtest86, memtest86+, memtester, concurrent memtest86 invocations fromXen domains, inventive acrobatics with tmpfs and FIO, etc), but failedunder heavy concurrent random access, which can be reliably produced byrunning a bunch of big software builds at the same time with the CPUinsanely over-committed. I could probably produce a similar workloadwith tmpfs and FIO, but it's a lot quicker and easier to remember how todo a kernel build than it is to remember the complex incantations neededto get FIO to do anything interesting.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

Reply via email to