Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

Niccolò Belli Mon, 09 May 2016 07:53:56 -0700

On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:

Are you using any power management tweaks?

Yes, as stated in my very first post I use TLP withSATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the bugeven without TLP. Also in the past week I've alwyas been on AC.


On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:

Memtest doesn't replicate typical usage patterns very well. Myusual testing for RAM involves not just memtest, but alsobooting into a LiveCD (usually SystemRescueCD), pulling down acopy of the kernel source, and then running as many concurrentkernel builds as cores, each with as many make jobs as cores (soif you've got a quad core CPU (or a dual core withhyperthreading), it would be running 4 builds with -j4 passed tomake). GCC seems to have memory usage patterns that reliablytrigger memory errors that aren't caught by memtest, so thisgenerally gives good results.

Building kernel with 4 concurrent threads is not an issue for my system, infact I do compile a lot and I never had any issue.


On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:

On a similar note, badblocks doesn't replicate filesystem likeaccess patterns, it just runs sequentially through the entiredisk. This isn't as likely to give bad results, but it's stillimportant to know. In particular, try running it over a dmcryptvolume a couple of times (preferably with a different key eachtime, pulling keys from /dev/urandom works well for this), asthat will result in writing different data. For what it'sworth, when I'm doing initial testing of new disks, I always useddrescue to copy /dev/zero over the whole disk, then do it twicethrough dmcrypt with different keys, copying from the disk to/dev/null after each pass. This gives random data on disk as astarting point (which is good if you're going to use dmcrypt),and usually triggers reallocation of any bad sectors as early aspossible.

While trying to find a common denominator for my issue I did lots ofbackups of /dev/mapper/cryptroot and I restored them into/dev/mapper/cryptroot dozens of times (triggering a 150GB+ random datawrite every time), without any issue (after restoring the backup I alwyascheck the parition with btrfs check). So disk doesn't seem to be theculprit.


On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:

1. If you have an eSATA port, try plugging your hard disk inthere and see if things work. If that works but having the harddrive plugged in internally doesn't, then the issue is probablyeither that specific SATA port (in which case your chip-set isbad and you should get a new system), or the SATA connectoritself (or the wiring, but that's not as likely when it's traceson a PCB). Normally I'd suggest just swapping cables and SATAports, but that's not really possible with a laptop.2. If you have access to a reasonably large flash drive, or toa USB to SATA adapter, try that as well, if it works on that butnot internally (or on an eSATA port), you've probably got a badSATA controller, and should get a new system.

My laptop doesn't have an eSATA port and my only big enough external driveis currently used for daily backups, since I fear for data loss.


On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:

3. Try things without dmcrypt. Adding extra layers makes itharder to determine what is actually wrong. If it works withoutdmcrypt, try using different parameters for the encryption(different ciphers is what I would try first). If it worksreliably without dmcrypt, then it's either a bug in dmcrypt(which I don't think is very likely), or it's bad interactionbetween dmcrypt and BTRFS. If it works with some encryptionparameters but not others, then that will help narrow down wherethe issue is.


On domenica 8 maggio 2016 01:35:16 CEST, Chris Murphy wrote:

You're making the troubleshooting unnecessarily difficult by
continuing to use non-default options. *shrug*

Every single layer you add complicates the setup and troubleshooting.
Of course all of it should work together, many people do. But you're
the one having the problem so in order to demonstrate whether this is
a software bug or hardware problem, you need to test it with the most
basic setup possible --> btrfs on plain partitions and default mount
options.

I will try to recap because you obviously missed my previous e-mail: Imanaged to replicate the irrecoverable corruption bug even with defaultoptions and no dmcrypt at all. Somehow it was a bit more difficult toreplicate with default options and so I started to play with differentcombinations to find if there was something which increased the chances ofgetting corruption. I have the feeling that "autodefrag" enhances thechances to get corruption, but I'm not 100% sure about it. Anyway,triggering a whole packages reinstall with "pacaur -S $(pacman -Qe)",giving high chances to get irrecoverable corruption. When running suchcommand it simply extracts the tarballs from the cache and overwrites thealready installed files. It doesn't write lots of data (afterreinstallation my system is still quite small, just a few GBs) but it seemsto be enough to displease the filesystem.

To avoid losing my data every time I power on or reboot my laptop I firstboot into an external drive, I btrfs check /dev/mapper/cryptroot and ifit's still sane I backup /dev/mapper/cryptroot into an external SSD withdd, otherwise I restore the previous copy from the SSD into/dev/mapper/cryptroot.I cannot manage to survive such annoying workflow for long, so I reallyhope someone will manage to track the bug down soon.


Thanks for your help, I really appreciate it.
Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs ate my data in just two days, after a fresh install. ram and disk are ok. it still mounts, but I cannot repair

Reply via email to