On Tue, Sep 3, 2019 at 2:20 PM Edmund Urbani <edmund.urb...@liland.com> wrote: > > > Hi all, > > two days ago my btrfs filesystem became quite slow and the logs showed a > lot of I/O errors on one of the HDDs. I ordered a replacement drive and > tried to remove the failing drive from the filesystem (btrfs device > remove). That removal command did not finish but just sat there without > any output.
What exact commands? 'btrfs device del missing' I expect causes reconstruction from parity as well as a balance to create the new 9 device stripe width (well, 7 data + 2 parity). This is not an inherently bad thing to do, it should work and should be COW. And there's one extra copy available in case of an unrecoverable read error, it can still do additional reconstruction. Because it's a balance though, it might be really, really slow and I don't think there is no way to cancel device removal. I don't think it's possible to cancel it with btrfs balance stop. How many subvolumes and snapshots? Are quotas enabled? > Today the new drive arrived. Device removal still had not finished, but > the filesystem had entered read-only mode last night. Likely pre-existing problem is discovered during the balance, or bug triggered, or both, and the file system goes read only to avoid further corruption. Do you have kernel messages for the entire time starting at 'device delete' until the file system goes read only? > Linux phoenix 4.14.78-gentoo #1 SMP Mon Dec 3 09:25:24 CET 2018 x86_64 kernel 4.14.141 is the current version LTS for that series, and there are hundreds of bug fix insertions/removals between just those two versions https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/?id=v4.14.141&id2=v4.14.78&dt=2 between kernel 4.14.141 and 5.2.11, there are thousands of changes just in Btrfs... thousands https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/?id=v5.2.11&id2=v4.14.141&dt=2 And quite a few in raid56.c which isn't that big to begin with, but there are a lot of simplifications and improvements from what I can tell https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v5.2.11&id2=v4.14.141 Anyway, it's worth a try to try and mount with 5.2.11 using '-o ro,degraded' and at least see if it will mount. But it gives you some idea why there's a strong bias toward using newer kernels. It's too hard to remember all the changes, even for developers. > AMD Opteron(tm) Processor 6174 AuthenticAMD GNU/Linux > > ***** > btrfs --version > > btrfs-progs v4.19 This is OK, but the change log will show lots of bug fixes here too. I wouldn't make changes (no repair attempts at all, including chunk recover or --repair) until you get some dev advice about the next step. > [ 8904.358084] BTRFS info (device sda1): turning on discard Unexpected. > [ 8904.358088] BTRFS info (device sda1): allowing degraded mounts > [ 8904.358089] BTRFS info (device sda1): disk space caching is enabled > [ 8904.358091] BTRFS info (device sda1): has skinny extents > [ 8904.361743] BTRFS warning (device sda1): devid 8 uuid > 0e8b4aff-6d64-4d31-a135-705421928f94 is missing > [ 8905.705036] BTRFS info (device sda1): bdev (null) errs: wr 0, rd > 14809, flush 0, corrupt 4, gen 0 > [ 8905.705041] BTRFS info (device sda1): bdev /dev/sda1 errs: wr 0, rd > 4, flush 0, corrupt 0, gen 0 > [ 8905.705052] BTRFS info (device sda1): bdev /dev/sdf1 errs: wr 0, rd > 10543, flush 0, corrupt 0, gen 0 > [ 8905.705062] BTRFS info (device sda1): bdev /dev/sdc1 errs: wr 0, rd > 8, flush 0, corrupt 0, gen 0 four devices with read errors When was the last time the volume was scrubbed? Do you know for sure these errors have not gone up at all since the last successful scrub? And were any errors reported for that last scrub? > I have tried all the mount / restore options listed here: > https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-543490 Good. Stick with ro attempts for now. Including if you want to try a newer kernel. If it succeeds to mount ro, my advice is to update backups so at least critical information isn't lost. Back up while you can. Any repair attempt makes changes that will risk the data being permanently lost. So it's important to be really deliberate about any changes. > ... and all I keep getting is "bad tree block" errors. Superblocks seem > fine (btrfs rescue super-reecover found no problem). I am considering > trying "btrfs rescue chunk-recover" at this point. > > Could this help in my situation? What do you think? I'm not sure if chunk recover can work on degraded volumes. Your best bet is to not make any further changes to the volume itself. Preserve all logs. -- Chris Murphy