On 2014-09-16 16:57, Chris Murphy wrote: > > On Sep 16, 2014, at 8:40 AM, Austin S Hemmelgarn <ahferro...@gmail.com> wrote: > >> Based on the kernel messages, the primary issue is log corruption, and >> in theory btrfs-zero-log should fix it. > > Can you provide a complete dmesg somewhere for this initial failure, just for > reference? I'm curious what this indication looks like compared to other > problems. > Okay, I can't really get a 'complete' dmesg, because the system panics on the mount failure (the filesystem in question is the system's root filesystem), the system has no serial ports, and I didn't think to build in support for console on ttyUSB0. I can however get what the recovery environment (locally compiled based on buildroot) shows when I try to mount the filesystem: [ 30.871036] BTRFS: device label gentoo devid 1 transid 160615 /dev/sda3 [ 30.875225] BTRFS info (device sda3): disk space caching is enabled [ 30.917091] BTRFS: detected SSD devices, enabling SSD mode [ 30.920536] BTRFS: bad tree block start 0 130402254848 [ 30.924018] BTRFS: bad tree block start 0 130402254848 [ 30.926234] BTRFS: failed to read log tree [ 30.953055] BTRFS: open_ctree failed >> The actual issue however, is >> that the primary superblock appears to be pointing at a corrupted root >> tree, which causes pretty much everything that does anything other than >> just read the sb to fail. The first backup sb does point to a good >> tree, but only btrfs check and btrfs restore have any option to ignore >> the first sb and use one of the backups instead. > > Maybe use wipefs -a on this volume, which removes the magic from only the > first superblock by default (you can specify another location). And then try > btrfs-show-super -F which "dumps" supers with bad magic. > Thanks for the suggestion, I hadn't thought of that... > I just tried this: > # wipefs -a /dev/sdb > /dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 > 5f 4d > # btrfs-show-super -F /dev/sdb > superblock: bytenr=65536, device=/dev/sdb > --------------------------------------------------------- > csum 0x5c1196d7 [DON'T MATCH] > bytenr 65536 > flags 0x1 > magic ........ [DON'T MATCH] > […] > # btrfs-show-super -i1 /dev/sdb > superblock: bytenr=67108864, device=/dev/sdb > --------------------------------------------------------- > csum 0xfc70be19 [match] > bytenr 67108864 > flags 0x1 > magic _BHRfS_M [match] > > So the mirror is definitely there and valid. > # btrfs rescue super-recover -yv /dev/sdb > No valid Btrfs found on /dev/sdb > Usage or syntax errors > > Not expected at all, man page says "Recover bad superblocks from good > copies." There's a good copy, it's not being found by btrfs rescue > super-recover. Seems like a bug. > > > # btrfs check /dev/sdb > No valid Btrfs found on /dev/sdb > Couldn't open file system > > # btrfs check -s1 /dev/sdb > using SB copy 1, bytenr 67108864 > Checking filesystem on /dev/sdb > UUID: 9acf13de-5b98-4f28-9992-533e4a99d348 > [snip] > OK it finds it, maybe a --repair will fix the bad first one? > # btrfs check -s1 /dev/sdb > using SB copy 1, bytenr 67108864 > enabling repair mode > Checking filesystem on /dev/sdb > UUID: 9acf13de-5b98-4f28-9992-533e4a99d348 > [snip] > No indication of repair > # btrfs check /dev/sdb > No valid Btrfs found on /dev/sdb > Couldn't open file system > # btrfs check /dev/sdb > No valid Btrfs found on /dev/sdb > Couldn't open file system > [root@f21v ~]# btrfs-show-super -F /dev/sdb > superblock: bytenr=65536, device=/dev/sdb > --------------------------------------------------------- > csum 0x5c1196d7 [DON'T MATCH] > bytenr 65536 > flags 0x1 > magic ........ [DON'T MATCH] > > > Still not fixed. Maybe I needed to corrupt something else in the superblock > other than the magic and this behavior is intentional, otherwise wipefs -a, > followed by btrfsck would resurrect an intentionally wiped btrfs fs, > potentially wiping out some newer file system in the process. > ...though maybe it's a good thing I didn't. > > >> I'm fine using dd to replace the primary sb with one of the >> backups, but don't know the exact parameters that would be needed. > > Here's an idea: > > # btrfs-show-super /dev/sdb > superblock: bytenr=65536, device=/dev/sdb > --------------------------------------------------------- > csum 0x92aa51ab [match] > [snip] > So I know what I'm looking for starts at LBA 65536/512 > > # dd if=/dev/sdb skip=128 count=4 2>/dev/null | hexdump -C > 00000000 92 aa 51 ab 00 00 00 00 00 00 00 00 00 00 00 00 |..Q…..........| > [snip] > > And as it turns out the csum is right at the beginning, 4 bytes. So use bs of > 4 bytes, seek 65536/4, count of 1. This should zero just 4 bytes starting at > 65536 bytes in. > > # dd if=/dev/zero of=/dev/sdb bs=4 seek=16384 count=1 > > Checked it with the earlier skip=128 command and it looks like everything > else is intact. > > # btrfs-show-super -F /dev/sdb > superblock: bytenr=65536, device=/dev/sdb > --------------------------------------------------------- > csum 0x00000000 [DON'T MATCH] > bytenr 65536 > flags 0x1 > magic _BHRfS_M [match] > [snip] > OK so the csum is bad, the magic is good. Now see if btrfs rescue > super-recover does anything > # btrfs rescue super-recover /dev/sdb > Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are > you sure? [y/N]: Y > Recovered bad superblocks successful > *** Error in `btrfs': corrupted double-linked list: 0x0000000002289e40 *** > ======= Backtrace: ========= > /lib64/libc.so.6(+0x7a77e)[0x7f388663977e] > /lib64/libc.so.6(+0x80b03)[0x7f388663fb03] > /lib64/libc.so.6(+0x81c88)[0x7f3886640c88] > /lib64/libc.so.6(cfree+0x4c)[0x7f38866456ec] > btrfs[0x425ec6] > btrfs[0x406902] > /lib64/libc.so.6(__libc_start_main+0xf0)[0x7f38865df0e0] > btrfs[0x406a04] > ======= Memory m > [snip] > > kaboom! > > But was it really successful? > # btrfs-show-super -F /dev/sdb > superblock: bytenr=65536, device=/dev/sdb > --------------------------------------------------------- > csum 0x92aa51ab [match] > [skip] > Looks fixed. And it mounts. > > NOW, I didn't actually have my first superblock pointing to a corrupt root > tree. So it's possible that while the csum was fixed in my case, that the > subsequent crash has not properly copied all good parts of superblock1 to > superblock0. *shrug* > > And since it crashes, looks like I found a bug. > >> I'm using btrfs-progs 3.16 and >> kernel 3.16.1. > > So did I for all of the above. > > Since posting this, I realized that the recovery environment I'm working from is actually btrfs-progs 3.14.1 and kernel 3.14.5, I need to make a point to update that once I get the system working again.
I've also discovered, when trying to use btrfs restore to copy out the data to a different system, that 3.14.1 restore apparently chokes on filesystem that have lzo compression turned on. It's reporting errors trying to inflate compressed files, and I know for a fact that none of those files were even open, let alone being written to, when the system crashed. I don't know if this is a known bug or even if it is still the case with btrfs-progs 3.16, but I figured I'd comment about it because I haven't seen anything about it anywhere. Also, I interestingly didn't get the crash you saw above with btrfs rescue super-recover, so that might be a regression in 3.16 btrfs-progs. Thanks for all the help.
smime.p7s
Description: S/MIME Cryptographic Signature