Hi btrfs devs, I recently updated Linux (4.15.x) and rebooted on a machine with a 12x4TB-disk btrfs volume, and it hung on boot. I did some initial troubleshooting and eventually saw in `btrfs dev stats` that one disk had a ton of errors. I settled on a theory that either the disk or the SAS backplane got wedged in a state where operations on that disk failed, because since power cycling the system the disk has behaved perfectly and smartctl doesn’t see any evidence of trouble.
When I tried to boot again, there were a flurry of messages like “csum mismatch on free space cache” and “parent transit verify failed on 64508387606528 wanted 1555425 found 1548963”, culminating in a panic and hang whose middle looks like this (unfortunately I couldn’t scroll up in the console due to the hang, so I only have the end of it — and only as a picture): bvec_alloc+0x86/0xe0 bio_alloc_bioset+0x132/0x1e0 btrfs_bio_alloc+0x23/0x90 [btrfs] submit_extend_page+0x191/0x250 [btrfs] ? btrfs_create_repair_bio+0xf0/0xf0 [btrfs] ... I ran a scrub, which finished successfully like this: scrub started at Wed Mar 21 18:33:14 2018 and finished after 05:07:53 total bytes scrubbed: 27.05TiB with 1463058 errors error details: verify=12491 csum=1450567 corrected errors: 1463058, uncorrectable errors: 0, unverified errors: 0 …which looks super promising. However, the machine still panicked on boot, or if I tried to mount the filesystem from a lived and perform more than a few operations on it. I did some research and ended up going through roughly this set of recovery attempts: 1. Attempt to clear the space_cache. 2. btrfs-zero-log (maybe — I actually can’t remember if I did this, and I now see https://btrfs.wiki.kernel.org/index.php/Btrfs-zero-log saying to *please not* if the FS mounts). 3. `btrfs check --init-extent-tree` I realize that, in reality, the right set of steps more likely involved "btrfs-image” and “send an email to this mailing list”. Unfortunately I exist in a state of mild impatience and high optimism and, as a result, I didn’t. My loss :/. The current state of things is that `btrfs check --init-extent-tree` has been running for 119 hours and has generated about 50MB of log output that looks like this: ref mismatch on [51614823280640 4096] extent item 0, found 1 data backref 51614823280640 parent 57412861165568 owner 0 offset 0 num_refs 0 not found in extent tree incorrect local backref count on 51614823280640 parent 57412861165568 owner 0 offset 0 found 1 wanted 0 back 0x5652a69bde00 backpointer mismatch on [51614823280640 4096] adding new data backref on 51614823280640 parent 57412861165568 owner 0 offset 0 found 1 Repaired extent references for 51614823280640 …which seems scary. I made the KVM recordings and screenshots I have, and the full `btrfs check --init-extent-tree` output (since only near the end did I spend time getting the system set up with a proper livecd and network access) available here in case anyone wants to look over my shoulder: https://ipfs.io/ipfs/QmXTgYgA4fQs4BSM8GFXrRdufF39ZVxMTV77z8ANjmZvzk At this point, I’m genuinely not sure whether `init-extent-tree` is moving in a useful direction (i.e. will ever finish), whether stopping and trying something else would be better, or whether it’s time to salvage some `btrfs-image`s if they’re useful and then start restoring from offsite backups. I’d appreciate any guidance, but realize that my own troubleshooting so far was *far* from ideal and would accept “It’s time to restore from backups, but please do X next time instead of all that”. Happy to answer any other questions, too. Sidney-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html