In April 2014, I reported a btrfs corruption on the linux-btrfs
mailing list (http://www.spinics.net/lists/linux-btrfs/msg33318.html).
8 months later, I am happy to be able to say I've been able to recover
the data with a combination of persistence and luck. I want to share
some of my insight with this list in the hope it that may be useful in
future cases.

I also did some work on the btrfs tools to be able to better
understand what was wrong; I will submit the additions and changes I
made for review later.

1. The history

I had created this file system in late 2012 when I installed OpenSUSE
12.2 on a friend's laptop. "btrfs was still unstable at that time", I
imagine you say.  That's easy to say in hindsight. OpenSUSE's
installer offered btrfs as a tier-1 choice, as far as I
remember. Articles written at the time (e.g.
http://rainbowtux.blogspot.de/2012/09/to-btrfs-or-not-to-btrfs.html)
suggest that I wasn't the only person considering it worth a serious
try. Today I wish I hadn't incautiously put my friend's /home on that
FS, too - I've certainly paid for that carelessness. So, /home was
subvolume 263 in this file system.  Complicating matters further, I
had created encrypted home file systems using ecryptfs on top of
btrfs.

2. The disaster

It all went well until April 14, 2014. On that day, the laptop
suddenly crashed.  OpenSUSE Kernel 3.4.11-2.16 was running at the time
of the crash.  Subsequent reboot attempts failed. I described the
phenomena in my posting to linux-btrfs, desparately hoping someone
would give me an easy recipe for recovery. It didn't happen. I got the
recommendation to use a newer version of the kernel and btrfs tools,
but they didn't get me any further. Whatever tool I tried, /home
appeared to be completely empty. I had to dig deeper.

3. The quest

After quite some time, I found the hint, looking at the root
of the /home subvolume, which was a level 2 node:

# ./btrfs-debug-tree -b 980717568 /dev/XX
node 980717568 level 2 items 78 free 43 generation 39637 owner 263
   key (256 INODE_ITEM 0) block 1012207616 (247121) gen 35754

Looking at the supposed level-1 subnode at 1012207616, I found that it
contained data of the wrong level (0), owner (2 - the extent tree),
and generation:

leaf 1012207616 items 26 free space 1967 generation 39622 owner 2
   item 0 key (8266870784 EXTENT_ITEM 12288) itemoff 3942 itemsize 53

So, the tree was massively corrupted at this crucial point; the top
inode of the subvolume couldn't be found, explaining why /home had
appeared empty on every recovery attempt. I looked at the other
children of the children of the tree root, and was pleasantly
surprised that these didn't look bad; I saw inodes and directory
entries of ecryptfs-encrypted home directories, as I had expected.

The obvious next thing to try was to look for previous generations of
the root of the /home subvolume, hoping they weren't corrupted. I
started with the super block root backups, with no luck. Later I went
back all the way from generation 39637 to 38081 (the oldest copy of
this root node I could find), but it was just as corrupted as the last
one - they all pointed to the same wrong level 1 block 1012207616.

I began to wonder whether the all-important level 1 and leaf meta data
of this part of the file system had survived somewhere at all. I
hacked together a tool to search for a specific btrfs key in all of
the meta data, and used it to search for the the key 256-1-0 of the
subvolume 263 (the first inode of the /home file system).  Luckily, I
found exactly one copy of a leaf containing this key, and a handful of
level 1 nodes referring to it.

At this point I didn't yet dare to even think of repairing the file
system.  Rather, I made additional debugging steps. One strange thing
I found was that beyond the 603 top (level 2) copies of /home's root
node, there were several instances with the same generation number:

node 1037123584 level 2 items 78 free 43 generation 39636 owner 263
node 1041215488 level 2 items 78 free 43 generation 39636 owner 263
node 980566016 level 2 items 78 free 43 generation 39636 owner 263
node 980717568 level 2 items 78 free 43 generation 39637 owner 263

Looking at the details of these blocks, I found that the various
level-2-gen-39636-owner-263 were actually different. I have no idea if
this can happen under any circumstances, but it gave me another hint
towards the final solution. Out of the generation 39636 roots listed
above, only the last one showed the original corruption I described -
the others actually had reasonable data in slot 1. My first hope that
these root copies might actually be healthy was quickly destroyed - a
tree dump showed other errors. But, and that was key, these other
corruptions were at different points of the tree. Taking the three
gen-39636 roots together, I was able to find sane data for every part
of the tree.  I was lucky insofar as the total number of corruptions I
needed to fix turned out to be so low that it was doable by hand.

4. The recovery

So I came up with a plan to fix the problem: for each broken link in
one tree, identify a healthy substitute in another one and fix the
link manually. For that purpose, I hacked together another tool
allowing me to do low-level editing of btrfs metadata and insert a
correct checksum at writeback. I verified manually that the metadata
items in the leafs remained well-ordered with the changes I had in
mind.

Eventually, I just needed to fix broken links at three points in the
tree.  I crossed my fingers and ran "btrfs restore" on the hacked tree
- and it extracted the complete /home tree. After that, I still needed
to mess around with ecryptfs tools to recover the pass phrase and make
the plain text data visible again. It certainly felt good when that
finally succeeded!

The encrypted home directories had been a burden in the first place,
because they impeded every debug technique based on searching for
known data. At this stage, they were a big benefit - I could be fairly
sure that there is no more "hidden" data corruption by any problems in
the FS i might have missed, because any such corruption would cause
files to be undecryptable.

5. Can this be generalized?

The repair technique I used could be generalized for a file
system repair tool. If a corrupted link is found in the tree
(unexpected level/owner/ generation), look for a suitable candidate to
substitute the broken link. Try it first by walking ealier
generations of the broken tree. If this fails, do a brute force search
through all meta data. If candidate nodes are found, make a sanity
check (make sure all data in the leaves after the repair are still
well ordered), and then pick the best (latest) node for which the
sanity check succeeded. I am leaving the implementation of this
technique to other interested parties.

6. Further remarks

After having recovered my friend's data, my motivation to do further
debugging decreased. However, I did some further research. As noted
above, the meta data contained several copies of the root of subvol
263 with the same generation.  The same applies to other trees as
well, in particular the root tree and the extent tree. Actually, at
least since generation 39610 (last was 39637), two distinct instances
of both trees seem to have existed.

The two extent trees had different ideas of which blocks were used,
and where meta data was stored (for example, the block 1012207616
mentioned above was listed as a level 1 block of subvol 263 in one of
the extent trees, and as part of the extent tree itself in the
other). Both instances have coexisted through at least 27 generations
(when this mess actually started is hard to tell). Clearly, this could
easily lead to meta data corruption. I can't be sure that this was
actually the root cause, though - some other corruption may have caused
it in the first place.

I have uploaded a sparse file with the system and meta data chunks of
the file system on DropBox
(https://www.dropbox.com/sh/utv8b3qd0do6a04/zTwGQCrN9x; file
img-metadat-sparse.tar.gz; unpack with tar xfzS), just in case anyone
with more btrfs insight than myself wants to take closer look.

Regards and thanks for reading this far,
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to