----- Original Message -----
> From: "Qu Wenruo" <quwenruo.bt...@gmx.com>
> To: "STEVE LEUNG" <sjle...@shaw.ca>, linux-btrfs@vger.kernel.org
> Sent: Sunday, February 10, 2019 6:52:23 AM
> Subject: Re: corruption with multi-device btrfs + single bcache, won't mount
> ----- Original Message -----
> From: "Qu Wenruo" <quwenruo.bt...@gmx.com>
> On 2019/2/10 下午2:56, STEVE LEUNG wrote:
>> Hi all,
>>
>> I decided to try something a bit crazy, and try multi-device raid1 btrfs on
>> top of dm-crypt and bcache. That is:
>>
>> btrfs -> dm-crypt -> bcache -> physical disks
>>
>> I have a single cache device in front of 4 disks. Maybe this wasn't
>> that good of an idea, because the filesystem went read-only a few
>> days after setting it up, and now it won't mount. I'd been running
>> btrfs on top of 4 dm-crypt-ed disks for some time without any
>> problems, and only added bcache (taking one device out at a time,
>> converting it over, adding it back) recently.
>>
>> This was on Arch Linux x86-64, kernel 4.20.1.
>>
>> dmesg from a mount attempt (using -o
>> usebackuproot,nospace_cache,clear_cache):
>>
>> [ 267.355024] BTRFS info (device dm-5): trying to use backup root at
>> mount time
>> [ 267.355027] BTRFS info (device dm-5): force clearing of disk cache
>> [ 267.355030] BTRFS info (device dm-5): disabling disk space caching
>> [ 267.355032] BTRFS info (device dm-5): has skinny extents
>> [ 271.446808] BTRFS error (device dm-5): parent transid verify failed on
>> 13069706166272 wanted 4196588 found 4196585
>> [ 271.447485] BTRFS error (device dm-5): parent transid verify failed on
>> 13069706166272 wanted 4196588 found 4196585
>
> When this happens, there is no good way to completely recover (btrfs
> check pass after the recovery) the fs.
>
> We should enhance btrfs-progs to handle it, but it will take some time.
>
>> [ 271.447491] BTRFS error (device dm-5): failed to read block groups: -5
>> [ 271.455868] BTRFS error (device dm-5): open_ctree failed
>>
>> btrfs check:
>>
>> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585
>> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585
>> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585
>> parent transid verify failed on 13069706166272 wanted 4196588 found 4196585
>> Ignoring transid failure
>> ERROR: child eb corrupted: parent bytenr=13069708722176 item=7 parent
>> level=2
>> child level=0
>> ERROR: cannot open file system
>>
>> Any simple fix for the filesystem? It'd be nice to recover the data
>> that's hopefully still intact. I have some backups that I can dust
>> off if it really comes down to it, but it's more convenient to
>> recover the data in-place.
>
> However there is a patch to address this kinda "common" corruption scenario.
>
> https://lwn.net/Articles/777265/
>
> In that patchset, there is a new rescue=bg_skip mount option (needs to
> be used with ro), which should allow you to access whatever you still
> have from the fs.
>
> From other reporters, such corruption is mainly related to extent tree,
> thus data damage should be pretty small.
Ok I think I spoke too soon. Some files are recoverable, but many
cannot be read. Userspace gets back an I/O error, and the kernel log
reports similar parent transid verify failed errors, with what seem
to be similar generation numbers to what I saw in my original mount
error.
i.e. wants 4196588, found something that's off by usually 2 or 3.
Occasionally there's one that's off by about 1300.
There are multiple snapshots on this filesystem (going back a few
days), and the same file in each snapshot seems to be equally
affected, even if the file hasn't changed in many months.
Metadata seems to be intact - I can stat every file in one of the
snapshots and I don't get any errors back.
Any other ideas? It kind of seems like "btrfs restore" would be
suitable here, but it sounds like it would need to be taught about
rescue=bg_skip first.
Thanks for all the help. Even a partial recovery is a lot better
than what I was facing before.
Steve