Re: Unmountable degraded BTRFS RAID6 filesystem

Chris Murphy Thu, 05 Sep 2019 15:33:56 -0700

On Thu, Sep 5, 2019 at 2:44 PM Edmund Urbani <edmund.urb...@liland.com> wrote:
>
> I did not need the degraded option. And so far I see no HW I/O errors in
> dmesg. I have encountered a few errors while copying files and found
> these in the log:
>
> [ 3560.273634] btrfs_print_data_csum_error: 50 callbacks suppressed
> [ 3560.273639] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0x98f94189 expected csum 0xcb3af09a mirror 1


Not a bit flip
0x98f94189
10011000111110010100000110001001
0xcb3af09a
11001011001110101111000010011010


> [ 3560.825942] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 2
> [ 3560.826588] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 3
> [ 3560.827813] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 4
> [ 3560.829063] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 5
> [ 3560.830366] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 6
> [ 3560.831559] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 7
> [ 3560.832998] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 8
> [ 3560.834649] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 9
> [ 3560.836188] BTRFS warning (device sdg1): csum failed root 262 ino
> 1838364 off 14467072 csum 0xc0248289 expected csum 0xcb3af09a mirror 10

Also not a bit flip.
0xc0248289
11000000001001001000001010001001
0xcb3af09a
11001011001110101111000010011010

I'm not sure what it means or suggests has happened, that all the
copies are wrong. Plausible with raid5 metadata. But seems unlikely
with raid6 metadata, and also with all devices accounted for.

The file itself is probably fine - these look like metadata
complaints. If you find the file this inode belongs to, either
duplicating it or deleting it is fine, should cause this bad leaf to
just go away. Make sure you delete the correct file, each subvolume
has its own list of inodes, this one is in subvol id 262.

>
> and also:
>
> [ 3889.813300] btree_readpage_end_io_hook: 1860 callbacks suppressed
> [ 3889.813304] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 0
> [ 3889.825732] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972
> [ 3889.826375] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972
> [ 3889.828149] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972
> [ 3889.829649] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972
> [ 3889.831592] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972
> [ 3889.833436] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972
> [ 3889.835458] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972
> [ 3889.836968] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972
> [ 3889.848545] BTRFS error (device sdg1): bad tree block start, want
> 34958548107264 have 12157064991241308972

I'm skeptical that a scrub will fix these things, because Btrfs is
passively scrubbing on reads, so any checksum mismatches should get
fixed up, if they can be fixed, from reconstruction, on the fly as
well as scrub. This is a different problem, I'm not sure how serious
it is.

I would still do the full scrub. And then unmount it and run 'btrfs
check --mode=lowmem'. On a file system of this size it will take a
long time. So maybe do it over a weekend

>
> I think that Input/output error btrfsck is showing is actually a
> filesystem checksum error and not triggered by faulty hardware (not
> anymore, I hope). If there actually are any more failing drives here, I
> will most likely do the ddrescue thing again. Currently there are no
> free SATA ports in that system to connect an additional drive, so I
> cannot simply add one (at least not without also installing an
> additional SATA controller).

I suggest start planning how to migrate the data to a new Btrfs
volume. If the problems can't be repaired, this becomes inevitable. A
reasonable strategy is to take read-only snapshots of each subvolume
you want to preserve. And either 'btrfs send/receive' or 'rsync' to
new storage. That way you can keep using the volume rw in the
meantime. Once that completes, do another read only snapshot of each
subvolume, and do an incremental 'send -p' or rsync to migrate the
much smaller changes.


-- 
Chris Murphy

Re: Unmountable degraded BTRFS RAID6 filesystem

Reply via email to