Re: Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM

Qu Wenruo Tue, 22 Dec 2020 15:04:53 -0800



On 2020/12/22 下午11:59, Nik. wrote:

Hi,

Thank you very much for the quick reply.
Ok, I am going to use the backups (as you suggested).

Just a quick question for understanding the background better:
   -given a btrfs with many intact subvolumes and
   -say, one defective sector within the subvolume "@" (Ubuntu specific),
    which couses this subvolume to be (automatically) remounted as RO
   -am I getting it right that none of the other subvolumes can be
mounted properly (i.e., RW)?


Unfortunately, it's not subvolume tree itself get corrupted, but the
extent tree.

Extent tree is shared through the whole fs, thus you may still be unable
to mount other subvolumes as long as it involves reading the extent tree.

Woildn't it be interesting to have an
option, allowing this to work?


We have new rescue= mount options, IIRC we have rescue=all, which will
try to ignore any non-critical trees.

In that case, you may be able to mount the subvolume RO, as long as
there are no bitflips in that subvolume.

Thanks,
Qu

There will be, of course, a processing
overhead, but probably not so expensive as by RAID 1?

Thank you in advance and I wish you all to be happy and healthy!

Nik.
--
21.12.2020 12:44, Qu Wenruo:



On 2020/12/21 下午6:08, Nik. wrote:

Dear all,

the forwarded mail below came back yesterday with the error
"Diagnostic-Code: X-Postfix; TLS is required, but was not offered by
host vger.kernel.org[23.128.96.18]".

Is it really intended that your mail server does not offer TLS?


Can't help on that, not a vger manager nor know anything. (Most if not
all kernel mail lists are hosted by vger, each mail list can't do much)

But I can definitely answer some of your btrfs problem.


Kind regards,

Nik.

--

15.12.2020 18:40, Nik.:

Dear all,

after almost a year without problems I need again your advice about
the same computer, but this time it is (hopefully only) the root FS
that failed. I have backups of everything except a couple of files in
/etc, so nothing critical, but probably it would be interesting for
somebody to see how behaved btrfs in such a situation.

The story in short:

- the FS switched to ro mode. Initially I thought that it is due to
insufficient free space (have already had similar situations) and
deleted some old snapshots. Within half a day it happened 3 more
times, though.


Any detailed report on that RO?
We should have it addressed upstream, if you still hit that, I guess we
need more investigation (if it's not caused by memory corruption)


- so I booted in memtest86 and it gave me a lot of errors! This NAS is
9 years old and I was already looking for replacement, but it is not
easy to find 8-bay NAS for 2,5" drives...

- took the drive out from the failed system and tried to mount it on
another (healthy?) PC. I am getting:

root@ubrun:~# mount -t btrfs -o subvol=@ /dev/sdb1 /mnt/sd
mount: /mnt/sd: wrong fs type, bad option, bad superblock on
/dev/sdb1, missing codepage or helper program, or other error.
root@ubrun:~# dmesg |tail
[   50.672561] Policy zone: Normal
[  185.190764] BTRFS info (device sdb1): disk space caching is enabled
[  185.190767] BTRFS info (device sdb1): has skinny extents
[  185.199331] BTRFS info (device sdb1): bdev /dev/sdb1 errs: wr 0, rd
0, flush 0, corrupt 65, gen 0
[  185.246051] BTRFS critical (device sdb1): corrupt leaf:
block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown
inline ref type: 54


This is indeed some memory bitflip, and your initial kernel is not newer
enough to detect it at write time.

If using newer enough kernel, such corrupted metadata shouldn't even
reach disk. (Although it still means you will get the fs RO)

There are only 4 valid types for extent refs:

TREE_BLOCK_REF     176(0xb0)
EXTENT_DATA_REF  178(0xb2)
SHARED_BLOCK_REF 182(0xb6)
SHARED_DATA_REF  184(0xb8)

The invalid type is:

                   54(0x36)

The diff is 0x80 to SHARED_BLOCK_REF, indeed one bit flipped.

[  185.246055] BTRFS error (device sdb1): block=50850988032 read time
tree block corruption detected
[  185.247070] BTRFS critical (device sdb1): corrupt leaf:
block=50850988032 slot=79 extent bytenr=50496929792 len=16384 unknown
inline ref type: 54
[  185.247073] BTRFS error (device sdb1): block=50850988032 read time
tree block corruption detected
[  185.247093] BTRFS error (device sdb1): failed to read block
groups: -5
[  185.281382] BTRFS error (device sdb1): open_ctree failed
root@ubrun:~#

How should one proceed?


Since it's caused by bitflip and you mentioned the system has tons of
memory error, I believe there will be tons of similar problems
scattering around your fs.

For repair, I don't really believe btrfs-check can or will be able to
fix any bitflip, not to mention so many possible more bitflips.

It's better just to use your backup.

BTW, for detection for extent tree bitflip is introduced in v5.4.
Next time at least you can catch the faulty hardware before it screws up
your data.

Thanks,
Qu


Kind regards

Nik.

Re: Fwd: "BTRFS critical: ... corrupt leaf" due to defective RAM

Reply via email to