Re: Unocorrectable errors with RAID1

Goldwyn Rodrigues Mon, 16 Jan 2017 14:46:14 -0800


On 01/16/2017 05:10 AM, Christoph Groth wrote:
> Hi,
> 
> I’ve been using a btrfs RAID1 of two hard disks since early 2012 on my
> home server.  The machine has been working well overall, but recently
> some problems with the file system surfaced.  Since I do have backups, I
> do not worry about the data, but I post here to better understand what
> happened.  Also I cannot exclude that my case is useful in some way to
> btrfs development.
> 
> First some information about the system:
> 
> root@mim:~# uname -a
> Linux mim 4.6.0-1-amd64 #1 SMP Debian 4.6.3-1 (2016-07-04) x86_64 GNU/Linux
> root@mim:~# btrfs --version
> btrfs-progs v4.7.3
> root@mim:~# btrfs fi show
> Label: none  uuid: 2da00153-f9ea-4d6c-a6cc-10c913d22686
>     Total devices 2 FS bytes used 345.97GiB
>     devid    1 size 465.29GiB used 420.06GiB path /dev/sda2
>     devid    2 size 465.29GiB used 420.04GiB path /dev/sdb2
> 
> root@mim:~# btrfs fi df /
> Data, RAID1: total=417.00GiB, used=344.62GiB
> Data, single: total=8.00MiB, used=0.00B
> System, RAID1: total=40.00MiB, used=68.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=3.00GiB, used=1.35GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=464.00MiB, used=0.00B
> root@mim:~# dmesg | grep -i btrfs
> [    4.165859] Btrfs loaded
> [    4.481712] BTRFS: device fsid 2da00153-f9ea-4d6c-a6cc-10c913d22686
> devid 1 transid 2075354 /dev/sda2
> [    4.482025] BTRFS: device fsid 2da00153-f9ea-4d6c-a6cc-10c913d22686
> devid 2 transid 2075354 /dev/sdb2
> [    4.521090] BTRFS info (device sdb2): disk space caching is enabled
> [    4.628506] BTRFS info (device sdb2): bdev /dev/sdb2 errs: wr 0, rd
> 0, flush 0, corrupt 3, gen 0
> [    4.628521] BTRFS info (device sdb2): bdev /dev/sda2 errs: wr 0, rd
> 0, flush 0, corrupt 3, gen 0
> [   18.315694] BTRFS info (device sdb2): disk space caching is enabled
> 
> The disks themselves have been turning for almost 5 years by now, but
> their SMART health is still fully satisfactory.
> 
> I noticed that something was wrong because printing stopped to work.  So
> I did a scrub that detected 0 "correctable errors" and 6 "uncorrectable"
> errors.  The relevant bits from kern.log are:
> 
> Jan 11 11:05:56 mim kernel: [159873.938579] BTRFS warning (device sdb2):
> checksum error at logical 180829634560 on dev /dev/sdb2, sector
> 353143968, root 5, inode 10014144, offset 221184, length 4096, links 1
> (path: usr/lib/x86_64-linux-gnu/libcups.so.2)
> Jan 11 11:05:57 mim kernel: [159874.857132] BTRFS warning (device sdb2):
> checksum error at logical 180829634560 on dev /dev/sda2, sector
> 353182880, root 5, inode 10014144, offset 221184, length 4096, links 1
> (path: usr/lib/x86_64-linux-gnu/libcups.so.2)
> Jan 11 11:28:42 mim kernel: [161240.083721] BTRFS warning (device sdb2):
> checksum error at logical 260254629888 on dev /dev/sda2, sector
> 508309824, root 5, inode 9990924, offset 6676480, length 4096, links 1
> (path:
> var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
> 
> Jan 11 11:28:42 mim kernel: [161240.235837] BTRFS warning (device sdb2):
> checksum error at logical 260254638080 on dev /dev/sda2, sector
> 508309840, root 5, inode 9990924, offset 6684672, length 4096, links 1
> (path:
> var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
> 
> Jan 11 11:37:21 mim kernel: [161759.725120] BTRFS warning (device sdb2):
> checksum error at logical 260254629888 on dev /dev/sdb2, sector
> 508270912, root 5, inode 9990924, offset 6676480, length 4096, links 1
> (path:
> var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
> 
> Jan 11 11:37:21 mim kernel: [161759.750251] BTRFS warning (device sdb2):
> checksum error at logical 260254638080 on dev /dev/sdb2, sector
> 508270928, root 5, inode 9990924, offset 6684672, length 4096, links 1
> (path:
> var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
> 
> 
> As you can see each disk has the same three errors, and there are no
> other errors.  Random bad blocks cannot explain this situation. I asked
> on #btrfs and someone suggested that these errors are likely due to RAM
> problems.  This may indeed be the case, since the machine has no ECC.  I
> managed to fix these errors by replacing the broken files with good
> copies.  Scrubbing shows no errors now:
> 
> root@mim:~# btrfs scrub status /
> scrub status for 2da00153-f9ea-4d6c-a6cc-10c913d22686
>     scrub started at Sat Jan 14 12:52:03 2017 and finished     after
> 01:49:10
>     total bytes scrubbed: 699.17GiB with 0 errors
> 
> However, there are further problems.  When trying to archive the full
> filesystem I noticed that some files/directories cannot be read.  (The
> problem is localized to some ".git" directory that I don’t need.)  Any
> attempt to read the broken files (or to delete them) does not work:
> 
> $ du -sh .git
> du: cannot access
> '.git/objects/28/ea2aae3fe57ab4328adaa8b79f3c1cf005dd8d': No such file
> or directory
> du: cannot access
> '.git/objects/28/fd95a5e9d08b6684819ce6e3d39d99e2ecccd5': Stale file handle
> du: cannot access
> '.git/objects/28/52e887ed436ed2c549b20d4f389589b7b58e09': Stale file handle
> du: cannot access '.git/objects/info': Stale file handle
> du: cannot access '.git/objects/pack': Stale file handle
> 
> During the above command the following lines were added to kern.log:
> 
> Jan 16 09:41:34 mim kernel: [132206.957566] BTRFS critical (device
> sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
> Jan 16 09:41:34 mim kernel: [132206.957924] BTRFS critical (device
> sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
> Jan 16 09:41:34 mim kernel: [132206.958505] BTRFS critical (device
> sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
> Jan 16 09:41:34 mim kernel: [132206.958971] BTRFS critical (device
> sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
> Jan 16 09:41:34 mim kernel: [132206.959534] BTRFS critical (device
> sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
> Jan 16 09:41:34 mim kernel: [132206.959874] BTRFS critical (device
> sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
> Jan 16 09:41:34 mim kernel: [132206.960523] BTRFS critical (device
> sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
> Jan 16 09:41:34 mim kernel: [132206.960943] BTRFS critical (device
> sda2): corrupt leaf, slot offset bad: block=192561152,root=1, slot=15
> 
> So I tried to repair the file system by running "btrfs check --repair",
> but this doesn’t work:
> 
> (initramfs) btrfs --version
> btrfs-progs v4.7.3
> (initramfs) btrfs check --repair /dev/sda2
> UUID: ...
> checking extents
> incorrect offsets 2527 2543
> items overlap, can't fix
> cmds-check.c:4297: fix_item_offset: Assertion `ret` failed.
> btrfs[0x41a8b4]
> btrfs[0x41a8db]
> btrfs[0x42428b]
> btrfs[0x424f83]
> btrfs[0x4259cd]
> btrfs(cmd_check+0x1111)[0x427d6d]
> btrfs(main+0x12f)[0x40a341]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fd98859d2b1]
> btrfs(_start+0x2a)[0x40a37a]
>


Would you be able to upload a btrfs-image for me to examine. This is a
core ctree error where most probably item size is incorrectly registered.

Thanks,

-- 
-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unocorrectable errors with RAID1

Reply via email to