On Tue, Sep 3, 2019 at 10:41 PM Edmund Urbani <edmund.urb...@liland.com> wrote:
>
> Also there are a few of these:
> Sep  1 21:10:17 phoenix kernel: ata6.00: exception Emask 0x0 SAct
> 0x10000020 SErr 0x0 action 0x0
> Sep  1 21:10:17 phoenix kernel: ata6.00: irq_stat 0x40000008
> Sep  1 21:10:17 phoenix kernel: ata6.00: failed command: READ FPDMA QUEUED
> Sep  1 21:10:17 phoenix kernel: ata6.00: cmd
> 60/20:28:80:66:09/00:00:50:01:00/40 tag 5 ncq dma 16384 in\x0a
> res 41/40:00:88:66:09/00:00:50:01:00/40 Emask 0x409 (media error) <F>
> Sep  1 21:10:17 phoenix kernel: ata6.00: status: { DRDY ERR }
> Sep  1 21:10:17 phoenix kernel: ata6.00: error: { UNC }
> Sep  1 21:10:17 phoenix kernel: ata6.00: configured for UDMA/133
> Sep  1 21:10:17 phoenix kernel: sd 5:0:0:0: [sdf] tag#5 UNKNOWN(0x2003)
> Result: hostbyte=0x00 driverbyte=0x08
> Sep  1 21:10:17 phoenix kernel: sd 5:0:0:0: [sdf] tag#5 Sense Key : 0x3
> [current]
> Sep  1 21:10:17 phoenix kernel: sd 5:0:0:0: [sdf] tag#5 ASC=0x11 ASCQ=0x4
> Sep  1 21:10:17 phoenix kernel: sd 5:0:0:0: [sdf] tag#5 CDB: opcode=0x88
> 88 00 00 00 00 01 50 09 66 80 00 00 00 20 00 00
> Sep  1 21:10:17 phoenix kernel: print_req_error: I/O error, dev sdf,
> sector 5637760640
> Sep  1 21:10:17 phoenix kernel: BTRFS error (device sdb1): bdev
> /dev/sdf1 errs: wr 0, rd 289, flush 0, corrupt 0, gen 0
> Sep  1 21:10:17 phoenix kernel: ata6: EH complete
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0
> Sep  1 21:10:17 phoenix kernel: BTRFS warning (device sdb1): sdb1
> checksum verify failed on 70943861833728 wanted 49137758 found 776101D6
> level 0

OK so the file system is not degraded, but sdb1 is giving you
problems, so you've deleted it and its in the process of being removed
(fs shrink, move chunks, and restripe).

Here /dev/sdf has  issued an uncorrectable read error. Classic case of
bad sector. And btrfs is trying to get data off sdb1 to try and fix
it, but this fails with checksum errors multiple times. So basically
it is a two device failure for the stripe currently being read. It
should still be possible to recover the stripe unless there is one
more error from another drive - but the included dmesg doesn't go on
far enough to tell us how this event turned out.


> I am still looking for log entries related to the filesystem going
> read-only. Not sure when exactly that happened and the logs are spammed
> with plenty of the above...

They're relevant because if there's a third failure at the same time,
and if it affects metadata, reconstruction isn't possible, the
metadata is missing. So then it's, what's missing and can it be
manually reconstructed. It's super tedious.





> >> [ 8904.358088] BTRFS info (device sda1): allowing degraded mounts
> >> [ 8904.358089] BTRFS info (device sda1): disk space caching is enabled
> >> [ 8904.358091] BTRFS info (device sda1): has skinny extents
> >> [ 8904.361743] BTRFS warning (device sda1): devid 8 uuid
> >> 0e8b4aff-6d64-4d31-a135-705421928f94 is missing
> >> [ 8905.705036] BTRFS info (device sda1): bdev (null) errs: wr 0, rd
> >> 14809, flush 0, corrupt 4, gen 0
> >> [ 8905.705041] BTRFS info (device sda1): bdev /dev/sda1 errs: wr 0, rd
> >> 4, flush 0, corrupt 0, gen 0
> >> [ 8905.705052] BTRFS info (device sda1): bdev /dev/sdf1 errs: wr 0, rd
> >> 10543, flush 0, corrupt 0, gen 0
> >> [ 8905.705062] BTRFS info (device sda1): bdev /dev/sdc1 errs: wr 0, rd
> >> 8, flush 0, corrupt 0, gen 0
> > four devices with read errors
> >
> > When was the last time the volume was scrubbed? Do you know for sure
> > these errors have not gone up at all since the last successful scrub?
> > And were any errors reported for that last scrub?
> Oh, that must have been quite a while ago. Sometime in 2018? Maybe? All
> these drives have been up and running for several years now. sda and sdc
> should still be fine, the replaced drive is sdb and sdf is next in line.

There's evidence of four drives with problems at some point in time.
And there's evidence in the kernel messages above of at least two
problems at the same time with the same stripe. So, all it takes is
one more problem with that stripe, and then that stripe can't be
recovered - and if it's a metadata stripe? That's 512KiB of metadata
lost, which is quite a lot, it probably kills the file system,
depending on where it happens. If it's data - no big deal. Btrfs won't
even care, it will just report EIO and the path to the bad file, and
continue on.

The whole point of regular scrubs is to prevent single sector
corruptions and failures. If you don't do that, they can accumulate
over time and then it's a huge problem when just one drive dies. So
when did you last do a scrub? Are they all the same make model drive?
Do they all have the same SCT ERC value? And is that value, for all
drives, less than the value found at /sys/block/sdN/device/timeout ?



> >
> >
> >> I have tried all the mount / restore options listed here:
> >> https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-543490
> > Good. Stick with ro attempts for now. Including if you want to try a
> > newer kernel. If it succeeds to mount ro, my advice is to update
> > backups so at least critical information isn't lost. Back up while you
> > can. Any repair attempt makes changes that will risk the data being
> > permanently lost. So it's important to be really deliberate about any
> > changes.
> I'll let you know, when I have the new kernel up and running.

I think you should have all the original drives installed, and try to
mount -o ro first. And if that doesn't work, try -o ro,degraded, and
then we'll just have to see which drive it doesn't like.



--
Chris Murphy

Reply via email to