Thanks for the quick response, Chris and Hugo!

After some testing, there *was* a RAM issue that has now been resolved, so that 
should prevent it from being a factor going forward, but could definitely have 
been related. The high number of lifetime errors for the filesystem is 
expected, and isn't related to this issue; it was caused by a bad power supply 
that caused a disk to go completely offline during a balance operation, but was 
fully recovered via scrub and hasn't shown any increase in errors since then 
until this new issue (going several months without an error, several TB 
written).

I've attached full output from Chris's recommendations, here are a couple 
excerpts:

# btrfs rescue super -v /dev/sdb
...
All supers are valid, no need to recover

# journalctl | grep -A 15 exception
...
Jan 23 01:06:37 localhost kernel: ata3.00: status: { DRDY }
Jan 23 01:06:37 localhost kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Jan 23 01:06:37 localhost kernel: ata3.00: cmd 
61/b0:98:ea:7a:48/00:00:0a:00:00/40 tag 19 ncq dma 90112 out
                                           res 
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
--
Jan 31 19:24:32 localhost kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
Jan 31 19:24:32 localhost kernel: ata5.00: failed command: READ DMA EXT
Jan 31 19:24:32 localhost kernel: ata5.00: cmd 
25/00:08:a8:2a:81/00:00:a3:03:00/e0 tag 0 dma 4096 in
                                           res 
40/00:01:00:00:00/00:00:00:00:00/10 Emask 0x4 (timeout)
Jan 31 19:24:32 localhost kernel: ata5.00: status: { DRDY }
Jan 31 19:24:32 localhost kernel: ata5: link is slow to respond, please be 
patient (ready=0)
Jan 31 19:24:32 localhost kernel: ata5: device not ready (errno=-16), forcing 
hardreset
Jan 31 19:24:32 localhost kernel: ata5: soft resetting link
Jan 31 19:24:32 localhost kernel: ata5.00: configured for UDMA/33
Jan 31 19:24:32 localhost kernel: ata5.01: configured for UDMA/33
Jan 31 19:24:32 localhost kernel: sd 4:0:0:0: [sde] tag#0 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 31 19:24:32 localhost kernel: sd 4:0:0:0: [sde] tag#0 Sense Key : Illegal 
Request [current]
Jan 31 19:24:32 localhost kernel: sd 4:0:0:0: [sde] tag#0 Add. Sense: Unaligned 
write command
Jan 31 19:24:32 localhost kernel: sd 4:0:0:0: [sde] tag#0 CDB: Read(16) 88 00 
00 00 00 03 a3 81 2a a8 00 00 00 08 00 00

This last journalctl result was from the first system boot when the filesystem 
stopped being mountable. The filesystem had been remounted as read-only 
automatically after a few errors (see btrfs-journal.log in archive). None of my 
other system log files were relevant from what I could tell, so I limited this 
to journalctl's output.

I have been able to successfully recover files via "btrfs restore ...", and 
there doesn't seem to be anything essential missing from its full output with 
-D, so if that's necessary to use to offload the entire filesystem, it at least 
seems possible if it can't be recovered directly.

Thanks for the help!

On Sat, Feb 2, 2019, at 17:26, Chris Murphy wrote:
> On Sat, Feb 2, 2019 at 5:02 AM Hugo Mills <h...@carfax.org.uk> wrote:
> >
> > On Fri, Feb 01, 2019 at 11:28:27PM -0500, Alan Hardman wrote:
> > > I have a Btrfs filesystem using 6 partitionless disks in RAID1 that's 
> > > failing to mount. I've tried the common recommended safe check options, 
> > > but I haven't gotten the disk to mount at all, even with -o ro,recovery. 
> > > If necessary, I can try to use the recovery to another filesystem, but I 
> > > have around 18 TB of data on the filesystem that won't mount, so I'd like 
> > > to avoid that if there's some other way of recovering it.
> > >
> > > Versions:
> > > btrfs-progs v4.19.1
> > > Linux localhost 4.20.6-arch1-1-ARCH #1 SMP PREEMPT Thu Jan 31 08:22:01 
> > > UTC 2019 x86_64 GNU/Linux
> > >
> > > Based on my understanding of how RAID1 works with Btrfs, I would expect a 
> > > single disk failure to not prevent the volume from mounting entirely, but 
> > > I'm only seeing one disk with errors according to dmesg output, maybe I'm 
> > > misinterpreting it:
> > >
> > > [  534.519437] BTRFS warning (device sdd): 'recovery' is deprecated, use 
> > > 'usebackuproot' instead
> > > [  534.519441] BTRFS info (device sdd): trying to use backup root at 
> > > mount time
> > > [  534.519443] BTRFS info (device sdd): disk space caching is enabled
> > > [  534.519446] BTRFS info (device sdd): has skinny extents
> > > [  536.306194] BTRFS info (device sdd): bdev /dev/sdc errs: wr 23038942, 
> > > rd 22208378, flush 1, corrupt 29486730, gen 2933
> > > [  556.126928] BTRFS critical (device sdd): corrupt leaf: root=2 
> > > block=25540634836992 slot=45, unexpected item end, have 13882 expect 13898
> >
> >    It's worth noting that 13898-13882 = 16, which is a power of
> > two. This means that you most likely have a single-bit error in your
> > metadata. That, plus the checksum not being warned about, would
> > strongly suggest that you have bad RAM. I would recommend that you
> > check your RAM first before trying anything else that would write to
> > your filesystem (including btrfs check --repair).
> 
> Good catch!
> 
> I think that can account for the corrupt and generation errors. I
> don't know that memory errors can account for the large number of read
> and write errors, however. So there may be more than one problem.
> 
> 
> -- 
> Chris Murphy
>

Attachment: btrfs.tar.gz
Description: GNU Zip compressed data

Reply via email to