Re: Expected behavior of bad sectors on one drive in a RAID1

Austin S Hemmelgarn Tue, 20 Oct 2015 12:48:35 -0700

On 2015-10-20 14:54, Duncan wrote:

But tho I'm a user not a dev and thus haven't actually checked the source
code itself, my believe here is with Russ and disagrees with Austin, as
based on what I've read both on the wiki and seen here previously, btrfs
runtime (that is, not during scrub) actually repairs the problem on-
hardware as well, from that second copy, not just fetching it for use
without the repair, the distinction between normal runtime error
detection and scrub thus being that scrub systematically checks
everything, while normal runtime on most systems will only check the
stuff it reads in normal usage, thus getting the stuff that's regularly
used, but not the stuff that's only stored and never read.


*WARNING*:  From my experience at least, at least on initial mount, btrfs
isn't particularly robust when the number of read errors on one device
start to go up dramatically.  Despite never seeing an error in scrub that
it couldn't fix, twice I had enough reads fail on a mount that the mount
itself failed and I couldn't mount successfully despite repeated
attempts.  In both cases, I was able to use btrfs restore to restore the
contents of the filesystem to some other place (as it happens, the
reiserfs on spinning rust I use for my media filesystem, since being for
big media files, that had enough space to recover the as I said above
reasonably small btrfs into), and ultimate recreating the filesystem
using mkfs.btrfs.

But given that despite not being able to mount, neither SMART nor dmesg
ever mentioned anything about the "good" device having errors, I'm left
to conclude that btrfs itself ultimately crashed on attempt to mount the
filesystem, even tho only the one copy was bad.  After a couple of those
events I started scrubbing much more frequently, thus fixing the errors
while btrfs could still mount the filesystem and /let/ me run a scrub.
It was actually those more frequent scrubs that quickly became the hassle
and lead me to give up on the device.  If btrfs had been able to fall
back to the second/valid copy even in that case, as it really should have
done, then I would have very possibly waited quite a bit longer to
replace the dying device.

So on that one I'd say to be sure, get confirmation either directly from
the code (if you can read it) or from a dev who has actually looked at it
and is basing his post on that, tho I still /believe/ btrfs still runtime-
corrects checksumming issues actually on-device, if there's a validating
second copy it can use to do so.

FWIW, my assessment is based on some testing I did a while back (kernel 3.14 IIRC) using a VM. The (significantly summarized of course) procedure I used was: 1. Create a basic minimalistic Linux system in a VM (in my case, I just used a stage3 tarball for Gentoo, with a paravirtuaized Xen domain) using BTRFS as the root filesystem with a raid1 setup. Make sure and verify that it actually boots. 2. Shutdown the VM, use btrfs-progs on the host to find the physical location of an arbitrary file (ideally one that is not touched at all during the boot process, IIRC, I think I used one of the e2fsprogs binaries), and then intentionally clear the CRC in one of the copies of a block from the file.

3. Boot the VM, read the file.
4. Shutdown the VM again.

5. Verify whether the file block you cleared the checksum on has a valid checksum now.

I repeated this more than a dozen times using different files and different methods of reading the file, and each time the CRC I had cleared was untouched. Based on this, unless BTRFS does some kind of deferred re-write that doesn't get forced during a clean unmount of the FS, I felt it was relatively safe to conclude that it did not automatically fix corrupted blocks. I did not however, test corrupting the block itself instead of the checksum, but I doubt that that would impact anything in this case.

As I mentioned, many veteran sysadmins would want to disable automatically fixing this in the FS driver without having some kind of notification. This preference largely dates back to traditional RAID1, where the system has no way to know for certain which copy is correct in the case of a mismatch, and therefore to safely fix mismatches, the admin needs to intervene. While it is possible to fix this safely because of how BTRFS is designed, there is still the possibility of it getting things wrong. There was one time I had a BTRFS raid1 filesystem where one copy of a block got corrupted but miraculously had a correct CRC (which is statistically impossible), and the other copy of the block was correct, but the CRC for it was wrong (which, while unlikely, is very much possible). In such a case (which was a serious pain to debug), automatically 'fixing' the supposedly bad block would have resulted in data loss. Of course, the chance that happening more than once in a lifetime is astronomically small, but it is still possible.

It's also worth noting that ZFS has been considered mature for more than a decade now, and the ZFS developers _still_ aren't willing to risk their user's data with something like this, which should be an immediate red flag for anyone developing a filesystem with features like ZFS.

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Expected behavior of bad sectors on one drive in a RAID1

Reply via email to