On Wed, Oct 12, 2016 at 09:32:17AM +0800, Qu Wenruo wrote:
> >But consider the identical scenario with md or LVM raid5, or any
> >conventional hardware raid5. A scrub check simply reports a mismatch.
> >It's unknown whether data or parity is bad, so the bad data strip is
> >propagated upward to user space without error. On a scrub repair, the
> >data strip is assumed to be good, and good parity is overwritten with
> >bad.
> 
> Totally true.
> 
> Original RAID5/6 design is only to handle missing device, not rotted bits.

Missing device is the _only_ thing the current design handles.  i.e. you
umount the filesystem cleanly, remove a disk, and mount it again degraded,
and then the only thing you can safely do with the filesystem is delete
or replace a device.  There is also a probability of being able to repair
bitrot under some circumstances.

If your disk failure looks any different from this, btrfs can't handle it.
If a disk fails while the array is running and the filesystem is writing,
the filesystem is likely to be severely damaged, possibly unrecoverably.

A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
snowball's chance in hell of surviving a disk failure on a live array
with only data losses.  This would work if mdadm and btrfs successfully
arrange to have each dup copy of metadata updated separately, and one
of the copies survives the raid5 write hole.  I've never tested this
configuration, and I'd test the heck out of it before considering
using it.

> >So while I agree in total that Btrfs raid56 isn't mature or tested
> >enough to consider it production ready, I think that's because of the
> >UNKNOWN causes for problems we've seen with raid56. Not the parity
> >scrub bug which - yeah NOT good, not least of which is the data
> >integrity guarantees Btrfs is purported to make are substantially
> >negated by this bug. I think the bark is worse than the bite. It is
> >not the bark we'd like Btrfs to have though, for sure.
> >
> 
> Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree and
> data checksum.
[snip]

This leads directly to a variety of problems with the diagnostic tools,
e.g.  scrub reports errors randomly across devices, and cannot report the
path of files containing corrupted blocks if it's the parity block that
gets corrupted.

btrfs also doesn't avoid the raid5 write hole properly.  After a crash,
a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced)
to reconstruct any parity that was damaged by an incomplete data stripe
update.  As long as all disks are working, the parity can be reconstructed
from the data disks.  If a disk fails prior to the completion of the
scrub, any data stripes that were written during previous crashes may
be destroyed.  And all that assumes the scrub bugs are fixed first.

If writes occur after a disk fails, they all temporarily corrupt small
amounts of data in the filesystem.  btrfs cannot tolerate any metadata
corruption (it relies on redundant metadata to self-repair), so when a
write to metadata is interrupted, the filesystem is instantly doomed
(damaged beyond the current tools' ability to repair and mount
read-write).

Currently the upper layers of the filesystem assume that once data
blocks are written to disk, they are stable.  This is not true in raid5/6
because the parity and data blocks within each stripe cannot be updated
atomically.  btrfs doesn't avoid writing new data in the same RAID stripe
as old data (it provides a rmw function for raid56, which is simply a bug
in a CoW filesystem), so previously committed data can be lost.  If the
previously committed data is part of the metadata tree, the filesystem
is doomed; for ordinary data blocks there are just a few dozen to a few
thousand corrupted files for the admin to clean up after each crash.

It might be possible to hack up the allocator to pack writes into empty
stripes to avoid the write hole, but every time I think about this it
looks insanely hard to do (or insanely wasteful of space) for data
stripes.

Attachment: signature.asc
Description: Digital signature

Reply via email to