Re: Status of RAID5/6

Zygo Blaxell Sat, 31 Mar 2018 20:46:33 -0700

On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:
> On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
> <kreij...@inwind.it> wrote:
> > On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
> >>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> >>>> is always a full-device operation.  In theory btrfs could track
> >>>> modifications at the chunk level but this isn't even specified in the
> >>>> on-disk format, much less implemented.
> >>> It could go even further; it would be sufficient to track which
> >>> *partial* stripes update will be performed before a commit, in one
> >>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >>> a scrub on these stripes would be sufficient.
> >
> >> A scrub cannot fix a raid56 write hole--the data is already lost.
> >> The damaged stripe updates must be replayed from the log.
> >
> > Your statement is correct, but you doesn't consider the COW nature of btrfs.
> >
> > The key is that if a data write is interrupted, all the transaction is 
> > interrupted and aborted. And due to the COW nature of btrfs, the "old 
> > state" is restored at the next reboot.
> >
> > What is needed in any case is rebuild of parity to avoid the "write-hole" 
> > bug.
> 
> Write hole happens on disk in Btrfs, but the ensuing corruption on
> rebuild is detected. Corrupt data never propagates.


Data written with nodatasum or nodatacow is corrupted without detection
(same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
journal device).

Metadata always has csums, and files have checksums if they are created
with default attributes and mount options.  Those cases are covered,
any corrupted data will give EIO on reads (except once per 4 billion
blocks, where the corrupted CRC matches at random).

> The problem is that Btrfs gives up when it's detected.

Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
combinations of recovery blocks for raid6, and earlier kernels than
those would not recover correctly for raid5 either.  I think this has
all been fixed in recent kernels but I haven't tested these myself so
don't quote me on that.

Other than that, btrfs doesn't give up in the write hole case.
It rebuilds the data according to the raid5/6 parity algorithm, but
the algorithm doesn't produce correct data for interrupted RMW writes
when there is no stripe update journal.  There is nothing else to try
at that point.  By the time the error is detected the opportunity to
recover the data has long passed.

The data that comes out of the recovery algorithm is a mixture of old
and new data from the filesystem.  The "new" data is something that
was written just before a failure, but the "old" data could be data
of any age, even a block of free space, that previously existed on the
filesystem.  If you bypass the EIO from the failing csums (e.g. by using
btrfs rescue) it will appear as though someone took the XOR of pairs of
random blocks from the disk and wrote it over one of the data blocks
at random.  When this happens to btrfs metadata, it is effectively a
fuzz tester for tools like 'btrfs check' which will often splat after
a write hole failure happens.

> If it assumes just a bit flip - not always a correct assumption but
> might be reasonable most of the time, it could iterate very quickly.

That is not how write hole works (or csum recovery for that matter).
Write hole producing a single bit flip would occur extremely rarely
outside of contrived test cases.

Recall that in a write hole, one or more 4K blocks are updated on some
of the disks in a stripe, but other blocks retain their original values
from prior to the update.  This is OK as long as all disks are online,
since the parity can be ignored or recomputed from the data blocks.  It is
also OK if the writes on all disks are completed without interruption,
since the data and parity eventually become consistent when all writes
complete as intended.  It is also OK if the entire stripe is written at
once, since then there is only one transaction referring to the stripe,
and if that transaction is not committed then the content of the stripe
is irrelevant.

The write hole error event is when all of the following occur:

        - a stripe containing committed data from one or more btrfs
        transactions is modified by raid5/6 RMW update in a new
        transaction.  This is the usual case on a btrfs filesystem
        with the default, 'nossd' or 'ssd' mount options.

        - the write is not completed (due to crash, power failure, disk
        failure, bad sector, SCSI timeout, bad cable, firmware bug, etc),
        so the parity block is out of sync with modified data blocks
        (before or after, order doesn't matter).

        - the array is alredy degraded, or later becomes degraded before
        the parity block can be recomputed by a scrub.

Users can run scrub immediately after _every_ unclean shutdown to
reduce the risk of inconsistent parity and unrecoverable data should
a disk fail later, but this can only prevent future write hole events,
not recover data lost during past events.

If one of the data blocks is not available, its content cannot be
recomputed from parity due to the inconsistency within the stripe.
This will likely be detected as a csum failure (unless the data block
is part of a nodatacow/nodatasum file, in which case corruption occurs
but is not detected) except for the one time out of 4 billion when
two CRC32s on random data match at random.

If a damaged block contains btrfs metadata, the filesystem will be
severely affected:  read-only, up to 100% of data inaccessible, only
recovery methods involving brute force search will work.

> Flip bit, and recompute and compare checksum. It doesn't have to
> iterate across 64KiB times the number of devices. It really only has
> to iterate bit flips on the particular 4KiB block that has failed csum
> (or in the case of metadata, 16KiB for the default leaf size, up to a
> max of 64KiB).

Write hole is effectively 32768 possible bit flips in a 4K block--assuming
only one block is affected, which is not very likely.  Each disk in an
array can have dozens of block updates in flight when an interruption
occurs, so there can be millions of bits corrupted in a single write
interruption event (and dozens of opportunities to encounter the nominally
rare write hole itself).

An experienced forensic analyst armed with specialized tools, a database
of file formats, and a recent backup of the filesystem might be able to
recover the damaged data or deduce what it was.  btrfs, being only mere
software running in the kernel, cannot.

There are two ways to solve the write hole problem and this is not one
of them.

> That's a maximum of 4096 iterations and comparisons. It'd be quite
> fast. And going for two bit flips while a lot slower is probably not
> all that bad either.

You could use that approach to fix a corrupted parity or data block
on a degraded array, but not a stripe that has data blocks destroyed
by an update with a write hole event.  Also this approach assumes that
whatever is flipping bits in RAM is not in and of itself corrupting data
or damaging the filesystem in unrecoverable ways, but most RAM-corrupting
agents in the real world do not limit themselves only to detectable and
recoverable mischief.

Aside:  As a best practice, if you see one-bit corruptions on your
btrfs filesystem, it is time to start replacing hardware, possibly also
finding a new hardware vendor or model (assuming the corruption is coming
from hardware, not a kernel memory corruption bug in some random device
driver).  Healthy hardware doesn't do bit flips.  So many things can go
wrong on unhealthy hardware, and they aren't all detectable or fixable.
It's one of the few IT risks that can be mitigated by merely spending
money until the problem goes away.

> Now if it's the kind of corruption you get from a torn or misdirected
> write, there's enough corruption that now you're trying to find a
> collision on crc32c with a partial match as a guide. That'd take a
> while and who knows you might actually get corrupted data anyway since
> crc32c isn't cryptographically secure.

All the CRC32 does is reduce the search space to for data recovery
from 32768 bits to 32736 bits per 4K block.  It is not possible to
brute-force search a 32736-bit space (that's two to the power of 32736
possible combinations), and even if it was, there would be no way to
distinguish which of billions of billions of billions of billions...[over
4000 "billions of" deleted]...of billions of possible data blocks that
have a matching CRC is the right one.  A SHA256 as block csum would only
reduce the search space to 32512 bits.

Our forensic analyst above could reduce the search space to a manageable
size for a data-specific recovery tool, but we can't put one of those
in the kernel.

Getting corrupted data out of a brute force search of multiple bit
flips against a checksum is not just likely--it's certain, if you can
even run the search long enough to get a result.  The number of corrupt
4K blocks with correct CRC outnumbers the number of correct blocks by 
ten thousand orders of magnitude.

It would work with a small number of bit flips because of the properties
of the CRC32 function is that it reliably detects errors with length
shorter than the polynomial.

> 
> -- 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

signature.asc
Description: PGP signature

Re: Status of RAID5/6

Reply via email to