On Wed, Oct 12, 2016 at 01:48:58PM +0800, Qu Wenruo wrote:
> >btrfs also doesn't avoid the raid5 write hole properly.  After a crash,
> >a btrfs filesystem (like mdadm raid[56]) _must_ be scrubbed (resynced)
> >to reconstruct any parity that was damaged by an incomplete data stripe
> >update.
> > As long as all disks are working, the parity can be reconstructed
> >from the data disks.  If a disk fails prior to the completion of the
> >scrub, any data stripes that were written during previous crashes may
> >be destroyed.  And all that assumes the scrub bugs are fixed first.
> 
> This is true.
> I didn't take this into account.
> 
> But this is not a *single* problem, but 2 problems.
> 1) Power loss
> 2) Device crash
> 
> Before making things complex, why not focusing on single problem.

Solve one problem at a time--but don't lose sight of the whole list of
problems either, especially when they are interdependent.

> Not to mention the possibility is much smaller than single problem.

Having field experience with both problems, I disagree with that.
The power loss/system crash problem is much more common than the device
failure/scrub problems.  More data is lost when a disk fails, but the
amount of data lost in a power failure isn't zero.  Before I gave up
on btrfs raid5, it worked out to about equal amounts of admin time
recovering from the two different failure modes.

> >If writes occur after a disk fails, they all temporarily corrupt small
> >amounts of data in the filesystem.  btrfs cannot tolerate any metadata
> >corruption (it relies on redundant metadata to self-repair), so when a
> >write to metadata is interrupted, the filesystem is instantly doomed
> >(damaged beyond the current tools' ability to repair and mount
> >read-write).
> 
> That's why we used higher duplication level for metadata by default.
> And considering metadata size, it's much acceptable to use RAID1 for
> metadata other than RADI5/6.

Data RAID5 metadata RAID1 makes a limited amount of sense.  Small amounts
of data are still lost on power failures due to RMW on the data stripes.
It just doesn't break the entire filesystem because the metadata is
on RAID1 and RAID1 doesn't use RMW.

Data RAID6 does not make sense, unless we also have a way to have RAID1
make more than one mirror copy.  With one mirror copy an array is not
able to tolerate two disk failures, so the Q stripe for RAID6 is wasted
CPU and space.

> >Currently the upper layers of the filesystem assume that once data
> >blocks are written to disk, they are stable.  This is not true in raid5/6
> >because the parity and data blocks within each stripe cannot be updated
> >atomically.
> 
> True, but if we ignore parity, we'd find that, RAID5 is just RAID0.

Degraded RAID5 is not RAID0.  RAID5 has strict constraints that RAID0
does not.  The way a RAID5 implementation behaves in degraded mode is
the thing that usually matters after a disk fails.

> COW ensures (cowed) data and metadata are all safe and checksum will ensure
> they are OK, so even for RAID0, it's not a problem for case like power loss.

This is not true.  btrfs does not use stripes correctly to get CoW to
work on RAID5/6.  This is why power failures result in small amounts of
data loss, if not filesystem-destroying disaster.

For CoW to work you have to make sure that you never modify a RAID stripe
that already contains committed data.  Let's consider a 5-disk array
and look at what we get when we try to reconstruct disk 2:

        Disk1  Disk2  Disk3  Disk4  Disk5
        Data1  Data2  Parity Data3  Data4

Suppose one transaction writes Data1-Data4 and Parity.  This is OK
because no metadata reference would point to this stripe before it
was committed to disk.  Here's some data as an example:

        Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
        1111   2222   ffff   4444   8888   2222

(to keep things simpler I'm just using Parity = Data1 ^ Data2 ^ Data4 ^
Data5 here)

Later, a transaction deletes Data3 and Data 4.  Still OK, because
we didn't modify any data in the stripe, so we may still be able to
reconstruct the data from missing disks.  The checksums for Data4 and
Data5 are missing, so if there is any bitrot we lose the whole stripe
(we can't tell whether the data is wrong or parity, we can't ignore the
rotted data because it's included in the parity, and we didn't update
the parity because deleting an extent doesn't modify its data stripe).

        Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
        1111   2222   ffff   4444   8888   2222

Now a third transaction allocates Data3 and Data 4.  Bad.  First, Disk4
is written and existing data is temporarily corrupted:

        Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
        1111   2222   ffff   1234   8888   7452

then Disk5 is written, and the data is still corrupted:

        Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
        1111   2222   ffff   1234   5678   aaa2

then parity is written, and the data isn't corrupted any more:

        Disk1  Disk2  Disk3  Disk4  Disk5  Reconstructed Disk2
        1111   2222   777f   1234   5678   2222

If we are interrupted at any point after writing Data4 and before writing
Parity (even if these writes are arranged in any order), previously
committed data is corrupted.  If the drives have deep queues we could have
multiple RAID stripes corrupted at once.  csum verification will not help
us because there are no csums for Disk4 and Disk5 yet--regardless of what
else happens, the csum tree is not committed until after we wrote Parity.

This is not CoW at all.  It's not a rare event because the btrfs allocator
intentionally tries to maintain locality, and therefore prefers to reuse
partially-filled RAID stripes instead of spreading allocations out over
empty stripes.  The location of damaged data changes rapidly over time,
but _something_ is damaged much of the time.

It's a filesystem-eating disaster if the stripe contains metadata and
Disk2 fails.

> So we should follow csum first and then parity.
> 
> If we following this principle, RAID5 should be a raid0 with a little higher
> possibility to recover some cases, like missing one device.

That is much less than what most users expect from the name "raid5", and
much less than what should be possible from a CoW filesystem as well.

> So, I'd like to fix RAID5 scrub to make it at least better than RAID0, not
> worse than RAID0.

Do please continue fixing scrub!  I meant to point out only that there
are several other problems to fix before we can consider using btrfs
raid5 and especially raid6 for the purposes that users expect.

> > btrfs doesn't avoid writing new data in the same RAID stripe
> >as old data (it provides a rmw function for raid56, which is simply a bug
> >in a CoW filesystem), so previously committed data can be lost.  If the
> >previously committed data is part of the metadata tree, the filesystem
> >is doomed; for ordinary data blocks there are just a few dozen to a few
> >thousand corrupted files for the admin to clean up after each crash.
> 
> In fact, the _concept_ to solve such RMW behavior is quite simple:
> 
> Make sector size equal to stripe length. (Or vice versa if you like)
> 
> Although the implementation will be more complex, people like Chandan are
> already working on sub page size sector size support.

So...metadata blocks would be 256K on the 5-disk RAID5 example above,
and any file smaller than 256K would be stored inline?  Ouch.  That would
also imply the compressed extent size limit (currently 128K) has to become
much larger.

I had been thinking that we could inject "plug" extents to fill up
RAID5 stripes.  This lets us keep the 4K block size for allocations,
but at commit (or delalloc) time we would fill up any gaps in new RAID
stripes to prevent them from being modified.  As the real data is deleted
from the RAID stripes, it would be replaced by "plug" extents to keep any
new data from being allocated in the stripe.  When the stripe consists
entirely of "plug" extents, the plug extent would be deleted, allowing
the stripe to be allocated again.  The "plug" data would be zero for
the purposes of parity reconstruction, regardless of what's on the disk.
Balance would just throw the plug extents away (no need to relocate them).

For nodatacow files, btrfs would need to ensure that a nodatacow extent
never appears in the same RAID stripe as a CoW extent.  nodatacow files
would use RMW writes and not preserve data integrity in the case of any
failure.

I'm not sure how to fit prealloc extents into this idea, since there would
be no way to safely write to them in sub-stripe-sized units.

> I think the sector size larger than page size is already on the TODO list
> and when it's done, we can do real COW RAID5/6 then.
> 
> Thanks,
> Qu
> 
> >
> >It might be possible to hack up the allocator to pack writes into empty
> >stripes to avoid the write hole, but every time I think about this it
> >looks insanely hard to do (or insanely wasteful of space) for data
> >stripes.
> >
> 
> 
> 

Attachment: signature.asc
Description: Digital signature

Reply via email to